A Survey of Large Language Models

Download as pdf or txt
Download as pdf or txt
You are on page 1of 140

1

A Survey of Large Language Models


Wayne Xin Zhao, Kun Zhou*, Junyi Li*, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen
Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang,
Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie and Ji-Rong Wen

Abstract—Ever since the Turing Test was proposed in the 1950s, humans have explored the mastering of language intelligence
by machine. Language is essentially a complex, intricate system of human expressions governed by grammatical rules. It poses a
significant challenge to develop capable artificial intelligence (AI) algorithms for comprehending and grasping a language. As a major
approach, language modeling has been widely studied for language understanding and generation in the past two decades, evolving
from statistical language models to neural language models. Recently, pre-trained language models (PLMs) have been proposed by pre-
training Transformer models over large-scale corpora, showing strong capabilities in solving various natural language processing (NLP)
arXiv:2303.18223v15 [cs.CL] 13 Oct 2024

tasks. Since the researchers have found that model scaling can lead to an improved model capacity, they further investigate the scaling
effect by increasing the parameter scale to an even larger size. Interestingly, when the parameter scale exceeds a certain level, these
enlarged language models not only achieve a significant performance improvement, but also exhibit some special abilities (e.g., in-
context learning) that are not present in small-scale language models (e.g., BERT). To discriminate the language models in different
parameter scales, the research community has coined the term large language models (LLM) for the PLMs of significant size (e.g.,
containing tens or hundreds of billions of parameters). Recently, the research on LLMs has been largely advanced by both academia
and industry, and a remarkable progress is the launch of ChatGPT (a powerful AI chatbot developed based on LLMs), which has
attracted widespread attention from society. The technical evolution of LLMs has been making an important impact on the entire AI
community, which would revolutionize the way how we develop and use AI algorithms. Considering this rapid technical progress, in this
survey, we review the recent advances of LLMs by introducing the background, key findings, and mainstream techniques. In particular,
we focus on four major aspects of LLMs, namely pre-training, adaptation tuning, utilization, and capacity evaluation. Furthermore, we
also summarize the available resources for developing LLMs and discuss the remaining issues for future directions. This survey provides
an up-to-date review of the literature on LLMs, which can be a useful resource for both researchers and engineers.

Index Terms—Large Language Models; Emergent Abilities; Adaptation Tuning; Utilization; Alignment; Capacity Evaluation

1 I NTRODUCTION
“The limits of my language mean the limits of my world.” extensive attention in the literature, which can be divided
—Ludwig Wittgenstein into four major development stages:
• Statistical language models (SLM). SLMs [6–9] are de-
veloped based on statistical learning methods that rose in
L ANGUAGE is a prominent ability in human beings to
express and communicate, which develops in early
childhood and evolves over a lifetime [3, 4]. Machines,
the 1990s. The basic idea is to build the word prediction
model based on the Markov assumption, e.g., predicting the
next word based on the most recent context. The SLMs with
however, cannot naturally grasp the abilities of understand-
a fixed context length n are also called n-gram language
ing and communicating in the form of human language,
models, e.g., bigram and trigram language models. SLMs
unless equipped with powerful artificial intelligence (AI)
have been widely applied to enhance task performance
algorithms. It has been a longstanding research challenge
in information retrieval (IR) [10, 11] and natural language
to achieve this goal, to enable machines to read, write, and
processing (NLP) [12–14]. However, they often suffer from
communicate like humans [5].
the curse of dimensionality: it is difficult to accurately
Technically, language modeling (LM) is one of the major
estimate high-order language models since an exponential
approaches to advancing language intelligence of machines.
number of transition probabilities need to be estimated.
In general, LM aims to model the generative likelihood
Thus, specially designed smoothing strategies such as back-
of word sequences, so as to predict the probabilities of
off estimation [15] and Good–Turing estimation [16] have
future (or missing) tokens. The research of LM has received
been introduced to alleviate the data sparsity problem.
• Neural language models (NLM). NLMs [1, 17, 18] charac-
• Version: v14 (major update on September 25, 2024). terize the probability of word sequences by neural networks,
• GitHub link: https://github.com/RUCAIBox/LLMSurvey
• Chinese book link: lmbook-zh.github.io
e.g., multi-layer perceptron (MLP) and recurrent neural net-
• * K. Zhou and J. Li contribute equally to this work. works (RNNs). As a remarkable contribution, the work in
• The authors are mainly with Gaoling School of Artificial Intelligence and [1] introduced the concept of distributed representation of
School of Information, Renmin University of China, Beijing, China; Jian- words and built the word prediction function conditioned
Yun Nie is with DIRO, Université de Montréal, Canada.
Contact e-mail: [email protected] on the aggregated context features (i.e., the distributed
• The authors of this survey paper reserve all the copyrights of the fig- word vectors). By extending the idea of learning effective
ures/tables, and any use of these materials for publication purpose must be features for text data, a general neural network approach
officially granted by the survey authors.
was developed to build a unified, end-to-end solution for
2


 

*37
*37

 //D0$
//D0$ 

&KDW*37
&KDW*37 
 *37
*37


,QVWUXFW*37
,QVWUXFW*37 
 //D0$
//D0$

 &RGH[
&RGH[ 

&KDW*37
&KDW*37



 *37
*37
77 
 ,QVWUXFW*37
,QVWUXFW*37
*37 *37
*37 *37 &RGH[
&RGH[
77 *37
*37
  %(57
%(57 

 
 
 
 
 
 
 
 
 

7L7LPPH H 7L7LPPHH
(a) Query=”Language Model” (b) Query=”Large Language Model”

Fig. 1: The trends of the cumulative numbers of arXiv papers that contain the keyphrases “language model” (since June 2018)
and “large language model” (since October 2019), respectively. The statistics are calculated using exact match by querying
the keyphrases in title or abstract by months. We set different x-axis ranges for the two keyphrases, because “language
models” have been explored at an earlier time. We label the points corresponding to important landmarks in the research
progress of LLMs. A sharp increase occurs after the release of ChatGPT: the average number of published arXiv papers
that contain “large language model” in title or abstract goes from 0.40 per day to 8.58 per day (Figure 1(b)).

General-purpose
Transferable task solver
Task-agnostic NLP task solver GPT-3/4!ChatGPT!Claude
Scaling language models
Task Specific task feature learner ELMO!BERT!GPT-1/2 Prompt based completion
solving helper Word2vec (NPLM)!NLPS Context-aware representations Solve various real-world tasks
capacity n-gram models Static word representations
Pre-training + fine-tuning
Solve various NLP tasks LLM
Statistical methods Neural context modeling
Probability estimation Solve typical NLP tasks Pre-trained LM
Assist in specific tasks Neural LM
Statistical LM

1990s 2013 2018 2020

Fig. 2: An evolution process of the four generations of language models (LM) from the perspective of task solving capacity.
Note that the time period for each stage may not be very accurate, and we set the time mainly according to the publish
date of the most representative studies at each stage. For neural language models, we abbreviate the paper titles of
two representative studies to name the two approaches: NPLM [1] (“A neural probabilistic language model”) and NLPS [2]
(“Natural language processing (almost) from scratch”). Due to the space limitation, we don’t list all representative studies in
this figure.

various NLP tasks [2]. Furthermore, word2vec [19, 20] was designed pre-training tasks on large-scale unlabeled cor-
proposed to build a simplified shallow neural network pora. These pre-trained context-aware word representations
for learning distributed word representations, which were are very effective as general-purpose semantic features,
demonstrated to be very effective across a variety of NLP which have largely raised the performance bar of NLP
tasks. These studies have initiated the use of language tasks. This study has inspired a large number of follow-up
models for representation learning (beyond word sequence work, which sets the “pre-training and fine-tuning” learning
modeling), having an important impact on the field of NLP. paradigm. Following this paradigm, a great number of stud-
ies on PLMs have been developed, introducing either differ-
• Pre-trained language models (PLM). As an early at- ent architectures [24, 25] (e.g., GPT-2 [26] and BART [24]) or
tempt, ELMo [21] was proposed to capture context-aware improved pre-training strategies [27–29]. In this paradigm, it
word representations by first pre-training a bidirectional often requires fine-tuning the PLM for adapting to different
LSTM (biLSTM) network (instead of learning fixed word downstream tasks.
representations) and then fine-tuning the biLSTM network
according to specific downstream tasks. Furthermore, based • Large language models (LLM). Researchers find that
on the highly parallelizable Transformer architecture [22] scaling PLM (e.g., scaling model size or data size) often
with self-attention mechanisms, BERT [23] was proposed by leads to an improved model capacity on downstream tasks
pre-training bidirectional language models with specially (i.e., following the scaling law [30]). A number of studies
3

have explored the performance limit by training an ever the prompting interface (e.g., GPT-4 API). Humans have to
larger PLM (e.g., the 175B-parameter GPT-3 and the 540B- understand how LLMs work and format their tasks in a way
parameter PaLM). Although scaling is mainly conducted that LLMs can follow. Third, the development of LLMs no
in model size (with similar architectures and pre-training longer draws a clear distinction between research and en-
tasks), these large-sized PLMs display different behaviors gineering. The training of LLMs requires extensive practical
from smaller PLMs (e.g., 330M-parameter BERT and 1.5B- experiences in large-scale data processing and distributed
parameter GPT-2) and show surprising abilities (called emer- parallel training. To develop capable LLMs, researchers
gent abilities [31]) in solving a series of complex tasks. For have to solve complicated engineering issues, working with
example, GPT-3 can solve few-shot tasks through in-context engineers or being engineers.
learning, whereas GPT-2 cannot do well. Thus, the research Nowadays, LLMs are posing a significant impact on
community coins the term “large language models (LLM)”1 the AI community, and the advent of ChatGPT and GPT-4
for these large-sized PLMs [32–35], which attract increasing leads to the rethinking of the possibilities of artificial general
research attention (See Figure 1). A remarkable application intelligence (AGI). OpenAI has published a technical article
of LLMs is ChatGPT2 that adapts the LLMs from the GPT entitled “Planning for AGI and beyond”, which discusses
series for dialogue, which presents an amazing conversation the short-term and long-term plans to approach AGI [40],
ability with humans. We can observe a sharp increase of the and a more recent paper has argued that GPT-4 might be
arXiv papers that are related to LLMs after the release of considered as an early version of an AGI system [41]. The
ChatGPT in Figure 1. research areas of AI are being revolutionized by the rapid
As discussed before, language model is not a new tech- progress of LLMs. In the field of NLP, LLMs can serve as a
nical concept specially for LLMs, but has evolved with the general-purpose language task solver (to some extent), and
advance of artificial intelligence over the decades. Early lan- the research paradigm has been shifting towards the use
guage models mainly aim to model and generate text data, of LLMs. In the field of IR, traditional search engines are
while latest language models (e.g., GPT-4) focus on complex challenged by the new information seeking way through AI
task solving. From language modeling to task solving, it is an chatbots (i.e., ChatGPT), and New Bing3 presents an initial
important leap in scientific thinking, which is the key to attempt that enhances the search results based on LLMs. In
understand the development of language models in the re- the field of CV, the researchers try to develop ChatGPT-like
search history. From the perspective of task solving, the four vision-language models that can better serve multimodal
generations of language models have exhibited different lev- dialogues [42–45], and GPT-4 [46] has supported multi-
els of model capacities. In Figure 2, we describe the evolu- modal input by integrating the visual information. This new
tion process of language models in terms of the task solving wave of technology would potentially lead to a prosperous
capacity. At first, statistical language models mainly assisted ecosystem of real-world applications based on LLMs. For
in some specific tasks (e.g., retrieval or speech tasks), in instance, Microsoft 365 is being empowered by LLMs (i.e.,
which the predicted or estimated probabilities can enhance Copilot) to automate the office work, and OpenAI supports
the performance of task-specific approaches. Subsequently, the use of plugins in ChatGPT for implementing special
neural language models focused on learning task-agnostic functions.
representations (e.g., features), aiming to reduce the efforts Despite the progress and impact, the underlying prin-
for human feature engineering. Furthermore, pre-trained ciples of LLMs are still not well explored. Firstly, it is
language models learned context-aware representations that mysterious why emergent abilities occur in LLMs, instead of
can be optimized according to downstream tasks. For the smaller PLMs. As a more general issue, there lacks a deep,
latest generation of language model, LLMs are enhanced by detailed investigation of the key factors that contribute to
exploring the scaling effect on model capacity, which can be the superior abilities of LLMs. It is important to study when
considered as general-purpose task solvers. To summarize, and how LLMs obtain such abilities [47]. Although there are
in the evolution process, the task scope that can be solved some meaningful discussions about this problem [31, 47],
by language models have been greatly extended, and the more principled investigations are needed to uncover the
task performance attained by language models have been “secrets“ of LLMs. Secondly, it is difficult for the research
significantly enhanced. community to train capable LLMs. Due to the huge de-
In the existing literature, PLMs have been widely dis- mand of computation resources, it is very costly to carry
cussed and surveyed [36–39], while LLMs are seldom re- out repetitive, ablating studies for investigating the effect
viewed in a systematic way. To motivate our survey, we first of various strategies for training LLMs. Indeed, LLMs are
highlight three major differences between LLMs and PLMs. mainly trained by industry, where many important training
First, LLMs display some surprising emergent abilities that details (e.g., data collection and cleaning) are not revealed
may not be observed in previous smaller PLMs. These abili- to the public. Thirdly, it is challenging to align LLMs with
ties are key to the performance of language models on com- human values or preferences. Despite the capacities, LLMs
plex tasks, making AI algorithms unprecedently powerful are also likely to produce toxic, fictitious, or harmful con-
and effective. Second, LLMs would revolutionize the way tents. It requires effective and efficient control approaches
that humans develop and use AI algorithms. Unlike small to eliminating the potential risk of the use of LLMs [46].
PLMs, the major approach to accessing LLMs is through Faced with both opportunities and challenges, it needs
more attention on the research and development of LLMs. In
1. Note that a LLM is not necessarily more capable than a small PLM, order to provide a basic understanding of LLMs, this survey
and emergent abilities may not occur in some LLMs.
2. https://openai.com/blog/chatgpt/ 3. https://www.bing.com/new
4

conducts a literature review of the recent advances in LLMs shown that scaling can largely improve the model capacity
from four major aspects, including pre-training (how to pre- of LLMs [26, 55, 56]. Thus, it is useful to establish a quantita-
train a capable LLM), adaptation (how to effectively adapt tive approach to characterizing the scaling effect. Next, we
pre-trained LLMs for better use), utilization (how to use introduce two representative scaling laws for Transformer
LLMs for solving various downstream tasks) and capability language models [30, 34].
evaluation (how to evaluate the abilities of LLMs and existing • KM scaling law5 . In 2020, Kaplan et al. [30] (the OpenAI
empirical findings). We thoroughly comb the literature and team) firstly proposed to model the power-law relationship
summarize the key findings, techniques, and methods of of model performance with respective to three major factors,
LLMs. For this survey, we also create a GitHub project namely model size (N ), dataset size (D), and the amount of
website by collecting the supporting resources for LLMs, at training compute (C ), for neural language models. Given
the link https://github.com/RUCAIBox/LLMSurvey. We a compute budget c, they empirically presented three basic
are also aware of several related review articles on PLMs formulas for the scaling law6 :
or LLMs [32, 36, 38, 39, 43, 48–54]. These papers either

discuss PLMs or some specific (or general) aspects of LLMs.

Nc N
Compared with them, we focus on the techniques and L(N ) = , αN ∼ 0.076, Nc ∼ 8.8 × 1013 (1)
N
methods to develop and use LLMs and provide a relatively  α
Dc D
comprehensive reference to important aspects of LLMs. L(D) = , αD ∼ 0.095, Dc ∼ 5.4 × 1013
D
The remainder of this survey is organized as follows:  αC
Cc
Section 2 introduces the background for LLMs and the evo- L(C) = , αC ∼ 0.050, Cc ∼ 3.1 × 108
C
lution of GPT-series models, followed by the summarization
of available resources for developing LLMs in Section 3. where L(·) denotes the cross entropy loss in nats, and
Sections 4, 5, 6, and 7 review and summarize the recent a follow-up study [58] from OpenAI has shown that the
progress from the four aspects of pre-training, adaptation, language modeling loss can be decomposed into two parts,
utilization, and capacity evaluation, respectively. Then, Sec- namely irreducible loss (the entropy of the true data distri-
tion 8 discusses the practical guide for prompt design, bution) and reducible loss (an estimate of the KL divergence
and Section 9 reviews the applications of LLMs in several between the true and model distributions). The three laws
representative domains. Finally, we conclude the survey in were derived by fitting the model performance with varied
Section 10 by summarizing the major findings and discuss data sizes (22M to 23B tokens), model sizes (768 to 1.5B non-
the remaining issues for future work. embedding parameters) and training compute, under some
assumptions (e.g., the analysis of one factor should be not
bottlenecked by the other two factors). They showed that
2 OVERVIEW the model performance has a strong dependence relation on
In this section, we present an overview about the back- the three factors.
ground of LLMs and then summarize the technical evolu- • Chinchilla scaling law. As another representative study,
tion of the GPT-series models. Hoffmann et al. [34] (the Google DeepMind team) proposed
an alternative form for scaling laws to instruct the compute-
2.1 Background for LLMs optimal training for LLMs. They conducted rigorous exper-
iments by varying a larger range of model sizes (70M to
Typically, large language models (LLMs) refer to Transformer 16B) and data sizes (5B to 500B tokens), and fitted a similar
language models that contain hundreds of billions (or scaling law yet with different coefficients as below [34]:
more) of parameters4 , which are trained on massive text
data [32], such as GPT-3 [55], PaLM [56], Galactica [35], A B
L(N, D) = E + + β, (2)
and LLaMA [57]. LLMs exhibit strong capacities to un- Nα D
derstand natural language and solve complex tasks (via where E = 1.69, A = 406.4, B = 410.7, α = 0.34 and
text generation). To have a quick understanding of how β = 0.28. By optimizing the loss L(N, D) under the con-
LLMs work, this part introduces the basic background for straint C ≈ 6N D, they showed that the optimal allocation
LLMs, including scaling laws, emergent abilities and key of compute budget to model size and data size can be
techniques. derived as follows:
Formulation of Scaling Laws for LLMs. Currently, LLMs  a  b
are mainly built upon the Transformer architecture [22], C C
Nopt (C) = G , Dopt (C) = G−1 , (3)
where multi-head attention layers are stacked in a very 6 6
deep neural network. Existing LLMs adopt similar Trans- α β
where a = α+β , b = α+β and G is a scaling coefficient that
former architectures and pre-training objectives (e.g., lan-
can be computed by A, B , α and β . As analyzed in [34],
guage modeling) as small language models. However, LLMs
significantly extend the model size, data size, and total 5. Since there was not a model trained following this law in the
compute (orders of magnification). Extensive research has original paper, we took the last names of the two co-first authors to
name this scaling law.
4. In existing literature, there is no formal consensus on the minimum 6. Here, Nc , Dc and Cc are measured in the number of non-
parameter scale for LLMs, since the model capacity is also related to embedding parameters, the number of training tokens and the number
data size and total compute. In this survey, we take a slightly loose of FP-days, respectively. According to the original paper [30], Cc and C
definition of LLMs, and mainly focus on discussing language models should be denoted by Ccmin and Cmin , corresponding to the optimal
with a model size larger than 10B. use of compute. We use the simplified notations for ease of discussions.
5

given an increase in compute budget, the KM scaling law characterize task-level scaling laws, since it might be also
favors a larger budget allocation in model size than the data dependent on task-related information (task metric, task
size, while the Chinchilla scaling law argues that the two difficulty, etc.). Furthermore, some capacities (e.g., in-context
sizes should be increased in equal scales, i.e., having similar learning [55]) are unpredictable according to the scaling law,
values for a and b in Equation (3). which can be observed only when the model size exceeds a
certain level (as discussed below).
Discussion on Scaling Laws. After introducing the formu-
lations, we continue to discuss scaling law in the following Emergent Abilities of LLMs. In the literature [31], emergent
two aspects, to enhance its understanding: abilities of LLMs are formally defined as “the abilities that
• Predictable scaling. In practice, scaling law can be used are not present in small models but arise in large models”,
to instruct the training of LLMs, and it has been proven which is one of the most prominent features that distin-
feasible to reliably estimate the performance of larger mod- guish LLMs from previous PLMs. It further introduces a
els based on that of smaller models, called predictable scal- notable characteristic when emergent abilities occur [31]:
ing [46]. The benefits of predictable scaling for training performance rises significantly above random when the
LLMs are mainly twofold. Firstly, for large models, it is scale reaches a certain level. By analogy, such an emergent
infeasible to rigorously examine various training tricks or pattern has close connections with the phenomenon of phase
variants, and it would be very helpful if experiences gained transition in physics [31, 63]. In principle, emergent abilities
from small models could also apply to large models. For can be defined in relation to some complex tasks [31, 64],
instance, small proxy models can be trained to find the while we are more concerned with general abilities that
optimal schedule of the data mixture for large models [59]. can be applied to solve a variety of tasks. Here, we briefly
Secondly, the training of large-scale models takes a long introduce three typical emergent abilities for LLMs and
time, often suffering from issues such as training loss spike, representative models that possess such an ability8 .
and scaling law can be employed to monitor the training • In-context learning. The in-context learning (ICL) ability
status of LLMs, e.g., identifying abnormal performance at an is formally introduced by GPT-3 [55]: assuming that the
early time. Despite that scaling law characterizes a smooth language model has been provided with a natural language
trend of performance increase (or loss decrease), it also instruction and/or several task demonstrations, it can gen-
indicates that diminishing returns7 might occur as model erate the expected output for the test instances by com-
scaling. An empirical study [58] from the OpenAI team pleting the word sequence of input text, without requiring
has shown that representation quality or semantic content additional training or gradient update9 . Among the GPT-
can still effectively improve even if approaching the point series models, the 175B GPT-3 model exhibited a strong ICL
of diminishing returns (i.e., approaching the irreducible ability in general, but not the GPT-1 and GPT-2 models. Such
loss) [58]. This finding suggests that training large models an ability also depends on the specific downstream task. For
are promising for improving the performance of down- example, the ICL ability can emerge on the arithmetic tasks
stream tasks. To further explore scaling effect, a potential (e.g., the 3-digit addition and subtraction) for the 13B GPT-3,
issue is that the amount of available data for training LLMs but 175B GPT-3 even cannot work well on the Persian QA
is actually limited. With the ever-increasing model scale, the task [31].
public text data would be soon “exhausted” for LLMs [60]. • Instruction following. By fine-tuning with a mixture of
Thus, it will be meaningful to study how scaling laws apply multi-task datasets formatted via natural language descrip-
to a data-constrained regime [61], where data repetition or tions (called instruction tuning), LLMs are shown to perform
augmentation might be useful to alleviate data scarcity. well on unseen tasks that are also described in the form
• Task-level predictability. Existing research of scaling laws of instructions [28, 66, 67]. With instruction tuning, LLMs
are mostly conducted in terms of language modeling loss are enabled to follow the task instructions for new tasks
(e.g., per-token cross-entropy loss in nats [30]), while in without using explicit examples, thus having an improved
practice we are more concerned about the performance of generalization ability. According to the experiments in [67],
LLMs on actual tasks. Thus, a basic problem is that how instruction-tuned LaMDA-PT [68] started to significantly
the decrease of language modeling loss translates into the outperform the untuned one on unseen tasks when the
improvement of task performance [58]. Intuitively, a model model size reached 68B, but not for 8B or smaller model
with a smaller language modeling loss tends to yield a sizes. A recent study [69] found that a model size of 62B is
better performance on downstream tasks, since language at least required for PaLM to perform well on various tasks
modeling loss can be considered as a general measure of in four evaluation benchmarks (i.e., MMLU, BBH, TyDiQA
the overall model capacity. GPT-4 [46] has reported that and MGSM), though a much smaller size might suffice for
some capabilities (e.g., coding ability) can be accurately some specific tasks (e.g., MMLU).
predicted via scaling law. Despite that, readers should be • Step-by-step reasoning. For small language models, it
aware that a direct decrease in language modeling loss does is usually difficult to solve complex tasks that involve
not always indicate an improvement of model performance
8. It is difficult to accurately examine the critical size for emergent
on downstream tasks. Specially, the phenomenon of inverse
abilities of LLMs (i.e., the minimum size to possess an ability), since it
scaling would occur for some tasks, where task performance might vary for different models or tasks. Also, existing studies often
surprisingly becomes worse as the language modeling loss test emergent abilities on very limited model sizes for a specific LLM.
decreases [62]. Overall, it is more difficult to explore and For example, PaLM is often tested with three sizes of 8B, 62B and 540B.
It is unclear about the model performance of the untested sizes.
9. In a recent study [65], it also shows that in-context learning implic-
7. https://en.wikipedia.org/wiki/Diminishing returns itly performs meta-optimization through the attention mechanism.
6

multiple reasoning steps, e.g., mathematical word problems. every day. It is interesting that young parents would be often
In contrast, with the chain-of-thought (CoT) prompting surprised by unexpected progress of the speaking ability
strategy [33], LLMs can solve such tasks by utilizing the exhibited by their babies.
prompting mechanism that involves intermediate reasoning
steps for deriving the final answer. This ability is speculated Key Techniques for LLMs. It has been a long way that
to be potentially obtained by training on code [33, 47]. An LLMs evolve into the current state: general and capable
empirical study [33] has shown that CoT prompting can learners. In the development process, a number of impor-
bring performance gains (on arithmetic reasoning bench- tant techniques are proposed, which largely improve the
marks) when applied to PaLM and LaMDA variants with capacity of LLMs. Here, we briefly list several important
a model size larger than 60B, while its advantage over techniques that (potentially) lead to the success of LLMs, as
the standard prompting becomes more evident when the follows.
model size exceeds 100B. Furthermore, the performance • Scaling. As discussed in previous parts, there exists
improvement with CoT prompting seems to be also varied an evident scaling effect in Transformer language mod-
for different tasks, e.g., GSM8K > MAWPS > SWAMP for els: larger model/data sizes and more training compute
PaLM [33]. typically lead to an improved model capacity [30, 34]. As
two representative models, GPT-3 and PaLM explored the
How Emergent Abilities Relate to Scaling Laws. In existing scaling limits by increasing the model size to 175B and
literature [30, 31, 34], scaling laws and emergent abilities 540B, respectively. Since compute budget is usually limited,
provide two perspectives to understand the advantage of scaling laws can be further employed to conduct a more
large models over small models. In general, scaling law compute-efficient allocation of the compute resources. For
(often measured by language modeling loss) describes pre- example, Chinchilla (with more training tokens) outper-
dictable performance relation with the potential effect of forms its counterpart model Gopher (with a larger model
diminishing returns, while emergent abilities (often mea- size) by increasing the data scale with the same compute
sured by task performance) are unpredictable but very prof- budget [34]. In addition, data scaling should be with careful
itable once such abilities actually emerge. Since the two cleaning process, since the quality of pre-training data plays
perspectives reflect different performance trends (continu- a key role in the model capacity.
ous improvement v.s. sharp performance leap), they might • Training. Due to the huge model size, it is very chal-
lead to misaligned findings or observations. There are also lenging to successfully train a capable LLM. Distributed
extensive debates on the rationality of emergent abilities. training algorithms are needed to learn the network param-
A popular speculation is that emergent abilities might be eters of LLMs, in which various parallel strategies are of-
partially attributed to the evaluation setting for special tasks ten jointly utilized. To support distributed training, several
(e.g., the discontinuous evaluation metrics) [70, 71]: when optimization frameworks have been released to facilitate
evaluation metrics are altered accordingly, the sharpness of the implementation and deployment of parallel algorithms,
the emergent ability curve would disappear. However, the such as DeepSpeed [74] and Megatron-LM [75–77]. Also, op-
performance of LLMs on most tasks are perceived by users timization tricks are also important for training stability and
naturally in a discontinuous way. For instance, end users model performance, e.g., restart to overcome training loss
prefer a reliable code generated by LLMs that can success- spike [56] and mixed precision training [78]. More recently,
fully pass the test case, but are less interested in selecting a GPT-4 [46] proposes to develop special infrastructure and
better code with fewer errors between two failed ones. More optimization methods that reliably predict the performance
recently, a study [72] proposes a new evaluation setting of large models with much smaller models.
that can enlarge the resolution of task metrics, making task • Ability eliciting. After being pre-trained on large-scale
performance more predictable. Despite these efforts, more corpora, LLMs are endowed with potential abilities as
fundamental research (e.g., grokking10 ) about the working general-purpose task solvers. These abilities might not be
mechanism of LLMs is still in need to understand the emer- explicitly exhibited when LLMs perform some specific tasks.
gence of certain abilities. The subtle relation between scaling As the technical approach, it is useful to design suitable task
law and emergent abilities can be explained by analogy with instructions or specific in-context learning strategies to elicit
the ability acquisition of human11 . Take the speaking ability such abilities. For instance, chain-of-thought prompting has
as an example. For children, language development (espe- been shown to be useful to solve complex reasoning tasks
cially infants) can be also considered as a multi-level process by including intermediate reasoning steps. Furthermore,
where “emergent abilities” occur. Specially, the language we can perform instruction tuning on LLMs with task
ability would relatively stable within a time interval, but descriptions expressed in natural language, for improving
qualitative change only occurs when evolving into another the generalizability of LLMs on unseen tasks. These eliciting
ability level (e.g., from speaking simple words to speaking techniques mainly correspond to the emergent abilities of
simple sentences). Such a learning process is essentially not LLMs, which may not show the same effect on small lan-
smooth and stable (i.e., language ability does not develop at guage models.
a constant rate over time), though a child actually grows • Alignment tuning. Since LLMs are trained to capture
the data characteristics of pre-training corpora (including
10. Grokking refers that “a pattern in the data, improving generaliza- both high-quality and low-quality data), they are likely to
tion performance from random chance level to perfect generalization”,
quoted from the original paper [73].
generate toxic, biased, or even harmful content for humans.
11. This explanation is only for ease of understanding, and there is It is necessary to align LLMs with human values, e.g., helpful,
not direct evidence to connect the two points. honest, and harmless. For this purpose, InstructGPT [66]
7

designs an effective tuning approach that enables LLMs to models was already explored in the early days of Ope-
follow the expected instructions, which utilizes the tech- nAI, while it was attempted with recurrent neural net-
nique of reinforcement learning with human feedback [66, 79]. works (RNN) [121]. With the advent of Transformer, OpenAI
It incorporates human in the training loop with elaborately developed two initial GPT models, namely GPT-1 [122] and
designed labeling strategies. ChatGPT is indeed developed GPT-2 [26], which can be considered as the foundation to
on a similar technique to InstructGPT, which shows a strong more powerful models subsequently i.e., GPT-3 and GPT-4.
alignment capacity in producing high-quality, harmless re- • GPT-1. In 2017, the Transformer model [22] was intro-
sponses, e.g., rejecting to answer insulting questions. duced by Google, and the OpenAI team quickly adapted
• Tools manipulation. In essence, LLMs are trained as text their language modeling work to this new neural network
generators over massive plain text corpora, thus performing architecture. They released the first GPT model in 2018,
less well on the tasks that are not best expressed in the i.e., GPT-1 [122], and coined the abbreviation term GPT
form of text (e.g., numerical computation). In addition, their as the model name, standing for Generative Pre-Training.
capacities are also limited to the pre-training data, e.g., the GPT-1 was developed based on a generative, decoder-only
inability to capture up-to-date information. To tackle these Transformer architecture, and adopted a hybrid approach of
issues, a recently proposed technique is to employ external unsupervised pre-training and supervised fine-tuning. GPT-
tools to compensate for the deficiencies of LLMs [80, 81]. 1 has set up the core architecture for the GPT-series models
For example, LLMs can utilize the calculator for accurate and established the underlying principle to model natural
computation [80] and employ search engines to retrieve language text, i.e., predicting the next word.
unknown information [81]. More recently, ChatGPT has • GPT-2. Following a similar architecture of GPT-1,
enabled the mechanism of using external plugins (existing GPT-2 [26] increased the parameter scale to 1.5B, which
or newly created apps)12 , which are by analogy with the was trained with a large webpage dataset WebText. As
“eyes and ears” of LLMs. Such a mechanism can broadly claimed in the paper of GPT-2, it sought to perform
expand the scope of capacities for LLMs. tasks via unsupervised language modeling, without explicit
In addition, many other factors (e.g., the upgrade of fine-tuning using labeled data. To motivate the approach,
hardware) also contribute to the success of LLMs. Currently, they introduced a probabilistic form for multi-task solving,
we limit our discussion to the major technical approaches i.e., p(output|input, task) (similar approaches have been
and key findings for developing LLMs. adopted in [123]), which predicts the output conditioned on
the input and task information. To model this conditional
probability, language text can be naturally employed as a
2.2 Technical Evolution of GPT-series Models unified way to format input, output and task information.
Due to the excellent capacity in communicating with hu- In this way, the process of solving a task can be cast as a
mans, ChatGPT has ignited the excitement of the AI com- word prediction problem for generating the solution text.
munity since its release. ChatGPT is developed based on the Further, they introduced a more formal claim for this idea:
powerful GPT model with specially optimized conversation “Since the (task-specific) supervised objective is the same
capacities. Considering the ever-growing interest in Chat- as the unsupervised (language modeling) objective but only
GPT and GPT models, we add a special discussion about the evaluated on a subset of the sequence, the global minimum
technical evolution of the GPT-series models, to briefly sum- of the unsupervised objective is also the global minimum
marize the progress how they have been developed in the of the supervised objective (for various tasks)” [26]15 . A
past years. Meanwhile, we drew a schematic diagram de- basic understanding of this claim is that each (NLP) task
picting the technological evolution of the GPT-series models can be considered as the word prediction problem based
in Figure 4. The basic principle underlying GPT models is on a subset of the world text. Thus, unsupervised language
to compress the world knowledge into the decoder-only modeling could be capable in solving various tasks, if it was
Transformer model by language modeling, such that it can trained to have sufficient capacity in recovering the world
recover (or memorize) the semantics of world knowledge text. These early discussion in GPT-2’s paper echoed in the
and serve as a general-purpose task solver. Two key points interview of Ilya Sutskever by Jensen Huang: “What the
to the success are (I) training decoder-only Transformer neural network learns is some representation of the process
language models that can accurately predict the next word that produced the text. This text is actually a projection of
and (II) scaling up the size of language models. Overall, the the world...the more accurate you are in predicting the next
research of OpenAI on LLMs can be roughly divided into word, the higher the fidelity, the more resolution you get in
the following stages13 . this process...”16 .

Early Explorations. According to one interview with Ilya Capacity Leap. Although GPT-2 is intended to be an “un-
Sutskever14 (a co-founder and chief scientist of OpenAI), supervised multitask learner”, it overall has an inferior
the idea of approaching intelligent systems with language performance compared with supervised fine-tuning state-
of-the-art methods. Because it has a relatively small model
12. https://openai.com/blog/chatgpt-plugins size, it has been widely fine-tuned in downstream tasks,
13. Note that the discussion of this part can be somewhat subjective. especially the dialog tasks [124, 125]. Based on GPT-2, GPT-3
The overall viewpoints and summaries are made based on the under-
standing of the survey authors by reading the papers, blog articles,
interview reports and APIs released by OpenAI. 15. To better understand this sentence, we put some explanation
14. https://hackernoon.com/an-interview-with-ilya-sutskever-co- words in parentheses.
founder-of-openai 16. https://lifearchitect.ai/ilya/
8

TABLE 1: Statistics of large language models (having a size larger than 10B in this survey) in recent years, including the
capacity evaluation, pre-training data scale (either in the number of tokens or storage size) and hardware resource costs.
In this table, we only include LLMs with a public paper about the technical details. Here, “Release Time” indicates the
date when the corresponding paper was officially released. “Publicly Available” means that the model checkpoints can be
publicly accessible while “Closed Source” means the opposite. “Adaptation” indicates whether the model has been with
subsequent fine-tuning: IT denotes instruction tuning and RLHF denotes reinforcement learning with human feedback.
“Evaluation” indicates whether the model has been evaluated with corresponding abilities in their original paper: ICL
denotes in-context learning and CoT denotes chain-of-thought. “*” denotes the largest publicly available version.

Release Size Base Adaptation Pre-train Latest Data Hardware Training Evaluation
Model
Time (B) Model IT RLHF Data Scale Timestamp (GPUs / TPUs) Time ICL CoT
T5 [82] Oct-2019 11 - - - 1T tokens Apr-2019 1024 TPU v3 - ✓ -
mT5 [83] Oct-2020 13 - - - 1T tokens - - - ✓ -
PanGu-α [84] Apr-2021 13* - - - 1.1TB - 2048 Ascend 910 - ✓ -
CPM-2 [85] Jun-2021 198 - - - 2.6TB - - - - -
T0 [28] Oct-2021 11 T5 ✓ - - - 512 TPU v3 27 h ✓ -
CodeGen [86] Mar-2022 16 - - - 577B tokens - - - ✓ -
GPT-NeoX-20B [87] Apr-2022 20 - - - 825GB - 96 40G A100 - ✓ -
Tk-Instruct [88] Apr-2022 11 T5 ✓ - - - 256 TPU v3 4h ✓ -
UL2 [89] May-2022 20 - - - 1T tokens Apr-2019 512 TPU v4 - ✓ ✓
OPT [90] May-2022 175 - - - 180B tokens - 992 80G A100 - ✓ -
NLLB [91] Jul-2022 54.5 - - - - - - - ✓ -
CodeGeeX [92] Sep-2022 13 - - - 850B tokens - 1536 Ascend 910 60 d ✓ -
GLM [93] Oct-2022 130 - - - 400B tokens - 768 40G A100 60 d ✓ -
Flan-T5 [69] Oct-2022 11 T5 ✓ - - - - - ✓ ✓
BLOOM [78] Nov-2022 176 - - - 366B tokens - 384 80G A100 105 d ✓ -
mT0 [94] Nov-2022 13 mT5 ✓ - - - - - ✓ -
Galactica [35] Nov-2022 120 - - - 106B tokens - - - ✓ ✓
BLOOMZ [94] Nov-2022 176 BLOOM ✓ - - - - - ✓ -
Publicly OPT-IML [95] Dec-2022 175 OPT ✓ - - - 128 40G A100 - ✓ ✓
Available LLaMA [57] Feb-2023 65 - - - 1.4T tokens - 2048 80G A100 21 d ✓ -
Pythia [96] Apr-2023 12 - - - 300B tokens - 256 40G A100 - ✓ -
CodeGen2 [97] May-2023 16 - - - 400B tokens - - - ✓ -
StarCoder [98] May-2023 15.5 - - - 1T tokens - 512 40G A100 - ✓ ✓
LLaMA2 [99] Jul-2023 70 - ✓ ✓ 2T tokens - 2000 80G A100 - ✓ -
Baichuan2 [100] Sep-2023 13 - ✓ ✓ 2.6T tokens - 1024 A800 - ✓ -
QWEN [101] Sep-2023 14 - ✓ ✓ 3T tokens - - - ✓ -
FLM [102] Sep-2023 101 - ✓ - 311B tokens - 192 A800 22 d ✓ -
Skywork [103] Oct-2023 13 - - - 3.2T tokens - 512 80G A800 - ✓ -

GPT-3 [55] May-2020 175 - - - 300B tokens - - - ✓ -


GShard [104] Jun-2020 600 - - - 1T tokens - 2048 TPU v3 4d - -
Codex [105] Jul-2021 12 GPT-3 - - 100B tokens May-2020 - - ✓ -
ERNIE 3.0 [106] Jul-2021 10 - - - 375B tokens - 384 V100 - ✓ -
Jurassic-1 [107] Aug-2021 178 - - - 300B tokens - 800 GPU - ✓ -
HyperCLOVA [108] Sep-2021 82 - - - 300B tokens - 1024 A100 13.4 d ✓ -
FLAN [67] Sep-2021 137 LaMDA-PT ✓ - - - 128 TPU v3 60 h ✓ -
Yuan 1.0 [109] Oct-2021 245 - - - 180B tokens - 2128 GPU - ✓ -
Anthropic [110] Dec-2021 52 - - - 400B tokens - - - ✓ -
WebGPT [81] Dec-2021 175 GPT-3 - ✓ - - - - ✓ -
Gopher [64] Dec-2021 280 - - - 300B tokens - 4096 TPU v3 920 h ✓ -
ERNIE 3.0 Titan [111] Dec-2021 260 - - - - - - - ✓ -
GLaM [112] Dec-2021 1200 - - - 280B tokens - 1024 TPU v4 574 h ✓ -
LaMDA [68] Jan-2022 137 - - - 768B tokens - 1024 TPU v3 57.7 d - -
MT-NLG [113] Jan-2022 530 - - - 270B tokens - 4480 80G A100 - ✓ -
Closed
AlphaCode [114] Feb-2022 41 - - - 967B tokens Jul-2021 - - - -
Source
InstructGPT [66] Mar-2022 175 GPT-3 ✓ ✓ - - - - ✓ -
Chinchilla [34] Mar-2022 70 - - - 1.4T tokens - - - ✓ -
PaLM [56] Apr-2022 540 - - - 780B tokens - 6144 TPU v4 - ✓ ✓
AlexaTM [115] Aug-2022 20 - - - 1.3T tokens - 128 A100 120 d ✓ ✓
Sparrow [116] Sep-2022 70 - - ✓ - - 64 TPU v3 - ✓ -
WeLM [117] Sep-2022 10 - - - 300B tokens - 128 A100 40G 24 d ✓ -
U-PaLM [118] Oct-2022 540 PaLM - - - - 512 TPU v4 5d ✓ ✓
Flan-PaLM [69] Oct-2022 540 PaLM ✓ - - - 512 TPU v4 37 h ✓ ✓
Flan-U-PaLM [69] Oct-2022 540 U-PaLM ✓ - - - - - ✓ ✓
GPT-4 [46] Mar-2023 - - ✓ ✓ - - - - ✓ ✓
PanGu-Σ [119] Mar-2023 1085 PanGu-α - - 329B tokens - 512 Ascend 910 100 d ✓ -
PaLM2 [120] May-2023 16 - ✓ - 100B tokens - - - ✓ ✓
9

T5 GShard Publicly Available


2019 mT5 PanGu-𝛂 Ernie 3.0
2020 YuLan-Chat
2021 Jurassic-1
1-4 PLUG
GPT-3 StarCoder
Codex 5-8 CPM-2
FLAN CodeGen2
T0 9-10 LaMDA
Anthropic Yuan 1.0 ChatGLM
HyperCLOVA AlphaCode
WebGPT 11-12 Falcon
Chinchilla
Ernie 3.0 Titan InstructGPT 2022 PaLM2
UL2 Sparrow
Gopher CodeGen Pythia InternLM
1-3 Qwen2
MT-NLG PaLM Flan-T5 Qwen
GLaM OPT Vicuna DeepSeek-V2
YaLM Flan-PaLM Mistral
CodeGeeX GPT-NeoX-20B PanGu-Σ LLaMA3
4-6 Luminous
BLOOM Tk-Instruct Bard Deepseek MiniCPM
GLM
mT0 7-10 NLLB
Cohere LLaMA Mixtral Gemma
AlexaTM
BLOOMZ 11-12
WeLM 2023
1-6 7-12
Galatica 2024 1-6
OPT-IML ChatGPT GPT-4 LLaMA2

Fig. 3: A timeline of existing large language models (having a size larger than 10B) in recent years. The timeline was
established mainly according to the release date (e.g., the submission date to arXiv) of the technical paper for a model. If
there was no corresponding paper, we set the date of a model as the earliest time of its public release or announcement.
We mark the LLMs with publicly available model checkpoints in yellow color. Due to the space limit of the figure, we only
include the LLMs with publicly reported evaluation results.

ChatGPT

GPT-1 GPT-2 GPT-3 +code Codex GPT-3.5 GPT-4


2018.06 2019.02 2020.05 2021.07 2022.03 2023.03
decoder-only architecture unsupervised multitask learner in-context learning code pre-training strong reasoning ability
generative pre-training scaling the model size exploring scaling limits
GPT-4 Turbo
2023.09
longer context window
code-davinci-002 +instruction text-davinci-002 +RLHF text-davinci-003 +chat gpt-3.5-turbo
2022.03 2022.03 2022.09 2023.03 GPT-4 Turbo with vision
2023.09
capable code model instruction following human alignment excellent comprehensive ability multimodal ability

Fig. 4: A brief illustration for the technical evolution of GPT-series models. We plot this figure mainly based on the papers,
blog articles and official APIs from OpenAI. Here, solid lines denote that there exists an explicit evidence (e.g., the official
statement that a new model is developed based on a base model) on the evolution path between two models, while dashed
lines denote a relatively weaker evolution relation.

demonstrates a key capacity leap by scaling of the (nearly cellent performance in a variety of NLP tasks, but also on a
same) generative pre-training architecture. number of specially designed tasks that require the abilities
• GPT-3. GPT-3 [55] was released in 2020, which scaled of reasoning or domain adaptation. Although the GPT-3’s
the model parameters to an ever larger size of 175B. In paper does not explicitly discuss the emergent abilities of
the GPT-3’s paper, it formally introduced the concept of LLMs, we can observe large performance leap that might
in-context learning (ICL)17 , which utilizes LLMs in a few- transcend the basic scaling law [30], e.g., larger models have
shot or zero-shot way. ICL can teach (or instruct) LLMs to significantly stronger ICL ability (illustrated in the original
understand the tasks in the form of natural language text. Figure 1.2 of the GPT-3’s paper [55]). Overall, GPT-3 can be
With ICL, the pre-training and utilization of LLMs converge viewed as a remarkable landmark in the journey evolving
to the same language modeling paradigm: pre-training pre- from PLMs to LLMs. It has empirically proved that scaling
dicts the following text sequence conditioned on the context, the neural networks to a significant size can lead to a huge
while ICL predicts the correct task solution, which can be increase in model capacity.
also formatted as a text sequence, given the task description
Capacity Enhancement. Due to the strong capacities, GPT-
and demonstrations. GPT-3 not only demonstrates very ex-
3 has been the base model to develop even more capable
17. GPT-2 essentially used ICL for unsupervised task learning, LLMs for OpenAI. Overall, OpenAI has explored two major
though it wasn’t called ICL at that time. approaches to further improving the GPT-3 model, i.e., train-
10

ing on code data and alignment with human preference, GPT-3.5 models by OpenAI (see the discussion about the
which are detailed as follows. OpenAI API in Section 3.1).
• Training on code data. A major limitation of the original
The Milestones of Language Models. Based on all the ex-
GPT-3 model (pre-trained on plain text) lies in the lack of
ploration efforts, two major milestones have been achieved
the reasoning ability on complex tasks, e.g., completing the
by OpenAI, namely ChatGPT [131] and GPT-4 [46], which
code and solving math problems. To enhance this ability,
have largely raised the capacity bar of existing AI systems.
Codex [105] was introduced by OpenAI in July 2021, which
was a GPT model fine-tuned on a large corpus of GitHub • ChatGPT. In November 2022, OpenAI released the
code. It demonstrated that Codex can solve very difficult conversation model ChatGPT, based on the GPT models
programming problems, and also lead to a significant per- (GPT-3.5 and GPT-4). As the official blog article intro-
formance improvement in solving math problems [126]. duced [131], ChatGPT was trained in a similar way as
Further, a contrastive approach [127] to training text and InstructGPT (called “a sibling model to InstructGPT” in the
code embedding was reported in January 2022, which was original post), while specially optimized for dialogue. They
shown to improve a series of related tasks (i.e., linear- reported a difference between the training of ChatGPT and
probe classification, text search and code search). Actually, InstructGPT in the data collection setup: human-generated
the GPT-3.5 models are developed based on a code-based conversations (playing both the roles of user and AI) are
GPT model (i.e., code-davinci-002), which indicates that combined with the InstructGPT dataset in a dialogue format
training on code data is a very useful practice to improve for training ChatGPT. ChatGPT exhibited superior capaci-
the model capacity of GPT models, especially the reasoning ties in communicating with humans: possessing a vast store
ability. Furthermore, there is also a speculation that train- of knowledge, skill at reasoning on mathematical problems,
ing on code data can greatly increase the chain-of-thought tracing the context accurately in multi-turn dialogues, and
prompting abilities of LLMs [47], while it is still worth aligning well with human values for safe use. Later on, the
further investigation with more thorough verification. plugin mechanism has been supported in ChatGPT, which
further extends the capacities of ChatGPT with existing tools
• Human alignment. The related research of human
or apps. So far, it seems to be the ever most powerful chatbot
alignment can be dated back to the year 2017 (or earlier)
in the AI history. The launch of ChatGPT has a significant
for OpenAI: a blog article entitled “learning from human
impact on the AI research in the future, which sheds light
preferences”18 was posted on the OpenAI blog describing
on the exploration of human-like AI systems.
a work that applied reinforcement learning (RL) to learn
from the preference comparisons annotated by humans [79] • GPT-4. As another remarkable progress, GPT-4 [46]
(similar to the reward training step in the aligning algorithm was released in March 2023, which extended the text input
of InstructGPT in Figure 12). Shortly after the release of this to multimodal signals. Overall, GPT-4 has stronger capac-
RL paper [79], the paper of the Proximal Policy Optimiza- ities in solving complex tasks than GPT-3.5, showing a
tion (PPO) [128] was published in July 2017, which now has large performance improvement on many evaluation tasks.
been the foundational RL algorithm for learning from hu- A recent study [41] investigated the capacities of GPT-
man preferences [66]. Later in January 2020, GPT-2 was fine- 4 by conducting qualitative tests with human-generated
tuned using the aforementioned RL algorithms [79, 128], problems, spanning a diverse range of difficult tasks, and
which leveraged human preferences to improve the capac- showed that GPT-4 can achieve more superior performance
ities of GPT-2 on NLP tasks. In the same year, another than prior GPT models. Furthermore, GPT-4 responds more
work [129] trained a summarization model for optimizing safely to malicious or provocative queries, due to a six-
human preferences in a similar way. Based on these prior month iterative alignment (with an additional safety re-
work, InstructGPT [66] was proposed in January 2022 to ward signal in the RLHF training). In the technical report,
improve the GPT-3 model for human alignment, which OpenAI has emphasized how to safely develop GPT-4 and
formally established a three-stage reinforcement learning from applied a number of intervention strategies to mitigate the
human feedback (RLHF) algorithm. Note that it seems that possible issues of LLMs, such as hallucinations, privacy
the wording of “instruction tuning” has seldom been used in and overreliance. For example, they introduced the mech-
OpenAI’s paper and documentation, which is substituted by anism called red teaming [132] to reduce the harm or toxic
supervised fine-tuning on human demonstrations (i.e., the first content generation. As another important aspect, GPT-4
step of the RLHF algorithm [66]). In addition to improving has been developed on a well-established deep learning
the instruction following capacity, the RLHF algorithm is infrastructure with improved optimization methods. They
particularly useful to mitigate the issues of generating harm introduced a new mechanism called predictable scaling that
or toxic content for LLMs, which is key to the safe deploy- can accurately predict the final performance with a small
ment of LLMs in practice. OpenAI describes their approach proportion of compute during model training.
to alignment research in a technical article [130], which • GPT-4V, GPT-4 turbo, and beyond. Based on the work
has summarized three promising directions: “training AI done for GPT-4 [46], OpenAI further released GPT-4V in
systems to use human feedback, to assist human evaluation September 2023, which focused on the safe deployment of
and to do alignment research”. the vision capabilities of GPT-4. In the GPT-4V’s system
card [133], it has extensively discussed the assessment and
These enhancement techniques lead to the improved
mitigation of risks related to visually augmented inputs.
GPT-3 models with stronger capacities, which are called
Specially, GPT-4V exhibited strong vision capacities in var-
ious application scenarios, showing the great potential as
18. https://openai.com/research/learning-from-human-preferences a powerful multimodal learning system. More recently, in
11

November 2023, OpenAI released an upgraded generation model, and its performance evaluation in downstream tasks.
of GPT-4 model at DevDay, named GPT-4 Turbo, with a For more details of LLMs, see Table 1.
series of technical improvements. GPT-4 Turbo is featured • LLaMA. The LLaMA series of models has gained im-
by the improved model capacity (more capable than GPT- mense popularity and widespread attention due to its open-
4), the extended knowledge source (up to April 2023), ness and effectiveness. From LLaMA [57], LLaMA-2 [99],
long context window (up to 128k tokens), optimized model LLaMA-3 [135] to LLaMA-3.1 [136], continuous updates
performance (cheaper price), and other useful functional- have been made and the development is still ongoing. With
ity updates (function call, reproducible outputs, etc.). At increased parameters (the largest version has 405B), more
the same time, Assistants API was launched to ease the pre-training tokens (15T tokens), and an extended context
rapid development of agent-like assistants. With this API, window (128K), LLaMA-3.1 has significantly enhanced its
developers can easily create goal-oriented assistants within capabilities, and it also integrates additional components
their applications, by leveraging specific instruction, extra that work in synergy with the model, including new se-
knowledge and tool use. Furthermore, multimodal capaci- curity and safety tools. In evaluation, LLaMa-3.1 (405B ver-
ties (see, hear, and speak) were also enhanced in this new sion) achieves competitive performance against prominent
release, supported by GPT-4 Turbo with vision, DALL·E 3, closed-source LLMs, such as GPT-4, GPT-4o, and Claude
Text-to-speech (TTS), and Listen to voice samples. These 3.5 Sonnet in various benchmarks (e.g., MMLU, GSM8k,
improvements have greatly extended the capacity scope and and HumanEval). The pre-training of LLaMA (65B version)
enhanced the task performance of GPT models. More impor- involves 2,048 A100-80G GPUs, whereas LLaMA-3.1 (405B
tantly, the application ecosystem will be greatly strength- version) involves more than 16,000 H100 GPUs.
ened with the technology upgrade in improved models, • Mistral. The Mistral series [137, 138] consist of Mis-
APIs, and functionalities. tral (7B), Mistral NeMo (12B), Mistral Large 2 (123B), and
Despite the huge progress, there are still limitations with Mixtral (8×7B and 8×22B), which have been widely known
these superior LLMs, e.g., generating hallucinations with for their strong performance on various mainstream bench-
factual errors or potentially risky response within some marks (e.g., MMLU and GSM8k). Mistral NeMo is featured
specific context [46]. More limitations or issues of LLMs will with a long context window of 128K at the parameter scale
be discussed in Section 7. It poses long-standing research of 12B. Although Mistral NeMo is trained with quantization
challenges to develop more capable, safer LLMs. From awareness, it enables FP8 inference without sacrificing per-
the perspective of engineering, OpenAI has adopted an formance. Mistral Large 2 is the largest and most powerful
iterative deployment strategy [134] to develop the models model of the Mistral series, which supports 11 natural
and products by following a five-stage development and languages and more than 80 programming languages. Mix-
deployment life-cycle, which aims to effectively reduce the tral is a kind of sparse Mixture-of-Experts (SMoE) model
potential risks of using the models. In the following, we that activates only part of the parameters during inference,
will dive into the technical details in order to have a specific making it more efficient compared to dense models of the
understanding of how they have been developed. same size.
• Gemma. Gemma [139, 140] is a series of lightweight,
strong, and open models, consisting of Gemma-1 (2B and
3 R ESOURCES OF LLM S 7B) and Gemma-2 (2B, 9B, and 27B). During the pre-training
It is by no means an easy job to develop or reproduce LLMs, stage, Gemma-2 2B, 9B, and 27B versions are trained on
considering the challenging technical issues and huge de- 2T, 8T, and 13T primarily English tokens, respectively. The
mands of computation resources. A feasible way is to learn largest version of Gemma-2 is trained on 6144 TPUv5p
experiences from existing LLMs and reuse publicly avail- chips. Gemma-2 has achieved excellent performance in mul-
able resources for incremental development or experimental tiple benchmarks (e.g., ARC-c, MMLU, and GSM8k).
study. In this section, we briefly summarize the publicly • Qwen. Qwen [141, 142] is an open-source large
available resources for developing LLMs, including model model series consisting of Qwen (raging from 7B to 72B),
checkpoints (or APIs), corpora and libraries. Qwen1.5 (raging from 0.5B to 110B), Qwen2 (ranging from
0.5B to 72B), and Qwen2.5 (ranging from 0.5B to 72B).
3.1 Publicly Available Model Checkpoints or APIs Qwen2.5 is the newest LLM collection of Qwen, which
is pre-trained on up to 18T tokens. Compared to Qwen2,
Given the huge cost of model pre-training, well-trained
Qwen2.5 demonstrates a significant increase in knowledge
model checkpoints are critical to the study and development
retention, as well as notable advancements in coding and
of LLMs for the research community. Due to space limita-
mathematical abilities. Qwen2.5 has also shown large im-
tion, we can only selectively discuss several representative
provements in instruction following, long texts generation
LLMs. In addition, for inference, we can directly employ
(over 8K tokens), structured data understanding and gener-
public APIs to perform our tasks, without running the
ation (e.g., JSON).
model locally. Next, we introduce the publicly available
model checkpoints and APIs.
• GLM. GLM [143] is a series of LLMs featuring compre-
hensive capabilities in both English and Chinese. GLM has
Publicly Available Model Checkpoints. To assist re- been upgraded to its fourth-generation model, GLM-4, with
searchers in selecting a suitable model based on the resource a parameter scale of up to 9B, possesses excellent conver-
budget and usage needs, we focus on discussing the model’s sational abilities. It has achieved excellent performance in
parameter size, data and computational resources required evaluations from multiple perspectives including semantics,
for training, the relevant technologies employed by the mathematics, reasoning, code, and knowledge. In addition
12

Continue pre-training LLaMA Parameter-efficient fine-tuning


Model inheritance Instruction
Data inheritance tuning Full parameter fine-tuning
+ chinese data + chat data

Chinese
Open-Chinese-LLaMA + synthetic data
Vicuna
Vicuna
Panda + task data
Alpaca
Linly-Chinese-LLaMA
Chinese Yulan-Chat
RLHF
LLaMA Alpaca Goat
+ chat data Lora PKU-Beaver
BiLLa
Cornucopia
+ synthetic data
+ chat data
+ Alpaca data
Lawyer
LLaMA OpenFlamingo LLaVA
BELLE MiniGPT-4
+ chat data
Ziya + task data
QiZhenGPT Baize
Chinese + task data
Alpaca + task data Guanaco
+ task data
Koala + task data VisionLLM InstructBLIP Chatbridge
TaoLi

LLaMA
ChatMed
Adapter PandaGPT
BenTsao LAWGPT Multimodal models

Math Finance Medicine Law Bilingualism Education

Fig. 5: An evolutionary graph of the research work conducted on LLaMA. Due to the huge number, we cannot include all
the LLaMA variants in this figure, even much excellent work. To support incremental update, we share the source file of
this figure, and welcome the readers to include the desired models by submitting the pull requests on our GitHub page.

to the base model GLM-4-9B, it has open-sourced human LLaMA models in non-English languages, it often needs to
preference-aligned model GLM-4-9B-Chat, and long context extend the original vocabulary (trained mainly on English
conversational model GLM-4-9B-Chat-1M. corpus) or fine-tune it with instructions or data in the
• Baichuan. Baichuan is a series of open-source bilingual target language. Among these extended models, Stanford
LLMs and the latest version is Baichuan-2. Both Baichuan Alpaca [146] is the first open instruct-following model
and Baichuan-2 have two available parameter sizes (7B fine-tuned based on LLaMA (7B). It is trained by 52K
and 13B). Baichuan supports both Chinese and English, instruction-following demonstrations generated via self-
with pre-training data reaching 1.2 trillion tokens. Further- instruct [147] using text-davinci-003. The instruction
more, Baichuan-2 expands its pre-training data to 2.6 trillion data, named Alpaca-52K, and training code have been ex-
tokens. Baichuan-2 surpasses Baichuan in all evaluation tensively adopted in subsequent work, such as Alpaca-
benchmarks, demonstrating excellent multilingual capabil- LoRA [148] (a reproduction of Stanford Alpaca using
ities and showing potential for vertical applications in the LoRA [149]), Koala [150], and BELLE [151]. In addition, Vi-
domains such as law and healthcare (e.g., JEC-QA [144] and cuna [152] is another popular LLaMA variant, trained upon
MedQA [145]). user-shared conversations collected from ShareGPT [153].
Due to the excellent performance and availability of the
LLaMA Model Family. The collection of LLaMA mod- LLaMA model family, many multimodal models incorpo-
els [57] were introduced by Meta AI in February, 2023, rate them as the base language models, to achieve strong
consisting of four sizes (7B, 13B, 30B and 65B). Since language understanding and generation abilities. Compared
released, LLaMA has attracted extensive attention from with other variants, Vicuna is more preferred in multimodal
both research and industry communities. LLaMA mod- language models, which have led to the emergence of a va-
els have achieved very excellent performance on various riety of popular models, including LLaVA [154], MiniGPT-
open benchmarks, which have become the most popu- 4 [155], InstructBLIP [156], and PandaGPT [157]. The re-
lar open language models thus far. A large number of lease of LLaMA has greatly advanced the research progress
researchers have extended LLaMA models by either in- of LLMs. To summarize the research work conducted on
struction tuning or continual pre-training. In particular, LLaMA, we present a brief evolutionary graph in Figure 5.
instruction tuning LLaMA has become a major approach
to developing customized or specialized models, due to Public API of LLMs. Instead of directly using the model
the relatively low computational costs. To effectively adapt copies, APIs provide a more convenient way for common
13

users to use LLMs, without the need of running the model Web pages. Web pages are a primary data source for train-
locally. As a representative interface for using LLMs, the ing language models.
APIs for the GPT-series models [46, 55, 66, 105] have • CommonCrawl. CommonCrawl [168] is one of the
been widely used for both academia and industry19 . largest open-source web crawling databases, containing a
OpenAI has provided seven major interfaces to the models petabyte-scale data volume, which has been widely used
in GPT-3 series: ada, babbage, curie, davinci (the as training data for existing LLMs. As the whole dataset is
most powerful version in GPT-3 series), text-ada-001, very large, existing studies mainly extract subsets of web
text-babbage-001, and text-curie-001. Among pages from it within a specific period or specific needs
them, the first four interfaces can be further fine- (e.g., extracting mathematical texts). However, due to the
tuned on the host server of OpenAI. In particular, widespread existence of noisy and low-quality information
babbage, curie, and davinci correspond to the in web page data, it is necessary to perform data preprocess-
GPT-3 (1B), GPT-3 (6.7B), and GPT-3 (175B) models, ing before usage. One commonly used toolkit for cleaning
respectively [55]. In addition, there are also two APIs CommonCrawl is CC-Net [169], which is developed by
related to Codex [105], called code-cushman-001 (a Facebook and has been used in processing datasets like
powerful and multilingual version of the Codex (12B) [105]) RedPajama-Data [170].
and code-davinci-002. Further, GPT-3.5 series • C4. The Colossal Clean Crawled Corpus (C4) includes
include one base model code-davinci-002 and five variants21 , namely en (806G), en.noclean (6T), real-
three enhanced versions, namely text-davinci-002, newslike (36G), webtextlike (17G), and multilingual (38T).
text-davinci-003, and gpt-3.5-turbo. As more The en version has been utilized for pre-training T5 [82],
powerful alternatives, in this year, OpenAI has released LaMDA [68], Gopher [64], and UL2 [89]. The multilingual
the model interfaces for GPT-4 series, including gpt-4, C4, also called mC4, has been used in mT5 [83].
gpt-4-32k, gpt-4-1106-preview (i.e., GPT-4 Turbo) • RedPajama-Data. RedPajama-Data [170] is a publicly
and gpt-4-vision-preview (i.e., GPT-4 Turbo with available comprehensive web dataset, comprising 100 bil-
vision, a multimodal model). It is worth noting that OpenAI lion documents from Common Crawl. It has been cleaned,
has been maintaining and upgrading these model interfaces filtered, and deduplicated using the CCNet tool, resulting in
(gpt-3.5-turbo, gpt-4, gpt-4-32k), so the API name approximately 30T tokens, which is available for download
will actually point to the latest version. Currently, ChatGPT on Hugging Face. RedPajama-Data is a multilingual dataset
can be powered by either GPT-3.5 or GPT-4 models. Overall, that includes five languages: English, French, Spanish, Ger-
one select the suitable model interface based on the specific man, and Italian. Additionally, it offers over 40 quality
application scenarios and response requirements. The labels, making it feasible to filter or reweight the dataset
detailed usage can be found on their project websites20 . according to specific criteria. The dataset is continuously
updated and maintained, with all data processing scripts
TABLE 2: Statistics of commonly-used data sources. open-sourced on GitHub for convenient use.
• RefinedWeb. RefinedWeb [171] is a web dataset obtained
Corpora Size Source Latest Update Time
through rigorous selection and deduplication based on data
BookCorpus [158] 5GB Books Dec-2015 from Common Crawl, encompassing all Common Crawl
Gutenberg [159] - Books Dec-2021
web records from 2008 to June 2023, totaling around 5T
C4 [82] 800GB CommonCrawl Apr-2019
CC-Stories-R [160] 31GB CommonCrawl Sep-2019 tokens. The open-source portion consists of 600B tokens,
CC-NEWS [27] 78GB CommonCrawl Feb-2019 with a data size of approximately 500GB. After decompres-
REALNEWs [161] 120GB CommonCrawl Apr-2019 sion, it requires 2.8TB of local storage space and is available
OpenWebText [162] 38GB Reddit links Mar-2023
Pushift.io [163] 2TB Reddit links Mar-2023 for download on Hugging Face. This dataset serves as the
Wikipedia [164] 21GB Wikipedia Mar-2023 primary training dataset for the open-source large language
BigQuery [165] - Codes Mar-2023 model Falcon.
the Pile [166] 800GB Other Dec-2020
ROOTS [167] 1.6TB Other Jun-2022
• WebText. WebText [26] is a well-known corpus com-
posed of highly upvoted links from Reddit, a social media
platform that enables users to submit links and text posts,
but it is not publicly available. As a surrogate, there is a
readily accessible open-source alternative called OpenWeb-
3.2 Commonly Used Corpora for Pre-training
Text [162].
In contrast to earlier PLMs, LLMs which consist of a signifi-
cantly larger number of parameters require a higher volume Books & Academic Data. Books and academic data contains
of training data that covers a broad range of content. For a wealth of world knowledge and linguistic information,
this need, there are increasingly more accessible training serving as a high-quality corpus for model learning.
datasets that have been released for research. In this section, • Book Data. BookCorpus [158] is a commonly used
we will briefly summarize several widely used corpora for dataset in previous small-scale models (e.g., GPT [122] and
training LLMs. Based on their content types, we categorize GPT-2 [26]), consisting of over 11,000 books covering a wide
these corpora into five groups: web pages, books, Wikipedia, range of topics and genres (e.g., novels and biographies).
code, and others. Another large-scale book corpus is Project Gutenberg [159],
consisting of over 70,000 literary books including novels,
19. https://platform.openai.com/docs/api-reference/introduction
20. https://platform.openai.com/docs/models/overview 21. https://www.tensorflow.org/datasets/catalog/c4
14

essays, poetry, drama, history, science, philosophy, and Scholar, GitHub code, books, social media from Reddit,
other types of works in the public domain. It is currently and Wikipedia data. Dolma consisting of 3T tokens of ap-
one of the largest open-source book collections, which is proximately 200TB of raw text and has been used to train
used in training of MT-NLG [113] and LLaMA [57]. As for OLMo [178].
Books1 [55] and Books2 [55] used in GPT-3 [55], they are In practice, it commonly requires a mixture of different
much larger than BookCorpus but have not been publicly data sources for pre-training LLMs (see Figure 6), instead
released so far. of a single corpus. Therefore, existing studies commonly
• Academic Data. In addition to book data, scientific mix several ready-made datasets (e.g., C4, OpenWebText,
publication data such as paper is also important for model and the Pile), and then perform further processing to obtain
pre-training. arXiv Dataset [172] is a corpus of 1.7 mil- the pre-training corpus. Furthermore, to train the LLMs that
lion academic papers, covering a wide range of papers in are adaptive to specific applications, it is also important
the fields of physics, mathematics, and computer science. to extract data from relevant sources (e.g., Wikipedia and
S2ORC [173] is a corpora that consists of 136M academic BigQuery) for enriching the corresponding information in
papers collected by Semantic Scholar. It also releases a pre-training data.
derivative dataset peS2o [174], which contains about 42B
tokens. TABLE 3: A detailed list of available collections for instruc-
tion tuning.
Wikipedia. Wikipedia [164] is an online encyclopedia con-
taining a large volume of high-quality articles on diverse Categories Collections Time #Examples
topics. Most of these articles are composed in an expository Nat. Inst. [179] Apr-2021 193K
style of writing (with supporting references), covering a FLAN [67] Sep-2021 4.4M
wide range of languages and fields. Typically, the English- P3 [180] Oct-2021 12.1M
Task Super Nat. Inst. [88] Apr-2022 5M
only filtered versions of Wikipedia are widely used in most MVPCorpus [181] Jun-2022 41M
LLMs (e.g., GPT-3 [55], LaMDA [68], and LLaMA [57]). xP3 [94] Nov-2022 81M
Wikipedia is available in multiple languages, so it can be OIG[182] Mar-2023 43M
used in multilingual settings. HH-RLHF [183] Apr-2022 160K
HC3 [184] Jan-2023 87K
Code. To collect code data, existing work mainly crawls Chat ShareGPT [153] Mar-2023 90K
open-source licensed codes from the Internet. Two major Dolly [185] Apr-2023 15K
OpenAssistant [186] Apr-2023 161K
sources are public code repositories under open-source li-
censes (e.g., GitHub) and code-related question-answering Self-Instruct [147] Dec-2022 82K
Alpaca [187] Mar-2023 52K
platforms (e.g., StackOverflow). Google has publicly re-
Synthetic Guanaco [188] Mar-2023 535K
leased the BigQuery dataset [165], which includes a sub- Baize [189] Apr-2023 158K
stantial number of open-source licensed code snippets in BELLE [190] Apr-2023 1.5M
various programming languages, serving as a representa-
tive code dataset. CodeGen has utilized BIGQUERY [86], a
subset of the BigQuery dataset, for training the multilingual TABLE 4: A list of available collections for alignment.
version of CodeGen (CodeGen-Multi). In addition, Hugging
Face has collected and released a code dataset named The Dataset Release Time #Examples
Stack [175], covering more than 30 programming languages. Summarize from Feedback [129] Sep-2020 193K
The Stack is continuously updated, and the v1.2 version SHP [191] Oct-2021 385K
WebGPT Comparisons [81] Dec-2021 19K
has expanded to 358 programming languages. Based on Stack Exchange Preferences [192] Dec-2021 10M
this dataset, BigCode further processed it and released HH-RLHF [183] Apr-2022 169K
StarCoder [98], which is also the pre-training data of the Sandbox Alignment Data [193] May-2023 169K
model StarCoder. CValues [194] Jul-2023 145K
PKU-SafeRLHF [195] Oct-2023 330K
Mixed Data. In addition to the aforementioned specific
types of datasets, different types of data have been com-
bined to facilitate usage by researchers. The Pile [166] is a 3.3 Commonly Used Datasets for Fine-tuning
large-scale, diverse, and open-source text dataset consisting
of over 800GB of data from multiple sources, including After pre-training, it requires further fine-tuning LLMs to
books, websites, codes, scientific papers, and social media enhance the model capacity, which often involve two major
platforms. It is constructed from 22 diverse high-quality steps, namely instruction tuning (supervised fine-tuning)
subsets. The Pile dataset is widely used in models with and alignment tuning. In this section, we mainly focus on
different parameter scales, such as GPT-J (6B) [176], Code- discussing the related available datasets for the two kinds of
Gen (16B) [86], and Megatron-Turing NLG (530B) [113]. tuning approaches, and more algorithm details can be found
ROOTS [167] is composed of various smaller datasets (to- in Section 5.
tally 1.61 TB of text) and covers 59 different languages (con-
taining natural languages and programming languages), 3.3.1 Instruction Tuning Datasets
which have been used for training BLOOM [78]. Another After pre-training, instruction tuning (a.k.a., supervised fine-
mixture dataset is Dolma [177], which includes web text tuning) is an important method to enhance or unlock spe-
from Common Crawl, academic papers from Semantic cific abilities of LLMs (e.g., instruction following). In this
15

part, we introduce several widely used datasets for in- • Self-Instruct-52K [147] is an instruction dataset gener-
struction tuning, and categorize them into three main types ated through the self-instruct [147] method, consisting of
based on the construction method of formatted instruction 82,000 instances with 52,000 instructions. Concretely, the
instances, namely NLP task datasets, daily chat datasets and authors construct 175 seed instances, and then iteratively
synthetic datasets. We show their details in Table 3. prompt the LLM [55] to synthesize additional instructions
based on randomly selected 8 instructions as reference.
NLP Task Datasets. This kind of datasets are formatted Subsequently, the LLM is further instructed to generate in-
based on collected NLP task datasets (e.g., text classifica- stance inputs and their corresponding outputs based on the
tion and summarization) with corresponding natural lan- synthetic instructions, and finally obtain the Self-Instruct-
guage task descriptions. In this category, P3 [196] and 52K dataset.
FLAN [67, 197] are two widely used datasets for instruction • Alpaca [146] is also a synthetic dataset based on the self-
tuning. instruct [147] method. It utilizes the text-davinci-003
• P3 [196] is composed of 170 English NLP datasets and model on the 175 seed datasets from Self-Instruct-52K to
2,052 English prompt templates, where the input and output obtain 52,000 new instructions and corresponding inputs
of each data example have been formatted with specific and outputs. Moreover, 60% of the examples are pure in-
prompt templates for composing the training instance. structions without the input part in the final dataset.
• FLAN [67] consists of 62 widely used NLP benchmarks • Baize [189] is an English multi-turn conversation corpus
in its original version. Recently, FLAN-v2 [197] is also pro- constructed using ChatGPT, comprising 111.5K instances. To
posed, which expands FLAN by mixing additional instruc- create Baize, a method called “self-chat” [189] is purposed,
tion datasets, including Muffin [67], NIV2 [88], T0-SF [28], where ChatGPT takes on the roles of both the user and the
and CoT [198–200]. Muffin contains 62 tasks from the orig- AI assistant in turns, generating information in a conversa-
inal FLAN and additional 26 tasks, including conversation tional format.
and code synthesis tasks. T0-SF is extracted from T0 [28]
while ensuring no overlap with Muffin. NIV2 refers to the 3.3.2 Alignment Datasets
Natural-Instructions v2 dataset [88], and CoT [198–200] is
a combination of nine reasoning tasks with corresponding Apart from instruction tuning, it is important to construct
chain-of-thought prompts and outputs. high-quality datasets for aligning LLMs with human values
and preferences (e.g., helpfulness, honesty, and harmless-
Daily Chat Datasets. This kind of datasets are constructed ness). In this section, we introduce several widely used
based on real user conversations where queries are posed datasets for alignment tuning, including HH-RLHF [183],
by humans and responses are mainly generated by hu- SHP [191], PKU-SafeRLHF [195], Stack Exchange Prefer-
man labelers or LLMs (e.g., ChatGPT, GPT-4). The con- ences [192] and Sandbox Alignment Data [193]. We show
versation types include open-ended generation, question their details in Table 4.
answering, brainstorming, and chatting. In this category, • HH-RLHF [183] consists of around 169K instances, and
ShareGPT [153], OpenAssistant [186] and Dolly [185] are can be divided into two parts that focus on the helpfulness
three commonly used datasets for LLM fine-tuning. and harmlessness of LLMs, respectively. Each instance is
• ShareGPT [153] is collected from a data collection an open-ended conversation between a crowdworker and
platform where users can upload their conversations with a chat model, about seeking assistance, advice, or task
ChatGPT or GPT-4 through the ShareGPT API. Currently, completion. The chat model provides two responses to each
this dataset consists of approximately 90,000 conversations, user query, and the more helpful or harmful responses will
including real instructions or inquiries from human and be chosen as the annotations.
responses from ChatGPT. • SHP [191] focuses on the helpfulness of responses.
• OpenAssistant [186] is a multilingual corpus containing It comprises 385K collective human preferences over re-
66,497 real-world conversation trees between human and AI sponses to questions/instructions across 18 diverse subject
assistant. Each conversation tree consists of multiple nodes, areas, spanning topics from cooking to legal advice. Each
and each node represents the information generated by a instance is a Reddit post containing a question or instruction
role in the dialogue. It spans 35 languages and includes and a pair of top-level comments, one of which is deemed
461,292 manually annotated quality ratings of responses. as more preferable by Reddit users and the other one is
• Dolly [185] is an English dataset comprising 15,000 deemed as less helpful. Different from HH-RLHF [183], the
human-generated data instances (prompt-response pairs) data in SHP consists of naturally occurring and human-
from Databricks. This dataset covers seven domains out- written responses.
lined in the InstructGPT [66], including brainstorming, clas- • PKU-SafeRLHF [195] encompasses more than 330K
sification, closed-book quality assurance, generation, infor- instances of expert comparison data, concentrating on the
mation extraction, open-book quality assurance, and sum- helpfulness and harmlessness. Each instance in the dataset
marization. includes a question and two responses, accompanied by
safety labels for each response and two preference anno-
Synthetic Datasets. This kind of datasets are typically tations between the two responses according to helpfulness
constructed by instructing LLMs, based on pre-defined and harmlessness. The harmlessness of a response indicates
guidance rules or methods. In this category, Self-Instruct- its classification as risk-neutral across all 14 harm categories,
52K [147], Alpaca [146] and Baize [189] are three commonly while the helpfulness of a response is evaluated based on its
used synthetic datasets for LLMs. effectiveness in addressing the question.
16

• Stack Exchange Preferences [192] focuses on the help- incorporated several common LLMs (e.g., Flan-T5 [69] and
fulness of answers. It comprises about 10M questions and GLM [93]) into its ModelCenter, where developers can use
answers from Stack Overflow. Each instance consists of a these models directly.
question and more than two corresponding answers. Each • FastMoE [207] is a specialized training library for MoE
answer is annotated with a score calculated based on its (i.e., mixture-of-experts) models. It is developed based on
votes and a label denoting whether it is selected. PyTorch, prioritizing both efficiency and user-friendliness
• Sandbox Alignment Data [193] is an alignment dataset in its design. FastMoE simplifies the process of transferring
containing feedback from LLMs rather than human. It Transformer models to MoE models and supports both data
comes from a virtual interaction environment called SAND- parallelism and model parallelism during training.
BOX, where the model simulates social interactions with • vLLM [208] is a fast, memory efficient, and easy-
other models and revise responses according to the feedback to-use library for LLM inference and serving. To enable
from other models. The dataset contains 169K instances, and fast inference, it is specially optimized with high serving
each instance consists of a societal query, several responses, throughput, effective attention memory management using
and corresponding ratings from other models. PagedAttention [208], continuous batching, and optimized
CUDA kernels. Furthermore, vLLM also supports various
decoding algorithms, tensor parallelism and streaming out-
3.4 Library Resource
puts. To ease the integration with other systems, vLLM is
In this part, we briefly introduce a series of available li- friendly to the use of HuggingFace models, and also provide
braries for developing LLMs. OpenAI-compatible API servers.
• Transformers [201] is an open-source Python library • DeepSpeed-MII [209] is also a memory efficient
for building models using the Transformer architecture, Python library developed by DeepSpeed [74]. It aims to
which is developed and maintained by Hugging Face. It democratize LLMs inference by prioritizing high through-
has a simple and user-friendly API, making it easy to use put, low latency, and cost-effectiveness. DeepSpeed-MII
and customize various pre-trained models. It is a powerful achieves accelerated text generation inference by leveraging
library with a large and active community of users and four essential technologies: blocked KV caching, continuous
developers who regularly update and improve the models batching, dynamic SplitFuse, and high-performance CUDA
and algorithms. Kernels. It currently supports over 13,000 models across
• DeepSpeed [74] is a deep learning optimization library three popular model architectures, such as LLaMA [57],
(compatible with PyTorch) developed by Microsoft, which Mistral [137], and OPT [90].
has been used to train a number of LLMs, such as MT- • DeepSpeed-Chat [210] is a fast, cost-effective, and
NLG [113] and BLOOM [78]. It provides the support of easy-to-use system framework that enables the integration
various optimization techniques for distributed training, of the complete RLHF process during model training. It
such as memory optimization (ZeRO technique, gradient is featured by three major functionalities: (1) it simplifies
checkpointing), and pipeline parallelism. the training and inference process for ChatGPT-like models,
• Megatron-LM [75–77] is a deep learning library devel- enabling using a simple script to implement multiple train-
oped by NVIDIA for training large-scale language models. ing or inference steps; (2) it replicates the training mode
It also provides rich optimization techniques for distributed of InstructGPT [66] and provides a complete pipeline for
training, including model and data parallelism, mixed- three training steps (i.e., SFT, reward model fine-tuning, and
precision training, and FlashAttention. These optimization RLHF); (3) it integrates the training engine and inference en-
techniques can largely improve the training efficiency and gine of Deepspeed into a unified hybrid engine (Deepspeed
speed, enabling efficient distributed training across GPUs. HE) for RLHF training, which enables seamless switch be-
• JAX [202] is a Python library for high-performance tween training and inference modes, and leveraging various
machine learning algorithms developed by Google, allow- optimizations from DeepSpeed Inference.
ing users to easily perform computations on arrays with In addition to the above library resources, existing deep
hardware acceleration (e.g., GPU or TPU). It enables efficient learning frameworks (e.g., PyTorch [211], TensorFlow [212],
computation on various devices and also supports several MXNet [213], PaddlePaddle [214], MindSpore [215] and
featured functions, such as automatic differentiation and OneFlow [216]) have also provided the support for parallel
just-in-time compilation. algorithms, which are commonly used for training large-
• Colossal-AI [203] is a deep learning library developed scale models.
by HPC-AI Tech for training large-scale AI models. It is
implemented based on PyTorch and supports a rich collec-
tion of parallel training strategies. Furthermore, it can also 4 P RE - TRAINING
optimize heterogeneous memory management with meth- Pre-training establishes the basis of the abilities of LLMs. By
ods proposed by PatrickStar [204]. Recently, a ChatGPT-like pre-training on large-scale corpora, LLMs can acquire essen-
model called ColossalChat [205] has been publicly released tial language understanding and generation skills [55, 56]. In
with two versions (7B and 13B), which are developed using this process, the scale and quality of the pre-training corpus
Colossal-AI based on LLaMA [57]. are critical for LLMs to attain powerful capabilities. Fur-
• BMTrain [206] is an efficient library developed by thermore, to effectively pre-train LLMs, model architectures,
OpenBMB for training models with large-scale parameters acceleration methods, and optimization techniques need to
in a distributed manner, which emphasizes code simplicity, be well designed. In what follows, we first discuss the data
low resource, and high availability. BMTrain has already collection and processing in Section 4.1, then introduce the
17

T5 (11B) Falcon (40B) LLaMA (65B) GPT-3 (175B) MT-NLG (530B) Gopher (280B) Chinchilla (70B)
3% 2%
2% 5% 16% 3% 4%
5% 26% 4% 37% 40%
62% 60% 56%
6%
100% 100% 87% 84%

Yi (34B) PaLM (540B) LaMDA (137B) Galactica (120B) GPT-NeoX (20B) CodeGen (16B) StarCoder 2 (15B)
5% 2%
8%
5% 9% 13% 8% 1%
14% 7% 20%
31% 30% 5%
4% 38% 39%
6%
38%
10% 10%
83% 50%
50% 86% 15% 25% 92%

💻 C4 (800G, 2019), 💻 OpenWebText (38G, 2023), 💻 Wikipedia (21G, 2023)


💬 the Pile - StackExchange (41G, 2020)
📚 BookCorpus (5G, 2015), 📚 Gutenberg (-, 2021), 📚 CC-Stories-R (31G, 2019), 📰 CC-NEWES (78G, 2019), 📰 REALNEWs (120G, 2019)
🔬 the Pile - ArXiv (72G, 2020), 🔬 the Pile - PubMed Abstracts (25G, 2020)
⌨ BigQuery (-, 2023), the Pile - GitHub (61G, 2020)

Fig. 6: Ratios of various data sources in the pre-training data for existing LLMs.

commonly used model architectures in Section 4.2, and fi- such as webpages, books, and conversational text, which
nally present the training techniques to stably and efficiently provides rich text sources on a variety of topics. Next, we
optimize LLMs in Section 4.3. briefly summarize three important kinds of general data.
• Webpages. Owing to the proliferation of the Internet,
4.1 Data Collection and Preparation various types of data have been created, which enables
Compared with small-scale language models, LLMs have LLMs to gain diverse linguistic knowledge and enhance
a stronger demand for high-quality data for model pre- their generalization capabilities [26, 82]. For convenient
training, and their model capacities largely rely on the pre- use of these data resources, a large amount of data is
training corpus and how it has been preprocessed. In this crawled from the web in previous work, such as Com-
part, we discuss the collection and processing of pre-training monCrawl [168]. However, the crawled web data tends to
data, including data sources, preprocessing methods, and contain both high-quality text, such as Wikipedia and low-
important analysis of how pre-training data affects the quality text, like spam mail, thus it is important to filter and
performance of LLMs. process webpages for improving the data quality.
• Conversation text. Conversation data can enhance the
4.1.1 Data Source conversational competence of LLMs [90] and potentially im-
To develop a capable LLM, it is key to collect a large amount prove their performance on a range of question-answering
of natural language corpus from various data sources. Ex- tasks [56]. Researchers can utilize subsets of public conver-
isting LLMs mainly leverage a mixture of diverse public sation corpus (e.g., PushShift.io Reddit corpus) [163, 217] or
textual datasets as the pre-training corpus. Figure 6 shows collect conversation data from online social media. Since on-
the distribution of the sources of pre-training data for a line conversational data often involves discussions among
number of representative LLMs. multiple participants, an effective processing way is to
The source of pre-training corpus can be broadly cate- transform a conversation into a tree structure, where the
gorized into two types: general data and specialized data. utterance is linked to the one it responds to. In this way, the
General data, such as webpages, books, and conversational multi-party conversation tree can be divided into multiple
text, is utilized by most LLMs [55, 56, 90] due to its large, sub-conversations, which can be collected in the pre-training
diverse, and accessible nature, which can enhance the lan- corpus. Furthermore, a potential risk is that the excessive
guage modeling and generalization abilities of LLMs. In integration of dialogue data into LLMs may result in a side
light of the impressive generalization capabilities exhibited effect [90]: declarative instructions and direct interrogatives
by LLMs, there are also studies that extend their pre-training are erroneously perceived as the beginning of conversations,
corpus to more specialized datasets, such as multilingual thus leading to a decline in the efficacy of the instructions.
data, scientific data, and code, endowing LLMs with specific
task-solving capabilities [35, 56, 86]. In what follows, we
• Books. Compared to other corpus, books provide an
important source of formal long texts, which are potentially
describe these two types of pre-training data sources and
beneficial for LLMs to learn linguistic knowledge, model
their effects on LLMs. For a detailed introduction to the
long-term dependency, and generate narrative and coherent
commonly used corpus, one can refer to Section 3.2.
texts. To obtain open-source book data, existing studies
General Text Data. As we can see in Figure 6, the vast usually adopt the Books3 and Bookcorpus2 datasets, which
majority of LLMs adopt general-purpose pre-training data, are available in the Pile dataset [166].
18

Specialized Text Data. Specialized datasets are useful to affect the capacity and performance of LLMs. To facilitate
improve the specific capabilities of LLMs on downstream the data processing, a recent study [228] proposes a useful
tasks. Next, we introduce three kinds of specialized data. data processing system for LLMs, named Data-Juicer, which
• Multilingual text. In addition to the text in the target provides over 50 processing operators and tools. In this
language, integrating a multilingual corpus can enhance part, we review the detailed data preprocessing strategies
the multilingual abilities of language understanding and to improve the quality of the collected data [64, 78, 112]. A
generation. For example, BLOOM [78] and PaLM [56] have typical pipeline of preprocessing the pre-training data for
curated multilingual data covering 46 and 122 languages, LLMs has been illustrated in Figure 7.
respectively, within their pre-training corpora. FLM [102]
mixes Chinese and English corpora in nearly equal propor- Filtering and Selection. To remove low-quality data from
tions. These models demonstrate impressive performance in the collected corpus, existing work generally adopts two ap-
multilingual tasks, such as translation, multilingual summa- proaches, namely classifier-based and heuristic-based. The
rization, and multilingual question answering, and achieve former approach trains a selection classifier based on high-
comparable or superior performance to the state-of-the- quality texts and leverages it to identify and filter out low-
art models that are fine-tuned on the corpus in the target quality data. Typically, these methods train a binary classi-
language(s). fier using positive instances that are: well-curated data (e.g.,
• Scientific text. The exploration of science by humans has Wikipedia pages) [55, 56, 112], high-quality synthesized
been witnessed by the increasing growth of scientific publi- data [135, 229–231], or a combination of both. They sample
cations. In order to enhance the understanding of scientific candidate data as negative instances and predict the score
knowledge for LLMs [35, 218], it is useful to incorporate a that measures the quality of each data example. However,
scientific corpus for model pre-training [35, 218]. By pre- several studies [64, 112] find that a classifier-based approach
training on a vast amount of scientific text, LLMs can may result in the unintentional removal of high-quality texts
achieve impressive performance in scientific and reasoning in dialectal, colloquial, and sociolectal languages, which
tasks [219]. To construct the scientific corpus, existing efforts potentially leads to bias in the pre-training corpus and
mainly collect arXiv papers, scientific textbooks, math web- diminishes the corpus diversity. As the second approach,
pages, and other related scientific resources. Due to the com- several studies, such as BLOOM [78] and Gopher [64],
plex nature of data in scientific fields, such as mathematical employ heuristic-based approaches to eliminate low-quality
symbols and protein sequences, specific tokenization and texts through a set of well-designed rules, which can be
preprocessing techniques are usually required to transform summarized as follows:
these different formats of data into a unified form that can • Language based filtering. If a LLM would be mainly used
be processed by language models. in the tasks of certain languages, the text in other lan-
• Code. Program synthesis has been widely studied in guages can be filtered.
the research community [105, 220–223], especially the use of • Metric based filtering. Evaluation metrics about the gener-
PLMs trained on code [176, 224]. However, it remains chal- ated texts, e.g., perplexity, can be employed to detect and
lenging for these PLMs (e.g., GPT-J [176]) to generate high- remove unnatural sentences.
quality and accurate programs. Recent studies [105, 223]
have found that training LLMs on a vast code corpus • Statistic based filtering. Statistical features of a corpus,
can lead to a substantial improvement in the quality of e.g., the punctuation distribution, symbol-to-word ratio,
the synthesized programs. The generated programs can and sentence length, can be utilized to measure the text
successfully pass expert-designed unit-test cases [105] or quality and filter the low-quality data.
solve competitive programming questions [114]. In gen- • Keyword based filtering. Based on specific keyword set, the
eral, two types of code corpora are commonly used for noisy or unuseful elements in the text, such as HTML
pre-training LLMs. The first source is from programming tags, hyperlinks, boilerplates, and offensive words, can
question answering communities like Stack Exchange [225]. be identified and removed.
The second source is from public software repositories
In addition to the above methods, LLMs (especially rela-
such as GitHub [86, 105, 223], where code data (includ-
tively small models) can be also employed for data selection,
ing comments and docstrings) are collected for utilization.
either by computing perplexity [232] or directly prompting
Compared to natural language text, code is in the format
LLMs [233] for measuring the sample importance. However,
of a programming language, corresponding to long-range
using LLMs is unavoidably computationally intensive for
dependencies and accurate execution logic [226]. A recent
large-scale data selection.
study [47] also speculates that training on code might be a
source of complex reasoning abilities (e.g., chain-of-thought De-duplication. Existing work [234] has found that dupli-
ability [33]). Furthermore, it has been shown that formatting cate data in a corpus would reduce the diversity of language
reasoning tasks into code can help LLMs generate more models, which may cause the training process to become un-
accurate results [226]. stable and thus affect the model performance. Therefore, it is
necessary to de-duplicate the pre-training corpus. Specially,
4.1.2 Data Preprocessing de-duplication can be performed at different granularities,
After collecting a large amount of text data, it is essential including sentence-level, document-level, and dataset-level
to preprocess the data for constructing the pre-training de-duplication. First, low-quality sentences that contain re-
corpus, especially removing noisy, redundant, irrelevant, peated words and phrases should be removed, as they may
and potentially toxic data [56, 64, 227], which may largely introduce repetitive patterns in language modeling [235].
19

Ready to
Raw Corpus Filtering & Selection De-duplication Privacy Reduction Tokenization
pre-train!

Language Filtering Sentence-level Detect Personality Reuse Existing


Identifiable Tokenizer
Metric Filtering Document-level
Information (PII) SentencePiece
Statistic Filtering Set-level
Remove PII Byte-level BPE
Keyword Filtering

Alice is writing a paper about Alice is writing a paper about Replace('Alice') is Encode('[Somebody] is 32, 145, 66, 79, 12, 56, ...
LLMs. #$^& Alice is writing LLMs. Alice is writing a paper writing a paper about LLMs. writing a paper about LLMs.')
a paper about LLMs. about LLMs.

Fig. 7: An illustration of a typical data preprocessing pipeline for pre-training large language models.

At the document level, existing studies mostly rely on the It starts with a set of basic symbols (e.g., the alphabets
overlap ratio of surface features (e.g., words and n-grams and boundary characters), and iteratively combine frequent
overlap) between documents to detect and remove duplicate pairs of two consecutive tokens in the corpus as new to-
documents containing similar contents [57, 64, 78, 236]. kens (called merge). For each merge, the selection criterion
Furthermore, to avoid the dataset contamination problem, is based on the co-occurrence frequency of two contigu-
it is also crucial to prevent the overlap between the training ous tokens: the top frequent pair would be selected. The
and evaluation sets [56], by removing the possible duplicate merge process continues until it reaches the predefined
texts from the training set. It has been shown that the three size. Further, Byte-level BPE has been used to improve the
levels of de-duplication are useful to improve the training tokenization quality for multilingual corpus (e.g., the text
of LLMs [56, 237], which should be jointly used in practice. containing non-ASCII characters) by considering bytes as the
basic symbols for merge. Representative language models
Privacy Reduction. Thus, it is necessary to remove the
with this tokenization approach include GPT-2, BART, and
personally identifiable information (PII) from the pre-training
LLaMA.
corpus. One direct and effective approach is to employ
rule-based methods, such as keyword spotting, to detect • WordPiece tokenization. WordPiece was a Google inter-
and remove PII such as names, addresses, and phone num- nal subword tokenization algorithm. It was originally pro-
bers [167]. Furthermore, researchers also find that the vul- posed by Google in developing voice search systems [242].
nerability of LLMs under privacy attacks can be attributed Then, it was used in the neural machine translation system
to the presence of duplicate PII data in the pre-training cor- in 2016 [243], and was adopted as the word tokenizer for
pus [238]. Therefore, de-duplication can also reduce privacy BERT in 2018 [23]. WordPiece has a very similar idea with
risks to some extent. BPE by iteratively merging consecutive tokens, whereas
taking a slightly different selection criterion for the merge.
Tokenization. Tokenization is also a crucial step for data To conduct the merge, it first trains a language model and
preprocessing. It aims to segment raw text into sequences employs it to score all possible pairs. Then, at each merge, it
of individual tokens, which are subsequently used as the selects the pair that leads to the most increase in the likeli-
inputs of LLMs. In traditional NLP research (e.g., sequence hood of training data. Since Google has’t released the official
labeling with conditional random fields [239]), word-based implementation of the WordPiece algorithm, HuggingFace
tokenization is the predominant approach, which is more gives a more intuitive selection measure in its online NLP
aligned with human’s language cognition. However, word- course: a pair is scored by dividing the co-occurrence count
based tokenization can yield different segmentation results by the product of the occurrence counts of two tokens in the
for the same input in some languages (e.g., Chinese word pair based on training corpus.
segmentation), generate a huge word vocabulary containing
many low-frequency words, and also suffer from the “out- • Unigram tokenization. Unlike BPE and WordPiece, Un-
of-vocabulary” issue. Thus, several neural network models igram tokenization [244] starts with a sufficiently large
employ character as the minimum unit to derive the word set of possible substrings or subtokens for a corpus, and
representation (e.g., a CNN word encoder in ELMo [21]). iteratively removes the tokens in the current vocabulary
Recently, subword tokenizers have been widely used in Trans- until the expected vocabulary size is reached. As the se-
former based language models, typically including Byte- lection criterion, it calculates the yielded increase in the
Pair Encoding tokenization, WordPiece tokenization and likelihood of training corpus by assuming that some to-
Unigram tokenization. HuggingFace has maintained an ken was removed from current vocabulary. This step is
excellent online NLP course on tokenizer22 with running conducted based on a trained unigram language model.
examples, and we refer to the beginners to this course. Next, To estimate the unigram language model, it adopts an
we briefly describe the three representative tokenization expectation–maximization (EM) algorithm: at each iteration,
methods. we first find the currently optimal tokenization of words
• Byte-Pair Encoding (BPE) tokenization. BPE was origi- based on the old language model, and then re-estimate the
nally proposed as a general data compression algorithm in probabilities of unigrams to update the language model.
1994 [240], and then adapted to NLP for tokenization [241]. During this procedure, dynamic programming algorithms
(i.e., the Viterbi algorithm) are used to efficiently find the
22. https://huggingface.co/learn/nlp-course/chapter6 optimal decomposition way of a word given the language
20

Data 1 4.1.3 Data Scheduling


Source
2
3 Data Mixture
4 After data preprocessing, it is essential to design suit-
able strategies to schedule these multi-source data for pre-
Stage 1 Stage 2 Stage Stage
training a capable LLM. Generally, two key aspects should
be paid close attention for data scheduling: the proportion
··· of each data source (data mixture), and the order in which
each data source is scheduled for training (data curriculum).
Next, we discuss the two aspects in detail. An illustration of
Data Curriculum data scheduling has been presented in Figure 8.

Fig. 8: An illustration of data scheduling for pre-training Data Mixture. Since each kind of data source is closely
LLMs. related to the development of certain capacities for LLMs
(referring to the discussions in Section 4.1), it is important
to set a suitable distribution to mix these data. The data
mixture is generally set in a global level (i.e., the distribution
model. Representative models that adopt this tokenization of the entire pre-training data), and can be also locally set
approach include T5 and mBART. to varied proportions at different training stages. During
pre-training, data samples from different sources would be
Although it is expedient to leverage an existing tokenizer selected according to the mixture proportions: more data
(e.g., OPT [90] and GPT-3 [55] utilize the tokenizer of GPT- will be sampled from a data source with a larger weight.
2 [26]), using a tokenizer specially designed for the pre- Typically, existing LLMs such as LLaMA [57] may employ
training corpus can be highly beneficial [78], especially for upsampling or downsampling on the full data of each
the corpus that consists of diverse domains, languages, and source to create specific data mixtures as pre-training data.
formats. Therefore, recent LLMs often train the customized As Figure 6 illustrates, existing LLMs use different data mix-
tokenizers specially for the pre-training corpus with the tures to construct the pre-training data. As a representative
SentencePiece library [245], which includes Byte-level BPE model, the pre-training data of LLaMA [57] mainly consists
and Unigram tokenization. A note is that normalization of webpages (over 80%), alongside 6.5% of code-heavy data
techniques in BPE, such as NFKC [246], may degrade the from GitHub and StackExchange, 4.5% from books, and
tokenization performance [34, 64, 78]. When extending 2.5% of scientific data sourced from arXiv, which has become
existing LLMs (i.e., continual pre-training or instruction an important reference for training general-purpose LLMs.
tuning), we should be also aware of the potential side effect Furthermore, special data mixtures can be used to facilitate
with customized tokenizers. For example, LLaMA trains different purposes. For example, Falcon [171] is trained on
the BPE tokenizer based on a pre-training corpus mainly pure webpages, and CodeGen [86] largely increases the
consisting of English texts, and the derived vocabulary amount of code data. In practice, data mixture is often de-
might be less capable in processing non-English data, e.g., termined empirically, and we summarize several common
taking longer inference latency to generate Chinese texts. strategies for finding an effective data mixture as follows:
• Increasing the diversity of data sources. Recent studies
Discussion on Effect of Data Quality. For pre-training, the have empirically shown that training on excessive data
quality of pre-training data is vital to the model capacities about a certain domain would degrade the generalization
of LLMs. Existing work has shown that pre-training on the capability of LLMs on other domains [35, 64]. In contrast,
low-quality corpus, such as noisy, toxic, and duplicate data, increasing the data source heterogeneity (e.g., including
would largely hurt the performance of models [64, 234, diverse data sources) is critical for improving the down-
236, 238]. Recent studies, such as T5 [82], GLaM [112], and stream performance of LLMs [227, 248, 249]. To further
Gopher [64], have investigated the influence of data quality examine the effect of different data sources, some studies
on the LLMs’ capacities. By comparing the performance of have conducted ablation experiments by removing each
models trained on the filtered and unfiltered corpus, they data source one by one, and pre-train LLMs with specially
have reached the similar conclusion that pre-training LLMs curated datasets [227]. It has been shown that dropping data
on cleaned data can improve the model performance. More sources with high heterogeneity (e.g., webpages) impacts
specifically, the duplication of data may result in “double LLM’s abilities more severely than dropping sources with
descent” (referring to the phenomenon of performance ini- low heterogeneity (e.g., academic corpus).
tially deteriorating and subsequently improving) [234, 247], • Optimizing data mixtures. In addition to manually set-
or even overwhelm the training process [234]. In addition, ting the data mixtures, several studies have proposed to
it has been shown that duplicate data degrades the ability optimize the data mixtures for improving the model pre-
of LLMs to copy from the context, which might further training [59, 250]. Given the target downstream tasks, one
affect the generalization capacity of LLMs using in-context can select pre-training data with either higher proximity
learning [234]. Therefore, as suggested in [56, 64, 78, 227], in the feature space [250] or those that provide positive
it is essential to utilize preprocessing methods like quality influences on downstream task performance [251]. Further,
filtering, toxic filtering and deduplication to carefully clean to reduce the reliance of target tasks, DoReMi [59] first trains
the pre-training corpus (as illustrated in Section 4.1.2), to a small reference model using given initial domain weights,
improve stability of the training process and avoid affecting and then trains another small proxy model, upweighting the
the model performance. domains on which the greatest discrepancies in likelihood
21

between the two models are observed. Finally, the learned • Coding. To improve the coding ability of LLMs, CodeL-
domain weights of the proxy model are applied to train LaMA [254] is developed based on LLaMA 2 [99] (2T general
a much larger LLM. In a more simple way, one can train tokens → 500B code-heavy tokens), aiming to improve the
several small language models with different data mixtures, code generation ability and retain natural language under-
and select the data mixture that leads to the most desir- standing skills. CodeLLaMA also provides a version that
able performance. However, an assumption made in this is further specialized to a certain programming language,
approach is, when trained in a similar way, small models namely CodeLLaMA-Python (2T general tokens → 500B
would resemble with large models in model abilities or code-heavy tokens → 100B Python-heavy tokens).
behaviors, which may not always hold in practice. • Mathematics. Llemma [258] is proposed to enhance
• Specializing the targeted abilities. The model capacities the mathematical capacities of general-purpose LLMs. It
of LLMs heavily rely on data selection and mixture, and is developed based on CodeLLaMA. Although CodeL-
one can boost the proportions of specific data sources to LaMA [254] mainly focuses on the coding ability, exper-
enhance certain model abilities [64, 227]. For example, the iments have shown that it performs better than its base
mathematical reasoning and coding abilities can be specially model LLaMA 2 on mathematics benchmarks [258]. Based
enhanced by training with more mathematical texts and on CodeLLaMA, Llemma is continually trained on mixtures
code data, respectively. Furthermore, experimental results of scientific papers, web data containing mathematical text
on the LAMBADA dataset [252] show that increasing the and code (2T general tokens → 500B code-heavy tokens
proportion of books data can improve the model capacity in → 50∼200B math-heavy tokens). Note that the pre-training
capturing long-term dependencies from text, and increasing data of Llemma also contains 5% general domain data as a
the proportion of the C4 dataset [82] leads to performance form of regularization.
improvement on the C4 validation dataset [64]. Generally, • Long context. Long context modeling is an important
it is important to identify more implicit relations between ability for LLMs, and many studies have explored extend-
data sources and model abilities. To enhance specific skills ing the context windows of LLMs via continually train-
such as mathematics and coding in LLMs, or to develop ing [254, 257]. With modifications on position embeddings
specialized LLMs, a practical way is to employ a multi-stage (i.e., position interpolation) of RoPE-based LLMs [57, 99,
training approach, e.g., general and skill-specific data can 259], CodeLLaMA further extends the context window of
be scheduled at two consecutive stages. This approach of LLaMA 2 (2.5T tokens with 4K context window → 20B
training LLMs on varying sources or proportions of data tokens with 16K context window). LongLLaMA [257] also
across multiple stages is also known as “data curriculum”, achieves longer context window with the help of external
which will be introduced below. memory and a unique training objective (1T tokens with 2K
context window → 10B tokens with 8K context window).
Data Curriculum. After preparing the data mixture, it
is important to schedule the order that specific data is 4.1.4 Summary of Data Preparation
presented to LLMs for pre-training. It has been shown that, In this part, we summarize the general procedure and key
in some cases, to learn a certain skill, learning in a skill- points to prepare pre-training data for LLMs, which are
set sequence (e.g., basic skills → target skill) outperforms detailed in the following three aspects.
direct learning from a corpus focused solely on the target • Data collection. It is suggested to include diverse data
skill [253, 254]. Following the idea of curriculum learn- sources in the pre-training data. Although Falcon [171]
ing [255], data curriculum has been proposed and widely shows that webpages alone can be employed to train power-
used in model pre-training [253, 254, 256, 257]. It aims to ful LLMs, a more typical approach is to also incorporate di-
organize different parts of pre-training data for LLMs in verse high-quality text like code, books, scientific papers, etc.
a specific order, e.g., starting with easy/general examples If a LLM is specialized with a certain skill, the proportion of
and progressively introducing more challenging/special- corresponding data source should be increased accordingly.
ized ones. More generally, it can broadly refer to the adap- For example, Gopher [64] and Chinchilla [34] are trained
tive adjustment of data proportions for different sources with approximately 40% of data from books. PaLM [44] and
during pre-training. Existing work about data curriculum LaMDA [68] use approximately 50% conversational data.
mainly focuses on continual pre-training, such as special- • Data cleaning. After data collection, it is crucial to clean
ized coding LLMs (e.g., CodeLLaMA [254]) or long context the raw corpus to enhance its quality as possible. First,
LLMs (e.g., LongLLaMA [257]). However, it still lacks of deduplication is commonly used in existing work [99, 171,
more detailed report about data curriculum for general- 248]. Second, low-quality text, toxic content, and data with
purpose LLMs (e.g., LLaMA) in the literature. To determine privacy concerns should be removed at different granulari-
data curriculum, a practical approach is to monitor the de- ties (e.g., document, passage or sentence). In practice, both
velopment of key abilities of LLMs based on specially con- heuristic and classifier-based methods can be employed
structed evaluation benchmarks, and then adaptively adjust for quality and toxicity filtering (e.g., CCNet [260], fast-
the data mixture during pre-training. Next, we take three Text [261], and Data-Juicer [262]). Third, with the cleaned
common abilities as examples to introduce how the concept data, one can further unify or specify the format for pre-
of data curriculum23 applies in continual pre-training. training data, and perform the tokenization by training
the tokenizer on the filtered and deduplicated corpus with
23. We utilize the symbol “→” to represent the data order in data
libraries like SentencePiece [245].
curriculum. For example, “2T webpage tokens → 500B code tokens”
means that the LLM is firstly trained with 2T webpage tokens and • Data scheduling. With the preprocessed data, the next
subsequently with 500B code data tokens. step is to determine the data mixture and the specific order
22

of data for pre-training LLMs. To determine both settings, a to decoder-only architectures. When mentioning “decoder-
practical way is to first train several small language models only architecture”, it mainly refers to the causal decoder
with multiple candidate plans and then select a good plan architecture in existing literature, unless specified.
among them [59]. Overall, it is more difficult to find a
suitable data curriculum. In practice, one can monitor the Prefix Decoder Architecture. The prefix decoder architec-
performance of intermediate model checkpoints on specific ture (a.k.a., non-causal decoder [263]) revises the masking
evaluation benchmarks, and dynamically tune the data mix- mechanism of causal decoders, to enable performing bidi-
ture and distribution during pre-training. In this process, it rectional attention over the prefix tokens [264] and unidi-
is also useful to explore the potential relations between data rectional attention only on generated tokens. In this way,
sources and model abilities to instruct the design of data like the encoder-decoder architecture, the prefix decoders
curriculum. can bidirectionally encode the prefix sequence and autore-
gressively predict the output tokens one by one, where the
same parameters are shared during encoding and decoding.
4.2 Architecture Instead of pre-training from scratch, a practical suggestion
In this section, we review the architecture design of LLMs, is to continually train causal decoders and then convert
i.e., mainstream architecture, pre-training objective, and de- them into prefix decoders for accelerating convergence [29],
tailed configuration. Table 5 presents the model cards of e.g., U-PaLM [118] is derived from PaLM [56]. Existing rep-
several representative LLMs with public details. resentative LLMs based on prefix decoders include GLM-
130B [93] and U-PaLM [118].
4.2.1 Typical Architectures
Mixture-of-Experts. For the above three types of archi-
Due to the excellent parallelizability and capacity, the Trans-
tectures, we can further extend them via the mixture-of-
former architecture [22] has become the de facto backbone to
experts (MoE) scaling, in which a subset of neural network
develop various LLMs, making it possible to scale language
weights for each input are sparsely activated, e.g., Switch
models to hundreds or thousands of billions of parameters.
Transformer [25] and GLaM [112]. The major merit is that
In general, the mainstream architectures of existing LLMs
MoE is a flexible way to scale up the model parameter while
can be roughly categorized into three major types, namely
maintaining a constant computational cost [25]. It has been
encoder-decoder, causal decoder, and prefix decoder, as
shown that substantial performance improvement can be
shown in Figure 9.
observed by increasing either the number of experts or the
Encoder-decoder Architecture. The vanilla Transformer total parameter size [265]. Despite the merits, training large
model is built on the encoder-decoder architecture [22], MoE models may suffer from instability issues due to the
which consists of two stacks of Transformer blocks as complex, hard-switching nature of the routing operation.
the encoder and decoder, respectively. The encoder adopts To enhance the training stability of MoE-based language
stacked multi-head self-attention layers to encode the input models, techniques such as selectively using high-precision
sequence for generating its latent representations, while tensors in the routing module or initializing the model with
the decoder performs cross-attention on these representa- a smaller range have been introduced [25]. More recently,
tions and autoregressively generates the target sequence. there is widespread speculation that GPT-4 has been devel-
Encoder-decoder PLMs (e.g., T5 [82] and BART [24]) have oped based on the MoE architecture, but without official
shown effectiveness on a variety of NLP tasks. So far, verification.
there are only a small number of LLMs that are built based
on the encoder-decoder architecture, e.g., Flan-T5 [69]. We
Emergent Architectures. The conventional Transformer ar-
leave a detailed discussion about the architecture selection
chitecture typically suffers from quadratic computational
in Section 4.2.5.
complexity with respect to sequence length, resulting in a
Causal Decoder Architecture. The causal decoder archi- high processing cost for dealing with long inputs. To im-
tecture incorporates the unidirectional attention mask, to prove efficiency, recent studies aim to devise new architec-
guarantee that each input token can only attend to the tures for language modeling, most based on parameterized
past tokens and itself. The input and output tokens are state space models (SSM) [266], which can be viewed as
processed in the same fashion through the decoder. As a combination of RNN and CNN. On the one hand, SSM
representative language models of this architecture, the can generate outputs recursively like RNN, meaning that
GPT-series models [26, 55, 122] are developed based on they only need to refer to the single previous state during
the causal-decoder architecture. In particular, GPT-3 [55] decoding. It makes the decoding process more efficient
has successfully demonstrated the effectiveness of this ar- as it eliminates the need to revisit all previous states as
chitecture, also showing an amazing in-context learning in conventional Transformers. On the other hand, these
capability of LLMs. Interestingly, GPT-1 [122] and GPT- models have the capability to encode an entire sequence
2 [26] do not exhibit such superior abilities as those in in parallel like Transformers via convolution computation.
GPT-3, and it seems that scaling plays an important role Thus, they can benefit from the parallelism of GPUs with
in increasing the model capacity of this model architecture. techniques such as Parallel Scan [267, 268], FFT [269, 270],
So far, the causal decoders have been widely adopted as and Chunkwise Recurrent [271]. Despite the high computa-
the architecture of LLMs by various existing LLMs, such tion efficiency of SSMs, their performance still lags behind
as OPT [90], BLOOM [78], and Gopher [64]. Note that both Transformer. Thus, several variants of SSM have been pro-
the causal decoder and prefix decoder discussed next belong posed, including Mamba [272], RetNet [271], RWKV [273],
23

TABLE 5: Model cards of several selected LLMs with public configuration details. Here, PE denotes position embedding,
#L denotes the number of layers, #H denotes the number of attention heads, dmodel denotes the size of hidden states, and
MCL denotes the maximum context length during training.

Model Category Size Normalization PE Activation Bias #L #H dmodel MCL


GPT3 [55] Causal decoder 175B Pre LayerNorm Learned GeLU ✓ 96 96 12288 2048
PanGU- α [84] Causal decoder 207B Pre LayerNorm Learned GeLU ✓ 64 128 16384 1024
OPT [90] Causal decoder 175B Pre LayerNorm Learned ReLU ✓ 96 96 12288 2048
PaLM [56] Causal decoder 540B Pre LayerNorm RoPE SwiGLU × 118 48 18432 2048
BLOOM [78] Causal decoder 176B Pre LayerNorm ALiBi GeLU ✓ 70 112 14336 2048
MT-NLG [113] Causal decoder 530B - - - - 105 128 20480 2048
Gopher [64] Causal decoder 280B Pre RMSNorm Relative - - 80 128 16384 2048
Chinchilla [34] Causal decoder 70B Pre RMSNorm Relative - - 80 64 8192 -
Galactica [35] Causal decoder 120B Pre LayerNorm Learned GeLU × 96 80 10240 2048
LaMDA [68] Causal decoder 137B - Relative GeGLU - 64 128 8192 -
Jurassic-1 [107] Causal decoder 178B Pre LayerNorm Learned GeLU ✓ 76 96 13824 2048
LLaMA [57] Causal decoder 65B Pre RMSNorm RoPE SwiGLU × 80 64 8192 2048
LLaMA 2 [99] Causal decoder 70B Pre RMSNorm RoPE SwiGLU × 80 64 8192 4096
Falcon [171] Causal decoder 40B Pre LayerNorm RoPE GeLU × 60 64 8192 2048
GLM-130B [93] Prefix decoder 130B Post DeepNorm RoPE GeGLU ✓ 70 96 12288 2048
T5 [82] Encoder-decoder 11B Pre RMSNorm Relative ReLU × 24 128 1024 512

Causal Decoder Prefix Decoder Encoder-Decoder

A
A

Encoder

Survey
Survey

Survey
Decoder

Decoder

of
of

of

Models Language Large


Models Language Large

Models Language Large

Decoder

A Survey of Large Language Models A Survey of Large Language Models A Survey of Large Language Models

Decoder Decoder Encoder Decoder

Fig. 9: A comparison of the attention patterns in three mainstream architectures. Here, the blue, green, yellow and grey
rounded rectangles indicate the attention between prefix tokens, attention between prefix and target tokens, attention
between target tokens, and masked attention respectively.

TABLE 6: Comparison of parallelism and complexity of dif- state and the current input depending on the current input.
ferent models. T represents sequence length, H represents Compared with traditional SSMs, Mamba has demonstrated
the dimension of the input representation, N represents the improved text modeling capacities.
dimension after compression in SSMs, and M represents the
• RWKV. RWKV [273] combines the advantages of Trans-
number of layers in each Hyena module.
former and RNN. It employs time-mixing modules, i.e.,
Model Decoding Complexity Training Complexity RNN with gating, and channel-mixing modules that are
special feedforward neural networks [273]. Within these
Transformer O(H(T + H)) O(T H(T + H))
SSM O(H(N 2 + H)) O(T H(log T + N 2 + H))
modules, token shift, a linear combination of the current and
Mamba O(H(N 2 + H)) O(T H(N 2 + H)) previous token, is used instead of the token representation
RWKV O(H 2 ) O(T H 2 ) as the input.
RetNet O(H 2 ) O(T H 2 ) • RetNet. RetNet [271] proposes multi-scale retention
Hyena O(M H(T + H)) O(T M H(log T + H))
(MSR) to replace the attention module in Transformer. Sim-
ilar to linear attention, in the MSR module, the input is
first mapped into query, key, and value, and the product
and Hyena [269]. of key and value is employed to update the state. Then, the
• Mamba. Mamba [272] aims to selectively filter out or query is used to project the state into the output. Similar
remember information during state update. It replaces the to traditional SSMs, RetNet keeps the parallel and recurrent
original fixed parameters of SSM layers with functions of the computation capacity at the same time.
input, selectively filtering out information of the previous • Hyena. Hyena employs long convolution to replace
24

the attention module. In the long convolution module, the the final prediction. Compared with post-LN, the Trans-
filters based on relative positions are used to aggregate formers with pre-LN are more stable in training. However,
information at different positions into the middle represen- it performs worse than the variants with post-LN [288].
tations, and gating functions are employed to further project Despite the decreasing performance, most LLMs still adopt
intermediate representations into the final output. However, pre-LN due to the training stability. However, one excep-
due to the long convolution, Hyena can not infer like RNN tion is that pre-LN has been found unstable in GLM when
and must explicitly access all previous states. training models more than 100B parameters [93].
• Sandwich-LN. Based on pre-LN, Sandwich-LN [274]
4.2.2 Detailed Configuration adds extra LN before the residual connections to avoid
the value explosion issues in Transformer layer outputs.
Since the launch of Transformer [22], various improvements
However, it has been found that Sandwich-LN sometimes
have been proposed to enhance its training stability, per-
fails to stabilize the training of LLMs and may lead to the
formance, and computational efficiency. In this part, we
collapse of training [93].
will discuss the corresponding configurations for four major
parts of the Transformer, including normalization, position Activation Functions. To obtain good performance, activa-
embeddings, activation functions, and attention and bias. tion functions also need to be properly set in feed-forward
To make this survey more self-contained, we present the networks. In existing LLMs, GeLU activations [289] are
detailed formulations for these configurations in Table 7. widely used. Specially, in the latest LLMs (e.g., PaLM and
LaMDA), variants of GLU activation [281, 290] have also
Normalization Methods. Training instability is a challeng- been utilized, especially the SwiGLU and GeGLU variants,
ing issue for pre-training LLMs. To alleviate this issue, which often achieve better performance in practice [285].
normalization is a widely adopted strategy to stabilize the However, compared with GeLU, they require extra parame-
training of neural networks. In the vanilla Transformer [22], ters (about 50%) in the feed-forward networks [291].
LayerNorm [275] is employed. Recently, several advanced
normalization techniques have been proposed as alterna- Position Embeddings. Since the self-attention modules in
tives to LayerNorm, e.g., RMSNorm, and DeepNorm. Transformer are permutation equivariant, position embed-
• LayerNorm. In the early research, BatchNorm [284] is dings (PE) are employed to inject absolute or relative posi-
a commonly used normalization method. However, it is tion information for modeling sequences.
difficult to deal with sequence data of variable lengths and • Absolute position embedding. In the vanilla Trans-
small-batch data. Thus, LayerNorm [275] is introduced to former [22], absolute position embeddings are employed.
conduct layerwise normalization. Specifically, the mean and At the bottoms of the encoder and the decoder, the absolute
variance over all activations per layer are calculated to re- positional embeddings are added to the input embeddings.
center and re-scale the activations. There are two variants of absolute position embeddings
• RMSNorm. To improve the training speed of Lay- proposed in the vanilla Transformer [22], i.e., sinusoidal and
erNorm (LN), RMSNorm [276] is proposed by re-scaling learned position embeddings, where the latter is commonly
the activations with only the root mean square (RMS) of used in existing pre-trained language models.
the summed activations, instead of the mean and variance. • Relative position embedding. Unlike absolute position
Related research has demonstrated its superiority in training embeddings, relative positional embeddings are generated
speed and performance on Transformer [285]. Representa- according to the offsets between keys and queries [292].
tive models that adopt RMSNorm include Gopher [64] and A popular variant of relative PE was introduced in
Chinchilla [34]. Transformer-XL [293, 294]. The calculation of attention
• DeepNorm. DeepNorm is proposed by Microsoft [277] scores between keys and queries has been modified to
to stabilize the training of deep Transformers. With Deep- introduce learnable embeddings corresponding to relative
Norm as residual connections, Transformers can be scaled positions. T5 [82] further simplified relative positional em-
up to 1,000 layers [277], which has shown the advantages beddings, which was subsequently adopted by Gopher [64].
of stability and good performance. It has been adopted by Specifically, it adds learnable scalars to the attention scores,
GLM-130B [93]. where the scalars are calculated based on the distances
between the positions of the query and the key. Compared
Normalization Position. In addition to the normalization with the absolute PE, Transformers with relative position
method, normalization position also plays a crucial role in embedding can generalize to sequences longer than those
the LLMs. There are generally three choices for the normal- sequences for training, i.e., extrapolation [283].
ization position, i.e., post-LN, pre-LN, and sandwich-LN. • Rotary position embedding. Rotary position embedding
• Post-LN. Post-LN is used in the vanilla Trans- (RoPE) [282] sets specific rotatory matrices based on the
former [22], which is placed between residual blocks. How- absolute position of each key or query. The scores between
ever, existing work has found that the training of Trans- keys and queries can be computed with relative position
formers with post-LN tends to be instable due to the large information (Table 7). RoPE combines each consecutive pair
gradients near the output layer [286]. Thus, post-LN is rarely of elements in query and key vectors as a dimension, so there
employed in existing LLMs except combined with other are d/2 dimensions for an original d-length embedding.
strategies (e.g., combining post-LN with pre-LN in GLM- For each dimension i ∈ {1, . . . , d/2}, the pair of involved
130B [93]). elements will rotate based on the rotation angle t · θi , where
• Pre-LN. Different from post-LN, pre-LN [287] is applied t denotes the position index and θi is the basis in the
before each sub-layer, and an additional LN is placed before dimension. Following sinusoidal position embeddings [22],
25

TABLE 7: Detailed formulations for the network configurations. Here, Sublayer denotes a FFN or a self-attention module
in a Transformer layer, d denotes the size of hidden states, pi denotes position embedding at position i, Aij denotes the
attention score between a query and a key, ri−j denotes a learnable scalar based on the offset between the query and the
key, and RΘ,t denotes a rotary matrix with rotation degree t · Θ.

Configuration Method Equation


Post Norm [22] Norm(x+Sublayer(x))
Normalization position Pre Norm [26] x + Sublayer(Norm(x))
Sandwich Norm [274] x + Norm(Sublayer(Norm(x)))
q P
x−µ 1 Pd 1 d
LayerNorm [275] σ
·γ + β, µ= d i=1 xi , σ= d i=1 (xi − µ))2
Normalization method x
q
1 Pd 2
RMSNorm [276] RMS(x)
· γ, RMS(x) = d i=1 xi
DeepNorm [277] LayerNorm(α · x + Sublayer(x))
ReLU [278] ReLU(x) = max(x, 0)
√ Rx 2
GeLU [279] GeLU(x) = 0.5x ⊗ [1 + erf(x/ 2)], erf(x) = √2 e−t dt
π 0
Activation function
Swish [280] Swish(x) = x ⊗ sigmoid(x)
SwiGLU [281] SwiGLU(x1 , x2 ) = Swish(x1 ) ⊗ x2
GeGLU [281] GeGLU(x1 , x2 ) = GeLU(x1 ) ⊗ x2
Absolute [22] xi = xi + p i
Position embedding Relative [82] Aij = Wq xi xT T
j Wk + ri−j
RoPE [282] Aij = Wq xi RΘ,i−j xT T
j Wk = (Wq xi RΘ,i )(Wk xj RΘ,j )
T

ALiBi [283] T T
Aij = Wq xi xj Wk − m(i − j)

RoPE defines the basis θi as an exponentiation of the base b are mapped into queries, keys, and values. Additionally,
(set to 10000 by default): Transformer uses multi-head attention instead of single
attention, projecting the queries, keys, and values with
Θ = {θi = b−2(i−1)/d |i ∈ {1, 2, . . . , d/2}}. (4) different projections in different heads. The concatenation
of the output of each head is taken as the final output.
Furthermore, a recent study [295] defines the distance re-
• Sparse attention. A crucial challenge of full attention
quired to rotate one cycle (2π ) for each dimension as wave-
is the quadratic computational complexity, which becomes
length:
a burden when dealing with long sequences. Therefore,
λi = 2πb2(i−1)/d = 2π/θi . (5) various efficient Transformer variants are proposed to re-
Due to the excellent performance and the long-term decay duce the computational complexity of the attention mecha-
property, RoPE is widely adopted in the latest LLMs, e.g., nism [297, 298]. For instance, locally banded sparse attention
PaLM [56] and LLaMA [57]. Based on RoPE, xPos [296] fur- (i.e., Factorized Attention [299] has been adopted in GPT-
ther improves the translation invariance and length extrap- 3 [55]. Instead of the whole sequence, each query can only
olation of Transformer. At each dimension of the rotation attend to a subset of tokens based on the positions.
angle vector, xPos adds a special exponential decay that is • Multi-query/grouped-query attention. Multi-query atten-
smaller when the basis is larger. It can alleviate the unstable tion refers to the attention variant where different heads
phenomenon during training as the distance increases. share the same linear transformation matrices on the keys
• ALiBi. ALiBi [283] is proposed to improve the extrap- and values [300]. It achieves higher inference speed with
olation of Transformer. Similar to relative position embed- only a minor sacrifice in model quality. Representative
ding, it biases attention scores with a penalty based on the models with multi-query attention include PaLM [56] and
distances between keys and queries. Different from the rela- StarCoder [98]. To make a trade-off between multi-query
tive positional embedding methods like T5 [82], the penalty attention and multi-head attention, grouped-query attention
scores in ALiBi are pre-defined without any trainable pa- (GQA) [301] has been explored. In GQA, heads are assigned
rameters. Empirical results in [283] have shown that ALiBi into different groups, and those heads that belong to the
has a better extrapolation performance on sequences that are same group will share the same transformation matrices.
longer than those for training than several popular position Specially, GQA has been adopted and empirically tested in
embedding methods such as sinusoidal PE [22], RoPE [282], the recently released LLaMA 2 model [99].
and T5 bias [82]. In addition, it has been shown that ALiBi • FlashAttention. Different from most existing approx-
can also improve training stability in BLOOM [78]. imate attention methods that trade-off model quality to
improve the computing efficiency, FlashAttention [302] pro-
Attention. Attention mechanism is a critical component of poses to optimize the speed and memory consumption of
Transformer. It allows the tokens across the sequence to attention modules on GPUs from an IO-aware perspective.
interact with each other and compute the representations There exist different levels of memory on modern GPUs,
of the input and output sequence. e.g., SRAM with a fast IO and HBM with a relatively
• Full attention. In the vanilla Transformer [22], the atten- slow IO. FlashAttention organizes the input into blocks and
tion mechanism is conducted in a pairwise way, considering introduces necessary recomputation, both to make better
the relations between all token pairs in a sequence. It adopts use of the fast memory SRAM. Implemented as a fused
scaled dot-product attention, in which the hidden states kernel in CUDA, FlashAttention has been integrated into
26

PyTorch [211], DeepSpeed [74], and Megatron-LM [75]. The I am sleepy. I start a pot of
updated version FlashAttention-2 [303] further optimizes
coffee 0.661 strong 0.008 soup 0.005
the work partitioning of GPU thread blocks and warps, lead-
water 0.119 black 0.008 ... ...
ing to around 2× speedup when compared to the original
tea 0.057 hot 0.007 happy 4.3e-6
FlashAttention.
rice 0.017 oat 0.006 Boh 4.3e-6
• PagedAttention. It has been observed when LLM are chai 0.012 beans 0.006 ... ...
deployed on servers, GPU memory is largely occupied by
cached attention key and value tensors (called KV cache). Fig. 10: The probability distribution over the vocabulary in
The major reason is that the input lengths are often varied, descending order for the next token of the context “I am
leading to fragmentation and over-reservation issues. In- sleepy. I start a pot of ”. For ease of discussion, this example is
spired by the classic paging technique in operating systems, given in word units instead of subword units.
PagedAttention has been proposed to improve the memory
efficiency and throughput of deployed LLMs [304]. In detail,
PagedAttention partitions each sequence into subsequences, Denoising Autoencoding. In addition to conventional
and the corresponding KV caches of these subsequences are LM, the denoising autoencoding task (DAE) has also been
allocated into non-contiguous physical blocks. The paging widely used to pre-train language models [24, 82]. The
technique increases the GPU utilization and enables efficient inputs x\x̃ for DAE task are corrupted text with randomly
memory sharing in parallel sampling. replaced spans. Then, the language models are trained to re-
To put all these discussions together, we summarize the cover the replaced tokens x̃. Formally, the training objective
suggestions from existing literature for detailed configura- of DAE is denoted as follows:
tion. For stronger generalization and training stability, it is
LDAE (x) = log P (x̃|x\x̃ ). (7)
suggested to choose the pre RMSNorm for layer normaliza-
tion, and SwiGLU or GeGLU as the activation function. In However, the DAE task seems to be more complicated
addition, LN may not be used immediately after embedding in implementation than LM task. As a result, it has not
layers, which is likely to incur performance degradation. As been widely used to pre-train large language models. Exist-
for position embeddings, RoPE or ALiBi is a better choice ing LLMs that take DAE as pre-training objectives include
since it performs better on long sequences. T5 [82] and GLM-130B [93]. These models are mainly trained
to recover the replaced spans in an autoregressive way.
4.2.3 Pre-training Tasks
Mixture-of-Denoisers. Mixture-of-Denoisers (MoD) [89],
Pre-training plays a key role that encodes general knowl- also known as UL2 loss, was introduced as a unified ob-
edge from large-scale corpus into the massive model param- jective for pre-training language models. MoD regards both
eters. For training LLMs, there are two commonly used pre- LM and DAE objectives as different types of denoising tasks,
training tasks, namely language modeling and denoising namely S-denoiser (LM), R-denoiser (DAE, short span and
autoencoding. low corruption), and X-denoiser (DAE, long span or high
corruption). Among the three denoising tasks, S-denoiser
Language Modeling. The language modeling task (LM) is
is similar to the conventional LM objective (Equation (6)),
the most commonly used objective to pre-train decoder-only
while R-denoiser and X-denoiser are similar to DAE ob-
LLMs, e.g., GPT3 [55] and PaLM [56]. Given a sequence of
jectives (Equation (7)) but differ from each other in the
tokens x = {x1 , . . . , xn }, the LM task aims to autoregres-
lengths of spans and ratio of corrupted text. For input sen-
sively predict the target tokens xi based on the preceding
tences started with different special tokens (i.e., {[R], [S],
tokens x<i in a sequence. A general training objective is to
[X]}), the model will be optimized using the corresponding
maximize the following likelihood:
denoisers. MoD has been applied in the latest PaLM 2
n
X model [120].
LLM (x) = log P (xi |x<i ). (6)
i=1 4.2.4 Decoding Strategy
Since most language tasks can be cast as the prediction After the LLMs have been pre-trained, it is essential to em-
problem based on the input, these decoder-only LLMs might ploy a specific decoding strategy to generate the appropriate
be potentially advantageous to implicitly learn how to ac- output from the LLMs.
complish these tasks in a unified LM way. Some studies
Background. We start the discussion with the prevalent
have also revealed that decoder-only LLMs can be naturally
decoder-only architecture, and introduce the auto-regressive
transferred to certain tasks by autoregressively predicting
decoding mechanism. Since such LLMs are pre-trained
the next tokens [26, 55], without fine-tuning. An important
based on the language modeling task (Equation 6), a basic
variant of LM is the prefix language modeling task, which is
decoding method is greedy search that predicts the most
designed for pre-training models with the prefix decoder
likely token at each step based on the previously generated
architecture. The tokens within a randomly selected prefix
tokens, formally modeled as:
would not be used in computing the loss of prefix language
modeling. With the same amount of tokens seen during pre- xi = arg maxP (x|x<i ), (8)
training, prefix language modeling performs slightly worse x

than language modeling, since fewer tokens in the sequence where xi is the token with the highest probability at i-
are involved for model pre-training [29]. th step of generation conditioned on the context x<i . For
27

instance in Figure 10, when predicting the next token of where lj ′ is the logits of each word and t is the temperature
the sentence “I am sleepy. I start a pot of”, greedy search coefficient. Reducing the temperature t increases the chance
selects the token “coffee” which has the highest probability of selecting words with high probabilities while decreases
at the current step. Greedy search can achieve satisfactory the chances of selecting words with low probabilities. When
results in text generation tasks (e.g., machine translation t is set to 1, it becomes the default random sampling; when
and text summarization), in which the output is highly t is approaching 0, it is equivalent to greedy search. In
dependent on the input [305]. However, in terms of open- addition, when t goes to infinity, it degenerates to uniform
ended generation tasks (e.g., story generation and dialog), sampling.
greedy search sometimes tends to generate awkward and • Top-k sampling. Different from temperature sampling,
repetitive sentences [306]. top-k sampling directly truncates the tokens with lower
As another alternative decoding strategy, sampling- probability and only samples from the tokens with the top
based methods are proposed to randomly select the next k highest probabilities [312]. For example in Figure 10, top-
token based on the probability distribution to enhance the 5 sampling will sample from the words “coffee”, “water”,
randomness and diversity during generation: “tea”, “rice”, and “chai” from their re-scaled probabilities.
• Top-p sampling. Since top-k sampling does not consider
xi ∼ P (x|x<i ). (9)
the overall possibility distribution, a constant value of k may
For the example in Figure 10, sampling-based methods will be not be suitable for different contexts. Therefore, top-p
sample the word “coffee” with higher probability while sampling (a.k.a., nucleus sampling) is proposed by sampling
also retaining the possibilities of selecting the rest words, from the smallest set having a cumulative probability above
“water”, “tea”, “rice”, etc. (or equal to) p [306]. In practice, the smallest set can be con-
Not limited to the decoder-only architecture, these two structed by gradually adding tokens from the vocabulary
decoding methods can be generally applied to encoder- sorted in descending order of generative probability, until
decoder models and prefix decoder models in a similar way. their cumulative value exceeds p.
Recently, researchers have also explored other sampling
Improvement for Greedy Search. Selecting the token with strategies for LLMs. For instance, η -sampling [313] further
the highest probability at each step may result in overlook- improves top-p sampling by introducing a dynamic thresh-
ing a sentence with a higher overall probability but a lower old based on the probability distribution. Furthermore, con-
local estimation. Next, we introduce several improvement trastive search [314] and typical sampling [315] can be utilized
strategies to alleviate this issue. to improve the generation coherence during decoding. Since
• Beam search. Beam search [307] retains the sentences it has been found that large models tend to assign higher
with the n (beam size) highest probabilities at each step probability to important tokens compared to small models,
during the decoding process, and finally selects the gener- contrastive decoding [316] utilizes a larger LM (e.g., OPT-
ated response with the top probability. Typically, the beam 13B) and a smaller LM (e.g., OPT-125M) to measure their
size is configured within the range of 3 to 6. However, log-likelihood differences. Subsequently, tokens are sampled
opting for a larger beam size might result in a decline in based on the delta value of the probability distribution,
performance [308]. thereby amplifying the impact of important tokens. Based
• Length penalty. Since beam search favours shorter sen- on this contrastive idea, DoLa [317] further extends this
tences, imposing length penalty (a.k.a., length normaliza- approach to contrasting the logits across different layers of
tion) is a commonly used technique [309] to overcome this a single LLM, as higher layers tend to assign more weight
issue, which normalizes the sentence probability according to important tokens.
to the sentence length (divided by an exponential power α
of the length). Practical Settings. In practice, existing libraries (e.g., Trans-
Besides, some researchers [310] propose to penalize the formers [201]) and public APIs of LLMs (e.g., OpenAI) have
generation of previously generated tokens or n-grams to supported various decoding strategies to serve different
alleviate the issue of repetitive generation. In addition, scenarios of text generation. Next, we present the decoding
diverse beam search [311] can be leveraged to produce a settings of several representative LLMs:
set of diverse outputs based on the same input. • T5 [82] utilizes greedy search as the default setting and
applies beam search (beam size of 4) with a length penalty
Improvement for Random Sampling. Sampling-based of 0.6 for translation and summarization tasks.
methods sample the token over the whole vocabulary, which • GPT-3 [55] employs beam search with a beam size of 4
may select wrong or irrelevant tokens (e.g., “happy” and and a length penalty of 0.6 for all generation tasks.
“Boh” in Figure 10) based on the context. To improve the
• Alpaca [146] utilizes sampling-based strategies with
generation quality, several strategies have been proposed
top-k (k = 50), top-p (p = 0.9), and temperature of 0.7 for
for mitigating or preventing the selection of words with
open-ended generation.
exceedingly low probabilities.
• LLaMA [57] applies diverse decoding strategies tai-
• Temperature sampling. To modulate the randomness of
lored to specific tasks. For instance, it employs the greedy
sampling, a practical method is to adjust the temperature
search for question answering tasks while utilizes a sam-
coefficient of the softmax function for computing the proba-
pling strategy with the temperature settings of 0.1 (pass@1)
bility of the j -th token over the vocabulary:
and 0.8 (pass@100) for code generation.
exp (lj /t) • OpenAI API supports several basic decoding strate-
P (xj |x<i ) = P , (10)
j ′ exp (lj ′ /t)
gies, including greedy search (by setting temperature to
28

0), beam search (with the setting best_of), temperature decoder has better zero-shot performance than other archi-
sampling (with the setting temperature), nucleus sam- tectures [29]. The success of GPT-3 [55] has demonstrates
pling (with the setting top_p). It also introduce param- that the large causal decoder model can be a good few-
eters presence_penalty and frequency_penalty to shot learner. In addition, instruction tuning and alignment
control the repetition degree of generation. According to tuning discussed in Section 5 have been proven to fur-
the OpenAI’s document, their APIs would produce different ther enhance the capability of large causal decoder mod-
outputs even if the input and the hyper-parameters are the els [66, 67, 69].
same. Setting temperature to 0 can yield more deterministic • Scaling law has been widely observed in causal de-
outputs, albeit with a slight chance of variability. coders. By scaling the model size, the dataset size, and
the total computation, the performance of causal decoders
4.2.5 Summary and Discussion can be substantially improved [30, 55]. Thus, it has become
an important strategy to increase the model capacity of
The choice of architecture and pre-training tasks may incur the causal decoder via scaling. However, more detailed
different inductive biases for LLMs, which would lead to investigation on encoder-decoder models is still lacking, and
different model capacities. In this part, we discuss one open more efforts are needed to investigate the performance of
issue about the architecture choice for LLMs. encoder-decoder models at a large scale.
More research efforts about the discussions on architec-
Why does Predicting the Next Word Works? tures and pre-training objectives are in need to analyze how
the choices of the architecture and pre-training tasks affect
The essence of decoder-only architecture is to the capacity of LLMs, especially for encoder-decoder archi-
accurately predict the next word for reconstructing tectures. Despite the effectiveness of decoder-only architec-
the pre-training data. Till now, there has been no ture, it is also suggested to make more diverse exploration
formal study that theoretically demonstrates its on architecture design. Besides the major architecture, the
advantage over other architectures. An interesting detailed configuration of LLM is also worth attention, which
explanation was from Ilya Sutskever during the has been discussed in Section 4.2.2.
interview held by Jensen Huanga . The original
transcript from the interview was copied belowb :
4.3 Model Training
Say you read a detective novel. It’s In this part, we review the important settings, techniques,
like complicated plot, a storyline, or tricks for training LLMs.
different characters, lots of events,
mysteries like clues, it’s unclear. 4.3.1 Optimization Setting
Then, let’s say that at the last
For parameter optimization of LLMs, we present the com-
page of the book, the detective has
monly used settings for batch training, learning rate, opti-
gathered all the clues, gathered
mizer, and training stability.
all the people and saying, "okay,
I’m going to reveal the identity of Batch Training. For language model pre-training, existing
whoever committed the crime and that work generally sets the batch size to a large number (e.g.,
person’s name is". Predict that word. 2,048 examples or 4M tokens) to improve the training
... stability and throughput. For LLMs such as GPT-3 and
Now, there are many different words. PaLM, they have introduced a new strategy that dynam-
But predicting those words better and ically increases the batch size during training, ultimately
better, the understanding of the text reaching a million scale. Specifically, the batch size of GPT-3
keeps on increasing. GPT-4 predicts is gradually increasing from 32K to 3.2M tokens. Empirical
the next word better. results have demonstrated that the dynamic schedule of
batch size can effectively stabilize the training process of
a. https://www.nvidia.com/en-us/on-
demand/session/gtcspring23-S52092/ LLMs [56].
b. https://lifearchitect.ai/ilya/
Learning Rate. Existing LLMs usually adopt a similar learn-
ing rate schedule with the warm-up and decay strategies
during pre-training. Specifically, in the initial 0.1% to 0.5%
Architecture Choice. In earlier literature of pre-trained lan-
of the training steps, a linear warm-up schedule is employed
guage models, there are lots of discussions on the effects
for gradually increasing the learning rate to the maximum
of different architectures [29, 89]. However, most LLMs are
value that ranges from approximately 5 × 10−5 to 1 × 10−4
developed based on the causal decoder architecture, and
(e.g., 6 × 10−5 for GPT-3). Then, a cosine decay strategy
there still lacks a theoretical analysis on its advantage over
is adopted in the subsequent steps, gradually reducing the
the other alternatives. Next, we briefly summarize existing
learning rate to approximately 10% of its maximum value,
discussions on this issue.
until the convergence of the training loss.
• By pre-training with the LM objective, it seems that
causal decoder architecture can achieve a superior zero- Optimizer. The Adam optimizer [318] and AdamW opti-
shot and few-shot generalization capacity. Existing research mizer [319] are widely utilized for training LLMs (e.g., GPT-
has shown that without multi-task fine-tuning, the causal 3), which are based on adaptive estimates of lower-order
29

TABLE 8: Detailed optimization settings of several existing LLMs.

Batch Size Learning Precision Weight Grad


Model Warmup Decay Method Optimizer Dropout
(#tokens) Rate Type Decay Clip
GPT3 (175B) 32K→3.2M 6 × 10−5 yes cosine decay to 10% Adam FP16 0.1 1.0 -
PanGu-α (200B) - 2 × 10−5 - - Adam - 0.1 - -
OPT (175B) 2M 1.2 × 10−4 yes manual decay AdamW FP16 0.1 - 0.1
PaLM (540B) 1M→4M 1 × 10−2 no inverse square root Adafactor BF16 lr2 1.0 0.1
BLOOM (176B) 4M 6 × 10−5 yes cosine decay to 10% Adam BF16 0.1 1.0 0.0
MT-NLG (530B) 64 K→3.75M 5 × 10−5 yes cosine decay to 10% Adam BF16 0.1 1.0 -
Gopher (280B) 3M→6M 4 × 10−5 yes cosine decay to 10% Adam BF16 - 1.0 -
Chinchilla (70B) 1.5M→3M 1 × 10−4 yes cosine decay to 10% AdamW BF16 - - -
Galactica (120B) 2M 7 × 10−6 yes linear decay to 10% AdamW - 0.1 1.0 0.1
LaMDA (137B) 256K - - - - BF16 - - -
Jurassic-1 (178B) 32 K→3.2M 6 × 10−5 yes - - - - - -
LLaMA (65B) 4M 1.5 × 10−4 yes cosine decay to 10% AdamW - 0.1 1.0 -
LLaMA 2 (70B) 4M 1.5 × 10−4 yes cosine decay to 10% AdamW - 0.1 1.0 -
Falcon (40B) 2M 1.85 × 10−4 yes cosine decay to 10% AdamW BF16 0.1 - -
GLM (130B) 0.4M→8.25M 8 × 10−5 yes cosine decay to 10% AdamW FP16 0.1 1.0 0.1
T5 (11B) 64K 1 × 10−2 no inverse square root AdaFactor - - - 0.1
ERNIE 3.0 Titan (260B) - 1 × 10−4 - - Adam FP16 0.1 1.0 -
PanGu-Σ (1.085T) 0.5M 2 × 10−5 yes - Adam FP16 - - -

moments for first-order gradient-based optimization. Com- parallelism [75]24 . We next introduce the three parallel train-
monly, its hyper-parameters are set as follows: β1 = 0.9, ing techniques.
β2 = 0.95 and ϵ = 10−8 . Meanwhile, the Adafactor op- • Data parallelism. Data parallelism is one of the most
timizer [320] has also been utilized in training LLMs (e.g., fundamental approaches to improving the training through-
PaLM and T5), which is a variant of the Adam optimizer put. It replicates the model parameters and optimizer states
specially designed for conserving GPU memory during across multiple GPUs and then distributes the whole train-
training. The hyper-parameters of the Adafactor optimizer ing corpus into these GPUs. In this way, each GPU only
are set as: β1 = 0.9 and β2 = 1.0 − k −0.8 , where k denotes needs to process the assigned data for it, and performs
the number of training steps. the forward and backward propagation to obtain the gra-
dients. The computed gradients on different GPUs will be
Stabilizing the Training. During the pre-training of LLMs, further aggregated to obtain the gradients of the entire batch
it often suffers from the training instability issue, which for updating the models in all GPUs. In this way, as the
may cause the model collapse. To address this issue, weight calculations of gradients are independently performed on
decay and gradient clipping have been widely utilized, different GPUs, the data parallelism mechanism is highly
where existing studies [55, 78, 90, 93, 113] commonly set scalable, enabling the way that increases the number of
the threshold of gradient clipping to 1.0 and weight decay GPUs to improve training throughput. Furthermore, this
rate to 0.1. However, with the scaling of LLMs, the training technique is simple in implementation, and most of existing
loss spike is also more likely to occur, leading to unstable popular deep learning libraries have already implemented
training. To mitigate this problem, PaLM [56] and OPT [90] data parallelism, such as TensorFlow and PyTorch.
use a simple strategy that restarts the training process from • Pipeline parallelism. Pipeline parallelism aims to dis-
an earlier checkpoint before the occurrence of the spike and tribute the different layers of a LLM into multiple GPUs.
skips over the data that may have caused the problem. Especially, in the case of a Transformer model, pipeline
Further, GLM [93] finds that the abnormal gradients of the parallelism loads consecutive layers onto the same GPU, to
embedding layer usually lead to spikes, and proposes to reduce the cost of transmitting the computed hidden states
shrink the embedding layer gradients to alleviate it. or gradients between GPUs. However, a naive implemen-
tation of pipeline parallelism may result in a lower GPU
4.3.2 Scalable Training Techniques utilization rate as each GPU has to wait for the previous
one to complete the computation, leading to the unneces-
As the model and data sizes increase, it has become chal- sary cost of bubbles overhead [321]. To reduce these bubbles
lenging to efficiently train LLMs under a limited computa- in pipeline parallelism, GPipe [321] and PipeDream [322]
tional resource. Especially, two primary technical issues are propose the techniques of padding multiple batches of data
required to be resolved, i.e., increasing training throughput and asynchronous gradient update to improve the pipeline
and loading larger models into GPU memory. In this part, efficiency.
we review several widely used approaches in existing work • Tensor parallelism. Tensor parallelism is also a com-
to address the above two challenges, namely 3D paral- monly used technique that aims to decompose the LLM for
lelism [75, 321, 322] and mixed precision training [323], and multi-GPU loading. Unlike pipeline parallelism, tensor par-
also give general suggestions about how to utilize them for allelism focuses on decomposing the tensors (the parameter
training. matrices) of LLMs. For a matrix multiplication operation
Y = XA in the LLM, the parameter matrix A can be
3D Parallelism. 3D parallelism is actually a combination of
three commonly used parallel training techniques, namely 24. Model parallelism is a more broader term that includes tensor
data parallelism, pipeline parallelism [321, 322], and tensor parallelism and pipeline parallelism in some work [75].
30

split into two submatrices, A1 and A2 , by column, which 5 P OST- TRAINING OF LLM S
can be expressed as Y = [XA1 , XA2 ]. By placing matrices After pre-training, LLMs can acquire the general abilities
A1 and A2 on different GPUs, the matrix multiplication for solving various tasks. However, an increasing number
operation would be invoked at two GPUs in parallel, and of studies have shown that LLM’s abilities can be further
the final result can be obtained by combining the outputs adapted according to specific goals. In this section, we
from the two GPUs through across-GPU communication. introduce two major approaches to adapting pre-trained
Currently, tensor parallelism has been supported in several LLMs, namely instruction tuning and alignment tuning. The
open-source libraries, e.g., Megatron-LM [75], and can be former approach mainly aims to enhance (or unlock) the
extended to higher-dimensional tensors. Also, Colossal-AI abilities of LLMs, while the latter approach aims to align the
has implemented tensor parallelism for higher-dimensional behaviors of LLMs with human values or preferences. Fur-
tensors [324–326] and proposed sequence parallelism [327] ther, we will also discuss efficient tuning and quantization
especially for sequence data, which can further decompose for model adaptation in resource-limited settings. In what
the attention operation of the Transformer model. follows, we will introduce the four parts in detail.

Mixed Precision Training. In previous PLMs (e.g., 5.1 Instruction Tuning


BERT [23]), 32-bit floating-point numbers, also known as
FP32, have been predominantly used for pre-training. In In essence, instruction tuning is the approach to fine-tuning
recent years, to pre-train extremely large language models, pre-trained LLMs on a collection of formatted instances in
some studies [323] have started to utilize 16-bit floating- the form of natural language [67], which is highly related
point numbers (FP16), which reduces memory usage and to supervised fine-tuning [66] and multi-task prompted
communication overhead. Additionally, as popular NVIDIA training [28]. In order to perform instruction tuning, we first
GPUs (e.g., A100) have twice the amount of FP16 computa- need to collect or construct instruction-formatted instances.
tion units as FP32, the computational efficiency of FP16 can Then, we employ these formatted instances to fine-tune
be further improved. However, existing work has found that LLMs in a supervised learning way (e.g., training with the
FP16 may lead to the loss of computational accuracy [64, 78], sequence-to-sequence loss). After instruction tuning, LLMs
which affects the final model performance. To alleviate it, an can demonstrate superior abilities to generalize to unseen
alternative called Brain Floating Point (BF16) has been used tasks [28, 67, 69], even in a multilingual setting [94].
for training, which allocates more exponent bits and fewer A recent survey [331] presents a systematic overview
significant bits than FP16. For pre-training, BF16 generally of the research on instruction tuning. In comparison to
performs better than FP16 on representation accuracy [78]. that, we mainly focus on the effect of instruction tuning
on LLMs and provide detailed guidelines or strategies for
instance collection and tuning. In addition, we also discuss
Overall Training Suggestion. In practice, the above train- the use of instruction tuning for satisfying the real needs of
ing techniques, especially 3D parallelism, are often jointly users, which has been widely applied in existing LLMs, e.g.,
used to improve the training throughput and large model InstructGPT [66] and GPT-4 [46].
loading. For instance, researchers have incorporated 8-way
data parallelism, 4-way tensor parallelism, and 12-way 5.1.1 Formatted Instance Construction
pipeline parallelism, enabling the training of BLOOM [78] Generally, an instruction-formatted instance consists of a
on 384 A100 GPUs. Currently, open-source libraries like task description (called an instruction), an optional input,
DeepSpeed [74], Colossal-AI [203], and Alpa [328] can well the corresponding output, and a small number of demon-
support the three parallel training methods. To reduce the strations (optional). As important public resources, existing
memory redundancy, ZeRO, FSDP, and activation recom- studies have released a large number of labeled data format-
putation techniques [77, 329] can be also employed for ted in natural language (see the list of available resources in
training LLMs, which have already been integrated into Table 3) as introduced in Section 3.3.1. Next, we introduce
DeepSpeed, PyTorch, and Megatron-LM. In addition, the four major methods for constructing formatted instances
mixed precision training technique such as BF16 can be (see an illustration in Figure 11) and then discuss several
also leveraged to improve the training efficiency and reduce key factors for instance construction.
GPU memory usage, while it requires necessary support on
hardware (e.g., A100 GPU). Because training large models is Formatting NLP Task Datasets. Before instruction tuning
a time-intensive process, it would be useful to forecast the was proposed, several early studies [181, 332, 333] collected
model performance and detect abnormal issues at an early the instances from a diverse range of traditional NLP tasks
stage. For this purpose, GPT-4 [46] has recently introduced (e.g., text summarization, text classification, and translation)
a new mechanism called predictable scaling built on a deep to create supervised multi-task training datasets. As a major
learning stack, enabling the performance prediction of large source of instruction tuning instances, it is convenient to for-
models with a much smaller model, which might be quite mat these multi-task training datasets with natural language
useful for developing LLMs. In practice, one can further task descriptions. Specifically, recent work [28, 66, 67, 88]
leverage the supporting training techniques of mainstream augments the labeled datasets with human-written task de-
deep learning frameworks. For instance, PyTorch supports scriptions, which instructs LLMs to understand the tasks by
the data parallel training algorithm FSDP [330] (i.e., fully explaining the task goal. For example, in Figure 11(a), a task
sharded data parallel), which allows for partial offloading description “Please answer this question” is added for each
of training computations to CPUs if desired. example in the question-answering task. After instruction
31

API collection Human-written Seed


Instance Pool
Human-written Task description Instances

Please answer this question: & Instruction


Generation LLM Filter
Demonstrations Task description
Task description
NLP Datasets Q: What is the capital of France? Give me a quote from a
A: Paris. Can you recommend some ways
famous person on this topic.
to lose weight?
Q: What is the capital of Brazil?
A: Brasilia Input-Output
Desired output written by human Generation LLM

Input Output Output Input Output


Here are some ways to lose weight: Input: The importance of being honest.
Q: What is the capital of China?
1. Eat a healthy diet: Focus on … Output: Honesty is the first chapter in
A: Beijing.
2. Increase physical activity: Engage … the book of wisdom.

(a) Formatting Task Datasets (b) Formatting Daily Chat Data (c) Formatting Synthetic Data

Fig. 11: An illustration of instance formatting and three different methods for constructing the instruction-formatted
instances.

tuning, LLMs can generalize well to other unseen tasks by ChatGPT or GPT-4 to generate responses as output data. A
following their task descriptions [28, 67, 69]. In particular, notable example of such a dataset is the conversational data
it has been shown that instructions are the crucial factor from ShareGPT [153]. Additionally, Dolly [185] and Ope-
in task generalization ability for LLMs [67]: by fine-tuning nAssistant [186] have further released their conversation
the model on labeled datasets with the task descriptions re- data, which has been carefully labeled by human annotators
moved, it results in a dramatic drop in model performance. to attain a high level of quality.
To better generate labeled instances for instruction tuning,
a crowd-sourcing platform, PromptSource [180] has been Formatting Synthetic Data. To reduce the burden of human
proposed to effectively create, share, and verify the task annotation or manual collection, several semi-automated
descriptions for different datasets. To enrich the training approaches [147] have been proposed for constructing in-
instances, several studies [28, 181, 334] also try to invert the stances by feeding existing instances into LLMs to synthe-
input-output pairs of existing instances with specially de- size diverse task descriptions and instances. As illustrated
signed task descriptions for instruction tuning. For instance, in Figure 11(c), the Self-Instruct method only needs 175
given a question-answer pair, we can create a new instance instances as the initial task pool. Then, they randomly select
by predicting the answer-conditioned question (e.g., “Please a few instances from the pool as demonstrations and prompt
generate a question based on the answer:”). a LLM to generate new instructions and corresponding
input-output pairs. After the quality and diversity filter-
Formatting Daily Chat Data. Despite that a large number ing, newly generated instances would be added into the
of training instances have been formatted with instructions, task pool. Hence, the synthetic method is an effective and
they mainly come from public NLP datasets, either lack- economical way to generate large-scale instruction data for
ing instruction diversity or mismatching with real human LLMs. However, the instances generated by the Self-Instruct
needs [66]. To overcome this issue, InstructGPT [66] pro- method might be simplistic or lack the diversity. To improve
poses to take the queries that real users have submitted to the quality of synthetic instructions, WizardLM [335] intro-
the OpenAI API as the task descriptions. Additionally, to duces Evol-Instruct by proposing in-depth and in-breadth
enrich the task diversity, human labelers are also asked to evolving to enrich the complexity and diversity of the
compose the instructions for real-life tasks, including open- instances. Furthermore, Self-Align [336] establishes multiple
ended generation, open question answering, brainstorm- human-aligned principles to filter the synthesized instances.
ing, and chatting. Then, they let another group of labelers It then employs these instances to train a LLM in order
directly answer these instructions as the output. Finally, to yield more aligned instances. To enhance the quality
they pair one instruction (i.e., the collected user query) and of the instance output, researchers directly adopt human-
the expected output (i.e., the human-written answer) as a written texts as the output and synthesize corresponding
training instance. Note that InstructGPT also employs these instructions using ICL examples [337].
real-world tasks formatted in natural language for align-
ment tuning (discussed in Section 5.2). Further, GPT-4 [46] Key Factors for Instruction Dataset Construction. The
has designed potentially high-risk instructions and guided quality of instruction instances has an important impact
the model to reject these instructions through supervised on the performance of the model. Here, we discuss some
fine-tuning for safety concerns. Considering the absence essential factors for instance construction.
of high-quality public chat data, several studies have also • Scaling the instructions. It has been widely shown that
collected users’ chat requests as input data, and then utilized scaling the number of tasks can largely enhance the gen-
32

eralization ability of LLMs [28, 67, 88]. With the increasing this issue, recent studies widely explore the potential of
of the task number, the model performance initially shows relatively small models for data synthesis. For instance,
a continuous growth pattern, while the gain becomes neg- JiuZhang3.0 [344] fine-tunes a 7B language model to syn-
ligible when it reaches a certain level [69, 88]. A plausible thesize questions by distilling the knowledge from GPT-
speculation is that a certain number of representative tasks 4, and then utilizes it to synthesize massive high-quality
can provide relatively sufficient knowledge and adding instructions based on pre-training corpus. Such a way can
more tasks may not bring additional gains [69]. Also, it is achieve better performance on mathematical reasoning tasks
beneficial to enhance the diversity of the task descriptions in than baseline methods, with only 20% data synthesis cost.
several aspects, such as length, structure, and creativity [28]. • Instruction selection. As a surge of instruction datasets
As for the number of instances per task, it has been found are proposed, it is non-trivial to select the high-quality
that a small number of instances can usually saturate the ones from them to construct the training dataset. Generally,
generalization performance of the model to perform a spe- existing work either leverages quality estimation metrics or
cific task [67, 69]. Specially, several recent work [338, 339] employs LLMs as the judge model to rank all the instruc-
has explored the effect of fine-tuning with a small amount tion instances, and then selects those with relatively higher
of high-quality instruction data (e.g., one or a few thousand scores. Concretely, for metrics, perplexity and other heuristic
instances), showing very promising results on the evalua- measurements (e.g., length) [345] have been widely used in
tion tasks. In contrast, another line of studies continue to practice, e.g., we can consider removing high-perplexity or
explore the scaling effect of instruction data [340, 341]. For very short instructions, which might correspond to low-
example, Orca [340] scales up the synthesized instances to quality ones. To better estimate the effect of an instruc-
5 million with step-by-step explanations, and it achieves tion for the LLM capability, more complex metrics (e.g.,
superior performance across a wide range of tasks. IFD [346]) have also been proposed, which are computed by
• Formatting design. As an important factor, the design combining multiple simple metrics. Additionally, diversity-
of natural language format also highly impacts the gener- aware sampling methods have been introduced to ensure
alization performance of LLMs [88]. Typically, we can add the overall coverage of representative instruction data [347].
task descriptions and optional demonstrations to the input- Besides, when downstream task data is available, cross-
output pairs of existing datasets, where the task description instance gradient similarity can be employed to measure
is the most key part for LLMs to understand the task [88]. the value of training instances for the target task. LESS [348]
Further, it can lead to substantial improvements by using an computes gradients for both downstream validation and
appropriate number of exemplars as demonstrations [69], training instruction data, to evaluate the contribution of
which also alleviates the model sensitivity to instruction instruction data based on extensions of influence func-
engineering [67, 69]. However, incorporating other compo- tion [349].
nents (e.g., things to avoid, reasons, and suggestions) into To summarize, diversity and quality of instructions are
instructions may have a negligible or even adverse effect important factors to consider when scaling the number of
on the performance of LLMs [88, 179]. Recently, to elicit instances [338]. As the capacities of LLMs improve, data
the step-by-step reasoning ability of LLMs, some work [69] synthesis methods have become the mainstream approach
proposes to include chain-of-thought (CoT) examples for for generating large amount of instruction data. Following
some reasoning datasets, such as arithmetic reasoning. It this trend, there are increasingly more automatically gener-
has been shown that fine-tuning LLMs with both CoT and ated instruction datasets available, and selection and refin-
non-CoT examples can lead to a good performance across ing methods are key to effectively use these datasets. To help
various reasoning tasks, including those that require multi- readers understand how different factors affect instruction
hop reasoning ability (e.g., commonsense question answer- tuning, we conduct an empirical study by experimenting
ing and arithmetic reasoning) as well as those without the with multiple specially constructed instruction datasets in
need for such a reasoning way (e.g., sentiment analysis and Section 5.1.4.
extractive question answering) [69, 95].
• Instruction quality improvement. Data quality is very 5.1.2 Instruction Tuning Strategies
important for the performance of instruction tuning, and Unlike pre-training, instruction tuning is often more effi-
a surge of work has been proposed to further improve cient since only a moderate number of instances are used
the quality of existing instruction datasets. Typically, these for training. Since instruction tuning can be considered as
methods mostly rely on carefully designed prompts, to a supervised training process, its optimization is different
guide LLMs to refine or rewrite the given instruction. Wiz- from pre-training in several aspects [69], such as the training
ardLM [335] aims to complexify and diversify the Alpaca objective (i.e., sequence-to-sequence loss) and optimization
dataset [187] by devising prompts to widen and deepen configuration (e.g., smaller batch size and learning rate),
the required knowledge of given instructions. It also crafts which require special attention in practice. In addition to
the filter strategy to remove the low-quality instructions. these optimization configurations, there are also four im-
To further provide fine-grained knowledge guidance, recent portant aspects to consider for instruction tuning:
work also involves the knowledge taxonomy into the input
prompt, e.g., knowledge key points [342] and the human- Balancing the Data Distribution. Since instruction tun-
AI conversation topic taxonomy [343]. To guarantee the in- ing involves a mixture of different tasks, it is important
struction quality, early methods mainly employ close-source to balance the proportion of different tasks during fine-
API or powerful open-source LLMs, which would take a tuning. A widely used method is the examples-proportional
huge cost for large-scale instructions synthesis. Considering mixing strategy [82], i.e., combining all the datasets and
33

TABLE 9: Basic statistics of the required number of GPUs, tuning time, batch size (denoted as BS) per device (full tuning
and LoRA tuning), and inference rate (the number of generated tokes per second). Our experiments are conducted based
on two Linux servers having 8 A800-80G SXM4 GPUs with 6 NVSwitch and 8 3090-24G GPUs, respectively. The major
difference between A800 and A100 lies in the NVLink interconnect speed. Thus, our estimations about training and
inference efficiency would be slightly improved for A100, while the rest memory consumption would remain the same.
For full tuning experiments, we use data parallel training, ZeRO Stage 3, BF16, and gradient checkpointing. Additionally,
the LoRA tuning can be executed on one 80G GPU utilizing INT8 quantization with the rank setting set to 16. All the
experiments are conducted with Alpaca-52K dataset by training LLaMA models three epochs. The max sequence length
for both training settings is set to 512. The inference experiments are performed with the batch size set to 1.

A800 Full Tuning A800 LoRA Tuning A800 Inference (16-bit) 3090 Inference (16-bit) 3090 Inference (8-bit)
Models
#GPU BS Time #GPU BS Time #GPU #Token/s #GPU #Token/s #GPU #Token/s
LLaMA (7B) 2 8 3.0h 1 80 3.5h 1 36.6 1 24.3 1 7.5
LLaMA (13B) 4 8 3.1h 1 48 5.1h 1 26.8 2 9.9 1 4.5
LLaMA (30B) 8 4 6.1h 1 24 14.3h 1 17.7 4 3.8 2 2.6
LLaMA (65B) 16 2 11.2h 1 4 60.6h 2 8.8 8 2.0 4 1.5

sampling each instance equally from the mixed datasets. tuned with large-scale task-formatted instructions and sub-
Furthermore, increasing the sampling ratio of high-quality sequently fine-tuned on daily chat ones. To avoid the capac-
collections (e.g., FLAN [67] and P3 [180]) can generally ity forgetting issue, it is also useful to add an amount of task-
lead to performance improvement according to recent find- formatted instructions at the second stage. Actually, such
ings [69, 95]. Further, it is common to set a maximum a multi-stage tuning strategy can be also applied to other
cap to control the maximum number of examples that a settings for instruction tuning. For example, we can sched-
dataset can contain during instruction tuning [82], which ule different fine-tuning stages with progressively increased
is set to prevent larger datasets from overwhelming the levels on difficulty and complexity, and gradually improve
entire distribution [82, 95]. In practice, the maximum cap the capacities of LLMs to follow complex instructions.
is typically set to several thousands or tens of thousands
according to different datasets [67, 69]. Recently, it has been Other Practical Tricks. In practice, there are also several
empirically found that existing instruction datasets (Table 3) useful strategies and tricks that are helpful to improve the
mainly focus on enhancing LLMs’ capabilities in certain fine-tuning performance of LLMs. We list several represen-
aspects, and a single dataset alone cannot lead to a compre- tative ones as follows:
hensive enhancement in model capacity [350]. Therefore, it • Efficient training for multi-turn chat data. Given a multi-
is often suggested to use a mixture of existing instruction turn chat example (the conversation between a user and
datasets to achieve a balanced improvement in different chatbot), a straightforward fine-tuning way is to split it into
capacities, including NLP task data (e.g., FLAN v2 [351]), multiple context-response pairs for training: a LLM is fine-
chat data (e.g., ShareGPT [153]), and synthetic data (e.g., tuned to generate the response based on the correspond-
GPT4-Alpaca [352]). ing context for all splits (i.e., at each utterance from the
user). In such a fine-tuning way, it is apparent that there
Combining Instruction Tuning and Pre-Training. To make exist overlapping utterances in the split examples from a
the tuning process more effective and stable, OPT-IML [95] conversation. To save the training cost, Vicuna [152] has
incorporates pre-training data during instruction tuning, adopted an efficient way that feeds the whole conversation
which can be regarded as regularization for model tuning. into the LLM, but relies on a loss mask that only computes
Further, instead of using a separate two-stage process (pre- the loss on the responses of the chatbot for training. It can
training then instruction tuning), some studies attempt to significantly reduce the compute costs derived from the
train a model from scratch with a mixture of pre-training overlapped utterances.
data (i.e., plain texts) and instruction tuning data (i.e., for-
matted datasets) using multi-task learning [82]. Specifically, • Establishing self-identification for LLM. To deploy LLMs
GLM-130B [93] and Galactica [35] integrate instruction- for real-world applications, it is necessary to establish its
formatted datasets as a small proportion of the pre-training identity and make LLMs aware of these identity informa-
corpora to pre-train LLMs, which potentially achieves the tion, such as name, developer and affiliation. A practical
advantages of pre-training and instruction tuning at the way is to create identity-related instructions for fine-tuning
same time. the LLM. It is also feasible to prefix the input with the self-
identification prompt, e.g., “The following is a conversation
Multi-stage Instruction Tuning. For instruction tuning, between a human and an AI assistant called C HATBOT N AME,
there are two kinds of important instruction data, namely developed by D EVELOPER.”, where C HATBOT N AME and D E -
VELOPER refer to the name and developer of the chatbot,
task-formatted instructions and daily chat instructions. Gen-
erally, the former has a significantly larger volume than the respectively.
latter. It is important to balance the training with the two In addition to the above practical strategies and tricks,
kinds of instruction data. In addition to carefully mixing existing work has also used other tricks, e.g., concatenating
different instruction data, we can also adopt a multi-stage multiple examples into a single sequence to approach the
instruction tuning strategy [341], where LLMs are first fine- max length [353].
34

5.1.3 The Effect of Instruction Tuning systems with natural language instructions, showing strong
In this part, we discuss the effect of instruction tuning on performance in a variety of recommendation tasks. There
LLMs in three major aspects. are also several open-sourced medical models instruction-
tuned based on LLaMA [57], such as BenTsao [356]. Also,
Performance Improvement. Despite being tuned on a mod- researchers explore instruction tuning on law [357], fi-
erate number of instances, instruction tuning has become nance [358], and arithmetic computation [359].
an important way to improve or unlock the abilities of
LLMs [69]. Recent studies have experimented with language 5.1.4 Empirical Analysis for Instruction Tuning
models in multiple scales (ranging from 77M to 540B),
Fine-tuning LLMs with different instruction sets tend to lead
showing that the models of different scales can all benefit
to model variants with varied performance on downstream
from instruction tuning [69, 334], yielding improved perfor-
tasks. In this section, we will explore the effect of different
mance as the parameter scale increases [94]. Further, smaller
types of instructions in fine-tuning LLMs (i.e., LLaMA (7B)
models with instruction tuning can even perform better
and LLaMA (13B)25 ), as well as examine the usefulness of
than larger models without fine-tuning [28, 69]. Besides
several instruction improvement strategies.
the model scale, instruction tuning demonstrates consistent
improvements in various model architectures, pre-training Instruction Datasets. According to the discussion in Sec-
objectives, and model adaptation methods [69]. In practice, tion 5.1.1, we mainly consider three common kinds of in-
instruction tuning offers a general approach to enhancing structions as follows:
the abilities of existing language models [69] (including • Task-specific instructions. For the first type of instruc-
small-sized PLMs). Also, it is much less costly than pre- tions, we adopt the most commonly-used multi-task instruc-
training, since the amount of instruction data required by tion dataset, FLAN-T5 [69], which contains 1,836 tasks and
LLMs is significantly smaller than pre-training data. over 15M instructions by combining four data mixtures from
prior work.
Task Generalization. Instruction tuning encourages the
• Daily chat instructions. This type of instructions are con-
model to understand natural language instructions for task
versations posed by users about daily life, which are more
completion. It endows LLMs with the ability (often con-
closely related to real-life scenarios. We adopt the ShareGPT
sidered as an emergent ability) to follow human instruc-
instruciton set, consisting of 63K real-user instructions. It
tions [31] to perform specific tasks without demonstrations,
has been used as the core instructions for Vicuna.
even on unseen tasks [69]. A large number of studies
• Synthetic instructions. In addition to reusing existing
have confirmed the effectiveness of instruction tuning to
instructions, we can also automatically synthesize massive
achieve superior performance on both seen and unseen
instructions using LLMs. We adopt the popular synthetic
tasks [95, 334]. Also, instruction tuning has been shown to
instruction dataset Self-Instruct-52K [147], consisting of 52K
be useful in alleviating several weaknesses of LLMs (e.g.,
instructions paired with about 82K instance inputs and
repetitive generation or complementing the input without
outputs. These generated instructions have a similar data
accomplishing a certain task) [66, 69], leading to a superior
distribution as the human-written seed tasks (e.g., grammar
capacity to solve real-world tasks for LLMs. Furthermore,
checking, brainstorming).
LLMs trained with instruction tuning can generalize to re-
As the original FLAN-T5 dataset is very large (i.e., over
lated tasks across languages. For example, BLOOMZ-P3 [94]
15M), we randomly sample 80,000 instructions from it for
is fine-tuned based on BLOOM [78] using English-only task
conducting a fair comparison with other instruction datasets
collection P3 [180]. Interestingly, BLOOMZ-P3 can achieve
(i.e., ShareGPT and Self-Instruct-52K) at a similar scale. In
a more than 50% improvement in multilingual sentence
our experiments, we test on each individual instruction
completion tasks compared to BLOOM, which shows that
set to explore their own effects and also examine their
instruction tuning can help LLMs acquire general task skills
combinatorial effects on model performance.
from English-only datasets and transfer such skills into
other languages [94]. In addition, it has been found that Improvement Strategies. Although real-world instructions
using English-only instructions can produce satisfactory from human users are more suitable for fine-tuning LLMs,
results on multilingual tasks [94], which helps reduce the it is difficult to collect them at a large scale. As alternatives
effort of instruction engineering for a specific language. to human-generated instructions, most existing research
mainly adopts synthetic instructions generated by LLMs.
Domain Specialization. Existing LLMs have showcased su-
However, there are some potential problems with synthetic
perior capabilities in traditional NLP tasks (e.g., generation
instructions, such as poor topic diversity and uneven in-
and reasoning) and daily questions. However, they may
struction difficulty (either too simple or too difficult). Thus,
still lack domain knowledge to accomplish specific tasks,
it is necessary to improve the quality of the synthetic in-
such as medicine, law, and finance (See Section 8 for a
structions. Next, we summarize four major improvement
detailed discussion of LLMs in different applications). In-
strategies widely used in existing work as follows:
struction tuning is an effective approach to adapting existing
general LLMs to be domain-specific experts. For instance, • Enhancing the instruction complexity. As discussed in
researchers propose to fine-tune Flan-PaLM [69] using medi- existing work [335], enhancing the complexity of instruc-
cal datasets to create Med-PaLM [354], a medical knowledge tions can improve the model capacity of LLMs in following
assistant that achieves performance levels comparable to
25. Due to the limit of computational resources, we cannot conduct
those of expert clinicians. Furthermore, a recent study [355] large-scale experiments on larger LLaMA variants right now, which
fine-tunes FLAN-T5 to support e-commerce recommender would be scheduled in a future version.
35

TABLE 10: Results of instruction-tuning experiments (all in a single-turn conversation) based on the LLaMA (7B) and
LLaMA (13B) model under the chat and QA setting. We employ four instruction improvement strategies on the Self-
Instruct-52K dataset, i.e., enhancing the complexity (w/ complexity), increasing the diversity (w/ diversity), balancing the
difficulty (w/ difficulty), and scaling the instruction number (w/ scaling). ∗ Since we select the LLaMA (7B)/(13B) model
fine-tuned on Self-Instruct-52K as the baseline, we omit the win rate of the fine-tuned model with Self-Instruct-52K against
itself.

Dataset Instruction Lexical Chat QA


Models
Mixtures Numbers Diversity AlpacaFarm MMLU BBH3k
LLaMA (7B) ① FLAN-T5 80,000 48.48 23.77 38.58 32.79
② ShareGPT 63,184 77.31 81.30 38.11 27.71
③ Self-Instruct-52K 82,439 25.92 /∗ 37.52 29.81
②+③ 145,623 48.22 71.36 41.26 28.36
①+②+③ 225,623 48.28 70.00 43.69 29.69
③ Self-Instruct-52K 82,439 25.92 /∗ 37.52 29.81
w/ complexity 70,000 70.43 76.96 39.73 33.25
w/ diversity 70,000 75.59 81.55 38.01 30.03
w/ difficulty 70,000 73.48 79.15 32.55 31.25
w/ scaling 220,000 57.78 51.13 33.81 26.63
LLaMA (13B) ① FLAN-T5 80,000 48.48 22.12 34.12 34.05
② ShareGPT 63,184 77.31 77.13 47.49 33.82
③ Self-Instruct-52K 82,439 25.92 /∗ 36.73 25.43
②+③ 145,623 48.22 72.85 41.16 29.49
①+②+③ 225,623 48.28 69.49 43.50 31.16
③ Self-Instruct-52K 82,439 25.92 /∗ 36.73 25.43
w/ complexity 70,000 70.43 77.94 46.89 35.75
w/ diversity 70,000 75.59 78.92 44.97 36.40
w/ difficulty 70,000 73.48 80.45 43.15 34.59
w/ scaling 220,000 57.78 58.12 38.07 27.28

complex instructions, e.g., including more task demands or is likely to result in training instability or even overfitting
requiring more reasoning steps. To validate this strategy, for LLMs. To explore the potential effects, we leverage
we follow WizardLM [335] by gradually increasing the the perplexity score of LLMs to estimate the difficulty of
complexity levels, e.g., adding constraints, increasing rea- instructions and remove too easy or too hard instructions. To
soning steps, and complicating the input. We leverage the generate the same scale of instructions for fair comparison,
publicly released WizardLM-70K instructions [335] as the we adopt a LLaMA (7B) model to compute the perplexity for
complexity-enhanced instruction dataset, which has been the 220K instructions from the large instruction dataset, and
generated via the above enhancement approach based on then keep 70K instructions of moderate perplexity scores as
the Self-Instruct-52K dataset [335]. the difficulty-balanced dataset.
• Increasing the topic diversity. In addition to the complex-
ity, improving the topic diversity of the instruction dataset Experimental Setup. To conduct the experiments on the
can help elicit different abilities of LLMs on diverse tasks in effect of instruction data, we leverage these new instruction
real world [336]. However, it is difficult to directly control datasets for tuning LLaMA, a popular LLM backbone that
the self-instruct process for generating diverse instructions. has been widely used for instruction-tuning. We use the
Following YuLan-Chat [341], we employ ChatGPT to rewrite code from YuLan-Chat [341] for our experiments, and train
the instructions from Self-Instruct-52K dataset for adapting LLaMA 7B and 13B on a server of 8 A800-80G GPUs. All
them into 293 topics via specific prompts. Finally, we obtain the hyper-parameters settings remain the same as Stanford
70K instructions as the diversity-increased dataset. Alpaca. To better evaluate the instruction following ability
• Scaling the instruction number. In addition to the above of fine-tuned models, we consider two settings, namely
aspects, the number of instructions is also an important Chat setting and QA setting. The chat setting mainly utilizes
factor that may affect the model performance. Specially, user instructions and queries from daily chat, whereas the
using more instructions can extend the task knowledge and QA setting mainly employs question answering examples
improve the ability of instruction following for LLMs [69]. from existing NLP datasets. The evaluation on the chat
To examine this strategy, we sample new instructions from setting is conducted based on the AlpacaFarm evaluation
the synthesized instruction set released from the MOSS set [361]. Instead of using a full pairwise comparison, we
project [360], as they are also synthesized using the same select the LLaMA 7B and 13B models fine-tuned on Self-
self-instruct method [147]. We mix them with the Self- Instruct-52K as the reference baselines, and then compare
Instruct-52K dataset to compose a larger one containing them with other fine-tuned LLaMA 7B and 13B models
220K instructions. using different instructions, respectively. Since our focus is
• Balancing the instruction difficulty. As the synthetic to examine the usefulness of different strategies to generate
instructions tend to contain too easy or too hard ones, it the instructions, the model fine-tuned on Self-Instruct-52K
36

can serve as a good reference. Following AlpacaFarm [361], the instruction number, it even hurts the performance, e.g.,
for each comparison, we employ ChatGPT to automatically a decrease from 29.81 to 26.63 in BBH for LLaMA (7B).
annotate which response from two compared models each It shows that simply scaling the number of synthesized
time is the best for the user query, and report the win instructions without quality control may not be effective to
rate (%) as the evaluation metric. For the QA setting, we improve the performance. Furthermore, fine-tuning with the
select two benchmarks, MMLU [362] and BBH [363], and instructions of moderate difficulty also performs well in the
evaluate the accuracy based on their default settings by chat setting, while slightly decreasing the performance in
using heuristic rules to parse the answers from these LLMs. the QA setting. A possible reason is that we filter complex
For both instruction tuning and evaluation, we adopt and hard instructions with large perplexity scores, hurting
the following prompt: “The following is a conversation be- the model performance in answering complex questions.
tween a human and an AI assistant. The AI assistant gives • A larger model scale leads to a better instruction following
helpful, detailed, and polite answers to the user’s questions.\n performance. By comparing the performance of LLaMA (7B)
[|Human|]:{input}\n[|AI|]:”. To reproduce our results, we and LLaMA (13B) models fine-tuned with the same set
release the code and data at the link: https://github.com/ of instruction data, we can see that LLaMA (13B) mostly
RUCAIBox/LLMSurvey/tree/main/Experiments. achieves a better performance. It indicates that scaling the
model size is helpful for improving the instruction following
Results and Analysis. The results using different instruction capability. Besides, we can see that the QA performance has
datasets based on 7B and 13B LLaMA are in Table 10. Next, been improved a lot, e.g., from 38.11 to 47.49 in MMLU. It is
we summarize and analyze our findings in detail. likely because that the larger models generally have better
• Task-formatted instructions are more proper for the QA knowledge utilization and reasoning capability [33, 55],
setting, but may not be useful for the chat setting. By comparing which can accurately answer more complex questions.
the performance of instruction tuning using FLAN-T5 with
that of ShareGPT and Self-Instruct-52K, we can observe Instruction Tuning Suggestions
that FLAN-T5 mostly achieves a better performance on QA
benchmarks while underperforms ShareGPT on the chat set- To conduct instruction tuning on LLMs, one can
ting. The reason is that FLAN-T5 is composed of a mixture prepare the computational resources according to
of instructions and examples from existing NLP tasks, e.g., the basic statistics about the required number of
translation and reading comprehension. As a result, LLaMA GPUs and tuning time in Table 9. After setting
fine-tuned with FLAN-T5 performs better on QA tasks, but up the development environment, we recommend
poorly on user queries. In contrast, ShareGPT consists of beginners to follow the code of Alpaca reposi-
real-world human-ChatGPT conversations, which is able to tory [187] for instruction tuning. Subsequently, one
better elicit LLaMA to follow user instructions in daily life, should select the base model and construct the
while may not be suitable for accomplishing the QA tasks. instruction datasets as we discuss in this section.
• A mixture of different kinds of instructions are helpful to When computational resources for training are con-
improve the comprehensive abilities of LLMs. After mixing the strained, users can utilize LoRA for parameter-
three kinds of instructions for fine-tuning, we can see that efficient tuning (see Section 5.3). As for inference,
the derived LLaMA variant (with FLAN-T5, ShareGPT and users can further use quantization methods to de-
Self-Instruct-52K) performs well in both task settings. In ploy LLMs on fewer or smaller GPUs (see Sec-
MMLU, the performance of LLaMA (7B) can surpass the tion 5.3).
ones using individual instruction set by a large margin, i.e.,
43.69 vs. 38.58 (FLAN-T5). It shows that mixing multiple
sources of instruction datasets is helpful to improve the 5.2 Alignment Tuning
performance of instruction-tuned LLMs, which scales the This part first presents the background of alignment with
instruction number as well as increases the diversity. its definition and criteria, then focuses on the collection
• Enhancing the complexity and diversity of instructions of human feedback data for aligning LLMs, and finally
leads to an improved model performance. By increasing the discusses the key technique of reinforcement learning from
complexity and diversity of the Self-Instruct-52K dataset human feedback (RLHF) for alignment tuning.
respectively, the chat and QA performance of LLaMA can
be consistently improved, e.g., from 37.52 to 39.73 in MMLU 5.2.1 Background and Criteria for Alignment
for LLaMA (7B). It demonstrates that both strategies are
useful to improve the instruction following ability of LLMs. Background. LLMs have shown remarkable capabilities
Further, we can see that improving the complexity yields a in a wide range of NLP tasks [55, 56, 67, 90]. However,
larger performance improvement on QA tasks. The reason these models may sometimes exhibit unintended behav-
is that the QA tasks mostly consist of difficult questions for iors, e.g., fabricating false information, pursuing inaccurate
evaluating LLMs, which can be better solved by LLMs that objectives, and producing harmful, misleading, and biased
have learned complex instructions at the fine-tuning stage. expressions [66, 364]. For LLMs, the language modeling
• Simply increasing the number of instructions may not be objective pre-trains the model parameters by word predic-
that useful, and balancing the difficulty is not always helpful. tion while lacking the consideration of human values or
As the results shown in Table 10, balancing the difficulty preferences. To avert these unexpected behaviors, human
and increasing the number of fine-tuning instructions are alignment has been proposed to make LLMs act in line with
not very helpful in our experiments. Especially for scaling human expectations [66, 365]. However, unlike the original
37

pre-training and adaptation tuning (e.g., instruction tuning), outputs.


such an alignment requires considering very different crite-
ria (e.g., helpfulness, honesty, and harmlessness). It has been 5.2.2 Collecting Human Feedback
shown that alignment might harm the general abilities of During the pre-training stage, LLMs are trained using the
LLMs to some extent, which is called alignment tax in related language modeling objective on a large-scale corpus. How-
literature [366]. ever, it cannot take into account the subjective and qualita-
tive evaluations of LLM outputs by humans (called human
Alignment Criteria. Recently, there is increasing attention feedback in this survey). High-quality human feedback is
on developing multifarious criteria to regulate the behav- extremely important for aligning LLMs with human pref-
iors of LLMs. Here, we take three representative alignment erences and values. In this part, we discuss how to select a
criteria (i.e., helpful, honest, and harmless) as examples for team of human labelers for feedback data collection.
discussion, which have been widely adopted in existing
literature [66, 366]. In addition, there are other alignment Human Labeler Selection. In existing work, the dominant
criteria for LLMs from different perspectives including be- method for generating human feedback data is human
havior, intent, incentive, and inner aspects [364], which annotation [66, 116, 365]. This highlights the critical role
are essentially similar (or at least with similar alignment of selecting appropriate human labelers. To provide high-
techniques) to the above three criteria. It is also feasible to quality feedback, human labelers are supposed to have a
modify the three criteria according to specific needs, e.g., qualified level of education and excellent proficiency in En-
substituting honesty with correctness [116]. Next, we give glish. For example, Sparrow [116] requires human labelers
brief explanations about the three representative alignment to be UK-based native English speakers who have obtained
criteria: at least an undergraduate-level educational qualification.
• Helpfulness. To be helpful, the LLM should demon- Even then, several studies [365] have found that there still
strate a clear attempt to assist users in solving their tasks exists a mismatch between the intentions of researchers
or answering questions in a concise and efficient manner and human labelers, which may lead to low-quality human
as possible. At a higher level, when further clarification feedback and cause LLMs to produce unexpected output.
is needed, the LLM should demonstrate the capability of To address this issue, InstructGPT [66] further conducts a
eliciting additional relevant information through pertinent screening process to filter labelers by assessing the agree-
inquiries and exhibit suitable levels of sensitivity, percep- ment between human labelers and researchers. Specifically,
tiveness, and prudence [366]. Realizing the alignment of researchers first label a small amount of data and then
helpful behavior is challenging for LLMs since it is difficult measure the agreement between themselves and human
to precisely define and measure the intention of users [364]. labelers. The labelers with the highest agreement will be
• Honesty. At a basic level, a LLM aligned to be honest selected to proceed with the subsequent annotation work.
should present accurate content to users instead of fabri- In some other work [368], “super raters” are used to ensure
cating information. Additionally, it is crucial for the LLM the high quality of human feedback. Researchers evaluate
to convey appropriate degrees of uncertainty in its output, the performance of human labelers and select a group of
in order to avoid any form of deception or misrepresen- well-performing human labelers (e.g., high agreement) as
tation of information. This requires the model to know super raters. The super raters will be given priority to
about its capabilities and levels of knowledge (e.g., “know collaborate with the researchers in the subsequent study.
unknowns”). According to the discussion in [366], honesty When human labelers annotate the output of LLMs, it is
is a more objective criterion compared to helpfulness and helpful to specify detailed instructions and provide instant
harmlessness, hence honesty alignment could potentially be guidance for human labelers, which can further regulate the
developed with less reliance on human efforts. annotation of labelers.
• Harmlessness. To be harmless, it requires that the lan- Human Feedback Collection. In existing work, there are
guage produced by the model should not be offensive or mainly three kinds of approaches to collecting feedback and
discriminatory. To the best of its abilities, the model should preference data from human labelers.
be capable of detecting covert endeavors aimed at soliciting • Ranking-based approach. In early work [365], human
requests for malicious purposes. Ideally, when the model labelers often evaluate model-generated outputs in a coarse-
was induced to conduct a dangerous action (e.g., commit- grained manner (i.e., only selecting the best) without taking
ting a crime), the LLM should politely refuse. Nonetheless, into account more fine-grained alignment criteria. Nonethe-
what behaviors are deemed harmful and to what extent vary less, different labelers may hold diverse opinions on the
amongst individuals or societies [366] highly depend on selection of the best candidate output, and this method
who is using the LLM, the type of the posed question, and disregards the unselected samples, which may lead to inac-
the context (e.g., time) at which the LLM is being used. curate or incomplete human feedback. To address this issue,
As we can see, these criteria are quite subjective, and are subsequent studies [116] introduce the Elo rating system
developed based on human cognition. Thus, it is difficult to derive the preference ranking by comparing candidate
to directly formulate them as optimization objectives for outputs. The ranking of outputs serves as the training signal
LLMs. In existing work, there are many ways to fulfill these that guides the model to prefer certain outputs over others,
criteria when aligning LLMs. A promising technique is red thus inducing outputs that are more reliable and safer.
teaming [367], which involves using manual or automated • Question-based approach. Further, human labelers can
means to probe LLMs in an adversarial way to generate provide more detailed feedback by answering certain ques-
harmful outputs and then updates LLMs to prevent such tions designed by researchers [81], covering the alignment
38

Supervised Fine-tuning feedback by learning a reward model. Such an approach


Prompts incorporates humans in the training loop for developing
Training with demonstration data
well-aligned LLMs, as exemplified by InstructGPT [66].
Human
Annotator
Demonstrations Pre-trained LM
🔥
RLHF System. The RLHF system mainly comprises three
key components: a pre-trained LM to be aligned, a reward
Demonstration Data
model learning from human feedback, and a RL algorithm
Reward Model Training training the LM. Specifically, the pre-trained LM is typically

🔥 a generative model that is initialized with existing pre-

🧊
Prompts LM Outputs Reward
trained LM parameters. For example, OpenAI uses 175B
Model GPT-3 for its first popular RLHF model, InstructGPT [66],
Pre-trained LM
and DeepMind uses the 280 billion parameter model Go-
Human Feedback pher [64] for its GopherCite model [368]. Further, the reward
Ranking Training with feedback data
model (RM) provides (learned) guidance signals that reflect
RL Fine-tuning human preferences for the text generated by the LM, usually
🧊 in the form of a scalar value. The reward model can take on
Prompts
Reward
Model
Aligned LM
🔥 two forms: a fine-tuned LM or a LM trained de novo using
human preference data. Existing work typically employs

LM Outputs 😊/ 😞
Reward
Training with RL algorithm (PPO)
reward models having a parameter scale different from that
of the aligned LM [66, 368]. For example, OpenAI uses 6B
GPT-3 and DeepMind uses 7B Gopher as the reward model,
Fig. 12: The workflow of the RLHF algorithm. respectively. Finally, to optimize the pre-trained LM using
the signal from the reward model, a specific RL algorithm
is designed for large-scale model tuning. Specifically, Prox-
criteria as well as additional constraints for LLMs. Specially, imal Policy Optimization (PPO) [128] is a widely used RL
in WebGPT [81], to assist the model in filtering and utiliz- algorithm for alignment in existing work [66, 116, 368].
ing relevant information from retrieved documents, human
labelers are required to answer questions with multiple Key Steps for RLHF. Figure 12 illustrates the overall three-
options about whether the retrieved documents are useful step process of RLHF [66] as introduced below.
for answering the given input. • Supervised fine-tuning. To make the LM initially perform
• Rule-based approach. Many studies also develop rule- desired behaviors, it usually needs to collect a supervised
based methods to provide more detailed human feedback. dataset containing input prompts (instruction) and desired
As a typical case, Sparrow [116] not only selects the response outputs for fine-tuning the LM. These prompts and outputs
that labelers consider the best but also uses a series of can be written by human labelers for some specific tasks
rules to test whether model-generated responses meet the while ensuring the diversity of tasks. For example, Instruct-
alignment criteria of being helpful, correct, and harmless. GPT [66] asks human labelers to compose prompts (e.g.,
In this way, two kinds of human feedback data can be ob- “List five ideas for how to regain enthusiasm for my career”) and
tained: (1) the response preference feedback is obtained by desired outputs for several generative tasks such as open
comparing the quality of model-generated output in pairs, QA, brainstorming, chatting, and rewriting. Note that the
and (2) the rule violation feedback is obtained by collecting first step is optional in specific settings or scenarios.
the assessment from human labelers (i.e., a score indicating • Reward model training. The second step is to train the
to what extent the generated output has violated the rules). RM using human feedback data. Specifically, we employ
Furthermore, GPT-4 [46] utilizes a set of zero-shot classifiers the LM to generate a certain number of output texts using
(based on GPT-4 itself) as rule-based reward models, which sampled prompts (from either the supervised dataset or
can automatically determine whether the model-generated the human-generated prompt) as input. We then invite
outputs violate a set of human-written rules. human labelers to annotate the preference for these pairs.
In the following, we focus on a well-known technique, The annotation process can be conducted in multiple forms,
reinforcement learning from human feedback (RLHF), and a common approach is to annotate by ranking the
which has been widely used in the recent powerful LLMs generated candidate texts, which can reduce the inconsis-
such as ChatGPT. As discussed below, the alignment criteria tency among annotators. Then, the RM is trained to predict
introduced in Section 5.2.1 can be fulfilled by learning from the human-preferred output. In InstructGPT, labelers rank
human feedback on the responses of LLMs to users’ queries. model-generated outputs from best to worst, and the RM
(i.e., 6B GPT-3) is trained to predict the ranking. Note that, in
5.2.3 Reinforcement Learning from Human Feedback recent work [369], the annotation of preference on response
To align LLMs with human values, reinforcement learning pairs has been conducted by an AI agent (usually an aligned
from human feedback (RLHF) [79, 365] has been proposed LLM) instead of humans, which is called “reinforcement
to fine-tune LLMs with the collected human feedback data, learning from AI feedback (RLAIF)”. LLMs trained with typical
which is useful to improve the alignment criteria (e.g., RLHF algorithms tend to generate harmless responses with
helpfulness, honesty, and harmlessness). RLHF employs less helpfulness, which is called evasion problem [369]. To
reinforcement learning (RL) algorithms (e.g., Proximal Pol- guarantee both the harmlessness and helpfulness, RLAIF
icy Optimization (PPO) [128]) to adapt LLMs to human generates the AI feedback based on pre-set alignment prin-
39

MHA Adapter FFN Adapter Prefix Layer #N Layer #N Wdown


Layer #N
… … … LoRA …
MHA Adapter FFN Adapter Prefix Layer #1 Layer #1 Wdown
Layer #1

Input Input Prompt Input Input


(a) Adapter Tuning (b) Prefix Tuning (c) Prompt Tuning (d) Low-Rank Adapation

Fig. 13: An illustration of four different parameter-efficient fine-tuning methods. MHA and FFN denote the multi-head
attention and feed-forward networks in the Transformer layer, respectively.

ciples in instructions [369, 370], which can also reduce the all the alignment criteria. Therefore, it is useful to train
efforts of human annotation. multiple reward models that focus on different alignment
• RL fine-tuning. At this step, aligning (i.e., fine-tuning) criteria [99], and compute the final reward based on the
the LM is formalized as an RL problem. In this setting, produced ones from them via special combination strategies
the pre-trained LM acts as the policy that takes as input (e.g., mean pooling and weighted sum). Such a way enables
a prompt and returns an output text, the action space of more flexible rules or standards on multiple criteria, e.g.,
it is the vocabulary, the state is the currently generated relaxing the requirement on helpfulness while posing more
token sequence, and the reward is provided by the RM. To strict limits on harmfulness.
avoid eviating significantly from the initial (before tuning) • Effective RL training. As the RL training process tends to
LM, a penalty term is commonly incorporated into the be unstable and hyper-parameter sensitive, it is suggested
reward function. For example, InstructGPT optimizes the that the language model should be well supervised fine-
LM against the RM using the PPO algorithm. For each input tuned before RL training, so as to reaching a good model
prompt, InstructGPT calculates the KL divergence between capacity. A commonly-used way is to fine-tune the LLM
the generated results from the current LM and the initial on its best outputs of the prompts (referred to as rejec-
LM as the penalty. It is noted that the second and final steps tion sampling or best-of-N ) from the alignment dataset until
can be iterated in multiple turns for better aligning LLMs. convergence before RL. Given a prompt, the LLM would
Due to the instability of the RL algorithm, recent work [371] first produce N outputs via the sampling algorithm, and
replaces the RL tuning with another supervised fine-tuning then the best candidate from the model will be selected
by reusing the best ranked samples with higher rewards. by the reward model for learning. After fine-tuning the
LLM on the best samples until convergence, the RL process
Practical Strategies for RLHF. Although RLHF is promising
will be performed to further improve the performance.
to effectively improve the alignment of LLMs with humans,
LLaMA 2 [99] has successively trained five versions of RLHF
it is practically challenging for researchers to successfully
models, where the LLM has been progressively improved
implement it. In this part, we focus on discussing several
with the improvement of the reward models. In this way,
useful strategies and tricks for improving the effectiveness
the collected prompts and annotations of human preference
and efficiency of RLHF. Concretely, we focus on the effective
data can better reflect the issues of the current model check-
training of reward models, efficient and effective RL train-
point, thus making special tuning to address these issues. In
ing, respectively.
addition, LLaMA 2 also adds samples from prior iterations
• Effective reward model training. Despite that InstructGPT
into the subsequent ones, to alleviate the possible capacity
used a small reward model (6B GPT model), increasing
regression issue during iterative optimization.
work [99] has shown it is often more effective to use a
• Efficient RL training. As the RL training requires to
large reward model (e.g., equal or greater than the original
iterate the inference process of both the LLM and reward
model size), since large reward models generally perform
models, it would greatly increase the total memory and
better in judging the quality of the LLM generated outputs.
computation cost, especially for larger reward models and
In LLaMa 2 [99], pretrained chat model checkpoints are
LLMs. As a practical trick, we can deploy the reward model
used to initialize the reward model, they argue that such an
on a separate server, and invoke the corresponding API
approach can effectively reduce the information mismatch
to work with the LLM on its own server. In addition, as
between the model to be aligned and the reward model
RLHF requires the LLM to generate multiple candidate
by sharing the same pre-training knowledge. Whereas, it is
outputs, instead of calling the sample decoding procedure
common to encounter the overfitting problem when train-
for multiple times, it is more efficient to utilize the beam
ing large-scale reward models. As a simple yet effective
search decoding algorithm26 . It only needs to perform one-
solution, existing work [372, 373] has introduced the LM
pass decoding for response generation, meanwhile such a
loss on the preferred response of the input prompt from
strategy can also enhance the diversity of the generated
the human-annotated alignment dataset as a regularizer,
candidate responses.
which alleviates the overfitting of the reward model on the
binary classification task. In addition, as there are multiple
26. https://huggingface.co/docs/transformers/v4.31.0/en/main
criteria for alignment (e.g., helpfulness and honesty), it is classes/text generation#transformers.GenerationMixin.group beam
often difficult to train a single reward model that can satisfy search
40

Process-Supervised RLHF. In existing literature of also suffers from notable limitations. First, RLHF needs to
RLHF [374], the supervision approach for RL training train multiple LMs including the model being aligned, the
generally takes two major forms, either using outcome- reward model, and the reference model at the same time,
supervision signals or process-supervision signals. The which is tedious in algorithmic procedure and memory-
outcome-supervised RLHF employs a quantitative score to consuming in practice. Besides, the commonly-used PPO
assess the quality of the whole text generated by LLMs. In algorithm in RLHF is rather complex and often sensitive
contrast, process-supervised RLHF offers an evaluation of to hyper-parameters. As an alternative, increasing studies
each individual component (e.g., sentence, word, or reason- explore to directly optimize LLMs to adhere to human pref-
ing step) within the generated content, which leverage fine- erences, using supervised fine-tuning without reinforcement
grained supervision signals to guide the training, helping learning [338].
LLMs refine the undesired generation contents [374, 375].
In what follows, we discuss two key aspects of process- Overview. The basic idea of non-RL alignment approaches
supervised RLHF. is to directly fine-tune LLMs with supervised learning on
• Obtaining Fine-grained Supervision Signals. Compared high-quality alignment dataset. It basically assumes that re-
with outcome rewards, it is more difficult to obtain fine- sponse feedback or golden rules to avert unsafe behaviors
grained supervision signals. OpenAI has released a fine- have been injected or included in the specially curated align-
grained annotation dataset named PRM800k [375] consist- ment dataset, so that LLMs can directly learn aligned behav-
ing of 12K process-annotated mathematical problems (i.e., iors from these demonstration data via suitable fine-tuning
MATH dataset [376]) and 75K solutions generated by LLMs strategies. Thus, to implement this approach, two key issues
of these problems, where each reasoning step of mathe- are the construction of alignment dataset and the design of
matical problems is labeled as positive, negative or neutral fine-tuning loss. For the first issue, the alignment dataset
in PRM800k. Considering the cost and efficiency of the can be automatically constructed by an aligned LLMs ac-
human annotation process, several methods aim to auto- cording to human-written safety principles [336] or refining
matically annotate the correctness of intermediate reason- existing examples using edits operations [385]. In addition,
ing steps, e.g., using powerful LLMs to directly replace we can also reuse existing reward models to select high-
human annotators [377] or Monte Carlo tree search [378]. rated responses from existing human feedback data [371].
After obtaining fine-grained supervision signals, existing For the second issue, non-RL alignment approaches mainly
work typically leverages them to train process-supervised fine-tune LLMs in a supervised learning way (the same
reward models (PRM) [375, 379], which can produce step- as the original instruction tuning loss) on a high-quality
level rewards (e.g., sentence based or token based rewards) alignment dataset, meanwhile auxiliary learning objectives
during the RLHF procedure. Furthermore, rather than lever- can be used to enhance the alignment performance, e.g.,
aging the discriminative model to produce the rewards, ranking responses or contrasting instruction-response pairs.
RLMEC [380] utilizes a generative reward model trained on
rewriting tasks with the minimum editing constraint, to pro- Alignment Data Collection. The construction of alignment
vide token-level rewards. In addition, for the downstream data is important to effectively align the behaviors of LLMs
tasks where fine-grained supervision signals are difficult to with human preferences. To collect high-quality alignment
collected, outcome-supervision signals can also be utilized data, some work tries to reuse existing reward models to
to perform process-supervised RLHF [381]. select high-rated responses, and others explore to leverage
• Utilizing the PRMs. To effectively leverage process- powerful LLMs (e.g., ChatGPT) or build a simulated envi-
supervision signals from PRMs, existing work mainly uti- ronment to generate synthetic alignment examples. Next,
lizes these fine-grained signals to evaluate individual parts we will discuss these three lines of research.
within the LLM responses and then guides LLMs to adjust • Reward model based approaches. The reward model in
their generation behaviors to maximize the received reward RLHF has been trained to measure the alignment degree
of the response. Concretely, expert iteration [382, 383], an on the responses of LLMs. It is straightforward to leverage
effective RL algorithm, has been utilized to improve the base existing reward models to select high-quality responses as
policy via learning from expert policy [374]. Typically, expert alignment data for subsequent fine-tuning. Based on this
iteration contains two main stages: policy improvement and idea, RAFT [371] adopts reward models trained on human
distillation [374]. In the policy improvement stage, expert preference data to rank the responses of LLMs and collect
policy processes the systematic search procedure to produce those with higher rewards for supervised fine-tuning. In
the samples under the guidance of PRMs. Subsequently, addition, the reward model can be also used to score model
during the distillation stage, the samples generated by ex- responses and assign them to different quality groups.
pert policy in the first stage are utilized to improve the Quark [386] sorts the responses of LLMs into different quan-
base policy through supervised fine-tuning. In addition to tiles based on the reward scores. Each quantile is attached
expert iteration, PRMs can also be utilized to re-rank the with a special reward token to represent the reward level
candidates of the final answers generated by LLMs [375] or of the quantile. Conditioned on the highest-reward tokens,
to select better intermediate reasoning steps during step by LLMs are subsequently prompted to generate high-quality
step reasoning [379, 384]. responses. Given an initial answer and the corresponding
human feedback, ILF [387] first adopts LLMs to generate
5.2.4 Alignment without RLHF refined answers, then utilizes the reward model to select
Although RLHF has achieved great success in aligning the the answer that best matches the feedback for further
behaviors of LLMs with human values and preferences, it training. As valuable resources for aligning LLMs, several
41

reward models have been released, including DeBERTa- • Primary training objective. Since the alignment data
base/large/xxlarge from OpenAssistant27 , Moss-7B from typically consists of an input instruction and an output re-
Fudan28 , and Flan-T5-xl from Stanford29 . sponse, the primary training loss is still the traditional cross-
• LLM based generative approaches. Reward models help entropy loss for sequence-to-sequence learning. Based on
to select aligned data from model responses. However, this loss, many studies propose a number of improvement
training reward models itself necessitates substantial high- variants for enhancing the supervised alignment tuning.
quality human-labeled data, which is typically expensive For example, CoH [390] constructs the training data by
and in short supply. In addition, although existing reward prepending “A helpful answer:” and “An unhelpful answer:”
models can be reused, they might not be able to accurately to the annotated good and bad responses, respectively, and
capture the nonalignment behaviors in another separately only compute losses for those response tokens with special
trained LLM. Therefore, some work explores leveraging masking. Quark [386] sorts model responses into different
powerful LLMs to automatically generate human-aligned quantiles with varying alignment quality, it prepends a spe-
data. As a representative work, constitutional AI [369] pro- cial reward token to each model response to represent the
poses that human supervision comes from a set of principles reward level of the response. These studies basically adopt
(i.e., natural language instructions) governing AI behaviors. the maximum likelihood objective, and employ instruction
Based on these principles, LLMs will critique their own prefixes to guide the learning of human preference.
harmful responses and revise them repeatedly into finally • Direct preference optimization. To better mimic the
aligned responses. Similarly, Self-Align [336] first adopts learning approach of RLHF in a supervised learning way,
self-instruct [147] to generate instructions focusing on cov- DPO [391] proposes to reparameterize the response rewards
ering diverse topics. Then, the model is also prompted using the policy model (i.e., the language model being
with multiple human-written principles that describe the optimized), and then the original reward modeling objective
rules of expected model behaviors (also with several in- can be reformulated only based on the policy model. In this
context exemplars), to generate helpful, ethical, and reliable way, DPO removes the explicit reward modeling step, and
responses as alignment data. To mitigate the limit that the optimizing the new learning objective that only involves the
original SFT method can only learn from positive responses, policy model is equivalent to optimizing the rewards. Based
FIGA [388] develops an improved supervised alignment on DPO, existing work has proposed several improvement
approach, where both negative (the original output of low strategies for enhancing the effectiveness or efficiency, e.g.,
quality) and positive (the refined output by LLMs) re- decomposing the optimization of positive responses and
sponses are leveraged in a contrastive way, to enable LLMs negative responses into two independent components [392]
to deeply understand what fine-grained revisions actually or removing the probability of the reference model in the
lead to good response. objective function [393]. Furthermore, FIGA [388] designs a
• LLM based interactive approaches. Most existing ap- token-level contrastive loss that aims to encourage desirable
proaches train LLMs in isolation, where LLMs are not tokens, penalize undesirable ones, and disregard trivial
present in actual environments to improve themselves tokens. Despite the effectiveness, recent work has also
through external feedback signals. As a comparison, hu- revealed that DPO may have inherent limitations in several
mans learn social norms and values from interactions with aspects. First, based on the analysis about the magnitude
others in social environments [389]. To mimic such a learn- and gradient directions, recent work reveals that DPO might
ing approach, Stable Alignment [193] builds a simulated have difficulty in well balancing the learning of positive
interaction environment consisting of a number of LLM instances and negative instances [394]. In addition, as the
agents, where AI agents keep interacting with and each reference model provides the reward scores for itself in DPO
other, receiving feedback on improvement. Once a central algorithm, a weak reference model would also influence
agent receives an instruction, it produces a response and the alignment performance [395], which can be enhanced
shares it with nearby agents. These critic agents generate by improved learning strategies [396] or well-trained policy
feedback comprising ratings about the response and re- model [395].
vision suggestions. Then the central agent would revise • Auxiliary optimization objectives. Besides the primary
the original response following these suggestions. Such cross-entropy loss, several studies propose auxiliary train-
an alignment approach can be also extended to real-world ing loss to enhance the learning from the alignment data.
environment with humans. First, since the responses of each instruction can be scored
by the reward model, the ranking loss can be used to train
Supervised Alignment Tuning. After obtaining alignment the model to preserve the ranking order of these responses.
data, it is also key to design suitable fine-tuning strategies For example, RRHF [397] samples responses from multi-
for direct alignment. A straightforward approach is to op- ple sources, including model-generated responses, such as
timize LLMs using the conventional sequence-to-sequence those derived from the model itself, ChatGPT, and GPT-4,
objective based on the alignment data. In addition to the as well as human-written responses, spanning both high-
conventional optimization objective, several studies further quality and low-quality instances. To align with the scores
explore auxiliary losses that enhance the learning from the from reward models, it further optimizes the ranking loss
alignment data. by encouraging the model to have a higher conditional
log probability for the response with a higher ranking.
27. https://huggingface.co/OpenAssistant Moreover, SLiC-HF [398] proposes to assess the similarity
28. https://github.com/OpenLMLab/MOSS-RLHF between model outputs and human preference via the dis-
29. https://huggingface.co/stanfordnlp/SteamSHP-flan-t5-xl tance in the latent space, and introduces specific calibration
42

and regularization loss to calibrate the candidate sequences policies, SFT adopts a “local” optimization way (i.e., token-
based on human-preference data. Similarly, the difference level loss) based on demonstration data, while RLHF takes a
between positive and negative responses from the reward “global” optimization way (i.e., text-level loss) by involving
model can be employed to construct the regularization human preference. More theoretical analysis about imitation
loss [399], to enhance the discrimination between positive learning and reinforcement learning can be referred to the
and negative responses by LLMs. Second, to enhance the related RL literature [401, 402].
relatedness between the response and the instruction, some
work adopts contrastive learning to push up the probability
Pros and Cons of SFT. SFT has been shown to be an
of correct instruction-response pairs while pushing down in-
effective approach to boosting the performance of LLMs
correct instruction-response pairs. Specifically, for an output
on various benchmarks [67, 69, 152, 187], which can largely
response, the proposed approach in [400] contrasts the target
enhance the task generalization ability and flexibly endow
instruction to the other irrelevant instructions. By doing so,
specific functions (e.g., establishing the chatbot’s identity).
it can enable the model to learn the right correlation between
More discussions about the usefulness of SFT can be found
instructions and responses.
in Section 5.1.3. It has been widely recognized that SFT
mainly unlocks the abilities but not inject new abilities into
5.2.5 Remarks on SFT and RLHF
LLMs. Thus, it might become problematic when one tries
As discussed in Section 5.1, instruction tuning is the process to stimulate the non-endogenous abilities of LLMs via SFT.
of training pre-trained language models with formatted As a concrete scenario, it would potentially advocate the
demonstration data (instructions paired with desired out- hallucination behaviors when demonstration data is beyond
puts). At early exploration, instruction data was mainly col- the knowledge or ability scope of LLMs, e.g., training a LLM
lected from NLP tasks [67], while it has been now extended to answer questions about its unknown facts. An interesting
to more diverse supervision data that pairs input and viewpoint from John Schulman’s talk on RLHF [403] is that
output texts (e.g., the utterances of open-ended dialogues). distilling superior models to train less capable models (e.g.,
Training with such paired texts is also called supervised fine- prompting GPT-4 to generate the response as fine-tuning
tuning (SFT) in the context of LLMs [66]. In this part, we data) might increase the possibilities of generating the hal-
mainly use the abbreviation SFT for discussion but not lucinated texts, thus likely affecting the factual accuracy
instruction tuning, due to the simplicity and popularity. of LLMs. Furthermore, as a behavior cloning method, SFT
Since SFT and RLHF are two major adaptation tuning aims to imitate the behaviors (without explorations) of the
methods for LLMs, it is important to understand the con- experts who construct the demonstration data. However,
nections and difference between them. Next, we make some there often exist variations among different annotators on
discussions on this issue30 . the writing styles, quality, and preferences of demonstration
Overall Comparison with RL Formulation. Following the data, which tends to affect the learning performance of SFT.
discussion in Section 5.2.3 (the part related to RL training), Thus, high-quality instruction data (but not the quantity) is
the text generation problem can be formulated as a decision- the primary factor for effective training of LLMs during the
making process based on RL. Taking a prompt as input, SFT stage [99].
the task of a LLM is to generate a text completion that
appropriately responds to the prompt. This task would be Pros and Cons of RLHF. RLHF was early explored in the
completed step by step. At each step, an agent (i.e., LLM) literature of deep RL [79], then borrowed to improve the
will perform an action (i.e., generating a token) according capacity of language models (e.g., summarization [129]),
to the policy (i.e., the generative probability distribution of and subsequently adopted as the fundamental technique to
LLM) conditioned on the current state (currently generated develop InstructGPT [66]. Recently, increasing evidence [99,
token sequence and other available context information). 369] has demonstrated the effectiveness of RLHF in miti-
It is expected that a high-quality output text would be gating the harmful responses and enhancing the model ca-
produced by the LLM, which can earn a large reward score pacity. Specially, LLaMA 2 has demonstrated that RLHF can
based on the entire response. Overall, RLHF and SFT can be improve both the helpfulness and harmlessness scores [99],
considered as two different training approaches to optimiz- and attributed this to a better human-LLM synergy for data
ing the above decision making process for LLMs. Specially, annotation. They explain this reason in two major aspects
RLHF firstly learns the reward model, and then employs as follows. First, since human annotators mainly provide
it to improve the LLM with RL training (e.g., PPO). As a preference annotations for RLHF, it can largely alleviate the
comparison, SFT adopts a teacher-forcing approach, which discrepancies of annotators as that in SFT. Secondly, pref-
directly optimizes the likelihood of a demonstration output. erence annotation is much easier than writing the demon-
Such a token-level training way essentially does behavior stration data, and annotators can even judge the quality of
cloning (a special algorithm of imitation learning [401]): it more superior generations than those they create, making it
utilizes the expert’s action (i.e., the target token at each step) possible to explore a broader state space beyond what can
as the supervision label and directly learns to imitate the be demonstrated by human annotators. Another key point
demonstrations from experts without specifying a reward is that RLHF essentially encourages LLMs to learn correct
model as in typical RL algorithms. To learn the desired policies by contrasting the self-generated responses (dis-
criminating between good and bad responses). It no longer
30. This part would be somehow subjective, mainly based on the au-
thors’ opinions and experiences. Comments or corrections are welcome forces the model to imitate external demonstration data,
to enhance this part. and thus can mitigate the hallucination issues with SFT as
43

discussed above31 . Actually, RLHF has been demonstrated be optimized according to the specific task goals, while the
to be an important approach to reduce the hallucination parameters of the original language model are frozen in this
behaviors in GPT-4 [46]. However, RLHF inherits the draw- process. In this way, we can effectively reduce the number
backs of classic RL algorithms, e.g., sample inefficiency and of trainable parameters during fine-tuning.
training instability. When adapted to LLMs, RLHF further
relies on a strong SFT model as initial model checkpoint for Prefix Tuning. Prefix tuning [404] prepends a sequence of
efficiently achieving good performance. In addition, human prefixes, which are a set of trainable continuous vectors, to
annotators are involved in a complex iterative optimization each Transformer layer in language models. These prefix
process, in which a number of important details (e.g., the vectors are task-specific, which can be considered as virtual
prompt selection, the schedule of reward model training and token embeddings. To optimize the prefix vectors, a repa-
PPO training, and the settings of hyper-parameters) have rameterization trick [404] has been proposed by learning a
important impact on the whole model performance. MLP function that maps a smaller matrix to the parameter
Overall, SFT is particularly useful to increase the model matrix of prefixes, instead of directly optimizing the pre-
capacity of pre-trained model checkpoints right after pre- fixes. It has been shown that this trick is useful for stable
training, while RLHF is promising to further improve the training. After optimization, the mapping function would
model capacity of SFT models. However, RLHF has been be discarded, and only the derived prefix vectors are kept
difficult to implement, and far from well explored (ac- to enhance task-specific performance. Since only the prefix
cording to public literature), and more improvements (e.g., parameters would be trained, it can lead to a parameter-
efficient and reliable annotation [369] and simplified opti- efficient model optimization. Similar to prefix tuning, p-
mization [391]) are still needed for further research. tuning v2 [409] incorporates layer-wise prompt vectors into
the Transformer architecture specially for natural language
understanding, which also utilizes multi-task learning for
5.3 Parameter-Efficient Model Adaptation jointly optimizing shared prompts. It has been shown to
In the above, we have discussed the approaches of instruc- be useful in improving the model performance of different
tion tuning and alignment tuning to adapt LLMs according parameter scales on natural language understanding tasks.
to specific goals. Since LLMs consist of a huge amount of Prompt Tuning. Different from prefix tuning, prompt tun-
model parameters, it would be costly to perform the full- ing [405, 410] mainly focuses on incorporating trainable
parameter tuning. In this section, we will discuss how to prompt vectors at the input layer32 . Based on the discrete
conduct efficient tuning on LLMs. We first review several prompting methods [412, 413], it augments the input text
representative parameter-efficient fine-tuning methods for by including a group of soft prompt tokens (either in a
Transformer language models, and then summarize existing free form [410] or a prefix form [405]), and then takes
work on parameter-efficient fine-tuned LLMs. the prompt-augmented input to solve specific downstream
tasks. In implementation, task-specific prompt embeddings
5.3.1 Parameter-Efficient Fine-Tuning Methods are combined with the input text embeddings, which are
In existing literature, parameter-efficient fine-tuning [149, subsequently fed into language models. P-tuning [410] has
404, 405] has been an important topic that aims to reduce proposed a free form to combine the context, prompt and
the number of trainable parameters while retaining a good target tokens, which can be applied to the architectures for
performance as possible. In what follows, we briefly re- both natural language understanding and generation. They
view four parameter-efficient fine-tuning methods for Trans- further learn the representations of soft prompt tokens by a
former language models, including adapter tuning, prefix bidirectional LSTM. Another representative approach [405]
tuning, prompt tuning and LoRA. The illustration of these named prompt tuning directly prepends prefix prompts to
four methods are shown in Figure 13. the input. During training, only the prompt embeddings
would be learned according to task-specific supervisions.
Adapter Tuning. Adapter tuning incorporates small neural
Since this method only includes a small number of trainable
network modules (called adapter) into the Transformer mod-
parameters at the input layer, it has been found that the
els [406]. To implement the adapter module, a bottleneck
performance highly relies on the model capacity of the
architecture has been proposed in [406, 407], which first
underlying language models [405].
compresses the original feature vector into a smaller di-
mension (followed by a nonlinear transformation) and then Low-Rank Adaptation (LoRA). LoRA [149] imposes the
recovers it to the original dimension. The adapter modules low-rank constraint for approximating the update matrix at
would be integrated into each Transformer layer, typically each dense layer, so as to reduce the trainable parameters
using a serial insertion after each of the two core parts (i.e., for adapting to downstream tasks. Consider the case of
attention layer and feed-forward layer) of a Transformer optimizing a parameter matrix W. The update process can
layer. Alternatively, parallel adapters [408] can be also used
in Transformer layers, where it places two adapter modules 32. Here, prompt tuning denotes a category of related efficient tuning
in parallel with the attention layer and feed-forward layer methods exemplified by the work [405, 410, 411], instead of a spe-
cific method as used in [405]. Indeed, the prefix based tuning meth-
accordingly. During fine-tuning, the adapter modules would ods [404, 409] can be also considered as prompting methods, which
are called deep prompting tuning in [409]. In this survey, prompt tuning
31. In RLHF, it seems to be also important that reward models specially refer to the methods that only include the prompt tokens at
should be aware of the knowledge or ability of a LLM to be aligned. the input layer, in the context of LLMs. We assign p-tuning v2 [409] to
For example, LLaMA 2 adopts pre-trained chat model checkpoints to the category of prefix tuning, because it incorporates layerwise prompts
initialize reward models [99]. in langauge models.
44

be written in a general form as: W ← W + ∆W. The basic Further, it supports a number of language models such as
idea of LoRA is to freeze the original matrix W ∈ Rm×n GPT-2 and LLaMA, and also covers several representative
while approximating the parameter update ∆W by low- vision Transformer models (e.g., ViT and Swin Transformer).
rank decomposition matrices, i.e., ∆W = A · B⊤ , where As discussed in Section 5.3.1, there have been a large
A ∈ Rm×k and B ∈ Rn×k are the trainable parameters for number of efficient tuning methods proposed in the existing
task adaptation and k ≪ min(m, n) is the reduced rank. The literature. However, most of these approaches are tested
major merit of LoRA is that it can largely save the memory on small-sized pre-trained language models, instead of the
and storage usage (e.g., VRAM). Further, one can only keep LLMs. So far, there still lacks a thorough investigation on
a single large model copy, while maintaining a number of the effect of different efficient tuning methods on large-sized
task-specific low-rank decomposition matrices for adapting language models at different settings or tasks.
to different downstream tasks. Further, several studies have
also discussed how to set the rank in a more principled
approach, e.g., importance score based allocation [414] and 6 U TILIZATION
search-free optimal rank selection [415]. After pre-training or adaptation tuning, a major approach
Besides the above methods, there is extensive research to using LLMs is to design suitable prompting strategies
on efficient tuning of Transformer language models. How- for solving various tasks. In existing literature, task-specific
ever, a more comprehensive discussion of efficient tuning is prompts can be effectively learned through manual creation
beyond the scope of this article, which can be found in the and automatic optimization. A representative prompting
related papers on this topic [408, 416]. method is in-context learning [50, 55], which formulates the
5.3.2 Parameter-Efficient Fine-Tuning on LLMs task description and/or demonstrations in the form of natu-
ral language text. In addition, chain-of-thought prompting [33]
With the rising of LLMs, efficient tuning has attracted
can be employed to enhance in-context learning by involv-
increasing research attention for developing a more
ing a series of intermediate reasoning steps in prompts.
lightweight adaptation approach in downstream tasks.
Furthermore, planning [432] is proposed for solving complex
In particular, LoRA [149] has been widely applied
tasks, which first breaks them down into smaller sub-tasks
to open-source LLMs (e.g., LLaMA and BLOOM) for
and then generates a plan of action to solve these sub-tasks
parameter-efficient fine-tuning. Among these research at-
one by one. We summarize representative work for these
tempts, LLaMA and its variants have gained much atten-
prompting approaches in Table 11. Next, we will elaborate
tion for parameter-efficient tuning. For example, Alpaca-
on the details of the four techniques.
LoRA [148] has been trained using LoRA as a lightweight
tuned version of Alpaca [146] (a fine-tuned 7B LLaMA
model with 52K human demonstrations of instruction fol- 6.1 Prompting
lowing). There are extensive explorations of Alpaca-LoRA As discussed in previous work [36], prompting is the major
ranging in different languages or model sizes, which can approach to utilizing LLMs for solving various tasks. Since
be found in the collection page33 . A recent study LLaMA- the quality of prompts will largely influence the perfor-
Adapter [417] inserts learnable prompt vectors into each mance of LLMs in specific tasks, there have been a series of
Transformer layer, in which zero-initialized attention has studies proposed to generate suitable task prompts through
been proposed to improve the training by mitigating the manual creation or automatic optimization, which will be
influence of under-fitted prompt vectors. They also extend introduced in this section.
this approach to a multi-modal setting, e.g., visual question
answering.
6.1.1 Prompt Creation
Further, an empirical study [407] has been conducted
to examine the effect of different tuning methods on lan- The process of manually creating a suitable prompt is also
guage models. They compare four efficient tuning methods called prompt engineering [445, 446]. A well-designed prompt
including serial adapter tuning [406], parallel adapter tun- is very helpful to elicit the abilities of LLMs for accomplish-
ing [408, 418], and LoRA [149], on three open-source LLMs, ing specific tasks. In this part, we will first introduce the
namely GPT-J (6B), BLOOM (7.1B) and LLaMA (7B), for key components of prompts and discuss several principles
evaluation. Based on the experimental results on six math for prompt design. Then, we evaluate ChatGPT with differ-
reasoning datasets, they show that these efficient-tuning ent prompts to show the results on several representative
methods under-perform the reference baseline GPT-3.5 on tasks. We are aware that there have been several existing
difficult tasks, while achieving a comparable performance papers [446, 447] and websites [448–450] that present the
on simple tasks. Overall, LoRA performs relatively well suggestions and guidelines to design good prompts. As a
among these comparison methods, using significantly fewer comparison, we mainly aim to discuss the key factors (ingre-
trainable parameters. dients and principles) that are useful for prompt creation,
As an important resource, the library PEFT [419] (stand- and provide experimental results and analysis on popular
ing for parameter-efficient fine-tuning) has been released on tasks as the reference to the beginners.
GitHub34 . It has included several widely used efficient tun-
Key Ingredients. Typically, there are four key ingredients
ing methods, including LoRA [149]/AdaLoRA [414], prefix-
that depict the functionality of a prompt for eliciting the
tuning [404, 409], P-Tuning [410], and prompt-tuning [405].
abilities of LLMs to complete the tasks, including task
33. https://github.com/tloen/alpaca-lora description, input data, contextual information, and prompt
34. https://github.com/huggingface/peft style. To have an intuitive understanding of our discussion,
45

TABLE 11: Typical LLM utilization methods and their key points for ICL, CoT, and planning. Note that the key points only
highlight the most important technical contribution.

Approach Representative Work Key Point


KATE [420] Demonstration selection (similar; k-NN)
EPR [421] Demonstration selection (dense retrieval; constrative learning)
In-context SG-ICL [422] Demonstration selection (LLM as the demonstration generator)
Learning (ICL) APE [423] Demonstration format (automatic generation & selection)
Structured Prompting [424] Demonstration format (grouped context encoding; rescaled attention)
GlobalE & LocalE [425] Demonstration order (entropy-based metric; probing set generation with LLM)
Complex CoT [426] Demonstration (complexity-based selection)
Auto-CoT [427] Demonstration (automatic generation)
Chain-of-thought Selection-Inference [428] Generation (alternate between selection and inference)
Prompting (CoT) Self-consistency [429] Generation (diverse paths; self-ensemble)
DIVERSE [430] Generation (diverse paths); Verification (step-wise voting)
Rationale-augmented ensembles [431] Generation (rationale sampling)
Least-to-most prompting [432] Plan generation (text-based; problem decomposition)
DECOMP [433] Plan generation (text-based; problem decomposition)
PS [434] Plan generation (text-based)
Faithful CoT [435] Plan generation (code-based)
PAL [436] Plan generation (code-based; Python)
HuggingGPT [437] Plan generation (code-based; models from HuggingFace)
Planning AdaPlanner [438] Plan refinement (skill memory)
TIP [439] Feedback acquisition (visual perception)
RAP [440] Feedback acquisition (LLM as the world model); Plan refinement (Monte Carlo Tree Search)
ChatCoT [441] Feedback acquisition (tool); Plan refinement (conversation between LLM and tools)
ReAct [442] Feedback acquisition (tool); Plan refinement (synergizing reasoning and acting)
Reflexion [443] Feedback acquisition (text-based self-reflection); Plan refinement (dynamic memory)
Tree of Thoughts [444] Feedback acquisition (vote comparison); Plan refinement (tree-based search)

we also present three prompt examples for question answer- solve specific tasks. Overall, one should express the prompt
ing, meta-review generation, and text-to-SQL in Table 13. as a clear question or detailed instruction that can be well
• Task description. A task description is typically a specific understood and answered. In some cases, it is also useful to
instruction that LLMs are expected to follow. In general, one add the prefix or suffix to better guide LLMs. For example,
should clearly describe the task goal in natural language. using the prefix “Let us think step by step” can help elicit
For the tasks with special input or output format, detailed LLMs perform step-by-step reasoning, and using the prefix
clarifications are often needed, and one can further utilize “You are an expert on this task (or in this domain)” can boost
keywords to highlight the special settings for better guiding the performance of LLMs in some specific tasks. Further, for
LLMs in task completion. chat-based LLMs (e.g., ChatGPT), instead of directly feeding
• Input data. In common cases, it is straightforward to a long or complex task prompt, it is suggested to decompose
describe input data (e.g., an instance to be responded by it into multiple prompts for the sub-tasks and then feed
LLMs) in natural language. For special input data, such them into LLMs via a multi-turn conversation [441].
as knowledge graph and table, it is necessary to apply an
appropriate and convenient way to make them readable Design Principles. Based on the key ingredients of prompts,
for LLMs. For structured data, linearization is commonly we summarize several critical design principles that can
used to transform the original records (e.g., knowledge help create more effective prompts for solving various tasks.
triples) into sequences [451] due to the simplicity. Further, • Expressing the task goal clearly. Task descriptions should
the programming language (e.g., executable code) has also not be ambiguous or unclear, which likely lead to in-
been utilized to formulate the structured data, which can accurate or inappropriate responses. This highlights the
also support using external tools (e.g., program executor) to need for clear and unambiguous directives when utilizing
produce the precise results [452, 453]. these models [66]. A clear and detailed description should
• Contextual information. In addition to the task descrip- contain various elements to explain a task, including task
tion and input data, contextual or background information objective, input/output data (e.g., “Given a long document, I
is also essential for specific tasks. For example, retrieved want you to generate a concise summary.”), and the response
documents are highly useful for open-domain question constraints (e.g., “the length of the summary cannot exceed 50.”).
answering as supporting evidence. Both the quality of the By providing a well-clarified task description, LLMs can
retrieved documents and their relevance to the question more effectively understand the target task and generate the
have an impact on the generated answers [454]. Thus, it desired output.
needs to include such information in a proper prompt • Decomposing into easy, detailed sub-tasks. To solve com-
pattern or expression format. Furthermore, in-context task plex tasks, it is important to decompose the difficult task
exemplars are also helpful for eliciting LLMs to accomplish into several more easier, detailed sub-tasks for helping
a complex task, which can better depict the task goal, the LLMs accomplish the goal step by step, which is closely re-
special output formats, and the mapping relation between lated to the planning technique in Section 6.4. For example,
input and output. following the suggestion [447], we can explicitly list the sub-
• Prompt style. For different LLMs, it is important to tasks in the form of multiple numbered items (e.g., “Braid a
design a suitable prompt style for eliciting their abilities to coherent narrative by performing the following tasks: 1. ...; 2. ...; 3.
46

...”). By decomposing a target task into sub-tasks, LLMs can results in Table 17, where we also include the supervised
focus on solving easier sub-tasks and finally achieve more performance in existing papers as reference.
accurate results for complex tasks. • Carefully designed prompts can boost the zero-shot or few-
• Providing few-shot demonstrations. As discussed in Sec- shot performance of ChatGPT. By comparing the results of
tion 6.2, LLMs can benefit from in-context learning for using different prompts on the same task, we can see that
solving complex tasks, where the prompts contain a small using the carefully designed prompts can achieve better per-
number of task examples of the desired input-output pairs, formance than the simpler ones. In the carefully designed
i.e., few-shot demonstrations. Few-shot demonstrations can prompts, we provide a more clearly expressed task de-
help LLMs learn the semantic mapping between input and scription (e.g., WMT and WikiFact), or use a model-friendly
output without parameter tuning. In practice, it is suggested format (e.g., GSM8k and OBQA). For example, for WikiFact
that one should generate a few high-quality demonstrations task, the prompt with a more detailed task description leads
for the target task, which would highly benefit the final task to a performance increase from 29.25 to 31.21.
performance. • More complex tasks can benefit more from careful prompt
• Utilizing model-friendly format. Since LLMs are pre- engineering on ChatGPT. In the WikiFact and Colored Objects
trained on specially constructed datasets, there are some tasks, the designed prompts have greatly improved the per-
prompt formats that can make LLMs better understand formance of ChatGPT, i.e., from 23.61 to 28.47 on WikiFact
the instruction. For example, as the OpenAI documentation and from 53.20 to 66.75 on Colored Objects. It indicates
suggests, we can use ### or """ as a stop symbol to the necessity of prompt engineering for LLMs to perform
separate the instruction and context, which can be better well on complex tasks, since these tasks typically have
understood by LLMs. As a general guideline, most existing specific output formats or require background knowledge.
LLMs perform a task better in English, thus it is useful to Our example prompts provide more detailed task descrip-
employ English instructions to solve difficult tasks based on tion (e.g., output format and task goal), which can help
machine translation. ChatGPT better understand the complex task requirement
• Adopting role-playing strategies. Since LLMs are pre- for fulfilling it.
trained on extensive corpora containing diverse characters • For mathematical reasoning tasks, it is more effective to
and dialogues, they possess an inherent ability for role- design specific prompts based on the format of programming
playing. This feature can be harnessed through specific language. For GSM8k, the designed prompt employs code-
prompts to enhance the corresponding capacity for some formatted few-shot demonstrations to convert this mathe-
specific domains [455]. For instance, when solving a math matical reasoning task into code generation task, which can
problem, we can use a prompt prefix like “You are an expert in leverage the strong code synthesis ability of ChatGPT for
mathematics”. This enables LLMs to solve the problem from solving mathematical problems. Further, with the help of an
an expert’s perspective, thereby leveraging their pretrained external program executor, we are able to obtain more pre-
knowledge more effectively. By guiding LLMs with role- cise results instead of using LLMs for arithmetic operation.
playing prompts, they can often generate more reasonable As we can see, the performance is boosted from 78.47 to
and accurate solutions. 79.30 on GSM8k, indicating the usefulness of programming
language in mathematical reasoning tasks.
Useful Tips. In addition to the design principles, we also
• In knowledge utilization and complex reasoning tasks,
present a collection of useful prompt tips based on existing
ChatGPT with proper prompts achieves comparable performance
work or our empirical experiences in Table 12. Note that
or even outperforms the supervised baselines methods. In knowl-
these tips are suggested in a general manner, it does not
edge utilization and complex reasoning tasks, ChatGPT
indicate that they are the best prompts for the corresponding
with proper zero-shot or few-shot prompts can achieve
tasks. This part will be continuously updated with more
comparable performance or even outperform the super-
guidelines or tips. We welcome readers to contribute to this
vised methods, e.g., 31.21 (ChatGPT) v.s. 34.20 (supervised
collection of prompt tips. We present the detailed procedure
baseline) on WikiFact. Despite that, ChatGPT still performs
to contribute to the prompt tips, at the link: https://github.
worse than supervised baseline models on some specific
com/RUCAIBox/LLMSurvey/tree/main/Prompts.
tasks (e.g., ARC and WikiFact), since these supervised mod-
Empirical Analysis. We further conduct empirical studies els have been specially optimized with task-specific data.
to present the impact of prompts on task performance. To • Through suitable prompt engineering, LLMs can handle
conduct the experiments, we select a variety of tasks that some non-traditional NLP tasks. With the help of specific
span language generation, knowledge utilization, complex prompts, ChatGPT can also accomplish non-traditional NLP
reasoning, structure data generation, and information re- tasks, i.e., the general recommendation and conversational
trieval. For each task, we manually write a prompt that recommendation. A key point is that these tasks can be
follows general guidelines introduced above. Note that the well expressed or described in natural language. However,
tested prompts may not be the optimal for these tasks, the performance of ChatGPT is still far from the referenced
since they mainly aim to help readers understand how to performance in these tasks, as LLMs cannot directly fit these
write an effective prompt for solving different tasks. Also, tasks, which require specific domain knowledge and task
we add a simplified prompt as the comparison for most adaptation [355, 456].
tasks. Following the experimental settings in Section 7.4, we
examine the 3-shot performance of ChatGPT on complex 6.1.2 Prompt Optimization
reasoning tasks (Colored Objects and GSM8k), and zero- Although manually creating task prompts is more intuitive,
shot performance on other tasks. We report the experimental it is time consuming and, more importantly, models are
47

TABLE 12: A collection of useful tips for designing prompts that are collected from online notes [446–449] and experiences
from our authors, where we also show the related ingredients and principles (introduced in Section 6.1.1). We abbreviate
principles as Prin. and list the IDs of the related principles for each prompt. ⃝ 1 : expressing the task goal clearly; ⃝ 2:
decomposing into easy, detailed sub-tasks; ⃝ 3 : providing few-shot demonstrations; ⃝ 4 : utilizing model-friendly format.

Ingredient Collected Prompts Prin.


T1. Make your prompt as detailed as possible, e.g., “Summarize the article into a short paragraph within 50 words. The major ⃝
1
Task Description storyline and conclusion should be included, and the unimportant details can be omitted.”
T2. It is helpful to let the LLM know that it is an expert with a prefixed prompt, e.g., “You are a sophisticated expert in the ⃝
1
domain of compute science.”
T3. Tell the model more what it should do, but not what it should not do. ⃝
1
T4. To avoid the LLM to generate too long output, you can just use the prompt: “Question: Short Answer: ”. Besides, you can ⃝
1
also use the following suffixes, “in a or a few words”, “in one of two sentences”.
I1. For the question required factual knowledge, it is useful to first retrieve relevant documents via the search engine, and ⃝
4
Input Data
then concatenate them into the prompt as reference.
I2. To highlight some important parts in your prompt, please use special marks, e.g., quotation (””) and line break (\n). You ⃝
4
can also use both of them for emphasizing.
C1. For complex tasks, you can clearly describe the required intermediate steps to accomplish it, e.g., “Please answer the ⃝
2
Contextual Information question step by step as: Step 1 - Decompose the question into several sub-questions, · · · ”
C2. If you want LLMs to provide the score for a text, it is necessary to provide a detailed description about the ⃝
1
scoring standard with examples as reference.
C3. When LLMs generate text according to some context (e.g., making recommendations according to purchase history), ⃝
2
instructing them with the explanation about the generated result conditioned on context is helpful to improve the quality
of the generated text.
C4. An approach similar to tree-of-thoughts but can be done in one prompt: e.g., Imagine three different experts are answering ⃝
2
this question. All experts will write down one step of their thinking, then share it with the group of experts. Then all experts will go on
to the next step, etc. If any expert realizes they’re wrong at any point then they leave. The question is
D1. Well-formatted in-context exemplars are very useful, especially for producing the outputs with complex formats. ⃝3
D2. For few-shot chain-of-thought prompting, you can also use the prompt “Let’s think step-by-step”, and the few-shot ⃝
1⃝ 3
examples should be separated by “\n” instead of full stop.
D3. You can also retrieve similar examples in context to supply the useful task-specific knowledge for LLMs. To retrieve ⃝
3⃝4
Demonstration more relevant examples, it is useful to first obtain the answer of the question, and then concatenate it with the question for
retrieval.
D4. The diversity of the in-context exemplars within the prompt is also useful. If it is not easy to obtain diverse questions, ⃝
3
you can also seek to keep the diversity of the solutions for the questions.
D5. When using chat-based LLMs, you can decompose in-context exemplars into multi-turn messages, to better match the ⃝
3
human-chatbot conversation format. Similarly, you can also decompose the reasoning process of an exemplars into multi-turn
conversation.
D6. Complex and informative in-context exemplars can help LLMs answer complex questions. ⃝3
D7. As a symbol sequence can typically be divided into multiple segments (e.g., i1 , i2 , i3 −→ i1 , i2 and i2 , i3 ), the preceding ⃝
2⃝ 3
ones can be used as in-context exemplars to guide LLMs to predict the subsequent ones, meanwhile providing historical
information.
D8. Order matters for in-context exemplars and prompts components. For very long input data, the position of the question ⃝
3
(first or last) may also affect the performance.
D9. If you can not obtain the in-context exemplars from existing datasets, an alternative way is to use the zero-shot ⃝
3
generated ones from the LLM itself.

O1. Let the LLM check its outputs before draw the conclusion, e.g., “Check whether the above solution is correct or not.” ⃝
2
O2. If the LLM can not well solve the task, you can seek help from external tools by prompting the LLM to manipulate ⃝
4
them. In this way, the tools should be encapsulated into callable APIs with detailed description about their functions, to
Other Designs better guide the LLM to utilize the tools.
O3. The prompt should be self-contained, and better not include pronouns (e.g., it and they) in the context. ⃝
1
O4. When using LLMs for comparing two or more examples, the order affects the performance a lot. ⃝
1
O5. Before the prompt, assigning a role for the LLM is useful to help it better fulfill the following task instruction, e.g., “I ⃝
1
want you to act as a lawyer”.
O6. OpenAI models can perform a task better in English than other languages. Thus, it is useful to first ⃝
4
translate the input into English and then feed it to LLMs.
O7. For multi-choice questions, it is useful to constrain the output space of the LLM. You can use a more detailed explanation ⃝
1
or just imposing constraints on the logits.
O8. For sorting based tasks (e.g., recommendation), instead of directly outputting the complete text of each item after sorting, ⃝
1
one can assign indicators (e.g., ABCD) to the unsorted items and instruct the LLMs to directly output the sorted indicators.

highly sensitive to the crafted prompts—improper prompts torial huge search space. To automatically search effective
will lead to low task performance (as shown in Table 17). prompts for downstream tasks, existing studies propose a
Therefore, a large body of studies propose automatic opti- wide spectrum of discrete prompt optimization approaches,
mization approaches for discrete prompts and continuous which are detailed as follows.
prompts to achieve the optimal performance [404, 413]. In
this part, we will detail these studies from two perspectives, • Gradient-based approaches. This kind of approaches
i.e., discrete prompts and continuous prompts. aims to optimize the prompt search process by maximizing
the output likelihood via gradient update [413, 458–461].
Discrete Prompt Optimization. Discrete prompt is typically As a representative work, Auto-Prompt [413] proposes a
composed of a sequence of natural language tokens. Despite gradient-guided method to greedily search the optimal to-
that the form is simple and flexible, optimizing prompts in ken for each position of the prompt, leveraging the gradient
discrete space is a challenging problem due to the combina- approximated by the change in the log-likelihood when
48

TABLE 13: Example instructions collected from [447, 457]. The blue text denotes the task description, the red text denotes
the contextual information, the green text denotes the demonstrations, and the gold text denotes the prompt style.

Use the provided articles delimited by triple quotes to answer questions. If the answer cannot be found in the articles, write “I could not find an
answer.”
Articles: “““Joao Moutinho is a Portuguese footballer who last played as a central midfielder for Premier League club Wolverhampton Wanderers
and the Portugal national team.”””
Question: Is the following sentence plausible? ’Joao Moutinho was out at third.’
Answer: Let’s think step by step. Joao Moutinho is a soccer player. Being out at third is part of baseball, not soccer. So the answer is No.
...
<Demonstrations>

Articles: <insert articles, each delimited by triple quotes>


Question: <insert question>
Answer:

Prepare a meta-review by answering the following questions from the reviewer comments (provided after the questions).
1. Based on the reviewer’s comments, what are the core contributions made by this manuscript?
2. What are the common strengths of this work, as mentioned by multiple reviewers?
3. What are the common weaknesses of this work, as highlighted by multiple reviewers?
4. What suggestions would you provide for improving this paper?
5. What are the missing references mentioned by the individual reviews?
The review texts are below: <insert three comments R1 , R2 , R3 from the reviewers>
Meta-review: <insert meta-review>
...
<Demonstrations>

Provide justification for your response in detail by explaining why you made the choices you actually made. A good output should be coherent,
highlight major strengths/issues mentioned by multiple reviewers, be less than 400 words in length, and finally, the response should be in English
only.

The review texts are below: <insert three comments R1 , R2 , R3 from the reviewers>
Meta-review:

CREATE TABLE Highschooler (


ID int primary key,
name text,
grade int
);
/*
3 example rows:
SELECT * FROM Highschooler LIMIT 3;
ID name grade
1234 Janie 8
5678 Mary 8
9012 Mike 9
*/
Using valid SQLite, answer the following questions for the tables provided above.
Question: What is Kyle’s id?
SQL: SELECT ID FROM Highschooler WHERE name=“Kyle”;
...
<Demonstrations>

Question: <insert question>


SQL:

replacing a prompt token with another candidate token ample, RLPrompt [462] trains a policy network to generate
from vocabulary. However, such a search process can be ex- desired prompts with multiple reward functions. In this
tremely expensive since it needs to evaluate each candidate approach, several effective reward stabilization strategies
token for each position of the prompt, leading to a num- are also proposed to enhance the RL training efficiency.
ber of additional forward passes. Therefore, an improved Compared to previous work that requires sufficient data
gradient method [458] has been proposed by transforming for training, TEMPERA [463] proposes to edit prompts at
discrete tokens into continuous embeddings and computing test time by utilizing a pre-trained RL agent to sequentially
the gradient on continuous space during optimization. edit different parts of a manually-written initial prompt. Al-
though these methods are simple and effective, they explore
• RL-based approaches. Since discrete prompts are difficult a manually defined edit space (e.g., add, swap and delete)
to be learned through gradient back-propagation, a num- and focus on modifying the original prompt, which limits
ber of studies propose to formulate the discrete prompt the flexibility of prompt search. In contrast, PRewrite [465]
optimization as a reinforcement learning (RL) problem and employs RL to train a prompt rewriter for generating new
leverage RL algorithms for optimization [462–465]. For ex-
49

prompts instead of modification, which does not impose gradient update based on the loss of downstream tasks.
any restrictions in the prompt rewriting and offers improved Note that continuous prompt optimization has been mainly
flexibility in the action space. studied in PLMs, but draws limited attention in era of LLMs
• Edit-based approaches. For the above methods, gradient- due to their massive magnitudes of parameters. We include
based and RL-based tuning can be extremely computation- the discussion of this part for content completeness. In prior
ally demanding for ever larger models, and may not be fea- work, most studies typically rely on supervised learning to
sible for API-based model calls (e.g., ChatGPT). Therefore, train continuous prompts based on task data. Furthermore,
another line of work aims to directly edit existing prompts in data-scarce scenarios, transfer learning methods can be
based on the task performance. Specifically, GPS [466] bor- employed to alleviate the lack of labeled data on target tasks.
rows an idea from the genetic algorithm and proposes These two approaches are detailed below.
a genetic prompt search method that utilizes a language • Prompt learning with sufficient data. In this approach,
model (i.e., T5) to edit prompts by taking the cloze task form. most existing methods regard continuous prompts as train-
In addition to model based edit methods, human-defined able model parameters and then leverage supervised learn-
operations can be also employed for prompt editing [467], ing to optimize the continuous prompts by minimizing
including delete, swap, paraphrase, and addition. Based the cross-entropy loss based on sufficient downstream task
on these operations, they iteratively edit the prompts and data [404, 405, 409, 477]. As discussed in Section 5.3.1,
greedily search for the best prompt guided by the model prefix tuning [404] prepends a sequence of prefixes (i.e.,
performance on a small pool of examples. a set of trainable continuous vectors) to each Transformer
• LLM-based approaches. Due to the exceptional capacities layer in language models, while prompt tuning [405] only
of LLMs, an increasing number of studies directly leverage incorporates trainable prompt vectors at the input layer. By
LLMs as prompt generator [468–475]. Specifically, APE [468] fixing the large-scale parameters of LLMs and only tuning
utilizes an LLM to generate initial prompts, then selects the continuous prompt vector, this kind of approaches can be
best prompt with the highest accuracy, and finally improves extremely parameter-efficient (Section 5.3). However, these
the best candidate through an iterative Monte Carlo search approaches are typically independent of the inputs, lacking
method. However, this method does not effectively con- sufficient consideration of input semantics. Therefore, the
strain the prompt search space, which might likely lead authors in [477] propose context tuning, where the continu-
to unstable results. To achieve good performance and fast ous prompts are derived based on the input text and learned
convergence, one line of work utilizes heuristic methods through the downstream task losses.
(e.g., evolutionary algorithms [473, 474] and adversarial • Prompt transferring with scarce data. Supervised learn-
learning [475]) for prompt optimization. Another line of ing approaches demand in sufficient training data to learn
work draws an analogy to gradient-based model optimiz- optimal continuous prompts, which may not work well
ers for LLM-based prompt optimization. For example, in data-scarce domains and tasks. To address this prob-
APO [469] instructs the LLM to generate text feedback on lem, SPoT [478] proposes a prompt-based transfer learning
how to refine an old prompt into new improved prompts approach, which first learns a single continuous prompt
and then execute textual gradient descent. However, their for several representative source tasks and then uses this
search in the prompt space might be inefficient without prompt to initialize the prompt for a target task. However,
fully considering the whole refinement trace of previous this approach leverages the same prompt for solving all
prompts, thus potentially leading to sub-optimal results. instances of the target task. For a single task, even a well-
Therefore, some recent studies [470, 471] incorporate the learned prompt may not be suitable for all the data instances
previous prompts with their scores to instruct LLMs for from a large population. To address this issue, an improved
progressively generating better new prompts. To further method [479] designs an adaptive attention mechanism dur-
design formalized guidelines about the design of prompt ing the prompt transfer process to derive the target prompts,
optimizers, GPO [472] conducts a systematic analogy for considering both task- and instance-level information. The
LLM-based prompt optimizers with gradient-based model prompt transfer paradigm can leverage the knowledge of
optimizers. It further develops a more formal LLM-based data-sufficient source tasks encoded in source prompts for
prompt optimization framework, which extensively bor- solving data-scarce target tasks.
rows the idea of machine learning optimization. Specifally,
it retrieves relevant prompts from the previous prompts 6.2 In-Context Learning
and utilizes the generation-based refinement strategy to
As a special prompting form, in-context learning (ICL) is
perform the update. In order to avoid large variation at each
first proposed along with GPT-3 [55], which has become a
iteration, GPO further adopts a cosine-based decay strategy
typical approach to utilizing LLMs.
to control the edit distance. However, these approaches still
struggle in exploring the vast space of effective prompts.
6.2.1 ICL Formulation
Inspired by human-like trial-and-error, prompt optimization
is further formulated as a strategic planning problem [476] As stated in [55], ICL uses a formatted natural language
and uses Monte Carlo tree search to navigate the vast prompt, consisting of the task description and/or a few task
prompt space. examples as demonstrations. Figure 14 presents an illustra-
tion of ICL. First, starting with a task description, a few ex-
Continuous Prompt Optimization. Different from discrete amples are selected from the task dataset as demonstrations.
prompts, continuous prompts consist of a set of continuous Then, they are combined in a specific order to form nat-
embeddings, which can be directly optimized through the ural language prompts with specially designed templates.
50

Finally, the test instance is appended to the demonstration retriever to select examples that are semantically relevant to
as the input for LLMs to generate the output. Based on task the query [420, 482]. However, they perform the selection
demonstrations, LLMs can recognize and perform a new individually for each example, rather than evaluating the
task without explicit gradient update. example set as a whole. To resolve this issue, diversity-
Formally, let Dk = {f (x1 , y1 ), . . . , f (xk , yk )} represent based selection strategies are proposed to choose the most
a set of demonstrations with k examples, where f (xk , yk ) is representative set of examples for specific tasks [483, 484].
the prompt function that transforms the k -th task example Furthermore, in [485], both relevance and diversity are taken
into natural language prompts. Given the task description into consideration when selecting demonstrations.
I , demonstration Dk , and a new input query xk+1 , the • LLM-based approaches. Another line of work selects
prediction of the output ŷk+1 generated from LLMs can be demonstrations by making use of LLMs. For example, LLMs
formulated as follows35 : can be utilized to directly measure the informativeness
LLM I, f (x1 , y1 ), . . . , f (xk , yk ), f (xk+1 ,

) → ŷk+1 . of each example according to the performance gain after
| {z } | {z } |{z} adding the example [486]. In addition, EPR [421] proposes
demonstrations input answer
a two-stage retrieval approach that first recalls similar ex-
(11)
amples with an unsupervised method (e.g., BM25) and then
where the actual answer yk+1 is left as a blank to be
ranks them using a dense retriever (trained with positive
predicted by the LLM. Since the performance of ICL heavily
and negative examples labeled by LLMs). As an alterna-
relies on demonstrations, it is important to properly design
tive approach, the task of demonstration selection can be
them in the prompts. According to the construction process
formulated into a RL problem, where LLMs serve as the
in Equation (11), we focus on three major aspects of for-
reward function to provide feedback for training the policy
matting demonstrations in the prompts, including how to
model [487]. Since LLMs perform well for text annota-
select examples that make up demonstrations, format each
tion [488], some recent studies employ LLM itself as the
example into the prompt with the function f (·), and arrange
demonstration generator without human intervention [489].
demonstrations in a reasonable order.
A comprehensive review of ICL has been presented in To summarize, as discussed in [490], the selected demon-
the survey paper [50], and we suggest the readers refer- stration examples in ICL should contain sufficient informa-
ring to it for a more general, detailed discussion on this tion about the task to solve as well as be relevant to the test
topic. Compared with this survey, we specially focus on the query, for the above two selection approaches.
discussion of applying ICL to LLMs in two major aspects,
Demonstration Format. After selecting task examples, the
i.e., demonstration design and the underlying mechanism
next step is to integrate and format them into a natural
of ICL. Also, ICL has a close connection with instruction
language prompt for LLMs. A straightforward method is to
tuning (discussed in Section 5.1) in that both utilize nat-
instantiate a pre-defined template with the corresponding
ural language to format the task or instances. However,
input-output pairs [36]. To construct more informative tem-
instruction tuning needs to fine-tune LLMs for adaptation,
plates, recent studies consider adding task descriptions [69]
while ICL only prompts LLMs for utilization. Furthermore,
or enhancing the reasoning capability of LLMs with chain-
instruction tuning can enhance the ICL ability of LLMs to
of-thought prompts [33]. For instance, in [179], the authors
perform target tasks, especially in the zero-shot setting (only
collect a large-scale dataset with task descriptions written by
using task descriptions) [69].
humans. After tuning with this dataset, the performance on
seen tasks can be boosted, and LLMs can also generalize to
6.2.2 Demonstration Design
unseen tasks to some extent. To reduce the annotation costs,
Several studies have shown that the effectiveness of ICL a semi-automated approach has been proposed in [147]
is highly affected by the design of demonstrations [425, by employing a seed set consisting of human-written task
480, 481] Following the discussion in Section 6.2.1, we will descriptions to guide LLMs to generate task descriptions
introduce the demonstration design of ICL from three major for new tasks. Since it is costly to manually annotate
aspects, i.e., demonstration selection, format, and order. demonstration formats for different tasks, some work also
Demonstration Selection. The performance of ICL tends studies how to automatically generate high-quality ones.
to have a large variance with different demonstration exam- As two representative methods, Auto-CoT [427] leverages
ples [420], so it is important to select a subset of examples LLMs with the zero-shot prompt “Let’s think step by step”
that can effectively leverage the ICL capability of LLMs. for generating intermediate reasoning steps, while least-to-
There are two main demonstration selection approaches, most prompting [432] first queries LLMs to perform prob-
namely heuristic and LLM-based approaches: lem decomposition and then utilizes LLMs to sequentially
• Heuristic approaches. Due to their simplicity and low solve sub-problems based on the intermediate answers to
costs, existing work widely adopts heuristic methods to previously solved ones.
select demonstrations. Several studies employ a k -NN based
Demonstration Order. LLMs are shown to sometimes suffer
35. When ICL was introduced in the GPT-3’s paper [55], it was from the recency bias, i.e., they are prone to repeat answers
originally defined to be a combination of the task description and that are near the end of demonstrations [481]. Thus, it is
demonstration examples, wherein either component is dispensable. important to arrange demonstrations (i.e., task examples)
Following this definition, when a LLM is required to solve an unseen in a reasonable order. Early work proposes several heuris-
task by using only task descriptions, it can be also considered to
perform ICL for task solving, whereas the ICL ability can be enhanced tic methods to quickly find a good order. For example,
by instruction tuning. demonstrations can be directly organized according to their
51

In-Context Learning Chain-of-Thought Prompting


Answer the following mathematical reasoning questions: Answer the following mathematical reasoning questions:

Q: If you have 12 candies and you give 4 candies to your friend, Q: If a rectangle has a length of 6 cm and a width of 3 cm,
how many candies do you have left? what is the perimeter of the rectangle?
Nx A: The answer is 8. Nx A: For a rectangle, add up the length and width and double it.
If a rectangle has a length of 6 cm and a width of 3 cm,
Q: what is the perimeter of the rectangle? So, the perimeter of this rectangle is (6 + 3) x 2 = 18 cm.

A: The answer is 18 cm. The answer is 18 cm.

Q: Sam has 12 marbles. He gives 1/4 of them to his sister. Q: Sam has 12 marbles. He gives 1/4 of them to his sister.
How many marbles does Sam have left? How many marbles does Sam have left?

A: He gives (1 / 4) x 12 = 3 marbles.
A: The answer is 9. LLM So Sam is left with 12 – 3 = 9 marbles.
The answer is 9.

: Task description : Demonstration : Chain-of-Thought : Query

Fig. 14: A comparative illustration of in-context learning (ICL) and chain-of-thought (CoT) prompting. ICL prompts LLMs
with a natural language description, several demonstrations, and a test query, while CoT prompting involves a series of
intermediate reasoning steps in prompts.

similarity to the query in the embedding space [420]: the are combined to form larger linguistic units like sentences)
more similar, the closer to the end. In addition, global present in language data.
and local entropy metrics can be used to score different
demonstration orders [425]. To integrate more task infor- How LLMs Perform ICL? At the inference stage, researchers
mation, some recent studies propose to minimize the code focus on analyzing how the ICL capability operates based
length required to compress and transmit task labels, which on given demonstrations since no explicit learning or updat-
is inspired by information theory [491]. However, these ing is involved. According to the discussion in [495], there
methods need additional labeled data as the validation are two main ways for LLMs to utilize demonstrations: task
set to evaluate the performance of specific demonstration recognition and task learning.
orders. To eliminate this need, the authors in [425] propose • Task recognition. In the first way, LLMs recognize the
to sample the validation data from the LLM itself. task from demonstrations and utilize the prior knowledge
obtained from pre-training to solve new test tasks. A Proba-
bly Approximately Correct (PAC) framework [496] has been
6.2.3 Underlying Mechanism proposed to assess the learnability of ICL. It assumes that
After pre-training, LLMs can exhibit intriguing ICL capabil- there exists a latent variable representing the task in the pre-
ity without being updated. In what follows, we discuss two training data, and LLMs have been shown to be capable
key questions about the ICL ability of LLMs, i.e., “how does of capturing this variable from demonstrations, enabling
pre-training affect the ICL ability” and “how do LLMs perform them to recognize the task in ICL. Also, the interpretation
ICL during inference”. of ICL as task recognition is supported by several empir-
ical studies [480, 497]. For example, it has been observed
How Pre-Training Affects ICL? ICL is first proposed in that replacing the inputs or labels of demonstrations with
GPT-3 [55], and it has been shown that the ICL ability random ones sampled from the input or label space does
becomes more significant with a larger model size. Further, not seriously hurt the performance of LLMs, indicating that
some studies reveal that small-scale PLMs can also demon- LLMs mainly recognize the target task from demonstrations
strate a strong ICL ability by continual pre-training [492] instead of learning from them [480, 495]. Similarly, LLMs
or fine-tuning [493] on specially designed training tasks, can exhibit decent performance even if the prompt template
which typically involve additional task examples in the is irrelevant or misleading [497].
input during the training process. It suggests that the design • Task learning. In the second way, LLMs learn new tasks
of training tasks is an important influence factor on the ICL unseen in the pre-training stage only through demonstra-
capability of LLMs. Besides training tasks, recent studies tions. Specially, task learning is analyzed mainly from the
have also investigated the relationship between ICL and perspective of gradient descent and considered as implicit
pre-training corpora [490, 494]. For example, ICL can be fine-tuning [65, 498]. Then, ICL can be explained as follows:
theoretically explained as the product of pre-training on by means of forward computation, LLMs generate meta-
documents that exhibit long-range coherence [490]. Fur- gradients with respect to demonstrations and implicitly per-
ther, another study [494] theoretically analyzes that when form gradient descent via the attention mechanism. Exper-
scaling parameters and data, LLMs based on next-word iments also show that certain attention heads in LLMs are
prediction can emerge the ability of ICL by learning from capable of performing task-agnostic atomic operations (e.g.,
the compositional structure (e.g., how words and phrases copying and prefix matching), which are closely related to
52

the ICL ability [499]. Furthermore, some studies abstract step.” [507], making CoT prompting easy to use. There are
ICL as an algorithm learning process [500]. For example, the also alternative magic prompts that can elicit the ability
authors in [500] find that LLMs essentially encode implicit of CoT reasoning and further improve the performance of
models through their parameters during pre-training. With LLMs, such as “Take a deep breath and work on this problem
the examples provided in ICL, LLMs can implement learn- step-by-step.” [470].
ing algorithms such as gradient descent or directly compute As illustrated in Figure 15, the generation process of
the closed-form solution to update these models during CoT follows a chain structure in the basic CoT prompt-
forward computation. Under this explanation framework, ing approach, where LLMs generate CoTs step by step.
it has been shown that LLMs can effectively learn simple Typically, CoT takes the format of natural language text.
linear functions and even some complex functions like deci- However, textual CoTs may not work well on complex tasks
sion trees with ICL [500]. that require rigorous logic for reasoning. Considering this,
As discussed in a recent study [495], LLMs exhibit the some work uses code [508, 509] due to its structured and
abilities of both task recognition and task learning in ICL, precise nature. Furthermore, the authors in [510] propose
but the two abilities seem to be possessed with different to dynamically select text or code as the format of CoTs to
model scales. As shown in the experiments [495], the ability combine their advantages.
of task recognition is easier to obtain, and even a small LM
with only 350M parameters can exhibit this ability, while 6.3.2 Improved CoT Prompting Strategies
task learning can only emerge for LLMs with at least 66B Despite the performance improvement in complex reason-
parameters. Another study [501] also supports this find- ing tasks, CoT prompting still suffers from problems like
ing with specially designed experiments. They set up the incorrect reasoning and instability. In this part, we first
tasks with flipped and semantically unrelated labels in the introduce how to design better CoT prompts and enhanced
experiment, which require task learning when performing CoT generation strategies, and then introduce the extension
ICL. The results suggest that small LMs tend to disregard of the basic chain structure of CoT. Figure 15 illustrates the
the labels and mainly depend on their prior knowledge evolution of representative CoT prompting strategies.
to accomplish the task, while LLMs have the ability to
Better Prompt Design. Since CoT prompting relies on
surpass their prior knowledge and acquire new knowledge
prompts to elicit the reasoning capabilities of LLMs, the
from demonstrations, resulting in better outcomes. Further-
design of prompts is critical to its performance. As a di-
more, to improve the task learning ability, Meta-In-Context
rect approach, it is shown that using diverse CoTs (i.e.,
Learning [502] proposes to include multiple related tasks
multiple reasoning paths for each problem) can effectively
instead of just a single one in the prompt. In addition,
enhance the performance [430]. Another intuitive idea is
Symbol Tuning [503] fine-tunes LLMs on demonstrations
that prompts with more complex reasoning paths are more
with semantically unrelated labels (e.g., foo/bar instead of
likely to elicit the reasoning ability of LLMs [426], which
positive/negative for sentiment analysis), forcing LLMs to
can result in higher accuracy in generating correct an-
learn the task from demonstrations instead of relying on
swers. However, all these approaches rely on annotated CoT
prior knowledge.
datasets, which limits their use in practice. To overcome
this limitation, magic instructions such as “Let’s think step
6.3 Chain-of-Thought Prompting by step” can be used to automatically construct CoTs by
Chain-of-Thought (CoT) prompting [33, 504] is an improved prompting LLMs [427].
prompting strategy to boost the performance of LLMs on Enhanced CoT Generation. Since LLMs are prone to
complex reasoning tasks, such as arithmetic reasoning [505], producing incorrect reasoning steps and exhibiting insta-
commonsense reasoning [506], and symbolic reasoning [33]. bility in the generation process, there are a number of
Instead of simply constructing the prompts with input- studies [429, 511] to improve the generation of CoT. In this
output pairs like ICL, CoT prompting further incorporates part, we will introduce two typical approaches to enhancing
intermediate reasoning steps, which serve as the bridge be- the generation of CoT: sampling- and verification-based
tween inputs and outputs. Figure 14 presents an illustration methods.
of CoT. In the following part, we will first elaborate on the • Sampling-based methods. LLMs are known to suffer
basic CoT prompting approach and its improved strategies, from instability during inference, which can lead to un-
then discuss when and why CoT prompting works. faithfulness in the generated reasoning steps. To address
this issue, some work proposes to sample multiple rea-
6.3.1 Basic CoT Prompting Approach soning paths instead of using greedy decoding. As a rep-
CoT prompting is first proposed as an extension of ICL [33], resentative solution, self-consistency [429] first generates
which augments each demonstration ⟨input, output⟩ as several reasoning paths and then takes an ensemble over
⟨input, CoT, output⟩. A CoT is a series of intermediate the corresponding answers, selecting the most consistent
reasoning steps for connecting the input and output. With one through majority voting. However, such a method can
these augmented demonstrations, LLMs can follow them to still lead to wrong answers when most of the reasoning
generate CoTs and the answer for a new input. However, paths are misled. Considering this, the authors in [426] only
unlike ⟨input, output⟩ pairs in ICL, CoTs are difficult to vote on the k most complex reasoning paths based on their
obtain and usually require human annotation. Fortunately, observation that reasoning paths with higher complexity
it has been found that LLMs can be triggered to generate (e.g., more reasoning steps) usually have better performance.
CoTs through simple instructions like “Let’s think step by Furthermore, MCR [512] proposes referring to the steps
53

Sampling- Verification-
CoT based CoT based CoT ToT GoT
Input Input Input Input Input

... ... ... ... Verification

✖️ ✖️
...
Ensemble

Output Output Output Output Output

Reason Backtrack Aggregate Unevaluated thought Positive thought Negative thought

Fig. 15: An illustration of the evolution of CoT prompting strategies. It begins with the basic CoT approach and progresses
to enhanced CoT generation techniques, including sampling-based and verification-based methods. Finally, it extends to
variations of the chain structure, such as trees and graphs. Here, “thought” refers to an intermediate reasoning step as
stated in [33, 444].

from other reasoning paths when generating the next step, the reasoning process. With more complex topological struc-
and performs reasoning across multiple reasoning paths to tures, graphs offer greater flexibility in reasoning, enabling
generate the final answer. the characterization of more intricate relationships and in-
• Verification-based methods. The sequential nature of rea- teractions. For instance, Graph of Thoughts (GoT) [519, 520]
soning steps in CoTs can lead to the accumulation of errors conceptualizes the reasoning process as an arbitrary graph,
in the generated CoTs when certain steps are incorrect. To where vertices denote intermediate thoughts and edges
mitigate this problem, recent studies propose to verify the denote the interdependence between these thoughts. Com-
correctness of generated reasoning steps with either trained pared with ToT, it can further utilize thoughts from other
verifiers or LLMs themselves. For example, DIVERSE [511] reasoning paths when generating new thoughts. However,
trains solution-level and step-level verifiers respectively to such an approach requires a large number of interactions
examine the reasoning steps at different granularities. An- with LLMs, making the thought exploration process highly
other approach [513] utilizes LLMs to verify the correctness inefficient. To reduce potentially meaningless thought
of reasoning steps through step-by-step self-verification exploration, XoT [521] further proposes to guide the search
with a specially designed reasoning format. In addition, of thoughts with pre-trained policy and value networks.
several studies propose backward reasoning for verification:
it first deduces the necessary question conditions [514, 515]
6.3.3 Further Discussion on CoT Prompting
or variables [516] from the model’s predictions, and then
compares them with the original ones. In this part, we present discussions regarding two funda-
mental questions related to CoT prompting, i.e., “when does
Reasoning Structure Extension. Despite the generality, the
CoT prompting work for LLMs” and “why can LLMs perform
chain reasoning structure of basic CoT prompting limits its
CoT reasoning”.
effectiveness in solving complex tasks, which require ex-
ploration like foresight and backtracking during inference. When CoT Prompting Works For LLMs? Since CoT reason-
Therefore, many studies have been devoted to extending ing is an emergent ability [31], it only has a positive effect
the reasoning structure by designing more intricate thought on sufficiently large models (typically containing 10B or
processes, e.g., tree- and graph-structured reasoning. more parameters [33]) but not on small models. Moreover,
• Tree-structured reasoning. This approach (exemplified by since CoT prompting augments the standard prompting
Tree of Thoughts (ToT) [444, 517]) formulates the reasoning with intermediate reasoning steps, it is mainly effective
process in a hierarchical tree structure, where intermediate for the tasks that require step-by-step reasoning [33], e.g.,
thoughts are nodes. In this way, it enables LLMs to explore arithmetic reasoning, commonsense reasoning, and sym-
multiple reasoning paths in parallel and further supports bolic reasoning. Whereas, for other tasks that do not rely
the operation of lookahead and backtracking to facilitate on complex reasoning, CoT prompting might lead to worse
more comprehensive decisions. In addition, TouT [518] takes performance than standard prompting [431], e.g., MNLI-
the uncertainty of intermediate thoughts into account for m/mm, SST-2, and QQP from GLUE [279]. Interestingly, it
thought evaluation based on Monte Carlo Dropout. seems that the performance gain brought by CoT prompting
• Graph-structured reasoning. Although the tree structure could be significant only when standard prompting yields
facilitates parallel reasoning, it also imposes restrictions on poor results [33].
54

Why LLMs Can Perform CoT Reasoning? As the second


Planning
question, we discuss the underlying mechanism of CoT Task Result
Framework
prompting in the following two aspects.
• The source of CoT reasoning ability. Regarding the source
of CoT reasoning capability, it is widely hypothesized that it Task Planner Plan
Plan Executor
can be attributed to training on code since models trained on (LLM) (generate & refine)
it show a strong reasoning ability [47, 522, 523]. Intuitively,
code data is well organized with algorithmic logic and Feedback Action
programming flow, which may be useful to improve the rea-
soning performance of LLMs. However, this hypothesis still Memory Environment Tool
lacks publicly reported evidence of ablation experiments
(with and without training on code). In addition, instruction
Internal External
tuning seems not to be the key reason for obtaining the CoT
reasoning ability, since it has been empirically shown that
instruction tuning on non-CoT data does not improve the …
performance on held-out CoT reasoning benchmarks [69]. LLM Human World Others
• The effect of CoT prompting components. The major dis-
tinction between CoT prompting and standard prompting
is the incorporation of reasoning paths prior to the final Fig. 16: An illustration of the formulation for prompt based
answer. Thus, some researchers investigate the effects of planning by LLMs for solving complex tasks.
different components in the reasoning paths. Specifically,
a recent study identifies three key components in CoT
prompting, namely symbols (e.g., numerical quantities in task planner, which is played by LLMs, aims to generate the
arithmetic reasoning), patterns (e.g., equations in arithmetic whole plan to solve a target task. The plan can be presented
reasoning), and text (i.e., the rest of tokens that are not in various forms, e.g., an action sequence in the form of
symbols or patterns) [524]. It is shown that the latter two natural language [432] or an executable program written in
parts (i.e., patterns and text) are essential to the model programming language [436]. The LLM-based task planner
performance, and removing either one would lead to a can be enhanced with the memory mechanism for plan
significant performance drop. However, the correctness of storage and retrieval, which is helpful for long-horizon
symbols and patterns does not seem critical. Further, there tasks. Then, plan executor is responsible for executing the
exists a symbiotic relationship between text and patterns: actions in the plan. It can be implemented by models like
the text helps LLMs to generate useful patterns, and patterns LLMs for textual tasks [434] or by tools like code interpreters
aid LLMs to understand tasks and generate texts that help for coding tasks [443]. Furthermore, environment refers to
solve them [524]. where the plan executor carries out the actions, which can
In summary, CoT prompting provides a general and be set differently according to specific tasks, e.g., the LLM
flexible approach to eliciting the reasoning ability of LLMs. itself [529] or an external virtual world like Minecraft [530].
There are also some preliminary attempts to extend this It provides feedback about the execution result of the action to
technique to solve multimodal [525] and multilingual the task planner, either in the form of natural language [443]
tasks [526]. or from other multimodal signals [439].
For solving a complex task, the task planner first needs to
clearly understand the task goal and generate a reasonable
6.4 Planning
plan based on the reasoning of LLMs (See Section 6.4.2).
Prompting with ICL and CoT is a conceptually simple yet Then, the plan executor acts according to the plan in the
general approach to solving various tasks. However, this environment, and the environment will produce feedback
approach struggles with complex tasks like mathematical for the task planner (See Section 6.4.3). The task planner
reasoning [527] and multi-hop question answering [528]. As can further incorporate the feedback obtained from the
an enhanced approach, prompt-based planning has been environment to refine its initial plan and iteratively perform
proposed to break down complex tasks into smaller sub- the above process to get better results as the task solution
tasks and generate a plan of actions to accomplish the task. (See Section 6.4.4).

6.4.1 The Overall Framework 6.4.2 Plan Generation


In this part, we first formulate the general planning Plan generation focuses on directly generating action se-
paradigm of LLMs for solving complex tasks, which is quences by prompting LLMs. Based on the format of the
illustrated in Figure 16. generated plans, existing work can be divided into two
In this paradigm, there are typically three components: groups: text-based and code-based approaches.
task planner, plan executor, and environment36 . Specifically,
Text-based Approaches. It is straightforward for LLMs to
36. Despite the similarity with RL, our formulation decouples the generate plans in the form of natural language. In this
planning and execution phases, whereas in RL, they are typically approach, LLMs are prompted to generate a sequence of
interleaved in the agent. This paradigm is defined in a general yet
slightly loose way, and it mainly aims to help readers understand the actions for the plan executor to perform and solve the com-
key idea underlying the planning approaches of LLMs. plex task. For example, Plan-and-Solve [434] adds explicit
55

instructions like “devise a plan” to directly prompt provide real-time error messages [443], models like stable
the LLM for planning in a zero-shot manner, while Self- diffusion [534] can be used in multimodal tasks to provide
planning [531] and DECOMP [433] add demonstrations in visual perception [439], and virtual worlds like Minecraft
the prompt to guide the LLM to devise a plan through ICL. can provide immersive experiences [530]. Besides, some
Following this way, some work further considers incorpo- work (e.g., Generative Agents [535]) explores multi-agent
rating extra tools or models when planning. For example, collaboration in simulated environments, where each agent
ToolFormer [80] first annotates a pre-training corpus with receives feedback not only from interaction with the envi-
potential API calls using LLMs, and then fine-tunes LLMs ronment but also from communication with other agents.
on it, so that LLMs can learn when and how to call APIs
and incorporate the results returned by APIs during gener- 6.4.4 Plan Refinement
ation. HuggingGPT [437] introduces the models available in With access to feedback from the environment, the task
HuggingFace and regards LLMs as the controller to select planner can accordingly refine its current plan and itera-
suitable models based on their descriptions and aggregate tively go through the “planning – execution – refinement” loop
their results as the final solution. for better results. In this part, we summarizes three major
Code-based Approaches. Although text-based approaches refinement approaches in existing work.
sound intuitive, they cannot guarantee faithful execution of Reasoning. The feedback data from the environment may
the plan, which may lead to failure even when the plan is not be directly suitable to be utilized by LLMs for plan
sound. To address this issue, code-based approaches have refinement, e.g., containing irrelevant information or taking
been proposed to generate more verifiable plans in the a non-language form. To solve this, some work adds the
form of executable code in programming languages, e.g., explicit reasoning process to extract critical information
Python or PDDL. In this way, LLMs are first prompted from feedback [441, 442]. For example, React [442] prompts
to generate the program and then utilize a deterministic LLMs with demonstrations to generate reasoning traces
solver to execute it. For example, Faithful CoT [435] and over feedback. It has been widely used in autonomous agent
PAL [436] decompose a reasoning task into two stages: at projects, such as AutoGPT [536], which can automatically
the first stage, the LLM generates a plan conditioned on the reason over the observed feedback to revise the initial
query; at the second stage, a deterministic solver executes plan for solving various user requests. However, these ap-
the plan to derive the final answer. Furthermore, code-based proaches typically fix the order of reasoning and planning.
approaches can be applied to embodied agents in a similar To support flexible switching between the two processes for
way. For example, PROGPROMPT [532] and LLM+P [533] better performance, ChatCoT [441] further unifies the tool-
first utilize LLMs to generate plans in the form of python augmented reasoning process into a multi-turn conversation
functions or PDDL files, and then leverage a virtual agent between the LLM-based task planner and the tool-based
or classical planner to solve the problem according to the environment.
code-based plans.
Backtracking. Early methods mainly consider planning
6.4.3 Feedback Acquisition forward actions while maintaining the existing plan, thus
likely leading to local optimal plans based on a short-term
After executing the generated plan, the environment would
evaluation. To solve this, Tree of Thoughts [529] allows back-
produce the feedback signal to the LLM-based task planner,
tracking with search algorithms like breadth-first and depth-
which can be used to refine its initial plan for better results.
first search to make global planning. It refines the plan
In existing work, there are typically two sources of feedback
step by step by backtracking to the last state in the initial
from the environment, depending on their relationship with
plan and choosing the next unexplored action. Furthermore,
the LLM-based task planner: internal (i.e., the LLM itself)
some studies [439, 537] utilize feedback signals to revise the
and external (e.g., tools or virtual worlds) feedback.
entire plan. For example, DEPS [537] selects a better plan
Internal Feedback. The LLM itself can be utilized as a according to feedback signals, while TIP [439] adds feedback
feedback provider. One straightforward way is to directly signals to prompts for the LLM-based planner to revise each
evaluate the quality of the generated plans through prompt- step in the initial plan.
ing. For example, RAP [440] evaluate the likelihood that
Memorization. In order to handle long-horizon tasks, it has
each candidate plan can lead to task success, while Tree of
become a key approach to aid plan refinement with long-
Thoughts [529] proposes to vote across plans by making
term memory in addition to utilizing the short-term memory of
comparisons between them. Further, LLMs can provide
LLMs through ICL. For example, Reflexion [443] stores the
feedback based on the intermediate results from the plan
feedback from self-reflection into the memory, so previous
executor. For example, Reflexion [443] utilizes LLMs to
feedback can be retrieved for plan refinement. Generative
transform sparse result signals (e.g., success or failure) into
Agents [535] designs the memory stream mechanism as the
concrete text-based feedback (e.g., “You should recommend
core component of agents for action planning and reflection.
comedies that the user mentions in the query instead of horror
Further, the skill library mechanism [438, 530] is proposed
movies”) and stores this feedback in long-term memory for
to store successful plans in the library, which can be reused
future planning.
and synthesized as complex plans for novel tasks. To imple-
External Feedback. In addition to LLMs, external objects ment the long-term memory mechanism, tools like vector
can also provide feedback signals. For example, tools like databases (e.g., milvus [538]) can be used to encode plans or
code interpreters are widely used in programming tasks to feedbacks into high-dimensional vectors for efficient storage
56

and retrieval at a large scale. MemoryBank [539] further text, automatic metrics (e.g., Accuracy, BLEU [627] and
proposes the memory updating mechanism to allow mem- ROUGE [628]) and human ratings have been typically used
ory forgetting and strengthening following the Ebbinghaus for evaluating the performance. Due to the powerful lan-
Forgetting Curve theory. guage generation capabilities, LLMs have achieved remark-
able performance on existing datasets and benchmarks. For
instance, GPT-4 exhibits comparable performance as com-
7 C APACITY AND E VALUATION mercial translation products, even for the translation task of
To examine the effectiveness and superiority of LLMs, a languages that are with significant linguistic distance [629].
surge of tasks and benchmarks have been proposed for On news summarization tasks (i.e., CNN/DM and XSUM),
conducting empirical ability evaluation and analysis. In this LLMs also demonstrate comparable performance with hu-
section, we first introduce three types of basic ability evalu- man freelance writers [630]. Despite the rapid progress
ation of LLMs for language generation and understanding, on model capacity, there are increasing concerns on the
then present several advanced ability evaluations with more feasibility of existing automatic metrics to faithfully assess
complicated settings or goals, and finally discuss existing the performance of LLMs in conditional text generation
benchmarks, evaluation approaches, and empirical analysis. tasks [630–632]. As the alternatives to automatic metrics,
recent studies also propose to incorporate LLMs as gener-
7.1 Basic Ability ation evaluators to examine the quality of the generated
content [152, 633, 634]. Moreover, researchers also explore
In this part, we mainly focus on three basic types of ability more challenging language generation tasks for LLMs, such
evaluation for LLMs, i.e., language generation, knowledge as structured data generation [451] and long text genera-
utilization, and complex reasoning. It is noted that we do not tion [46, 635, 636].
intend to have complete coverage of all the related tasks, but
instead only focus on the most widely discussed or studied Code Synthesis. In addition to generating high-quality nat-
tasks for LLMs. Next, we introduce these tasks in detail. ural language text, existing LLMs also show strong abilities
to generate formal language, especially computer programs
7.1.1 Language Generation (i.e., code) that satisfy specific conditions, called code syn-
According to the task definition, existing tasks about lan- thesis [637]. Unlike natural language generation, as the gen-
guage generation can be roughly categorized into language erated code can be directly checked by execution with cor-
modeling, conditional text generation, and code synthesis responding compilers or interpreters, existing work mostly
tasks. Note that code synthesis is not a typical NLP task, we evaluates the quality of the generated code from LLMs by
include it for discussion because it can be directly solved calculating the pass rate against the test cases, i.e., pass@k 37 .
by a number of LLMs (trained on code data) in a similar Recently, several code benchmarks focusing on functional
generation approach as natural language text. correctness are proposed to assess the code synthesis abil-
ities of LLMs, such as APPS [376], HumanEval [105], and
Language Modeling. As the most fundamental ability of MBPP [223]. Typically, they consist of diverse programming
LLMs, language modeling aims to predict the next token problems, with text specification and test cases for cor-
based on the previous tokens [1], which mainly focuses rectness checking. To improve such an ability, it is key to
on the capacity of basic language understanding and gen- fine-tuning (or pre-training) LLMs on code data, which can
eration. For evaluating such an ability, typical language effectively adapt LLMs to code synthesis tasks [86]. In addi-
modeling datasets that existing work uses include Penn tion, existing work has proposed new strategies to generate
Treebank [540], WikiText-103 [541], and the Pile [166], where code, e.g., sampling multiple candidate solutions [223] and
the metric of perplexity is commonly used for evaluating the planning-guided decoding [638], which can be considered
model performance under the zero-shot setting. Empirical as the imitation of bug-fixing and code-planning processes
studies [55, 93] show that LLMs bring substantial per- by programmers. Impressively, LLMs have recently shown
formance gains over the previous state-of-the-art methods competitive performance with humans by achieving a rank-
on these evaluation datasets. To better test the modeling ing of the top 28% among users on the programming contest
capacity of long-range dependencies in text, the LAMBADA platform Codeforces [114]. Further, GitHub Copilot has been
dataset [252] has been introduced, where LLMs are required released to assist programming in coding IDEs (e.g., Visual
to predict the last word of sentences based on a paragraph of Studio and JetBrains IDEs), which can support a variety
context. Then, the accuracy and perplexity of the predicted of languages including Python, JavaScript, and Java. A
last words are employed to evaluate LLMs. As shown in viewpoint article entitled “The End of Programming” [639] in
existing work, the performance on the language modeling Communications of the ACM has discussed the impact of AI
tasks typically follows the scaling law [30], which means programming in the field of computer science, emphasizing
that scaling language models would improve the accuracy an important shift towards the highly adaptive LLM as a
and reduce the perplexity. new atomic unit of computation.
Conditional Text Generation. As an important topic in Major Issues. Although LLMs have achieved splendid per-
language generation, conditional text generation [48] fo- formance in generating human-like text, they are susceptible
cuses on generating texts satisfying specific task demands to suffering from two major issues in language generation
based on the given conditions, typically including machine
translation [626], text summarization [550], and question 37. Given k programs generated by the LLM, pass@k is computed as
answering [559]. To measure the quality of the generated 1 when at least one program passes all test cases, or else 0
57

TABLE 14: Representative basic and advanced abilities and corresponding representative datasets for evaluating.

Level Ability Task Dataset


Language Modeling Penn Treebank [540], WikiText-103 [541], the Pile [166], LAMBADA [252]
WMT’14,16,19,20,21,22 [542–547], Flores-101 [548], DiaBLa [549],
Language Generation Conditional Text Generation CNN/DailyMail [550], XSum [551], WikiLingua [552]
OpenDialKG [553]
APPS [376], HumanEval [105], MBPP [223], CodeContest [114], MTPB [86],
Code Synthesis
DS-1000 [554], ODEX [555]
Natural Questions [556], ARC [557], TruthfulQA [558], Web Questions [559],
Closed-Book QA TriviaQA [560], PIQA [561], LC-quad2.0 [562], GrailQA [563], KQApro [564],
CWQ [565], MKQA [566], ScienceQA [567]
Natural Questions [556], OpenBookQA [568], ARC [557], TriviaQA [560],
Knowledge Utilization
Open-Book QA Web Questions [559], MS MARCO [569], QASC [570], SQuAD [571],
Basic WikiMovies [572]
WikiFact [573], FB15k-237 [574], Freebase [575], WN18RR [576],
Knowledge Completion
WordNet [577], LAMA [578], YAGO3-10 [579], YAGO [580]
CSQA [506], StrategyQA [199], HotpotQA [581], ARC [557], BoolQ [582],
PIQA [561], SIQA [583], HellaSwag [584], WinoGrande [585], COPA [586],
Knowledge Reasoning
OpenBookQA [568], ScienceQA [567], proScript [587], ProPara [588],
ExplaGraphs [589], ProofWriter [590], EntailmentBank [591],
ProOntoQA [592]
Complex Reasoning CoinFlip [33], ReverseList [33], LastLetter [33], Boolean Assignment [593],
Symbolic Reasoning Parity [593], Colored Object [70], Penguins in a Table [70],
Repeat Copy [436], Object Counting [436]
MATH [362], GSM8k [198], SVAMP [594], MultiArith [595], ASDiv [505],
Mathematical Reasoning MathQA [596], AQUA-RAT [597], MAWPS [598], DROP [599],
NaturalProofs [600], PISA [601], miniF2F [602], ProofNet [603]
Honestness TruthfulQA [558], HaluEval [604]
Helpfulness HH-RLHF [183]
Human Alignment
HH-RLHF [183], Crows-Pairs [605]
Harmlessness
WinoGender [606], RealToxicityPrompts [607]
Household VirtualHome [608], BEHAVIOR [609], ALFRED [610],ALFWorld [611]
Interaction with
Website Environment WebShop [612], Mind2Web [613]
External Environment
Advanced Open World MineRL [614], MineDojo [615]
Search Engine HotpotQA [581], TriviaQA [560], Natural Questions [556]
Code Executor GSM8k [198], TabMWP [616], Date Understanding [70]
Calculator GSM8k [198], MATH [362], CARP [617]
Tool Manipulation
Model Interface GPT4Tools [618], Gorilla [619]
WebQSP [620], MetaQA [621], WTQ [622]
Data Interface
WikiSQL [623], TabFact [624], Spider [625]

as discussed below. eration tasks in the era of LLMs has become a fundamental
• Unreliable generation evaluation. With the advancement yet challenging research topic. Recently, increasing research
of language generation ability of LLMs, existing studies work proposes to leverage LLMs to improve the evaluation
find that the generated texts from LLMs have reached a quality of the generated texts. Specially, LLMs can be used
comparable quality to the reference texts on a variety of text to improve the evaluation quality of existing metrics. For ex-
generation tasks. However, due to the intrinsic weakness ample, Para-Ref [643] augments various automatic metrics
of existing evaluation benchmarks, there exists pronounced by leveraging LLMs to paraphrase existing references into
inconsistency between human evaluation and automatic semantically equivalent references with diverse expressions.
reference-based metrics [630–632, 640]. For example, in Further, LLMs are widely employed as the evaluators of text
OpenDialKG [553], ChatGPT underperforms a fine-tuned generation in a reference-free manner, including evaluating
GPT-2 on BLEU and ROUGE-L metrics, while earning more a single prediction [633, 634, 644] or comparing several
favor from human judgment [640]. Furthermore, existing candidates [152, 645–647]. Nevertheless, LLMs may expose
work argues that even human evaluation may not be robust bias (e.g., order bias or preference for LLM-generated texts
enough [630, 631, 641, 642]. In some cases, it is difficult over human-written texts) as language generation evalua-
to achieve a high level of consensus among human an- tors, demonstrating disparities when compared to human
notators [631], and there is also a large gap between the evaluation [634, 648, 649].
annotation quality of crowdworkers and experts [641, 642].
Thus, how to conduct reliable evaluation for language gen-
58

Unreliable Generation Evaluation or knowledge completion) and evaluation settings (with or


without external resources), we categorize existing knowl-
LLMs have been capable of generating texts with edge utilization tasks into three types, namely closed-book
a comparable quality to human-written texts, QA, open-book QA38 , and knowledge completion.
which however might be underestimated by au-
tomatic reference-based metrics. As an alterna- Closed-Book QA. Closed-book QA tasks [654] test the
tive evaluation approach, LLMs can serve as lan- acquired factual knowledge of LLMs from the pre-training
guage generation evaluators to evaluate a single corpus, where LLMs should answer the question only based
text, compare multiple candidates, and improve on the given context without using external resources. For
existing metrics. However, this evaluation ap- evaluating this ability, there are several datasets that can
proach still needs more inspections and exami- be leveraged, including Natural Questions [556], Web Ques-
nations in real-world tasks. tions [559], and TriviaQA [560], where the accuracy metric is
widely adopted. Empirical results have revealed that LLMs
• Underperforming specialized generation. Although LLMs can perform well in this setting and even match the per-
have learned general language patterns to generate coherent formance of state-of-the-art open-domain QA systems [56].
text, their proficiency in generation might be constrained Also, the performance of LLMs on closed-book QA tasks
when dealing with a specialized domain or task. For in- shows a scaling law pattern in terms of both model size
stance, a language model that has been trained on gen- and data size: scaling the parameters and training tokens
eral web articles may face challenges when generating a can increase the capacity of LLMs and help them learn (or
medical report which involves many medical jargon and memorize) more knowledge from the pre-training data [56].
methods. Intuitively, domain knowledge should be critical Further, under a similar parameter scale, LLMs with more
for model specialization. However, it is not easy to inject pre-training data relevant to the evaluated tasks would
such specialized knowledge into LLMs. As discussed in achieve better performance [81]. Also, the closed-book QA
recent analyses [47, 650], when LLMs are trained to exhibit setting provides a testbed for probing the accuracy of the
some specific ability that allows them to excel in some areas, factual knowledge encoded by LLMs. However, as shown
they might struggle in others. Such an issue is related to in existing work [55], LLMs might perform less well on QA
catastrophic forgetting [651, 652] in training neural networks, tasks relying on fine-grained knowledge, even when it exists
which refers to the conflict phenomenon of integrating new in the pre-training data.
and old knowledge. Similar cases also occur in human align-
Open-Book QA. Unlike closed-book QA, in open-book QA
ment of LLMs, where “alignment tax” [66] (e.g., a potential
tasks, LLMs can extract useful evidence from the external
loss in the in-context learning ability) has to be paid for
knowledge base or document collections, and then answer
aligning to human values and needs. Moreover, due to
the question based on the extracted evidence [655–658]. Typ-
the limitations of sequence modeling architecture, LLMs
ical open-book QA datasets (e.g., Natural Questions [556],
still face challenges in the understanding and generation
OpenBookQA [568], and SQuAD [571]) have overlap with
of structured data. Consequently, they often fall behind
closed-book QA datasets, but they incorporate external data
task-specific models on complex structured data tasks, such
sources, e.g., Wikipedia. The metrics of accuracy and F1
as knowledge-base question answering and semantic pars-
score are widely used in open-book QA tasks for evalua-
ing [451, 653]. Therefore, it is important to develop effective
tion. To select relevant knowledge from external resources,
model specialization methods that can flexibly adapt LLMs
LLMs are often paired with a text retriever (or even a
to various task scenarios, meanwhile retaining the original
search engine), which is trained independently or jointly
abilities as possible.
with LLMs [81, 655, 659]. Also, previous work [660–662]
Underperforming Specialized Generation has indicated that retrievers can assist LLMs in verifying
and rectifying the reasoning path. In evaluation, existing
LLMs may fall short in mastering generation studies mainly focus on testing how LLMs utilize the ex-
tasks that require domain-specific knowledge or tracted knowledge to answer the question and show that
generating structured data. It is non-trivial to the retrieved evidence can largely improve the accuracy
inject specialized knowledge into LLMs, mean- of the generated answers, even enabling a smaller LLM to
while maintaining the original abilities of LLMs. outperform 10× larger ones [655, 659]. Further, open-book
QA tasks can be also employed to evaluate the recency
of knowledge information. Pre-training or retrieving from
outdated knowledge resources may cause LLMs to generate
7.1.2 Knowledge Utilization
incorrect answers for time-sensitive questions [655].
Knowledge utilization is an important ability of intelligent
systems to accomplish knowledge-intensive tasks (e.g., com- Knowledge Completion. In knowledge completion tasks,
monsense question answering and fact completion) based LLMs might be (to some extent) considered as a knowledge
on supporting factual evidence. Concretely, it requires LLMs
to properly utilize the rich factual knowledge from the pre- 38. In this part, open-book QA refers to the QA tasks that require
training corpus or retrieve external data when necessary. In to extract and utilize useful information from external knowledge
particular, question answering (QA) and knowledge com- resources, as the antithesis of closed-book QA (only using the encoded
information from pre-training corpus). Note that there is a dataset also
pletion have been two commonly used tasks for evaluating named OpenBookQA [568], which follows the settings of open-book
this ability. According to the test tasks (question answering QA tasks by extracting and utilizing external science facts.
59

Bob’s wife is Amy. Bob’s daughter is Cindy.


Explain RLHF for LLMs.
Who is Cindy to Amy?

RLHF stands for "Rights, Limitations, Harms, and


Cindy is Amy’s daughter-in-law. Freedoms" and is a framework for …… models like
LLMs (Large Language Models).

(a) Intrinsic hallucination (b) Extrinsic hallucination

Fig. 17: Examples of intrinsic and extrinsic hallucination for a public LLM (access date: March 19, 2023). As an example
of intrinsic hallucination, the LLM gives a conflicting judgment about the relationship between Cindy and Amy, which
contradicts the input. For extrinsic hallucination, in this example, the LLM seems to have an incorrect understanding of
the meaning of RLHF (reinforcement learning from human feedback), though it can correctly understand the meaning of
LLMs (in this context).

base [578], which can be leveraged to complete or predict the deploying LLMs in real-world applications. To alleviate
missing parts of knowledge units (e.g., knowledge triples). this problem, alignment tuning strategies (as discussed in
Such tasks can probe and evaluate how much and what kind Section 5.2) have been widely utilized in existing work [66],
of knowledge LLMs have learned from the pre-training which rely on tuning LLMs on high-quality data or using
data. Existing knowledge completion tasks can be roughly human feedback. Moreover, the integration of external
divided into knowledge graph completion tasks (e.g., FB15k- tools for the provision of credible information sources can
237 [574] and WN18RR [576]) and fact completion tasks (e.g., help alleviate the hallucination issue [81, 604, 661]. Another
WikiFact [573]), which aim to complete the triples from a line of research work leverages uncertainty estimation of
knowledge graph and incomplete sentences about specific LLMs to identify hallucinations [665, 666]. For instance,
facts, respectively. Empirical studies have revealed that it considering that hallucinated facts are prone to exhibit
is difficult for existing LLMs to accomplish knowledge inconsistency across different sampled outputs, SelfCheck-
completion tasks related to specific relation types [522]. GPT [666] detects hallucination by measuring information
As shown in the evaluation results on WikiFact, LLMs inconsistency within sampled outputs. For the evaluation
perform well on several frequent relations that occur in of the hallucination problem, a set of hallucination de-
the pre-training data (e.g., currency and author), while tection tasks have been proposed, e.g., TruthfulQA [558]
not well on rare ones (e.g., discoverer_or_inventor for detecting human falsehood mimicked by models. More
and place_of_birth). Interestingly, under the same eval- recently, HaluEval [604] creates a large-scale LLM-generated
uation settings (e.g., in-context learning), InstructGPT (i.e., and human-annotated hallucinated samples to evaluate the
text-davinci-002) outperforms GPT-3 in all subsets of ability of language models to recognize hallucination in both
WikiFact. task-specific and general scenarios.

Major Issues. Although LLMs have achieved key progress Hallucination


in capturing and utilizing knowledge information, they
suffer from two major issues as discussed below. LLMs are prone to generate untruthful informa-
tion that either conflicts with the existing source
• Hallucination. In generating factual texts, a challeng-
or cannot be verified by the available source.
ing issue is hallucination generations [640, 663], where the
Even the most powerful LLMs such as ChatGPT
generated information is either in conflict with the existing
face great challenges in migrating the hallucina-
source (intrinsic hallucination) or cannot be verified by the
tions of the generated texts. This issue can be
available source (extrinsic hallucination), which are illustrated
partially alleviated by special approaches such as
by two examples in Figure 17. Hallucination widely occurs
alignment tuning and tool utilization.
in existing LLMs, even the most superior LLMs such as
GPT-4 [46]. Furthermore, existing work shows that LLMs
encounter difficulties in recognizing the hallucinated con- • Knowledge recency. As another major challenge, LLMs
tent in text [604], even the powerful ChatGPT. Additionally, would encounter difficulties when solving tasks that require
beyond language tasks, a recent study has shown that large the latest knowledge beyond the training data. To tackle
vision-language models (LVLM) also face challenges with this issue, a straightforward approach is to regularly update
hallucination, i.e., generating objects that are not present in LLMs with new data. However, it is very costly to fine-tune
the accompanying images [664]. In essence, LLMs seem LLMs, and also likely to cause the catastrophic forgetting
to “unconsciously” utilize the knowledge in task solving, issue when incrementally training LLMs. Therefore, it is
which still lack an ability to accurately control the use necessary to develop efficient and effective approaches that
of internal or external knowledge. Hallucinations would can integrate new knowledge into existing LLMs, making
mislead LLMs to generate undesired outputs and mostly them up-to-date. Existing studies have explored how to
degrade the performance, leading to potential risks when utilize the external knowledge source (e.g., search engine)
60

to complement LLMs, which can be either jointly optimized knowledge reasoning tasks into code generation tasks, re-
with LLMs [655] or used as a plug-and-play module [661]. searchers have found that the performance of LLMs can
For instance, ChatGPT utilizes a retrieval plugin to access be further improved [226], especially with the LLMs pre-
up-to-date information sources [667]. By incorporating the trained on code. However, due to the complexity of knowl-
extracted relevant information into the context [668–670], edge reasoning tasks, the performance of current LLMs still
LLMs can acquire new factual knowledge and perform lags behind human results on tasks such as commonsense
better on relevant tasks. However, such an approach seems reasoning [33, 56, 677]. As a common type of mistakes, LLMs
to be still at a superficial level. In addition, existing studies might generate inaccurate intermediate steps, leading to a
also explore editing parameters of language models to up- wrong final result. To address this issue, existing work has
date intrinsic knowledge [671–673]. Nevertheless, previous proposed special decoding or ensemble strategies to im-
work [674] has shown that several parameter editing meth- prove the accuracy of the whole reasoning chain [429, 430].
ods perform not well on LLMs, though they can improve
Symbolic Reasoning39 . The symbolic reasoning tasks
the performance of small language models. Therefore, it
mainly focus on manipulating the symbols in a formal rule
is still difficult to directly amend intrinsic knowledge or
setting to fulfill some specific goal [51], where the operations
inject specific knowledge into LLMs, which remains an
and rules may have never been seen by LLMs during pre-
open research problem [674]. Recently, a useful framework
training. Existing work [33, 432, 507] commonly evaluates
EasyEdit [675] has been released to facilitate the research of
LLMs on the task of last letter concatenation and coin flip,
knowledge editing for LLMs.
where the evaluation examples require the same reasoning
Knowledge Recency steps as the in-context examples (called in-domain test) or
more steps (called out-of-domain test). For an example of
The parametric knowledge of LLMs is hard to be the out-of-domain test, LLMs could only see the examples
updated in a timely manner. Augmenting LLMs with two words in context, but it requires LLMs to concate-
with external knowledge sources is a practical nate the last letters of three or more words. Typically, the
approach to tackling the issue. However, how accuracy of the generated symbols is adopted to evaluate
to effectively update knowledge within LLMs the performance of LLMs on these tasks. Thus, LLMs need
remains an open research problem. to understand the semantic relations among the symbolic
operations and their composition in complex scenarios.
However, under the out-of-domain setting, as LLMs have
not seen the complex compositions of symbolic operations
7.1.3 Complex Reasoning
and rules (e.g., twice the number of operations in context
Complex reasoning refers to the ability of understanding examples), it is hard for LLMs to capture their accurate
and utilizing supporting evidence or logic to derive con- meanings. To solve this issue, existing studies incorporate
clusions or make decisions [51, 52]. According to the type scratchpad [593, 678] and tutor [679] strategies to help
of involved logic and evidence in the reasoning process, LLMs better manipulate symbolic operations, for generating
we consider dividing existing evaluation tasks into three longer and more complex reasoning processes. Another
major categories, namely knowledge reasoning, symbolic line of research work utilizes the formal programming
reasoning, and mathematical reasoning. language to represent the symbolic operations and rules,
which requires LLMs to generate code and perform the
Knowledge Reasoning. The knowledge reasoning tasks
reasoning process by executing it with external interpreters.
rely on logical relations and evidence about factual
Such a way can decompose the complex reasoning process
knowledge to answer the given question. Existing work
into code synthesis and program execution for LLMs and
mainly uses specific datasets to evaluate the reasoning
interpreters, respectively, leading to a simplified reasoning
capacity of the corresponding type of knowledge, e.g.,
process with yet more accurate results [436].
CSQA [506]/StrategyQA [199] for commonsense knowledge
reasoning and ScienceQA [567] for science knowledge rea- Mathematical Reasoning. The mathematical reasoning
soning. In addition to the accuracy of the predicted results, tasks need to comprehensively utilize mathematical knowl-
existing work [567] has also evaluated the quality of the edge, logic, and computation for solving problems or gen-
generated reasoning process, via automatic metrics (e.g., erating proof statements. Existing mathematical reasoning
BLEU) or human evaluation. Typically, these tasks require tasks can be mainly categorized into math problem solv-
LLMs to perform step-by-step reasoning based on factual ing and automated theorem proving. For math problem
knowledge, until reaching the answer to the given ques- solving tasks, SVAMP [594], GSM8k [198] and MATH [362]
tion. To elicit the step-by-step reasoning ability, chain-of- datasets are commonly used for evaluation, where LLMs
thought (CoT) prompting strategy [33] has been proposed need to generate accurate concrete numbers or equations
for enhancing the complex reasoning capacity of LLMs. to answer the mathematical problem. As these tasks also
As discussed in Section 6.3, CoT involves the intermediate require multi-step reasoning, the CoT prompting strategy
reasoning steps, which can be manually created [33] or has been widely adopted for LLMs to improve the reasoning
automatically generated [676], into the prompts to guide performance [33]. As another practical strategy, continu-
LLMs to perform multi-step reasoning. Such a way largely
improves the reasoning performance of LLMs, leading to 39. Following [33], we mainly discuss symbolic reasoning tasks spe-
cially designed for evaluating LLMs. We do not consider symbolic
new state-of-the-art results on several complex knowledge reasoning methods in traditional NLP tasks, such as deducing logical
reasoning tasks [33, 56, 528]. Further, after reformulating rules from the knowledge graphs in KBQA.
61

ally pre-training LLMs on large-scale mathematical corpora Reasoning Inconsistency


can largely boost their performance on mathematical rea-
soning tasks [35, 218, 680]. Further, since math problems LLMs may generate the correct answer following
in different languages share the same mathematical logic, an invalid reasoning path, or produce a wrong
researchers also propose a multilingual math word problem answer after a correct reasoning process, leading
benchmark [526] to evaluate the multilingual mathematical to inconsistency between the derived answer and
reasoning capacity of LLMs. As another challenging task, the reasoning process. The issue can be alleviated
automated theorem proving (ATP) [600, 602, 681] requires by fine-tuning LLMs with process-level feedback,
the reasoning model to strictly follow the reasoning logic using an ensemble of diverse reasoning paths,
and mathematical skills. To evaluate the performance on and refining the reasoning process with self-
this task, PISA [601] and miniF2F [602] are two typical ATP reflection or external feedback.
datasets with the proof success rate as the evaluation metric.
As a typical approach, existing work on ATP utilizes LLMs • Numerical computation. For complex reasoning tasks,
to aid the search for proofs using an interactive theorem LLMs still face difficulties in the involved numerical com-
prover (ITP), such as Lean, Metamath, and Isabelle [682– putation, especially for the symbols that are seldom en-
684]. A major limitation of ATP research is the lack of related countered during pre-training, such as arithmetic with large
corpora in formal language. To tackle it, several studies numbers [49, 679, 692]. To tackle this issue, a direct way is
utilize LLMs to convert informal statements into formal to tune LLMs on synthesized arithmetic problems [359, 693].
proofs for augmenting new data [685] or generate drafts and Also, a surge of studies improve the numerical computation
proof sketches to reduce the search space of the proofs [686]. performance by tracing intermediate calculation steps in
training and inference stages [359, 678, 694], e.g., scratchpad
tracing. In addition, existing work [80] has also incorpo-
rated external tools (e.g., calculator), especially for handling
Major Issues. In spite of the advancements, LLMs still have arithmetic operations. More recently, ChatGPT has provided
several limitations in solving complex reasoning tasks. a plugin mechanism to use external tools [667]. In this
way, LLMs need to learn how to properly manipulate the
tools. For this purpose, researchers have augmented the
• Reasoning inconsistency. With improved reasoning examples using tools (even the LLM itself) for tuning the
strategies (e.g., CoT prompting), LLMs can solve some com- LLM [80, 695], or devised instructions and exemplars for
plex reasoning tasks, by performing step-by-step reasoning in-context learning [436]. In addition to the aid of ex-
based on the supporting logic and evidence. Despite the ternal tools, recent studies find that tokenizing digits into
effectiveness, the reasoning inconsistency issue often occurs in individual tokens (e.g., LLaMA and Galactica tokenizers)
the decomposed reasoning process. Concretely, LLMs may is a useful approach to enhancing the inherent arithmetic
generate the correct answer following an invalid reasoning ability of LLMs [359, 692]. One possible explanation is that
path, or produce a wrong answer after a correct reason- subword tokenization techniques can result in inconsistent
ing process [33, 435], leading to inconsistency between the sequences when tokenizing numbers. For instance, with
derived answer and the reasoning process. To alleviate a subword tokenizer the integer 7481 may be tokenized
this problem, existing work has proposed to guide the as 7 481, while 74815 may be tokenized as 748 15 (the
whole generation process of LLMs via external tools or same numerical substrings with different splits) [359]. As a
models [430, 444, 638], to re-check the reasoning process comparison, digit-based tokenization for numbers can avoid
and final answer for correcting the potential errors [687–689] such an inconsistency, thus likely improving the numerical
or fine-tune LLMs with process-based feedback [690, 691]. computation ability of LLMs.
For instance, Tree of Thoughts (ToT) [444] empowers LLMs
to engage in the decision-making process by concurrently Numerical Computation
exploring and self-evaluating various reasoning paths. To
refine the reasoning processes, Self-Refine [687] elicits feed- LLMs face difficulties in numerical computation,
back from LLMs on self-generated solutions, enabling the especially for the symbols that are seldom en-
iterative refinement of solutions based on the feedback. countered during pre-training. In addition to us-
Moreover, several studies improve the consistency in the ing mathematical tools, tokenizing digits into in-
reasoning chain of LLMs through the integration of process- dividual tokens is also an effective design choice
based supervision during training [690, 691]. As a promis- for improving the arithmetic ability of LLMs.
ing solution, recent approaches reformulate the complex
reasoning tasks into code generation tasks, where the strict
execution of the generated code ensures the consistency 7.2 Advanced Ability
between the reasoning process and the outcome. Also, In addition to the above basic evaluation tasks, LLMs also
it has been revealed that there might exist inconsistency exhibit some superior abilities that require special consider-
between tasks with similar inputs, where small changes ations for evaluation. In this part, we discuss several rep-
in the task description may cause the model to produce resentative advanced abilities and the corresponding eval-
different results [49, 594]. To mitigate this problem, self- uation approaches, including human alignment, interaction
consistency [429] adopts the ensemble of multiple reasoning with the external environment, and tool manipulation. Next,
paths to enhance the decoding process of LLMs. we discuss these advanced abilities in detail.
62

7.2.1 Human Alignment work either adopts the regular metrics (e.g., executability
It is desired that LLMs could well conform to human values and correctness of the generated action plans) [696] in the
and needs, i.e., human alignment, which is a key ability for benchmark or directly conducts real-world experiments and
the broad use of LLMs in real-world applications. measures the success rate [700], to evaluate such ability. It
To evaluate this ability, existing studies consider multiple has been shown that LLMs are capable in interacting with
criteria for human alignment, such as helpfulness, honesty, the external environment and generating accurate action
and safety [46, 183, 366]. For helpfulness and honesty, adver- plans [701]. Recently, several improvement methods have
sarial question answering tasks (e.g., TruthfulQA [558]) can been proposed to enhance the interaction ability of LLMs,
be utilized to examine LLM’s ability in detecting possible e.g., designing code-like prompts [532] and providing real-
falsehood in the text [46, 81]. Furthermore, harmlessness world grounding [700].
can be also evaluated by several existing benchmarks, e.g., In addition, recent work also explores multi-agent col-
CrowS-Pairs [605] and Winogender [606]. Despite the auto- laboration based on LLMs in simulated environments [535,
matic evaluation with the above datasets, human evaluation 702, 703]. These studies simulate human social behaviors
is still a more direct way to effectively test the human by instantiating multiple LLM-based agents with observa-
alignment ability of LLMs. OpenAI invites many experts tions, planning, and memories in a sandbox environment.
in domains related to AI risks to evaluate and improve the In controlled evaluation, the abilities of generative agents
behaviors of GPT-4 when encountering risky contents [46]. to search, plan, and think are evaluated by humans in an
In addition, for other aspects of human alignment (e.g., interview-like manner. Further, they also conduct descrip-
truthfulness), several studies propose to use specific instruc- tive measurements on multiple agents within a simulated
tions and devise annotation rules to guide the annotation environment to examine emergent social behaviors.
process [81]. Empirical studies have revealed that these
7.2.3 Tool Manipulation
strategies can greatly improve the human alignment ability
of LLMs [183]. For instance, after alignment tuning on data When solving complex problems, LLMs can turn to external
collected through interactions with experts, the incorrect tools if they determine it is necessary. By encapsulating
behavior rate of GPT-4 can be largely reduced when it deals available tools with API calls, existing work has involved
with sensitive or disallowed prompts. In addition, high- a variety of external tools, e.g., search engine [81], calcula-
quality pre-training data can reduce the effort required for tor [80], and compiler [436], to enhance the performance of
alignment [46]. For instance, Galactica is potentially more LLMs on several specific tasks. Recently, OpenAI has sup-
harmless due to the less biased contents in the scientific ported the use of plugins in ChatGPT [667], which can equip
corpus [35]. LLMs with broader capacities beyond language modeling.
For example, the web browser plugin enables ChatGPT
to access fresh information. Further, incorporating third-
7.2.2 Interaction with External Environment party plugins is particularly key for creating a prosperous
In addition to standard evaluation tasks, LLMs have the ecosystem of applications based on LLMs.
ability to receive feedback from the external environment To examine the ability of tool manipulation, existing
and perform actions according to the behavior instruction, work mostly adopts complex reasoning tasks for evaluation,
e.g., generating action plans in natural language to manip- such as mathematical problem solving (e.g., GSM8k [198]
ulate agents [696, 697]. Such an ability is also emergent in and SVAMP [594]) or knowledge question answering (e.g.,
LLMs that can generate detailed and highly realistic action TruthfulQA [558]), where the successful utilization of tools is
plans, while smaller models (e.g., GPT-2) tend to generate very important for enhancing the required skills that LLMs
shorter or meaningless plans [696]. are incapable in (e.g., numerical calculation). In this way, the
To test this ability, several embodied AI environments evaluated performance on these tasks can reflect the ability
and benchmarks can be used for evaluation, described of LLMs in tool manipulation. To teach LLMs to utilize tools,
as follows. VirtualHome [608] builds a 3D simulator for existing studies add exemplars using tools in context to elicit
household tasks such as cleaning and cooking, in which LLMs [436], or fine-tune LLMs on simulated data about
the agent can execute natural language actions generated tool utilization [80, 695]. It has been found that with the
by LLMs. ALFRED [610] includes more challenging tasks help of tools, LLMs become more capable of handling the
that require LLMs to accomplish compositional targets. BE- issues that they are not good at, e.g., equation calculation
HAVIOR [609] focuses on everyday chores in simulation and answering timely questions [80, 441]. However, as
environments and requires LLMs to generate complex so- the number of available tools increases, the limited context
lutions, e.g., changing the internal status of objects. Apart length of LLMs may pose challenges in describing and
from restricted environments such as household tasks, a demonstrating extensive tool APIs. To address this issue,
line of research work investigates the proficiency of LLM- existing work retrieves the usage of relevant tools, or en-
based agents to explore open-world environments, such as coding tool information as tokens within the embedding
Minecraft and the Internet [698, 699]. Voyager [699] intro- space [704–706].
duces an automatic curriculum module that enables LLMs In addition to existing tools developed by humans,
to continuously acquire new skills based on feedback from LLMs possess the capability to make their own tools for
the environment. GITM [698] focuses on solving various specific tasks autonomously [707]. This enables the models
challenges in Minecraft based on LLM, through task de- to independently explore and manipulate these self-created
composition, planning, and invocation of interfaces. Based tools, thereby expanding their potential for autonomous
on the generated action plans or task completions, existing exploration in solving a wide range of real-world tasks.
63

Summary. The above three abilities are of great value to the experimental results of HELM, instruction tuning can
the practical performance of LLMs: conforming to human consistently boost the performance of LLMs in terms of
values and preferences (human alignment), acting properly accuracy, robustness, and fairness. Further, for reasoning
in real-world scenarios (interaction with the external envi- tasks, the LLMs that have been pre-trained on the code
ronment), and expanding the ability scope (tool manipu- corpus show superior performance.
lation). In addition to the above three advanced abilities, • Human-level test benchmarks aim to evaluate the compre-
LLMs might also show other abilities that are specially hensive ability of LLMs with questions designed for testing
related to some tasks (e.g., data annotation [488]) or learning humans, such as AGIEval [710], MMCU [711], M3KE [712],
mechanisms (e.g., self-improvement [708]). It will be an open C-Eval [713] and Xiezhi [714]. These benchmarks encompass
direction to discover, measure and evaluate these newly a wide range of domains, difficulty levels, and languages
emerging abilities, so as to better utilize and improve LLMs. to provide a comprehensive evaluation of LLMs’ general
capabilities. Compared to publicly available models, models
offering API services (e.g., GPT-4, ChatGPT, Claude) demon-
7.3 Benchmarks and Evaluation Approaches strate superior performance compared to publicly avail-
able models on these evaluation benchmarks. As the best-
In the above, we have discussed the basic and advanced performing model in evaluations, GPT-4 surpasses average
abilities of LLMs. Next, we will introduce existing evalua- human performance in AGIEval [710]. However, it still lags
tion benchmarks and approaches [735, 736]. behind the top human performance on these challenging
benchmarks. Hence, there remains ample room for further
7.3.1 Comprehensive Evaluation Benchmarks enhancements in the overall abilities of LLMs, particularly
Recently, several comprehensive benchmarks [70, 362, 522] for publicly accessible models.
have been released for the evaluation of LLMs. In this The above benchmarks cover a variety of mainstream
part, we introduce several widely used benchmarks, i.e., evaluation tasks and real-world human exam questions for
MMLU, BIG-bench, HELM, and a series of human exam the evaluation of LLMs. Also, there are several benchmarks
benchmarks. that focus on evaluating specific abilities of LLMs, such
• MMLU [362] is a versatile benchmark for large-scale as TyDiQA [737] for multilingual knowledge utilization
evaluation of multi-task knowledge understanding, cover- and MGSM [526] for multilingual mathematical reasoning.
ing a wide range of knowledge domains from mathematics To conduct the evaluation, one can select suitable bench-
and computer science to humanities and social sciences. The marks according to specific goals. In addition, there are also
difficulties of these tasks vary from basic to advanced. As several open-source evaluation frameworks for researchers
shown in existing work, LLMs mostly outperform small to evaluate LLMs on existing benchmarks or extend new
models by a substantial margin on this benchmark [35, 56, tasks for customized evaluations, such as Language Model
57, 69], which shows the scaling law in model size. More Evaluation Harness [738] and OpenAI Evals [46]. Fur-
recently, GPT-4 achieves a remarkable record (86.4% in 5- ther, some researchers also construct continuously updated
shot setting) in MMLU, which is significantly better than leaderboards by aggregating representative benchmarks, to
the previous state-of-the-art models [46]. compare the performance of existing LLMs, such as Open
• BIG-bench [70] is a collaborative benchmark intended LLM Leaderboard [709]. The above benchmarks and leader-
to probe existing LLMs from various aspects. It comprises boards provide important references to demonstrate the ba-
204 tasks that encompass a broad range of topics, includ- sic and advanced abilities of LLMs. We will give more deep
ing linguistics, childhood development, mathematics, com- discussions on pros and cons on evaluation approaches in
monsense reasoning, biology, physics, social bias, software Section 7.3.2.
development, and so on. By scaling the model size, LLMs
can even outperform the average human performance under
7.3.2 Evaluation Approaches
the few-shot setting on 65% of tasks in BIG-bench [56].
Considering the high evaluation cost of the entire bench- After introducing existing benchmarks, in this part, we
mark, a lightweight benchmark BIG-bench-Lite has been will review existing evaluation approaches for assessing
proposed, which contains 24 small yet diverse and challeng- the performance of LLMs. To organize our discussion, we
ing tasks from BIG-bench. Additionally, the BIG-bench hard categorize LLMs into three different types: base LLMs (pre-
(BBH) benchmark [363] has been proposed to concentrate trained model checkpoints), fine-tuned LLMs (instruction or
on investigating the currently unsolvable tasks of LLMs by alignment fine-tuned model checkpoints), and specialized
selecting the challenging tasks in which LLMs exhibit infe- LLMs (adapted model checkpoints for some specific task
rior performance compared to humans. Since BBH becomes or domain). Here, we keep both fine-tuned LLMs and
more difficult, small models mostly achieve performance specialized LLMs, to distinguish the different purposes of
close to random. As a comparison, CoT prompting can LLMs: general or specific task solvers. To evaluate the three
elicit the abilities of LLMs to perform step-by-step reasoning types of LLMs, we can test the LLM’s performance related
for enhancing the performance, even exceeding the average to different abilities (e.g., basic or advanced abilities as
human performance in BBH. discussed in Section 7.1 and 7.2). In general, there are three
• HELM [522] is a comprehensive benchmark that cur- main approaches to evaluating LLMs, namely benchmark-
rently implements a core set of 16 scenarios and 7 categories based approach [362], human-based approach [729], and
of metrics. It is built on top of many prior studies, conduct- model-based approach [731]. Table 15 shows an illustration
ing a holistic evaluation of language models. As shown in of the relationship among LLM type, evaluation approach,
64

TABLE 15: A category of existing evaluation work. “General” denotes that the evaluation focuses on an overall performance
of multiple abilities. The evaluated abilities are not limited to the representative basic and advanced abilities mentioned in
Section 7.1 and 7.2.

Method Evaluation Model Types Abilities/Domain Data Source


MMLU [362] Base/Fine-tuned/Specialized General Human exam/practice
BIG-bench [70] Base/Fine-tuned/Specialized General Human annotation
HELM [522] Base/Fine-tuned/Specialized General Benchmark collection
Open LLM Leaderboard [709] Base/Fine-tuned/Specialized General Benchmark collection
AGIEval [710] Base/Fine-tuned/Specialized General Human exam/practice
MMCU [711] Base/Fine-tuned/Specialized General Human exam/practice
M3KE [712] Base/Fine-tuned/Specialized General Human exam/practice
C-Eval [713] Base/Fine-tuned/Specialized General Human exam/practice
Xiezhi [714] Base/Fine-tuned/Specialized General Human exam/practice
OpenCompass [715] Base/Fine-tuned/Specialized General Benchmark collection
Chain-of-Thought Hub [716] Base/Fine-tuned General Benchmark collection
KoLA [717] Base/Fine-tuned Knowledge utilization Web
ARB [718] Fine-tuned Complex reasoning Human exam/practice
APIBench [719] Base/Fine-tuned Tool manipulation Web
Benchmark
APIBank [720] Fine-tuned Tool manipulation Synthesis
ToolAlpaca [721] Base/Fine-tuned Tool manipulation Synthesis
T-Bench [722] Fine-tuned Tool manipulation Synthesis
ToolBench [723] Fine-tuned Tool manipulation Synthesis
BOLAA [724] Base/Fine-tuned Environment interaction Benchmark collection
AgentBench [725] Base/Fine-tuned Environment interaction Human annotation/Synthesis
HaluEval [604] Base/Fine-tuned Human alignment Human annotation/Synthesis
PromptBench [726] Base/Fine-tuned Robustness Benchmark collection
HumanEval [105] Base/Fine-tuned/Specialized Code synthesis Human annotation
MultiMedQA [354] Specialized Healthcare Benchmark collection
FLUE [727] Specialized Finance Benchmark collection
LegalBench [728] Specialized Legal Human annotation
Chatbot Arena [729] Base/Fine-tuned/Specialized Human Alignment Human annotation
Human
SciBench [730] Fine-tuned Complex reasoning Human exam/practice
AlpacaEval [731] Fine-tuned Instruction following Synthesis
MT-bench [729] Fine-tuned Human alignment Human annotation
Model TrustGPT [732] Base/Fine-tuned Human alignment Benchmark collection
LMExamQA [733] Base/Fine-tuned Knowledge utilization Synthesis
ChatEval [734] Base/Fine-tuned Knowledge utilization Benchmark collection

and tested abilities. Next, we will discuss the evaluation the generated result text will be parsed with human-written
approaches for different types of LLMs. rules to get the predicted answer. Finally, the performance
of LLMs can be automatically calculated using standard
Evaluation of Base LLMs. Base LLMs refer to the model metrics like accuracy by comparing the predicted answer
checkpoints obtained right after pre-training. For base with the ground-truth one. The evaluation approach can be
LLMs, we mainly focus on examining the basic abilities conducted in either the few-shot or zero-shot setting, which
(Section 7.1), such as complex reasoning and knowledge might lead to different evaluation results or rankings. Since
utilization. Since most of these basic abilities can be assessed base LLMs have not been instruction fine-tuned (with rela-
with well-defined tasks, benchmark-based approaches have tively weak task generalization ability), the few-shot setting
been widely used to evaluate base LLMs. Next, we will is often more suitable for evaluation. For some complex
introduce common evaluation benchmarks and evaluation reasoning tasks, CoT prompts also need to be used to fully
procedures for base LLMs. exhibit the capacity during evaluation. Another note is that
• Common benchmarks. To evaluate base LLMs, typical this evaluation approach can also be applied to assess the
benchmarks are designed in the form of close-ended prob- abilities of fine-tuned LLMs. Actually, several leaderboards
lems like multiple-choice questions. These commonly used (e.g., Open LLM Leaderboard [709]) are built upon this
benchmarks can be mainly divided into two categories: approach, evaluating both base and fine-tuned LLMs.
knowledge-oriented and reasoning-oriented benchmarks.
Evaluation of Fine-tuned LLMs. Fine-tuned LLMs in this
Knowledge-oriented benchmarks (e.g., MMLU [362] and C-
part refer to the model checkpoints obtained after in-
Eval [713]) aim to evaluate the capacity of world knowledge,
struction tuning or alignment tuning based on pre-trained
while reasoning-oriented benchmarks (e.g., GSM8K [645],
model weights40 . Typically, fine-tuned LLMs will be tested
BBH [363], and MATH [362]) focus on evaluating the ca-
on various abilities (e.g., knowledge utilization and hu-
pability of solving complex reasoning tasks. Further, some
man alignment), and thus it is common that they are as-
recently proposed benchmarks (e.g., OpenCompass [715])
sessed with multiple evaluation approaches. In addition
combine these two types for a comprehensive comparison.
to benchmark-based evaluation, human-based and model-
• Benchmark based evaluation procedure. To perform the based approaches have also been widely used to evaluate
benchmark evaluation, each problem will first be formatted
into a prompt for LLMs to generate the result text. Then, 40. In some cases, it is also called chat models.
65

the advanced abilities of fine-tuned LLMs. Next, we will analysis to question answering. It has been used collab-
introduce the two evaluation methods. oratively with BBH [363] to evaluate finical LLMs like
• Human-based evaluation. Unlike automatic evaluation BloombergGPT [358].
for basic abilities, human evaluation typically considers
Pros and Cons of Different Evaluation Approaches. In the
more factors or abilities in real-world use, such as hu-
above, we have discussed different evaluation approaches
man alignment and tool manipulation. In this evaluation
to assess the abilities of LLMs. Next, we simply analyze the
approach, test tasks are usually in the form of open-
pros and cons of each evaluation approach.
ended questions, and human evaluators are invited to make
judgments on the quality of answers generated by LLMs. • Benchmark-based approach. This evaluation approach can
Typically, there are two main types of scoring methods leverage existing benchmarks for assessing the performance
for human evaluators: pairwise comparison and single- of LLMs. The tasks involved in these benchmarks often
answer grading. In pairwise comparison, given the same contain sufficient test samples to measure the core abilities
question, humans are assigned two answers from different (e.g., reasoning). The whole evaluation procedure can be
models to determine which one is better, while in single- (almost) automatic, and it is convenient to carry out test
answer grading, they only need to score a single answer experiments for various base LLMs, especially useful for
at a time. For example, HELM [522] employs humans monitoring the performance of model checkpoints during
to perform single-answer grading on summarization and pre-training. However, LLMs are often sensitive to the eval-
disinformation tasks, while Chatbot Arena [729] constructs uation settings, including the question prompts, zero-shot or
a crowdsourcing platform that allows users to engage in few-shot tests, and the answer parsing methods. Thus, one
conversations with two anonymous chat LLMs and report should take possible influencing factors into consideration
pairwise comparison results. when conducting the evaluation experiments. The evalua-
tion results should be noted with the adopted evaluation
• Model-based evaluation. Since human-based evaluation
settings. Another issue is the data contamination [56, 740],
is both expensive and time-consuming, some work has
i.e., the test data itself or relevant content has been contained
proposed leveraging powerful closed-source LLMs such
in the pre-training corpora. This phenomenon has become
as ChatGPT and GPT-4 as a surrogate for human evalu-
increasingly severe since more and more open data has been
ators [729, 731]. For example, AlpacaEval [731] collects a
collected for developing LLMs.
set of instructions and utilizes a capable LLM (e.g., GPT-4)
as the judge to perform pair-wise comparisons against the • Human-based approach. Human evaluation offers several
reference outputs. Furthermore, MT-bench [729] collects a advantages when assessing the capabilities of LLMs to solve
set of multi-turn questions for evaluation and improves the real-world tasks. One of the key benefits is its ability to
reliability of LLM-based evaluators through methods like directly reflect the actual abilities of LLMs. Based on feed-
ICL and CoT. Compared with human evaluators, LLMs such back and experiences from real users, human evaluation
as ChatGPT and GPT-4 can achieve high agreement with provides a more direct measure of LLMs’ performance in
humans, in both small-scale handcrafted and large-scale real-world scenarios. Further, it can conduct more flexible
crowdsourced evaluation tasks. Despite this, these closed- and diverse evaluation tasks based on human evaluators.
source LLMs are limited in access and have the potential For instance, users can submit various queries and test the
risk of data leakage. To address this, recent work [729] has abilities of LLMs according to their own task cognition. It
explored fine-tuning open-source LLMs (e.g., Vicuna [152]) allows for a deep understanding of the strengths and weak-
as model evaluators using scoring data from human eval- nesses of LLMs across different types of tasks and contexts.
uators, which has narrowed the gap with powerful closed- However, human evaluation also has inherent limitations
source LLMs (e.g., GPT-4). that could potentially affect its accuracy and consistency.
Factors such as personalized tastes and varying education
Evaluation of Specialized LLMs. Specialized LLMs refer levels among evaluators can introduce biases or even incon-
to the model checkpoints specially adapted to some do- sistencies in the evaluation process. In some cases, users’
mains or applications like healthcare [354] and finance [739]. judgments are likely to be subjective, which may not reflect
As special task solvers, specialized LLMs will be tested the true capabilities of the LLMs. Moreover, conducting
not only on general abilities (e.g., basic ability like com- robust and reliable human evaluations often requires a large
plex reasoning and advanced ability like human align- number of evaluators, which can be very expensive and
ment), but also on specific abilities related to their des- time-consuming. In addition, human evaluation is often
ignated domains or applications. For this purpose, one not reproducible, making it infeasible to extend existing
often needs to construct specific benchmarks tailored for the evaluation results or track the progress of LLMs.
target domains or applications. Then, these domain-specific • Model-based approach. As a surrogate for human-based
benchmarks can be combined with general benchmarks to approaches, model-based approaches serve to diminish the
conduct both comprehensive and targeted evaluation for reliance on human involvement, and enable more efficient
specialized LLMs. For example, MultiMedQA [354] is a and scalable evaluation. In addition, LLMs can provide
specific benchmark in healthcare, which includes medical meaningful explanations for the assigned rating scores,
examinations and healthcare questions. In this work [354], thereby enhancing the interpretability of evaluations. De-
MultiMedQA has been combined with MMLU [362] to spite their scalability and explanability, model-based ap-
assess the performance of specialized LLMs for healthcare, proaches have been found to suffer from several issues, in-
such as Med-PaLM [354]. Similarly, FLUE [739] constructs a cluding position, verbosity, and self-enhancement bias [729].
benchmark for finance, spanning from financial sentiment Specially, position bias (i.e., the order to present the re-
66

sponses) refers to the fact that LLMs tend to assign high Claude 2, where the first three models are developed by
scores for the answers at specific positions over others, OpenAI and the other two are developed by Anthropic.
verbosity bias means that LLMs favor verbose answers even
if they are short in quality compared with shorter answers, Tasks and Datasets. Next, we set up the evaluation tasks
and self-enhancement bias indicates that LLMs often over- and datasets for the abilities discussed in Section 7.1 and
rate in their own generations. In addition, since LLMs have Section 7.2. We mainly evaluate the zero-shot performance
limited capacities in solving complex reasoning problems, of LLMs on these datasets. For more complex tasks that are
they cannot serve as qualified evaluators for some difficult hard to be solved in the zero-shot manner (e.g., mathemati-
tasks (e.g., mathematical reasoning). These limitations can cal reasoning and tool manipulation), we mainly report the
be mitigated to some extent by specific prompt engineering 3-shot performance, considering the context length limit of
and fine-tuning strategies [729]. open-source models.
To summarize, our categorization (Table 15) of existing • Language generation. As discussed before, for language
work on LLM evaluation is mainly based on two major di- generation, we consider evaluating three kinds of tasks,
mensions, namely evaluation methodology and model type, i.e., language modeling, conditional text generation, and
which are further extended with the test abilities. There code synthesis. Specially, we select four commonly-used
are some recent work [735, 736] that also has discussed datasets, namely LAMBADA [252] (language modeling),
the categorization or taxonomies of existing work for LLM WMT’22 [547] (machine translation), XSum [551] (text sum-
evaluation. marization), and HumanEval [105] (code synthesis) for eval-
uation. In WMT’22, we construct a new evaluation set
by selecting 1000 examples for each language pair from
7.4 Empirical Evaluation the original large-scale test set to examine the average
The above evaluation benchmarks and approaches are performance of LLMs in machine translation. We evaluate
mainly employed to evaluate the overall abilities of LLMs. the zero-shot performance of LLMs on these datasets, and
In this part, we conduct a fine-grained evaluation of the compute the accuracy of predicting words for LAMBADA,
abilities discussed in Section 7.1 and Section 7.2. For each BLEU-4 for WMT’22, ROUGE-L for XSum, and pass@10 for
kind of ability, we select representative tasks and datasets HumanEval.
for conducting evaluation experiments to examine the cor- • Knowledge utilization. To evaluate the ability of knowl-
responding performance of LLMs. edge utilization, we select four question answering datasets
(i.e., TriviaQA [560], Natural Questions [556], Web Ques-
tions [559], and ARC [557]), and a fact extraction dataset,
7.4.1 Experimental Settings
WikiFact [573]. We also report the zero-shot performance of
In this part, we introduce the experimental settings for our LLMs on these datasets, and compute accuracy for ARC and
evaluation. exact match for other datasets.
Evaluation Models. To conduct the evaluation, we consider • Complex reasoning. For complex reasoning, we eval-
representative LLMs from open-source models to closed- uate the comparison models on OpenbookQA [568], Hel-
source API-accessing models as follows: laSwag [584], and SocialIQA [583] for knowledge reason-
• Open-source models. Existing open-source models can be ing; Colored Objects [70] and Penguins in the Table [70]
categorized into base models and instruction-tuned models. for symbolic reasoning; GSM8k [198] and MATH [362] for
Base models are only pre-trained on a large general-purpose mathematical reasoning. We compute the accuracy for Open-
corpus with the language modeling objective, but without bookQA, HellaSwag, and SocialIQA; solve rate for Colored
further supervised fine-tuning. In our evaluation, we select Objects and Penguins in the Table; and accuracy for GSM8k
four representative base models including LLaMA (7B) [57], and MATH. For knowledge reasoning tasks, we evaluate
LLaMA 2 (7B) [99], Pythia (7B and 12B) [96], and Falcon the zero-shot performance, since they are all QA tasks that
(7B) [749]41 . Instruction-tuned models are those fine-tuned can be solved in a zero-shot setting. For complex symbolic
using instructions (i.e., task datasets, daily chat, or syn- reasoning and mathematical reasoning tasks, we leverage
thetic instructions). In our experiments, we select four rep- 3-shot in-context exemplars to better elicit LLMs to accom-
resentative instruction-tuned models including Vicuna (7B plish them. Following existing work [33, 436], we also utilize
and 13B) [152], Alpaca (7B) [187], and ChatGLM (6B) [93]. the chain-of-thought prompting strategy for better solving
In addition, we also include LLaMA 2-Chat (7B) [99] for the mathematical reasoning tasks.
comparison, and it is a representative model that has been • Human alignment. For human alignment, we select
aligned with human via instruction tuning and RLHF, based TruthfulQA [558] to measure whether a LLM is truth-
on LLaMA 2 (7B). ful in generating answers to questions, CrowS-Pairs [605]
• Closed-source models. In addition to the open-source and WinoGender [606] to assess the stereotypes in LLMs,
models, there are also closed-source models that can only RealToxityPrompts [607] to evaluate the extent to which
be accessed via APIs, which have gained much attention LLMs generate toxic language, and HaluEval [604] to test
from both developers and researchers. Here, we select four the ability of LLMs to recognize hallucination. As the test
representative closed-source models including text-davinci- set of Real-Toxicity-Prompts is too large, we randomly
002/003 (short as Davinci002/003), ChatGPT, Claude, and sample 10000 examples from it for evaluation. We fol-
low LLaMA [57] to report the zero-shot performance, and
41. Experiments with larger models are still in schedule due to the compute the accuracy of identifying a claim as true for
limit of computational resources. TruthfulQA, accuracy of recognizing biased sentences (high
67

TABLE 16: Evaluation on the eight abilities of LLMs with specially selected tasks. The shade of the Orange and Blue
fonts denote the performance orders of the results in closed-source and open-source models, respectively. This table will
be continuously updated by incorporating the results of more models.

Language Generation Knowledge Utilization


Models
LBD↑ WMT↑ XSum↑ HumanEval↑ TriviaQA↑ NaturalQ↑ WebQ↑ ARC↑ WikiFact↑
ChatGPT 55.81 36.44 21.71 79.88 54.54 21.52 17.77 93.69 29.25
Claude 64.47 31.23 18.63 51.22 40.92 13.77 14.57 66.62 34.34
Claude 2 45.20 12.93 19.13 78.04 54.30 21.30 21.06 79.97 35.83
Davinci003 69.98 37.46 18.19 67.07 51.51 17.76 16.68 88.47 28.29
Davinci002 58.85 35.11 19.15 56.70 52.11 20.47 18.45 89.23 29.15
LLaMA 2-Chat (7B) 56.12 12.62 16.00 11.59 38.93 12.96 11.32 72.35 23.37
Vicuna (13B) 62.45 20.49 17.87 20.73 29.04 10.75 11.52 20.69 28.76
Vicuna (7B) 63.90 19.95 13.59 17.07 28.58 9.17 6.64 16.96 26.95
Alpaca (7B) 63.35 21.52 8.74 13.41 17.14 3.24 3.00 49.75 26.05
ChatGLM (6B) 33.34 16.58 13.48 13.42 13.42 4.40 9.20 55.39 16.01
LLaMA 2 (7B) 66.39 11.57 11.57 17.07 30.92 5.15 2.51 24.16 28.06
LLaMA (7B) 67.68 13.84 8.77 15.24 34.62 7.92 11.12 4.88 19.78
Falcon (7B) 66.89 4.05 10.00 10.37 28.74 10.78 8.46 4.08 23.91
Pythia (12B) 61.19 5.43 8.87 14.63 15.73 1.99 4.72 11.66 20.57
Pythia (7B) 56.96 3.68 8.23 9.15 10.16 1.77 3.74 11.03 15.75
Knowledge Reasoning Symbolic Reasoning Mathematical Reasoning Interaction with Environment
Models
OBQA↑ HellaSwag↑ SocialIQA↑ C-Objects↑ Penguins↑ GSM8k↑ MATH↑ ALFW↑ WebShop↑
ChatGPT 81.20 61.43 73.23 53.20 40.27 78.47 33.78 58.96 45.12/15.60
Claude 81.80 54.95 73.23 59.95 47.65 70.81 20.18 76.87 47.72/23.00
Claude 2 71.60 50.75 58.34 66.76 74.50 82.87 32.24 77.61 34.96/19.20
Davinci003 74.40 62.65 69.70 64.60 61.07 57.16 17.66 65.67 64.08/32.40
Davinci002 69.80 47.81 57.01 62.55 67.11 49.96 14.28 76.87 29.66/15.20
LLaMA 2-Chat (7B) 45.62 74.01 43.84 43.40 38.93 9.63 2.22 11.19 24.51/5.60
Vicuna (13B) 43.65 70.51 45.97 53.55 36.91 18.50 3.72 8.96 22.74/5.00
Vicuna (7B) 43.84 69.25 46.27 44.25 36.24 14.03 3.54 1.49 6.90/1.40
Alpaca (7B) 47.82 69.81 47.55 39.35 40.27 4.93 4.16 4.48 0.00/0.00
ChatGLM (6B) 30.42 29.27 33.18 14.05 14.09 3.41 1.10 0.00 0.00/0.00
LLaMA 2 (7B) 44.81 74.25 41.72 43.95 35.75 10.99 2.64 8.96 0.00/0.00
LLaMA (7B) 42.42 73.91 41.46 39.95 34.90 10.99 3.12 2.24 0.00/0.00
Falcon (7B) 39.46 74.58 42.53 29.80 24.16 1.67 0.94 7.46 0.00/0.00
Pythia (12B) 37.02 65.45 41.53 32.40 26.17 2.88 1.96 5.22 3.68/0.60
Pythia (7B) 34.88 61.82 41.01 29.05 27.52 1.82 1.46 7.46 10.75/1.80
Human Alignment Tool Manipulation
Models
TfQA↑ C-Pairs↓ WinoGender↑ RTP↓ HaluEval↑ HotpotQA↑ Gorilla-TH↑ Gorilla-TF↑ Gorilla-HF↑
ChatGPT 69.16 18.60 62.50/72.50/79.17 3.07 66.64 23.80 67.20 44.53 19.36
Claude 67.93 32.73 71.67/55.00/52.50 3.75 63.75 33.80 22.04 7.74 7.08
Claude 2 71.11 10.67 60.00/60.00/55.83 3.20 50.63 36.4 61.29 22.19 23.67
Davinci003 60.83 0.99 67.50/68.33/79.17 8.81 58.94 34.40 72.58 3.80 6.42
Davinci002 53.73 7.56 72.50/70.00/64.17 10.65 59.67 26.00 2.69 1.02 1.00
LLaMA 2-Chat (7B) 69.77 48.54 47.50/46.67/46.67 4.61 43.82 4.40 0.00 0.00 0.22
Vicuna (13B) 62.30 45.95 50.83/50.83/52.50 5.00 49.01 11.20 0.00 0.44 0.89
Vicuna (7B) 57.77 67.44 49.17/49.17/49.17 4.70 43.44 6.20 0.00 0.00 0.33
Alpaca (7B) 46.14 65.45 53.33/51.67/53.33 4.78 44.16 11.60 0.00 0.00 0.11
ChatGLM (6B) 63.53 50.53 47.50/47.50/46.67 2.89 41.82 4.00 0.00 0.00 0.00
LLaMA 2 (7B) 50.06 51.39 48.83/48.83/50.83 6.17 42.23 3.80 0.00 0.00 0.11
LLaMA (7B) 47.86 67.84 54.17/52.50/51.67 5.94 14.18 1.60 0.00 0.00 0.11
Falcon (7B) 53.24 68.04 50.00/50.83/50.00 6.71 37.41 1.00 0.00 0.00 0.00
Pythia (12B) 54.47 65.78 49.17/48.33/49.17 6.59 27.09 0.40 0.00 0.00 0.00
Pythia (7B) 50.92 64.79 51.67/49.17/50.00 13.02 25.84 0.20 0.00 0.00 0.00

perplexity) for CrowS-Pairs, coreference resolution accuracy evaluation.


(he/she/they) for WinoGender, toxicity score for RealToxi- • Interaction with environment. To test this ability, we
tyPrompts, and average accuracy of recognizing hallucina- select ALFWorld [611] and WebShop [612] for evaluation,
tions for HaluEval. For TruthfulQA, we follow existing which simulate real-world scenarios such as household
work [57] that utilizes text-davinci-003 to replace humans and e-commerce environments. We follow the setting of
for scoring. For Crows-Pairs and WinoGender, we follow ReAct [442] that evaluate the 1-shot and 2-shot performance
the experimental settings of LLaMA [57] to compute the of LLMs on WebShop and ALFWorld respectively, and com-
perplexity and coreference resolution score. For RealTox- pute success rate for ALFWorld and average score/success rate
ityPrompts, we utilize the Perspective-API42 for toxicity for WebShop. Further, we also follow ReAct [442] to reduce
the length of the input prompt and utilize line break as the
42. https://perspectiveapi.com/ EOS token.
68

TABLE 17: Prompt examples and their performance of ChatGPT on representative tasks. For most tasks, we compare the
performance for simple and complex prompts. We also present the reported performance of supervised methods. “LG”,
“KU”, “CR”, “SDG”, “IR” are short for “language generation”, “knowledge utilization”, “complex reasoning”, “structured
data generation”, “information retrieval”. “-” means there is no reported supervised result previously on this dataset.

Tasks Datasets Instructions ChatGPT Supervised


I want you to act as a translator. Please translate the English 20.66
sentence into Czech.
Translation WMT 41.40 [741]
I want you to act as a translator. Translate the given English 21.12
sentence into Czech, and ensure that the translated sentence is
semantically consistent with the given sentence. \n Sentence:
{source sentence} \n Translation:
LG
Please generate a one-sentence summary for the given document. 21.71

Summarization XSum {document} Try your best to summarize the main content of the given 23.01 42.08 [742]
document. And generate a short summary in 1 sentence for it.\n
Summary:
Choose your answer to the question. {query} {options} 85.19
Closed-Book QA ARC 92.00 [743]
Choose a correct answer according to the given question, and output 85.86
the corresponding id, do not answer other content except the answer
id.
Choose your answer to the question: {question} {choices}. You must 81.20
KU only output A, B, C, or D without any extra explanation. The answer
is
Open-Book QA OBQA 87.20 [743]
Following is a question that requires multi-step reasoning, use 82.20
of additional common and commonsense knowledge, and rich text
comprehension. Choose your answer to the question: \n Question:
Frilled sharks and angler fish live far beneath the surface of the
ocean, which is why they are known as \n Choices: \n A. Deep sea
animals \n B. fish \n C. Long Sea Fish \n D. Far Sea Animals \n You
must only output A, B, C, or D without any extra explanation. The
answer is
Complete the sentence with one or a few words. 29.25
Fact Extraction WikiF 34.20 [522]
Complete the given sentence with one entity name in Wikipedia (MUST 31.21
be a noun) as short as possible, and ensure that the completed
sentence conforms to the facts.
Problem: {problem}\n Answer: 53.20
Symbolic Reasoning C-Objects —
You are an expert in reasoning problem. Here are some examples 66.75
about symbolic reasoning. You can use the knowledge in examples and
solve the last problem. You should follow the examples and generate
the final answer without external solution or words.
CR Problem: {problem}\n Solution: Let’s think step by step. 78.47

Math Word Problems GSM8k Let’s use python to solve math problems. Here are three examples 79.30 63.20 [744]
how to do it,\n Q: Olivia has $23. She bought five bagels for $3
each. How much money does she have left?\n‘‘‘def solution():\n
"""Olivia has $23. She bought five bagels for $3 each. How
much money does she have left?"""\n money_initial = 23\n
bagels = 5\n bagel_cost = 3\n money_spent = bagels *
bagel_cost\n money_left = money_initial - money_spent\n
result = money_left\n return result‘‘‘\n ...... \n How about
this question?\n Q:
Code Synthesis HumanEval I want you act as a code completer. Given a code snippet, your 79.88 48.20 [745]
objective is to complete the code and ensure that it can achieve
the described functionality.
SDG
Text-to-SQL Spider ### Complete sqlite SQL query only and with no explanation.\n 70.10 84.10 [746]
#\n### Sqlite SQL tables, with their properties: \n#\n{table}\n#
{foreign_key}\n#\n### {question}\n SELECT
Recommendation MovieLens I’ve watched the following movies in the past in order: \n 48.80 76.25 [747]
{user_his_text} \n\n Now there are {recall_budget} candidate movies
that I can watch next: \n {candidate_text_order} \n Please rank
these {recall_budget} movies by measuring the possibilities that I
would like to watch next most, according to my watching history.
Please think step by step. \n Note that my most recently watched
movie is {recent_item}. Please show me your ranking results with
IR order numbers. Split your output with line break. You MUST rank the
given candidate movies. You can not generate movies that are not in
the given candidate list.
Conversational ReDial Recommend 10 items that are consistent with user preference. The 17.20 25.60 [748]
Recommenda- recommendation list can contain items that the dialog mentioned
tion before. The format of the recommendation list is: no. title (year).
Don’t mention anything other than the title of items in your
recommendation list
69

• Tool manipulation. For tool manipulation, we consider • All the comparison models perform not well on very diffi-
two kinds of tools including search engine and model in- cult reasoning tasks. On MATH and HotpotQA, all models
terfaces. Therefore, we adopt two tool manipulation bench- (including ChatGPT) perform not well. The two tasks are
marks, i.e., HotpotQA [581] and Gorilla [619]. HotpotQA very difficult to solve, requiring accurate understanding of
requires LLMs to use search engine to retrieve documents complex mathematical knowledge and performing multi-
from the web, and Gorilla to invoke model APIs from hop reasoning across documents, respectively. Further, these
three hubs of TorchHub, TensorHub and HuggingFace. We models also have a relatively weak performance on machine
compute exact match for HotpotQA and accuracy for Gorilla. translation task (WMT). A possible reason is that WMT also
For HotpotQA, we follow ReAct [442] to report the 3-shot contains many evaluation examples in minor languages,
performance. For Gorilla, we follow the code released by its which might not be well covered in the pre-training data
paper [619], and evaluate the zero-shot performance. of these LLMs.
Implementation Details. For each task and dataset, we Analysis of Open-Source Models. Next, we continue to
evaluate the compared LLMs using the same prompts and show our analysis and findings about eight open-source
results parsing method provided by existing work (i.e., models (i.e., LLaMA 2-Chat, Vicuna, Alpaca, ChatGLM,
TruthfulQA, HotPotQA, Gorilla, HaluEval) or designed ac- LLaMA 2, LLaMA, Pythia and Falcon) as follows:
cording to our empirical experience (i.e., TriviaQA, Nat- • Instruction-tuned models mostly perform better than the
ural Questions, Web Questions, ARC, WikiFact, GSM8k, base models. Among all the compared open-source methods,
MATH, C-Objects, Penguins, LAMBADA, WMT’22, XSum, the instruction-tuned models (i.e., LLaMA 2-Chat, Vicuna,
HumanEval, CrowS-Pairs, WinoGender, RealToxityPrompt). Alpaca and ChatGLM) mostly perform better than non-
Specifically, all the experiments about closed-source models instruction-tuned models (i.e., LLaMA 2, LLaMA, Pythia
are based on invoking their official APIs, while for open- and Falcon). It indicates that instruction tuning is generally
source models, we utilize their publicly available code and capable of improving the few-shot or zero-shot ability of
model parameters, and perform the inference on 8 A800- LLMs in solving various tasks. However, after instruction
80G GPUs. For TriviaQA, OpenbookQA, HellaSwag, and tuning, Vicuna (7B) and Alpaca (7B) suffer from perfor-
SocialIQA, we experiment on the development set since the mance degradations on LAMBADA, a language modeling
test set is not publicly released. While for other datasets, task. The reason may be that the instruction data mainly
we experiment on the test set. To reproduce our experi- focuses on enabling LLMs to follow human instructions,
ments, we also publicly release our experimental code and which is not always useful for the general language gen-
data in https://github.com/RUCAIBox/LLMSurvey/tree/ eration task.
main/Experiments. • These small-sized open-source models perform not well on
mathematical reasoning, interaction with environment, and tool
7.4.2 Results Analysis and Findings manipulation tasks. On the tasks of mathematical reasoning,
interaction with environment and tool manipulation, all
We report the experimental results in Table 16, and analyze
these evaluated open-source models perform not well, in-
the results in the following.
cluding instruction-tuned ones. A possible reason is that the
Analysis of Closed-Source Models. We summarize our instruction data for fine-tuning these models is not specif-
analysis and findings of the four closed-source models (i.e., ically designed for these tasks. In addition, these closed-
ChatGPT, Claude, Davinci003 and Davinci002) as follows: source models may have limited model capacities due to
• These five closed-source models achieve promising results small model sizes.
as general-purpose task solvers, in which ChatGPT mostly per- • The top-performing model varies on different human align-
forms the best. ChatGPT, Claude, Claude 2, Davinci003 and ment tasks. For different human alignment tasks, we can see
Davinci002 perform well in most of tasks, including com- that these models achieve inconsistent performance rank-
plex tasks (e.g., GSM8k), which have shown great potential ings. For example, LLaMA 2-Chat (7B) performs the best
to be general-purpose task solvers. Among them, ChatGPT among the compared open-source models on TruthfulQA,
exhibits a more superior model capacity on the evaluation while Vicuna (13B) performs the best on CrowS-Pairs. A
tasks, winning the most across all tasks. In some evaluation possible reason is that these tasks are designed with spe-
tasks, the performance gap between ChatGPT and other cific purposes for evaluating different aspects of human
closed-source models is very large, especially for complex alignment, and these models exhibit varied performance
tasks e.g., 78.47 (ChatGPT) v.s. 49.96 (Davinci002) on GSM8k, on different tasks, even for the variants of the same model
and 79.88 (ChatGPT) v.s. 51.22 (Claude) on HumanEval. (e.g., Pythia (7B) and Pythia (12B)). More experiments and
• Claude 2, ChatGPT and Davinci003 perform better on inter- analysis on human alignment evaluation are needed to
action with environment and tool manipulation tasks. On the two reveal more detailed findings.
evaluation tasks, Claude 2, ChatGPT and Davinci003, per- • As a more recently released model, LLaMA 2 (7B) overall
form better than other models by a large margin, e.g., 36.40 achieves a good performance, especially on complex reasoning
(Claude 2) v.s. 26.00 (Davinci002) on HotpotQA, 44.53 (Chat- tasks. For complex reasoning tasks, LLaMA 2 (7B) mostly
GPT) v.s. 7.74 (Claude) on Gorilla-TF, and 72.58 (Davinci003) performs better than other base models, e.g., 43.95 (LLaMA
v.s. 22.04 (Claude) on Gorilla-TH. A possible reason is that 2 (7B)) v.s. 29.80 (Falcon (7B)) in C-Objects. For other
these three models have been specially optimized towards tasks (e.g., language generation and knowledge utilization),
these advanced abilities, e.g., supporting the use of external LLaMA 2 (7B) can also achieve comparable performance
plugins. as the best-performing base models. It has used more data
70

for pre-training (i.e., about 2 trillion tokens), which mainly performance by fine-tuning on it. Recent studies [55, 754]
contributes to the excellent performance. Furthermore, it have also tested the performance of LLMs on these tasks,
also conducts a more robust data cleaning process. showing that LLMs can also perform well via in-context
• Scaling the open-source modes can improve the performance learning (with very few examples). Whereas, as small mod-
consistently. By comparing the performance of Vicuna (7B) els can be specially optimized on these tasks to learn the
and Vicuna (13B), Pythia (7B) and Pythia (13B), we can see specific task requirement and domain knowledge, full-data
that the models with larger scales mostly perform better fine-tuned small models can mostly outperform LLMs using
than smaller ones on these evaluation tasks, indicating the in-context learning on several classic tasks [755, 756], e.g.,
effectiveness of scaling up the model size. Across different semantic matching and sentiment analysis.
tasks, scaling model is more beneficial for more complex
tasks (e.g., symbolic and mathematical reasoning), where the Sequence Tagging. The sequence tagging tasks, e.g., named
larger models mostly outperform smaller ones in a large entity recognition (NER) [757] and part-of-speech (POS)
margin. tagging [758], are also fundamental tasks. Typically, such
The readers should be note that these findings about tasks require assigning each token in the input sequence a
open-source language models are limited to the model sizes. proper semantic category label, e.g., the classic B-I-O (Be-
We will continually update this part by including the results ginning, Inside and Outside) tagging scheme for NER tasks.
of larger versions of these models, and also call for the In the era of deep learning, early efforts [759, 760] mainly
support of computational resources for more experiments. integrate the learned sequence representations (e.g., using
CNN, LSTM, and BERT) into the classic conditional random
field model (CRF), which performs the tagging task based
8 A PPLICATIONS on structural prediction. Recently, researchers have tested
In this section, we briefly review the recent progress on the the performance of LLMs in sequence tagging tasks, but ob-
applications of LLMs in two aspects, namely the impact to served that LLMs still face challenges in solving them using
research community and representative domains. Figure 18 in-context learning [755], especially for special categories
shows a content organization of this section43 . with ambiguous or rare names, e.g., the “MISC” (miscella-
neous entity) and “ORG” (organization) classes. A possible
reason is that LLMs may misunderstand the meanings of
8.1 LLM for Research Community these classes in the human-annotated dataset, making it
As LLMs have revolutionized the way how we develop difficult to accurately understand their semantics according
AI algorithms, it poses significant impact on the research to the instruction and limited examples in the context.
community. In this part, we briefly review the advances that
Information Extraction. The information extraction task
led by LLMs for several representative research directions.
focuses on automatically extracting useful structured infor-
mation from unstructured text data, such as relation extrac-
8.1.1 LLM for Classic NLP Tasks
tion [761] and event extraction [762], which is also a crucial
As pre-trained language models (e.g., BERT) have originated task relating to many NLP applications. Typically, previous
in the field of NLP, the technical advances of language studies formulate this task as a text classification task or
models has an important impact on the research of NLP. In a sequential labeling task. As information extraction often
this part, we discuss the application of LLMs on five kinds needs to accurately understand and process complex se-
of classic NLP tasks, including word-level, sentence-level, mantic relations (multiple relations within one sentence), in-
sequence tagging, relation extraction, and text generation context learning with LLMs typically underperform state-
tasks, which had been the foundation of many existing NLP of-the-art full-data fine-tuning methods [763, 764]. Whereas,
systems and applications. Note that we do not intend to it is shown that enabling collaboration between LLMs and
comprehensively cover all NLP tasks, but instead try to small models can further boost the performance of specific
analyze the impact of LLMs for fundamental NLP research tasks [764, 765]. In addition, a recent study [766] also reveals
through the basic tasks. We also omit the discussion of sev- that LLMs can achieve competitive zero-shot performance
eral tasks (e.g., language modeling) that have been discussed for information extraction with a two-stage workflow, mak-
early in this survey. ing this approach attractive in future applications.
Word/Sentence-level Tasks. As long-standing NLP tasks, Text Generation. Text generation tasks, e.g., machine trans-
word-level (e.g., word clustering [750] and sense disam- lation [626] and automatic summarization [550], are long-
biguation [751]) and sentence-level tasks (sentence match- standing NLP tasks that have been widely studied, and
ing [752] and sentiment classification [753]) have been there have been a number of deployed products and sys-
widely studied in the literature and applied in real-world tems based on fine-tuned small models [309, 767]. Since the
platforms. To solve these tasks, the key is to accurately pre-training of LLMs is established on text prediction, they
understand the semantic information about the words or exhibit strong language generation abilities as commercial
sentences. As rich high-quality labeled data about these products [629] and humans [630], with the help of proper
tasks has been accumulated so far, existing work [23, 39] prompts [768, 769]. Additionally, LLMs are flexible to effec-
finds that small language models can achieve very good tively handle special requirement in real-world application
scenarios, e.g., document-level translation [770], and also
43. Note that we don’t aim to cover all the related research directions
or domains, but instead demonstrating the use or impact of LLMs via enable natural language interaction with users to further
these selected examples. improve the generation quality [771]. Despite the above
71

• Word/Sentence-level Tasks
LLM for Classic NLP Tasks • Sequence Tagging
• Information Extraction
• Text Generation
Classic Scenarios • LLM as IR Model
LLM for IR • LLM-Enhanced IR Models

• LLM as Recommendation Model


LLM for Recommendation • LLM-enhanced Recommendation Models
Research • LLM as Recommendation Simulator
Directions • Vision-Language Alignment Pre-Training
Multimodal LLMs • Visual Instruction Tuning
• Evaluation of MLLM
Enhanced Capabilities
• Retrieval-augmented LLM
KG Enhanced LLM • Synergy Augmented LLM

LLM for • Components: Memory/Planning/Execution


Application LLM-based Agent
• Single/Multi-agent based Application
New Scenarios
• Score/Language-based Evaluation
LLM for Evaluation • Instruction Design, Multiple Feedbacks, Debate Agent
• Meta-Evaluation

Scientific
Specific Domains Healthcare Finance Law Education
Research

Fig. 18: The applications of LLMs in representative research directions and downstream domains.

success, recent work also reveals that LLMs are hard to well 8.1.2 LLM for Information Retrieval
address the generation tasks about low-resource languages
and domains, e.g., Marathi-to-English translation [772], due The goal of information retrieval (IR) systems is to assist
to their unbalanced training data across different languages. users in discovering ideal information resources (typically
documents) and mitigating the information overload issue.
Typically, contemporary IR systems adopt a retrieve-then-
Summary. Based on the above discussion, we summarize
rerank pipeline framework [54]. Within this framework,
the suggestions, and future direction about the use of LLMs
the retriever initially retrieves relevant information from a
in classic NLP tasks as follows:
large-scale corpus, and the reranker subsequently performs
• Suggestions: LLMs and small models have their own multi-stage ranking procedure to acquire the most relevant
merits in different aspects: LLMs are can provide unified information [776]. Since the advent of LLMs has significant
solutions to various NLP tasks and achieve competitive impact on the way of information access, we discuss how
performance (especially in the zero/few-shot setting), while it advances the development of IR from two main aspects,
small models are economical to develop and can be specially namely LLMs as IR models and LLM-enhanced IR models.
tuned according to target tasks, which can achieve good
performance with sufficient high-quality labeled data [755, LLMs as IR Models. Existing IR models can be overall
756, 773, 774]. In applications, one can make suitable choices categorized into sparse models (relying on term-based lexical
based on the actual needs, comprehensively considering similarity) and dense models (relying on embedding based
flexibility, data availability, training compute, and efficiency. semantic similarity) [54]. Specially, dense models are mainly
implemented by fine-tuned PLMs (e.g., BERT). Compared
• Future direction: Despite the excellent general capac- to PLMs, LLMs have more strong model capacities in
ities, LLMs still cannot effectively process the NLP tasks capturing text semantics, thus having the potential to im-
in low-resource domains, e.g., minor language translation. prove existing dense IR models. However, due to the high
To tackle such tasks, it needs to develop effective ap- overhead of LLMs, the majority of studies concentrate on
proaches to injecting necessary task information or domain- employing LLMs as rerankers, aiming to refine the rank-
specific knowledge into LLMs, either through fine-tuning ing of retrieved candidates. To achieve this, recent efforts
or prompting. In addition, it is still challenging for LLMs to often formulate special instructions that enable LLMs to
handle complex semantic relations in classic NLP tasks (e.g., perform reranking on a small set of provided candidate
nested entity extraction), which is worth more exploration documents. Typically, such an approach does not necessitate
from the underlying working mechanism of LLMs. It is also model training, and achieve promising results compared
promising to combine LLMs and fine-tuned small language with well-trained reranking methods [777, 778]. Specially,
models for complementing with each other in solving com- the LLM-based reranking approach can be implemented
plex cases of classic NLP tasks [775]. Another promising di- in different ways by zero-shot or few-shot instruction, in-
rection is to conduct human-machine collaborative research cluding pointwise (estimating the relevance scores for query-
(e.g., conversational translation [771]) on NLP tasks, since document pairs) [779], pairwise (determining the relevance order
LLMs can effectively understand human instructions and of two documents) [778], or listwise ranking (sorting a subset of
make meaningful responses. candidate documents) [780]. The essence of these methods lies
72

in the special design of instructions for text reranking, such be discussed in Section 8.1.4, multimodal large language
as sliding window strategy for document lists [777, 781], models [800] are also widely studied, making it feasible to
setwise selection prompting [782], fine-grained relevance la- develop more powerful multimedia retrieval systems.
bels incorporation [783], and pairwise comparison prompt-
ing [778]. In addition, recent efforts employ LLMs to gen- 8.1.3 LLM for Recommender Systems
erate intermediate texts (e.g., URLs) as retrieval results us-
ing few-shot demonstrations [784]. To further enhance the Unlike IR systems that analyze user search queries to
model performance, LLMs can be specially fine-tuned as retrieve relevant documents, recommender systems (RS)
backbones for reranking [785, 786] or retrieval (including aim to capture the underlying user preference and pro-
dense retrieval [54] and model-based retrieval [787, 788]), vide appropriate information resources to users [801–804].
similar to the fine-tuning process for traditional PLM-based Typically, existing studies train a recommendation model
IR models [785]. However, fine-tuning LLMs as IR models (either classic or deep learning model) by fitting it over
entails considerable expenses given the huge parameter the user’s logged data (e.g., click data) [747, 805]. However,
scale of LLMs. these models often suffer from a series of technical issues,
e.g., cold-start recommendation, domain transfer, and poor
LLM-Enhanced IR Models. As another major research explainability. Recently, LLMs have demonstrated the po-
direction, LLMs can be employed to improve existing IR tential to alleviate these issues of recommendation mod-
models (e.g., small models). A common challenge faced els [355, 806, 807], due to the strong capacities of domain
by existing IR models is the lack of relevant judgment generalization and language generation. In this part, we
annotation [789, 790]. To tackle this problem, LLMs can be briefly review the recent progress of LLMs in recommender
instructed to annotate positive or negative documents for systems, from the following three aspects, namely LLMs as
a given query [791], or to generate corresponding queries recommendation models, LLM-enhanced recommendation
based on a set of documents in the corpus by referring to a models, and LLMs as recommendation simulators.
few demonstrations [792, 793]. In addition to training data LLMs as Recommendation Models. With specific methods
augmentation, LLM has the potential to improve existing or mechanisms, LLMs can be adapted to serve as recom-
IR models by refining the search-oriented informativeness mendation models. Existing work along this line can be
of both queries and documents. In IR systems, the in- generally divided into two main categories. First, some
put queries may be constrained by a user’s cognitive and methods prompt LLMs for completing the recommendation
cultural competency, making it challenging to accurately task in a zero-shot paradigm (i.e., without parameter tun-
express the real intent, and irrelevant content present in ing) [808, 809]. A series of prompt engineering methods like
documents can also impact the relevance evaluation with recency-focused and in-context learning are introduced to
the query. As a solution, LLM can be utilized to rewrite the improve recommendation performance as well as alleviate
query for enhancing the understanding of the query intent the potential model biases [810, 811]. Second, another cat-
and incorporating additional knowledge into the query egory of studies aim to specialize LLMs for personalized
through well-designed instructions. The rewritten query recommendation through instruction tuning [355, 812]. Spe-
can take the form of an improved version of the original cially, high-quality instruction data is key to adapt LLMs
query [794], a document in the corpus that related to the to the recommendation tasks, which can be constructed
query [795], or an expansion of the query that concatenated based on user-item interactions with heuristic templates. To
with a pseudo generated document [796]. In addition, docu- further improve the instruction diversity, InstructRec [355]
ments can also be expanded with queries that are generated employs self-instruct technique to simulate large amounts of
based on the original documents using LLMs for context potential user instructions in various scenarios like product
extension [797]. search and personalized recommendations. In addition to
representing each item by its text description, there is also
Remaining Issues. In this part, we further discuss several growing attention on extending LLM’s vocabulary with
important issues to apply LLMs to improve IR systems. semantic identifiers in recommender systems [813, 814], to
First, though LLMs are capable of being as general-purpose incorporate collaborative semantics into LLMs.
task solvers, they are not directly well suited for existing
IR systems: they require high overhead for inference [777, LLM-enhanced Recommendation Models. In addition to
785], have limitations in modeling long texts or document instructing LLMs to directly provide recommendations, re-
lists [781], and need special adaptation (e.g., instruction searchers also propose leveraging the universal knowledge
tuning) to perform the text ranking task [798]. Therefore, encoded in LLMs to improve traditional recommender sys-
more systematic approaches to adapt LLMs for modern IR tems. Existing approaches in this line can be divided into
systems should be investigated, to leverage their benefits three main categories. The first category employs LLMs to
and meanwhile overcome these limitations. Secondly, the infer users’ potential intention from their historical interac-
advent of LLMs sheds lights on the development of new tion data. Furthermore, traditional recommendation/search
information seeking ways (e.g., New Bing). It is meaningful models employ the inferred intentions to improve the re-
to explore how to reshape the architecture and paradigm trieval of relevant items [815, 816]. Additionally, several
of IR by integrating the LLMs’ capacities and the merits studies explore the use of LLMs as feature encoders. They
of existing IR systems [799]. Thirdly, existing work mainly employ LLMs to encode the side information of items and
focuses on text retrieval tasks, lacking a comprehensive users (e.g., item’s descriptions and user’s reviews), thus de-
consideration of multimodal information sources. As will riving more informative representations of users and items.
73

These representations are then fed into traditional recom- extension of LLMs by enabling the information modeling
mender systems as augmented input [817, 818]. As an- of non-textual modalities, especially the vision modality,
other alternative approach, several studies [819, 820] adopt called multimodal large language models (MLLMs) [800]44 . To
a distillation-like way to transfer LLM’s capacities (e.g., start our discussion, we specify the input to be text-image
semantic encoding) to improve traditional recommenders pairs and the output to be text responses. Similar discus-
(i.e., small models). Specially, they align the hidden states sions can be made for other modalities, e.g., language-audio
of LLMs and traditional recommendation models via joint models [828], which is beyond our scope here. In essence,
training. After training, since only the enhanced small MLLMs are developed by adapting the information from
model will be deployed online, it can avoid the huge over- other modalities to the text modality, so as to leverage the
head of LLMs in online service. excellent model capacities of LLMs that are learned based on
world text. Typically, a MLLM comprises an image encoder
LLM as Recommendation Simulator. Inspired by the recent
for image encoding and a LLM for text generation, associ-
success of autonomous AI agents [821], LLMs have been
ated by a connection module that aligns vision and language
also utilized to develop recommendation simulators [822,
representations. During generation, the image is first split
823] (exemplified by RecAgent [822]), showing great po-
into patches, and then transformed into patch embeddings
tential to simulate user real behaviors in recommender
by the image encoder and the connection module, to derive
systems [822, 824, 825]. Specifically, to make personalized
a visual representation that can be understood by the LLM.
simulation, an agent will be equipped with a profiling
Subsequently, the patch embeddings and text embeddings
module that encompasses relevant identity information.
are concatenated, and fed into the MLLM, allowing the
Then, a memory module is introduced to store agents’ past
language model to generate the response autoregressively.
interaction experiences. During the process of simulation,
In the following, we will discuss the training, evaluation,
agents are further prompted to conduct self-reflection based
and key points to develop capable MLLMs.
on their past experiences, to capture their underlying user
preference. Most of existing recommendation simulators are Training Process. The training process of the MLLM in-
conducted in a user-oriented way, without explicitly mod- cludes two major stages: vision-language alignment pre-
eling the items in the interaction process. To address this, training and visual instruction tuning.
AgentCF [824] models both users and items as agents, and • Vision-language alignment pre-training. To develop
further facilitates collaborative reflections to simulate user- MLLMs, existing work mostly initializes the vision encoder
item interactions, so as to capturing the two-sided relations and the LLM with pre-trained models [154, 155, 829]. These
between users and items. models retain excellent vision and language capacities, but
span different semantic spaces. Thus, the goal of vision-
Remaining Issues. Despite these efforts, there are still
language alignment pre-training (i.e., the first-stage training)
several challenges to address when applying LLMs in
is to align the vision encoder and the LLM through end-to-
recommender systems. First, existing studies have shown
end training on large-scale image-text pairs [830, 831]. How-
that LLM-based recommendation models in zero/few-shot
ever, directly tuning these two models on image-text pairs
settings tend to perform worse than traditional ID-based
may cause the degradation of the original representation ca-
recommenders [809, 810]. This indicates that LLMs might
pacities. To improve the alignment performance, it is crucial
lack an understanding of personalized user behaviors and
to design effective training strategies and select appropriate
domain-specific collaborative semantics. Although instruc-
pre-training data [832, 833]. Existing work mainly employs
tion tuning alleviates this issue to some extent [355, 812],
the following strategies for cross-modality alignment: (1) if
it can’t fully reduce the semantic gap between LLMs and
the number of image-text pairs is not sufficiently large (e.g.,
recommender systems, and also suffers from high tuning
less than 1M), it is often suggested to only update the
costs. Furthermore, recommender systems prioritize min-
connection module [834]; (2) if the training data includes
imizing inference latency to enhance users’ experience in
high-quality text corpora [835] or image-text pairs with
low-resourced environments (e.g., phones), which poses a
fine-grained annotations [836], fine-tuning the LLM can be
challenge to LLMs’ inference speed as well as memory
conducted to boost the performance; (3) if the number of
overhead. Therefore, it is important to explore improvement
image-text pairs is very large (e.g., about 1B), fine-tuning
techniques, such as efficient tuning and quantization meth-
the vision encoder is also plausible [832, 833], but the benefit
ods, to deploy LLMs efficiently and effectively in real-world
remains further verification.
recommender systems. In addition, existing LLMs have
• Visual instruction tuning. After vision-language pre-
limited capacities in long context modeling, make it difficult
training, the second-stage training, i.e., visual instruction
to process the huge amount of user-item interaction data.
tuning, aims to improve the instruction-following and task-
Improved context length extension and context information
solving abilities of MLLMs. Generally, the input of vi-
utilization approaches should be developed to improve the
sual instruction tuning consists of an image and a task
modeling capacities of LLMs in long interaction sequences.
description, and the task is to generate a corresponding
8.1.4 Multimodal Large Language Model text output. To boost the performance, high-quality visual
In existing literature [826, 827], multimodal models mainly instruction data is key to eliciting and enhancing the abil-
refer to the models that can process and integrate informa-
tion of various modalities (e.g., text, image, and audio) from 44. In existing work, large vision language models (LVLMs) [664] are
also used to term such bimodal models that are developed based on
input, and further produce corresponding output in certain LLMs. We use the naming of MLLMs in this part due to its wide use in
modalities. In this part, we mainly focus on the multimodal existing literature.
74

ities of MLLMs. Therefore, most studies are dedicated to questions. Similarly, LLaVA [854] utilizes GPT-4 for eval-
constructing various visual instruction datasets. As the basic uating MLLMs’ output, where GPT-4 takes the generated
approaches, early studies construct visual instructions by image captions and object bounding boxes as visual inputs
distilling from GPT-4 [154] or reformulating vision-language for assessment. Such open-ended evaluation methods can
task datasets [156]. To enhance the quality of instruction improve assessment accuracy while incurring higher costs
data, recent work further proposes improved strategies by due to the involvement of humans or LLMs.
increasing the instruction diversity [837], incorporating fine- • Evaluation benchmarks. To facilitate a more thorough
grained information (e.g., coordinate of objects) into the evaluation of MLLMs, various benchmarks have been devel-
instruction [836], or synthesizing complex visual reasoning oped. Part of them collect existing vision-language tasks for
instructions [838]. comprehensive evaluation. For instance, LVLM-eHub [855]
aggregates 47 existing text-related visual tasks to assess
Evaluation of MLLM. After introducing the approaches to
six distinct capabilities of MLLMs, and Reform-Eval [856]
developing MLLMs, we further discuss how to effectively
takes this a step further by standardizing questions from
assess the multimodal capabilities of MLLMs from the fol-
existing benchmarks into a uniform format and discusses
lowing three aspects.
how the backbone models influence MLLMs’ performance.
• Evaluation perspectives. The evaluation tasks for MLLMs In addition to incorporating existing tasks, several work
can be categorized into two main types: perception and also derives new questions annotated by humans or with
cognition tasks. Specifically, perception tasks aim to assess the the help of LLMs. MME [842] creates a dataset by pair-
model’s abilities in understanding the basic semantics of the ing images from public sources with manually-collected
image content, while cognition tasks evaluate models with text instructions for perception and cognition evaluations.
more complex tasks that require reasoning based on per- MMBench [841] transforms these instructions into multiple-
ception results. The perception ability is typically evaluated choice questions and introduces CircularEval to ensure
through classification tasks about attributes of image (e.g., evaluation consistency. SEED-Bench [857] further considers
topic and style) and object (e.g., existence and color) or OCR- temporal understanding tasks and enlarges the evaluation
related tasks, based on existing datasets or new datasets scale to 19K multiple-choice questions with the assistance of
derived from existing images with annotations by humans LLMs. MM-Vet [858] presents more complex tasks to assess
or LLMs [839–842]. A notable perception issue is hallucina- the integrated multimodal capabilities of MLLMs. It starts
tion [843], where the model’s responses contain inconsistent by defining six essential multimodal abilities and then cre-
content with the image. Among existing studies about hallu- ates intricate questions by combining multiple abilities. In
cination in MLLMs [837, 844, 845], object hallucination [846] summary, the above benchmarks collectively contribute to
has received much research attention. To conduct a stable, the comprehensive evaluation and improved development
robust evaluation of object hallucination, POPE [847] pro- of MLLMs.
poses a polling-based object probing approach for convert-
ing object recognition into a series of binary questions, and Key Points for Improving MLLMs. To develop capable
the results indicate that current MLLMs often struggle with MLLMs, we continue to discuss three key points to improve
object hallucination. Cognition tasks, on the other hand, re- the model capacities, from the perspectives of instruction
quire MLLMs to perform reasoning based on image percep- data, training strategy, and safety and alignment.
tion. A common reasoning task is visual question answering • Visual instruction data. Extensive work [834, 859] has
(VQA), where models answer questions about images that empirically found that both quantity and quality of visual
demand reasoning about spatial relationships [848], general instructions have an important impact on model perfor-
knowledge [849], or scene text [850]. To fully explore the mance of MLLMs. One basic way to construct visual in-
capabilities of MLLMs, HallusionBench [851] collects 200 structions is to leverage the exceptional capability of LLMs
sophisticated visual dependent or supplement questions, on to synthesize instructions based on text descriptions of
which even the most advanced MLLMs like LLaVA-1.5 [834] images [854]. To further enhance the quality of instructions,
and GPT-4V [133] fail to achieve good performance. one can construct fine-grained visual instructions with the
• Evaluation paradigms. The responses of MLLMs can help of human annotation [836, 860] or synthesize more
be evaluated either in a closed-ended or an open-ended complex data through carefully-designed prompts [838].
manner. Traditional multimodal tasks often rely on a closed- Despite the effectiveness of the above LLM-based ap-
ended evaluation framework, where the assessment is based proaches, one primary question emerges as to whether a
on the exact match between the model’s response and the LLM (i.e., text generation model without training on any
ground-truth answer. Examples include the VQA score [852] images) possesses the ability to generate sufficiently good
for visual question answering tasks and the CIDEr [853] visual instructions solely based on verbalized visual infor-
score for captioning tasks. However, MLLMs generate re- mation (e.g., captions and coordinates). Specially, existing
sponses in an open-ended way, which may contain the work has also revealed that visual instructions generated
correct answer but not exactly match the ground-truth per- by LLMs sometimes contain misinterpretations about the
fectly. This discrepancy can lead to the underestimation of visual information, e.g., object hallucination [847]. Therefore,
the model’s performance in previous evaluation paradigms. it is crucial to design effective verification methods to con-
To address this issue, recent approaches have incorporated trol the quality of instruction data generated by LLMs [838].
humans or LLMs as evaluators [832]. For instance, MM- Furthermore, it still needs more investigation about what
Bench [841] employs ChatGPT to align the model responses makes good visual instructions and how visual instructions
with the most relevant option in a set of multiple-choice elicit specific multimodal abilities in MLLMs.
75

• Model training. Different from LLMs, MLLMs are not tential to generate hallucinated content [604] and the lack of
trained from scratch, but instead developed based on pre- domain-specific knowledge [865]. As a promising solution,
trained language and vision models. Existing work em- knowledge graphs (KGs), which store enormous knowledge
ploys a typical two-stage approach for training MLLMs, in the triple format, i.e., ⟨ head entity, relation, tail entity ⟩, can
i.e., vision-language alignment pre-training and visual in- be utilized to enhance the task performance of LLMs by pro-
struction tuning. In essence, existing MLLMs aim to (1) pre- viding precise and necessary knowledge. Generally, knowl-
serve the inherent capabilities and parametric knowledge edge enhanced approaches can be expanded into other
of LLMs as possible, and meanwhile (2) effectively adapt forms of structured data (e.g., tables and databases) [864],
to multimodal tasks by leveraging the pre-trained LLMs while we limit our discussion to the integration of KG for
and visual encoders. To achieve the above two goals, two improving LLMs, which are detailed in two aspects, namely
typical training strategies are often employed for visual retrieval-augmented LLM and synergy-augmented LLM.
instruction tuning, either only optimizing the connection
module [156] or fine-tuning both the connector module Retrieval-Augmented LLM. Due to the huge amount of
and LLM component [854]. As we can see, the former fact records in a KG, existing work typically adopts a
can reserve the original capacities of LLMs but likely have retrieval model to first obtain a relatively small subgraph
a weak an adaptation performance, while the latter can from KG, and then leverages it to enhance LLMs by en-
fully adapt to multimodal tasks but suffer from the loss of riching the relevant knowledge. Before the advent of LLMs,
original capacities of LLMs. More efforts should be made to the retrieved subgraphs are often supplemented into train-
investigate how to effectively balance the two aspects, so as ing data, injecting knowledge information into PLMs via
to achieving improved multimodal capacities. In addition, parameter learning [866–868]. In contrast, to leverage the
existing MLLMs are still overly dependent on the capacities retrieved knowledge, LLMs mainly incorporate it as part of
of LLMs, which pose the limits on many multimodal tasks the prompt, without parameter update. To implement this
(e.g., space positioning). It will be meaningful to explore approach, there are two main technical problems, i.e., how
improved training approaches of language models, so that to retrieve relevant knowledge from KGs and how to make
multimodal information can be also utilized in this process. better use of the structured data by LLMs. For the first issue
• Safety and alignment. Safety and alignment has been (i.e., retrieving relevant knowledge), a typical approach is
widely discussed in LLMs, which aim to regulate the behav- to train a small language model (e.g., RoBERTa) to iden-
iors of models by technical approaches [66]. This topic is also tify question-related fact triples [869]. To further improve
important to MLLMs. Even a highly advanced MLLM (e.g., the retrieval performance, several studies also propose an
GPT-4V [133]) can be susceptible to safety issues. For exam- iterative reading-then-reasoning framework, enabling the
ple, GPT-4V might occasionally exhibit factual inaccuracies LLM to interact with the KG multiple times and acquire the
and baseless inferences about images. In some cases, it may required knowledge in a more accurate way [451]. For the
even generate harmful content targeting specific individuals second issue (i.e., utilizing retrieved knowledge), a straight-
or groups [133]. Furthermore, open-sourced MLLMs are forward approach is to serialize the retrieved subgraph
also prone to generate hallucinated response [847] and can and craft specific prompts to include it as the input of
be easily manipulated to produce harmful content [861]. LLMs [468, 653]. However, due to the loss of structured
To address the aforementioned issues, some studies collect information in knowledge serialization, LLMs cannot fully
specialized visual instructions to mitigate the problem of capture the structural semantics conveyed by original KGs.
hallucination [837]. Another alternative approach is to train To address this issue, several model-based approaches train
a revision model to rectify hallucinated response generated a specialized language model (e.g., T5) to transform the
by MLLMs in a post-hoc way [862]. Additionally, aligning subgraph into the natural language text [870]. To guarantee
MLLMs with RLHF can also assist MLLMs in generating the transformation accuracy, it relies on sufficient training
responses with improved factuality [863]. Despite these pairs (often unsupervised constructed) [871] and excellent
efforts, existing alignment techniques for MLLMs mainly model capability [872].
concentrate on several specific aspects (e.g., hallucination),
lacking a comprehensive consideration of alignment criteria. Synergy-Augmented LLM. To solve complex tasks (e.g.,
More efforts should be made to promote the research of multi-hop question answering [658]), it often requires LLMs
safety and alignment for MLLMs. As a promising solution, to query a KG multiple times, following a systematic solu-
knowledge graphs (KGs), which store enormous knowledge tion plan. We call such a multi-turn interaction approach to
in the triple format, i.e., ⟨ head entity, relation, tail entity ⟩, can enhancing LLM synergy-augmented LLM. To better synergize
be utilized to enhance the task performance of LLMs by pro- the LLM and KG in a complementary manner, recent studies
viding precise and necessary knowledge. Generally, knowl- propose to decompose the complex task into multiple sub-
edge enhanced approaches can be expanded into other goals and iteratively solve each one by leveraging the nec-
forms of structured data (e.g., tables and databases) [864], essary knowledge from KG [451, 873, 874]. In this process,
while we limit our discussion to the integration of KG for the LLM can be regarded as an autonomous agent (detailed
improving LLMs, which are detailed in two aspects, namely in Section 9.2), which automatically generates the plan
retrieval-augmented LLM and synergy-augmented LLM. and executes it through interaction with the KG environ-
ment [873]. Specially, the mainstream approaches typically
8.1.5 KG-Enhanced LLM start by enumerating the candidates using the available
Despite the excellent capacities, LLMs often suffer from knowledge information at the current step, and then retrieve
challenges on knowledge-intensive tasks, such as the po- the most appropriate candidates for the next step according
76

to the question [873, 874]. By iterating the above two steps, candidate texts following specific guidelines [352, 649, 729],
LLMs can gradually collect relevant evidence [873, 874], and which greatly simplifies the evaluation task. However,
finally approach the correct solution. Despite the effective- it may face the inefficiency issue when scaling up the
ness, enumeration of the candidates over the KG would lead number of candidates [729]. When high-quality reference
to a vast search space [875]. To address it, StructGPT [451] texts are available during evaluation, LLMs can be in-
proposes a more efficient way to access knowledge infor- structed to score texts under the guidance provided by ref-
mation using the specialized interfaces for KGs. Specifically, erences [718, 729, 730]. On the other hand, language-based
it carefully designs the specialized interfaces according to evaluation focuses on generating critiques and suggestions,
the common data operations on KG (e.g., relation extraction offering qualitative explanation beyond simple quantitative
and triple extraction), to ensure efficient and accurate data scoring [369, 377, 882, 883]. It is particularly useful for
extraction. In this way, LLMs can be instructed to better gathering language feedback signals for human alignment
manipulate and process the structural information of KGs, tuning [369, 882]. Furthermore, it can evolve into a multi-
thus achieving improved task performance. turn interaction framework, where LLM-based evaluators
provide natural language feedback to existing solutions
Future Directions. Besides the above approaches, there
from task solvers [884]. This framework evaluates the ability
are several promising directions for KG-enhanced LLM
of LLMs to leverage language feedback for refining self-
remaining underexplored. First, due to the variety of struc-
generated solutions.
tured data, it is still difficult for LLMs to directly leverage
various kinds of knowledge sources, e.g., domain-specific
Evaluation Methods. A common method for LLM-based
KGs. Therefore, it is essential to explore the unified way
evaluation involves prompting LLMs with specific instruc-
to manipulate and utilize different knowledge sources by
tions. To further improve the quality of LLM-based eval-
LLMs. As a potential solution, it is promising to develop
uation, recent work proposes to prompt LLMs with varied
effective approaches to help LLMs comprehend and make
contexts to generate diverse evaluation feedback. These con-
use of the access interfaces provided by specific knowledge
texts vary in aspects such as the candidate order [649, 729],
sources to acquire precise knowledge [451], while more ef-
evaluation perspectives [885, 886] (e.g., relevance, clarity,
forts should be made to investigate how to adapt to the data
originality), and evaluation explanation [649]. The gener-
variety in a cost-effective way. Second, with the evolution of
ated multiple evaluation feedbacks are then aggregated to
real-world information, the knowledge stored in LLMs may
produce a final evaluation result, which makes the evalua-
become outdated or incorrect. It is necessary to explore how
tion process less prone to biases from individual feedback
to synchronize the updated knowledge into LLMs through
and allows for a more thorough evaluation by covering
a cost-effective manner [876, 877]. Third, it is promising to
a wider range of evaluation aspects. To further improve
investigate the use of factual information from KG to align
the quality of the single-model evaluation, recent studies
LLMs in generating more faithful content [878, 879], which
also develop multi-agent collaboration frameworks [886–
can help reduce the hallucination of LLMs.
888] or fine-tune LLMs as specified evaluators [369, 377, 882,
In addition to exploring KG-enhanced LLMs, it is also
883, 889]. In a multi-model collaboration mode, different
meaningful to leverage LLMs to improve the tasks on the
LLMs evaluate the candidates by engaging in discussions
KG side (i.e., LLM4KG) [865, 880]. A typical example is that
to align preferences and reach a consensus [887, 888]. This
LLMs can help supplement or construct the KG. We omit
method helps reduce the potential biases in individual
the discussion of this part, since it is beyond our scope.
models through the consensus reached by multiple agents.
8.1.6 LLM for Evaluation Another approach to improving single-model evaluation
is to specialize LLMs as scores or critics through fine-
While human evaluation can generally offer reliable quality
tuning [369, 377, 882, 883, 889]. This process involves creat-
assessment, it is also often hindered by high annotation
ing datasets annotated with preferences and feedback from
costs, significant time requirements, and annotation incon-
humans or proficient LLMs. These datasets are then used to
sistencies [881]. In contrast, automatic evaluation can be
train evaluation-oriented models, enabling them to generate
employed as a scalable alternative to human evaluation.
pairwise preference or language feedback. The specialized
Traditional automatic evaluations have relied on reference-
LLM evaluators demonstrate competitive performance with
based metrics (e.g., BLEU and ROUGE). Recently, with
fewer parameters [377, 883, 889].
the emergence of LLMs as general task solvers highlights
their potential as automatic evaluators [649, 729], making it
Meta-Evaluation. To effectively assess the quality of
promising to conduct LLM based evaluation. In the follow-
LLM-based evaluators, meta-evaluation benchmarks have
ing part, we will introduce the recent progress on LLM for
been introduced, for gauging the agreement with human
evaluation, including evaluation formats, methods, meta-
preferences and the fairness of the evaluations made by
evaluation, and the remaining issues.
LLMs [649, 729, 886, 890, 891]. As a representative bench-
Evaluation Formats. Depending on the type of evaluation mark, MT-Bench [729] evaluates the agreement between
outcome, the evaluation format can be categorized into LLMs and human judgments, demonstrating that GPT-4
score-based evaluation and language-based evaluation. Score- aligns closely with human preferences in no-tie compar-
based evaluation employs measurable metrics to assign isons on 80 multi-turn questions. In addition, to address
quality scores (e.g., ratings or rankings) for evaluated texts. potential biases arising from subjective human evaluations,
A prevalent way is to conduct pairwise comparison, where LLMBar [890] manually designs outputs that are objectively
LLMs are used to determine the partial order relation of worse but superficially appealing, which could mislead
77

evaluators. The evaluation results reveal that even the most consistent answers across disciplines, balancing both depth
advanced LLMs still fall short of human-level evaluation in and breadth. Another quantitative analysis [901] shows that
the challenging setting. students utilizing ChatGPT (either keeping or refining the
results from LLMs as their own answers) perform better
Remaining Issues. As discussed in Section 7.1.1, recent
than average students in some courses from the computer
studies demonstrate that LLM-based evaluators expose
security field. Recently, several perspective papers [903, 904]
multiple types of bias, such as order bias, self-preference
also explore various application scenarios of LLMs in class-
bias, and length bias [649, 729]. Although some biases can
room teaching, such as teacher-student collaboration, per-
be mitigated through methods like multi-path ensemble or
sonalized learning, and assessment automation. However,
multi-agent collaboration, they remain inherent to LLM-
the application of LLMs in education may lead to a series
based evaluators. Consequently, addressing these biases
of practical issues, e.g., plagiarism, potential bias in AI-
intrinsically within the models continues to be an a chal-
generated content, overreliance on LLMs, and inequitable
lenging issue. In addition, recent work has revealed that
access for non-English speaking individuals [905].
LLMs may be incapable of understanding the self-generated
content, exhibiting a weaker understanding capacity com- Law is a specialized domain that is built on professional
pared to their generation capabilities [892]. Even the most domain knowledge. Recently, a number of studies have ap-
advanced LLMs still struggle identifying their reasoning or plied LLMs to solve various legal tasks, e.g., legal document
factual errors without external feedback [893, 894]. Conse- analysis [906], legal judgment prediction [907], and legal
quently, current LLM-based evaluators might not be ade- document writing [908]. A recent study [909] has found
quate for evaluating top-tier LLMs or complex tasks. This that LLMs exhibit powerful abilities of legal interpretation
underscores the importance of improvement approaches and reasoning. Moreover, the latest GPT-4 model achieves
for LLM-based evaluators, especially for evaluating capable a top 10% score in a simulated bar exam compared with
LLMs and complex tasks demanding sophisticated reason- human test-takers [46]. To further improve the performance
ing, planning, and domain-specific knowledge. of LLMs in the law domain, specially designed legal prompt
engineering are employed to yield advanced performance
8.2 LLM for Specific Domains in long legal document comprehension and complex legal
In this part, we discuss the applications of LLMs on several reasoning [910, 911]. To summarize the progress, LLMs can
representative domains, including healthcare, education, act as helpful assistants to legal profession. Despite the
law, finance, and scientific research assistance. progress, the use of LLMs in law raises concerns about
legal challenges, including copyright issues [912], personal
Healthcare is a vital application field closely related to information leakage [913], or bias and discrimination [914].
human life. Ever since the advent of ChatGPT, a number of
studies have applied ChatGPT or other LLMs to the medical Finance is an important field where LLMs have promis-
domain. It has been shown that LLMs are capable of han- ing application prospects. LLMs have been employed on
dling a variety of healthcare tasks, e.g., biology information various finance related tasks, such as numerical claim
extraction [765], medical advice consultation [895], mental detection [915], financial sentiment analysis [916], finan-
health analysis [896], and report simplification [897]. As cial named entity recognition [917], and financial reason-
the major technical approach, researchers typically design ing [918]. Despite the competitive zero-shot performance
specific prompts or instructions to guide LLMs to perform a exhibited by general-purpose LLMs in the finance tasks,
wide range of medical tasks. To further harness the power they still underperform domain-specific PLMs containing
of LLMs in the healthcare domain, researchers propose to million-scale parameters [915]. To leverage the scaling effect
develop healthcare-related LLMs [354, 898, 899]. Specifically, of LLMs, researchers collect large-scale finance corpora for
the Med-PaLM models [354, 898] achieves expert-level per- continually pre-training LLMs (e.g., BloombergGPT [358],
formance on the United States Medical Licensing Exami- XuanYuan 2.0 [919], and FinGPT [920]). BloombergGPT
nation (USMLE), and earns greater approval from physi- has demonstrated remarkable performance across a diverse
cians in answering consumer’s medical questions. However, range of financial tasks while maintaining competitive per-
LLMs may fabricate medical misinformation [897, 900], formance in general-purpose tasks [358]. Nevertheless, it is
e.g., misinterpreting medical terms and suggesting advice imperative to consider the potential risks in the application
inconsistent with medical guidelines. In addition, it would of LLMs in finance, as the generation of inaccurate or
also raise privacy concerns to upload the health information harmful content by LLMs could have significant adverse
of patients [765] into a commercial server that support the implications for financial markets [358]. Therefore, it needs
LLM. more strict reviewing and monitoring on the use of LLMs in
the financial field.
Education is also an important application domain where
LLMs potentially exert significant influence. Existing work Scientific research is another promising field that LLMs
has found that LLMs can achieve student-level performance can empower the development progress. Prior research
on standardized tests [46] in a variety of subjects of math- demonstrates the effectiveness of LLMs in handling
ematics (e.g., physics, computer science) on both multiple- knowledge-intensive scientific tasks (e.g., PubMedQA [921],
choice and free-response problems. In addition, empirical BioASQ [922]), especially for LLMs that are trained on
studies have shown that LLMs can serve as writing or read- scientific-related corpora [35, 218, 923]. Given the excel-
ing assistant for education [901, 902]. A recent study [902] lent general abilities and broad scientific knowledge, LLMs
reveals that ChatGPT is capable of generating logically hold significant potential as helpful assistants across var-
78

ious stages of the scientific research pipeline [924]. First, 9.1.1 Scaling Position Embeddings
during the literature survey stage, LLMs can help conduct Transformer-based LLMs can learn effective position em-
a comprehensive overview of the progress in a specific beddings within the maximum training length. When
research field [925, 926]. Second, during the research idea adapting LLMs to language tasks beyond the maximum
generation stage, LLMs demonstrate the ability to generate training length, it is necessary to scale to larger position
intriguing scientific hypotheses [927]. Third, during the data indices. Specially, some position embedding methods have
analysis stage, LLMs can be employed to conduct automatic been shown to possess a certain degree of ability to gener-
approaches to analyzing the data characteristics, includ- alize to text beyond the training length, which is termed as
ing data exploration, visualization, and deriving analytical extrapolation capability, including T5 bias [82], ALiBi [283],
conclusions [928, 929]. Fourth, during the paper writing xPos [296] and even NoPE [939]. However, as one of the
stage, researchers can also benefit from the assistance of mainstream position embedding methods, RoPE exhibits
LLMs in scientific writing [930, 931], in which LLMs can limited extrapolation ability in empirical studies [259]. In
offer valuable support for scientific writing through diverse the following, we discuss several methods that adapt RoPE
means, such as summarizing the existing content and pol- to longer texts.
ishing the writing [932]. In addition, LLMs can aid in • Direct model fine-tuning. To adapt LLMs to a long
the automated paper review process, encompassing tasks context window, a straightforward approach is to directly
such as error detection, checklist verification, and candidate fine-tune the models on long texts with the target length.
ranking [933]. Despite these advances, there is much room The context extension can be scheduled with gradually
for improving the capacities of LLMs to serve as helpful, increased lengths in a multi-stage manner (e.g., 2K → 8K
trustworthy scientific assistants, to both increase the quality → 32K). To conduct effective extension, it often requires
of the generated scientific content and reduce the harmful specially prepared long text data for training (Section 9.1.3),
hallucinations. and data quality plays a critical role in improving LLM’s
Summary. In addition to the aforementioned work, the long context capacities [940]. However, such a direct fine-
applications of LLMs have been also discussed in several tuning approach tends to be inherently slow when adapting
other domains. For instance, in the psychologic domain, LLMs for long texts [259].
some recent work has studied the human-like characteristics • Position interpolation. This method downscales the po-
of LLMs, such as self-awareness, theory of mind (ToM), and sition indices within the original context window, to avoid
affective computing [934, 935]. In particular, an empirical out-of-distribution rotation angles during pre-training [259,
evaluation of ToM conducted on two classic false-belief 941]. Specifically, this approach multiplies all position in-
tasks speculates that LLMs may have ToM-like abilities dices by a scaling coefficient L/L′ (L < L′ ), where L and
since the model in the GPT-3.5 series achieves comparable L′ denote the original and target context window length,
performance with nine-year-old children in ToM task [934]. respectively. Experimental results [259] have shown that
In addition, another line of work has investigated applying this method can extend the context window effectively and
LLMs into the software development domain, e.g., code efficiently, compared to the above approach of direct model
suggestion [936], code summarization [937], and automated fine-tuning. However, it is worth noting that this technique
program repair [938]. To summarize, to assist humans by may have an adverse impact on the model’s performance
LLMs in real-world tasks has become a significant area of when handling normal texts within the original context
research. However, it also presents challenges. Ensuring the window [259, 942].
accuracy of LLM-generated content, addressing biases, and • Position truncation. To mitigate the challenges posed
maintaining user privacy and data security are crucial con- by out-of-distribution rotation angles, another practical ap-
siderations when applying LLMs to real-world scenarios. proach is to truncate longer relative positions to satisfy the
requirement of the maximum training length. ReRoPE and
9 A DVANCED TOPICS LeakyReRoPE [943] introduce a pre-defined window length
for truncation, which is smaller than the maximum training
In this section, we focus on discussing several advanced
length. Specifically, position indices within this pre-defined
topics that have attracted extensive attention in the research
window would be retained, while those indices beyond the
community, and these topics are related to challenging
window are either truncated to the pre-defined window
technical issues that largely limit LLM’s capacity. Next, we
length or interpolated to align with the maximum training
will introduce these issues and discuss how to address them
length. This strategy can preserve the attention mechanism
with feasible approaches.
with the neighbor tokens (within the window length), and
further enhance the extrapolation capacity. However, this
9.1 Long Context Modeling approach needs to compute the attention matrices twice,
In real-world application scenarios, there are increasing accommodating additional computational costs.
demands for long context modeling capacities of LLMs, • Base modification. Since LLMs are usually trained with
especially for text file processing (e.g., information parsing, a pre-set maximum training length, wavelengths in certain
extraction, and summarization). Many mainstream LLMs dimensions of RoPE may exceed the training length for
have provided support for long context window. To enhance longer text [295], on which language models may not be
the long context modeling abilities, there are generally two sufficiently trained, i.e., training data can’t cover a complete
widely used approaches, namely scaling position embed- rotation cycle. Thus, when processing long text, some ro-
dings and adapting context window. Next, we introduce the tation angles for certain dimensions would never be seen
two approaches in detail. in the training phase [351]. Formally, given a fixed rotation
79

angle t · θi , a smaller basis θi allows for a greater distance of attention patterns in a Transformer [951], e.g.,the top-
t, i.e., enabling the modeling of longer texts [254, 295, 940]. k attention scores can well approximate the original full
According to the formula θi = b−2(i−1)/d in Equation 4, attention. Therefore, a number of studies propose different
decreasing the basis can be achieved by increasing the methods to select the most relevant tokens from token-level
value of the base. In addition, decreasing the base can also or block-level memory units for generation. Token-level se-
help re-scale the wavelengths of all dimensions below the lection methods store the past keys in external memory and
training length, while it often needs continual pre-training utilize a k -NN search method to retrieve the k most relevant
to adapt the LLMs to long context windows [351]. A re- tokens for generation [257, 951, 952]. For a decoder model,
cent study [351] has empirically compared these two base it typically employs one certain layer to access these top-
modification methods, and shown that decreasing the base k external tokens, while still adopting the normal context
demonstrates better extrapolation performance, while in- window in the rest layers [257, 952]. Block-level selection
creasing the base performs better within the training length. methods [953, 954] first segment the long sequence into
• Basis truncation. Similar to the base modification, the blocks with the same length and represent each block into
truncation of the basis also concentrates on dealing with several key vectors for retrieval. Then, the most relevant
the singular dimensions with wavelengths exceeding the blocks to the query as well as the neighbor and initial
training length [944]. According to the definition λi = 2π/θi blocks will be selected for attention computations. Unlike
in Equation 5, the dimension with a large wavelength λi token-level selection methods, block-level selection methods
has a small basis θi accordingly. Based on this observation, typically retrieve different tokens with specific heads.
this approach first defines a basis range [a, c]. Given the
basis range, the value of basis is modified according to the 9.1.3 Long Text Data
following ways: (1) when θi ≥ c, the value is retained, To further enhance the long context modeling capacity,
(2) when θi ≤ a, the value is set to zero, and (3) when it typically requires continual pre-training with specially
a < θi < c, the value is truncated to a fixed small curated long text data. Next, we discuss how to prepare the
value. Via basis truncation, the out-of-distribution rotation long text data from the two aspects of quantity and quality.
angles can be avoided at larger position indices. However, • Quantity effect. Different from the pre-training phase
this approach does not perform very well at long context that requires vast amounts of data, a small amount of long-
tasks [944]. text data for continual pre-training is sufficient for context
window extension [259]. Several studies show that LLMs
9.1.2 Adapting Context Window have obtained the capability of utilizing distant information
Since Transformer-based LLMs have limited context win- via large-scale pre-training data, and thus it only needs
dows, they can not directly integrate or utilize the entire to adapt for extended context windows during continual
information of the long sequences exceeding the context pre-training [955]. Typically, it has shown that LLaMA-2-
window. To alleviate the limitation, several methods have 7B or LLaMA-2-13B can achieve a context window length
been proposed to adapt LLMs to long context, as discussed of over 100K tokens and effective context utilization [955]
below. with the training on several billion tokens. However, the
• Parallel context window. Inspired by fusion-in- ability to handle short text of LLMs may be affected to some
decoder [945], parallel context window methods [424, 946] extent [259].
adopt a divide-and-conquer strategy to process input text. • Quality effect. In addition to the quantity, the quality
Specially, it divides the input text into multiple segments, of long text data is essential to long context modeling for
each independently encoded with shared position embed- LLMs. For instance, LongWanjuan [956] categorize long
dings. At the generation stage, the attention masks are texts into holistic, aggregated, and chaotic long texts based
modified to make that subsequent tokens can access to on three metrics, i.e., coherence, cohesion, and complexity,
previous tokens in each segment. Nevertheless, this method and they show that removing chaotic data and keeping
cannot distinguish the order of different segments, resulting coherent and cohesive data are useful to enhance the long
in a limited model capacity on certain tasks. text modeling capacities of LLMs. Further, up-sampling
• Λ-shaped context window. Some prior work has revealed cohesive data can lead to further improvement. In addition,
that LLMs tend to allocate greater attention weights to when preparing long text data, data mixture should be
the starting and nearest tokens among all previous to- carefully adjusted for avoiding large distribution drift with
kens [947, 948], and it potentially results in the “lost in the the original pre-training data.
middle” phenomenon [949]. Based on this observation, LM- In addition to the studies based on vanilla Transformer,
Infinite [950] and StreamingLLM [948] propose to employ there are a surge of Transformer variants with efficient at-
a “Λ-shaped” attention mask, which selectively preserves tentions and other efficient architectures, aiming to alleviate
the initial tokens and the nearest tokens that each query can the high computational costs for modeling long texts. These
attend to and then discards any tokens beyond this scope. studies are discussed in Section 4.2.1 and Section 4.2.2. Fur-
Experiments demonstrate that this method can facilitate thermore, context compression and prompting techniques
extra-long text generation with a fixed memory [948]. How- (e.g., iterative reasoning [957]) have also been proven to
ever, it may struggle to model the long-range dependency be a viable strategy for handling long text tasks [957–960],
in the context window, since it cannot effectively utilize the without the need of model adaption.
information from the discarded tokens [948].
• Token selection. It has been shown that a relatively 9.2 LLM-empowered Agent
small subset of tokens can effectively capture the majority The research on agents in AI aims to develop entities that
80

can perceive the environment, make decisions, and take are assigned goals, they follow the above workflow to
actions to achieve specific goals [961]. However, traditional accomplish tasks through multi-turn interactions with the
agents are often limited to heuristic rules or specific environ- environment.
ments, which constrain their generalization to open-domain To summarize, in an LLM-based agent, the LLM serves
scenarios [962]. Given that LLMs possess excellent capacities as the core computation unit and is equipped with compo-
in solving complex tasks, they have rapidly emerged as nents including memory, planning, and execution. These com-
promising solutions for serving as the core computation ponents are integrated in a systematic way under the control
unit of agents [821]. In this part, we will first introduce of the LLM during interactions with the environment. For
the framework for LLM-based agents, then explore their more details, the readers might refer to the comprehensive
applications, and finally discuss the future directions. survey for LLM-based AI agents [821].

9.2.1 Overall Framework. 9.2.2 Applications


Next, we first detail the key components of an LLM-based Recently, LLM-based agents have shown great potential in
agent and then present the typical workflow. autonomously solving complex tasks, making it feasible to
rapidly develop capable applications for specific domains
Components. Typically, there are three main components or tasks. In this section, we will discuss the applications in
in an LLM-based agent: memory, planning45 , and execution. single-agent and multi-agent scenarios.
Specifically, the memory component aims to store the in-
formation perceived from the environment and can be Single-agent based Applications. Applications based on
utilized to support decision-making. In particular, LLM- a single-agent mode mainly aim to develop capable task
based agents usually maintain information in both short- solvers that can autonomously complete user requests. A
term memory and long-term memory with the operations large number of single-agent projects have been developed,
of reading and writing. Short-term memory usually refers which focus on general-purpose task solving. As a rep-
to the internal context window of LLMs (i.e., input), where resentative project, AutoGPT [536] empowers LLMs with
LLMs can read and write through actions like reason- long/short-term memory management and external tools
ing [963]. While long-term memory can be mapped to the like search engines. In order to autonomously address a
external storage like vector databases [539], where LLMs user request, AutoGPT understands the request with knowl-
can read through retrieval and write with reflection [688]. edge from its memory and actions like reasoning, decom-
Specially, profiles are usually implemented with long-term poses it into a detailed plan, executes the plan step-by-
memory, which is an important feature for an agent that step with the assistance of tools, and refines the rest plan
specifies its role and function [821]. The planning component based on feedback from the environment. Such an iterative
is responsible for generating the action plan based on the in- process continues until the user request is successfully re-
formation from the memory component. In data format, the solved. Other similar projects include GPT-Engineer [964]
plan usually takes the form of text-based instructions [434] and XAgent [965]. In addition, there is also some work that
or code-based programs [436]. To generate it, LLM-based aims to develop autonomous agents for specific domains,
agents will first propose several candidates and then select such as WebGPT [81] for the web-browsing environment,
a more suitable one among them [429]. The initial plan ProgPrompt [532] for the real-life environment, and Voy-
can be further refined with execution feedback from the ager [699] for the Minecraft environment.
environment [530]. The execution component is in charge Multi-agent based Applications. Different from single-
of carrying out the plan from the planning component, agent systems where agents work independently, multi-
which can be fulfilled by the internal LLM [434] or external agent systems work in collaboration to unleash collective
tools [963]. intelligence. Typically, multiple agents can be instantiated
Workflow. With the three components mentioned above, a from the same or different LLMs, each with their respective
typical workflow of an LLM-based agent is as follows. First, roles and functions. According to the coordinating strategies
it receives information from the environment and writes among these agents, multi-agent systems can be divided
it into short-term memory. Then, the agent processes the into two categories: cooperation-based and competition-
newly received information in the short-term memory. Such based. In the cooperation-based mode, to share informa-
a process can be enhanced with information retrieved from tion and seek collaborative actions among agents, various
long-term memory. Subsequently, the planning component communication protocols have been proposed, including
utilizes the processed information from short-term memory free-form dialogue [966], structured document [967], and
to generate the next plan. Finally, the execution component data embedding [968]. Based on the communication pro-
carries out the plan generated from the planning compo- tocol, agents can be effectively organized for downstream
nent, which can be further assisted with external tools. applications, such as software engineering [967], user be-
By repeating the aforementioned process, the LLM-based havior analysis [822, 824], and society simulation [535].
agent can autonomously adjust its behavior in response As a representative project, LangChain46 is a framework
to feedback from the environment and ultimately achieve for developing multi-agent based applications powered by
its goal. Once LLM-based agents receive user requests or LLMs. It enables users to deploy different roles of LLM-
based agents and utilize them to solve tasks via working in
45. Section 6.4 introduces planning as a utilization approach for collaboration. In addition, other similar frameworks, such
LLMs, while in this section, we describe its utilization as a functional
component in LLM-based agents. 46. https://www.langchain.com/
81

as AgentVerse [969] and AutoGen [970], can also be utilized Robustness and Trustworthiness. The deployment of LLM-
for developing multi-agent collaborative systems. In the based agent systems necessitates robustness and trustwor-
competition-based mode, debate serves as one of the pop- thiness [973]. The system should be resilient against adver-
ular communication protocols to foster divergent thinking sarial inputs from various modalities such as text, image,
and elicit valuable external feedback among agents. Such a or audio. Incorporating existing techniques like adversarial
way is beneficial for domains that demand precise decision- training, data augmentation, and sample detection to in-
making and accurate responses, such as mathematical rea- crease sensitivity to aggressive information in the input can
soning [971] and evaluation [734]. fortify the system’s security. Concurrently, it is challenging
to ensure the credibility of LLM-based agents given the se-
9.2.3 Discussion vere hallucination issues inherently rooted in LLMs. While
Despite the huge success, there still remain several technical existing methods such as constrained decoding during infer-
challenges that limit the development and application of ence and external knowledge integration can mitigate these
LLM-based agents. In this part, we discuss the remaining issues to some extent [974], further exploration of efficient
challenges from the perspective of computational burden, and effective alignment methods is necessary to develop
human alignment, complex capability extension, and ro- reliable agent systems.
bustness.
9.3 Analysis and Optimization for Model Training
Computational Costs. With the ever-increasing capabilities
of LLMs [821], their performance on agent applications In Section 4.3, we have introduced basic techniques for
demonstrate promising performance. However, it also in- training LLMs. As the scale of model parameters and data
troduces significant issues in terms of efficiency due to continues to expand, efficiently training larger models with
the high computational demands and intricate interaction limited computational resources has become a critical tech-
mechanisms involved. Furthermore, in multi-agent systems nical challenge in the development of LLMs. This challenge
with numerous LLM instances, as the number of agents in- primarily encompasses two technical issues: firstly, how
creases, this issue would be more severe, since the commu- to optimize memory usage when loading and processing
nication network within multi-agent systems also becomes models across GPU clusters, and secondly, how to maintain
increasingly complex. Therefore, more effective and efficient or improve training efficiency as models scale. Next, we
communication protocols and architectures are essential will conduct quantitative analyses and introduce advanced
to support the heightened coordination demands among training techniques addressing the two aforementioned is-
agents. sues.

Alignment with Human Sociality. LLM-based agents can 9.3.1 Estimation of Training Memory Consumption
be conceptualized as individual entities, with the emergence
In this part, we will first estimate the GPU memory con-
of sociability resulting from the interaction among these
sumption for training LLMs.
agents. Autonomous agents often assume specific roles such
as coders or researchers, making role-playing a vital capa- Model States Cost. Model states often occupy the majority
bility for agents to solve downstream tasks [972]. However, of memory during training, typically consisting of model
LLMs, typically trained on web corpora, face difficulties in parameters, gradients, and optimizer states. As introduced
accurately mimicking roles that are infrequently discussed in Section 4.3.2, mixed precision training has been widely
online or are emergent. They also lack self-awareness in utilized in LLM training. For a model containing P param-
conversational scenarios due to inadequate modeling of hu- eters, both the model parameters and their gradients are
man cognitive psychology. Thus, it is imperative to develop typically stored as 16-bit floating-point numbers, requiring
improved agent technique, including both training methods a total storage of 4P bytes (2P for the parameters and 2P for
and architectures, to better align LLMs with human prefer- the gradients). When using optimizers such as Adam [318]
ences and enhance their role-playing abilities. or AdamW [975], an additional set of 32-bit floating-point
numbers are needed to store the optimizer states, including
Capability Extension. LLM-based agents, similar to hu-
the copy of model parameters, gradient momenta, and
mans, require advanced capabilities (e.g., tool learning) to
gradient variances, which leads to a total storage of 12P
fulfill complex functions or tasks, which might be beyond
bytes (4P each for each of these components). Consequently,
their capacity scope. To address this issue, tool use has
the total memory required for storing the model states
become a widely-used approach to enhancing LLMs’ capac-
during training is 16P bytes. For instance, training LLaMA-
ities in various complex tasks. For example, when answer-
7B (P ≈ 6.7 × 109 ) requires around 100GB memory to store
ing informative user questions, they use search engines to
the model states alone.
retrieve information from the internet. However, the quality
and quantity of existing available tools impose limitations Activations Cost. Activations are the intermediate states
on their accessibility and comprehensiveness. And it would that require to be stored in the forward pass for gradient
become more difficult for LLM-based agents to use such computation during backpropagation. For example, for a
∂Y
limited tools when interacting with dynamic and changing binary operation Y = W X , calculating the gradient ∂W
environments. In addition, as the scale of tools expands, necessitates the input X , which should be preserved dur-
the compatibility and extensibility between the agents and ing the forward pass. In Table 18, we list the estimation
tools must be further improved to facilitate complex task of the activation memory consumption for different com-
resolution. ponents within the Transformer model. Take LLaMA-7B
82

(V = 32, 000, L = 32, H = 4, 096, H ′ = 11, 008, N = 32) as used to optimize memory usage during backpropagation.
an example, it would take 16GB memory to store activations Specifically, the activations need to be retained during the
per device under the setting B = 1, T = 2, 048. forward pass. However, storing all activation values for each
layer requires a significant amount of memory resources
TABLE 18: The activation memory consumption of each (detailed in Table 18). To reduce the memory cost, gradient
computation within the LLaMA model based on research checkpointing retains only a subset of the activations during
work [976]. We denote batch size by B , sequence length by the forward pass and recomputes these values during the
T , the vocabulary size by V , the number of head in the backward pass to save memory, albeit with additional com-
attention module by N , the dimension of each head by D, putational overhead. In implementation, gradient check-
the hidden size by H (H = N D), and the intermediate pointing typically involves storing the input of each Trans-
size inside FFN by H ′ . Equations ➀-➈ are layer-wise and former layer and recomputing the corresponding activation
need to be multiplied by the number of the layers L when values during backpropagation.
computing the total consumption.
ZeRO. Zero redundancy optimizer (ZeRO) [977] technique,
Equations Activation consumption proposed by the DeepSpeed library, focuses on alleviating
the issue of memory redundancy in data parallelism. As
➀ Q, K, V = XW Q,K,V store X with size 2BT H
➁ Q, K = RoPE(Q, K) store Q and K with size 4BT H mentioned in Section 4.3.2, data parallelism requires each
➂ O = Attn (Q, K, V ) store Q, K , and V with size GPU to store the same copy of the model states, resulting
6T H and results of softmax in a memory consumption of 16P bytes per GPU. A direct
with size 2BT 2 N side effect of data parallelism is that it memory redundancy
➃ X = OW O store O with size 2BT H issues, since not all of the above data is necessary to be
➄ X = Add&Norm(X) store X with size 2BT H retained on each GPU. To resolve it, the ZeRO technique
➅ G, U = X[W G , W U ] store X with size 2BT H aims to retain only a fraction of data on each GPU, while the
➆ D = Swish(G) · U store G and U with size 4BT H ′
rest data can be obtained from other GPUs when required.
➇ X = DW D store D with size 2BT H ′
Specifically, ZeRO provides three strategies, depending on
➈ X = Add&Norm(X) store X with size 2BT H
➉ CE(softmax(XW L )) store X with size 2BT H and re- how the three parts of the data are stored, namely optimizer
sults of softmax with size 4BT V state partitioning (ZeRO-1), gradient partitioning (ZeRO-
2), and parameter partitioning (ZeRO-3). Empirical results
indicate that the first two strategies do not increase the
Other Memory Cost. In addition to the main factors af- communication overhead, and the third solution increases
fecting GPU memory consumption discussed above, the about 50% communication overhead but saves memory
memory usage also includes the following aspects: proportional to the number of GPUs. PyTorch has imple-
• Deep learning frameworks. The PyTorch framework re- mented a similar technique as ZeRO, called fully sharded
quires approximately 1GB of GPU memory when loading data parallel (FSDP) [330].
its core functions. This is the essential overhead for the
framework to operate. Offload. In GPU-limited environments, DeepSpeed has pro-
• Distributed frameworks. When utilizing distributed posed the offload technique [978], which can significantly
training frameworks (e.g., DeepSpeed), its GPU memory reduce the GPU memory required for training by offloading
usage can fluctuate between 1GB and 4GB. The exact part of the model states and computational overhead to CPU
amount depends on the level of optimization and the hyper- memory. Specifically, gradients and optimizer states would
parameter settings. This portion of the memory is primarily be offloaded to CPU memory, with only the model param-
used to optimize memory management and communication eters kept on GPU. The computationally intensive forward
efficiency during the training process. and backward propagation still need to be performed on
• Intermediate results and memory fragmentation. Besides GPU to ensure efficiency, while parameter update, which
the activations, there also exist intermediate results that will requires relatively fewer computations, are executed on
affect the peak memory consumption. Take the computation CPU to reduce GPU memory overhead. Furthermore, In-
of the softmax function in Equation ➉ as an example, finity [979] allows training models that exceed the GPU
the implementation of the Transformers library requires an memory limits by utilizing high-speed disk storage (e.g.,
additional 8BT V bytes of memory, as it needs to store two NVMe).
additional copies of the 32-bit input (4BT V bytes each).
Moreover, during the training process, memory fragmenta- 9.3.3 Efficiency Optimization Methods
tion occurs due to the non-contiguous allocation and release In addition to memory-saving techniques, it is also crucial to
of memory, typically leading to an additional 0.5GB to 1GB maintain computational throughput as the model scales. In
of memory consumption. what follows, we will describe two representative efficiency
optimization methods.
9.3.2 Memory Optimization Methods
FlashAttention. FlashAttention [303, 980] is an optimization
Based on the aforementioned analysis, we will next intro-
method for the attention mechanism that significantly re-
duce several typical methods for optimizing the memory
duces the memory transfer during attention computation.
usage for training LLMs.
The core idea is to minimize the storage of intermediate
Gradient Checkpointing. Gradient checkpointing [329], results and directly obtain the final result. According

to the
also known as activation recomputation, is a technique attention computation equation softmax( QK √
D
)V , multiple
83

intermediate results, such as QK ⊺ and the attention score which is measured in FLOP/byte. For example, the half-
matrix, need to be explicitly retained, leading to numerous precision compute and bandwidth of the A100 GPU are 312
memory read-write operations. FlashAttention uses spe- TFLOP/s and 2039GB/s, respectively. Correspondingly, its
cially designed methods, such as matrix partition and opera- maximum arithmetic intensity is 142.51 FLOP/byte47 .
tor fusion, to keep intermediate results in the cache until the • Model efficiency metrics. Similarly, each operation (e.g.,
final result is obtained, thus reducing the amount of mem- matrix multiplication) of the model can be measured by
ory read and write operations. Additionally, FlashAttention two corresponding metrics: the computation amount and the
can effectively reduce the peak memory usage and activa- data transfer amount. The former refers to the total number
tion memory consumption (Section 9.3) during the LLM of floating-point operations, measured in FLOPs. The latter
training and inference. By using FlashAttention, LLaMA- refers to the total amount of GPU memory read and write
2 (7B) with a sequence length of 2,048 and a batch size of 8 operations, measured in bytes. Analogous to the arithmetic
requires only one-tenth of the computation time compared intensity of a GPU, the arithmetic intensity I of a model oper-
to the standard method. ation (e.g., matrix multiplication) can be defined as the ratio
of computation to data transfer, with units of FLOP/byte.
Sequence Parallelism. Compared with the 3D parallelism When the model’s arithmetic intensity I is less than the
introduced in Section 4.3, sequence parallelism can be GPU’s maximum arithmetic intensity Imax , it indicates that
considered a fourth parallelism dimension in pre-training, the maximum memory bandwidth of the GPU is lower than
particularly when handling long data sequences. The core the speed required. Consequently, the model’s efficiency
idea is to partition the sequence across multiple devices will primarily be limited by memory bandwidth, and the
for parallel computation. The primary challenge lies in operation is called memory-bound. Conversely, when I ex-
minimizing communication across the devices during atten- ceeds Imax , it suggests that the GPU’s maximum floating-
tion computation. DeepSpeed-Ulysses [981] partitions the point operation speed is lower than the speed required. In
sequence along the hidden dimension, allowing each device this case, the model’s efficiency will mainly be constrained
to receive a subset of the attention heads and compute by the GPU’s compute capability, and the operation is called
attention for different heads in parallel. In comparison, Ring compute-bound.
Attention [982] partitions the sequence along the length
dimension, where the query matrices on each device are in Bottleneck Analysis. Based on the above analysis, we can
turn computed with the key and value matrices on other obtain the arithmetic intensity for each operation during
devices. Furthermore, Ring Attention is also compatible both the prefill and decoding stages, as shown in Tables 19
with FlashAttention and can be considered as its distributed and 20, thereby better identifying the bottleneck operations
extension. in the inference process.
• Prefill stage. In the following analysis, we will still
take the LLaMA (7B) model in Table 18 as an example
9.4 Analysis and Optimization for Model Inference
(N = 32, D = 128, H = 4096) and assume a batch size of
In Section 4.2.4, we have introduced the basic decoding 8 and a sequence length of 1024 (i.e., B = 8, T = 1024).
strategies for using LLMs. As inference efficiency is criti- Substituting these values into Table 19, we can find that
cally important for the application of LLMs, we next will the arithmetic intensity for linear transformations (Equa-
quantitatively analyze the efficiency of the inference process tions ➀➃➅➇) is approximately 2730.67, for multi-head at-
and also present corresponding optimization methods. tention (Equation ➂) it is approximately 114.67, while the
intensity for other operations (Equations ➁➄➆➉) is around
9.4.1 Analysis of Inference Efficiency 1. When using an A100 (80G) GPU with Imax = 142.51,
Overall, the inference process of LLMs can be divided into the arithmetic intensities of the linear transformations and
two stages for overhead analysis: (1) the prefill stage, which multi-head attention operations are all above or close to the
computes the states and caches the key-value tensors for the maximum value. Given that these operations occupy the
input sequence; and (2) the decoding stage, which computes majority of the computations during the prefill stage, we
the states of the newly generated tokens, updates the key- can conclude that prefill stage is actually compute-bound.
value cache (KV cache, and continuously generate tokens • Decoding stage. Similarly, substituting these values into
in an auto-regressive way until the generation process is the arithmetic intensity formulas in Table 20 for the decod-
complete [984]. ing stage reveals that the arithmetic intensities of the lin-
ear transformations and multi-head attention are all below
Inference Efficiency Measurement. To quantitatively an-
8, which is much lower than the A100 GPU’s maximum
alyze the inference efficiency, we next will introduce two
intensity 142.51. This indicates that the decoding stage is
widely-used metrics for measuring inference efficiency.
constrained by the GPU’s data transfer speed (i.e., memory-
• GPU performance metrics. First, we introduce the com-
bound), a problem commonly referred to as the memory wall.
pute capability and memory bandwidth to evaluate the effi-
The analysis indicates that inefficiencies in LLM inference
ciency of a certain GPU. The compute capability of a GPU
primarily occur during the decoding stage.
refers to the number of floating-point operations (FLOP)
that it can perform per second, measured in FLOP/s. The 9.4.2 System-level Optimization
bandwidth of a GPU refers to the amount of memory read To mitigate the memory wall issue, an intuitive idea is
and write operations it can perform per second, measured in to reduce the data transfer operations as possible, thereby
byte/s. The ratio of compute to bandwidth is known as the
maximum arithmetic intensity of the GPU, denoted as Imax , 47. https://www.nvidia.com/en-us/data-center/a100/
84

TABLE 19: The computation, data transfer, and arithmetic intensity during the prefill stage. We use the asymptotic notation
O to denote the complexity of data transfer amount, where the constant factor of the complexity is related to the specific
implementation method. Table source: [983].

Equations Computation Data transfer Arithmetic intensity


 
➀ Q, K, V = XW Q,K,V 6BT H 2 O(BT H + H 2 ) O 1 +1 1
H BT
➁ Q, K = RoPE(Q, K) 6BT H O(BT H) O(1)
  1
1+ D
2 2
➂ O = Attn(Q, K, V ) 4BT N D + 4BT N O(BT 2 N + BT N D) O 1 1
 D+T 
O 2 2 1
➃ X = OW 2BT H O(BT H + H ) O 1 + 1
 H BT
1
➄ X = Add&Norm(X) 5BT H O(BT H + H) O 1
1+ BT 
➅ G, U = X[W G , W U ] 4BT HH ′ O(BT H + BT H ′ + HH ′ ) O 1 + 1′
1
1
+ BT
H H
➆ D = Swish(G) · U 2BT H ′ O(BT H ′ ) O(1)
 
D ′ ′ ′
➇ X = DW 2BT HH O(BT H + BT H + HH ) O 1 + 11 + 1
 H H ′ BT
➈ X = Add&Norm(X) 5BT H O(BT H + H) O 1+1 1
BT

TABLE 20: The computation, data transfer, and arithmetic intensity during the decoding stage. Table source: [983].

Equations Computation Data transfer Arithmetic intensity


 
➀ q, k, v = XW QKV 6BH 2 O(BH + H 2 ) O 1+ 1
1
H B
➁ q, k = RoPE(q, k) 6BH O(BH) O(1)
➂ K, V = Cache(k, v) - O(BT N D) or O(BN D) -   1
1+ D
➃ o = Attn(q, K, V ) 4BT N D + 4BT N O(BT N + BT N D + BN D) O 1 1
 1+ D +T
➄ X = oW O 2BH 2 O(BH + H 2 ) O 1
1
1
 H + B
➅ X = Add&Norm(X) 5BH O(BH + H) O 1+1 1
 B 
➆ g, u = X[W G , W U ] 4BHH ′ O(BH + BH ′ + HH ′ ) O 1 + 1′
1
1
+B
H H
➇ d = Swish(g) · u 2BH ′ O(BH ′ ) O(1)
 
D ′ ′ ′
➈ X = dW 2BHH O(BH + BH + HH ) O 1 + 11 + 1
 H H′ B
➉ X = Add&Norm(X) 5BH O(BH + H) O 1+1 1
B

enhancing the arithmetic intensity. In this part, we will intro- cate new GPU memory for each concatenation, copying the
duce several system-level optimization methods to achieve original KV cache and the new hidden states into the newly
the reduction in data transfer. allocated memory. This process leads to repeated memory
read-write operations and substantial memory fragmenta-
FlashAttention and Flash-Decoding. The FlashAttention tion. PagedAttention addresses this issue by introducing
method discussed in Section 9.3.3 can also be applied at a memory paging management method, preallocating sev-
the prefill stage, as it reduces data transfer operations and eral blocks of memory for future KV caches, which can
effectively increases arithmetic intensity. However, this op- largely reduce the memory allocation operations during
timization technique is not directly applicable during the concatenation. Additionally, PagedAttention optimizes the
decoding stage, where only the current query vector needs attention computation by increasing the parallelism. It uses
to be computed with the KV cache matrices. To further operator fusion to parallelize the computation of the query
optimize the decoding process, Flash-Decoding [985] has vector with multiple KV cache chunk, thereby enhancing
been proposed based on FlashAttention, particularly for the computational efficiency.
long sequences, which shares a similar idea with sequence
parallelism. Specifically, Flash-Decoding splits the KV cache Batch Management Optimization. Batch management op-
into smaller chunks, allowing the computation of the query timization aims to increase the batch size during the decod-
vector with these chunks in parallel, thereby improving the ing stage to enhance arithmetic intensity. A representative
decoding efficiency. method is continuous batching, proposed by vLLM [304].
Unlike traditional fixed-length batch processing, this tech-
PagedAttention. PagedAttention [304] focuses on optimiz- nique breaks down each request into a prefill iteration
ing KV cache and attention computation, significantly re- and several single-step decoding iterations, and continu-
ducing data transfer operations in these two aspects. In KV ous batching further employ heuristic algorithms to select
cache concatenation, traditional methods often need to allo- requests for prefill or single-step decoding iteration. This
85

fine-grained batching mechanism allows for handling more of this method still largely lags behind autoregressive meth-
requests simultaneously, which is has the same effect as in- ods. To improve the quality of the generated text, several
creasing the batch size. Furthermore, DeepSpeed-MII [986] studies attempt to combine both decoding methods, propos-
introduces Dynamic SplitFuse, which splits the prefill stage ing semi-autoregressive decoding methods [994] that gener-
into multiple iterations and allows simultaneous prefill and ate a group of tokens (e.g., 3 to 10 tokens) at each step and
decoding in one computation, resulting in larger batches use these tokens as input to generate the next group. How-
and higher inference throughput. ever, existing mainstream LLMs are pre-trained to predict
the next token, making direct non- or semi-autoregressive
9.4.3 Algorithm-level Optimization generation infeasible. To address this, Medusa [995] trains
In addition to system-level optimization methods, existing two additional prediction heads on the Vicuna model to
research work has proposed a series of improvements for predict the second and third tokens respectively, thereby
autoregressive inference algorithms aimed at enhancing in- achieving the generation of three tokens simultaneously.
ference efficiency. This part introduces four typical inference However, due to the decreased generation quality, these
optimization algorithms. methods have been rarely used directly in practice, but are
more often combined with other methods (e.g., speculative
Speculative Decoding. Intuitively, the generation steps in decoding) to accelerate the inference process of LLMs. For
language modeling have varied difficulty levels. For exam- instance, after Medusa generates three tokens in parallel, the
ple, predicting the next word of “The founder of Microsoft original Vicuna model would still be employed to verify the
is” may be more challenging than predicting the next word generation quality.
of “The founder of Microsoft is Bill”. Even a small model
may successfully predict the answer in this case. Based on Early Exit. It has been found that in multi-layer Transformer
this idea, speculative decoding [987, 988] has been proposed models, it may not be necessary to perform the computation
to accelerate the inference speed. Specifically, it employs a through all layers to reliably predict the next token [996].
relatively smaller yet more efficient model (such as an n- Based on this idea, several studies [996, 997] have proposed
gram statistical model or a small pre-trained model) to au- improved generation methods based on early exit. During
toregressively generate several tokens. Then, a larger model model decoding, when the conditions for early exit are
then verifies these tokens, determining whether each token satisfied, the model can directly use intermediate compu-
is the top-ranked prediction at the each generation step. The tation results from certain layers to generate tokens, thereby
small and large models iteratively repeat this process until improving the inference efficiency. To determine the exit
decoding is complete. Speculative decoding can lead to a condition, prediction confidence [997] or the entropy [996]
notable 2× to 3× speedup without compromising the gener- of the next token’s generation probability distribution can
ation quality. Researchers further suggest several variants to be used as reference measures. More recently, mixture-
improve the efficiency of this approach, such as a learning- of-depths [998] has proposed to dynamically adjust the
based method to combine several small models [989] and computation load of each layer. Similar to MoE networks,
a stage-wise acceleration which employs a more smaller the mixture-of-depths method calculates a score for each
model to accelerate the small model first [990]. layer’s input via a routing network. If the score exceeds a
preset threshold, the layer would be computed; otherwise,
Cascade Inference. Cascade inference optimizes the inference the layer would be skipped. Unlike traditional early exit
efficiency by addressing requests of varying difficulty with mechanisms that skip all subsequent layers, the mixture-
models of different scales. FrugalGPT [991] introduces a of-depths method selectively skips certain layers, which
series of models arranged by efficiency from high to low, can adaptively utilize the characteristics of different layers
sequentially processing a request through these models. A during generation.
specially trained binary classification model then evaluates
whether the generated result meets the task requirements. 9.5 Model Compression
If the result is deemed reliable, subsequent models would
Due to the huge number of model parameters, LLMs take
be bypassed, thus improving the inference speed. This
a significant memory footprint for inference, making it very
strategy can be applied to various open-source models and
costly to be deployed in real-world applications [999]. In this
commercial APIs, allowing for the flexible adjustment the
section, we focus on how to reduce the memory footprint
classification threshold to balance inference efficiency and
of LLMs via technical approaches. In particular, we will
generation quality according to specific needs. For reason-
primarily introduce the model quantization approach, and
ing tasks, researchers [992] further propose to utilize the
also briefly discuss other model compression methods, e.g.,
self-consistency [429] of generated answers to evaluate the
model pruning and distillation.
quality of the small model: the large model is employed for
generation only when the small model’s answers exhibit a 9.5.1 Quantization Methods
low consistency.
There are generally two major model quantization ap-
Non-autoregressive Decoding. Existing decoding methods proaches, namely quantization-aware training (QAT) (requir-
predominantly adopt the autoregressive mechanism, gen- ing additional full model retraining) and post-training quanti-
erating tokens one by one, which is a primary reason zation (PTQ) (requires no model retraining). Compared with
for lower inference efficiency. Therefore, non-autoregressive small-sized language models, two major differences need
decoding [993] has been proposed by generating all tokens to be considered when designing or selecting quantization
based on the input at once. However, the generation quality methods for LLMs. Firstly, LLMs consist of a huge number
86

of parameters, and thus PTQ methods are more preferred and can be pre-processed before model deployment. By
due to a much lower computational cost than QAT methods. identifying and preserving these salient weights, the error
Secondly, LLMs exhibit very different activation patterns associated with model quantization can be effectively re-
(i.e., large outlier features), and it becomes more difficult duced. In existing literature, various methods have been
to quantize LLMs, especially hidden activations. Next, we proposed to detect these salient weights. For instance, PB-
will briefly review several representative PTQ methods48 for LLM [1003] utilizes the magnitude of weights for finding
LLMs. critical weights, SpQR [1004] categorizes the outliers in
weights into small groups by investigating the structural
Background for Quantization. In this part, we present a patterns, APTQ [1005] employs the Hessian trace as a sen-
general introduction of quantization techniques for neu- sitivity metric, and OWQ [1006] selects the top sensitive
ral networks. In neural network compression, quantization columns based on both the Hessian matrix and weight
often refers to the mapping process from floating-point perturbations.
numbers to integers [1000], especially the 8-bit integer quan-
• Fine-grained quantization. For Transformer models,
tization (i.e., INT8 quantization). For neural network models,
weights and activations are usually represented in the
there are typically two kinds of data to be quantized, namely
form of tensors. A straightforward approach is to use
weights (model parameters) and activations (hidden activa-
coarse-grained quantization parameters for the whole ten-
tions), which are originally represented in floating-point
sor (i.e., per-tensor quantization) [1007]. However, it usu-
numbers. To illustrate the essential idea of model quan-
ally leads to inaccurate reconstruction results. Thus, fine-
tization, we introduce a simple yet popular quantization
grained methods are proposed to reduce the quantization
function: xq = R(x/S) − Z , which transforms a floating
error. ZeroQuant [1008] adopts a token-wise quantization
number x into a quantized value xq . In this function, S
approach with dynamic calibration for compressing acti-
and Z denote the scaling factor (involving two parameters
vations. Whereas for weights (easier to be quantized), it
α and β that determine the clipping range) and zero-point uses a group-wise quantization. In practice, a group size of
factor (determining symmetric or asymmetric quantization),
128 [1002, 1008] is commonly used for model quantization.
respectively, and R(·) denotes the rounding operation that
• Balancing the quantization difficulty. Considering that
maps a scaled floating value to an approximate integer.
weights are easier to be quantized than activations,
As the reverse process, dequantization recovers the original
SmoothQuant [1007] proposes to migrate the difficulty from
value from the quantized value accordingly: x̃ = S·(xq +Z).
activations to weights. Specially, they incorporate a scaling
The quantization error is calculated as the numerical differ-
transformation to balance the difficulty between weights
ence between the original value x and the recovered value
and activations in a linear layer: Y = (Xdiag(s)−1 ) ·
x̃. The range parameters α and β have a large impact on the
(diag(s)W). By introducing an mathematically equivalent
quantization performance, which often need to be calibrated
transformation, this formula controls the quantization diffi-
according to real data distributions, in either a static (offline)
culty through the scaling factor s. To set s, it incorporates
or dynamic way (runtime). For more details, we refer to the
a migration strength parameter α to balance the difficulties,
readers to the excellent survey [1000] about quantization
where each entry sj = max(xj )α / max(wj )(1−α) is deter-
methods on neural networks.
mined by the migration strength.
Post-Training Quantization (PTQ). We first introduce the • Layerwise quantization. This approach finds optimal
PTQ methods for LLMs. quantized weights that minimize a layerwise reconstruction
2
• Mixed-precision decomposition. As found in [1001], ex- loss: arg minW c ∥ WX − WX ∥2 . To efficiently optimize
c
tremely large values would occur in hidden activations this objective, GPTQ [1009] improves the original opti-
(called the emergence of outliers) when the model size reaches mal brain quantization (OBQ) [1010] method by fixing the
6.7B parameters or above. These outliers significantly influ- quantization order of weights for all rows. Further, with
ence the data distribution ranges of the hidden activations, specially designed methods (i.e., lazy batch-updates and
making it challenging to conduct effective model quantiza- Cholesky reformulation), GPTQ is feasible to quantize very
tion. To reduce the quantization error, a straightforward large models (e.g., 175B OPT) in 3 or 4 bit precision. More
method is to separately process the outliers and the rest recently, AWQ [1002] further simplifies the optimization
weight values. Specifically, LLM.int8() [1001] has observed form by incorporating activation-aware scaling for weights,
that these outliers are mainly distributed in certain feature which resembles the idea of SmoothQuant [1007]: weights
dimensions at Transformer layers. Based on this finding, a corresponding to outlier activations are more important
vector-wise quantization approach is proposed to separate to be precisely quantized. It does not directly optimize
the outliers and the rest in matrix multiplication. the reconstruction loss, but instead performs simple hyper-
• Salient weights protection. For Transformer based lan- parameter search to achieve the minimal loss on calibration
guage models, there often exists a subset of weight values data.
that are more sensitive to quantization, which are also These strategies in the above methods can be jointly
referred to as salient weights [1002]. Unlike activation out- used to improve the quantization performance. In order to
liers, which occur dynamically during inference and may achieve high-efficiency implementation, quantization meth-
require complex runtime handling, weight outliers are static ods also rely on hardware- or system-level support (e.g., ef-
ficient GPU kernels or hardware-friendly group partition).
48. Since we mainly focus on discussing quantization methods in the
context of LLMs, the line of quantization work on small-sized language Other Quantization Methods. In the above, we mainly fo-
models (e.g., BERT) has not been included in this survey. cus on PTQ methods, and next introduce two recent studies
87

that explore efficient fine-tuning methods or QAT methods and difficulty migration [1007], can be applied to alleviate
for quanitizing LLMs. the influence of outlier values. Since large outliers mainly
• Efficient fine-tuning enhanced quantization. For post- exist in the activations of LLMs, small language models
training quantization, direct low-bit quantization (e.g., INT4 are more resistant to activation quantization [1013, 1015].
quantization) often results in large performance degrada- In practice, high-quality INT8 activation quantization is still
tion. To overcome this challenge, QLoRA [1011] incorporates a difficult task, though several methods can attain satisfying
additional small tunable adapters (16-bit precision) into the results. Further, lower precision activation quantization has
quantized models, to achieve an efficient, high-precision still not been successfully explored, even for QAT meth-
model fine-tuning. It combines the merits of LoRA (See ods [1012].
Section 5.3.1) and quantization methods. The experiment • Efficient fine-tuning enhanced quantization is a good option
results show that 4-bit quantized models can achieve the to enhance the performance of quantized LLMs [149, 1011]. The
full 16-bit fine-tuning performance by QLoRA. benefits of efficient fine-tuning methods in quantization can
• Quantization-aware training (QAT) for LLMs. A recent be twofold. Firstly, it can directly compensate for the per-
study [1012] explores the effect of QAT methods by applying formance degradation suffered from low-bit quantization.
a data-free distillation method to compress the weights, This can be achieved either by increasing the fitting capacity
activations as well as key-value cache. By conducting exten- via updating high precision adapters [1013, 1015, 1016],
sive experiments based on LLaMA, they show promising or by finding a proper low-rank initizalization for LoRA
results with 4-bit quantization on both weights and key- fine-tuning [1017]. Secondly, it is flexible to support task-
value cache, but not on 4-bit activation quantization, which specific or goal-specific fine-tuning of LLMs in a lightweight
still needs more exploration. way [1011], e.g., instruction tuning or chat-oriented tuning,
by only tuning the small adapters. Overall, it makes a
Empirical Analysis and Findings. Quantization has cur- good trade-off between the effectiveness and training cost,
rently become a common technique to reduce the memory which provides a promising approach to enhancing the
footprint and latency of LLMs in deployment. In particular, performance of quantized LLMs.
it is important to understand what level of precision (e.g.,
INT8 or INT4) can be applied to quantize different parts of Empirical Analysis on Quantization Experiments. To fur-
LLMs (e.g., weights or activations), while retaining a high ther help readers understand the impact of quantization on
accuracy. In this part, we first summarize the major findings LLMs, we also conduct a group of experiments to investi-
about the quantization of LLMs in existing literature, and gate the inference performance of quantized models here.
then present some empirical analysis with quantization Specifically, we focus on the fine-tuned LLaMA models (i.e.,
experiments. 7B and 13B) using popular SFT datasets, including FLAN-
• INT8 weight quantization can often yield very good results v2 [69], Alpaca-52K [187] and ShareGPT [153]. For evalua-
on LLMs, while the performance of lower precision weight quan- tion, we utilize the same tasks in Table 10, and follow the
tization depends on specific methods [1002, 1007, 1009, 1013]. In quantization settings in the study [1015] examining the per-
most cases, INT8 weight quantization can be effectively ap- formance of quantized language models at three precision
plied to reduce the memory footprint without performance levels: 4-bit, 8-bit and 16-bit. The results are summarized
degradation. While for INT4 (or INT3) weight quantiza- in Table 21. As can be observed from Table 21, the results
tion, existing methods rely on specific strategies to reduce obtained with 8-bit and 4-bit weight quantization are close
the performance degradation, e.g., layerwise method [1008, to the performance of 16-bit models while significantly
1009], activation-aware scaling [1002] and low-rank adapter reducing memory consumption. In practice, it is recom-
tuning [1011]. Interestingly, LLMs seem to be less sensitive mended to first examine the performance of 4-bit weight
to low-bit weight quantization than small-sized language quantization for LLMs if reducing memory usage is a critical
models [1013]. In practice, with the same memory cost, it consideration for deployment.
is suggested to use a larger language model with a lower 9.5.2 Other Model Compression Methods
quantization precision rather than a smaller language model
In addition to model quantization, we next introduce two
with a higher quantization precision. For example, a 4-bit
other model compression methods for LLMs, namely model
60B LLM is demonstrated to have better performance than
distillation and model pruning. Unlike model quantization,
an 8-bit 30B LLM [1014]. Moreover, focusing on emergent
model distillation and pruning aim to simplify the model
capabilities, the study [1015] finds that in-context learning,
architecture, thereby reducing the total number of parame-
step-by-step reasoning, and instruction following all seem
ters.
to be seldom affected with 4-bit weight quantization. This
result suggests that INT4 quantization exhibits a favorable Distillation for LLMs. In general, model distillation aims to
trade-off in terms of both total bits and performance of transfer the capabilities from a capable model (referred to
emergent abilities. as the teacher model) to a less capable model (referred to
• Activations are more difficult to be quantized than as the student model), thereby achieving the compression
weights [1001, 1007, 1013]. It has been found that large out- of the capable model. Based on whether the weights of
liers would occur for Transformer language models having teacher models are accessible, one can employ either the
a size of 6.7B or above [1001]. This issue has been one white-box approach or the black-box approach for LLM
of the most fundamental difficulties to quantize LLMs. To distillation. The white-box approach often employs the
overcome this issue, various methods, e.g., mixed-precision traditional knowledge distillation technique, which incor-
decomposition [1001], fine-grained quantization [766, 1001] porates additional loss functions (i.e., distillation loss) for
88

TABLE 21: Evaluation results for quantized LLaMA models (7B and 13B). We employ existing model checkpoints provided
by [350] for quantization experiments, which have been fine-tuned on FLAN-v2, Alpaca-52K, and ShareGPT, respectively.
Specifically, we report the performance with AlpacaFarm, MMLU, and BBH, as well as the memory usage of the loaded
model (Mem.). For quantization, we employ bitsandbytes to quantize the 16-bit models to 8/4 bits by specifying the
commands load_in_8bit and load_in_4bit when loading the weights. It is worth noting that we select text-davinci-
003 as the baseline model for the AlpacaFarm dataset.

16-bit 8-bit 4-bit


Models SFT Dataset
AlpacaFarm MMLU BBH Mem.(GiB) AlpacaFarm MMLU BBH Mem.(GiB) AlpacaFarm MMLU BBH Mem.(GiB)
LLaMA (7B) FLAN-v2 6.65 47.34 35.05 12.58 6.15 47.02 35.17 6.65 7.83 46.23 34.77 3.94
Alpaca-52K 32.55 40.87 33.66 12.58 33.60 39.98 34.38 6.65 29.57 39.24 32.80 3.94
ShareGPT 72.05 41.30 32.90 12.58 72.86 39.34 32.71 6.65 70.31 40.08 32.11 3.94
LLaMA (13B) FLAN-v2 8.14 51.67 41.46 24.40 7.64 51.02 41.25 12.53 7.52 50.48 40.68 7.34
Alpaca-52K 33.60 47.63 36.10 24.40 31.43 47.04 35.98 12.53 30.87 46.20 36.16 7.34
ShareGPT 75.59 47.58 38.00 24.40 73.79 47.71 38.31 12.53 71.99 45.77 36.97 7.34

aligning the outputs or intermediate states of the student optimizers [1023]. It focuses on the quantization of both
model to those of the teacher model. Based on this ap- activations and weights for LLMs, including the support on
proach, MINILLM [1018] effectively distills the 13B LLaMA 8-bit and 4-bit (NF4,FP4) matrix multiplication for efficient
model down to a 7B model. The black-box approach [1019], inference, as well as an 8-bit optimizer for efficient training.
on the other hand, can only make use of the textual re- • GPTQ-for-LLaMA50 is developed specially for quantiz-
sponse of the teacher model for training the student model. ing LLaMA models. It enables 4-bit quantization of LLaMA
These studies mainly focus on utilizing the generated re- models of varied sizes based on the GPTQ algorithm [1009].
sponses for enhancing the key capabilities from the teacher Also, it provides a comparison with bitsandbytes in both
model [146, 384], such as in-context learning and chain-of- memory and performance (PPL) on the project website.
thought prompting. • AutoGPTQ51 is a quantization package developed
based on the GPTQ algorithm [1009], which supports INT4
Pruning for LLMs. The goal of model pruning is to min-
quantization for LLMs. It includes a number of quantized
imize the number of parameters in a model while pre-
models in the library, and supports LoRA by integrating
serving its performance as much as possible. In general,
with HuggingFace PEFT library.
model pruning methods can be categorized into two lines:
structured pruning and unstructured pruning. Structured
• llama.cpp52 makes it feasible to run quantized LLaMA
models on a MacBook device. It supports INT4, INT5 and
pruning aims to remove certain less important model com-
INT8 quantization, which is developed in efficient C/C++
ponents (e.g., neurons, channels, layers) that have minimal
implementation. It also supports a number of LLaMA based
impact on performance. On the other hand, unstructured
models, such as Alpaca and Vicuna.
pruning mainly focuses on removing individual weights or
connections within a neural network model without chang- Other Libraries. In addition, there are also libraries for
ing the model’s main structure. As for LLMs, unstructured supporting other model compression methods.
pruning can generally lead to higher compression rates. • Torch-Pruning 53 is a toolkit developed for general-
For instance, SparseGPT [1020] achieves 60% unstructured purpose structural pruning, including the pruning for vision
sparsity for OPT-175B using unstructured pruning (i.e., models, diffusion models and large language models. It em-
60% of the elements in the weights are masked), and the ploys dependency graph for automatic structural pruning
pruned LLM still retains a relatively low perplexity. With and supports several high-level pruners (e.g., MetaPruner
suitable strategies, structured pruning for LLMs can also and BNScalePruner).
achieve promising model compression rate. For instance, • LLM-Pruner54 is designed specifically for the pruning
LLM-pruner [1021] selectively removes 20% of the non- of LLMs. It enables efficient gradient-based structral prun-
essential parameters from LLaMA (7B) based on gradient ing for LLMs with minimal training samples and training
information, while maintaining 93.6% performance of the time. Currently, it supports a number of LLMs, such as
original model. Furthermore, Sheared LLaMA [1022] in- Baichuan, BLOOM, and LLaMA3.
troduces two techniques: targeted structured pruning and
dynamic batch loading, which effectively prunes LLaMA-
2 (7B) to a parameter size of 2.7B, while preserving 87.8% of 9.6 Retrieval-Augmented Generation
the original model’s performance. When dealing with real-time information or specialized
9.5.3 Open-source Libraries domain knowledge, LLMs may struggle to generate ac-
In this part, we briefly introduce the available open-source curate outputs solely based on their internal knowledge.
libraries for memory-efficient deployment. To address this issue, retrieval-augmented generation (RAG)
technique [1024, 1025] has been proposed by incorporating
Quantization Libraries. Next, we introduce three popular
quantization libraries for LLMs, including: 50. https://github.com/qwopqwop200/GPTQ-for-LLaMa
• Bitsandbytes49 is developed based on the methods 51. https://github.com/PanQiWei/AutoGPTQ
introduced in the papers of LLM.int8() [1001] and 8-bit 52. https://github.com/ggerganov/llama.cpp
53. https://github.com/VainF/Torch-Pruning
49. https://github.com/TimDettmers/bitsandbytes 54. https://github.com/horseee/LLM-Pruner
89

external knowledge source for improving the model re- determine whether the current task requires retrieval or the
sponse. This technique aims to retrieve relevant information use of retrieved content [662].
from external sources (e.g., the internet or domain-specific
knowledge bases) using an information retrieval system, Improvement Strategies. In practice, factors such as the
thereby providing LLMs with timely or domain-relevant quality of retrieved documents, prompt design, and the
context to reduce the factual errors in generated content. generation method of LLMs might impact the final per-
In the format, RAG can also be considered as a specific formance of RAG. Next, we discuss how to enhance the
prompting strategy that integrates auxiliary information RAG performance by summarizing existing improvement
from external sources into the original prompt. Next, we will strategies.
introduce the basic workflow of the retrieval-augmented • Retrieval method improvement. The incorporation of
generation technique and related optimization strategies. retrieval supplements the LLM with relevant contextual
information, and the retrieval performance directly affects
Basic Workflow. Typically, the standard RAG procedure the quality of the final generated response [454]. To design
consists of three steps, including context retrieval, prompt effective retrieval strategy, an important factor to consider
construction, and response generation. is the text granularity. Intuitively, a coarser granularity (e.g.,
• Context Retrieval. The retrieval step primarily focuses document-level) may result in efficient retrieval but tend to
on finding relevant context information from existing infor- incorporate substantial irrelevant information, while a finer
mation sources that are helpful for addressing the current granularity (e.g., sentence-level) increases the proportion of
information need. To achieve efficient retrieval, it is often relevant content in the retrieval results but can lead to higher
necessary to build a search index over the collection of can- retrieval latency. To balance relevance and latency, existing
didate documents and then use appropriate methodologies research work proposes using “propositions” as the retrieval
for text retrieval. There are two commonly used retrieval ap- unit [1031], corresponding to semantically complete and
proaches: lexical-based retrieval [1026] using sparse vector relatively independent text fragments, which can effectively
representations and semantic retrieval methods using dense reduce the recall of irrelevant information. In particular, they
vector representations [54]. The former tokenizes the docu- mainly use GPT-4 to synthesize instruction data for the ex-
ments and building an inverted index based on a vocabu- traction of proposition text, training a smaller model specifi-
lary, followed by retrieving relevant documents using lexical cally to construct proposition text data [1031]. Furthermore,
matching. The latter maps documents to low-dimensional to improve retrieval performance, methods such as query
dense vectors and then constructs an efficient index of doc- expansion and query rewriting can be utilized to optimize
ument vectors using approximate nearest neighbor search query formulation. Query expansion focuses on adding
algorithms, ranking candidate documents based on the sim- supplementary information to the original query, such as
ilarity of embeddings. Both methods can often perform well incorporating related entity information or providing de-
for large-scale document collection, which are widely used tailed explanations of key information in the query [796],
in existing RAG systems. which helps strengthen relevance matching. However, tra-
• Prompt Construction. After the retrieval stage returns ditional query expansion methods may disrupt the original
the relevant documents, these documents need to be incor- semantics for complex queries. To address this issue, we can
porated into the input prompt of the LLM along with the employ LLMs to decompose complex queries into several
task description. The prompt should guide the model to uti- sub-queries, which are subsequently expanded individually,
lize the retrieved information to complete the corresponding allowing for multi-path recall of related information [1032].
task. For example, a prompt could be, “Please refer to the As another query enhancment technique, query rewriting
information contained in the following documents to complete the focuses on modifying the query content to highlight key
task”. Since the retrieved documents are typically lengthy, information and eliminate potential ambiguities, facilitating
simply concatenating them into the prompt might lead the retrieval of related documents [1033]. LLMs can be ap-
to a poor utilization of the provided context due to the plied directly to query rewriting, transforming the original
biased attention (e.g., lost in the middle [949]). To address query into a more suitable form through well-designed
this issue, existing approaches often introduce reranking prompts [1034]. To reduce inference overhead, the query
models to select the most relevant documents from the optimization capabilities of LLMs can also be transferred
retrieval results [1027]. Alternatively, information extraction to smaller models through knowledge distillation [1035].
or text compression techniques can be used to retain only the • Retrieval results refinement. In addition to the initial
highly relevant information from the documents, thereby retrieval methods, the refinement of retrieval results also
reducing the input context length [1028, 1029]. plays an important role in RAG systems, since the retrieved
• Response Generation. In this step, the constructed documents may be not best suited for RAG systems, e.g.,
prompt is input into the LLM, enabling it to utilize the re- LLMs might have difficulty in utilizing long contexts or
trieved content to better accomplish the corresponding task. be affected by irrelevant information in the retrieved docu-
However, the retrieved documents may contain irrelevant ments. As a solution, the documents returned during the re-
information or even contradictory information to the true trieval stage can be reranked according to their relevance to
answer, which might affect the output generated by the the input [1036], filtering out low-quality or irrelevant doc-
LLM. To address this, the LLM can be further prompted uments or placing less relevant documents in non-optimal
to self-check the quality of the generated output and decide positions within the prompt. Furthermore, both generation
whether to re-perform the retrieval based on the new out- and reranking tasks [1027] can be jointly optimized to facil-
puts [1030], or it can perform a confidence assessment to iate better utilize of context documents. Additionally, LLMs
90

can be directly used for document re-ranking by designing reconstruct the remaining content of the original document
specific prompts or using context examples to accomplish based on the retrieval results [1043].
this task [777]. In addition to document filtering or rerank-
ing, information extraction or automatic summarization
9.7 Hallucination
techniques can be employed to refine the retrieved content
by extracting more concise and query-relevant content from Hallucination, which refers to the phenomenon that LLMs
the retrieved documents. Furthermore, existing research has generate content inconsistent with factual information, has
proposed token-level compression strategies [1037], which become a significant issue that greatly affects the task
select important tokens and remove unimportant parts from performance of LLMs [1044]. In this section, we focus on
the candidate documents. discussing the topic of LLM hallucination, first introducing
• Iterative retrieval enhancement. In some complex appli- the definition and source of hallucination and then summa-
cation scenarios, a single retrieval procedure may not suffice rizing the detection and mitigation methods.
for RAG systems. To address this issue, we can further use
iterative retrieval augmentation and adaptive retrieval aug- 9.7.1 Definition of Hallucination
mentation. Iterative retrieval augmentation aims to itera- Early research typically defines hallucinations based on
tively refine the initial query based on the model’s generated the relationship between a model’s output and the given
results to achieve a comprehensive coverage of the required input [1045]. In this manner, hallucinations are categorized
information. As it involves accumulating multiple rounds into intrinsic hallucinations where the model’s output does
of retrieval information, the performance of RAG systems not match the input text and extrinsic hallucinations where
may be affected by redundant or conflicting information. To the model’s output cannot be verified against the input.
address this issue, stop mechanism has been introduced for However, in real-world scenarios, user inputs often do not
retrieval iteration, using the LLM to evaluate the confidence contain reference documents, and thus existing work mainly
of the current generation results to determine whether to focuses on open-domain factual hallucinations, where the
continue the iteration process [662]. Additionally, for more model-generated content does not align with or cannot be
complex scenarios, iterative retrieval can be combined with verified by existing world knowledge [1044, 1046]. Accord-
the LLM’s own CoT reasoning capability. For example, ing to a recent study [1044], factual hallucinations can be
intermediate results from the chain of thought can be used further categorized into the following types:
as the query input for the next round of retrieval, and after • Entity-error hallucination. This type of hallucination
completing the retrieval process, the returned results can refers to LLMs generating text containing incorrect entities,
be integrated into the chain of thought. Building on the such as names of people, dates, locations, or objects that
iterative retrieval augmentation method, adaptive retrieval contradict world knowledge.
augmentation further enhances the LLM’s autonomous use • Relation-error hallucination. This type of hallucination
of the retrieval mechanism [1038], thereby improving the involves LLMs generating incorrect relationships between
overall framework’s efficacy in using the retrieval systems. entities, such as inaccurate quantitative or chronological
In practical implementation, for the above two types of aug- connections.
mentation methods, LLM first need to determine when to • Incompleteness hallucination. LLMs may produce incom-
use the retriever and then utilize pre-set prompts to initiate plete outputs, especially when generating lengthy or list-
query generation and retrieval result processing [1039]. based responses. This hallucination arises when LLMs are
• RAG-enhanced training. In addition to the improvement asked about aggregated facts and they fail to reserve the
strategies mentioned above, specialized training tasks can factual completeness.
be designed to further enhance the LLM’s ability to utilize • Outdatedness hallucination. This type of hallucination
the retrieved content, including both instruction tuning and occurs when LLMs generate information that was accurate
pre-training tasks. By constructing instruction data focused at a past time but is no longer correct at present. This issue
on retrieval context utilization [1040], instruction tuning typically arises due to that most LLMs were trained on time-
can improve the LLM’s ability to utilize relevant retrieval limited corpora.
information. When curating the instruction data, it is essen- • Overclaim hallucination. This type of hallucination refers
tial to consider two important issues: positional bias and to cases where the statement expressed in the generated text
irrelevant information within the input context. Specifically, of LLMs is beyond the scale of factual knowledge.
relevant documents can be placed at different positions • Unverifiability hallucination. This hallucination refers
within the prompt, which can enhance the model’s attention to cases where the information produced by LLMs cannot
to relevant content in various positions and prevent the be verified against existing information sources, making it
model from neglecting certain positions [949]. Additionally, difficult to assess its accuracy.
irrelevant information can be added to the instructions data,
so as to improving the model’s ability to resist interference 9.7.2 Source of Hallucination
from such information [1041]. In addition, special training In this part, we will discuss the potential factors that might
tasks can be introduced during the pre-training stage to lead to hallucination for LLMs.
further enhance the LLM’s retrieval and generation capa-
bilities [657, 1042]. Existing work mainly constructs unsu- Training Data. The quality of training data significantly
pervised pre-training data aimed at retrieval augmentation. impacts the model’s output and is a primary source of
A common data construction method uses portions of the hallucinations. Further, the distribution of training data also
original document as queries and then trains the model to plays a key role in shaping the behaviors of LLMs. We next
91

introduce the effect of training data on hallucinations from data may contain hallucinated content, which might lead
these two aspects. to more hallucinations for the trained model. Addition-
• Data quality. In practice, the pre-training dataset is ally, during the human alignment process, existing training
typically constructed by collecting diverse data from various methods may also cause hallucination issues. Some research
sources. While increasing pre-training data can lead to im- work has revealed that LLMs may cater to human responses
proved model performance, low-quality data can severely for earning higher rewards, likely resulting in answers that
damage the generation performance of large models. On do not align with factual knowledge [1048].
the one hand, pre-training data may contain erroneous
Response Generation. Given the input prompt, LLMs
information, and the goal of training large models is to
employ decoding strategies (e.g., top-k sampling in Sec-
imitate and memorize the training data as possible. If inac-
tion 4.2.4) for generating the response. In this process, the
curate information frequently appears in the training data,
prompt formulation and the decoding strategies potentially
the model may memorize and directly copy this content
affect the generation behaviors of LLMs.
during generation, leading to the phenomenon known as
“imitative falsehoods” [558]. On the other hand, pre-training • Prompt design. Prompting has become the primary
data may contain biased content and the subjective views way for using LLMs to solve downstream tasks. However,
of its creators. Such biased content can severely affect the inappropriate prompt design can cause the model to over-
model’s learning of world knowledge, possibly leading to look or misunderstand important information, leading to
inappropriate representations. incorrect or irrelevant content [1044]. Recent studies have
shown that the readability, format, and concreteness of user
• Data distribution. The distribution of pre-training data
instructions would impact the model’s output [1049]. For
also significantly affects the model’s behavior. Firstly, re-
instance, the use of complex words or long phrases in the
garding the recency factor, LLMs are typically trained on
prompt reduces the readability, which makes LLMs more
data from a limited period. As world knowledge continu-
difficult to understand the real intentions of user instruction,
ously evolves, the model’s stored knowledge can become
thereby increasing the chance of hallucination. Additionally,
outdated, thereby likely leading to fabrications or outdated
non-standard expressions or abstract concepts can also ex-
information when addressing questions beyond its knowl-
acerbate hallucinations.
edge scope. In terms of data composition, pre-training
data may lack domain-specific knowledge, which would • Decoding strategy. To improve the diversity of the
affect the model performance on tasks requiring specialized generated content, multiple random sampling strategies are
knowledge, such as medical or legal issues, and it will introduced (e.g., beam search, top-p sampling). However,
also result in significant hallucinations. Additionally, recent increasing diversity also brings a higher likelihood of gen-
studies show that when addressing questions involving erating hallucinated content. For example, increasing the
long-tail knowledge that appears infrequently in the train- temperature t (Equation 10) will result in a more uniform to-
ing corpus, models are more likely to generate inaccurate ken probability distribution, which potentially leads to more
content [1044]. hallucinations, since low-frequency yet irrelevant words
would be assigned a higher probability for generation in
Training Methods. The training process of LLMs typically this setting.
includes two major stages: pre-training and post-training.
Inappropriate training methods across the two stages are 9.7.3 Hallucination Detection
also likely to result in the hallucination behaviors of LLMs.
• Pre-training. Currently, the pre-training stage primar- To effectively detect the hallucinated content, existing work
ily employs the next token prediction method for model mainly adopts three approaches, namely model-based,
training. Recent studies [949] indicate that under the au- uncertainty-based and tool-based methods.
toregressive training method, the model’s attention distri-
Model-based Methods. Due to the powerful language ca-
bution tends to decay as the sequence length increases. This
pabilities and rich world knowledge, existing work exten-
would prevent LLMs from effectively modeling long-range
sively adopts powerful LLMs to detect hallucinations from
dependencies, potentially resulting in inference errors or
the model-generated text. In this approach, hallucination
hallucinations. Additionally, the teacher-forcing strategy is
detection can be considered as a normal text task that re-
commonly used during the training of large models. In this
quires prompt formulation. To facilitate the research in this
approach, the correct tokens from the previous steps are
line, HaluEval [604] introduces a comprehensive dataset of
used to predict the next token instead of the model output.
model-generated and human-annotated hallucinated sam-
However, during model inference, the model can only use
ples to evaluate how well LLMs can identify such instances,
its own generated content for subsequent predictions. This
and they empirically show specific prompting strategies
discrepancy between the training and generation phases
such as CoT can effectively improve the model’s accuracy
leads to “exposure bias” [1047], which may in turn cause
in detecting hallucinations. Furthermore, research work pro-
hallucination issues.
poses to decompose the hallucination detection into two
• Post-training. During the instruction-tuning process, subtasks: first, extract factual statements, and then assess
existing works typically employ knowledge distillation to whether each statement is hallucinated or not [1044, 1050].
improve the model’s instruction-following ability. This in-
volves using high-performance models (such as GPT-4) to Uncertainty-based Methods. Recent studies suggest that
generate large-scale instruction data and then fine-tuning the occurrence of hallucinations in LLMs may be related
weaker models with this data. However, these synthesized to the uncertainty of their outputs [1051]. Based on such
92

assumptions, a series of works propose detecting hallucina- further expands the knowledge source to local databases,
tions by assessing the uncertainty of model-generated con- devising an agent framework to retrieve, consolidate, and
tent. Some research work focuses on the internal features of generate feedback to the LLM for the final answer. Other
LLMs, such as token probability and logits. For key concepts research explores placing the retrieval process at different
in the generated text, a lower token probability indicates a positions relative to the generation process. Verify-and-
higher uncertainty, which represents an increased likelihood Edit [1060] proposes to perform the retrieval procedure
of hallucination [1052]. Other research efforts evaluate the after the generation process, allowing the original answer to
uncertainty by examining the consistency of the models’ be edited based on the retrieved documents. Furthermore,
responses. For instance, SelfCheckGPT [1051] lets LLMs an- to help LLMs better handle complex tasks, IRCoT [1061]
swer the same questions multiple times to judge whether the interleaves the knowledge retrieval process with CoT gen-
generated answers are consistent or not. Another alternative eration, where the retrieved documents guide the LLM in
way requires LLMs to reconstruct the input questions based generating additional reasoning steps and CoT sentences
on the responses and then check the consistency between assist in retrieving more relevant and diverse documents.
the generated and original questions [1053].
Improved Decoding Strategy. In addition to the above
Tool-based Methods. LLMs can detect hallucinations by methods, hallucinations can also be mitigated by using im-
calling external tools to verify the model-generated content. proved decoding strategies. Typically, the internal states or
Typically, the model’s output contains various segments of knowledge of LLMs themselves can be exploited to reduce
factual knowledge, which can be broken down into fine- the hallucinations. DoLa [317] proposes that the lower layers
grained factual statements. FActScore [1054] refers to knowl- of LLMs tend to assign higher probabilities to syntactically
edge sources like search engines to verify these statements. plausible words, while higher layers encode more factual
FacTool [1055] further proposes to use a series of external knowledge. Therefore, DoLa devises a contrastive decod-
verification tools such as calculators and code interpreters to ing strategy by subtracting the lower logits from the last
check different types of text. In addition, HaluAgent [1056] layer’s logits and using the results for next-token predic-
proposes an agent framework to employ smaller open- tion. ITI [1062] finds that specific attention heads show
source models for hallucination detection. With the assis- high linear probing accuracy and regards their activation as
tance of tools like search engines and calculators, HaluAgent truth-correlated directions. During inference, certain heads’
enables 7B-size models to achieve comparable performance activations would be shifted along these pivot directions.
as GPT-4 in hallucination detection. Some other work introduces external knowledge sources
to aid the decoding process. CAD [1063] provides LLMs
9.7.4 Hallucination Mitigation with extra context about the query, and then contrasts the
In practice, it is essential to effectively mitigate the halluci- output probabilities by those without using context, thereby
nation behaviors of LLMs, to provide accurate and helpful adjusting the influence of the model’s prior knowledge.
responses. In this part, we will introduce several widely- KCTS [878] applies an auxiliary knowledge classifier on top
used approaches for alleviating the hallucination, including of the LLM to detect hallucinations, and uses its knowledge
human alignment, retrieval-augmented generation and im- faithfulness score to reweight the token distribution.
proved decoding strategy.
Human Alignment. Hallucination mitigation is closely re- 10 C ONCLUSION AND F UTURE D IRECTIONS
lated to the honest criterion in “3H” standards for human In this survey, we have reviewed the recent progress of large
alignment, and various alignment methods like RLHF can language models (LLMs), and introduced the key concepts,
be adopted to mitigate the model hallucination. HaluEval findings, and techniques for understanding and utilizing
2.0 [1044] proposes to first collect hallucinated and non- LLMs. We focus on the large-sized models (i.e., having a size
hallucinated responses to train a reward model, and then larger than 10B) while excluding the contents of early pre-
fine-tune the LLM with the reward model’s feedback us- trained language models (e.g., BERT and GPT-2) that have
ing RL algorithms. However, recent research shows that been well covered in the existing literature. In particular,
human preference data may lead LLMs to exhibit syco- our survey has discussed four important aspects of LLMs,
phantic behavior [1057], where models prioritize catering i.e., pre-training, adaptation, utilization, and evaluation. For
to human demands over maintaining truthfulness. Some each aspect, we highlight the techniques or findings that are
work proposes to refine the annotation process of preference key to the success of LLMs. Furthermore, we also summa-
data, such as by aggregating multiple human preferences rize the available resources for developing LLMs and dis-
to improve feedback quality [1057] or fine-tuning LLMs on cuss important implementation guidelines for reproducing
prompts where the truthfulness of a claim is independent of LLMs. This survey tries to cover the most recent literature
the user’s opinion [1058]. about LLMs and provides a good reference resource on this
topic for both researchers and engineers.
Retrieval-Augmented Generation. Providing LLMs with
Next, we summarize the discussions of this survey, and
highly reliable external knowledge as context can help re-
introduce the challenges and future directions for LLMs, in
duce hallucinations. RARR [1059] first generates multiple
the following aspects.
questions about the generated text, then retrieves web pages
from Google Search as evidence, and finally, an editing Basics and Principles. Instead of training on specific task
model is employed if any disagreement is detected between goals, LLMs learn from unsupervised pre-training on large-
the evidence and the generated text. LLM-Augmenter [661] scale text data. This is quite different from previous multi-
93

task learning approaches, which aim to extend the training cluster. In practice, it is very challenging to pre-train capable
tasks as possible to achieve sufficient generalization. Thus, LLMs, due to the huge compute consumption and the
it is essential to reveal the basic principles or elements that sensitivity to data quality and training tricks [78, 93]. Thus,
establish the foundation of the abilities of LLMs. Although it becomes particularly important to develop systemic, eco-
the basic idea of language models is intuitive, it is still chal- nomical pre-training approaches for optimizing LLMs, e.g.,
lenging to formally explain why LLMs trained by simple predictable scaling [46] and proxy model training [59]. More
language modeling objectives (e.g., next token prediction) training recipes or principles should be investigated and
can become capable of solving various real-world tasks. shared to reduce the potential risk of degradation or failure
To investigate this problem, a promising approach is to in large-scale model optimization. Although increasingly
study the capacity learning (or selection) mechanism based more model checkpoints and cleaned datasets have been
on unsupervised pre-training, since the model capacity of released, there still lacks reproducible work on pre-training
LLMs strongly depends on pre-training data. In addition, data preparation (e.g., detailed cleaning strategies) and data
scaling plays an important role in improving the capacity scheduling (e.g., data mixture and curriculum). Since it is
of LLMs [31, 55, 64], and it is very useful to conduct more very costly to pre-train a LLM from scratch, it is important
theoretical analysis about how the behaviors of large models to design suitable mechanisms for continually pre-training
relate to those of small models, e.g., what behaviors of large or fine-tuning the LLM based on publicly available model
models can be inferred from small models and what can’t be checkpoints (e.g., LLaMA [57] and Flan-T5 [69]). For this
predicted indeed. Another research direction is to explore purpose, a number of technical issues have to be resolved,
more deep analysis on model generalization for LLMs, e.g., catastrophic forgetting and task specialization. Further-
since increasing concerns have been raised about whether more, it is also useful to develop effective tuning strategies
LLMs can generalize beyond the knowledge encoded by that effectively inject or edit specific knowledge [674], e.g.,
pre-training data. Furthermore, data contamination has be- correcting the outdated facts.
come a severe issue for fairly assessing the performance of
LLMs [740], and thus setting appropriate evaluation proto- Model Utilization. Based on the natural language inter-
col will be the basis to investigate and analyze the model face, prompting has become the prominent approach for
capacity of LLMs. using LLMs to solving various tasks. By combining task
descriptions and demonstration examples into prompts, in-
Model Architecture. Due to the scalability and effective- context learning (ICL) endows LLMs with the ability to
ness, Transformer has become the de facto architecture perform well on new tasks, even outperforming full-data
for building LLMs. Various strategies have been proposed fine-tuned models in some cases. To enhance the ability of
to improve the performance of this architecture, such as complex reasoning, advanced prompting techniques have
neural network configuration and scalable parallel training been proposed, exemplified by the chain-of-thought (CoT)
(see discussions in Section 4.2.2). However, Transformer strategy, which includes the intermediate reasoning steps
still suffers from high training costs and slow inference into prompts. Furthermore, planning is a promising ap-
rates. More efforts [270, 271] are still in need to develop proach for solving complex tasks, which iteratively invokes
improved model architectures for large-scale pre-training. LLMs by leveraging tool use capacities. Despite these ef-
Specially, system-level or hardware-level optimization (e.g., forts, several basic problems related to prompting are still
FlashAttention [303]) is worth more exploration to improve under-explored: why a good prompt can elicit the correct
the efficiency of Transformer architectures. In addition, as an answer but a bad prompt cannot, how to reveal the working
important basic capacity, existing LLMs typically maintain principles of advanced prompting methods (e.g., ICL and
a long context window. For example, the most recent GPT-4 CoT) and further improve these existing approaches, and
Turbo enables a long context of 128K tokens, and Claude how to efficiently find the effective prompts for LLMs on
2.1 also supports the input up to 200K tokens. Although specific tasks. Furthermore, from a practical perspective, it
many efforts have been made to enhance the long context has become a fundamental challenge to reduce the inference
modeling ability of LLMs [283, 943], the resulting mod- cost of LLMs, especially in large-scale deployment. Another
els still can’t well process the information in the context popular research direction is retrieval-augmented gener-
window [949]. To address this issue, specific architecture ation, where retrieved contexts from supporting sources
adaptations or algorithms might be needed to enhance the are included into prompts for task solving. It has been
modeling and utilization of long context information. An- shown that retrieval augmentation can extend the knowl-
other worrying concern is that existing work mostly focuses edge boundary and improve the question answering ca-
on training LLMs with decoder-only Transformers. Despite pacity [454], but may suffer from the effectiveness of long
the effectiveness, it severely limits the more wide, diverse context utilization by LLMs [949].
explorations on alternative model architectures.
Safety and Alignment. Despite the capacities, LLMs are
Model Training. For pre-training, it is essential to establish faced with great safety challenges in practical use. As a
a data-centric infrastructure and training procedure for LLM fundamental issue of probabilistic modeling nature, LLMs
optimization, which can effectively support a systematic exhibit a tendency to generate hallucinations [640], refer-
process of data collection, data cleaning, data mixture, and ring to texts that seem plausible but may be factually
data curriculum. Furthermore, it also calls for more flexible incorrect [46]. What is worse, LLMs might be elicited by
mechanisms of hardware support or resource schedule, so intentional instructions to produce harmful, biased, or toxic
as to better organize and utilize the resources in a computing texts for malicious systems, leading to the potential risks
94

of misuse [55, 66]. To have a detailed discussion of the first draft was finished on March 13, 2023, in which our
safety issues of LLMs (e.g., privacy, overreliance, disinfor- team members tried their best to include the related stud-
mation, and influence operations), the readers can refer to ies about LLMs in a relatively objective, comprehensive
the GPT-3/4 technical reports [46, 55]. As the major tech- way. Then, we have extensively revised the writing and
nical approach to averting these issues, alignment methods contents in several passes. Due to the space limit, we can
(e.g., RLHF) [66, 116] have been widely used by leveraging only include a fraction of existing LLMs in Figure 3 and
human feedback for developing well-aligned LLMs. How- Table 1, by setting the selection criterion. However, we set
ever, RLHF heavily relies on high-quality human feedback a more relaxed criterion for model selection on our GitHub
data from professional labelers, which is costly and time- page (https://github.com/RUCAIBox/LLMSurvey), which
consuming to recruit qualified human annotators. There- will be regularly maintained. We release the initial version
fore, it is necessary to improve the RLHF framework for on March 31, 2023, the major revision on June 29, 2023,
reducing the efforts of human labelers and seek a more and second version on September 10, 2023, and this latest
efficient annotation approach with guaranteed data quality, version (major revision) on November 23, 2023.
e.g., LLMs can be employed to assist the labeling work.
Furthermore, it is also suggested to develop simplified Seeking for Advice. Despite all our efforts, this survey
optimization algorithms for alignment [388, 391], to reduce is still far from perfect: we are likely to miss important
the training difficulty and unstability of RLHF. As another references or topics, and might also have non-rigorous
practical approach, red teaming [132, 367] has been adopted expressions or discussions. We will continuously update
for improving the model safety of LLMs, which utilizes this survey, and improve the quality as much as we can.
the collected adversarial prompts to refine the LLMs (i.e., For us, survey writing is also a learning process for LLMs
avoiding the attacks from red teaming). In addition, privacy by ourselves. For readers with constructive suggestions to
concerns are also important to consider when fine-tuning improve this survey, you are welcome to leave comments on
LLMs with domain-specific data, and thus federated based the GitHub page of our survey or directly email our authors.
learning [1064] can be useful in privacy-restricted scenarios. We will make revisions following the received comments
or suggestions in a future version, and acknowledge the
Application and Ecosystem. As LLMs have shown strong readers who have contributed constructive suggestions in
capacities in solving various tasks, they can be applied our survey.
in a broad range of real-world applications (i.e., following
task-specific natural language instructions). As a remarkable Update log. In this part, we regularly maintain an update
progress, ChatGPT has potentially changed the way how log for the submissions of this survey to arXiv:
humans access information, which has been additionally • First release on March 31, 2023: the initial version.
integrated in the release of New Bing. Generally, in the • Update on April 9, 2023: add the affiliation information,
near future, it can be foreseen that LLMs would have a revise Figure 3 and Table 1 and clarify the correspond-
significant impact on information-seeking techniques, in- ing selection criterion for LLMs, improve the writing,
cluding both search engines and recommender systems. and correct some minor errors.
Furthermore, LLMs make it possible to develop more intel- • Update on April 11, 2023: correct the errors for library
ligent systems (e.g., autonomous AI agents) to tackle various resources.
complex tasks in real-world scenarios. Specially, Assistants • Update on April 12, 2023: revise Figure 3 and Table 1,
API has been launched by OpenAI (featured by instructions, and clarify the release date of LLMs.
knowledge and tool use), enabling rapid development of • Update on April 16, 2023: add a new Section 2.2 about
agent-like assistants within the applications. This wave of the technical evolution of GPT-series models.
technical innovation would lead to an ecosystem of LLM- • Update on April 24, 2023: add the discussion about
empowered applications (e.g., OpenAI’s GPT Store), which scaling laws and add some explanations about the
has a close connection with human life. Lastly, the rise of model sizes for emergent abilities (Section 2.1); add an
LLMs sheds light on the exploration of artificial general illustrative figure for the attention patterns for different
intelligence (AGI). It is promising to develop more smart AI architectures in Figure 9, and add the detailed formulas
systems than ever. However, in this development process, in Table 7.
AI safety should be one of the primary concerns, i.e., making • Update on April 25, 2023: revise some copy errors in
AI lead to good for humanity but not bad [40]. figures and tables.
• Update on April 27, 2023: add efficient tuning in Sec-
C ODA tion 5.3.
• Update on April 28, 2023: revise Section 5.3.
It is not an easy job to write this long survey and update • Update on May 7, 2023: revise Table 1, Table 2, and
its content with timely work. First of all, we would like to some minor points.
sincerely thank the support from the readers and our team • Update on June 29, 2023 (major revision):
members. We work very hard on this survey, and hope that
– Section 1: add Figure 1 for the trends of published
it can present a comprehensive, timely reference for LLMs.
LLM papers in arXiv;
Survey Writing. This survey was planned during a discus- – Section 2: add Figure 4 for GPT’s evolution and the
sion meeting held by our research team, and we aimed to corresponding discussion;
summarize the recent advances of large language models – Section 3: add Figure 5 for LLaMA family and the
as a highly readable report for our team members. The corresponding discussion;
95

– Section 5: add latest discussion about the synthetic about chain-of-thought prompting in Section 6.3;
data formatting of instruction tuning in Section 5.1.1, – Section 8: add latest discussion about LLM for re-
the empirical analysis for instruction tuning in Sec- search directions in Section 8.1;
tion 5.1.4, parameter-efficient model adaptation in – Section 10: revise the content in the several aspects.
Section 5.3 and memory-efficient adaptation in Sec- • Update on September 25, 2024 (major revision):
tion 5.3; – Section 3: reorganize the content of “public available
– Section 6: add latest discussion about the underlying model checkpoints” into multiple series; add the
mechanism of ICL 6.2.3, planning for complex task latest LLMs in Figure 3.
solving in Section 6.4; – Section 4: add LLM-based data filtering and selec-
– Section 7: update Table 14 for representative datasets tion methods in Section 4.1.2; update Section 4.2.1,
for evaluating advanced abilities of LLMs, and em- “Emergent Architectures” to include more discus-
pirical ability evaluation in Section 7.4; sions about SSM-based architectures; add Table 6
– Section 6.1.1: add prompt design; to compare parallelism and complexity of different
– Section 8: add the discussions on applications of architectures.
LLMs in finance and scientific research domains; – Section 5: add latest discussion about instruction
• Update on September 10, 2023 (major revision): quality improvement and instruction selection in
– Claim the copyrights of the figures and tables in this Section 5.1.1; add latest discussion about practical
paper. strategies for RLHF and process-supervised RLHF
– Add latest LLMs, techniques and their descriptions in in Section 5.2.3; update the content about supervised
Section 3, Section 4, Section 5, Section 6 and Section 7; alignment tuning in Section 5.2.4.
– Section 4: add latest discussion about the decoding – Section 6: add latest papers about discrete prompt
strategy in Section 4.2.4; optimization in Section 6.1.2.
– Section 5: add latest discussion about the practical – Section 9: add latest discussion about advanced
tricks for instruction tuning in Section 5.1.2, the topics, including long context modeling, LLM-
empirical analysis on LLaMA (13B) for instruction based agent, analysis and optimization for training
tuning in Section 5.1.4, practical strategies for RLHF and inference, model inference, model compression,
in Section 5.2.3, alignment without RLHF in Sec- retrieval-augmented generation, and hallucination.
tion 5.2.4 and remarks on SFT and RLHF in Sec- • Update on October 12, 2024: correct the errors in Sec-
tion 5.2.5; tion 8.1.5.
– Section 6: update the content about the planning for
complex task solving in Section 6.4; Clarifications on Experiments. In this version, we have
– Section 7: add discussions about evaluation ap- included a number experiments on instruction-tuning (Ta-
proaches in Section 7.3.2, Table 15 for the category ble 10), overall ability evaluation (Table 16), and prompt
of existing evaluation work, and update empirical engineering (Table 17). Due to the limit of computational
ability evaluation in Section 7.4 and the results on resources, our experiments are not complete, limited to
Table 16; small-sized models or a few comparisons. Despite that, we
– Section 6.1.1: add new prompt examples in Table 12; feel that it might be meaningful to share the partial results to
• Update on November 23, 2023 (major revision): the public. We will try to include the missing results of larger
models or more comparisons in the future versions. We also
– Section 1: add Figure 2 for the evolution process of
call for support of computing power for conducting more
four generations of language models;
comprehensive experiments.
– Section 2: add more discussion about scaling laws
and how emergent abilities relate to scaling laws; Chinese Book. We also released a Chinese book based on
– Section 3: add latest LLMs in Figure 3 and Table 1, this survey article, at the link: https://llmbook-zh.github.io.
latest APIs in Section 3.1, commonly used datasets This book is in the publication process.
for instruction tuning and alignment tuning in Sec-
tion 3.3, and several libraries in Section 3.4; ACKNOWLEDGMENTS
– Section 4: add latest discussion about the data The authors would like to thank Yankai Lin and Yutao Zhu
scheduling, including data mixtures and data cur- for proofreading this paper. Since the first release of this
riculum in Section 4.1.3; add summary of data prepa- paper, we have received a number of valuable comments
ration in Section 4.1.4; add discussion about model- from the readers. We sincerely thank the readers who have
ing long context in Section 9.1; add discussion about written to us with constructive suggestions and comments:
decoding efficiency issues and add latest decoding Tyler Suard, Damai Dai, Liang Ding, Stella Biderman,
strategies in Section 4.2.4; Kevin Gray, Jay Alammar, Yubo Feng, Mark Holmstrom,
– Section 5: add latest discussion about instance con- Xingdong Liu, Il-Seok Oh, Yiting Liu, Shaojun Wang,
struction and tuning strategies in Section 5.1; add Gaoyan Ou, Todd Morrill, Hao Liu, Zhenyu Zhang, and
latest discussion about process-supervised RLHF in Xinlin Zhuang.
Section 5.2.3, and the empirical study on quantized
LLaMA models (7B and 13B) in Section 9.5.1; Since the v11 version (June 29, 2023), we have been
– Section 6: add latest discussion about prompt op- adding a large number of experiments and prompt prac-
timization in Section 6.1.2, and update the content tices. These new contents are completed by a number of
96

volunteers in our team. Here, we add a special part to thank [11] C. Zhai, Statistical Language Models for Information Re-
all the students who have worked very hard on this part trieval, ser. Synthesis Lectures on Human Language
(also including the ones on our author list). Technologies. Morgan & Claypool Publishers, 2008.
[12] S. M. Thede and M. P. Harper, “A second-order
Contribution on Experiments. We would like to sincerely hidden markov model for part-of-speech tagging,”
thank the following people for their hard work involved in in 27th Annual Meeting of the Association for Computa-
experiments shown in Table 16. tional Linguistics, University of Maryland, College Park,
• Xiaoxue Cheng: implement the experiments for evalu- Maryland, USA, 20-26 June 1999, R. Dale and K. W.
ation on Language Generation and HaluEval tasks. Church, Eds. ACL, 1999, pp. 175–182.
• Yuhao Wang: implement the experiments for evalua- [13] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mer-
tion on interaction with environment tasks. cer, “A tree-based statistical language model for nat-
• Bowen Zheng: implement the experiments for evalua- ural language speech recognition,” IEEE Transactions
tion on tool manipulation tasks. on Acoustics, Speech, and Signal Processing, vol. 37,
Contribution on Tips. We list the following guys for their no. 7, pp. 1001–1008, 1989.
contributions on the corresponding numbers of provided [14] T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean,
tips for designing prompts in Table 12. “Large language models in machine translation,”
• Xiaolei Wang: T3, O3 in EMNLP-CoNLL 2007, Proceedings of the 2007 Joint
• Beichen Zhang: D2, D5 Conference on Empirical Methods in Natural Language
Processing and Computational Natural Language Learn-
• Zhipeng Chen: D3, D4
ing, June 28-30, 2007, Prague, Czech Republic, J. Eisner,
• Junjie Zhang: D6
Ed. ACL, 2007, pp. 858–867.
• Bowen Zheng: D7
[15] S. M. Katz, “Estimation of probabilities from sparse
• Zican Dong: D8
data for the language model component of a speech
• Xinyu Tang: C2
recognizer,” IEEE Trans. Acoust. Speech Signal Process.,
• Yifan Du: T4 vol. 35, no. 3, pp. 400–401, 1987.
• Tianyi Tang: O6, O7, D9 [16] W. A. Gale and G. Sampson, “Good-turing frequency
• Yupeng Hou: O8, C3 estimation without tears,” J. Quant. Linguistics, vol. 2,
• Salvatore Raieli: C4 no. 3, pp. 217–237, 1995.
[17] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and
S. Khudanpur, “Recurrent neural network based lan-
R EFERENCES guage model,” in INTERSPEECH 2010, 11th Annual
[1] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A Conference of the International Speech Communication
neural probabilistic language model,” J. Mach. Learn. Association, Makuhari, Chiba, Japan, September 26-30,
Res., vol. 3, pp. 1137–1155, 2003. 2010, T. Kobayashi, K. Hirose, and S. Nakamura, Eds.
[2] R. Collobert, J. Weston, L. Bottou, M. Karlen, ISCA, 2010, pp. 1045–1048.
K. Kavukcuoglu, and P. P. Kuksa, “Natural language [18] S. Kombrink, T. Mikolov, M. Karafiát, and L. Burget,
processing (almost) from scratch,” J. Mach. Learn. “Recurrent neural network based language modeling
Res., vol. 12, pp. 2493–2537, 2011. in meeting recognition,” in INTERSPEECH 2011, 12th
[3] S. Pinker, The Language Instinct: How the Mind Creates Annual Conference of the International Speech Commu-
Language. Brilliance Audio; Unabridged edition, nication Association, Florence, Italy, August 27-31, 2011.
2014. ISCA, 2011, pp. 2877–2880.
[4] M. D. Hauser, N. Chomsky, and W. T. Fitch, “The [19] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and
faculty of language: what is it, who has it, and how J. Dean, “Distributed representations of words and
did it evolve?” science, vol. 298, no. 5598, pp. 1569– phrases and their compositionality,” in Advances in
1579, 2002. Neural Information Processing Systems 26: 27th Annual
[5] A. M. Turing, “Computing machinery and intelli- Conference on Neural Information Processing Systems
gence,” Mind, vol. LIX, no. 236, pp. 433–460, 1950. 2013. Proceedings of a meeting held December 5-8, 2013,
[6] F. Jelinek, Statistical Methods for Speech Recognition. Lake Tahoe, Nevada, United States, C. J. C. Burges,
MIT Press, 1998. L. Bottou, Z. Ghahramani, and K. Q. Weinberger,
[7] J. Gao and C. Lin, “Introduction to the special issue Eds., 2013, pp. 3111–3119.
on statistical language modeling,” ACM Trans. Asian [20] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Ef-
Lang. Inf. Process., vol. 3, no. 2, pp. 87–93, 2004. ficient estimation of word representations in vector
[8] R. Rosenfeld, “Two decades of statistical language space,” in 1st International Conference on Learning Rep-
modeling: Where do we go from here?” Proceedings resentations, ICLR 2013, Scottsdale, Arizona, USA, May
of the IEEE, vol. 88, no. 8, pp. 1270–1278, 2000. 2-4, 2013, Workshop Track Proceedings, Y. Bengio and
[9] A. Stolcke, “Srilm-an extensible language modeling Y. LeCun, Eds., 2013.
toolkit,” in Seventh international conference on spoken [21] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner,
language processing, 2002. C. Clark, K. Lee, and L. Zettlemoyer, “Deep contex-
[10] X. Liu and W. B. Croft, “Statistical language mod- tualized word representations,” in Proceedings of the
eling for information retrieval,” Annu. Rev. Inf. Sci. 2018 Conference of the North American Chapter of the
Technol., vol. 39, no. 1, pp. 1–31, 2005. Association for Computational Linguistics: Human Lan-
97

guage Technologies, NAACL-HLT 2018, New Orleans, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and
Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), D. Amodei, “Scaling laws for neural language mod-
M. A. Walker, H. Ji, and A. Stent, Eds. Association els,” CoRR, vol. abs/2001.08361, 2020.
for Computational Linguistics, 2018, pp. 2227–2237. [31] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph,
[22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou,
L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals,
“Attention is all you need,” in Advances in Neural In- P. Liang, J. Dean, and W. Fedus, “Emergent
formation Processing Systems 30: Annual Conference on abilities of large language models,” CoRR, vol.
Neural Information Processing Systems 2017, December abs/2206.07682, 2022.
4-9, 2017, Long Beach, CA, USA, 2017, pp. 5998–6008. [32] M. Shanahan, “Talking about large language mod-
[23] J. Devlin, M. Chang, K. Lee, and K. Toutanova, els,” CoRR, vol. abs/2212.03551, 2022.
“BERT: pre-training of deep bidirectional transform- [33] J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. H. Chi,
ers for language understanding,” in Proceedings of Q. Le, and D. Zhou, “Chain of thought prompting
the 2019 Conference of the North American Chapter of elicits reasoning in large language models,” CoRR,
the Association for Computational Linguistics: Human vol. abs/2201.11903, 2022.
Language Technologies, NAACL-HLT 2019, Minneapolis, [34] J. Hoffmann, S. Borgeaud, A. Mensch,
MN, USA, June 2-7, 2019, Volume 1 (Long and Short E. Buchatskaya, T. Cai, E. Rutherford,
Papers), J. Burstein, C. Doran, and T. Solorio, Eds. D. de Las Casas, L. A. Hendricks, J. Welbl,
Association for Computational Linguistics, 2019, pp. A. Clark, T. Hennigan, E. Noland, K. Millican,
4171–4186. G. van den Driessche, B. Damoc, A. Guy, S. Osindero,
[24] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo- K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and
hamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, L. Sifre, “Training compute-optimal large language
“BART: denoising sequence-to-sequence pre-training models,” vol. abs/2203.15556, 2022.
for natural language generation, translation, and [35] R. Taylor, M. Kardas, G. Cucurull, T. Scialom,
comprehension,” in Proceedings of the 58th Annual A. Hartshorn, E. Saravia, A. Poulton, V. Kerkez, and
Meeting of the Association for Computational Linguistics, R. Stojnic, “Galactica: A large language model for
ACL 2020, Online, July 5-10, 2020, 2020, pp. 7871– science,” CoRR, vol. abs/2211.09085, 2022.
7880. [36] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and
[25] W. Fedus, B. Zoph, and N. Shazeer, “Switch trans- G. Neubig, “Pre-train, prompt, and predict: A sys-
formers: Scaling to trillion parameter models with tematic survey of prompting methods in natural
simple and efficient sparsity,” J. Mach. Learn. Res, pp. language processing,” ACM Comput. Surv., pp. 195:1–
1–40, 2021. 195:35, 2023.
[26] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, [37] C. Zhou, Q. Li, C. Li, J. Yu, Y. Liu, G. Wang,
I. Sutskever et al., “Language models are unsuper- K. Zhang, C. Ji, Q. Yan, L. He, H. Peng, J. Li, J. Wu,
vised multitask learners,” OpenAI blog, p. 9, 2019. Z. Liu, P. Xie, C. Xiong, J. Pei, P. S. Yu, and L. Sun,
[27] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, “A comprehensive survey on pretrained foundation
O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, models: A history from BERT to chatgpt,” CoRR, vol.
“Roberta: A robustly optimized BERT pretraining abs/2302.09419, 2023.
approach,” CoRR, vol. abs/1907.11692, 2019. [38] X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo,
[28] V. Sanh, A. Webson, C. Raffel, S. H. Bach, J. Qiu, Y. Yao, A. Zhang, L. Zhang, W. Han,
L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, M. Huang, Q. Jin, Y. Lan, Y. Liu, Z. Liu, Z. Lu,
A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, S. S. X. Qiu, R. Song, J. Tang, J. Wen, J. Yuan, W. X. Zhao,
Sharma, E. Szczechla, T. Kim, G. Chhablani, N. V. and J. Zhu, “Pre-trained models: Past, present and
Nayak, D. Datta, J. Chang, M. T. Jiang, H. Wang, future,” AI Open, vol. 2, pp. 225–250, 2021.
M. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Baw- [39] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang,
den, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. San- “Pre-trained models for natural language processing:
tilli, T. Févry, J. A. Fries, R. Teehan, T. L. Scao, S. Bi- A survey,” CoRR, vol. abs/2003.08271, 2020.
derman, L. Gao, T. Wolf, and A. M. Rush, “Multitask [40] S. Altman, “Planning for agi and beyond,” OpenAI
prompted training enables zero-shot task generaliza- Blog, February 2023.
tion,” in The Tenth International Conference on Learning [41] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke,
Representations, ICLR 2022, Virtual Event, April 25-29, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li,
2022. OpenReview.net, 2022. S. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, and
[29] T. Wang, A. Roberts, D. Hesslow, T. L. Scao, H. W. Y. Zhang, “Sparks of artificial general intelligence:
Chung, I. Beltagy, J. Launay, and C. Raffel, “What Early experiments with gpt-4,” vol. abs/2303.12712,
language model architecture and pretraining objec- 2023.
tive works best for zero-shot generalization?” in [42] S. Huang, L. Dong, W. Wang, Y. Hao, S. Singhal,
International Conference on Machine Learning, ICML S. Ma, T. Lv, L. Cui, O. K. Mohammed, B. Patra,
2022, 17-23 July 2022, Baltimore, Maryland, USA, ser. Q. Liu, K. Aggarwal, Z. Chi, J. Bjorck, V. Chaudhary,
Proceedings of Machine Learning Research, vol. 162, S. Som, X. Song, and F. Wei, “Language is not all you
2022, pp. 22 964–22 984. need: Aligning perception with language models,”
[30] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, CoRR, vol. abs/2302.14045, 2023.
98

[43] Y. Cao, S. Li, Y. Liu, Z. Yan, Y. Dai, P. S. Yu, and B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Is-
L. Sun, “A comprehensive survey of ai-generated ard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghe-
content (aigc): A history of generative ai from gan mawat, S. Dev, H. Michalewski, X. Garcia, V. Misra,
to chatgpt,” arXiv preprint arXiv:2303.04226, 2023. K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan,
[44] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdh- H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Do-
ery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu han, S. Agrawal, M. Omernick, A. M. Dai, T. S.
et al., “Palm-e: An embodied multimodal language Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child,
model,” arXiv preprint arXiv:2303.03378, 2023. O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta,
[45] C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-
N. Duan, “Visual chatgpt: Talking, drawing and edit- Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel,
ing with visual foundation models,” arXiv preprint “Palm: Scaling language modeling with pathways,”
arXiv:2303.04671, 2023. CoRR, vol. abs/2204.02311, 2022.
[46] OpenAI, “Gpt-4 technical report,” OpenAI, 2023. [57] H. Touvron, T. Lavril, G. Izacard, X. Martinet,
[47] Y. Fu, H. Peng, and T. Khot, “How does gpt obtain M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham-
its ability? tracing emergent abilities of language bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and
models to their sources,” Yao Fu’s Notion, Dec 2022. G. Lample, “Llama: Open and efficient foundation
[48] J. Li, T. Tang, W. X. Zhao, and J. Wen, “Pretrained language models,” CoRR, 2023.
language model for text generation: A survey,” in [58] T. Henighan, J. Kaplan, M. Katz, M. Chen, C. Hesse,
Proceedings of the Thirtieth International Joint Confer- J. Jackson, H. Jun, T. B. Brown, P. Dhariwal, S. Gray
ence on Artificial Intelligence, IJCAI 2021, Virtual Event et al., “Scaling laws for autoregressive generative
/ Montreal, Canada, 19-27 August 2021, Z. Zhou, Ed. modeling,” arXiv preprint arXiv:2010.14701, 2020.
ijcai.org, 2021, pp. 4492–4499. [59] S. M. Xie, H. Pham, X. Dong, N. Du, H. Liu, Y. Lu,
[49] P. Lu, L. Qiu, W. Yu, S. Welleck, and K. Chang, “A P. Liang, Q. V. Le, T. Ma, and A. W. Yu, “Doremi:
survey of deep learning for mathematical reason- Optimizing data mixtures speeds up language model
ing,” CoRR, vol. abs/2212.10535, 2022. pretraining,” arXiv preprint arXiv:2305.10429, 2023.
[50] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, [60] P. Villalobos, J. Sevilla, L. Heim, T. Besiroglu,
X. Sun, J. Xu, L. Li, and Z. Sui, “A survey for in- M. Hobbhahn, and A. Ho, “Will we run out of
context learning,” CoRR, vol. abs/2301.00234, 2023. data? an analysis of the limits of scaling datasets in
[51] J. Huang and K. C. Chang, “Towards reasoning machine learning,” CoRR, vol. abs/2211.04325, 2022.
in large language models: A survey,” CoRR, vol. [Online]. Available: https://doi.org/10.48550/arXiv.
abs/2212.10403, 2022. 2211.04325
[52] S. Qiao, Y. Ou, N. Zhang, X. Chen, Y. Yao, S. Deng, [61] N. Muennighoff, A. M. Rush, B. Barak, T. L. Scao,
C. Tan, F. Huang, and H. Chen, “Reasoning with A. Piktus, N. Tazi, S. Pyysalo, T. Wolf, and C. Raffel,
language model prompting: A survey,” CoRR, vol. “Scaling data-constrained language models,” arXiv
abs/2212.09597, 2022. preprint arXiv:2305.16264, 2023.
[53] J. Zhou, P. Ke, X. Qiu, M. Huang, and J. Zhang, [62] I. McKenzie, A. Lyzhov, A. Parrish, A. Prabhu,
“Chatgpt: potential, prospects, and limitations,” in A. Mueller, N. Kim, S. Bowman, and E. Perez, “The
Frontiers of Information Technology & Electronic Engi- inverse scaling prize,” 2022. [Online]. Available:
neering, 2023, pp. 1–6. https://github.com/inverse-scaling/prize
[54] W. X. Zhao, J. Liu, R. Ren, and J.-R. Wen, “Dense [63] B. A. Huberman and T. Hogg, “Phase transitions in
text retrieval based on pretrained language models: artificial intelligence systems,” Artificial Intelligence,
A survey,” ACM Transactions on Information Systems, vol. 33, no. 2, pp. 155–171, 1987.
vol. 42, no. 4, pp. 1–60, 2024. [64] J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoff-
[55] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, mann, H. F. Song, J. Aslanides, S. Henderson,
J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, R. Ring, S. Young, E. Rutherford, T. Hennigan,
G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, J. Menick, A. Cassirer, R. Powell, G. van den
G. Krueger, T. Henighan, R. Child, A. Ramesh, Driessche, L. A. Hendricks, M. Rauh, P. Huang,
D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Ue-
E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, sato, J. Mellor, I. Higgins, A. Creswell, N. McAleese,
C. Berner, S. McCandlish, A. Radford, I. Sutskever, A. Wu, E. Elsen, S. M. Jayakumar, E. Buchatskaya,
and D. Amodei, “Language models are few-shot D. Budden, E. Sutherland, K. Simonyan, M. Paganini,
learners,” in Advances in Neural Information Processing L. Sifre, L. Martens, X. L. Li, A. Kuncoro, A. Ne-
Systems 33: Annual Conference on Neural Information matzadeh, E. Gribovskaya, D. Donato, A. Lazaridou,
Processing Systems 2020, NeurIPS 2020, December 6-12, A. Mensch, J. Lespiau, M. Tsimpoukelli, N. Grig-
2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, orev, D. Fritz, T. Sottiaux, M. Pajarskas, T. Pohlen,
M. Balcan, and H. Lin, Eds., 2020. Z. Gong, D. Toyama, C. de Masson d’Autume,
[56] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, Y. Li, T. Terzi, V. Mikulik, I. Babuschkin, A. Clark,
G. Mishra, A. Roberts, P. Barham, H. W. Chung, D. de Las Casas, A. Guy, C. Jones, J. Bradbury, M. J.
C. Sutton, S. Gehrmann, P. Schuh, K. Shi, Johnson, B. A. Hechtman, L. Weidinger, I. Gabriel,
S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, W. S. Isaac, E. Lockhart, S. Osindero, L. Rimell,
Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, C. Dyer, O. Vinyals, K. Ayoub, J. Stanway, L. Bennett,
99

D. Hassabis, K. Kavukcuoglu, and G. Irving, “Scaling [72] S. Hu, X. Liu, X. Han, X. Zhang, C. He, W. Zhao,
language models: Methods, analysis & insights from Y. Lin, N. Ding, Z. Ou, G. Zeng, Z. Liu, and M. Sun,
training gopher,” CoRR, vol. abs/2112.11446, 2021. “Unlock predictable scaling from emergent abilities,”
[65] D. Dai, Y. Sun, L. Dong, Y. Hao, Z. Sui, and F. Wei, 2023.
“Why can GPT learn in-context? language models se- [73] A. Power, Y. Burda, H. Edwards, I. Babuschkin, and
cretly perform gradient descent as meta-optimizers,” V. Misra, “Grokking: Generalization beyond overfit-
CoRR, vol. abs/2212.10559, 2022. ting on small algorithmic datasets,” arXiv preprint
[66] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wain- arXiv:2201.02177, 2022.
wright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, [74] J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He,
A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, “Deepspeed: System optimizations enable training
M. Simens, A. Askell, P. Welinder, P. F. Christiano, deep learning models with over 100 billion param-
J. Leike, and R. Lowe, “Training language models eters,” in KDD, 2020, pp. 3505–3506.
to follow instructions with human feedback,” CoRR, [75] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley,
vol. abs/2203.02155, 2022. J. Casper, and B. Catanzaro, “Megatron-lm: Train-
[67] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, ing multi-billion parameter language models using
B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned model parallelism,” CoRR, vol. abs/1909.08053, 2019.
language models are zero-shot learners,” in The Tenth [76] D. Narayanan, M. Shoeybi, J. Casper, P. LeGres-
International Conference on Learning Representations, ley, M. Patwary, V. Korthikanti, D. Vainbrand,
ICLR 2022, Virtual Event, April 25-29, 2022. Open- P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phan-
Review.net, 2022. ishayee, and M. Zaharia, “Efficient large-scale lan-
[68] R. Thoppilan, D. D. Freitas, J. Hall, N. Shazeer, guage model training on GPU clusters using
A. Kulshreshtha, H. Cheng, A. Jin, T. Bos, L. Baker, megatron-lm,” in International Conference for High Per-
Y. Du, Y. Li, H. Lee, H. S. Zheng, A. Ghafouri, formance Computing, Networking, Storage and Analysis,
M. Menegali, Y. Huang, M. Krikun, D. Lepikhin, SC 2021, St. Louis, Missouri, USA, November 14-19,
J. Qin, D. Chen, Y. Xu, Z. Chen, A. Roberts, M. Bosma, 2021. ACM, 2021, p. 58.
Y. Zhou, C. Chang, I. Krivokon, W. Rusch, M. Pick- [77] V. Korthikanti, J. Casper, S. Lym, L. McAfee, M. An-
ett, K. S. Meier-Hellstern, M. R. Morris, T. Doshi, dersch, M. Shoeybi, and B. Catanzaro, “Reducing ac-
R. D. Santos, T. Duke, J. Soraker, B. Zevenbergen, tivation recomputation in large transformer models,”
V. Prabhakaran, M. Diaz, B. Hutchinson, K. Olson, CoRR, vol. abs/2205.05198, 2022.
A. Molina, E. Hoffman-John, J. Lee, L. Aroyo, R. Ra- [78] T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilic,
jakumar, A. Butryna, M. Lamm, V. Kuzmina, J. Fen- D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon,
ton, A. Cohen, R. Bernstein, R. Kurzweil, B. Aguera- M. Gallé, J. Tow, A. M. Rush, S. Biderman, A. Web-
Arcas, C. Cui, M. Croak, E. H. Chi, and Q. Le, son, P. S. Ammanamanchi, T. Wang, B. Sagot,
“Lamda: Language models for dialog applications,” N. Muennighoff, A. V. del Moral, O. Ruwase, R. Baw-
CoRR, vol. abs/2201.08239, 2022. den, S. Bekman, A. McMillan-Major, I. Beltagy,
[69] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, H. Nguyen, L. Saulnier, S. Tan, P. O. Suarez, V. Sanh,
W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, H. Laurençon, Y. Jernite, J. Launay, M. Mitchell,
A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, C. Raffel, A. Gokaslan, A. Simhi, A. Soroa, A. F. Aji,
A. Chowdhery, S. Narang, G. Mishra, A. Yu, V. Y. A. Alfassy, A. Rogers, A. K. Nitzav, C. Xu, C. Mou,
Zhao, Y. Huang, A. M. Dai, H. Yu, S. Petrov, E. H. C. Emezue, C. Klamm, C. Leong, D. van Strien,
Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V. Le, D. I. Adelani, and et al., “BLOOM: A 176b-parameter
and J. Wei, “Scaling instruction-finetuned language open-access multilingual language model,” CoRR,
models,” CoRR, vol. abs/2210.11416, 2022. vol. abs/2211.05100, 2022.
[70] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, [79] P. F. Christiano, J. Leike, T. B. Brown, M. Martic,
A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, S. Legg, and D. Amodei, “Deep reinforcement learn-
A. Garriga-Alonso, A. Kluska, A. Lewkowycz, ing from human preferences,” in Advances in Neural
A. Agarwal, A. Power, A. Ray, A. Warstadt, A. W. Information Processing Systems 30: Annual Conference
Kocurek, A. Safaya, A. Tazarv, A. Xiang, A. Parrish, on Neural Information Processing Systems 2017, Decem-
A. Nie, A. Hussain, A. Askell, A. Dsouza, A. Rahane, ber 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von
A. S. Iyer, A. Andreassen, A. Santilli, A. Stuhlmüller, Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N.
A. M. Dai, A. La, A. K. Lampinen, A. Zou, A. Jiang, Vishwanathan, and R. Garnett, Eds., 2017, pp. 4299–
A. Chen, A. Vuong, A. Gupta, A. Gottardi, A. Norelli, 4307.
A. Venkatesh, A. Gholamidavoodi, A. Tabassum, [80] T. Schick, J. Dwivedi-Yu, R. Dessı̀, R. Raileanu,
A. Menezes, A. Kirubarajan, A. Mullokandov, A. Sab- M. Lomeli, L. Zettlemoyer, N. Cancedda, and
harwal, A. Herrick, A. Efrat, A. Erdem, A. Karakas, T. Scialom, “Toolformer: Language models can teach
and et al., “Beyond the imitation game: Quantifying themselves to use tools,” CoRR, vol. abs/2302.04761,
and extrapolating the capabilities of language mod- 2023.
els,” CoRR, vol. abs/2206.04615, 2022. [81] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang,
[71] R. Schaeffer, B. Miranda, and S. Koyejo, “Are emer- C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saun-
gent abilities of large language models a mirage?” ders, X. Jiang, K. Cobbe, T. Eloundou, G. Krueger,
arXiv preprint arXiv:2304.15004, 2023. K. Button, M. Knight, B. Chess, and J. Schulman,
100

“Webgpt: Browser-assisted question-answering with S. Chen, C. Dewan, M. T. Diab, X. Li, X. V. Lin,


human feedback,” CoRR, vol. abs/2112.09332, 2021. T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig,
[82] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer,
M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring “OPT: open pre-trained transformer language mod-
the limits of transfer learning with a unified text- els,” CoRR, vol. abs/2205.01068, 2022.
to-text transformer,” J. Mach. Learn. Res., pp. 140:1– [91] M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad,
140:67, 2020. K. Heafield, K. Heffernan, E. Kalbassi, J. Lam,
[83] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al- D. Licht, J. Maillard, A. Sun, S. Wang, G. Wen-
Rfou, A. Siddhant, A. Barua, and C. Raffel, “mt5: A zek, A. Youngblood, B. Akula, L. Barrault, G. M.
massively multilingual pre-trained text-to-text trans- Gonzalez, P. Hansanti, J. Hoffman, S. Jarrett, K. R.
former,” in Proceedings of the 2021 Conference of the Sadagopan, D. Rowe, S. Spruit, C. Tran, P. Andrews,
North American Chapter of the Association for Com- N. F. Ayan, S. Bhosale, S. Edunov, A. Fan, C. Gao,
putational Linguistics: Human Language Technologies, V. Goswami, F. Guzmán, P. Koehn, A. Mourachko,
NAACL-HLT 2021, Online, June 6-11, 2021, 2021, pp. C. Ropers, S. Saleem, H. Schwenk, and J. Wang, “No
483–498. language left behind: Scaling human-centered ma-
[84] W. Zeng, X. Ren, T. Su, H. Wang, Y. Liao, Z. Wang, chine translation,” CoRR, vol. abs/2207.04672, 2022.
X. Jiang, Z. Yang, K. Wang, X. Zhang, C. Li, [92] Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue,
Z. Gong, Y. Yao, X. Huang, J. Wang, J. Yu, Q. Guo, Z. Wang, L. Shen, A. Wang, Y. Li et al., “Codegeex:
Y. Yu, Y. Zhang, J. Wang, H. Tao, D. Yan, Z. Yi, A pre-trained model for code generation with multi-
F. Peng, F. Jiang, H. Zhang, L. Deng, Y. Zhang, lingual evaluations on humaneval-x,” arXiv preprint
Z. Lin, C. Zhang, S. Zhang, M. Guo, S. Gu, G. Fan, arXiv:2303.17568, 2023.
Y. Wang, X. Jin, Q. Liu, and Y. Tian, “Pangu-α: Large- [93] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding,
scale autoregressive pretrained chinese language Z. Yang, Y. Xu, W. Zheng, X. Xia, W. L. Tam, Z. Ma,
models with auto-parallel computation,” CoRR, vol. Y. Xue, J. Zhai, W. Chen, P. Zhang, Y. Dong, and
abs/2104.12369, 2021. J. Tang, “GLM-130B: an open bilingual pre-trained
[85] Z. Zhang, Y. Gu, X. Han, S. Chen, C. Xiao, model,” vol. abs/2210.02414, 2022.
Z. Sun, Y. Yao, F. Qi, J. Guan, P. Ke, Y. Cai, [94] N. Muennighoff, T. Wang, L. Sutawika, A. Roberts,
G. Zeng, Z. Tan, Z. Liu, M. Huang, W. Han, Y. Liu, S. Biderman, T. L. Scao, M. S. Bari, S. Shen, Z. X.
X. Zhu, and M. Sun, “CPM-2: large-scale cost- Yong, H. Schoelkopf, X. Tang, D. Radev, A. F. Aji,
effective pre-trained language models,” CoRR, vol. K. Almubarak, S. Albanie, Z. Alyafeai, A. Web-
abs/2106.10715, 2021. son, E. Raff, and C. Raffel, “Crosslingual general-
[86] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, ization through multitask finetuning,” CoRR, vol.
Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An abs/2211.01786, 2022.
open large language model for code with mtulti-turn [95] S. Iyer, X. V. Lin, R. Pasunuru, T. Mihaylov, D. Simig,
program synthesis,” arXiv preprint arXiv:2203.13474, P. Yu, K. Shuster, T. Wang, Q. Liu, P. S. Koura,
2022. X. Li, B. O’Horo, G. Pereyra, J. Wang, C. Dewan,
[87] S. Black, S. Biderman, E. Hallahan, Q. Anthony, A. Celikyilmaz, L. Zettlemoyer, and V. Stoyanov,
L. Gao, L. Golding, H. He, C. Leahy, K. McDonell, “OPT-IML: scaling language model instruction meta
J. Phang, M. Pieler, U. S. Prashanth, S. Purohit, learning through the lens of generalization,” CoRR,
L. Reynolds, J. Tow, B. Wang, and S. Weinbach, “Gpt- vol. abs/2212.12017, 2022.
neox-20b: An open-source autoregressive language [96] S. Biderman, H. Schoelkopf, Q. Anthony, H. Bradley,
model,” CoRR, vol. abs/2204.06745, 2022. K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit,
[88] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, U. S. Prashanth, E. Raff et al., “Pythia: A suite for
A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran, analyzing large language models across training and
A. Arunkumar, D. Stap, E. Pathak, G. Karamanolakis, scaling,” arXiv preprint arXiv:2304.01373, 2023.
H. G. Lai, I. Purohit, I. Mondal, J. Anderson, K. Kuz- [97] E. Nijkamp, H. Hayashi, C. Xiong, S. Savarese, and
nia, K. Doshi, K. K. Pal, M. Patel, M. Moradshahi, Y. Zhou, “Codegen2: Lessons for training llms on
M. Parmar, M. Purohit, N. Varshney, P. R. Kaza, programming and natural languages,” CoRR, vol.
P. Verma, R. S. Puri, R. Karia, S. Doshi, S. K. abs/2305.02309, 2023.
Sampat, S. Mishra, S. R. A, S. Patro, T. Dixit, and [98] R. Li, L. B. Allal, Y. Zi, N. Muennighoff,
X. Shen, “Super-naturalinstructions: Generalization D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li,
via declarative instructions on 1600+ NLP tasks,” in J. Chim, Q. Liu, E. Zheltonozhskii, T. Y. Zhuo,
Proceedings of the 2022 Conference on Empirical Methods T. Wang, O. Dehaene, M. Davaadorj, J. Lamy-Poirier,
in Natural Language Processing, EMNLP 2022, Abu J. Monteiro, O. Shliazhko, N. Gontier, N. Meade,
Dhabi, United Arab Emirates, December 7-11, 2022, A. Zebaze, M. Yee, L. K. Umapathi, J. Zhu, B. Lipkin,
2022, pp. 5085–5109. M. Oblokulov, Z. Wang, R. M. V, J. Stillerman,
[89] Y. Tay, M. Dehghani, V. Q. Tran, X. Garcı́a, J. Wei, S. S. Patel, D. Abulkhanov, M. Zocca, M. Dey,
X. Wang, H. W. Chung, D. Bahri, T. Schuster, Z. Zhang, N. Fahmy, U. Bhattacharyya, W. Yu,
H. Zheng, D. Zhou, N. Houlsby, and D. Metzler, S. Singh, S. Luccioni, P. Villegas, M. Kunakov,
“Ul2: Unifying language learning paradigms,” 2022. F. Zhdanov, M. Romero, T. Lee, N. Timor, J. Ding,
[90] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, C. Schlesinger, H. Schoelkopf, J. Ebert, T. Dao,
101

M. Mishra, A. Gu, J. Robinson, C. J. Anderson, S. Kang, N. Ryu, K. M. Yoo, M. Chang, S. Suh,


B. Dolan-Gavitt, D. Contractor, S. Reddy, D. Fried, S. In, J. Park, K. Kim, H. Kim, J. Jeong, Y. G. Yeo,
D. Bahdanau, Y. Jernite, C. M. Ferrandis, S. Hughes, D. Ham, D. Park, M. Y. Lee, J. Kang, I. Kang, J. Ha,
T. Wolf, A. Guha, L. von Werra, and H. de Vries, W. Park, and N. Sung, “What changes can large-
“Starcoder: may the source be with you!” CoRR, scale language models bring? intensive study on hy-
vol. abs/2305.06161, 2023. [Online]. Available: perclova: Billions-scale korean generative pretrained
https://doi.org/10.48550/arXiv.2305.06161 transformers,” in Proceedings of the 2021 Conference
[99] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma- on Empirical Methods in Natural Language Processing,
hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, EMNLP 2021, Virtual Event / Punta Cana, Dominican
S. Bhosale et al., “Llama 2: Open foundation and fine- Republic, 7-11 November, 2021. Association for Com-
tuned chat models,” arXiv preprint arXiv:2307.09288, putational Linguistics, 2021.
2023. [109] S. Wu, X. Zhao, T. Yu, R. Zhang, C. Shen, H. Liu,
[100] A. Yang, B. Xiao, B. Wang, B. Zhang, C. Yin, C. Lv, F. Li, H. Zhu, J. Luo, L. Xu et al., “Yuan 1.0: Large-
D. Pan, D. Wang, D. Yan, F. Yang et al., “Baichuan scale pre-trained language model in zero-shot and
2: Open large-scale language models,” arXiv preprint few-shot learning,” arXiv preprint arXiv:2110.04725,
arXiv:2309.10305, 2023. 2021.
[101] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, [110] A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli,
W. Ge, Y. Han, F. Huang et al., “Qwen technical T. Henighan, A. Jones, N. Joseph, B. Mann, N. Das-
report,” arXiv preprint arXiv:2309.16609, 2023. Sarma, N. Elhage, Z. Hatfield-Dodds, D. Hernandez,
[102] X. Li, Y. Yao, X. Jiang, X. Fang, X. Meng, S. Fan, J. Kernion, K. Ndousse, C. Olsson, D. Amodei, T. B.
P. Han, J. Li, L. Du, B. Qin et al., “Flm-101b: An open Brown, J. Clark, S. McCandlish, C. Olah, and J. Ka-
llm and how to train it with $100 k budget,” arXiv plan, “A general language assistant as a laboratory
preprint arXiv:2309.03852, 2023. for alignment,” CoRR, vol. abs/2112.00861, 2021.
[103] T. Wei, L. Zhao, L. Zhang, B. Zhu, L. Wang, H. Yang, [111] S. Wang, Y. Sun, Y. Xiang, Z. Wu, S. Ding, W. Gong,
B. Li, C. Cheng, W. Lü, R. Hu et al., “Skywork: S. Feng, J. Shang, Y. Zhao, C. Pang, J. Liu, X. Chen,
A more open bilingual foundation model,” arXiv Y. Lu, W. Liu, X. Wang, Y. Bai, Q. Chen, L. Zhao,
preprint arXiv:2310.19341, 2023. S. Li, P. Sun, D. Yu, Y. Ma, H. Tian, H. Wu, T. Wu,
[104] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, W. Zeng, G. Li, W. Gao, and H. Wang, “ERNIE
Y. Huang, M. Krikun, N. Shazeer, and Z. Chen, 3.0 titan: Exploring larger-scale knowledge enhanced
“Gshard: Scaling giant models with conditional com- pre-training for language understanding and gener-
putation and automatic sharding,” in 9th International ation,” CoRR, vol. abs/2112.12731, 2021.
Conference on Learning Representations, ICLR 2021, Vir- [112] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin,
tual Event, Austria, May 3-7, 2021, 2021. Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat,
[105] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. B. Zoph, L. Fedus, M. P. Bosma, Z. Zhou, T. Wang,
de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, Y. E. Wang, K. Webster, M. Pellat, K. Robinson, K. S.
N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, Meier-Hellstern, T. Duke, L. Dixon, K. Zhang, Q. V.
M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, Le, Y. Wu, Z. Chen, and C. Cui, “Glam: Efficient
S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, scaling of language models with mixture-of-experts,”
M. Bavarian, C. Winter, P. Tillet, F. P. Such, in International Conference on Machine Learning, ICML
D. Cummings, M. Plappert, F. Chantzis, E. Barnes, 2022, 17-23 July 2022, Baltimore, Maryland, USA, 2022,
A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, pp. 5547–5569.
N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, [113] S. Smith, M. Patwary, B. Norick, P. LeGresley,
W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, S. Rajbhandari, J. Casper, Z. Liu, S. Prabhumoye,
V. Misra, E. Morikawa, A. Radford, M. Knight, G. Zerveas, V. Korthikanti, E. Zheng, R. Child,
M. Brundage, M. Murati, K. Mayer, P. Welinder, R. Y. Aminabadi, J. Bernauer, X. Song, M. Shoeybi,
B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, Y. He, M. Houston, S. Tiwary, and B. Catanzaro,
and W. Zaremba, “Evaluating large language models “Using deepspeed and megatron to train megatron-
trained on code,” CoRR, vol. abs/2107.03374, 2021. turing NLG 530b, A large-scale generative language
[106] Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang, model,” CoRR, vol. abs/2201.11990, 2022.
J. Liu, X. Chen, Y. Zhao, Y. Lu, W. Liu, Z. Wu, [114] Y. Li, D. H. Choi, J. Chung, N. Kushman, J. Schrit-
W. Gong, J. Liang, Z. Shang, P. Sun, W. Liu, twieser, R. Leblond, T. Eccles, J. Keeling, F. Gi-
X. Ouyang, D. Yu, H. Tian, H. Wu, and H. Wang, meno, A. D. Lago, T. Hubert, P. Choy, C. de Mas-
“ERNIE 3.0: Large-scale knowledge enhanced pre- son d’Autume, I. Babuschkin, X. Chen, P. Huang,
training for language understanding and genera- J. Welbl, S. Gowal, A. Cherepanov, J. Molloy, D. J.
tion,” CoRR, vol. abs/2107.02137, 2021. Mankowitz, E. S. Robson, P. Kohli, N. de Freitas,
[107] O. Lieber, O. Sharir, B. Lenz, and Y. Shoham, K. Kavukcuoglu, and O. Vinyals, “Competition-level
“Jurassic-1: Technical details and evaluation,” White code generation with alphacode,” Science, 2022.
Paper. AI21 Labs, vol. 1, 2021. [115] S. Soltan, S. Ananthakrishnan, J. FitzGerald,
[108] B. Kim, H. Kim, S. Lee, G. Lee, D. Kwak, D. H. Jeon, R. Gupta, W. Hamza, H. Khan, C. Peris, S. Rawls,
S. Park, S. Kim, S. Kim, D. Seo, H. Lee, M. Jeong, A. Rosenbaum, A. Rumshisky, C. S. Prakash, M. Srid-
S. Lee, M. Kim, S. Ko, S. Kim, T. Park, J. Kim, har, F. Triefenbach, A. Verma, G. Tür, and P. Natara-
102

jan, “Alexatm 20b: Few-shot learning using a solves and generates mathematics problems by pro-
large-scale multilingual seq2seq model,” CoRR, vol. gram synthesis: Calculus, differential equations, lin-
abs/2208.01448, 2022. ear algebra, and more,” CoRR, vol. abs/2112.15594,
[116] A. Glaese, N. McAleese, M. Trebacz, J. Aslanides, 2021.
V. Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chad- [127] A. Neelakantan, T. Xu, R. Puri, A. Radford, J. M.
wick, P. Thacker, L. Campbell-Gillingham, J. Ue- Han, J. Tworek, Q. Yuan, N. Tezak, J. W. Kim,
sato, P. Huang, R. Comanescu, F. Yang, A. See, C. Hallacy, J. Heidecke, P. Shyam, B. Power, T. E.
S. Dathathri, R. Greig, C. Chen, D. Fritz, J. S. Nekoul, G. Sastry, G. Krueger, D. Schnurr, F. P.
Elias, R. Green, S. Mokrá, N. Fernando, B. Wu, Such, K. Hsu, M. Thompson, T. Khan, T. Sherbakov,
R. Foley, S. Young, I. Gabriel, W. Isaac, J. Mel- J. Jang, P. Welinder, and L. Weng, “Text and code
lor, D. Hassabis, K. Kavukcuoglu, L. A. Hendricks, embeddings by contrastive pre-training,” CoRR, vol.
and G. Irving, “Improving alignment of dialogue abs/2201.10005, 2022.
agents via targeted human judgements,” CoRR, vol. [128] J. Schulman, F. Wolski, P. Dhariwal, A. Radford,
abs/2209.14375, 2022. and O. Klimov, “Proximal policy optimization algo-
[117] H. Su, X. Zhou, H. Yu, Y. Chen, Z. Zhu, Y. Yu, and rithms,” arXiv preprint arXiv:1707.06347, 2017.
J. Zhou, “Welm: A well-read pre-trained language [129] N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler,
model for chinese,” CoRR, vol. abs/2209.10372, 2022. R. Lowe, C. Voss, A. Radford, D. Amodei, and P. F.
[118] Y. Tay, J. Wei, H. W. Chung, V. Q. Tran, D. R. So, Christiano, “Learning to summarize from human
S. Shakeri, X. Garcia, H. S. Zheng, J. Rao, A. Chowd- feedback,” CoRR, vol. abs/2009.01325, 2020.
hery, D. Zhou, D. Metzler, S. Petrov, N. Houlsby, [130] OpenAI, “Our approach to alignment research,” Ope-
Q. V. Le, and M. Dehghani, “Transcending scal- nAI Blog, August 2022.
ing laws with 0.1% extra compute,” CoRR, vol. [131] ——, “Introducing chatgpt,” OpenAI Blog, November
abs/2210.11399, 2022. 2022.
[119] X. Ren, P. Zhou, X. Meng, X. Huang, Y. Wang, [132] D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai,
W. Wang, P. Li, X. Zhang, A. Podolskiy, G. Arshinov, S. Kadavath, B. Mann, E. Perez, N. Schiefer,
A. Bout, I. Piontkovskaya, J. Wei, X. Jiang, T. Su, K. Ndousse, A. Jones, S. Bowman, A. Chen, T. Con-
Q. Liu, and J. Yao, “Pangu-Σ: Towards trillion pa- erly, N. DasSarma, D. Drain, N. Elhage, S. E. Showk,
rameter language model with sparse heterogeneous S. Fort, Z. Hatfield-Dodds, T. Henighan, D. Hernan-
computing,” CoRR, vol. abs/2303.10845, 2023. dez, T. Hume, J. Jacobson, S. Johnston, S. Kravec,
[120] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lep- C. Olsson, S. Ringer, E. Tran-Johnson, D. Amodei,
ikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, T. Brown, N. Joseph, S. McCandlish, C. Olah, J. Ka-
Z. Chen et al., “Palm 2 technical report,” arXiv plan, and J. Clark, “Red teaming language models
preprint arXiv:2305.10403, 2023. to reduce harms: Methods, scaling behaviors, and
[121] A. Radford, R. Józefowicz, and I. Sutskever, “Learn- lessons learned,” CoRR, vol. abs/2209.07858, 2022.
ing to generate reviews and discovering sentiment,” [133] OpenAI, “Gpt-4v(ision) system card,” OpenAI, 2023.
CoRR, vol. abs/1704.01444, 2017. [134] ——, “Lessons learned on language model safety
[122] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever and misuse,” OpenAI blog, 2022.
et al., “Improving language understanding by gener- [135] Meta, “Introducing meta llama 3: The most capable
ative pre-training,” 2018. openly available llm to date,” https://ai.meta.com/
[123] B. McCann, N. S. Keskar, C. Xiong, and R. Socher, blog/meta-llama-3/, 2024.
“The natural language decathlon: Multitask learning [136] “Introducing Llama 3.1: Our most capable models to
as question answering,” CoRR, vol. abs/1806.08730, date ,” https://ai.meta.com/blog/meta-llama-3-1/,
2018. 2023.
[124] Y. Zhang, S. Sun, M. Galley, Y. Chen, C. Brockett, [137] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam-
X. Gao, J. Gao, J. Liu, and B. Dolan, “DIALOGPT ford, D. S. Chaplot, D. de las Casas, F. Bressand,
: Large-scale generative pre-training for conversa- G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-
tional response generation,” in Proceedings of the 58th A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang,
Annual Meeting of the Association for Computational T. Lacroix, and W. E. Sayed, “Mistral 7b,” 2023.
Linguistics: System Demonstrations, ACL 2020, Online, [138] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch,
July 5-10, 2020, A. Celikyilmaz and T. Wen, Eds. B. Savary, C. Bamford, D. S. Chaplot, D. de Las Casas,
Association for Computational Linguistics, 2020, pp. E. B. Hanna, F. Bressand, G. Lengyel, G. Bour,
270–278. G. Lample, L. R. Lavaud, L. Saulnier, M. Lachaux,
[125] D. Ham, J. Lee, Y. Jang, and K. Kim, “End-to-end P. Stock, S. Subramanian, S. Yang, S. Antoniak,
neural pipeline for goal-oriented dialogue systems T. L. Scao, T. Gervet, T. Lavril, T. Wang,
using GPT-2,” in Proceedings of the 58th Annual Meet- T. Lacroix, and W. E. Sayed, “Mixtral of experts,”
ing of the Association for Computational Linguistics, CoRR, vol. abs/2401.04088, 2024. [Online]. Available:
ACL 2020, Online, July 5-10, 2020. Association for https://doi.org/10.48550/arXiv.2401.04088
Computational Linguistics, 2020, pp. 583–592. [139] T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju,
[126] I. Drori, S. Tran, R. Wang, N. Cheng, K. Liu, L. Tang, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love,
E. Ke, N. Singh, T. L. Patti, J. Lynch, A. Shporer, P. Tafti, L. Hussenot, A. Chowdhery, A. Roberts,
N. Verma, E. Wu, and G. Strang, “A neural network A. Barua, A. Botev, A. Castro-Ros, A. Slone,
103

A. Héliou, A. Tacchetti, A. Bulanova, A. Paterson, M. Sun, “JEC-QA: A legal-domain question answer-


B. Tsai, B. Shahriari, C. L. Lan, C. A. Choquette-Choo, ing dataset,” in The Thirty-Fourth AAAI Conference
C. Crepy, D. Cer, D. Ippolito, D. Reid, E. Buchatskaya, on Artificial Intelligence, AAAI 2020, The Thirty-Second
E. Ni, E. Noland, G. Yan, G. Tucker, G. Muraru, Innovative Applications of Artificial Intelligence Confer-
G. Rozhdestvenskiy, H. Michalewski, I. Tenney, I. Gr- ence, IAAI 2020, The Tenth AAAI Symposium on Edu-
ishchenko, J. Austin, J. Keeling, J. Labanowski, cational Advances in Artificial Intelligence, EAAI 2020,
J. Lespiau, J. Stanway, J. Brennan, J. Chen, J. Ferret, New York, NY, USA, February 7-12, 2020. AAAI Press,
J. Chiu, and et al., “Gemma: Open models based 2020, pp. 9701–9708.
on gemini research and technology,” CoRR, vol. [145] D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang,
abs/2403.08295, 2024. and P. Szolovits, “What disease does this patient
[140] M. Rivière, S. Pathak, P. G. Sessa, C. Hardin, S. Bhu- have? a large-scale open domain question answer-
patiraju, L. Hussenot, T. Mesnard, B. Shahriari, ing dataset from medical exams,” Applied Sciences,
A. Ramé, J. Ferret, P. Liu, P. Tafti, A. Friesen, M. Cas- vol. 11, no. 14, p. 6421, 2021.
bon, S. Ramos, R. Kumar, C. L. Lan, S. Jerome, A. Tsit- [146] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li,
sulin, N. Vieillard, P. Stanczyk, S. Girgin, N. Mom- C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford
chev, M. Hoffman, S. Thakoor, J. Grill, B. Neyshabur, alpaca: An instruction-following llama model,”
O. Bachem, A. Walton, A. Severyn, A. Parrish, A. Ah- https://github.com/tatsu-lab/stanford alpaca,
mad, A. Hutchison, A. Abdagic, A. Carl, A. Shen, 2023.
A. Brock, A. Coenen, A. Laforge, A. Paterson, B. Bas- [147] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith,
tian, B. Piot, B. Wu, B. Royal, C. Chen, C. Kumar, D. Khashabi, and H. Hajishirzi, “Self-instruct: Align-
C. Perry, C. Welty, C. A. Choquette-Choo, D. Sinopal- ing language model with self generated instruc-
nikov, D. Weinberger, D. Vijaykumar, D. Rogozin- tions,” CoRR, vol. abs/2212.10560, 2022.
ska, D. Herbison, E. Bandy, E. Wang, E. Noland, [148] Alpaca-LoRA, “Instruct-tune llama on consumer
E. Moreira, E. Senter, E. Eltyshev, F. Visin, G. Rasskin, hardware,” https://github.com/tloen/alpaca-lora,
G. Wei, G. Cameron, G. Martins, H. Hashemi, 2023.
H. Klimczak-Plucinska, H. Batra, H. Dhand, I. Nar- [149] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li,
dini, J. Mein, J. Zhou, J. Svensson, J. Stanway, J. Chan, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank
J. P. Zhou, J. Carrasqueira, J. Iljazi, J. Becker, J. Fer- adaptation of large language models,” in The Tenth
nandez, J. van Amersfoort, J. Gordon, J. Lipschultz, International Conference on Learning Representations,
J. Newlan, J. Ji, K. Mohamed, K. Badola, K. Black, ICLR 2022, Virtual Event, April 25-29, 2022. Open-
K. Millican, K. McDonell, K. Nguyen, K. Sodhia, Review.net, 2022.
K. Greene, L. L. Sjösund, L. Usui, L. Sifre, L. Heuer- [150] X. Geng, A. Gudibande, H. Liu, E. Wallace, P. Abbeel,
mann, L. Lago, and L. McNealus, “Gemma 2: Im- S. Levine, and D. Song, “Koala: A dialogue model for
proving open language models at a practical size,” academic research,” Blog post, April 2023.
CoRR, vol. abs/2408.00118, 2024. [151] Y. Ji, Y. Deng, Y. Gong, Y. Peng, Q. Niu, B. Ma,
[141] A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, and X. Li, “Belle: Be everyone’s large language
C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, model engine,” https://github.com/LianjiaTech/
J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Xu, BELLE, 2023.
J. Zhou, J. Bai, J. He, J. Lin, K. Dang, K. Lu, K. Chen, [152] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu,
K. Yang, M. Li, M. Xue, N. Ni, P. Zhang, P. Wang, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E.
R. Peng, R. Men, R. Gao, R. Lin, S. Wang, S. Bai, Gonzalez, I. Stoica, and E. P. Xing, “Vicuna:
S. Tan, T. Zhu, T. Li, T. Liu, W. Ge, X. Deng, X. Zhou, An open-source chatbot impressing gpt-4 with
X. Ren, X. Zhang, X. Wei, X. Ren, Y. Fan, Y. Yao, 90%* chatgpt quality,” 2023. [Online]. Available:
Y. Zhang, Y. Wan, Y. Chu, Y. Liu, Z. Cui, Z. Zhang, https://vicuna.lmsys.org
and Z. Fan, “Qwen2 technical report,” arXiv preprint [153] D. Eccleston, “Sharegpt,” https://sharegpt.com/,
arXiv:2407.10671, 2024. 2023.
[142] Q. Team, “Qwen2.5: A party of foundation [154] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction
models,” September 2024. [Online]. Available: tuning,” CoRR, vol. abs/2304.08485, 2023.
https://qwenlm.github.io/blog/qwen2.5/ [155] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny,
[143] T. GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, “Minigpt-4: Enhancing vision-language understand-
D. Rojas, G. Feng, H. Zhao, H. Lai, H. Yu, H. Wang, ing with advanced large language models,” CoRR,
J. Sun, J. Zhang, J. Cheng, J. Gui, J. Tang, J. Zhang, vol. abs/2304.10592, 2023.
J. Li, L. Zhao, L. Wu, L. Zhong, M. Liu, M. Huang, [156] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang,
P. Zhang, Q. Zheng, R. Lu, S. Duan, S. Zhang, S. Cao, B. Li, P. Fung, and S. C. H. Hoi, “Instructblip: To-
S. Yang, W. L. Tam, W. Zhao, X. Liu, X. Xia, X. Zhang, wards general-purpose vision-language models with
X. Gu, X. Lv, X. Liu, X. Liu, X. Yang, X. Song, instruction tuning,” CoRR, vol. abs/2305.06500, 2023.
X. Zhang, Y. An, Y. Xu, Y. Niu, Y. Yang, Y. Li, Y. Bai, [157] Y. Su, T. Lan, H. Li, J. Xu, Y. Wang, and D. Cai,
Y. Dong, Z. Qi, Z. Wang, Z. Yang, Z. Du, Z. Hou, “Pandagpt: One model to instruction-follow them
and Z. Wang, “Chatglm: A family of large language all,” 2023.
models from glm-130b to glm-4 all tools,” 2024. [158] Y. Zhu, R. Kiros, R. S. Zemel, R. Salakhutdinov,
[144] H. Zhong, C. Xiao, C. Tu, T. Zhang, Z. Liu, and R. Urtasun, A. Torralba, and S. Fidler, “Aligning
104

books and movies: Towards story-like visual expla- A. A. Alemi, “On the use of arxiv as a dataset,” arXiv
nations by watching movies and reading books,” in preprint arXiv:1905.00075, 2019.
2015 IEEE International Conference on Computer Vision, [173] K. Lo, L. L. Wang, M. Neumann, R. Kinney, and
ICCV 2015, Santiago, Chile, December 7-13, 2015. IEEE D. Weld, “S2ORC: The semantic scholar open re-
Computer Society, 2015, pp. 19–27. search corpus,” in ACL, 2020.
[159] “Project gutenberg.” [Online]. Available: https: [174] L. Soldaini and K. Lo, “peS2o (Pretraining Efficiently
//www.gutenberg.org/ on S2ORC) Dataset,” ODC-By, https://github.com/
[160] T. H. Trinh and Q. V. Le, “A simple method for com- allenai/pes2o, 2023.
monsense reasoning,” CoRR, vol. abs/1806.02847, [175] D. Kocetkov, R. Li, L. B. Allal, J. Li, C. Mou, C. M.
2018. Ferrandis, Y. Jernite, M. Mitchell, S. Hughes, T. Wolf
[161] R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk, et al., “The stack: 3 tb of permissively licensed source
A. Farhadi, F. Roesner, and Y. Choi, “Defending code,” arXiv preprint arXiv:2211.15533, 2022.
against neural fake news,” in Advances in Neu- [176] B. Wang and A. Komatsuzaki, “GPT-J-6B: A 6 Billion
ral Information Processing Systems 32: Annual Confer- Parameter Autoregressive Language Model,” https:
ence on Neural Information Processing Systems 2019, //github.com/kingoflolz/mesh-transformer-jax,
NeurIPS 2019, December 8-14, 2019, Vancouver, BC, 2021.
Canada, H. M. Wallach, H. Larochelle, A. Beygelz- [177] L. Soldaini, R. Kinney, A. Bhagia, D. Schwenk,
imer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., D. Atkinson, R. Authur, B. Bogin, K. Chandu, J. Du-
2019, pp. 9051–9062. mas, Y. Elazar, V. Hofmann, A. H. Jha, S. Kumar,
[162] A. Gokaslan, V. C. E. Pavlick, and S. Tellex, L. Lucy, X. Lyu, N. Lambert, I. Magnusson, J. Morri-
“Openwebtext corpus,” http://Skylion007.github. son, N. Muennighoff, A. Naik, C. Nam, M. E. Peters,
io/OpenWebTextCorpus, 2019. A. Ravichander, K. Richardson, Z. Shen, E. Strubell,
[163] J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, N. Subramani, O. Tafjord, P. Walsh, L. Zettlemoyer,
and J. Blackburn, “The pushshift reddit dataset,” in N. A. Smith, H. Hajishirzi, I. Beltagy, D. Groeneveld,
Proceedings of the Fourteenth International AAAI Con- J. Dodge, and K. Lo, “Dolma: an open corpus of
ference on Web and Social Media, ICWSM 2020, Held three trillion tokens for language model pretraining
Virtually, Original Venue: Atlanta, Georgia, USA, June research,” arXiv preprint arXiv:2402.00159, 2024.
8-11, 2020. AAAI Press, 2020, pp. 830–839. [178] D. Groeneveld, I. Beltagy, P. Walsh, A. Bhagia, R. Kin-
[164] “Wikipedia.” [Online]. Available: https://en. ney, O. Tafjord, A. H. Jha, H. Ivison, I. Magnusson,
wikipedia.org/wiki/Main Page Y. Wang et al., “Olmo: Accelerating the science of
[165] “Bigquery dataset.” [Online]. Available: https: language models,” arXiv preprint arXiv:2402.00838,
//cloud.google.com/bigquery?hl=zh-cn 2024.
[166] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, [179] S. Mishra, D. Khashabi, C. Baral, and H. Ha-
C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, jishirzi, “Cross-task generalization via natural lan-
S. Presser, and C. Leahy, “The pile: An 800gb dataset guage crowdsourcing instructions,” in Proceedings of
of diverse text for language modeling,” CoRR, vol. the 60th Annual Meeting of the Association for Com-
abs/2101.00027, 2021. putational Linguistics (Volume 1: Long Papers), ACL
[167] H. Laurençon, L. Saulnier, T. Wang, C. Akiki, A. V. 2022, Dublin, Ireland, May 22-27, 2022, S. Muresan,
del Moral, T. Le Scao, L. Von Werra, C. Mou, E. G. P. Nakov, and A. Villavicencio, Eds., 2022, pp. 3470–
Ponferrada, H. Nguyen et al., “The bigscience roots 3487.
corpus: A 1.6 tb composite multilingual dataset,” in [180] S. H. Bach, V. Sanh, Z. X. Yong, A. Webson, C. Raffel,
Thirty-sixth Conference on Neural Information Process- N. V. Nayak, A. Sharma, T. Kim, M. S. Bari, T. Févry,
ing Systems Datasets and Benchmarks Track, 2022. Z. Alyafeai, M. Dey, A. Santilli, Z. Sun, S. Ben-David,
[168] “Common crawl.” [Online]. Available: https:// C. Xu, G. Chhablani, H. Wang, J. A. Fries, M. S.
commoncrawl.org/ AlShaibani, S. Sharma, U. Thakker, K. Almubarak,
[169] G. Wenzek, M.-A. Lachaux, A. Conneau, V. Chaud- X. Tang, D. R. Radev, M. T. Jiang, and A. M. Rush,
hary, F. Guzmán, A. Joulin, and É. Grave, “Ccnet: “Promptsource: An integrated development environ-
Extracting high quality monolingual datasets from ment and repository for natural language prompts,”
web crawl data,” in Proceedings of The 12th Language in ACL (demo). Association for Computational Lin-
Resources and Evaluation Conference, 2020, pp. 4003– guistics, 2022, pp. 93–104.
4012. [181] T. Tang, J. Li, W. X. Zhao, and J. Wen, “MVP: multi-
[170] T. Computer, “Redpajama: an open dataset for train- task supervised pre-training for natural language
ing large language models,” https://github.com/ generation,” CoRR, vol. abs/2206.12131, 2022.
togethercomputer/RedPajama-Data, 2023. [182] H. Nguyen, S. Suri, K. Tsui, Shahules786, T. team,
[171] G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, and C. Schuhmann, “The oig dataset,” https://laion.
A. Cappelli, H. Alobeidli, B. Pannier, E. Almazrouei, ai/blog/oig-dataset/, 2023.
and J. Launay, “The RefinedWeb dataset for Falcon [183] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen,
LLM: outperforming curated corpora with web data, N. DasSarma, D. Drain, S. Fort, D. Ganguli,
and web data only,” arXiv preprint arXiv:2306.01116, T. Henighan, N. Joseph, S. Kadavath, J. Kernion,
2023. T. Conerly, S. E. Showk, N. Elhage, Z. Hatfield-
[172] C. B. Clement, M. Bierbaum, K. P. O’Keeffe, and Dodds, D. Hernandez, T. Hume, S. Johnston,
105

S. Kravec, L. Lovitt, N. Nanda, C. Olsson, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler,


D. Amodei, T. B. Brown, J. Clark, S. McCandlish, A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, S. S.
C. Olah, B. Mann, and J. Kaplan, “Training a helpful Sharma, E. Szczechla, T. Kim, G. Chhablani, N. V.
and harmless assistant with reinforcement learning Nayak, D. Datta, J. Chang, M. T. Jiang, H. Wang,
from human feedback,” CoRR, vol. abs/2204.05862, M. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Baw-
2022. [Online]. Available: https://doi.org/10.48550/ den, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. San-
arXiv.2204.05862 tilli, T. Févry, J. A. Fries, R. Teehan, T. L. Scao, S. Bi-
[184] B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding, derman, L. Gao, T. Wolf, and A. M. Rush, “Multitask
J. Yue, and Y. Wu, “How close is chatgpt to human prompted training enables zero-shot task generaliza-
experts? comparison corpus, evaluation, and detec- tion,” in The Tenth International Conference on Learning
tion,” arXiv preprint arXiv:2301.07597, 2023. Representations, ICLR 2022, Virtual Event, April 25-29,
[185] M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan, 2022. OpenReview.net, 2022.
S. Shah, A. Ghodsi, P. Wendell, M. Zaharia, and [197] S. Longpre, L. Hou, T. Vu, A. Webson, H. W.
R. Xin. (2023) Free dolly: Introducing the world’s first Chung, Y. Tay, D. Zhou, Q. V. Le, B. Zoph, J. Wei
truly open instruction-tuned llm. et al., “The flan collection: Designing data and meth-
[186] A. Köpf, Y. Kilcher, D. von Rütte, S. Anagnostidis, Z.- ods for effective instruction tuning,” arXiv preprint
R. Tam, K. Stevens, A. Barhoum, N. M. Duc, O. Stan- arXiv:2301.13688, 2023.
ley, R. Nagyfi et al., “Openassistant conversations– [198] K. Cobbe, V. Kosaraju, M. Bavarian, J. Hilton,
democratizing large language model alignment,” R. Nakano, C. Hesse, and J. Schulman, “Training
arXiv preprint arXiv:2304.07327, 2023. verifiers to solve math word problems,” CoRR, vol.
[187] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, abs/2110.14168, 2021.
C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford [199] M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth,
alpaca: An instruction-following llama model,” and J. Berant, “Did aristotle use a laptop? A ques-
https://github.com/tatsu-lab/stanford alpaca, tion answering benchmark with implicit reasoning
2023. strategies,” Trans. Assoc. Comput. Linguistics, vol. 9,
[188] J. Cheung, “Guanaco - generative universal assistant pp. 346–361, 2021.
for natural-language adaptive context-aware om- [200] O. Camburu, B. Shillingford, P. Minervini,
nilingual outputs,” https://guanaco-model.github. T. Lukasiewicz, and P. Blunsom, “Make up your
io/, 2023. mind! adversarial generation of inconsistent natural
[189] C. Xu, D. Guo, N. Duan, and J. McAuley, language explanations,” in Proceedings of the 58th
“Baize: An open-source chat model with parameter- Annual Meeting of the Association for Computational
efficient tuning on self-chat data,” arXiv preprint Linguistics, ACL 2020, Online, July 5-10, 2020,
arXiv:2304.01196, 2023. D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault,
[190] Y. Ji, Y. Gong, Y. Deng, Y. Peng, Q. Niu, B. Ma, Eds. Association for Computational Linguistics,
and X. Li, “Towards better instruction following 2020, pp. 4157–4165.
language models for chinese: Investigating the im- [201] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De-
pact of training data and evaluation,” arXiv preprint langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun-
arXiv:2304.07854, 2023. towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma,
[191] K. Ethayarajh, Y. Choi, and S. Swayamdipta, “Under- Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger,
standing dataset difficulty with V -usable informa- M. Drame, Q. Lhoest, and A. M. Rush, “Transform-
tion,” in Proceedings of the 39th International Conference ers: State-of-the-art natural language processing,” in
on Machine Learning, 2022, pp. 5988–6008. Proceedings of the 2020 Conference on Empirical Methods
[192] N. Lambert, L. Tunstall, N. Rajani, in Natural Language Processing: System Demonstrations,
and T. Thrush. (2023) Huggingface h4 EMNLP 2020 - Demos, Online, November 16-20, 2020.
stack exchange preference dataset. [On- Association for Computational Linguistics, 2020, pp.
line]. Available: https://huggingface.co/datasets/ 38–45.
HuggingFaceH4/stack-exchange-preferences [202] J. Bradbury, R. Frostig, P. Hawkins, M. J.
[193] R. Liu, R. Yang, C. Jia, G. Zhang, D. Zhou, A. M. Johnson, C. Leary, D. Maclaurin, G. Necula,
Dai, D. Yang, and S. Vosoughi, “Training socially A. Paszke, J. VanderPlas, S. Wanderman-Milne,
aligned language models in simulated human soci- and Q. Zhang, “JAX: composable transformations
ety,” CoRR, vol. abs/2305.16960, 2023. of Python+NumPy programs,” 2018. [Online].
[194] G. Xu, J. Liu, M. Yan, H. Xu, J. Si, Z. Zhou, P. Yi, Available: http://github.com/google/jax
X. Gao, J. Sang, R. Zhang, J. Zhang, C. Peng, [203] Z. Bian, H. Liu, B. Wang, H. Huang, Y. Li, C. Wang,
F. Huang, and J. Zhou, “Cvalues: Measuring the F. Cui, and Y. You, “Colossal-ai: A unified deep learn-
values of chinese large language models from safety ing system for large-scale parallel training,” CoRR,
to responsibility,” 2023. vol. abs/2110.14883, 2021.
[195] J. Dai, X. Pan, R. Sun, J. Ji, X. Xu, M. Liu, Y. Wang, and [204] J. Fang, Y. Yu, S. Li, Y. You, and J. Zhou, “Patrick-
Y. Yang, “Safe rlhf: Safe reinforcement learning from star: Parallel training of pre-trained models via
human feedback,” arXiv preprint arXiv:2310.12773, a chunk-based memory management,” CoRR, vol.
2023. abs/2108.05818, 2021.
[196] V. Sanh, A. Webson, C. Raffel, S. H. Bach, [205] Y. You, “Colossalchat: An open-source solution
106

for cloning chatgpt with a complete F. Yang, X. Yi, C. Wu, H. Zhang, and J. Zhao, “One-
rlhf pipeline,” 2023. [Online]. Available: flow: Redesign the distributed deep learning frame-
https://medium.com/@yangyou berkeley/ work from scratch,” CoRR, vol. abs/2110.15032, 2021.
colossalchat-an-open-source-solution-for-cloning- [217] S. Roller, E. Dinan, N. Goyal, D. Ju, M. Williamson,
chatgpt-with-a-complete-rlhf-pipeline-5edf08fb538b Y. Liu, J. Xu, M. Ott, E. M. Smith, Y. Boureau, and
[206] “Bmtrain: Effient training for big models.” [Online]. J. Weston, “Recipes for building an open-domain
Available: https://github.com/OpenBMB/BMTrain chatbot,” in Proceedings of the 16th Conference of the
[207] J. He, J. Qiu, A. Zeng, Z. Yang, J. Zhai, and J. Tang, European Chapter of the Association for Computational
“Fastmoe: A fast mixture-of-expert training system,” Linguistics: Main Volume, EACL 2021, Online, April 19
CoRR, vol. abs/2103.13262, 2021. - 23, 2021, 2021, pp. 300–325.
[208] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, [218] A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer,
C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, H. Michalewski, V. V. Ramasesh, A. Slone, C. Anil,
“Efficient memory management for large language I. Schlag, T. Gutman-Solo, Y. Wu, B. Neyshabur,
model serving with pagedattention,” in Proceedings G. Gur-Ari, and V. Misra, “Solving quantitative rea-
of the ACM SIGOPS 29th Symposium on Operating soning problems with language models,” CoRR, vol.
Systems Principles, 2023. abs/2206.14858, 2022.
[209] (2023) Deepspeed-mii. [Online]. Available: https: [219] T. Saier, J. Krause, and M. Färber, “unarxive 2022:
//github.com/microsoft/DeepSpeed-MII All arxiv publications pre-processed for nlp, includ-
[210] Z. Yao, R. Y. Aminabadi, O. Ruwase, S. Rajbhan- ing structured full-text and citation network,” arXiv
dari, X. Wu, A. A. Awan, J. Rasley, M. Zhang, preprint arXiv:2303.14957, 2023.
C. Li, C. Holmes, Z. Zhou, M. Wyatt, M. Smith, [220] H. A. Simon, “Experiments with a heuristic com-
L. Kurilenko, H. Qin, M. Tanaka, S. Che, S. L. Song, piler,” J. ACM, vol. 10, no. 4, pp. 493–506, 1963.
and Y. He, “DeepSpeed-Chat: Easy, Fast and Afford- [221] Z. Manna and R. J. Waldinger, “Toward automatic
able RLHF Training of ChatGPT-like Models at All program synthesis,” Commun. ACM, vol. 14, no. 3,
Scales,” arXiv preprint arXiv:2308.01320, 2023. pp. 151–165, 1971.
[211] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad- [222] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong,
bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou,
L. Antiga, A. Desmaison, A. Köpf, E. Z. Yang, “Codebert: A pre-trained model for programming
Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, and natural languages,” in Findings of EMNLP, 2020.
B. Steiner, L. Fang, J. Bai, and S. Chintala, “Py- [223] J. Austin, A. Odena, M. I. Nye, M. Bosma,
torch: An imperative style, high-performance deep H. Michalewski, D. Dohan, E. Jiang, C. J. Cai,
learning library,” in Advances in Neural Information M. Terry, Q. V. Le, and C. Sutton, “Program syn-
Processing Systems 32: Annual Conference on Neural thesis with large language models,” CoRR, vol.
Information Processing Systems 2019, NeurIPS 2019, abs/2108.07732, 2021.
December 8-14, 2019, Vancouver, BC, Canada, H. M. [224] S. Black, L. Gao, P. Wang, C. Leahy, and S. Bider-
Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché- man, “GPT-Neo: Large Scale Autoregressive Lan-
Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 8024– guage Modeling with Mesh-Tensorflow,” 2021.
8035. [225] F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn,
[212] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, “A systematic evaluation of large language models
J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Is- of code,” in MAPS@PLDI, 2022.
ard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, [226] A. Madaan, S. Zhou, U. Alon, Y. Yang, and G. Neu-
D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan, big, “Language models of code are few-shot com-
P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensor- monsense learners,” in Proceedings of the 2022 Confer-
flow: A system for large-scale machine learning,” in ence on Empirical Methods in Natural Language Process-
12th USENIX Symposium on Operating Systems Design ing, EMNLP 2022, Abu Dhabi, United Arab Emirates,
and Implementation, OSDI 2016, Savannah, GA, USA, December 7-11, 2022, Y. Goldberg, Z. Kozareva, and
November 2-4, 2016, K. Keeton and T. Roscoe, Eds. Y. Zhang, Eds. Association for Computational Lin-
USENIX Association, 2016, pp. 265–283. guistics, 2022, pp. 1384–1403.
[213] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, [227] S. Longpre, G. Yauney, E. Reif, K. Lee, A. Roberts,
T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “Mxnet: B. Zoph, D. Zhou, J. Wei, K. Robinson, D. Mimno
A flexible and efficient machine learning library et al., “A pretrainer’s guide to training data: Measur-
for heterogeneous distributed systems,” CoRR, vol. ing the effects of data age, domain coverage, quality,
abs/1512.01274, 2015. & toxicity,” arXiv preprint arXiv:2305.13169, 2023.
[214] Y. Ma, D. Yu, T. Wu, and H. Wang, “Paddlepaddle: [228] D. Chen, Y. Huang, Z. Ma, H. Chen, X. Pan, C. Ge,
An open-source deep learning platform from indus- D. Gao, Y. Xie, Z. Liu, J. Gao, Y. Li, B. Ding, and
trial practice,” Frontiers of Data and Domputing, vol. 1, J. Zhou, “Data-juicer: A one-stop data processing
no. 1, p. 105, 2019. system for large language models,” 2023.
[215] L. Huawei Technologies Co., “Huawei mindspore [229] M. Abdin, S. A. Jacobs, A. A. Awan, J. Aneja,
ai development framework,” in Artificial Intelligence A. Awadallah, H. Awadalla, N. Bach, A. Bahree,
Technology. Springer, 2022, pp. 137–162. A. Bakhtiari, H. Behl et al., “Phi-3 technical report:
[216] J. Yuan, X. Li, C. Cheng, J. Liu, R. Guo, S. Cai, C. Yao, A highly capable language model locally on your
107

phone,” arXiv preprint arXiv:2404.14219, 2024. [242] M. Schuster and K. Nakajima, “Japanese and korean
[230] G. Penedo, H. Kydlı́ček, A. Lozhkov, M. Mitchell, voice search,” in 2012 IEEE international conference on
C. Raffel, L. Von Werra, T. Wolf et al., “The fineweb acoustics, speech and signal processing (ICASSP). IEEE,
datasets: Decanting the web for the finest text data at 2012, pp. 5149–5152.
scale,” arXiv preprint arXiv:2406.17557, 2024. [243] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi,
[231] P. Maini, S. Seto, H. Bai, D. Grangier, Y. Zhang, and W. Macherey, M. Krikun, Y. Cao, Q. Gao,
N. Jaitly, “Rephrasing the web: A recipe for compute K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu,
and data-efficient language modeling,” in ICLR 2024 L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa,
Workshop on Navigating and Addressing Data Problems K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young,
for Foundation Models, 2024. J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Cor-
[232] M. Marion, A. Üstün, L. Pozzobon, A. Wang, rado, M. Hughes, and J. Dean, “Google’s neural
M. Fadaee, and S. Hooker, “When less is more: Inves- machine translation system: Bridging the gap be-
tigating data pruning for pretraining llms at scale,” tween human and machine translation,” CoRR, vol.
arXiv preprint arXiv:2309.04564, 2023. abs/1609.08144, 2016.
[233] N. Sachdeva, B. Coleman, W.-C. Kang, J. Ni, L. Hong, [244] T. Kudo, “Subword regularization: Improving neural
E. H. Chi, J. Caverlee, J. McAuley, and D. Z. Cheng, network translation models with multiple subword
“How to train data-efficient llms,” arXiv preprint candidates,” in Proceedings of the 56th Annual Meeting
arXiv:2402.09668, 2024. of the Association for Computational Linguistics, ACL
[234] D. Hernandez, T. B. Brown, T. Conerly, N. Das- 2018, Melbourne, Australia, July 15-20, 2018, Volume 1:
Sarma, D. Drain, S. E. Showk, N. Elhage, Z. Hatfield- Long Papers, I. Gurevych and Y. Miyao, Eds. Associ-
Dodds, T. Henighan, T. Hume, S. Johnston, B. Mann, ation for Computational Linguistics, 2018, pp. 66–75.
C. Olah, C. Olsson, D. Amodei, N. Joseph, J. Ka- [245] T. Kudo and J. Richardson, “Sentencepiece: A simple
plan, and S. McCandlish, “Scaling laws and inter- and language independent subword tokenizer and
pretability of learning from repeated data,” CoRR, detokenizer for neural text processing,” in Proceed-
vol. abs/2205.10487, 2022. ings of the 2018 Conference on Empirical Methods in
[235] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, Natural Language Processing, EMNLP 2018: System
“The curious case of neural text degeneration,” in 8th Demonstrations, Brussels, Belgium, October 31 - Novem-
International Conference on Learning Representations, ber 4, 2018, E. Blanco and W. Lu, Eds. Association
ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. for Computational Linguistics, 2018.
OpenReview.net, 2020. [246] M. Davis and M. Dürst, “Unicode normalization
[236] K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, forms,” 2001.
C. Callison-Burch, and N. Carlini, “Deduplicating [247] P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak,
training data makes language models better,” in Pro- and I. Sutskever, “Deep double descent: Where big-
ceedings of the 60th Annual Meeting of the Association ger models and more data hurt,” in 8th International
for Computational Linguistics (Volume 1: Long Papers), Conference on Learning Representations, ICLR 2020,
ACL 2022, Dublin, Ireland, May 22-27, 2022, 2022, pp. Addis Ababa, Ethiopia, April 26-30, 2020. OpenRe-
8424–8445. view.net, 2020.
[237] N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramèr, [248] K. Tirumala, D. Simig, A. Aghajanyan, and A. S. Mor-
and C. Zhang, “Quantifying memorization across cos, “D4: Improving llm pretraining via document
neural language models,” CoRR, 2022. de-duplication and diversification,” arXiv preprint
[238] N. Kandpal, E. Wallace, and C. Raffel, “Deduplicat- arXiv:2308.12284, 2023.
ing training data mitigates privacy risks in language [249] Z. Shen, T. Tao, L. Ma, W. Neiswanger, J. Hes-
models,” in International Conference on Machine Learn- tness, N. Vassilieva, D. Soboleva, and E. Xing,
ing, ICML 2022, 17-23 July 2022, Baltimore, Maryland, “Slimpajama-dc: Understanding data combinations
USA. PMLR, 2022, pp. 10 697–10 707. for llm training,” arXiv preprint arXiv:2309.10818,
[239] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, 2023.
“Conditional random fields: Probabilistic models [250] S. M. Xie, S. Santurkar, T. Ma, and P. Liang, “Data
for segmenting and labeling sequence data,” in selection for language models via importance resam-
Proceedings of the Eighteenth International Conference pling,” arXiv preprint arXiv:2302.03169, 2023.
on Machine Learning (ICML 2001), Williams College, [251] X. Wang, W. Zhou, Q. Zhang, J. Zhou, S. Gao,
Williamstown, MA, USA, June 28 - July 1, 2001, C. E. J. Wang, M. Zhang, X. Gao, Y. Chen, and T. Gui,
Brodley and A. P. Danyluk, Eds. Morgan Kaufmann, “Farewell to aimless large-scale pretraining: Influ-
2001, pp. 282–289. ential subset selection for language model,” arXiv
[240] P. Gage, “A new algorithm for data compression,” C preprint arXiv:2305.12816, 2023.
Users Journal, vol. 12, no. 2, pp. 23–38, 1994. [252] D. Paperno, G. Kruszewski, A. Lazaridou, Q. N.
[241] R. Sennrich, B. Haddow, and A. Birch, “Neural ma- Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda,
chine translation of rare words with subword units,” and R. Fernández, “The LAMBADA dataset: Word
in Proceedings of the 54th Annual Meeting of the Associa- prediction requiring a broad discourse context,” in
tion for Computational Linguistics, ACL 2016, August 7- ACL (1). The Association for Computer Linguistics,
12, 2016, Berlin, Germany, Volume 1: Long Papers. The 2016.
Association for Computer Linguistics, 2016. [253] M. F. Chen, N. Roberts, K. Bhatia, J. Wang, C. Zhang,
108

F. Sala, and C. Ré, “Skill-it! a data-driven skills sirer, C. Jones, E. Buchatskaya, D. Budden, L. Sifre,
framework for understanding and training language S. Osindero, O. Vinyals, M. Ranzato, J. W. Rae,
models,” arXiv preprint arXiv:2307.14430, 2023. E. Elsen, K. Kavukcuoglu, and K. Simonyan, “Uni-
[254] B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, fied scaling laws for routed language models,” in
I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, International Conference on Machine Learning, ICML
J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, 2022, 17-23 July 2022, Baltimore, Maryland, USA, 2022,
M. Bhatt, C. Canton-Ferrer, A. Grattafiori, W. Xiong, pp. 4057–4086.
A. Défossez, J. Copet, F. Azhar, H. Touvron, L. Mar- [266] A. Gu, K. Goel, and C. Ré, “Efficiently modeling
tin, N. Usunier, T. Scialom, and G. Synnaeve, “Code long sequences with structured state spaces,”
llama: Open foundation models for code,” CoRR, vol. in The Tenth International Conference on Learning
abs/2308.12950, 2023. Representations, ICLR 2022, Virtual Event, April 25-29,
[255] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, 2022. OpenReview.net, 2022. [Online]. Available:
“Curriculum learning,” in ICML, 2009, pp. 41–48. https://openreview.net/forum?id=uYLFoz1vlAC
[256] C. Xu, C. Rosset, L. Del Corro, S. Mahajan, [267] J. T. Smith, A. Warrington, and S. Linderman, “Sim-
J. McAuley, J. Neville, A. H. Awadallah, and N. Rao, plified state space layers for sequence modeling,” in
“Contrastive post-training large language models ICLR, 2023.
on data curriculum,” arXiv preprint arXiv:2310.02263, [268] A. Orvieto, S. L. Smith, A. Gu, A. Fernando, C. Gul-
2023. cehre, R. Pascanu, and S. De, “Resurrecting recurrent
[257] S. Tworkowski, K. Staniszewski, M. Pacek, Y. Wu, neural networks for long sequences,” in ICML, 2023.
H. Michalewski, and P. Milos, “Focused transformer: [269] M. Poli, S. Massaroli, E. Nguyen, D. Y. Fu, T. Dao,
Contrastive training for context scaling,” CoRR, vol. S. Baccus, Y. Bengio, S. Ermon, and C. Ré, “Hyena
abs/2307.03170, 2023. hierarchy: Towards larger convolutional language
[258] Z. Azerbayev, H. Schoelkopf, K. Paster, M. D. Santos, models,” in ICML, 2023.
S. McAleer, A. Q. Jiang, J. Deng, S. Biderman, and [270] B. Peng, E. Alcaide, Q. Anthony, A. Albalak,
S. Welleck, “Llemma: An open language model for S. Arcadinho, H. Cao, X. Cheng, M. Chung,
mathematics,” arXiv preprint arXiv:2310.10631, 2023. M. Grella, K. K. G. V., X. He, H. Hou, P. Kazienko,
[259] S. Chen, S. Wong, L. Chen, and Y. Tian, “Extend- J. Kocon, J. Kong, B. Koptyra, H. Lau, K. S. I.
ing context window of large language models via Mantri, F. Mom, A. Saito, X. Tang, B. Wang,
positional interpolation,” CoRR, vol. abs/2306.15595, J. S. Wind, S. Wozniak, R. Zhang, Z. Zhang,
2023. Q. Zhao, P. Zhou, J. Zhu, and R. Zhu, “RWKV:
[260] G. Wenzek, M.-A. Lachaux, A. Conneau, V. Chaud- reinventing rnns for the transformer era,” CoRR,
hary, F. Guzmán, A. Joulin, and É. Grave, “Ccnet: vol. abs/2305.13048, 2023. [Online]. Available:
Extracting high quality monolingual datasets from https://doi.org/10.48550/arXiv.2305.13048
web crawl data,” in Proceedings of the Twelfth Language [271] Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue,
Resources and Evaluation Conference, 2020, pp. 4003– J. Wang, and F. Wei, “Retentive network: A successor
4012. to transformer for large language models,” arXiv
[261] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, preprint arXiv:2307.08621, 2023.
“Bag of tricks for efficient text classification,” in [272] A. Gu and T. Dao, “Mamba: Linear-time sequence
EACL, 2017, pp. 427–431. modeling with selective state spaces,” CoRR, vol.
[262] D. Chen, Y. Huang, Z. Ma, H. Chen, X. Pan, C. Ge, abs/2312.00752, 2023.
D. Gao, Y. Xie, Z. Liu, J. Gao et al., “Data-juicer: A [273] B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Ar-
one-stop data processing system for large language cadinho, H. Cao, X. Cheng, M. Chung, M. Grella,
models,” arXiv preprint arXiv:2309.02033, 2023. K. K. GV et al., “Rwkv: Reinventing rnns for the
[263] B. Zhang, B. Ghorbani, A. Bapna, Y. Cheng, X. Garcia, transformer era,” arXiv preprint arXiv:2305.13048,
J. Shen, and O. Firat, “Examining scaling and transfer 2023.
of language model architectures for machine transla- [274] M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou,
tion,” in International Conference on Machine Learning, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang, and J. Tang,
ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, “Cogview: Mastering text-to-image generation via
2022, pp. 26 176–26 192. transformers,” in Advances in Neural Information Pro-
[264] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, cessing Systems 34: Annual Conference on Neural Infor-
J. Gao, M. Zhou, and H. Hon, “Unified language mation Processing Systems 2021, NeurIPS 2021, Decem-
model pre-training for natural language understand- ber 6-14, 2021, virtual, 2021, pp. 19 822–19 835.
ing and generation,” in Advances in Neural Information [275] L. J. Ba, J. R. Kiros, and G. E. Hinton, “Layer normal-
Processing Systems 32: Annual Conference on Neural ization,” vol. abs/1607.06450, 2016.
Information Processing Systems 2019, NeurIPS 2019, [276] B. Zhang and R. Sennrich, “Root mean square layer
December 8-14, 2019, Vancouver, BC, Canada, 2019, pp. normalization,” in Advances in Neural Information
13 042–13 054. Processing Systems 32: Annual Conference on Neural
[265] A. Clark, D. de Las Casas, A. Guy, A. Mensch, Information Processing Systems 2019, NeurIPS 2019,
M. Paganini, J. Hoffmann, B. Damoc, B. A. Hecht- December 8-14, 2019, Vancouver, BC, Canada, 2019, pp.
man, T. Cai, S. Borgeaud, G. van den Driessche, 12 360–12 371.
E. Rutherford, T. Hennigan, M. J. Johnson, A. Cas- [277] H. Wang, S. Ma, L. Dong, S. Huang, D. Zhang,
109

and F. Wei, “Deepnet: Scaling transformers to 1, 000 [289] D. Hendrycks and K. Gimpel, “Gaussian error linear
layers,” vol. abs/2203.00555, 2022. units (gelus),” arXiv preprint arXiv:1606.08415, 2016.
[278] V. Nair and G. E. Hinton, “Rectified linear units im- [290] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier,
prove restricted boltzmann machines,” in Proceedings “Language modeling with gated convolutional net-
of the 27th international conference on machine learning works,” in Proceedings of the 34th International Confer-
(ICML-10), 2010, pp. 807–814. ence on Machine Learning, ICML 2017, Sydney, NSW,
[279] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, Australia, 6-11 August 2017, 2017, pp. 933–941.
and S. R. Bowman, “GLUE: A multi-task benchmark [291] T. L. Scao, T. Wang, D. Hesslow, S. Bekman, M. S.
and analysis platform for natural language under- Bari, S. Biderman, H. Elsahar, N. Muennighoff,
standing,” in Proceedings of the Workshop: Analyz- J. Phang, O. Press, C. Raffel, V. Sanh, S. Shen,
ing and Interpreting Neural Networks for NLP, Black- L. Sutawika, J. Tae, Z. X. Yong, J. Launay, and I. Belt-
boxNLP@EMNLP 2018, Brussels, Belgium, November 1, agy, “What language model to train if you have one
2018, T. Linzen, G. Chrupala, and A. Alishahi, Eds. million GPU hours?” in Findings of the Association for
Association for Computational Linguistics, 2018, pp. Computational Linguistics: EMNLP 2022, Abu Dhabi,
353–355. United Arab Emirates, December 7-11, 2022, 2022, pp.
[280] P. Ramachandran, B. Zoph, and Q. V. Le, 765–782.
“Searching for activation functions,” arXiv preprint [292] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-
arXiv:1710.05941, 2017. attention with relative position representations,”
[281] N. Shazeer, “GLU variants improve transformer,” in Proceedings of the 2018 Conference of the North
vol. abs/2002.05202, 2020. American Chapter of the Association for Computational
[282] J. Su, Y. Lu, S. Pan, B. Wen, and Y. Liu, “Roformer: Linguistics: Human Language Technologies, NAACL-
Enhanced transformer with rotary position embed- HLT, New Orleans, Louisiana, USA, June 1-6, 2018,
ding,” vol. abs/2104.09864, 2021. Volume 2 (Short Papers), M. A. Walker, H. Ji,
[283] O. Press, N. A. Smith, and M. Lewis, “Train short, and A. Stent, Eds. Association for Computational
test long: Attention with linear biases enables input Linguistics, 2018, pp. 464–468. [Online]. Available:
length extrapolation,” in The Tenth International Con- https://doi.org/10.18653/v1/n18-2074
ference on Learning Representations, ICLR 2022, Virtual [293] Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell,
Event, April 25-29, 2022, 2022. Q. V. Le, and R. Salakhutdinov, “Transformer-xl:
[284] S. Ioffe and C. Szegedy, “Batch normalization: Attentive language models beyond a fixed-length
Accelerating deep network training by reducing context,” in Proceedings of the 57th Conference of
internal covariate shift,” in Proceedings of the the Association for Computational Linguistics, ACL
32nd International Conference on Machine Learning, 2019, Florence, Italy, July 28- August 2, 2019, Volume
ICML 2015, Lille, France, 6-11 July 2015, ser. 1: Long Papers, A. Korhonen, D. R. Traum, and
JMLR Workshop and Conference Proceedings, L. Màrquez, Eds. Association for Computational
F. R. Bach and D. M. Blei, Eds., vol. 37. Linguistics, 2019, pp. 2978–2988. [Online]. Available:
JMLR.org, 2015, pp. 448–456. [Online]. Available: https://doi.org/10.18653/v1/p19-1285
http://proceedings.mlr.press/v37/ioffe15.html [294] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhut-
[285] S. Narang, H. W. Chung, Y. Tay, L. Fedus, T. Févry, dinov, and Q. V. Le, “Xlnet: Generalized autoregres-
M. Matena, K. Malkan, N. Fiedel, N. Shazeer, Z. Lan, sive pretraining for language understanding,” Ad-
Y. Zhou, W. Li, N. Ding, J. Marcus, A. Roberts, vances in neural information processing systems, vol. 32,
and C. Raffel, “Do transformer modifications transfer 2019.
across implementations and applications?” in Pro- [295] B. Peng, J. Quesnelle, H. Fan, and E. Shippole, “Yarn:
ceedings of the 2021 Conference on Empirical Methods Efficient context window extension of large language
in Natural Language Processing, EMNLP 2021, Virtual models,” CoRR, vol. abs/2309.00071, 2023.
Event / Punta Cana, Dominican Republic, 7-11 Novem- [296] Y. Sun, L. Dong, B. Patra, S. Ma, S. Huang,
ber, 2021, 2021, pp. 5758–5773. A. Benhaim, V. Chaudhary, X. Song, and F. Wei,
[286] R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, “A length-extrapolatable transformer,” CoRR, vol.
H. Zhang, Y. Lan, L. Wang, and T. Liu, “On layer abs/2212.10554, 2022. [Online]. Available: https:
normalization in the transformer architecture,” in //doi.org/10.48550/arXiv.2212.10554
ICML, 2020. [297] H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. A.
[287] A. Baevski and M. Auli, “Adaptive input repre- Smith, and L. Kong, “Random feature attention,”
sentations for neural language modeling,” in 7th in 9th International Conference on Learning Representa-
International Conference on Learning Representations, tions, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. [298] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie,
OpenReview.net, 2019. C. Alberti, S. Ontañón, P. Pham, A. Ravula, Q. Wang,
[288] L. Liu, X. Liu, J. Gao, W. Chen, and J. Han, “Under- L. Yang, and A. Ahmed, “Big bird: Transformers for
standing the difficulty of training transformers,” in longer sequences,” in Advances in Neural Information
Proceedings of the 2020 Conference on Empirical Methods Processing Systems 33: Annual Conference on Neural
in Natural Language Processing, EMNLP 2020, Online, Information Processing Systems 2020, NeurIPS 2020,
November 16-20, 2020. Association for Computa- December 6-12, 2020, virtual, 2020.
tional Linguistics, 2020, pp. 5747–5763. [299] R. Child, S. Gray, A. Radford, and I. Sutskever, “Gen-
110

erating long sequences with sparse transformers,” guistics, 2023.


CoRR, vol. abs/1904.10509, 2019. [316] X. L. Li, A. Holtzman, D. Fried, P. Liang, J. Eis-
[300] N. Shazeer, “Fast transformer decoding: One write- ner, T. Hashimoto, L. Zettlemoyer, and M. Lewis,
head is all you need,” CoRR, vol. abs/1911.02150, “Contrastive decoding: Open-ended text generation
2019. [Online]. Available: http://arxiv.org/abs/1911. as optimization,” in ACL (1). Association for Com-
02150 putational Linguistics, 2023, pp. 12 286–12 312.
[301] J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, [317] Y. Chuang, Y. Xie, H. Luo, Y. Kim, J. R. Glass, and
F. Lebrón, and S. Sanghai, “Gqa: Training gener- P. He, “Dola: Decoding by contrasting layers im-
alized multi-query transformer models from multi- proves factuality in large language models,” CoRR,
head checkpoints,” arXiv preprint arXiv:2305.13245, vol. abs/2309.03883, 2023.
2023. [318] D. P. Kingma and J. Ba, “Adam: A method for
[302] T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Re, stochastic optimization,” in 3rd International Confer-
“Flashattention: Fast and memory-efficient exact at- ence on Learning Representations, ICLR 2015, San Diego,
tention with IO-awareness,” in NeurIPS, 2022. CA, USA, May 7-9, 2015, Conference Track Proceedings,
[303] T. Dao, “Flashattention-2: Faster attention with better Y. Bengio and Y. LeCun, Eds., 2015.
parallelism and work partitioning,” arXiv preprint [319] I. Loshchilov and F. Hutter, “Fixing weight decay
arXiv:2307.08691, 2023. regularization in adam,” CoRR, vol. abs/1711.05101,
[304] “vllm: Easy, fast, and cheap llm serving with 2017.
pagedattention.” [Online]. Available: https://vllm. [320] N. Shazeer and M. Stern, “Adafactor: Adaptive
ai/ learning rates with sublinear memory cost,” in Pro-
[305] K. Murray and D. Chiang, “Correcting length bias in ceedings of the 35th International Conference on Machine
neural machine translation,” in WMT. Association Learning, ICML 2018, Stockholmsmässan, Stockholm,
for Computational Linguistics, 2018, pp. 212–223. Sweden, July 10-15, 2018, ser. Proceedings of Machine
[306] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, Learning Research, J. G. Dy and A. Krause, Eds.,
“The curious case of neural text degeneration,” in vol. 80. PMLR, 2018, pp. 4603–4611.
ICLR, 2020. [321] Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen,
[307] C.-M. U. P. P. D. O. C. SCIENCE, Speech Under- M. X. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, and
standing Systems. Summary of Results of the Five-Year Z. Chen, “Gpipe: Efficient training of giant neural
Research Effort at Carnegie-Mellon University, 1977. networks using pipeline parallelism,” in Advances
[308] P. Koehn and R. Knowles, “Six challenges for neural in Neural Information Processing Systems 32: Annual
machine translation,” in NMT@ACL. Association Conference on Neural Information Processing Systems
for Computational Linguistics, 2017, pp. 28–39. 2019, NeurIPS 2019, December 8-14, 2019, Vancouver,
[309] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelz-
W. Macherey, M. Krikun, Y. Cao, Q. Gao, imer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds.,
K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, 2019, pp. 103–112.
L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, [322] A. Harlap, D. Narayanan, A. Phanishayee, V. Se-
K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, shadri, N. R. Devanur, G. R. Ganger, and P. B. Gib-
J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Cor- bons, “Pipedream: Fast and efficient pipeline parallel
rado, M. Hughes, and J. Dean, “Google’s neural DNN training,” CoRR, vol. abs/1806.03377, 2018.
machine translation system: Bridging the gap be- [323] P. Micikevicius, S. Narang, J. Alben, G. F. Di-
tween human and machine translation,” CoRR, vol. amos, E. Elsen, D. Garcı́a, B. Ginsburg, M. Houston,
abs/1609.08144, 2016. O. Kuchaiev, G. Venkatesh, and H. Wu, “Mixed pre-
[310] R. Paulus, C. Xiong, and R. Socher, “A deep re- cision training,” CoRR, vol. abs/1710.03740, 2017.
inforced model for abstractive summarization,” in [324] Q. Xu, S. Li, C. Gong, and Y. You, “An efficient
ICLR (Poster). OpenReview.net, 2018. 2d method for training super-large deep learning
[311] A. K. Vijayakumar, M. Cogswell, R. R. Selvaraju, models,” CoRR, vol. abs/2104.05343, 2021.
Q. Sun, S. Lee, D. J. Crandall, and D. Batra, “Diverse [325] B. Wang, Q. Xu, Z. Bian, and Y. You, “Tesseract:
beam search: Decoding diverse solutions from neural Parallelize the tensor parallelism efficiently,” in Pro-
sequence models,” CoRR, vol. abs/1610.02424, 2016. ceedings of the 51st International Conference on Parallel
[312] A. Fan, M. Lewis, and Y. N. Dauphin, “Hierarchical Processing, ICPP 2022, Bordeaux, France, 29 August
neural story generation,” in ACL (1). Association for 2022 - 1 September 2022. ACM, 2022.
Computational Linguistics, 2018, pp. 889–898. [326] Z. Bian, Q. Xu, B. Wang, and Y. You, “Maximizing
[313] J. Hewitt, C. D. Manning, and P. Liang, “Trunca- parallelism in distributed training for huge neural
tion sampling as language model desmoothing,” in networks,” CoRR, vol. abs/2105.14450, 2021.
EMNLP (Findings). Association for Computational [327] S. Li, F. Xue, C. Baranwal, Y. Li, and Y. You, “Se-
Linguistics, 2022, pp. 3414–3427. quence parallelism: Long sequence training from
[314] Y. Su, T. Lan, Y. Wang, D. Yogatama, L. Kong, and system perspective,” arXiv e-prints, pp. arXiv–2105,
N. Collier, “A contrastive framework for neural text 2021.
generation,” in NeurIPS, 2022. [328] L. Zheng, Z. Li, H. Zhang, Y. Zhuang, Z. Chen,
[315] C. Meister, T. Pimentel, G. Wiher, and R. Cotterell, Y. Huang, Y. Wang, Y. Xu, D. Zhuo, E. P. Xing
“Locally typical sampling,” Trans. Assoc. Comput. Lin- et al., “Alpa: Automating inter-and {Intra-Operator}
111

parallelism for distributed deep learning,” in OSDI, Proceedings of the 2023 Conference on Empirical Methods
2022, pp. 559–578. in Natural Language Processing, EMNLP 2023, Sin-
[329] T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training gapore, December 6-10, 2023, H. Bouamor, J. Pino,
deep nets with sublinear memory cost,” CoRR, vol. and K. Bali, Eds. Association for Computational
abs/1604.06174, 2016. Linguistics, 2023, pp. 3029–3051.
[330] FairScale authors, “Fairscale: A general purpose [344] K. Zhou, B. Zhang, J. Wang, Z. Chen, W. X. Zhao,
modular pytorch library for high performance J. Sha, Z. Sheng, S. Wang, and J. Wen, “Jiuzhang3.0:
and large scale training,” https://github.com/ Efficiently improving mathematical reasoning by
facebookresearch/fairscale, 2021. training small data synthesis models,” CoRR, vol.
[331] R. Lou, K. Zhang, and W. Yin, “Is prompt all you abs/2405.14365, 2024.
need? no. A comprehensive and broader view of in- [345] Y. Cao, Y. Kang, and L. Sun, “Instruction mining:
struction learning,” CoRR, vol. abs/2303.10475, 2023. High-quality instruction data selection for large lan-
[332] X. Liu, P. He, W. Chen, and J. Gao, “Multi-task deep guage models,” CoRR, vol. abs/2307.06290, 2023.
neural networks for natural language understand- [346] M. Li, Y. Zhang, Z. Li, J. Chen, L. Chen,
ing,” in ACL (1). Association for Computational N. Cheng, J. Wang, T. Zhou, and J. Xiao, “From
Linguistics, 2019, pp. 4487–4496. quantity to quality: Boosting LLM performance with
[333] A. Aghajanyan, A. Gupta, A. Shrivastava, X. Chen, self-guided data selection for instruction tuning,”
L. Zettlemoyer, and S. Gupta, “Muppet: Massive CoRR, vol. abs/2308.12032, 2023. [Online]. Available:
multi-task representations with pre-finetuning,” in https://doi.org/10.48550/arXiv.2308.12032
EMNLP (1). Association for Computational Linguis- [347] O. Sener and S. Savarese, “Active learning
tics, 2021, pp. 5799–5811. for convolutional neural networks: A core-set
[334] S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, approach,” in 6th International Conference on Learning
Y. Tay, D. Zhou, Q. V. Le, B. Zoph, J. Wei, and Representations, ICLR 2018, Vancouver, BC, Canada,
A. Roberts, “The flan collection: Designing data and April 30 - May 3, 2018, Conference Track Proceedings.
methods for effective instruction tuning,” CoRR, vol. OpenReview.net, 2018. [Online]. Available: https:
abs/2301.13688, 2023. //openreview.net/forum?id=H1aIuk-RW
[335] C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, [348] M. Xia, S. Malladi, S. Gururangan, S. Arora,
C. Tao, and D. Jiang, “Wizardlm: Empowering large and D. Chen, “LESS: selecting influential
language models to follow complex instructions,” data for targeted instruction tuning,” CoRR,
CoRR, vol. abs/2304.12244, 2023. [Online]. Available: vol. abs/2402.04333, 2024. [Online]. Available:
https://doi.org/10.48550/arXiv.2304.12244 https://doi.org/10.48550/arXiv.2402.04333
[336] Z. Sun, Y. Shen, Q. Zhou, H. Zhang, Z. Chen, D. Cox, [349] P. W. Koh and P. Liang, “Understanding black-box
Y. Yang, and C. Gan, “Principle-driven self-alignment predictions via influence functions,” in International
of language models from scratch with minimal hu- conference on machine learning. PMLR, 2017, pp. 1885–
man supervision,” arXiv preprint arXiv:2305.03047, 1894.
2023. [350] Y. Wang, H. Ivison, P. Dasigi, J. Hessel, T. Khot, K. R.
[337] X. Li, P. Yu, C. Zhou, T. Schick, L. Zettle- Chandu, D. Wadden, K. MacMillan, N. A. Smith,
moyer, O. Levy, J. Weston, and M. Lewis, “Self- I. Beltagy, and H. Hajishirzi, “How far can camels
alignment with instruction backtranslation,” CoRR, go? exploring the state of instruction tuning on open
vol. abs/2308.06259, 2023. resources,” CoRR, vol. abs/2306.04751, 2023.
[338] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, [351] X. Liu, H. Yan, S. Zhang, C. An, X. Qiu, and D. Lin,
A. Efrat, P. Yu, L. Yu et al., “Lima: Less is more for “Scaling laws of rope-based extrapolation,” CoRR,
alignment,” arXiv preprint arXiv:2305.11206, 2023. vol. abs/2310.05209, 2023.
[339] L. Chen, S. Li, J. Yan, H. Wang, K. Gunaratna, V. Ya- [352] B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruc-
dav, Z. Tang, V. Srinivasan, T. Zhou, H. Huang, and tion tuning with GPT-4,” CoRR, vol. abs/2304.03277,
H. Jin, “Alpagasus: Training A better alpaca with 2023.
fewer data,” CoRR, vol. abs/2307.08701, 2023. [353] M. M. Krell, M. Kosec, S. P. Perez, and A. Fitzgib-
[340] S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, bon, “Efficient sequence packing without cross-
H. Palangi, and A. H. Awadallah, “Orca: Progressive contamination: Accelerating large language mod-
learning from complex explanation traces of GPT-4,” els without impacting performance,” arXiv preprint
CoRR, vol. abs/2306.02707, 2023. arXiv:2107.02027, 2021.
[341] YuLan-Chat-Team, “Yulan-chat: An open-source [354] K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei,
bilingual chatbot,” https://github.com/RUC-GSAI/ H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis,
YuLan-Chat, 2023. S. Pfohl et al., “Large language models encode clinical
[342] Y. Huang, X. Liu, Y. Gong, Z. Gou, Y. Shen, N. Duan, knowledge,” arXiv preprint arXiv:2212.13138, 2022.
and W. Chen, “Key-point-driven data synthesis with [355] J. Zhang, R. Xie, Y. Hou, W. X. Zhao, L. Lin, and
its enhancement on mathematical reasoning,” CoRR, J. Wen, “Recommendation as instruction following:
vol. abs/2403.02333, 2024. A large language model empowered recommenda-
[343] N. Ding, Y. Chen, B. Xu, Y. Qin, S. Hu, Z. Liu, M. Sun, tion approach,” CoRR, vol. abs/2305.07001, 2023.
and B. Zhou, “Enhancing chat language models by [356] H. Wang, C. Liu, N. Xi, Z. Qiang, S. Zhao, B. Qin, and
scaling high-quality instructional conversations,” in T. Liu, “Huatuo: Tuning llama model with chinese
112

medical knowledge,” arXiv preprint arXiv:2304.06975, abs/2203.11147, 2022.


2023. [369] Y. Bai, S. Kadavath, S. Kundu, A. Askell,
[357] Q. Huang, M. Tao, Z. An, C. Zhang, C. Jiang, Z. Chen, J. Kernion, A. Jones, A. Chen, A. Goldie,
Z. Wu, and Y. Feng, “Lawyer llama technical report,” A. Mirhoseini, C. McKinnon, C. Chen, C. Olsson,
arXiv preprint arXiv:2305.15062, 2023. C. Olah, D. Hernandez, D. Drain, D. Ganguli,
[358] S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, D. Li, E. Tran-Johnson, E. Perez, J. Kerr,
S. Gehrmann, P. Kambadur, D. Rosenberg, and J. Mueller, J. Ladish, J. Landau, K. Ndousse,
G. Mann, “Bloomberggpt: A large language model K. Lukosiute, L. Lovitt, M. Sellitto, N. Elhage,
for finance,” arXiv preprint arXiv:2303.17564, 2023. N. Schiefer, N. Mercado, N. DasSarma, R. Lasenby,
[359] T. Liu and B. K. H. Low, “Goat: Fine-tuned llama out- R. Larson, S. Ringer, S. Johnston, S. Kravec,
performs gpt-4 on arithmetic tasks,” arXiv preprint S. E. Showk, S. Fort, T. Lanham, T. Telleen-
arXiv:2305.14201, 2023. Lawton, T. Conerly, T. Henighan, T. Hume, S. R.
[360] T. Sun, X. Zhang, Z. He, P. Li, Q. Cheng, H. Yan, Bowman, Z. Hatfield-Dodds, B. Mann, D. Amodei,
X. Liu, Y. Shao, Q. Tang, X. Zhao, K. Chen, Y. Zheng, N. Joseph, S. McCandlish, T. Brown, and J. Kaplan,
Z. Zhou, R. Li, J. Zhan, Y. Zhou, L. Li, X. Yang, L. Wu, “Constitutional AI: harmlessness from AI feedback,”
Z. Yin, X. Huang, and X. Qiu, “Moss: Training con- CoRR, vol. abs/2212.08073, 2022. [Online]. Available:
versational language models from synthetic data,” https://doi.org/10.48550/arXiv.2212.08073
2023. [370] H. Lee, S. Phatale, H. Mansoor, K. Lu, T. Mesnard,
[361] Y. Dubois, X. Li, R. Taori, T. Zhang, I. Gulrajani, C. Bishop, V. Carbune, and A. Rastogi, “RLAIF:
J. Ba, C. Guestrin, P. Liang, and T. B. Hashimoto, scaling reinforcement learning from human feedback
“Alpacafarm: A simulation framework for methods with AI feedback,” CoRR, vol. abs/2309.00267, 2023.
that learn from human feedback,” CoRR, vol. [371] H. Dong, W. Xiong, D. Goyal, R. Pan, S. Diao,
abs/2305.14387, 2023. [Online]. Available: https: J. Zhang, K. Shum, and T. Zhang, “RAFT:
//doi.org/10.48550/arXiv.2305.14387 reward ranked finetuning for generative foundation
[362] D. Hendrycks, C. Burns, S. Basart, A. Zou, model alignment,” CoRR, vol. abs/2304.06767, 2023.
M. Mazeika, D. Song, and J. Steinhardt, “Measur- [Online]. Available: https://doi.org/10.48550/arXiv.
ing massive multitask language understanding,” in 2304.06767
ICLR. OpenReview.net, 2021. [372] A. Askell, Y. Bai, A. Chen, D. Drain, D. Gan-
[363] M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, guli, T. Henighan, A. Jones, N. Joseph, B. Mann,
Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. N. DasSarma et al., “A general language assis-
Chi, D. Zhou, and J. Wei, “Challenging big-bench tant as a laboratory for alignment,” arXiv preprint
tasks and whether chain-of-thought can solve them,” arXiv:2112.00861, 2021.
CoRR, vol. abs/2210.09261, 2022. [373] R. Zheng, S. Dou, S. Gao, W. Shen, B. Wang, Y. Liu,
[364] Z. Kenton, T. Everitt, L. Weidinger, I. Gabriel, S. Jin, Q. Liu, L. Xiong, L. Chen et al., “Secrets of rlhf
V. Mikulik, and G. Irving, “Alignment of language in large language models part i: Ppo,” arXiv preprint
agents,” CoRR, vol. abs/2103.14659, 2021. arXiv:2307.04964, 2023.
[365] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, [374] J. Uesato, N. Kushman, R. Kumar, H. F. Song,
A. Radford, D. Amodei, P. F. Christiano, and G. Irv- N. Y. Siegel, L. Wang, A. Creswell, G. Irving, and
ing, “Fine-tuning language models from human pref- I. Higgins, “Solving math word problems with
erences,” CoRR, vol. abs/1909.08593, 2019. process- and outcome-based feedback,” CoRR, vol.
[366] A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, abs/2211.14275, 2022.
T. Henighan, A. Jones, N. Joseph, B. Mann, N. Das- [375] H. Lightman, V. Kosaraju, Y. Burda, H. Edwards,
Sarma, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever,
J. Kernion, K. Ndousse, C. Olsson, D. Amodei, T. B. and K. Cobbe, “Let’s verify step by step,” CoRR, vol.
Brown, J. Clark, S. McCandlish, C. Olah, and J. Ka- abs/2305.20050, 2023.
plan, “A general language assistant as a laboratory [376] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika,
for alignment,” CoRR, vol. abs/2112.00861, 2021. A. Arora, E. Guo, C. Burns, S. Puranik, H. He,
[367] E. Perez, S. Huang, H. F. Song, T. Cai, R. Ring, D. Song, and J. Steinhardt, “Measuring coding chal-
J. Aslanides, A. Glaese, N. McAleese, and G. Irving, lenge competence with APPS,” in NeurIPS Datasets
“Red teaming language models with language mod- and Benchmarks, 2021.
els,” in Proceedings of the 2022 Conference on Empir- [377] T. Wang, P. Yu, X. E. Tan, S. O’Brien, R. Pa-
ical Methods in Natural Language Processing, EMNLP sunuru, J. Dwivedi-Yu, O. Golovneva, L. Zettle-
2022, Abu Dhabi, United Arab Emirates, December 7-11, moyer, M. Fazel-Zarandi, and A. Celikyilmaz, “Shep-
2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds. herd: A critic for language model generation,” CoRR,
Association for Computational Linguistics, 2022, pp. vol. abs/2308.04592, 2023.
3419–3448. [378] G. Chen, M. Liao, C. Li, and K. Fan, “Alphamath
[368] J. Menick, M. Trebacz, V. Mikulik, J. Aslanides, almost zero: process supervision without process,”
H. F. Song, M. Chadwick, M. Glaese, S. Young, CoRR, vol. abs/2405.03553, 2024.
L. Campbell-Gillingham, G. Irving, and [379] Q. Ma, H. Zhou, T. Liu, J. Yuan, P. Liu, Y. You, and
N. McAleese, “Teaching language models to H. Yang, “Let’s reward step by step: Step-level re-
support answers with verified quotes,” CoRR, vol. ward model as the navigators for reasoning,” CoRR,
113

vol. abs/2310.10080, 2023. 2305.18290


[380] Z. Chen, K. Zhou, W. X. Zhao, J. Wan, F. Zhang, [392] K. Ethayarajh, W. Xu, N. Muennighoff, D. Jurafsky,
D. Zhang, and J. Wen, “Improving large language and D. Kiela, “KTO: model alignment as prospect
models via fine-grained reinforcement learning theoretic optimization,” CoRR, vol. abs/2402.01306,
with minimum editing constraint,” CoRR, vol. 2024.
abs/2401.06081, 2024. [Online]. Available: https: [393] Y. Meng, M. Xia, and D. Chen, “Simpo: Simple pref-
//doi.org/10.48550/arXiv.2401.06081 erence optimization with a reference-free reward,”
[381] Z. Xi, W. Chen, B. Hong, S. Jin, R. Zheng, W. He, CoRR, vol. abs/2405.14734, 2024.
Y. Ding, S. Liu, X. Guo, J. Wang, H. Guo, W. Shen, [394] D. Feng, B. Qin, C. Huang, Z. Zhang, and W. Lei,
X. Fan, Y. Zhou, S. Dou, X. Wang, X. Zhang, “Towards analyzing and understanding the limita-
P. Sun, T. Gui, Q. Zhang, and X. Huang, “Train- tions of DPO: A theoretical perspective,” CoRR, vol.
ing large language models for reasoning through abs/2404.04626, 2024.
reverse curriculum reinforcement learning,” CoRR, [395] A. Gorbatovski, B. Shaposhnikov, A. Malakhov,
vol. abs/2402.05808, 2024. N. Surnachev, Y. Aksenov, I. Maksimov, N. Balagan-
[382] D. Silver, J. Schrittwieser, K. Simonyan, sky, and D. Gavrilov, “Learn your reference model
I. Antonoglou, A. Huang, A. Guez, T. Hubert, for real good alignment,” CoRR, vol. abs/2404.09656,
L. Baker, M. Lai, A. Bolton, Y. Chen, T. P. Lillicrap, 2024.
F. Hui, L. Sifre, G. van den Driessche, T. Graepel, [396] D. Kim, Y. Kim, W. Song, H. Kim, Y. Kim, S. Kim,
and D. Hassabis, “Mastering the game of go without and C. Park, “sdpo: Don’t use your data all at once,”
human knowledge,” Nat., pp. 354–359, 2017. CoRR, vol. abs/2403.19270, 2024.
[383] T. Anthony, Z. Tian, and D. Barber, “Thinking fast [397] Z. Yuan, H. Yuan, C. Tan, W. Wang, S. Huang, and
and slow with deep learning and tree search,” in F. Huang, “RRHF: rank responses to align language
Advances in Neural Information Processing Systems 30: models with human feedback without tears,”
Annual Conference on Neural Information Processing CoRR, vol. abs/2304.05302, 2023. [Online]. Available:
Systems 2017, December 4-9, 2017, Long Beach, CA, https://doi.org/10.48550/arXiv.2304.05302
USA, 2017, pp. 5360–5370. [398] Y. Zhao, R. Joshi, T. Liu, M. Khalman, M. Saleh,
[384] H. Luo, Q. Sun, C. Xu, P. Zhao, J. Lou, C. Tao, and P. J. Liu, “Slic-hf: Sequence likelihood calibration
X. Geng, Q. Lin, S. Chen, and D. Zhang, “Wizard- with human feedback,” CoRR, vol. abs/2305.10425,
math: Empowering mathematical reasoning for large 2023.
language models via reinforced evol-instruct,” CoRR, [399] A. Fisch, J. Eisenstein, V. Zayats, A. Agarwal,
vol. abs/2308.09583, 2023. A. Beirami, C. Nagpal, P. Shaw, and J. Berant, “Ro-
[385] R. Liu, C. Jia, G. Zhang, Z. Zhuang, T. X. Liu, and bust preference optimization through reward model
S. Vosoughi, “Second thoughts are best: Learning distillation,” CoRR, vol. abs/2405.19316, 2024.
to re-align with human values from text edits,” in [400] T. Zhang, F. Liu, J. Wong, P. Abbeel, and
NeurIPS, 2022. J. E. Gonzalez, “The wisdom of hindsight makes
[386] X. Lu, S. Welleck, J. Hessel, L. Jiang, L. Qin, P. West, language models better instruction followers,”
P. Ammanabrolu, and Y. Choi, “QUARK: control- CoRR, vol. abs/2302.05206, 2023. [Online]. Available:
lable text generation with reinforced unlearning,” in https://doi.org/10.48550/arXiv.2302.05206
NeurIPS, 2022. [401] A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne,
[387] J. Scheurer, J. A. Campos, T. Korbak, J. S. Chan, “Imitation learning: A survey of learning methods,”
A. Chen, K. Cho, and E. Perez, “Training language ACM Comput. Surv., vol. 50, no. 2, apr 2017. [Online].
models with language feedback at scale,” CoRR, vol. Available: https://doi.org/10.1145/3054912
abs/2303.16755, 2023. [402] S. Levine, “Should i imitate or reinforce,”
[388] G. Guo, R. Zhao, T. Tang, W. X. Zhao, and 2022. [Online]. Available: https://www.youtube.
J.-R. Wen, “Beyond imitation: Leveraging fine- com/watch?v=sVPm7zOrBxM
grained quality signals for alignment,” arXiv preprint [403] J. Schulman, “Reinforcement learning from
arXiv:2311.04072, 2023. human feedback: Progress and challenges,” 2023.
[389] R. Krishna, D. Lee, L. Fei-Fei, and M. S. Bernstein, [Online]. Available: https://www.youtube.com/
“Socially situated artificial intelligence enables watch?v=hhiLw5Q UFg
learning from human interaction,” Proceedings of the [404] X. L. Li and P. Liang, “Prefix-tuning: Optimizing
National Academy of Sciences of the United States of continuous prompts for generation,” in Proceedings of
America, vol. 119, 2022. [Online]. Available: https: the 59th Annual Meeting of the Association for Compu-
//api.semanticscholar.org/CorpusID:252381954 tational Linguistics and the 11th International Joint Con-
[390] H. Liu, C. Sferrazza, and P. Abbeel, “Chain of hind- ference on Natural Language Processing, ACL/IJCNLP
sight aligns language models with feedback,” CoRR, 2021, (Volume 1: Long Papers), Virtual Event, August
vol. abs/2302.02676, 2023. 1-6, 2021, C. Zong, F. Xia, W. Li, and R. Navigli, Eds.
[391] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, Association for Computational Linguistics, 2021, pp.
C. D. Manning, and C. Finn, “Direct preference 4582–4597.
optimization: Your language model is secretly a [405] B. Lester, R. Al-Rfou, and N. Constant, “The power
reward model,” CoRR, vol. abs/2305.18290, 2023. of scale for parameter-efficient prompt tuning,” in
[Online]. Available: https://doi.org/10.48550/arXiv. Proceedings of the 2021 Conference on Empirical Methods
114

in Natural Language Processing, EMNLP 2021, Virtual tuning of language models with zero-init attention,”
Event / Punta Cana, Dominican Republic, 7-11 Novem- CoRR, vol. abs/2303.16199, 2023.
ber, 2021, M. Moens, X. Huang, L. Specia, and S. W. [418] J. Pfeiffer, I. Vulic, I. Gurevych, and S. Ruder, “MAD-
Yih, Eds. Association for Computational Linguistics, X: an adapter-based framework for multi-task cross-
2021, pp. 3045–3059. lingual transfer,” in Proceedings of the 2020 Conference
[406] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, on Empirical Methods in Natural Language Processing,
Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and EMNLP 2020, Online, November 16-20, 2020, B. Web-
S. Gelly, “Parameter-efficient transfer learning for ber, T. Cohn, Y. He, and Y. Liu, Eds. Association for
NLP,” in Proceedings of the 36th International Confer- Computational Linguistics, 2020, pp. 7654–7673.
ence on Machine Learning, ICML 2019, 9-15 June 2019, [419] S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada,
Long Beach, California, USA, 2019, pp. 2790–2799. and S. Paul, “Peft: State-of-the-art parameter-
[407] Z. Hu, Y. Lan, L. Wang, W. Xu, E. Lim, R. K. Lee, efficient fine-tuning methods,” https://github.com/
L. Bing, and S. Poria, “Llm-adapters: An adapter huggingface/peft, 2022.
family for parameter-efficient fine-tuning of large [420] J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, and
language models,” CoRR, vol. abs/2304.01933, 2023. W. Chen, “What makes good in-context examples for
[408] J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and gpt-3?” in Proceedings of Deep Learning Inside Out: The
G. Neubig, “Towards a unified view of parameter- 3rd Workshop on Knowledge Extraction and Integration
efficient transfer learning,” in The Tenth International for Deep Learning Architectures, DeeLIO@ACL 2022,
Conference on Learning Representations, ICLR 2022, Vir- Dublin, Ireland and Online, May 27, 2022, 2022, pp.
tual Event, April 25-29, 2022. OpenReview.net, 2022. 100–114.
[409] X. Liu, K. Ji, Y. Fu, Z. Du, Z. Yang, and J. Tang, “P- [421] O. Rubin, J. Herzig, and J. Berant, “Learning to
tuning v2: Prompt tuning can be comparable to fine- retrieve prompts for in-context learning,” in Pro-
tuning universally across scales and tasks,” CoRR, ceedings of the 2022 Conference of the North American
vol. abs/2110.07602, 2021. Chapter of the Association for Computational Linguistics:
[410] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, Human Language Technologies, NAACL 2022, Seattle,
and J. Tang, “GPT understands, too,” CoRR, vol. WA, United States, July 10-15, 2022, 2022, pp. 2655–
abs/2103.10385, 2021. 2671.
[411] Y. Gu, X. Han, Z. Liu, and M. Huang, “Ppt: Pre- [422] H. J. Kim, H. Cho, J. Kim, T. Kim, K. M. Yoo, and
trained prompt tuning for few-shot learning,” in Pro- S. Lee, “Self-generated in-context learning: Leverag-
ceedings of the 60th Annual Meeting of the Association ing auto-regressive language models as a demonstra-
for Computational Linguistics (Volume 1: Long Papers), tion generator,” CoRR, vol. abs/2206.08082, 2022.
2022, pp. 8410–8423. [423] Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis,
[412] Z. Jiang, F. F. Xu, J. Araki, and G. Neubig, “How can H. Chan, and J. Ba, “Large language models are
we know what language models know?” Transactions human-level prompt engineers,” in Proc. of ICLR,
of the Association for Computational Linguistics, vol. 8, 2023.
pp. 423–438, 2020. [424] Y. Hao, Y. Sun, L. Dong, Z. Han, Y. Gu, and F. Wei,
[413] T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, “Structured prompting: Scaling in-context learning
and S. Singh, “Autoprompt: Eliciting knowledge to 1, 000 examples,” CoRR, 2022.
from language models with automatically gener- [425] Y. Lu, M. Bartolo, A. Moore, S. Riedel, and P. Stene-
ated prompts,” in Proceedings of the 2020 Conference torp, “Fantastically ordered prompts and where to
on Empirical Methods in Natural Language Processing find them: Overcoming few-shot prompt order sen-
(EMNLP), 2020, pp. 4222–4235. sitivity,” in Proceedings of the 60th Annual Meeting of
[414] Q. Zhang, M. Chen, A. Bukharin, P. He, Y. Cheng, the Association for Computational Linguistics (Volume
W. Chen, and T. Zhao, “Adaptive budget allocation 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-
for parameter-efficient fine-tuning,” CoRR, vol. 27, 2022, S. Muresan, P. Nakov, and A. Villavicencio,
abs/2303.10512, 2023. [Online]. Available: https: Eds., 2022, pp. 8086–8098.
//doi.org/10.48550/arXiv.2303.10512 [426] Y. Fu, H. Peng, A. Sabharwal, P. Clark, and T. Khot,
[415] M. Valipour, M. Rezagholizadeh, I. Kobyzev, and “Complexity-based prompting for multi-step reason-
A. Ghodsi, “Dylora: Parameter efficient tuning ing,” CoRR, vol. abs/2210.00720, 2022.
of pre-trained models using dynamic search-free [427] Z. Zhang, A. Zhang, M. Li, and A. Smola, “Auto-
low-rank adaptation,” CoRR, vol. abs/2210.07558, matic chain of thought prompting in large language
2022. [Online]. Available: https://doi.org/10.48550/ models,” CoRR, vol. abs/2210.03493, 2022.
arXiv.2210.07558 [428] A. Creswell, M. Shanahan, and I. Higgins,
[416] N. Ding, Y. Qin, G. Yang, F. Wei, Y. Zonghan, Y. Su, “Selection-inference: Exploiting large language mod-
S. Hu, Y. Chen, C.-M. Chan, W. Chen, J. Yi, W. Zhao, els for interpretable logical reasoning,” CoRR, vol.
X. Wang, Z. Liu, H.-T. Zheng, J. Chen, Y. Liu, J. Tang, abs/2205.09712, 2022.
J. Li, and M. Sun, “Parameter-efficient fine-tuning [429] X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi,
of large-scale pre-trained language models,” Nature and D. Zhou, “Self-consistency improves chain of
Machine Intelligence, vol. 5, pp. 1–16, 03 2023. thought reasoning in language models,” CoRR, vol.
[417] R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, abs/2203.11171, 2022.
P. Gao, and Y. Qiao, “Llama-adapter: Efficient fine- [430] Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J. Lou,
115

and W. Chen, “On the advance of making language [445] V. Liu and L. B. Chilton, “Design guidelines for
models better reasoners,” CoRR, vol. abs/2206.02336, prompt engineering text-to-image generative mod-
2022. els,” in Proceedings of the 2022 CHI Conference on
[431] X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, Human Factors in Computing Systems, 2022, pp. 1–23.
and D. Zhou, “Rationale-augmented ensembles in [446] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea,
language models,” CoRR, 2022. H. Gilbert, A. Elnashar, J. Spencer-Smith, and D. C.
[432] D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, Schmidt, “A prompt pattern catalog to enhance
X. Wang, D. Schuurmans, O. Bousquet, Q. Le, and prompt engineering with chatgpt,” arXiv preprint
E. H. Chi, “Least-to-most prompting enables com- arXiv:2302.11382, 2023.
plex reasoning in large language models,” CoRR, vol. [447] S. K. K. Santu and D. Feng, “Teler: A general
abs/2205.10625, 2022. taxonomy of LLM prompts for benchmarking
[433] T. Khot, H. Trivedi, M. Finlayson, Y. Fu, complex tasks,” CoRR, vol. abs/2305.11430, 2023.
K. Richardson, P. Clark, and A. Sabhar- [Online]. Available: https://doi.org/10.48550/arXiv.
wal, “Decomposed prompting: A modular 2305.11430
approach for solving complex tasks,” CoRR, [448] OpenAI, “Gpt best practices,” OpenAI, 2023.
vol. abs/2210.02406, 2022. [Online]. Available: [Online]. Available: https://platform.openai.com/
https://doi.org/10.48550/arXiv.2210.02406 docs/guides/gpt-best-practices
[434] L. Wang, W. Xu, Y. Lan, Z. Hu, Y. Lan, R. K. [449] Contributors, “Ai short,” 2023. [Online]. Available:
Lee, and E. Lim, “Plan-and-solve prompting: https://www.aishort.top/
Improving zero-shot chain-of-thought reasoning by [450] ——, “Awesome chatgpt prompts,” Github,
large language models,” CoRR, vol. abs/2305.04091, 2023. [Online]. Available: https://github.com/f/
2023. [Online]. Available: https://doi.org/10.48550/ awesome-chatgpt-prompts/
arXiv.2305.04091 [451] J. Jiang, K. Zhou, Z. Dong, K. Ye, W. X. Zhao, and
[435] Q. Lyu, S. Havaldar, A. Stein, L. Zhang, D. Rao, J. Wen, “Structgpt: A general framework for large
E. Wong, M. Apidianaki, and C. Callison-Burch, language model to reason over structured data,”
“Faithful chain-of-thought reasoning,” CoRR, vol. CoRR, vol. abs/2305.09645, 2023.
abs/2301.13379, 2023. [452] L. Beurer-Kellner, M. Fischer, and M. Vechev,
[436] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, “Prompting is programming: A query language for
J. Callan, and G. Neubig, “PAL: program-aided lan- large language models,” Proceedings of the ACM on
guage models,” CoRR, vol. abs/2211.10435, 2022. Programming Languages, vol. 7, no. PLDI, pp. 1946–
[437] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and 1969, 2023.
Y. Zhuang, “Hugginggpt: Solving ai tasks with chat- [453] P. Lu, B. Peng, H. Cheng, M. Galley, K.-W. Chang,
gpt and its friends in huggingface,” arXiv preprint Y. N. Wu, S.-C. Zhu, and J. Gao, “Chameleon: Plug-
arXiv:2303.17580, 2023. and-play compositional reasoning with large lan-
[438] H. Sun, Y. Zhuang, L. Kong, B. Dai, and guage models,” arXiv preprint arXiv:2304.09842, 2023.
C. Zhang, “Adaplanner: Adaptive planning from [454] R. Ren, Y. Wang, Y. Qu, W. X. Zhao, J. Liu, H. Tian,
feedback with language models,” arXiv preprint H. Wu, J.-R. Wen, and H. Wang, “Investigating
arXiv:2305.16653, 2023. the factual knowledge boundary of large language
[439] Y. Lu, P. Lu, Z. Chen, W. Zhu, X. E. Wang, and W. Y. models with retrieval augmentation,” arXiv preprint
Wang, “Multimodal procedural planning via dual arXiv:2307.11019, 2023.
text-image prompting,” CoRR, vol. abs/2305.01795, [455] X. Amatriain, “Prompt design and engineering:
2023. Introduction and advanced methods,” CoRR, vol.
[440] S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, abs/2401.14423, 2024.
and Z. Hu, “Reasoning with language model is plan- [456] Y. Hou, J. Zhang, Z. Lin, H. Lu, R. Xie, J. J. McAuley,
ning with world model,” CoRR, vol. abs/2305.14992, and W. X. Zhao, “Large language models are zero-
2023. shot rankers for recommender systems,” CoRR, vol.
[441] Z. Chen, K. Zhou, B. Zhang, Z. Gong, W. X. abs/2305.08845, 2023.
Zhao, and J. Wen, “Chatcot: Tool-augmented chain- [457] S. Chang and E. Fosler-Lussier, “How to prompt
of-thought reasoning on chat-based large language llms for text-to-sql: A study in zero-shot, single-
models,” CoRR, vol. abs/2305.14323, 2023. domain, and cross-domain settings,” CoRR, vol.
[442] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, abs/2305.11853, 2023. [Online]. Available: https:
K. Narasimhan, and Y. Cao, “React: Synergizing rea- //doi.org/10.48550/arXiv.2305.11853
soning and acting in language models,” CoRR, vol. [458] Y. Wen, N. Jain, J. Kirchenbauer, M. Goldblum,
abs/2210.03629, 2022. J. Geiping, and T. Goldstein, “Hard prompts
[443] N. Shinn, F. Cassano, B. Labash, A. Gopinath, made easy: Gradient-based discrete optimization
K. Narasimhan, and S. Yao, “Reflexion: Language for prompt tuning and discovery,” CoRR, vol.
agents with verbal reinforcement learning,” 2023. abs/2302.03668, 2023. [Online]. Available: https:
[444] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, //doi.org/10.48550/arXiv.2302.03668
Y. Cao, and K. Narasimhan, “Tree of thoughts: Delib- [459] T. Gao, A. Fisch, and D. Chen, “Making pre-trained
erate problem solving with large language models,” language models better few-shot learners,” in Pro-
CoRR, vol. abs/2305.10601, 2023. ceedings of the 59th Annual Meeting of the Association
116

for Computational Linguistics and the 11th International and M. Zeng, “Automatic prompt optimization
Joint Conference on Natural Language Processing, ACL/I- with ”gradient descent” and beam search,” CoRR,
JCNLP 2021, (Volume 1: Long Papers), Virtual Event, vol. abs/2305.03495, 2023. [Online]. Available:
August 1-6, 2021, C. Zong, F. Xia, W. Li, and R. Nav- https://doi.org/10.48550/arXiv.2305.03495
igli, Eds. Association for Computational Linguistics, [470] C. Yang, X. Wang, Y. Lu, H. Liu, Q. V. Le,
2021, pp. 3816–3830. D. Zhou, and X. Chen, “Large language models
[460] L. Chen, J. Chen, T. Goldstein, H. Huang, as optimizers,” CoRR, vol. abs/2309.03409, 2023.
and T. Zhou, “Instructzero: Efficient instruction [Online]. Available: https://doi.org/10.48550/arXiv.
optimization for black-box large language models,” 2309.03409
CoRR, vol. abs/2306.03082, 2023. [Online]. Available: [471] Q. Ye, M. Axmed, R. Pryzant, and F. Khani,
https://doi.org/10.48550/arXiv.2306.03082 “Prompt engineering a prompt engineer,” CoRR, vol.
[461] X. Lin, Z. Wu, Z. Dai, W. Hu, Y. Shu, S. Ng, P. Jaillet, abs/2311.05661, 2023.
and B. K. H. Low, “Use your INSTINCT: instruc- [472] X. Tang, X. Wang, W. X. Zhao, S. Lu, Y. Li, and
tion optimization using neural bandits coupled with J. Wen, “Unleashing the potential of large language
transformers,” CoRR, vol. abs/2310.02905, 2023. models as prompt optimizers: An analogical analysis
[462] M. Deng, J. Wang, C. Hsieh, Y. Wang, H. Guo, T. Shu, with gradient-based model optimizers,” CoRR, vol.
M. Song, E. P. Xing, and Z. Hu, “Rlprompt: Optimiz- abs/2402.17564, 2024.
ing discrete text prompts with reinforcement learn- [473] H. Yang and K. Li, “Instoptima: Evolutionary
ing,” in Proceedings of the 2022 Conference on Empir- multi-objective instruction optimization via large
ical Methods in Natural Language Processing, EMNLP language model-based instruction operators,” in
2022, Abu Dhabi, United Arab Emirates, December 7-11, EMNLP (Findings). Association for Computational
2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds. Linguistics, 2023, pp. 13 593–13 602.
Association for Computational Linguistics, 2022, pp. [474] Q. Guo, R. Wang, J. Guo, B. Li, K. Song, X. Tan,
3369–3391. G. Liu, J. Bian, and Y. Yang, “Connecting large
[463] T. Zhang, X. Wang, D. Zhou, D. Schuurmans, and language models with evolutionary algorithms
J. E. Gonzalez, “TEMPERA: test-time prompt editing yields powerful prompt optimizers,” CoRR, vol.
via reinforcement learning,” in The Eleventh Inter- abs/2309.08532, 2023.
national Conference on Learning Representations, ICLR [475] X. L. Do, Y. Zhao, H. Brown, Y. Xie, J. X. Zhao, N. F.
2023, Kigali, Rwanda, May 1-5, 2023. OpenRe- Chen, K. Kawaguchi, M. Q. Xie, and J. He, “Prompt
view.net, 2023. optimization via adversarial in-context learning,”
[464] Y. Jafari, D. Mekala, R. Yu, and T. Berg-Kirkpatrick, CoRR, vol. abs/2312.02614, 2023.
“Morl-prompt: An empirical analysis of multi- [476] X. Wang, C. Li, Z. Wang, F. Bai, H. Luo,
objective reinforcement learning for discrete prompt J. Zhang, N. Jojic, E. P. Xing, and Z. Hu,
optimization,” CoRR, vol. abs/2402.11711, 2024. “Promptagent: Strategic planning with language
[465] W. Kong, S. A. Hombaiah, M. Zhang, Q. Mei, and models enables expert-level prompt optimization,”
M. Bendersky, “Prewrite: Prompt rewriting with re- CoRR, vol. abs/2310.16427, 2023. [Online]. Available:
inforcement learning,” CoRR, vol. abs/2401.08189, https://doi.org/10.48550/arXiv.2310.16427
2024. [477] T. Tang, J. Li, W. X. Zhao, and J. Wen, “Context-
[466] H. Xu, Y. Chen, Y. Du, N. Shao, Y. Wang, H. Li, tuning: Learning contextualized prompts for natu-
and Z. Yang, “GPS: genetic prompt search for effi- ral language generation,” in Proceedings of the 29th
cient few-shot learning,” in Proceedings of the 2022 International Conference on Computational Linguistics,
Conference on Empirical Methods in Natural Language COLING 2022, Gyeongju, Republic of Korea, October 12-
Processing, EMNLP 2022, Abu Dhabi, United Arab Emi- 17, 2022, N. Calzolari, C. Huang, H. Kim, J. Puste-
rates, December 7-11, 2022, Y. Goldberg, Z. Kozareva, jovsky, L. Wanner, K. Choi, P. Ryu, H. Chen, L. Do-
and Y. Zhang, Eds. Association for Computational natelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue, S. Kim,
Linguistics, 2022, pp. 8162–8171. Y. Hahm, Z. He, T. K. Lee, E. Santus, F. Bond, and
[467] A. Prasad, P. Hase, X. Zhou, and M. Bansal, S. Na, Eds. International Committee on Computa-
“Grips: Gradient-free, edit-based instruction search tional Linguistics, 2022, pp. 6340–6354.
for prompting large language models,” in Proceedings [478] T. Vu, B. Lester, N. Constant, R. Al-Rfou’, and D. Cer,
of the 17th Conference of the European Chapter of the “Spot: Better frozen model adaptation through soft
Association for Computational Linguistics, EACL 2023, prompt transfer,” in Proceedings of the 60th Annual
Dubrovnik, Croatia, May 2-6, 2023, A. Vlachos and Meeting of the Association for Computational Linguistics
I. Augenstein, Eds. Association for Computational (Volume 1: Long Papers), ACL 2022, Dublin, Ireland,
Linguistics, 2023, pp. 3827–3846. May 22-27, 2022, S. Muresan, P. Nakov, and A. Villavi-
[468] Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, cencio, Eds. Association for Computational Linguis-
H. Chan, and J. Ba, “Large language models are tics, 2022, pp. 5039–5059.
human-level prompt engineers,” in The Eleventh [479] J. Li, T. Tang, J. Nie, J. Wen, and X. Zhao, “Learn-
International Conference on Learning Representations, ing to transfer prompts for text generation,” in Pro-
ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open- ceedings of the 2022 Conference of the North American
Review.net, 2023. Chapter of the Association for Computational Linguistics:
[469] R. Pryzant, D. Iter, J. Li, Y. T. Lee, C. Zhu, Human Language Technologies, NAACL 2022, Seattle,
117

WA, United States, July 10-15, 2022, M. Carpuat, [492] Y. Gu, L. Dong, F. Wei, and M. Huang, “Pre-training
M. de Marneffe, and I. V. M. Ruı́z, Eds. Association to learn in context,” CoRR, vol. abs/2305.09137, 2023.
for Computational Linguistics, 2022, pp. 3506–3518. [493] S. Min, M. Lewis, L. Zettlemoyer, and H. Hajishirzi,
[480] S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, “Metaicl: Learning to learn in context,” in Proceed-
H. Hajishirzi, and L. Zettlemoyer, “Rethinking the ings of the 2022 Conference of the North American
role of demonstrations: What makes in-context learn- Chapter of the Association for Computational Linguistics:
ing work?” in Proceedings of the 2022 Conference Human Language Technologies, NAACL 2022, Seattle,
on Empirical Methods in Natural Language Processing, WA, United States, July 10-15, 2022, M. Carpuat,
EMNLP 2022, Abu Dhabi, United Arab Emirates, De- M. de Marneffe, and I. V. M. Ruı́z, Eds., 2022, pp.
cember 7-11, 2022. Association for Computational 2791–2809.
Linguistics, 2022, pp. 11 048–11 064. [494] M. Hahn and N. Goyal, “A theory of emergent
[481] Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh, in-context learning as implicit structure induction,”
“Calibrate before use: Improving few-shot perfor- CoRR, vol. abs/2303.07971, 2023.
mance of language models,” in Proceedings of the 38th [495] J. Pan, T. Gao, H. Chen, and D. Chen, “What in-
International Conference on Machine Learning, ICML context learning ”learns” in-context: Disentangling
2021, 18-24 July 2021, Virtual Event, M. Meila and task recognition and task learning,” CoRR, vol.
T. Zhang, Eds., 2021, pp. 12 697–12 706. abs/2305.09731, 2023.
[482] Y. Lee, C. Lim, and H. Choi, “Does GPT-3 generate [496] N. Wies, Y. Levine, and A. Shashua, “The learnability
empathetic dialogues? A novel in-context example of in-context learning,” CoRR, vol. abs/2303.07895,
selection method and automatic evaluation metric 2023.
for empathetic dialogue generation,” in Proceedings [497] A. Webson and E. Pavlick, “Do prompt-based models
of the 29th International Conference on Computational really understand the meaning of their prompts?” in
Linguistics, COLING 2022, Gyeongju, Republic of Korea, Proceedings of the 2022 Conference of the North American
October 12-17, 2022, N. Calzolari, C. Huang, H. Kim, Chapter of the Association for Computational Linguistics:
J. Pustejovsky, L. Wanner, K. Choi, P. Ryu, H. Chen, Human Language Technologies, NAACL 2022, Seattle,
L. Donatelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue, WA, United States, July 10-15, 2022, 2022, pp. 2300–
S. Kim, Y. Hahm, Z. He, T. K. Lee, E. Santus, F. Bond, 2344.
and S. Na, Eds. International Committee on Com- [498] J. von Oswald, E. Niklasson, E. Randazzo, J. Sacra-
putational Linguistics, 2022, pp. 669–683. mento, A. Mordvintsev, A. Zhmoginov, and M. Vla-
[483] I. Levy, B. Bogin, and J. Berant, “Diverse demonstra- dymyrov, “Transformers learn in-context by gradient
tions improve in-context compositional generaliza- descent,” CoRR, vol. abs/2212.07677, 2022.
tion,” CoRR, vol. abs/2212.06800, 2022. [499] C. Olsson, N. Elhage, N. Nanda, N. Joseph,
[484] H. Su, J. Kasai, C. H. Wu, W. Shi, T. Wang, J. Xin, N. DasSarma, T. Henighan, B. Mann, A. Askell,
R. Zhang, M. Ostendorf, L. Zettlemoyer, N. A. Smith, Y. Bai, A. Chen, T. Conerly, D. Drain, D. Gan-
and T. Yu, “Selective annotation makes language guli, Z. Hatfield-Dodds, D. Hernandez, S. John-
models better few-shot learners,” CoRR, 2022. ston, A. Jones, J. Kernion, L. Lovitt, K. Ndousse,
[485] X. Ye, S. Iyer, A. Celikyilmaz, V. Stoyanov, G. Durrett, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCan-
and R. Pasunuru, “Complementary explanations for dlish, and C. Olah, “In-context learning and induc-
effective in-context learning,” CoRR, 2022. tion heads,” CoRR, vol. abs/2209.11895, 2022.
[486] X. Li and X. Qiu, “Finding supporting examples for [500] E. Akyürek, D. Schuurmans, J. Andreas, T. Ma, and
in-context learning,” CoRR, 2023. D. Zhou, “What learning algorithm is in-context
[487] Y. Zhang, S. Feng, and C. Tan, “Active example learning? investigations with linear models,” CoRR,
selection for in-context learning,” in Proceedings of vol. abs/2211.15661, 2022.
the 2022 Conference on Empirical Methods in Natural [501] J. Wei, J. Wei, Y. Tay, D. Tran, A. Webson, Y. Lu,
Language Processing, EMNLP 2022, Abu Dhabi, United X. Chen, H. Liu, D. Huang, D. Zhou et al., “Larger
Arab Emirates, December 7-11, 2022, 2022, pp. 9134– language models do in-context learning differently,”
9148. arXiv preprint arXiv:2303.03846, 2023.
[488] F. Gilardi, M. Alizadeh, and M. Kubli, “Chatgpt out- [502] J. Coda-Forno, M. Binz, Z. Akata, M. M. Botvinick,
performs crowd-workers for text-annotation tasks,” J. X. Wang, and E. Schulz, “Meta-in-context
2023. learning in large language models,” CoRR, vol.
[489] H. J. Kim, H. Cho, J. Kim, T. Kim, K. M. Yoo, and abs/2305.12907, 2023.
S. Lee, “Self-generated in-context learning: Leverag- [503] J. W. Wei, L. Hou, A. K. Lampinen, X. Chen,
ing auto-regressive language models as a demonstra- D. Huang, Y. Tay, X. Chen, Y. Lu, D. Zhou, T. Ma, and
tion generator,” CoRR, vol. abs/2206.08082, 2022. Q. V. Le, “Symbol tuning improves in-context learn-
[490] S. M. Xie, A. Raghunathan, P. Liang, and T. Ma, ing in language models,” CoRR, vol. abs/2305.08298,
“An explanation of in-context learning as implicit 2023.
bayesian inference,” in The Tenth International Con- [504] Z. Chu, J. Chen, Q. Chen, W. Yu, T. He, H. Wang,
ference on Learning Representations, ICLR 2022, Virtual W. Peng, M. Liu, B. Qin, and T. Liu, “A survey of
Event, April 25-29, 2022, 2022. chain of thought reasoning: Advances, frontiers and
[491] Z. Wu, Y. Wang, J. Ye, and L. Kong, “Self-adaptive in- future,” CoRR, vol. abs/2309.15402, 2023.
context learning,” CoRR, vol. abs/2212.10375, 2022. [505] S. Miao, C. Liang, and K. Su, “A diverse corpus
118

for evaluating and developing english math word of thoughts: Solving elaborate problems with large
problem solvers,” in Proceedings of the 58th Annual language models,” CoRR, vol. abs/2308.09687, 2023.
Meeting of the Association for Computational Linguistics, [520] B. Lei, P. Lin, C. Liao, and C. Ding, “Boosting log-
ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai, ical reasoning in large language models through a
N. Schluter, and J. R. Tetreault, Eds. Association for new framework: The graph of thought,” CoRR, vol.
Computational Linguistics, 2020, pp. 975–984. abs/2308.08614, 2023.
[506] A. Talmor, J. Herzig, N. Lourie, and J. Berant, “Com- [521] R. Ding, C. Zhang, L. Wang, Y. Xu, M. Ma, W. Zhang,
monsenseqa: A question answering challenge tar- S. Qin, S. Rajmohan, Q. Lin, and D. Zhang, “Ev-
geting commonsense knowledge,” in Proceedings of erything of thoughts: Defying the law of pen-
the 2019 Conference of the North American Chapter of rose triangle for thought generation,” arXiv preprint
the Association for Computational Linguistics: Human arXiv:2311.04254, 2023.
Language Technologies, NAACL-HLT 2019, Minneapolis, [522] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu,
MN, USA, June 2-7, 2019, Volume 1 (Long and Short M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Ku-
Papers), J. Burstein, C. Doran, and T. Solorio, Eds. mar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. Cos-
Association for Computational Linguistics, 2019, pp. grove, C. D. Manning, C. Ré, D. Acosta-Navas, D. A.
4149–4158. Hudson, E. Zelikman, E. Durmus, F. Ladhak, F. Rong,
[507] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwa- H. Ren, H. Yao, J. Wang, K. Santhanam, L. J. Orr,
sawa, “Large language models are zero-shot reason- L. Zheng, M. Yüksekgönül, M. Suzgun, N. Kim,
ers,” CoRR, vol. abs/2205.11916, 2022. N. Guha, N. S. Chatterji, O. Khattab, P. Henderson,
[508] W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Pro- Q. Huang, R. Chi, S. M. Xie, S. Santurkar, S. Ganguli,
gram of thoughts prompting: Disentangling com- T. Hashimoto, T. Icard, T. Zhang, V. Chaudhary,
putation from reasoning for numerical reasoning W. Wang, X. Li, Y. Mai, Y. Zhang, and Y. Koreeda,
tasks,” CoRR, vol. abs/2211.12588, 2022. “Holistic evaluation of language models,” CoRR, vol.
[509] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, abs/2211.09110, 2022.
J. Callan, and G. Neubig, “PAL: program-aided lan- [523] Z. Bi, N. Zhang, Y. Jiang, S. Deng, G. Zheng, and
guage models,” in International Conference on Ma- H. Chen, “When do program-of-thoughts work for
chine Learning, ICML 2023, 23-29 July 2023, Honolulu, reasoning?” CoRR, vol. abs/2308.15452, 2023.
Hawaii, USA, A. Krause, E. Brunskill, K. Cho, B. En- [524] A. Madaan and A. Yazdanbakhsh, “Text and pat-
gelhardt, S. Sabato, and J. Scarlett, Eds., 2023. terns: For effective chain of thought, it takes two to
[510] X. Zhao, Y. Xie, K. Kawaguchi, J. He, and Q. Xie, “Au- tango,” CoRR, vol. abs/2209.07686, 2022.
tomatic model selection with large language models [525] Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and
for reasoning,” CoRR, vol. abs/2305.14333, 2023. A. Smola, “Multimodal chain-of-thought reasoning
[511] Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J.-G. Lou, in language models,” CoRR, vol. abs/2302.00923,
and W. Chen, “Making large language models better 2023.
reasoners with step-aware verifier,” 2023. [526] F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Sri-
[512] O. Yoran, T. Wolfson, B. Bogin, U. Katz, D. Deutch, vats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder,
and J. Berant, “Answering questions by meta- D. Zhou, D. Das, and J. Wei, “Language models are
reasoning over multiple chains of thought,” CoRR, multilingual chain-of-thought reasoners,” CoRR, vol.
vol. abs/2304.13007, 2023. abs/2210.03057, 2022.
[513] Z. Ling, Y. Fang, X. Li, Z. Huang, M. Lee, R. Memi- [527] J. Qian, H. Wang, Z. Li, S. Li, and X. Yan, “Limita-
sevic, and H. Su, “Deductive verification of chain-of- tions of language models in arithmetic and symbolic
thought reasoning,” CoRR, vol. abs/2306.03872, 2023. induction,” CoRR, vol. abs/2208.05051, 2022.
[514] T. Xue, Z. Wang, Z. Wang, C. Han, P. Yu, and H. Ji, [528] N. Bian, X. Han, L. Sun, H. Lin, Y. Lu, and B. He,
“RCOT: detecting and rectifying factual inconsis- “ChatGPT is a Knowledgeable but Inexperienced
tency in reasoning by reversing chain-of-thought,” Solver: An Investigation of Commonsense Problem
CoRR, vol. abs/2305.11499, 2023. in Large Language Models,” CoRR, 2023.
[515] Y. Weng, M. Zhu, F. Xia, B. Li, S. He, K. Liu, and [529] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths,
J. Zhao, “Large language models are better reasoners Y. Cao, and K. Narasimhan, “Tree of thoughts: Delib-
with self-verification,” CoRR, abs/2212.09561, 2023. erate problem solving with large language models,”
[516] W. Jiang, H. Shi, L. Yu, Z. Liu, Y. Zhang, Z. Li, CoRR, vol. abs/2305.10601, 2023.
and J. T. Kwok, “Forward-backward reasoning in [530] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao,
large language models for mathematical verifica- Y. Zhu, L. Fan, and A. Anandkumar, “Voyager: An
tion,” 2023. open-ended embodied agent with large language
[517] J. Long, “Large language model guided tree-of- models,” arXiv preprint arXiv:2305.16291, 2023.
thought,” CoRR, vol. abs/2305.08291, 2023. [531] X. Jiang, Y. Dong, L. Wang, Q. Shang, and
[518] S. Mo and M. Xin, “Tree of uncertain thoughts G. Li, “Self-planning code generation with large
reasoning for large language models,” CoRR, vol. language model,” CoRR, vol. abs/2303.06689, 2023.
abs/2309.07694, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.
[519] M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, 2303.06689
L. Gianinazzi, J. Gajda, T. Lehmann, M. Podstawski, [532] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu,
H. Niewiadomski, P. Nyczyk, and T. Hoefler, “Graph J. Tremblay, D. Fox, J. Thomason, and A. Garg, “Prog-
119

prompt: Generating situated robot task plans using ume 2: Shared Task Papers, Day 1, O. Bojar, R. Chatter-
large language models,” CoRR, vol. abs/2209.11302, jee, C. Federmann, M. Fishel, Y. Graham, B. Haddow,
2022. M. Huck, A. Jimeno-Yepes, P. Koehn, A. Martins,
[533] B. Liu, Y. Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas, C. Monz, M. Negri, A. Névéol, M. L. Neves, M. Post,
and P. Stone, “LLM+P: empowering large language M. Turchi, and K. Verspoor, Eds. Association for
models with optimal planning proficiency,” CoRR, Computational Linguistics, 2019, pp. 1–61.
vol. abs/2304.11477, 2023. [Online]. Available: [545] L. Barrault, M. Biesialska, O. Bojar, M. R. Costa-
https://doi.org/10.48550/arXiv.2304.11477 jussà, C. Federmann, Y. Graham, R. Grundkiewicz,
[534] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Haddow, M. Huck, E. Joanis, T. Kocmi, P. Koehn,
B. Ommer, “High-resolution image synthesis with C. Lo, N. Ljubesic, C. Monz, M. Morishita, M. Na-
latent diffusion models,” in IEEE/CVF Conference on gata, T. Nakazawa, S. Pal, M. Post, and M. Zampieri,
Computer Vision and Pattern Recognition, CVPR 2022, “Findings of the 2020 conference on machine trans-
New Orleans, LA, USA, June 18-24, 2022, 2022, pp. lation (WMT20),” in Proceedings of the Fifth Con-
10 674–10 685. ference on Machine Translation, WMT@EMNLP 2020,
[535] J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, Online, November 19-20, 2020, L. Barrault, O. Bojar,
P. Liang, and M. S. Bernstein, “Generative agents: F. Bougares, R. Chatterjee, M. R. Costa-jussà, C. Fed-
Interactive simulacra of human behavior,” CoRR, vol. ermann, M. Fishel, A. Fraser, Y. Graham, P. Guzman,
abs/2304.03442, 2023. B. Haddow, M. Huck, A. Jimeno-Yepes, P. Koehn,
[536] 2023. [Online]. Available: https://github.com/ A. Martins, M. Morishita, C. Monz, M. Nagata,
Significant-Gravitas/Auto-GPT T. Nakazawa, and M. Negri, Eds. Association for
[537] Z. Wang, S. Cai, A. Liu, X. Ma, and Y. Liang, Computational Linguistics, 2020, pp. 1–55.
“Describe, explain, plan and select: Interactive plan- [546] F. Akhbardeh, A. Arkhangorodsky, M. Biesialska,
ning with large language models enables open-world O. Bojar, R. Chatterjee, V. Chaudhary, M. R. Costa-
multi-task agents,” CoRR, vol. abs/2302.01560, 2023. jussà, C. España-Bonet, A. Fan, C. Federmann,
[538] J. Wang, X. Yi, R. Guo, H. Jin, P. Xu, S. Li, X. Wang, M. Freitag, Y. Graham, R. Grundkiewicz, B. Had-
X. Guo, C. Li, X. Xu et al., “Milvus: A purpose- dow, L. Harter, K. Heafield, C. Homan, M. Huck,
built vector data management system,” in Proceedings K. Amponsah-Kaakyire, J. Kasai, D. Khashabi,
of the 2021 International Conference on Management of K. Knight, T. Kocmi, P. Koehn, N. Lourie, C. Monz,
Data, 2021, pp. 2614–2627. M. Morishita, M. Nagata, A. Nagesh, T. Nakazawa,
[539] W. Zhong, L. Guo, Q. Gao, H. Ye, and Y. Wang, M. Negri, S. Pal, A. A. Tapo, M. Turchi, V. Vydrin,
“Memorybank: Enhancing large language models and M. Zampieri, “Findings of the 2021 confer-
with long-term memory,” CoRR, vol. abs/2305.10250, ence on machine translation (WMT21),” in Proceed-
2023. ings of the Sixth Conference on Machine Translation,
[540] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, WMT@EMNLP 2021, Online Event, November 10-11,
“Building a large annotated corpus of english: The 2021, L. Barrault, O. Bojar, F. Bougares, R. Chat-
penn treebank,” Comput. Linguistics, vol. 19, no. 2, terjee, M. R. Costa-jussà, C. Federmann, M. Fishel,
pp. 313–330, 1993. A. Fraser, M. Freitag, Y. Graham, R. Grundkiewicz,
[541] S. Merity, C. Xiong, J. Bradbury, and R. Socher, P. Guzman, B. Haddow, M. Huck, A. Jimeno-Yepes,
“Pointer sentinel mixture models,” in ICLR (Poster). P. Koehn, T. Kocmi, A. Martins, M. Morishita, and
OpenReview.net, 2017. C. Monz, Eds. Association for Computational Lin-
[542] O. Bojar, C. Buck, C. Federmann, B. Haddow, guistics, 2021, pp. 1–88.
P. Koehn, J. Leveling, C. Monz, P. Pecina, M. Post, [547] T. Kocmi, R. Bawden, O. Bojar, A. Dvorkovich, C. Fe-
H. Saint-Amand, R. Soricut, L. Specia, and A. Tam- dermann, M. Fishel, T. Gowda, Y. Graham, R. Grund-
chyna, “Findings of the 2014 workshop on statistical kiewicz, B. Haddow, R. Knowles, P. Koehn, C. Monz,
machine translation,” in WMT@ACL. The Associa- M. Morishita, M. Nagata, T. Nakazawa, M. Novák,
tion for Computer Linguistics, 2014, pp. 12–58. M. Popel, and M. Popovic, “Findings of the 2022
[543] O. Bojar, R. Chatterjee, C. Federmann, Y. Graham, conference on machine translation (WMT22),” in Pro-
B. Haddow, M. Huck, A. Jimeno-Yepes, P. Koehn, ceedings of the Seventh Conference on Machine Trans-
V. Logacheva, C. Monz, M. Negri, A. Névéol, M. L. lation, WMT 2022, Abu Dhabi, United Arab Emi-
Neves, M. Popel, M. Post, R. Rubino, C. Scarton, rates (Hybrid), December 7-8, 2022, P. Koehn, L. Bar-
L. Specia, M. Turchi, K. Verspoor, and M. Zampieri, rault, O. Bojar, F. Bougares, R. Chatterjee, M. R.
“Findings of the 2016 conference on machine trans- Costa-jussà, C. Federmann, M. Fishel, A. Fraser,
lation,” in WMT. The Association for Computer M. Freitag, Y. Graham, R. Grundkiewicz, P. Guzman,
Linguistics, 2016, pp. 131–198. B. Haddow, M. Huck, A. Jimeno-Yepes, T. Kocmi,
[544] L. Barrault, O. Bojar, M. R. Costa-jussà, C. Feder- A. Martins, M. Morishita, C. Monz, M. Nagata,
mann, M. Fishel, Y. Graham, B. Haddow, M. Huck, T. Nakazawa, M. Negri, A. Névéol, M. Neves,
P. Koehn, S. Malmasi, C. Monz, M. Müller, S. Pal, M. Popel, M. Turchi, and M. Zampieri, Eds. Associ-
M. Post, and M. Zampieri, “Findings of the 2019 ation for Computational Linguistics, 2022, pp. 1–45.
conference on machine translation (WMT19),” in Pro- [548] N. Goyal, C. Gao, V. Chaudhary, P. Chen, G. Wenzek,
ceedings of the Fourth Conference on Machine Transla- D. Ju, S. Krishnan, M. Ranzato, F. Guzmán, and
tion, WMT 2019, Florence, Italy, August 1-2, 2019 - Vol- A. Fan, “The flores-101 evaluation benchmark for
120

low-resource and multilingual machine translation,” [560] M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer,
Trans. Assoc. Comput. Linguistics, vol. 10, pp. 522–538, “Triviaqa: A large scale distantly supervised chal-
2022. lenge dataset for reading comprehension,” in Pro-
[549] R. Bawden, E. Bilinski, T. Lavergne, and S. Rosset, ceedings of the 55th Annual Meeting of the Association
“Diabla: a corpus of bilingual spontaneous writ- for Computational Linguistics, ACL 2017, Vancouver,
ten dialogues for machine translation,” Lang. Resour. Canada, July 30 - August 4, Volume 1: Long Papers, 2017,
Evaluation, vol. 55, no. 3, pp. 635–660, 2021. pp. 1601–1611.
[550] R. Nallapati, B. Zhou, C. N. dos Santos, Ç. Gülçehre, [561] Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi,
and B. Xiang, “Abstractive text summarization using “PIQA: reasoning about physical commonsense in
sequence-to-sequence rnns and beyond,” in Proceed- natural language,” in The Thirty-Fourth AAAI Con-
ings of the 20th SIGNLL Conference on Computational ference on Artificial Intelligence, AAAI 2020, The Thirty-
Natural Language Learning, CoNLL 2016, Berlin, Ger- Second Innovative Applications of Artificial Intelligence
many, August 11-12, 2016, Y. Goldberg and S. Riezler, Conference, IAAI 2020, The Tenth AAAI Symposium
Eds. ACL, 2016, pp. 280–290. on Educational Advances in Artificial Intelligence, EAAI
[551] S. Narayan, S. B. Cohen, and M. Lapata, “Don’t give 2020, New York, NY, USA, February 7-12, 2020, 2020,
me the details, just the summary! topic-aware con- pp. 7432–7439.
volutional neural networks for extreme summariza- [562] M. Dubey, D. Banerjee, A. Abdelkawi, and
tion,” in EMNLP. Association for Computational J. Lehmann, “Lc-quad 2.0: A large dataset for com-
Linguistics, 2018, pp. 1797–1807. plex question answering over wikidata and dbpe-
[552] F. Ladhak, E. Durmus, C. Cardie, and K. Mckeown, dia,” in The Semantic Web - ISWC 2019 - 18th In-
“Wikilingua: A new benchmark dataset for cross- ternational Semantic Web Conference, Auckland, New
lingual abstractive summarization,” in Findings of Zealand, October 26-30, 2019, Proceedings, Part II, 2019,
the Association for Computational Linguistics: EMNLP pp. 69–78.
2020, 2020, pp. 4034–4048. [563] Y. Gu, S. Kase, M. Vanni, B. M. Sadler, P. Liang,
[553] S. Moon, P. Shah, A. Kumar, and R. Subba, “Open- X. Yan, and Y. Su, “Beyond I.I.D.: three levels of
dialkg: Explainable conversational reasoning with generalization for question answering on knowledge
attention-based walks over knowledge graphs,” in bases,” in WWW ’21: The Web Conference 2021, Virtual
ACL (1). Association for Computational Linguistics, Event / Ljubljana, Slovenia, April 19-23, 2021, 2021, pp.
2019, pp. 845–854. 3477–3488.
[554] Y. Lai, C. Li, Y. Wang, T. Zhang, R. Zhong, L. Zettle- [564] S. Cao, J. Shi, L. Pan, L. Nie, Y. Xiang, L. Hou,
moyer, S. W. Yih, D. Fried, S. I. Wang, and T. Yu, J. Li, B. He, and H. Zhang, “KQA pro: A dataset
“DS-1000: A natural and reliable benchmark for data with explicit compositional programs for complex
science code generation,” CoRR, vol. abs/2211.11501, question answering over knowledge base,” in Pro-
2022. ceedings of the 60th Annual Meeting of the Association
[555] Z. Wang, S. Zhou, D. Fried, and G. Neubig, for Computational Linguistics (Volume 1: Long Papers),
“Execution-based evaluation for open-domain code ACL 2022, Dublin, Ireland, May 22-27, 2022, 2022, pp.
generation,” CoRR, vol. abs/2212.10481, 2022. 6101–6119.
[556] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, [565] X. Hu, X. Wu, Y. Shu, and Y. Qu, “Logical form gen-
A. P. Parikh, C. Alberti, D. Epstein, I. Polosukhin, eration via multi-task learning for complex question
J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kel- answering over knowledge bases,” in Proceedings
cey, M. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and of the 29th International Conference on Computational
S. Petrov, “Natural questions: a benchmark for ques- Linguistics, COLING 2022, Gyeongju, Republic of Korea,
tion answering research,” Trans. Assoc. Comput. Lin- October 12-17, 2022, 2022, pp. 1687–1696.
guistics, pp. 452–466, 2019. [566] S. Longpre, Y. Lu, and J. Daiber, “MKQA: A lin-
[557] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, guistically diverse benchmark for multilingual open
C. Schoenick, and O. Tafjord, “Think you have solved domain question answering,” Trans. Assoc. Comput.
question answering? try arc, the AI2 reasoning chal- Linguistics, vol. 9, pp. 1389–1406, 2021.
lenge,” CoRR, vol. abs/1803.05457, 2018. [567] T. Saikh, T. Ghosal, A. Mittal, A. Ekbal, and P. Bhat-
[558] S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measur- tacharyya, “Scienceqa: a novel resource for question
ing how models mimic human falsehoods,” in Pro- answering on scholarly articles,” Int. J. Digit. Libr.,
ceedings of the 60th Annual Meeting of the Association vol. 23, no. 3, pp. 289–301, 2022.
for Computational Linguistics (Volume 1: Long Papers), [568] T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal,
ACL 2022, Dublin, Ireland, May 22-27, 2022, 2022, pp. “Can a suit of armor conduct electricity? A new
3214–3252. dataset for open book question answering,” in Pro-
[559] J. Berant, A. Chou, R. Frostig, and P. Liang, “Semantic ceedings of the 2018 Conference on Empirical Methods in
parsing on freebase from question-answer pairs,” in Natural Language Processing, Brussels, Belgium, October
Proceedings of the 2013 Conference on Empirical Methods 31 - November 4, 2018, 2018, pp. 2381–2391.
in Natural Language Processing, EMNLP 2013, 18-21 [569] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary,
October 2013, Grand Hyatt Seattle, Seattle, Washington, R. Majumder, and L. Deng, “MS MARCO: A human
USA, A meeting of SIGDAT, a Special Interest Group of generated machine reading comprehension dataset,”
the ACL, 2013, pp. 1533–1544. in Proceedings of the Workshop on Cognitive Computa-
121

tion: Integrating neural and symbolic approaches 2016 “YAGO3: A knowledge base from multilingual
co-located with the 30th Annual Conference on Neural wikipedias,” in Seventh Biennial Conference on Innova-
Information Processing Systems (NIPS 2016), Barcelona, tive Data Systems Research, CIDR 2015, Asilomar, CA,
Spain, December 9, 2016, 2016. USA, January 4-7, 2015, Online Proceedings, 2015.
[570] T. Khot, P. Clark, M. Guerquin, P. Jansen, and A. Sab- [580] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago:
harwal, “QASC: A dataset for question answering a core of semantic knowledge,” in Proceedings of
via sentence composition,” in The Thirty-Fourth AAAI the 16th International Conference on World Wide Web,
Conference on Artificial Intelligence, AAAI 2020, The WWW 2007, Banff, Alberta, Canada, May 8-12, 2007,
Thirty-Second Innovative Applications of Artificial Intel- 2007, pp. 697–706.
ligence Conference, IAAI 2020, The Tenth AAAI Sympo- [581] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen,
sium on Educational Advances in Artificial Intelligence, R. Salakhutdinov, and C. D. Manning, “Hotpotqa:
EAAI 2020, New York, NY, USA, February 7-12, 2020, A dataset for diverse, explainable multi-hop ques-
2020, pp. 8082–8090. tion answering,” in Proceedings of the 2018 Conference
[571] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, on Empirical Methods in Natural Language Processing,
“Squad: 100, 000+ questions for machine compre- Brussels, Belgium, October 31 - November 4, 2018. As-
hension of text,” in Proceedings of the 2016 Conference sociation for Computational Linguistics, 2018, pp.
on Empirical Methods in Natural Language Processing, 2369–2380.
EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, [582] C. Clark, K. Lee, M. Chang, T. Kwiatkowski,
2016, pp. 2383–2392. M. Collins, and K. Toutanova, “Boolq: Exploring the
[572] A. H. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes, surprising difficulty of natural yes/no questions,” in
and J. Weston, “Key-value memory networks for Proceedings of the 2019 Conference of the North American
directly reading documents,” in Proceedings of the Chapter of the Association for Computational Linguis-
2016 Conference on Empirical Methods in Natural Lan- tics: Human Language Technologies, NAACL-HLT 2019,
guage Processing, EMNLP 2016, Austin, Texas, USA, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long
November 1-4, 2016, 2016, pp. 1400–1409. and Short Papers), J. Burstein, C. Doran, and T. Solorio,
[573] B. Goodrich, V. Rao, P. J. Liu, and M. Saleh, “As- Eds. Association for Computational Linguistics,
sessing the factual accuracy of generated text,” in 2019, pp. 2924–2936.
Proceedings of the 25th ACM SIGKDD International [583] M. Sap, H. Rashkin, D. Chen, R. L. Bras, and Y. Choi,
Conference on Knowledge Discovery & Data Mining, “Socialiqa: Commonsense reasoning about social in-
KDD 2019, Anchorage, AK, USA, August 4-8, 2019, teractions,” CoRR, vol. abs/1904.09728, 2019.
2019, pp. 166–175. [584] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and
[574] K. Toutanova and D. Chen, “Observed versus latent Y. Choi, “Hellaswag: Can a machine really finish
features for knowledge base and text inference,” in your sentence?” in Proceedings of the 57th Conference of
Proceedings of the 3rd Workshop on Continuous Vector the Association for Computational Linguistics, ACL 2019,
Space Models and their Compositionality, CVSC 2015, Florence, Italy, July 28- August 2, 2019, Volume 1: Long
Beijing, China, July 26-31, 2015, 2015, pp. 57–66. Papers, A. Korhonen, D. R. Traum, and L. Màrquez,
[575] K. D. Bollacker, C. Evans, P. K. Paritosh, T. Sturge, Eds. Association for Computational Linguistics,
and J. Taylor, “Freebase: a collaboratively created 2019, pp. 4791–4800.
graph database for structuring human knowledge,” [585] K. Sakaguchi, R. L. Bras, C. Bhagavatula, and
in Proceedings of the ACM SIGMOD International Con- Y. Choi, “Winogrande: An adversarial winograd
ference on Management of Data, SIGMOD 2008, Vancou- schema challenge at scale,” in AAAI. AAAI Press,
ver, BC, Canada, June 10-12, 2008, 2008, pp. 1247–1250. 2020, pp. 8732–8740.
[576] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel, [586] M. Roemmele, C. A. Bejan, and A. S. Gordon,
“Convolutional 2d knowledge graph embeddings,” “Choice of plausible alternatives: An evaluation of
in Proceedings of the Thirty-Second AAAI Conference commonsense causal reasoning,” in Logical Formaliza-
on Artificial Intelligence, (AAAI-18), the 30th innovative tions of Commonsense Reasoning, Papers from the 2011
Applications of Artificial Intelligence (IAAI-18), and the AAAI Spring Symposium, Technical Report SS-11-06,
8th AAAI Symposium on Educational Advances in Ar- Stanford, California, USA, March 21-23, 2011. AAAI,
tificial Intelligence (EAAI-18), New Orleans, Louisiana, 2011.
USA, February 2-7, 2018, 2018, pp. 1811–1818. [587] K. Sakaguchi, C. Bhagavatula, R. L. Bras, N. Tandon,
[577] G. A. Miller, “Wordnet: A lexical database for en- P. Clark, and Y. Choi, “proscript: Partially ordered
glish,” Commun. ACM, pp. 39–41, 1995. scripts generation,” in Findings of the Association for
[578] F. Petroni, T. Rocktäschel, S. Riedel, P. S. H. Lewis, Computational Linguistics: EMNLP 2021, Virtual Event
A. Bakhtin, Y. Wu, and A. H. Miller, “Language mod- / Punta Cana, Dominican Republic, 16-20 November,
els as knowledge bases?” in Proceedings of the 2019 2021, M. Moens, X. Huang, L. Specia, and S. W. Yih,
Conference on Empirical Methods in Natural Language Eds. Association for Computational Linguistics,
Processing and the 9th International Joint Conference 2021, pp. 2138–2149.
on Natural Language Processing, EMNLP-IJCNLP 2019, [588] B. Dalvi, L. Huang, N. Tandon, W. Yih, and P. Clark,
Hong Kong, China, November 3-7, 2019, 2019, pp. 2463– “Tracking state changes in procedural text: a chal-
2473. lenge dataset and models for process paragraph com-
[579] F. Mahdisoltani, J. Biega, and F. M. Suchanek, prehension,” in Proceedings of the 2018 Conference of
122

the North American Chapter of the Association for Com- gram induction by rationale generation: Learning to
putational Linguistics: Human Language Technologies, solve and explain algebraic word problems,” in Pro-
NAACL-HLT 2018, New Orleans, Louisiana, USA, June ceedings of the 55th Annual Meeting of the Association
1-6, 2018, Volume 1 (Long Papers), M. A. Walker, H. Ji, for Computational Linguistics, ACL 2017, Vancouver,
and A. Stent, Eds. Association for Computational Canada, July 30 - August 4, Volume 1: Long Papers,
Linguistics, 2018, pp. 1595–1604. R. Barzilay and M. Kan, Eds. Association for Com-
[589] S. Saha, P. Yadav, L. Bauer, and M. Bansal, “Expla- putational Linguistics, 2017, pp. 158–167.
graphs: An explanation graph generation task for [598] R. Koncel-Kedziorski, S. Roy, A. Amini, N. Kushman,
structured commonsense reasoning,” in Proceedings and H. Hajishirzi, “Mawps: A math word problem
of the 2021 Conference on Empirical Methods in Natural repository,” in Proceedings of the 2016 conference of
Language Processing, EMNLP 2021, Virtual Event / the north american chapter of the association for compu-
Punta Cana, Dominican Republic, 7-11 November, 2021, tational linguistics: human language technologies, 2016,
M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds. pp. 1152–1157.
Association for Computational Linguistics, 2021, pp. [599] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh,
7716–7740. and M. Gardner, “DROP: A reading comprehension
[590] O. Tafjord, B. Dalvi, and P. Clark, “Proofwriter: benchmark requiring discrete reasoning over para-
Generating implications, proofs, and abductive state- graphs,” in Proceedings of the 2019 Conference of the
ments over natural language,” in Findings of the North American Chapter of the Association for Com-
Association for Computational Linguistics: ACL/IJCNLP putational Linguistics: Human Language Technologies,
2021, Online Event, August 1-6, 2021, ser. Findings of NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7,
ACL, C. Zong, F. Xia, W. Li, and R. Navigli, Eds., vol. 2019, Volume 1 (Long and Short Papers), 2019, pp. 2368–
ACL/IJCNLP 2021. Association for Computational 2378.
Linguistics, 2021, pp. 3621–3634. [600] S. Welleck, J. Liu, R. L. Bras, H. Hajishirzi, Y. Choi,
[591] B. Dalvi, P. Jansen, O. Tafjord, Z. Xie, H. Smith, L. Pi- and K. Cho, “Naturalproofs: Mathematical theorem
patanangkura, and P. Clark, “Explaining answers proving in natural language,” in Proceedings of the
with entailment trees,” in Proceedings of the 2021 Neural Information Processing Systems Track on Datasets
Conference on Empirical Methods in Natural Language and Benchmarks 1, NeurIPS Datasets and Benchmarks
Processing, EMNLP 2021, Virtual Event / Punta Cana, 2021, December 2021, virtual, J. Vanschoren and S. Ye-
Dominican Republic, 7-11 November, 2021, M. Moens, ung, Eds., 2021.
X. Huang, L. Specia, and S. W. Yih, Eds. Association [601] A. Q. Jiang, W. Li, J. M. Han, and Y. Wu, “Lisa: Lan-
for Computational Linguistics, 2021, pp. 7358–7370. guage models of isabelle proofs,” in 6th Conference
[592] A. Saparov and H. He, “Language models are greedy on Artificial Intelligence and Theorem Proving, 2021, pp.
reasoners: A systematic formal analysis of chain-of- 378–392.
thought,” CoRR, vol. abs/2210.01240, 2022. [602] K. Zheng, J. M. Han, and S. Polu, “minif2f: a cross-
[593] C. Anil, Y. Wu, A. Andreassen, A. Lewkowycz, system benchmark for formal olympiad-level mathe-
V. Misra, V. V. Ramasesh, A. Slone, G. Gur-Ari, matics,” in The Tenth International Conference on Learn-
E. Dyer, and B. Neyshabur, “Exploring length gen- ing Representations, ICLR 2022, Virtual Event, April 25-
eralization in large language models,” CoRR, vol. 29, 2022. OpenReview.net, 2022.
abs/2207.04901, 2022. [603] Z. Azerbayev, B. Piotrowski, H. Schoelkopf, E. W.
[594] A. Patel, S. Bhattamishra, and N. Goyal, “Are NLP Ayers, D. Radev, and J. Avigad, “Proofnet: Autofor-
models really able to solve simple math word prob- malizing and formally proving undergraduate-level
lems?” in NAACL-HLT. Association for Computa- mathematics,” CoRR, vol. abs/2302.12433, 2023.
tional Linguistics, 2021, pp. 2080–2094. [604] J. Li, X. Cheng, W. X. Zhao, J. Nie, and J. Wen,
[595] S. Roy and D. Roth, “Solving general arithmetic “Halueval: A large-scale hallucination evaluation
word problems,” in Proceedings of the 2015 Conference benchmark for large language models,” CoRR, vol.
on Empirical Methods in Natural Language Processing, abs/2305.11747, 2023.
EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, [605] N. Nangia, C. Vania, R. Bhalerao, and S. R. Bowman,
L. Màrquez, C. Callison-Burch, J. Su, D. Pighin, and “Crows-pairs: A challenge dataset for measuring
Y. Marton, Eds. The Association for Computational social biases in masked language models,” in Pro-
Linguistics, 2015, pp. 1743–1752. ceedings of the 2020 Conference on Empirical Methods
[596] A. Amini, S. Gabriel, S. Lin, R. Koncel-Kedziorski, in Natural Language Processing, EMNLP 2020, Online,
Y. Choi, and H. Hajishirzi, “Mathqa: Towards inter- November 16-20, 2020, 2020, pp. 1953–1967.
pretable math word problem solving with operation- [606] R. Rudinger, J. Naradowsky, B. Leonard, and B. V.
based formalisms,” in Proceedings of the 2019 Confer- Durme, “Gender bias in coreference resolution,” in
ence of the North American Chapter of the Association for Proceedings of the 2018 Conference of the North American
Computational Linguistics: Human Language Technolo- Chapter of the Association for Computational Linguistics:
gies, NAACL-HLT 2019, Minneapolis, MN, USA, June Human Language Technologies, NAACL-HLT, New Or-
2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, leans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short
C. Doran, and T. Solorio, Eds. Association for Papers), 2018, pp. 8–14.
Computational Linguistics, 2019, pp. 2357–2367. [607] S. Gehman, S. Gururangan, M. Sap, Y. Choi, and
[597] W. Ling, D. Yogatama, C. Dyer, and P. Blunsom, “Pro- N. A. Smith, “Realtoxicityprompts: Evaluating neu-
123

ral toxic degeneration in language models,” in Find- abs/2305.18752, 2023.


ings of the Association for Computational Linguistics: [619] S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez,
EMNLP 2020, Online Event, 16-20 November 2020, ser. “Gorilla: Large language model connected with mas-
Findings of ACL, T. Cohn, Y. He, and Y. Liu, Eds., sive apis,” CoRR, vol. abs/2305.15334, 2023.
vol. EMNLP 2020. Association for Computational [620] W. Yih, M. Richardson, C. Meek, M. Chang, and
Linguistics, 2020, pp. 3356–3369. J. Suh, “The value of semantic parse labeling for
[608] X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler, knowledge base question answering,” in Proceedings
and A. Torralba, “Virtualhome: Simulating house- of the 54th Annual Meeting of the Association for Com-
hold activities via programs,” in CVPR. Computer putational Linguistics, ACL 2016, August 7-12, 2016,
Vision Foundation / IEEE Computer Society, 2018, Berlin, Germany, Volume 2: Short Papers. The Associ-
pp. 8494–8502. ation for Computer Linguistics, 2016.
[609] S. Srivastava, C. Li, M. Lingelbach, R. Martı́n-Martı́n, [621] H. Puerto, G. G. Sahin, and I. Gurevych, “Metaqa:
F. Xia, K. E. Vainio, Z. Lian, C. Gokmen, S. Buch, Combining expert agents for multi-skill question an-
C. K. Liu, S. Savarese, H. Gweon, J. Wu, and L. Fei- swering,” in Proceedings of the 17th Conference of the
Fei, “BEHAVIOR: benchmark for everyday house- European Chapter of the Association for Computational
hold activities in virtual, interactive, and ecological Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6,
environments,” in CoRL, ser. Proceedings of Machine 2023, A. Vlachos and I. Augenstein, Eds. Association
Learning Research, vol. 164. PMLR, 2021, pp. 477– for Computational Linguistics, 2023, pp. 3548–3562.
490. [622] P. Pasupat and P. Liang, “Compositional semantic
[610] M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, parsing on semi-structured tables,” in Proceedings of
W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, the 53rd Annual Meeting of the Association for Com-
“ALFRED: A benchmark for interpreting grounded putational Linguistics and the 7th International Joint
instructions for everyday tasks,” in CVPR. Com- Conference on Natural Language Processing of the Asian
puter Vision Foundation / IEEE, 2020, pp. 10 737– Federation of Natural Language Processing, ACL 2015,
10 746. July 26-31, 2015, Beijing, China, Volume 1: Long Papers.
[611] M. Shridhar, X. Yuan, M. Côté, Y. Bisk, A. Trischler, The Association for Computer Linguistics, 2015, pp.
and M. J. Hausknecht, “Alfworld: Aligning text and 1470–1480.
embodied environments for interactive learning,” in [623] V. Zhong, C. Xiong, and R. Socher, “Seq2sql: Gener-
9th International Conference on Learning Representa- ating structured queries from natural language using
tions, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. reinforcement learning,” CoRR, vol. abs/1709.00103,
OpenReview.net, 2021. 2017.
[612] S. Yao, H. Chen, J. Yang, and K. Narasimhan, “Web- [624] W. Chen, H. Wang, J. Chen, Y. Zhang, H. Wang,
shop: Towards scalable real-world web interaction S. Li, X. Zhou, and W. Y. Wang, “Tabfact: A large-
with grounded language agents,” in NeurIPS, 2022. scale dataset for table-based fact verification,” in 8th
[613] X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, International Conference on Learning Representations,
B. Wang, H. Sun, and Y. Su, “Mind2web: To- ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.
wards a generalist agent for the web,” CoRR, vol. OpenReview.net, 2020.
abs/2306.06070, 2023. [625] T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang,
[614] W. H. Guss, B. Houghton, N. Topin, P. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, and
C. Codel, M. Veloso, and R. Salakhutdinov, “Minerl: D. R. Radev, “Spider: A large-scale human-labeled
A large-scale dataset of minecraft demonstrations,” dataset for complex and cross-domain semantic pars-
in Proceedings of the Twenty-Eighth International Joint ing and text-to-sql task,” in Proceedings of the 2018
Conference on Artificial Intelligence, IJCAI 2019, Macao, Conference on Empirical Methods in Natural Language
China, August 10-16, 2019, S. Kraus, Ed. ijcai.org, Processing, Brussels, Belgium, October 31 - November 4,
2019, pp. 2442–2448. 2018, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsu-
[615] L. Fan, G. Wang, Y. Jiang, A. Mandlekar, Y. Yang, jii, Eds. Association for Computational Linguistics,
H. Zhu, A. Tang, D. Huang, Y. Zhu, and A. Anand- 2018, pp. 3911–3921.
kumar, “Minedojo: Building open-ended embodied [626] D. Bahdanau, K. Cho, and Y. Bengio, “Neural ma-
agents with internet-scale knowledge,” in NeurIPS, chine translation by jointly learning to align and
2022. translate,” in ICLR, 2015.
[616] P. Lu, L. Qiu, K. Chang, Y. N. Wu, S. Zhu, T. Rajpuro- [627] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “Bleu:
hit, P. Clark, and A. Kalyan, “Dynamic prompt learn- a method for automatic evaluation of machine trans-
ing via policy gradient for semi-structured mathe- lation,” in Proceedings of the 40th Annual Meeting of
matical reasoning,” CoRR, vol. abs/2209.14610, 2022. the Association for Computational Linguistics, July 6-12,
[617] B. Zhang, K. Zhou, X. Wei, W. X. Zhao, J. Sha, 2002, Philadelphia, PA, USA. ACL, 2002, pp. 311–318.
S. Wang, and J. rong Wen, “Evaluating and improv- [628] C.-Y. Lin, “ROUGE: A package for automatic evalu-
ing tool-augmented computation-intensive math rea- ation of summaries,” in Text Summarization Branches
soning,” CoRR, vol. abs/2306.02408, 2023. Out. Association for Computational Linguistics, Jul.
[618] R. Yang, L. Song, Y. Li, S. Zhao, Y. Ge, X. Li, 2004, pp. 74–81.
and Y. Shan, “Gpt4tools: Teaching large language [629] W. Jiao, W. Wang, J.-t. Huang, X. Wang, and Z. Tu,
model to use tools via self-instruction,” CoRR, vol. “Is chatgpt a good translator? a preliminary study,”
124

arXiv preprint arXiv:2301.08745, 2023. [645] M. Gao, J. Ruan, R. Sun, X. Yin, S. Yang, and X. Wan,
[630] T. Zhang, F. Ladhak, E. Durmus, P. Liang, K. R. McK- “Human-like summarization evaluation with chat-
eown, and T. B. Hashimoto, “Benchmarking large gpt,” CoRR, vol. abs/2304.02554, 2023.
language models for news summarization,” CoRR, [646] Y. Ji, Y. Gong, Y. Peng, C. Ni, P. Sun, D. Pan, B. Ma,
vol. abs/2301.13848, 2023. and X. Li, “Exploring chatgpt’s ability to rank con-
[631] T. Goyal, J. J. Li, and G. Durrett, “News summariza- tent: A preliminary study on consistency with hu-
tion and evaluation in the era of GPT-3,” CoRR, vol. man preferences,” CoRR, vol. abs/2303.07610, 2023.
abs/2209.12356, 2022. [647] Y. Bai, J. Ying, Y. Cao, X. Lv, Y. He, X. Wang, J. Yu,
[632] S. Gehrmann, E. Clark, and T. Sellam, “Repairing K. Zeng, Y. Xiao, H. Lyu, J. Zhang, J. Li, and L. Hou,
the cracked foundation: A survey of obstacles in “Benchmarking foundation models with language-
evaluation practices for generated text,” CoRR, vol. model-as-an-examiner,” CoRR, vol. abs/2306.04181,
abs/2202.06935, 2022. 2023.
[633] J. Wang, Y. Liang, F. Meng, H. Shi, Z. Li, J. Xu, J. Qu, [648] Y. Liu, S. Feng, D. Wang, Y. Zhang, and H. Schütze,
and J. Zhou, “Is chatgpt a good NLG evaluator? A “Evaluate what you can’t evaluate: Unassess-
preliminary study,” CoRR, vol. abs/2303.04048, 2023. able generated responses quality,” CoRR, vol.
[634] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu, abs/2305.14658, 2023.
“G-eval: NLG evaluation using GPT-4 with better [649] P. Wang, L. Li, L. Chen, D. Zhu, B. Lin, Y. Cao, Q. Liu,
human alignment,” CoRR, vol. abs/2303.16634, 2023. T. Liu, and Z. Sui, “Large language models are not
[635] K. Yang, Y. Tian, N. Peng, and D. Klein, “Re3: Gen- fair evaluators,” CoRR, vol. abs/2305.17926, 2023.
erating longer stories with recursive reprompting [650] J. Ye, X. Chen, N. Xu, C. Zu, Z. Shao, S. Liu, Y. Cui,
and revision,” in Proceedings of the 2022 Conference Z. Zhou, C. Gong, Y. Shen, J. Zhou, S. Chen, T. Gui,
on Empirical Methods in Natural Language Processing, Q. Zhang, and X. Huang, “A comprehensive capabil-
EMNLP 2022, Abu Dhabi, United Arab Emirates, De- ity analysis of gpt-3 and gpt-3.5 series models,” arXiv
cember 7-11, 2022, Y. Goldberg, Z. Kozareva, and preprint arXiv:2303.10420, 2023.
Y. Zhang, Eds. Association for Computational Lin- [651] M. McCloskey and N. J. Cohen, “Catastrophic in-
guistics, 2022, pp. 4393–4479. terference in connectionist networks: The sequential
[636] W. Zhou, Y. E. Jiang, P. Cui, T. Wang, Z. Xiao, Y. Hou, learning problem,” in Psychology of learning and moti-
R. Cotterell, and M. Sachan, “Recurrentgpt: Interac- vation, 1989, pp. 109–165.
tive generation of (arbitrarily) long text,” CoRR, vol. [652] R. Kemker, M. McClure, A. Abitino, T. L. Hayes,
abs/2305.13304, 2023. and C. Kanan, “Measuring catastrophic forgetting in
[637] S. Gulwani, O. Polozov, and R. Singh, “Program neural networks,” in Proceedings of the Thirty-Second
synthesis,” Found. Trends Program. Lang., vol. 4, no. AAAI Conference on Artificial Intelligence, (AAAI-18),
1-2, pp. 1–119, 2017. the 30th innovative Applications of Artificial Intelligence
[638] S. Zhang, Z. Chen, Y. Shen, M. Ding, J. B. Tenenbaum, (IAAI-18), and the 8th AAAI Symposium on Educational
and C. Gan, “Planning with large language models Advances in Artificial Intelligence (EAAI-18), New Or-
for code generation,” 2023. leans, Louisiana, USA, February 2-7, 2018, 2018, pp.
[639] M. Welsh, “The end of programming,” Commun. 3390–3398.
ACM, vol. 66, no. 1, pp. 34–35, 2023. [653] T. Xie, C. H. Wu, P. Shi, R. Zhong, T. Scholak,
[640] Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, M. Yasunaga, C. Wu, M. Zhong, P. Yin, S. I. Wang,
B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung, Q. V. Do, V. Zhong, B. Wang, C. Li, C. Boyle, A. Ni, Z. Yao,
Y. Xu, and P. Fung, “A multitask, multilingual, mul- D. Radev, C. Xiong, L. Kong, R. Zhang, N. A. Smith,
timodal evaluation of chatgpt on reasoning, halluci- L. Zettlemoyer, and T. Yu, “Unifiedskg: Unifying and
nation, and interactivity,” CoRR, vol. abs/2302.04023, multi-tasking structured knowledge grounding with
2023. text-to-text language models,” in EMNLP. Associ-
[641] Y. Liu, A. R. Fabbri, P. Liu, Y. Zhao, L. Nan, R. Han, ation for Computational Linguistics, 2022, pp. 602–
S. Han, S. R. Joty, C. Wu, C. Xiong, and D. Radev, 631.
“Revisiting the gold standard: Grounding summa- [654] A. Roberts, C. Raffel, and N. Shazeer, “How much
rization evaluation with robust human evaluation,” knowledge can you pack into the parameters of a lan-
CoRR, vol. abs/2212.07981, 2022. guage model?” in Proceedings of the 2020 Conference
[642] A. R. Fabbri, W. Kryscinski, B. McCann, C. Xiong, on Empirical Methods in Natural Language Processing,
R. Socher, and D. R. Radev, “Summeval: Re- EMNLP 2020, Online, November 16-20, 2020, 2020, pp.
evaluating summarization evaluation,” Trans. Assoc. 5418–5426.
Comput. Linguistics, vol. 9, pp. 391–409, 2021. [655] G. Izacard, P. S. H. Lewis, M. Lomeli, L. Hos-
[643] T. Tang, H. Lu, Y. E. Jiang, H. Huang, D. Zhang, W. X. seini, F. Petroni, T. Schick, J. Dwivedi-Yu, A. Joulin,
Zhao, and F. Wei, “Not all metrics are guilty: Improv- S. Riedel, and E. Grave, “Few-shot learning with
ing NLG evaluation with LLM paraphrasing,” CoRR, retrieval augmented language models,” CoRR, vol.
vol. abs/2305.15067, 2023. abs/2208.03299, 2022.
[644] X. Wang, X. Tang, W. X. Zhao, J. Wang, and J. Wen, [656] K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang,
“Rethinking the evaluation for conversational rec- “Retrieval augmented language model pre-training,”
ommendation in the era of large language models,” in Proceedings of the 37th International Conference on
CoRR, vol. abs/2305.13112, 2023. Machine Learning, ICML 2020, 13-18 July 2020, Virtual
125

Event, 2020, pp. 3929–3938. abs/2305.06983, 2023.


[657] P. S. H. Lewis, E. Perez, A. Piktus, F. Petroni, [667] S. Agarwal, I. Akkaya, V. Balcom, M. Bavarian,
V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, G. Bernadett-Shapiro, G. Brockman, M. Brundage,
T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval- J. Chan, F. Chantzis, N. Deutsch, B. Eastman, A. Eleti,
augmented generation for knowledge-intensive NLP N. Felix, S. P. Fishman, I. Fulford, C. Gibson, J. Gross,
tasks,” in Advances in Neural Information Processing M. Heaton, J. Hilton, X. Hu, S. Jain, H. Jin, L. Kil-
Systems 33: Annual Conference on Neural Information patrick, C. Kim, M. Kolhede, A. Mayne, P. McMil-
Processing Systems 2020, NeurIPS 2020, December 6-12, lan, D. Medina, J. Menick, A. Mishchenko, A. Nair,
2020, virtual, 2020. R. Nayak, A. Neelakantan, R. Nuttall, J. Parish,
[658] Y. Lan, G. He, J. Jiang, J. Jiang, W. X. Zhao, and J. Wen, A. T. Passos, A. Perelman, F. de Avila Belbute Peres,
“Complex knowledge base question answering: A V. Pong, J. Schulman, E. Sigler, N. Staudacher, N. Tur-
survey,” CoRR, vol. abs/2108.06688, 2021. ley, J. Tworek, R. Greene, A. Vijayvergiya, C. Voss,
[659] S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, J. Weng, M. Wiethoff, S. Yoo, K. Yu, W. Zaremba,
E. Rutherford, K. Millican, G. van den Driessche, S. Zhao, W. Zhuk, and B. Zoph, “Chatgpt plugins,”
J. Lespiau, B. Damoc, A. Clark, D. de Las Casas, OpenAI Blog, March 2023.
A. Guy, J. Menick, R. Ring, T. Hennigan, S. Huang, [668] A. Lazaridou, E. Gribovskaya, W. Stokowiec, and
L. Maggiore, C. Jones, A. Cassirer, A. Brock, M. Pa- N. Grigorev, “Internet-augmented language models
ganini, G. Irving, O. Vinyals, S. Osindero, K. Si- through few-shot prompting for open-domain ques-
monyan, J. W. Rae, E. Elsen, and L. Sifre, “Improv- tion answering,” CoRR, vol. abs/2203.05115, 2022.
ing language models by retrieving from trillions of [669] H. Qian, Y. Zhu, Z. Dou, H. Gu, X. Zhang, Z. Liu,
tokens,” in International Conference on Machine Learn- R. Lai, Z. Cao, J. Nie, and J. Wen, “Webbrain: Learn-
ing, ICML 2022, 17-23 July 2022, Baltimore, Maryland, ing to generate factually correct articles for queries
USA, ser. Proceedings of Machine Learning Research, by grounding on large web corpus,” CoRR, vol.
K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, abs/2304.04358, 2023.
G. Niu, and S. Sabato, Eds., vol. 162. PMLR, 2022, [670] J. Liu, J. Jin, Z. Wang, J. Cheng, Z. Dou, and J. Wen,
pp. 2206–2240. “RETA-LLM: A retrieval-augmented large language
[660] S. Xu, L. Pang, H. Shen, X. Cheng, and T.-S. Chua, model toolkit,” CoRR, vol. abs/2306.05212, 2023.
“Search-in-the-chain: Towards accurate, credible and [671] D. Dai, L. Dong, Y. Hao, Z. Sui, B. Chang, and F. Wei,
traceable large language models for knowledge- “Knowledge neurons in pretrained transformers,” in
intensive tasks,” CoRR, vol. abs/2304.14732, 2023. Proceedings of the 60th Annual Meeting of the Asso-
[661] B. Peng, M. Galley, P. He, H. Cheng, Y. Xie, Y. Hu, ciation for Computational Linguistics (Volume 1: Long
Q. Huang, L. Liden, Z. Yu, W. Chen, and J. Gao, Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022,
“Check your facts and try again: Improving large S. Muresan, P. Nakov, and A. Villavicencio, Eds.
language models with external knowledge and au- Association for Computational Linguistics, 2022, pp.
tomated feedback,” CoRR, vol. abs/2302.12813, 2023. 8493–8502.
[662] Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi- [672] K. Meng, D. Bau, A. J. Andonian, and Y. Belinkov,
Yu, Y. Yang, J. Callan, and G. Neubig, “Ac- “Locating and editing factual associations in gpt,”
tive retrieval augmented generation,” CoRR, vol. in Advances in Neural Information Processing Systems,
abs/2305.06983, 2023. 2022.
[663] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, [673] M. Geva, R. Schuster, J. Berant, and O. Levy, “Trans-
H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and former feed-forward layers are key-value memo-
T. Liu, “A survey on hallucination in large language ries,” in Proceedings of the 2021 Conference on Empirical
models: Principles, taxonomy, challenges, and open Methods in Natural Language Processing, EMNLP 2021,
questions,” CoRR, vol. abs/2311.05232, 2023. Virtual Event / Punta Cana, Dominican Republic, 7-
[664] Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and 11 November, 2021, M. Moens, X. Huang, L. Specia,
J. Wen, “Evaluating object hallucination in large and S. W. Yih, Eds. Association for Computational
vision-language models,” CoRR, vol. abs/2305.10355, Linguistics, 2021, pp. 5484–5495.
2023. [674] Y. Yao, P. Wang, B. Tian, S. Cheng, Z. Li, S. Deng,
[665] S. Kadavath, T. Conerly, A. Askell, T. J. Henighan, H. Chen, and N. Zhang, “Editing large language
D. Drain, E. Perez, N. Schiefer, Z. Dodds, N. Das- models: Problems, methods, and opportunities,”
Sarma, E. Tran-Johnson, S. Johnston, S. El-Showk, CoRR, vol. abs/2305.13172, 2023.
A. Jones, N. Elhage, T. Hume, A. Chen, Y. Bai, [675] P. Wang, N. Zhang, X. Xie, Y. Yao, B. Tian,
S. Bowman, S. Fort, D. Ganguli, D. Hernandez, J. Ja- M. Wang, Z. Xi, S. Cheng, K. Liu, G. Zheng, and
cobson, J. Kernion, S. Kravec, L. Lovitt, K. Ndousse, H. Chen, “Easyedit: An easy-to-use knowledge edit-
C. Olsson, S. Ringer, D. Amodei, T. B. Brown, J. Clark, ing framework for large language models,” CoRR,
N. Joseph, B. Mann, S. McCandlish, C. Olah, and vol. abs/2308.07269, 2023.
J. Kaplan, “Language models (mostly) know what [676] Z. Shao, Y. Gong, Y. Shen, M. Huang, N. Duan, and
they know,” CoRR, vol. abs/2207.05221, 2022. W. Chen, “Synthetic prompting: Generating chain-of-
[666] P. Manakul, A. Liusie, and M. J. F. Gales, “Selfcheck- thought demonstrations for large language models,”
gpt: Zero-resource black-box hallucination detection CoRR, vol. abs/2302.00618, 2023.
for generative large language models,” ArXiv, vol. [677] Sifatkaur, M. Singh, V. S. B, and N. Malviya, “Mind
126

meets machine: Unravelling gpt-4’s cognitive psy- [690] J. Uesato, N. Kushman, R. Kumar, H. F. Song,
chology,” CoRR, vol. abs/2303.11436, 2023. N. Y. Siegel, L. Wang, A. Creswell, G. Irving, and
[678] M. I. Nye, A. J. Andreassen, G. Gur-Ari, I. Higgins, “Solving math word problems with
H. Michalewski, J. Austin, D. Bieber, D. Dohan, process- and outcome-based feedback,” CoRR, vol.
A. Lewkowycz, M. Bosma, D. Luan, C. Sutton, and abs/2211.14275, 2022.
A. Odena, “Show your work: Scratchpads for inter- [691] H. Lightman, V. Kosaraju, Y. Burda, H. Edwards,
mediate computation with language models,” CoRR, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever,
vol. abs/2112.00114, 2021. and K. Cobbe, “Let’s verify step by step,” CoRR, vol.
[679] J. Qian, H. Wang, Z. Li, S. Li, and X. Yan, “Limita- abs/2305.20050, 2023.
tions of language models in arithmetic and symbolic [692] Z. Yuan, H. Yuan, C. Tan, W. Wang, and S. Huang,
induction,” CoRR, vol. abs/2208.05051, 2022. “How well do large language models perform in
[680] W. X. Zhao, K. Zhou, Z. Gong, B. Zhang, Y. Zhou, arithmetic tasks?” CoRR, vol. abs/2304.02015, 2023.
J. Sha, Z. Chen, S. Wang, C. Liu, and J. Wen, “Ji- [693] X. Pi, Q. Liu, B. Chen, M. Ziyadi, Z. Lin, Q. Fu,
uzhang: A chinese pre-trained language model for Y. Gao, J. Lou, and W. Chen, “Reasoning like pro-
mathematical problem understanding,” in KDD ’22: gram executors,” in Proceedings of the 2022 Conference
The 28th ACM SIGKDD Conference on Knowledge Dis- on Empirical Methods in Natural Language Processing,
covery and Data Mining, Washington, DC, USA, August EMNLP 2022, Abu Dhabi, United Arab Emirates, De-
14 - 18, 2022, A. Zhang and H. Rangwala, Eds. ACM, cember 7-11, 2022, 2022, pp. 761–779.
2022, pp. 4571–4581. [694] H. Zhou, A. Nova, H. Larochelle, A. C. Courville,
[681] Q. Wang, C. Kaliszyk, and J. Urban, “First experi- B. Neyshabur, and H. Sedghi, “Teaching algorith-
ments with neural translation of informal to formal mic reasoning via in-context learning,” CoRR, vol.
mathematics,” in Intelligent Computer Mathematics - abs/2211.09066, 2022.
11th International Conference, CICM 2018, Hagenberg, [695] A. Parisi, Y. Zhao, and N. Fiedel, “TALM:
Austria, August 13-17, 2018, Proceedings, ser. Lecture tool augmented language models,” CoRR, vol.
Notes in Computer Science, F. Rabe, W. M. Farmer, abs/2205.12255, 2022.
G. O. Passmore, and A. Youssef, Eds., vol. 11006. [696] W. Huang, P. Abbeel, D. Pathak, and I. Mordatch,
Springer, 2018, pp. 255–270. “Language models as zero-shot planners: Extract-
[682] S. Polu and I. Sutskever, “Generative language mod- ing actionable knowledge for embodied agents,” in
eling for automated theorem proving,” CoRR, vol. ICML, ser. Proceedings of Machine Learning Re-
abs/2009.03393, 2020. search, vol. 162. PMLR, 2022, pp. 9118–9147.
[683] A. Q. Jiang, W. Li, S. Tworkowski, K. Czechowski, [697] T. Carta, C. Romac, T. Wolf, S. Lamprier, O. Sigaud,
T. Odrzygózdz, P. Milos, Y. Wu, and M. Jamnik, and P. Oudeyer, “Grounding large language models
“Thor: Wielding hammers to integrate language in interactive environments with online reinforce-
models and automated theorem provers,” CoRR, vol. ment learning,” CoRR, vol. abs/2302.02662, 2023.
abs/2205.10893, 2022. [698] X. Zhu, Y. Chen, H. Tian, C. Tao, W. Su, C. Yang,
[684] S. Polu, J. M. Han, K. Zheng, M. Baksys, G. Huang, B. Li, L. Lu, X. Wang, Y. Qiao, Z. Zhang,
I. Babuschkin, and I. Sutskever, “Formal mathe- and J. Dai, “Ghost in the minecraft: Generally capa-
matics statement curriculum learning,” CoRR, vol. ble agents for open-world environments via large
abs/2202.01344, 2022. language models with text-based knowledge and
[685] Y. Wu, A. Q. Jiang, W. Li, M. N. Rabe, C. Staats, memory,” CoRR, vol. abs/2305.17144, 2023.
M. Jamnik, and C. Szegedy, “Autoformalization with [699] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao,
large language models,” CoRR, vol. abs/2205.12615, Y. Zhu, L. Fan, and A. Anandkumar, “Voyager: An
2022. open-ended embodied agent with large language
[686] A. Q. Jiang, S. Welleck, J. P. Zhou, W. Li, J. Liu, models,” CoRR, vol. abs/2305.16291, 2023.
M. Jamnik, T. Lacroix, Y. Wu, and G. Lample, “Draft, [700] M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes,
sketch, and prove: Guiding formal theorem provers B. David, C. Finn, K. Gopalakrishnan, K. Hausman,
with informal proofs,” CoRR, vol. abs/2210.12283, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Ir-
2022. pan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth,
[687] A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, N. J. Joshi, R. Julian, D. Kalashnikov, Y. Kuang,
S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, K. Lee, S. Levine, Y. Lu, L. Luu, C. Parada, P. Pastor,
Y. Yang, S. Welleck, B. P. Majumder, S. Gupta, A. Yaz- J. Quiambao, K. Rao, J. Rettinghouse, D. Reyes, P. Ser-
danbakhsh, and P. Clark, “Self-refine: Iterative refine- manet, N. Sievers, C. Tan, A. Toshev, V. Vanhoucke,
ment with self-feedback,” CoRR, vol. abs/2303.17651, F. Xia, T. Xiao, P. Xu, S. Xu, and M. Yan, “Do as
2023. I can, not as I say: Grounding language in robotic
[688] N. Shinn, B. Labash, and A. Gopinath, “Reflexion: an affordances,” CoRR, vol. abs/2204.01691, 2022.
autonomous agent with dynamic memory and self- [701] J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman,
reflection,” CoRR, vol. abs/2303.11366, 2023. B. Ichter, P. Florence, and A. Zeng, “Code as policies:
[689] Z. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, N. Duan, Language model programs for embodied control,”
and W. Chen, “CRITIC: large language models can CoRR, vol. abs/2209.07753, 2022.
self-correct with tool-interactive critiquing,” CoRR, [702] Y. Fu, H. Peng, T. Khot, and M. Lapata, “Improv-
vol. abs/2305.11738, 2023. ing language model negotiation with self-play and
127

in-context learning from AI feedback,” CoRR, vol. Y. Bai, Y. Liu, A. Xin, N. Lin, K. Yun, L. Gong, J. Chen,
abs/2305.10142, 2023. Z. Wu, Y. Qi, W. Li, Y. Guan, K. Zeng, J. Qi, H. Jin,
[703] N. Mehta, M. Teruel, P. F. Sanz, X. Deng, A. H. J. Liu, Y. Gu, Y. Yao, N. Ding, L. Hou, Z. Liu, B. Xu,
Awadallah, and J. Kiseleva, “Improving grounded J. Tang, and J. Li, “Kola: Carefully benchmarking
language understanding in a collaborative environ- world knowledge of large language models,” CoRR,
ment by interacting with agents through help feed- vol. abs/2306.09296, 2023.
back,” CoRR, vol. abs/2304.10750, 2023. [718] T. Sawada, D. Paleka, A. Havrilla, P. Tadepalli, P. Vi-
[704] S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, das, A. Kranias, J. J. Nay, K. Gupta, and A. Komat-
“Gorilla: Large language model connected with mas- suzaki, “ARB: advanced reasoning benchmark for
sive apis,” CoRR, vol. abs/2305.15334, 2023. large language models,” CoRR, vol. abs/2307.13692,
[705] S. Hao, T. Liu, Z. Wang, and Z. Hu, “Toolkengpt: 2023.
Augmenting frozen language models with mas- [719] Y. Peng, S. Li, W. Gu, Y. Li, W. Wang, C. Gao, and
sive tools via tool embeddings,” CoRR, vol. M. R. Lyu, “Revisiting, benchmarking and exploring
abs/2305.11554, 2023. API recommendation: How far are we?” IEEE Trans.
[706] Y. Liang, C. Wu, T. Song, W. Wu, Y. Xia, Y. Liu, Y. Ou, Software Eng., vol. 49, no. 4, pp. 1876–1897, 2023.
S. Lu, L. Ji, S. Mao, Y. Wang, L. Shou, M. Gong, [720] M. Li, F. Song, B. Yu, H. Yu, Z. Li, F. Huang, and Y. Li,
and N. Duan, “Taskmatrix.ai: Completing tasks by “Api-bank: A benchmark for tool-augmented llms,”
connecting foundation models with millions of apis,” CoRR, vol. abs/2304.08244, 2023.
CoRR, vol. abs/2303.16434, 2023. [721] Q. Tang, Z. Deng, H. Lin, X. Han, Q. Liang, and
[707] T. Cai, X. Wang, T. Ma, X. Chen, and D. Zhou, L. Sun, “Toolalpaca: Generalized tool learning for
“Large language models as tool makers,” CoRR, vol. language models with 3000 simulated cases,” CoRR,
abs/2305.17126, 2023. vol. abs/2306.05301, 2023.
[708] J. Huang, S. S. Gu, L. Hou, Y. Wu, X. Wang, [722] Q. Xu, F. Hong, B. Li, C. Hu, Z. Chen, and J. Zhang,
H. Yu, and J. Han, “Large language models can self- “On the tool manipulation capability of open-source
improve,” CoRR, vol. abs/2210.11610, 2022. large language models,” CoRR, vol. abs/2305.16504,
[709] E. Beeching, C. Fourrier, N. Habib, S. Han, 2023.
N. Lambert, N. Rajani, O. Sanseviero, [723] Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin,
L. Tunstall, and T. Wolf, “Open llm leaderboard,” X. Cong, X. Tang, B. Qian, S. Zhao, R. Tian, R. Xie,
https://huggingface.co/spaces/HuggingFaceH4/ J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun, “Tool-
open llm leaderboard, 2023. llm: Facilitating large language models to master
[710] W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, 16000+ real-world apis,” CoRR, vol. abs/2307.16789,
A. Saied, W. Chen, and N. Duan, “Agieval: A human- 2023.
centric benchmark for evaluating foundation mod- [724] Z. Liu, W. Yao, J. Zhang, L. Xue, S. Heinecke,
els,” CoRR, vol. abs/2304.06364, 2023. R. Murthy, Y. Feng, Z. Chen, J. C. Niebles,
[711] H. Zeng, “Measuring massive multitask chinese un- D. Arpit, R. Xu, P. Mui, H. Wang, C. Xiong, and
derstanding,” CoRR, vol. abs/2304.12986, 2023. S. Savarese, “BOLAA: benchmarking and orchestrat-
[712] C. Liu, R. Jin, Y. Ren, L. Yu, T. Dong, X. Peng, ing llm-augmented autonomous agents,” CoRR, vol.
S. Zhang, J. Peng, P. Zhang, Q. Lyu, X. Su, Q. Liu, abs/2308.05960, 2023.
and D. Xiong, “M3KE: A massive multi-level multi- [725] X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai,
subject knowledge evaluation benchmark for chinese Y. Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng,
large language models,” CoRR, vol. abs/2305.10263, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang,
2023. Y. Su, H. Sun, M. Huang, Y. Dong, and J. Tang,
[713] Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, “Agentbench: Evaluating llms as agents,” CoRR, vol.
J. Liu, C. Lv, Y. Zhang, J. Lei, Y. Fu, M. Sun, and abs/2308.03688, 2023.
J. He, “C-eval: A multi-level multi-discipline chinese [726] K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y. Wang,
evaluation suite for foundation models,” CoRR, vol. L. Yang, W. Ye, N. Z. Gong, Y. Zhang, and X. Xie,
abs/2305.08322, 2023. “Promptbench: Towards evaluating the robustness
[714] Z. Gu, X. Zhu, H. Ye, L. Zhang, J. Wang, S. Jiang, of large language models on adversarial prompts,”
Z. Xiong, Z. Li, Q. He, R. Xu, W. Huang, W. Zheng, CoRR, vol. abs/2306.04528, 2023.
H. Feng, and Y. Xiao, “Xiezhi: An ever-updating [727] R. S. Shah, K. Chawla, D. Eidnani, A. Shah, W. Du,
benchmark for holistic domain knowledge evalua- S. Chava, N. Raman, C. Smiley, J. Chen, and D. Yang,
tion,” CoRR, vol. abs/2306.05783, 2023. “WHEN FLUE MEETS FLANG: benchmarks and
[715] O. Contributors, “Opencompass: A universal eval- large pre-trained language model for financial do-
uation platform for foundation models,” https:// main,” CoRR, vol. abs/2211.00083, 2022.
github.com/InternLM/OpenCompass, 2023. [728] N. Guha, D. E. Ho, J. Nyarko, and C. Ré, “Legal-
[716] Y. Fu, L. Ou, M. Chen, Y. Wan, H. Peng, and bench: Prototyping a collaborative benchmark for
T. Khot, “Chain-of-thought hub: A continuous effort legal reasoning,” CoRR, vol. abs/2209.06120, 2022.
to measure large language models’ reasoning perfor- [729] L. Zheng, W. Chiang, Y. Sheng, S. Zhuang, Z. Wu,
mance,” CoRR, vol. abs/2305.17306, 2023. Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang,
[717] J. Yu, X. Wang, S. Tu, S. Cao, D. Zhang-li, X. Lv, J. E. Gonzalez, and I. Stoica, “Judging llm-as-a-
H. Peng, Z. Yao, X. Zhang, H. Li, C. Li, Z. Zhang, judge with mt-bench and chatbot arena,” CoRR, vol.
128

abs/2306.05685, 2023. A. Fraser, M. Freitag, Y. Graham, R. Grundkiewicz,


[730] X. Wang, Z. Hu, P. Lu, Y. Zhu, J. Zhang, S. Sub- P. Guzman, B. Haddow, M. Huck, A. Jimeno-Yepes,
ramaniam, A. R. Loomba, S. Zhang, Y. Sun, and T. Kocmi, A. Martins, M. Morishita, C. Monz, M. Na-
W. Wang, “Scibench: Evaluating college-level sci- gata, T. Nakazawa, M. Negri, A. Névéol, M. Neves,
entific problem-solving abilities of large language M. Popel, M. Turchi, and M. Zampieri, Eds. Asso-
models,” CoRR, vol. abs/2307.10635, 2023. ciation for Computational Linguistics, 2022, pp. 411–
[731] X. Li, T. Zhang, Y. Dubois, R. Taori, I. Gulrajani, 422.
C. Guestrin, P. Liang, and T. B. Hashimoto, “Al- [742] Y. Zhao, M. Khalman, R. Joshi, S. Narayan, M. Saleh,
pacaeval: An automatic evaluator of instruction- and P. J. Liu, “Calibrating sequence likelihood
following models,” https://github.com/tatsu-lab/ improves conditional language generation,” CoRR,
alpaca eval, 2023. vol. abs/2210.00045, 2022. [Online]. Available:
[732] Y. Huang, Q. Zhang, P. S. Yu, and L. Sun, “Trustgpt: https://doi.org/10.48550/arXiv.2210.00045
A benchmark for trustworthy and responsible large [743] D. Khashabi, S. Min, T. Khot, A. Sabharwal,
language models,” CoRR, vol. abs/2306.11507, 2023. O. Tafjord, P. Clark, and H. Hajishirzi, “Unifiedqa:
[733] Y. Bai, J. Ying, Y. Cao, X. Lv, Y. He, X. Wang, J. Yu, Crossing format boundaries with a single QA sys-
K. Zeng, Y. Xiao, H. Lyu, J. Zhang, J. Li, and L. Hou, tem,” in EMNLP (Findings), ser. Findings of ACL,
“Benchmarking foundation models with language- vol. EMNLP 2020. Association for Computational
model-as-an-examiner,” CoRR, vol. abs/2306.04181, Linguistics, 2020, pp. 1896–1907.
2023. [744] X. Zhu, J. Wang, L. Zhang, Y. Zhang, R. Gan,
[734] C. Chan, W. Chen, Y. Su, J. Yu, W. Xue, S. Zhang, J. Fu, J. Zhang, and Y. Yang, “Solving math word problem
and Z. Liu, “Chateval: Towards better llm-based via cooperative reasoning induced language mod-
evaluators through multi-agent debate,” CoRR, vol. els,” arXiv preprint arXiv:2210.16257, 2022.
abs/2308.07201, 2023. [745] A. Nguyen, N. Karampatziakis, and W. Chen, “Meet
[735] Y. Chang, X. Wang, J. Wang, Y. Wu, K. Zhu, H. Chen, in the middle: A new pre-training paradigm,”
L. Yang, X. Yi, C. Wang, Y. Wang, W. Ye, Y. Zhang, CoRR, vol. abs/2303.07295, 2023. [Online]. Available:
Y. Chang, P. S. Yu, Q. Yang, and X. Xie, “A survey https://doi.org/10.48550/arXiv.2303.07295
on evaluation of large language models,” CoRR, vol. [746] H. Li, J. Zhang, C. Li, and H. Chen, “RESDSQL:
abs/2307.03109, 2023. decoupling schema linking and skeleton parsing
[736] Z. Zhuang, Q. Chen, L. Ma, M. Li, Y. Han, Y. Qian, for text-to-sql,” CoRR, vol. abs/2302.05965, 2023.
H. Bai, Z. Feng, W. Zhang, and T. Liu, “Through [Online]. Available: https://doi.org/10.48550/arXiv.
the lens of core competency: Survey on evaluation of 2302.05965
large language models,” CoRR, vol. abs/2308.07902, [747] W. Kang and J. J. McAuley, “Self-attentive sequential
2023. recommendation,” in IEEE International Conference on
[737] J. H. Clark, J. Palomaki, V. Nikolaev, E. Choi, D. Gar- Data Mining, ICDM 2018, Singapore, November 17-20,
rette, M. Collins, and T. Kwiatkowski, “Tydi QA: 2018. IEEE Computer Society, 2018, pp. 197–206.
A benchmark for information-seeking question an- [748] B. Yang, C. Han, Y. Li, L. Zuo, and Z. Yu, “Improv-
swering in typologically diverse languages,” Trans. ing conversational recommendation systems’ quality
Assoc. Comput. Linguistics, vol. 8, pp. 454–470, 2020. with context-aware item meta-information,” in Find-
[738] L. Gao, J. Tow, S. Biderman, S. Black, A. DiPofi, ings of the Association for Computational Linguistics:
C. Foster, L. Golding, J. Hsu, K. McDonell, N. Muen- NAACL 2022, Seattle, WA, United States, July 10-15,
nighoff, J. Phang, L. Reynolds, E. Tang, A. Thite, 2022, M. Carpuat, M. de Marneffe, and I. V. M. Ruı́z,
B. Wang, K. Wang, and A. Zou, “A framework for Eds. Association for Computational Linguistics,
few-shot language model evaluation,” Sep. 2021. 2022, pp. 38–48.
[739] R. Shah, K. Chawla, D. Eidnani, A. Shah, W. Du, [749] E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cap-
S. Chava, N. Raman, C. Smiley, J. Chen, and D. Yang, pelli, R. Cojocaru, M. Debbah, E. Goffinet, D. Hes-
“When flue meets flang: Benchmarks and large pre- low, J. Launay, Q. Malartic, B. Noune, B. Pannier,
trained language model for financial domain,” in and G. Penedo, “Falcon-40B: an open large language
Proceedings of the 2022 Conference on Empirical Methods model with state-of-the-art performance,” 2023.
in Natural Language Processing, 2022, pp. 2322–2335. [750] S. Martin, J. Liermann, and H. Ney, “Algorithms for
[740] K. Zhou, Y. Zhu, Z. Chen, W. Chen, W. X. Zhao, bigram and trigram word clustering,” Speech commu-
X. Chen, Y. Lin, J.-R. Wen, and J. Han, “Don’t make nication, vol. 24, no. 1, pp. 19–37, 1998.
your llm an evaluation benchmark cheater,” arXiv [751] R. Navigli, “Word sense disambiguation: A survey,”
preprint arXiv:2311.01964, 2023. ACM computing surveys (CSUR), vol. 41, no. 2, pp.
[741] C. Zan, K. Peng, L. Ding, B. Qiu, B. Liu, S. He, Q. Lu, 1–69, 2009.
Z. Zhang, C. Liu, W. Liu, Y. Zhan, and D. Tao, “Vega- [752] W. H. Gomaa, A. A. Fahmy et al., “A survey of
mt: The JD explore academy machine translation text similarity approaches,” international journal of
system for WMT22,” in Proceedings of the Seventh Con- Computer Applications, vol. 68, no. 13, pp. 13–18, 2013.
ference on Machine Translation, WMT 2022, Abu Dhabi, [753] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad,
United Arab Emirates (Hybrid), December 7-8, 2022, M. Chenaghlu, and J. Gao, “Deep learning–based
P. Koehn, L. Barrault, O. Bojar, F. Bougares, R. Chat- text classification: a comprehensive review,” ACM
terjee, M. R. Costa-jussà, C. Federmann, M. Fishel, computing surveys (CSUR), vol. 54, no. 3, pp. 1–40,
129

2021. and Z. Tu, “Document-level machine transla-


[754] N. Alex, E. Lifland, L. Tunstall, A. Thakur, P. Maham, tion with large language models,” arXiv preprint
C. J. Riedel, E. Hine, C. Ashurst, P. Sedille, A. Carlier, arXiv:2304.02210, 2023.
M. Noetel, and A. Stuhlmüller, “RAFT: A real-world [771] W. Jiao, J.-t. Huang, W. Wang, X. Wang, S. Shi, and
few-shot text classification benchmark,” in NeurIPS Z. Tu, “Parrot: Translating during chat using large
Datasets and Benchmarks, 2021. language models,” arXiv preprint arXiv:2304.02426,
[755] C. Qin, A. Zhang, Z. Zhang, J. Chen, M. Yasunaga, 2023.
and D. Yang, “Is chatgpt a general-purpose nat- [772] W. Yang, C. Li, J. Zhang, and C. Zong, “Bigtrans:
ural language processing task solver?” CoRR, vol. Augmenting large language models with multi-
abs/2302.06476, 2023. lingual translation capability over 100 languages,”
[756] X. Chen, J. Ye, C. Zu, N. Xu, R. Zheng, M. Peng, arXiv preprint arXiv:2305.18098, 2023.
J. Zhou, T. Gui, Q. Zhang, and X. Huang, “How [773] J. Kocon, I. Cichecki, O. Kaszyca, M. Kochanek,
robust is gpt-3.5 to predecessors? a comprehensive D. Szydlo, J. Baran, J. Bielaniewicz, M. Gruza,
study on language understanding tasks,” 2023. A. Janz, K. Kanclerz, A. Kocon, B. Koptyra,
[757] D. Nadeau and S. Sekine, “A survey of named entity W. Mieleszczenko-Kowszewicz, P. Milkowski,
recognition and classification,” Lingvisticae Investiga- M. Oleksy, M. Piasecki, L. Radlinski, K. Wojtasik,
tiones, vol. 30, no. 1, pp. 3–26, 2007. S. Wozniak, and P. Kazienko, “Chatgpt: Jack of all
[758] A. Ratnaparkhi, “A maximum entropy model for trades, master of none,” CoRR, vol. abs/2302.10724,
part-of-speech tagging,” in Conference on empirical 2023.
methods in natural language processing, 1996. [774] Q. Zhong, L. Ding, J. Liu, B. Du, and D. Tao,
[759] V. Yadav and S. Bethard, “A survey on recent “Can chatgpt understand too? A comparative study
advances in named entity recognition from deep on chatgpt and fine-tuned BERT,” CoRR, vol.
learning models,” in Proceedings of the 27th Interna- abs/2302.10198, 2023.
tional Conference on Computational Linguistics, 2018, [775] D. Cheng, S. Huang, J. Bi, Y. Zhan, J. Liu, Y. Wang,
pp. 2145–2158. H. Sun, F. Wei, D. Deng, and Q. Zhang, “Uprise:
[760] F. Souza, R. Nogueira, and R. Lotufo, “Portuguese Universal prompt retrieval for improving zero-shot
named entity recognition using bert-crf,” arXiv evaluation,” arXiv preprint arXiv:2303.08518, 2023.
preprint arXiv:1909.10649, 2019. [776] R. Ren, Y. Qu, J. Liu, W. X. Zhao, Q. She, H. Wu,
[761] S. Pawar, G. K. Palshikar, and P. Bhattacharyya, H. Wang, and J.-R. Wen, “Rocketqav2: A joint train-
“Relation extraction: A survey,” arXiv preprint ing method for dense passage retrieval and pas-
arXiv:1712.05191, 2017. sage re-ranking,” in Proceedings of the 2021 Conference
[762] C. Walker and et al., “Ace 2005 multilingual training on Empirical Methods in Natural Language Processing,
corpus ldc2006t06,” Philadelphia, 2006. 2021, pp. 2825–2835.
[763] J. Gao, H. Zhao, C. Yu, and R. Xu, “Exploring the [777] W. Sun, L. Yan, X. Ma, P. Ren, D. Yin, and Z. Ren,
feasibility of chatgpt for event extraction,” CoRR, vol. “Is chatgpt good at search? investigating large lan-
abs/2303.03836, 2023. guage models as re-ranking agent,” arXiv preprint
[764] Y. Ma, Y. Cao, Y. Hong, and A. Sun, “Large language arXiv:2304.09542, 2023.
model is not a good few-shot information extractor, [778] Z. Qin, R. Jagerman, K. Hui, H. Zhuang, J. Wu,
but a good reranker for hard samples!” CoRR, vol. J. Shen, T. Liu, J. Liu, D. Metzler, X. Wang et al.,
abs/2303.08559, 2023. “Large language models are effective text rankers
[765] R. Tang, X. Han, X. Jiang, and X. Hu, “Does synthetic with pairwise ranking prompting,” arXiv preprint
data generation of llms help clinical text mining?” arXiv:2306.17563, 2023.
arXiv preprint arXiv:2303.04360, 2023. [779] S. Cho, S. Jeong, J. Seo, and J. C. Park, “Discrete
[766] X. Wei, X. Cui, N. Cheng, X. Wang, X. Zhang, prompt optimization via constrained generation for
S. Huang, P. Xie, J. Xu, Y. Chen, M. Zhang et al., zero-shot re-ranker,” arXiv preprint arXiv:2305.13729,
“Zero-shot information extraction via chatting with 2023.
chatgpt,” arXiv preprint arXiv:2302.10205, 2023. [780] R. Tang, X. Zhang, X. Ma, J. Lin, and F. Ture,
[767] A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, “Found in the middle: Permutation self-consistency
A. Gomez, S. Gouws, L. Jones, Ł. Kaiser, N. Kalch- improves listwise ranking in large language mod-
brenner, N. Parmar et al., “Tensor2tensor for neural els,” arXiv preprint arXiv:2310.07712, 2023.
machine translation,” in Proceedings of the 13th Con- [781] X. Ma, X. Zhang, R. Pradeep, and J. Lin, “Zero-shot
ference of the Association for Machine Translation in the listwise document reranking with a large language
Americas (Volume 1: Research Track), 2018, pp. 193–199. model,” arXiv preprint arXiv:2305.02156, 2023.
[768] B. Zhang, B. Haddow, and A. Birch, “Prompting [782] S. Zhuang, H. Zhuang, B. Koopman, and G. Zuccon,
large language model for machine translation: A case “A setwise approach for effective and highly effi-
study,” arXiv preprint arXiv:2301.07069, 2023. cient zero-shot ranking with large language models,”
[769] M. Ghazvininejad, H. Gonen, and L. Zettlemoyer, arXiv preprint arXiv:2310.09497, 2023.
“Dictionary-based phrase-level prompting of large [783] H. Zhuang, Z. Qin, K. Hui, J. Wu, L. Yan, X. Wang,
language models for machine translation,” arXiv and M. Berdersky, “Beyond yes and no: Improving
preprint arXiv:2302.07856, 2023. zero-shot llm rankers via scoring fine-grained rele-
[770] L. Wang, C. Lyu, T. Ji, Z. Zhang, D. Yu, S. Shi, vance labels,” arXiv preprint arXiv:2310.14122, 2023.
130

[784] N. Ziems, W. Yu, Z. Zhang, and M. Jiang, “Large arXiv:2303.07678, 2023.


language models are built-in autoregressive search [797] G. Ma, X. Wu, P. Wang, Z. Lin, and S. Hu, “Pre-
engines,” arXiv preprint arXiv:2305.09612, 2023. training with large language model-based document
[785] X. Ma, L. Wang, N. Yang, F. Wei, and J. Lin, “Fine- expansion for dense passage retrieval,” arXiv preprint
tuning llama for multi-stage text retrieval,” arXiv arXiv:2308.08285, 2023.
preprint arXiv:2310.08319, 2023. [798] W. Sun, Z. Chen, X. Ma, L. Yan, S. Wang, P. Ren,
[786] R. Pradeep, S. Sharifymoghaddam, and J. Lin, Z. Chen, D. Yin, and Z. Ren, “Instruction distilla-
“Rankvicuna: Zero-shot listwise document rerank- tion makes large language models efficient zero-shot
ing with open-source large language models,” arXiv rankers,” arXiv preprint arXiv:2311.01555, 2023.
preprint arXiv:2309.15088, 2023. [799] L. Wang, N. Yang, X. Huang, L. Yang, R. Ma-
[787] Y. Tay, V. Q. Tran, M. Dehghani, J. Ni, D. Bahri, jumder, and F. Wei, “Large search model: Redefin-
H. Mehta, Z. Qin, K. Hui, Z. Zhao, J. Gupta et al., ing search stack in the era of llms,” arXiv preprint
“Transformer memory as a differentiable search in- arXiv:2310.14587, 2023.
dex,” in Advances in Neural Information Processing [800] C. Li, Z. Gan, Z. Yang, J. Yang, L. Li, L. Wang,
Systems, 2022. and J. Gao, “Multimodal foundation models: From
[788] R. Ren, W. X. Zhao, J. Liu, H. Wu, J.-R. Wen, specialists to general-purpose assistants,” CoRR, vol.
and H. Wang, “TOME: A two-stage approach for abs/2309.10020, 2023.
model-based retrieval,” in Proceedings of the 61st [801] W. X. Zhao, S. Mu, Y. Hou, Z. Lin, Y. Chen, X. Pan,
Annual Meeting of the Association for Computational K. Li, Y. Lu, H. Wang, C. Tian, Y. Min, Z. Feng, X. Fan,
Linguistics (Volume 1: Long Papers). Association X. Chen, P. Wang, W. Ji, Y. Li, X. Wang, and J. Wen,
for Computational Linguistics, 2023, pp. 6102–6114. “Recbole: Towards a unified, comprehensive and ef-
[Online]. Available: https://aclanthology.org/2023. ficient framework for recommendation algorithms,”
acl-long.336 in CIKM, G. Demartini, G. Zuccon, J. S. Culpepper,
[789] Y. Qu, Y. Ding, J. Liu, K. Liu, R. Ren, W. X. Zhao, Z. Huang, and H. Tong, Eds. ACM, 2021, pp. 4653–
D. Dong, H. Wu, and H. Wang, “Rocketqa: An op- 4664.
timized training approach to dense passage retrieval [802] K. Zhou, H. Wang, W. X. Zhao, Y. Zhu, S. Wang,
for open-domain question answering,” in Proceedings F. Zhang, Z. Wang, and J. Wen, “S3-rec: Self-
of the 2021 Conference of the North American Chapter supervised learning for sequential recommendation
of the Association for Computational Linguistics: Human with mutual information maximization,” in CIKM,
Language Technologies, 2021, pp. 5835–5847. M. d’Aquin, S. Dietze, C. Hauff, E. Curry, and
[790] R. Ren, S. Lv, Y. Qu, J. Liu, W. X. Zhao, Q. She, P. Cudré-Mauroux, Eds. ACM, 2020, pp. 1893–1902.
H. Wu, H. Wang, and J.-R. Wen, “Pair: Leverag- [803] W. X. Zhao, Y. Hou, X. Pan, C. Yang, Z. Zhang, Z. Lin,
ing passage-centric similarity relation for improving J. Zhang, S. Bian, J. Tang, W. Sun, Y. Chen, L. Xu,
dense passage retrieval,” in Findings of the Association G. Zhang, Z. Tian, C. Tian, S. Mu, X. Fan, X. Chen,
for Computational Linguistics: ACL-IJCNLP 2021, 2021, and J. Wen, “Recbole 2.0: Towards a more up-to-date
pp. 2173–2183. recommendation library,” in CIKM, M. A. Hasan and
[791] Z. Peng, X. Wu, and Y. Fang, “Soft prompt tuning L. Xiong, Eds. ACM, 2022, pp. 4722–4726.
for augmenting dense retrieval with large language [804] L. Xu, Z. Tian, G. Zhang, J. Zhang, L. Wang, B. Zheng,
models,” arXiv preprint arXiv:2307.08303, 2023. Y. Li, J. Tang, Z. Zhang, Y. Hou, X. Pan, W. X. Zhao,
[792] Z. Dai, V. Y. Zhao, J. Ma, Y. Luan, J. Ni, J. Lu, X. Chen, and J. Wen, “Towards a more user-friendly
A. Bakalov, K. Guu, K. Hall, and M.-W. Chang, and easy-to-use benchmark library for recommender
“Promptagator: Few-shot dense retrieval from 8 ex- systems,” in SIGIR, H. Chen, W. E. Duh, H. Huang,
amples,” in The Eleventh International Conference on M. P. Kato, J. Mothe, and B. Poblete, Eds. ACM,
Learning Representations, 2023. 2023, pp. 2837–2847.
[793] A. Askari, M. Aliannejadi, E. Kanoulas, and S. Ver- [805] S. Rendle, C. Freudenthaler, Z. Gantner, and
berne, “Generating synthetic documents for cross- L. Schmidt-Thieme, “BPR: bayesian personalized
encoder re-rankers: A comparative study of chatgpt ranking from implicit feedback,” CoRR, vol.
and human experts,” arXiv preprint arXiv:2305.02320, abs/1205.2618, 2012.
2023. [806] W. Fan, Z. Zhao, J. Li, Y. Liu, X. Mei, Y. Wang, J. Tang,
[794] K. Mao, Z. Dou, H. Chen, F. Mo, and H. Qian, “Large and Q. Li, “Recommender systems in the era of large
language models know your contextual search in- language models (llms),” CoRR, 2023.
tent: A prompting framework for conversational [807] L. Wu, Z. Zheng, Z. Qiu, H. Wang, H. Gu, T. Shen,
search,” arXiv preprint arXiv:2303.06573, 2023. C. Qin, C. Zhu, H. Zhu, Q. Liu, H. Xiong, and
[795] L. Gao, X. Ma, J. Lin, and J. Callan, “Precise zero- E. Chen, “A survey on large language models for
shot dense retrieval without relevance labels,” in recommendation,” CoRR, 2023.
Proceedings of the 61st Annual Meeting of the Association [808] Y. Gao, T. Sheng, Y. Xiang, Y. Xiong, H. Wang, and
for Computational Linguistics (Volume 1: Long Papers). J. Zhang, “Chat-rec: Towards interactive and explain-
Association for Computational Linguistics, 2023, pp. able llms-augmented recommender system,” CoRR,
1762–1777. vol. abs/2303.14524, 2023.
[796] L. Wang, N. Yang, and F. Wei, “Query2doc: Query ex- [809] S. Dai, N. Shao, H. Zhao, W. Yu, Z. Si, C. Xu, Z. Sun,
pansion with large language models,” arXiv preprint X. Zhang, and J. Xu, “Uncovering chatgpt’s capabil-
131

ities in recommender systems,” in RecSys, J. Zhang, figurable simulation platform for recommender sys-
L. Chen, S. Berkovsky, M. Zhang, T. D. Noia, J. Basil- tems,” CoRR, vol. abs/1909.04847, 2019.
ico, L. Pizzato, and Y. Song, Eds. ACM, 2023, pp. [824] J. Zhang, Y. Hou, R. Xie, W. Sun, J. J. McAuley,
1126–1132. W. X. Zhao, L. Lin, and J. Wen, “Agentcf: Collabora-
[810] Y. Hou, J. Zhang, Z. Lin, H. Lu, R. Xie, J. J. McAuley, tive learning with autonomous language agents for
and W. X. Zhao, “Large language models are zero- recommender systems,” CoRR, vol. abs/2310.09233,
shot rankers for recommender systems,” CoRR, 2023. 2023.
[811] J. Liu, C. Liu, R. Lv, K. Zhou, and Y. Zhang, “Is [825] A. Zhang, L. Sheng, Y. Chen, H. Li, Y. Deng, X. Wang,
chatgpt a good recommender? A preliminary study,” and T. Chua, “On generative agents in recommenda-
CoRR, vol. abs/2304.10149, 2023. tion,” CoRR, vol. abs/2310.10108, 2023.
[812] K. Bao, J. Zhang, Y. Zhang, W. Wang, F. Feng, [826] Y. Du, Z. Liu, J. Li, and W. X. Zhao, “A survey of
and X. He, “Tallrec: An effective and efficient tun- vision-language pre-trained models,” in Proceedings
ing framework to align large language model with of the Thirty-First International Joint Conference on Ar-
recommendation,” in RecSys, J. Zhang, L. Chen, tificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29
S. Berkovsky, M. Zhang, T. D. Noia, J. Basilico, L. Piz- July 2022, L. D. Raedt, Ed. ijcai.org, 2022, pp. 5436–
zato, and Y. Song, Eds. ACM, 2023, pp. 1007–1014. 5443.
[813] Y. Zhu, L. Wu, Q. Guo, L. Hong, and J. Li, “Col- [827] Z. Gan, L. Li, C. Li, L. Wang, Z. Liu, and J. Gao,
laborative large language model for recommender “Vision-language pre-training: Basics, recent ad-
systems,” arXiv preprint arXiv:2311.01343, 2023. vances, and future trends,” Found. Trends Comput.
[814] B. Zheng, Y. Hou, H. Lu, Y. Chen, W. X. Graph. Vis., vol. 14, no. 3-4, pp. 163–352, 2022.
Zhao, and J.-R. Wen, “Adapting large language [828] P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen,
models by integrating collaborative semantics for A. Bapna, Z. Borsos, F. de Chaumont Quitry, P. Chen,
recommendation,” 2023. [Online]. Available: https: D. E. Badawy, W. Han, E. Kharitonov et al., “Au-
//api.semanticscholar.org/CorpusID:265213194 diopalm: A large language model that can speak and
[815] Y. Xi, W. Liu, J. Lin, J. Zhu, B. Chen, R. Tang, listen,” CoRR, 2023.
W. Zhang, R. Zhang, and Y. Yu, “Towards open- [829] J. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr,
world recommendation with knowledge augmen- Y. Hasson, K. Lenc, A. Mensch, K. Millican,
tation from large language models,” CoRR, vol. M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han,
abs/2306.10933, 2023. Z. Gong, S. Samangooei, M. Monteiro, J. L. Menick,
[816] Q. Liu, N. Chen, T. Sakai, and X. Wu, “A first look S. Borgeaud, A. Brock, A. Nematzadeh, S. Shar-
at llm-powered generative news recommendation,” ifzadeh, M. Binkowski, R. Barreira, O. Vinyals,
CoRR, vol. abs/2305.06566, 2023. A. Zisserman, and K. Simonyan, “Flamingo: a visual
[817] R. Li, W. Deng, Y. Cheng, Z. Yuan, J. Zhang, language model for few-shot learning,” in NeurIPS,
and F. Yuan, “Exploring the upper limits of 2022.
text-based collaborative filtering using large lan- [830] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon,
guage models: Discoveries and insights,” CoRR, vol. R. Wightman, M. Cherti, T. Coombes, A. Katta,
abs/2305.11700, 2023. C. Mullis, M. Wortsman, P. Schramowski, S. Kun-
[818] W. Wei, X. Ren, J. Tang, Q. Wang, L. Su, S. Cheng, durthy, K. Crowson, L. Schmidt, R. Kaczmarczyk,
J. Wang, D. Yin, and C. Huang, “Llmrec: Large lan- and J. Jitsev, “LAION-5B: an open large-scale dataset
guage models with graph augmentation for recom- for training next generation image-text models,” in
mendation,” CoRR, vol. abs/2311.00423, 2023. NeurIPS, 2022.
[819] X. Li, B. Chen, L. Hou, and R. Tang, “Ctrl: Connect [831] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut,
tabular and language model for ctr prediction,” arXiv “Conceptual 12m: Pushing web-scale image-text pre-
preprint arXiv:2306.02841, 2023. training to recognize long-tail visual concepts,” in
[820] A. Muhamed, I. Keivanloo, S. Perera, J. Mracek, IEEE Conference on Computer Vision and Pattern Recog-
Y. Xu, Q. Cui, S. Rajagopalan, B. Zeng, and nition, CVPR 2021, virtual, June 19-25, 2021. Com-
T. Chilimbi, “Ctr-bert: Cost-effective knowledge dis- puter Vision Foundation / IEEE, 2021, pp. 3558–3568.
tillation for billion-parameter teacher models,” in [832] Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y. Zhou, J. Wang,
NeurIPS Efficient Natural Language and Speech Process- A. Hu, P. Shi, Y. Shi, C. Li, Y. Xu, H. Chen, J. Tian,
ing Workshop, 2021. Q. Qi, J. Zhang, and F. Huang, “mplug-owl: Mod-
[821] L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, ularization empowers large language models with
J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin, W. X. multimodality,” CoRR, vol. abs/2304.14178, 2023.
Zhao, Z. Wei, and J. Wen, “A survey on large lan- [833] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang,
guage model based autonomous agents,” CoRR, vol. J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier
abs/2308.11432, 2023. large vision-language model with versatile abilities,”
[822] L. Wang, J. Zhang, X. Chen, Y. Lin, R. Song, W. X. CoRR, vol. abs/2308.12966, 2023.
Zhao, and J. Wen, “Recagent: A novel simulation [834] H. Liu, C. Li, Y. Li, and Y. J. Lee, “Improved base-
paradigm for recommender systems,” CoRR, vol. lines with visual instruction tuning,” CoRR, vol.
abs/2306.02552, 2023. abs/2310.03744, 2023.
[823] E. Ie, C. Hsu, M. Mladenov, V. Jain, S. Narvekar, [835] P. Zhang, X. Dong, B. Wang, Y. Cao, C. Xu,
J. Wang, R. Wu, and C. Boutilier, “Recsim: A con- L. Ouyang, Z. Zhao, S. Ding, S. Zhang, H. Duan,
132

W. Zhang, H. Yan, X. Zhang, W. Li, J. Li, O. Tafjord, P. Clark, and A. Kalyan, “Learn to explain:
K. Chen, C. He, X. Zhang, Y. Qiao, D. Lin, and Multimodal reasoning via thought chains for science
J. Wang, “Internlm-xcomposer: A vision-language question answering,” in NeurIPS, 2022.
large model for advanced text-image comprehension [850] A. Singh, V. Natarjan, M. Shah, Y. Jiang, X. Chen,
and composition,” CoRR, vol. abs/2309.15112, 2023. D. Parikh, and M. Rohrbach, “Towards vqa models
[836] K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, and that can read,” in Proceedings of the IEEE Conference
R. Zhao, “Shikra: Unleashing multimodal llm’s ref- on Computer Vision and Pattern Recognition, 2019, pp.
erential dialogue magic,” CoRR, vol. abs/2306.15195, 8317–8326.
2023. [851] F. Liu, T. Guan, Z. Li, L. Chen, Y. Yacoob,
[837] F. Liu, K. Lin, L. Li, J. Wang, Y. Yacoob, and L. Wang, D. Manocha, and T. Zhou, “Hallusionbench: You
“Aligning large multi-modal model with robust in- see what you think? or you think what you see?
struction tuning,” CoRR, vol. abs/2306.14565, 2023. an image-context reasoning benchmark challenging
[838] Y. Du, H. Guo, K. Zhou, W. X. Zhao, J. Wang, for gpt-4v(ision), llava-1.5, and other multi-modality
C. Wang, M. Cai, R. Song, and J.-R. Wen, “What models,” CoRR, vol. abs/2310.14566, 2023.
makes for good visual instructions? synthesizing [852] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra,
complex visual reasoning instructions for visual in- C. L. Zitnick, and D. Parikh, “VQA: visual question
struction tuning,” 2023. answering,” in ICCV. IEEE Computer Society, 2015,
[839] D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, pp. 2425–2433.
K. Grauman, J. Luo, and J. P. Bigham, “Vizwiz grand [853] R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider:
challenge: Answering visual questions from blind Consensus-based image description evaluation,” in
people,” in CVPR. Computer Vision Foundation CVPR. IEEE Computer Society, 2015, pp. 4566–4575.
/ IEEE Computer Society, 2018, pp. 3608–3617. [854] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction
[840] A. Mishra, K. Alahari, and C. V. Jawahar, “Top-down tuning,” CoRR, vol. abs/2304.08485, 2023.
and bottom-up cues for scene text recognition,” in [855] P. Xu, W. Shao, K. Zhang, P. Gao, S. Liu, M. Lei,
CVPR. IEEE Computer Society, 2012, pp. 2687–2694. F. Meng, S. Huang, Y. Qiao, and P. Luo, “Lvlm-ehub:
[841] Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, A comprehensive evaluation benchmark for large
Y. Yuan, J. Wang, C. He, Z. Liu, K. Chen, and vision-language models,” CoRR, vol. abs/2306.09265,
D. Lin, “Mmbench: Is your multi-modal model an 2023.
all-around player?” CoRR, vol. abs/2307.06281, 2023. [856] Z. Li, Y. Wang, M. Du, Q. Liu, B. Wu, J. Zhang,
[842] C. Fu, P. Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin, C. Zhou, Z. Fan, J. Fu, J. Chen, X. Huang, and
Z. Qiu, W. Lin, J. Yang, X. Zheng, K. Li, X. Sun, and Z. Wei, “Reform-eval: Evaluating large vision lan-
R. Ji, “MME: A comprehensive evaluation bench- guage models via unified re-formulation of task-
mark for multimodal large language models,” CoRR, oriented benchmarks,” CoRR, vol. abs/2310.02569,
vol. abs/2306.13394, 2023. 2023.
[843] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, [857] B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and
E. Zhao, Y. Zhang, Y. Chen, L. Wang, A. T. Luu, W. Bi, Y. Shan, “Seed-bench: Benchmarking multimodal
F. Shi, and S. Shi, “Siren’s song in the AI ocean: A llms with generative comprehension,” CoRR, vol.
survey on hallucination in large language models,” abs/2307.16125, 2023.
CoRR, vol. abs/2309.01219, 2023. [858] W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang,
[844] A. Gunjal, J. Yin, and E. Bas, “Detecting and prevent- and L. Wang, “Mm-vet: Evaluating large multi-
ing hallucinations in large vision language models,” modal models for integrated capabilities,” CoRR, vol.
CoRR, vol. abs/2308.06394, 2023. abs/2308.02490, 2023.
[845] J. Lu, J. Rao, K. Chen, X. Guo, Y. Zhang, B. Sun, [859] J. Wang, L. Meng, Z. Weng, B. He, Z. Wu, and
C. Yang, and J. Yang, “Evaluation and mitigation Y. Jiang, “To see is to believe: Prompting GPT-
of agnosia in multimodal large language models,” 4V for better visual instruction tuning,” CoRR, vol.
CoRR, vol. abs/2309.04041, 2023. abs/2311.07574, 2023.
[846] A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, [860] Y. Zhang, R. Zhang, J. Gu, Y. Zhou, N. Lipka, D. Yang,
and K. Saenko, “Object hallucination in image cap- and T. Sun, “Llavar: Enhanced visual instruction tun-
tioning,” in EMNLP. Association for Computational ing for text-rich image understanding,” arXiv preprint
Linguistics, 2018, pp. 4035–4045. arXiv:2306.17107, 2023.
[847] Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and [861] X. Qi, K. Huang, A. Panda, M. Wang, and P. Mittal,
J.-R. Wen, “Evaluating object hallucination in large “Visual adversarial examples jailbreak aligned large
vision-language models,” in The 2023 Conference on language models,” in The Second Workshop on New
Empirical Methods in Natural Language Processing, Frontiers in Adversarial Machine Learning, 2023.
2023. [Online]. Available: https://openreview.net/ [862] Y. Zhou, C. Cui, J. Yoon, L. Zhang, Z. Deng, C. Finn,
forum?id=xozJw0kZXF M. Bansal, and H. Yao, “Analyzing and mitigating
[848] D. A. Hudson and C. D. Manning, “GQA: A new object hallucination in large vision-language mod-
dataset for real-world visual reasoning and compo- els,” arXiv preprint arXiv:2310.00754, 2023.
sitional question answering,” in CVPR. Computer [863] Z. Sun, S. Shen, S. Cao, H. Liu, C. Li, Y. Shen, C. Gan,
Vision Foundation / IEEE, 2019, pp. 6700–6709. L.-Y. Gui, Y.-X. Wang, Y. Yang et al., “Aligning large
[849] P. Lu, S. Mishra, T. Xia, L. Qiu, K. Chang, S. Zhu, multimodal models with factually augmented rlhf,”
133

arXiv preprint arXiv:2309.14525, 2023. 8648.


[864] E. Jiménez-Ruiz, O. Hassanzadeh, V. Efthymiou, [873] Y. Gu, X. Deng, and Y. Su, “Don’t generate, discrim-
J. Chen, and K. Srinivas, “Semtab 2019: Resources to inate: A proposal for grounding language models to
benchmark tabular data to knowledge graph match- real-world environments,” in Proceedings of the 61st
ing systems,” in The Semantic Web - 17th International Annual Meeting of the Association for Computational
Conference, ESWC 2020, Heraklion, Crete, Greece, May Linguistics (Volume 1: Long Papers), ACL 2023, Toronto,
31-June 4, 2020, Proceedings, ser. Lecture Notes in Canada, July 9-14, 2023. Association for Computa-
Computer Science, vol. 12123. Springer, 2020, pp. tional Linguistics, 2023, pp. 4928–4949.
514–530. [874] L. Luo, Y. Li, G. Haffari, and S. Pan, “Reasoning
[865] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu, on graphs: Faithful and interpretable large language
“Unifying large language models and knowledge model reasoning,” CoRR, vol. abs/2310.01061, 2023.
graphs: A roadmap,” CoRR, vol. abs/2306.08302, [875] Y. Lan and J. Jiang, “Query graph generation for an-
2023. swering multi-hop complex questions from knowl-
[866] Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, edge bases,” in Proceedings of the 58th Annual Meeting
J. Shang, J. Liu, X. Chen, Y. Zhao, Y. Lu, of the Association for Computational Linguistics, ACL
W. Liu, Z. Wu, W. Gong, J. Liang, Z. Shang, 2020, Online, July 5-10, 2020, D. J. and, Ed. Asso-
P. Sun, W. Liu, X. Ouyang, D. Yu, H. Tian, ciation for Computational Linguistics, 2020, pp. 969–
H. Wu, and H. Wang, “ERNIE 3.0: Large- 974.
scale knowledge enhanced pre-training for [876] P. Wang, N. Zhang, X. Xie, Y. Yao, B. Tian,
language understanding and generation,” CoRR, M. Wang, Z. Xi, S. Cheng, K. Liu, G. Zheng, and
vol. abs/2107.02137, 2021. [Online]. Available: H. Chen, “Easyedit: An easy-to-use knowledge edit-
https://arxiv.org/abs/2107.02137 ing framework for large language models,” CoRR,
[867] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and vol. abs/2308.07269, 2023.
Q. Liu, “ERNIE: enhanced language representation [877] Y. Yao, P. Wang, B. Tian, S. Cheng, Z. Li, S. Deng,
with informative entities,” in Proceedings of the 57th H. Chen, and N. Zhang, “Editing large language
Conference of the Association for Computational Linguis- models: Problems, methods, and opportunities,”
tics, ACL 2019, Florence, Italy, July 28- August 2, 2019, CoRR, vol. abs/2305.13172, 2023.
Volume 1: Long Papers. Association for Computa- [878] S. Choi, T. Fang, Z. Wang, and Y. Song, “KCTS:
tional Linguistics, 2019, pp. 1441–1451. knowledge-constrained tree search decoding with
[868] X. Wang, T. Gao, Z. Zhu, Z. Zhang, Z. Liu, J. Li, token-level hallucination detection,” CoRR, vol.
and J. Tang, “KEPLER: A unified model for knowl- abs/2310.09044, 2023.
edge embedding and pre-trained language represen- [879] S. Zhang, L. Pan, J. Zhao, and W. Y. Wang, “Mit-
tation,” Trans. Assoc. Comput. Linguistics, vol. 9, pp. igating language model hallucination with inter-
176–194, 2021. active question-knowledge alignment,” CoRR, vol.
[869] J. Zhang, X. Zhang, J. Yu, J. Tang, J. Tang, C. Li, abs/2305.13669, 2023.
and H. Chen, “Subgraph retrieval enhanced model [880] Y. Zhu, X. Wang, J. Chen, S. Qiao, Y. Ou,
for multi-hop knowledge base question answering,” Y. Yao, S. Deng, H. Chen, and N. Zhang, “Llms
in Proceedings of the 60th Annual Meeting of the As- for knowledge graph construction and reasoning:
sociation for Computational Linguistics (Volume 1: Long Recent capabilities and future opportunities,” CoRR,
Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022. vol. abs/2305.13168, 2023. [Online]. Available:
Association for Computational Linguistics, 2022, pp. https://doi.org/10.48550/arXiv.2305.13168
5773–5784. [881] M. Karpinska, N. Akoury, and M. Iyyer, “The perils
[870] P. Ke, H. Ji, Y. Ran, X. Cui, L. Wang, L. Song, X. Zhu, of using mechanical turk to evaluate open-ended
and M. Huang, “Jointgt: Graph-text joint represen- text generation,” in Proceedings of the 2021 Conference
tation learning for text generation from knowledge on Empirical Methods in Natural Language Processing,
graphs,” in Findings of the Association for Compu- EMNLP 2021, Virtual Event / Punta Cana, Dominican
tational Linguistics: ACL/IJCNLP 2021, Online Event, Republic, 7-11 November, 2021, M. Moens, X. Huang,
August 1-6, 2021, ser. Findings of ACL, vol. ACL/I- L. Specia, and S. W. Yih, Eds. Association for
JCNLP 2021. Association for Computational Lin- Computational Linguistics, 2021, pp. 1265–1285.
guistics, 2021, pp. 2526–2538. [882] H. Lee, S. Phatale, H. Mansoor, K. Lu, T. Mesnard,
[871] O. Agarwal, H. Ge, S. Shakeri, and R. Al-Rfou, C. Bishop, V. Carbune, and A. Rastogi, “RLAIF:
“Large scale knowledge graph based synthetic cor- scaling reinforcement learning from human feedback
pus generation for knowledge-enhanced language with AI feedback,” CoRR, vol. abs/2309.00267, 2023.
model pre-training,” CoRR, vol. abs/2010.12688, [883] G. Cui, L. Yuan, N. Ding, G. Yao, W. Zhu, Y. Ni,
2020. G. Xie, Z. Liu, and M. Sun, “Ultrafeedback: Boosting
[872] W. Chen, Y. Su, X. Yan, and W. Y. Wang, “KGPT: language models with high-quality feedback,” CoRR,
knowledge-grounded pre-training for data-to-text vol. abs/2310.01377, 2023.
generation,” in Proceedings of the 2020 Conference [884] X. Wang, Z. Wang, J. Liu, Y. Chen, L. Yuan, H. Peng,
on Empirical Methods in Natural Language Processing, and H. Ji, “MINT: evaluating llms in multi-turn in-
EMNLP 2020, Online, November 16-20, 2020. Associ- teraction with tools and language feedback,” CoRR,
ation for Computational Linguistics, 2020, pp. 8635– vol. abs/2309.10691, 2023.
134

[885] S. Saha, O. Levy, A. Celikyilmaz, M. Bansal, J. We- expert-level medical question answering with large
ston, and X. Li, “Branch-solve-merge improves large language models,” CoRR, vol. abs/2305.09617, 2023.
language model evaluation and generation,” CoRR, [899] S. Yang, H. Zhao, S. Zhu, G. Zhou, H. Xu, Y. Jia, and
vol. abs/2310.15123, 2023. H. Zan, “Zhongjing: Enhancing the chinese medical
[886] X. Zhang, B. Yu, H. Yu, Y. Lv, T. Liu, F. Huang, H. Xu, capabilities of large language model through expert
and Y. Li, “Wider and deeper LLM networks are feedback and real-world multi-turn dialogue,” CoRR,
fairer LLM evaluators,” CoRR, vol. abs/2308.01862, vol. abs/2308.03549, 2023.
2023. [900] S. Chen, B. H. Kann, M. B. Foote, H. J. Aerts,
[887] C. Chan, W. Chen, Y. Su, J. Yu, W. Xue, S. Zhang, J. Fu, G. K. Savova, R. H. Mak, and D. S. Bitterman, “The
and Z. Liu, “Chateval: Towards better llm-based utility of chatgpt for cancer treatment information,”
evaluators through multi-agent debate,” CoRR, vol. medRxiv, 2023.
abs/2308.07201, 2023. [901] K. Malinka, M. Peresı́ni, A. Firc, O. Hujnak, and
[888] R. Li, T. Patel, and X. Du, “PRD: peer rank and dis- F. Janus, “On the educational impact of chatgpt:
cussion improve large language model based evalu- Is artificial intelligence ready to obtain a university
ations,” CoRR, vol. abs/2307.02762, 2023. degree?” CoRR, vol. abs/2303.11146, 2023.
[889] L. Zhu, X. Wang, and X. Wang, “Judgelm: Fine-tuned [902] T. Susnjak, “Chatgpt: The end of online exam in-
large language models are scalable judges,” CoRR, tegrity?” CoRR, vol. abs/2212.09292, 2022.
vol. abs/2310.17631, 2023. [903] K. Tan, T. Pang, and C. Fan, “Towards applying
[890] Z. Zeng, J. Yu, T. Gao, Y. Meng, T. Goyal, powerful large ai models in classroom teaching: Op-
and D. Chen, “Evaluating large language mod- portunities, challenges and prospects,” 2023.
els at evaluating instruction following,” CoRR, vol. [904] F. Kamalov and I. Gurrib, “A new era of artificial
abs/2310.07641, 2023. intelligence in education: A multifaceted revolution,”
[891] R. Koo, M. Lee, V. Raheja, J. I. Park, Z. M. Kim, CoRR, vol. abs/2305.18303, 2023.
and D. Kang, “Benchmarking cognitive biases in [905] E. Kasneci, K. Seßler, S. Küchemann, M. Bannert,
large language models as evaluators,” CoRR, vol. D. Dementieva, F. Fischer, U. Gasser, G. Groh,
abs/2309.17012, 2023. S. Günnemann, E. Hüllermeier et al., “Chatgpt for
[892] P. West, X. Lu, N. Dziri, F. Brahman, L. Li, good? on opportunities and challenges of large lan-
J. D. Hwang, L. Jiang, J. Fisher, A. Ravichander, guage models for education,” Learning and Individual
K. Chandu, B. Newman, P. W. Koh, A. Ettinger, Differences, vol. 103, p. 102274, 2023.
and Y. Choi, “The generative AI paradox: ”what [906] A. Blair-Stanek, N. Holzenberger, and B. V. Durme,
it can create, it may not understand”,” CoRR, vol. “Can GPT-3 perform statutory reasoning?” CoRR,
abs/2311.00059, 2023. vol. abs/2302.06100, 2023.
[893] J. Huang, X. Chen, S. Mishra, H. S. Zheng, A. W. [907] D. Trautmann, A. Petrova, and F. Schilder, “Legal
Yu, X. Song, and D. Zhou, “Large language mod- prompt engineering for multilingual legal judgement
els cannot self-correct reasoning yet,” CoRR, vol. prediction,” CoRR, vol. abs/2212.02199, 2022.
abs/2310.01798, 2023. [908] J. H. Choi, K. E. Hickman, A. Monahan, and
[894] K. Stechly, M. Marquez, and S. Kambhampati, “GPT- D. Schwarcz, “Chatgpt goes to law school,” Available
4 doesn’t know it’s wrong: An analysis of itera- at SSRN, 2023.
tive prompting for reasoning problems,” CoRR, vol. [909] J. J. Nay, “Law informs code: A legal informatics
abs/2310.12397, 2023. approach to aligning artificial intelligence with hu-
[895] O. Nov, N. Singh, and D. M. Mann, “Putting chat- mans,” CoRR, vol. abs/2209.13020, 2022.
gpt’s medical advice to the (turing) test,” CoRR, vol. [910] F. Yu, L. Quartey, and F. Schilder, “Legal prompting:
abs/2301.10035, 2023. Teaching a language model to think like a lawyer,”
[896] K. Yang, S. Ji, T. Zhang, Q. Xie, and S. Anani- CoRR, vol. abs/2212.01326, 2022.
adou, “On the evaluations of chatgpt and emotion- [911] D. Trautmann, A. Petrova, and F. Schilder, “Legal
enhanced prompting for mental health analysis,” prompt engineering for multilingual legal judgement
CoRR, vol. abs/2304.03347, 2023. prediction,” CoRR, vol. abs/2212.02199, 2022.
[897] K. Jeblick, B. Schachtner, J. Dexl, A. Mittermeier, [912] A. Tamkin, M. Brundage, J. Clark, and D. Ganguli,
A. T. Stüber, J. Topalis, T. Weber, P. Wesp, B. O. “Understanding the capabilities, limitations, and so-
Sabel, J. Ricke, and M. Ingrisch, “Chatgpt makes cietal impact of large language models,” CoRR, vol.
medicine easy to swallow: An exploratory case abs/2102.02503, 2021.
study on simplified radiology reports,” CoRR, vol. [913] Z. Sun, “A short survey of viewing large language
abs/2212.14882, 2022. models in legal aspect,” CoRR, vol. abs/2303.09136,
[898] K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wul- 2023.
czyn, L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis, [914] A. Abid, M. Farooqi, and J. Zou, “Persistent anti-
D. Neal, M. Schaekermann, A. Wang, M. Amin, muslim bias in large language models,” in AIES
S. Lachgar, P. A. Mansfield, S. Prakash, B. Green, ’21: AAAI/ACM Conference on AI, Ethics, and Society,
E. Dominowska, B. A. y Arcas, N. Tomasev, Y. Liu, Virtual Event, USA, May 19-21, 2021, M. Fourcade,
R. Wong, C. Semturs, S. S. Mahdavi, J. K. Barral, B. Kuipers, S. Lazar, and D. K. Mulligan, Eds. ACM,
D. R. Webster, G. S. Corrado, Y. Matias, S. Azizi, 2021, pp. 298–306.
A. Karthikesalingam, and V. Natarajan, “Towards [915] A. Shah and S. Chava, “Zero is not hero yet: Bench-
135

marking zero-shot performance of llms for financial is a remarkable tool – for experts,” CoRR, vol.
tasks,” CoRR, vol. abs/2305.16633, 2023. abs/2306.03102, 2023.
[916] D. Araci, “Finbert: Financial sentiment analysis [932] O. O. Buruk, “Academic writing with GPT-3.5: reflec-
with pre-trained language models,” CoRR, vol. tions on practices, efficacy and transparency,” CoRR,
abs/1908.10063, 2019. vol. abs/2304.11079, 2023.
[917] J. C. S. Alvarado, K. Verspoor, and T. Baldwin, [933] R. Liu and N. B. Shah, “Reviewergpt? an exploratory
“Domain adaption of named entity recognition to study on using large language models for paper
support credit risk assessment,” in Proceedings of reviewing,” CoRR, vol. abs/2306.00622, 2023.
the Australasian Language Technology Association Work- [934] M. Kosinski, “Theory of mind may have sponta-
shop, ALTA 2015, Parramatta, Australia, December 8 - 9, neously emerged in large language models,” CoRR,
2015, B. Hachey and K. Webster, Eds. ACL, 2015, vol. abs/2302.02083, 2023.
pp. 84–90. [935] M. M. Amin, E. Cambria, and B. W. Schuller, “Will
[918] G. Son, H. Jung, M. Hahm, K. Na, and S. Jin, “Beyond affective computing emerge from foundation models
classification: Financial reasoning in state-of-the-art and general ai? A first evaluation on chatgpt,” CoRR,
language models,” CoRR, vol. abs/2305.01505, 2023. vol. abs/2303.03186, 2023.
[919] X. Zhang, Q. Yang, and D. Xu, “Xuanyuan 2.0: A [936] G. Sridhara, R. H. G., and S. Mazumdar, “Chatgpt: A
large chinese financial chat model with hundreds of study on its utility for ubiquitous software engineer-
billions parameters,” arXiv preprint arXiv:2305.12002, ing tasks,” CoRR, vol. abs/2305.16837, 2023.
2023. [937] W. Sun, C. Fang, Y. You, Y. Miao, Y. Liu, Y. Li,
[920] H. Yang, X.-Y. Liu, and C. D. Wang, “Fingpt: Open- G. Deng, S. Huang, Y. Chen, Q. Zhang, H. Qian,
source financial large language models,” CoRR, vol. Y. Liu, and Z. Chen, “Automatic code summariza-
abs/2306.06031, 2023. tion via chatgpt: How far are we?” CoRR, vol.
[921] Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, and X. Lu, abs/2305.12865, 2023.
“Pubmedqa: A dataset for biomedical research ques- [938] C. S. Xia and L. Zhang, “Conversational automated
tion answering,” in Proceedings of the 2019 Conference program repair,” CoRR, vol. abs/2301.13246, 2023.
on Empirical Methods in Natural Language Processing [939] A. Kazemnejad, I. Padhi, K. N. Ramamurthy, P. Das,
and the 9th International Joint Conference on Natu- and S. Reddy, “The impact of positional encoding
ral Language Processing, EMNLP-IJCNLP 2019, Hong on length generalization in transformers,” CoRR, vol.
Kong, China, November 3-7, 2019, 2019, pp. 2567–2577. abs/2305.19466, 2023.
[922] A. Krithara, A. Nentidis, K. Bougiatiotis, and [940] W. Xiong, J. Liu, I. Molybog, H. Zhang, P. Bhargava,
G. Paliouras, “Bioasq-qa: A manually curated corpus R. Hou, L. Martin, R. Rungta, K. A. Sankararaman,
for biomedical question answering,” 2022. B. Oguz, M. Khabsa, H. Fang, Y. Mehdad, S. Narang,
[923] Z. Bi, N. Zhang, Y. Xue, Y. Ou, D. Ji, G. Zheng, K. Malik, A. Fan, S. Bhosale, S. Edunov, M. Lewis,
and H. Chen, “Oceangpt: A large language model S. Wang, and H. Ma, “Effective long-context scaling
for ocean science tasks,” CoRR, vol. abs/2310.02031, of foundation models,” CoRR, vol. abs/2309.16039,
2023. 2023.
[924] C. Zhang, C. Zhang, C. Li, Y. Qiao, S. Zheng, S. K. [941] kaiokendev, “Things I’m learning while training su-
Dam, M. Zhang, J. U. Kim, S. T. Kim, J. Choi, G. Park, perhot.” 2023.
S. Bae, L. Lee, P. Hui, I. S. Kweon, and C. S. Hong, [942] Z. Dong, T. Tang, J. Li, W. X. Zhao, and J. Wen,
“One small step for generative ai, one giant leap for “BAMBOO: A comprehensive benchmark for evalu-
AGI: A complete survey on chatgpt in AIGC era,” ating long text modeling capacities of large language
CoRR, vol. abs/2304.06488, 2023. models,” CoRR, vol. abs/2309.13345, 2023.
[925] M. Haman and M. Skolnik, “Using chatgpt to con- [943] J. Su. (2023) Transformer upgrade path: 12, infinite
duct a literature review.” Accountability in research, extrapolation of rerope?
2023. [944] A. Pal, D. Karkhanis, M. Roberts, S. Dooley, A. Sun-
[926] Ö. Aydın and E. Karaarslan, “Openai chatgpt gen- dararajan, and S. Naidu, “Giraffe: Adventures in
erated literature review: Digital twin in healthcare,” expanding context lengths in llms,” CoRR, vol.
SSRN Electronic Journal, 2022. abs/2308.10882, 2023.
[927] Y. J. Park, D. Kaplan, Z. Ren, C. Hsu, C. Li, H. Xu, [945] G. Izacard and E. Grave, “Leveraging passage re-
S. Li, and J. Li, “Can chatgpt be used to generate trieval with generative models for open domain
scientific hypotheses?” CoRR, vol. abs/2304.12208, question answering,” in Proceedings of the 16th Con-
2023. ference of the European Chapter of the Association for
[928] M. M. Hassan, R. A. Knipper, and S. K. K. Santu, Computational Linguistics: Main Volume, EACL 2021,
“Chatgpt as your personal data scientist,” CoRR, vol. Online, April 19 - 23, 2021. Association for Compu-
abs/2305.13657, 2023. tational Linguistics, 2021, pp. 874–880.
[929] L. Cheng, X. Li, and L. Bing, “Is GPT-4 a good data [946] N. Ratner, Y. Levine, Y. Belinkov, O. Ram, I. Magar,
analyst?” CoRR, vol. abs/2305.15038, 2023. O. Abend, E. Karpas, A. Shashua, K. Leyton-Brown,
[930] S. I. M. Hussam Alkaissi, “Artificial hallucinations in and Y. Shoham, “Parallel context windows for large
chatgpt: Implications in scientific writing,” PubMed, language models,” in Proceedings of the 61st Annual
2023. Meeting of the Association for Computational Linguistics
[931] A. Azaria, R. Azoulay, and S. Reches, “Chatgpt (Volume 1: Long Papers), ACL 2023, Toronto, Canada,
136

July 9-14, 2023. Association for Computational [962] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J.
Linguistics, 2023, pp. 6383–6402. Gershman, “Building machines that learn and think
[947] I. Beltagy, M. E. Peters, and A. Cohan, “Long- like people,” CoRR, vol. abs/1604.00289, 2016.
former: The long-document transformer,” CoRR, vol. [963] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran,
abs/2004.05150, 2020. K. Narasimhan, and Y. Cao, “React: Synergizing rea-
[948] G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis, soning and acting in language models,” CoRR, vol.
“Efficient streaming language models with attention abs/2210.03629, 2022.
sinks,” CoRR, vol. abs/2309.17453, 2023. [964] 2023. [Online]. Available: https://github.com/
[949] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilac- AntonOsika/gpt-engineer
qua, F. Petroni, and P. Liang, “Lost in the middle: [965] X. Team, “Xagent: An autonomous agent for complex
How language models use long contexts,” Transac- task solving,” 2023.
tions of the Association for Computational Linguistics, [966] G. Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin,
vol. 12, pp. 157–173, 2024. and B. Ghanem, “CAMEL: communicative agents for
[950] C. Han, Q. Wang, W. Xiong, Y. Chen, H. Ji, and ”mind” exploration of large scale language model
S. Wang, “Lm-infinite: Simple on-the-fly length gen- society,” CoRR, vol. abs/2303.17760, 2023.
eralization for large language models,” CoRR, vol. [967] S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang,
abs/2308.16137, 2023. C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou,
[951] A. Bertsch, U. Alon, G. Neubig, and M. R. Gorm- C. Ran, L. Xiao, and C. Wu, “Metagpt: Meta pro-
ley, “Unlimiformer: Long-range transformers with gramming for multi-agent collaborative framework,”
unlimited length input,” CoRR, vol. abs/2305.01625, CoRR, vol. abs/2308.00352, 2023.
2023. [968] C. Pham, B. Liu, Y. Yang, Z. Chen, T. Liu, J. Yuan,
[952] Y. Wu, M. N. Rabe, D. Hutchins, and C. Szegedy, B. A. Plummer, Z. Wang, and H. Yang, “Let models
“Memorizing transformers,” in The Tenth Interna- speak ciphers: Multiagent debate through embed-
tional Conference on Learning Representations, ICLR dings,” CoRR, vol. abs/2310.06272, 2023.
2022, Virtual Event, April 25-29, 2022. OpenRe- [969] W. Chen, Y. Su, J. Zuo, C. Yang, C. Yuan, C. Qian,
view.net, 2022. C.-M. Chan, Y. Qin, Y. Lu, R. Xie et al., “Agent-
[953] Y. Lu, X. Zhou, W. He, J. Zhao, T. Ji, T. Gui, Q. Zhang, verse: Facilitating multi-agent collaboration and ex-
and X. Huang, “Longheads: Multi-head attention ploring emergent behaviors in agents,” arXiv preprint
is secretly a long context processor,” CoRR, vol. arXiv:2308.10848, 2023.
abs/2402.10685, 2024. [970] Q. Wu, G. Bansal, J. Zhang, Y. Wu, B. Li, E. Zhu,
[954] C. Xiao, P. Zhang, X. Han, G. Xiao, Y. Lin, Z. Zhang, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah,
Z. Liu, S. Han, and M. Sun, “Infllm: Unveiling the in- R. W. White, D. Burger, and C. Wang, “Autogen:
trinsic capacity of llms for understanding extremely Enabling next-gen llm applications via multi-agent
long sequences with training-free memory,” CoRR, conversation framework,” 2023.
vol. abs/2402.04617, 2024. [971] Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and
[955] Y. Fu, R. Panda, X. Niu, X. Yue, H. Hajishirzi, Y. Kim, I. Mordatch, “Improving factuality and reasoning in
and H. Peng, “Data engineering for scaling language language models through multiagent debate,” CoRR,
models to 128k context,” CoRR, vol. abs/2402.10171, vol. abs/2305.14325, 2023.
2024. [972] Y. Shao, L. Li, J. Dai, and X. Qiu, “Character-llm:
[956] K. Lv, X. Liu, Q. Guo, H. Yan, C. He, X. Qiu, A trainable agent for role-playing,” in Proceedings of
and D. Lin, “Longwanjuan: Towards systematic the 2023 Conference on Empirical Methods in Natural
measurement for long text quality,” CoRR, vol. Language Processing, EMNLP 2023, Singapore, Decem-
abs/2402.13583, 2024. ber 6-10, 2023, H. Bouamor, J. Pino, and K. Bali, Eds.
[957] H. Chen, R. Pasunuru, J. Weston, and A. Celiky- Association for Computational Linguistics, 2023, pp.
ilmaz, “Walking down the memory maze: Beyond 13 153–13 187.
context limit through interactive reading,” CoRR, vol. [973] W. Hua, X. Yang, Z. Li, W. Cheng, and Y. Zhang,
abs/2310.05029, 2023. “Trustagent: Towards safe and trustworthy llm-
[958] W. Zhou, Y. E. Jiang, P. Cui, T. Wang, Z. Xiao, Y. Hou, based agents through agent constitution,” CoRR, vol.
R. Cotterell, and M. Sachan, “Recurrentgpt: Interac- abs/2402.01586, 2024.
tive generation of (arbitrarily) long text,” CoRR, vol. [974] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng,
abs/2305.13304, 2023. H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and
[959] C. Packer, V. Fang, S. G. Patil, K. Lin, S. Wooders, and T. Liu, “A survey on hallucination in large language
J. E. Gonzalez, “Memgpt: Towards llms as operating models: Principles, taxonomy, challenges, and open
systems,” CoRR, vol. abs/2310.08560, 2023. questions,” CoRR, vol. abs/2311.05232, 2023.
[960] P. Xu, W. Ping, X. Wu, L. McAfee, C. Zhu, Z. Liu, [975] I. Loshchilov and F. Hutter, “Decoupled weight de-
S. Subramanian, E. Bakhturina, M. Shoeybi, and cay regularization,” in ICLR (Poster). OpenRe-
B. Catanzaro, “Retrieval meets long context large view.net, 2019.
language models,” CoRR, vol. abs/2310.03025, 2023. [976] V. A. Korthikanti, J. Casper, S. Lym, L. McAfee,
[961] S. Russell and P. Norvig, Artificial Intelligence: M. Andersch, M. Shoeybi, and B. Catanzaro, “Re-
A Modern Approach (4th Edition). Pearson, 2020. ducing activation recomputation in large transformer
[Online]. Available: http://aima.cs.berkeley.edu/ models,” in MLSys. mlsys.org, 2023.
137

[977] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, [990] B. Spector and C. Ré, “Accelerating LLM infer-
“Zero: memory optimizations toward training tril- ence with staged speculative decoding,” CoRR, vol.
lion parameter models,” in Proceedings of the Interna- abs/2308.04623, 2023.
tional Conference for High Performance Computing, Net- [991] L. Chen, M. Zaharia, and J. Zou, “Frugalgpt: How to
working, Storage and Analysis, SC 2020, Virtual Event / use large language models while reducing cost and
Atlanta, Georgia, USA, November 9-19, 2020, C. Cuic- improving performance,” CoRR, vol. abs/2305.05176,
chi, I. Qualters, and W. T. Kramer, Eds. IEEE/ACM, 2023.
2020, p. 20. [992] M. Yue, J. Zhao, M. Zhang, L. Du, and Z. Yao, “Large
[978] J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, language model cascades with mixture of thoughts
S. Yang, M. Zhang, D. Li, and Y. He, “Zero-offload: representations for cost-efficient reasoning,” CoRR,
Democratizing billion-scale model training,” in 2021 vol. abs/2310.03094, 2023.
USENIX Annual Technical Conference, USENIX ATC [993] J. Gu, J. Bradbury, C. Xiong, V. O. K. Li, and R. Socher,
2021, July 14-16, 2021, I. Calciu and G. Kuenning, Eds. “Non-autoregressive neural machine translation,” in
USENIX Association, 2021, pp. 551–564. ICLR (Poster). OpenReview.net, 2018.
[979] S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and [994] C. Wang, J. Zhang, and H. Chen, “Semi-
Y. He, “Zero-infinity: breaking the GPU memory wall autoregressive neural machine translation,” in
for extreme scale deep learning,” in SC. ACM, 2021, EMNLP. Association for Computational Linguistics,
p. 59. 2018, pp. 479–488.
[980] T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré, [995] T. Cai, Y. Li, Z. Geng, H. Peng, J. D. Lee, D. Chen, and
“Flashattention: Fast and memory-efficient exact at- T. Dao, “Medusa: Simple LLM inference acceleration
tention with io-awareness,” in NeurIPS, 2022. framework with multiple decoding heads,” CoRR,
[981] S. A. Jacobs, M. Tanaka, C. Zhang, M. Zhang, S. L. vol. abs/2401.10774, 2024.
Song, S. Rajbhandari, and Y. He, “Deepspeed ulysses: [996] S. Teerapittayanon, B. McDanel, and H. T. Kung,
System optimizations for enabling training of ex- “Branchynet: Fast inference via early exiting from
treme long sequence transformer models,” CoRR, deep neural networks,” in ICPR. IEEE, 2016, pp.
vol. abs/2309.14509, 2023. 2464–2469.
[982] H. Liu, M. Zaharia, and P. Abbeel, “Ring attention [997] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten,
with blockwise transformers for near-infinite con- and K. Q. Weinberger, “Multi-scale dense networks
text,” CoRR, vol. abs/2310.01889, 2023. for resource efficient image classification,” in ICLR.
[983] Y. Chen, T. Tang, E. Xiang, L. Li, W. X. Zhao, OpenReview.net, 2018.
J. Wang, Y. Chai, and J. Wen, “Towards coarse-to-fine [998] D. Raposo, S. Ritter, B. A. Richards, T. P. Lilli-
evaluation of inference efficiency for large language crap, P. C. Humphreys, and A. Santoro, “Mixture-
models,” CoRR, vol. abs/2404.11502, 2024. of-depths: Dynamically allocating compute in
[984] Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, transformer-based language models,” CoRR, vol.
B. Chen, P. Liang, C. Ré, I. Stoica, and C. Zhang, abs/2404.02258, 2024.
“Flexgen: High-throughput generative inference of [999] Z. Wan, X. Wang, C. Liu, S. Alam, Y. Zheng,
large language models with a single GPU,” in ICML, J. Liu, Z. Qu, S. Yan, Y. Zhu, Q. Zhang,
ser. Proceedings of Machine Learning Research, vol. M. Chowdhury, and M. Zhang, “Efficient large
202. PMLR, 2023, pp. 31 094–31 116. language models: A survey,” 2024. [Online].
[985] T. Dao, D. Haziza, F. Massa, and G. Sizov, “Flash- Available: https://arxiv.org/abs/2312.03863
decoding for long-context inference,” 2023. [Online]. [1000] A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W.
Available: https://crfm.stanford.edu/2023/10/12/ Mahoney, and K. Keutzer, “A survey of quantization
flashdecoding.html methods for efficient neural network inference,”
[986] C. Holmes, M. Tanaka, M. Wyatt, A. A. Awan, CoRR, vol. abs/2103.13630, 2021. [Online]. Available:
J. Rasley, S. Rajbhandari, R. Y. Aminabadi, H. Qin, https://arxiv.org/abs/2103.13630
A. Bakhtiari, L. Kurilenko, and Y. He, “Deepspeed- [1001] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettle-
fastgen: High-throughput text generation for llms moyer, “Llm.int8(): 8-bit matrix multiplication for
via MII and deepspeed-inference,” CoRR, vol. transformers at scale,” CoRR, vol. abs/2208.07339,
abs/2401.08671, 2024. 2022.
[987] Y. Leviathan, M. Kalman, and Y. Matias, “Fast infer- [1002] J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, and S. Han,
ence from transformers via speculative decoding,” in “Awq: Activation-aware weight quantization for llm
International Conference on Machine Learning, 2023. compression and acceleration,” 2023.
[988] C. Chen, S. Borgeaud, G. Irving, J. Lespiau, L. Sifre, [1003] Y. Shang, Z. Yuan, Q. Wu, and Z. Dong, “PB-
and J. Jumper, “Accelerating large language model LLM: partially binarized large language models,”
decoding with speculative sampling,” CoRR, vol. CoRR, vol. abs/2310.00034, 2023. [Online]. Available:
abs/2302.01318, 2023. https://doi.org/10.48550/arXiv.2310.00034
[989] X. Miao, G. Oliaro, Z. Zhang, X. Cheng, Z. Wang, [1004] T. Dettmers, R. Svirschevski, V. Egiazarian,
R. Y. Y. Wong, Z. Chen, D. Arfeen, R. Abhyankar, D. Kuznedelev, E. Frantar, S. Ashkboos, A. Borzunov,
and Z. Jia, “Specinfer: Accelerating generative LLM T. Hoefler, and D. Alistarh, “Spqr: A sparse-
serving with speculative inference and token tree quantized representation for near-lossless LLM
verification,” CoRR, vol. abs/2305.09781, 2023. weight compression,” CoRR, vol. abs/2306.03078,
138

2023. aware quantization for large language models,”


[1005] Z. Guan, H. Huang, Y. Su, H. Huang, N. Wong, and CoRR, vol. abs/2310.08659, 2023. [Online]. Available:
H. Yu, “APTQ: attention-aware post-training mixed- https://doi.org/10.48550/arXiv.2310.08659
precision quantization for large language models,” [1018] Y. Gu, L. Dong, F. Wei, and M. Huang, “Knowledge
CoRR, vol. abs/2402.14866, 2024. [Online]. Available: distillation of large language models,” CoRR,
https://doi.org/10.48550/arXiv.2402.14866 vol. abs/2306.08543, 2023. [Online]. Available:
[1006] C. Lee, J. Jin, T. Kim, H. Kim, and E. Park, “OWQ: https://doi.org/10.48550/arXiv.2306.08543
outlier-aware weight quantization for efficient fine- [1019] C. Hsieh, C. Li, C. Yeh, H. Nakhost, Y. Fujii,
tuning and inference of large language models,” in A. Ratner, R. Krishna, C. Lee, and T. Pfister,
Thirty-Eighth AAAI Conference on Artificial Intelligence, “Distilling step-by-step! outperforming larger
AAAI 2024, Thirty-Sixth Conference on Innovative language models with less training data and
Applications of Artificial Intelligence, IAAI 2024, smaller model sizes,” in Findings of the Association for
Fourteenth Symposium on Educational Advances in Computational Linguistics: ACL 2023, Toronto, Canada,
Artificial Intelligence, EAAI 2014, February 20- July 9-14, 2023, A. Rogers, J. L. Boyd-Graber, and
27, 2024, Vancouver, Canada, M. J. Wooldridge, N. Okazaki, Eds. Association for Computational
J. G. Dy, and S. Natarajan, Eds. AAAI Press, Linguistics, 2023, pp. 8003–8017. [Online]. Available:
2024, pp. 13 355–13 364. [Online]. Available: https: https://doi.org/10.18653/v1/2023.findings-acl.507
//doi.org/10.1609/aaai.v38i12.29237 [1020] E. Frantar and D. Alistarh, “Sparsegpt: Massive lan-
[1007] G. Xiao, J. Lin, M. Seznec, J. Demouth, and guage models can be accurately pruned in one-
S. Han, “Smoothquant: Accurate and efficient post- shot,” in International Conference on Machine Learning.
training quantization for large language models,” PMLR, 2023, pp. 10 323–10 337.
CoRR, vol. abs/2211.10438, 2022. [Online]. Available: [1021] X. Ma, G. Fang, and X. Wang, “Llm-pruner: On the
https://doi.org/10.48550/arXiv.2211.10438 structural pruning of large language models,” Ad-
[1008] Z. Yao, R. Y. Aminabadi, M. Zhang, X. Wu, C. Li, vances in neural information processing systems, vol. 36,
and Y. He, “Zeroquant: Efficient and affordable post- pp. 21 702–21 720, 2023.
training quantization for large-scale transformers,” [1022] M. Xia, T. Gao, Z. Zeng, and D. Chen, “Sheared
in NeurIPS, 2022. llama: Accelerating language model pre-training via
[1009] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alis- structured pruning,” arXiv preprint arXiv:2310.06694,
tarh, “Gptq: Accurate post-training quantization for 2023.
generative pre-trained transformers,” arXiv preprint [1023] T. Dettmers, M. Lewis, S. Shleifer, and L. Zettle-
arXiv:2210.17323, 2022. moyer, “8-bit optimizers via block-wise quantiza-
[1010] E. Frantar and D. Alistarh, “Optimal brain compres- tion,” 9th International Conference on Learning Repre-
sion: A framework for accurate post-training quanti- sentations, ICLR, 2022.
zation and pruning,” in NeurIPS, 2022. [1024] Y. Ding, W. Fan, L. Ning, S. Wang, H. Li, D. Yin, T.-S.
[1011] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettle- Chua, and Q. Li, “A survey on rag meets llms: To-
moyer, “Qlora: Efficient finetuning of quantized wards retrieval-augmented large language models,”
llms,” arXiv preprint arXiv:2305.14314, 2023. arXiv preprint arXiv:2405.06211, 2024.
[1012] Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stock, [1025] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai,
Y. Mehdad, Y. Shi, R. Krishnamoorthi, and V. Chan- J. Sun, and H. Wang, “Retrieval-augmented gener-
dra, “Llm-qat: Data-free quantization aware training ation for large language models: A survey,” arXiv
for large language models,” 2023. preprint arXiv:2312.10997, 2023.
[1013] Z. Yao, X. Wu, C. Li, S. Youn, and Y. He, “Zeroquant- [1026] S. Robertson and H. Zaragoza, The probabilistic rele-
v2: Exploring post-training quantization in llms from vance framework: BM25 and beyond, 2009.
comprehensive study to low rank compensation,” [1027] Y. Wang, R. Ren, J. Li, W. X. Zhao, J. Liu, and J.-R.
2023. Wen, “Rear: A relevance-aware retrieval-augmented
[1014] T. Dettmers and L. Zettlemoyer, “The case for 4-bit framework for open-domain question answering,”
precision: k-bit inference scaling laws,” CoRR, vol. arXiv preprint arXiv:2402.17497, 2024.
abs/2212.09720, 2022. [1028] D. Rau, S. Wang, H. Déjean, and S. Clinchant, “Con-
[1015] L. Peiyu, L. Zikang, G. Ze-Feng, G. Dawei, Z. W. Xin, text embeddings for efficient answer generation in
L. Yaliang, D. Bolin, and W. Ji-Rong, “Do emergent rag,” arXiv preprint arXiv:2407.09252, 2024.
abilities exist in quantized large language models: [1029] F. Xu, W. Shi, and E. Choi, “Recomp: Improving
An empirical study,” arXiv preprint arXiv:2307.08072, retrieval-augmented lms with context compression
2023. and selective augmentation,” in The Twelfth Interna-
[1016] Y. Xu, L. Xie, X. Gu, X. Chen, H. Chang, tional Conference on Learning Representations, 2024.
H. Zhang, Z. Chen, X. Zhang, and Q. Tian, “Qa-lora: [1030] Z. Shao, Y. Gong, Y. Shen, M. Huang, N. Duan,
Quantization-aware low-rank adaptation of large and W. Chen, “Enhancing retrieval-augmented large
language models,” CoRR, vol. abs/2309.14717, 2023. language models with iterative retrieval-generation
[Online]. Available: https://doi.org/10.48550/arXiv. synergy,” in Findings of the Association for Computa-
2309.14717 tional Linguistics: EMNLP 2023, 2023, pp. 9248–9274.
[1017] Y. Li, Y. Yu, C. Liang, P. He, N. Karampatziakis, [1031] T. Chen, H. Wang, S. Chen, W. Yu, K. Ma, X. Zhao,
W. Chen, and T. Zhao, “Loftq: Lora-fine-tuning- D. Yu, and H. Zhang, “Dense x retrieval: What re-
139

trieval granularity should we use?” arXiv preprint hallucination in natural language generation,” ACM
arXiv:2312.06648, 2023. Comput. Surv., 2023.
[1032] X. Huang, S. Cheng, Y. Shu, Y. Bao, and Y. Qu, [1046] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang,
“Question decomposition tree for answering com- E. Zhao, Y. Zhang, Y. Chen, L. Wang, A. T. Luu, W. Bi,
plex questions over knowledge bases,” in Proceedings F. Shi, and S. Shi, “Siren’s song in the AI ocean: A
of the AAAI Conference on Artificial Intelligence, vol. 37, survey on hallucination in large language models,”
no. 11, 2023, pp. 12 924–12 932. arXiv preprint arXiv:2309.01219, 2023.
[1033] Y. He, J. Tang, H. Ouyang, C. Kang, D. Yin, and [1047] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer,
Y. Chang, “Learning to rewrite queries,” in Pro- “Scheduled sampling for sequence prediction with
ceedings of the 25th ACM International on Conference recurrent neural networks,” in NIPS, 2015, pp. 1171–
on Information and Knowledge Management, 2016, pp. 1179.
1443–1452. [1048] M. Sharma, M. Tong, T. Korbak, D. Duvenaud,
[1034] J. Liu and B. Mozafari, “Query rewriting via large A. Askell, S. R. Bowman, N. Cheng, E. Dur-
language models,” arXiv preprint arXiv:2403.09060, mus, Z. Hatfield-Dodds, S. R. Johnston, S. Kravec,
2024. T. Maxwell, S. McCandlish, K. Ndousse, O. Rausch,
[1035] F. Ye, M. Fang, S. Li, and E. Yilmaz, “Enhancing N. Schiefer, D. Yan, M. Zhang, and E. Perez, “To-
conversational search: Large language model-aided wards understanding sycophancy in language mod-
informative query rewriting,” in Findings of the As- els,” CoRR, vol. abs/2310.13548, 2023.
sociation for Computational Linguistics: EMNLP 2023, [1049] V. Rawte, P. Priya, S. M. T. I. Tonmoy, S. M. M.
2023, pp. 5985–6006. Zaman, A. P. Sheth, and A. Das, “Exploring the re-
[1036] S. Jeong, J. Baek, S. Cho, S. J. Hwang, and J. C. lationship between LLM hallucinations and prompt
Park, “Adaptive-rag: Learning to adapt retrieval- linguistic nuances: Readability, formality, and con-
augmented large language models through question creteness,” CoRR, vol. abs/2309.11064, 2023.
complexity,” arXiv preprint arXiv:2403.14403, 2024. [1050] S. Dhuliawala, M. Komeili, J. Xu, R. Raileanu, X. Li,
[1037] H. Jiang, Q. Wu, C.-Y. Lin, Y. Yang, and L. Qiu, A. Celikyilmaz, and J. Weston, “Chain-of-verification
“Llmlingua: Compressing prompts for accelerated reduces hallucination in large language models,”
inference of large language models,” in Proceedings CoRR, vol. abs/2309.11495, 2023.
of the 2023 Conference on Empirical Methods in Natural [1051] P. Manakul, A. Liusie, and M. J. F. Gales, “Selfcheck-
Language Processing, 2023, pp. 13 358–13 376. gpt: Zero-resource black-box hallucination detection
[1038] T. Xu, S. Wu, S. Diao, X. Liu, X. Wang, Y. Chen, for generative large language models,” in EMNLP.
and J. Gao, “Sayself: Teaching llms to express con- Association for Computational Linguistics, 2023, pp.
fidence with self-reflective rationales,” arXiv preprint 9004–9017.
arXiv:2405.20974, 2024. [1052] N. Varshney, W. Yao, H. Zhang, J. Chen, and D. Yu,
[1039] A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Ha- “A stitch in time saves nine: Detecting and mitigating
jishirzi, “Self-rag: Learning to retrieve, generate, hallucinations of llms by validating low-confidence
and critique through self-reflection,” arXiv preprint generation,” CoRR, vol. abs/2307.03987, 2023.
arXiv:2310.11511, 2023. [1053] Y. Yehuda, I. Malkiel, O. Barkan, J. Weill, R. Ronen,
[1040] H. Luo, Y.-S. Chuang, Y. Gong, T. Zhang, Y. Kim, and N. Koenigstein, “In search of truth: An interro-
X. Wu, D. Fox, H. Meng, and J. Glass, “Sail: Search- gation approach to hallucination detection,” CoRR,
augmented instruction learning,” arXiv preprint vol. abs/2403.02889, 2024.
arXiv:2305.15225, 2023. [1054] S. Min, K. Krishna, X. Lyu, M. Lewis, W. tau Yih, P. W.
[1041] X. V. Lin, X. Chen, M. Chen, W. Shi, M. Lomeli, Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi,
R. James, P. Rodriguez, J. Kahn, G. Szilvasy, M. Lewis “Factscore: Fine-grained atomic evaluation of factual
et al., “Ra-dit: Retrieval-augmented dual instruction precision in long form text generation,” 2023.
tuning,” arXiv preprint arXiv:2310.01352, 2023. [1055] I. Chern, S. Chern, S. Chen, W. Yuan, K. Feng,
[1042] K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang, C. Zhou, J. He, G. Neubig, and P. Liu, “Factool:
“Retrieval augmented language model pre-training,” Factuality detection in generative AI - A tool aug-
in International conference on machine learning. PMLR, mented framework for multi-task and multi-domain
2020, pp. 3929–3938. scenarios,” CoRR, vol. abs/2307.13528, 2023.
[1043] K. Lee, M.-W. Chang, and K. Toutanova, “Latent re- [1056] X. Cheng, J. Li, W. X. Zhao, H. Zhang, F. Zhang,
trieval for weakly supervised open domain question D. Zhang, K. Gai, and J.-R. Wen, “Small agent can
answering,” in Proceedings of the 57th Annual Meeting also rock! empowering small language models as
of the Association for Computational Linguistics, 2019, hallucination detector,” CoRR, vol. abs/2406.11277,
pp. 6086–6096. 2024.
[1044] J. Li, J. Chen, R. Ren, X. Cheng, W. X. Zhao, J.-Y. [1057] M. Sharma, M. Tong, T. Korbak, D. Duvenaud,
Nie, and J.-R. Wen, “The dawn after the dark: An A. Askell, S. R. Bowman, E. Durmus, Z. Hatfield-
empirical study on factuality hallucination in large Dodds, S. R. Johnston, S. Kravec, T. Maxwell, S. Mc-
language models,” arXiv preprint arXiv:2401.03205, Candlish, K. Ndousse, O. Rausch, N. Schiefer, D. Yan,
2024. M. Zhang, and E. Perez, “Towards understanding
[1045] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, sycophancy in language models,” in ICLR. Open-
Y. J. Bang, A. Madotto, and P. Fung, “Survey of Review.net, 2024.
140

[1058] J. W. Wei, D. Huang, Y. Lu, D. Zhou, and Q. V. Le,


“Simple synthetic data reduces sycophancy in large
language models,” CoRR, vol. abs/2308.03958, 2023.
[1059] L. Gao, Z. Dai, P. Pasupat, A. Chen, A. T. Chaganty,
Y. Fan, V. Y. Zhao, N. Lao, H. Lee, D. Juan, and
K. Guu, “RARR: researching and revising what lan-
guage models say, using language models,” in ACL
(1). Association for Computational Linguistics, 2023,
pp. 16 477–16 508.
[1060] R. Zhao, X. Li, S. Joty, C. Qin, and L. Bing, “Verify-
and-edit: A knowledge-enhanced chain-of-thought
framework,” in ACL (1). Association for Compu-
tational Linguistics, 2023, pp. 5823–5840.
[1061] H. Trivedi, N. Balasubramanian, T. Khot, and A. Sab-
harwal, “Interleaving retrieval with chain-of-thought
reasoning for knowledge-intensive multi-step ques-
tions,” CoRR, vol. abs/2212.10509, 2022.
[1062] K. Li, O. Patel, F. B. Viégas, H. Pfister, and M. Watten-
berg, “Inference-time intervention: Eliciting truthful
answers from a language model,” in NeurIPS, 2023.
[1063] W. Shi, X. Han, M. Lewis, Y. Tsvetkov, L. Zettlemoyer,
and S. W. Yih, “Trusting your evidence: Halluci-
nate less with context-aware decoding,” CoRR, vol.
abs/2305.14739, 2023.
[1064] W. Kuang, B. Qian, Z. Li, D. Chen, D. Gao, X. Pan,
Y. Xie, Y. Li, B. Ding, and J. Zhou, “Federatedscope-
llm: A comprehensive package for fine-tuning large
language models in federated learning,” 2023.

You might also like