LLM Model
LLM Model
LLM Model
A Comprehensive Overview of
Large Language Models
Humza Naveed1 , Asad Ullah Khan1,∗ , Shi Qiu2,∗ , Muhammad Saqib3,4,∗ ,
Saeed Anwar5,6 , Muhammad Usman5,6 , Naveed Akhtar7 , Nick Barnes8 , Ajmal Mian9
1 Universityof Engineering and Technology (UET), Lahore, Pakistan
2 The Chinese University of Hong Kong (CUHK), HKSAR, China
3 University of Technology Sydney (UTS), Sydney, Australia
4 Commonwealth Scientific and Industrial Research Organisation (CSIRO), Sydney, Australia
5 King Fahd University of Petroleum and Minerals (KFUPM), Dhahran, Saudi Arabia
arXiv:2307.06435v8 [cs.CL] 20 Feb 2024
6 SDAIA-KFUPM Joint Research Center for Artificial Intelligence (JRCAI), Dhahran, Saudi Arabia
7 The University of Melbourne (UoM), Melbourne, Australia
8 Australian National University (ANU), Canberra, Australia
9 The University of Western Australia (UWA), Perth, Australia
Abstract—
Large Language Models (LLMs) have recently demonstrated
remarkable capabilities in natural language processing tasks and
beyond. This success of LLMs has led to a large influx of research
contributions in this direction. These works encompass diverse
topics such as architectural innovations, better training strategies,
context length improvements, fine-tuning, multi-modal LLMs,
robotics, datasets, benchmarking, efficiency, and more. With the
rapid development of techniques and regular breakthroughs in
LLM research, it has become considerably challenging to perceive
the bigger picture of the advances in this direction. Considering
the rapidly emerging plethora of literature on LLMs, it is
imperative that the research community is able to benefit from a
concise yet comprehensive overview of the recent developments
in this field. This article provides an overview of the existing
literature on a broad range of LLM-related concepts. Our self-
contained comprehensive overview of LLMs discusses relevant
background concepts along with covering the advanced topics
at the frontier of research in LLMs. This review article is
intended to not only provide a systematic survey but also a quick
comprehensive reference for the researchers and practitioners
to draw insights from extensive informative summaries of the Fig. 1: The trend of papers released over years containing
existing works to advance the LLM research. keywords “Large Language Model”, “Large Language Model
Index Terms— + Fine-Tuning”, and “Large Language Model + Alignment”.
Large Language Models, LLMs, chatGPT, Augmented LLMs,
Multimodal LLMs, LLM training, LLM Benchmarking
Fig. 2: Chronological display of LLM releases: light blue rectangles represent ‘pre-trained’ models, while dark rectangles
correspond to ‘instruction-tuned’ models. Models on the upper half signify open-source availability, whereas those on the
bottom half are closed-source. The chart illustrates the increasing trend towards instruction-tuned models and open-source
models, highlighting the evolving landscape and trends in natural language processing research.
The historical progress in natural language processing (NLP) instructions data [16], [17], [18], [19] and aligning with human
evolved from statistical to neural language modeling and then preferences [20], [21] enhances generalization to unseen tasks,
from pre-trained language models (PLMs) to LLMs. While improving zero-shot performance significantly and reducing
conventional language modeling (LM) trains task-specific misaligned behavior.
models in supervised settings, PLMs are trained in a self- In addition to better generalization and domain adaptation,
supervised setting on a large corpus of text [7], [8], [9] LLMs appear to have emergent abilities, such as reasoning,
with the aim to learn generic representation shareable among planning, decision-making, in-context learning, answering in
various NLP tasks. After fine-tuning for downstream tasks, zero-shot settings, etc. These abilities are known to be acquired
PLMs surpass the performance gains of traditional language by them due to their gigantic scale even when the pre-
modeling (LM). The larger PLMs bring more performance trained LLMs are not trained specifically to possess these
gains, which has led to the transitioning of PLMs to LLMs by attributes [22], [23], [24]. Such abilities have led LLMs widely
significantly increasing model parameters (tens to hundreds of adopted in diverse settings including, multi-modal, robotics,
billions) [10] and training dataset (many GBs and TBs) [10], tool manipulation, question answering, autonomous agents,
[11]. Following this development, numerous LLMs have been etc. Various improvements have also been suggested in these
proposed in the literature [10], [11], [12], [6], [13], [14], [15]. areas either by task-specific training [25], [26], [27], [28], [29],
An increasing trend in the number of released LLMs and [30], [31] or better prompting [32].
names of a few significant LLMs proposed over the years are
The LLMs abilities to solve diverse tasks with human-level
shown in Fig 1 and Fig 2, respectively.
performance come at a cost of slow training and inference,
The early work on LLMs, such as T5 [10] and mT5 [11]
extensive hardware requirements, and higher running costs.
employed transfer learning until GPT-3 [6] showing LLMs are
Such requirements have limited their adoption and opened up
zero-shot transferable to downstream tasks without fine-tuning.
opportunities to devise better architectures [15], [33], [34],
LLMs accurately respond to task queries when prompted
[35] and training strategies [36], [37], [21], [38], [39], [40],
with task descriptions and examples. However, pre-trained
[41]. Parameter efficient tuning [38], [41], [40], pruning [42],
LLMs fail to follow user intent and perform worse in zero-
[43], quantization [44], [45], knowledge distillation, and con-
shot settings than in few-shot. Fine-tuning them with task
text length interpolation [46], [47], [48], [49] among others
PREPRINT 3
Fig. 3: A broader overview of LLMs, dividing LLMs into seven branches: 1. Pre-Training 2. Fine-Tuning 3. Efficient 4.
Inference 5. Evaluation 6. Applications 7. Challenges
are some of the methods widely studied for efficient LLM • We present a survey on the developments in LLM re-
utilization. search providing a concise comprehensive overview of
Due to the success of LLMs on a wide variety of tasks, the the direction.
research literature has recently experienced a large influx of • We present extensive summaries of pre-trained models
LLM-related contributions. Researchers have organized the that include fine-grained details of architecture and train-
LLMs literature in surveys [50], [51], [52], [53], and topic- ing details.
specific surveys in [54], [55], [56], [57], [58]. In contrast • We summarize major findings of the popular contribu-
to these surveys, our contribution focuses on providing a tions and provide a detailed discussion on the key design
comprehensive yet concise overview of the general direction and development aspects of LLMs to help practitioners
of LLM research. This article summarizes architectural and effectively leverage this technology.
training details of pre-trained LLMs and delves deeper into • In this self-contained article, we cover a range of concepts
the details of concepts like fine-tuning, multi-modal LLMs, to present the general direction of LLMs comprehen-
robotics, augmented LLMs, datasets, evaluation, and others sively, including background, pre-training, fine-tuning,
to provide a self-contained comprehensive overview. Our key multi-modal LLMs, augmented LLMs, LLMs-powered
contributions are summarized as follows. agents, datasets, evaluation, etc.
PREPRINT 4
We loosely follow the existing terminologies to ensure a and key, later used to weight values. We discuss different
standardized outlook of this research direction. For instance, attention strategies used in LLMs below.
following [50], our survey discusses pre-trained LLMs with Self-Attention [64]: Calculates attention using queries, keys,
10B parameters or more. We refer the readers interested in and values from the same block (encoder or decoder).
smaller pre-trained models to [51], [52], [53]. Cross Attention: It is used in encoder-decoder architecture,
The organization of this paper is as follows. Section II dis- where encoder outputs are the queries, and key-value pairs
cusses the background of LLMs. Section III focuses on LLMs come from the decoder.
overview, architectures, training pipelines and strategies, fine- Sparse Attention [67]: Self-attention has O(n2 ) time
tuning, and utilization in different aspects. Section IV high- complexity which becomes infeasible for large sequences. To
lights the configuration and parameters that play a crucial role speed up the computation, sparse attention [67] iteratively
in the functioning of these models. Summary and discussions calculates the attention in sliding windows for speed gains.
are presented in section III-H. The LLM training and eval- Flash Attention [68]: Memory access is the major bottleneck
uation, datasets, and benchmarks are discussed in section V, in calculating attention using GPUs. To speed up, flash
followed by challenges and future directions and conclusion attention employs input tiling to minimize the memory reads
in sections VII and VIII, respectively. and writes between the GPU high bandwidth memory (HBM)
and the on-chip SRAM.
II. BACKGROUND
We provide the relevant background to understand the
fundamentals related to LLMs in this section. We briefly D. Activation Functions
discuss necessary components in LLMs and refer the readers The activation functions serve a crucial role in the curve-
interested in details to the original works. fitting abilities of the neural networks [69]. We discuss acti-
vation functions used in LLMs in this section.
A. Tokenization ReLU [70]: Rectified linear unit (ReLU) is defined as
Tokenization [59] is an essential pre-processing step in ReLU (x) = max(0, x) (1)
LLM training that parses the text into non-decomposing units
called tokens. Tokens can be characters, subwords [60], sym- GeLU [71]: Gaussian Error Linear Unit (GeLU) is the com-
bols [61], or words, depending on the tokenization process. bination of ReLU, dropout [72] and zoneout [73].
Some of the commonly used tokenization schemes in LLMs GLU variants [74]: Gated Linear Unit [75] is a neural
include wordpiece [62], byte pair encoding (BPE) [61], and network layer that is an element-wise product (⊗) of a linear
unigramLM [60]. Readers are encouraged to refer to [63] for transformation and a sigmoid transformed (σ) linear projection
a detailed survey. of the input given as
GLU (x, W, V, b, c) = (xW + b) ⊗ σ(xV + c), (2)
B. Encoding Positions
where X is the input of layer and l, W, b, V and c are learned
The transformer processes input sequences in parallel and parameters.
independently of each other. Moreover, the attention module Other GLU variants [74] used in LLMs are:
in the transformer does not capture positional information.
As a result, positional encodings were introduced in trans-
former [64], where a positional embedding vector is added ReGLU (x, W, V, b, c) = max(0, xW + b)⊗,
with the token embedding. Variants of positional embedding GEGLU (x, W, V, b, c) = GELU (xW + b) ⊗ (xV + c),
include absolute, relative, or learned positional encodings. SwiGLU (x, W, V, b, c, β) = Swishβ(xW + b) ⊗ (xV + c).
Within relative encoding, Alibi and RoPE are two widely used
positional embeddings in LLMs.
Alibi [65]: It subtracts a scalar bias from the attention score E. Layer Normalization
that increases with the distance between token positions. This Layer normalization leads to faster convergence and is an
favors using recent tokens for attention. integrated component of transformers [64]. Besides Layer-
RoPE [66]: It rotates query and key representations at an Norm [76] and RMSNorm [77], LLMs use pre-layer normal-
angle proportional to the token absolute position in the input ization [78], applying it before multi-head attention (MHA).
sequence, resulting in a relative positional encoding scheme Pre-norm is shown to provide training stability in LLMs.
which decays with the distance between the tokens. Another normalization variant, DeepNorm [79] fixes larger
gradient issue in pre-norm.
C. Attention in LLMs
The attention assigns weights to input tokens based F. Distributed LLM Training
on importance so that the model emphasizes more on This section describes distributed LLM training approaches
relevant over irrelevant tokens. Attention in transformers [64] briefly. More details are available in [13], [37], [80], [81].
calculates query, key, and value mappings for input sequences, Data Parallelism: Data parallelism replicates the model on
where the attention score is obtained by multiplying the query multiple devices where data in a batch gets divided across
PREPRINT 5
Fig. 6: A basic flow diagram depicting various stages of LLMs from pre-training to prompting/utilization. Prompting LLMs
to generate responses is possible at different training stages like pre-training, instruction-tuning, or alignment tuning.
and utilization. An example of different training stages and zero-shot generalization and downstream task performance.
inference in LLMs is shown in Figure 6. In this paper, we refer Details on formatting instruction data and its various styles
alignment-tuning to aligning with human preferences, while are available in [16], [50], [92].
occasionally the literature uses the term alignment for different Alignment-tuning: LLMs are prone to generate false, biased,
purposes. and harmful text. To make them helpful, honest, and harmless
1. Pre-Training: In the very first stage, the model is trained models are aligned using human feedback. Alignment involves
in a self-supervised manner on a large corpus to predict the asking LLMs to generate unexpected responses and then
next tokens given the input. The design choices of LLMs vary updating their parameters to avoid such responses [20], [21],
from encoder-decoder to decoder-only architectures with dif- [93].
ferent building blocks and loss functions in sections II-E, II-D, It ensures LLMs operate according to human intentions and
II-J. values. A model is defined to be an “aligned” model if the
model fulfills three criteria of helpful, honest, and harmless or
2. Fine-Tuning: There are different styles to fine-tune an “HHH” [94].
LLM. This section briefly discusses fine-tuning approaches. Researchers employ reinforcement learning with human feed-
Transfer Learning: The pre-trained LLMs perform well for back (RLHF) [95] for model alignment. In RLHF, a fine-
various tasks [6], [15]. But to improve the performance for tuned model on demonstrations is further trained with reward
a downstream task, pre-trained models are fine-tuned with modeling (RM) and reinforcement learning (RL), shown in
the task-specific data [10], [11], known as transfer learning. Figure 6. Below we briefly discuss RM and RL pipelines in
Instruction-tuning: To enable a model to respond to user RLHF.
queries effectively, the pre-trained model is fine-tuned on
instruction formatted data i.e., instruction and an input-output Reward modeling: trains a model to rank generated responses
pair. Instructions generally comprise multi-task data in plain according to human preferences using a classification objec-
natural language, guiding the model to respond according to tive. To train the classifier humans annotate LLMs generated
the prompt and the input. This type of fine-tuning improves responses based on HHH criteria.
PREPRINT 7
examples as a prompt.
1.8 HyperCLOVA [109]: A Korean language model with
GPT-3 architecture.
1.9 Yuan 1.0 [110]: Trained on a Chinese corpus with
5TB of high-quality text collected from the Internet. A
Massive Data Filtering System (MDFS) built on Spark is
developed to process the raw data via coarse and fine filtering
techniques. To speed up the training of Yuan 1.0 with the
aim of saving energy expenses and carbon emissions, various
factors that improve the performance of distributed training
are incorporated in architecture and training like increasing
the number of hidden size improves pipeline and tensor par-
allelism performance, larger micro batches improve pipeline
parallelism performance, and higher global batch size improve
Fig. 8: The image is the article of [103], showing an example
data parallelism performance. In practice, the Yuan 1.0 model
of PanGu-α architecture.
performs well on text classification, Winograd Schema, natural
language inference, and reading comprehension tasks.
1.10 Gopher [111]: The Gopher family of models ranges
1.5 CPM-2 [12]: Cost-efficient Pre-trained language from 44M to 280B parameters in size to study the effect of
Models (CPM-2) pre-trains bilingual (English and Chinese) scale on the LLMs performance. The 280B model beats GPT-
11B and 198B mixture-of-experts (MoE) models on the Wu- 3 [6], Jurrasic-1 [107], MT-NLG [112], and others on 81% of
DaoCorpus [104] dataset. The tokenization process removes the evaluated tasks.
“_” white space tokens in the sentencepiece tokenizer. The 1.11 ERNIE 3.0 TITAN [35]: ERNIE 3.0 Titan extends
models are trained with knowledge inheritance, starting with ERNIE 3.0 by training a larger model with 26x the number of
only the Chinese language in the first stage and then adding parameters of the latter. This bigger model outperformed other
English and Chinese data. This trained model gets duplicated state-of-the-art models in 68 NLP tasks. LLMs produce text
multiple times to initialize the 198B MoE model. Moreover, with incorrect facts. In order to have control of the generated
to use the model for downstream tasks, CPM-2 experimented text with factual consistency, ERNIE 3.0 Titan adds another
with both complete fine-tuning and prompt fine-tuning as task, Credible and Controllable Generations, to its multi-
in [40] where only prompt-related parameters are updated task learning setup. It introduces additional self-supervised
by inserting prompts at various positions, front, middle, and adversarial and controllable language modeling losses to the
back. CPM-2 also proposes INFMOE, a memory-efficient pre-training step, which enables ERNIE 3.0 Titan to beat
framework with a strategy to dynamically offload parameters other LLMs in their manually selected Factual QA task set
to the CPU for inference at a 100B scale. It overlaps data evaluations.
movement with inference computation for lower inference 1.12 GPT-NeoX-20B [113]: An auto-regressive model
time. that largely follows GPT-3 with a few deviations in architec-
1.6 ERNIE 3.0 [105]: ERNIE 3.0 takes inspiration from ture design, trained on the Pile dataset without any data dedu-
multi-task learning to build a modular architecture using plication. GPT-NeoX has parallel attention and feed-forward
Transformer-XL [106] as the backbone. The universal repre- layers in a transformer block, given in Eq. 4, that increases
sentation module is shared by all the tasks, which serve as the throughput by 15%. It uses rotary positional embedding [66],
basic block for task-specific representation modules, which are applying it to only 25% of embedding vector dimension as
all trained jointly for natural language understanding, natural in [114]. This reduces the computation without performance
language generation, and knowledge extraction. This LLM is degradation. Opposite to GPT-3, which uses dense and sparse
primarily focused on the Chinese language, claims to train layers, GPT-NeoX-20B uses only dense layers. The hyperpa-
on the largest Chinese text corpora for LLM training, and rameter tuning at this scale is difficult; therefore, the model
achieved state-of-the-art in 54 Chinese NLP tasks. chooses hyperparameters from the method [6] and interpolates
1.7 Jurassic-1 [107]: A pair of auto-regressive language values between 13B and 175B models for the 20B model. The
models, including a 7B-parameter J1-Large model and a model training is distributed among GPUs using both tensor
178B-parameter J1-Jumbo model. The training vocabulary of and pipeline parallelism.
Jurassic-1 comprise word pieces, complete words, and multi-
x + Attn(LN1 (x)) + F F (LN2 (x)) (4)
word expressions without any word boundaries, where possible
out-of-vocabulary instances are interpreted as Unicode bytes. 1.13 OPT [14]: It is a clone of GPT-3, developed with
Compared to the GPT-3 counterparts, the Jurassic-1 models the intention to open-source a model that replicates GPT-3
apply a more balanced depth-to-width self-attention architec- performance. Training of OPT employs dynamic loss scaling
ture [108] and an improved tokenizer for a faster prediction [115] and restarts from an earlier checkpoint with a lower
based on broader resources, achieving a comparable perfor- learning rate whenever loss divergence is observed. Overall,
mance in zero-shot learning tasks and a superior performance the performance of OPT-175B models is comparable to the
in few-shot learning tasks given the ability to feed more GPT3-175B model.
PREPRINT 9
activations recomputed in backward pass, as in [127]. competitors, where its Codeforces rating is within the top 28%
LLaMA-2 [21]: This work is more focused towards fine- of recently participated users.
tuning a safer and better LLaMA-2-Chat model for dialogue 2.4 CodeT5+ [34]: CodeT5+ is based on CodeT5 [135],
generation. The pre-trained model has 40% more training data with shallow encoder and deep decoder, trained in multiple
with a larger context length and grouped-query attention. stages initially unimodal data (code) and later bimodal data
1.24 PanGu-Σ [128]: An autoregressive model with (text-code pairs). Each training stage has different training
parameters copied from PanGu-α and extended to a trillion objectives and activates different model blocks encoder, de-
scale with Random Routed Experts (RRE), the architectural coder, or both according to the task. The unimodal pre-training
diagram is shown in Figure 10. RRE is similar to the MoE includes span denoising and CLM objectives, whereas bimodal
architecture, with distinctions at the second level, where tokens pre-training objectives contain contrastive learning, matching,
are randomly routed to experts in a domain instead of using a and CLM for text-code pairs. CodeT5+ adds special tokens
learnable gating method. The model has bottom layers densely with the text to enable task modes, for example, [CLS] for
activated and shared across all domains, whereas top layers are contrastive loss, [M atch] for text-code matching, etc.
sparsely activated according to the domain. This training style 2.5 StarCoder [136]: A decoder-only model with San-
allows extracting task-specific models and reduces catastrophic taCoder architecture, employing Flash attention to scale up
forgetting effects in case of continual learning. the context length to 8k. The StarCoder trains an encoder to
2. Coding: filter names, emails, and other personal data from the training
2.1 CodeGen [129]: CodeGen has similar architecture to data. Its fine-tuned variant outperforms PaLM, LLaMA, and
the PaLM [15], i.e., parallel attention, MLP layers, and RoPE LAMDA on HumanEval and MBPP benchmarks.
embeddings. The model is trained on both natural language
3. Scientific Knowledge:
and programming language data sequentially (trained on the
first dataset, then the second and so on) on the following 3.1 Galactica [137]: A large curated corpus of human
datasets 1) PILE, 2) BIGQUERY and 3) BIGPYTHON. Code- scientific knowledge with 48 million papers, textbooks, lecture
Gen proposed a multi-step approach to synthesizing code. The notes, millions of compounds and proteins, scientific websites,
purpose is to simplify the generation of long sequences where encyclopedias, and more are trained using metaseq library3,
the previous prompt and generated code are given as input with which is built on PyTorch and fairscale [138]. The model
the next prompt to generate the next code sequence. CodeGen wraps reasoning datasets with < work > token to provide
opensource a Multi-Turn Programming Benchmark (MTPB) step-by-step reasoning context to the model, which has been
to evaluate multi-step program synthesis. shown to improve the performance on reasoning tasks.
2.2 Codex [130]: This LLM is trained on a subset 4. Dialog:
of public Python Github repositories to generate code from 4.1 LaMDA [139]: A decoder-only model pre-trained
docstrings. Computer programming is an iterative process on public dialog data, public dialog utterances, and public
where the programs are often debugged and updated before web documents, where more than 90% of the pre-training
fulfilling the requirements. Similarly to this, Codex generates data is in English. LaMDA is trained with the objective
100 versions of a program by repetitive sampling for a given of producing responses that exhibit high levels of quality,
description, which produces a working solution for 77.5% of safety, and groundedness. To achieve this, discriminative and
the problems passing unit tests. Its powerful version powers generative fine-tuning techniques are incorporated to enhance
Github Copilot2 . the model’s safety and quality aspects. As a result, the LaMDA
2.3 AlphaCode [131]: A set of large language mod- models can be utilized as a general language model performing
els, ranging from 300M to 41B parameters, designed for various tasks.
competition-level code generation tasks. It uses the multi- 5. Finance:
query attention [132] to reduce memory and cache costs.
5.1 BloombergGPT [140]: A non-causal decoder model
Since competitive programming problems highly require deep
trained using both financial ("FINPILE" from the Bloomberg
reasoning and an understanding of complex natural language
archive) and general-purpose datasets. The model’s architec-
algorithms, the AlphaCode models are pre-trained on filtered
ture is similar to the BLOOM [13] and OPT [14]. It allocates
GitHub code in popular languages and then fine-tuned on a
50B parameters to different blocks of the model using the
new competitive programming dataset named CodeContests.
approach [108]. For effective training, BloombergGPT packs
The CodeContests dataset mainly contains problems, solu-
documents together with < |endof text| > to use maximum
tions, and test cases collected from the Codeforces platform3 .
sequence length, use warmup batch size starting from 1024 to
The pre-training employs standard language modeling objec-
2048, and manually reduces the learning rate multiple times
tives, while GOLD [133] with tempering [134] serves as the
during the training.
training objective for the fine-tuning on CodeContests data. To
evaluate the performance of AlphaCode, simulated program- 5.2 Xuan Yuan 2.0 [141]: A Chinese financial chat
ming competitions are hosted on the Codeforces platform: model with BLOOM’s [13] architecture trained on a combina-
overall, AlphaCode ranks at the top 54.3% among over 5000 tion of general purpose, financial, general purpose instructions,
and financial institutions datasets. Xuan Yuan 2.0 combined
2 https://github.com/features/copilot the pre-training and fine-tuning stages to avoid catastrophic
3 https://codeforces.com/ forgetting.
PREPRINT 11
TABLE I: Noteworthy findings and insights from pre-trained Large Language Model.
Models Findings & Insights
• Encoder and decoder with shared parameters perform equivalently when parameters are not shared
T5 • Fine-tuning model layers (adapter layers) work better than the conventional way of training on only classification layers
GPT-3 • Few-shot performance of LLMs is better than the zero-shot, suggesting that LLMs are meta-learners
• Large multi-lingual models perform equivalently to single language models on downstream tasks. However, smaller multi-
mT5 lingual models perform worse
Gopher • Relative encodings enable models to be evaluated for longer sequences than those on which it was trained.
• This LLM builds on top of ERNIE 3.0 and add a self-supervised adversarial loss to distinguish whether a text is generated
ERNIE 3.0 Titan or the original one.
• This distinction ability between real and generate text improves the LLM’s performance as compared to ERNIE 3.0.
• Parallel attention + FF layers speed-up training 15% with the same performance as with cascaded layers
• Initializing feed-forward output layers before residuals with scheme in [142] avoids activations from growing with increasing
GPT-NeoX-20B depth and width
• Training on Pile outperforms GPT-3 on five-shot
• Restart training from an earlier checkpoint with a lower learning rate if loss diverges
OPT • Model is prone to generate repetitive text and stuck in a loop
BLOOM • None
• Galactica’s performance has continued to improve across validation set, in-domain, and out-of-domain benchmarks, even
with multiple repetitions of the corpus, which is superior to existing research on LLMs.
Galactica • A working memory token approach can achieve strong performance over existing methods on mathematical MMLU and
MATH benchmarks. It sets a new state-of-the-art on several downstream tasks such as PubMedQA (77.6%) and MedMCQA
dev (52.9%).
• The feed-forward component of each Transformer layer can be replaced with a mixture-of-experts (MoE) module consisting
of a set of independent feed-forward networks (i.e., the ‘experts’). By sparsely activating these experts, the model capacity
can be maintained while much computation is saved.
• By leveraging sparsity, we can make significant strides toward developing high-quality NLP models while simultaneously
reducing energy consumption. Consequently, MoE emerges as a robust candidate for future scaling endeavors.
GLaM • The model trained on filtered data shows consistently better performances on both NLG and NLU tasks, where the effect of
filtering is more significant on the former tasks.
• Filtered pretraining corpora plays a crucial role in the generation capability of LLMs, especially for the downstream tasks.
• The scaling of GLaM MoE models can be achieved by increasing the size or number of experts in the MoE layer. Given a
fixed budget of computation, more experts contribute to better predictions.
LaMDA • The model can be fine-tuned to learn to call different external information resources and tools.
• For higher effectiveness and efficiency, a transformer model can be asymmetrically constructed with a shallower encoder and
a deeper decoder.
• To achieve better performances, it is necessary to employ strategies such as massively scaling up sampling, followed by the
AlphaCode filtering and clustering of samples into a compact set.
• The utilization of novel sampling-efficient transformer architectures designed to facilitate large-scale sampling is crucial.
• Simplifying problem descriptions can effectively improve the model’s performance.
GLM-130B • Pre-training data with a small proportion of multi-task instruction data improves the overall model performance
CodeGen • Multi-step prompting for code synthesis leads to a better user intent understanding and code generation
• LLaMA is open-source and can be fine-tuned or continually pre-trained to develop new models or instruction-based tools.
• A few optimizations are proposed to improve the training efficiency of LLaMA, such as efficient implementation of multi-head
self-attention and a reduced amount of activations during back-propagation.
LLaMA • Training exclusively on public data can also achieve state-of-the-art performance.
• A constant performance improvement is gained when scaling the model.
• Smaller models can also realize good performances using more training data and time.
• Sparse models provide the benefits of large models at a lower computation cost
• Randomly Routed Experts reduces catastrophic forgetting effects which in turn is essential for continual learning
PanGu-Σ • Randomly Routed Experts allow extracting a domain-specific sub-model in deployment which is cost-efficient while
maintaining a performance similar to the original
BloombergGPT • Pre-training with general-purpose and task-specific data improves task performance without hurting other model capabilities
XuanYuan 2.0 • Combining pre-training and fine-tuning stages in single training avoids catastrophic forgetting
• Causal LM is crucial for a model’s generation capability in encoder-decoder architectures
CodeT5+ • Multiple training objectives like span corruption, Causal LM, matching, etc complement each other for better performance
StarCoder • HHH prompt by Anthropic allows the model to follow instructions without fine-tuning
• Model trained on unfiltered data is more toxic but may perform better on downstream tasks after fine-tuning
LLaMA-2 • Model trained on unfiltered data requires fewer samples for safety alignment
• Data quality is important to train better models
PaLM-2 • Model and data size should be scaled with 1:1 proportions
• Smaller models trained for larger iterations outperform larger models
B. Fine-Tuned LLMs
Pre-trained LLMs have excellent generalization abilities to
unseen tasks. However, because they are generally trained with
the objective of next token prediction, LLMs have limited
capacity to follow user intent and are prone to generate
unethical, toxic or inaccurate responses [20]. For their effective
utilization, LLMs are fine-tuned to follow instructions [16],
[17], [92] and generate safe responses [20], which also results
in increasing zero-shot, few-shot, and cross-task generaliza-
tion [92], [16], [18], with minimal compute increment, e.g.,
0.2% of the total pre-training for PaLM 540B [16].
We review various fine-tuned LLMs and strategies for effective
fine-tuning in this section.
1. Instruction-Tuning with Manually Created Datasets:
Numerous hand-crafted instruction-tuning datasets with
Fig. 10: This example illustrates the PanGu-
P
architecture, different design choices are proposed in the literature to
as depicted in the image sourced from [128]. instruction-tune LLMs. The performance of fine-tuned LLMs
depends on multiple factors, such as dataset, instruction
diversity, prompting templates, model size, and training
PREPRINT 13
TABLE II: Key insights and findings from the study of instruction-tuned Large Language Models.
Models Findings & Insights
• Multi-task prompting enables zero-shot generalization and outperforms baselines
T0 • Even a single prompt per dataset task is enough to improve performance
• The answer quality of LLMs can be further improved with human feedback.
• To aid the model in effectively filtering and utilizing relevant information, human labelers play a crucial role in answering
questions regarding the usefulness of the retrieved documents.
WebGPT • Interacting a fine-tuned language model with a text-based web-browsing environment can improve end-to-end retrieval and
synthesis via imitation learning and reinforcement learning.
• Generating answers with references can make labelers easily judge the factual accuracy of answers.
• Instruction tuning leads to a stronger generalization of unseen tasks
• More tasks improve generalization whereas only increasing task instances does not help
Tk-INSTRUCT • Supervised trained models are better than generalized models
• Models pre-trained with instructions and examples perform well for different types of inputs
• Instruction tuning enables zero-shot generalization to the tasks never seen before
• Multi-lingual training leads to even better zero-shot generalization for both English and non-English
mT0 and BLOOMZ • Training on machine-translated prompts improves performance for held-out tasks with non-English prompts
• English only fine-tuning on multilingual pre-trained language model is enough to generalize to other pre-trained language
tasks
• Task size sampling to create a batch with most of the task examples is important for better performance
• Only example proportional sampling is not enough, training datasets/benchmarks should also be proportional for better
generalization/performance
• Fully held-out and partially supervised tasks performance improves by scaling tasks or categories whereas fully supervised
OPT-IML tasks have no effect
• Including small amounts i.e. 5% of pretraining data during fine-tuning is effective
• Only 1% reasoning data improves the performance, adding more deteriorates performance
• Adding dialogue data makes the performance worse
• Finetuning with CoT improves performance on held-out tasks
• Fine-tuning along with CoT data improves reasoning abilities
• CoT tuning improves zero-shot reasoning
Flan • Performance improves with more tasks
• Instruction fine-tuning improves usability which otherwise is challenging for pre-trained models
• Improving the model’s performance with instruction tuning is compute-efficient
• Multitask prompting enables zero-shot generalization abilities in LLM
• The judgments of labelers and the alignments with defined rules can help the model generate better responses.
• Good dialogue goals can be broken down into detailed natural language rules for the agent and the raters.
Sparrow • The combination of reinforcement learning (RL) with reranking yields optimal performance in terms of preference win rates
and resilience against adversarial probing.
WizardCoder • Fine-tuning with re-written instruction-tuning data into a complex set improves the performance significantly
• Model learns to write safe responses with fine-tuning on safe demonstrations, while additional RLHF step further improves
LLaMA-2-Chat model safety and make it less prone to jailbreak attacks
LIMA • Less high quality data is enough for fine-tuned model generalization
written by humans, smaller in size, and less diverse. To to questions, which is later used to train the model, as in
overcome this, self-instruct [19] proposed an approach to GopherCite [154], WebGPT [155], and Sparrow [156]. The
prompt available LLMs to generate instruction-tuning datasets. ranking model in Sparrow [156] is divided into two branches,
Self-instruct outperformed models trained on manually created preference reward and rule reward, where human annotators
dataset SUPER-NATURALINSTRUCTIONS (a dataset with adversarial probe the model to break a rule. These two
1600+ tasks) [18] by 33%. It starts with a seed of 175 tasks, rewards together rank a response to train with RL.
1 instruction, and 1 sample per task and iteratively generates Aligning Directly with SFT: The PPO in the RLHF pipeline
new instructions (52k) and instances (82k input-output pairs) is complex, memory-intensive, and unstable, requiring
using GPT-3 [6]. Contrary to this, Dynosaur [144] uses the multiple models, reward, value, policy, and reference models.
meta-data of datasets on Huggingface to prompt LLMs to Avoiding this sophisticated alignment pipeline is possible
generate multiple task instruction-tuning datasets. by incorporating minimal changes in the supervised fine-
LLaMA Tuned: Various models in literature instruction-tune tuning (SFT) pipeline as in [157], [158], [159], with better
LLaMA [145] with GPT-3 [6] or GPT-4 [146] generated or comparable performance to PPO. Direct preference
datasets. Among these, Alpaca [147], Vicuna [148], and optimization (DPO) [157] trains a model directly on the
LLaMA-GPT-4 [149] are a few general-purpose fine-tuned human-preferred responses to maximize the likelihood of
models, where Alpaca is trained on 52k samples from text- preferred against unpreferred responses, with per-sample
davinci-003, Vicuna on 70k samples from ShareGPT.com, importance weight. Reward ranked fine-tuning RAFT [158]
and LLaMA-GPT-4 by re-creating Alpaca instructions from fine-tunes the model on ranked responses by the reward
GPT-4. Goat [150] fine-tunes LLaMA for arithmetic tasks model. Preference ranking optimization (PRO) [160] and
(1 million samples) by generating data from ChatGPT and RRHF [159] penalize the model to rank responses with
outperforms GPT-4, PaLM, BLOOM, OPT, etc, attributing its human preferences and supervised loss. On the other hand,
success to the LLaMA’s consistent tokenization of numbers. chain-of-hindsight (CoH) [161] provides feedback to the
HuaTuo [151] is a medical knowledge model, fine-tuned with model in language rather than reward, to learn good versus
a generated QA dataset of 8k instructions. bad responses.
Complex Instructions: Evol-Instruct [152], [153] prompts Aligning with Synthetic Feedback: Aligning LLMs with
LLMs to convert given instructions into a more complex human feedback is slow and costly. The literature suggests a
set. The instructions are iteratively evolved with re- semi-automated process to align LLMs by prompting LLMs to
writing instructions in complex wording and creating generate helpful, honest, and ethical responses to the queries,
new instructions. With this style of automated instruction and fine-tuning using the newly created dataset. Constitutional
generation, WizardLM [152] (fine-tuned LLaMA on AI [162] replaces human feedback in RLHF with AI, calling
250k instructions), outperforms Vicuna and Alpaca, and it RL from AI feedback (RLAIF). AlpacaFarm [163] designs
WizardCoder [153] (fine-tuned StarCoder) beats Claude-Plus, prompts to imitate human feedback using LLMs APIs.
Bard, and others. Opposite to constitutional AI, AlpacaFarm injects noise
in feedback to replicate human mistakes. Self-Align [93]
prompts the LLM with ICL examples, instructing the LLM
3. Aligning with Human Preferences: Incorporating about what the response should contain to be considered
human preferences into LLMs presents a significant useful and ethical. The same LLM is later fine-tuned with the
advantage in mitigating undesirable behaviors and ensuring new dataset.
accurate outputs. The initial work on alignment, such as Aligning with Prompts: LLMs can be steered with prompts
InstructGPT [20] aligns GPT-3 using a 3-step approach, to generate desirable responses without training [164],
instruction-tuning, reward modeling, and fine-tuning with [165]. The self-correction prompting in [165] concatenates
reinforcement learning (RL). The supervised fine-tuned instructions and CoT with questions, guiding the model to
GPT-3 on demonstrations is queried to generate responses, answer its instruction following strategy to ensure moral
which human labelers rank according to human values, and safety before the actual answer. This strategy is shown to
a reward model is trained on the ranked data. Lastly, the reduce the harm in generated responses significantly.
GPT-3 is trained with proximal policy optimization (PPO) Red-Teaming/Jailbreaking/Adversarial Attacks: LLMs
using rewards on the generated data from the reward model. exhibit harmful behaviors, hallucinations, leaking personal
LLaMA 2-Chat [21] improves alignment by dividing reward information, and other shortcomings through adversarial
modeling into helpfulness and safety rewards and using probing. The models are susceptible to generating harmful
rejection sampling in addition to PPO. The initial four responses even though they are aligned for safety [166],
versions of LLaMA 2-Chat are fine-tuned with rejection [167]. Red-teaming is a common approach to address
sampling and then with PPO on top of rejection sampling. illicit outputs, where the LLMs are prompted to generate
Aligning with Supported Evidence: This style of alignment harmful outputs [167], [168]. The dataset collected through
allows the model to generate responses with proofs and facts, red-teaming is used to fine-tune models for safety. While
reduces hallucination, and assists humans more effectively, red-teaming largely relies on human annotators, another
which increases trust in the model’s output. Similar to the work [169] red-team LLMs to find prompts that lead to
RLHF training style, a reward model is trained to rank harmful outputs of other LLMs.
generated responses containing web citations in answers
PREPRINT 15
4. Continue Pre-Training: Although fine-tuning boosts a as 16k, and improves task performance with longer inputs.
model’s performance, it leads to catastrophic forgetting of This shows the extrapolation ability of TGlobal attention
previously learned information. Concatenating fine-tuning data with only fine-tuning. COLT5 [176] uses two branches, one
with a few randomly selected pre-training samples in every with lightweight and the other with heavyweight attention
iteration avoids network forgetting [170], [141]. This is also and feed-forward layers. All tokens are processed from the
effective in adapting LLMs for cases where fine-tuning data is lightweight branch, and only important tokens are routed to
small and the original capacity is to be maintained. Prompt- the heavyweight branch. LongNet [177] replaces standard
based continued pre-training (PCP) [171] trains the model attention with dilated attention, expanding sequence length to 1
with text and instructions related to tasks and then finally billion tokens. LongLoRA [178] proposes shift-short attention,
instruction-tunes the model for downstream tasks. used during fine-tuning to reduce dense attention costs, while
5. Sample Efficiency: While fine-tuning data is generally the model during inference can use dense attention and achieve
many-fold smaller than the pre-training data, it still has to similar performance as full attention fine-tuning.
be large enough for acceptable performance [16], [92], [18] Extrapolation without Training: LM-Infinite [175] and par-
and requires proportional computing resources. To study the allel context windows (PCW) [179] show length extrapolation
effects on performance with less data, existing literature [172], is possible using pre-trained LLMs. LM-Infinite suggested Λ-
[173] finds that the models trained on lesser data can out- shaped attention applied within the original context window
perform models trained with more data. In [172], 25% of limits. Likewise, PCW chunks larger inputs into the pre-trained
the total downstream data is found enough for state-of-the- context lengths and applies the same positional encodings to
art performance. Selecting coreset-based 0.5% of the total each chunk.
instruction-tuning data improves the model performance by
2% in [173], as compared to the complete data tuning. Less
D. Augmented LLMs
is more for alignment (LIMA) [174] uses only 1000 carefully
created demonstrations to fine-tune the model and has achieved LLMs are capable of learning from the examples concate-
comparable performance to GPT-4. nated with the input, known as context augmentation, in-
context learning (ICL), or few-shot prompting. They show
excellent generalization to unseen tasks with few-shot prompt-
C. Increasing Context Window ing, enabling LLMs to answer queries beyond the capacity
LLMs are trained with limited context windows due to acquired during training [6], [55]. These emergent abilities
expensive attention and high memory requirements. A model allow for adapting the model without fine-tuning - a costly
trained on limited sequence lengths fails to generalize to process. Aside from this, hallucination, producing inaccurate,
unseen lengths at inference time [175], [49]. Alternatively, unsafe, or factually incorrect responses, is common for LLMs,
LLMs with ALiBi [65] positional encodings can perform zero- which is avoided by augmenting contextual data. While the
shot length extrapolation. However, ALiBi has less expres- user can provide in-context samples in the query [54], [32],
sive power [66] and inferior performance on multiple bench- here we specifically refer to the methods that access external
marks [46], and many LLMs use RoPE positional embedding storage programmatically, calling them augmented LLMs.
that is unable to perform zero-shot extrapolation. A larger con- The literature suggests various external memory designs to
text length has benefits such as a better understanding of longer augment LLMs, long-term [180], [181], [182], [183], short-
documents, more samples in in-context learning, execution term [184], symbolic [185], and non-symbolic [186], [187].
of bigger reasoning processes, etc. Expanding context length The memory can be maintained in different formats such as
during fine-tuning is slow, inefficient, and computationally documents, vectors, or databases. A few systems maintain
expensive [49]. Therefore, researchers employ various context intermediate memory representations to retain information
window extrapolation techniques discussed below. across multiple iterations [183], [181], while others extract
Position Interpolation: Rather than extrapolating, [49] shows important information from the datasets and save it in memory
that interpolating position encodings within the pre-trained for recall [188]. The memory read and write operations are
context window are more effective. The work demonstrates performed either with or without LLMs cooperation [181],
that only 1000 steps of fine-tuning are enough to achieve better [189], [183], [190], acting as a feedback signal in [184]. We
results on larger windows without performance loss compared discuss different types of augmented LLMs below.
to the original context size. Giraffe [46] uses power scaling in 1. Retrieval Augmented LLMs: LLMs may have limited
RoPE, and YaRN [47] proposed NTK-aware interpolation. memory and outdated information, leading to inaccurate
Efficient Attention Mechanism: Dense global attention is responses. Retrieving relevant information from external
one of the major constraints in training larger context win- up-to-date storage enables the LLMs to accurately answer
dow LLMs. Using efficient attention variants, such as local, with references and utilize more information. With retrieval
sparse, and dilated attention, reduces the computation cost augmentation, smaller models have been shown to perform
significantly. LongT5 [48] proposes transient global attention at par with larger models. For instance, the 11B model can
(TGlobal), applying attention to local and global tokens (win- become competitive to 540B PaLM in [25] and 7.5B to 280B
dowing token averaging). The model replaces attention in Gopher in [182]. Retrieval augmented language modeling
T5 [10] with TGlobal attention, pre-trains the model on 4098 (RALM) has two major components, shown in Figure 12,
sequence length, fine-tunes on larger window sizes, as large namely: 1) retriever and 2) language model. In RALM, the
PREPRINT 16
instruction-following, however, utilizing them for physically tuning alleviates this problem by fine-tuning only 0.001%-
grounded tasks requires adaptation, as they lack real-world 3% additional parameters [240]. It concatenates trainable
knowledge. This could lead to generating illogical responses prompt parameters with the model embeddings [236], [40],
for a particular physical situation [229], [26]. SayCan [229] [240]. Task-specific fixed discrete prompts are concatenated
make LLMs aware of the available low-level task operations. with input embeddings in [40]. As discrete prompts bring
LLM (Say) builds a high-level plan to complete the task and instability, prompts are encoded through a learnable mapping
a learned affordance function (Can) explores the possibility in P-Tuning [236], naming continuous prompts, which are
of executing the plan in the real world. SayCan uses RL to appended with the discrete prompts. Only the prompt encoder
train the language-conditioned affordance function. PaLM-E is trainable in the model. In an extension of P-Tuning, contin-
enables the LLM to solve grounded tasks by training multi- uous prompts are concatenated with each layer of the network
modal LLM feeding inputs directly from the sensors. in [240]. Progressive prompts [241] avoid catastrophic forget-
Manipulation: In the area of manipulation [225], [230], LLMs ting and transfer previously learned knowledge by sequentially
enhance a robot’s dexterity and adaptability, excelling in tasks adding trainable prompt embeddings to the previously frozen
like object recognition, grasping, and collaboration. They task embeddings.
analyze visual and spatial information to determine the most Prefix Tuning: A set of trainable task-specific prefix vec-
effective approach to interact with objects. tors are appended to the frozen transformer layers in prefix
Navigation: LLMs enhance a robot’s ability to navigate com- tuning [41]. The prefix vectors are virtual tokens attended
plex environments with precision and adaptability [231], [232], by the context tokens on the right. In addition, adaptive
[233], [234]. They generate feasible paths and trajectories for prefix tuning [242] applies a gating mechanism to control the
robots, accounting for intricate environmental details [235]. information from the prefix and actual tokens.
This ability is valuable in scenarios requiring precise and Bias Tuning: Fine-tuning only bias terms in small to medium
dynamically adaptable navigation in environments like ware- training data has been found effective in BitFit [243]. This
houses, transport, healthcare facilities, and residences. method achieves full fine-tuning performance for tasks with
less training data and comparable performance with larger
training data.
F. Efficient LLMs
2. Quantization: LLMs require extensive computing and
Deploying LLMs in production is expensive. Reducing their memory for inference. Deploying 175B parameter GPT-3
running costs while preserving performance is an appealing model needs at least 5x80GB A100 GPUs and 350GB of
area of research. This section summarizes the approaches memory to store in FP16 format [44]. Such demanding re-
suggested to enhance LLMs efficiency. quirements for deploying LLMs make it harder for smaller
1. Parameter Efficient Fine-Tuning: Fine-tuning LLMs organizations to utilize them. Model compression is an effec-
with tens or hundreds of billions of parameters, such as tive solution but comes at the cost of degrading performance,
GPT-3 (175B), BLOOM (176B), MT-NLG (540B), etc., is especially at large scales greater than 6B. These models exhibit
computationally-intensive and time-consuming. To avoid com- very large magnitude outliers that do not exist in smaller
plete model fine-tuning, numerous parameter-efficient fine- models [244], making it challenging and requiring specialized
tuning (PEFT) techniques [40], [236], [41], [38], [39] try methods for quantizing LLMs [44], [245].
to achieve acceptable model fine-tuning performance at re- Post-Training Quantization: Minimal or no training is re-
duced costs. As compared to full fine-tuning [237], PEFT quired in this type of quantization, without significantly com-
performs better in low-resource setups, achieves comparable promising the model performance. LLM-8-bit [244] uses full-
performance on medium-resource scenarios, and performs precision matrix multiplication for weights associated with
worse than full fine-tuning under high-resource availability. An outlier features and 8-bit for remaining. The lower precision
overview of different PEFT approaches is shown in Figure 14. multiplication outputs are converted to FP-16 and concatenated
Adapter Tuning: Adds a few trainable parameters within the with others. The quantized models have homogenous word
transformer block. The adapter layer is a sequence of feature embeddings, which may degrade their performance. To fix this,
downscaling, non-linearity, and upscaling [101]. Variants of token-level knowledge distillation is employed in [45] along
adapter tuning inject adapter layers sequentially [101] and in with independent quantization scaling factors for each module
parallel [38], whereas the mixture of adapter (AdaMix) [238] due to varying weight distribution. Feature distributions are
employs multiple adapter modules in a single layer. AdaMix asymmetric and appear in different channels; outlier suppres-
routes input instances randomly to one of the multiple down- sion [246] shifts and scales per-channel activation distributions
scale and upscale modules. The mixture of adapters is av- for effective quantization. SmoothQuant [44] quantizes acti-
eraged out for inference to avoid additional latency. Low- vations and weights to INT8 format by smoothing activations
Rank Adaptation (LoRA) [239] learns low-rank decomposed and migrating the quantization difficulty toward weights. It
matrices to freeze original weights. Learned weights are fused multiplies the inverse of the smoothing factor with weights,
with the original weights for inference, avoiding latency. which introduces a few outliers in the weights but is easier to
Prompt Tuning: Prompting is an effective way to adapt a quantify than unsmoothed activations. OPTQ [245] uses the
pre-trained LLM for the downstream task. However, manual optimal brain compression (OBC) [247] algorithm to quantize
prompts bring uncertainty in the model’s prediction, where a the model layer-by-layer and update weights to compensate for
change in a single word drops the performance [236]. Prompt quantization error. To improve speed and performance, OPTQ
PREPRINT 19
Fig. 14: Illustration of parameter-efficient fine-tuning paradigms, where x is input and h is hidden state, figure courtesy [38].
Parallel adapter and LoRA fall in the adapter tuning category
updates weights in arbitrary order, employs lazy updates, and tional costs. Outlier weighed layerwise sparsity (OWL) [256]
uses better Cholesky kernels. Outlier-aware weight quantiza- extends Wanda with non-uniform layer pruning. It shows that
tion (OWQ) [248] uses the OPTQ algorithm for quantization the number of outliers varies for different layers; therefore,
but assigns higher precision to vulnerable weights, causing the model should have variable pruning ratios for better
outliers and lower precision for others. performance for every layer. Contrastive pruning (CAP) [43]
Quantization-Aware Training: To compensate for perfor- iteratively prunes the model by training the sparse model
mance degradation, a quantized model is fine-tuned in using contrastive loss between pre-trained, fine-tuning, and
quantization-aware training (QAT) [249], [250], [251]. Al- snapshots of previous sparse models to learn task-specific and
phatuning quantizes the model using binary coding quanti- task-agnostic knowledge.
zation (BCQ) [252] and fine-tunes only quantization scaling Structured Pruning: Here, the parameters are removed in
factors. This approach improves performance over parameter- groups, rows, columns, or matrices, which speeds up the
efficient fine-tuning of a pre-trained model and then quan- inference because of effective hardware tensor core utiliza-
tizing and fine-tuning. Similarly, parameter-efficient and tion [254]. LLM-Pruner [42] employs a 3-stage structured
quantization-aware adaptation (PEQA) [253] reduces the pre- pruning strategy, identifying the groups of hidden states
cision of fully-connected layers and fine-tunes only quantiza- causing each other to activate during forward-pass, keeping
tion scaling parameters. LLM-QAT [251] generates training important groups and removing less important ones, and fine-
data from the pre-trained network and trains a quantized tuning the pruned model with LoRA. Sparsity-induced mask
student model with knowledge distillation. QLoRA [250] fine- learning (SIMPLE) [257] prunes the network using learnable
tunes 4-bit quantized pre-trained LLM with LoRA [239] using masks. Similarly, another method prunes LLMs by learning
4-bit normal float, which shows better performance over 4-bit masks and removing unimportant rank-1 components of the
integer and float. factorized weight matrix [255].
3. Pruning: Pruning is an alternative approach to quantiza-
tion to compress model size, thereby reducing LLMs deploy- G. Multimodal LLMs
ment costs significantly. Compared to task-agnostic pruning, Inspired by the success of LLMs in natural language pro-
task-specific pruning is easily achievable with good perfor- cessing applications, an increasing number of research works
mance, where a model is fine-tuned on the downstream task are now facilitating LLMs to perceive different modalities
and pruned for faster inference. It is possible to prune LLMs of information like image [258], [259], [260], video [261],
for individual tasks, but the cost of pruning and deploying task- [262], [263], audio [264], [263], [265], etc. Multimodal LLMs
specific models is high. To overcome this, many structured and (MLLMs) present substantial benefits compared to standard
unstructured pruning methods for LLMs have been proposed LLMs that process only text. By incorporating information
to maintain reasonable performance while shrinking the model from various modalities, MLLMs can achieve a deeper un-
size [254], [42], [255]. derstanding of context, leading to more intelligent responses
Unstructured Pruning: This kind of pruning removes less infused with a variety of expressions. Importantly, MLLMs
important weights without maintaining any structure. Existing align closely with human perceptual experiences, leveraging
LLM pruning methods take advantage of the unique charac- the synergistic nature of our multisensory inputs to form
teristics of LLMs, uncommon for smaller models, where a a comprehensive understanding of the world [265], [26].
small subset of hidden states are activated with large magni- Coupled with a user-friendly interface, MLLMs can offer
tude [244]. Pruning by weights and activations (Wanda) [254] intuitive, flexible, and adaptable interactions, allowing users
prunes weights in every row based on importance, calculated to engage with intelligent assistants through a spectrum of
by multiplying the weights with the norm of input. The pruned input methods. According to the ways of constructing models,
model does not require fine-tuning, thereby saving computa- current MLLMs can be generally divided into three streams:
PREPRINT 20
pre-training, fine-tuning, and prompting. In this section, we can also be prompted with multimodal descriptions and tools,
will discuss more details of these main streams, as well as the effectively dividing complex tasks into sub-tasks [278], [279].
important application of MLLMs in visual reasoning. Visual Reasoning Application: Recent visual reasoning sys-
Pre-training: This stream of MLLMs intends to support differ- tems [280], [281], [205], [282] tend to apply LLMs for better
ent modalities using unified end-to-end models. For instance, visual information analysis and visual-language integration.
Flamingo [258] applies gated cross-attention to fuse vision and Different from previous works [283], [284] that rely on limited
language modalities, which are collected from pre-trained and VQA datasets and small-scale neural networks, current LLM-
frozen visual encoder and LLM, respectively. Moreover, BLIP- aided methods offer benefits of stronger generalization ability,
2 [259] proposes a two-stage strategy to pre-train a Querying emergent ability, and interactivity [58]. To realize visual rea-
Transformer (Q-Former) for the alignment between vision soning with the help of LLMs, prompting and fine-tuning tech-
and language modalities: in the first stage, vision-language niques can also be utilized: for example, PointClip V2 [281]
representation learning is bootstrapped from a frozen visual applies LLMs to generate 3D-specific prompts, which are
encoder; and in the second stage, a frozen LLM bootstraps encoded as textual features and then combined with visual
vision-to-language generative learning for zero-shot image- features for 3D recognition; and GPT4Tools [31] employs
to-text generation. Similarly, MiniGPT-4 [266] also deploys LoRA [239] to fine-tune LLMs following tool-related instruc-
pre-trained and frozen ViT [267], Q-Former and Vicuna tions. Serving as a controller [282], decision maker [285], or
LLM [148], while only a linear projection layer needs to be semantics refiner [280], [286], LLMs significantly facilitates
trained for vision and language modalities alignment. the progress of visual reasoning research.
Fine-tuning: Derived from instruction tuning [16] for NLP
tasks [20], [16], [92], researchers are now fine-tuning pre-
trained LLMs using multimodal instructions. Following this H. Summary and Discussion
method, LLMs can be easily and effectively extended as 1. Architecture: Due to the gigantic scale of LLMs, minor
multimodal chatbots [266], [260], [29] and multimodal task changes in architecture and training strategies have a big
solvers [268], [30], [269]. The key issue of this stream of impact on performance and stability. Here, we summarize key
MLLMs is to collect multimodal instruction-following data for architectural modules used in various LLMs, leading to better
fine-tuning [58]. To address this issue, the solutions of bench- performance, reduced training time and memory, and better
mark adaptation [268], [270], [271], self-instruction [19], [31], training stability.
[272], and hybrid composition [273], [269] are employed, Layer Normalization: is found to have a significant effect
respectively. To mitigate the gap between the original language on the performance and training stability of LLMs. Pre-
modality and additional modalities, the learnable interface is norm, that is normalizing inputs rather than outputs, is more
introduced to connect different modalities from frozen pre- common among LLMs stabilizing the training [6], [125],
trained models. Particularly, the learnable interface is expected [103]. BLOOM [13] and AlexaTM [120] utilize an additional
to work in a parameter-efficient tuning manner: e.g., LLaMA- layer normalization before embedding layer to stabilize the
Adapter [274] applies an efficient transformer-based adapter training of large-scale models, while the model’s zero-shot
module for training, and LaVIN [273] dynamically learns generalization ability can be negatively impacted [13]. How-
the multimodal feature weights using a mixture-of-modality ever, another study [33] finds that pre-norm degrades fine-
adapter. Different from the learnable interface, the expert mod- tuned model performance as compared to post-norm, and there
els can directly convert multimodalities into language: e.g., are no stability benefits of pre-norm beyond the 100B scale.
VideoChat-Text [261] incorporates Whisper [275], a speech Therefore, GLM-130B [33] used deep-norm which is a variant
recognition expert model, to generate the captions of given of post-norm for better downstream task performance after
videos for the understanding of following LLMs. fine-tuning.
Prompting: Different from the fine-tuning technique that Positional Encoding: effect performance and training stability
directly updates the model parameters given task-specific of LLMs like other building blocks of a model. BLOOM [13]
datasets, the prompting technique provides certain context, finds ALiBi outperforming learned and rotary positional en-
examples, or instructions to the model, fulfilling specialized codings. Contrary to this, GLM-130B [33] identifies rotary
tasks without changing the model parameters. Since prompting positional encoding better than ALiBi. So, there is no conclu-
can significantly reduce the need for large-scale multimodal sion in literature about the positional encodings yet.
data, this technique is widely used to construct MLLMs. Parallel Attention: where attention and feed-forward layers
Particularly, to solve multimodal Chain of Thought (CoT) are parallel to each other rather than sequential in transformer
problems [98], LLMs are prompted to generate both the rea- block has shown to reduce training time by 15%. There is no
soning process and the answer given multimodal inputs [276]. evidence of performance drop due to this change in literature
On this front, different learning paradigms are exploited in and used by the models PaLM [15], GPT-NeoX [113], and
practice: for example, Multimodal-CoT [276] involves two CodeGen [129].
stages of rationale generation and answer inference, where the Multi-Query Attention has shared key and value attention
input of the second stage is a combination of the original input heads in a transformer block while query attention heads are
and the output of the first stage; and CoT-PT [277] applies projected as usual. This reduces memory usage and speeds
both prompt tuning and specific visual bias to generate a chain up sampling in autoregressive decoding. No performance
of reasoning implicitly. In addition to CoT problems, LLMs degradation has been observed with this change and makes
PREPRINT 21
the training efficient allowing larger batch sizes. Multi-query Learning Rate: is important for stable training. It is suggested
attention is used in [15], [131]. to use a lower value [13], [15], [122] with warmup and decay
Mixture of Experts: allows easily scaling model to trillion (cosine or linear). Usually, the learning rate is within the
of parameters [128], [116]. Only a few experts are activated range 1e−4 to 8e−4 . Moreover, MT-NLG (530B) [112] and
during the computation making them compute-efficient. The GPT-NeoX (20B) [113] suggest interpolating learning rates
performance of MoE models is better than the dense models based on the model size using the GPT-3 [6] models ranging
for the same amount of data and requires less computation between 13B and 175B. This avoids tuning the learning rate
during fine-tuning to achieve performance similar to the dense hyperparameter.
models as discussed in [116]. MoE architectures are less Training Parallelism: 3D parallelism, a combination of data,
prone to catastrophic forgetting, therefore are more suited for pipeline and tensor parallelism, is the most utilized training
continual learning [128]. Extracting smaller sub-models for parallelism approach in LLMs [33], [15], [14], [13], [112],
downstream tasks is possible without losing any performance, [110], [107]. In addition to the 3D parallelism, BLOOM [13]
making MoE architecture hardware-friendly [128]. uses zero optimizer [37] to shard optimizer states. PanGu-
Sparse vs Dense Activated: GPT-3 [6] uses P sparse trans- α [103] and PanGu-Σ [128] go beyond the 3D parallelism and
formers [67] whereas GLaM [116] and PanGu- [128] use apply 5D parallelism which additionally contains optimizer
MoE [117] architecture to lower computational costs and in- parallelism and rematerialization.
crease the model size and capacity. According to the literature, Mode Switching: adds task-related tokens at the beginning
sparse modules do not degrade the model’s performance [67]. of the text during training. These tokens refer to the natural
However, more experiments are required to verify this state- language understanding and natural language generation tasks
ment. which are shown to improve the downstream task performance
2. Training Strategies: Training models at a huge scale in [123], [122], [120]. During fine-tuning and inference, tokens
require tricks to reduce training costs, avoid loss divergence are appended based on the downstream tasks.
and achieve better performance. We summarize and discuss Controllable Text Generation: Generating credible and con-
some of these key tricks used in different LLMs. trolled text from a pre-trained model is challenging. GPT-
Mixed Precision: is a famous method for LLMs to reduce 3 [6] and other LLMs use in-context learning to control
memory usage and improve training efficiency. In mixed generated text. While in-context learning helps in controlling
precision, forward and backward passes are performed in the generated text, ERNIE 3.0 Titan [35] suggests using
FP16 format whereas optimizer states and master weights are adversarial loss to rank its generated text for credibility and
kept in FP32 format [115]. A drawback associated with this soft prompts such as genre, topic, keywords, sentiment, and
format change is training instability due to a smaller value length for better control on generated text.
range resulting in loss spikes [33]. An alternative to FP16 3. Supervised Models vs Generalized Models: Although
is BF16 which has a comparatively larger range and performs generalized models are capable of performing diverse tasks
some precision-sensitive operations like gradient accumulation with good performance they have not yet outperformed models
and softmax in FP32 [13]. BF16 has better performance and trained in supervised settings. The supervised trained models
training stability but uses more memory and is supported on are still state-of-the-art in various NLP tasks by a large margin
specific hardware, for example, A100 GPUs. Therefore, its as shown in [6], [15], [18].
adoption in LLMs is limited. 4. Zero-Shot vs Few-Shot: LLMs perform well in zero-
Training Instability: is a common issue in LLMs where loss shot and few-shot settings. But the performance difference
divergence or spiking is observed multiple times during train- between zero-shot and few-shot is large for pre-trained mod-
ing. This happens in the presence of gradient clipping [15]. els [6], [15], naming LLMs as meta-learners [6]. LLMs zero-
To mitigate this problem, many approaches suggest restarting shot evaluations underperform unsupervised methods in neural
training from an earlier checkpoint [15], [33], [116], skipping machine translation [6]. The literature shows pre-training is not
200-500 earlier data batches at the point of divergence in [15] enough for good zero-shot performance [15], [16]. To improve
and re-shuffling batches in [116]. The embedding layer gradi- the zero-shot performance the literature suggests using in-
ent shrink proves to further stabilize the training as its gradient struction fine-tuning that improves the zero-shot performance
norm is significantly larger than the other layers [33]. Another significantly and outperforms baselines. Instruction fine-tuning
suggestion to improve training stability for larger models is not has also been shown to improve zero-shot generalization to
to use biases in dense and norm layers as in [15]. unseen tasks. Another model Flan-PaLM [16] unlocks zero-
Weight Initialization: plays a significant role in model con- shot reasoning with CoT training.
vergence and training stability. GPT-NeoX [113] initializes 5. Encoder vs Decoder vs Encoder-Decoder: Traditionally,
2
feed-forward layers before residuals with L√ d
as in [142] and these architectures perform well for different tasks, for exam-
other layers with small initialization scheme [287]. This avoids ple, encoder-only for NLU tasks, decoder-only for NLG, and
activations growing exponentially with the increasing depth. encoder-decoder for sequence2sequence modeling. Encoder-
MT-NLG [112] found higher variance for weight initialization only models are famous for smaller models such as Bert [7],
leads to unstable training, hence validating small initialization RoBERTa [288], etc, whereas LLMs are either decoder-
scheme [287]. Various models perform random weight ini- only [6], [113], [13] or encoder-decoder [10], [11], [120].
tialization which can cause bad initialization, Galactica [137] While decoder-only models are good at NLG tasks, various
suggests a longer warmup to negate the effect. LLMs, PaLM [15], OPT [14], GPT-3 [6], BLOOM [13],
PREPRINT 22
LLaMA [145], are decoder-only models with significant per- model (LM) is divided into two broader categories: 1) natural
formance gains on both NLU and NLG tasks. In contradiction language understanding (NLU) and 2) natural language gen-
to this, T5 [10] and UL2 [123] identify encoder-decoder eration (NLG). It is emphasized that tasks in NLU and NLG
models out-performing decoder-only models. In another study, are softly categorized and are often used interchangeably in
PaLM [15] finds increasing the size of decoder-only models the literature.
can reduce the performance gap between decoder-only and Natural Language Understanding: This task measures the
encoder-decoder architectures. language understanding capacity of LMs. It encompasses
Although decoder-only architectures have become a trend for multiple tasks, including sentiment analysis, text classification,
LLMs, many recently proposed approaches [123], [120] use natural language inference (NLI), question answering (QA),
mode-switching tokens in text with encoder-decoder architec- commonsense reasoning (CR), mathematical reasoning (MR),
tures to enable task-specific modes. Similarly, CodeT5+ [34] reading comprehension (RC), etc.
uses an encoder-decoder architecture with multiple training Natural Language Generation: This task assesses the language
objectives for different tasks, activating the encoder, decoder, generation capabilities of LLMs by understanding the provided
or both according to the tasks. These variations in architecture input context. It includes tasks such as summarization, sen-
and training objectives allow a model to perform well in differ- tence completion, machine translation (MT), dialogue gener-
ent settings. Because of this dynamic configuration, the future ation, etc.
of LLMs can be attributed to encoder-decoder architectures. Numerous datasets are proposed for each task, evaluating
LLMs against different characteristics. To provide an overview
IV. M ODEL C ONFIGURATIONS of evaluation datasets, we briefly discuss a few famous datasets
We provide different statistics of pre-trained and instruction- within each category and offer a comprehensive list of datasets
tuned models in this section. This includes information such as in Table IX. Moreover, we show a detailed overview of the
publication venue, license type, model creators, steps trained, training datasets and evaluation tasks and benchmarks used
parallelism, etc in Table III and Table IV. Architecture details by various pre-trained LLMs in Table X and fine-tuned LLMs
of pre-trained LLMs are available in Table V. Providing in Table XI. We also compare the top-performing LLMs in
these details for instruction-tuned models is unnecessary various NLP tasks in Table XII.
because it fine-tunes pre-trained models for instruction 1. Multi-task:
datasets. Hence, architectural details are the same as the 1.1 MMLU [296]: A benchmark that measures the
baselines. Moreover, optimization settings for various LLMs knowledge acquired by models during pretraining and eval-
are available in Table VI and Table VII. We do not include uates models in zero-shot and few-shot settings across 57
details on precision, warmup, and weight decay in Table VII. subjects, testing both world knowledge and problem-solving
Neither of these details are important as others to mention ability.
for instruction-tuned models nor provided by the papers. 1.2 SuperGLUE [2]: A more challenging and diverse
successor to the GLUE [298] benchmark, SuperGLUE in-
cludes a variety of language understanding tasks, such as ques-
tion answering, natural language inference, and coreference
V. DATASETS AND E VALUATION
resolution. It is designed to provide a rigorous test of language
Generating training and evaluation datasets is expensive understanding and requires significant progress in areas like
because of the large-scale data demand of LLMs. Hence, sample-efficient, transfer, multitasking, and unsupervised or
datasets for training and benchmarking these models are topics self-supervised learning.
of key importance. A summary of datasets commonly used by 1.3 BIG-bench [297]: The BIG-bench (Behavior of
LLMs is provided next. Intelligent Generative Models Benchmark) is a large-scale
benchmark designed to test the abilities of LLMs across a
A. Training Datasets wide range of tasks, including reasoning, creativity, ethics, and
The performance of LLMs largely depends on the training understanding of specific domains.
data’s quality, size, and diversity. Preparing training datasets 1.4 GLUE [298]: The General Language Understanding
of high quality at a large scale is laborious. Researchers Evaluation (GLUE) benchmark is a collection of resources
have suggested various pre-training and fine-tuning datasets for training, evaluating, and analyzing natural language under-
to enhance LLMs capabilities. We summarize these efforts standing systems. It includes a variety of tasks that test a wide
in Table VIII. While numerous training datasets are available range of linguistic phenomena, making it a comprehensive tool
in the literature, we cover the most widely used ones in our for evaluating language understanding in AI.
summary. 2. Language Understanding:
2.1 WinoGrande [343]: A large-scale dataset inspired by
the original Winograd [346] Schema Challenge tests models
B. Evaluation Datasets and Tasks on their ability to resolve pronoun ambiguity and encourages
The evaluation of LLMs is important in gauging their the development of models that understand the broad context
proficiency and limitations. This process measures the model’s in natural language text.
ability to comprehend, generate, and interact with human 2.2 CoQA [305]: A conversational question-answering
language across a spectrum of tasks. Evaluating a language dataset, CoQA challenges models with questions that rely
PREPRINT 23
TABLE III: Summary of pre-trained LLMs (>10B). Only the LLMs discussed individually in the previous sections are
summarized. “Data/Tokens” is the model’s pre-training data, which is either the number of tokens or data size. “Data Cleaning”
indicates whether the data cleaning is performed or not. This includes heuristics (Heur), deduplication (Dedup), quality filtering
(QF), and privacy filtering (PF), “Cost” is the calculated training cost obtained by multiplying the GPUs/TPUs hourly rate
with the number of GPUs and the training time. The actual cost may vary due to many reasons such as using in-house GPUs
or getting a discounted rate, re-training, number of employees working on the problem, etc. “Training Parallelism” indicates
distributed training using data parallelism (D), tensor parallelism (T), pipeline parallelism (P), model parallelism (M), optimizer
parallelism (OP), and rematerialization (R), where for “Library” column, “DS” is a short form for Deep Speed. In column
“Commercial Use”, we assumed a model is for non-commercial purposes if its license is unavailable.
Publication License Model No. of Commercial Steps Data/ Data No. of Processing Training Calculated Training
Models
Venue Type Creators Purpose Params Use Trained Tokens Cleaning Processing Units Unit Type Time Train. Cost Parallelism Library
T5 [10] JMLR'20 Apache-2.0 Google General 11B ✓ 1M 1T Heur+Dedup 1024 TPU v3 - - D+M Mesh TensorFlow
GPT-3 [6] NeurIPS'20 - OpenAI General 175B × - 300B Dedup+QF - V100 - - M -
mT5 [11] NAACL'21 Apache-2.0 Google General 13B ✓ 1M 1T - - - - - - -
PanGu-α [103] arXiv'21 Apache-2.0 Huawei General 200B ✓ 260k 1.1TB Heur+Dedup 2048 Ascend 910 - - D+OP+P+O+R MindSpore
CPM-2 [12] AI Open'21 MIT Tsinghua General 198B ✓ 1M 2.6TB Dedup - - - - D+M JAXFormer
Codex [130] arXiv'21 - OpenAI Coding 12B × - 100B Heur - - - - - -
ERNIE 3.0 [105] arXiv'21 - Baidu General 10B × 120k∗ 375B Heur+Dedup 384 V100 - - M∗ PaddlePaddle
Jurassic-1 [107] White-Paper'21 Apache-2.0 AI21 General 178B ✓ - 300B - 800 GPU - - D+M+P Megatron+DS
HyperCLOVA [109] EMNLP'21 - Naver General 82B × - 300B Clf+Dedup+PF 1024 A100 321h 1.32 Mil M Megatron
Yuan 1.0 [110] arXiv'21 Apache-2.0 - General 245B ✓ 26k∗ 180B Heur+Clf+Dedup 2128 GPU - - D+T+P -
Gopher [111] arXiv'21 - Google General 280B × - 300B QF+Dedup 4096 TPU v3 920h 13.19 Mil D+M JAX+Haiku
ERNIE 3.0 Titan [35] arXiv'21 - Baidu General 260B × - 300B Heur+Dedup - Ascend 910 - - D+M+P+D* PaddlePaddle
GPT-NeoX-20B [113] BigScience'22 Apache-2.0 EleutherAI General 20B ✓ 150k 825GB None 96 40G A100 - - M Megatron+DS+PyTorch
OPT [14] arXiv'22 MIT Meta General 175B ✓ 150k 180B Dedup 992 80G A100 - - D+T Megatron
BLOOM [13] arXiv'22 RAIL-1.0 BigScience General 176B ✓ - 366B Dedup+PR 384 80G A100 2520h 3.87 Mil D+T+P Megatron+DS
Galactica [137] arXiv'22 Apache-2.0 Meta Science 120B × 225k 106B Dedup 128 80GB A100 - - - Metaseq
GLaM [116] ICML'22 - Google General 1.2T × 600k∗ 600B Clf 1024 TPU v4 - - M GSPMD
LaMDA [139] arXiv'22 - Google Dialog 137B × 3M 2.81T Filtered 1024 TPU v3 1384h 4.96 Mil D+M Lingvo
MT-NLG [112] arXiv'22 Apache-v2.0 MS.+Nvidia General 530B × - 270B - 4480 80G A100 - - D+T+P Megatron+DS
AlphaCode [131] Science'22 Apache-v2.0 Google Coding 41B ✓ 205k 967B Heur+Dedup - TPU v4 - - M JAX+Haiku
Chinchilla [119] arXiv'22 - Google General 70B × - 1.4T QF+Dedup - TPUv4 - - - JAX+Haiku
PaLM [15] arXiv'22 - Google General 540B × 255k 780B Heur 6144 TPU v4 - - D+M JAX+T5X
AlexaTM [120] arXiv'22 Apache v2.0 Amazon General 20B × 500k 1.1T Filtered 128 A100 2880h 1.47 Mil M DS
U-PaLM [122] arXiv'22 - Google General 540B × 20k - - 512 TPU v4 120h 0.25 Mil - -
UL2 [123] ICLR'23 Apache-2.0 Google General 20B ✓ 2M 1T - 512 TPU v4 - - M JAX+T5X
GLM [33] ICLR'23 Apache-2.0 Multiple General 130B × - 400B - 768 40G A100 1440h 3.37 Mil M -
CodeGen [129] ICLR'23 Apache-2.0 Salesforce Coding 16B ✓ 650k 577B Heur+Dedup - TPU v4 - - D+M JAXFormer
LLaMA [125] arXiv'23 - Meta General 65B × 350k 1.4T Clf+Heur+Dedup 2048 80G A100 504h 4.12 Mil D+M xFormers
PanGuΣ [128] arXiv'23 - Huawei General 1.085T × - 329B - 512 Ascend 910 2400h - D+OP+P+O+R MindSpore
BloombergGPT [140] arXiv23 - Bloomberg Finance 50B × 139k 569B Dedup 512 40G A100 1272h 1.97 Mil M PyTorch
Xuan Yuan 2.0 [141] arXiv23 RAIL-1.0 Du Xiaoman Finance 176B ✓ - 366B Filtered 80GB A100 - - P DS
CodeT5+ [34] arXiv'23 BSD-3 Salesforce Coding 16B ✓ 110k 51.5B Dedup 16 40G A100 - - - DS
StarCoder [136] arXiv'23 OpenRAIL-M BigCode Coding 15.5B ✓ 250k 1T Dedup+QF+PF 512 80G A100 624h 1.28 Mil D+T+P Megatron-LM
LLaMA-2 [21] arXiv'23 LLaMA-2.0 Meta General 70B ✓ 500k 2T Minimal Filtering - 80G A100 1.7Mh - - -
PaLM-2 [121] arXiv'23 - Google General - × - - Ddedup+PF+QF - - - - - -
TABLE IV: Summary of instruction tuned LLMs (>10B). All abbreviations are the same as Table III. Entries in “Data/Tokens”
starting with “S-” represents the number of training samples.
Publication License Model No. of Commercial Pre-trained Steps Data/ No. of Processing Train. Calculated Train.
Models
Venue Type Creators Purpose Params Use Models Trained Tokens Processing Units Unit Type Time Train. Cost Parallelism Library
WebGPT [155] arXiv'21 - OpenAI General 175B × GPT-3 - - - - - - - -
T0 [17] ICLR'22 Apache-2.0 BigScience General 11B ✓ T5 - 250B 512 TPU v3 270h 0.48 Mil - -
Tk-Instruct [18] EMNLP'22 MIT AI2+ General 11B ✓ T5 1000 - 256 TPU v3 4h 0.0036 Mil - Google T5
OPT-IML [92] arXiv'22 - Meta General 175B × OPT 8k 2B 128 40G A100 - - D+T Megatron
Flan-U-PaLM [16] ICLR'22 Apache-2.0 Google General 540B ✓ U-PaLM 30k - 512 TPU v4 - - - JAX+T5X
mT0 [143] ACL'23 Apache-2.0 HuggingFace+ General 13B ✓ mT5 - - - - - - - -
Sparrow [156] arXiv'22 - Google Dialog 70B × Chinchilla - - 64 TPU v3 - - M -
WizardCoder [153] arXiv'23 Apache-2.0 HK Bapt. Coding 15B × StarCoder 200 S-78k - - - - - -
Alpaca [147] Github'23 Apache-2.0 Stanford General 13B ✓ LLaMA 3-Epoch S-52k 8 80G A100 3h 600 FSDP PyTorch
Vicuna [148] Github'23 Apache-2.0 LMSYS General 13B ✓ LLaMA 3-Epoch S-125k - - - - FSDP PyTorch
LIMA [174] arXiv'23 - Meta+ General 65B - LLaMA 15-Epoch S-1000 - - - - - -
Koala [289] Github'23 Apache-2.0 UC-Berkley General 13B × LLaMA 2-Epoch S-472k 8 A100 6h 100 - JAX/FLAX
on conversation history and require free-form text answers. a special focus on long-form content.
Its diverse content from seven domains makes it a rigorous 2.6 C4 [10]: A clean, multilingual dataset, C4 offers
test for models’ ability to handle a wide range of topics and billions of tokens from web-crawled data. It’s a comprehensive
conversational contexts. resource for training advanced Transformer models on various
2.3 WiC [306]: This dataset assesses a model’s ability languages.
to discern word meanings based on context, aiding in tasks 2.7 LCQMC [309]: The Large-scale Chinese Question
related to Word Sense Disambiguation. Matching Corpus (LCQMC) is a dataset for evaluating the
2.4 Wikitext103 [307]: With over 100 million tokens performance of models in semantic matching tasks. It contains
from Wikipedia’s top articles, this dataset is a rich resource pairs of questions in Chinese and their matching status,
for tasks that require understanding long-term dependencies, making it a valuable resource for research in Chinese language
such as language modeling and translation. understanding.
2.5 PG19 [308]: This is a digital library of diverse books 3. Story Cloze and Sentence Completion:
from Project Gutenberg. It’s specifically designed to facilitate 3.1 StoryCloze [323]: It introduces a new “StoryCloze
research in unsupervised learning and language modeling, with Test”, a commonsense reasoning framework for evaluating
PREPRINT 24
TABLE V: Architecture details of LLMs. Here, “PE” is the positional embedding, “nL” is the number of layers, “nH” is the
number of attention heads, “HS” is the size of hidden states.
Training
Models Type Attention Vocab Tokenizer Norm PE Activation Bias nL nH HS
Objective
T5 (11B) Enc-Dec Span Corruption Standard 32k SentencePiece Pre-RMS Relative ReLU × 24 128 1024
GPT3 (175B) Causal-Dec Next Token Dense+Sparse - - Layer Learned GeLU ✓ 96 96 12288
mT5 (13B) Enc-Dec Span Corruption Standard 250k SentencePiece Pre-RMS Relative ReLU - - - -
PanGu-α (200B) Causal-Dec Next Token Standard 40k BPE Layer - - - 64 128 16384
CPM-2 (198B) Enc-Dec Span Corruption Standard 250k SentencePiece Pre-RMS Relative ReLU - 24 64 -
Codex (12B) Causal-Dec Next Token Standard - BPE+ Pre-Layer Learned GeLU - 96 96 12288
ERNIE 3.0 (10B) Causal-Dec Next Token Standard - WordPiece Post-Layer Relative GeLU - 48 64 4096
Jurassic-1 (178B) Causal-Dec Next Token Standard 256k SentencePiece∗ Pre-Layer Learned GeLU ✓ 76 96 13824
HyperCLOVA (82B) Causal-Dec Next Token Dense+Sparse - BPE* Pre-Layer Learned GeLU - 64 80 10240
Yuan 1.0 (245B) Causal-Dec Next Token Standard - - - - - - 76 - 16384
Gopher (280B) Causal-Dec Next Token Standard 32k SentencePiece Pre-RMS Relative GeLU ✓ 80 128 16384
ERNIE 3.0 Titan (260B) Causal-Dec Next Token Standard - WordPiece Post-Layer Relative GeLU - 48 192 12288
GPT-NeoX-20B Causal-Dec Next Token Parallel 50k BPE Layer Rotary GeLU ✓ 44 64 -
OPT (175B) Causal-Dec Next Token Standard - BPE - - ReLU ✓ 96 96 -
BLOOM (176B) Causal-Dec Next Token Standard 250k BPE Layer ALiBi GeLU ✓ 70 112 14336
Galactica (120B) Causal-Dec Next Token Standard 50k BPE+custom Layer Learned GeLU × 96 80 10240
GLaM (1.2T) MoE-Dec Next Token Standard 256k SentencePiece Layer Relative GeLU ✓ 64 128 32768
LaMDA (137B) Causal-Dec Next Token Standard 32k BPE Layer Relative GeGLU - 64 128 8192
MT-NLG (530B) Causal-Dec Next Token Standard 50k BPE Pre-Layer Learned GeLU ✓ 105 128 20480
AlphaCode (41B) Enc-Dec Next Token Multi-query 8k SentencePiece - - - - 64 128 6144
Chinchilla (70B) Causal-Dec Next Token Standard 32k SentencePiece-NFKC Pre-RMS Relative GeLU ✓ 80 64 8192
PaLM (540B) Causal-Dec Next Token Parallel+Multi-query 256k SentencePiece Layer RoPE SwiGLU × 118 48 18432
AlexaTM (20B) Enc-Dec Denoising Standard 150k SentencePiece Pre-Layer Learned GeLU ✓ 78 32 4096
Sparrow (70B) Causal-Dec Pref.&Rule RM - 32k SentencePiece-NFKC Pre-RMS Relative GeLU ✓ 16∗ 64 8192
U-PaLM (540B) Non-Causal-Dec MoD Parallel+Multi-query 256k SentencePiece Layer RoPE SwiGLU × 118 48 18432
UL2 (20B) Enc-Dec MoD Standard 32k SentencePiece - - - - 64 16 4096
GLM (130B) Non-Causal-Dec AR Blank Infilling Standard 130k SentencePiece Deep RoPE GeGLU ✓ 70 96 12288
CodeGen (16B) Causal-Dec Next Token Parallel - BPE Layer RoPE - - 34 24 -
LLaMA (65B) Causal-Dec Next Token Standard 32k BPE Pre-RMS RoPE SwiGLU - 80 64 8192
PanGu-Σ (1085B) Causal-Dec Next Token Standard - BPE Fused Layer - FastGeLU - 40 40 5120
BloombergGPT (50B) Causal-Dec Next Token Standard 131k Unigram Layer ALiBi GeLU ✓ 70 40 7680
Xuan Yuan 2.0 (176B) Causal-Dec Next Token Self 250k BPE Layer ALiBi GeLU ✓ 70 112 14336
CodeT5+ (16B) Enc-Dec SC+NT+Cont.+Match Standard - Code-Specific - - - - - - -
StarCoder (15.5B) Causal-Dec FIM Multi-query 49k BPE - Learned - - 40 48 6144
LLaMA (70B) Causal-Dec Next Token Grouped-query 32k BPE Pre-RMS RoPE SwiGLUE - - - -
PaLM-2 - MoD Parallel - - - - - - - - -
TABLE VI: Summary of optimization settings used for pre-trained LLMs. The values for weight decay, gradient clipping, and
dropout are 0.1, 1.0, and 0.1, respectively, for most of the LLMs.
Sequence LR Optimizers Precision Weight Grad
Models Batch Size Length LR Warmup Decay AdaFactor Adam AdamW FP16 BF16 Mixed Decay Clip Dropout
T5 (11B) 211 512 0.01 × inverse square root ✓ - - - - - ✓
GPT3 (175B) 32K - 6e-5 ✓ cosine ✓ ✓ ✓ ✓ -
mT5 (13B) 1024 1024 0.01 - inverse square root ✓ - - - - - ✓
PanGu-α (200B) - 1024 2e-5 - - - - - - ✓ - - - -
CPM-2 (198B) 1024 1024 0.001 - - ✓ - - - - - ✓
Codex (12B) - - 6e-5 ✓ cosine ✓ ✓ ✓ - -
ERNIE 3.0 (12B) 6144 512 1e-4 ✓ linear ✓ - - - ✓ - -
Jurassic-1 (178B) 3.2M 2048 6e-5 ✓ cosine ✓ ✓ ✓ ✓ -
HyperCLOVA (82B) 1024 - 6e-5 - cosine ✓ - - - ✓ - -
Yuan 1.0 (245B) <10M 2048 1.6e-4 ✓ cosine decay to 10% ✓ - - - ✓ - -
Gopher (280B) 3M 2048 4e-5 ✓ cosine decay to 10% ✓ ✓ - ✓ -
ERNIE 3.0 Titan (260B) - 512 1e-4 ✓ linear ✓ ✓ ✓ ✓ -
GPT-NeoX-20B 1538 2048 0.97e-5 ✓ cosine ✓ ✓ ✓ ✓ ×
OPT (175B) 2M 2048 1.2e-4 - linear ✓ ✓ ✓ ✓ ✓
BLOOM (176B) 2048 2048 6e-5 ✓ cosine ✓ ✓ ✓ ✓ ×
Galactica (120B) 2M 2048 7e-6 ✓ linear decay to 10% ✓ - - - ✓ ✓ ✓
GLaM (1.2T) 1M 1024 0.01 - inverse square root ✓ FP32 + ✓ - ✓ ×
LaMDA (137B) 256K - - - - - - - - - - - - -
MT-NLG (530B) 1920 2048 5e-5 ✓ cosine decay to 10% ✓ ✓ ✓ ✓ -
AlphaCode (41B) 2048 1536+768 1e-4 ✓ cosine decay to 10% ✓ ✓ ✓ ✓ -
Chinchilla (70B) 1.5M 2048 1e-4 ✓ cosine decay to 10% ✓ ✓ - - -
PaLM (540B) 2048 2048 0.01 - inverse square root ✓ - - - ✓ ✓ ×
AlexaTM (20B) 2M 1024 1e-4 - linear decay to 5% ✓ ✓ ✓ - ✓
U-PaLM (540B) 32 2048 1e-4 - cosine ✓ - - - - - -
UL2 (20B) 1024 1024 - - inverse square root - - - - - - × - -
GLM (130B) 4224 2048 8e-5 ✓ cosine ✓ ✓ ✓ ✓ ✓
CodeGen (16B) 2M 2048 5e-5 ✓ cosine ✓ - - - ✓ ✓ -
LLaMA (65B) 4M Tokens 2048 1.5e-4 ✓ cosine decay to 10% ✓ - - - ✓ ✓ -
PanGu-Σ (1.085T) 512 1024 2e-5 ✓ - ✓ ✓ - - -
BloombergGPT (50B) 2048 2048 6e-5 ✓ cosine ✓ ✓ ✓ ✓ ×
Xuan Yuan 2.0 (176B) 2048 2048 6e-5 ✓ cosine ✓ ✓ ✓ ✓ -
CodeT5+ (16B) 2048 1024 2e-4 - linear ✓ ✓ ✓ - -
StarCoder (15.5B) 512 8k 3e-4 ✓ cosine ✓ ✓ ✓ - -
LLaMA-2 (70B) 4M Tokens 4k 1.5e-4 ✓ cosine ✓ ✓ ✓ ✓ -
PREPRINT 25
TABLE VII: Summary of optimization settings used for instruction-tuned LLMs. Values for gradient clipping and dropout are
the same as the pre-trained models, while no model uses weight decay for instruction tuning.
Sequence Optimizers Grad
Models Batch Size Length LR Warmup LR_Decay AdaFactor Adam AdamW Clip Dropout
WebGPT (175B) BC:512, RM:32 - 6e-5 - - ✓ - -
T0 (11B) 1024 1280 1e-3 - - ✓ - ✓
Tk-Instruct (11B) 1024 - 1e-5 - constant - - - - -
OPT-IML (175B) 128 2048 5e-5 × linear ✓ ✓ ✓
Flan-U-PaLM (540B) 32 - 1e-3 - constant ✓ - ✓
Sparrow (70B) RM: 8+16, RL:16 - 2e-6 ✓ cosine decay to 10% ✓ ✓ ×
WizardCoder (15B) 512 2048 2e-5 ✓ cosine - - - - -
Alpaca (13B) 128 512 1e-5 ✓ cosine - - ✓ ✓ ×
Vicuna (13B) 128 -2048 2e-5 ✓ cosine ✓ - ×
LIMA (65B) 32 2048 1e-5 × linear ✓ - ✓
TABLE VIII: Details of various well-known pre-training and fine-tuning datasets. Here, alignment means aligning with human
preferences.
Dataset Type Size/Samples Tasks Source Creation Comments
C4 [10] Pretrain 806GB - Common Crawl Automated A clean, multilingual dataset with billions of tokens
mC4 [11] Pretrain 38.49TB - Common Crawl Automated A multilingual extension of the C4 dataset, mC4
identifies over 100 languages using cld3 from 71
monthly web scrapes of Common Crawl.
Common Crawl, PubMed Central,
PILE [290] Pretrain 825GB - OpenWebText2, ArXiv, GitHub, Automated A massive dataset comprised of 22 constituent sub-
Books3, and others datasets
ROOTs [291] Pretrain 1.61TB - 498 Hugging Face datasets Automated 46 natural and 13 programming languages
MassiveWeb, Books, News,
MassiveText [111] Pretrain 10.5TB - Automated 99% of the data is in English
Wikipedia, Github, C4
Wikipedia [292] Pretrain - - Wikipedia Automated Dump of wikipedia
CommonCrawl, C4, Wikipedia,
RedPajama [293] Pretrain 5TB - Automated Open-source replica of LLaMA dataset
Github, Books, StackExchange
PushShift.io Reddit Pretrain 21.1GB - Reddit Automated Submissions and comments on Reddit from 2005
to 2019
BigPython [129] Pretrain 5.5TB Coding GitHub Automated -
Pool of Prompt (P3) [17] Instructions 12M 62 PromptSource Manual A Subset of PromptSource, created from 177
datasets including summarization, QA, classifica-
tion, etc.
xP3 [143] Instructions 81M 71 P3+Multilingual datasets Manual Extending P3 to total 46 languages
Super-NaturalInstructions (SNI) [18] Instructions 12.4M 1616 Multiple datasets Manual Extending P3 with additional multi-lingual
datasets, total 46 languages
Flan [16] Instructions 15M 1836 Muffin+T0-SF+NIV2 Manual Total 60 languages
OPT-IML [92] Instructions 18.1M 1667 - Manual -
Self-Instruct [19] Instructions 82k 175 - Automated Generated 52k instructions with 82k samples from
175 seed tasks using GPT-3
Alpaca [147] Instructions 52k - - Automated Employed self-instruct method to generate data
from text-davinci-003
Vicuna [148] Instructions 125k - ShareGPT Automated Conversations shared by users on ShareGPT using
public APIs
LLaMA-GPT-4 [149] Instructions 52k - Alpaca Automated Recreated Alpaca dataset with GPT-4 in English
and Chinese
Unnatural Instructions [294] Instructions 68k - 15-Seeds (SNI) Automated -
LIMA [174] Instructions 1k - Multiple datasets Manual Carefully created samples to test performance with
fine-tuning on less data
Anthropic-HH-RLHF [295] Alignment 142k - - Manual
Anthropic-HH-RLHF-2 [167] Alignment 39k - - Manual
story understanding, generation, and script learning. It con- 4.3 ARC [331]: A larger version of the ARC-Challenge,
siders a model’s ability to understand and generate coherent this dataset contains both easy and challenging grade-school
and sensible stories. level, multiple-choice science questions. It’s a comprehensive
3.2 LAMBADA [324]: This dataset evaluates contextual test of a model’s ability to understand and answer complex
text understanding through a word prediction task. Models questions.
must predict the last word of a passage, which is easy for 4.4 ARC-Easy [331]: A subset of the ARC dataset,
humans when given the whole passage, but not when given ARC-Easy, contains questions that are answered correctly by
only the last sentence. either a retrieval-based algorithm or a word co-occurrence
4. Physical Knowledge and World Understanding: algorithm. It’s a great starting point for models beginning to
4.1 PIQA [329]: A dataset that probes the physical explore advanced question-answering.
knowledge of models, aiming to understand how well they 4.5 ARC-Challenge [331]: A rigorous question-
are learning about the real world. answering dataset, ARC-Challenge includes complex,
4.2 TriviaQA [330]: A dataset that tests models on grade-school level questions that demand reasoning beyond
reading comprehension and open domain question answering simple retrieval, testing the true comprehension capabilities
(QA) tasks, with a focus on Information Retrieval (IR)-style of models.
QA. 5. Contextual Language Understanding:
PREPRINT 26
5.1 RACE [336]: The RACE is a reading comprehension model’s ability to understand and reason about cause and
dataset collected from English examinations in China, which effect.
benchmarks AI models for understanding and answering ques- 6.3 WSC [346]: The Winograd Schema Challenge
tions on long and complex passages, simulating the challenge (WSC) is a reading comprehension task in which a system
of a real-world examination. must resolve references in a text, often requiring world knowl-
5.2 RACE-Middle [336]: Another subset of the edge and reasoning about the text.
RACE [336] dataset, RACE-Middle, contains middle school- 6.4 CSQA [347]: The CommonsenseQA is a question-
level English exam questions. It offers a slightly less answering dataset that requires commonsense knowledge to
challenging but academically oriented evaluation of a model’s answer the ability of AI models to understand and answer
comprehension skills. questions that require commonsense reasoning.
5.3 RACE-High [336]: A subset of the RACE [336] 7. Reading Comprehension:
dataset, RACE-High consists of high school-level English 7.1 BoolQ [352]: A dataset derived from Google search
exam questions. It is designed to evaluate the comprehension queries, BoolQ challenges models to answer binary (yes/no)
ability of models in a more academic and challenging context. questions. The questions are naturally occurring and are paired
5.4 QuAC [337]: This dataset simulates an information- with a paragraph from a Wikipedia article containing the
seeking dialog between students and teachers using hidden answer. It’s a test of reading comprehension and reasoning.
Wikipedia text. It introduces unique challenges not found 7.2 SQUADv2 [353]: The Stanford Question Answering
in machine comprehension datasets, making it a valuable Dataset (SQuAD) [351] is a collection of questions posed by
resource for advancing dialog systems. crowdworkers on a set of Wikipedia articles, where the answer
6. Commonsense Reasoning: to every question is a segment of text from the corresponding
6.1 HellaSwag [344]: A dataset that challenges models reading passage. SQuADv2 combines the original SQuAD1.1
to pick the best ending to a context uses Adversarial Filtering dataset with over 50,000 unanswerable questions. The aim is to
to create a ‘Goldilocks’ zone of complexity, where generated evaluate a model’s ability to understand and answer questions
text is absurd to humans but often misclassified by models. based on a given context and to determine when a question is
6.2 COPA [390]: This dataset evaluates a model’s unanswerable.
progress in open-domain commonsense causal reasoning. Each 7.3 DROP [354]: DROP, or Discrete Reasoning Over
question comprises a premise and two alternatives, and the the content of Paragraphs, is designed to test a model’s
model must select the more plausible alternative, testing a ability to understand a wide variety of reading phenomena. It
PREPRINT 27
TABLE X: An illustration of training datasets and evaluation tasks employed by pre-trained LLMs. Here, “QA” is question-
answering, “Clf” is classification, “NLI” is natural language inference, “MT” is machine translation, “RC” is reading
comprehension, “CR” is commonsense reasoning, “MR” is mathematical reasoning, “Mem.” is memorization.
Benchmark
Truthful/
BIG- Super Cloze/ Bias/
Models Training Dataset MMLU QA Clf NLI MT RC CR MR Coding
bench GLUE Completion Toxicity/
Mem.
T5 C4 [10] ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
GPT-3 Common Crawl, WebText, Books Corpora, ✓ ✓ ✓ ✓ ✓ ✓
Wikipedia
mT5 mC4 [11] ✓ ✓ ✓
PanGu-α 1.1TB Chinese Text Corpus ✓ ✓ ✓ ✓ ✓
CPM-2 WuDaoCorpus [104] ✓ ✓
Codex 54 million public repositories from Github ✓
ERNIE-3.0 Chinese text corpora, Baidu Search, Web ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
text, QA-long, QA-short, Poetry and Cou-
plet Domain-specific data from medical,
law, and financial area Baidu knowledge
graph with more than 50 million facts
Jurassic-1 Wikipedia, OWT, Books, C4, Pile [290], ✓ ✓ ✓ ✓
arXiv, GitHub
HyperCLOVA Korean blogs, Community sites, News, KiN ✓
Korean Wikipedia, Wikipedia (English and
Japanese), Modu-Corpus: Messenger, News,
Spoken and written language corpus, Web
corpus
Yuan 1.0 Common Crawl, SogouT, Sogou News, ✓ ✓ ✓ ✓
Baidu Baike, Wikipedia, Books
Gopher subsets of MassiveWeb Books, C4, News, ✓ ✓ ✓ ✓ ✓ ✓ ✓
GitHub and Wikipedia samples from Mas-
siveText
ERNIE-3.0 TITAN Same as ERNIE 3.0 and ERNIE 3.0 ad- ✓ ✓ ✓ ✓ ✓
versarial dataset, ERNIE 3.0 controllable
dataset
GPT-NeoX-20B Pile [290] ✓ ✓ ✓ ✓ ✓ ✓
OPT RoBERTa [288], Pile [290], PushShift.io ✓ ✓ ✓ ✓
Reddit [412]
BLOOM ROOTs [13] ✓ ✓ ✓ ✓ ✓ ✓
Galactica arXiv, PMC, Semantic Scholar, Wikipedia, ✓ ✓ ✓ ✓ ✓
StackExchange, LibreText, Open Text-
books, RefSeq Genome, OEIS, LIPID
MAPS, NASAExoplanet, Common Crawl,
ScientificCC, AcademicCC, GitHub reposi-
tories Khan Problems, GSM8K, OneSmall-
Step
GLaM Filtered Webpages, Social media conversa- ✓ ✓ ✓ ✓ ✓
tions Wikipedia, Forums, Books, News
LaMDA Infiniset : Public documents, Dialogs, Utter- ✓
ances
MT-NLG Two snapshots of Common Crawl and ✓ ✓ ✓ ✓ ✓
Books3, OpenWebText2, Stack Exchange,
PubMed Abstracts, Wikipedia, PG-19 [242],
BookCorpus2, NIH ExPorter, Pile, CC-
Stories, RealNews
AlphaCode Selected GitHub repositories, CodeCon- ✓
tests: Codeforces, Description2Code, Co-
deNet
Chinchilla MassiveWeb, MassiveText Books, C4, ✓ ✓ ✓ ✓ ✓ ✓
News, GitHub, Wikipedia
PaLM webpages, books, Wikipedia, news, articles, ✓ ✓ ✓ ✓ ✓ ✓
source code, social media conversations
AlexaTM Wikipedia, mC4 ✓ ✓ ✓ ✓ ✓
U-PaLM Same as PaLM ✓ ✓ ✓ ✓ ✓ ✓ ✓
UL2 - ✓ ✓ ✓ ✓ ✓ ✓
GLM-130B - ✓ ✓ ✓
CodeGen Pile, BigQuery, BigPython ✓
LLaMA CommonCrawl, C4, Github, Wikipedia, ✓ ✓ ✓ ✓ ✓ ✓ ✓
Books, arXiv, StackExchange
PanGu-Σ WuDaoCorpora, CLUE, Pile, C4, Python ✓ ✓ ✓ ✓ ✓ ✓
code
BloombergGPT inPile, Pile, C4, Wikipedia ✓ ✓ ✓ ✓ ✓ ✓ ✓
CodeT5+ CodeSearchNet, Github Code ✓ ✓
StarCoder The Stack v1.2 ✓ ✓ ✓ ✓
LLaMA-2 ✓ ✓ ✓ ✓ ✓ ✓ ✓
PaLM-2 Web documents, Code, Books, Maths, Con- ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
versation
PREPRINT 28
TABLE XI: An illustration of training datasets and evaluation benchmarks used in fine-tuned LLMs. “SNI” is a short of
Super-NaturalInsturctions.
Truthful/
BIG-
Models Training Dataset MMLU BBH RAFT FLAN SNI PromptSource TyDiQA HumanEval MBPP Bias/
bench
Toxicity
T0 Pool of Prompts ✓
WebGPT ELI5 [413], ELI5 fact-check [155], Triv- ✓
iaQA [330], ARC-Challenge [331], ARC-
Easy [331], Hand-written data, Demon-
strations of humans, Comparisons between
model-generated answers
Tk-INSTRUCT SNI [18] ✓
mT0 xP3 [143]
OPT-IML PromptSource [17], FLAN [16], SNI [414], ✓ ✓ ✓ ✓ ✓ ✓
UnifiedSKG [415], CrossFit [416],
ExMix [417], T5 [10], Reasoning
Flan Muffin, T0-SF, NIv2, CoT ✓ ✓ ✓
WizardCoder Code Alpaca ✓ ✓
TABLE XII: Performance comparison of top performing LLMs across various NLU and NLG tasks. Here, “N-Shots” indicate
the number of example prompts provided to the model during the evaluation, representing its capability in few-shot or zero-shot
learning settings, “f” represents the fine-tuned version, and “B” represents the benchmark.
Top-1 Top-2 Top-3
Task Dataset/Benchmark
Model (Size) Score (N-shots) Model (Size) Score (N-shots) Model (Size) Score (N-shots)
BIG-bench (B) Chinchilla (70B) 65.1 (5-shot) Gopher (280B) 53.97 (5-shot) PaLM (540B) 53.7 (5-shot)
Multi-Task
MMLU (B) GPT-4 (-) 86.4 (5-shot) Gemini (Ultra) 83.7 (5-shot) Flan-PaLM-2(f ) (Large) 81.2 (5-shot)
Language Understanding SuperGLUE (B) ERNIE 3.0 (12B) 90.6 (-) PaLM(f ) (540B) 90.4 (-) T5 (11B) 88.9 (-)
Story Comprehension and HellaSwag GPT-4 (-) 95.3 (10-shot) Gemini (Ultra) 87.8 (10-shot) PaLM-2 (Large) 86.8 (one shot)
Generation StoryCloze GPT3 (175B) 87.7 (few shot) PaLM-2 (Large) 87.4 (one shot) OPT (175B) 79.82 (-)
Physical Knowledge and PIQA PaLM-2 (Large) 85.0 (one shot) LLaMa (65B) 82.8 (zero shot) MT-NLG (530B) 81.99 (zero shot)
World Understanding TriviaQA PaLM-2 (Large) 86.1 (one shot) LLaMA-2 (70B) 85.0 (one shot) PaLM (540B) 81.4 (one shot)
Contextual Language
LAMBADA PaLM (540B) 89.7 (few shot) MT-NLG (530B) 87.15 (few shot) PaLM-2 (Large) 86.9 (one shot)
Understanding
WinoGrande GPT-4 (-) 87.5 (5-shot) PaLM-2 (Large) 83.0 (one shot) PaLM (540B) 81.1 (zero shot)
Commonsense Reasoning
SIQA LLaMA (65B) 52.3 (zero shot) Chinchilla (70B) 51.3 (zero shot) Gopher (280B) 50.6 (zero shot)
Reading Comprehension BoolQ PaLM(f ) (540B) 92.2 (-) T5 (11B) 91.2 (-) PaLM-2 (Large) 90.9 (one shot)
Truthfulness Truthful-QA LLaMA (65B) 57 (-)
MATH Gemini (Ultra) 53.2 (4-shot) PaLM-2 (Large) 34.3 (4-shot) LLaMa-2 (65B) 13.5 (4-shot)
Mathematical Reasoning
GSM8K GPT-4 (-) 92.0 (5-shot) PaLM-2 (Large) 80.7 (8-shot) U-PaLM (540B) 58.5 (-)
Problem Solving and
HumanEval Gemini(f ) (Ultra) 74.4 (zero shot) GPT-4 (-) 67.0 (zero shot) Code Llama (34B) 48.8 (zero shot)
Logical Reasoning
General Purpose: LLMs are being widely considered as personalized feedback to students [434], [435], [436], [437].
general-purpose tools for a wide variety of tasks [420]. This They can also simulate patient interactions, enabling students
is due to their inherent ability to understand, generate, and to practice and improve their clinical skills. At a broader level,
manipulate human-like text in a contextually relevant man- LLMs can assist in public health initiatives by analyzing media
ner. This allows them to perform tasks ranging from simple data to detect disease outbreaks, monitor public sentiment
language translation and question-answering to more complex towards health policies, and disseminate health information
tasks like summarization, text generation, and even program- in a clear and understandable manner [438]. When employing
ming help [421]. The utility of LLMs is further enhanced by LLMs to support public health initiatives, addressing related
their ability to adapt to the specific style and tone of the text issues such as data privacy, the necessity for explainability,
they are processing, making the outputs more user-friendly and and the potential risk of propagating biases [439], [440].
context-aware. In everyday applications, LLMs can be used Education: The integration of LLMs into the educational
as personal assistants, helping users draft emails or schedule sector offers opportunities to enhance learning experiences,
appointments [422]; they can also be deployed in customer teacher support, and educational content development. For
service to handle common questions; or applied to generate students, by analyzing their learning styles, performance, and
content for digital platforms like websites, by creating human- preferences, LLMs can provide customized study materi-
like text based on given prompts [423]. Moreover, LLMs play als and practice questions to develop personalized learning
a crucial role in data analysis, where they can filter large experiences [441]. For teachers, LLMs can help to create
volumes of text data, summarize key points, and find patterns lesson plans and grade assignments and generate diverse and
that would take humans much longer to identify [424]. Despite inclusive educational content, significantly saving more time
their wide-ranging applications, it is essential to remember that for teaching and student interaction [442], [443]. In language
LLMs, similar to any AI system, are only as good as the data learning, LLMs serve as advanced conversational partners
they have been trained on. capable of simulating conversations in multiple languages,
Medicine: The application of LLMs in the field of medicine correcting grammar, enhancing vocabulary, and aiding pronun-
is reshaping healthcare delivery and research. For example, ciation for the needs of fluency in practice [444]. Furthermore,
LLMs are increasingly used in clinical decision support sys- LLMs improve accessibility in education by providing support
tems to provide physicians with evidence-based treatment for students with disabilities. They can generate real-time
recommendations [425], [426], [427]. By analyzing patient transcriptions for the hearing impaired, offer reading assistance
data and medical literature, they can help identify potential for the visually impaired, and simplify complex texts for those
diagnoses, suggest appropriate tests, and recommend optimal with learning disabilities [440]. As LLMs continue to evolve,
treatment strategies. Moreover, LLMs can also enhance patient their applications in education can benefit more students and
interactions with healthcare systems; e.g., they can be used teachers from different perspectives in practice.
in chatbot applications [428], [429], [430] to answer patient Science: Similar to medical applications, LLMs can expedite
queries about symptoms or medications, schedule appoint- the research process by quickly analyzing and summarizing
ments, and even provide essential health advice. For medical scientific literature. By briefing comprehensible and accessible
research, LLMs are used to extract and filter information from research summaries, LLMs can assist researchers in staying
a considerable amount of medical literature, identify relevant up-to-date with the latest findings, even in fields outside
studies, summarize findings, and even predict future research their area of expertise [445], [446]. In addition, LLMs can
trends [431], [432], [433]. For medical education, LLMs can aid scientists in formulating new hypotheses and research
help create training materials, generate exam questions, pro- questions since their ability to process large-scale datasets
vide detailed explanations of complex medical topics, and offer allows them to unveil insights that might not be immediately
PREPRINT 30
apparent to human researchers [447]. Moreover, for scientific ized robots [465], etc. LLMs enable robots to understand the
writing, LLMs can help researchers draft documents, suggest environment effectively and generate plans to complete tasks
improvements, and ensure adherence to specific formatting collaboratively [229], [26]. They can facilitate continuous
guidelines [448], [449]. This not only saves time but also learning by allowing robots to access and integrate information
improves the clarity of scientific communication, enabling from a wide range of sources, helping robots acquire new
interdisciplinary teams to work together more effectively. skills, adapt to changes, and refine their paths [213], [222],
Maths: In addition to providing mathematical research and [223].
education support, LLMs can assist in solving mathematical
problems by giving step-by-step explanations and guiding
VII. C HALLENGES AND F UTURE D IRECTIONS
users through complex proofs and calculations. They can
help identify errors in reasoning or computation and suggest LLMs such as GPT-4 and its predecessors have significantly
corrections, serving as an invaluable tool for both learning advanced natural language processing. Nevertheless, they also
and verification purposes [450], [451]. LLMs can be employed bring along a set of challenges. The computational cost, adver-
to check the validity of mathematical proofs, offering a pre- sarial robustness, and interpretability are among the technical
liminary filter before human review. While they are not a challenges that are intrinsic to these models. Furthermore, as
substitute for the meticulous work of mathematicians, they can these models are scaled up to handle more complex tasks
help simplify the process of proof verification [452], [453]. or to operate in more complex or dynamic environments,
Moreover, LLMs enhance accessibility to mathematics by new challenges in scalability, privacy, and real-time processing
translating complex concepts and findings into understandable emerge. On the frontier of foundational research, integrating
language for non-specialists [454], where the gap between multi-modality and the effectiveness of transfer learning are
theoretical mathematics and applied contexts such as physics, being keenly explored. Additionally, the continuous learning
engineering, and economics can be bridged. aspect of these models, which aims to have models that can
Law: LLMs can assist with the thematic analysis of legal adapt to new information over time, presents a fresh set of
documents, including generating initial coding for datasets, challenges. These challenges not only underscore the technical
identifying themes, and classifying data according to these intricacies involved but also highlight the broader impact and
themes. This collaborative effort between legal experts and the future trajectory of LLMs in real-world applications. The
LLMs has proved to be effective in analyzing legal texts such following sections delve into these challenges, shedding light
as court opinions on theft, improving both the efficiency and on the ongoing and potential efforts to address them.
quality of the research [455]. Additionally, LLMs have been Computational Cost: Training LLMs requires extensive com-
evaluated for their ability to generate explanations of legal putational resources, which increases production costs and
terms, focusing on improving factual accuracy and relevance raises environmental concerns due to substantial energy con-
by incorporating sentences from case law. By feeding relevant sumption during large-scale training. Improved performance
case law into the LLM, the augmented models can generate occurs as computational resources increase, but the rate of
higher-quality explanations with less factually incorrect infor- improvement gradually decreases when both the model and
mation [456]. Moreover, LLMs can be trained with specialized dataset size remain fixed, following the power law of dimin-
domain knowledge to perform legal reasoning tasks [457] and ishing returns [466].
answer legal questions [458]. Bias and Fairness: LLMs can inherit and amplify societal
Finance: LLMs like BloombergGPT [140], trained on ex- biases in their training data. These biases can manifest in
tensive proprietary financial datasets, exhibit superior perfor- the model’s outputs, leading to potential ethical and fairness
mance on financial tasks. This indicates the value of domain- issues [467].
specific training in creating LLMs that can more accurately Overfitting: Although LLMs possess substantial learning ca-
understand and process industry-specific language and con- pabilities, they are susceptible to overfitting noisy and peculiar
cepts. The introduction of FinGPT [459] as an open-source patterns within their extensive training data. Consequently,
model offers transparent and accessible resources to develop this may cause them to generate illogical responses [468].
novel applications such as robo-advising, algorithmic trading, The debate about Memorization vs. Generalization in LLMs
and low-code solutions, ultimately expanding the capabilities is about finding the right balance. Memorization allows the
of financial services. Both BloombergGPT and FinGPT show model to remember specific details from its training data,
the adaptability of LLMs to the financial domain, with the ensuring it can provide accurate answers to precise questions.
former showing the power of custom datasets and the latter However, generalization enables the model to make inferences
emphasizing a data-centric approach and low-rank adaptation and produce responses for inputs it hasn’t seen before, which
techniques for customization. Moreover, LLMs demonstrate an is essential for handling various real-world tasks. Striking the
ability to break down complex financial tasks into actionable right balance is the challenge: too much memorization can
plans, enabling end-to-end solutions that were previously lead to overfitting, making the model inflexible and struggling
unfeasible with a single model [460]. with new inputs [469].
Robotics: In robotics research, LLMs have promising applica- Economic and Research Inequality: The high cost of training
tions, such as enhancing human-robot interaction [28], [461], and deploying LLMs may make their development concen-
[462], [463], task planning [226], motion planning [235], trated within well-funded organizations, potentially worsening
navigation [235], [464], object manipulation [225], personal- economic and research inequalities in AI [470].
PREPRINT 31
Reasoning and Planning: Some reasoning and planning tasks, domains, necessitating robust adversarial evaluation tools to
even as seemingly simple as common-sense planning, which ensure LLM reliability [477].
humans find easy, remain well beyond the current capabilities Interpretability and Explainability: The "black-box" nature
of LLMs evaluated using an assessment framework. This of LLMs poses challenges in understanding their decision-
isn’t entirely unexpected, considering that LLMs primarily making, which is crucial for broader acceptance and trust,
generate text completions based on likelihood and offer no especially in sensitive domains. Despite their advanced
solid guarantees in terms of reasoning abilities [471]. capabilities, the lack of insight into their operation limits
Hallucinations: LLMs exhibit "hallucinations," where they their effectiveness and trustworthiness [478], [479]. Efforts
generate responses that, while sounding plausible, are incorrect are being made to make LLMs more explainable to promote
or don’t align with the provided information [472]. The user trust and to ensure responsible AI usage. Understanding
hallucination can be categorized into three categories. the logic behind LLMs’ responses is essential for fostering
trust and ensuring they align with human values and legal
• Input-conflicting hallucination, wherein LLMs produce
standards.
content that diverges from the input given by users.
Privacy Concerns: Privacy concerns in Large Language
• Context-conflicting hallucination, where LLMs generate
Models (LLMs) have escalated with their growth in
content that contradicts information they have generated
complexity and size, particularly around data sharing and
earlier.
potential misuse. There is a risk of malicious content
• Fact-conflicting hallucination involves LLM’s generation
creation, filter bypass, and data privacy issues, especially in
of content that does not align with established world
e-commerce, where protecting customer privacy is crucial. If
knowledge.
models are trained on private data, additional concerns arise
Prompt Engineering: Prompts serve as inputs to LLMs, and if such models are made publicly available. LLMs tend to
their syntax and semantics play a crucial role in determining memorize phrases from their training sets, which an adversary
the model’s output. The prompt variations, sometimes counter- could exploit to extract sensitive data, posing a threat to
intuitive to humans, can result in significant changes in model personal privacy [480], [481].
output and are addressed through prompt engineering, which Real-Time Processing: Real-time processing in Large
involves designing natural language queries to guide LLMs Language Models (LLMs) is pivotal for various applications,
responses effectively [473], [32]. especially with the rising popularity of mobile AI applications
Limited Knowledge: Information acquired during pretraining and concerns regarding information security and privacy.
is limited and may become obsolete after some time. Re- However, LLMs often have hundreds of layers and millions
training the model using updated data is costly. To generate of parameters, which impede real-time processing due to
factually accurate responses people use retrieval augmentation the high computational demands and limited weight storage
pipeline [187]. However, pre-trained models are not trained on hardware platforms, particularly in edge computing
with retrieval augmentation generation (RAG) [6], [21], environments [482]. While certain efforts like MobileBERT
hence, adapting the training pipeline is necessary [182], [25]. aim to reduce memory requirements, they still face substantial
Safety and Controllability: Using LLMs comes with the risk execution overhead due to the large number of model layers,
of generating harmful, misleading, or inappropriate content, leading to high inference latency.
whether by accident or when given specific prompts. Ensuring Long-Term Dependencies: Large Language Models (LLMs)
these models are safely utilized is a significant concern [474]. have shown considerable progress in understanding and
Multi-Modality: Multi-modal learning, where LLMs are generating text, yet they often struggle with preserving
trained on diverse data like text, images, and videos, aims to context and handling long-term dependencies, particularly in
create models with richer understanding but faces challenges complex, multi-turn conversations or long documents. This
in data alignment, fusion strategies, and higher computational limitation can lead to incoherent or irrelevant responses.
demands. Hardware Acceleration: The growth of LLMs presents
Catastrophic Forgetting: LLMs are often pre-trained on large significant hardware challenges due to the increasing
datasets and then fine-tuned on domain-specific data, reducing computational and memory demands associated with training
training resources but facing issues like domain adaptation and deploying these models. GPUs have played a crucial role
and catastrophic forgetting, which hinders the retention of in meeting the hardware requirements for training LLMs,
original knowledge when learning new tasks. with the networking industry also evolving to optimize
Adversarial Robustness: Large Language Models (LLMs) hardware for training workloads. However, the growing size
have shown great capabilities in various tasks but are of LLMs, which has been outpacing hardware progress, makes
vulnerable to adversarial attacks, where slight, deliberate model inference increasingly costly. Model quantization is
input alterations can mislead them. Especially with models a promising approach to bridge the widening gap between
like BERT, adversarial fine-tuning can enhance robustness, LLM size and hardware capacity [483]. Although specialized
although it sometimes compromises generalization [475]. hardware acceleration like GPUs or TPUs can significantly
As LLMs integrate more into complex systems, examining reduce the computational cost, making real-time applications
their security properties becomes crucial, given the emerging more feasible, they may not fully resolve all limitations,
field of adversarial attacks on LLMs within trustworthy necessitating further advancements in hardware technology.
ML [476]. This vulnerability is notable in safety-critical Regulatory and Ethical Frameworks: The rapid
PREPRINT 32
generated instructions,” arXiv preprint arXiv:2212.10560, 2022. 2, 14, [41] X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts
17, 20, 25 for generation,” arXiv preprint arXiv:2101.00190, 2021. 2, 18
[20] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, [42] X. Ma, G. Fang, and X. Wang, “Llm-pruner: On the structural pruning
C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language of large language models,” arXiv preprint arXiv:2305.11627, 2023. 2,
models to follow instructions with human feedback,” Advances in 19
Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, [43] R. Xu, F. Luo, C. Wang, B. Chang, J. Huang, S. Huang, and F. Huang,
2022. 2, 6, 12, 14, 20 “From dense to sparse: Contrastive pruning for better pre-trained
[21] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, language model compression,” in Proceedings of the AAAI Conference
N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama on Artificial Intelligence, vol. 36, no. 10, 2022, pp. 11 547–11 555. 2,
2: Open foundation and fine-tuned chat models,” arXiv preprint 19
arXiv:2307.09288, 2023. 2, 6, 9, 10, 14, 23, 31 [44] G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han,
[22] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, “Smoothquant: Accurate and efficient post-training quantization for
D. Yogatama, M. Bosma, D. Zhou, D. Metzler et al., “Emergent large language models,” in ICML, ser. Proceedings of Machine Learn-
abilities of large language models,” arXiv preprint arXiv:2206.07682, ing Research, vol. 202. PMLR, 2023, pp. 38 087–38 099. 2, 18
2022. 2 [45] C. Tao, L. Hou, W. Zhang, L. Shang, X. Jiang, Q. Liu, P. Luo, and
[23] T. Webb, K. J. Holyoak, and H. Lu, “Emergent analogical reasoning N. Wong, “Compression of generative pre-trained language models via
in large language models,” Nature Human Behaviour, vol. 7, no. 9, pp. quantization,” arXiv preprint arXiv:2203.10705, 2022. 2, 18
1526–1541, 2023. 2 [46] A. Pal, D. Karkhanis, M. Roberts, S. Dooley, A. Sundararajan, and
[24] D. A. Boiko, R. MacKnight, and G. Gomes, “Emergent autonomous S. Naidu, “Giraffe: Adventures in expanding context lengths in llms,”
scientific research capabilities of large language models,” arXiv arXiv preprint arXiv:2308.10882, 2023. 2, 15
preprint arXiv:2304.05332, 2023. 2 [47] B. Peng, J. Quesnelle, H. Fan, and E. Shippole, “Yarn: Efficient
[25] G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, T. Schick, context window extension of large language models,” arXiv preprint
J. Dwivedi-Yu, A. Joulin, S. Riedel, and E. Grave, “Few-shot arXiv:2309.00071, 2023. 2, 15
learning with retrieval augmented language models,” arXiv preprint [48] M. Guo, J. Ainslie, D. Uthus, S. Ontanon, J. Ni, Y.-H. Sung,
arXiv:2208.03299, 2022. 2, 15, 16, 31 and Y. Yang, “Longt5: Efficient text-to-text transformer for long
[26] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, sequences,” arXiv preprint arXiv:2112.07916, 2021. 2, 15
A. Wahid, J. Tompson, Q. Vuong, T. Yu et al., “Palm-e: An embodied [49] S. Chen, S. Wong, L. Chen, and Y. Tian, “Extending context window
multimodal language model,” arXiv preprint arXiv:2303.03378, 2023. of large language models via positional interpolation,” arXiv preprint
2, 17, 18, 19, 30 arXiv:2306.15595, 2023. 2, 15
[27] A. Parisi, Y. Zhao, and N. Fiedel, “Talm: Tool augmented language [50] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min,
models,” arXiv preprint arXiv:2205.12255, 2022. 2, 16, 17 B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language
[28] B. Zhang and H. Soh, “Large language models as zero-shot human models,” arXiv preprint arXiv:2303.18223, 2023. 3, 4, 6, 7
models for human-robot interaction,” arXiv preprint arXiv:2303.03548, [51] U. Naseem, I. Razzak, S. K. Khan, and M. Prasad, “A comprehensive
2023. 2, 30 survey on word representation models: From classical to state-of-the-
art word representation language models,” Transactions on Asian and
[29] Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y. Zhou, J. Wang, A. Hu, P. Shi,
Low-Resource Language Information Processing, vol. 20, no. 5, pp.
Y. Shi et al., “mplug-owl: Modularization empowers large language
1–35, 2021. 3, 4
models with multimodality,” arXiv preprint arXiv:2304.14178, 2023.
[52] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz,
2, 20
E. Agirre, I. Heinz, and D. Roth, “Recent advances in natural language
[30] W. Wang, Z. Chen, X. Chen, J. Wu, X. Zhu, G. Zeng, P. Luo,
processing via large pre-trained language models: A survey,” arXiv
T. Lu, J. Zhou, Y. Qiao et al., “Visionllm: Large language model is
preprint arXiv:2111.01243, 2021. 3, 4
also an open-ended decoder for vision-centric tasks,” arXiv preprint
[53] C. Zhou, Q. Li, C. Li, J. Yu, Y. Liu, G. Wang, K. Zhang, C. Ji, Q. Yan,
arXiv:2305.11175, 2023. 2, 20
L. He et al., “A comprehensive survey on pretrained foundation models:
[31] R. Yang, L. Song, Y. Li, S. Zhao, Y. Ge, X. Li, and Y. Shan, “Gpt4tools: A history from bert to chatgpt,” arXiv preprint arXiv:2302.09419, 2023.
Teaching large language model to use tools via self-instruction,” arXiv 3, 4
preprint arXiv:2305.18752, 2023. 2, 17, 20 [54] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun,
[32] E. Saravia, “Prompt Engineering Guide,” https://github.com/dair- J. Xu, and Z. Sui, “A survey for in-context learning,” arXiv preprint
ai/Prompt-Engineering-Guide, 12 2022. 2, 7, 15, 31 arXiv:2301.00234, 2022. 3, 7, 15
[33] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu, [55] J. Huang and K. C.-C. Chang, “Towards reasoning in large language
W. Zheng, X. Xia et al., “Glm-130b: An open bilingual pre-trained models: A survey,” arXiv preprint arXiv:2212.10403, 2022. 3, 7, 15
model,” arXiv preprint arXiv:2210.02414, 2022. 2, 9, 20, 21, 23 [56] Y. Wang, W. Zhong, L. Li, F. Mi, X. Zeng, W. Huang, L. Shang,
[34] Y. Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, and S. C. Hoi, X. Jiang, and Q. Liu, “Aligning large language models with human: A
“Codet5+: Open code large language models for code understanding survey,” arXiv preprint arXiv:2307.12966, 2023. 3
and generation,” arXiv preprint arXiv:2305.07922, 2023. 2, 10, 22, 23 [57] X. Zhu, J. Li, Y. Liu, C. Ma, and W. Wang, “A survey on model com-
[35] S. Wang, Y. Sun, Y. Xiang, Z. Wu, S. Ding, W. Gong, S. Feng, pression for large language models,” arXiv preprint arXiv:2308.07633,
J. Shang, Y. Zhao, C. Pang et al., “Ernie 3.0 titan: Exploring larger- 2023. 3
scale knowledge enhanced pre-training for language understanding and [58] S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on
generation,” arXiv preprint arXiv:2112.12731, 2021. 2, 8, 21, 23 multimodal large language models,” arXiv preprint arXiv:2306.13549,
[36] J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, “Deepspeed: Sys- 2023. 3, 20
tem optimizations enable training deep learning models with over [59] J. J. Webster and C. Kit, “Tokenization as the initial phase in nlp,”
100 billion parameters,” in Proceedings of the 26th ACM SIGKDD in COLING 1992 volume 4: The 14th international conference on
International Conference on Knowledge Discovery & Data Mining, computational linguistics, 1992. 4
2020, pp. 3505–3506. 2, 5 [60] T. Kudo, “Subword regularization: Improving neural network transla-
[37] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory tion models with multiple subword candidates,” in Proceedings of the
optimizations toward training trillion parameter models,” in SC20: In- 56th Annual Meeting of the Association for Computational Linguistics
ternational Conference for High Performance Computing, Networking, (Volume 1: Long Papers), 2018, pp. 66–75. 4
Storage and Analysis. IEEE, 2020, pp. 1–16. 2, 4, 5, 21 [61] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation
[38] J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and G. Neubig, “Towards of rare words with subword units,” in Proceedings of the 54th Annual
a unified view of parameter-efficient transfer learning,” arXiv preprint Meeting of the Association for Computational Linguistics (Volume 1:
arXiv:2110.04366, 2021. 2, 18, 19 Long Papers), 2016, pp. 1715–1725. 4
[39] Z. Hu, Y. Lan, L. Wang, W. Xu, E.-P. Lim, R. K.-W. Lee, L. Bing, and [62] M. Schuster and K. Nakajima, “Japanese and korean voice search,” in
S. Poria, “Llm-adapters: An adapter family for parameter-efficient fine- 2012 IEEE international conference on acoustics, speech and signal
tuning of large language models,” arXiv preprint arXiv:2304.01933, processing (ICASSP). IEEE, 2012, pp. 5149–5152. 4
2023. 2, 18 [63] S. J. Mielke, Z. Alyafeai, E. Salesky, C. Raffel, M. Dey, M. Gallé,
[40] B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for A. Raja, C. Si, W. Y. Lee, B. Sagot et al., “Between words and char-
parameter-efficient prompt tuning,” arXiv preprint arXiv:2104.08691, acters: A brief history of open-vocabulary modeling and tokenization
2021. 2, 8, 18 in nlp,” arXiv preprint arXiv:2112.10508, 2021. 4
PREPRINT 34
[64] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. [88] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: a system for
Advances in neural information processing systems, vol. 30, 2017. 4, large-scale machine learning.” in Osdi, vol. 16, no. 2016. Savannah,
7 GA, USA, 2016, pp. 265–283. 5
[65] O. Press, N. Smith, and M. Lewis, “Train short, test long: Attention [89] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu,
with linear biases enables input length extrapolation,” in International C. Zhang, and Z. Zhang, “Mxnet: A flexible and efficient machine
Conference on Learning Representations, 2022. [Online]. Available: learning library for heterogeneous distributed systems,” arXiv preprint
https://openreview.net/forum?id=R8sQPpGCv0 4, 15 arXiv:1512.01274, 2015. 5
[66] J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu, “Roformer: [90] T. Wang, A. Roberts, D. Hesslow, T. Le Scao, H. W. Chung, I. Beltagy,
Enhanced transformer with rotary position embedding,” arXiv preprint J. Launay, and C. Raffel, “What language model architecture and
arXiv:2104.09864, 2021. 4, 8, 15 pretraining objective works best for zero-shot generalization?” in
[67] R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long International Conference on Machine Learning. PMLR, 2022, pp.
sequences with sparse transformers,” arXiv preprint arXiv:1904.10509, 22 964–22 984. 5
2019. 4, 7, 21 [91] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao,
[68] T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast M. Zhou, and H.-W. Hon, “Unified language model pre-training for
and memory-efficient exact attention with io-awareness,” Advances in natural language understanding and generation,” Advances in neural
Neural Information Processing Systems, vol. 35, pp. 16 344–16 359, information processing systems, vol. 32, 2019. 5
2022. 4 [92] S. Iyer, X. V. Lin, R. Pasunuru, T. Mihaylov, D. Simig, P. Yu, K. Shus-
[69] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward ter, T. Wang, Q. Liu, P. S. Koura et al., “Opt-iml: Scaling language
networks are universal approximators,” Neural networks, vol. 2, no. 5, model instruction meta learning through the lens of generalization,”
pp. 359–366, 1989. 4 arXiv preprint arXiv:2212.12017, 2022. 6, 7, 12, 13, 15, 20, 23, 25
[70] V. Nair and G. E. Hinton, “Rectified linear units improve restricted [93] Z. Sun, Y. Shen, Q. Zhou, H. Zhang, Z. Chen, D. Cox, Y. Yang,
boltzmann machines,” in Proceedings of the 27th international confer- and C. Gan, “Principle-driven self-alignment of language mod-
ence on machine learning (ICML-10), 2010, pp. 807–814. 4 els from scratch with minimal human supervision,” arXiv preprint
[71] D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv:2305.03047, 2023. 6, 14
arXiv preprint arXiv:1606.08415, 2016. 4 [94] A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan,
[72] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut- A. Jones, N. Joseph, B. Mann, N. DasSarma et al., “A general
dinov, “Dropout: a simple way to prevent neural networks from language assistant as a laboratory for alignment,” arXiv preprint
overfitting,” The journal of machine learning research, vol. 15, no. 1, arXiv:2112.00861, 2021. 6
pp. 1929–1958, 2014. 4 [95] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford,
[73] D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R. Ke, D. Amodei, P. Christiano, and G. Irving, “Fine-tuning language models
A. Goyal, Y. Bengio, A. Courville, and C. Pal, “Zoneout: Regulariz- from human preferences,” arXiv preprint arXiv:1909.08593, 2019. 6
ing rnns by randomly preserving hidden activations,” arXiv preprint
[96] S. Kim, S. J. Joo, D. Kim, J. Jang, S. Ye, J. Shin, and M. Seo,
arXiv:1606.01305, 2016. 4
“The cot collection: Improving zero-shot and few-shot learning of
[74] N. Shazeer, “Glu variants improve transformer,” arXiv preprint language models via chain-of-thought fine-tuning,” arXiv preprint
arXiv:2002.05202, 2020. 4 arXiv:2305.14045, 2023. 7, 13
[75] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling
[97] Q. Liu, F. Zhou, Z. Jiang, L. Dou, and M. Lin, “From zero to hero:
with gated convolutional networks,” in International conference on
Examining the power of symbolic tasks in instruction tuning,” arXiv
machine learning. PMLR, 2017, pp. 933–941. 4
preprint arXiv:2304.07995, 2023. 7, 13
[76] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv
[98] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V.
preprint arXiv:1607.06450, 2016. 4
Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in
[77] B. Zhang and R. Sennrich, “Root mean square layer normalization,”
large language models,” Advances in Neural Information Processing
Advances in Neural Information Processing Systems, vol. 32, 2019. 4
Systems, vol. 35, pp. 24 824–24 837, 2022. 7, 17, 20
[78] A. Baevski and M. Auli, “Adaptive input representations for neural
[99] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowd-
language modeling,” arXiv preprint arXiv:1809.10853, 2018. 4
hery, and D. Zhou, “Self-consistency improves chain of thought rea-
[79] H. Wang, S. Ma, L. Dong, S. Huang, D. Zhang, and F. Wei,
soning in language models,” arXiv preprint arXiv:2203.11171, 2022.
“Deepnet: Scaling transformers to 1,000 layers,” arXiv preprint
7, 17
arXiv:2203.00555, 2022. 4
[80] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catan- [100] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and
zaro, “Megatron-lm: Training multi-billion parameter language models K. Narasimhan, “Tree of thoughts: Deliberate problem solving with
using model parallelism,” arXiv preprint arXiv:1909.08053, 2019. 4, large language models,” arXiv preprint arXiv:2305.10601, 2023. 7, 17
5 [101] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe,
[81] “"bmtrain: Efficient training for big models.".” [Online]. Available: A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer
https://github.com/OpenBMB/BMTrain 4, 5 learning for nlp,” in International Conference on Machine Learning.
[82] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, PMLR, 2019, pp. 2790–2799. 7, 18
P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., “Transformers: [102] S. McCandlish, J. Kaplan, D. Amodei, and O. D. Team, “An empirical
State-of-the-art natural language processing,” in Proceedings of the model of large-batch training,” arXiv preprint arXiv:1812.06162, 2018.
2020 conference on empirical methods in natural language processing: 7
system demonstrations, 2020, pp. 38–45. 5 [103] W. Zeng, X. Ren, T. Su, H. Wang, Y. Liao, Z. Wang, X. Jiang, Z. Yang,
[83] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, K. Wang, X. Zhang et al., “Pangu-α : Large-scale autoregressive
D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman- pretrained chinese language models with auto-parallel computation,”
Milne et al., “Jax: composable transformations of python+ numpy arXiv preprint arXiv:2104.12369, 2021. 7, 8, 20, 21, 23
programs,” 2018. 5 [104] S. Yuan, H. Zhao, Z. Du, M. Ding, X. Liu, Y. Cen, X. Zou, Z. Yang,
[84] S. Li, J. Fang, Z. Bian, H. Liu, Y. Liu, H. Huang, B. Wang, and and J. Tang, “Wudaocorpora: A super large-scale chinese corpora for
Y. You, “Colossal-ai: A unified deep learning system for large-scale pre-training language models,” AI Open, vol. 2, pp. 65–68, 2021. 8,
parallel training,” arXiv preprint arXiv:2110.14883, 2021. 5 27
[85] J. He, J. Qiu, A. Zeng, Z. Yang, J. Zhai, and J. Tang, “Fastmoe: A fast [105] Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang, J. Liu, X. Chen,
mixture-of-expert training system,” arXiv preprint arXiv:2103.13262, Y. Zhao, Y. Lu et al., “Ernie 3.0: Large-scale knowledge enhanced
2021. 5 pre-training for language understanding and generation,” arXiv preprint
[86] L. Huawei Technologies Co., “Huawei mindspore ai development arXiv:2107.02137, 2021. 8, 23
framework,” in Artificial Intelligence Technology. Springer, 2022, pp. [106] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov,
137–162. 5 “Transformer-xl: Attentive language models beyond a fixed-length
[87] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, context,” arXiv preprint arXiv:1901.02860, 2019. 8
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An [107] O. Lieber, O. Sharir, B. Lenz, and Y. Shoham, “Jurassic-1: Technical
imperative style, high-performance deep learning library,” Advances details and evaluation,” White Paper. AI21 Labs, vol. 1, 2021. 8, 21,
in neural information processing systems, vol. 32, 2019. 5 23
PREPRINT 35
[108] Y. Levine, N. Wies, O. Sharir, H. Bata, and A. Shashua, “Limits to [128] X. Ren, P. Zhou, X. Meng, X. Huang, Y. Wang,PW. Wang, P. Li,
depth efficiencies of self-attention,” Advances in Neural Information X. Zhang, A. Podolskiy, G. Arshinov et al., “Pangu- : Towards trillion
Processing Systems, vol. 33, pp. 22 640–22 651, 2020. 8, 10 parameter language model with sparse heterogeneous computing,”
[109] B. Kim, H. Kim, S.-W. Lee, G. Lee, D. Kwak, D. H. Jeon, S. Park, arXiv preprint arXiv:2303.10845, 2023. 10, 12, 21, 23
S. Kim, S. Kim, D. Seo et al., “What changes can large-scale language [129] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou,
models bring? intensive study on hyperclova: Billions-scale korean S. Savarese, and C. Xiong, “Codegen: An open large language
generative pretrained transformers,” arXiv preprint arXiv:2109.04650, model for code with multi-turn program synthesis,” arXiv preprint
2021. 8, 23 arXiv:2203.13474, 2022. 10, 20, 23, 25
[110] S. Wu, X. Zhao, T. Yu, R. Zhang, C. Shen, H. Liu, F. Li, H. Zhu, [130] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan,
J. Luo, L. Xu et al., “Yuan 1.0: Large-scale pre-trained language model H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large
in zero-shot and few-shot learning,” arXiv preprint arXiv:2110.04725, language models trained on code,” arXiv preprint arXiv:2107.03374,
2021. 8, 21, 23 2021. 10, 23, 26, 28
[111] J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, [131] Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond,
J. Aslanides, S. Henderson, R. Ring, S. Young et al., “Scaling language T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago et al., “Competition-
models: Methods, analysis & insights from training gopher,” arXiv level code generation with alphacode,” Science, vol. 378, no. 6624, pp.
preprint arXiv:2112.11446, 2021. 8, 9, 23, 25 1092–1097, 2022. 10, 21, 23, 26
[112] S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, [132] N. Shazeer, “Fast transformer decoding: One write-head is all you
J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti et al., “Us- need,” arXiv preprint arXiv:1911.02150, 2019. 10
ing deepspeed and megatron to train megatron-turing nlg 530b, a large- [133] R. Y. Pang and H. He, “Text generation by learning from demonstra-
scale generative language model,” arXiv preprint arXiv:2201.11990, tions,” arXiv preprint arXiv:2009.07839, 2020. 10
2022. 8, 9, 21, 23
[134] R. Dabre and A. Fujita, “Softmax tempering for training neural
[113] S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Gold-
machine translation models,” arXiv preprint arXiv:2009.09372, 2020.
ing, H. He, C. Leahy, K. McDonell, J. Phang et al., “Gpt-neox-
10
20b: An open-source autoregressive language model,” arXiv preprint
arXiv:2204.06745, 2022. 8, 20, 21, 23 [135] Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “Codet5: Identifier-aware
[114] W. Ben and K. Aran, “Gpt-j-6b: A 6 billion parameter autoregressive unified pre-trained encoder-decoder models for code understanding and
language model,” 2021. 8 generation,” arXiv preprint arXiv:2109.00859, 2021. 10
[115] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, [136] R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou,
B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh et al., “Mixed M. Marone, C. Akiki, J. Li, J. Chim et al., “Starcoder: may the source
precision training,” arXiv preprint arXiv:1710.03740, 2017. 8, 21 be with you!” arXiv preprint arXiv:2305.06161, 2023. 10, 23
[116] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, [137] R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Sar-
Y. Zhou, A. W. Yu, O. Firat et al., “Glam: Efficient scaling of avia, A. Poulton, V. Kerkez, and R. Stojnic, “Galactica: A large
language models with mixture-of-experts,” in International Conference language model for science,” arXiv preprint arXiv:2211.09085, 2022.
on Machine Learning. PMLR, 2022, pp. 5547–5569. 9, 21, 23 10, 21, 23, 26
[117] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, [138] FairScale authors, “Fairscale: A general purpose modular pytorch
and J. Dean, “Outrageously large neural networks: The sparsely-gated library for high performance and large scale training,” https://github.
mixture-of-experts layer,” arXiv preprint arXiv:1701.06538, 2017. 9, com/facebookresearch/fairscale, 2021. 10
21 [139] R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T.
[118] W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling Cheng, A. Jin, T. Bos, L. Baker, Y. Du et al., “Lamda: Language models
to trillion parameter models with simple and efficient sparsity,” The for dialog applications,” arXiv preprint arXiv:2201.08239, 2022. 10,
Journal of Machine Learning Research, vol. 23, no. 1, pp. 5232–5270, 23
2022. 9 [140] S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann,
[119] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, P. Kambadur, D. Rosenberg, and G. Mann, “Bloomberggpt: A large
E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark et al., language model for finance,” arXiv preprint arXiv:2303.17564, 2023.
“Training compute-optimal large language models,” arXiv preprint 10, 23, 30
arXiv:2203.15556, 2022. 9, 23, 26 [141] X. Zhang, Q. Yang, and D. Xu, “Xuanyuan 2.0: A large chinese
[120] S. Soltan, S. Ananthakrishnan, J. FitzGerald, R. Gupta, W. Hamza, financial chat model with hundreds of billions parameters,” arXiv
H. Khan, C. Peris, S. Rawls, A. Rosenbaum, A. Rumshisky et al., preprint arXiv:2305.12002, 2023. 10, 15, 23
“Alexatm 20b: Few-shot learning using a large-scale multilingual [142] W. Ben, “Mesh-transformer-jax: Model-parallel implementation of
seq2seq model,” arXiv preprint arXiv:2208.01448, 2022. 9, 20, 21, transformer language model with jax,” 2021. 11, 21
22, 23 [143] N. Muennighoff, T. Wang, L. Sutawika, A. Roberts, S. Biderman,
[121] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, T. L. Scao, M. S. Bari, S. Shen, Z.-X. Yong, H. Schoelkopf
S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “Palm 2 technical et al., “Crosslingual generalization through multitask finetuning,” arXiv
report,” arXiv preprint arXiv:2305.10403, 2023. 9, 23 preprint arXiv:2211.01786, 2022. 13, 23, 25, 28
[122] Y. Tay, J. Wei, H. W. Chung, V. Q. Tran, D. R. So, S. Shakeri, X. Gar- [144] D. Yin, X. Liu, F. Yin, M. Zhong, H. Bansal, J. Han, and K.-W. Chang,
cia, H. S. Zheng, J. Rao, A. Chowdhery et al., “Transcending scaling “Dynosaur: A dynamic growth paradigm for instruction-tuning data
laws with 0.1% extra compute,” arXiv preprint arXiv:2210.11399, curation,” arXiv preprint arXiv:2305.14327, 2023. 14
2022. 9, 21, 23
[145] P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou, W. Zhang, P. Lu,
[123] Y. Tay, M. Dehghani, V. Q. Tran, X. Garcia, J. Wei, X. Wang, H. W.
C. He, X. Yue et al., “Llama-adapter v2: Parameter-efficient visual
Chung, D. Bahri, T. Schuster, S. Zheng et al., “Ul2: Unifying language
instruction model,” arXiv preprint arXiv:2304.15010, 2023. 14, 22
learning paradigms,” in The Eleventh International Conference on
Learning Representations, 2022. 9, 21, 22, 23 [146] “Openai. gpt-4 technical report,” 2023. 14, 32
[124] Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang, [147] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang,
“Glm: General language model pretraining with autoregressive blank and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama
infilling,” in Proceedings of the 60th Annual Meeting of the Association model,” https://github.com/tatsu-lab/stanford_alpaca, 2023. 14, 23, 25
for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 320– [148] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang,
335. 9 L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, and
[125] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, E. P. Xing, “Vicuna: An open-source chatbot impressing gpt-4
T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: with 90%* chatgpt quality,” March 2023. [Online]. Available:
Open and efficient foundation language models,” arXiv preprint https://lmsys.org/blog/2023-03-30-vicuna/ 14, 20, 23, 25
arXiv:2302.13971, 2023. 9, 20, 23 [149] B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruction tuning with
[126] M. N. Rabe and C. Staats, “Self-attention does not need o(n2 ) memory,” gpt-4,” arXiv preprint arXiv:2304.03277, 2023. 14, 25
arXiv preprint arXiv:2112.05682, 2021. 9 [150] T. Liu and B. K. H. Low, “Goat: Fine-tuned llama outperforms gpt-4
[127] V. A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, on arithmetic tasks,” arXiv preprint arXiv:2305.14201, 2023. 14
M. Shoeybi, and B. Catanzaro, “Reducing activation recomputation [151] H. Wang, C. Liu, N. Xi, Z. Qiang, S. Zhao, B. Qin, and T. Liu, “Huatuo:
in large transformer models,” Proceedings of Machine Learning and Tuning llama model with chinese medical knowledge,” arXiv preprint
Systems, vol. 5, 2023. 10 arXiv:2304.06975, 2023. 14
PREPRINT 36
[152] C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, and [175] C. Han, Q. Wang, W. Xiong, Y. Chen, H. Ji, and S. Wang, “Lm-infinite:
D. Jiang, “Wizardlm: Empowering large language models to follow Simple on-the-fly length generalization for large language models,”
complex instructions,” arXiv preprint arXiv:2304.12244, 2023. 14 arXiv preprint arXiv:2308.16137, 2023. 15
[153] Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, [176] J. Ainslie, T. Lei, M. de Jong, S. Ontañón, S. Brahma, Y. Zemlyan-
and D. Jiang, “Wizardcoder: Empowering code large language models skiy, D. Uthus, M. Guo, J. Lee-Thorp, Y. Tay et al., “Colt5: Faster
with evol-instruct,” arXiv preprint arXiv:2306.08568, 2023. 14, 23 long-range transformers with conditional computation,” arXiv preprint
[154] J. Menick, M. Trebacz, V. Mikulik, J. Aslanides, F. Song, M. Chadwick, arXiv:2303.09752, 2023. 15
M. Glaese, S. Young, L. Campbell-Gillingham, G. Irving et al., [177] J. Ding, S. Ma, L. Dong, X. Zhang, S. Huang, W. Wang, and
“Teaching language models to support answers with verified quotes,” F. Wei, “Longnet: Scaling transformers to 1,000,000,000 tokens,” arXiv
arXiv preprint arXiv:2203.11147, 2022. 14 preprint arXiv:2307.02486, 2023. 15
[155] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, [178] Y. Chen, S. Qian, H. Tang, X. Lai, Z. Liu, S. Han, and J. Jia, “Longlora:
C. Hesse, S. Jain, V. Kosaraju, W. Saunders et al., “Webgpt: Browser- Efficient fine-tuning of long-context large language models,” arXiv
assisted question-answering with human feedback,” arXiv preprint preprint arXiv:2309.12307, 2023. 15
arXiv:2112.09332, 2021. 14, 16, 17, 23, 28 [179] N. Ratner, Y. Levine, Y. Belinkov, O. Ram, I. Magar, O. Abend,
[156] A. Glaese, N. McAleese, M. Tr˛ebacz, J. Aslanides, V. Firoiu, T. Ewalds, E. Karpas, A. Shashua, K. Leyton-Brown, and Y. Shoham, “Parallel
M. Rauh, L. Weidinger, M. Chadwick, P. Thacker et al., “Improving context windows for large language models,” in Proceedings of the
alignment of dialogue agents via targeted human judgements,” arXiv 61st Annual Meeting of the Association for Computational Linguistics
preprint arXiv:2209.14375, 2022. 14, 17, 23 (Volume 1: Long Papers), 2023, pp. 6383–6402. 15
[157] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and [180] W. Wang, L. Dong, H. Cheng, X. Liu, X. Yan, J. Gao, and F. Wei,
C. Finn, “Direct preference optimization: Your language model is “Augmenting language models with long-term memory,” arXiv preprint
secretly a reward model,” arXiv preprint arXiv:2305.18290, 2023. 14 arXiv:2306.07174, 2023. 15
[158] H. Dong, W. Xiong, D. Goyal, R. Pan, S. Diao, J. Zhang, K. Shum, and [181] X. Xu, Z. Gou, W. Wu, Z.-Y. Niu, H. Wu, H. Wang, and S. Wang,
T. Zhang, “Raft: Reward ranked finetuning for generative foundation “Long time no see! open-domain conversation with long-term persona
model alignment,” arXiv preprint arXiv:2304.06767, 2023. 14 memory,” arXiv preprint arXiv:2203.05797, 2022. 15
[159] Z. Yuan, H. Yuan, C. Tan, W. Wang, S. Huang, and F. Huang, “Rrhf: [182] S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Milli-
Rank responses to align language models with human feedback without can, G. B. Van Den Driessche, J.-B. Lespiau, B. Damoc, A. Clark et al.,
tears,” arXiv preprint arXiv:2304.05302, 2023. 14 “Improving language models by retrieving from trillions of tokens,”
[160] F. Song, B. Yu, M. Li, H. Yu, F. Huang, Y. Li, and H. Wang, in International conference on machine learning. PMLR, 2022, pp.
“Preference ranking optimization for human alignment,” arXiv preprint 2206–2240. 15, 16, 31
arXiv:2306.17492, 2023. 14 [183] W. Zhong, L. Guo, Q. Gao, and Y. Wang, “Memorybank: Enhanc-
[161] H. Liu, C. Sferrazza, and P. Abbeel, “Languages are rewards: Hindsight ing large language models with long-term memory,” arXiv preprint
finetuning using human feedback,” arXiv preprint arXiv:2302.02676, arXiv:2305.10250, 2023. 15
2023. 14 [184] N. Shinn, F. Cassano, B. Labash, A. Gopinath, K. Narasimhan, and
[162] Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, S. Yao, “Reflexion: Language agents with verbal reinforcement learn-
A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon et al., “Constitutional ing,” arXiv preprint arXiv:2303.11366, vol. 14, 2023. 15, 17
ai: Harmlessness from ai feedback,” arXiv preprint arXiv:2212.08073, [185] C. Hu, J. Fu, C. Du, S. Luo, J. Zhao, and H. Zhao, “Chatdb:
2022. 14 Augmenting llms with databases as their symbolic memory,” arXiv
[163] Y. Dubois, X. Li, R. Taori, T. Zhang, I. Gulrajani, J. Ba, C. Guestrin, preprint arXiv:2306.03901, 2023. 15
P. Liang, and T. B. Hashimoto, “Alpacafarm: A simulation frame- [186] Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-Yu, Y. Yang,
work for methods that learn from human feedback,” arXiv preprint J. Callan, and G. Neubig, “Active retrieval augmented generation,”
arXiv:2305.14387, 2023. 14 arXiv preprint arXiv:2305.06983, 2023. 15, 16
[164] C. Si, Z. Gan, Z. Yang, S. Wang, J. Wang, J. Boyd-Graber, [187] O. Ram, Y. Levine, I. Dalmedigos, D. Muhlgay, A. Shashua, K. Leyton-
and L. Wang, “Prompting gpt-3 to be reliable,” arXiv preprint Brown, and Y. Shoham, “In-context retrieval-augmented language
arXiv:2210.09150, 2022. 14 models,” arXiv preprint arXiv:2302.00083, 2023. 15, 16, 31
[165] D. Ganguli, A. Askell, N. Schiefer, T. Liao, K. Lukošiūtė, A. Chen, [188] X. Li and X. Qiu, “Mot: Pre-thinking and recalling enable
A. Goldie, A. Mirhoseini, C. Olsson, D. Hernandez et al., “The capacity chatgpt to self-improve with memory-of-thoughts,” arXiv preprint
for moral self-correction in large language models,” arXiv preprint arXiv:2305.05181, 2023. 15
arXiv:2302.07459, 2023. 14 [189] D. Schuurmans, “Memory augmented large language models are com-
[166] A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does llm putationally universal,” arXiv preprint arXiv:2301.04589, 2023. 15
safety training fail?” arXiv preprint arXiv:2307.02483, 2023. 14 [190] A. Modarressi, A. Imani, M. Fayyaz, and H. Schütze, “Ret-llm:
[167] D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, Towards a general read-write memory for large language models,”
B. Mann, E. Perez, N. Schiefer, K. Ndousse et al., “Red teaming arXiv preprint arXiv:2305.14322, 2023. 15
language models to reduce harms: Methods, scaling behaviors, and [191] S. Robertson, H. Zaragoza et al., “The probabilistic relevance frame-
lessons learned,” arXiv preprint arXiv:2209.07858, 2022. 14, 25 work: Bm25 and beyond,” Foundations and Trends® in Information
[168] S. Casper, J. Lin, J. Kwon, G. Culp, and D. Hadfield-Menell, “Explore, Retrieval, vol. 3, no. 4, pp. 333–389, 2009. 16
establish, exploit: Red teaming language models from scratch,” arXiv [192] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, and D. Zhou,
preprint arXiv:2306.09442, 2023. 14 “Rationale-augmented ensembles in language models,” arXiv preprint
[169] E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, arXiv:2207.00747, 2022. 16
N. McAleese, and G. Irving, “Red teaming language models with [193] F. Zhang, B. Chen, Y. Zhang, J. Liu, D. Zan, Y. Mao, J.-G. Lou,
language models,” arXiv preprint arXiv:2202.03286, 2022. 14 and W. Chen, “Repocoder: Repository-level code completion through
[170] T. Scialom, T. Chakrabarty, and S. Muresan, “Fine-tuned language iterative retrieval and generation,” arXiv preprint arXiv:2303.12570,
models are continual learners,” in Proceedings of the 2022 Conference 2023. 16
on Empirical Methods in Natural Language Processing, 2022, pp. [194] B. Wang, W. Ping, P. Xu, L. McAfee, Z. Liu, M. Shoeybi, Y. Dong,
6107–6122. 15 O. Kuchaiev, B. Li, C. Xiao et al., “Shall we pretrain autoregressive
[171] Z. Shi and A. Lipani, “Don’t stop pretraining? make prompt-based language models with retrieval? a comprehensive study,” arXiv preprint
fine-tuning powerful learner,” arXiv preprint arXiv:2305.01711, 2023. arXiv:2304.06762, 2023. 16
15 [195] L. Wang, N. Yang, and F. Wei, “Learning to retrieve in-context
[172] H. Gupta, S. A. Sawant, S. Mishra, M. Nakamura, A. Mitra, examples for large language models,” arXiv preprint arXiv:2307.07164,
S. Mashetty, and C. Baral, “Instruction tuned models are quick learn- 2023. 16
ers,” arXiv preprint arXiv:2306.05539, 2023. 15 [196] J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, and W. Chen,
[173] H. Chen, Y. Zhang, Q. Zhang, H. Yang, X. Hu, X. Ma, Y. Yanggong, “What makes good in-context examples for gpt-3?” arXiv preprint
and J. Zhao, “Maybe only 0.5% data is needed: A preliminary arXiv:2101.06804, 2021. 16
exploration of low training data instruction tuning,” arXiv preprint [197] O. Rubin, J. Herzig, and J. Berant, “Learning to retrieve prompts for
arXiv:2305.09246, 2023. 15 in-context learning,” arXiv preprint arXiv:2112.08633, 2021. 16
[174] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, [198] W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettle-
P. Yu, L. Yu et al., “Lima: Less is more for alignment,” arXiv preprint moyer, and W.-t. Yih, “Replug: Retrieval-augmented black-box lan-
arXiv:2305.11206, 2023. 15, 23, 25 guage models,” arXiv preprint arXiv:2301.12652, 2023. 16
PREPRINT 37
[199] O. Rubin and J. Berant, “Long-range language modeling with self- [223] W. Yao, S. Heinecke, J. C. Niebles, Z. Liu, Y. Feng, L. Xue, R. Murthy,
retrieval,” arXiv preprint arXiv:2306.13421, 2023. 16 Z. Chen, J. Zhang, D. Arpit et al., “Retroformer: Retrospective large
[200] K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang, “Retrieval language agents with policy gradient optimization,” arXiv preprint
augmented language model pre-training,” in International conference arXiv:2308.02151, 2023. 17, 30
on machine learning. PMLR, 2020, pp. 3929–3938. 16 [224] W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng,
[201] S. Hofstätter, J. Chen, K. Raman, and H. Zamani, “Fid-light: Efficient J. Tompson, I. Mordatch, Y. Chebotar, P. Sermanet, T. Jackson,
and effective retrieval-augmented text generation,” in Proceedings N. Brown, L. Luu, S. Levine, K. Hausman, and brian ichter, “Inner
of the 46th International ACM SIGIR Conference on Research and monologue: Embodied reasoning through planning with language
Development in Information Retrieval, 2023, pp. 1437–1447. 16 models,” in 6th Annual Conference on Robot Learning, 2022.
[202] M. Komeili, K. Shuster, and J. Weston, “Internet-augmented dialogue [Online]. Available: https://openreview.net/forum?id=3R3Pz5i0tye 17
generation,” arXiv preprint arXiv:2107.07566, 2021. 16 [225] C. Jin, W. Tan, J. Yang, B. Liu, R. Song, L. Wang, and J. Fu,
[203] A. Lazaridou, E. Gribovskaya, W. Stokowiec, and N. Grigorev, “Alphablock: Embodied finetuning for vision-language reasoning in
“Internet-augmented language models through few-shot prompting for robot manipulation,” arXiv preprint arXiv:2305.18898, 2023. 17, 18,
open-domain question answering,” arXiv preprint arXiv:2203.05115, 30
2022. 16 [226] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay,
[204] D. Gao, L. Ji, L. Zhou, K. Q. Lin, J. Chen, Z. Fan, and M. Z. Shou, D. Fox, J. Thomason, and A. Garg, “Progprompt: Generating situated
“Assistgpt: A general multi-modal assistant that can plan, execute, robot task plans using large language models,” in 2023 IEEE Interna-
inspect, and learn,” arXiv preprint arXiv:2306.08640, 2023. 16, 17 tional Conference on Robotics and Automation (ICRA). IEEE, 2023,
[205] P. Lu, B. Peng, H. Cheng, M. Galley, K.-W. Chang, Y. N. Wu, S.-C. pp. 11 523–11 530. 17, 30
Zhu, and J. Gao, “Chameleon: Plug-and-play compositional reasoning [227] W. Yu, N. Gileadi, C. Fu, S. Kirmani, K.-H. Lee, M. G. Arenas, H.-
with large language models,” arXiv preprint arXiv:2304.09842, 2023. T. L. Chiang, T. Erez, L. Hasenclever, J. Humplik et al., “Language to
16, 17, 20 rewards for robotic skill synthesis,” arXiv preprint arXiv:2306.08647,
[206] B. Paranjape, S. Lundberg, S. Singh, H. Hajishirzi, L. Zettlemoyer, and 2023. 17
M. T. Ribeiro, “Art: Automatic multi-step reasoning and tool-use for [228] X. Tang, A. Zou, Z. Zhang, Y. Zhao, X. Zhang, A. Cohan, and
large language models,” arXiv preprint arXiv:2303.09014, 2023. 16 M. Gerstein, “Medagents: Large language models as collaborators for
[207] C.-Y. Hsieh, S.-A. Chen, C.-L. Li, Y. Fujii, A. Ratner, C.-Y. Lee, zero-shot medical reasoning,” arXiv preprint arXiv:2311.10537, 2023.
R. Krishna, and T. Pfister, “Tool documentation enables zero-shot tool- 17
usage with large language models,” arXiv preprint arXiv:2308.00675, [229] A. Brohan, Y. Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho,
2023. 16 J. Ibarz, A. Irpan, E. Jang, R. Julian et al., “Do as i can, not as i say:
[208] Y. Song, W. Xiong, D. Zhu, C. Li, K. Wang, Y. Tian, and S. Li, “Rest- Grounding language in robotic affordances,” in Conference on Robot
gpt: Connecting large language models with real-world applications via Learning. PMLR, 2023, pp. 287–318. 18, 30
restful apis,” arXiv preprint arXiv:2306.06624, 2023. 16 [230] H. Ha, P. Florence, and S. Song, “Scaling up and distilling
[209] S. Hao, T. Liu, Z. Wang, and Z. Hu, “Toolkengpt: Augmenting frozen down: Language-guided robot skill acquisition,” arXiv preprint
language models with massive tools via tool embeddings,” arXiv arXiv:2307.14535, 2023. 18
preprint arXiv:2305.11554, 2023. 16 [231] A. Rajvanshi, K. Sikka, X. Lin, B. Lee, H.-P. Chiu, and A. Velasquez,
[210] S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: “Saynav: Grounding large language models for dynamic planning to
Large language model connected with massive apis,” arXiv preprint navigation in new environments,” arXiv preprint arXiv:2309.04077,
arXiv:2305.15334, 2023. 16 2023. 18
[211] Q. Xu, F. Hong, B. Li, C. Hu, Z. Chen, and J. Zhang, “On the tool [232] C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, and Y. Su,
manipulation capability of open-source large language models,” arXiv “Llm-planner: Few-shot grounded planning for embodied agents with
preprint arXiv:2305.16504, 2023. 16 large language models,” arXiv preprint arXiv:2212.04088, 2022. 18
[212] Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, [233] V. S. Dorbala, J. F. Mullen Jr, and D. Manocha, “Can an embodied
X. Tang, B. Qian et al., “Toolllm: Facilitating large language models agent find your" cat-shaped mug"? llm-based zero-shot object naviga-
to master 16000+ real-world apis,” arXiv preprint arXiv:2307.16789, tion,” arXiv preprint arXiv:2303.03480, 2023. 18
2023. 16, 17 [234] C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language
[213] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, “Hugginggpt: maps for robot navigation,” in 2023 IEEE International Conference
Solving ai tasks with chatgpt and its friends in huggingface,” arXiv on Robotics and Automation (ICRA). IEEE, 2023, pp. 10 608–10 615.
preprint arXiv:2303.17580, 2023. 17, 30 18
[214] Y. Liang, C. Wu, T. Song, W. Wu, Y. Xia, Y. Liu, Y. Ou, S. Lu, L. Ji, [235] Y. Ding, X. Zhang, C. Paxton, and S. Zhang, “Task and motion planning
S. Mao et al., “Taskmatrix. ai: Completing tasks by connecting foun- with large language models for object rearrangement,” arXiv preprint
dation models with millions of apis,” arXiv preprint arXiv:2303.16434, arXiv:2303.06247, 2023. 18, 30
2023. 17 [236] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, and J. Tang, “Gpt
[215] D. Surís, S. Menon, and C. Vondrick, “Vipergpt: Visual inference understands, too,” arXiv preprint arXiv:2103.10385, 2021. 18
via python execution for reasoning,” arXiv preprint arXiv:2303.08128, [237] G. Chen, F. Liu, Z. Meng, and S. Liang, “Revisiting parameter-efficient
2023. 17 tuning: Are we really there yet?” arXiv preprint arXiv:2202.07962,
[216] A. Maedche, S. Morana, S. Schacht, D. Werth, and J. Krumeich, 2022. 18
“Advanced user assistance systems,” Business & Information Systems [238] Y. Wang, S. Mukherjee, X. Liu, J. Gao, A. H. Awadallah, and J. Gao,
Engineering, vol. 58, pp. 367–370, 2016. 17 “Adamix: Mixture-of-adapter for parameter-efficient tuning of large
[217] M. Campbell, A. J. Hoane Jr, and F.-h. Hsu, “Deep blue,” Artificial language models,” arXiv preprint arXiv:2205.12410, vol. 1, no. 2, p. 4,
intelligence, vol. 134, no. 1-2, pp. 57–83, 2002. 17 2022. 18
[218] S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, C. Zhang, [239] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang,
Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou et al., “Metagpt: Meta and W. Chen, “Lora: Low-rank adaptation of large language models,”
programming for multi-agent collaborative framework,” arXiv preprint arXiv preprint arXiv:2106.09685, 2021. 18, 19, 20
arXiv:2308.00352, 2023. 17 [240] X. Liu, K. Ji, Y. Fu, W. Tam, Z. Du, Z. Yang, and J. Tang, “P-tuning:
[219] Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang, Prompt tuning can be comparable to fine-tuning across scales and
S. Jin, E. Zhou et al., “The rise and potential of large language model tasks,” in Proceedings of the 60th Annual Meeting of the Association
based agents: A survey,” arXiv preprint arXiv:2309.07864, 2023. 17 for Computational Linguistics (Volume 2: Short Papers), 2022, pp. 61–
[220] L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, 68. 18
J. Tang, X. Chen, Y. Lin et al., “A survey on large language model [241] A. Razdaibiedina, Y. Mao, R. Hou, M. Khabsa, M. Lewis, and
based autonomous agents,” arXiv preprint arXiv:2308.11432, 2023. 17 A. Almahairi, “Progressive prompts: Continual learning for language
[221] W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models models,” arXiv preprint arXiv:2301.12314, 2023. 18
as zero-shot planners: Extracting actionable knowledge for embodied [242] Z.-R. Zhang, C. Tan, H. Xu, C. Wang, J. Huang, and S. Huang,
agents,” in International Conference on Machine Learning. PMLR, “Towards adaptive prefix tuning for parameter-efficient language model
2022, pp. 9118–9147. 17 fine-tuning,” arXiv preprint arXiv:2305.15212, 2023. 18
[222] S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, and Z. Hu, [243] E. B. Zaken, S. Ravfogel, and Y. Goldberg, “Bitfit: Simple parameter-
“Reasoning with language model is planning with world model,” arXiv efficient fine-tuning for transformer-based masked language-models,”
preprint arXiv:2305.14992, 2023. 17, 30 arXiv preprint arXiv:2106.10199, 2021. 18
PREPRINT 38
[244] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, “Llm. int8 (): [266] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: En-
8-bit matrix multiplication for transformers at scale,” arXiv preprint hancing vision-language understanding with advanced large language
arXiv:2208.07339, 2022. 18, 19 models,” arXiv preprint arXiv:2304.10592, 2023. 20
[245] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accu- [267] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
rate post-training quantization for generative pre-trained transformers,” T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al.,
arXiv preprint arXiv:2210.17323, 2022. 18 “An image is worth 16x16 words: Transformers for image recognition
[246] X. Wei, Y. Zhang, Y. Li, X. Zhang, R. Gong, J. Guo, and X. Liu, at scale,” arXiv preprint arXiv:2010.11929, 2020. 20
“Outlier suppression+: Accurate quantization of large language mod- [268] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung,
els by equivalent and optimal shifting and scaling,” arXiv preprint and S. Hoi, “Instructblip: Towards general-purpose vision-language
arXiv:2304.09145, 2023. 18 models with instruction tuning,” arXiv preprint arXiv:2305.06500,
[247] E. Frantar and D. Alistarh, “Optimal brain compression: A framework 2023. 20
for accurate post-training quantization and pruning,” Advances in [269] Z. Xu, Y. Shen, and L. Huang, “Multiinstruct: Improving multi-
Neural Information Processing Systems, vol. 35, pp. 4475–4488, 2022. modal zero-shot learning via instruction tuning,” arXiv preprint
18 arXiv:2212.10773, 2022. 20
[248] C. Lee, J. Jin, T. Kim, H. Kim, and E. Park, “Owq: Lessons learned [270] Z. Zhao, L. Guo, T. Yue, S. Chen, S. Shao, X. Zhu, Z. Yuan, and
from activation outliers for weight quantization in large language J. Liu, “Chatbridge: Bridging modalities with large language model as
models,” arXiv preprint arXiv:2306.02272, 2023. 19 a language catalyst,” arXiv preprint arXiv:2305.16103, 2023. 20
[249] S. J. Kwon, J. Kim, J. Bae, K. M. Yoo, J.-H. Kim, B. Park, B. Kim, [271] L. Li, Y. Yin, S. Li, L. Chen, P. Wang, S. Ren, M. Li, Y. Yang, J. Xu,
J.-W. Ha, N. Sung, and D. Lee, “Alphatuning: Quantization-aware X. Sun et al., “M3 it: A large-scale dataset towards multi-modal mul-
parameter-efficient adaptation of large-scale pre-trained language mod- tilingual instruction tuning,” arXiv preprint arXiv:2306.04387, 2023.
els,” arXiv preprint arXiv:2210.03858, 2022. 19 20
[250] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Ef- [272] R. Pi, J. Gao, S. Diao, R. Pan, H. Dong, J. Zhang, L. Yao, J. Han, H. Xu,
ficient finetuning of quantized llms,” arXiv preprint arXiv:2305.14314, and L. K. T. Zhang, “Detgpt: Detect what you need via reasoning,”
2023. 19 arXiv preprint arXiv:2305.14167, 2023. 20
[251] Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stock, Y. Mehdad, Y. Shi, R. Kr- [273] G. Luo, Y. Zhou, T. Ren, S. Chen, X. Sun, and R. Ji, “Cheap and
ishnamoorthi, and V. Chandra, “Llm-qat: Data-free quantization aware quick: Efficient vision-language instruction tuning for large language
training for large language models,” arXiv preprint arXiv:2305.17888, models,” arXiv preprint arXiv:2305.15023, 2023. 20
2023. 19
[274] R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, P. Gao, and
[252] Y. Guo, A. Yao, H. Zhao, and Y. Chen, “Network sketching: Exploiting Y. Qiao, “Llama-adapter: Efficient fine-tuning of language models with
binary structure in deep cnns,” in Proceedings of the IEEE Conference zero-init attention,” arXiv preprint arXiv:2303.16199, 2023. 20
on Computer Vision and Pattern Recognition, 2017, pp. 5955–5963.
[275] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and
19
I. Sutskever, “Robust speech recognition via large-scale weak super-
[253] J. Kim, J. H. Lee, S. Kim, J. Park, K. M. Yoo, S. J. Kwon, and D. Lee,
vision,” in International Conference on Machine Learning. PMLR,
“Memory-efficient fine-tuning of compressed large language models
2023, pp. 28 492–28 518. 20
via sub-4-bit integer quantization,” arXiv preprint arXiv:2305.14152,
2023. 19 [276] Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola,
“Multimodal chain-of-thought reasoning in language models,” arXiv
[254] M. Sun, Z. Liu, A. Bair, and J. Z. Kolter, “A simple and effec-
preprint arXiv:2302.00923, 2023. 20
tive pruning approach for large language models,” arXiv preprint
arXiv:2306.11695, 2023. 19 [277] J. Ge, H. Luo, S. Qian, Y. Gan, J. Fu, and S. Zhan, “Chain of
[255] Z. Wang, J. Wohlwend, and T. Lei, “Structured pruning of large thought prompt tuning in vision language models,” arXiv preprint
language models,” arXiv preprint arXiv:1910.04732, 2019. 19 arXiv:2304.07919, 2023. 20
[256] L. Yin, Y. Wu, Z. Zhang, C.-Y. Hsieh, Y. Wang, Y. Jia, M. Pechenizkiy, [278] C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and N. Duan, “Visual chatgpt:
Y. Liang, Z. Wang, and S. Liu, “Outlier weighed layerwise sparsity Talking, drawing and editing with visual foundation models,” arXiv
(owl): A missing secret sauce for pruning llms to high sparsity,” arXiv preprint arXiv:2303.04671, 2023. 20
preprint arXiv:2310.05175, 2023. 19 [279] Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu,
[257] C. Tao, L. Hou, H. Bai, J. Wei, X. Jiang, Q. Liu, P. Luo, and N. Wong, C. Liu, M. Zeng, and L. Wang, “Mm-react: Prompting chatgpt for
“Structured pruning for efficient generative pre-trained language mod- multimodal reasoning and action,” arXiv preprint arXiv:2303.11381,
els,” in Findings of the Association for Computational Linguistics: ACL 2023. 20
2023, 2023, pp. 10 880–10 895. 19 [280] T. Wang, J. Zhang, J. Fei, Y. Ge, H. Zheng, Y. Tang, Z. Li,
[258] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, M. Gao, S. Zhao, Y. Shan et al., “Caption anything: Interactive
K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: image description with diverse multimodal controls,” arXiv preprint
a visual language model for few-shot learning,” Advances in Neural arXiv:2305.02677, 2023. 20
Information Processing Systems, vol. 35, pp. 23 716–23 736, 2022. 19, [281] X. Zhu, R. Zhang, B. He, Z. Zeng, S. Zhang, and P. Gao, “Pointclip
20 v2: Adapting clip for powerful 3d open-world learning,” arXiv preprint
[259] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- arXiv:2211.11682, 2022. 20
image pre-training with frozen image encoders and large language [282] T. Gupta and A. Kembhavi, “Visual programming: Compositional
models,” arXiv preprint arXiv:2301.12597, 2023. 19, 20 visual reasoning without training,” in Proceedings of the IEEE/CVF
[260] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” arXiv Conference on Computer Vision and Pattern Recognition, 2023, pp.
preprint arXiv:2304.08485, 2023. 19, 20 14 953–14 962. 20
[261] K. Li, Y. He, Y. Wang, Y. Li, W. Wang, P. Luo, Y. Wang, L. Wang, and [283] P. Gao, Z. Jiang, H. You, P. Lu, S. C. Hoi, X. Wang, and H. Li,
Y. Qiao, “Videochat: Chat-centric video understanding,” arXiv preprint “Dynamic fusion with intra-and inter-modality attention flow for visual
arXiv:2305.06355, 2023. 19, 20 question answering,” in Proceedings of the IEEE/CVF conference on
[262] M. Maaz, H. Rasheed, S. Khan, and F. S. Khan, “Video-chatgpt: computer vision and pattern recognition, 2019, pp. 6639–6648. 20
Towards detailed video understanding via large vision and language [284] Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian, “Deep modular co-
models,” arXiv preprint arXiv:2306.05424, 2023. 19 attention networks for visual question answering,” in Proceedings of
[263] H. Zhang, X. Li, and L. Bing, “Video-llama: An instruction-tuned the IEEE/CVF conference on computer vision and pattern recognition,
audio-visual language model for video understanding,” arXiv preprint 2019, pp. 6281–6290. 20
arXiv:2306.02858, 2023. 19 [285] H. You, R. Sun, Z. Wang, L. Chen, G. Wang, H. A. Ayyubi, K.-
[264] X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, W. Chang, and S.-F. Chang, “Idealgpt: Iteratively decomposing vision
Y. Zou, and W. Wang, “Wavcaps: A chatgpt-assisted weakly-labelled and language reasoning via large language models,” arXiv preprint
audio captioning dataset for audio-language multimodal research,” arXiv:2305.14985, 2023. 20
arXiv preprint arXiv:2303.17395, 2023. 19 [286] R. Zhang, X. Hu, B. Li, S. Huang, H. Deng, Y. Qiao, P. Gao, and H. Li,
[265] C. Lyu, M. Wu, L. Wang, X. Huang, B. Liu, Z. Du, S. Shi, and “Prompt, generate, then cache: Cascade of foundation models makes
Z. Tu, “Macaw-llm: Multi-modal language modeling with image, strong few-shot learners,” in Proceedings of the IEEE/CVF Conference
audio, video, and text integration,” arXiv preprint arXiv:2306.09093, on Computer Vision and Pattern Recognition, 2023, pp. 15 211–15 222.
2023. 19 20
PREPRINT 39
[287] T. Q. Nguyen and J. Salazar, “Transformers without tears: Improving [310] S. Iyer, N. Dandekar, and K. Csernai, “First quora
the normalization of self-attention,” CoRR, vol. abs/1910.05895, 2019. dataset release: Question pairs,” https://quoradata.quora.com/
21 First-Quora-Dataset-Release-Question-Pairs. 26
[288] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, [311] R. Rudinger, J. Naradowsky, B. Leonard, and B. Van Durme, “Gender
L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert bias in coreference resolution,” arXiv preprint arXiv:1804.09301, 2018.
pretraining approach,” arXiv preprint arXiv:1907.11692, 2019. 21, 27 26
[289] X. Geng, A. Gudibande, H. Liu, E. Wallace, P. Abbeel, S. Levine, [312] M.-C. De Marneffe, M. Simons, and J. Tonhauser, “The commit-
and D. Song, “Koala: A dialogue model for academic research,” Blog mentbank: Investigating projection in naturally occurring discourse,”
post, April 2023. [Online]. Available: https://bair.berkeley.edu/blog/ in proceedings of Sinn und Bedeutung, vol. 23, no. 2, 2019, pp. 107–
2023/04/03/koala/ 23 124. 26
[290] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, [313] Z. Li, N. Ding, Z. Liu, H. Zheng, and Y. Shen, “Chinese relation extrac-
J. Phang, H. He, A. Thite, N. Nabeshima et al., “The pile: An tion with multi-grained information and external linguistic knowledge,”
800gb dataset of diverse text for language modeling,” arXiv preprint in Proceedings of the 57th Annual Meeting of the Association for
arXiv:2101.00027, 2020. 25, 27 Computational Linguistics, 2019, pp. 4377–4386. 26
[291] H. Laurençon, L. Saulnier, T. Wang, C. Akiki, A. Villanova del Moral, [314] J. Xu, J. Wen, X. Sun, and Q. Su, “A discourse-level named entity
T. Le Scao, L. Von Werra, C. Mou, E. González Ponferrada, H. Nguyen recognition and relation extraction dataset for chinese literature text,”
et al., “The bigscience roots corpus: A 1.6 tb composite multilingual arXiv preprint arXiv:1711.07010, 2017. 26
dataset,” Advances in Neural Information Processing Systems, vol. 35, [315] J. Chen, Q. Chen, X. Liu, H. Yang, D. Lu, and B. Tang, “The bq corpus:
pp. 31 809–31 826, 2022. 25 A large-scale domain-specific chinese corpus for sentence semantic
[292] “Wikipedia.” [Online]. Available: https://en.wikipedia.org/wiki/Main_ equivalence identification,” in Proceedings of the 2018 conference on
Page 25 empirical methods in natural language processing, 2018, pp. 4946–
[293] Together Computer, “Redpajama: An open source recipe to reproduce 4951. 26
llama training dataset,” Apr. 2023. [Online]. Available: https: [316] B. Liu, D. Niu, H. Wei, J. Lin, Y. He, K. Lai, and Y. Xu, “Matching
//github.com/togethercomputer/RedPajama-Data 25 article pairs with graphical decomposition and convolutions,” arXiv
[294] O. Honovich, T. Scialom, O. Levy, and T. Schick, “Unnatural instruc- preprint arXiv:1802.07459, 2018. 26
tions: Tuning language models with (almost) no human labor,” arXiv [317] P. Li, W. Li, Z. He, X. Wang, Y. Cao, J. Zhou, and W. Xu, “Dataset
preprint arXiv:2212.09689, 2022. 25 and neural recurrent sequence labeling model for open-domain factoid
[295] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, question answering,” arXiv preprint arXiv:1607.06275, 2016. 26
D. Drain, S. Fort, D. Ganguli, T. Henighan et al., “Training a helpful [318] N. Peng and M. Dredze, “Named entity recognition for chinese social
and harmless assistant with reinforcement learning from human feed- media with jointly trained embeddings,” in Proceedings of the 2015
back,” arXiv preprint arXiv:2204.05862, 2022. 25 conference on empirical methods in natural language processing, 2015,
[296] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and pp. 548–554. 26
J. Steinhardt, “Measuring massive multitask language understanding,” [319] W. Ling, D. Yogatama, C. Dyer, and P. Blunsom, “Program induction
arXiv preprint arXiv:2009.03300, 2020. 22, 26 by rationale generation: Learning to solve and explain algebraic word
[297] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, problems,” arXiv preprint arXiv:1705.04146, 2017. 26
A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso et al., “Beyond
[320] R. Weischedel, S. Pradhan, L. Ramshaw, M. Palmer, N. Xue, M. Mar-
the imitation game: Quantifying and extrapolating the capabilities of
cus, A. Taylor, C. Greenberg, E. Hovy, R. Belvin et al., “Ontonotes
language models,” arXiv preprint arXiv:2206.04615, 2022. 22, 26
release 4.0,” LDC2011T03, Philadelphia, Penn.: Linguistic Data Con-
[298] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, sortium, 2011. 26
“Glue: A multi-task benchmark and analysis platform for natural
[321] D. Vilares and C. Gómez-Rodríguez, “Head-qa: A healthcare dataset
language understanding,” arXiv preprint arXiv:1804.07461, 2018. 22,
for complex reasoning,” arXiv preprint arXiv:1906.04701, 2019. 26
26
[322] S. L. Blodgett, L. Green, and B. O’Connor, “Demographic dialectal
[299] Y. Yao, Q. Dong, J. Guan, B. Cao, Z. Zhang, C. Xiao, X. Wang, F. Qi,
variation in social media: A case study of african-american english,”
J. Bao, J. Nie et al., “Cuge: A chinese language understanding and
arXiv preprint arXiv:1608.08868, 2016. 26
generation evaluation benchmark,” arXiv preprint arXiv:2112.13610,
2021. 26 [323] N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Van-
[300] L. Xu, H. Hu, X. Zhang, L. Li, C. Cao, Y. Li, Y. Xu, K. Sun, D. Yu, derwende, P. Kohli, and J. Allen, “A corpus and evaluation framework
C. Yu et al., “Clue: A chinese language understanding evaluation for deeper understanding of commonsense stories,” arXiv preprint
benchmark,” arXiv preprint arXiv:2004.05986, 2020. 26 arXiv:1604.01696, 2016. 23, 26
[301] L. Xu, X. Lu, C. Yuan, X. Zhang, H. Xu, H. Yuan, G. Wei, X. Pan, [324] D. Paperno, G. Kruszewski, A. Lazaridou, Q. N. Pham, R. Bernardi,
X. Tian, L. Qin et al., “Fewclue: A chinese few-shot learning evaluation S. Pezzelle, M. Baroni, G. Boleda, and R. Fernández, “The lambada
benchmark,” arXiv preprint arXiv:2107.07498, 2021. 26 dataset: Word prediction requiring a broad discourse context,” arXiv
[302] E. M. Smith, M. Williamson, K. Shuster, J. Weston, and Y.-L. Boureau, preprint arXiv:1606.06031, 2016. 25, 26
“Can you put it all together: Evaluating conversational agents’ ability [325] B. Hu, Q. Chen, and F. Zhu, “Lcsts: A large scale chinese short text
to blend skills,” arXiv preprint arXiv:2004.08449, 2020. 26 summarization dataset,” arXiv preprint arXiv:1506.05865, 2015. 26
[303] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, [326] Z. Shao, M. Huang, J. Wen, W. Xu, and X. Zhu, “Long and diverse text
Y. Zhang, D. Narayanan, Y. Wu, A. Kumar et al., “Holistic evaluation generation with planning-based hierarchical variational model,” arXiv
of language models,” arXiv preprint arXiv:2211.09110, 2022. 26 preprint arXiv:1908.06605, 2019. 26
[304] S. Park, J. Moon, S. Kim, W. I. Cho, J. Han, J. Park, C. Song, [327] J. Novikova, O. Dušek, and V. Rieser, “The e2e dataset: New challenges
J. Kim, Y. Song, T. Oh et al., “Klue: Korean language understanding for end-to-end generation,” arXiv preprint arXiv:1706.09254, 2017. 26
evaluation,” arXiv preprint arXiv:2105.09680, 2021. 26 [328] C. Zheng, M. Huang, and A. Sun, “Chid: A large-scale chinese idiom
[305] S. Reddy, D. Chen, and C. D. Manning, “Coqa: A conversational dataset for cloze test,” arXiv preprint arXiv:1906.01265, 2019. 26
question answering challenge,” Transactions of the Association for [329] Y. Bisk, R. Zellers, J. Gao, Y. Choi et al., “Piqa: Reasoning about
Computational Linguistics, vol. 7, pp. 249–266, 2019. 22, 26 physical commonsense in natural language,” in Proceedings of the
[306] M. T. Pilehvar and J. Camacho-Collados, “Wic: 10,000 example AAAI conference on artificial intelligence, vol. 34, no. 05, 2020, pp.
pairs for evaluating context-sensitive representations,” arXiv preprint 7432–7439. 25, 26
arXiv:1808.09121, vol. 6, 2018. 23, 26 [330] M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer, “Triviaqa: A large
[307] S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel scale distantly supervised challenge dataset for reading comprehen-
mixture models,” arXiv preprint arXiv:1609.07843, 2016. 23, 26 sion,” arXiv preprint arXiv:1705.03551, 2017. 25, 26, 28
[308] J. W. Rae, A. Potapenko, S. M. Jayakumar, and T. P. Lillicrap, [331] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick,
“Compressive transformers for long-range sequence modelling,” arXiv and O. Tafjord, “Think you have solved question answering? try arc,
preprint arXiv:1911.05507, 2019. 23, 26 the ai2 reasoning challenge,” arXiv preprint arXiv:1803.05457, 2018.
[309] X. Liu, Q. Chen, C. Deng, H. Zeng, J. Chen, D. Li, and B. Tang, 25, 26, 28
“Lcqmc: A large-scale chinese question matching corpus,” in Proceed- [332] S. Aroca-Ouellette, C. Paik, A. Roncone, and K. Kann, “Prost: Phys-
ings of the 27th international conference on computational linguistics, ical reasoning of objects through space and time,” arXiv preprint
2018, pp. 1952–1962. 23, 26 arXiv:2106.03634, 2021. 26
PREPRINT 40
[333] T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal, “Can a suit of armor [356] Y. Chang, M. Narang, H. Suzuki, G. Cao, J. Gao, and Y. Bisk, “We-
conduct electricity? a new dataset for open book question answering,” bqa: Multihop and multimodal qa,” in Proceedings of the IEEE/CVF
arXiv preprint arXiv:1809.02789, 2018. 26 Conference on Computer Vision and Pattern Recognition, 2022, pp.
[334] T. C. Ferreira, C. Gardent, N. Ilinykh, C. Van Der Lee, S. Mille, 16 495–16 504. 26, 28
D. Moussallem, and A. Shimorina, “The 2020 bilingual, bi-directional [357] Y. Cui, T. Liu, Z. Chen, W. Ma, S. Wang, and G. Hu, “Dataset for
webnlg+ shared task overview and evaluation results (webnlg+ 2020),” the first evaluation on chinese machine reading comprehension,” arXiv
in Proceedings of the 3rd International Workshop on Natural Language preprint arXiv:1709.08299, 2017. 26
Generation from the Semantic Web (WebNLG+), 2020. 26 [358] Y. Cui, T. Liu, W. Che, L. Xiao, Z. Chen, W. Ma, S. Wang, and G. Hu,
[335] C. Xu, W. Zhou, T. Ge, K. Xu, J. McAuley, and F. Wei, “Blow the dog “A span-extraction dataset for chinese machine reading comprehen-
whistle: A chinese dataset for cant understanding with common sense sion,” arXiv preprint arXiv:1810.07366, 2018. 26, 28
and world knowledge,” arXiv preprint arXiv:2104.02704, 2021. 26 [359] Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, and G. Hu,
[336] G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy, “Race: Large-scale “A sentence cloze dataset for chinese machine reading comprehension,”
reading comprehension dataset from examinations,” arXiv preprint arXiv preprint arXiv:2004.03116, 2020. 26
arXiv:1704.04683, 2017. 26 [360] Y. Li, T. Liu, D. Li, Q. Li, J. Shi, and Y. Wang, “Character-based
[337] E. Choi, H. He, M. Iyyer, M. Yatskar, W.-t. Yih, Y. Choi, P. Liang, and bilstm-crf incorporating pos and dictionaries for chinese opinion target
L. Zettlemoyer, “Quac: Question answering in context,” arXiv preprint extraction,” in Asian Conference on Machine Learning. PMLR, 2018,
arXiv:1808.07036, 2018. 26 pp. 518–533. 26
[338] M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth, and J. Berant, [361] D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth,
“Did aristotle use a laptop? a question answering benchmark with “Looking beyond the surface: A challenge set for reading comprehen-
implicit reasoning strategies,” Transactions of the Association for sion over multiple sentences,” in Proceedings of the 2018 Conference
Computational Linguistics, vol. 9, pp. 346–361, 2021. 26, 28 of the North American Chapter of the Association for Computational
[339] J. Boyd-Graber, B. Satinoff, H. He, and H. Daumé III, “Besting Linguistics: Human Language Technologies, Volume 1 (Long Papers),
the quiz master: Crowdsourcing incremental classification games,” 2018, pp. 252–262. 26
in Proceedings of the 2012 joint conference on empirical methods [362] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh,
in natural language processing and computational natural language C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee et al., “Natural
learning, 2012, pp. 1290–1301. 26 questions: a benchmark for question answering research,” Transactions
[340] S. Zhang, X. Zhang, H. Wang, J. Cheng, P. Li, and Z. Ding, “Chinese of the Association for Computational Linguistics, vol. 7, pp. 453–466,
medical question answer matching using end-to-end character-level 2019. 26
multi-scale cnns,” Applied Sciences, vol. 7, no. 8, p. 767, 2017. 26 [363] C. C. Shao, T. Liu, Y. Lai, Y. Tseng, and S. Tsai, “Drcd: A
[341] S. Zhang, X. Zhang, H. Wang, L. Guo, and S. Liu, “Multi-scale chinese machine reading comprehension dataset,” arXiv preprint
attentive interaction networks for chinese medical question answer arXiv:1806.00920, 2018. 26
selection,” IEEE Access, vol. 6, pp. 74 061–74 071, 2018. 26 [364] W. He, K. Liu, J. Liu, Y. Lyu, S. Zhao, X. Xiao, Y. Liu, Y. Wang,
[342] C. Xu, J. Pei, H. Wu, Y. Liu, and C. Li, “Matinf: A jointly labeled large- H. Wu, Q. She et al., “Dureader: a chinese machine reading
scale dataset for classification, question answering and summarization,” comprehension dataset from real-world applications,” arXiv preprint
arXiv preprint arXiv:2004.12302, 2020. 26 arXiv:1711.05073, 2017. 26
[365] H. Tang, J. Liu, H. Li, Y. Hong, H. Wu, and H. Wang, “Dureaderrobust:
[343] K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi, “Winogrande:
A chinese dataset towards evaluating the robustness of machine reading
An adversarial winograd schema challenge at scale,” Communications
comprehension models,” arXiv preprint arXiv:2004.11142, 2020. 26
of the ACM, vol. 64, no. 9, pp. 99–106, 2021. 22, 26
[366] J. Welbl, N. F. Liu, and M. Gardner, “Crowdsourcing multiple choice
[344] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi, “Hel-
science questions,” arXiv preprint arXiv:1707.06209, 2017. 26
laswag: Can a machine really finish your sentence?” arXiv preprint
[367] C. Xiong, Z. Dai, J. Callan, Z. Liu, and R. Power, “End-to-end
arXiv:1905.07830, 2019. 26
neural ad-hoc ranking with kernel pooling,” in Proceedings of the 40th
[345] M. Roemmele, C. A. Bejan, and A. S. Gordon, “Choice of plausible
International ACM SIGIR conference on research and development in
alternatives: An evaluation of commonsense causal reasoning.” in AAAI
information retrieval, 2017, pp. 55–64. 26
spring symposium: logical formalizations of commonsense reasoning,
[368] A. Peñas, E. Hovy, P. Forner, Á. Rodrigo, R. Sutcliffe, and R. Morante,
2011, pp. 90–95. 26
“Qa4mre 2011-2013: Overview of question answering for machine
[346] H. Levesque, E. Davis, and L. Morgenstern, “The winograd schema reading evaluation,” in Information Access Evaluation. Multilinguality,
challenge,” in Thirteenth international conference on the principles of Multimodality, and Visualization: 4th International Conference of the
knowledge representation and reasoning, 2012. 22, 26 CLEF Initiative, CLEF 2013, Valencia, Spain, September 23-26, 2013.
[347] A. Talmor, J. Herzig, N. Lourie, and J. Berant, “Commonsenseqa: Proceedings 4. Springer, 2013, pp. 303–320. 26
A question answering challenge targeting commonsense knowledge,” [369] S. Lim, M. Kim, and J. Lee, “Korquad1. 0: Korean qa dataset for
arXiv preprint arXiv:1811.00937, 2018. 26 machine reading comprehension,” arXiv preprint arXiv:1909.07005,
[348] M. Sap, H. Rashkin, D. Chen, R. LeBras, and Y. Choi, “Socialiqa: 2019. 26
Commonsense reasoning about social interactions,” arXiv preprint [370] C. Xiao, H. Zhong, Z. Guo, C. Tu, Z. Liu, M. Sun, Y. Feng, X. Han,
arXiv:1904.09728, 2019. 26 Z. Hu, H. Wang et al., “Cail2018: A large-scale legal dataset for
[349] K. Sun, D. Yu, D. Yu, and C. Cardie, “Investigating prior knowledge judgment prediction,” arXiv preprint arXiv:1807.02478, 2018. 26
for challenging chinese machine reading comprehension,” Transactions [371] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo,
of the Association for Computational Linguistics, vol. 8, pp. 141–155, C. Burns, S. Puranik, H. He, D. Song et al., “Measuring coding
2020. 26 challenge competence with apps,” arXiv preprint arXiv:2105.09938,
[350] S. Zhang, X. Liu, J. Liu, J. Gao, K. Duh, and B. Van Durme, “Record: 2021. 26, 28
Bridging the gap between human and machine commonsense reading [372] Y. Wang, X. Liu, and S. Shi, “Deep neural solver for math word
comprehension,” arXiv preprint arXiv:1810.12885, 2018. 26 problems,” in Proceedings of the 2017 conference on empirical methods
[351] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100,000+ in natural language processing, 2017, pp. 845–854. 26, 28
questions for machine comprehension of text,” arXiv preprint [373] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser,
arXiv:1606.05250, 2016. 26 M. Plappert, J. Tworek, J. Hilton, R. Nakano et al., “Training verifiers
[352] C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021.
K. Toutanova, “Boolq: Exploring the surprising difficulty of natural 26, 28
yes/no questions,” arXiv preprint arXiv:1905.10044, 2019. 26 [374] J. Austin, A. Odena, M. I. Nye, M. Bosma, H. Michalewski, D. Dohan,
[353] P. Rajpurkar, R. Jia, and P. Liang, “Know what you don’t know: E. Jiang, C. J. Cai, M. Terry, Q. V. Le, and C. Sutton, “Program
Unanswerable questions for squad,” arXiv preprint arXiv:1806.03822, synthesis with large language models,” CoRR, vol. abs/2108.07732,
2018. 26 2021. 26
[354] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner, [375] F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W.
“Drop: A reading comprehension benchmark requiring discrete reason- Chung, Y. Tay, S. Ruder, D. Zhou et al., “Language models are multi-
ing over paragraphs,” arXiv preprint arXiv:1903.00161, 2019. 26 lingual chain-of-thought reasoners,” arXiv preprint arXiv:2210.03057,
[355] I. Dagan, O. Glickman, and B. Magnini, “The pascal recognising tex- 2022. 26
tual entailment challenge,” in Machine learning challenges workshop. [376] S. Roy and D. Roth, “Solving general arithmetic word problems,” arXiv
Springer, 2005, pp. 177–190. 26, 28 preprint arXiv:1608.01413, 2016. 26
PREPRINT 41
[377] S.-Y. Miao, C.-C. Liang, and K.-Y. Su, “A diverse corpus for evaluating [400] J. Zhao, T. Wang, M. Yatskar, V. Ordonez, and K.-W. Chang, “Gender
and developing english math word problem solvers,” arXiv preprint bias in coreference resolution: Evaluation and debiasing methods,”
arXiv:2106.15772, 2021. 26 arXiv preprint arXiv:1804.06876, 2018. 26
[378] R. Koncel-Kedziorski, S. Roy, A. Amini, N. Kushman, and H. Ha- [401] N. Nangia, C. Vania, R. Bhalerao, and S. R. Bowman, “Crows-pairs:
jishirzi, “Mawps: A math word problem repository,” in Proceedings of A challenge dataset for measuring social biases in masked language
the 2016 conference of the north american chapter of the association models,” arXiv preprint arXiv:2010.00133, 2020. 26
for computational linguistics: human language technologies, 2016, pp. [402] S. Gehman, S. Gururangan, M. Sap, Y. Choi, and N. A. Smith,
1152–1157. 26 “Realtoxicityprompts: Evaluating neural toxic degeneration in language
[379] A. Patel, S. Bhattamishra, and N. Goyal, “Are nlp models really able to models,” arXiv preprint arXiv:2009.11462, 2020. 26
solve simple math word problems?” arXiv preprint arXiv:2103.07191, [403] D. Borkan, L. Dixon, J. Sorensen, N. Thain, and L. Vasserman,
2021. 26 “Nuanced metrics for measuring unintended bias with real data for
[380] Y. Lai, C. Li, Y. Wang, T. Zhang, R. Zhong, L. Zettlemoyer, W.- text classification,” in Companion proceedings of the 2019 world wide
t. Yih, D. Fried, S. Wang, and T. Yu, “Ds-1000: A natural and web conference, 2019, pp. 491–500. 26
reliable benchmark for data science code generation,” in International [404] O. Bojar, R. Chatterjee, C. Federmann, Y. Graham, B. Haddow,
Conference on Machine Learning. PMLR, 2023, pp. 18 319–18 345. M. Huck, A. J. Yepes, P. Koehn, V. Logacheva, C. Monz et al., “Find-
26 ings of the 2016 conference on machine translation,” in Proceedings of
[381] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, the First Conference on Machine Translation: Volume 2, Shared Task
E. Jiang, C. Cai, M. Terry, Q. Le et al., “Program synthesis with large Papers, 2016, pp. 131–198. 26
language models,” arXiv preprint arXiv:2108.07732, 2021. 26 [405] B. Loïc, B. Magdalena, B. Ondřej, F. Christian, G. Yvette, G. Roman,
[382] Y. Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, and D. Kiela, H. Barry, H. Matthias, J. Eric, K. Tom et al., “Findings of the
“Adversarial nli: A new benchmark for natural language understand- 2020 conference on machine translation (wmt20),” in Proceedings
ing,” arXiv preprint arXiv:1910.14599, 2019. 26, 28 of the Fifth Conference on Machine Translation. Association for
[383] A. Williams, N. Nangia, and S. R. Bowman, “A broad-coverage Computational Linguistics„ 2020, pp. 1–55. 26
challenge corpus for sentence understanding through inference,” arXiv [406] W. Li, F. Qi, M. Sun, X. Yi, and J. Zhang, “Ccpm: A chinese classical
preprint arXiv:1704.05426, 2017. 26 poetry matching dataset,” arXiv preprint arXiv:2106.01979, 2021. 26
[384] R. T. McCoy, E. Pavlick, and T. Linzen, “Right for the wrong reasons: [407] E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli, and J. Weston,
Diagnosing syntactic heuristics in natural language inference,” arXiv “Wizard of wikipedia: Knowledge-powered conversational agents,”
preprint arXiv:1902.01007, 2019. 26 arXiv preprint arXiv:1811.01241, 2018. 26
[385] J. Liu, L. Cui, H. Liu, D. Huang, Y. Wang, and Y. Zhang, “Logiqa: [408] H. Rashkin, E. M. Smith, M. Li, and Y.-L. Boureau, “Towards
A challenge dataset for machine reading comprehension with logical empathetic open-domain conversation models: A new benchmark and
reasoning,” arXiv preprint arXiv:2007.08124, 2020. 26 dataset,” arXiv preprint arXiv:1811.00207, 2018. 26
[386] P. Lewis, B. Oğuz, R. Rinott, S. Riedel, and H. Schwenk, “Mlqa: [409] E. Dinan, V. Logacheva, V. Malykh, A. Miller, K. Shuster, J. Ur-
Evaluating cross-lingual extractive question answering,” arXiv preprint banek, D. Kiela, A. Szlam, I. Serban, R. Lowe et al., “The second
arXiv:1910.07475, 2019. 26 conversational intelligence challenge (convai2),” in The NeurIPS’18
[387] A. Conneau, G. Lample, R. Rinott, A. Williams, S. R. Bowman, Competition: From Machine Learning to Intelligent Conversations.
H. Schwenk, and V. Stoyanov, “Xnli: Evaluating cross-lingual sentence Springer, 2020, pp. 187–208. 26
representations,” arXiv preprint arXiv:1809.05053, 2018. 26, 28 [410] H. Zhou, C. Zheng, K. Huang, M. Huang, and X. Zhu, “Kdconv: A
[388] Y. Yang, Y. Zhang, C. Tar, and J. Baldridge, “Paws-x: A cross- chinese multi-domain dialogue dataset towards multi-turn knowledge-
lingual adversarial dataset for paraphrase identification,” arXiv preprint driven conversation,” arXiv preprint arXiv:2004.04100, 2020. 26
arXiv:1908.11828, 2019. 26, 28 [411] L. CO, “Iflytek: a multiple categories chinese text classifier. competi-
[389] S. Narayan, S. B. Cohen, and M. Lapata, “Don’t give me the details, tion official website,” 2019. 26
just the summary!” Topic-Aware Convolutional Neural Networks for [412] J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, and J. Blackburn,
Extreme Summarization. ArXiv, abs, 1808. 26 “The pushshift reddit dataset,” in Proceedings of the international AAAI
[390] E. M. Ponti, G. Glavaš, O. Majewska, Q. Liu, I. Vulić, and A. Korho- conference on web and social media, vol. 14, 2020, pp. 830–839. 27
nen, “Xcopa: A multilingual dataset for causal commonsense reason- [413] A. Fan, Y. Jernite, E. Perez, D. Grangier, J. Weston, and M. Auli, “Eli5:
ing,” arXiv preprint arXiv:2005.00333, 2020. 26 Long form question answering,” arXiv preprint arXiv:1907.09190,
[391] A. Tikhonov and M. Ryabinin, “It’s all in the heads: Using attention 2019. 28
heads as a baseline for cross-lingual transfer in commonsense reason- [414] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei,
ing,” arXiv preprint arXiv:2106.12066, 2021. 26 A. Arunkumar, A. Ashok, A. S. Dhanasekaran, A. Naik, D. Stap et al.,
[392] J. H. Clark, E. Choi, M. Collins, D. Garrette, T. Kwiatkowski, V. Niko- “Benchmarking generalization via in-context instructions on 1,600+
laev, and J. Palomaki, “Tydi qa: A benchmark for information-seeking language tasks,” arXiv preprint arXiv:2204.07705, 2022. 28
question answering in typologically diverse languages,” Transactions [415] T. Xie, C. H. Wu, P. Shi, R. Zhong, T. Scholak, M. Yasunaga, C.-
of the Association for Computational Linguistics, vol. 8, pp. 454–470, S. Wu, M. Zhong, P. Yin, S. I. Wang et al., “Unifiedskg: Unifying
2020. 26 and multi-tasking structured knowledge grounding with text-to-text
[393] T. Scialom, P.-A. Dray, S. Lamprier, B. Piwowarski, and J. Sta- language models,” arXiv preprint arXiv:2201.05966, 2022. 28
iano, “Mlsum: The multilingual summarization corpus,” arXiv preprint [416] Q. Ye, B. Y. Lin, and X. Ren, “Crossfit: A few-shot learning challenge
arXiv:2004.14900, 2020. 26 for cross-task generalization in nlp,” arXiv preprint arXiv:2104.08835,
[394] S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measuring how models 2021. 28
mimic human falsehoods,” arXiv preprint arXiv:2109.07958, 2021. 26, [417] V. Aribandi, Y. Tay, T. Schuster, J. Rao, H. S. Zheng, S. V.
28 Mehta, H. Zhuang, V. Q. Tran, D. Bahri, J. Ni et al., “Ext5: To-
[395] I. Augenstein, C. Lioma, D. Wang, L. C. Lima, C. Hansen, wards extreme multi-task scaling for transfer learning,” arXiv preprint
C. Hansen, and J. G. Simonsen, “Multifc: A real-world multi-domain arXiv:2111.10952, 2021. 28
dataset for evidence-based fact checking of claims,” arXiv preprint [418] A. Williams, N. Nangia, and S. Bowman, “A broad-coverage
arXiv:1909.03242, 2019. 26 challenge corpus for sentence understanding through inference,” in
[396] J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal, “Fever: a Proceedings of the 2018 Conference of the North American Chapter
large-scale dataset for fact extraction and verification,” arXiv preprint of the Association for Computational Linguistics: Human Language
arXiv:1803.05355, 2018. 26 Technologies, Volume 1 (Long Papers). New Orleans, Louisiana:
[397] I. Mollas, Z. Chrysopoulou, S. Karlos, and G. Tsoumakas, “Ethos: an Association for Computational Linguistics, Jun. 2018, pp. 1112–1122.
online hate speech detection dataset,” arXiv preprint arXiv:2006.08328, [Online]. Available: https://aclanthology.org/N18-1101 28
2020. 26, 28 [419] Y. Zhang, J. Baldridge, and L. He, “PAWS: Paraphrase adversaries
[398] M. Nadeem, A. Bethke, and S. Reddy, “Stereoset: Measuring from word scrambling,” in Proceedings of the 2019 Conference of
stereotypical bias in pretrained language models,” arXiv preprint the North American Chapter of the Association for Computational
arXiv:2004.09456, 2020. 26, 28 Linguistics: Human Language Technologies, Volume 1 (Long and Short
[399] A. Parrish, A. Chen, N. Nangia, V. Padmakumar, J. Phang, J. Thomp- Papers). Minneapolis, Minnesota: Association for Computational
son, P. M. Htut, and S. R. Bowman, “Bbq: A hand-built bias benchmark Linguistics, Jun. 2019, pp. 1298–1308. [Online]. Available: https:
for question answering,” arXiv preprint arXiv:2110.08193, 2021. 26 //aclanthology.org/N19-1131 28
PREPRINT 42
[420] C. Qin, A. Zhang, Z. Zhang, J. Chen, M. Yasunaga, and scientific and research advancements: a double-edged sword,” Interna-
D. Yang, “Is chatGPT a general-purpose natural language processing tional Research Journal of Modernization in Engineering Technology
task solver?” in The 2023 Conference on Empirical Methods in and Science, vol. 5, no. 10, pp. 875–899, 2023. 29
Natural Language Processing, 2023. [Online]. Available: https: [441] W. Dai, J. Lin, H. Jin, T. Li, Y.-S. Tsai, D. Gašević, and G. Chen,
//openreview.net/forum?id=u03xn1COsO 29 “Can large language models provide feedback to students? a case
[421] M. U. Hadi, R. Qureshi, A. Shah, M. Irfan, A. Zafar, M. B. Shaikh, study on chatgpt,” in 2023 IEEE International Conference on Advanced
N. Akhtar, J. Wu, S. Mirjalili et al., “Large language models: a Learning Technologies (ICALT). IEEE, 2023, pp. 323–325. 29
comprehensive survey of its applications, challenges, limitations, and [442] E. Kasneci, K. Seßler, S. Küchemann, M. Bannert, D. Dementieva,
future prospects,” TechRxiv, 2023. 29 F. Fischer, U. Gasser, G. Groh, S. Günnemann, E. Hüllermeier et al.,
[422] X. L. Dong, S. Moon, Y. E. Xu, K. Malik, and Z. Yu, “Towards “Chatgpt for good? on opportunities and challenges of large language
next-generation intelligent assistants leveraging llm techniques,” in models for education,” Learning and individual differences, vol. 103,
Proceedings of the 29th ACM SIGKDD Conference on Knowledge p. 102274, 2023. 29
Discovery and Data Mining, 2023, pp. 5792–5793. 29 [443] N. Rane, “Enhancing the quality of teaching and learning through chat-
[423] K. Pandya and M. Holia, “Automating customer service using gpt and similar large language models: Challenges, future prospects,
langchain: Building custom open-source gpt chatbot for organizations,” and ethical considerations in education,” Future Prospects, and Ethical
arXiv preprint arXiv:2310.05421, 2023. 29 Considerations in Education (September 15, 2023), 2023. 29
[424] J. Li, B. Hui, G. Qu, B. Li, J. Yang, B. Li, B. Wang, B. Qin, R. Cao, [444] J. C. Young and M. Shishido, “Investigating openai’s chatgpt potentials
R. Geng et al., “Can llm already serve as a database interface? a big in generating chatbot’s dialogue for english as a foreign language
bench for large-scale database grounded text-to-sqls,” arXiv preprint learning,” International Journal of Advanced Computer Science and
arXiv:2305.03111, 2023. 29 Applications, vol. 14, no. 6, 2023. 29
[425] A. Rao, J. Kim, M. Kamineni, M. Pang, W. Lie, and M. D. Succi, [445] J. Irons, C. Mason, P. Cooper, S. Sidra, A. Reeson, and C. Paris,
“Evaluating chatgpt as an adjunct for radiologic decision-making,” “Exploring the impacts of chatgpt on future scientific work,” SocArXiv,
medRxiv, pp. 2023–02, 2023. 29 2023. 29
[426] M. Benary, X. D. Wang, M. Schmidt, D. Soll, G. Hilfenhaus, M. Nassir, [446] P. G. Schmidt and A. J. Meir, “Using generative ai for literature
C. Sigler, M. Knödler, U. Keller, D. Beule et al., “Leveraging large searches and scholarly writing: Is the integrity of the scientific dis-
language models for decision support in personalized oncology,” JAMA course in jeopardy?” arXiv preprint arXiv:2311.06981, 2023. 29
Network Open, vol. 6, no. 11, pp. e2 343 689–e2 343 689, 2023. 29 [447] Y. Zheng, H. Y. Koh, J. Ju, A. T. Nguyen, L. T. May, G. I. Webb, and
[427] C. M. Chiesa-Estomba, J. R. Lechien, L. A. Vaira, A. Brunet, S. Pan, “Large language models for scientific synthesis, inference and
G. Cammaroto, M. Mayo-Yanez, A. Sanchez-Barrueco, and C. Saga- explanation,” arXiv preprint arXiv:2310.07984, 2023. 30
Gutierrez, “Exploring the potential of chat-gpt as a supportive tool [448] B. Aczel and E.-J. Wagenmakers, “Transparency guidance for chatgpt
for sialendoscopy clinical decision making and patient information usage in scientific writing,” PsyArXiv, 2023. 30
support,” European Archives of Oto-Rhino-Laryngology, pp. 1–6, 2023. [449] S. Altmäe, A. Sola-Leyva, and A. Salumets, “Artificial intelligence in
29 scientific writing: a friend or a foe?” Reproductive BioMedicine Online,
[428] S. Montagna, S. Ferretti, L. C. Klopfenstein, A. Florio, and M. F. 2023. 30
Pengo, “Data decentralisation of llm-based chatbot systems in chronic [450] S. Imani, L. Du, and H. Shrivastava, “Mathprompter: Mathematical rea-
disease self-management,” in Proceedings of the 2023 ACM Conference soning using large language models,” arXiv preprint arXiv:2303.05398,
on Information Technology for Social Good, 2023, pp. 205–212. 29 2023. 30
[429] D. Bill and T. Eriksson, “Fine-tuning a llm using reinforcement learning [451] Z. Yuan, H. Yuan, C. Li, G. Dong, C. Tan, and C. Zhou, “Scaling
from human feedback for a therapy chatbot application,” 2023. 29 relationship on learning mathematical reasoning with large language
[430] M. Abbasian, I. Azimi, A. M. Rahmani, and R. Jain, “Conversational models,” arXiv preprint arXiv:2308.01825, 2023. 30
health agents: A personalized llm-powered agent framework,” arXiv [452] K. Yang, A. M. Swope, A. Gu, R. Chalamala, P. Song, S. Yu,
preprint arXiv:2310.02374, 2023. 29 S. Godil, R. Prenger, and A. Anandkumar, “Leandojo: Theorem
[431] K. V. Lemley, “Does chatgpt help us understand the medical literature?” proving with retrieval-augmented language models,” arXiv preprint
Journal of the American Society of Nephrology, pp. 10–1681, 2023. 29 arXiv:2306.15626, 2023. 30
[432] S. Pal, M. Bhattacharya, S.-S. Lee, and C. Chakraborty, “A domain- [453] K. M. Collins, A. Q. Jiang, S. Frieder, L. Wong, M. Zilka, U. Bhatt,
specific next-generation large language model (llm) or chatgpt is re- T. Lukasiewicz, Y. Wu, J. B. Tenenbaum, W. Hart et al., “Evaluating
quired for biomedical engineering and research,” Annals of Biomedical language models for mathematics through interactions,” arXiv preprint
Engineering, pp. 1–4, 2023. 29 arXiv:2306.01694, 2023. 30
[433] Y. Du, S. Zhao, Y. Chen, R. Bai, J. Liu, H. Wu, H. Wang, and B. Qin, [454] Y. Liu, T. Han, S. Ma, J. Zhang, Y. Yang, J. Tian, H. He, A. Li, M. He,
“The calla dataset: Probing llms’ interactive knowledge acquisition Z. Liu et al., “Summary of chatgpt-related research and perspective
from chinese medical literature,” arXiv preprint arXiv:2309.04198, towards the future of large language models,” Meta-Radiology, p.
2023. 29 100017, 2023. 30
[434] A. Abd-Alrazaq, R. AlSaad, D. Alhuwail, A. Ahmed, P. M. Healy, [455] J. Drápal, H. Westermann, and J. Savelka, “Using large language
S. Latifi, S. Aziz, R. Damseh, S. A. Alrazak, J. Sheikh et al., “Large models to support thematic analysis in empirical legal studies,” arXiv
language models in medical education: Opportunities, challenges, and preprint arXiv:2310.18729, 2023. 30
future directions,” JMIR Medical Education, vol. 9, no. 1, p. e48291, [456] J. Savelka, K. D. Ashley, M. A. Gray, H. Westermann, and H. Xu,
2023. 29 “Explaining legal concepts with augmented large language models (gpt-
[435] A. B. Mbakwe, I. Lourentzou, L. A. Celi, O. J. Mechanic, and 4),” arXiv preprint arXiv:2306.09525, 2023. 30
A. Dagan, “Chatgpt passing usmle shines a spotlight on the flaws of [457] N. Guha, J. Nyarko, D. E. Ho, C. Ré, A. Chilton, A. Narayana,
medical education,” p. e0000205, 2023. 29 A. Chohlas-Wood, A. Peters, B. Waldon, D. N. Rockmore et al.,
[436] S. Ahn, “The impending impacts of large language models on medical “Legalbench: A collaboratively built benchmark for measuring legal
education,” Korean Journal of Medical Education, vol. 35, no. 1, p. reasoning in large language models,” arXiv preprint arXiv:2308.11462,
103, 2023. 29 2023. 30
[437] E. Waisberg, J. Ong, M. Masalkhi, and A. G. Lee, “Large language [458] J. Cui, Z. Li, Y. Yan, B. Chen, and L. Yuan, “Chatlaw: Open-source
model (llm)-driven chatbots for neuro-ophthalmic medical education,” legal large language model with integrated external knowledge bases,”
Eye, pp. 1–3, 2023. 29 arXiv preprint arXiv:2306.16092, 2023. 30
[438] G. Deiana, M. Dettori, A. Arghittu, A. Azara, G. Gabutti, and P. Cas- [459] H. Yang, X.-Y. Liu, and C. D. Wang, “Fingpt: Open-source financial
tiglia, “Artificial intelligence and public health: Evaluating chatgpt large language models,” arXiv preprint arXiv:2306.06031, 2023. 30
responses to vaccination myths and misconceptions,” Vaccines, vol. 11, [460] Y. Li, S. Wang, H. Ding, and H. Chen, “Large language models in
no. 7, p. 1217, 2023. 29 finance: A survey,” in Proceedings of the Fourth ACM International
[439] L. De Angelis, F. Baglivo, G. Arzilli, G. P. Privitera, P. Ferragina, A. E. Conference on AI in Finance, 2023, pp. 374–382. 30
Tozzi, and C. Rizzo, “Chatgpt and the rise of large language models: [461] A. Lykov and D. Tsetserukou, “Llm-brain: Ai-driven fast generation of
the new ai-driven infodemic threat in public health,” Frontiers in Public robot behaviour tree based on large language model,” arXiv preprint
Health, vol. 11, p. 1166120, 2023. 29 arXiv:2305.19352, 2023. 30
[440] N. L. Rane, A. Tawde, S. P. Choudhary, and J. Rane, “Contribution [462] E. Billing, J. Rosén, and M. Lamb, “Language models for human-robot
and performance of chatgpt and other large language models (llm) for interaction,” in ACM/IEEE International Conference on Human-Robot
PREPRINT 43
Interaction, March 13–16, 2023, Stockholm, Sweden. ACM Digital [485] J. Zhang, X. Ji, Z. Zhao, X. Hei, and K.-K. R. Choo, “Ethical
Library, 2023, pp. 905–906. 30 considerations and policy implications for large language models:
[463] Y. Ye, H. You, and J. Du, “Improved trust in human-robot collaboration Guiding responsible development and deployment,” arXiv preprint
with chatgpt,” IEEE Access, 2023. 30 arXiv:2308.02678, 2023. 32
[464] Y. Ding, X. Zhang, C. Paxton, and S. Zhang, “Leveraging common- [486] J. Mökander, J. Schuett, H. R. Kirk, and L. Floridi, “Auditing large
sense knowledge from large language models for task and motion language models: a three-layered approach,” AI and Ethics, pp. 1–31,
planning,” in RSS 2023 Workshop on Learning for Task and Motion 2023. 32
Planning, 2023. 30
[465] J. Wu, R. Antonova, A. Kan, M. Lepert, A. Zeng, S. Song, J. Bohg,
S. Rusinkiewicz, and T. Funkhouser, “Tidybot: Personalized robot as-
sistance with large language models,” arXiv preprint arXiv:2305.05658,
2023. 30
[466] E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy con-
siderations for deep learning in nlp,” arXiv preprint arXiv:1906.02243,
2019. 30
[467] E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, “On
the dangers of stochastic parrots: Can language models be too big?” in
Proceedings of the 2021 ACM conference on fairness, accountability,
and transparency, 2021, pp. 610–623. 30
[468] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Un-
derstanding deep learning (still) requires rethinking generalization,”
Communications of the ACM, vol. 64, no. 3, pp. 107–115, 2021. 30
[469] M. Tänzer, S. Ruder, and M. Rei, “Memorisation versus generalisation
in pre-trained language models,” arXiv preprint arXiv:2105.00828,
2021. 30
[470] S. M. West, M. Whittaker, and K. Crawford, “Discriminating systems,”
AI Now, pp. 1–33, 2019. 30
[471] K. Valmeekam, A. Olmo, S. Sreedharan, and S. Kambhampati, “Large
language models still can’t plan (a benchmark for llms on planning
and reasoning about change),” arXiv preprint arXiv:2206.10498, 2022.
31
[472] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao,
Y. Zhang, Y. Chen et al., “Siren’s song in the ai ocean: A survey on hal-
lucination in large language models,” arXiv preprint arXiv:2309.01219,
2023. 31
[473] A. Webson and E. Pavlick, “Do prompt-based models really understand
the meaning of their prompts?” arXiv preprint arXiv:2109.01247, 2021.
31
[474] O. Shaikh, H. Zhang, W. Held, M. Bernstein, and D. Yang, “On second
thought, let’s not think step by step! bias and toxicity in zero-shot
reasoning,” arXiv preprint arXiv:2212.08061, 2022. 31
[475] X. Liu, H. Cheng, P. He, W. Chen, Y. Wang, H. Poon, and J. Gao,
“Adversarial training for large neural language models,” ArXiv, April
2020. [Online]. Available: https://www.microsoft.com/en-us/research/
publication/adversarial-training-for-large-neural-language-models/ 31
[476] E. Shayegani, M. A. A. Mamun, Y. Fu, P. Zaree, Y. Dong, and N. Abu-
Ghazaleh, “Survey of vulnerabilities in large language models revealed
by adversarial attacks,” 2023. 31
[477] X. Xu, K. Kong, N. Liu, L. Cui, D. Wang, J. Zhang, and M. Kankan-
halli, “An llm can fool itself: A prompt-based adversarial attack,” 2023.
31
[478] H. Zhao, H. Chen, F. Yang, N. Liu, H. Deng, H. Cai, S. Wang, D. Yin,
and M. Du, “Explainability for large language models: A survey,” 2023.
31
[479] S. Huang, S. Mamidanna, S. Jangam, Y. Zhou, and L. H. Gilpin, “Can
large language models explain themselves? a study of llm-generated
self-explanations,” 2023. 31
[480] H. Brown, K. Lee, F. Mireshghallah, R. Shokri, and F. Tramèr,
“What does it mean for a language model to preserve privacy?” in
Proceedings of the 2022 ACM Conference on Fairness, Accountability,
and Transparency, 2022, pp. 2280–2292. 31
[481] R. Plant, V. Giuffrida, and D. Gkatzia, “You are what you write:
Preserving privacy in the era of large language models,” arXiv preprint
arXiv:2204.09391, 2022. 31
[482] W. Niu, Z. Kong, G. Yuan, W. Jiang, J. Guan, C. Ding, P. Zhao, S. Liu,
B. Ren, and Y. Wang, “Real-time execution of large-scale language
models on mobile,” 2020. 31
[483] C. Guo, J. Tang, W. Hu, J. Leng, C. Zhang, F. Yang, Y. Liu, M. Guo,
and Y. Zhu, “Olive: Accelerating large language models via hardware-
friendly outlier-victim pair quantization,” in Proceedings of the 50th
Annual International Symposium on Computer Architecture, 2023, pp.
1–15. 31
[484] B. Meskó and E. J. Topol, “The imperative for regulatory oversight
of large language models (or generative ai) in healthcare,” npj Digital
Medicine, vol. 6, no. 1, p. 120, 2023. 32