A Comprehensive Overview of Large Language Models: A A, B, C, D, E, F E, F G, I H I

Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

A Comprehensive Overview of Large Language Models

Humza Naveeda , Asad Ullah Khana,∗, Shi Qiub,∗, Muhammad Saqibc,d,∗, Saeed Anware,f , Muhammad Usmane,f , Naveed Akhtarg,i ,
Nick Barnesh , Ajmal Miani
a University of Engineering and Technology (UET), Lahore, Pakistan
b The Chinese University of Hong Kong (CUHK), HKSAR, China
c University of Technology Sydney (UTS), Sydney, Australia
d Commonwealth Scientific and Industrial Research Organisation (CSIRO), Sydney, Australia
e King Fahd University of Petroleum and Minerals (KFUPM), Dhahran, Saudi Arabia
f SDAIA-KFUPM Joint Research Center for Artificial Intelligence (JRCAI), Dhahran, Saudi Arabia
g The University of Melbourne (UoM), Melbourne, Australia
h Australian National University (ANU), Canberra, Australia
i The University of Western Australia (UWA), Perth, Australia
arXiv:2307.06435v9 [cs.CL] 9 Apr 2024

Abstract
Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and
beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse
topics such as architectural innovations, better training strategies, context length improvements, fine-tuning, multi-modal LLMs,
robotics, datasets, benchmarking, efficiency, and more. With the rapid development of techniques and regular breakthroughs in
LLM research, it has become considerably challenging to perceive the bigger picture of the advances in this direction. Considering
the rapidly emerging plethora of literature on LLMs, it is imperative that the research community is able to benefit from a concise
yet comprehensive overview of the recent developments in this field. This article provides an overview of the existing literature
on a broad range of LLM-related concepts. Our self-contained comprehensive overview of LLMs discusses relevant background
concepts along with covering the advanced topics at the frontier of research in LLMs. This review article is intended to not only
provide a systematic survey but also a quick comprehensive reference for the researchers and practitioners to draw insights from
extensive informative summaries of the existing works to advance the LLM research.
Keywords:
Large Language Models, LLMs, chatGPT, Augmented LLMs, Multimodal LLMs, LLM training, LLM Benchmarking

1. Introduction

Language plays a fundamental role in facilitating commu-


nication and self-expression for humans, and their interaction
with machines. The need for generalized models stems from
the growing demand for machines to handle complex language
tasks, including translation, summarization, information re-
trieval, conversational interactions, etc. Recently, significant
breakthroughs have been witnessed in language models, pri-
marily attributed to transformers [1], increased computational
capabilities, and the availability of large-scale training data.
These developments have brought about a revolutionary trans-
formation by enabling the creation of LLMs that can approxi-
mate human-level performance on various tasks [2, 3]. Large

∗ Equalcontribution
Email addresses: humza_naveed@yahoo.com (Humza Naveed),
aukhanee@gmail.com (Asad Ullah Khan), shiqiu@cse.cuhk.edu.hk (Shi
Qiu), muhammad.saqib@data61.csiro.au (Muhammad Saqib),
saeed.anwar@kfupm.edu.sa (Saeed Anwar), Figure 1: The trend of papers released over years containing keywords "Large
muhammad.usman@kfupm.edu.sa (Muhammad Usman), Language Model", "Large Language Model + Fine-Tuning", and "Large Lan-
naveed.akhtar1@unimelb.edu.au (Naveed Akhtar), guage Model + Alignment".
nick.barnes@anu.edu.au (Nick Barnes), ajmal.mian@uwa.edu.au
(Ajmal Mian)

Preprint submitted to Elsevier April 11, 2024


Alpaca (Mar) LLaMA (Feb)
CodeGen (Mar)
HuaTuo (Apr) Xuan Yuan 2.0 (May)
GPT-NeoX-20B (Apr)
Vicuna MPT (Jun)
UL2 (May)
TK-Instruct (May) Koala (May) CodeT5+
GLM (Oct)
mT0 (Dec) Wizard-LM StarCoder
PanGu-α (Apr) OPT LLaMA 2 (Jul)
OPT-IML Wizard-Coder (Jun)
T5 (Oct) mT5 (Oct) T0 (Oct) CPM-2 (Jun) Galactica (Nov) Code Llama (Aug)
Goat

2019 2020 2021 2022 2023 2024

Codex (Jul) MT-NLG (Jan) PanGu-Σ (Mar)


GPT-3 (May) WebGPT (Dec) ERNIE 3.0 Sparrow (Sep) AlphaCode (Feb) Bard (Oct) BloombergGPT
Jurassic-1 (Aug) FLAN-U-PaLM (Oct) Chinchilla (Mar) GPT-4
HyperCLOVA (Sep) ChatGPT (Nov) PaLM (Apr) Claude
Yuan 1.0 (Oct) AlexaTM (Aug) PaLM2 (May)
Gopher (Dec) U-PALM (Oct) Gemini (Dec)
ERNIE 3.0 Titan BLOOM (Nov)
GLaM
LaMDA

Figure 2: Chronological display of LLM releases: blue cards represent ‘pre-trained’ models, while orange cards correspond to ‘instruction-tuned’ models. Models
on the upper half signify open-source availability, whereas those on the bottom half are closed-source. The chart illustrates the increasing trend towards instruction-
tuned models and open-source models, highlighting the evolving landscape and trends in natural language processing research.

Language Models (LLMs) have emerged as cutting-edge arti- adopted in diverse settings including, multi-modal, robotics,
ficial intelligence systems that can process and generate text tool manipulation, question answering, autonomous agents, etc.
with coherent communication [4], and generalize to multiple Various improvements have also been suggested in these areas
tasks [5, 6]. either by task-specific training [25, 26, 27, 28, 29, 30, 31] or
The historical progress in natural language processing (NLP) better prompting [32].
evolved from statistical to neural language modeling and then The LLMs abilities to solve diverse tasks with human-level
from pre-trained language models (PLMs) to LLMs. While performance come at a cost of slow training and inference,
conventional language modeling (LM) trains task-specific mod- extensive hardware requirements, and higher running costs.
els in supervised settings, PLMs are trained in a self-supervised Such requirements have limited their adoption and opened up
setting on a large corpus of text [7, 8, 9] with the aim of learning opportunities to devise better architectures [15, 33, 34, 35]
a generic representation that is shareable among various NLP and training strategies [36, 37, 21, 38, 39, 40, 41]. Param-
tasks. After fine-tuning for downstream tasks, PLMs surpass eter efficient tuning [38, 41, 40], pruning [42, 43], quantiza-
the performance gains of traditional language modeling (LM). tion [44, 45], knowledge distillation, and context length inter-
The larger PLMs bring more performance gains, which has led polation [46, 47, 48, 49] among others are some of the methods
to the transitioning of PLMs to LLMs by significantly increas- widely studied for efficient LLM utilization.
ing model parameters (tens to hundreds of billions) [10] and Due to the success of LLMs on a wide variety of tasks, the
training dataset (many GBs and TBs) [10, 11]. Following this research literature has recently experienced a large influx of
development, numerous LLMs have been proposed in the lit- LLM-related contributions. Researchers have organized the
erature [10, 11, 12, 6, 13, 14, 15]. An increasing trend in the LLMs literature in surveys [50, 51, 52, 53], and topic-specific
number of released LLMs and names of a few significant LLMs surveys in [54, 55, 56, 57, 58]. In contrast to these surveys, our
proposed over the years are shown in Fig 1 and Fig 2, respec- contribution focuses on providing a comprehensive yet concise
tively. overview of the general direction of LLM research. This arti-
The early work on LLMs, such as T5 [10] and mT5 [11] em- cle summarizes architectural and training details of pre-trained
ployed transfer learning until GPT-3 [6] showed LLMs are LLMs and delves deeper into the details of concepts like fine-
zero-shot transferable to downstream tasks without fine-tuning. tuning, multi-modal LLMs, augmented LLMs, datasets, eval-
LLMs accurately respond to task queries when prompted with uation, applications, challenges, and others to provide a self-
task descriptions and examples. However, pre-trained LLMs contained comprehensive overview. Our key contributions are
fail to follow user intent and perform worse in zero-shot set- summarized as follows.
tings than in few-shot. Fine-tuning them with task instruc-
tions data [16, 17, 18, 19] and aligning with human prefer- • We present a survey on the developments in LLM research
ences [20, 21] enhances generalization to unseen tasks, im- providing a concise comprehensive overview of the direc-
proving zero-shot performance significantly and reducing mis- tion.
aligned behavior. • We present extensive summaries of pre-trained models that
In addition to better generalization and domain adaptation, include fine-grained details of architecture and training de-
LLMs appear to have emergent abilities, such as reasoning, tails.
planning, decision-making, in-context learning, answering in • We summarize major findings of the popular contributions
zero-shot settings, etc. These abilities are known to be ac- and provide a detailed discussion on the key design and
quired by them due to their gigantic scale even when the pre- development aspects of LLMs to help practitioners effec-
trained LLMs are not trained specifically to possess these at- tively leverage this technology.
tributes [22, 23, 24]. Such abilities have led LLMs to be widely • In this self-contained article, we cover a range of con-
cepts to present the general direction of LLMs compre-
2
Figure 3: A broader overview of LLMs, dividing LLMs into seven branches: 1. Pre-Training 2. Fine-Tuning 3. Efficient 4. Inference 5. Evaluation 6. Applications
7. Challenges

hensively, including background, pre-training, fine-tuning, architectures, training pipelines and strategies, fine-tuning, and
multi-modal LLMs, augmented LLMs, LLMs-powered utilization in different domains. Section 4 highlights the config-
agents, datasets, evaluation, etc. uration and parameters that play a crucial role in the function-
ing of these models. Summary and discussions are presented
We loosely follow the existing terminology to ensure a stan- in section 3.8. The LLM training and evaluation, datasets, and
dardized outlook of this research direction. For instance, fol- benchmarks are discussed in section 5, followed by challenges
lowing [50], our survey discusses pre-trained LLMs with 10B and future directions, and conclusion in sections 7 and 8, re-
parameters or more. We refer the readers interested in smaller spectively.
pre-trained models to [51, 52, 53].
The organization of this paper is as follows. Section 2 discusses
the background of LLMs. Section 3 focuses on LLMs overview,
3
2. Background 2.4. Activation Functions
We provide the relevant background to understand the fun- The activation functions serve a crucial role in the curve-
damentals related to LLMs in this section. We briefly discuss fitting abilities of neural networks [69]. We discuss activation
necessary components in LLMs and refer the readers interested functions used in LLMs in this section.
in details to the original works. ReLU [70]: The Rectified linear unit (ReLU) is defined as:

2.1. Tokenization ReLU(x) = max(0, x) (1)


Tokenization [59] is an essential pre-processing step in
LLM training that parses the text into non-decomposing units GeLU [71]: The Gaussian Error Linear Unit (GeLU) is the
called tokens. Tokens can be characters, subwords [60], sym- combination of ReLU, dropout [72] and zoneout [73].
bols [61], or words, depending on the tokenization process. GLU variants [74]: The Gated Linear Unit [75] is a neural
Some of the commonly used tokenization schemes in LLMs network layer that is an element-wise product (⊗) of a linear
include wordpiece [62], byte pair encoding (BPE) [61], and un- transformation and a sigmoid transformed (σ) linear projection
igramLM [60]. Readers are encouraged to refer to [63] for a of the input given as:
detailed survey.
GLU(x, W, V, b, c) = (xW + b) ⊗ σ(xV + c), (2)
2.2. Encoding Positions
The transformer processes input sequences in parallel and where X is the input of layer and l, W, b, V and c are learned
independently of each other. Moreover, the attention mod- parameters. Other GLU variants [74] used in LLMs are:
ule in the transformer does not capture positional information.
As a result, positional encodings were introduced in trans- ReGLU(x, W, V, b, c) = max(0, xW + b)⊗,
former [64], where a positional embedding vector is added to GEGLU(x, W, V, b, c) = GELU(xW + b) ⊗ (xV + c),
the token embedding. Variants of positional embedding include S wiGLU(x, W, V, b, c, β) = S wishβ(xW + b) ⊗ (xV + c).
absolute, relative, or learned positional encodings. Within rel-
ative encoding, Alibi and RoPE are two widely used positional
2.5. Layer Normalization
embeddings in LLMs.
Alibi [65]: It subtracts a scalar bias from the attention score Layer normalization leads to faster convergence and is an in-
that increases with the distance between token positions. This tegrated component of transformers [64]. In addition to Layer-
favors using recent tokens for attention. Norm [76] and RMSNorm [77], LLMs use pre-layer normal-
RoPE [66]: It rotates query and key representations at an an- ization [78], applying it before multi-head attention (MHA).
gle proportional to the token absolute position in the input Pre-norm is shown to provide training stability in LLMs. An-
sequence, resulting in a relative positional encoding scheme other normalization variant, DeepNorm [79] fixes the issue with
which decays with the distance between the tokens. larger gradients in pre-norm.

2.3. Attention in LLMs


2.6. Distributed LLM Training
Attention assigns weights to input tokens based on impor-
tance so that the model gives more emphasis to relevant tokens. This section describes distributed LLM training approaches
Attention in transformers [64] calculates query, key, and value briefly. More details are available in [13, 37, 80, 81].
mappings for input sequences, where the attention score is Data Parallelism: Data parallelism replicates the model on
obtained by multiplying the query and key, and later used to multiple devices where data in a batch gets divided across de-
weight values. We discuss different attention strategies used in vices. At the end of each training iteration weights are synchro-
LLMs below. nized across all devices.
Self-Attention [64]: Calculates attention using queries, keys, Tensor Parallelism: Tensor parallelism shards a tensor compu-
and values from the same block (encoder or decoder). tation across devices. It is also known as horizontal parallelism
Cross Attention: It is used in encoder-decoder architectures, or intra-layer model parallelism.
where encoder outputs are the queries, and key-value pairs Pipeline Parallelism: Pipeline parallelism shards model layers
come from the decoder. across different devices. This is also known as vertical paral-
Sparse Attention [67]: Self-attention has O(n2 ) time complex- lelism.
ity which becomes infeasible for large sequences. To speed Model Parallelism: A combination of tensor and pipeline par-
up the computation, sparse attention [67] iteratively calculates allelism is known as model parallelism.
attention in sliding windows for speed gains. 3D Parallelism: A combination of data, tensor, and model par-
Flash Attention [68]: Memory access is the major bottleneck allelism is known as 3D parallelism.
in calculating attention using GPUs. To speed up, flash Optimizer Parallelism: Optimizer parallelism also known as
attention employs input tiling to minimize the memory reads zero redundancy optimizer [37] implements optimizer state
and writes between the GPU high bandwidth memory (HBM) partitioning, gradient partitioning, and parameter partitioning
and the on-chip SRAM. across devices to reduce memory consumption while keeping
the communication costs as low as possible.
4
2.7. Libraries
Some commonly used libraries for LLMs training are:
Transformers [82]: The library provides access to various pre-
trained transformer models with APIs to train, fine-tune, infer,
and develop custom models.
DeepSpeed [36]: A library for scalable distributed training and
inference of deep learning models.
Megatron-LM [80]: It provides GPU-optimized techniques for
large-scale training of LLMs. Figure 4: An example of attention patterns in language models, image is taken
JAX [83]: A Python library for high-performance numerical from [93].
computing and scaleable machine learning. It can differenti-
ate native Python and NumPy functions and execute them on
GPUs.
Colossal-AI [84]: A collection of components to write dis-
tributed deep learning models.
BMTrain [81]: A library to write efficient stand-alone LLMs
training code.
FastMoE [85]: Provides API to build mixture-of-experts
(MoE) model in PyTorch. Figure 5: An example of language model training objectives, image from [93].
MindSpore [86]: A deep learning training and inference frame-
work extendable to mobile, edge, and cloud computing.
the attention and the connection of transformer blocks. An il-
PyTorch [87]: A framework developed by Facebook AI Re-
lustration of attention patterns of these architectures is shown
search lab (FAIR) to build deep learning models. The main
in Figure 4.
features of PyTorch include a dynamic computation graph and
Encoder Decoder: This architecture processes inputs through
a pythonic coding style.
the encoder and passes the intermediate representation to the
Tensorflow [88]: A deep learning framework written by
decoder to generate the output. Here, the encoder sees the
Google. The key features of TensorFlow are graph-based com-
complete sequence utilizing self-attention whereas the decoder
putation, eager execution, scalability, etc.
processes the sequence one after the other with implementing
MXNet [89]: Apache MXNet is a deep learning framework
cross-attention.
with support to write programs in multiple languages, includ-
Causal Decoder: A type of architecture that does not have an
ing, Python, C++, Scala, R, etc. It also provides support for
encoder and processes and generates output using a decoder,
dynamic and static computation graphs.
where the predicted token depends only on the previous time
2.8. Data PreProcessing steps.
Prefix Decoder: It is also known as a non-causal decoder,
This section briefly summarizes data preprocessing tech-
where the attention calculation is not strictly dependent on the
niques used in LLMs training.
past information and the attention is bidirectional. An example
Quality Filtering: For better results, training data quality is
of a non-causal attention mask is shown in Figure 4.
essential. Some approaches to filtering data are: 1) classifier-
Mixture-of-Experts: It is a variant of transformer architecture
based and 2) heuristics-based. Classifier-based approaches
with parallel independent experts and a router to route tokens
train a classifier on high-quality data and predict the quality of
to experts. These experts are feed-forward layers after the at-
text for filtering, whereas heuristics-based employ some rules
tention block [90]. Mixture-of-Experts (MoE) is an efficient
for filtering like language, metrics, statistics, and keywords.
sparse architecture that offers comparable performance to dense
Data Deduplication: Duplicated data can affect model per-
models and allows increasing the model size without increas-
formance and increase data memorization; therefore, to train
ing the computational cost by activating only a few experts at a
LLMs, data deduplication is one of the preprocessing steps.
time [91, 92].
This can be performed at multiple levels, like sentences,
documents, and datasets.
Privacy Reduction: Most of the training data for LLMs is 2.10. Pre-Training Objectives
collected through web sources. This data contains private This section describes LLMs pre-training objectives. For
information; therefore, many LLMs employ heuristics-based more details see the paper [93].
methods to filter information such as names, addresses, and Full Language Modeling: An autoregressive language model-
phone numbers to avoid learning personal information. ing objective where the model is asked to predict future tokens
given the previous tokens, an example is shown in Figure 5.
Prefix Language Modeling: A non-causal training objective,
2.9. Architectures where a prefix is chosen randomly and only remaining target
Here we discuss the variants of the transformer architectures tokens are used to calculate the loss. An example is shown in
used in LLMs. The difference arises due to the application of Figure 5.
5
Figure 6: A basic flow diagram depicting various stages of LLMs from pre-training to prompting/utilization. Prompting LLMs to generate responses is possible at
different training stages like pre-training, instruction-tuning, or alignment tuning. “RL” stands for reinforcement learning, “RM” represents reward-modeling, and
“RLHF” represents reinforcement learning with human feedback.

Masked Language Modeling: In this training objective, tokens 2.12. LLMs Adaptation Stages
or spans (a sequence of tokens) are masked randomly and the
This section discusses the fundamentals of LLMs adaptation
model is asked to predict masked tokens given the past and
stages, from pre-training to fine-tuning for downstream tasks
future context. An example is shown in Figure 5.
and utilization. An example of different training stages and in-
Unified Language Modeling: Unified language modeling [94]
ference in LLMs is shown in Figure 6. In this paper, we refer
is a combination of causal, non-causal, and masked language
to alignment-tuning as aligning with human preferences, while
training objectives. Here in masked language modeling, the
occasionally the literature uses the term alignment for different
attention is not bidirectional but unidirectional, attending either
purposes.
left-to-right or right-to-left context.
2.12.1. Pre-Training
In the very first stage, the model is trained in a self-
2.11. LLMs Scaling Laws supervised manner on a large corpus to predict the next to-
kens given the input. The design choices of LLMs vary from
Scaling laws study the optimal combination of model param- encoder-decoder to decoder-only architectures with different
eters, dataset size, and computational resources that predict the building blocks and loss functions in sections 2.5, 2.4, 2.10.
improvement in the model performance. It has been shown
that the loss scales according to the power-law with model size, 2.12.2. Fine-Tuning
dataset size, and compute resources [95]. This study suggests There are different styles to fine-tune an LLM. This section
larger models are more important than big data for better perfor- briefly discusses fine-tuning approaches.
mance. Another variant of scaling law [96] suggests the model Transfer Learning: The pre-trained LLMs perform well for
size and the number of training tokens should be scaled equally. various tasks [6, 15]. However, to improve the performance for
6
a downstream task, pre-trained models are fine-tuned with the whereas to improve LLMs further on reasoning tasks many
task-specific data [10, 11], known as transfer learning. methods [16, 97] train them on reasoning datasets. We discuss
Instruction-tuning: To enable a model to respond to user various prompting techniques for reasoning below.
queries effectively, the pre-trained model is fine-tuned on in- Chain-of-Thought (CoT): A special case of prompting where
struction formatted data i.e., instruction and an input-output demonstrations contain reasoning information aggregated with
pair. Instructions generally comprise multi-task data in plain inputs and outputs so that the model generates outcomes with
natural language, guiding the model to respond according to the step-by-step reasoning. More details on CoT prompts are avail-
prompt and the input. This type of fine-tuning improves zero- able in [55, 103, 101].
shot generalization and downstream task performance. Details Self-Consistency: Improves CoT performance by generat-
on formatting instruction data and its various styles are avail- ing multiple responses and selecting the most frequent an-
able in [16, 50, 97]. swer [104].
Alignment-tuning: LLMs are prone to generating false, biased, Tree-of-Thought (ToT): Explores multiple reasoning paths
and harmful text. To make them helpful, honest, and harmless, with possibilities to look ahead and backtrack for problem-
models are aligned using human feedback. Alignment involves solving [105].
asking LLMs to generate unexpected responses and then updat- Single-Turn Instructions: In this prompting setup, LLMs are
ing their parameters to avoid such responses [20, 21, 98]. queried only once with all the relevant information in the
It ensures LLMs operate according to human intentions and prompt. LLMs generate responses by understanding the con-
values. A model is defined to be an “aligned” model if the text either in a zero-shot or few-shot setting.
model fulfills three criteria of helpful, honest, and harmless or Multi-Turn Instructions: Solving a complex task requires mul-
“HHH” [99]. tiple interactions with LLMs, where feedback and responses
Researchers employ reinforcement learning with human feed- from the other tools are given as input to the LLM for the next
back (RLHF) [100] for model alignment. In RLHF, a fine-tuned rounds. This style of using LLMs in the loop is common in
model on demonstrations is further trained with reward model- autonomous agents.
ing (RM) and reinforcement learning (RL), shown in Figure 6.
Below we briefly discuss RM and RL pipelines in RLHF.
3. Large Language Models
Reward modeling: trains a model to rank generated responses
according to human preferences using a classification objec- This section reviews LLMs, briefly describing their architec-
tive. To train the classifier humans annotate LLMs generated tures, training objectives, pipelines, datasets, and fine-tuning
responses based on the HHH criteria. details.
Reinforcement learning: in combination with the reward model
is used for alignment in the next stage. The previously trained
3.1. Pre-Trained LLMs
reward model ranks LLM-generated responses into preferred
vs. non-preferred, which is used to align the model with proxi- Here, we provide summaries of various well-known pre-
mal policy optimization (PPO). This process repeats iteratively trained LLMs with significant discoveries, changing the course
until convergence. of research and development in NLP. These LLMs have consid-
erably improved the performance in NLU and NLG domains,
2.12.3. Prompting/Utilization and are widely fine-tuned for downstream tasks. Moreover, We
Prompting is a method to query trained LLMs for generating also identify key findings and insights of pre-trained LLMs in
responses, as illustrated in Figure 6. LLMs can be prompted in Table 1 and 2 that improve their performance.
various prompt setups, where they can be adapted to the instruc-
tions without fine-tuning and in other cases with fine-tuning on 3.1.1. General Purpose
data containing different prompt styles [16, 101, 102]. A good T5 [10]: An encoder-decoder model employing a unified text-
guide on prompt engineering is available at [32]. Below, we to-text training for all NLP problems is shown in Figure 7. T5
will discuss various widely used prompt setups. places layer normalization outside the residual path in a conven-
Zero-Shot Prompting: LLMs are zero-shot learners and ca- tional transformer model [64]. It uses masked language mod-
pable of answering queries never seen before. This style of eling as a pre-training objective where spans (consecutive to-
prompting requires LLMs to answer user questions without see- kens) are replaced with a single mask instead of separate masks
ing any examples in the prompt. for each token. This type of masking speeds up the training as
In-context Learning: Also known as few-shot learning, here, it produces shorter sequences. After pre-training, the model is
multiple input-output demonstration pairs are shown to the fine-tuned using adapter layers [106] for downstream tasks.
model to generate the desired response. This adaptation style GPT-3 [6]: The GPT-3 architecture is the same as the GPT-
is also called few-shot learning. A discussion on formatting in- 2 [5] but with dense and sparse attention in transformer layers
context learning (ICL) templates is available in [54, 50, 18, 16]. similar to the Sparse Transformer [67]. It shows that large mod-
Reasoning in LLMs: LLMs are zero-shot reasoners and can els can train on larger batch sizes with a lower learning rate to
be provoked to generate answers to logical problems, task decide the batch size during training, GPT-3 uses the gradient
planning, critical thinking, etc. with reasoning. Generating noise scale as in [107]. Overall, GPT-3 increases model param-
reasons is possible only by using different prompting styles, eters to 175B showing that the performance of large language
7
plete fine-tuning and prompt fine-tuning as in [40] where only
prompt-related parameters are updated by inserting prompts at
various positions, front, middle, and back. CPM-2 also pro-
poses the INFMOE, a memory-efficient framework with a strat-
egy to dynamically offload parameters to the CPU for inference
at a 100B scale. It overlaps data movement with inference com-
putation for lower inference time.
ERNIE 3.0 [110]: ERNIE 3.0 takes inspiration from multi-
Figure 7: Unified text-to-text training example, source image from [10]. task learning to build a modular architecture using Transformer-
XL [111] as the backbone. The universal representation mod-
ule is shared by all the tasks, which serve as the basic block
for task-specific representation modules, which are all trained
jointly for natural language understanding, natural language
generation, and knowledge extraction. This LLM is primar-
ily focused on the Chinese language. It claims to train on the
largest Chinese text corpora for LLM training, and achieved
state-of-the-art in 54 Chinese NLP tasks.
Jurassic-1 [112]: A pair of auto-regressive language mod-
els, including a 7B-parameter J1-Large model and a 178B-
parameter J1-Jumbo model. The training vocabulary of
Jurassic-1 comprise word pieces, complete words, and multi-
word expressions without any word boundaries, where possible
out-of-vocabulary instances are interpreted as Unicode bytes.
Compared to the GPT-3 counterparts, the Jurassic-1 models
Figure 8: The image is the article of [108], showing an example of PanGu-α apply a more balanced depth-to-width self-attention architec-
architecture.
ture [113] and an improved tokenizer for a faster prediction
based on broader resources, achieving a comparable perfor-
models improves with the scale and is competitive with the fine- mance in zero-shot learning tasks and a superior performance in
tuned models. few-shot learning tasks given the ability to feed more examples
mT5 [11]: A multilingual T5 model [10] trained on the mC4 as a prompt.
dataset with 101 languages. The dataset is extracted from the HyperCLOVA [114]: A Korean language model with GPT-3
public common crawl scrape. The model uses a larger vocab- architecture.
ulary size of 250,000 to cover multiple languages. To avoid Yuan 1.0 [115]: Trained on a Chinese corpus with 5TB of
over-fitting or under-fitting for a language, mT5 employs a data high-quality text collected from the Internet. A Massive Data
sampling procedure to select samples from all languages. The Filtering System (MDFS) built on Spark is developed to pro-
paper suggests using a small amount of pre-training datasets, cess the raw data via coarse and fine filtering techniques. To
including all languages when fine-tuning for a task using En- speed up the training of Yuan 1.0 to save energy expenses and
glish language data. This allows the model to generate correct carbon emissions, various factors that improve the performance
non-English outputs. of distributed training are incorporated in architecture and train-
PanGu-α [108]: An autoregressive model that has a query ing: like increasing the hidden state size improves pipeline and
layer at the end of standard transformer layers, example shown tensor parallelism performance, larger micro batches improve
in Figure 8, to predict the next token. Its structure is similar to pipeline parallelism performance, and larger global batch size
the transformer layer but with an additional embedding for the improve data parallelism performance. In practice, the Yuan 1.0
next position in the attention mechanism, given in Eq. 3. model performs well on text classification, Winograd Schema,
natural language inference, and reading comprehension tasks.
a = pn Whq Whk T HLT (3) Gopher [116]: The Gopher family of models ranges from
44M to 280B parameters in size to study the effect of scale
CPM-2 [12]: Cost-efficient Pre-trained language Models on the LLMs performance. The 280B model beats GPT-3 [6],
(CPM-2) pre-trains bilingual (English and Chinese) 11B and Jurrasic-1 [112], MT-NLG [117], and others on 81% of the
198B mixture-of-experts (MoE) models on the WuDaoCor- evaluated tasks.
pus [109] dataset. The tokenization process removes “_” white ERNIE 3.0 TITAN [35]: ERNIE 3.0 Titan extends ERNIE 3.0
space tokens in the sentencepiece tokenizer. The models are by training a larger model with 26x the number of parameters
trained with knowledge inheritance, starting with only the Chi- of the latter. This bigger model outperformed other state-of-the-
nese language in the first stage and then adding English and art models in 68 NLP tasks. LLMs produce text with incorrect
Chinese data. This trained model gets duplicated multiple times facts. In order to have control of the generated text with fac-
to initialize the 198B MoE model. Moreover, to use the model tual consistency, ERNIE 3.0 Titan adds another task, Credible
for downstream tasks, CPM-2 experimented with both com- and Controllable Generations, to its multi-task learning setup.
8
It introduces additional self-supervised adversarial and control-
lable language modeling losses to the pre-training step, which
enables ERNIE 3.0 Titan to beat other LLMs in their manually
selected Factual QA task set evaluations.
GPT-NeoX-20B [118]: An auto-regressive model that largely
follows GPT-3 with a few deviations in architecture design,
trained on the Pile dataset without any data deduplication. GPT-
NeoX has parallel attention and feed-forward layers in a trans-
former block, given in Eq. 4, that increases throughput by 15%.
It uses rotary positional embedding [66], applying it to only
25% of embedding vector dimension as in [119]. This reduces
the computation without performance degradation. As opposed Figure 9: The BLOOM architecture example sourced from [13].
to GPT-3, which uses dense and sparse layers, GPT-NeoX-20B
uses only dense layers. The hyperparameter tuning at this scale
is difficult; therefore, the model chooses hyperparameters from relationship that model size should be doubled for every dou-
the method [6] and interpolates values between 13B and 175B bling of training tokens. Over 400 language models ranging
models for the 20B model. The model training is distributed from 70 million to over 16 billion parameters on 5 to 500 bil-
among GPUs using both tensor and pipeline parallelism. lion tokens are trained to get the estimates for compute-optimal
training under a given budget. The authors train a 70B model
x + Attn(LN1 (x)) + FF(LN2 (x)) (4) with the same compute budget as Gopher (280B) but with 4
times more data. It outperforms Gopher [116], GPT-3 [6], and
OPT [14]: It is a clone of GPT-3, developed to open-source others on various downstream tasks, after fine-tuning.
a model that replicates GPT-3 performance. Training of OPT AlexaTM [122]: An encoder-decoder model, where encoder
employs dynamic loss scaling [120] and restarts from an earlier weights and decoder embeddings are initialized with a pre-
checkpoint with a lower learning rate whenever loss divergence trained encoder to speed up training. The encoder stays frozen
is observed. Overall, the performance of OPT-175B models is for the initial 100k steps and is later unfrozen for end-to-end
comparable to the GPT3-175B model. training. The model is trained on a combination of denoising
BLOOM [13]: A causal decoder model trained on the ROOTS and causal language modeling (CLM) objectives, concatenat-
corpus to open-source an LLM. The architecture of BLOOM is ing a [CLM] token at the beginning for mode switching. Dur-
shown in Figure 9, with differences like ALiBi positional em- ing training, the CLM task is applied for 20% of the time, which
bedding, an additional normalization layer after the embedding improves the in-context learning performance.
layer as suggested by the bitsandbytes1 library. These changes PaLM [15]: A causal decoder with parallel attention and
stabilize training with improved downstream performance. feed-forward layers similar to Eq. 4, speeding up training by
GLaM [91]: Generalist Language Model (GLaM) represents a a factor of 15. Additional changes to the conventional trans-
family of language models using a sparsely activated decoder- former model include SwiGLU activation, RoPE embeddings,
only mixture-of-experts (MoE) structure [121, 90]. To gain multi-query attention that saves computation cost during decod-
more model capacity while reducing computation, the experts ing, and shared input-output embeddings. During training, loss
are sparsely activated where only the best two experts are used spiking was observed, and to fix it, model training was restarted
to process each input token. The largest GLaM model, GLaM from a 100-step earlier checkpoint by skipping 200-500 batches
(64B/64E), is about 7× larger than GPT-3 [6], while only part of around the spike. Moreover, the model was found to memo-
the parameters are activated per input token. The largest GLaM rize around 2.4% of the training data at the 540B model scale,
(64B/64E) model achieves better overall results as compared whereas this number was lower for smaller models.
to GPT-3 while consuming only one-third of GPT-3’s training PaLM-2 [123]: A smaller multi-lingual variant of PaLM,
energy. trained for larger iterations on a better quality dataset. PaLM-
MT-NLG [117]: A 530B causal decoder based on the GPT- 2 shows significant improvements over PaLM, while reducing
2 architecture that has roughly 3× GPT-3 model parameters. training and inference costs due to its smaller size. To lessen
MT-NLG is trained on filtered high-quality data collected from toxicity and memorization, it appends special tokens with a
various public datasets and blends various types of datasets in a fraction of pre-training data, which shows a reduction in gener-
single batch, which beats GPT-3 on several evaluations. ating harmful responses.
Chinchilla [96]: A causal decoder trained on the same dataset U-PaLM [124]: This method trains PaLM for 0.1% addi-
as the Gopher [116] but with a little different data sampling tional compute with the UL2 (also named as UL2Restore) ob-
distribution (sampled from MassiveText). The model architec- jective [125], using the same dataset it outperforms the baseline
ture is similar to the one used for Gopher, with the exception of significantly on various NLP tasks, including zero-shot, few-
AdamW optimizer instead of Adam. Chinchilla identifies the shot, commonsense reasoning, CoT, etc. Training with UL2R
involves converting a causal decoder PaLM to a non-causal de-
coder PaLM and employing 50% sequential denoising, 25%
1 https://github.com/TimDettmers/bitsandbytes regular denoising, and 25% extreme denoising loss functions.
9
UL2 [125]: An encoder-decoder architecture trained using a Codex [131]: This LLM is trained on a subset of public Python
mixture of denoisers (MoD) objective. Denoisers include 1) Github repositories to generate code from docstrings. Com-
R-Denoiser: a regular span masking, 2) S-Denoiser: which cor- puter programming is an iterative process where the programs
rupts consecutive tokens of a large sequence and 3) X-Denoiser: are often debugged and updated before fulfilling the require-
which corrupts a large number of tokens randomly. During pre- ments. Similarly to this, Codex generates 100 versions of a
training, UL2 includes a denoiser token from R, S , X to rep- program by repetitive sampling for a given description, which
resent a denoising setup. It helps improve fine-tuning perfor- produces a working solution for 77.5% of the problems passing
mance for downstream tasks that bind the task to one of the up- unit tests. Its powerful version powers Github Copilot2 .
stream training modes. This MoD style of training outperforms AlphaCode [132]: A set of large language models, ranging
the T5 model on many benchmarks. from 300M to 41B parameters, designed for competition-level
GLM-130B [33]: GLM-130B is a bilingual (English and Chi- code generation tasks. It uses the multi-query attention [133] to
nese) model trained using an auto-regressive mask infilling pre- reduce memory and cache costs. Since competitive program-
training objective similar to the GLM [126]. This training style ming problems highly require deep reasoning and an under-
makes the model bidirectional as compared to GPT-3, which is standing of complex natural language algorithms, the Alpha-
unidirectional. As opposed to GLM, the training of GLM-130B Code models are pre-trained on filtered GitHub code in popular
includes a small amount of multi-task instruction pre-training languages and then fine-tuned on a new competitive program-
data (5% of the total data) along with self-supervised mask in- ming dataset named CodeContests. The CodeContests dataset
filling. To stabilize the training, it applies embedding layer gra- mainly contains problems, solutions, and test cases collected
dient shrink. from the Codeforces platform3 . The pre-training employs stan-
LLaMA [127, 21]: A set of decoder-only language models dard language modeling objectives, while GOLD [134] with
varying from 7B to 70B parameters. LLaMA models series is tempering [135] serves as the training objective for the fine-
the most famous among the community for parameter efficiency tuning on CodeContests data. To evaluate the performance of
and instruction tuning. AlphaCode, simulated programming competitions are hosted
LLaMA-1 [127]: Implements efficient causal attention [128] on the Codeforces platform: overall, AlphaCode ranks at the
by not storing and computing masked attention weights and top 54.3% among over 5000 competitors, where its Codeforces
key/query scores. Another optimization is reducing the number rating is within the top 28% of recently participated users.
of activations recomputed in the backward pass, as in [129]. CodeT5+ [34]: CodeT5+ is based on CodeT5 [136], with
LLaMA-2 [21]: This work is more focused on fine-tuning a shallow encoder and deep decoder, trained in multiple stages
safer and better LLaMA-2-Chat model for dialogue generation. initially unimodal data (code) and later bimodal data (text-code
The pre-trained model has 40% more training data with a larger pairs). Each training stage has different training objectives and
context length and grouped-query attention. activates different model blocks encoder, decoder, or both ac-
PanGu-Σ [92]: An autoregressive model with parameters cording to the task. The unimodal pre-training includes span
copied from PanGu-α and extended to a trillion scale with Ran- denoising and CLM objectives, whereas bimodal pre-training
dom Routed Experts (RRE), the architectural diagram is shown objectives contain contrastive learning, matching, and CLM for
in Figure 10. RRE is similar to the MoE architecture, with text-code pairs. CodeT5+ adds special tokens with the text to
distinctions at the second level, where tokens are randomly enable task modes, for example, [CLS ] for contrastive loss,
routed to experts in a domain instead of using a learnable gat- [Match] for text-code matching, etc.
ing method. The model has bottom layers densely activated StarCoder [137]: A decoder-only model with the SantaCoder
and shared across all domains, whereas top layers are sparsely architecture, employing Flash attention to scale up the context
activated according to the domain. This training style allows length to 8k. The StarCoder trains an encoder to filter names,
extracting task-specific models and reduces catastrophic forget- emails, and other personal data from the training data. Its fine-
ting effects in the case of continual learning. tuned variant outperforms PaLM, LLaMA, and LAMDA on
HumanEval and MBPP benchmarks.

3.1.2. Coding 3.1.3. Scientific Knowledge


CodeGen [130]: CodeGen has similar architecture to Galactica [138]: A large curated corpus of human scientific
PaLM [15], i.e., parallel attention, MLP layers, and RoPE em- knowledge with 48 million papers, textbooks, lecture notes,
beddings. The model is trained on both natural language and millions of compounds and proteins, scientific websites, en-
programming language data sequentially (trained on the first cyclopedias, and more are trained using the metaseq library3,
dataset, then the second and so on) on the following datasets which is built on PyTorch and fairscale [139]. The model wraps
1) PILE, 2) BIGQUERY and 3) BIGPYTHON. CodeGen pro- reasoning datasets with the < work > token to provide step-by-
posed a multi-step approach to synthesizing code. The purpose step reasoning context to the model, which has been shown to
is to simplify the generation of long sequences where the previ- improve the performance on reasoning tasks.
ous prompt and generated code are given as input with the next
prompt to generate the next code sequence. CodeGen open-
source a Multi-Turn Programming Benchmark (MTPB) to eval- 2 https://github.com/features/copilot

uate multi-step program synthesis. 3 https://codeforces.com/

10
Figure 11: An example image shows an instance of the Flan training paradigm,
taken from [16].

P
Figure 10: This example illustrates the PanGu- architecture, as depicted in with minimal compute increment, e.g., 0.2% of the total pre-
the image sourced from [92]. training for PaLM 540B [16].
We review various fine-tuned LLMs and strategies for effective
fine-tuning in this section.
3.1.4. Dialog
LaMDA [140]: A decoder-only model pre-trained on pub- 3.2.1. Instruction-Tuning with Manually Created Datasets
lic dialog data, public dialog utterances, and public web doc- Numerous hand-crafted instruction-tuning datasets with
uments, where more than 90% of the pre-training data is in different design choices are proposed in the literature to
English. LaMDA is trained with the objective of producing re- instruction-tune LLMs. The performance of fine-tuned LLMs
sponses that exhibit high levels of quality, safety, and grounded- depends on multiple factors, such as dataset, instruction diver-
ness. To achieve this, discriminative and generative fine-tuning sity, prompting templates, model size, and training objectives.
techniques are incorporated to enhance the model’s safety and Keeping this in view, diverse fine-tuned models have emerged
quality aspects. As a result, the LaMDA models can be utilized in the literature using manually created datasets.
as a general language model performing various tasks. The models T0 [17] and mT0 (multi-lingual) [144] employ
templates to convert existing datasets into prompt datasets.
3.1.5. Finance They have shown improvements in generalization to zero-shot
BloombergGPT [141]: A non-causal decoder model trained and held-out tasks. Tk-Instruct [18] fine-tuned the T5 model
using both financial ("FINPILE" from the Bloomberg archive) with in-context instructions to study generalization on unseen
and general-purpose datasets. The model’s architecture is sim- tasks when given in-context instructions during test time. The
ilar to the BLOOM [13] and OPT [14]. It allocates 50B param- model outperformed Instruct-GPT, despite being smaller in
eters to different blocks of the model using the approach [113]. size, i.e., 11B parameters as compared to 175B of GPT-3.
For effective training, BloombergGPT packs documents to- Increasing Tasks and Prompt Setups: Zero-shot and few-shot
gether with < |endo f text| > to use the maximum sequence performance improves significantly by expanding task collec-
length, uses warmup batch size starting from 1024 to 2048, and tion and prompt styles. OPT-IML [97] and Flan [16] curated
manually reduces the learning rate multiple times during the larger 2k and 1.8k task datasets, respectively. While increasing
training. task size alone is not enough, OPT-IML and Flan add more
Xuan Yuan 2.0 [142]: A Chinese financial chat model with prompting setups in their datasets, zero-shot, few-shot, and
BLOOM’s [13] architecture trained on a combination of general CoT. In continuation, CoT Collection [101] fine-tunes Flan-T5
purpose, financial, general purpose instructions, and financial further on 1.88M CoT samples. Another method [102] uses
institutions datasets. Xuan Yuan 2.0 combined the pre-training symbolic tasks with tasks in T0, Flan, etc.
and fine-tuning stages to avoid catastrophic forgetting.

3.2.2. Instruction-Tuning with LLMs Generated Datasets


3.2. Fine-Tuned LLMs
Generating an instruction-tuning dataset requires carefully
Pre-trained LLMs have excellent generalization abilities to writing instructions and input-output pairs, which are often
unseen tasks. However, because they are generally trained with written by humans, smaller in size, and less diverse. To
the objective of next token prediction, LLMs have limited ca- overcome this, self-instruct [19] proposed an approach to
pacity to follow user intent and are prone to generate unethical, prompt available LLMs to generate instruction-tuning datasets.
toxic or inaccurate responses [20]. For their effective utiliza- Self-instruct outperformed models trained on manually created
tion, LLMs are fine-tuned to follow instructions [16, 17, 97] and dataset SUPER-NATURALINSTRUCTIONS (a dataset with
generate safe responses [20], which also results in increasing 1600+ tasks) [18] by 33%. It starts with a seed of 175 tasks,
zero-shot, few-shot, and cross-task generalization [97, 16, 18], 1 instruction, and 1 sample per task and iteratively generates
11
Table 1: Noteworthy findings and insights of pre-trained Large Language Models.

Models Findings & Insights

• Encoder and decoder with shared parameters perform equivalently when parameters are not shared
T5 • Fine-tuning model layers (adapter layers) work better than the conventional way of training on only
classification layers

• Few-shot performance of LLMs is better than the zero-shot, suggesting that LLMs are meta-
GPT-3 learners

• Large multi-lingual models perform equivalently to single language models on downstream tasks.
mT5 However, smaller multi-lingual models perform worse

PanGu-α • LLMs have good few shot capabilities

• Prompt fine-tuning requires updating very few parameters while achieving performance compara-
ble to full model fine-tuning
• Prompt fine-tuning takes more time to converge as compared to full model fine-tuning
CPM-2 • Inserting prompt tokens in-between sentences can allow the model to understand relations between
sentences and long sequences
• In an analysis, CPM-2 finds that prompts work as a provider (additional context) and aggregator
(aggregate information with the input text) for the model

• A modular LLM architecture with a universal representation module and task-specific representa-
tion module helps in the finetuning phase
ERNIE 3.0 • Optimizing the parameters of a task-specific representation network during the fine-tuning phase is
an efficient way to take advantage of the powerful pre-trained model

• The performance of LLM is highly related to the network size


• To improve runtime performance, more operations can be performed in parallel (width) rather than
sequential (depth)
Jurassic-1 • To efficiently represent and fit more text in the same context length, the model uses a larger vo-
cabulary to train a SentencePiece tokenizer without restricting it to word boundaries. This further
benefits in few-shot learning tasks

• By employing prompt-based tuning, the performances of models can be improved, often surpassing
HyperCLOVA
those of state-of-the-art models when the backward gradients of inputs are accessible

• The model architecture that excels in pre-training and fine-tuning cases may exhibit contrasting
Yuan 1.0
behavior in zero-shot and few-shot learning

Gopher • Relative encodings enable the model to evaluate for longer sequences than training.

• Additional self-supervised adversarial loss to distinguish between real and generated text improves
ERNIE 3.0 Titan the model performance as compared to ERNIE 3.0

• Parallel attention + FF layers speed-up training 15% with the same performance as with cascaded
layers
GPT-NeoX-20B • Initializing feed-forward output layers before residuals with scheme in [143] avoids activations
from growing with increasing depth and width
• Training on Pile outperforms GPT-3 on five-shot
Table Continued on Next Page

12
Models Findings & Insights

• Restart training from an earlier checkpoint with a lower learning rate if loss diverges
OPT • Model is prone to generate repetitive text and stuck in a loop

• Galactica’s performance has continued to improve across validation set, in-domain, and out-of-
domain benchmarks, even with multiple repetitions of the corpus, which is superior to existing
research on LLMs
Galactica • A working memory token approach can achieve strong performance over existing methods on
mathematical MMLU and MATH benchmarks. It sets a new state-of-the-art on several downstream
tasks such as PubMedQA (77.6%) and MedMCQA dev (52.9%)

• The model capacity can be maintained at reduced computation by replacing the feed-forward layer
in each transformer layer with a mixture-of-experts (MoE)
• The model trained on filtered data shows consistently better performances on both NLG and NLU
tasks, where the effect of filtering is more significant on the former tasks
GLaM • Filtered pretraining corpora play a crucial role in the generation capability of LLMs, especially for
the downstream tasks
• The scaling of GLaM MoE models can be achieved by increasing the size or number of experts in
the MoE layer. Given a fixed budget of computation, more experts contribute to a better perfor-
mance

LaMDA • The model can be fine-tuned to learn to call different external information resources and tools

• For higher effectiveness and efficiency, a transformer model can be asymmetrically constructed
with a shallower encoder and a deeper decoder
• To achieve better performances, it is necessary to employ strategies such as massively scaling
AlphaCode upsampling, followed by the filtering and clustering of samples into a compact set
• The utilization of novel sampling-efficient transformer architectures designed to facilitate large-
scale sampling is crucial
• Simplifying problem descriptions can effectively improve the model’s performance

• The model size and the number of training tokens should be scaled proportionately: for each dou-
Chinchilla bling of the model size, the number of training tokens should be doubled as well

• English-centric models produce better translations when translating to English as compared to non-
English
• Generalized models can have equivalent performance for language translation to specialized small
PaLM models
• Larger models have a higher percentage of training data memorization
• Performance has not yet saturated even at 540B scale, which means larger models are likely to
perform better

• Encoder-decoder architecture is more suitable to train LLMs given bidirectional attention to the
context than decoder-only
AlexaTM • Causal Language Modeling (CLM) task can be added to benefit the model with efficient in-context
learning
• Placing layer norm at the beginning of each transformer layer improves the training stability
Table Continued on Next Page

13
Models Findings & Insights

• Training with a mixture of denoisers outperforms PaLM when trained further for a few more FLOPs
U-PaLM • Training with a mixture of denoisers improves the infilling ability and open-ended text generation
diversity

• Mode switching training enables better performance on downstream tasks


UL2 • CoT prompting outperforms standard prompting for UL2

• Pre-training data with a small proportion of multi-task instruction data improves the overall model
GLM-130B performance

• Multi-step prompting for code synthesis leads to a better user intent understanding and code gen-
CodeGen eration

• A constant performance improvement is observed when scaling the model


LLaMA • Smaller models can achieve good performances with more training data and computing time

• Sparse models provide the benefits of large models at a lower computation cost
• Randomly Routed Experts reduces catastrophic forgetting effects which in turn is essential for
PanGu-Σ continual learning
• Randomly Routed Experts allow extracting a domain-specific sub-model in deployment which is
cost-efficient while maintaining a performance similar to the original

• Pre-training with general-purpose and task-specific data improves task performance without hurt-
BloombergGPT ing other model capabilities

XuanYuan 2.0 • Combining pre-training and fine-tuning stages in single training avoids catastrophic forgetting

• Causal LM is crucial for a model’s generation capability in encoder-decoder architectures


CodeT5+ • Multiple training objectives like span corruption, Causal LM, matching, etc complement each other
for better performance

StarCoder • HHH prompt by Anthropic allows the model to follow instructions without fine-tuning

• Model trained on unfiltered data is more toxic but may perform better on downstream tasks after
LLaMA-2 fine-tuning
• Model trained on unfiltered data requires fewer samples for safety alignment

• Data quality is important to train better models


PaLM-2 • Model and data size should be scaled with 1:1 proportions
• Smaller models trained for larger iterations outperform larger models

14
Table 2: Key insights and findings from the study of instruction-tuned Large Language Models.

Models Findings & Insights

• Multi-task prompting enables zero-shot generalization and outperforms baselines


T0 • Even a single prompt per dataset task is enough to improve performance

• To aid the model in effectively filtering and utilizing relevant information, human labelers play a
crucial role in answering questions regarding the usefulness of the retrieved documents
WebGPT • Interacting a fine-tuned language model with a text-based web-browsing environment can improve
end-to-end retrieval and synthesis via imitation learning and reinforcement learning
• Generating answers with references can make labelers easily judge the factual accuracy of answers

• Instruction tuning leads to a stronger generalization of unseen tasks


• More tasks improve generalization whereas only increasing task instances does not help
Tk-INSTRUCT • Supervised trained models are better than generalized models
• Models pre-trained with instructions and examples perform well for different types of inputs

• Instruction tuning enables zero-shot generalization to tasks never seen before


• Multi-lingual training leads to even better zero-shot generalization for both English and non-
English
mT0 and BLOOMZ • Training on machine-translated prompts improves performance for held-out tasks with non-English
prompts
• English only fine-tuning on multilingual pre-trained language model is enough to generalize to
other pre-trained language tasks

• Creating a batch with multiple task examples is important for better performance
• Only example proportional sampling is not enough, training datasets should also be proportional
for better generalization/performance
• Fully held-out and partially supervised tasks performance improves by scaling tasks or categories
OPT-IML whereas fully supervised tasks have no effect
• Including small amounts i.e. 5% of pretraining data during fine-tuning is effective
• Only 1% reasoning data improves the performance, adding more deteriorates performance
• Adding dialogue data makes the performance worse

• Labelers’ judgment and well-defined alignment rules help the model generate better responses
• Good dialogue goals can be broken down into detailed natural language rules for the agent and the
Sparrow raters
• The combination of reinforcement learning (RL) with reranking yields optimal performance in
terms of preference win rates and resilience against adversarial probing

• Finetuning with CoT improves performance on held-out tasks


• Fine-tuning along with CoT data improves reasoning abilities
• CoT tuning improves zero-shot reasoning
Flan • Performance improves with more tasks
• Instruction fine-tuning improves usability which otherwise is challenging for pre-trained models
• Improving the model’s performance with instruction tuning is compute-efficient
• Multitask prompting enables zero-shot generalization abilities in LLM

WizardCoder • Fine-tuning with re-written instruction-tuning data into a complex set improves performance

• Model learns to write safe responses with fine-tuning on safe demonstrations, while additional
LLaMA-2-Chat RLHF step further improves model safety and make it less prone to jailbreak attacks

LIMA • Less high quality data is enough for fine-tuned model generalization

15
new instructions (52k) and instances (82k input-output pairs) Aligning Directly with SFT: The PPO in the RLHF pipeline
using GPT-3 [6]. Contrary to this, Dynosaur [145] uses the is complex, memory-intensive, and unstable, requiring mul-
meta-data of datasets on Huggingface to prompt LLMs to tiple models, reward, value, policy, and reference models.
generate multiple task instruction-tuning datasets. Avoiding this sophisticated alignment pipeline is possible by
LLaMA Tuned: Various models in the literature instruction- incorporating minimal changes in the supervised fine-tuning
tune LLaMA [146] with GPT-3 [6] or GPT-4 [147] generated (SFT) pipeline as in [158, 159, 160], with better or compa-
datasets. Among these, Alpaca [148], Vicuna [149], and rable performance to PPO. Direct preference optimization
LLaMA-GPT-4 [150] are a few general-purpose fine-tuned (DPO) [158] trains a model directly on the human-preferred
models, where Alpaca is trained on 52k samples from text- responses to maximize the likelihood of preferred against
davinci-003, Vicuna on 70k samples from ShareGPT.com, unpreferred responses, with per-sample importance weight.
and LLaMA-GPT-4 by re-creating Alpaca instructions from Reward ranked fine-tuning RAFT [159] fine-tunes the model
GPT-4. Goat [151] fine-tunes LLaMA for arithmetic tasks on ranked responses by the reward model. Preference ranking
(1 million samples) by generating data from ChatGPT and optimization (PRO) [161] and RRHF [160] penalize the model
outperforms GPT-4, PaLM, BLOOM, OPT, etc., attributing its to rank responses with human preferences and supervised loss.
success to the LLaMA’s consistent tokenization of numbers. On the other hand, chain-of-hindsight (CoH) [162] provides
HuaTuo [152] is a medical knowledge model, fine-tuned with feedback to the model in language rather than reward, to learn
a generated QA dataset of 8k instructions. good versus bad responses.
Complex Instructions: Evol-Instruct [153, 154] prompts Aligning with Synthetic Feedback: Aligning LLMs with
LLMs to convert given instructions into a more complex set. human feedback is slow and costly. The literature suggests a
The instructions are iteratively evolved with re-writing instruc- semi-automated process to align LLMs by prompting LLMs to
tions in complex wording and creating new instructions. With generate helpful, honest, and ethical responses to the queries,
this style of automated instruction generation, WizardLM [153] and fine-tuning using the newly created dataset. Constitutional
(fine-tuned LLaMA on 250k instructions), outperforms Vicuna AI [163] replaces human feedback in RLHF with AI, calling
and Alpaca, and WizardCoder [154] (fine-tuned StarCoder) it RL from AI feedback (RLAIF). AlpacaFarm [164] designs
beats Claude-Plus, Bard, and others. prompts to imitate human feedback using LLMs APIs. Oppo-
site to constitutional AI, AlpacaFarm injects noise in feedback
to replicate human mistakes. Self-Align [98] prompts the
LLM with ICL examples, instructing the LLM about what the
3.2.3. Aligning with Human Preferences response should contain to be considered useful and ethical.
Incorporating human preferences into LLMs presents a The same LLM is later fine-tuned with the new dataset.
significant advantage in mitigating undesirable behaviors and Aligning with Prompts: LLMs can be steered with prompts to
ensuring accurate outputs. The initial work on alignment, such generate desirable responses without training [165, 166]. The
as InstructGPT [20] aligns GPT-3 using a 3-step approach, self-correction prompting in [166] concatenates instructions
instruction-tuning, reward modeling, and fine-tuning with and CoT with questions, guiding the model to answer its
reinforcement learning (RL). The supervised fine-tuned GPT-3 instruction following a strategy to ensure moral safety before
on demonstrations is queried to generate responses, which the actual answer. This strategy is shown to reduce the harm in
human labelers rank according to human values, and a reward generated responses significantly.
model is trained on the ranked data. Lastly, the GPT-3 is trained Red-Teaming/Jailbreaking/Adversarial Attacks: LLMs
with proximal policy optimization (PPO) using rewards on the exhibit harmful behaviors, hallucinations, leaking personal in-
generated data from the reward model. LLaMA 2-Chat [21] formation, and other shortcomings through adversarial probing.
improves alignment by dividing reward modeling into help- The models are susceptible to generating harmful responses
fulness and safety rewards and using rejection sampling in even though they are aligned for safety [167, 168]. Red-
addition to PPO. The initial four versions of LLaMA 2-Chat teaming is a common approach to address illicit outputs, where
are fine-tuned with rejection sampling and then with PPO on the LLMs are prompted to generate harmful outputs [168, 169].
top of rejection sampling. The dataset collected through red-teaming is used to fine-tune
Aligning with Supported Evidence: This style of alignment models for safety. While red-teaming largely relies on human
allows the model to generate responses with proofs and facts, annotators, another work [170] red-team LLMs to find prompts
reduces hallucination, and assists humans more effectively, that lead to harmful outputs for other LLMs.
which increases trust in the model’s output. Similar to
the RLHF training style, a reward model is trained to rank
generated responses containing web citations in answers 3.2.4. Continue Pre-Training
to questions, which is later used to train the model, as in Although fine-tuning boosts a model’s performance, it leads
GopherCite [155], WebGPT [156], and Sparrow [157]. The to catastrophic forgetting of previously learned information.
ranking model in Sparrow [157] is divided into two branches, Concatenating fine-tuning data with a few randomly selected
preference reward and rule reward, where human annotators pre-training samples in every iteration avoids network forget-
adversarial probe the model to break a rule. These two rewards ting [171, 142]. This is also effective in adapting LLMs for
together rank a response to train with RL. cases where fine-tuning data is small and the original capac-
16
ity is to be maintained. Prompt-based continued pre-training lightweight and the other with heavyweight attention and feed-
(PCP) [172] trains the model with text and instructions related forward layers. All tokens are processed from the lightweight
to tasks and then finally instruction-tunes the model for down- branch, and only important tokens are routed to the heavy-
stream tasks. weight branch. LongNet [178] replaces standard attention with
dilated attention, expanding sequence length to 1 billion tokens.
3.2.5. Sample Efficiency LongLoRA [179] proposes shift-short attention, used during
While fine-tuning data is generally many-fold smaller than fine-tuning to reduce dense attention costs. However, the model
the pre-training data, it still has to be large enough for accept- during inference uses dense attention and achieves similar per-
able performance [16, 97, 18] and requires proportional com- formance as full attention fine-tuning.
puting resources. Studying the effects on performance with less Extrapolation without Training: LM-Infinite [176] and par-
data, existing literature [173, 174] finds that models trained allel context windows (PCW) [180] show length extrapolation
on less data can outperform models trained with more data. is possible using pre-trained LLMs. LM-Infinite suggested Λ-
In [173], 25% of the total downstream data is found enough shaped attention applied within the original context window
for state-of-the-art performance. Selecting coreset-based 0.5% limits. Likewise, PCW chunks larger inputs into the pre-trained
of the total instruction-tuning data improves the model perfor- context lengths and applies the same positional encodings to
mance by 2% in [174], as compared to the complete data tun- each chunk.
ing. Less is more for alignment (LIMA) [175] uses only 1000
carefully created demonstrations to fine-tune the model and has 3.4. Augmented LLMs
achieved comparable performance to GPT-4. LLMs are capable of learning from the examples concate-
nated with the input, known as context augmentation, in-
3.3. Increasing Context Window context learning (ICL), or few-shot prompting. They show ex-
LLMs are trained with limited context windows due to ex- cellent generalization to unseen tasks with few-shot prompt-
pensive attention and high memory requirements. A model ing, enabling LLMs to answer queries beyond the capacity ac-
trained on limited sequence lengths fails to generalize to unseen quired during training [6, 55]. These emergent abilities allow
lengths at inference time [176, 49]. Alternatively, LLMs with for adapting the model without fine-tuning—a costly process.
ALiBi [65] positional encodings can perform zero-shot length Aside from this, hallucination, producing inaccurate, unsafe,
extrapolation. However, ALiBi has less expressive power [66] or factually incorrect responses, is common for LLMs, which is
and inferior performance on multiple benchmarks [46], and avoided by augmenting contextual data. While the user can pro-
many LLMs use RoPE positional embedding that is unable to vide in-context samples in the query [54, 32], here we specifi-
perform zero-shot extrapolation. A larger context length has cally refer to the methods that access external storage program-
benefits such as a better understanding of longer documents, matically, calling them augmented LLMs.
more samples in in-context learning, execution of bigger rea- The literature suggests various external memory designs to aug-
soning processes, etc. Expanding context length during fine- ment LLMs, long-term [181, 182, 183, 184], short-term [185],
tuning is slow, inefficient, and computationally expensive [49]. symbolic [186], and non-symbolic [187, 188]. The memory
Therefore, researchers employ various context window extrap- can be maintained in different formats such as documents, vec-
olation techniques discussed below. tors, or databases. A few systems maintain intermediate mem-
Position Interpolation: Rather than extrapolating, [49] shows ory representations to retain information across multiple iter-
that interpolating position encodings within the pre-trained con- ations [184, 182], while others extract important information
text window are more effective. The work demonstrates that from the datasets and save it in memory for recall [189]. The
only 1000 steps of fine-tuning are enough to achieve better re- memory read and write operations are performed either with
sults on larger windows without reducing performance com- or without LLMs cooperation [182, 190, 184, 191], acting as
pared to the original context size. Giraffe [46] uses power scal- a feedback signal in [185]. We discuss different types of aug-
ing in RoPE, and YaRN [47] proposed NTK-aware interpola- mented LLMs below.
tion.
Efficient Attention Mechanism: Dense global attention is 3.4.1. Retrieval Augmented LLMs
one of the major constraints in training larger context win- LLMs may have limited memory and outdated information,
dow LLMs. Using efficient attention variants, such as lo- leading to inaccurate responses. Retrieving relevant informa-
cal, sparse, and dilated attention, reduces the computation cost tion from external up-to-date storage enables the LLMs to
significantly. LongT5 [48] proposes transient global atten- accurately answer with references and utilize more informa-
tion (TGlobal), applying attention to local and global tokens tion. With retrieval augmentation, smaller models have been
(windowed token averaging). The model replaces attention shown to perform at par with larger models. For instance, the
in T5 [10] with TGlobal attention, pre-trains the model on 11B model can become competitive to 540B PaLM in [25] and
4098 sequence length, fine-tunes on larger window sizes, as 7.5B to 280B Gopher in [183]. Retrieval augmented language
large as 16k, and improves task performance on longer inputs. modeling (RALM) has two major components, shown in
This shows the extrapolation ability of TGlobal attention with Figure 12, namely: 1) retriever and 2) language model. In
only fine-tuning. COLT5 [177] uses two branches, one with RALM, the retriever plays a crucial role in driving LLM
17
ation. Retrieved samples are ranked to build ground-truth
data to train retrievers with contrastive learning in [196, 198].
RoBERTa is trained for downstream tasks in [197] for ICL
samples retrieval. REPLUG [199] trains the retriever with
supervised signals from the frozen LLM-generated outputs.
Training Retriever and LLM: Further benefits are achieved by
training both the retriever and the model in [25, 200, 201]. In
this case, the error propagates back to the retriever, updating
both the language model and the retriever. While masked
language modeling (MLM) is a common pre-training objec-
tive [25, 201], retrieval pre-trained transformer (RPT) [200]
used document chunk prediction as a pre-training objective for
long text modeling.
Figure 12: A flow diagram of Retrieval Augmented LLMs. The retriever ex-
tracts a similar context to the input and forwards it to the LLM either in simple
Encoded Context Augmentation: Concatenating retrieved
language or encoded through Fusion-in-Decoder (FiD). Depending on the task, documents with the query becomes infeasible as the sequence
retrieval and generation may repeat multiple times. length and sample size grow. Encoding the context and fusing
it with the decoder (Fusion-in-Decoder) using cross-attention
makes it possible to augment more samples without increasing
response, where incorrect information can steer LLMs to false computation costs significantly [202, 183, 200, 25].
behavior. This leads to the development of various methods to Web Augmented: Locally stored memory, but external to
retrieve accurate information and fuse with the query for better LLM, has limited information. However, a large amount of
performance. information is available on the internet, which gets updated
Zero-Shot Retrieval Augmentation: This kind of augmen- regularly. Rather than storing information locally, various
tation keeps the original LLM architecture and weights methods retrieve query-related context through a web search
unchanged and uses BM25 [192], nearest neighbors, or frozen and forward it to LLMs [203, 204, 156].
pre-trained models like Bert [7] as a retriever. The retrieved
information is provided as input to the model for response
generation, shown to improve performance over LLMs without 3.4.2. Tool Augmented LLMs
retrieval [188, 193]. In some scenarios, multiple retrieval While RAG relies on the retriever to provide context to the
iterations are required to complete the task. The output LLM to answer queries, tool augmented LLMs capitalize on the
generated in the first iteration is forwarded to the retriever reasoning abilities of LLMs to iteratively plan by dividing tasks
to fetch similar documents. Forward-looking active retrieval into sub-tasks, selecting necessary tools, and taking actions to
(FLARE) [187] initially generates the response and corrects complete the task [205, 206, 207, 27]. A generic pipeline of
the output by retrieving relevant documents if the response tool-augmented LLMs is shown in Figure 13, where different
contains low-confidence tokens. Similarly, RepoCoder [194] modules in Figure 13 are selected in a loop until the task com-
fetches code snippets recursively for code completion. pletion.
Training with Retrieval Augmentation: To reduce failures in Zero-Shot Tool Augmentation: LLMs in-context learning
retrieval augmentation generation (RAG), researchers train or and reasoning abilities enable them to interact with tools with-
fine-tune retrievers and LLMs with a retrieval augmentation out training. Automatic reasoning and tool-use (ART) [207]
pipeline. We discuss the literature below based on their focus builds a task library with demonstrations of reasoning steps and
on the respective training processes of the pipeline. calling external tools. It retrieves similar task examples and
Training LLM: Retrieval-enhanced transformer (RETRO) [183] provides the context to the LLM for inference. Aside from
shows pre-training smaller LLMs with RAG pipeline outper- this, [208] shows tool documentation is enough to teach LLMs
forms larger LLMs, such as GPT-3 trained without RAG. to use tools without demonstrations. RestGPT [209] integrates
RETRO uses a 2-trillion token subset of MassiveText as LLMs with RESTful APIs by decomposing tasks into planning
a database. The retrieval pipeline divides the input query and API selection steps. The API selector understands the API
into subsets and retrieves relevant chunks from the database documentation to select a suitable API for the task and plan the
for each subset, encoded together with input intermediate execution. ToolkenGPT [210] uses tools as tokens by concate-
representations for generating tokens. It uses cross-chunked nating tool embeddings with other token embeddings. During
attention to attend to previous chunks auto-regressively. A inference, the LLM generates the tool tokens representing the
study on RETRO [195] shows models pre-trained without RAG tool call, stops text generation, and restarts using the tool exe-
but fine-tuned using RAG lack the performance gains obtained cution output.
by pre-training with RAG. Training with Tool Augmentation: LLMs are trained to inter-
Training Retriever: Quality of responses generated by LLMs act with diverse tools, enhancing planning abilities to overcome
is highly dependent on the in-context examples. There- the limitations of zero-shot tool augmentation [211, 27, 212,
fore, [196, 197, 198, 199] train retrievers to retrieve accurate 213]. Gorilla [211] instruction-tunes LLaMA with information
few-shot samples while keeping the LLM frozen for gener- retrieval from API documentation. It uses the self-instruct [19]
18
as the brain of agents. LLMs have been incorporated in web
agents [156, 157], coding agents [219], tool agents [27, 213],
embodied agents [26], and conversational agents [185], requir-
ing minimal to no fine-tuning". Below we summarize the re-
search in LLMs-based autonomous agents. For a more detailed
discussion, please refer to [220, 221].
LLMs Steering Autonomous Agents: LLMs are the cognitive
controllers of the autonomous agents. They generate plans, rea-
son about tasks, incorporate memory to complete tasks, and
adapt the outline depending on the feedback from the environ-
ment. Depending on the acquired capabilities of LLMs, many
methods fine-tune, propose a better prompting approach, or uti-
lize different modules to enhance agents’ performance. Mod-
ules and strategies employed in autonomous agents are briefly
discussed below.
Planning and Reasoning: Completing a complex task requires
human-like logical thinking, planning necessary steps, and
reasoning current and future directions. Prompting methods
like chain-of-thoughts [103], tree-of-thoughts [105], and self-
consistency [104] are central to agents, eliciting LLMs to rea-
Figure 13: A basic flow diagram of tool augmented LLMs. Given an input and
a set of available tools, the model generates a plan to complete the task. The son its actions and choose among different paths for task com-
tool augmented LLMs utilize different modules iteratively, such as retriever, pletion. When LLMs are prompted with a task description and
tool execution, read-write to memory, feedback, etc., depending on the task. a sequence of actions, they can accurately generate plan ac-
tions without any fine-tuning [222]. Reasoning via planning
(RAP) [223] incorporates a re-purposed LLM as a world model
data generation pipeline with GPT-4 by providing in-context to reason about future outcomes and explore alternative paths
examples retrieved from API documentation. Tool augmented for task completion. Retroformer [224] uses a retrospective
language model (TALM) [27] fine-tunes T5 [10] for tool use LLM to improve main LLM planning and reasoning capabil-
with a self-play approach, where it iteratively completes tool ities by providing helpful task cues.
manipulation tasks and includes them back in the training set. Feedback: LLMs in open-loop systems generate plans and as-
ToolLLM [213] collects 16k APIs from RapidAPI. It samples sume that the agent will complete them successfully. However,
APIs from the list to generate an instruction-tuning dataset us- the actual scenario is different with failures and variable re-
ing ChatGPT in single-tool and multi-tool scenarios. For high- sponses from the environment. To correctly complete tasks,
quality datasets, ToolLLM suggested a depth-first search-based many methods use LLMs in a closed-loop where the action re-
decision tree (DFSDT) method to generate ground-truths with sponse is provided as feedback to the LLMs to re-assess and
diverse reasoning and planning. update the plan as required [225, 226, 227, 185]. Another di-
Multimodal Tool Augmentation: The compositional reasoning rection of research exploits LLMs as reward functions to train
capacity of LLMs allows them to manipulate tools in multi- reinforcement learning (RL) policies instead of humans [228].
modal settings [205, 206, 214]. Following the pipeline shown Memory: LLMs can learn from the context provided in the
in Figure 13, the LLM outlines a plan, generally executing in a prompt. In addition to internal memory, various systems em-
sequence: Plan → Tool selection → Execute → Inspect → ploy external memory to save the response history. Reflex-
Generate, to respond to the user query. Here, the database of ion [185] maintains an episodic memory to use previous re-
tools is rich in modalities, including text, images, etc. Many of sponses as feedback to improve future decision-making. Retro-
the multimodal tool augmentation systems employ multimodal former [224] improves its responses by employing short-term
LLMs [31, 215, 214, 206], while others utilize single modality and long-term memory, where short-term memory contains re-
LLMs and generate a plan on using different modality tools to cent responses and long-term memory keeps summarized failed
solve multimodal queries [216]. attempts to add in the prompt as reflection.
Multi-Agents Systems: LLMs can play user-defined roles and
3.5. LLMs-Powered Agents behave like a specific domain expert. In multi-agent systems,
AI agents are autonomous entities, capable of planning, each LLM is assigned a unique role, simulating human behav-
decision-making, and performing actions to achieve complex ior and collaborating with other agents to complete a complex
goals. In the early days, AI agents were rule-based, de- task [219, 229].
signed for narrow tasks, and had limited capabilities, such LLMs in Physical Environment: LLMs are good at
as Clippy [217] and Deep Blue [218]. In contrast to this, instruction-following, however, utilizing them for physically
LLMs abilities to respond to dynamic scenarios have made it grounded tasks requires adaptation, as they lack real-world
possible to incorporate them in diverse applications, includ- knowledge. This could lead to generating illogical responses
ing LLMs-powered agents [214, 206], where LLMs behave for a particular physical situation [230, 26]. SayCan [230]
19
make LLMs aware of the available low-level task operations. parameters with the model embeddings [237, 40, 241]. Task-
LLM (Say) builds a high-level plan to complete the task and specific fixed discrete prompts are concatenated with input em-
a learned affordance function (Can) explores the possibility of beddings in [40]. As discrete prompts bring instability, prompts
executing the plan in the real world. SayCan uses RL to train are encoded through a learnable mapping in P-Tuning [237],
the language-conditioned affordance function. PaLM-E enables naming continuous prompts, which are appended with the dis-
the LLM to solve grounded tasks by training multi-modal LLM crete prompts. Only the prompt encoder is trainable in the
feeding inputs directly from the sensors. model. In an extension of P-Tuning, continuous prompts are
Manipulation: In the area of manipulation [226, 231], LLMs concatenated with each layer of the network in [241]. Progres-
enhance a robot’s dexterity and adaptability, excelling in tasks sive prompts [242] avoid catastrophic forgetting and transfer
like object recognition, grasping, and collaboration. They ana- previously learned knowledge by sequentially adding trainable
lyze visual and spatial information to determine the most effec- prompt embeddings to the previously frozen task embeddings.
tive approach to interact with objects. Prefix Tuning: A set of trainable task-specific prefix vectors
Navigation: LLMs enhance a robot’s ability to navigate com- are appended to the frozen transformer layers in prefix tun-
plex environments with precision and adaptability [232, 233, ing [41]. The prefix vectors are virtual tokens attended by the
234, 235]. They generate feasible paths and trajectories for context tokens on the right. In addition, adaptive prefix tun-
robots, accounting for intricate environmental details [236]. ing [243] applies a gating mechanism to control the information
This ability is valuable in scenarios requiring precise and from the prefix and actual tokens.
dynamically adaptable navigation in environments like ware- Bias Tuning: Fine-tuning only bias terms in small to medium
houses, transport, healthcare facilities, and residences. training data has been found effective in BitFit [244]. This
method achieves full fine-tuning performance for tasks with less
3.6. Efficient LLMs training data and comparable performance with more training
data.
Deploying LLMs in production is expensive. Reducing their
running costs while preserving performance is an appealing 3.6.2. Quantization
area of research. This section summarizes the approaches sug- LLMs require extensive computing and memory for infer-
gested to enhance LLMs’ efficiency. ence. Deploying a 175B parameter GPT-3 model needs at
least 5x80GB A100 GPUs and 350GB of memory to store in
3.6.1. Parameter Efficient Fine-Tuning FP16 format [44]. Such demanding requirements for deploying
Fine-tuning LLMs with tens or hundreds of billions of pa- LLMs make it harder for smaller organizations to utilize them.
rameters, such as GPT-3 (175B), BLOOM (176B), MT-NLG Model compression is an effective solution but comes at the cost
(540B), etc., is computationally intensive and time-consuming. of degraded performance, especially at large scales greater than
To avoid complete model fine-tuning, numerous parameter- 6B. These models exhibit very large magnitude outliers that do
efficient fine-tuning (PEFT) techniques [40, 237, 41, 38, 39] try not exist in smaller models [245], making it challenging and re-
to achieve acceptable model fine-tuning performance at reduced quiring specialized methods for quantizing LLMs [44, 246].
costs. As compared to full fine-tuning [238], PEFT performs Post-Training Quantization: Minimal or no training is re-
better in low-resource setups, achieves comparable perfor- quired in this type of quantization, without significantly com-
mance on medium-resource scenarios, and performs worse than promising the model performance. LLM-8-bit [245] uses full-
full fine-tuning under high-resource availability. An overview precision matrix multiplication for weights associated with out-
of different PEFT approaches is shown in Figure 14. lier features and 8-bit for remaining features. The lower pre-
Adapter Tuning: Adds a few trainable parameters within the cision multiplication outputs are converted to FP-16 and con-
transformer block. The adapter layer is a sequence of feature catenated with others. The quantized models have homogenous
downscaling, non-linearity, and upscaling [106]. Variants of word embeddings, which may degrade their performance. To
adapter tuning inject adapter layers sequentially [106] and in fix this, token-level knowledge distillation is employed in [45]
parallel [38], whereas the mixture of adapter (AdaMix) [239] along with independent quantization scaling factors for each
employs multiple adapter modules in a single layer. AdaMix module due to varying weight distribution. Feature distribu-
routes input instances randomly to one of the multiple down- tions are asymmetric and appear in different channels; outlier
scale and upscale modules. The mixture of adapters is averaged suppression [247] shifts and scales per-channel activation dis-
out for inference to avoid additional latency. Low-Rank Adap- tributions for effective quantization. SmoothQuant [44] quan-
tation (LoRA) [240] learns low-rank decomposed matrices to tizes activations and weights to INT8 format by smoothing
freeze original weights. The learned weights are fused with the activations and migrating the quantization difficulty toward
original weights for inference, avoiding latency. weights. It multiplies the inverse of the smoothing factor with
Prompt Tuning: Prompting is an effective way to adapt a weights, which introduces a few outliers in the weights but is
pre-trained LLM for the downstream task. However, manual easier to quantify than unsmoothed activations. OPTQ [246]
prompts bring uncertainty in the model’s prediction, where a uses the optimal brain compression (OBC) [248] algorithm to
change in a single word drops the performance [237]. Prompt quantize the model layer-by-layer and update weights to com-
tuning alleviates this problem by fine-tuning only 0.001%-3% pensate for quantization error. To improve speed and per-
additional parameters [241]. It concatenates trainable prompt formance, OPTQ updates weights in arbitrary order, employs
20
Figure 14: Illustration of parameter-efficient fine-tuning paradigms, where x is input and h is hidden state, figure courtesy [38]. Parallel adapter and LoRA fall in
the adapter tuning category.

lazy updates, and uses better Cholesky kernels. Outlier-aware model does not require fine-tuning, thereby saving computa-
weight quantization (OWQ) [249] uses the OPTQ algorithm for tional costs. Outlier weighed layerwise sparsity (OWL) [257]
quantization but assigns higher precision to vulnerable weights, extends Wanda with non-uniform layer pruning. It shows that
causing outliers and lower precision for others. the number of outliers varies for different layers; therefore, the
Quantization-Aware Training: To compensate for perfor- model should have variable pruning ratios for better perfor-
mance degradation, a quantized model is fine-tuned in mance for every layer. Contrastive pruning (CAP) [43] itera-
quantization-aware training (QAT) [250, 251, 252]. Al- tively prunes the model by training the sparse model using con-
pha Tuning quantizes the model using binary coding quan- trastive loss between pre-trained, fine-tuned, and snapshots of
tization (BCQ) [253] and fine-tunes only quantization scal- previous sparse models to learn task-specific and task-agnostic
ing factors. This approach improves performance over knowledge.
parameter-efficient fine-tuning of the pre-trained model. Sim- Structured Pruning: Here, the parameters are removed in
ilarly, parameter-efficient and quantization-aware adaptation groups, rows, columns, or matrices, which speeds up the
(PEQA) [254] reduces the precision of fully-connected layers inference because of effective hardware tensor core utiliza-
and fine-tunes only quantization scaling parameters. LLM- tion [255]. LLM-Pruner [42] employs a 3-stage structured
QAT [252] generates training data from the pre-trained network pruning strategy, identifying the groups of hidden states caus-
and trains a quantized student model with knowledge distilla- ing each other to activate during the forward-pass, keeping im-
tion. QLoRA [251] fine-tunes 4-bit quantized pre-trained LLM portant groups and removing less important ones, and fine-
with LoRA [240] using a 4-bit normal float, which shows better tuning the pruned model with LoRA. Sparsity-induced mask
performance over a 4-bit integer and float. learning (SIMPLE) [258] prunes the network using learnable
masks. Similarly, another method prunes LLMs by learning
3.6.3. Pruning masks and removing unimportant rank-1 components of the
factorized weight matrix [256].
Pruning is an alternative approach to quantization to com-
press model size, thereby reducing LLMs deployment costs
significantly. Compared to task-agnostic pruning, task-specific 3.7. Multimodal LLMs
pruning is easily achievable with good performance, where a Inspired by the success of LLMs in natural language process-
model is fine-tuned on the downstream task and pruned for ing applications, an increasing number of research works are
faster inference. It is possible to prune LLMs for individual now facilitating LLMs to perceive different modalities of infor-
tasks, but the cost of pruning and deploying task-specific mod- mation like image [259, 260, 261], video [262, 263, 264], au-
els is high. To overcome this, many structured and unstructured dio [265, 264, 266], etc. Multimodal LLMs (MLLMs) present
pruning methods for LLMs have been proposed to maintain rea- substantial benefits compared to standard LLMs that process
sonable performance across all tasks while shrinking the model only text. By incorporating information from various modal-
size [255, 42, 256]. ities, MLLMs can achieve a deeper understanding of context,
Unstructured Pruning: This kind of pruning removes less im- leading to more intelligent responses infused with a variety of
portant weights without maintaining any structure. Existing expressions. Importantly, MLLMs align closely with human
LLM pruning methods take advantage of the unique charac- perceptual experiences, leveraging the synergistic nature of our
teristics of LLMs, uncommon for smaller models, where a multisensory inputs to form a comprehensive understanding of
small subset of hidden states are activated with large magni- the world [266, 26]. Coupled with a user-friendly interface,
tude [245]. Pruning by weights and activations (Wanda) [255] MLLMs can offer intuitive, flexible, and adaptable interactions,
prunes weights in every row based on importance, calculated allowing users to engage with intelligent assistants through a
by multiplying the weights with the norm of input. The pruned spectrum of input methods. According to the ways of construct-
21
ing models, current MLLMs can be generally divided into three cific visual bias to generate a chain of reasoning implicitly. In
streams: pre-training, fine-tuning, and prompting. In this sec- addition to CoT problems, LLMs can also be prompted with
tion, we will discuss more details of these main streams, as well multimodal descriptions and tools, effectively dividing complex
as the important application of MLLMs in visual reasoning. tasks into sub-tasks [279, 280].
Pre-training: This stream of MLLMs intends to support differ- Visual Reasoning Application: Recent visual reasoning sys-
ent modalities using unified end-to-end models. For instance, tems [281, 282, 206, 283] tend to apply LLMs for better visual
Flamingo [259] applies gated cross-attention to fuse vision and information analysis and visual-language integration. Differ-
language modalities, which are collected from pre-trained and ent from previous works [284, 285] that rely on limited VQA
frozen visual encoder and LLM, respectively. Moreover, BLIP- datasets and small-scale neural networks, current LLM-aided
2 [260] proposes a two-stage strategy to pre-train a Querying methods offer benefits of stronger generalization ability, emer-
Transformer (Q-Former) for the alignment between vision and gent ability, and interactivity [58]. To realize visual reasoning
language modalities: in the first stage, vision-language repre- with the help of LLMs, prompting and fine-tuning techniques
sentation learning is bootstrapped from a frozen visual encoder; can also be utilized: for example, PointClip V2 [282] applies
and in the second stage, a frozen LLM bootstraps vision-to- LLMs to generate 3D-specific prompts, which are encoded as
language generative learning for zero-shot image-to-text gen- textual features and then combined with visual features for
eration. Similarly, MiniGPT-4 [267] deploys pre-trained and 3D recognition; and GPT4Tools [31] employs LoRA [240] to
frozen ViT [268], Q-Former and Vicuna LLM [149], only train- fine-tune LLMs following tool-related instructions. Serving
ing the linear projection layer for vision and language modali- as a controller [283], decision maker [286], or semantics re-
ties alignment. finer [281, 287], LLMs significantly facilitates the progress of
Fine-tuning: Derived from instruction tuning [16] for NLP visual reasoning research.
tasks [20, 16, 97], researchers are fine-tune pre-trained LLMs
using multimodal instructions. Following this method, LLMs 3.8. Summary and Discussion
can be easily and effectively extended as multimodal chat-
bots [267, 261, 29] and multimodal task solvers [269, 30, 270]. 3.8.1. Architecture
The key issue of this stream of MLLMs is to collect multi- Due to the gigantic scale of LLMs, minor changes in archi-
modal instruction-following data for fine-tuning [58]. To ad- tecture and training strategies have a big impact on performance
dress this issue, the solutions of benchmark adaptation [269, and stability. Here, we summarize key architectural modules
271, 272], self-instruction [19, 31, 273], and hybrid composi- used in various LLMs, leading to better performance, reduced
tion [274, 270] are employed, respectively. To mitigate the gap training time and memory, and better training stability.
between the original language modality and additional modal- Layer Normalization: The performance and training stability
ities, the learnable interface is introduced to connect differ- of LLMs are affected significantly by layer normalization. Pre-
ent modalities from frozen pre-trained models. Particularly, norm, that is normalizing inputs rather than outputs, is more
the learnable interface is expected to work in a parameter- common among LLMs stabilizing the training [6, 127, 108].
efficient tuning manner: e.g., LLaMA-Adapter [275] applies BLOOM [13] and AlexaTM [122] utilize an additional layer
an efficient transformer-based adapter module for training, normalization before embedding layer to stabilize the training
and LaVIN [274] dynamically learns the multimodal feature of large-scale models, while the model’s zero-shot generaliza-
weights using a mixture-of-modality adapter. Different from tion ability can be negatively impacted [13]. However, another
the learnable interface, the expert models can directly convert study [33] finds that pre-norm degrades fine-tuned model per-
multimodalities into language: e.g., VideoChat-Text [262] in- formance as compared to post-norm, and there are no stability
corporates Whisper [276], a speech recognition expert model, benefits of pre-norm beyond the 100B scale. Therefore, GLM-
to generate the captions of given videos for the understanding 130B [33] used deep-norm which is a variant of post-norm for
of following LLMs. better downstream task performance after fine-tuning.
Prompting: Different from the fine-tuning technique that Positional Encoding: Like other building blocks of the model,
directly updates the model parameters given task-specific positional encoding also affects the performance and training
datasets, the prompting technique provides certain context, ex- stability of LLMs. BLOOM [13] finds ALiBi outperforms
amples, or instructions to the model, fulfilling specialized tasks learned and rotary positional encodings. Contrary to this,
without changing the model parameters. Since prompting can GLM-130B [33] identifies rotary positional encoding as being
significantly reduce the need for large-scale multimodal data, better than ALiBi. So, there is no conclusion in the literature
this technique is widely used to construct MLLMs. Particularly, about positional encodings yet.
to solve multimodal Chain of Thought (CoT) problems [103], Parallel Attention: In this type of attention, feed-forward and
LLMs are prompted to generate both the reasoning process and attention layers are parallel to each other rather than sequen-
the answer given multimodal inputs [277]. On this front, differ- tial in a transformer block. It has been shown to reduce train-
ent learning paradigms are exploited in practice: for example, ing time by 15%. There is no evidence of performance drop
Multimodal-CoT [277] involves two stages of rationale genera- due to this change in the literature and it is used by the models
tion and answer inference, where the input of the second stage PaLM [15], GPT-NeoX [118], and CodeGen [130].
is a combination of the original input and the output of the first Multi-Query Attention It has shared key and value attention
stage; and CoT-PT [278] applies both prompt tuning and spe- heads in a transformer block while query attention heads are
22
projected as usual. This reduces memory usage and speeds up MT-NLG [117] found higher variance for weight initialization
sampling in autoregressive decoding. No performance degrada- leads to unstable training, hence validating small initialization
tion has been observed with this change and it makes the train- scheme [288]. Various models perform random weight initial-
ing efficient allowing larger batch sizes. Multi-query attention ization which can cause bad initialization, Galactica [138] sug-
is used in [15, 132]. gests a longer warmup to negate the effect.
Mixture of Experts: This type of architecture enables eas- Learning Rate: A suitable learning rate is important for sta-
ily scaling models to trillions of parameters [92, 91]. Only a ble training. It is suggested to use a lower value [13, 15, 124]
few experts are activated during the computation making them with warmup and decay (cosine or linear). Usually, the learn-
compute-efficient. The performance of MoE models is better ing rate is within the range 1e−4 to 8e−4 . Moreover, MT-NLG
than dense models for the same amount of data and requires less (530B) [117] and GPT-NeoX (20B) [118] suggest interpolat-
computation during fine-tuning to achieve performance similar ing learning rates based on the model size using the GPT-3 [6]
to dense models as discussed in [91]. MoE architectures are models ranging between 13B and 175B. This avoids tuning the
less prone to catastrophic forgetting, therefore are more suited learning rate hyperparameter.
for continual learning [92]. Extracting smaller sub-models for Training Parallelism: 3D parallelism, a combination of data,
downstream tasks is possible without losing any performance, pipeline, and tensor parallelism, is the most utilized training
making MoE architecture hardware-friendly [92]. parallelism approach in LLMs [33, 15, 14, 13, 117, 115, 112].
Sparse vs Dense Activated: GPT-3 [6] uses sparse transform- In addition to 3D parallelism, BLOOM [13] uses a zero op-
P
ers [67] whereas GLaM [91] and PanGu- [92] use MoE [121] timizer [37] to shard optimizer states. PanGu-α [108] and
architectures to lower computational costs and increase the PanGu-Σ [92] go beyond 3D parallelism and apply 5D paral-
model size and capacity. According to the literature, sparse lelism which additionally contains optimizer parallelism and
modules do not degrade the model’s performance [67]. How- rematerialization.
ever, more experiments are required to verify this statement. Mode Switching: It adds task-related tokens at the beginning
of the text during training. These tokens refer to the natural
3.8.2. Training Strategies language understanding and natural language generation tasks
Training models at a huge scale require tricks to reduce train- which are shown to improve downstream task performance
ing costs, avoid loss divergence, and achieve better perfor- in [125, 124, 122]. During fine-tuning and inference, tokens
mance. We summarize and discuss some of these key tricks are appended based on the downstream tasks.
used in different LLMs. Controllable Text Generation: Generating credible and con-
Mixed Precision: It is a famous method for LLMs to reduce trolled text from a pre-trained model is challenging. GPT-3 [6]
memory usage and improve training efficiency. In mixed pre- and other LLMs use in-context learning to control generated
cision, forward and backward passes are performed in FP16 text. While in-context learning helps in controlling the gener-
format whereas optimizer states and master weights are kept ated text, ERNIE 3.0 Titan [35] suggests using adversarial loss
in FP32 format [120]. A drawback associated with this for- to rank its generated text for credibility and soft prompts such as
mat change is training instability due to a smaller value range genre, topic, keywords, sentiment, and length for better control
resulting in loss spikes [33]. An alternative to FP16 is BF16 on generated text.
which has a comparatively larger range and performs precision-
sensitive operations like gradient accumulation and softmax in
3.8.3. Supervised Models vs Generalized Models
FP32 [13]. BF16 has better performance and training stability
but uses more memory and is supported on specific hardware, Although generalized models are capable of performing di-
for example, A100 GPUs. Therefore, its adoption in LLMs is verse tasks with good performance they have not yet outper-
limited. formed models trained in supervised settings. The supervised
Training Instability: Loss divergence or spiking is a common trained models are still state-of-the-art in various NLP tasks by
issue in LLMs that occurs multiple times during training. This a large margin as shown in [6, 15, 18].
happens in the presence of gradient clipping [15]. To mitigate
this problem, many approaches suggest restarting training from 3.8.4. Zero-Shot vs Few-Shot
an earlier checkpoint [15, 33, 91], skipping 200-500 earlier LLMs perform well in zero-shot and few-shot settings. But
data batches at the point of divergence in [15] and re-shuffling the performance difference between zero-shot and few-shot is
batches in [91]. The embedding layer gradient shrink proves to large for pre-trained models [6, 15], naming LLMs as meta-
further stabilize the training as its gradient norm is significantly learners [6]. LLMs zero-shot evaluations underperform unsu-
larger than the other layers [33]. Another suggestion to improve pervised methods in neural machine translation [6]. The liter-
training stability for larger models is not to use biases in dense ature shows pre-training is not enough for good zero-shot per-
and norm layers as in [15]. formance [15, 16]. To improve the zero-shot performance the
Weight Initialization: It plays a significant role in model con- literature suggests using instruction fine-tuning that improves
vergence and training stability. GPT-NeoX [118] initializes the zero-shot performance significantly and outperforms base-
feed-forward layers before residuals with L 2√d as in [143] and lines. Instruction fine-tuning has also been shown to improve
other layers with the small initialization scheme [288]. This zero-shot generalization to unseen tasks. Another model, Flan-
avoids activations growing exponentially with increasing depth. PaLM [16], unlocks zero-shot reasoning with CoT training.
23
3.8.5. Encoder vs Decoder vs Encoder-Decoder suggested various pre-training and fine-tuning datasets to en-
Traditionally, these architectures perform well for different hance LLMs capabilities. We summarize these efforts in Ta-
tasks, for example, encoder-only for NLU tasks, decoder-only ble 8. While numerous training datasets are available in the
for NLG, and encoder-decoder for sequence2sequence model- literature, we cover the most widely used ones in our summary.
ing. Encoder-only models are famous for smaller models such
as Bert [7], RoBERTa [289], etc., whereas LLMs are either 5.2. Evaluation Datasets and Tasks
decoder-only [6, 118, 13] or encoder-decoder [10, 11, 122].
The evaluation of LLMs is important in gauging their profi-
While decoder-only models are good at NLG tasks, various
ciency and limitations. This process measures the model’s abil-
LLMs, PaLM [15], OPT [14], GPT-3 [6], BLOOM [13],
ity to comprehend, generate, and interact with human language
LLaMA [146], are decoder-only models with significant per-
across a spectrum of tasks. Evaluating a language model (LM)
formance gains on both NLU and NLG tasks. In contradic-
is divided into two broader categories: 1) natural language un-
tion to this, T5 [10] and UL2 [125] identify encoder-decoder
derstanding (NLU) and 2) natural language generation (NLG).
models out-performing decoder-only models. In another study,
It is emphasized that tasks in NLU and NLG are softly catego-
PaLM [15] finds increasing the size of decoder-only models
rized and are often used interchangeably in the literature.
can reduce the performance gap between decoder-only and
Natural Language Understanding: This task measures the lan-
encoder-decoder architectures.
guage understanding capacity of LMs. It encompasses multiple
Although decoder-only architectures have become a trend for
tasks, including sentiment analysis, text classification, natural
LLMs, many recently proposed approaches [125, 122] use
language inference (NLI), question answering (QA), common-
mode-switching tokens in text with encoder-decoder architec-
sense reasoning (CR), mathematical reasoning (MR), reading
tures to enable task-specific modes. Similarly, CodeT5+ [34]
comprehension (RC), etc.
uses an encoder-decoder architecture with multiple training ob-
Natural Language Generation: This task assesses the language
jectives for different tasks, activating the encoder, decoder, or
generation capabilities of LLMs by understanding the provided
both according to the tasks. These variations in architecture
input context. It includes tasks such as summarization, sen-
and training objectives allow a model to perform well in differ-
tence completion, machine translation (MT), dialogue genera-
ent settings. Because of this dynamic configuration, the future
tion, etc.
of LLMs can be attributed to encoder-decoder architectures.
Numerous datasets are proposed for each task, evaluating
LLMs against different characteristics. To provide an overview
4. Model Configurations of evaluation datasets, we briefly discuss a few famous datasets
within each category and offer a comprehensive list of datasets
We provide different statistics of pre-trained and instruction- in Table 9. Moreover, we show a detailed overview of the train-
tuned models in this section. This includes information such as ing datasets and evaluation tasks and benchmarks used by vari-
publication venue, license type, model creators, steps trained, ous pre-trained LLMs in Table 10 and fine-tuned LLMs in Ta-
parallelism, etc in Table 3 and Table 4. Architecture details ble 11. We also compare the top-performing LLMs in various
of pre-trained LLMs are available in Table 5. Providing these NLP tasks in Table 12.
details for instruction-tuned models is unnecessary because it
fine-tunes pre-trained models for instruction datasets. Hence,
5.2.1. Multi-task
architectural details are the same as the baselines. Moreover,
MMLU [297]: A benchmark that measures the knowledge
optimization settings for various LLMs are available in Table 6
acquired by models during pretraining and evaluates models in
and Table 7. We do not include details on precision, warmup,
zero-shot and few-shot settings across 57 subjects, testing both
and weight decay in Table 7. These details are not as important
world knowledge and problem-solving ability.
as others to mention for instruction-tuned models, and are not
SuperGLUE [2]: A more challenging and diverse successor
provided by the papers.
to the GLUE [299] benchmark, SuperGLUE includes a variety
of language understanding tasks, such as question answering,
natural language inference, and co-reference resolution. It is
5. Datasets and Evaluation designed to provide a rigorous test of language understanding
and requires significant progress in areas like sample-efficient,
Generating training and evaluation datasets is expensive be- transfer, multi-task, and unsupervised or self-supervised learn-
cause of the large-scale data demand of LLMs. Hence, datasets ing.
for training and benchmarking these models are topics of key BIG-bench [298]: The BIG-bench (Behavior of Intelligent
importance. A summary of datasets commonly used by LLMs Generative Models Benchmark) is a large-scale benchmark de-
is provided next. signed to test the abilities of LLMs across a wide range of
tasks, including reasoning, creativity, ethics, and understanding
5.1. Training Datasets of specific domains.
The performance of LLMs largely depends on the training GLUE [299]: The General Language Understanding Evalua-
data’s quality, size, and diversity. Preparing training datasets tion (GLUE) benchmark is a collection of resources for train-
of high quality at a large scale is laborious. Researchers have ing, evaluating, and analyzing natural language understanding
24
Table 3: Summary of pre-trained LLMs (>10B). Only the LLMs discussed individually in the previous sections are summarized. “Data/Tokens” is the model’s
pre-training data, which is either the number of tokens or data size. “Data Cleaning” indicates whether data cleaning is performed or not. This includes heuristics
(Heur), deduplication (Dedup), quality filtering (QF), and privacy filtering (PF), “Cost” is the calculated training cost obtained by multiplying the GPUs/TPUs
hourly rate with the number of GPUs and the training time. The actual cost may vary due to many reasons such as using in-house GPUs or getting a discounted rate,
re-training, number of employees working on the problem, etc. “Training Parallelism” indicates distributed training using data parallelism (D), tensor parallelism
(T), pipeline parallelism (P), model parallelism (M), optimizer parallelism (OP), and rematerialization (R), where for “Library” column, “DS” is a short form for
Deep Speed. In column “Commercial Use”, we assumed a model is for non-commercial purposes if its license is unavailable.

Publication License Model No. of Commercial Steps Data/ Data No. of Processing Training Calculated Training
Models
Venue Type Creators Purpose Params Use Trained Tokens Cleaning Processing Units Unit Type Time Train. Cost Parallelism Library
T5 [10] JMLR'20 Apache-2.0 Google General 11B ✓ 1M 1T Heur+Dedup 1024 TPU v3 - - D+M Mesh TensorFlow
GPT-3 [6] NeurIPS'20 - OpenAI General 175B × - 300B Dedup+QF - V100 - - M -
mT5 [11] NAACL'21 Apache-2.0 Google General 13B ✓ 1M 1T - - - - - - -
PanGu-α [108] arXiv'21 Apache-2.0 Huawei General 200B ✓ 260k 1.1TB Heur+Dedup 2048 Ascend 910 - - D+OP+P+O+R MindSpore
CPM-2 [12] AI Open'21 MIT Tsinghua General 198B ✓ 1M 2.6TB Dedup - - - - D+M JAXFormer
Codex [131] arXiv'21 - OpenAI Coding 12B × - 100B Heur - - - - - -
ERNIE 3.0 [110] arXiv'21 - Baidu General 10B × 120k∗ 375B Heur+Dedup 384 V100 - - M∗ PaddlePaddle
Jurassic-1 [112] White-Paper'21 Apache-2.0 AI21 General 178B ✓ - 300B - 800 GPU - - D+M+P Megatron+DS
HyperCLOVA [114] EMNLP'21 - Naver General 82B × - 300B Clf+Dedup+PF 1024 A100 321h 1.32 Mil M Megatron
Yuan 1.0 [115] arXiv'21 Apache-2.0 - General 245B ✓ 26k∗ 180B Heur+Clf+Dedup 2128 GPU - - D+T+P -
Gopher [116] arXiv'21 - Google General 280B × - 300B QF+Dedup 4096 TPU v3 920h 13.19 Mil D+M JAX+Haiku
ERNIE 3.0 Titan [35] arXiv'21 - Baidu General 260B × - 300B Heur+Dedup - Ascend 910 - - D+M+P+D* PaddlePaddle
GPT-NeoX-20B [118] BigScience'22 Apache-2.0 EleutherAI General 20B ✓ 150k 825GB None 96 40G A100 - - M Megatron+DS+PyTorch
OPT [14] arXiv'22 MIT Meta General 175B ✓ 150k 180B Dedup 992 80G A100 - - D+T Megatron
BLOOM [13] arXiv'22 RAIL-1.0 BigScience General 176B ✓ - 366B Dedup+PR 384 80G A100 2520h 3.87 Mil D+T+P Megatron+DS
Galactica [138] arXiv'22 Apache-2.0 Meta Science 120B × 225k 106B Dedup 128 80GB A100 - - - Metaseq
GLaM [91] ICML'22 - Google General 1.2T × 600k∗ 600B Clf 1024 TPU v4 - - M GSPMD
LaMDA [140] arXiv'22 - Google Dialog 137B × 3M 2.81T Filtered 1024 TPU v3 1384h 4.96 Mil D+M Lingvo
MT-NLG [117] arXiv'22 Apache-v2.0 MS.+Nvidia General 530B × - 270B - 4480 80G A100 - - D+T+P Megatron+DS
AlphaCode [132] Science'22 Apache-v2.0 Google Coding 41B ✓ 205k 967B Heur+Dedup - TPU v4 - - M JAX+Haiku
Chinchilla [96] arXiv'22 - Google General 70B × - 1.4T QF+Dedup - TPUv4 - - - JAX+Haiku
PaLM [15] arXiv'22 - Google General 540B × 255k 780B Heur 6144 TPU v4 - - D+M JAX+T5X
AlexaTM [122] arXiv'22 Apache v2.0 Amazon General 20B × 500k 1.1T Filtered 128 A100 2880h 1.47 Mil M DS
U-PaLM [124] arXiv'22 - Google General 540B × 20k - - 512 TPU v4 120h 0.25 Mil - -
UL2 [125] ICLR'23 Apache-2.0 Google General 20B ✓ 2M 1T - 512 TPU v4 - - M JAX+T5X
GLM [33] ICLR'23 Apache-2.0 Multiple General 130B × - 400B - 768 40G A100 1440h 3.37 Mil M -
CodeGen [130] ICLR'23 Apache-2.0 Salesforce Coding 16B ✓ 650k 577B Heur+Dedup - TPU v4 - - D+M JAXFormer
LLaMA [127] arXiv'23 - Meta General 65B × 350k 1.4T Clf+Heur+Dedup 2048 80G A100 504h 4.12 Mil D+M xFormers
PanGuΣ [92] arXiv'23 - Huawei General 1.085T × - 329B - 512 Ascend 910 2400h - D+OP+P+O+R MindSpore
BloombergGPT [141] arXiv23 - Bloomberg Finance 50B × 139k 569B Dedup 512 40G A100 1272h 1.97 Mil M PyTorch
Xuan Yuan 2.0 [142] arXiv23 RAIL-1.0 Du Xiaoman Finance 176B ✓ - 366B Filtered 80GB A100 - - P DS
CodeT5+ [34] arXiv'23 BSD-3 Salesforce Coding 16B ✓ 110k 51.5B Dedup 16 40G A100 - - - DS
StarCoder [137] arXiv'23 OpenRAIL-M BigCode Coding 15.5B ✓ 250k 1T Dedup+QF+PF 512 80G A100 624h 1.28 Mil D+T+P Megatron-LM
LLaMA-2 [21] arXiv'23 LLaMA-2.0 Meta General 70B ✓ 500k 2T Minimal Filtering - 80G A100 1.7Mh - - -
PaLM-2 [123] arXiv'23 - Google General - × - - Ddedup+PF+QF - - - - - -

Table 4: Summary of instruction tuned LLMs (>10B). All abbreviations are the same as Table 3. Entries in “Data/Tokens” starting with “S-” represents the number
of training samples.

Publication License Model No. of Commercial Pre-trained Steps Data/ No. of Processing Train. Calculated Train.
Models
Venue Type Creators Purpose Params Use Models Trained Tokens Processing Units Unit Type Time Train. Cost Parallelism Library
WebGPT [156] arXiv'21 - OpenAI General 175B × GPT-3 - - - - - - - -
T0 [17] ICLR'22 Apache-2.0 BigScience General 11B ✓ T5 - 250B 512 TPU v3 270h 0.48 Mil - -
Tk-Instruct [18] EMNLP'22 MIT AI2+ General 11B ✓ T5 1000 - 256 TPU v3 4h 0.0036 Mil - Google T5
OPT-IML [97] arXiv'22 - Meta General 175B × OPT 8k 2B 128 40G A100 - - D+T Megatron
Flan-U-PaLM [16] ICLR'22 Apache-2.0 Google General 540B ✓ U-PaLM 30k - 512 TPU v4 - - - JAX+T5X
mT0 [144] ACL'23 Apache-2.0 HuggingFace+ General 13B ✓ mT5 - - - - - - - -
Sparrow [157] arXiv'22 - Google Dialog 70B × Chinchilla - - 64 TPU v3 - - M -
WizardCoder [154] arXiv'23 Apache-2.0 HK Bapt. Coding 15B × StarCoder 200 S-78k - - - - - -
Alpaca [148] Github'23 Apache-2.0 Stanford General 13B ✓ LLaMA 3-Epoch S-52k 8 80G A100 3h 600 FSDP PyTorch
Vicuna [149] Github'23 Apache-2.0 LMSYS General 13B ✓ LLaMA 3-Epoch S-125k - - - - FSDP PyTorch
LIMA [175] arXiv'23 - Meta+ General 65B - LLaMA 15-Epoch S-1000 - - - - - -
Koala [290] Github'23 Apache-2.0 UC-Berkley General 13B × LLaMA 2-Epoch S-472k 8 A100 6h 100 - JAX/FLAX

systems. It includes a variety of tasks that test a wide range of content from seven domains makes it a rigorous test for mod-
linguistic phenomena, making it a comprehensive tool for eval- els’ ability to handle a wide range of topics and conversational
uating language understanding in AI. contexts.
WiC [307]: This dataset assesses a model’s ability to dis-
5.2.2. Language Understanding cern word meanings based on context, aiding in tasks related
WinoGrande [344]: A large-scale dataset inspired by the orig- to Word Sense Disambiguation.
inal Winograd [347] Schema Challenge tests models on their Wikitext103 [308]: With over 100 million tokens from
ability to resolve pronoun ambiguity and encourages the devel- Wikipedia’s top articles, this dataset is a rich resource for tasks
opment of models that understand the broad context in natural that require understanding long-term dependencies, such as lan-
language text. guage modeling and translation.
CoQA [306]: A conversational question-answering dataset, PG19 [309]: This is a digital library of diverse books from
CoQA challenges models with questions that rely on conver- Project Gutenberg. It is specifically designed to facilitate re-
sation history and require free-form text answers. Its diverse search in unsupervised learning and language modeling, with a
25
Table 5: Architecture details of LLMs. Here, “PE” is the positional embedding, “nL” is the number of layers, “nH” is the number of attention heads, “HS” is the
size of hidden states.

Training
Models Type Attention Vocab Tokenizer Norm PE Activation Bias nL nH HS
Objective
T5 (11B) Enc-Dec Span Corruption Standard 32k SentencePiece Pre-RMS Relative ReLU × 24 128 1024
GPT3 (175B) Causal-Dec Next Token Dense+Sparse - - Layer Learned GeLU ✓ 96 96 12288
mT5 (13B) Enc-Dec Span Corruption Standard 250k SentencePiece Pre-RMS Relative ReLU - - - -
PanGu-α (200B) Causal-Dec Next Token Standard 40k BPE Layer - - - 64 128 16384
CPM-2 (198B) Enc-Dec Span Corruption Standard 250k SentencePiece Pre-RMS Relative ReLU - 24 64 -
Codex (12B) Causal-Dec Next Token Standard - BPE+ Pre-Layer Learned GeLU - 96 96 12288
ERNIE 3.0 (10B) Causal-Dec Next Token Standard - WordPiece Post-Layer Relative GeLU - 48 64 4096
Jurassic-1 (178B) Causal-Dec Next Token Standard 256k SentencePiece∗ Pre-Layer Learned GeLU ✓ 76 96 13824
HyperCLOVA (82B) Causal-Dec Next Token Dense+Sparse - BPE* Pre-Layer Learned GeLU - 64 80 10240
Yuan 1.0 (245B) Causal-Dec Next Token Standard - - - - - - 76 - 16384
Gopher (280B) Causal-Dec Next Token Standard 32k SentencePiece Pre-RMS Relative GeLU ✓ 80 128 16384
ERNIE 3.0 Titan (260B) Causal-Dec Next Token Standard - WordPiece Post-Layer Relative GeLU - 48 192 12288
GPT-NeoX-20B Causal-Dec Next Token Parallel 50k BPE Layer Rotary GeLU ✓ 44 64 -
OPT (175B) Causal-Dec Next Token Standard - BPE - - ReLU ✓ 96 96 -
BLOOM (176B) Causal-Dec Next Token Standard 250k BPE Layer ALiBi GeLU ✓ 70 112 14336
Galactica (120B) Causal-Dec Next Token Standard 50k BPE+custom Layer Learned GeLU × 96 80 10240
GLaM (1.2T) MoE-Dec Next Token Standard 256k SentencePiece Layer Relative GeLU ✓ 64 128 32768
LaMDA (137B) Causal-Dec Next Token Standard 32k BPE Layer Relative GeGLU - 64 128 8192
MT-NLG (530B) Causal-Dec Next Token Standard 50k BPE Pre-Layer Learned GeLU ✓ 105 128 20480
AlphaCode (41B) Enc-Dec Next Token Multi-query 8k SentencePiece - - - - 64 128 6144
Chinchilla (70B) Causal-Dec Next Token Standard 32k SentencePiece-NFKC Pre-RMS Relative GeLU ✓ 80 64 8192
PaLM (540B) Causal-Dec Next Token Parallel+Multi-query 256k SentencePiece Layer RoPE SwiGLU × 118 48 18432
AlexaTM (20B) Enc-Dec Denoising Standard 150k SentencePiece Pre-Layer Learned GeLU ✓ 78 32 4096
Sparrow (70B) Causal-Dec Pref.&Rule RM - 32k SentencePiece-NFKC Pre-RMS Relative GeLU ✓ 16∗ 64 8192
U-PaLM (540B) Non-Causal-Dec MoD Parallel+Multi-query 256k SentencePiece Layer RoPE SwiGLU × 118 48 18432
UL2 (20B) Enc-Dec MoD Standard 32k SentencePiece - - - - 64 16 4096
GLM (130B) Non-Causal-Dec AR Blank Infilling Standard 130k SentencePiece Deep RoPE GeGLU ✓ 70 96 12288
CodeGen (16B) Causal-Dec Next Token Parallel - BPE Layer RoPE - - 34 24 -
LLaMA (65B) Causal-Dec Next Token Standard 32k BPE Pre-RMS RoPE SwiGLU - 80 64 8192
PanGu-Σ (1085B) Causal-Dec Next Token Standard - BPE Fused Layer - FastGeLU - 40 40 5120
BloombergGPT (50B) Causal-Dec Next Token Standard 131k Unigram Layer ALiBi GeLU ✓ 70 40 7680
Xuan Yuan 2.0 (176B) Causal-Dec Next Token Self 250k BPE Layer ALiBi GeLU ✓ 70 112 14336
CodeT5+ (16B) Enc-Dec SC+NT+Cont.+Match Standard - Code-Specific - - - - - - -
StarCoder (15.5B) Causal-Dec FIM Multi-query 49k BPE - Learned - - 40 48 6144
LLaMA (70B) Causal-Dec Next Token Grouped-query 32k BPE Pre-RMS RoPE SwiGLUE - - - -
PaLM-2 - MoD Parallel - - - - - - - - -

special focus on long-form content. ARC [332]: A larger version of the ARC-Challenge, this
C4 [10]: A clean, multilingual dataset, C4 offers billions of to- dataset contains both easy and challenging grade-school level,
kens from web-crawled data. It is a comprehensive resource for multiple-choice science questions. It is a comprehensive test of
training advanced Transformer models on various languages. a model’s ability to understand and answer complex questions.
LCQMC [310]: The Large-scale Chinese Question Matching ARC-Easy [332]: A subset of the ARC dataset, ARC-
Corpus (LCQMC) is a dataset for evaluating the performance Easy, contains questions that are answered correctly by either
of models in semantic matching tasks. It contains pairs of ques- a retrieval-based algorithm or a word co-occurrence algorithm.
tions in Chinese and their matching status, making it a valuable It is a great starting point for models beginning to explore ad-
resource for research in Chinese language understanding. vanced question-answering.
ARC-Challenge [332]: A rigorous question-answering
5.2.3. Story Cloze and Sentence Completion dataset, ARC-Challenge includes complex, grade-school level
StoryCloze [324]: It introduces a new “StoryCloze Test”, a questions that demand reasoning beyond simple retrieval, test-
commonsense reasoning framework for evaluating story under- ing the true comprehension capabilities of models.
standing, generation, and script learning. It considers a model’s
ability to understand and generate coherent and sensible stories. 5.2.5. Contextual Language Understanding
LAMBADA [325]: This dataset evaluates contextual text un- RACE [337]: The RACE dataset is a reading comprehension
derstanding through a word prediction task. Models must pre- dataset collected from English examinations in China, which
dict the last word of a passage, which is easy for humans when benchmarks AI models for understanding and answering ques-
given the whole passage, but not when given only the last sen- tions on long and complex passages, simulating the challenge
tence. of a real-world examination.
RACE-Middle [337]: Another subset of the RACE [337]
5.2.4. Physical Knowledge and World Understanding dataset, RACE-Middle, contains middle school-level English
PIQA [330]: A dataset that probes the physical knowledge of exam questions. It offers a slightly less challenging but academ-
models, aiming to understand how well they are learning about ically oriented evaluation of a model’s comprehension skills.
the real world. RACE-High [337]: A subset of the RACE [337] dataset,
TriviaQA [331]: A dataset that tests models on reading com- RACE-High consists of high school-level English exam ques-
prehension and open domain question answering (QA) tasks, tions. It is designed to evaluate the comprehension ability of
with a focus on Information Retrieval (IR)-style QA. models in a more academic and challenging context.
26
Table 6: Summary of optimization settings used for pre-trained LLMs. The values for weight decay, gradient clipping, and dropout are 0.1, 1.0, and 0.1, respectively,
for most of the LLMs.

Sequence LR Optimizers Precision Weight Grad


Models Batch Size Length LR Warmup Decay AdaFactorAdam AdamWFP16 BF16 Mixed Decay Clip Dropout
T5 (11B) 211 512 0.01 × inverse square root ✓ - - - - - ✓
GPT3 (175B) 32K - 6e-5 ✓ cosine ✓ ✓ ✓ ✓ -
mT5 (13B) 1024 1024 0.01 - inverse square root ✓ - - - - - ✓
PanGu-α (200B) - 1024 2e-5 - - - - - - ✓ - - - -
CPM-2 (198B) 1024 1024 0.001 - - ✓ - - - - - ✓
Codex (12B) - - 6e-5 ✓ cosine ✓ ✓ ✓ - -
ERNIE 3.0 (12B) 6144 512 1e-4 ✓ linear ✓ - - - ✓ - -
Jurassic-1 (178B) 3.2M 2048 6e-5 ✓ cosine ✓ ✓ ✓ ✓ -
HyperCLOVA (82B) 1024 - 6e-5 - cosine ✓ - - - ✓ - -
Yuan 1.0 (245B) <10M 2048 1.6e-4 ✓ cosine decay to 10% ✓ - - - ✓ - -
Gopher (280B) 3M 2048 4e-5 ✓ cosine decay to 10% ✓ ✓ - ✓ -
ERNIE 3.0 Titan (260B) - 512 1e-4 ✓ linear ✓ ✓ ✓ ✓ -
GPT-NeoX-20B 1538 2048 0.97e-5 ✓ cosine ✓ ✓ ✓ ✓ ×
OPT (175B) 2M 2048 1.2e-4 - linear ✓ ✓ ✓ ✓ ✓
BLOOM (176B) 2048 2048 6e-5 ✓ cosine ✓ ✓ ✓ ✓ ×
Galactica (120B) 2M 2048 7e-6 ✓ linear decay to 10% ✓ - - - ✓ ✓ ✓
GLaM (1.2T) 1M 1024 0.01 - inverse square root ✓ FP32 + ✓ - ✓ ×
LaMDA (137B) 256K - - - - - - - - - - - - -
MT-NLG (530B) 1920 2048 5e-5 ✓ cosine decay to 10% ✓ ✓ ✓ ✓ -
AlphaCode (41B) 2048 1536+768 1e-4 ✓ cosine decay to 10% ✓ ✓ ✓ ✓ -
Chinchilla (70B) 1.5M 2048 1e-4 ✓ cosine decay to 10% ✓ ✓ - - -
PaLM (540B) 2048 2048 0.01 - inverse square root ✓ - - - ✓ ✓ ×
AlexaTM (20B) 2M 1024 1e-4 - linear decay to 5% ✓ ✓ ✓ - ✓
U-PaLM (540B) 32 2048 1e-4 - cosine ✓ - - - - - -
UL2 (20B) 1024 1024 - - inverse square root - - - - - - × - -
GLM (130B) 4224 2048 8e-5 ✓ cosine ✓ ✓ ✓ ✓ ✓
CodeGen (16B) 2M 2048 5e-5 ✓ cosine ✓ - - - ✓ ✓ -
LLaMA (65B) 4M Tokens 2048 1.5e-4 ✓ cosine decay to 10% ✓ - - - ✓ ✓ -
PanGu-Σ (1.085T) 512 1024 2e-5 ✓ - ✓ ✓ - - -
BloombergGPT (50B) 2048 2048 6e-5 ✓ cosine ✓ ✓ ✓ ✓ ×
Xuan Yuan 2.0 (176B) 2048 2048 6e-5 ✓ cosine ✓ ✓ ✓ ✓ -
CodeT5+ (16B) 2048 1024 2e-4 - linear ✓ ✓ ✓ - -
StarCoder (15.5B) 512 8k 3e-4 ✓ cosine ✓ ✓ ✓ - -
LLaMA-2 (70B) 4M Tokens 4k 1.5e-4 ✓ cosine ✓ ✓ ✓ ✓ -

Table 7: Summary of optimization settings used for instruction-tuned LLMs. Values for gradient clipping and dropout are the same as the pre-trained models, while
no model uses weight decay for instruction tuning.

Sequence Optimizers Grad


Models Batch Size Length LR Warmup LR_Decay AdaFactor Adam AdamW Clip Dropout
WebGPT (175B) BC:512, RM:32 - 6e-5 - - ✓ - -
T0 (11B) 1024 1280 1e-3 - - ✓ - ✓
Tk-Instruct (11B) 1024 - 1e-5 - constant - - - - -
OPT-IML (175B) 128 2048 5e-5 × linear ✓ ✓ ✓
Flan-U-PaLM (540B) 32 - 1e-3 - constant ✓ - ✓
Sparrow (70B) RM: 8+16, RL:16 - 2e-6 ✓ cosine decay to 10% ✓ ✓ ×
WizardCoder (15B) 512 2048 2e-5 ✓ cosine - - - - -
Alpaca (13B) 128 512 1e-5 ✓ cosine - - ✓ ✓ ×
Vicuna (13B) 128 -2048 2e-5 ✓ cosine ✓ - ×
LIMA (65B) 32 2048 1e-5 × linear ✓ - ✓

QuAC [338]: This dataset simulates an information-seeking open-domain commonsense causal reasoning. Each question
dialog between students and teachers using hidden Wikipedia comprises a premise and two alternatives, and the model must
text. It introduces unique challenges not found in machine com- select the more plausible alternative, testing a model’s ability to
prehension datasets, making it a valuable resource for advanc- understand and reason about cause and effect.
ing dialog systems.
WSC [347]: The Winograd Schema Challenge (WSC) is a
reading comprehension task in which a system must resolve
5.2.6. Commonsense Reasoning
references in a text, often requiring world knowledge and rea-
HellaSwag [345]: A dataset that challenges models to pick the soning about the text.
best ending to a context uses Adversarial Filtering to create a
‘Goldilocks’ zone of complexity, where generated text is absurd CSQA [348]: The CommonsenseQA is a question-answering
to humans but often misclassified by models. dataset that requires commonsense knowledge to evaluate the
COPA [391]: This dataset evaluates a model’s progress in ability of AI models to understand and answer questions.
27
Table 8: Details of various well-known pre-training and fine-tuning datasets. Here, alignment means aligning with human preferences.

Dataset Type Size/Samples Tasks Source Creation Comments


C4 [10] Pretrain 806GB - Common Crawl Automated A clean, multilingual dataset with billions
of tokens
mC4 [11] Pretrain 38.49TB - Common Crawl Automated A multilingual extension of the C4
dataset, mC4 identifies over 100 lan-
guages using cld3 from 71 monthly web
scrapes of Common Crawl.
Common Crawl, PubMed Central,
PILE [291] Pretrain 825GB - OpenWebText2, ArXiv, GitHub, Automated A massive dataset comprised of 22 con-
Books3, and others stituent sub-datasets
ROOTs [292] Pretrain 1.61TB - 498 Hugging Face datasets Automated 46 natural and 13 programming lan-
guages
MassiveWeb, Books, News,
MassiveText [116] Pretrain 10.5TB - Automated 99% of the data is in English
Wikipedia, Github, C4
Wikipedia [293] Pretrain - - Wikipedia Automated Dump of wikipedia
CommonCrawl, C4, Wikipedia,
RedPajama [294] Pretrain 5TB - Automated Open-source replica of LLaMA dataset
Github, Books, StackExchange
PushShift.io Reddit Pretrain 21.1GB - Reddit Automated Submissions and comments on Reddit
from 2005 to 2019
BigPython [130] Pretrain 5.5TB Coding GitHub Automated -
Pool of Prompt (P3) [17] Instructions 12M 62 PromptSource Manual A Subset of PromptSource, created from
177 datasets including summarization,
QA, classification, etc.
xP3 [144] Instructions 81M 71 P3+Multilingual datasets Manual Extending P3 to total 46 languages
Super-NaturalInstructions (SNI) [18] Instructions 12.4M 1616 Multiple datasets Manual Extending P3 with additional multi-
lingual datasets, total 46 languages
Flan [16] Instructions 15M 1836 Muffin+T0-SF+NIV2 Manual Total 60 languages
OPT-IML [97] Instructions 18.1M 1667 - Manual -
Self-Instruct [19] Instructions 82k 175 - Automated Generated 52k instructions with 82k sam-
ples from 175 seed tasks using GPT-3
Alpaca [148] Instructions 52k - - Automated Employed self-instruct method to gener-
ate data from text-davinci-003
Vicuna [149] Instructions 125k - ShareGPT Automated Conversations shared by users on
ShareGPT using public APIs
LLaMA-GPT-4 [150] Instructions 52k - Alpaca Automated Recreated Alpaca dataset with GPT-4 in
English and Chinese
Unnatural Instructions [295] Instructions 68k - 15-Seeds (SNI) Automated -
LIMA [175] Instructions 1k - Multiple datasets Manual Carefully created samples to test perfor-
mance with fine-tuning on less data
Anthropic-HH-RLHF [296] Alignment 142k - - Manual
Anthropic-HH-RLHF-2 [168] Alignment 39k - - Manual

5.2.7. Reading Comprehension entailment, predicting whether a given sentence logically fol-
BoolQ [353]: A dataset derived from Google search queries, lows from another and evaluating a model’s understanding of
BoolQ challenges models to answer binary (yes/no) questions. logical relationships in a text.
The questions are naturally occurring and are paired with a WebQA [357]: A dataset for open-domain question answering,
paragraph from a Wikipedia article containing the answer. It WebQA offers a large collection of web-based question-answer
is a test of reading comprehension and reasoning. pairs. It is designed to assess the ability of AI models to under-
SQUADv2 [354]: The Stanford Question Answering Dataset stand and answer questions based on web content.
(SQuAD) [352] is a collection of questions posed by crowd CMRC2018 [359]: This dataset is a test of Chinese language
workers on a set of Wikipedia articles, where the answer to ev- models’ ability to reason comprehensively and is designed with
ery question is a segment of text from the corresponding reading a challenging span-extraction format that pushes the boundaries
passage. SQuADv2 combines the original SQuAD1.1 dataset of machine performance.
with over 50,000 unanswerable questions. The aim is to evalu-
ate a model’s ability to understand and answer questions based 5.2.8. Mathematical Reasoning
on a given context and to determine when a question is unan- MATH [372]: This dataset is a platform for evaluating the
swerable. mathematical problem-solving abilities of AI models. It con-
DROP [355]: DROP, or Discrete Reasoning Over the con- tains a diverse set of math problems, ranging from arithmetic
tent of Paragraphs, is designed to test a model’s ability to un- to calculus, and is designed to test the model’s ability to under-
derstand a wide variety of reading phenomena. It encourages stand and solve complex mathematical problems.
comprehensive and reliable evaluation of reading comprehen- Math23k [373]: This one challenges a model’s ability to un-
sion capabilities. derstand and solve mathematical word problems. It contains
RTE [356]: The Recognizing Textual Entailment (RTE) 23,000 Chinese arithmetic word problems that require models
datasets come from a series of annual competitions on textual to perform reasoning and computation based on the problem
28
Table 9: Categorized evaluation datasets used in evaluating LLMs.

Type Datasets/Benchmarks
Multi-Task MMLU [297], SuperGLUE [2], BIG-bench [298], GLUE [299], BBH [298], CUGE [300], Zero-
CLUE [301], FewCLUE [302], Blended Skill Talk [303], HELM [304], KLUE-STS [305]
Language Understanding CoQA [306], WiC [307], Wikitext103 [308], PG19 [309], LCQMC [310], QQP [311], WinoGender [312],
CB [313], FinRE [314], SanWen [315], AFQMC [301], BQ Corpus [316], CNSS [317], CKBQA 13 [318],
CLUENER [301], Weibo [319], AQuA [320], OntoNotes [321], HeadQA [322], Twitter Dataset [323]
Story Cloze and
StoryCloze [324], LAMBADA [325], LCSTS [326], AdGen [327], E2E [328], CHID [329], CHID-
Sentence Completion
FC [302]
Physical Knowledge and
PIQA [330], TriviaQA [331], ARC [332], ARC-Easy [332], ARC-Challenge [332], PROST [333], Open-
World Understanding
BookQA [334], WebNLG [335], DogWhistle Insider & Outsider [336]
Contextual Language RACE [337], RACE-Middle [337], RACE-High [337], QuAC [338], StrategyQA [339], Quiz Bowl [340],
Understanding cMedQA [341],cMedQA2 [342], MATINF-QA [343]
Commonsense Reasoning WinoGrande [344], HellaSwag [345], COPA [346], WSC [347], CSQA [348], SIQA [349], C3 [350],
CLUEWSC2020 [301], CLUEWSC [301], CLUEWSC-FC [302], ReCoRD [351]
Reading Comprehension SQuAD [352], BoolQ [353], SQUADv2 [354], DROP [355], RTE [356], WebQA [357], CMRC2017 [358],
CMRC2018 [359], CMRC2019 [360], COTE-BD [361], COTE-DP [361], COTE-MFW [361], Mul-
tiRC [362], Natural Questions [363], CNSE [317], DRCD [364], DuReader [365], Dureaderrobust [366],
DuReader-QG [365], SciQ [367], Sogou-log [368], Dureaderrobust -QG [366], QA4MRE [369], KorQuAD
1.0 [370], CAIL2018-Task1 & Task2 [371]
Mathematical Reasoning MATH [372], Math23k [373], GSM8K [374], MathQA [375], MGSM [376], MultiArith [377], AS-
Div [378], MAWPS [379], SVAMP [380]
Problem Solving HumanEval [131], DS-1000 [381], MBPP [382], APPS [372], CodeContests [132]
Natural Language Inference
ANLI [383], MNLI-m [384], MNLI-mm [384],QNLI [352], WNLI [347], OCNLI [301], CMNLI [301],
& Logical Reasoning
ANLI R1 [383], ANLI R2 [383], ANLI R3 [383], HANS [385], OCNLI-FC [302], LogiQA [386], Strate-
gyQA [339]
Cross-Lingual Understanding MLQA [387], XNLI [388], PAWS-X [389], XSum [390], XCOPA [391], XWinograd [392], TyDiQA-
GoldP [393], MLSum [394]
Truthfulness and Fact Checking TruthfulQA [395], MultiFC [396], Fact Checking on Fever [397]
Biases and Ethics in AI ETHOS [398], StereoSet [399], BBQ [400], Winobias [401], CrowS-Pairs [402]
Toxicity RealToxicityPrompts [403], CivilComments toxicity classification [404]
Language Translation WMT [405], WMT20 [406], WMT20-enzh [406], EPRSTMT [302], CCPM [407]
Scientific Knowledge AminoProbe [138], BioLAMA [138], Chemical Reactions [138], Galaxy Clusters [138], Mineral
Groups [138]
Dialogue Wizard of Wikipedia [408], Empathetic Dialogues [409], DPC-generated [96] dialogues, ConvAI2 [410],
KdConv [411]
Topic Classification TNEWS-FC [302], YNAT [305], KLUE-TC [305], CSL [301], CSL-FC [302], IFLYTEK [412]

description. 5.2.10. Cross-Lingual Understanding


GSM8K [374]: A dataset of diverse grade school math word XNLI [388]: A cross-lingual benchmark, XNLI extends the
problems, testing a model’s ability to perform multi-step math- MultiNLI [419] corpus to 15 languages, including low-resource
ematical reasoning. ones like Urdu. It tests models on cross-lingual sentence under-
standing, with 112,500 annotated pairs across three categories:
5.2.9. Problem Solving and Logical Reasoning entailment, contradiction, and neutral.
ANLI [383]: A large-scale dataset designed to test the robust- PAWS-X [389]: PAWS-X, or Cross-lingual Paraphrase Adver-
ness of machine learning models in Natural Language Inference saries from Word Scrambling, is a multilingual version of the
(NLI) is created through an iterative, adversarial process where PAWS [420] dataset for paraphrase identification. It includes
humans try to generate examples that models cannot correctly examples in seven languages and is designed to evaluate the
classify. performance of cross-lingual paraphrase identification models.
HumanEval [131]: A dataset for evaluating the problem-
solving ability of AI models, which includes a diverse set of
tasks that require various cognitive abilities, making it a com- 5.2.11. Truthfulness
prehensive tool for assessing general intelligence in AI. Truthful-QA [395]: A unique benchmark that measures a
StrategyQA [339]: A question-answering dataset that re- language model’s truthfulness when generating answers. The
quires reasoning over multiple pieces of evidence to evaluate dataset includes questions across various categories like health,
the strategic reasoning ability of AI models, pushing the bound- law, and politics, some designed to test the model against com-
aries of what machines can understand and answer. mon human misconceptions.
29
Table 10: An illustration of training datasets and evaluation tasks employed by pre-trained LLMs. Here, “QA” is question-answering, “Clf” is classification, “NLI”
is natural language inference, “MT” is machine translation, “RC” is reading comprehension, “CR” is commonsense reasoning, “MR” is mathematical reasoning,
“Mem.” is memorization.

Benchmark
Truthful/
BIG- Super Cloze/ Bias/
Models Training Dataset MMLU QA Clf NLI MT RC CR MR Coding
bench GLUE Completion Toxicity/
Mem.
T5 C4 [10] ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
GPT-3 Common Crawl, WebText, Books Cor- ✓ ✓ ✓ ✓ ✓ ✓
pora, Wikipedia
mT5 mC4 [11] ✓ ✓ ✓
PanGu-α 1.1TB Chinese Text Corpus ✓ ✓ ✓ ✓ ✓
CPM-2 WuDaoCorpus [109] ✓ ✓
Codex 54 million public repositories from Github ✓
ERNIE-3.0 Chinese text corpora, Baidu Search, Web ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
text, QA-long, QA-short, Poetry and Cou-
plet Domain-specific data from medical,
law, and financial area Baidu knowledge
graph with more than 50 million facts
Jurassic-1 Wikipedia, OWT, Books, C4, Pile [291], ✓ ✓ ✓ ✓
arXiv, GitHub
HyperCLOVA Korean blogs, Community sites, News, ✓
KiN Korean Wikipedia, Wikipedia (En-
glish and Japanese), Modu-Corpus: Mes-
senger, News, Spoken and written lan-
guage corpus, Web corpus
Yuan 1.0 Common Crawl, SogouT, Sogou News, ✓ ✓ ✓ ✓
Baidu Baike, Wikipedia, Books
Gopher subsets of MassiveWeb Books, C4, News, ✓ ✓ ✓ ✓ ✓ ✓ ✓
GitHub and Wikipedia samples from Mas-
siveText
ERNIE-3.0 TITAN Same as ERNIE 3.0 and ERNIE 3.0 ad- ✓ ✓ ✓ ✓ ✓
versarial dataset, ERNIE 3.0 controllable
dataset
GPT-NeoX-20B Pile [291] ✓ ✓ ✓ ✓ ✓ ✓
OPT RoBERTa [289], Pile [291], PushShift.io ✓ ✓ ✓ ✓
Reddit [413]
BLOOM ROOTs [13] ✓ ✓ ✓ ✓ ✓ ✓
Galactica arXiv, PMC, Semantic Scholar, Wikipedia, ✓ ✓ ✓ ✓ ✓
StackExchange, LibreText, Open Text-
books, RefSeq Genome, OEIS, LIPID
MAPS, NASAExoplanet, Common Crawl,
ScientificCC, AcademicCC, GitHub repos-
itories Khan Problems, GSM8K, OneS-
mallStep
GLaM Filtered Webpages, Social media conversa- ✓ ✓ ✓ ✓ ✓
tions Wikipedia, Forums, Books, News
LaMDA Infiniset : Public documents, Dialogs, Ut- ✓
terances
MT-NLG Two snapshots of Common Crawl and ✓ ✓ ✓ ✓ ✓
Books3, OpenWebText2, Stack Exchange,
PubMed Abstracts, Wikipedia, PG-19
[242], BookCorpus2, NIH ExPorter, Pile,
CC-Stories, RealNews
AlphaCode Selected GitHub repositories, CodeCon- ✓
tests: Codeforces, Description2Code, Co-
deNet
Chinchilla MassiveWeb, MassiveText Books, C4, ✓ ✓ ✓ ✓ ✓ ✓
News, GitHub, Wikipedia
PaLM webpages, books, Wikipedia, news, arti- ✓ ✓ ✓ ✓ ✓ ✓
cles, source code, social media conversa-
tions
AlexaTM Wikipedia, mC4 ✓ ✓ ✓ ✓ ✓
U-PaLM Same as PaLM ✓ ✓ ✓ ✓ ✓ ✓ ✓
UL2 - ✓ ✓ ✓ ✓ ✓ ✓
GLM-130B - ✓ ✓ ✓
CodeGen Pile, BigQuery, BigPython ✓
LLaMA CommonCrawl, C4, Github, Wikipedia, ✓ ✓ ✓ ✓ ✓ ✓ ✓
Books, arXiv, StackExchange
PanGu-Σ WuDaoCorpora, CLUE, Pile, C4, Python ✓ ✓ ✓ ✓ ✓ ✓
code
BloombergGPT inPile, Pile, C4, Wikipedia ✓ ✓ ✓ ✓ ✓ ✓ ✓
CodeT5+ CodeSearchNet, Github Code ✓ ✓
StarCoder The Stack v1.2 ✓ ✓ ✓ ✓
LLaMA-2 ✓ ✓ ✓ ✓ ✓ ✓ ✓
PaLM-2 Web documents, Code, Books, Maths, ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Conversation

30
Table 11: An illustration of training datasets and evaluation benchmarks used in fine-tuned LLMs. “SNI” is a short of Super-NaturalInsturctions.

Truthful/
BIG-
Models Training Dataset MMLU BBH RAFT FLAN SNI PromptSource TyDiQA HumanEval MBPP Bias/
bench
Toxicity
T0 Pool of Prompts ✓
WebGPT ELI5 [414], ELI5 fact- ✓
check [156], TriviaQA [331],
ARC-Challenge [332], ARC-
Easy [332], Hand-written data,
Demonstrations of humans, Com-
parisons between model-generated
answers
Tk-INSTRUCT SNI [18] ✓
mT0 xP3 [144]
OPT-IML PromptSource [17], FLAN [16], ✓ ✓ ✓ ✓ ✓ ✓
SNI [415], UnifiedSKG [416],
CrossFit [417], ExMix [418],
T5 [10], Reasoning
Flan Muffin, T0-SF, NIv2, CoT ✓ ✓ ✓
WizardCoder Code Alpaca ✓ ✓

5.2.12. Biases and Ethics in AI vice to handle common questions; or applied to generate con-
ETHOS [398]: ETHOS is a hate speech detection dataset tent for digital platforms like websites, by creating human-like
built from YouTube and Reddit comments. It is a tool in the text based on given prompts [424]. Moreover, LLMs play a cru-
fight against online hate speech, offering binary and multi-label cial role in data analysis, where they can filter large volumes of
variants for robust content moderation. text data, summarize key points, and find patterns that would
StereoSet [399]: StereoSet is a comprehensive dataset de- take humans much longer to identify [425]. Despite their wide-
signed to measure and evaluate the presence of stereotypical ranging applications, it is essential to remember that LLMs,
biases in language models. It focuses on four key domains: similar to any AI system, are only as good as the data they have
gender, profession, race, and religion. Contrasting stereotypi- been trained on.
cal bias against language modeling ability provides a valuable Medicine: The application of LLMs in the field of medicine is
tool for understanding and mitigating biases in large language reshaping healthcare delivery and research. For example, LLMs
models. are increasingly used in clinical decision support systems to
provide physicians with evidence-based treatment recommen-
dations [426, 427, 428]. By analyzing patient data and medical
6. Applications literature, they can help identify potential diagnoses, suggest
appropriate tests, and recommend optimal treatment strategies.
Applying Large Language Models (LLMs) to a variety of Moreover, LLMs can also enhance patient interactions with
downstream tasks has become a popular trend in both AI- healthcare systems; e.g., they can be used in chatbot applica-
related research communities and industries, with many emerg- tions [429, 430, 431] to answer patient queries about symptoms
ing uses being discovered and explored daily. LLMs, which are or medications, schedule appointments, and even provide es-
capable of understanding and generating human-like text, have sential health advice. For medical research, LLMs are used to
found meaningful applications across a variety of fields. This extract and filter information from a considerable amount of
section provides an overview of LLM applications in medicine, medical literature, identify relevant studies, summarize find-
education, science, mathematics, law, finance, robotics, and ings, and even predict future research trends [432, 433, 434].
coding. While each of these domains pose different challenges, For medical education, LLMs can help create training mate-
LLMs open up opportunities to make significant contributions rials, generate exam questions, provide detailed explanations
to these domains through their generalizability. of complex medical topics, and offer personalized feedback to
General Purpose: LLMs are being widely considered as students [435, 436, 437, 438]. They can also simulate patient
general-purpose tools for a wide variety of tasks [421]. This interactions, enabling students to practice and improve their
is due to their inherent ability to understand, generate, and clinical skills. At a broader level, LLMs can assist in public
manipulate human-like text in a contextually relevant man- health initiatives by analyzing media data to detect disease out-
ner. This allows them to perform tasks ranging from simple breaks, monitor public sentiment towards health policies, and
language translation and question-answering to more complex disseminate health information in a clear and understandable
tasks like summarization, text generation, and even program- manner [439]. LLMs can be employed to support public health
ming help [422]. The utility of LLMs is further enhanced by initiatives, addressing related issues such as data privacy, the
their ability to adapt to the specific style and tone of the text necessity for explainability, and the potential risk of propagat-
they are processing, making the outputs more user-friendly and ing biases [440, 441].
context-aware. In everyday applications, LLMs can be used as Education: The integration of LLMs into the educational sec-
personal assistants, helping users draft emails or schedule ap- tor offers opportunities to enhance learning experiences, teacher
pointments [423]; they can also be deployed in customer ser-
31
Table 12: Performance comparison of top performing LLMs across various NLU and NLG tasks. Here, “N-Shots” indicate the number of example prompts provided
to the model during the evaluation, representing its capability in few-shot or zero-shot learning settings, “f” represents the fine-tuned version, and “B” represents the
benchmark.

Top-1 Top-2 Top-3


Task Dataset/Benchmark
Model (Size) Score (N-shots) Model (Size) Score (N-shots) Model (Size) Score (N-shots)
BIG-bench (B) Chinchilla (70B) 65.1 (5-shot) Gopher (280B) 53.97 (5-shot) PaLM (540B) 53.7 (5-shot)
Multi-Task
MMLU (B) GPT-4 (-) 86.4 (5-shot) Gemini (Ultra) 83.7 (5-shot) Flan-PaLM-2( f ) (Large) 81.2 (5-shot)
Language Understanding SuperGLUE (B) ERNIE 3.0 (12B) 90.6 (-) PaLM( f ) (540B) 90.4 (-) T5 (11B) 88.9 (-)
Story Comprehension and HellaSwag GPT-4 (-) 95.3 (10-shot) Gemini (Ultra) 87.8 (10-shot) PaLM-2 (Large) 86.8 (one shot)
Generation StoryCloze GPT3 (175B) 87.7 (few shot) PaLM-2 (Large) 87.4 (one shot) OPT (175B) 79.82 (-)
Physical Knowledge and PIQA PaLM-2 (Large) 85.0 (one shot) LLaMa (65B) 82.8 (zero shot) MT-NLG (530B) 81.99 (zero shot)
World Understanding TriviaQA PaLM-2 (Large) 86.1 (one shot) LLaMA-2 (70B) 85.0 (one shot) PaLM (540B) 81.4 (one shot)
Contextual Language
LAMBADA PaLM (540B) 89.7 (few shot) MT-NLG (530B) 87.15 (few shot) PaLM-2 (Large) 86.9 (one shot)
Understanding
WinoGrande GPT-4 (-) 87.5 (5-shot) PaLM-2 (Large) 83.0 (one shot) PaLM (540B) 81.1 (zero shot)
Commonsense Reasoning
SIQA LLaMA (65B) 52.3 (zero shot) Chinchilla (70B) 51.3 (zero shot) Gopher (280B) 50.6 (zero shot)
Reading Comprehension BoolQ PaLM( f ) (540B) 92.2 (-) T5 (11B) 91.2 (-) PaLM-2 (Large) 90.9 (one shot)
Truthfulness Truthful-QA LLaMA (65B) 57 (-)
MATH Gemini (Ultra) 53.2 (4-shot) PaLM-2 (Large) 34.3 (4-shot) LLaMa-2 (65B) 13.5 (4-shot)
Mathematical Reasoning
GSM8K GPT-4 (-) 92.0 (5-shot) PaLM-2 (Large) 80.7 (8-shot) U-PaLM (540B) 58.5 (-)
Problem Solving and
HumanEval Gemini( f ) (Ultra) 74.4 (zero shot) GPT-4 (-) 67.0 (zero shot) Code Llama (34B) 48.8 (zero shot)
Logical Reasoning

support, and educational content development. For students, by tify errors in reasoning or computation and suggest corrections,
analyzing their learning styles, performance, and preferences, serving as an invaluable tool for both learning and verification
LLMs can provide customized study materials and practice purposes [451, 452]. LLMs can be employed to check the valid-
questions to develop personalized learning experiences [442]. ity of mathematical proofs, offering a preliminary filter before
For teachers, LLMs can help to create lesson plans and grade human review. While they are not a substitute for the meticu-
assignments and generate diverse and inclusive educational lous work of mathematicians, they can help simplify the process
content, significantly saving more time for teaching and student of proof verification [453, 454]. Moreover, LLMs enhance ac-
interaction [443, 444]. In language learning, LLMs serve as cessibility to mathematics by translating complex concepts and
advanced conversational partners capable of simulating conver- findings into understandable language for non-specialists [455],
sations in multiple languages, correcting grammar, enhancing where the gap between theoretical mathematics and applied
vocabulary, and aiding pronunciation for the needs of fluency contexts such as physics, engineering, and economics can be
in practice [445]. Furthermore, LLMs improve accessibility bridged.
in education by providing support for students with disabili- Law: LLMs can assist with the thematic analysis of legal doc-
ties. They can generate real-time transcriptions for the hear- uments, including generating initial coding for datasets, iden-
ing impaired, offer reading assistance for the visually impaired, tifying themes, and classifying data according to these themes.
and simplify complex texts for those with learning disabili- This collaborative effort between legal experts and LLMs has
ties [441]. As LLMs continue to evolve, their applications in proved to be effective in analyzing legal texts such as court
education can benefit more students and teachers from different opinions on theft, improving both the efficiency and quality of
perspectives in practice. the research [456]. Additionally, LLMs have been evaluated for
Science: Similar to medical applications, LLMs can expedite their ability to generate explanations of legal terms, focusing
the research process by quickly analyzing and summarizing sci- on improving factual accuracy and relevance by incorporating
entific literature. By briefing comprehensible and accessible re- sentences from case law. By feeding relevant case law into the
search summaries, LLMs can assist researchers in staying up- LLM, the augmented models can generate higher-quality expla-
to-date with the latest findings, even in fields outside their area nations with less factually incorrect information [457]. More-
of expertise [446, 447]. In addition, LLMs can aid scientists over, LLMs can be trained with specialized domain knowledge
in formulating new hypotheses and research questions since to perform legal reasoning tasks [458] and answer legal ques-
their ability to process large-scale datasets allows them to un- tions [459].
veil insights that might not be immediately apparent to human Finance: LLMs like BloombergGPT [141], trained on exten-
researchers [448]. Moreover, for scientific writing, LLMs can sive proprietary financial datasets, exhibit superior performance
help researchers draft documents, suggest improvements, and on financial tasks. This indicates the value of domain-specific
ensure adherence to specific formatting guidelines [449, 450]. training in creating LLMs that can more accurately understand
This not only saves time but also improves the clarity of scien- and process industry-specific language and concepts. The intro-
tific communication, enabling interdisciplinary teams to work duction of FinGPT [460] as an open-source model offers trans-
together more effectively. parent and accessible resources to develop novel applications
Maths: In addition to providing mathematical research and such as robo-advising, algorithmic trading, and low-code so-
education support, LLMs can assist in solving mathematical lutions, ultimately expanding the capabilities of financial ser-
problems by giving step-by-step explanations and guiding users vices. Both BloombergGPT and FinGPT show the adaptabil-
through complex proofs and calculations. They can help iden- ity of LLMs to the financial domain, with the former showing
32
the power of custom datasets and the latter emphasizing a data- provide accurate answers to precise questions. However, gen-
centric approach and low-rank adaptation techniques for cus- eralization enables the model to make inferences and produce
tomization. Moreover, LLMs demonstrate an ability to break responses for inputs it has not seen before, which is essential
down complex financial tasks into actionable plans, enabling for handling various real-world tasks. Striking the right bal-
end-to-end solutions that were previously unfeasible with a sin- ance is the challenge: too much memorization can lead to over-
gle model [461]. fitting, making the model inflexible and struggling with new
Robotics: In robotics research, LLMs have promising appli- inputs [470].
cations, such as enhancing human-robot interaction [28, 462, Economic and Research Inequality: The high cost of train-
463, 464], task planning [227], motion planning [236], nav- ing and deploying LLMs may make their development concen-
igation [236, 465], object manipulation [226], personalized trated within well-funded organizations, potentially worsening
robots [466], etc. LLMs enable robots to understand the en- economic and research inequalities in AI [471].
vironment effectively and generate plans to complete tasks col- Reasoning and Planning: Some reasoning and planning tasks,
laboratively [230, 26]. They can facilitate continuous learning even as seemingly simple as common-sense planning, which
by allowing robots to access and integrate information from a humans find easy, remain well beyond the current capabilities
wide range of sources, helping robots acquire new skills, adapt of LLMs evaluated using an assessment framework. This is not
to changes, and refine their paths [214, 223, 224]. entirely unexpected, considering that LLMs primarily generate
text completions based on likelihood and offer no solid guaran-
tees in terms of reasoning abilities [472].
7. Challenges and Future Directions Hallucinations: LLMs exhibit “hallucinations", where they
generate responses that, while sounding plausible, are incor-
LLMs such as GPT-4 and its predecessors have significantly rect or do not align with the provided information [473]. The
advanced natural language processing. Nevertheless, they also hallucination can be categorized into three categories.
bring along a set of challenges. The computational cost, ad-
versarial robustness, and interpretability are among the tech- • Input-conflicting hallucination, wherein LLMs produce
nical challenges that are intrinsic to these models. Further- content that diverges from the input given by users.
more, as these models are scaled up to handle more complex
• Context-conflicting hallucination, where LLMs generate
tasks or to operate in more complex or dynamic environments,
content that contradicts information they have generated
new challenges in scalability, privacy, and real-time processing
earlier.
emerge. On the frontier of foundational research, integrating
multi-modality and the effectiveness of transfer learning are be- • Fact-conflicting hallucination involves LLM’s generation
ing keenly explored. Additionally, the continuous learning as- of content that does not align with established world
pect of these models, which aims to have models that can adapt knowledge.
to new information over time, presents a fresh set of challenges.
These challenges not only underscore the technical intricacies Prompt Engineering: Prompts serve as inputs to LLMs, and
involved but also highlight the broader impact and the future their syntax and semantics play a crucial role in determining
trajectory of LLMs in real-world applications. The following the model’s output. The prompt variations, sometimes counter-
sections delve into these challenges, shedding light on the on- intuitive to humans, can result in significant changes in model
going and potential efforts to address them. output and are addressed through prompt engineering, which
Computational Cost: Training LLMs requires extensive com- involves designing natural language queries to guide LLMs
putational resources, which increases production costs and responses effectively [474, 32].
raises environmental concerns due to substantial energy con- Limited Knowledge: Information acquired during pretraining
sumption during large-scale training. Improved performance is limited and may become obsolete after some time. Re-
occurs as computational resources increase, but the rate of training the model using updated data is costly. To generate
improvement gradually decreases when both the model and factually accurate responses people use a retrieval augmen-
dataset size remain fixed, following the power law of dimin- tation pipeline [188]. However, pre-trained models are not
ishing returns [467]. trained with retrieval augmentation generation (RAG) [6, 21],
Bias and Fairness: LLMs can inherit and amplify societal bi- hence, adapting the training pipeline is necessary [183, 25].
ases in their training data. These biases can manifest in the Safety and Controllability: Using LLMs comes with the risk
model’s outputs, leading to potential ethical and fairness is- of generating harmful, misleading, or inappropriate content,
sues [468]. whether by accident or when given specific prompts. Ensuring
Overfitting: Although LLMs possess substantial learning ca- these models are safely utilized is a significant concern [475].
pabilities, they are susceptible to overfitting noisy and peculiar Multi-Modality: Multi-modal learning, where LLMs are
patterns within their extensive training data. Consequently, this trained on diverse data like text, images, and videos, aims to
may cause them to generate illogical responses [469]. The de- create models with richer understanding but faces challenges
bate about Memorization vs. Generalization in LLMs is about in data alignment, fusion strategies, and higher computational
finding the right balance. Memorization allows the model to demands.
remember specific details from its training data, ensuring it can Catastrophic Forgetting: LLMs are often pre-trained on large
33
datasets and then fine-tuned on domain-specific data, reducing these models. GPUs have played a crucial role in meeting the
training resources but facing issues like domain adaptation and hardware requirements for training LLMs, with the networking
catastrophic forgetting, which hinders the retention of original industry also evolving to optimize hardware for training
knowledge when learning new tasks. workloads. However, the growing size of LLMs, which has
Adversarial Robustness: Large Language Models (LLMs) been outpacing hardware progress, makes model inference in-
have shown great capabilities in various tasks but are vul- creasingly costly. Model quantization is a promising approach
nerable to adversarial attacks, where slight, deliberate input to bridge the widening gap between LLM size and hardware
alterations can mislead them. Especially with models like capacity [484]. Although specialized hardware acceleration
BERT, adversarial fine-tuning can enhance robustness, al- like GPUs or TPUs can significantly reduce the computational
though it sometimes compromises generalization [476]. As cost, making real-time applications more feasible, they may not
LLMs integrate more into complex systems, examining their fully resolve all limitations, necessitating further advancements
security properties becomes crucial, given the emerging field in hardware technology.
of adversarial attacks on LLMs within trustworthy ML [477]. Regulatory and Ethical Frameworks: The rapid advancements
This vulnerability is notable in safety-critical domains, ne- in artificial intelligence have given rise to sophisticated Large
cessitating robust adversarial evaluation tools to ensure LLM Language Models (LLMs) like OpenAI’s GPT-4 [147] and
reliability [478]. Google’s Bard. These developments underscore the imperative
Interpretability and Explainability: The "black-box" nature for regulatory oversight to manage the ethical and social
of LLMs poses challenges in understanding their decision- challenges accompanying LLMs’ widespread use [485]. For
making, which is crucial for broader acceptance and trust, instance, LLMs can generate content that can be used posi-
especially in sensitive domains. Despite their advanced tively or negatively, emphasizing the need for proactive ethical
capabilities, the lack of insight into their operation limits their frameworks and policy measures to guide their responsible
effectiveness and trustworthiness [479, 480]. Efforts are being use and assign accountability for their outputs [486]. Auditing
made to make LLMs more explainable to promote user trust is identified as a promising governance mechanism to ensure
and to ensure responsible AI usage. Understanding the logic that AI systems, including LLMs, are designed and deployed
behind LLMs’ responses is essential for fostering trust and ethically, legally, and technically robust [487].
ensuring they align with human values and legal standards.
Privacy Concerns: Privacy concerns in Large Language
Models (LLMs) have escalated with their growth in complexity 8. Conclusion
and size, particularly around data sharing and potential misuse.
There is a risk of malicious content creation, filter bypass, This article has reviewed the developments on LLMs com-
and data privacy issues, especially in e-commerce, where prehensively. It contributes to summarizing significant find-
protecting customer privacy is crucial. If models are trained ings of LLMs in the existing literature and provides a de-
on private data, additional concerns arise if such models are tailed analysis of the design aspects, including architectures,
made publicly available. LLMs tend to memorize phrases from datasets, and training pipelines. We identified crucial archi-
their training sets, which an adversary could exploit to extract tectural components and training strategies employed by dif-
sensitive data, posing a threat to personal privacy [481, 482]. ferent LLMs. These aspects are presented as summaries and
Real-Time Processing: Real-time processing in Large Lan- discussions throughout the article. Moreover, we have dis-
guage Models (LLMs) is pivotal for various applications, cussed the performance differences of LLMs in zero-shot and
especially with the rising popularity of mobile AI applications few-shot settings, explored the impact of fine-tuning, and com-
and concerns regarding information security and privacy. pared supervised and generalized models and encoder vs. de-
However, LLMs often have hundreds of layers and millions coder vs. encoder-decoder architectures. A comprehensive re-
of parameters, which impede real-time processing due to the view of multi-modal LLMs, retrieval augmented LLMs, LLMs-
high computational demands and limited weight storage on powered agents, efficient LLMs, datasets, evaluation, applica-
hardware platforms, particularly in edge computing environ- tions, and challenges is also provided. This article is anticipated
ments [483]. While certain efforts like MobileBERT aim to serve as a valuable resource for researchers, offering insights
to reduce memory requirements, they still face substantial into the recent advancements in LLMs and providing funda-
execution overhead due to the large number of model layers, mental concepts and details to develop better LLMs.
leading to high inference latency.
Long-Term Dependencies: Large Language Models (LLMs) References
have shown considerable progress in understanding and
generating text, yet they often struggle with preserving context [1] A. Chernyavskiy, D. Ilvovsky, P. Nakov, Transformers:“the end of his-
tory” for natural language processing?, in: Machine Learning and
and handling long-term dependencies, particularly in complex, Knowledge Discovery in Databases. Research Track: European Con-
multi-turn conversations or long documents. This limitation ference, ECML PKDD 2021, Bilbao, Spain, September 13–17, 2021,
can lead to incoherent or irrelevant responses. Proceedings, Part III 21, Springer, 2021, pp. 677–693. 1
[2] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill,
Hardware Acceleration: The growth of LLMs presents signif-
O. Levy, S. Bowman, Superglue: A stickier benchmark for general-
icant hardware challenges due to the increasing computational purpose language understanding systems, Advances in neural informa-
and memory demands associated with training and deploying tion processing systems 32 (2019). 1, 24, 29

34
[3] D. Adiwardana, M.-T. Luong, D. R. So, J. Hall, N. Fiedel, R. Thoppilan, foundation and fine-tuned chat models, arXiv preprint arXiv:2307.09288
Z. Yang, A. Kulshreshtha, G. Nemade, Y. Lu, et al., Towards a human- (2023). 2, 7, 10, 16, 25, 33
like open-domain chatbot, arXiv preprint arXiv:2001.09977 (2020). 1 [22] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yo-
[4] B. A. y Arcas, Do large language models understand us?, Daedalus gatama, M. Bosma, D. Zhou, D. Metzler, et al., Emergent abilities of
151 (2) (2022) 183–197. 2 large language models, arXiv preprint arXiv:2206.07682 (2022). 2
[5] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., [23] T. Webb, K. J. Holyoak, H. Lu, Emergent analogical reasoning in large
Language models are unsupervised multitask learners, OpenAI blog language models, Nature Human Behaviour 7 (9) (2023) 1526–1541. 2
1 (8) (2019) 9. 2, 7 [24] D. A. Boiko, R. MacKnight, G. Gomes, Emergent autonomous sci-
[6] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, entific research capabilities of large language models, arXiv preprint
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models arXiv:2304.05332 (2023). 2
are few-shot learners, Advances in neural information processing sys- [25] G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, T. Schick,
tems 33 (2020) 1877–1901. 2, 6, 7, 8, 9, 16, 17, 22, 23, 24, 25, 33 J. Dwivedi-Yu, A. Joulin, S. Riedel, E. Grave, Few-shot learning with
[7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training retrieval augmented language models, arXiv preprint arXiv:2208.03299
of deep bidirectional transformers for language understanding, arXiv (2022). 2, 17, 18, 33
preprint arXiv:1810.04805 (2018). 2, 18, 24 [26] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter,
[8] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al., Palm-e: An embodied
L. Zettlemoyer, Deep contextualized word representations, in: NAACL- multimodal language model, arXiv preprint arXiv:2303.03378 (2023).
HLT, Association for Computational Linguistics, 2018, pp. 2227–2237. 2, 19, 21, 33
2 [27] A. Parisi, Y. Zhao, N. Fiedel, Talm: Tool augmented language models,
[9] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, arXiv preprint arXiv:2205.12255 (2022). 2, 18, 19
V. Stoyanov, L. Zettlemoyer, Bart: Denoising sequence-to-sequence pre- [28] B. Zhang, H. Soh, Large language models as zero-shot human models
training for natural language generation, translation, and comprehen- for human-robot interaction, arXiv preprint arXiv:2303.03548 (2023). 2,
sion, arXiv preprint arXiv:1910.13461 (2019). 2 33
[10] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, [29] Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y. Zhou, J. Wang, A. Hu, P. Shi,
Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with Y. Shi, et al., mplug-owl: Modularization empowers large language
a unified text-to-text transformer, The Journal of Machine Learning Re- models with multimodality, arXiv preprint arXiv:2304.14178 (2023). 2,
search 21 (1) (2020) 5485–5551. 2, 7, 8, 17, 19, 24, 25, 26, 28, 30, 22
31 [30] W. Wang, Z. Chen, X. Chen, J. Wu, X. Zhu, G. Zeng, P. Luo,
[11] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, T. Lu, J. Zhou, Y. Qiao, et al., Visionllm: Large language model
A. Barua, C. Raffel, mt5: A massively multilingual pre-trained text-to- is also an open-ended decoder for vision-centric tasks, arXiv preprint
text transformer, arXiv preprint arXiv:2010.11934 (2020). 2, 7, 8, 24, arXiv:2305.11175 (2023). 2, 22
25, 28, 30 [31] R. Yang, L. Song, Y. Li, S. Zhao, Y. Ge, X. Li, Y. Shan, Gpt4tools:
[12] Z. Zhang, Y. Gu, X. Han, S. Chen, C. Xiao, Z. Sun, Y. Yao, F. Qi, Teaching large language model to use tools via self-instruction, arXiv
J. Guan, P. Ke, et al., Cpm-2: Large-scale cost-effective pre-trained lan- preprint arXiv:2305.18752 (2023). 2, 19, 22
guage models, AI Open 2 (2021) 216–224. 2, 8, 25 [32] E. Saravia, Prompt Engineering Guide, https://github.com/dair-
[13] T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, ai/Prompt-Engineering-Guide (12 2022). 2, 7, 17, 33
R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, et al., Bloom: A 176b- [33] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu,
parameter open-access multilingual language model, arXiv preprint W. Zheng, X. Xia, et al., Glm-130b: An open bilingual pre-trained
arXiv:2211.05100 (2022). 2, 4, 9, 11, 22, 23, 24, 25, 30 model, arXiv preprint arXiv:2210.02414 (2022). 2, 10, 22, 23, 25
[14] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, [34] Y. Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, S. C. Hoi, Codet5+:
M. Diab, X. Li, X. V. Lin, et al., Opt: Open pre-trained transformer Open code large language models for code understanding and genera-
language models, arXiv preprint arXiv:2205.01068 (2022). 2, 9, 11, 23, tion, arXiv preprint arXiv:2305.07922 (2023). 2, 10, 24, 25
24, 25 [35] S. Wang, Y. Sun, Y. Xiang, Z. Wu, S. Ding, W. Gong, S. Feng, J. Shang,
[15] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, Y. Zhao, C. Pang, et al., Ernie 3.0 titan: Exploring larger-scale knowl-
P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al., Palm: Scal- edge enhanced pre-training for language understanding and generation,
ing language modeling with pathways, arXiv preprint arXiv:2204.02311 arXiv preprint arXiv:2112.12731 (2021). 2, 8, 23, 25
(2022). 2, 6, 9, 10, 22, 23, 24, 25 [36] J. Rasley, S. Rajbhandari, O. Ruwase, Y. He, Deepspeed: System op-
[16] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, timizations enable training deep learning models with over 100 billion
X. Wang, M. Dehghani, S. Brahma, et al., Scaling instruction-finetuned parameters, in: Proceedings of the 26th ACM SIGKDD International
language models, arXiv preprint arXiv:2210.11416 (2022). 2, 7, 11, 17, Conference on Knowledge Discovery & Data Mining, 2020, pp. 3505–
22, 23, 25, 28, 31 3506. 2, 5
[17] V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, [37] S. Rajbhandari, J. Rasley, O. Ruwase, Y. He, Zero: Memory optimiza-
A. Chaffin, A. Stiegler, T. L. Scao, A. Raja, et al., Multitask tions toward training trillion parameter models, in: SC20: International
prompted training enables zero-shot task generalization, arXiv preprint Conference for High Performance Computing, Networking, Storage and
arXiv:2110.08207 (2021). 2, 11, 25, 28, 31 Analysis, IEEE, 2020, pp. 1–16. 2, 4, 23
[18] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei, [38] J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, G. Neubig, Towards
A. Naik, A. Ashok, A. S. Dhanasekaran, A. Arunkumar, D. Stap, et al., a unified view of parameter-efficient transfer learning, arXiv preprint
Super-naturalinstructions: Generalization via declarative instructions on arXiv:2110.04366 (2021). 2, 20, 21
1600+ nlp tasks, in: Proceedings of the 2022 Conference on Empirical [39] Z. Hu, Y. Lan, L. Wang, W. Xu, E.-P. Lim, R. K.-W. Lee, L. Bing, S. Po-
Methods in Natural Language Processing, 2022, pp. 5085–5109. 2, 7, ria, Llm-adapters: An adapter family for parameter-efficient fine-tuning
11, 17, 23, 25, 28, 31 of large language models, arXiv preprint arXiv:2304.01933 (2023). 2,
[19] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, H. Ha- 20
jishirzi, Self-instruct: Aligning language model with self generated in- [40] B. Lester, R. Al-Rfou, N. Constant, The power of scale for parameter-
structions, arXiv preprint arXiv:2212.10560 (2022). 2, 11, 18, 22, 28 efficient prompt tuning, arXiv preprint arXiv:2104.08691 (2021). 2, 8,
[20] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, 20
C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., Training language mod- [41] X. L. Li, P. Liang, Prefix-tuning: Optimizing continuous prompts for
els to follow instructions with human feedback, Advances in Neural In- generation, arXiv preprint arXiv:2101.00190 (2021). 2, 20
formation Processing Systems 35 (2022) 27730–27744. 2, 7, 11, 16, [42] X. Ma, G. Fang, X. Wang, Llm-pruner: On the structural pruning of
22 large language models, arXiv preprint arXiv:2305.11627 (2023). 2, 21
[21] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, [43] R. Xu, F. Luo, C. Wang, B. Chang, J. Huang, S. Huang, F. Huang,
N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., Llama 2: Open From dense to sparse: Contrastive pruning for better pre-trained lan-

35
guage model compression, in: Proceedings of the AAAI Conference on linear biases enables input length extrapolation, in: International Con-
Artificial Intelligence, Vol. 36, 2022, pp. 11547–11555. 2, 21 ference on Learning Representations, 2022.
[44] G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, S. Han, Smoothquant: URL https://openreview.net/forum?id=R8sQPpGCv0 4, 17
Accurate and efficient post-training quantization for large language [66] J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, Y. Liu, Roformer: En-
models, in: ICML, Vol. 202 of Proceedings of Machine Learning Re- hanced transformer with rotary position embedding, arXiv preprint
search, PMLR, 2023, pp. 38087–38099. 2, 20 arXiv:2104.09864 (2021). 4, 9, 17
[45] C. Tao, L. Hou, W. Zhang, L. Shang, X. Jiang, Q. Liu, P. Luo, N. Wong, [67] R. Child, S. Gray, A. Radford, I. Sutskever, Generating long sequences
Compression of generative pre-trained language models via quantiza- with sparse transformers, arXiv preprint arXiv:1904.10509 (2019). 4, 7,
tion, arXiv preprint arXiv:2203.10705 (2022). 2, 20 23
[46] A. Pal, D. Karkhanis, M. Roberts, S. Dooley, A. Sundararajan, S. Naidu, [68] T. Dao, D. Fu, S. Ermon, A. Rudra, C. Ré, Flashattention: Fast and
Giraffe: Adventures in expanding context lengths in llms, arXiv preprint memory-efficient exact attention with io-awareness, Advances in Neural
arXiv:2308.10882 (2023). 2, 17 Information Processing Systems 35 (2022) 16344–16359. 4
[47] B. Peng, J. Quesnelle, H. Fan, E. Shippole, Yarn: Efficient con- [69] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks
text window extension of large language models, arXiv preprint are universal approximators, Neural networks 2 (5) (1989) 359–366. 4
arXiv:2309.00071 (2023). 2, 17 [70] V. Nair, G. E. Hinton, Rectified linear units improve restricted boltz-
[48] M. Guo, J. Ainslie, D. Uthus, S. Ontanon, J. Ni, Y.-H. Sung, Y. Yang, mann machines, in: Proceedings of the 27th international conference on
Longt5: Efficient text-to-text transformer for long sequences, arXiv machine learning (ICML-10), 2010, pp. 807–814. 4
preprint arXiv:2112.07916 (2021). 2, 17 [71] D. Hendrycks, K. Gimpel, Gaussian error linear units (gelus), arXiv
[49] S. Chen, S. Wong, L. Chen, Y. Tian, Extending context window preprint arXiv:1606.08415 (2016). 4
of large language models via positional interpolation, arXiv preprint [72] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov,
arXiv:2306.15595 (2023). 2, 17 Dropout: a simple way to prevent neural networks from overfitting, The
[50] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, journal of machine learning research 15 (1) (2014) 1929–1958. 4
J. Zhang, Z. Dong, et al., A survey of large language models, arXiv [73] D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R.
preprint arXiv:2303.18223 (2023). 2, 3, 7 Ke, A. Goyal, Y. Bengio, A. Courville, C. Pal, Zoneout: Regular-
[51] U. Naseem, I. Razzak, S. K. Khan, M. Prasad, A comprehensive sur- izing rnns by randomly preserving hidden activations, arXiv preprint
vey on word representation models: From classical to state-of-the-art arXiv:1606.01305 (2016). 4
word representation language models, Transactions on Asian and Low- [74] N. Shazeer, Glu variants improve transformer, arXiv preprint
Resource Language Information Processing 20 (5) (2021) 1–35. 2, 3 arXiv:2002.05202 (2020). 4
[52] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, [75] Y. N. Dauphin, A. Fan, M. Auli, D. Grangier, Language modeling with
E. Agirre, I. Heinz, D. Roth, Recent advances in natural language pro- gated convolutional networks, in: International conference on machine
cessing via large pre-trained language models: A survey, arXiv preprint learning, PMLR, 2017, pp. 933–941. 4
arXiv:2111.01243 (2021). 2, 3 [76] J. L. Ba, J. R. Kiros, G. E. Hinton, Layer normalization, arXiv preprint
[53] C. Zhou, Q. Li, C. Li, J. Yu, Y. Liu, G. Wang, K. Zhang, C. Ji, Q. Yan, arXiv:1607.06450 (2016). 4
L. He, et al., A comprehensive survey on pretrained foundation models: [77] B. Zhang, R. Sennrich, Root mean square layer normalization, Advances
A history from bert to chatgpt, arXiv preprint arXiv:2302.09419 (2023). in Neural Information Processing Systems 32 (2019). 4
2, 3 [78] A. Baevski, M. Auli, Adaptive input representations for neural language
[54] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, modeling, arXiv preprint arXiv:1809.10853 (2018). 4
J. Xu, Z. Sui, A survey for in-context learning, arXiv preprint [79] H. Wang, S. Ma, L. Dong, S. Huang, D. Zhang, F. Wei, Deepnet: Scaling
arXiv:2301.00234 (2022). 2, 7, 17 transformers to 1,000 layers, arXiv preprint arXiv:2203.00555 (2022). 4
[55] J. Huang, K. C.-C. Chang, Towards reasoning in large language models: [80] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, B. Catanzaro,
A survey, arXiv preprint arXiv:2212.10403 (2022). 2, 7, 17 Megatron-lm: Training multi-billion parameter language models using
[56] Y. Wang, W. Zhong, L. Li, F. Mi, X. Zeng, W. Huang, L. Shang, X. Jiang, model parallelism, arXiv preprint arXiv:1909.08053 (2019). 4, 5
Q. Liu, Aligning large language models with human: A survey, arXiv [81] "bmtrain: Efficient training for big models.".
preprint arXiv:2307.12966 (2023). 2 URL https://github.com/OpenBMB/BMTrain 4, 5
[57] X. Zhu, J. Li, Y. Liu, C. Ma, W. Wang, A survey on model compression [82] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cis-
for large language models, arXiv preprint arXiv:2308.07633 (2023). 2 tac, T. Rault, R. Louf, M. Funtowicz, et al., Transformers: State-of-the-
[58] S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, E. Chen, A survey on multi- art natural language processing, in: Proceedings of the 2020 conference
modal large language models, arXiv preprint arXiv:2306.13549 (2023). on empirical methods in natural language processing: system demon-
2, 22 strations, 2020, pp. 38–45. 5
[59] J. J. Webster, C. Kit, Tokenization as the initial phase in nlp, in: COL- [83] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclau-
ING 1992 volume 4: The 14th international conference on computa- rin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, et al.,
tional linguistics, 1992. 4 Jax: composable transformations of python+ numpy programs (2018).
[60] T. Kudo, Subword regularization: Improving neural network translation 5
models with multiple subword candidates, in: Proceedings of the 56th [84] S. Li, J. Fang, Z. Bian, H. Liu, Y. Liu, H. Huang, B. Wang, Y. You,
Annual Meeting of the Association for Computational Linguistics (Vol- Colossal-ai: A unified deep learning system for large-scale parallel train-
ume 1: Long Papers), 2018, pp. 66–75. 4 ing, arXiv preprint arXiv:2110.14883 (2021). 5
[61] R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare [85] J. He, J. Qiu, A. Zeng, Z. Yang, J. Zhai, J. Tang, Fastmoe: A
words with subword units, in: Proceedings of the 54th Annual Meet- fast mixture-of-expert training system, arXiv preprint arXiv:2103.13262
ing of the Association for Computational Linguistics (Volume 1: Long (2021). 5
Papers), 2016, pp. 1715–1725. 4 [86] L. Huawei Technologies Co., Huawei mindspore ai development frame-
[62] M. Schuster, K. Nakajima, Japanese and korean voice search, in: 2012 work, in: Artificial Intelligence Technology, Springer, 2022, pp. 137–
IEEE international conference on acoustics, speech and signal process- 162. 5
ing (ICASSP), IEEE, 2012, pp. 5149–5152. 4 [87] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
[63] S. J. Mielke, Z. Alyafeai, E. Salesky, C. Raffel, M. Dey, M. Gallé, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., Pytorch: An imper-
A. Raja, C. Si, W. Y. Lee, B. Sagot, et al., Between words and char- ative style, high-performance deep learning library, Advances in neural
acters: A brief history of open-vocabulary modeling and tokenization in information processing systems 32 (2019). 5
nlp, arXiv preprint arXiv:2112.10508 (2021). 4 [88] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
[64] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, S. Ghemawat, G. Irving, M. Isard, et al., Tensorflow: a system for large-
Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural scale machine learning., in: Osdi, Vol. 16, Savannah, GA, USA, 2016,
information processing systems 30 (2017). 4, 7 pp. 265–283. 5
[65] O. Press, N. Smith, M. Lewis, Train short, test long: Attention with [89] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao,

36
B. Xu, C. Zhang, Z. Zhang, Mxnet: A flexible and efficient machine [109] S. Yuan, H. Zhao, Z. Du, M. Ding, X. Liu, Y. Cen, X. Zou, Z. Yang,
learning library for heterogeneous distributed systems, arXiv preprint J. Tang, Wudaocorpora: A super large-scale chinese corpora for pre-
arXiv:1512.01274 (2015). 5 training language models, AI Open 2 (2021) 65–68. 8, 30
[90] W. Fedus, B. Zoph, N. Shazeer, Switch transformers: Scaling to tril- [110] Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang, J. Liu, X. Chen,
lion parameter models with simple and efficient sparsity, The Journal of Y. Zhao, Y. Lu, et al., Ernie 3.0: Large-scale knowledge enhanced
Machine Learning Research 23 (1) (2022) 5232–5270. 5, 9 pre-training for language understanding and generation, arXiv preprint
[91] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, arXiv:2107.02137 (2021). 8, 25
Y. Zhou, A. W. Yu, O. Firat, et al., Glam: Efficient scaling of language [111] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, R. Salakhutdinov,
models with mixture-of-experts, in: International Conference on Ma- Transformer-xl: Attentive language models beyond a fixed-length con-
chine Learning, PMLR, 2022, pp. 5547–5569. 5, 9, 23, 25 text, arXiv preprint arXiv:1901.02860 (2019). 8
[92] X. Ren, P. Zhou, X. Meng, X. Huang, Y. Wang, W. Wang, P. Li, [112] O. Lieber, O. Sharir, B. Lenz, Y. Shoham, Jurassic-1: Technical details
P
X. Zhang, A. Podolskiy, G. Arshinov, et al., Pangu- : Towards trillion and evaluation, White Paper. AI21 Labs 1 (2021). 8, 23, 25
parameter language model with sparse heterogeneous computing, arXiv [113] Y. Levine, N. Wies, O. Sharir, H. Bata, A. Shashua, Limits to depth ef-
preprint arXiv:2303.10845 (2023). 5, 10, 11, 23, 25 ficiencies of self-attention, Advances in Neural Information Processing
[93] T. Wang, A. Roberts, D. Hesslow, T. Le Scao, H. W. Chung, I. Beltagy, Systems 33 (2020) 22640–22651. 8, 11
J. Launay, C. Raffel, What language model architecture and pretrain- [114] B. Kim, H. Kim, S.-W. Lee, G. Lee, D. Kwak, D. H. Jeon, S. Park,
ing objective works best for zero-shot generalization?, in: International S. Kim, S. Kim, D. Seo, et al., What changes can large-scale language
Conference on Machine Learning, PMLR, 2022, pp. 22964–22984. 5 models bring? intensive study on hyperclova: Billions-scale korean
[94] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, generative pretrained transformers, arXiv preprint arXiv:2109.04650
H.-W. Hon, Unified language model pre-training for natural language (2021). 8, 25
understanding and generation, Advances in neural information process- [115] S. Wu, X. Zhao, T. Yu, R. Zhang, C. Shen, H. Liu, F. Li, H. Zhu, J. Luo,
ing systems 32 (2019). 6 L. Xu, et al., Yuan 1.0: Large-scale pre-trained language model in zero-
[95] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, shot and few-shot learning, arXiv preprint arXiv:2110.04725 (2021). 8,
S. Gray, A. Radford, J. Wu, D. Amodei, Scaling laws for neural language 23, 25
models, arXiv preprint arXiv:2001.08361 (2020). 6 [116] J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song,
[96] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, J. Aslanides, S. Henderson, R. Ring, S. Young, et al., Scaling lan-
E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, guage models: Methods, analysis & insights from training gopher, arXiv
et al., Training compute-optimal large language models, arXiv preprint preprint arXiv:2112.11446 (2021). 8, 9, 25, 28
arXiv:2203.15556 (2022). 6, 9, 25, 29 [117] S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari,
[97] S. Iyer, X. V. Lin, R. Pasunuru, T. Mihaylov, D. Simig, P. Yu, K. Shuster, J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti, et al.,
T. Wang, Q. Liu, P. S. Koura, et al., Opt-iml: Scaling language model in- Using deepspeed and megatron to train megatron-turing nlg 530b, a
struction meta learning through the lens of generalization, arXiv preprint large-scale generative language model, arXiv preprint arXiv:2201.11990
arXiv:2212.12017 (2022). 7, 11, 17, 22, 25, 28 (2022). 8, 9, 23, 25
[98] Z. Sun, Y. Shen, Q. Zhou, H. Zhang, Z. Chen, D. Cox, Y. Yang, C. Gan, [118] S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Golding,
Principle-driven self-alignment of language models from scratch with H. He, C. Leahy, K. McDonell, J. Phang, et al., Gpt-neox-20b: An open-
minimal human supervision, arXiv preprint arXiv:2305.03047 (2023). source autoregressive language model, arXiv preprint arXiv:2204.06745
7, 16 (2022). 9, 22, 23, 24, 25
[99] A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, [119] W. Ben, K. Aran, Gpt-j-6b: A 6 billion parameter autoregressive lan-
N. Joseph, B. Mann, N. DasSarma, et al., A general language assistant guage model (2021). 9
as a laboratory for alignment, arXiv preprint arXiv:2112.00861 (2021). [120] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia,
7 B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, et al., Mixed pre-
[100] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, cision training, arXiv preprint arXiv:1710.03740 (2017). 9, 23
P. Christiano, G. Irving, Fine-tuning language models from human pref- [121] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hin-
erences, arXiv preprint arXiv:1909.08593 (2019). 7 ton, J. Dean, Outrageously large neural networks: The sparsely-gated
[101] S. Kim, S. J. Joo, D. Kim, J. Jang, S. Ye, J. Shin, M. Seo, The cot collec- mixture-of-experts layer, arXiv preprint arXiv:1701.06538 (2017). 9, 23
tion: Improving zero-shot and few-shot learning of language models via [122] S. Soltan, S. Ananthakrishnan, J. FitzGerald, R. Gupta, W. Hamza,
chain-of-thought fine-tuning, arXiv preprint arXiv:2305.14045 (2023). H. Khan, C. Peris, S. Rawls, A. Rosenbaum, A. Rumshisky, et al., Alex-
7, 11 atm 20b: Few-shot learning using a large-scale multilingual seq2seq
[102] Q. Liu, F. Zhou, Z. Jiang, L. Dou, M. Lin, From zero to hero: Exam- model, arXiv preprint arXiv:2208.01448 (2022). 9, 22, 23, 24, 25
ining the power of symbolic tasks in instruction tuning, arXiv preprint [123] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos,
arXiv:2304.07995 (2023). 7, 11 S. Shakeri, E. Taropa, P. Bailey, Z. Chen, et al., Palm 2 technical report,
[103] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, arXiv preprint arXiv:2305.10403 (2023). 9, 25
D. Zhou, et al., Chain-of-thought prompting elicits reasoning in large [124] Y. Tay, J. Wei, H. W. Chung, V. Q. Tran, D. R. So, S. Shakeri, X. Garcia,
language models, Advances in Neural Information Processing Systems H. S. Zheng, J. Rao, A. Chowdhery, et al., Transcending scaling laws
35 (2022) 24824–24837. 7, 19, 22 with 0.1% extra compute, arXiv preprint arXiv:2210.11399 (2022). 9,
[104] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowd- 23, 25
hery, D. Zhou, Self-consistency improves chain of thought reasoning in [125] Y. Tay, M. Dehghani, V. Q. Tran, X. Garcia, J. Wei, X. Wang, H. W.
language models, arXiv preprint arXiv:2203.11171 (2022). 7, 19 Chung, D. Bahri, T. Schuster, S. Zheng, et al., Ul2: Unifying lan-
[105] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, K. Narasimhan, guage learning paradigms, in: The Eleventh International Conference
Tree of thoughts: Deliberate problem solving with large language mod- on Learning Representations, 2022. 9, 10, 23, 24, 25
els, arXiv preprint arXiv:2305.10601 (2023). 7, 19 [126] Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, J. Tang, Glm: Gen-
[106] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, eral language model pretraining with autoregressive blank infilling, in:
A. Gesmundo, M. Attariyan, S. Gelly, Parameter-efficient transfer learn- Proceedings of the 60th Annual Meeting of the Association for Compu-
ing for nlp, in: International Conference on Machine Learning, PMLR, tational Linguistics (Volume 1: Long Papers), 2022, pp. 320–335. 10
2019, pp. 2790–2799. 7, 20 [127] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux,
[107] S. McCandlish, J. Kaplan, D. Amodei, O. D. Team, An empirical model T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al.,
of large-batch training, arXiv preprint arXiv:1812.06162 (2018). 7 Llama: Open and efficient foundation language models, arXiv preprint
[108] W. Zeng, X. Ren, T. Su, H. Wang, Y. Liao, Z. Wang, X. Jiang, Z. Yang, arXiv:2302.13971 (2023). 10, 22, 25
K. Wang, X. Zhang, et al., Pangu-α : Large-scale autoregressive pre- [128] M. N. Rabe, C. Staats, Self-attention does not need o(n2 ) memory, arXiv
trained chinese language models with auto-parallel computation, arXiv preprint arXiv:2112.05682 (2021). 10
preprint arXiv:2104.12369 (2021). 8, 22, 23, 25 [129] V. A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch,

37
M. Shoeybi, B. Catanzaro, Reducing activation recomputation in large arXiv:2304.06975 (2023). 16
transformer models, Proceedings of Machine Learning and Systems 5 [153] C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, D. Jiang,
(2023). 10 Wizardlm: Empowering large language models to follow complex in-
[130] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, structions, arXiv preprint arXiv:2304.12244 (2023). 16
C. Xiong, Codegen: An open large language model for code with multi- [154] Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin,
turn program synthesis, arXiv preprint arXiv:2203.13474 (2022). 10, D. Jiang, Wizardcoder: Empowering code large language models with
22, 25, 28 evol-instruct, arXiv preprint arXiv:2306.08568 (2023). 16, 25
[131] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Ed- [155] J. Menick, M. Trebacz, V. Mikulik, J. Aslanides, F. Song, M. Chadwick,
wards, Y. Burda, N. Joseph, G. Brockman, et al., Evaluating large lan- M. Glaese, S. Young, L. Campbell-Gillingham, G. Irving, et al., Teach-
guage models trained on code, arXiv preprint arXiv:2107.03374 (2021). ing language models to support answers with verified quotes, arXiv
10, 25, 29 preprint arXiv:2203.11147 (2022). 16
[132] Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, [156] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim,
T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago, et al., Competition-level C. Hesse, S. Jain, V. Kosaraju, W. Saunders, et al., Webgpt: Browser-
code generation with alphacode, Science 378 (6624) (2022) 1092–1097. assisted question-answering with human feedback, arXiv preprint
10, 23, 25, 29 arXiv:2112.09332 (2021). 16, 18, 19, 25, 31
[133] N. Shazeer, Fast transformer decoding: One write-head is all you need, [157] A. Glaese, N. McAleese, M. Trebacz,
˛ J. Aslanides, V. Firoiu, T. Ewalds,
arXiv preprint arXiv:1911.02150 (2019). 10 M. Rauh, L. Weidinger, M. Chadwick, P. Thacker, et al., Improving
[134] R. Y. Pang, H. He, Text generation by learning from demonstrations, alignment of dialogue agents via targeted human judgements, arXiv
arXiv preprint arXiv:2009.07839 (2020). 10 preprint arXiv:2209.14375 (2022). 16, 19, 25
[135] R. Dabre, A. Fujita, Softmax tempering for training neural machine [158] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, C. Finn,
translation models, arXiv preprint arXiv:2009.09372 (2020). 10 Direct preference optimization: Your language model is secretly a re-
[136] Y. Wang, W. Wang, S. Joty, S. C. Hoi, Codet5: Identifier-aware unified ward model, arXiv preprint arXiv:2305.18290 (2023). 16
pre-trained encoder-decoder models for code understanding and genera- [159] H. Dong, W. Xiong, D. Goyal, R. Pan, S. Diao, J. Zhang, K. Shum,
tion, arXiv preprint arXiv:2109.00859 (2021). 10 T. Zhang, Raft: Reward ranked finetuning for generative foundation
[137] R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, model alignment, arXiv preprint arXiv:2304.06767 (2023). 16
M. Marone, C. Akiki, J. Li, J. Chim, et al., Starcoder: may the source be [160] Z. Yuan, H. Yuan, C. Tan, W. Wang, S. Huang, F. Huang, Rrhf: Rank
with you!, arXiv preprint arXiv:2305.06161 (2023). 10, 25 responses to align language models with human feedback without tears,
[138] R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, arXiv preprint arXiv:2304.05302 (2023). 16
A. Poulton, V. Kerkez, R. Stojnic, Galactica: A large language model for [161] F. Song, B. Yu, M. Li, H. Yu, F. Huang, Y. Li, H. Wang, Preference rank-
science, arXiv preprint arXiv:2211.09085 (2022). 10, 23, 25, 29 ing optimization for human alignment, arXiv preprint arXiv:2306.17492
[139] FairScale authors, Fairscale: A general purpose modular pytorch library (2023). 16
for high performance and large scale training, https://github.com/ [162] H. Liu, C. Sferrazza, P. Abbeel, Languages are rewards: Hindsight fine-
facebookresearch/fairscale (2021). 10 tuning using human feedback, arXiv preprint arXiv:2302.02676 (2023).
[140] R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. 16
Cheng, A. Jin, T. Bos, L. Baker, Y. Du, et al., Lamda: Language models [163] Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen,
for dialog applications, arXiv preprint arXiv:2201.08239 (2022). 11, 25 A. Goldie, A. Mirhoseini, C. McKinnon, et al., Constitutional ai: Harm-
[141] S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann, lessness from ai feedback, arXiv preprint arXiv:2212.08073 (2022). 16
P. Kambadur, D. Rosenberg, G. Mann, Bloomberggpt: A large language [164] Y. Dubois, X. Li, R. Taori, T. Zhang, I. Gulrajani, J. Ba, C. Guestrin,
model for finance, arXiv preprint arXiv:2303.17564 (2023). 11, 25, 32 P. Liang, T. B. Hashimoto, Alpacafarm: A simulation frame-
[142] X. Zhang, Q. Yang, D. Xu, Xuanyuan 2.0: A large chinese finan- work for methods that learn from human feedback, arXiv preprint
cial chat model with hundreds of billions parameters, arXiv preprint arXiv:2305.14387 (2023). 16
arXiv:2305.12002 (2023). 11, 16, 25 [165] C. Si, Z. Gan, Z. Yang, S. Wang, J. Wang, J. Boyd-Graber, L. Wang,
[143] W. Ben, Mesh-transformer-jax: Model-parallel implementation of trans- Prompting gpt-3 to be reliable, arXiv preprint arXiv:2210.09150 (2022).
former language model with jax (2021). 12, 23 16
[144] N. Muennighoff, T. Wang, L. Sutawika, A. Roberts, S. Biderman, [166] D. Ganguli, A. Askell, N. Schiefer, T. Liao, K. Lukošiūtė, A. Chen,
T. L. Scao, M. S. Bari, S. Shen, Z.-X. Yong, H. Schoelkopf, et al., A. Goldie, A. Mirhoseini, C. Olsson, D. Hernandez, et al., The capac-
Crosslingual generalization through multitask finetuning, arXiv preprint ity for moral self-correction in large language models, arXiv preprint
arXiv:2211.01786 (2022). 11, 25, 28, 31 arXiv:2302.07459 (2023). 16
[145] D. Yin, X. Liu, F. Yin, M. Zhong, H. Bansal, J. Han, K.-W. Chang, [167] A. Wei, N. Haghtalab, J. Steinhardt, Jailbroken: How does llm safety
Dynosaur: A dynamic growth paradigm for instruction-tuning data cu- training fail?, arXiv preprint arXiv:2307.02483 (2023). 16
ration, arXiv preprint arXiv:2305.14327 (2023). 16 [168] D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath,
[146] P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou, W. Zhang, P. Lu, B. Mann, E. Perez, N. Schiefer, K. Ndousse, et al., Red teaming lan-
C. He, X. Yue, et al., Llama-adapter v2: Parameter-efficient visual in- guage models to reduce harms: Methods, scaling behaviors, and lessons
struction model, arXiv preprint arXiv:2304.15010 (2023). 16, 24 learned, arXiv preprint arXiv:2209.07858 (2022). 16, 28
[147] Openai. gpt-4 technical report (2023). 16, 34 [169] S. Casper, J. Lin, J. Kwon, G. Culp, D. Hadfield-Menell, Explore, estab-
[148] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, lish, exploit: Red teaming language models from scratch, arXiv preprint
T. B. Hashimoto, Stanford alpaca: An instruction-following llama arXiv:2306.09442 (2023). 16
model, https://github.com/tatsu-lab/stanford_alpaca [170] E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese,
(2023). 16, 25, 28 N. McAleese, G. Irving, Red teaming language models with language
[149] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, models, arXiv preprint arXiv:2202.03286 (2022). 16
S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, E. P. Xing, Vicuna: An [171] T. Scialom, T. Chakrabarty, S. Muresan, Fine-tuned language models are
open-source chatbot impressing gpt-4 with 90%* chatgpt quality (March continual learners, in: Proceedings of the 2022 Conference on Empirical
2023). Methods in Natural Language Processing, 2022, pp. 6107–6122. 16
URL https://lmsys.org/blog/2023-03-30-vicuna/ 16, 22, 25, [172] Z. Shi, A. Lipani, Don’t stop pretraining? make prompt-based fine-
28 tuning powerful learner, arXiv preprint arXiv:2305.01711 (2023). 17
[150] B. Peng, C. Li, P. He, M. Galley, J. Gao, Instruction tuning with gpt-4, [173] H. Gupta, S. A. Sawant, S. Mishra, M. Nakamura, A. Mitra, S. Mashetty,
arXiv preprint arXiv:2304.03277 (2023). 16, 28 C. Baral, Instruction tuned models are quick learners, arXiv preprint
[151] T. Liu, B. K. H. Low, Goat: Fine-tuned llama outperforms gpt-4 on arXiv:2306.05539 (2023). 17
arithmetic tasks, arXiv preprint arXiv:2305.14201 (2023). 16 [174] H. Chen, Y. Zhang, Q. Zhang, H. Yang, X. Hu, X. Ma, Y. Yanggong,
[152] H. Wang, C. Liu, N. Xi, Z. Qiang, S. Zhao, B. Qin, T. Liu, Huatuo: J. Zhao, Maybe only 0.5% data is needed: A preliminary exploration
Tuning llama model with chinese medical knowledge, arXiv preprint of low training data instruction tuning, arXiv preprint arXiv:2305.09246

38
(2023). 17 [197] J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, W. Chen, What makes
[175] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, good in-context examples for gpt-3?, arXiv preprint arXiv:2101.06804
P. Yu, L. Yu, et al., Lima: Less is more for alignment, arXiv preprint (2021). 18
arXiv:2305.11206 (2023). 17, 25, 28 [198] O. Rubin, J. Herzig, J. Berant, Learning to retrieve prompts for in-
[176] C. Han, Q. Wang, W. Xiong, Y. Chen, H. Ji, S. Wang, Lm-infinite: Sim- context learning, arXiv preprint arXiv:2112.08633 (2021). 18
ple on-the-fly length generalization for large language models, arXiv [199] W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettle-
preprint arXiv:2308.16137 (2023). 17 moyer, W.-t. Yih, Replug: Retrieval-augmented black-box language
[177] J. Ainslie, T. Lei, M. de Jong, S. Ontañón, S. Brahma, Y. Zemlyan- models, arXiv preprint arXiv:2301.12652 (2023). 18
skiy, D. Uthus, M. Guo, J. Lee-Thorp, Y. Tay, et al., Colt5: Faster [200] O. Rubin, J. Berant, Long-range language modeling with self-retrieval,
long-range transformers with conditional computation, arXiv preprint arXiv preprint arXiv:2306.13421 (2023). 18
arXiv:2303.09752 (2023). 17 [201] K. Guu, K. Lee, Z. Tung, P. Pasupat, M. Chang, Retrieval augmented
[178] J. Ding, S. Ma, L. Dong, X. Zhang, S. Huang, W. Wang, F. Wei, language model pre-training, in: International conference on machine
Longnet: Scaling transformers to 1,000,000,000 tokens, arXiv preprint learning, PMLR, 2020, pp. 3929–3938. 18
arXiv:2307.02486 (2023). 17 [202] S. Hofstätter, J. Chen, K. Raman, H. Zamani, Fid-light: Efficient and ef-
[179] Y. Chen, S. Qian, H. Tang, X. Lai, Z. Liu, S. Han, J. Jia, Longlora: Effi- fective retrieval-augmented text generation, in: Proceedings of the 46th
cient fine-tuning of long-context large language models, arXiv preprint International ACM SIGIR Conference on Research and Development in
arXiv:2309.12307 (2023). 17 Information Retrieval, 2023, pp. 1437–1447. 18
[180] N. Ratner, Y. Levine, Y. Belinkov, O. Ram, I. Magar, O. Abend, [203] M. Komeili, K. Shuster, J. Weston, Internet-augmented dialogue gener-
E. Karpas, A. Shashua, K. Leyton-Brown, Y. Shoham, Parallel context ation, arXiv preprint arXiv:2107.07566 (2021). 18
windows for large language models, in: Proceedings of the 61st Annual [204] A. Lazaridou, E. Gribovskaya, W. Stokowiec, N. Grigorev, Internet-
Meeting of the Association for Computational Linguistics (Volume 1: augmented language models through few-shot prompting for open-
Long Papers), 2023, pp. 6383–6402. 17 domain question answering, arXiv preprint arXiv:2203.05115 (2022).
[181] W. Wang, L. Dong, H. Cheng, X. Liu, X. Yan, J. Gao, F. Wei, 18
Augmenting language models with long-term memory, arXiv preprint [205] D. Gao, L. Ji, L. Zhou, K. Q. Lin, J. Chen, Z. Fan, M. Z. Shou, Assist-
arXiv:2306.07174 (2023). 17 gpt: A general multi-modal assistant that can plan, execute, inspect, and
[182] X. Xu, Z. Gou, W. Wu, Z.-Y. Niu, H. Wu, H. Wang, S. Wang, Long learn, arXiv preprint arXiv:2306.08640 (2023). 18, 19
time no see! open-domain conversation with long-term persona memory, [206] P. Lu, B. Peng, H. Cheng, M. Galley, K.-W. Chang, Y. N. Wu, S.-C. Zhu,
arXiv preprint arXiv:2203.05797 (2022). 17 J. Gao, Chameleon: Plug-and-play compositional reasoning with large
[183] S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Milli- language models, arXiv preprint arXiv:2304.09842 (2023). 18, 19, 22
can, G. B. Van Den Driessche, J.-B. Lespiau, B. Damoc, A. Clark, et al., [207] B. Paranjape, S. Lundberg, S. Singh, H. Hajishirzi, L. Zettlemoyer, M. T.
Improving language models by retrieving from trillions of tokens, in: Ribeiro, Art: Automatic multi-step reasoning and tool-use for large lan-
International conference on machine learning, PMLR, 2022, pp. 2206– guage models, arXiv preprint arXiv:2303.09014 (2023). 18
2240. 17, 18, 33 [208] C.-Y. Hsieh, S.-A. Chen, C.-L. Li, Y. Fujii, A. Ratner, C.-Y. Lee, R. Kr-
[184] W. Zhong, L. Guo, Q. Gao, Y. Wang, Memorybank: Enhanc- ishna, T. Pfister, Tool documentation enables zero-shot tool-usage with
ing large language models with long-term memory, arXiv preprint large language models, arXiv preprint arXiv:2308.00675 (2023). 18
arXiv:2305.10250 (2023). 17 [209] Y. Song, W. Xiong, D. Zhu, C. Li, K. Wang, Y. Tian, S. Li, Restgpt:
[185] N. Shinn, F. Cassano, B. Labash, A. Gopinath, K. Narasimhan, S. Yao, Connecting large language models with real-world applications via rest-
Reflexion: Language agents with verbal reinforcement learning, arXiv ful apis, arXiv preprint arXiv:2306.06624 (2023). 18
preprint arXiv:2303.11366 14 (2023). 17, 19 [210] S. Hao, T. Liu, Z. Wang, Z. Hu, Toolkengpt: Augmenting frozen lan-
[186] C. Hu, J. Fu, C. Du, S. Luo, J. Zhao, H. Zhao, Chatdb: Augment- guage models with massive tools via tool embeddings, arXiv preprint
ing llms with databases as their symbolic memory, arXiv preprint arXiv:2305.11554 (2023). 18
arXiv:2306.03901 (2023). 17 [211] S. G. Patil, T. Zhang, X. Wang, J. E. Gonzalez, Gorilla: Large language
[187] Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-Yu, Y. Yang, model connected with massive apis, arXiv preprint arXiv:2305.15334
J. Callan, G. Neubig, Active retrieval augmented generation, arXiv (2023). 18
preprint arXiv:2305.06983 (2023). 17, 18 [212] Q. Xu, F. Hong, B. Li, C. Hu, Z. Chen, J. Zhang, On the tool manipu-
[188] O. Ram, Y. Levine, I. Dalmedigos, D. Muhlgay, A. Shashua, K. Leyton- lation capability of open-source large language models, arXiv preprint
Brown, Y. Shoham, In-context retrieval-augmented language models, arXiv:2305.16504 (2023). 18
arXiv preprint arXiv:2302.00083 (2023). 17, 18, 33 [213] Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang,
[189] X. Li, X. Qiu, Mot: Pre-thinking and recalling enable chatgpt to self- B. Qian, et al., Toolllm: Facilitating large language models to master
improve with memory-of-thoughts, arXiv preprint arXiv:2305.05181 16000+ real-world apis, arXiv preprint arXiv:2307.16789 (2023). 18,
(2023). 17 19
[190] D. Schuurmans, Memory augmented large language models are compu- [214] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, Y. Zhuang, Hugginggpt: Solv-
tationally universal, arXiv preprint arXiv:2301.04589 (2023). 17 ing ai tasks with chatgpt and its friends in huggingface, arXiv preprint
[191] A. Modarressi, A. Imani, M. Fayyaz, H. Schütze, Ret-llm: Towards a arXiv:2303.17580 (2023). 19, 33
general read-write memory for large language models, arXiv preprint [215] Y. Liang, C. Wu, T. Song, W. Wu, Y. Xia, Y. Liu, Y. Ou, S. Lu, L. Ji,
arXiv:2305.14322 (2023). 17 S. Mao, et al., Taskmatrix. ai: Completing tasks by connecting foun-
[192] S. Robertson, H. Zaragoza, et al., The probabilistic relevance frame- dation models with millions of apis, arXiv preprint arXiv:2303.16434
work: Bm25 and beyond, Foundations and Trends® in Information Re- (2023). 19
trieval 3 (4) (2009) 333–389. 18 [216] D. Surís, S. Menon, C. Vondrick, Vipergpt: Visual inference via python
[193] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, D. Zhou, execution for reasoning, arXiv preprint arXiv:2303.08128 (2023). 19
Rationale-augmented ensembles in language models, arXiv preprint [217] A. Maedche, S. Morana, S. Schacht, D. Werth, J. Krumeich, Advanced
arXiv:2207.00747 (2022). 18 user assistance systems, Business & Information Systems Engineering
[194] F. Zhang, B. Chen, Y. Zhang, J. Liu, D. Zan, Y. Mao, J.-G. Lou, W. Chen, 58 (2016) 367–370. 19
Repocoder: Repository-level code completion through iterative retrieval [218] M. Campbell, A. J. Hoane Jr, F.-h. Hsu, Deep blue, Artificial intelligence
and generation, arXiv preprint arXiv:2303.12570 (2023). 18 134 (1-2) (2002) 57–83. 19
[195] B. Wang, W. Ping, P. Xu, L. McAfee, Z. Liu, M. Shoeybi, Y. Dong, [219] S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, C. Zhang, Z. Wang,
O. Kuchaiev, B. Li, C. Xiao, et al., Shall we pretrain autoregressive S. K. S. Yau, Z. Lin, L. Zhou, et al., Metagpt: Meta programming for
language models with retrieval? a comprehensive study, arXiv preprint multi-agent collaborative framework, arXiv preprint arXiv:2308.00352
arXiv:2304.06762 (2023). 18 (2023). 19
[196] L. Wang, N. Yang, F. Wei, Learning to retrieve in-context examples for [220] Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang,
large language models, arXiv preprint arXiv:2307.07164 (2023). 18 S. Jin, E. Zhou, et al., The rise and potential of large language model

39
based agents: A survey, arXiv preprint arXiv:2309.07864 (2023). 19 ceedings of the 60th Annual Meeting of the Association for Computa-
[221] L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, tional Linguistics (Volume 2: Short Papers), 2022, pp. 61–68. 20
X. Chen, Y. Lin, et al., A survey on large language model based au- [242] A. Razdaibiedina, Y. Mao, R. Hou, M. Khabsa, M. Lewis, A. Almahairi,
tonomous agents, arXiv preprint arXiv:2308.11432 (2023). 19 Progressive prompts: Continual learning for language models, arXiv
[222] W. Huang, P. Abbeel, D. Pathak, I. Mordatch, Language models as zero- preprint arXiv:2301.12314 (2023). 20
shot planners: Extracting actionable knowledge for embodied agents, [243] Z.-R. Zhang, C. Tan, H. Xu, C. Wang, J. Huang, S. Huang, To-
in: International Conference on Machine Learning, PMLR, 2022, pp. wards adaptive prefix tuning for parameter-efficient language model
9118–9147. 19 fine-tuning, arXiv preprint arXiv:2305.15212 (2023). 20
[223] S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, Z. Hu, Reason- [244] E. B. Zaken, S. Ravfogel, Y. Goldberg, Bitfit: Simple parameter-
ing with language model is planning with world model, arXiv preprint efficient fine-tuning for transformer-based masked language-models,
arXiv:2305.14992 (2023). 19, 33 arXiv preprint arXiv:2106.10199 (2021). 20
[224] W. Yao, S. Heinecke, J. C. Niebles, Z. Liu, Y. Feng, L. Xue, R. Murthy, [245] T. Dettmers, M. Lewis, Y. Belkada, L. Zettlemoyer, Llm. int8 ():
Z. Chen, J. Zhang, D. Arpit, et al., Retroformer: Retrospective 8-bit matrix multiplication for transformers at scale, arXiv preprint
large language agents with policy gradient optimization, arXiv preprint arXiv:2208.07339 (2022). 20, 21
arXiv:2308.02151 (2023). 19, 33 [246] E. Frantar, S. Ashkboos, T. Hoefler, D. Alistarh, Gptq: Accurate
[225] W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, post-training quantization for generative pre-trained transformers, arXiv
J. Tompson, I. Mordatch, Y. Chebotar, P. Sermanet, T. Jackson, preprint arXiv:2210.17323 (2022). 20
N. Brown, L. Luu, S. Levine, K. Hausman, brian ichter, Inner mono- [247] X. Wei, Y. Zhang, Y. Li, X. Zhang, R. Gong, J. Guo, X. Liu, Outlier sup-
logue: Embodied reasoning through planning with language models, in: pression+: Accurate quantization of large language models by equiva-
6th Annual Conference on Robot Learning, 2022. lent and optimal shifting and scaling, arXiv preprint arXiv:2304.09145
URL https://openreview.net/forum?id=3R3Pz5i0tye 19 (2023). 20
[226] C. Jin, W. Tan, J. Yang, B. Liu, R. Song, L. Wang, J. Fu, Alphablock: [248] E. Frantar, D. Alistarh, Optimal brain compression: A framework for
Embodied finetuning for vision-language reasoning in robot manipula- accurate post-training quantization and pruning, Advances in Neural In-
tion, arXiv preprint arXiv:2305.18898 (2023). 19, 20, 33 formation Processing Systems 35 (2022) 4475–4488. 20
[227] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, [249] C. Lee, J. Jin, T. Kim, H. Kim, E. Park, Owq: Lessons learned from ac-
J. Thomason, A. Garg, Progprompt: Generating situated robot task plans tivation outliers for weight quantization in large language models, arXiv
using large language models, in: 2023 IEEE International Conference on preprint arXiv:2306.02272 (2023). 21
Robotics and Automation (ICRA), IEEE, 2023, pp. 11523–11530. 19, [250] S. J. Kwon, J. Kim, J. Bae, K. M. Yoo, J.-H. Kim, B. Park, B. Kim, J.-
33 W. Ha, N. Sung, D. Lee, Alphatuning: Quantization-aware parameter-
[228] W. Yu, N. Gileadi, C. Fu, S. Kirmani, K.-H. Lee, M. G. Arenas, H.-T. L. efficient adaptation of large-scale pre-trained language models, arXiv
Chiang, T. Erez, L. Hasenclever, J. Humplik, et al., Language to rewards preprint arXiv:2210.03858 (2022). 21
for robotic skill synthesis, arXiv preprint arXiv:2306.08647 (2023). 19 [251] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, Qlora: Efficient
[229] X. Tang, A. Zou, Z. Zhang, Y. Zhao, X. Zhang, A. Cohan, M. Gerstein, finetuning of quantized llms, arXiv preprint arXiv:2305.14314 (2023).
Medagents: Large language models as collaborators for zero-shot med- 21
ical reasoning, arXiv preprint arXiv:2311.10537 (2023). 19 [252] Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stock, Y. Mehdad, Y. Shi, R. Kr-
[230] A. Brohan, Y. Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, ishnamoorthi, V. Chandra, Llm-qat: Data-free quantization aware train-
J. Ibarz, A. Irpan, E. Jang, R. Julian, et al., Do as i can, not as i say: ing for large language models, arXiv preprint arXiv:2305.17888 (2023).
Grounding language in robotic affordances, in: Conference on Robot 21
Learning, PMLR, 2023, pp. 287–318. 19, 33 [253] Y. Guo, A. Yao, H. Zhao, Y. Chen, Network sketching: Exploiting bi-
[231] H. Ha, P. Florence, S. Song, Scaling up and distilling down: Language- nary structure in deep cnns, in: Proceedings of the IEEE Conference on
guided robot skill acquisition, arXiv preprint arXiv:2307.14535 (2023). Computer Vision and Pattern Recognition, 2017, pp. 5955–5963. 21
20 [254] J. Kim, J. H. Lee, S. Kim, J. Park, K. M. Yoo, S. J. Kwon, D. Lee,
[232] A. Rajvanshi, K. Sikka, X. Lin, B. Lee, H.-P. Chiu, A. Velasquez, Say- Memory-efficient fine-tuning of compressed large language models via
nav: Grounding large language models for dynamic planning to navi- sub-4-bit integer quantization, arXiv preprint arXiv:2305.14152 (2023).
gation in new environments, arXiv preprint arXiv:2309.04077 (2023). 21
20 [255] M. Sun, Z. Liu, A. Bair, J. Z. Kolter, A simple and effective pruning
[233] C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, Y. Su, approach for large language models, arXiv preprint arXiv:2306.11695
Llm-planner: Few-shot grounded planning for embodied agents with (2023). 21
large language models, arXiv preprint arXiv:2212.04088 (2022). 20 [256] Z. Wang, J. Wohlwend, T. Lei, Structured pruning of large language
[234] V. S. Dorbala, J. F. Mullen Jr, D. Manocha, Can an embodied agent find models, arXiv preprint arXiv:1910.04732 (2019). 21
your" cat-shaped mug"? llm-based zero-shot object navigation, arXiv [257] L. Yin, Y. Wu, Z. Zhang, C.-Y. Hsieh, Y. Wang, Y. Jia, M. Pechenizkiy,
preprint arXiv:2303.03480 (2023). 20 Y. Liang, Z. Wang, S. Liu, Outlier weighed layerwise sparsity (owl): A
[235] C. Huang, O. Mees, A. Zeng, W. Burgard, Visual language maps for missing secret sauce for pruning llms to high sparsity, arXiv preprint
robot navigation, in: 2023 IEEE International Conference on Robotics arXiv:2310.05175 (2023). 21
and Automation (ICRA), IEEE, 2023, pp. 10608–10615. 20 [258] C. Tao, L. Hou, H. Bai, J. Wei, X. Jiang, Q. Liu, P. Luo, N. Wong,
[236] Y. Ding, X. Zhang, C. Paxton, S. Zhang, Task and motion planning Structured pruning for efficient generative pre-trained language models,
with large language models for object rearrangement, arXiv preprint in: Findings of the Association for Computational Linguistics: ACL
arXiv:2303.06247 (2023). 20, 33 2023, 2023, pp. 10880–10895. 21
[237] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, J. Tang, Gpt under- [259] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc,
stands, too, arXiv preprint arXiv:2103.10385 (2021). 20 A. Mensch, K. Millican, M. Reynolds, et al., Flamingo: a visual lan-
[238] G. Chen, F. Liu, Z. Meng, S. Liang, Revisiting parameter-efficient tun- guage model for few-shot learning, Advances in Neural Information Pro-
ing: Are we really there yet?, arXiv preprint arXiv:2202.07962 (2022). cessing Systems 35 (2022) 23716–23736. 21, 22
20 [260] J. Li, D. Li, S. Savarese, S. Hoi, Blip-2: Bootstrapping language-image
[239] Y. Wang, S. Mukherjee, X. Liu, J. Gao, A. H. Awadallah, J. Gao, pre-training with frozen image encoders and large language models,
Adamix: Mixture-of-adapter for parameter-efficient tuning of large lan- arXiv preprint arXiv:2301.12597 (2023). 21, 22
guage models, arXiv preprint arXiv:2205.12410 1 (2) (2022) 4. 20 [261] H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction tuning, arXiv preprint
[240] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, arXiv:2304.08485 (2023). 21, 22
W. Chen, Lora: Low-rank adaptation of large language models, arXiv [262] K. Li, Y. He, Y. Wang, Y. Li, W. Wang, P. Luo, Y. Wang, L. Wang,
preprint arXiv:2106.09685 (2021). 20, 21, 22 Y. Qiao, Videochat: Chat-centric video understanding, arXiv preprint
[241] X. Liu, K. Ji, Y. Fu, W. Tam, Z. Du, Z. Yang, J. Tang, P-tuning: Prompt arXiv:2305.06355 (2023). 21, 22
tuning can be comparable to fine-tuning across scales and tasks, in: Pro- [263] M. Maaz, H. Rasheed, S. Khan, F. S. Khan, Video-chatgpt: Towards de-

40
tailed video understanding via large vision and language models, arXiv [285] Z. Yu, J. Yu, Y. Cui, D. Tao, Q. Tian, Deep modular co-attention net-
preprint arXiv:2306.05424 (2023). 21 works for visual question answering, in: Proceedings of the IEEE/CVF
[264] H. Zhang, X. Li, L. Bing, Video-llama: An instruction-tuned conference on computer vision and pattern recognition, 2019, pp. 6281–
audio-visual language model for video understanding, arXiv preprint 6290. 22
arXiv:2306.02858 (2023). 21 [286] H. You, R. Sun, Z. Wang, L. Chen, G. Wang, H. A. Ayyubi, K.-
[265] X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, W. Chang, S.-F. Chang, Idealgpt: Iteratively decomposing vision
Y. Zou, W. Wang, Wavcaps: A chatgpt-assisted weakly-labelled au- and language reasoning via large language models, arXiv preprint
dio captioning dataset for audio-language multimodal research, arXiv arXiv:2305.14985 (2023). 22
preprint arXiv:2303.17395 (2023). 21 [287] R. Zhang, X. Hu, B. Li, S. Huang, H. Deng, Y. Qiao, P. Gao, H. Li,
[266] C. Lyu, M. Wu, L. Wang, X. Huang, B. Liu, Z. Du, S. Shi, Z. Tu, Macaw- Prompt, generate, then cache: Cascade of foundation models makes
llm: Multi-modal language modeling with image, audio, video, and text strong few-shot learners, in: Proceedings of the IEEE/CVF Conference
integration, arXiv preprint arXiv:2306.09093 (2023). 21 on Computer Vision and Pattern Recognition, 2023, pp. 15211–15222.
[267] D. Zhu, J. Chen, X. Shen, X. Li, M. Elhoseiny, Minigpt-4: Enhancing 22
vision-language understanding with advanced large language models, [288] T. Q. Nguyen, J. Salazar, Transformers without tears: Improving the
arXiv preprint arXiv:2304.10592 (2023). 22 normalization of self-attention, CoRR abs/1910.05895 (2019). 23
[268] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, [289] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pre-
An image is worth 16x16 words: Transformers for image recognition at training approach, arXiv preprint arXiv:1907.11692 (2019). 24, 30
scale, arXiv preprint arXiv:2010.11929 (2020). 22 [290] X. Geng, A. Gudibande, H. Liu, E. Wallace, P. Abbeel, S. Levine,
[269] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, D. Song, Koala: A dialogue model for academic research, Blog post
S. Hoi, Instructblip: Towards general-purpose vision-language models (April 2023).
with instruction tuning, arXiv preprint arXiv:2305.06500 (2023). 22 URL https://bair.berkeley.edu/blog/2023/04/03/koala/
[270] Z. Xu, Y. Shen, L. Huang, Multiinstruct: Improving multi-modal zero- 25
shot learning via instruction tuning, arXiv preprint arXiv:2212.10773 [291] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster,
(2022). 22 J. Phang, H. He, A. Thite, N. Nabeshima, et al., The pile: An
[271] Z. Zhao, L. Guo, T. Yue, S. Chen, S. Shao, X. Zhu, Z. Yuan, J. Liu, 800gb dataset of diverse text for language modeling, arXiv preprint
Chatbridge: Bridging modalities with large language model as a lan- arXiv:2101.00027 (2020). 28, 30
guage catalyst, arXiv preprint arXiv:2305.16103 (2023). 22 [292] H. Laurençon, L. Saulnier, T. Wang, C. Akiki, A. Villanova del Moral,
[272] L. Li, Y. Yin, S. Li, L. Chen, P. Wang, S. Ren, M. Li, Y. Yang, J. Xu, T. Le Scao, L. Von Werra, C. Mou, E. González Ponferrada, H. Nguyen,
X. Sun, et al., M3 it: A large-scale dataset towards multi-modal multi- et al., The bigscience roots corpus: A 1.6 tb composite multilingual
lingual instruction tuning, arXiv preprint arXiv:2306.04387 (2023). 22 dataset, Advances in Neural Information Processing Systems 35 (2022)
[273] R. Pi, J. Gao, S. Diao, R. Pan, H. Dong, J. Zhang, L. Yao, J. Han, 31809–31826. 28
H. Xu, L. K. T. Zhang, Detgpt: Detect what you need via reasoning, [293] Wikipedia.
arXiv preprint arXiv:2305.14167 (2023). 22 URL https://en.wikipedia.org/wiki/Main_Page 28
[274] G. Luo, Y. Zhou, T. Ren, S. Chen, X. Sun, R. Ji, Cheap and quick: [294] Together Computer, Redpajama: An open source recipe to reproduce
Efficient vision-language instruction tuning for large language models, llama training dataset (Apr. 2023).
arXiv preprint arXiv:2305.15023 (2023). 22 URL https://github.com/togethercomputer/
[275] R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, P. Gao, Y. Qiao, RedPajama-Data 28
Llama-adapter: Efficient fine-tuning of language models with zero-init [295] O. Honovich, T. Scialom, O. Levy, T. Schick, Unnatural instructions:
attention, arXiv preprint arXiv:2303.16199 (2023). 22 Tuning language models with (almost) no human labor, arXiv preprint
[276] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, I. Sutskever, arXiv:2212.09689 (2022). 28
Robust speech recognition via large-scale weak supervision, in: Inter- [296] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma,
national Conference on Machine Learning, PMLR, 2023, pp. 28492– D. Drain, S. Fort, D. Ganguli, T. Henighan, et al., Training a helpful and
28518. 22 harmless assistant with reinforcement learning from human feedback,
[277] Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, A. Smola, Multi- arXiv preprint arXiv:2204.05862 (2022). 28
modal chain-of-thought reasoning in language models, arXiv preprint [297] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song,
arXiv:2302.00923 (2023). 22 J. Steinhardt, Measuring massive multitask language understanding,
[278] J. Ge, H. Luo, S. Qian, Y. Gan, J. Fu, S. Zhan, Chain of thought prompt arXiv preprint arXiv:2009.03300 (2020). 24, 29
tuning in vision language models, arXiv preprint arXiv:2304.07919 [298] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch,
(2023). 22 A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al., Beyond
[279] C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, N. Duan, Visual chatgpt: Talk- the imitation game: Quantifying and extrapolating the capabilities of
ing, drawing and editing with visual foundation models, arXiv preprint language models, arXiv preprint arXiv:2206.04615 (2022). 24, 29
arXiv:2303.04671 (2023). 22 [299] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, S. R. Bowman, Glue:
[280] Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu, C. Liu, A multi-task benchmark and analysis platform for natural language un-
M. Zeng, L. Wang, Mm-react: Prompting chatgpt for multimodal rea- derstanding, arXiv preprint arXiv:1804.07461 (2018). 24, 29
soning and action, arXiv preprint arXiv:2303.11381 (2023). 22 [300] Y. Yao, Q. Dong, J. Guan, B. Cao, Z. Zhang, C. Xiao, X. Wang, F. Qi,
[281] T. Wang, J. Zhang, J. Fei, Y. Ge, H. Zheng, Y. Tang, Z. Li, M. Gao, J. Bao, J. Nie, et al., Cuge: A chinese language understanding and gen-
S. Zhao, Y. Shan, et al., Caption anything: Interactive image descrip- eration evaluation benchmark, arXiv preprint arXiv:2112.13610 (2021).
tion with diverse multimodal controls, arXiv preprint arXiv:2305.02677 29
(2023). 22 [301] L. Xu, H. Hu, X. Zhang, L. Li, C. Cao, Y. Li, Y. Xu, K. Sun, D. Yu,
[282] X. Zhu, R. Zhang, B. He, Z. Zeng, S. Zhang, P. Gao, Pointclip v2: C. Yu, et al., Clue: A chinese language understanding evaluation bench-
Adapting clip for powerful 3d open-world learning, arXiv preprint mark, arXiv preprint arXiv:2004.05986 (2020). 29
arXiv:2211.11682 (2022). 22 [302] L. Xu, X. Lu, C. Yuan, X. Zhang, H. Xu, H. Yuan, G. Wei, X. Pan,
[283] T. Gupta, A. Kembhavi, Visual programming: Compositional visual rea- X. Tian, L. Qin, et al., Fewclue: A chinese few-shot learning evaluation
soning without training, in: Proceedings of the IEEE/CVF Conference benchmark, arXiv preprint arXiv:2107.07498 (2021). 29
on Computer Vision and Pattern Recognition, 2023, pp. 14953–14962. [303] E. M. Smith, M. Williamson, K. Shuster, J. Weston, Y.-L. Boureau, Can
22 you put it all together: Evaluating conversational agents’ ability to blend
[284] P. Gao, Z. Jiang, H. You, P. Lu, S. C. Hoi, X. Wang, H. Li, Dynamic skills, arXiv preprint arXiv:2004.08449 (2020). 29
fusion with intra-and inter-modality attention flow for visual question [304] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga,
answering, in: Proceedings of the IEEE/CVF conference on computer Y. Zhang, D. Narayanan, Y. Wu, A. Kumar, et al., Holistic evaluation of
vision and pattern recognition, 2019, pp. 6639–6648. 22 language models, arXiv preprint arXiv:2211.09110 (2022). 29

41
[305] S. Park, J. Moon, S. Kim, W. I. Cho, J. Han, J. Park, C. Song, J. Kim, arXiv:1908.06605 (2019). 29
Y. Song, T. Oh, et al., Klue: Korean language understanding evaluation, [328] J. Novikova, O. Dušek, V. Rieser, The e2e dataset: New challenges for
arXiv preprint arXiv:2105.09680 (2021). 29 end-to-end generation, arXiv preprint arXiv:1706.09254 (2017). 29
[306] S. Reddy, D. Chen, C. D. Manning, Coqa: A conversational question [329] C. Zheng, M. Huang, A. Sun, Chid: A large-scale chinese idiom dataset
answering challenge, Transactions of the Association for Computational for cloze test, arXiv preprint arXiv:1906.01265 (2019). 29
Linguistics 7 (2019) 249–266. 25, 29 [330] Y. Bisk, R. Zellers, J. Gao, Y. Choi, et al., Piqa: Reasoning about phys-
[307] M. T. Pilehvar, J. Camacho-Collados, Wic: 10,000 example ical commonsense in natural language, in: Proceedings of the AAAI
pairs for evaluating context-sensitive representations, arXiv preprint conference on artificial intelligence, Vol. 34, 2020, pp. 7432–7439. 26,
arXiv:1808.09121 6 (2018). 25, 29 29
[308] S. Merity, C. Xiong, J. Bradbury, R. Socher, Pointer sentinel mixture [331] M. Joshi, E. Choi, D. S. Weld, L. Zettlemoyer, Triviaqa: A large scale
models, arXiv preprint arXiv:1609.07843 (2016). 25, 29 distantly supervised challenge dataset for reading comprehension, arXiv
[309] J. W. Rae, A. Potapenko, S. M. Jayakumar, T. P. Lillicrap, Compres- preprint arXiv:1705.03551 (2017). 26, 29, 31
sive transformers for long-range sequence modelling, arXiv preprint [332] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick,
arXiv:1911.05507 (2019). 25, 29 O. Tafjord, Think you have solved question answering? try arc, the ai2
[310] X. Liu, Q. Chen, C. Deng, H. Zeng, J. Chen, D. Li, B. Tang, Lcqmc: A reasoning challenge, arXiv preprint arXiv:1803.05457 (2018). 26, 29,
large-scale chinese question matching corpus, in: Proceedings of the 31
27th international conference on computational linguistics, 2018, pp. [333] S. Aroca-Ouellette, C. Paik, A. Roncone, K. Kann, Prost: Phys-
1952–1962. 26, 29 ical reasoning of objects through space and time, arXiv preprint
[311] S. Iyer, N. Dandekar, K. Csernai, First quora dataset re- arXiv:2106.03634 (2021). 29
lease: Question pairs, https://quoradata.quora.com/ [334] T. Mihaylov, P. Clark, T. Khot, A. Sabharwal, Can a suit of armor con-
First-Quora-Dataset-Release-Question-Pairs. 29 duct electricity? a new dataset for open book question answering, arXiv
[312] R. Rudinger, J. Naradowsky, B. Leonard, B. Van Durme, Gender bias in preprint arXiv:1809.02789 (2018). 29
coreference resolution, arXiv preprint arXiv:1804.09301 (2018). 29 [335] T. C. Ferreira, C. Gardent, N. Ilinykh, C. Van Der Lee, S. Mille,
[313] M.-C. De Marneffe, M. Simons, J. Tonhauser, The commitmentbank: In- D. Moussallem, A. Shimorina, The 2020 bilingual, bi-directional
vestigating projection in naturally occurring discourse, in: proceedings webnlg+ shared task overview and evaluation results (webnlg+ 2020),
of Sinn und Bedeutung, Vol. 23, 2019, pp. 107–124. 29 in: Proceedings of the 3rd International Workshop on Natural Language
[314] Z. Li, N. Ding, Z. Liu, H. Zheng, Y. Shen, Chinese relation extraction Generation from the Semantic Web (WebNLG+), 2020. 29
with multi-grained information and external linguistic knowledge, in: [336] C. Xu, W. Zhou, T. Ge, K. Xu, J. McAuley, F. Wei, Blow the dog whistle:
Proceedings of the 57th Annual Meeting of the Association for Compu- A chinese dataset for cant understanding with common sense and world
tational Linguistics, 2019, pp. 4377–4386. 29 knowledge, arXiv preprint arXiv:2104.02704 (2021). 29
[315] J. Xu, J. Wen, X. Sun, Q. Su, A discourse-level named entity recognition [337] G. Lai, Q. Xie, H. Liu, Y. Yang, E. Hovy, Race: Large-scale
and relation extraction dataset for chinese literature text, arXiv preprint reading comprehension dataset from examinations, arXiv preprint
arXiv:1711.07010 (2017). 29 arXiv:1704.04683 (2017). 26, 29
[316] J. Chen, Q. Chen, X. Liu, H. Yang, D. Lu, B. Tang, The bq corpus: A [338] E. Choi, H. He, M. Iyyer, M. Yatskar, W.-t. Yih, Y. Choi, P. Liang,
large-scale domain-specific chinese corpus for sentence semantic equiv- L. Zettlemoyer, Quac: Question answering in context, arXiv preprint
alence identification, in: Proceedings of the 2018 conference on empiri- arXiv:1808.07036 (2018). 27, 29
cal methods in natural language processing, 2018, pp. 4946–4951. 29 [339] M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth, J. Berant, Did aristo-
[317] B. Liu, D. Niu, H. Wei, J. Lin, Y. He, K. Lai, Y. Xu, Matching arti- tle use a laptop? a question answering benchmark with implicit reason-
cle pairs with graphical decomposition and convolutions, arXiv preprint ing strategies, Transactions of the Association for Computational Lin-
arXiv:1802.07459 (2018). 29 guistics 9 (2021) 346–361. 29
[318] P. Li, W. Li, Z. He, X. Wang, Y. Cao, J. Zhou, W. Xu, Dataset and neu- [340] J. Boyd-Graber, B. Satinoff, H. He, H. Daumé III, Besting the quiz mas-
ral recurrent sequence labeling model for open-domain factoid question ter: Crowdsourcing incremental classification games, in: Proceedings of
answering, arXiv preprint arXiv:1607.06275 (2016). 29 the 2012 joint conference on empirical methods in natural language pro-
[319] N. Peng, M. Dredze, Named entity recognition for chinese social media cessing and computational natural language learning, 2012, pp. 1290–
with jointly trained embeddings, in: Proceedings of the 2015 conference 1301. 29
on empirical methods in natural language processing, 2015, pp. 548– [341] S. Zhang, X. Zhang, H. Wang, J. Cheng, P. Li, Z. Ding, Chinese medical
554. 29 question answer matching using end-to-end character-level multi-scale
[320] W. Ling, D. Yogatama, C. Dyer, P. Blunsom, Program induction by ratio- cnns, Applied Sciences 7 (8) (2017) 767. 29
nale generation: Learning to solve and explain algebraic word problems, [342] S. Zhang, X. Zhang, H. Wang, L. Guo, S. Liu, Multi-scale attentive in-
arXiv preprint arXiv:1705.04146 (2017). 29 teraction networks for chinese medical question answer selection, IEEE
[321] R. Weischedel, S. Pradhan, L. Ramshaw, M. Palmer, N. Xue, M. Mar- Access 6 (2018) 74061–74071. 29
cus, A. Taylor, C. Greenberg, E. Hovy, R. Belvin, et al., Ontonotes re- [343] C. Xu, J. Pei, H. Wu, Y. Liu, C. Li, Matinf: A jointly labeled large-scale
lease 4.0, LDC2011T03, Philadelphia, Penn.: Linguistic Data Consor- dataset for classification, question answering and summarization, arXiv
tium (2011). 29 preprint arXiv:2004.12302 (2020). 29
[322] D. Vilares, C. Gómez-Rodríguez, Head-qa: A healthcare dataset for [344] K. Sakaguchi, R. L. Bras, C. Bhagavatula, Y. Choi, Winogrande: An
complex reasoning, arXiv preprint arXiv:1906.04701 (2019). 29 adversarial winograd schema challenge at scale, Communications of the
[323] S. L. Blodgett, L. Green, B. O’Connor, Demographic dialectal variation ACM 64 (9) (2021) 99–106. 25, 29
in social media: A case study of african-american english, arXiv preprint [345] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, Y. Choi, Hellaswag: Can a
arXiv:1608.08868 (2016). 29 machine really finish your sentence?, arXiv preprint arXiv:1905.07830
[324] N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Van- (2019). 27, 29
derwende, P. Kohli, J. Allen, A corpus and evaluation framework [346] M. Roemmele, C. A. Bejan, A. S. Gordon, Choice of plausible alter-
for deeper understanding of commonsense stories, arXiv preprint natives: An evaluation of commonsense causal reasoning., in: AAAI
arXiv:1604.01696 (2016). 26, 29 spring symposium: logical formalizations of commonsense reasoning,
[325] D. Paperno, G. Kruszewski, A. Lazaridou, Q. N. Pham, R. Bernardi, 2011, pp. 90–95. 29
S. Pezzelle, M. Baroni, G. Boleda, R. Fernández, The lambada dataset: [347] H. Levesque, E. Davis, L. Morgenstern, The winograd schema chal-
Word prediction requiring a broad discourse context, arXiv preprint lenge, in: Thirteenth international conference on the principles of knowl-
arXiv:1606.06031 (2016). 26, 29 edge representation and reasoning, 2012. 25, 27, 29
[326] B. Hu, Q. Chen, F. Zhu, Lcsts: A large scale chinese short text summa- [348] A. Talmor, J. Herzig, N. Lourie, J. Berant, Commonsenseqa: A question
rization dataset, arXiv preprint arXiv:1506.05865 (2015). 29 answering challenge targeting commonsense knowledge, arXiv preprint
[327] Z. Shao, M. Huang, J. Wen, W. Xu, X. Zhu, Long and diverse text gener- arXiv:1811.00937 (2018). 27, 29
ation with planning-based hierarchical variational model, arXiv preprint [349] M. Sap, H. Rashkin, D. Chen, R. LeBras, Y. Choi, Socialiqa:

42
Commonsense reasoning about social interactions, arXiv preprint ceedings 4, Springer, 2013, pp. 303–320. 29
arXiv:1904.09728 (2019). 29 [370] S. Lim, M. Kim, J. Lee, Korquad1. 0: Korean qa dataset for machine
[350] K. Sun, D. Yu, D. Yu, C. Cardie, Investigating prior knowledge for chal- reading comprehension, arXiv preprint arXiv:1909.07005 (2019). 29
lenging chinese machine reading comprehension, Transactions of the [371] C. Xiao, H. Zhong, Z. Guo, C. Tu, Z. Liu, M. Sun, Y. Feng, X. Han,
Association for Computational Linguistics 8 (2020) 141–155. 29 Z. Hu, H. Wang, et al., Cail2018: A large-scale legal dataset for judg-
[351] S. Zhang, X. Liu, J. Liu, J. Gao, K. Duh, B. Van Durme, Record: Bridg- ment prediction, arXiv preprint arXiv:1807.02478 (2018). 29
ing the gap between human and machine commonsense reading compre- [372] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo,
hension, arXiv preprint arXiv:1810.12885 (2018). 29 C. Burns, S. Puranik, H. He, D. Song, et al., Measuring coding challenge
[352] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad: 100,000+ questions competence with apps, arXiv preprint arXiv:2105.09938 (2021). 28, 29
for machine comprehension of text, arXiv preprint arXiv:1606.05250 [373] Y. Wang, X. Liu, S. Shi, Deep neural solver for math word problems,
(2016). 28, 29 in: Proceedings of the 2017 conference on empirical methods in natural
[353] C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, language processing, 2017, pp. 845–854. 28, 29
K. Toutanova, Boolq: Exploring the surprising difficulty of natural [374] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser,
yes/no questions, arXiv preprint arXiv:1905.10044 (2019). 28, 29 M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al., Training verifiers
[354] P. Rajpurkar, R. Jia, P. Liang, Know what you don’t know: Unanswer- to solve math word problems, arXiv preprint arXiv:2110.14168 (2021).
able questions for squad, arXiv preprint arXiv:1806.03822 (2018). 28, 29
29 [375] J. Austin, A. Odena, M. I. Nye, M. Bosma, H. Michalewski, D. Dohan,
[355] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, M. Gardner, Drop: E. Jiang, C. J. Cai, M. Terry, Q. V. Le, C. Sutton, Program synthesis with
A reading comprehension benchmark requiring discrete reasoning over large language models, CoRR abs/2108.07732 (2021). 29
paragraphs, arXiv preprint arXiv:1903.00161 (2019). 28, 29 [376] F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W.
[356] I. Dagan, O. Glickman, B. Magnini, The pascal recognising textual en- Chung, Y. Tay, S. Ruder, D. Zhou, et al., Language models are mul-
tailment challenge, in: Machine learning challenges workshop, Springer, tilingual chain-of-thought reasoners, arXiv preprint arXiv:2210.03057
2005, pp. 177–190. 28, 29 (2022). 29
[357] Y. Chang, M. Narang, H. Suzuki, G. Cao, J. Gao, Y. Bisk, Webqa: Mul- [377] S. Roy, D. Roth, Solving general arithmetic word problems, arXiv
tihop and multimodal qa, in: Proceedings of the IEEE/CVF Conference preprint arXiv:1608.01413 (2016). 29
on Computer Vision and Pattern Recognition, 2022, pp. 16495–16504. [378] S.-Y. Miao, C.-C. Liang, K.-Y. Su, A diverse corpus for evaluating
28, 29 and developing english math word problem solvers, arXiv preprint
[358] Y. Cui, T. Liu, Z. Chen, W. Ma, S. Wang, G. Hu, Dataset for the first arXiv:2106.15772 (2021). 29
evaluation on chinese machine reading comprehension, arXiv preprint [379] R. Koncel-Kedziorski, S. Roy, A. Amini, N. Kushman, H. Hajishirzi,
arXiv:1709.08299 (2017). 29 Mawps: A math word problem repository, in: Proceedings of the 2016
[359] Y. Cui, T. Liu, W. Che, L. Xiao, Z. Chen, W. Ma, S. Wang, G. Hu, conference of the north american chapter of the association for computa-
A span-extraction dataset for chinese machine reading comprehension, tional linguistics: human language technologies, 2016, pp. 1152–1157.
arXiv preprint arXiv:1810.07366 (2018). 28, 29 29
[360] Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu, [380] A. Patel, S. Bhattamishra, N. Goyal, Are nlp models really able to solve
A sentence cloze dataset for chinese machine reading comprehension, simple math word problems?, arXiv preprint arXiv:2103.07191 (2021).
arXiv preprint arXiv:2004.03116 (2020). 29 29
[361] Y. Li, T. Liu, D. Li, Q. Li, J. Shi, Y. Wang, Character-based bilstm-crf [381] Y. Lai, C. Li, Y. Wang, T. Zhang, R. Zhong, L. Zettlemoyer, W.-t. Yih,
incorporating pos and dictionaries for chinese opinion target extraction, D. Fried, S. Wang, T. Yu, Ds-1000: A natural and reliable benchmark for
in: Asian Conference on Machine Learning, PMLR, 2018, pp. 518–533. data science code generation, in: International Conference on Machine
29 Learning, PMLR, 2023, pp. 18319–18345. 29
[362] D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, D. Roth, Look- [382] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan,
ing beyond the surface: A challenge set for reading comprehension E. Jiang, C. Cai, M. Terry, Q. Le, et al., Program synthesis with large
over multiple sentences, in: Proceedings of the 2018 Conference of the language models, arXiv preprint arXiv:2108.07732 (2021). 29
North American Chapter of the Association for Computational Linguis- [383] Y. Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, D. Kiela, Adver-
tics: Human Language Technologies, Volume 1 (Long Papers), 2018, sarial nli: A new benchmark for natural language understanding, arXiv
pp. 252–262. 29 preprint arXiv:1910.14599 (2019). 29
[363] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Al- [384] A. Williams, N. Nangia, S. R. Bowman, A broad-coverage challenge
berti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, et al., Natural ques- corpus for sentence understanding through inference, arXiv preprint
tions: a benchmark for question answering research, Transactions of the arXiv:1704.05426 (2017). 29
Association for Computational Linguistics 7 (2019) 453–466. 29 [385] R. T. McCoy, E. Pavlick, T. Linzen, Right for the wrong reasons: Diag-
[364] C. C. Shao, T. Liu, Y. Lai, Y. Tseng, S. Tsai, Drcd: A chinese ma- nosing syntactic heuristics in natural language inference, arXiv preprint
chine reading comprehension dataset, arXiv preprint arXiv:1806.00920 arXiv:1902.01007 (2019). 29
(2018). 29 [386] J. Liu, L. Cui, H. Liu, D. Huang, Y. Wang, Y. Zhang, Logiqa: A chal-
[365] W. He, K. Liu, J. Liu, Y. Lyu, S. Zhao, X. Xiao, Y. Liu, Y. Wang, H. Wu, lenge dataset for machine reading comprehension with logical reason-
Q. She, et al., Dureader: a chinese machine reading comprehension ing, arXiv preprint arXiv:2007.08124 (2020). 29
dataset from real-world applications, arXiv preprint arXiv:1711.05073 [387] P. Lewis, B. Oğuz, R. Rinott, S. Riedel, H. Schwenk, Mlqa: Eval-
(2017). 29 uating cross-lingual extractive question answering, arXiv preprint
[366] H. Tang, J. Liu, H. Li, Y. Hong, H. Wu, H. Wang, Dureaderrobust: A arXiv:1910.07475 (2019). 29
chinese dataset towards evaluating the robustness of machine reading [388] A. Conneau, G. Lample, R. Rinott, A. Williams, S. R. Bowman,
comprehension models, arXiv preprint arXiv:2004.11142 (2020). 29 H. Schwenk, V. Stoyanov, Xnli: Evaluating cross-lingual sentence rep-
[367] J. Welbl, N. F. Liu, M. Gardner, Crowdsourcing multiple choice science resentations, arXiv preprint arXiv:1809.05053 (2018). 29
questions, arXiv preprint arXiv:1707.06209 (2017). 29 [389] Y. Yang, Y. Zhang, C. Tar, J. Baldridge, Paws-x: A cross-
[368] C. Xiong, Z. Dai, J. Callan, Z. Liu, R. Power, End-to-end neural ad-hoc lingual adversarial dataset for paraphrase identification, arXiv preprint
ranking with kernel pooling, in: Proceedings of the 40th International arXiv:1908.11828 (2019). 29
ACM SIGIR conference on research and development in information [390] S. Narayan, S. B. Cohen, M. Lapata, Don’t give me the details, just the
retrieval, 2017, pp. 55–64. 29 summary!, Topic-Aware Convolutional Neural Networks for Extreme
[369] A. Peñas, E. Hovy, P. Forner, Á. Rodrigo, R. Sutcliffe, R. Morante, Summarization. ArXiv, abs (1808). 29
Qa4mre 2011-2013: Overview of question answering for machine read- [391] E. M. Ponti, G. Glavaš, O. Majewska, Q. Liu, I. Vulić, A. Korhonen,
ing evaluation, in: Information Access Evaluation. Multilinguality, Mul- Xcopa: A multilingual dataset for causal commonsense reasoning, arXiv
timodality, and Visualization: 4th International Conference of the CLEF preprint arXiv:2005.00333 (2020). 27, 29
Initiative, CLEF 2013, Valencia, Spain, September 23-26, 2013. Pro- [392] A. Tikhonov, M. Ryabinin, It’s all in the heads: Using attention heads

43
as a baseline for cross-lingual transfer in commonsense reasoning, arXiv [414] A. Fan, Y. Jernite, E. Perez, D. Grangier, J. Weston, M. Auli, Eli5: Long
preprint arXiv:2106.12066 (2021). 29 form question answering, arXiv preprint arXiv:1907.09190 (2019). 31
[393] J. H. Clark, E. Choi, M. Collins, D. Garrette, T. Kwiatkowski, V. Niko- [415] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei,
laev, J. Palomaki, Tydi qa: A benchmark for information-seeking ques- A. Arunkumar, A. Ashok, A. S. Dhanasekaran, A. Naik, D. Stap, et al.,
tion answering in typologically diverse languages, Transactions of the Benchmarking generalization via in-context instructions on 1,600+ lan-
Association for Computational Linguistics 8 (2020) 454–470. 29 guage tasks, arXiv preprint arXiv:2204.07705 (2022). 31
[394] T. Scialom, P.-A. Dray, S. Lamprier, B. Piwowarski, J. Staiano, [416] T. Xie, C. H. Wu, P. Shi, R. Zhong, T. Scholak, M. Yasunaga, C.-S. Wu,
Mlsum: The multilingual summarization corpus, arXiv preprint M. Zhong, P. Yin, S. I. Wang, et al., Unifiedskg: Unifying and multi-
arXiv:2004.14900 (2020). 29 tasking structured knowledge grounding with text-to-text language mod-
[395] S. Lin, J. Hilton, O. Evans, Truthfulqa: Measuring how models mimic els, arXiv preprint arXiv:2201.05966 (2022). 31
human falsehoods, arXiv preprint arXiv:2109.07958 (2021). 29 [417] Q. Ye, B. Y. Lin, X. Ren, Crossfit: A few-shot learning challenge
[396] I. Augenstein, C. Lioma, D. Wang, L. C. Lima, C. Hansen, for cross-task generalization in nlp, arXiv preprint arXiv:2104.08835
C. Hansen, J. G. Simonsen, Multifc: A real-world multi-domain (2021). 31
dataset for evidence-based fact checking of claims, arXiv preprint [418] V. Aribandi, Y. Tay, T. Schuster, J. Rao, H. S. Zheng, S. V. Mehta,
arXiv:1909.03242 (2019). 29 H. Zhuang, V. Q. Tran, D. Bahri, J. Ni, et al., Ext5: Towards extreme
[397] J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal, Fever: a multi-task scaling for transfer learning, arXiv preprint arXiv:2111.10952
large-scale dataset for fact extraction and verification, arXiv preprint (2021). 31
arXiv:1803.05355 (2018). 29 [419] A. Williams, N. Nangia, S. Bowman, A broad-coverage challenge cor-
[398] I. Mollas, Z. Chrysopoulou, S. Karlos, G. Tsoumakas, Ethos: an online pus for sentence understanding through inference, in: Proceedings of
hate speech detection dataset, arXiv preprint arXiv:2006.08328 (2020). the 2018 Conference of the North American Chapter of the Associ-
29, 31 ation for Computational Linguistics: Human Language Technologies,
[399] M. Nadeem, A. Bethke, S. Reddy, Stereoset: Measuring stereotypical Volume 1 (Long Papers), Association for Computational Linguistics,
bias in pretrained language models, arXiv preprint arXiv:2004.09456 New Orleans, Louisiana, 2018, pp. 1112–1122. doi:10.18653/v1/
(2020). 29, 31 N18-1101.
[400] A. Parrish, A. Chen, N. Nangia, V. Padmakumar, J. Phang, J. Thomp- URL https://aclanthology.org/N18-1101 29
son, P. M. Htut, S. R. Bowman, Bbq: A hand-built bias benchmark for [420] Y. Zhang, J. Baldridge, L. He, PAWS: Paraphrase adversaries from word
question answering, arXiv preprint arXiv:2110.08193 (2021). 29 scrambling, in: Proceedings of the 2019 Conference of the North Amer-
[401] J. Zhao, T. Wang, M. Yatskar, V. Ordonez, K.-W. Chang, Gender bias ican Chapter of the Association for Computational Linguistics: Human
in coreference resolution: Evaluation and debiasing methods, arXiv Language Technologies, Volume 1 (Long and Short Papers), Associa-
preprint arXiv:1804.06876 (2018). 29 tion for Computational Linguistics, Minneapolis, Minnesota, 2019, pp.
[402] N. Nangia, C. Vania, R. Bhalerao, S. R. Bowman, Crows-pairs: A chal- 1298–1308. doi:10.18653/v1/N19-1131.
lenge dataset for measuring social biases in masked language models, URL https://aclanthology.org/N19-1131 29
arXiv preprint arXiv:2010.00133 (2020). 29 [421] C. Qin, A. Zhang, Z. Zhang, J. Chen, M. Yasunaga, D. Yang, Is chat-
[403] S. Gehman, S. Gururangan, M. Sap, Y. Choi, N. A. Smith, Realtoxic- GPT a general-purpose natural language processing task solver?, in: The
ityprompts: Evaluating neural toxic degeneration in language models, 2023 Conference on Empirical Methods in Natural Language Process-
arXiv preprint arXiv:2009.11462 (2020). 29 ing, 2023.
[404] D. Borkan, L. Dixon, J. Sorensen, N. Thain, L. Vasserman, Nuanced URL https://openreview.net/forum?id=u03xn1COsO 31
metrics for measuring unintended bias with real data for text classifica- [422] M. U. Hadi, R. Qureshi, A. Shah, M. Irfan, A. Zafar, M. B. Shaikh,
tion, in: Companion proceedings of the 2019 world wide web confer- N. Akhtar, J. Wu, S. Mirjalili, et al., Large language models: a com-
ence, 2019, pp. 491–500. 29 prehensive survey of its applications, challenges, limitations, and future
[405] O. Bojar, R. Chatterjee, C. Federmann, Y. Graham, B. Haddow, prospects, TechRxiv (2023). 31
M. Huck, A. J. Yepes, P. Koehn, V. Logacheva, C. Monz, et al., Find- [423] X. L. Dong, S. Moon, Y. E. Xu, K. Malik, Z. Yu, Towards next-
ings of the 2016 conference on machine translation, in: Proceedings of generation intelligent assistants leveraging llm techniques, in: Proceed-
the First Conference on Machine Translation: Volume 2, Shared Task ings of the 29th ACM SIGKDD Conference on Knowledge Discovery
Papers, 2016, pp. 131–198. 29 and Data Mining, 2023, pp. 5792–5793. 31
[406] B. Loïc, B. Magdalena, B. Ondřej, F. Christian, G. Yvette, G. Ro- [424] K. Pandya, M. Holia, Automating customer service using langchain:
man, H. Barry, H. Matthias, J. Eric, K. Tom, et al., Findings of the Building custom open-source gpt chatbot for organizations, arXiv
2020 conference on machine translation (wmt20), in: Proceedings of preprint arXiv:2310.05421 (2023). 31
the Fifth Conference on Machine Translation, Association for Compu- [425] J. Li, B. Hui, G. Qu, B. Li, J. Yang, B. Li, B. Wang, B. Qin, R. Cao,
tational Linguistics„ 2020, pp. 1–55. 29 R. Geng, et al., Can llm already serve as a database interface? a
[407] W. Li, F. Qi, M. Sun, X. Yi, J. Zhang, Ccpm: A chinese classical poetry big bench for large-scale database grounded text-to-sqls, arXiv preprint
matching dataset, arXiv preprint arXiv:2106.01979 (2021). 29 arXiv:2305.03111 (2023). 31
[408] E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli, J. Weston, Wizard of [426] A. Rao, J. Kim, M. Kamineni, M. Pang, W. Lie, M. D. Succi, Evaluating
wikipedia: Knowledge-powered conversational agents, arXiv preprint chatgpt as an adjunct for radiologic decision-making, medRxiv (2023)
arXiv:1811.01241 (2018). 29 2023–02. 31
[409] H. Rashkin, E. M. Smith, M. Li, Y.-L. Boureau, Towards empathetic [427] M. Benary, X. D. Wang, M. Schmidt, D. Soll, G. Hilfenhaus, M. Nas-
open-domain conversation models: A new benchmark and dataset, arXiv sir, C. Sigler, M. Knödler, U. Keller, D. Beule, et al., Leveraging large
preprint arXiv:1811.00207 (2018). 29 language models for decision support in personalized oncology, JAMA
[410] E. Dinan, V. Logacheva, V. Malykh, A. Miller, K. Shuster, J. Urbanek, Network Open 6 (11) (2023) e2343689–e2343689. 31
D. Kiela, A. Szlam, I. Serban, R. Lowe, et al., The second conversa- [428] C. M. Chiesa-Estomba, J. R. Lechien, L. A. Vaira, A. Brunet, G. Cam-
tional intelligence challenge (convai2), in: The NeurIPS’18 Competi- maroto, M. Mayo-Yanez, A. Sanchez-Barrueco, C. Saga-Gutierrez, Ex-
tion: From Machine Learning to Intelligent Conversations, Springer, ploring the potential of chat-gpt as a supportive tool for sialendoscopy
2020, pp. 187–208. 29 clinical decision making and patient information support, European
[411] H. Zhou, C. Zheng, K. Huang, M. Huang, X. Zhu, Kdconv: A chinese Archives of Oto-Rhino-Laryngology (2023) 1–6. 31
multi-domain dialogue dataset towards multi-turn knowledge-driven [429] S. Montagna, S. Ferretti, L. C. Klopfenstein, A. Florio, M. F. Pengo,
conversation, arXiv preprint arXiv:2004.04100 (2020). 29 Data decentralisation of llm-based chatbot systems in chronic disease
[412] L. CO, Iflytek: a multiple categories chinese text classifier. competition self-management, in: Proceedings of the 2023 ACM Conference on In-
official website (2019). 29 formation Technology for Social Good, 2023, pp. 205–212. 31
[413] J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, J. Blackburn, The [430] D. Bill, T. Eriksson, Fine-tuning a llm using reinforcement learning from
pushshift reddit dataset, in: Proceedings of the international AAAI con- human feedback for a therapy chatbot application (2023). 31
ference on web and social media, Vol. 14, 2020, pp. 830–839. 30 [431] M. Abbasian, I. Azimi, A. M. Rahmani, R. Jain, Conversational health

44
agents: A personalized llm-powered agent framework, arXiv preprint [453] K. Yang, A. M. Swope, A. Gu, R. Chalamala, P. Song, S. Yu, S. Godil,
arXiv:2310.02374 (2023). 31 R. Prenger, A. Anandkumar, Leandojo: Theorem proving with retrieval-
[432] K. V. Lemley, Does chatgpt help us understand the medical literature?, augmented language models, arXiv preprint arXiv:2306.15626 (2023).
Journal of the American Society of Nephrology (2023) 10–1681. 31 32
[433] S. Pal, M. Bhattacharya, S.-S. Lee, C. Chakraborty, A domain-specific [454] K. M. Collins, A. Q. Jiang, S. Frieder, L. Wong, M. Zilka, U. Bhatt,
next-generation large language model (llm) or chatgpt is required for T. Lukasiewicz, Y. Wu, J. B. Tenenbaum, W. Hart, et al., Evaluating
biomedical engineering and research, Annals of Biomedical Engineering language models for mathematics through interactions, arXiv preprint
(2023) 1–4. 31 arXiv:2306.01694 (2023). 32
[434] Y. Du, S. Zhao, Y. Chen, R. Bai, J. Liu, H. Wu, H. Wang, B. Qin, The [455] Y. Liu, T. Han, S. Ma, J. Zhang, Y. Yang, J. Tian, H. He, A. Li, M. He,
calla dataset: Probing llms’ interactive knowledge acquisition from chi- Z. Liu, et al., Summary of chatgpt-related research and perspective
nese medical literature, arXiv preprint arXiv:2309.04198 (2023). 31 towards the future of large language models, Meta-Radiology (2023)
[435] A. Abd-Alrazaq, R. AlSaad, D. Alhuwail, A. Ahmed, P. M. Healy, 100017. 32
S. Latifi, S. Aziz, R. Damseh, S. A. Alrazak, J. Sheikh, et al., Large [456] J. Drápal, H. Westermann, J. Savelka, Using large language models
language models in medical education: Opportunities, challenges, and to support thematic analysis in empirical legal studies, arXiv preprint
future directions, JMIR Medical Education 9 (1) (2023) e48291. 31 arXiv:2310.18729 (2023). 32
[436] A. B. Mbakwe, I. Lourentzou, L. A. Celi, O. J. Mechanic, A. Dagan, [457] J. Savelka, K. D. Ashley, M. A. Gray, H. Westermann, H. Xu, Explain-
Chatgpt passing usmle shines a spotlight on the flaws of medical educa- ing legal concepts with augmented large language models (gpt-4), arXiv
tion (2023). 31 preprint arXiv:2306.09525 (2023). 32
[437] S. Ahn, The impending impacts of large language models on medical [458] N. Guha, J. Nyarko, D. E. Ho, C. Ré, A. Chilton, A. Narayana,
education, Korean Journal of Medical Education 35 (1) (2023) 103. 31 A. Chohlas-Wood, A. Peters, B. Waldon, D. N. Rockmore, et al., Legal-
[438] E. Waisberg, J. Ong, M. Masalkhi, A. G. Lee, Large language model bench: A collaboratively built benchmark for measuring legal reasoning
(llm)-driven chatbots for neuro-ophthalmic medical education, Eye in large language models, arXiv preprint arXiv:2308.11462 (2023). 32
(2023) 1–3. 31 [459] J. Cui, Z. Li, Y. Yan, B. Chen, L. Yuan, Chatlaw: Open-source legal
[439] G. Deiana, M. Dettori, A. Arghittu, A. Azara, G. Gabutti, P. Castiglia, large language model with integrated external knowledge bases, arXiv
Artificial intelligence and public health: Evaluating chatgpt responses to preprint arXiv:2306.16092 (2023). 32
vaccination myths and misconceptions, Vaccines 11 (7) (2023) 1217. 31 [460] H. Yang, X.-Y. Liu, C. D. Wang, Fingpt: Open-source financial large
[440] L. De Angelis, F. Baglivo, G. Arzilli, G. P. Privitera, P. Ferragina, A. E. language models, arXiv preprint arXiv:2306.06031 (2023). 32
Tozzi, C. Rizzo, Chatgpt and the rise of large language models: the new [461] Y. Li, S. Wang, H. Ding, H. Chen, Large language models in finance: A
ai-driven infodemic threat in public health, Frontiers in Public Health 11 survey, in: Proceedings of the Fourth ACM International Conference on
(2023) 1166120. 31 AI in Finance, 2023, pp. 374–382. 33
[441] N. L. Rane, A. Tawde, S. P. Choudhary, J. Rane, Contribution and per- [462] A. Lykov, D. Tsetserukou, Llm-brain: Ai-driven fast generation of
formance of chatgpt and other large language models (llm) for scientific robot behaviour tree based on large language model, arXiv preprint
and research advancements: a double-edged sword, International Re- arXiv:2305.19352 (2023). 33
search Journal of Modernization in Engineering Technology and Science [463] E. Billing, J. Rosén, M. Lamb, Language models for human-robot inter-
5 (10) (2023) 875–899. 31, 32 action, in: ACM/IEEE International Conference on Human-Robot Inter-
[442] W. Dai, J. Lin, H. Jin, T. Li, Y.-S. Tsai, D. Gašević, G. Chen, Can large action, March 13–16, 2023, Stockholm, Sweden, ACM Digital Library,
language models provide feedback to students? a case study on chatgpt, 2023, pp. 905–906. 33
in: 2023 IEEE International Conference on Advanced Learning Tech- [464] Y. Ye, H. You, J. Du, Improved trust in human-robot collaboration with
nologies (ICALT), IEEE, 2023, pp. 323–325. 32 chatgpt, IEEE Access (2023). 33
[443] E. Kasneci, K. Seßler, S. Küchemann, M. Bannert, D. Dementieva, [465] Y. Ding, X. Zhang, C. Paxton, S. Zhang, Leveraging commonsense
F. Fischer, U. Gasser, G. Groh, S. Günnemann, E. Hüllermeier, et al., knowledge from large language models for task and motion planning,
Chatgpt for good? on opportunities and challenges of large language in: RSS 2023 Workshop on Learning for Task and Motion Planning,
models for education, Learning and individual differences 103 (2023) 2023. 33
102274. 32 [466] J. Wu, R. Antonova, A. Kan, M. Lepert, A. Zeng, S. Song, J. Bohg,
[444] N. Rane, Enhancing the quality of teaching and learning through chat- S. Rusinkiewicz, T. Funkhouser, Tidybot: Personalized robot assistance
gpt and similar large language models: Challenges, future prospects, with large language models, arXiv preprint arXiv:2305.05658 (2023).
and ethical considerations in education, Future Prospects, and Ethical 33
Considerations in Education (September 15, 2023) (2023). 32 [467] E. Strubell, A. Ganesh, A. McCallum, Energy and policy considerations
[445] J. C. Young, M. Shishido, Investigating openai’s chatgpt potentials in for deep learning in nlp, arXiv preprint arXiv:1906.02243 (2019). 33
generating chatbot’s dialogue for english as a foreign language learning, [468] E. M. Bender, T. Gebru, A. McMillan-Major, S. Shmitchell, On the dan-
International Journal of Advanced Computer Science and Applications gers of stochastic parrots: Can language models be too big?, in: Pro-
14 (6) (2023). 32 ceedings of the 2021 ACM conference on fairness, accountability, and
[446] J. Irons, C. Mason, P. Cooper, S. Sidra, A. Reeson, C. Paris, Exploring transparency, 2021, pp. 610–623. 33
the impacts of chatgpt on future scientific work, SocArXiv (2023). 32 [469] C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding
[447] P. G. Schmidt, A. J. Meir, Using generative ai for literature searches and deep learning (still) requires rethinking generalization, Communications
scholarly writing: Is the integrity of the scientific discourse in jeopardy?, of the ACM 64 (3) (2021) 107–115. 33
arXiv preprint arXiv:2311.06981 (2023). 32 [470] M. Tänzer, S. Ruder, M. Rei, Memorisation versus generalisation in pre-
[448] Y. Zheng, H. Y. Koh, J. Ju, A. T. Nguyen, L. T. May, G. I. Webb, S. Pan, trained language models, arXiv preprint arXiv:2105.00828 (2021). 33
Large language models for scientific synthesis, inference and explana- [471] S. M. West, M. Whittaker, K. Crawford, Discriminating systems, AI
tion, arXiv preprint arXiv:2310.07984 (2023). 32 Now (2019) 1–33. 33
[449] B. Aczel, E.-J. Wagenmakers, Transparency guidance for chatgpt usage [472] K. Valmeekam, A. Olmo, S. Sreedharan, S. Kambhampati, Large lan-
in scientific writing, PsyArXiv (2023). 32 guage models still can’t plan (a benchmark for llms on planning and
[450] S. Altmäe, A. Sola-Leyva, A. Salumets, Artificial intelligence in sci- reasoning about change), arXiv preprint arXiv:2206.10498 (2022). 33
entific writing: a friend or a foe?, Reproductive BioMedicine Online [473] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao,
(2023). 32 Y. Zhang, Y. Chen, et al., Siren’s song in the ai ocean: A survey on hal-
[451] S. Imani, L. Du, H. Shrivastava, Mathprompter: Mathematical reasoning lucination in large language models, arXiv preprint arXiv:2309.01219
using large language models, arXiv preprint arXiv:2303.05398 (2023). (2023). 33
32 [474] A. Webson, E. Pavlick, Do prompt-based models really understand the
[452] Z. Yuan, H. Yuan, C. Li, G. Dong, C. Tan, C. Zhou, Scaling relationship meaning of their prompts?, arXiv preprint arXiv:2109.01247 (2021). 33
on learning mathematical reasoning with large language models, arXiv [475] O. Shaikh, H. Zhang, W. Held, M. Bernstein, D. Yang, On second
preprint arXiv:2308.01825 (2023). 32 thought, let’s not think step by step! bias and toxicity in zero-shot rea-

45
soning, arXiv preprint arXiv:2212.08061 (2022). 33
[476] X. Liu, H. Cheng, P. He, W. Chen, Y. Wang, H. Poon, J. Gao, Adversar-
ial training for large neural language models, ArXiv (April 2020).
URL https://www.microsoft.com/en-us/research/
publication/adversarial-training-for-large-neural-language-models/
34
[477] E. Shayegani, M. A. A. Mamun, Y. Fu, P. Zaree, Y. Dong, N. Abu-
Ghazaleh, Survey of vulnerabilities in large language models revealed
by adversarial attacks (2023). arXiv:2310.10844. 34
[478] X. Xu, K. Kong, N. Liu, L. Cui, D. Wang, J. Zhang, M. Kankanhalli, An
llm can fool itself: A prompt-based adversarial attack (2023). arXiv:
2310.13345. 34
[479] H. Zhao, H. Chen, F. Yang, N. Liu, H. Deng, H. Cai, S. Wang, D. Yin,
M. Du, Explainability for large language models: A survey (2023).
arXiv:2309.01029. 34
[480] S. Huang, S. Mamidanna, S. Jangam, Y. Zhou, L. H. Gilpin, Can large
language models explain themselves? a study of llm-generated self-
explanations (2023). arXiv:2310.11207. 34
[481] H. Brown, K. Lee, F. Mireshghallah, R. Shokri, F. Tramèr, What does it
mean for a language model to preserve privacy?, in: Proceedings of the
2022 ACM Conference on Fairness, Accountability, and Transparency,
2022, pp. 2280–2292. 34
[482] R. Plant, V. Giuffrida, D. Gkatzia, You are what you write: Pre-
serving privacy in the era of large language models, arXiv preprint
arXiv:2204.09391 (2022). 34
[483] W. Niu, Z. Kong, G. Yuan, W. Jiang, J. Guan, C. Ding, P. Zhao, S. Liu,
B. Ren, Y. Wang, Real-time execution of large-scale language models
on mobile (2020). arXiv:2009.06823. 34
[484] C. Guo, J. Tang, W. Hu, J. Leng, C. Zhang, F. Yang, Y. Liu, M. Guo,
Y. Zhu, Olive: Accelerating large language models via hardware-
friendly outlier-victim pair quantization, in: Proceedings of the 50th
Annual International Symposium on Computer Architecture, 2023, pp.
1–15. 34
[485] B. Meskó, E. J. Topol, The imperative for regulatory oversight of large
language models (or generative ai) in healthcare, npj Digital Medicine
6 (1) (2023) 120. 34
[486] J. Zhang, X. Ji, Z. Zhao, X. Hei, K.-K. R. Choo, Ethical considerations
and policy implications for large language models: Guiding responsible
development and deployment, arXiv preprint arXiv:2308.02678 (2023).
34
[487] J. Mökander, J. Schuett, H. R. Kirk, L. Floridi, Auditing large language
models: a three-layered approach, AI and Ethics (2023) 1–31. 34

46

You might also like