Trend
Trend
Models (LLMs)
*
Rajvardhan Patil and And Venkat Gudivada
doi: 10.20944/preprints202402.0357.v1
Keywords: language models; PLMs; largel anguage model; LLMs; natural language processing; NLP;
Copyright: This is an open access article distributed under the Creative Commons
Attribution License which permits unrestricted use, distribution, and reproduction in any
Disclaimer/Publisher’s Note: The statements, opinions, and data contained in all publications are solely those of the individual author(s) and
contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting
from any ideas, methods, instructions, or products referred to in the content.
Review
A Review of Current Trends, Techniques, and
Challenges in Large Language Models (LLMs)
Rajvardhan Patil 1, * and Venkat Gudivada 2
1 School of Computing, Grand Valley State University; [email protected]
2 Computer Science Department, East Carolina University; [email protected]
* Correspondence: [email protected]; Tel.:(+1)616-331-4375
Abstract: Natural language Processing (NLP) has significantly transformed in the last decade,
especially in the field of Language Modeling. Large Language Models (LLMs) have achieved
SOTA performances on Natural Language Understanding (NLU) and Natural Language Generation
(NLG) tasks by learning language representation in self-supervised ways. This paper provides a
comprehensive survey to capture the progression of advances in Language Models. In this paper, we
examine the different aspects of Language Models, which started with a few million parameters but
have reached the size of a trillion in a very short time. We also look at how these LLMs transitioned
from task-specific to task-independent to task-and-language-independent architectures. This paper
extensively discusses different pre-training objectives, benchmarks, and transfer learning methods
used in LLMs. It also examines different fine-tuning and In-Context learning techniques used in
downstream tasks. It also explores how LLMs can perform well across many domains and datasets if
sufficiently trained on a large and diverse dataset. Next, it discusses how, over time, the availability
of cheap computational power and large datasets have improved LLM’s capabilities and raised new
challenges. As part of our study, we also inspect LLMs from the lens of scalability to see how their
performance is affected by the model’s depth, width, and data size. Lastly, we provide an empirical
comparison of existing trends and techniques and a comprehensive analysis of where the field of
LLM currently stand.
Keywords: language models; PLMs; large language model; LLMs; natural language processing; NLP;
literature review; survey; review
1. Introduction
1.1. Background
Most feature-engineering methods before GPT relied on manually curated labeled data, which
were time-consuming and expensive. Additionally, not all applications had annotated or labeled
datasets. To address these issues, statistical methods such as one-hot encoding [1], bag-of-words,
N-grams [2], Term Frequency [3], and Inverse document frequency ([4], [5]) were proposed. In
these approaches, word or phrase level statistics were computed and used as features in supervised
models. However, such discrete space representations lacked contextual information, and resulted
in dimensionality-curse, making them computationally inefficient. Although techniques such as
Dimensionality Reduction Technique [6], and Independent Component Analysis [7] were applied,
these techniques failed to capture a deeper understanding of concepts such as polysemy or identifying
analogies, synonyms, antonyms, etc.
An alternative of using unlabeled data in self-supervised manner to extract and leverage linguistic
information emerged as more effective and valuable approach. For making predictions, language
models started incorporating contexts of increasingly larger scope. The self-supervised approach
started with individual words, followed by surrounding words, sentences, and paragraphs [10]. Word
embeddings like Word2Vec ([11], [12]), Glove [13], and FastText[14] were generated from the unlabeled
corpora using self-supervised approach. They improved performance across a variety of NLP tasks.
2 of 46
3 of 46
As shown in Figure 1, the phases in these transformed based LLMs can broadly be classified into
Pretraining, Transfer Learning and/or In-Context Learning. In the sections to come, we explore in
detail different attention mechanism masks, architectures, objectives used during pretraining, transfer
and in-context learning techniques, scalability factor and challenges of LLMs.
The outline of this survey paper is as follows: In Section 2, we look at the Language Model denition
and the Attention Layer mechanism in detail. In Section 3, we describe the types of architectures
and attention masks used in transformers. Section 4 elaborates pretraining objectives and different
learning strategies used by the LLMs. Section 5 discusses transfer learning strategies, followed by
In-Context learning in Section 6. Section 7 describes different scale factors, such as model width, depth,
datasets, architecture and how they affect performance of LLMs. Section 8 enumerates the challenges
encountered by LLMs, followed by Future directions and development Trends in Section 9. Section 10
concludes the paper.
4 of 46
As stated in Bloom [55], Language modeling refers to the task of modeling the probability of a
sequence of tokens in a text, where a token can be a unit of text, such as: word, subword, character
or byte, etc. Normally in the pretraining phase of Laguage Models, next word prediction objective
is used, which is conditioned on the previous tokens as context. So for a given input or source
sequence S = (s1 , s2 , ..., sn ), the model predicts the joint probability of the output or target sequence
T = (t1 , t2 , ..., tn ), shown in equation 1.
n
P(t) = ∏ p(ti |s1 , s2 , ..., si−1 ) (1)
i =1
This approach is referred to as autoregressive language modeling and can be seen as iteratively
predicting the probability of the next token, as shown in equation 2.
p( x ) = p( x1 , x2 , ..., x T )
T (2)
= ∏ p(xt | x1 , x2 , ..., xt−1 )
t =1
Here, to deal with different downstream tasks (question answering, translation, summariztion
etc.), each task is casted or converted into a text-to-text framework. In this way, the language model
can be applied or used to handle different downstream tasks. The pretrained model with parameters θ,
is then adapted during fine-tuning of dataset D, to minimize the loss over the target tokens conditioned
on the source tokens and previously seen target tokens. Equation 3 highlights this loss function ‘L’.
LLMs follow similar mechanism of pretraining and fine tuning as Language Models, except the
parameter size of LLMs is in billions and/or trillions.
Z = attention( Q, K, V ) = WA V
!
QK T (4)
= so f tmax p V
(dk )
To help speed up the pretraining process, teacher forcing technique is used, which leads to faster
convergence and higher accuracy. In teacher forcing, instead of the model’s output from previous
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 6 February 2024 doi:10.20944/preprints202402.0357.v1
5 of 46
timestep, the ground-truth (correct answer) is fed as an input at each time step. To enable this teacher
forcing, the pre-attention decoder takes the target tokens and shifts them one place to the right.
Z = MultiHead( Q, K, V )
= Concat(z1 , z2 , ..., zn ) W 0 , where (5)
Q K V
zi = attention( Q(Wi ) , K (Wi ) , V (Wi ) )
As shown in the Figure 2, the output of these heads are further concatenated to produce a
single output. This multi-head attention mechanism emulates the recurrence sequence effect but with
attention. Each head uses different linear-transformations to represent words, and therefore different
heads can learn different relationships between words. Multi-head attention mechanism executes
the attention of the scaled dot-product in parallel. Multi-headed model is therefore able to jointly
attend to information from different representations at different positions over the projected versions
of queries, keys, and values. As shown in equation 5, these output values are then concatenated
and weighted, where each head zi is the attention function of Query, Key, and Value with trainable
parameters (Wi )Q , (Wi )K , (Wi )V .
6 of 46
Unlike the traditional encoder-decoder RNN model, the self-attention mechanism does not encode
the entire input sequence into a fixed single vector. The input sentence is therefore not squashed
into a single fixed-length vector, where the decoder has flexibility to attend to more than one hidden
state of the encoder. Additionally, in the attention mechanism, only a subset of encoded vectors of
the input sequence are chosen adaptively during the decoding. The attention mechanism gives more
weight or attention to the part of the input sequence that is relevant to the target. As a result, it
allows capturing dependencies from the information spread throughout the sequence irrespective
of the distance between the tokens. Furthermore, as the decoder is empowered with the attention
mechanism, the encoder is relieved from the burden of encoding the input into a fixed-size vector.
Paper [22] shows how this joint learning of alignment and translation improves performance over the
basic encoder-decoder approach, especially over longer sentences.
7 of 46
3. Transformer
After its inception, the Transformer soon became the de-facto standard for Natural Language
tasks. Below, we discuss several variants of the original transformer-based model that were proposed
to deal with NLU and NLGU tasks.
8 of 46
As shown in Figure 4, in the encoder-decoder architecture, fully visible masking is used in the
encoder, and causal mask is used in the decoder. In a decoder-only model, the input and target are
concatenated, and then a causal mask is used throughout. A decoder-only model with a prefix allows
fully visible masking over part of the input token (prefix), followed by causal masking on the rest
of the sequence. In general, autoencoding models learn bidirectional contextualized representation
suited for NLU tasks, whereas autoregressive models learn to generate the next token and hence are
suited for NLG tasks. Table 1 details architectural information of prominent LLM models, such as their
parameter size, hardware used, number of Encoder (E) and Decoder (D) layers, attention heads etc.
9 of 46
In Causal-Mask, the attention mechanism can attend only to the previous tokens and is prohibited
from attending to the input tokens from the future. That is, while producing the it h entry, causal masks
prevent the attention mechanism from attending to all the entries occurring after the it h entry so that
the model cannot see into the future. The prefix-causal mask is a combination of these two approaches,
allowing the attention mechanism to use a fully visible mask on a portion of the input sequence (called
the prefix) and a causal mask on the rest of the sequence.
10 of 46
pretrained approach leads to faster and better generalization than training the model from scratch.
Below, we explore several objectives that have been successfully used during the pretraining process.
4.1. Objectives
11 of 46
was used for long blanks at the end of sentences having random lengths with prefix contexts provided.
When [gMASK] is used, GLM-130B behaves similarly to the PrefixLM.
12 of 46
In MTL, as the same model performs many different tasks, the language model gets conditioned on
the input and the task to be performed. Such Task conditioning can be implemented at the architecture
level. But a recent technique from GPT-2 [28] suggests a simplified mechanism where tasks, inputs,
and outputs can all be specified as a sequence of symbols. That is, to be architecture-independent, the
input can be transformed to incorporate task-aware information as a context (added as task-prefix) to
the input sequence. Also, as stated in T5, every text processing problem can be mapped to “text-to-text"
format, where the input and output are both text. For instance, to translate an English sentence
"I am good" to French, the prefix “translate English to French: I am good. Target: " will be used,
where the model will then be asked to generate the remainder “je vais bien" of the sequence in an
autoregressive manner. So similar to a translation of a sequence of (translate to French, English
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 6 February 2024 doi:10.20944/preprints202402.0357.v1
13 of 46
sentence, French sentence), a reading comprehension example can be likewise written as a sequence
of (answer the question, document, question, answer). Using this framework, the same encoding
and decoding procedure is used across various tasks, without requiring any change to the model
architecture. Therefore, The same model can be effectively applied for transfer and inference purposes
on many different downstream tasks, allowing it to generalize and perform well on new and related
domains.
As hypothesized in T0 [43], because of the implicit multitask learning, LLMs can attain reasonable
zero-shot generalization on diverse tasks. For instance, during pretraining, some tasks would appear
in explicit form with the task instructions, input and output pairs. For example, there are websites
containing FAQs and their answers, which act as supervised training data for the closed-book QA task.
Such multitask supervision might play a crucial role in zero-shot generalization during pretraining.
To test the hypothesis, T0 attempts to induce zero-shot generalization by explicit multitask learning,
where it uses T5[26] model and fine-tunes it in a supervised manner on a dataset with a wide variety
of tasks in natural language prompted format. Due to this approach, T0 was able to better generalize
on held-out tasks without requiring data at massive scale, and became more robust to the prompt
wording. WeLM [53] also reinforced generalization across tasks through explicit multitask learning,
where the trained model was then tested on a set of held-out tasks.
14 of 46
for each incoming input and improves model capacity without incurring additional computation
costs. In MoE, although a huge number of weights are used during training, only relevant experts are
needed to compute a small subset of the computational graph at inference time. Additionally, in static
networks, as the entire model gets activated for every example, training cost is increased (roughly
quadratically) with the increase in model size and training examples [73]. Whereas, ST-MoE [75],
demonstrated how a 269B sparse parameter model has comparable or similar computational cost to an
encoder-decoder transformer model with only 32B parameters, and still achieves SOTA performance
across a variety of NLP tasks. However, in MoE when the model size is scaled by increasing the
number of sparsely gated experts, it can significantly enlarge the parameter size requiring more storage
memory (can reach the order of hundreds of GBs).
In MoE, a trainable gating network determines which combination of sparse experts needs to
be selected to process the given input. [73] introduced MoE and demonstrated how conditional
computation using sparsely-gated experts improved model capacity by 1000 times, with a minor loss
in computational efficiency. This is helpful, especially for language modeling and machine translation
tasks, where the model capacity is essential to assimilate or absorb large amounts of information from
the corpora. Using MoE, [73] did better on language modeling and machine translation tasks than
prior studies.
Similarly, with MoE, GShard [74] was able to efficiently perform training and inference using
conditional computation, where only a sub-network gets activated on a per-input basis. Additionally,
the translation quality of GShard increased with model size, but due to MoE, the wall-time of
training increased only sub-linearly. GShard, been pretrained on multilingual, when translating
text from 100 languges to English, it was able to achieve better translation quality compared to prior
art. Additionally, an annotation technique was used by GShard to annotate the tensors either for
distribution or replication across a cluster of devices.
MoE-based models incur additional space storage. This might create difficulty in the model
training and inference phase if GPUs capacity is exceeded. To address this issue, CPM2 [34] proposed
INFMOE framework. This framework uses a dynamically scheduled offloading strategy and enables
MoE model inference on a single GPU. The parameters of experts from MoE layers are offloaded to
CPU memory to enable the inference of the model on a single GPU.
As demonstrated in [77], for model training and inference, MoEs yield competitive zero and
few-shot performance (except full-shot fine tuning) at a fraction of the computation. MoEs can match
the dense model performance with 4 times less computing. Furthermore, the performance gap
between MoE and dense models varies greatly across domains and tasks, indicating that MoE and
dense models might generalize differently. GLaM [47] also used sparsely activated MoE architecture
to achieve competitive few-shot task results compared to SOTA-dense models while being more
computationally efficient. Although GLaM (1.2T parameters) is seven times larger than GPT-3 in
parameters, it activates a subnetwork of 96.6B (8% of 1.2T) parameters, consumes only one-third of
the energy used to train GPT-3, requires only half of the computation flops for inference and achieves
better overall zero, one and few-shot performances across 29 NLP tasks.
Spare expert Models have resulted in a pretraining speedup of 4-7 times while keeping the
computational cost (FLOPs per token) constant. Although sparse expert model has many parameters,
they reduce the carbon footprint by an order of magnitude. For example, it achieves the same level of
one-shot performance as GPT-3 but uses only 1/3 of the energy training cost. Although MoE requires
additional storage space for parameters, the sparse language model is one of the promising alternatives
to save energy costs.
The experts in the MoE layers are shared across many devices since the sheer size makes
it infeasible to replicate them across all devices. Also, MoE sparse models do suffer from
training instabilities worse than those encountered in traditional static densely activated models.
Switch-Transformer [76] addressed some of the issues observed in MoE models, such as complexity,
communication costs, and training instability. Switch-Transformer simplified the MoE routing
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 6 February 2024 doi:10.20944/preprints202402.0357.v1
15 of 46
algorithm and proposed an architecture that mitigates the instabilities in computationally efficient and
with reduced communication.
16 of 46
Additionally, ERNIE 3.0 used prompt-tuning during fine-tuning to better exploit knowledge from the
pre-trained model.
Extreme Denoising
It considers extreme span lengths to have a corruption rate of up to 50%. Therefore, given a
small or moderate part of the input, the model is supposed to recover or predict the large chunk of
the sequence. The pretraining objective is considered to be highly denoising if it has a long span (for
example, equal or greater than 12) or has a large corruption rate (for example, more significant or
more than 30%). So, it covers scenarios with long spans and low corruption, long spans, and high
corruption, and short spans and high corruption, where it generates long targets based on relatively
limited information from memory.
Sequential Denoising
This objective strictly follows sequence order, i.e., the prefix language modeling. The target tokens
cannot attend to the future context tokens, but the prefix context does use bidirectional architecture.
Regular Denoising
This denoising approach has short spans, a range of 2 to 5 tokens, and a low corruption rate that
masks up to 15% of the sequence. Because of the short span length, they are not fit for generating text
but are preferred for acquiring knowledge and understanding tasks.
With the MoD approach, UL2 outperformed GPT-3 on the SuperGLUE benchmark in the zero-shot
setting, and in the one-shot setting, it tripled the performance of T5-XXL on the summarization task.
In the zero-shot setting, UL2-20B also outperformed T0 and T5 on the Massive Multitask Language
Understanding (MMLU) benchmark and performed well with a chain of thought processes using
prompting and reasoning steps. UL2-20B, when experimented with FLAN instruction tuning, achieved
a competitive score to FLAN-PaLM 62B on MMLU and Big-Bench benchmarks. After using the
MoD objective, U-PaLM [60] achieved the same performance as PaLM-540B but with only half of its
computational budget.
17 of 46
benchmark. Despite not been trained on general corpora, Galactica did better than BLOOM and
OPT-175B on the Big-bench benchmark. It also achieved state-of-the-art results on PubMedQA and
MedMCQA benchmarks.
18 of 46
19 of 46
objective forms such as labeling (parts of speech tagging) or classification, whereas pretraining is
usually formalized as a next-token prediction task. One of the reasons behind the prompt-tuning
approach was to bridge this gap between pretraining and fine-tuning objectives and help in better
adaption of knowledge from pretrained models to downstream tasks. In Prompt Tuning, Prompts are
used to interact with LLMs where a prompt is a user-provided input to which the model responds
to. Prompting is prepending extra information for the model to condition on during the generation
of output. This extra information typically includes questions, instructions, and a few examples as
tokens to the task input.
20 of 46
approach, in continuous-prompt, as there are trainable embedding tensors, the prompt encoder can
be optimized in a differentiable way. P-tuning helped augment the pre-trained model’s NLU ability
by automatically searching for better prompts in the continuous space. As demonstrated in [31], the
P-tuning method improves GPTs and BERTs in both few-shot and fully-supervised settings.
Additionally, as the parameters of only prompt tokens are stored, which are less than 0.01% of the
total model parameters, the prompt tuning approach saves a significant amount of storage space. For
example, CPM-2 [34] used only 100 prompt tokens, where only 409.6K trainable parameters were to
be updated compared to the 11B parameters of fine-tuning. As demonstrated in CPM-2, except for
the Sogou-Log task, CPM-2 with prompt-tuning achieved comparable performance to the fine-tuning
approach. The total size required for gradient tensors and optimizer state tensors also significantly
decreases since, in prompt tuning, the number of parameters needed to be optimized is also much
smaller. As a result, Prompt tuning can save at most 50% GPU memory as compared to fine-tuning.
However, prompt tuning takes many more steps to converge, hence more time.
[36] demonstrated how p-tuning with only 4K examples provided comparable results to
RoBERTwhich was fine-tuned on 150K data. P-tuning was able to significantly enhance the robustness
of HyperCLOVA as well as the accuracy. Bloom [55] used Multitask prompted fine-tuning where it was
fine-tuned on a training mixture composed of a large set of different tasks specified through natural
language prompts. T0 and Bloom demonstrated how language models fine-tuned on a multitask
mixture of prompted datasets have strong zero-shot task generalization abilities. MemPrompt [72] is a
memory-enhanced GPT-3 that allows users to interact and improve the model without retraining. It
pairs GPT-3 with a growing memory of recorded cases where the model misunderstood the user’s
intents, along with user feedback for clarification. Such a memory allows the system to produce
enhanced prompts for any new query based on the user feedback for error correction in similar cases
in the past.
PTR [95] proposed prompt tuning with rules for many-class text classification, which apply logic
rules to construct (task-specific) prompts with several sub-prompts. This enables PTR to encode prior
knowledge about tasks and classes into prompt tuning. This introduction of sub-prompts can further
alleviate the difficulty of designing templates and sets of label words. AutoPrompt [93] creates a
prompt by combining the original task inputs with a collection of trigger tokens according to a template.
The same set of trigger tokens is used for all inputs and is learned using a variant of the gradient-based
search. AutoPrompt searches for a sequence of discrete trigger words and concatenates it with each
input to elicit sentiment or factual knowledge from a masked LM. AutoPrompt elicited more accurate
factual knowledge from MLMs than the manually created prompts on the LAMA benchmark. These
results demonstrate that automatically generated prompts are a viable parameter-free alternative to
existing probing methods since prompting does not introduce large amounts of additional parameters.
In contrast with AutoPrompt, the Prefix-Tuning method optimizes continuous prefixes, which are
more expressive, and focuses on language generation tasks.
However, prompt engineering also has limitations, such as: only a small number of examples can
be used, which limits the level of control. Also, as the examples are part of the prompt, it affects the
token-budget.
21 of 46
learning performance on FLORES-101 machine translation benchmark between many language pairs.
When BloomZ [58] was fine-tuned with xP3, a multilingual task dataset of 46 languages, the model
achieved better zero-shot task generalization (than P3-trained baseline) on English and non-English
tasks. Furthermore, when xP3mt, a machine-translated multilingual dataset of xP3, was used to
fine-tune BloomZ on non-English prompts, the performance of held-out tasks with non-English
human-written prompts significantly improved. In other words, models could zero-shot generalization
to tasks in languages they had never intentionally seen. So, the models learn higher-level capabilities
that are both task- and language-agnostic.
Typically, a cross-lingual dataset is used to make the model language-agnostic, and to make
it task-agnostic, a multitask dataset is required. Also, for multilingual large models, zero-shot
performance tends to be significantly lower than fine-tuned performance. So, to improve the
multilingual model’s zero-shot task generalization BloomZ [58] focused on crosslingual and multitask
fine-tuning. This enabled the model to be usable for low-resource language tasks without further
fine-tuning.
1. In the first step, supervised fine-tuning is used, where the dataset consisting of prompts along
with their desired output behavior is given as input.
2. Another dataset of comparisons between model outputs is collected, where for a given input,
labelers identify which output they would prefer using labels. This comparison data is then used
to train a Reward Model to predict human-preferred output (which model output the labelers
prefer).
3. The policy generates an output for which the reward model generates a reward. This reward is
then used to update (maximize) the policy’s Proximal Policy Optimization (PPO) algorithm.
Using the RLHF approach, InstructGPT demonstrated improvement in toxicity and truthfulness over
GPT-3 and generalized well to held-out instructions. [69] applied reinforcement learning (RL) to
complex tasks defined only by human judgment, where only humans can tell whether a result is
good or bad. In [69], the pre-trained model is fine-tuned using reinforcement learning rather than
supervised learning, where it demonstrated its results on summarization and continuation tasks
by applying reward learning to language generation. [70] recursively used the RL approach to
produce novel summaries and achieve SOTA results for book-length summarization on the BookSum
dataset. Similarly, using the reinforcement learning technique, [71] trained a model to predict the
human-preferred summary and used it as a reward function to fine-tune the summarization policy.
It could outperform larger models fine-tuned using a supervised approach and human reference
summaries and generalize well to new datasets.
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 6 February 2024 doi:10.20944/preprints202402.0357.v1
22 of 46
23 of 46
improved with the increased size of models and number of fine-tuning tasks. Additionally, when nine
CoT datasets were added to the instruction tuning dataset mixture, the model could perform better on
evaluation reasoning tasks. This contradicts other work where instruction-finetuning instead degraded
CoT task performance. So [61] demonstrated how CoT data improves performance reasoning tasks
when jointly fine-tuned with instruction dataset. After instruction tuning model classes such as T5,
PaLM, and U-PaLM, [61] observed a significant boost in performance for different types of prompting
setups (zero, few, CoT), and benchmarks as compared to the original models (without instruction
fine-tuning).
In Self-Instruct [51], the bootstrap technique is used to improve the model’s instruction
following capabilities. Here, the existing collection of instructions is leveraged to generate new
and more broad-coverage instructions. Using a language model, Self-instruct generates instructions
along with input-output samples, filters invalid, low-quality or repeated instructions, and uses
the remaining valid ones to fine-tune the original model. Along with the instructions, the
framework also creates input-output instances, which can be used to supervise the fine-tuning
of instructions. When self-instruct was applied to GPT-3, it achieved 33% performance gain on
SUPER-NATURALINSTRUCTIONS over the original model, which was on par with InstructGPT
performance.
24 of 46
single turn. CodeGEN [65] proposed an open benchmark called Multi-Turn Programming Benchmark
(MTPB), comprising 115 diverse problem sets that are factorized into multi-turn prompts. MTPB is
used to measure the models’ capacity for multi-turn program synthesis. To solve a problem in this
benchmark, a model needs to synthesize a program in multiple steps with a user who specifies the
intent in turn in natural language.
CodeGeeX [66] is a multilingual model trained on 23 programming languages. It proposed a
HumanEval-X benchmark for evaluating multilingual models by hand-writing the solutions in C++,
Java, JavaScript, and Go. CodeGeeX was able to outperform multilingual code models of similar scale
for both the tasks of code generation and translation on HumanEval-X.
6. In-Context Learning
Fine-tuning is task-agnostic, but it uses a supervised approach during transfer learning and hence
requires access to a large amount of labeled dataset for every downstream task. Furthermore, having
such a task-specific dataset, leads to fine-tuning the model on a very narrow distribution, which might
potentially yield poor generalization on out-of-distribution dataset. It might also be overly specific to
the distribution, exploiting spurious correlations and features of the training data. The need for such
labeled datasets limits the applicability of language models.
To overcome these limitations, In-Context Learning (ICL) was proposed in GPT-3 [29], where
the language model uses in-context information for inference. The main benefits of ICL are the
minimal need for task-specific data and the fact that it does not go through any parameter updates or
architectural modifications. In ICL, a prompt feeds the model with input-label pair examples, avoiding
the need for large labeled datasets. Unlike fine-tuning, ICL has no gradient updates, so the weights
of the model parameters are not updated. In ICL, the abilities that are developed by LLMs during
pretraining are applied to adapt to or recognize the task at inference time, enabling the model to easily
switch between many tasks.
As experimented in GPT-3, the larger model with 175B parameters outperformed the smaller
models by efficiently using in-context information. Based on the experiments conducted in GPT-3, ICL
has shown initial promises and improved out-of-domain generalization. However, the results are far
inferior to those of the fine-tuning technique. ICL helps analyze whether the model rapidly adapts to
the tasks that are unlikely to be directly contained in the training set. In ICL, the model gets conditioned
on task instruction and a couple of task demonstrations as a context and is expected to complete the
target instance of the task. As Transformer-based models are conditioned by a bounded-length context
(e.g., 2048 tokens in GPT-3), ICL cannot fully exploit data longer than the context window. Based on
the number of demonstrations provided for inference in the context window, ICAL can be categorized
into a few-shot, one-shot, and zero-shot. We describe each of them below.
25 of 46
However, it was demonstrated in some of the papers, such as [81], that the examples used in the
few-shot, the sequence in which the examples were ordered, and the format of the prompt directly
affected the accuracy. [81] demonstrated how this instability in few-shot learning stems from the
language model’s bias toward predicting specific answers. For example, the model can be biased
towards answers placed towards the end of the prompt, or those appearing frequently, or to those
familiar in the pre-trained dataset. To address this instability, [81] first estimated the model’s bias
towards each answer. It then used calibration parameters that caused the prediction for the input to be
uniform across answers. This calibration procedure improved GPT-3 and GPT -2’s average accuracy
by up to 30.0% on a diverse set of tasks and also reduced variance across different prompt choices.
Instead of randomly sampling few-shot examples, [99] investigated to find effective strategies that
could select the in-context learning examples judiciously, which would help in better leveraging the
model’s capabilities in a few-shot setting. It proposed "KATE", a non-parametric selection approach,
which retrieved in-context examples which were semantically similar to the test sample. This strategy
helped give more relevant and informative inputs to the model, such as GPT-3, and unleashed the
model’s needed knowledge to solve the problem. GPT-3’s performance using KATE was improved
by a significant margin as compared to the random sampling on several NLU and NLG tasks. In
[97], the study compared how the model generalizes in a few-shot fine-tuning and in-context learning
setting. During the comparison, the model size and, number of examples and parameters used in the
experiment were controlled. The results demonstrated how the fine-tuned model generalized similarly
to the ICL model to out-of-domain and improved performance as models became larger.
26 of 46
model to access relevant knowledge (acquired during pretraining), which helps improve the reasoning
ability of the model.
Experiments have shown how CoT-based prompting improves reasoning-oriented tasks, such as
symbolic, commonsense, and arithmetic-based tasks. For example, when PaLM-540B was prompted
using 8 CoT examples, it surpassed fine-tuned GPT-3 to achieve SOTA performance on the GSM8K
benchmark having math word problems. Similarly, Minerva [52] used PaLM model, and further
fine-tuned it on the technical and mathematical dataset. When Minerva was prompted with CoT
examples that included step-by-step solutions, it generated a chain-of-thought answer and demarcated
a final answer. Of two hundred undergraduate college-level problems used for evaluation, Minerva
answered nearly a third of them from mathematics, science, and engineering domains requiring
quantitative reasoning. PaLM [54] analyzed the effect of CoT prompting with model scaling and
demonstrated how CoT-based few-shot matched or outperformed state-of-the-art fine-tuned models
on various reasoning tasks.
In zero-shot chain-of-thought with no examples, CoT reasoning can explicitly be activated by
using some trigger phrases, such as: “let’s think step-by-step” or “Let’s think about this logically"
to prompt the model to generate explanations. OPT-IML [57] used 15 reasoning dataset and studied
the effects of different proportions of reasoning data on different held-out task clusters. The default
mechanism or approach used in CoT is of greedy decoding, where the most common way of reasoning
is selected to solve the problem. [102] proposed a self-consistency decoding alternative, where instead
of taking the greedy path, it explores different ways of solving a complex reasoning problem that leads
to the unique correct answer. [102] demonstrated how by adapting the self-consistency approach in
CoT prompts improved performance on benchmarks of commonsense and arithmetic reasoning tasks
across four large language models with varying scales. However, this alternative does incur more
computational cost.
As addressed in Galactica [63], some limitations are associated with the CoT process. The CoT
process needs some few-shot examples to understand the step-by-step reasoning process, which takes
up the context space. Also, as internet data is used for pretraining, such data may have only some of
the necessary intermediate steps. Since some trivial, easy, and practiced steps are internally computed
and memorized by humans, they may only write down some necessary details or steps as it would
lead to long and tedious answers. As only principal steps are involved, this leads to missing data
where internally computed steps are not written. As a result, more effort is required to review the
datasets and explicitly inject missing steps. Table 3 enlists fine-tuning methods used in the prominent
LLMmodels along with additional details, such as Pretraining (PT) and Fine Tuning (FT) batch sizes
and epochs.
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 6 February 2024 doi:10.20944/preprints202402.0357.v1
27 of 46
7. Scalability
In recent years, transformer-based language models’ capacity has increased rapidly, from a few
million parameters to a trillion parameters. Each increase has improved the model’s language learning
abilities and downstream task performance. Recent research has demonstrated how the loss decreases
as the model size increases and follows a smooth trend of improvement with scale. Recent work
has demonstrated how scaling up the LLMs improves their abilities across various tasks. LLMs
have demonstrated that scaling up language models significantly improves task-agnostic, few-shot
performance. Recent work has shown that scaling up produces better performance than more carefully
engineered methods. If the LLMs are sufficiently pre-trained on a large corpus, it can lead to significant
performance improvements on diverse tasks. Over time, it has become evident through experiments
that the performance of LLMs can steadily be improved by scaling the model size and training data
and training the model longer (increasing the training steps).
As stated in [83], new behaviors that arise due to scaling language models have been increasingly
referred to as emergent abilities, which are the abilities that are not present in smaller models but are
present (resurface/are discovered) in larger models. In other words, quantitative changes in a system
result in qualitative changes in behavior. Large language models with over 100 billion parameters have
presented attractive scaling laws where emergent zero-shot and few-shot capabilities suddenly arouse
[83]. As stated in [44], many of the most exciting capabilities of LLMs only emerge above a certain
number of parameters, and they have many properties that cannot be studied in smaller models.
For instance, GPT-3 with 175B parameters performed better with fewer shots (32 labeled examples)
than the fully supervised BERT-Large model on various benchmarks. Additionally, with the increase
in size, the GPT model has been effective even in zero and few-shot settings, sometimes matching the
finetuning performance. The experiments in [29] demonstrated that with the increase in model size,
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 6 February 2024 doi:10.20944/preprints202402.0357.v1
28 of 46
model performance improved steadily for zero-shot and rapidly for few-shot. As their size increase,
models tend to be more proficient and efficient at in-context learning. As highlighted in [29], the gap
between zero, on, and few-shot performance often grows with model capacity, suggesting that larger
models are more proficient meta-learners. As demonstrated in [33], the perplexity decreases with the
increase in model capacity, training data, and computational resources. As experimented in [28], when
a large language model is trained on a sufficiently large and diverse dataset, it can perform well across
many domains and datasets.
There are various ways to scale, including using a bigger model, training the model for more
steps, and training the model on more data, which we explore below. We also look at how the scaling
up of the model and data size has affected the performance of models.
29 of 46
model size and the dataset for a given compute budget, it disregards the inference budget, which is
crucial since the preferred model is the one that is fastest at inference and not at training. For example,
Falcon-40B requires 70GB of GPU memory to make inferences, whereas Falcon-7B needs only 15GB,
making inference and fine-tuning accessible even on consumer hardware.
Additionally, as per [46], although it may be cheaper to train a large model to reach a certain
level of accuracy, a smaller model trained longer will be cheaper at inference. For instance, although
[79] recommended training a 10B model on 200B tokens, [46] demonstrated that the performance of
a 7B model continues to improve even after 1T tokens. Furthermore, unlike Chinchilla, PaLM, or
GPT-3, LLaMA demonstrated how it can train models and achieve SOTA performance using publicly
available datasets without relying on proprietary and inaccessible datasets. WeLM [53], a Chinese
LM, demonstrated how, by carefully cleaning, balancing, and scaling up the training data size, WeLM
could significantly outperform existing models with similar or larger sizes. For instance, on zero-shot
evaluations, it matched the performance of Ernie 3.0 Titan which is 25x larger.
30 of 46
memory than just storing the parameters since gradients and optimizer states are also essential for
updating the parameters.
As large GPUs available today have a memory of around 80GB, additional space is required to
store the optimizer’s state and intermediate calculations used during backpropagation. As a result,
training must be distributed across hundreds of nodes, each with multiple GPUs, which might result
in a communication bottleneck. In order to use the nodes efficiently, different parallelization strategies,
such as data, model, and pipeline, are used to acquire large end-to-end throughput (keeping high
resource utilization on a cluster of processors).
Figure 6 demonstrates different types of parallelism techniques. Each parallelism dimension trades
computation (or communication) overheads for memory (or throughput) benefits. To acquire maximum
end-to-end throughput, a balanced composition point should be found along these dimensions. The
problem becomes more challenging when considering the heterogeneous bandwidths in a cluster of
devices. Below we discuss each of these approaches.
Figure 6. 3D Parallelism
7.5. Miscellaneous
31 of 46
steps. Increasing the size of T5 yields consistent improvement but comes at a significant computational
cost from Base to 11B. In contrast, with the help of the ‘textual knowledge retriever’ that REALM
uses, it outperformed the largest T5-11B model while being 30 times smaller. It is also important to
note that T5 accesses additional reading comprehension data from SQuAD during its pre-training
(100,000+ examples). Access to such data could also benefit REALM, but it was not used in our
experiments. Primer [50] proposed a new architecture that has a smaller training cost as compared to
other transformer variants. Its improvements were attributed mainly to squaring ReLU activations
and adding a depthwise convolution layer after each Q, K, and V projection in self-attention. As a
result, the Primer needs much less compute time to reach a target one-shot performance. For instance,
Primer uses 1/3 of the training compute to achieve the same one-shot performance as Transformer.
7.5.2. Checkpoints
Parameter checkpoints are created while pretraining the model to reduce memory requirements
and speed up pre-training. These checkpoints allow researchers to quickly boost and attain strong
performance on many tasks without needing to perform the expensive pretraining themselves. For
example, the checkpoints released by [26] were used to achieve SOTA results on many benchmarks.
7.5.3. Ensembling
It is common practice to train and evaluate models using an ensemble of models, as it helps to use
additional computation. [26] demonstrated how an ensemble of models provides substantially better
results than a single model, which provides an orthogonal means of leveraging additional computation.
It was observed in [26] that ensembling models that were fine-tuned from the same base pre-trained
model performed worse than pre-training and fine-tuning all models completely separately, though
fine-tune-only ensembling still substantially outperformed a single model.
8. LLM Challenges
Language models can generate biased outputs of misinformation and be used maliciously. Large
Language Models carry potential risks such as outputting offensive language, propagating social
biases, and leaking private information. Large language models reproduce and might amplify existing
biases in the training data, generating toxic or offensive content. During training, as the language
model absorbs biases and toxicity expressed in the text, they are prone to replicating them. This occurs
because of the significant presence of unmoderated social media discussions in the pre-training dataset,
providing additional inductive bias to aid in such classification tasks. This growing field aims to
build benchmarks to concretely evaluate these harms, particularly around stereotypes, social bias,
and toxicity. Making progress on these issues is challenging since well-intended intervention might
have side effects on LM behavior. Below, we discuss the prominent benchmarks that are proposed to
overcome these LLM challenges.
8.1. Biases
Human biases and undesired social stereotypes exist in large pre-trained language models. One
such bias is the gender and occupation bias. The Winogender benchmark [108] is a coreference
resolution task and measures gender bias in English occupation nouns such as "nurse" and "engineer."
Winogender tests whether a model can correctly determine if a pronoun refers to different occupation
words. An unbiased model would correctly predict which word the pronoun refers to regardless of
pronoun gender.
WinoGender is made of Winograd schema, and biases are evaluated by determining if the gender
of the pronoun impacts a model coreference resolution performance. More precisely, each sentence has
three mentions: an "occupation", a "participant", and a "pronoun," where the pronoun is coreferencing
either the occupation or participant. We prompt the model to determine the coreference relation and
measure if it does so correctly according to the context of the sentence. The goal is to reveal if the
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 6 February 2024 doi:10.20944/preprints202402.0357.v1
32 of 46
model has captured societal biases associated with occupations. For example, doctors are more likely
to associate with the pronoun ‘he’, while nurses are more likely to associate with the pronoun ‘she’.
GLaM was able to close the performance gap between stereotypical and anti-stereotypical examples
on the WinoGender benchmark, suggesting that large, sparsely activated models may rely less on
superficial statistical correlations.
Gopher [78] studies gender and occupation bias via two different evaluations. First, it measures
the probability of gendered words following different occupation contexts. Second, it evaluates
the Winogender coreference resolution dataset, where similar coreference accuracy across different
pronouns indicates less gender bias. To measure "Gender Word Probability" or how probable different
gender words are in different occupation contexts, Gopher inputs an occupation prompt like "The
occupation was a" into the model and computes a gender bias metric by comparing the probabilities of
the prompt being followed by either male or female gendered terms. It was observed in PaLM that
accuracy on the Winogender coreference task improves with model scale, and PaLM 540B sets a new
state-of-the-art result in 1-shot and few-shot settings. Secondly, co-occurrence analysis performed on
race/religion/gender prompt continuation demonstrates the model’s potential to affirm stereotypes
falsely.
There have been attempts such as ETHOS dataset, which helps measure LLMs’ ability to identify
whether certain English statements are racist or sexist or neither. Furthermore, CrowSPairs is a
crowdsourced benchmark aiming to measure intrasentence level biases in 9 categories: gender, religion,
race/color, sexual orientation, age, nationality, disability, physical appearance, and socioeconomic
status. Each example consists of a pair of sentences representing a stereotype, or anti-stereotype,
regarding a particular group to measure model preference towards stereotypical expressions. Higher
scores indicate higher bias exhibited by a model. CrowSPairs[109] dataset allows measuring
biases in 9 categories: gender, religion, race/color, sexual orientation, age, nationality, disability,
physical appearance, and socioeconomic status. Each example is composed of a stereotype and an
anti-stereotype. Additionally, StereoSet [110] is another dataset used to measure stereotypical bias
across four categories: profession, gender, religion, and race. In addition to intrasentence measurement
(similar to CrowSPairs), StereoSet includes measurement at the intersentence level to test a model’s
ability to incorporate additional context. To account for a potential trade-off between bias detection
and language modeling capability, StereoSet includes two metrics: Language Modeling Score (LMS)
and Stereotype Score (SS), which are then combined to form the Idealized Context Association Test
score (ICAT).
33 of 46
failures, usually in the form of apologizing or recognizing its mistake. In contrast, Safety Bench Unit
Tests measure how unsafe a model’s response is, stratified across four levels of topic sensitivity: Safe,
Realistic, Unsafe, and Adversarial.
8.3. Hallucination
LLMs are said to hallucinate when they generate information that is fake or incorrect. The
hallucination can either be intrinsic or extrinsic. In intrinsic hallucination, the model generates
information that contradicts the content of the source text. In contrast, the generated content cannot
be contradicted or supported by the source text in extrinsic hallucination. There are various reasons
why a model can hallucinate or generate fake information during inference. For instance, if the
model misunderstands the information or facts given in the source text, it can lead the model to
hallucinate. So to be truthful, the model should have reasoning ability to correctly understand the
information from the source text. The other reason why LLMs can generate false information is
when the provided contextual information conflicts with the parametric knowledge acquired during
pretraining. Additionally, it is observed that models have parametric knowledge bias, where the model
gives more importance to the knowledge acquired during pretraining over the provided contextual
information.
Also, teacher-forcing is used during pretraining, where the decoder is conditioned on the
ground-truth prefix sequences to predict the next token. However, such a teacher-forcing technique
is missing during the inference, and such discrepancy can also make a model hallucinate. Several
techniques have been proposed to detect Hallucinations in LLMs, such as
1. sample, not one, but multiple outputs and check the information consistency between them to
check which statements are factual and which are hallucinated
2. validate the correctness of the model output by relying and using external knowledge source
3. check if the generated Named Entities or <subject, relation, object> tuples appear in the
ground-truth knowledge source or not etc.
Benchmarks such as TruthfulQA [103] have been developed to measure the truthfulness of language
models. This benchmark can evaluate the risks of a model to generate misinformation or false claims
that mimic popular misconceptions and false beliefs. It was observed in [103] that generally, the largest
models were the least truthful, and so scaling up the model size increased performance but was less
promising in improving the model’s truthfulness.
1. Computational cost for pre-training: a super large model requires several weeks of pre-training
with thousands of GPUs.
2. Storage cost for fine-tuned models: a large language model usually takes hundreds of gigabytes
(GBs) to store, and as many model copies as the number of downstream tasks need to be stored.
3. Equipment cost for inference: it is expected to use multiple GPUs to infer a large language model.
So, as the model size increases, they become hard to use with limited computational resources and
unaffordable for most researchers and institutions.
Furthermore, the pre-training phase of large language models consumes massive energy
responsible for carbon dioxide emissions. The formulas used in LLaMA to estimate the Watt-hour
(Wh) and carbon emissions are listed in equation 6 and 7, where 0.385 in equation 7 is the US national
average carbon intensity factor (0.385kgCO2eq/KWh) and PUE represents Power Usage Effectiveness.
34 of 46
As stated in LLaMA [46], carbon emission also depends on the data center’s location used during
pre-training of the network. For instance, BLOOM uses a grid that emits 0.057kgCO2eq/KWh, leading
to 27tCO2eq, and OPT uses a grid that emits 0.231kgCO2eq/KWh, leading to 82tCO2eq. As stated in
[104], a couple of factors are involved in computing the Electricity required to run an NLP model,
such as algorithm, program, number of processors running the program, speed and power of those
processors, a data center’s efficiency in delivering power and cooling the processors, and the energy
supply mix (renewable, gas, coal). Cloud data centers can also be 1.4 − 2X more energy efficient than
typical data centers. A more detailed and granular formula stated in equation 8, was presented in
[104] that captures the carbon footprint of an NLP model:
To decrease the footprint of training, an ML researcher should pick the DNN model, the processor,
and the datacenter carefully. The above equations 6 and 7 can be restated in terms of energy
consumption and CO2 emission as equations 9 and 10 below.
To address the cost and carbon footprint problems, there is a need to improve the energy efficiency
of algorithms, data centers, software, and hardware involved in implementing NLP models. Emphasis
should be given to reducing carbon footprint by building more efficient LLMs. For example, OPT [45]
is comparable to GPT-3, and requires only 1/7th of the carbon footprint to develop.
[104] also recommends three suggestions that could eventually help reduce CO2e footprint:
As highlighted in [104], large but sparsely activated DNNs can consume < 1/10th the energy of large,
dense DNNs without sacrificing accuracy despite using as many or even more parameters.
35 of 46
large-scale data mining and monolingual data pipelines to consolidate data found across the web. The
latter techniques are often plagued by noise and biases, making it difficult to validate the quality of the
created datasets. Finally, they also require high-quality evaluation datasets or benchmarks to test the
models. NLLB [67] has attempted and strived to understand the low-resource translation problem
from the perspective of native speakers and studies how to create training data to move low-resource
languages towards high-resource automatically. It proposed Flores-200 many-to-many benchmark
dataset that doubled the language coverage of a previous effort known as Flores-101. Flores-200 is
created using professional human translators who translate the FLORES source dataset into the target
language, where a separate group of reviewers perform quality assessments and provide translation
feedback to the translators.
36 of 46
9.2. Fairness
Bias and fairness, if not adequately addressed, pose serious societal implications in the form of
biased language generation and its impact on some segments of society. Basis can creep into LLMs
from several sources discussed below. The first source of bias, dataset bias, stems from the datasets
that were used to train the LLMs. If the datasets contain biases related to race, gender, religion, or
socioeconomic status, the models inherit and amplify them.
Underrepresentation or misrepresentation of certain groups in the training data can lead to
representation bias and biased language generation. The LLM developers should have checks and
balances to ensure that all perspectives are adequately represented in the datasets. Otherwise,
the model will produce inaccurate or skewed output for underrepresented groups. If the training
data contains stereotypes, models amplify stereotyping and perpetuate prejudices. Fairness across
demographics is a complex challenge but essential for advancing LLMs.
Centextual bias stems from the context in which the language models are used. This poses
severe and negative implications in applications such as recommender systems, employee hiring and
promotions, clustering, and sentiment analysis. The model evaluation metrics and benchmarks used
in traditional machine learning are inadequate to capture bias in LLMs. Comprehensive evaluation
methods are needed to consider various aspects of bias in LLMs. A multi-faceted approach is required
to address bias and fairness issues in LLMs. Approaches to data curation, model development,
evaluation strategies, and ethical issues need to be reexamined for their suitability for the LLMs.
Mitigating biases in the datasets using debiasing approaches such as modifying loss functions, altering
training data distributions, and adversarial training requires LLM-contextualized research.
37 of 46
the availability of benchmarks and evaluation metrics contextualized to adversarial attacks helps
to compare the effectiveness of different models and techniques. The techniques mentioned above
originally came from the traditional machine learning domain. Research is needed to adapt these to
the LLMs’ context. Moreover, research is needed to develop new approaches to adversarial attacks,
given the unique characteristics of LLMs.
38 of 46
and sparsity induction through identifying and eliminating redundant or less significant parameters
contributes to creating leaner models. Transfer learning and few-shot learning methods reduce the
need for extensive training of LLMs on new tasks or domains. Advances in this area can significantly
reduce energy requirements via better model generalization with less training. Energy consumption
can also be optimized by employing energy-aware training and inference strategies, which include
adaptive precision tuning, dynamic pruning, and model scaling.
Quantization of model weights and compression schemes contribute to reduced computational
overhead of LLMs. For example, knowledge distillation is a technique that helps decrease the
model’s memory and computational requirements. Research is needed in lifecycle assessment
and environmental impact to inform the researchers and provide guidelines and best practices for
developing and using LLMs. Such research will document the environmental impact of LLMs by
quantifying the carbon footprint and suggestions for footprint reduction. Data center efficiency is
pivotal in developing LLMs and deploying downstream applications. Supporting data center efficiency
initiatives, including renewable energy sources, is critical. Lastly, collaboration between academia,
industry, and policymakers is needed to share best practices, application frameworks, and tools for
energy-aware LLMs.
39 of 46
LLMs can be used to support indigenous communities by providing tools that assist in preserving their
languages and traditions. These activities help to preserve linguistic heritage that might otherwise be
lost.
LLMs can translate between high-resource and low-resource languages, making the information
more accessible and fostering communication across linguistic barriers. Also, LLMs can be used
to support language revitalization efforts by providing language learning resources and generating
teaching materials. Furthermore, LLMs will aid in developing language-learning applications for
low-resource and endangered languages. LLMs will provide language researchers with advanced
tools and resources for linguistic analysis, corpus creation, and comparative studies on a scale
that was infeasible before. Furthermore, LLMs will foster collaborative language preservation
by facilitating collective work and communication across language barriers. LLMs will facilitate
technology democratization by developing inclusive technologies to communicate with users in their
native languages and cultural contexts.
40 of 46
of the model or specific examples. Task-agnostic representations help LLMs learn more generalized
features that transfer across different tasks. Learning task-agnostic representations helps in continual
learning as models can adapt to new tasks without drastic retraining.
Regularization methods encourage model parameters to remain stable and selectively update
them for new information, which aids in continual learning. For example, elastic weight consolidation
(EWC) and synaptic intelligence help models retain learned information. As noted earlier,
meta-learning and few-shot learning approaches enable models to adapt quickly to new tasks or
domains with minimal data. Fine-tuning the models on new data while leveraging pre-trained
representations helps in adaptation. Another approach to adaptation is through ensemble models,
which combine learning paradigms such as episodic memory systems and continual learning
techniques.
10. Conclusions
This paper comprehensively studied different types of architecture, masking techniques, and
phases that go into building Language Models. It explained in detail how the Language Models
have transitioned from task-and-language specific to task-and-language agnostic. It also looked at
LLMs through the lens of scalability and compared them based on parameters such as network depth,
width, hardware, objectives, datasets, and corpus size used during pre-training. It elucidated different
in-context, pre-training, and transfer learning strategies and their advantageous and disadvantageous
applications or scenarios where they performed better. It also comprehensively analyzed different
ways to scale and incorporate parallelism into the model to make them compute efficient. Finally, it also
sheds light on future directions and challenges encountered in LLMs. Overall, the article empirically
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 6 February 2024 doi:10.20944/preprints202402.0357.v1
41 of 46
compared existing trends and techniques and comprehensively analyzed where the field of LLMs
currently stands.
References
1. Harris, Z. S.Distributional structure, Word, 10 (2-3), pp. 146-162, 1954.
2. Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek, F., Lafferty, J., Mercer, R. L., and Roossin,
P. S. A statistical approach to machine translation,Computational linguistics, 16(2), pp. 79-85, 1990.
3. Salton, G., and Lesk, M. E. Computer evaluation of indexing and text processing,Journal of the ACM (JACM),
15(1), pp. 8-36, 1968.
4. Jones, K. S. A statistical interpretation of term specificity and its application in retrieval,Journal of documentation,
1972.
5. Salton, G., Wong, A., and Yang, C. S. A vector space model for automatic indexing,Communications of the ACM,
18(11), pp. 613-620, 1975.
6. Tang, B., Shepherd, M., Milios, E., and Heywood, M. I. Comparing and combining dimension reduction techniques
for efficient text clustering,In Proceeding of SIAM international workshop on feature selection for data mining,
pp. 17-26, 2005.
7. Hyvärinen, A., and Oja, E. Independent component analysis: algorithms and applications,Neural networks, 13(4-5),
pp. 411-430, 2000.
8. Vilnis, L., and McCallum, A. Word representations via gaussian embedding,arXiv preprint arXiv:1412.6623, 2014.
9. Athiwaratkun, B.,and Wilson, A. G. Multimodal word distributions,arXiv preprint arXiv:1704.08424, 2017.
10. Le, Q., and Mikolov, T. Distributed representations of sentences and documents,In International conference on
machine learning, pp. 1188-1196, PMLR, 2014.
11. Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space,arXiv
preprint arXiv:1301.3781, 2013.
12. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases
and their compositionality,Advances in neural information processing systems, 26, 2013.
13. Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation,In Proceedings of
the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532-1543, 2014.
14. Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. Enriching word vectors with subword
information,Transactions of the association for computational linguistics, 5, 135-146, 2017.
15. Melamud, O., Goldberger, J., & Dagan, I. context2vec: Learning generic context embedding with bidirectional
lstm,In Proceedings of the 20th SIGNLL conference on computational natural language learning, pp. 51-61,
2016.
16. McCann, B., Bradbury, J., Xiong, C., and Socher, R. Learned in translation: Contextualized word vectors,Advances
in neural information processing systems, 30, 2017.
17. Peters, M. E., Neumann M., Iyyer M., Gardner M., Clark C., Lee K., and Zettlemoyer L. Deep Contextualized
Word Representations,In Proceedings of the 2018 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, Vol 1, pages 2227–2237, 2018.
18. Howard, J., & Ruder, S. Universal language model fine-tuning for text classification,arXiv preprint
arXiv:1801.06146, 2018.
19. Dong, L., Yang, N., Wang, W., Wei, F., Liu, X., Wang, Y., Gao, J., Zhou, M., & Hon, H. W. Unified language model
pre-training for natural language understanding and generation,Advances in Neural Information Processing
Systems, 32, 2019.
20. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. Learning
phrase representations using RNN encoder-decoder for statistical machine translation,arXiv preprint arXiv:1406.1078,
2014.
21. Sutskever, I., Vinyals, O., & Le, Q. V. Sequence to sequence learning with neural networks,Advances in neural
information processing systems, 27, 2014.
22. Bahdanau, D., Cho, K., & Bengio, Y. Neural machine translation by jointly learning to align and translate,arXiv
preprint arXiv:1409.0473, 2014.
23. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. & Polosukhin, I.
Attention is all you need,Advances in neural information processing systems, 30, 2017.
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 6 February 2024 doi:10.20944/preprints202402.0357.v1
42 of 46
24. Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. Improving language
understanding by generative pre-training,URL https://s3-us-west-2. amazonaws.
com/openaiassets/research-covers/languageunsupervised/language understanding paper.pdf, 2018.
25. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for
language understanding,arXiv preprint arXiv:1810.04805, 2018.
26. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. & Liu, P. J. Exploring the
limits of transfer learning with a unified text-to-text transformer,The Journal of Machine Learning Research, 21(1),
pp. 5485-5551, 2020.
27. Guu, K., Lee, K., Tung, Z., Pasupat, P., & Chang, M. Retrieval augmented language model pre-training,In
International conference on machine learning, pp. 3929-3938, 2020, November, PMLR.
28. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. Language models are unsupervised multitask
learners,OpenAI blog, 1(8), 9, 2019.
29. Brown. T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry,
G., Askell, A. & Agarwal, S. Language models are few-shot learners,Advances in neural information processing
systems, 33, pp.1877-1901, 2020.
30. Lieber, O., Sharir, O., Lenz, B., & Shoham, Y. Jurassic-1: Technical details and evaluation,White Paper. AI21 Labs,
1, 2021.
31. Liu, X., Zheng, Y., Du, Z., Ding, M., Qian, Y., Yang, Z. & Tang, J., GPT understands, too,AI Open, 2023.
32. Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A. and Raffel, C., mT5: A
massively multilingual pre-trained text-to-text transformer,arxiv preprint arXiv:2010.11934, 2020.
33. Zeng, W., Ren, X., Su, T., Wang, H., Liao, Y., Wang, Z., ... & Tian, Y. Pangu-α: Large-scale autoregressive
pretrained Chinese language models with auto-parallel computation. ,arxiv preprint arXiv:2104.12369, 2021.
34. Zhang, Z., Gu, Y., Han, X., Chen, S., Xiao, C., Sun, Z., Yao, Y., Qi, F., Guan, J., Ke, P. and Cai, Y., Cpm-2:
Large-scale cost-effective pre-trained language models,AI Open, 2, pp.216-224, 2021.
35. Wu, S., Zhao, X., Yu, T., Zhang, R., Shen, C., Liu, H., Li, F., Zhu, H., Luo, J., Xu, L. and Zhang, X., Yuan 1.0:
Large-scale pre-trained language model in zero-shot and few-shot learning,arxiv preprint arXiv:2110.04725, 2021.
36. Kim, B., Kim, H., Lee, S.W., Lee, G., Kwak, D., Jeon, D.H., Park, S., Kim, S., Kim, S., Seo, D. and Lee, H.,
What changes can large-scale language models bring? intensive study on hyperclova: Billions-scale korean generative
pretrained transformers,arxiv preprint arXiv:2109.04650, 2021.
37. Wei, J., Bosma, M., Zhao, V.Y., Guu, K., Yu, A.W., Lester, B., Du, N., Dai, A.M. and Le, Q.V., Finetuned language
models are zero-shot learners,arxiv preprint arXiv:2109.01652, 2021.
38. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray,
A. and Schulman, J., Training language models to follow instructions with human feedback,Advances in Neural
Information Processing Systems, 35, pp.27730-27744, 2022.
39. Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.T., Jin, A., Bos, T., Baker, L.,
Du, Y. and Li, Y., Lamda: Language models for dialog applications,arxiv preprint arXiv:2201.08239, 2022.
40. Du, Z., Qian, Y., Liu, X., Ding, M., Qiu, J., Yang, Z. and Tang, J., Glm: General language model pretraining with
autoregressive blank infilling,arXiv preprint arXiv:2103.10360, 2021.
41. Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., Yang, Z., Xu, Y., Zheng, W., Xia, X. and Tam, W.L.,
Glm-130b: An open bilingual pre-trained model,arxiv preprint arXiv:2210.02414, 2022.
42. Wang, Y., Mishra, S., Alipoormolabashi, P., Kordi, Y., Mirzaei, A., Arunkumar, A., Ashok, A., Dhanasekaran,
A.S., Naik, A., Stap, D. and Pathak, E., Super-naturalinstructions: Generalization via declarative instructions on
1600+ nlp tasks,arxiv preprint arXiv:2204.07705, 2022.
43. Sanh, V., Webson, A., Raffel, C., Bach, S.H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T.L., Raja,
A. and Dey, M., Multitask prompted training enables zero-shot task generalization,arxiv preprint arXiv:2110.08207,
2021.
44. Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang,
J. and Pieler, M., Gpt-neox-20b: An open-source autoregressive language model,arxiv preprint arXiv:2204.06745,
2022.
45. Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X.V. and
Mihaylov, T., Opt: Open pre-trained transformer language models,URL https://arxiv. org/abs/2205.01068, 2022.
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 6 February 2024 doi:10.20944/preprints202402.0357.v1
43 of 46
46. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N.,
Hambro, E., Azhar, F. and Rodriguez, A., Llama: Open and efficient foundation language models,arxiv preprint
arXiv:2302.13971, 2023.
47. Du, N., Huang, Y., Dai, A.M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A.W., Firat, O. and
Zoph, B., Glam: Efficient scaling of language models with mixture-of-experts,In International Conference on
Machine Learning, pp. 5547-5569, PMLR, June, 2022.
48. Soltan, S., Ananthakrishnan, S., FitzGerald, J., Gupta, R., Hamza, W., Khan, H., Peris, C., Rawls, S.,
Rosenbaum, A., Rumshisky, A. and Prakash, C.S., Alexatm 20b: Few-shot learning using a large-scale multilingual
seq2seq model,arxiv preprint arXiv:2208.01448, 2022.
49. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.D.O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N.,
Brockman, G. and Ray, A., Evaluating large language models trained on code,arxiv preprint arXiv:2107.03374,
2021.
50. So, D. R., Mańke, W., Liu, H., Dai, Z., Shazeer, N., & Le, Q. V. Primer: Searching for efficient transformers for
language modeling,arxiv preprint arXiv:2109.08668, 2021.
51. Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N.A., Khashabi, D. and Hajishirzi, H., Self-instruct: Aligning
language model with self generated instructions,arXiv preprint arXiv:2212.10560, 2022.
52. Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C.,
Schlag, I., Gutman-Solo, T. and Wu, Y., Solving quantitative reasoning problems with language models,Advances
in Neural Information Processing Systems, 35, pp.3843-3857, 2022.
53. Su, H., Zhou, X., Yu, H., Chen, Y., Zhu, Z., Yu, Y., & Zhou, J. Welm: A well-read pre-trained language model for
chinese,arxiv preprint arXiv:2209.10372, 2022.
54. Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton,
C., Gehrmann, S. and Schuh, P., Palm: Scaling language modeling with pathways,arxiv preprint arXiv:2204.02311,
2022.
55. Scao, T.L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A.S., Yvon, F., Gallé, M.
and Tow, J., Bloom: A 176b-parameter open-access multilingual language model,arxiv preprint arXiv:2211.05100,
2022.
56. Tay, Y., Dehghani, M., Tran, V.Q., Garcia, X., Wei, J., Wang, X., Chung, H.W., Bahri, D., Schuster, T., Zheng,
S. and Zhou, D., Ul2: Unifying language learning paradigms,In The Eleventh International Conference on
Learning Representations September, 2022.
57. Iyer, S., Lin, X.V., Pasunuru, R., Mihaylov, T., Simig, D., Yu, P., Shuster, K., Wang, T., Liu, Q., Koura, P.S. and
Li, X., Opt-iml: Scaling language model instruction meta learning through the lens of generalization,arxiv preprint
arXiv:2212.12017, 2022.
58. Muennighoff, N., Wang, T., Sutawika, L., Roberts, A., Biderman, S., Scao, T.L., Bari, M.S., Shen, S., Yong,
Z.X., Schoelkopf, H. and Tang, X., Crosslingual generalization through multitask finetuning,arxiv preprint
arXiv:2211.01786, 2022.
59. Lin, X.V., Mihaylov, T., Artetxe, M., Wang, T., Chen, S., Simig, D., Ott, M., Goyal, N., Bhosale, S., Du, J. and
Pasunuru, R., Few-shot Learning with Multilingual Generative Language Models,In Proceedings of the 2022
Conference on Empirical Methods in Natural Language Processing, (pp. 9019-9052), 2022.
60. Tay, Y., Wei, J., Chung, H.W., Tran, V.Q., So, D.R., Shakeri, S., Garcia, X., Zheng, H.S., Rao, J., Chowdhery, A.
and Zhou, D., Transcending scaling laws with 0.1% extra compute,arxiv preprint arXiv:2210.11399, 2022.
61. Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S.
and Webson, A., Scaling instruction-finetuned language models,arxiv preprint arXiv:2210.11416, 2022.
62. Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A.J., Padlewski, P., Salz, D., Goodman, S., Grycner,
A., Mustafa, B., Beyer, L. and Kolesnikov, A., Pali: A jointly-scaled multilingual language-image model,URL
https://arxiv. org/abs/2209.06794, 2022.
63. Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A., Saravia, E., Poulton, A., Kerkez, V. and
Stojnic, R., Galactica: A large language model for science,arxiv preprint arXiv:2211.09085, 2022.
64. Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal
Lago, A. and Hubert, T., Competition-level code generation with alphacode,Science, 378(6624), pp.1092-1097, 2022.
65. Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., Zhou, Y., Savarese, S. and Xiong, C., Codegen: An open
large language model for code with multi-turn program synthesis,arxiv preprint arXiv:2203.13474, 2022.
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 6 February 2024 doi:10.20944/preprints202402.0357.v1
44 of 46
66. Zheng, Q., Xia, X., Zou, X., Dong, Y., Wang, S., Xue, Y., Wang, Z., Shen, L., Wang, A., Li, Y. and Su, T.,
Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x,arxiv preprint
arXiv:2303.17568, 2023.
67. Costa-jussà, M.R., Cross, J., Çelebi, O., Elbayad, M., Heafield, K., Heffernan, K., Kalbassi, E., Lam, J., Licht,
D., Maillard, J. and Sun, A., No language left behind: Scaling human-centered machine translation,arxiv preprint
arXiv:2207.04672, 2022.
68. Biderman, S., Schoelkopf, H., Anthony, Q.G., Bradley, H., O’Brien, K., Hallahan, E., Khan, M.A., Purohit, S.,
Prashanth, U.S., Raff, E. and Skowron, A., Pythia: A suite for analyzing large language models across training and
scaling,In International Conference on Machine Learning, pp. 2397-2430, July, 2023, PMLR.
69. Ziegler, D.M., Stiennon, N., Wu, J., Brown, T.B., Radford, A., Amodei, D., Christiano, P. and Irving, G.,
Fine-tuning language models from human preferences,arXiv preprint arXiv:1909.08593, 2019.
70. Wu, J., Ouyang, L., Ziegler, D.M., Stiennon, N., Lowe, R., Leike, J. and Christiano, P., Recursively summarizing
books with human feedback,arXiv preprint arXiv:2109.10862, 2021.
71. Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D. and Christiano,
P.F., Learning to summarize with human feedback,Advances in Neural Information Processing Systems, 33,
pp.3008-3021, 2020.
72. Madaan, A., Tandon, N., Clark, P. and Yang, Y., Memory-assisted prompt editing to improve gpt-3 after
deployment,arXiv preprint arXiv:2201.06009, 2022.
73. Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G. and Dean, J., Outrageously large neural
networks: The sparsely-gated mixture-of-experts layer,arXiv preprint arXiv:1701.06538, 2017.
74. Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N. and Chen, Z., GShard:
Scaling giant models with conditional computation and automatic sharding,arxiv preprint arXiv:2006.16668,2020.
75. Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y., Dean, J., Shazeer, N. and Fedus, W., St-moe: Designing stable
and transferable sparse expert models,arXiv preprint arXiv:2202.08906, 2022.
76. Fedus, W., Zoph, B. and Shazeer, N., Switch transformers: Scaling to trillion parameter models with simple and
efficient sparsity,The Journal of Machine Learning Research, 23(1), pp.5232-5270, 2022.
77. Artetxe, M., Bhosale, S., Goyal, N., Mihaylov, T., Ott, M., Shleifer, S., Lin, X.V., Du, J., Iyer, S., Pasunuru,
R. and Anantharaman, G., Efficient large scale language modeling with mixtures of experts,arXiv preprint
arXiv:2112.10684, 2021.
78. Rae, J.W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R.,
Young, S. and Rutherford, E., Scaling language models: Methods, analysis & insights from training gopher,arxiv
preprint arXiv:2112.11446, 2021.
79. Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D.D.L., Hendricks,
L.A., Welbl, J., Clark, A. and Hennigan, T., Training compute-optimal large language models,arXiv preprint
arXiv:2203.15556, 2022.
80. Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J. and
Amodei, D., Scaling laws for neural language models,arXiv preprint arXiv:2001.08361, 2020.
81. Zhao, Z., Wallace, E., Feng, S., Klein, D. and Singh, S., Calibrate before use: Improving few-shot performance of
language models,In International Conference on Machine Learning, pp. 12697-12706. PMLR, 2021.
82. Tay, Y., Dehghani, M., Rao, J., Fedus, W., Abnar, S., Chung, H.W., Narang, S., Yogatama, D., Vaswani,
A. and Metzler, D., Scale efficiently: Insights from pre-training and fine-tuning transformers,arXiv preprint
arXiv:2109.10686, 2021.
83. Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler,
D. and Chi, E.H., Emergent abilities of large language models,arXiv preprint arXiv:2206.07682, 2022.
84. Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M. and Liu, Q., ERNIE: Enhanced language representation with
informative entities,arXiv preprint arXiv:1905.07129, 2019.
85. Peters, M.E., Neumann, M., Logan IV, R.L., Schwartz, R., Joshi, V., Singh, S. and Smith, N.A., Knowledge
enhanced contextual word representations,arXiv preprint arXiv:1909.04164, 2019.
86. Xiong, W., Du, J., Wang, W.Y. and Stoyanov, V., Pretrained encyclopedia: Weakly supervised knowledge-pretrained
language model,arXiv preprint arXiv:1912.09637, 2019.
87. Zhou, W., Lee, D.H., Selvam, R.K., Lee, S., Lin, B.Y. and Ren, X., Pre-training text-to-text transformers for
concept-centric common sense,arXiv preprint arXiv:2011.07956, 2020.
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 6 February 2024 doi:10.20944/preprints202402.0357.v1
45 of 46
88. Wang, X., Gao, T., Zhu, Z., Zhang, Z., Liu, Z., Li, J. and Tang, J., KEPLER: A unified model for knowledge
embedding and pre-trained language representation,Transactions of the Association for Computational Linguistics,
9, pp.176-194, 2021.
89. Sun, T., Shao, Y., Qiu, X., Guo, Q., Hu, Y., Huang, X. and Zhang, Z., Colake: Contextualized language and
knowledge embedding,arXiv preprint arXiv:2010.00309, 2020.
90. Wang, R., Tang, D., Duan, N., Wei, Z., Huang, X., Cao, G., Jiang, D. and Zhou, M., K-adapter: Infusing
knowledge into pre-trained models with adapters,arXiv preprint arXiv:2002.01808, 2020.
91. Sun, Y., Wang, S., Feng, S., Ding, S., Pang, C., Shang, J., Liu, J., Chen, X., Zhao, Y., Lu, Y. and Liu, W.,
Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation,arxiv preprint
arXiv:2107.02137, 2021.
92. Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M. and
Gelly, S., Parameter-efficient transfer learning for NLP,In International Conference on Machine Learning, pp.
2790-2799, PMLR, 2019.
93. Shin, T., Razeghi, Y., Logan IV, R.L., Wallace, E. and Singh, S., Autoprompt: Eliciting knowledge from language
models with automatically generated prompts,arXiv preprint arXiv:2010.15980, 2020.
94. Li, X.L. and Liang, P., Prefix-tuning: Optimizing continuous prompts for generation,arXiv preprint
arXiv:2101.00190, 2021.
95. Han, X., Zhao, W., Ding, N., Liu, Z. and Sun, M., Ptr: Prompt tuning with rules for text classification,AI Open, 3,
pp.182-192, 2022.
96. Lester, B., Al-Rfou, R. and Constant, N., The power of scale for parameter-efficient prompt tuning,arXiv preprint
arXiv:2104.08691, 2021.
97. Mosbach, M., Pimentel, T., Ravfogel, S., Klakow, D. and Elazar, Y., Few-shot Fine-tuning vs. In-context Learning:
A Fair Comparison and Evaluation,arXiv preprint arXiv:2305.16938, 2023.
98. Wang, T., Roberts, A., Hesslow, D., Le Scao, T., Chung, H.W., Beltagy, I., Launay, J. and Raffel, C., What
language model architecture and pretraining objective works best for zero-shot generalization?,In International
Conference on Machine Learning, pp. 22964-22984. PMLR, 2022.
99. Liu, J., Shen, D., Zhang, Y., Dolan, B., Carin, L. and Chen, W., What Makes Good In-Context Examples for
GPT-3?,arXiv preprint arXiv:2101.06804, 2021.
100. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M.,
Zettlemoyer, L. and Stoyanov, V., Unsupervised cross-lingual representation learning at scale,arXiv preprint
arXiv:1911.02116, 2019.
101. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V. and Zhou, D., Chain-of-thought
prompting elicits reasoning in large language models,Advances in Neural Information Processing Systems, 35,
pp.24824-24837, 2022.
102. Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A. and Zhou, D., Self-consistency
improves chain of thought reasoning in language models,arXiv preprint arXiv:2203.11171, 2022.
103. Lin, S., Hilton, J. and Evans, O., Truthfulqa: Measuring how models mimic human falsehoods,arXiv preprint
arXiv:2109.07958, 2021.
104. Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.M., Rothchild, D., So, D., Texier, M. and Dean, J.,
Carbon emissions and large neural network training,arXiv preprint arXiv:2104.10350, 2021.
105. Gehman, S., Gururangan, S., Sap, M., Choi, Y. and Smith, N.A., Realtoxicityprompts: Evaluating neural toxic
degeneration in language models,arXiv preprint arXiv:2009.11462, 2020.
106. Ung, M., Xu, J. and Boureau, Y.L., Saferdialogues: Taking feedback gracefully after conversational safety
failures,arXiv preprint arXiv:2110.07518, 2021.
107. Dinan, E., Abercrombie, G., Bergman, A.S., Spruit, S., Hovy, D., Boureau, Y.L. and Rieser, V., Anticipating
safety issues in e2e conversational ai: Framework and tooling,arXiv preprint arXiv:2107.03451, 2021.
108. Rudinger, R., Naradowsky, J., Leonard, B. and Van Durme, B., Gender bias in coreference resolution,arXiv
preprint arXiv:1804.09301, 2018.
109. Nangia, N., Vania, C., Bhalerao, R. and Bowman, S.R., CrowS-pairs: A challenge dataset for measuring social
biases in masked language models,arXiv preprint arXiv:2010.00133, 2020.
110. Nadeem, M., Bethke, A. and Reddy, S., StereoSet: Measuring stereotypical bias in pretrained language models,arXiv
preprint arXiv:2004.09456, 2020.
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 6 February 2024 doi:10.20944/preprints202402.0357.v1
46 of 46
111. Levine, Y., Wies, N., Sharir, O., Bata, H. and Shashua, A., Limits to depth efficiencies of self-attention,Advances
in Neural Information Processing Systems, 33, pp.22640-22651, 2020.
112. Lester, B., Al-Rfou, R. and Constant, N., The power of scale for parameter-efficient prompt tuning,arXiv preprint
arXiv:2104.08691, 2021.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those
of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s)
disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or
products referred to in the content.