0% found this document useful (0 votes)

127 views67 pages

An Overview of Large Language Models For Statisticians

This document provides an overview of Large Language Models (LLMs) and their significance in artificial intelligence, highlighting their capabilities in tasks like text generation and reasoning. It emphasizes the need for statisticians to engage with LLMs to address challenges related to uncertainty quantification, interpretability, and fairness, thereby fostering collaboration between AI and statistics. The paper aims to explore how statistical methods can enhance the trustworthiness and transparency of LLMs while also considering their applications in statistical analysis.

Uploaded by

lizc900507

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

127 views67 pages

An Overview of Large Language Models For Statisticians

Uploaded by

lizc900507

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 67

An Overview of Large Language Models for Statisticians

Wenlong Ji1,∗ Weizhe Yuan2,3,∗ Emily Getzen4

2 5,6
Kyunghyun Cho Michael I. Jordan Song Mei5 Jason Weston2,3
Weijie J. Su4,† Jing Xu3,† Linjun Zhang7,†
1 2 3
Stanford University New York University Meta FAIR 4 University of Pennsylvania
5
UC Berkeley 6 INRIA 7
Rutgers University
arXiv:2502.17814v1 [stat.ML] 25 Feb 2025

Abstract
Large Language Models (LLMs) have emerged as transformative tools in artificial intelligence
(AI), exhibiting remarkable capabilities across diverse tasks such as text generation, reasoning,
and decision-making. While their success has primarily been driven by advances in computational
power and deep learning architectures, emerging problems—in areas such as uncertainty quantifica-
tion, decision-making, causal inference, and distribution shift—require a deeper engagement with
the field of statistics. This paper explores potential areas where statisticians can make important
contributions to the development of LLMs, particularly those that aim to engender trustworthiness
and transparency for human users. Thus, we focus on issues such as uncertainty quantification,
interpretability, fairness, privacy, watermarking and model adaptation. We also consider possible
roles for LLMs in statistical analysis. By bridging AI and statistics, we aim to foster a deeper
collaboration that advances both the theoretical foundations and practical applications of LLMs,
ultimately shaping their role in addressing complex societal challenges.

Contents
1 Introduction 2

2 Background and Fundamentals of LLMs 4

2.1 Historical Development of LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Architectures of Pre-trained Language Models . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 LLM Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Training Pipelines of LLMs 8

3.1 LLM Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.1 Pre-training Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Pre-training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.3 Scaling Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 LLM Prompting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 LLM Supervised Fine-Tuning (SFT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.1 Instruction Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.2 Parameter-Efficient Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 System 2 Prompting and Chain-of-Thought . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 LLM Reinforcement Learning & Preference Optimization . . . . . . . . . . . . . . . . . 15
3.5.1 Approaches to Aligning LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.6 LLM Self-Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.6.1 Synthetic Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
∗ Co-first authors
† Co-corresponding authors

1
3.6.2 Self-Rewarding Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.6.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Designing Trustworthy LLMs by Statistical Methods 18

4.1 Uncertainty Quantification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 LLM Watermarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Privacy and Copyright . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4 Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Algorithmic Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.6 LLM Alignment: A Statistical Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5 LLM-Empowered Statistical Analysis 29

5.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.1 Extract Structured Data from Unstructured text . . . . . . . . . . . . . . . . . . 29
5.1.2 Synthetic Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 LLMs for Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.4 LLM-assisted Medical Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6 Discussion 33
6.1 Statisticians’ Small Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.2 Towards Understanding LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.3 Human-AI Collaborative Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

A Historical Development of LLMs 62

A.1 Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
A.2 Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

B Key Resources: Accelerating Progress in LLM Research 63

B.1 Core Frameworks for Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . 63
B.2 Model Libraries and Pre-trained Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
B.3 Datasets and Data Preparation Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
B.4 Training and Fine-Tuning Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
B.5 Inference and Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
B.6 Visualization and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
B.7 Cloud and Hardware Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

1 Introduction
Large Language Models (LLMs) have emerged as transformative tools in artificial intelligence (AI),
exhibiting remarkable capabilities across diverse tasks from text generation to dialog to complex rea-
soning. As these models increase in scale and sophistication, an important question arises: in the
era of LLMs, how can statisticians play a role in guiding the design and deployment of large-scale AI
models?

Previous tokens Next token Probability

Input Output probability 0.2

uncertainty 0.4
The statisticians like LLMs 0.05
…… ……

Figure 1: An illustration of next-token prediction by LLMs.

At their core, language models are models that assign probabilities to upcoming words, or sequences
of words. Specifically, a language model f aims to learn a mapping that takes a sequence of tokens
x[T ] = x1 , x2 , · · · , xT as input, and outputs a probability vector f (x[T ] ) over the token vocabulary
for the next-token xT +1 . While simple in formulation, when applied to vast corpora of text, this

2
probabilistic approach allows LLMs to capture intricate patterns, including those that were classically
the province of syntax, semantics, and logic. For example, given an instruction such as “Write an
engaging short story” or “Let’s think step by step,” LLMs can dynamically adapt their responses
to follow the given directive, generating outputs that align with the user’s intent. This instruction-
following capability is a result of extensive training on diverse datasets combined with fine-tuning
techniques that align model outputs with user preferences. While much of this work takes inspiration
from statistical principles, as well as principles from optimization theory, there is a large gap between
problem formulations and the design of statistical procedures that target the specific architectures and
data-collection protocols used in LLMs.

Example of a LLM response to a user instruction on Creative Writing

User: Write an engaging short story about a group of statisticians going hiking. Provide a
detailed background for the statisticians and describe the fun and challenges they face during
the journey. Incorporate rich details to bring the story to life.
LLM:
Title: The Law of Large Trails
It all started as a joke in the statistics department. Every Wednesday, after the weekly seminar,
a group of faculty and graduate students would gather in the common lounge, sharing coffee,
complaints, and the occasional proof sketch scribbled on napkins. One particularly dull winter
afternoon, Professor Strawderman sighed and said, “You know, we spend all our time analyzing
uncertainty in data, but when was the last time we faced real uncertainty? Like, say... in
nature?”...

Example of a LLM response to a probability question with Chain-of-Thought Prompting

User: You are an expert in solving advanced probability theory problems. Carefully solve the
following step by step, ensuring that every calculation and assumption is explained: What is
the probability of getting at least one 6 when rolling two fair six-sided dice? Let’s think step
by step.
LLM:
• For each die, there are 6 possible outcomes. The total number of possible outcomes when
rolling two dice is: 6 × 6 = 36. We want the probability of getting at least one 6. This
includes all cases where one or both dice show a 6. It’s easier to calculate the complement:
the probability of not getting a 6 on either die.
• Probability of not getting a 6 on a single die: For one die, there are 5 outcomes where
we do not get a 6 (i.e., 1, 2, 3, 4, or 5). The probability of not getting a 6 on one die is:
P(not 6 on one die) = 65 .
• Probability of not getting a 6 on both dice: Since the dice rolls are independent, the
probability of not getting a 6 on both dice is: P(not 6 on both die) = 56 × 65 = 25
36 .

• Probability of getting at least one 6 is the complement of the probability of not getting
25
any 6s: P(at least one 6) = 1 − P(not 6 on both die) = 1 − 36 = 11
36 .

This requires more than simply applying existing statistcal tools—the rise of LLMs demands ad-
vances in statistical methodology and theory to ensure trustworthiness, reliably quantify uncertainty,
and enhance interpretability—all at the massive scale of LLM deployments. For example, there are
many open problems in understanding how the architectural design of LLMs yields, or fails to yield,
calibrated probabilistic outputs or understanding how the outputs of LLMs interact with downstream
decision-making to ensure algorithmic fairness and reliability.
One can also view statistics itself as an application domain for LLMs. Indeed, LLMs provide signif-
icant opportunities to augment traditional statistical workflows. Tasks like data collection, cleaning,
and data analysis can leverage LLMs’ contextual understanding and scalability to improve accuracy
and efficiency. In fields such as medical research and biostatistics, LLMs can synthesize representative
datasets, extract valuable insights from unstructured clinical notes, and support predictive modeling

3
in high-stakes applications. Of course, such applications again require closing the gap between the ad
hoc nature of current LLM design and the statistical properties of its outputs.
These dual perspectives, statistics for LLMs and LLMs for statistics, highlight the urgent need for
statisticians to engage deeply with the LLM revolution. This engagement will be challenging. The
rapid pace of innovation in LLMs raises pressing questions: How can statistical frameworks adapt
to the unprecedented scale and complexity of these models? What role should statistical principles
play in ensuring the trustworthy and reliable deployment of LLMs in diverse applications? How can
statisticians incorporate LLMs into their workflows without compromising the rigor, transparency, and
interpretability that define their discipline? Addressing these questions is critical to bridging the gap
between statistical rigor and the transformative potential of LLMs.
This survey aims to catalyze the involvement of statisticians in the evolving landscape of LLMs by
providing a structured overview of the field. It begins by introducing the historical development and
foundational principles of LLMs, shedding light on their architectures and training pipelines. It also
explores how statistical methods can enhance the trustworthiness and alignment of LLMs, focusing on
topics such as conformal prediction, synthetic text detection, and algorithmic fairness. The discussion
then shifts to how LLMs can empower statistical analysis in areas like data synthesis, cleaning, and
medical research.
By bridging the fields of statistics and LLMs, this survey underscores the mutual benefits of collab-
oration. Statisticians can enrich the development of LLMs with rigor and transparency, while LLMs
offer powerful tools to push the boundaries of statistical practice. This interplay promises to rede-
fine both fields, presenting statisticians with a unique opportunity to shape the trajectory of AI in
promising directions.
The paper is organized as follows. In Section 2, we provide background on the fundamentals of
LLMs. Section 3 delves into the training pipelines of LLMs, including pre-training, prompting, fine-
tuning, and alignment techniques. Section 4 focuses on designing trustworthy LLMs leveraging statis-
tical insights, covering topics like interpretability, uncertainty quantification, watermarking, privacy,
and algorithmic fairness. Section 5 explores the potential of LLMs to empower statistical analysis,
with applications to medical research. Finally, Section 6 concludes the paper with a discussion on
statisticians’ role in the LLM landscape, highlighting the importance of collaboration and offering sug-
gestions for future research directions. Additional resources are provided in the appendicies, including
frameworks for model development, libraries, datasets, training tools, and visualization utilities that
can accelerate progress in LLM research.

2 Background and Fundamentals of LLMs

The development of LLMs has been a landmark event in the field of natural language processing,
representing a quantum leap in the ability of machines to understand human language. This section
provides a comprehensive overview of the historical development of LLMs, beginning with foundational
concepts in representation learning which illustrate how models encode language into numerical forms.
We then transition into advances in language modeling, involves predicting the next word in a sequence
of texts, allowing for coherent and fluent text generation. Finally, we explore various architectures that
have been pivotal in shaping the current landscape of LLMs.

2.1 Historical Development of LLMs

2.1.1 Representation Learning
The first step in an LLM is to transform natural language into a format that computers can understand—
specifically vectors, matrices, and tensors. The symbolic lexical units (i.e., words) that comprise sen-
tences require effective numerical representation strategies, known as word embeddings. Grounded on
the distributional hypothesis, which posits that linguistic items that occur in similar contexts have
similar meanings [Har54], such methods have gradually become more sophisticated over time.
Early attempts at capturing word meanings involved bag-of-words representations, which represent
documents as sparse vectors as in TF-IDF [Spa72], with a dimension for each element of the dictionary.
Many unsupervised [Hof01; LD02] and supervised methods [Joa98; Bai+09; WBU11] involved learning
(dense) word embeddings, but struggled with complex semantic and syntactic nuances of words in chal-
lenging NLP tasks. Neural methods first grew to prominence with the advent of Word2vec [Mik+13]

4
and GloVe [PSM14], that directly learn low-dimensional word representations from local word context,
with each dimension encodes a latent feature of the word. The specific objective functions underlying
the architecture of these models were explored in foundational works such as those by [Ben+03; CW08;
Col+11; TRB10], setting the stage for subsequent innovations.
These approaches treat words as the primary unit, leading to challenges like a large vocabulary
size and out-of-vocabulary words. Tokenization, such as Byte Pair Encoding (BPE) [Gag94], mitigates
these issues by merging frequent byte or character pairs. Thus, The evolution of representation learning,
language modeling, and tokenization techniques has become a unified and interdependent process.

2.1.2 Language Modeling

Early stages of language models (LMs), such as n-gram models [Jel98; GL04; Ros00], were mostly
statistical regression models, relying on the Markov assumption to predict the next word from the
most recent context. Challenges for these models included the need to tackle data sparsity [CG96] and
the exponentially growing number of transition probabilities as n increases.
Neural language models (NLMs) [Ben+03; Mik+10; SDG06] tackle data sparsity by mapping
words to low-dimensional vectors and predicting subsequent words using neural networks. One of
the earliest and most influential neural language models is based on a Recurrent Neural Network
(RNN) architecture, which was first introduced by [Elm90; Jor86] and later popularized by [Mik+10].
RNNs are particularly well-suited for modeling sequential data, such as text, but can sometimes
struggled with vanishing gradients and capturing long-term dependencies in the input sequence. To
address these limitations, variants of RNNs were introduced, including Long Short-Term Memory
(LSTM) networks [HS97] and Gated Recurrent Units (GRUs) [Chu+14]. A major advance came with
the introduction of the attention mechanism [BCB15], allowing the model to focus on specific parts
of the input sequence which can be a large number of positions away when generating each output
token. More detail on the attention mechanism can be found in §2.2. This innovation quickly led
to a number of developments including stacking network layers of attention [SWF+15] and use of
position embeddings [CW08; SWF+15] which showed strong performance without the use of position
recurrence, culminating in the introduction of the Transformer architecture [Vas+17].
The Transformer, introduced by [Vas+17], revolutionized NLP by enabling deeper and more efficient
language model training. Unlike LSTMs, Transformers capture global dependencies between input and
output regardless of distance while crucially allowing for significantly more parallelization, enhancing
their scalability. This led to the ability to train Transformer-based pre-trained models on large amounts
of data, as exhibited by GPT [Rad+18] BERT [Dev+19], XLNET [Yan+19], RoBERTa [Liu+19] and
T5 [Raf+20]. The results were striking. These models learn general representations of language
through pre-training on large text corpora with language modeling objectives, allowing for effective
fine-tuning on specific NLP tasks.
Most LLMs today are built on the Transformer architecture, which has led to steady improvement
in performance on downstream tasks by scaling both the number of model parameters and the vol-
ume of training data [Hof+22]. There are numerous popular families of LLMs available, such as the
LLama [Tou+23b; Tou+23a; Dub+24a], Mistral [Jia+23; Jia+24b], GPT [Bro+20; Ope+24a], Claude
families1 , and DeepSeek [Bi+24; Liu+24a; Liu+24b], each offering models of varying sizes. As illus-
trated in Table 1, LLMs can be categorized based on their hosting requirements—from small models
that can run on a laptop, to medium-sized models that require a server cluster, and large proprietary
models accessible via API. Whether one requires a lightweight model for personal use or a powerful
model for enterprise-level tasks, there are currently LLM solutions available.
Statisticians interested in research on actual LLMs can begin by leveraging resources and tools that
lower the barrier to entry while addressing the computational challenges associated with these models.
Please refer to Appendix B for more details.

2.2 Architectures of Pre-trained Language Models

As we have discussed, the evolution of neural language model architectures progressed from sequence-
based models such as convolutional [Kim14] and recurrent [HS97] models to the more advanced Trans-
former models [Vas+17]. Further success came from Pre-trained Language Models (PTMs), which
1 https://www.anthropic.com/news/claude-3-family

5
Softmax Feed Forward

Linear
h′1 h′2 ⋯ h′N

h′i = W1 σ(W2hi)
Feed
Forward
h1 h2 ⋯ hN
Norm

L×
Masked Masked Multi-Head Attention
Multi-Head
Attention h′1 h′2 ⋯ h′N
Norm
H′ = ∑m=1 Om⊤Vm H Softmax(mask(H ⊤Km⊤Qm H ))
M

Positional
Embedding h1 h2 ⋯ hN
Input
Embedding

Figure 3: The architecture of a decoder-only Transformer. The model consists of L stacked Transformer
decoder layers. Each layer contains a feed-forward network with weight matrices W1 and W2 , as well
as a multi-head attention module comprising the output matrix (Om ), value matrix (Vm ), key matrix
(Km ), and query matrix (Qm ).

integrate self-supervised learning and the Transformer model. Self-supervised learning allows the
model to learn from the data without explicit labels or supervision, while the Transformer model uses
self-attention mechanisms to capture long-range dependencies and contextual relationships in the input
data. Below we will introduce different components of the Transformer architecture.

Attention Mechanism The attention mechanism was first introduced to tackle the limitations of
the RNN encoder-decoder model, which struggled with long sentences due to its fixed-length vector rep-
resentation of the source information [BCB15]. This mechanism enhances the model by jointly learning
to align and translate. It identifies key positions in the source sentence for each target word prediction
adaptively, considering both these specific context vectors and all previously generated target words.
This approach effectively handles longer sentences, improving the model’s overall performance. The
Transformer model [Vas+17], built entirely on attention mechanisms, further leverages this approach
for superior results. In more detail, an attention function maps a query ⃗q and a set of key-value pairs
{(⃗k1 , ⃗v1 )}, · · · , {(⃗kn , ⃗vn )} to an output ⃗o. The output is a weighted sum of the values as calculated by
⃗o = α1⃗v1 + · · · + αn⃗vn , with the weight αi for each value ⃗vi determined by a compatibility function
that matches the query ⃗q with its corresponding key ⃗ki .

Transformer Overview The basic Transformer model operates on a sequence-to-sequence basis, in-
corporating both an encoder and a decoder, each constructed from a series of identical blocks [Vas+17].
The encoder maps an input sequence ⃗x = (x1 , · · · , xn ) to a sequence of continuous representations
⃗z = (z1 , · · · , zn ). Given ⃗z, the decoder then generates an output sequence ⃗y = (y1 , · · · , ym ). The pri-
mary components of each encoder block are a multi-head attention (MHA) module and a position-wise
feed-forward network (FFN). To facilitate the construction of a more complex model, a residual con-
nection [He+16] is utilized around each block, succeeded by a Layer Normalization [BKH16] module.
Decoder blocks, in contrast to encoder blocks, incorporate additional cross-attention modules between
the multi-head self-attention modules and the position-wise FFNs. Moreover, the self-attention mod-
ules within the decoder are modified to inhibit each position from attending to positions that follow

6
it.
Since the Transformer model does not inherently encode sequential order, positional encodings
are introduced to provide the model with information about the relative or absolute positions of
tokens in the input sequence. These positional encodings are added to the input embeddings at the
bottom of both the encoder and decoder stacks. The original Transformer employs sinusoidal positional
encodings, where each position is represented by a combination of sine and cosine functions of varying
frequencies. This choice allows the model to extrapolate to sequences longer than those seen during
training. Formally, the positional encoding for a position pos and dimension d is defined as:
pos pos
P Epos,2i = sin , P Epos,2i+1 = cos ,
100002i/dmodel 100002i/dmodel
where dmodel represents the dimension of the input embeddings. For more detail regarding each
individual module, including the incorporation of encodings, please refer to the original Transformer
paper [Vas+17]. In addition to sinusoidal encodings, several advanced positional encoding methods
have been proposed to enhance model performance and flexibility. Rotary Positional Embeddings
(RoPE) [Su+24] is one such method, designed to incorporate relative position information directly into
the attention mechanism. RoPE applies a rotation matrix to the embeddings, enabling the model to
better capture the relationship between tokens at different positions. Building on the idea of expanding
positional encoding capabilities, the recently proposed Contextual Position Encoding (CoPE) [Gol+24]
addresses a different limitation. Traditional methods, including RoPE, rely on token counts to derive
positions, which restricts their ability to generalize to higher levels of abstraction, such as attending to
the i-th sentence. CoPE overcomes this challenge by conditioning positions on context, incrementing
positions only for specific tokens determined by the model. This flexibility enables CoPE to handle
complex tasks like selective copy, counting, and Flip-Flop, while also improving perplexity in language
modeling and coding tasks.

Multi-head Attention A crucial component within the Transformer’s encoder and decoder layers
is the multi-head attention module which computes representations of the input and output without
using recurrence or convolution calculations. The attention function used in [Vas+17] is a scaled dot-
product function. The input to this function consists of queries and keys of dimension dk , and values
of dimension dv . These queries, keys, and values are obtained by applying a linear transformation to
the representations output from the previous encoder/decoder layer. The output calculation of the
attention function is shown in Equation 1.
QK ⊤
Attention(Q, K, V ) = Softmax √ V (1)
dk
Empirically, it has been found beneficial to linearly project the queries, keys, and values h times
with different, learned linear projections to dk , dk , and dv dimensions, respectively. The attention
function is then performed in parallel on each of these projected queries, keys, and values, yielding
dv -dimensional output values. These are concatenated and once again projected, resulting in the final
values. The output calculation of the multi-head attention is shown in Equation 2, where multiple
projections parameterized by matrices WiQ ,WiK , WiV , W O are applied. Multi-head attention allows
for the capture of more nuanced syntactic and semantic information since the model can attend to
information from different subspaces at different positions.

MultiHead(Q, K, V ) = Concat(head1 , · · · , headh )W O (2)

where headi = Attention(QWiQ , KWiK , V WiV ) (3)

Variants The Transformer architecture is versatile and highly extensible, allowing for the seamless
integration of new modules or the implementation of novel attention mechanisms. The application
of the Transformer architecture can take three distinct forms, depending upon the utilization of the
encoder and decoder components.
• Encoder-Decoder Transformers The full Transformer architecture, which incorporates both
the encoder and decoder components, is typically employed for sequence-to-sequence modeling
tasks, such as machine translation and text summarization. Pre-trained models that utilize this

7
full Transformer architecture, such as T5 [Raf+20] and BART [Lew+20], are commonly applied
in these contexts.
• Encoder-only Transformers This involves the exclusive use of the encoder component within
the Transformer architecture. The encoder typically functions as a feature extractor, with its
outputs serving as a representation of the input sequence. This method is commonly employed for
tasks such as text classification or sequence labeling, which can be viewed as token classification
problems. Typical pre-trained encoder models include BERT [Dev+19] and RoBERTa [Liu+19].
• Decoder-only Transformers This involves solely utilizing the decoder component within the
Transformer architecture, with the cross-attention module between the encoder and decoder be-
ing omitted, as shown in Figure 3. It is typically employed for sequence generation tasks, such
as language modeling. Recently, the unification of task formats into language modeling and the
scalability of decoder-only Transformer training have led to the rise of instruction tuning. This
method, which involves fine-tuning language models on a collection of tasks described via instruc-
tions, significantly enhances zero-shot performance on unseen tasks [Wei+22b]. Consequently,
most current large language models are based on the decoder-only Transformer ar-
chitecture. Notable examples include the GPT series [Rad+19; Bro+20; Ope+24a], Llama
series [Tou+23b; Tou+23a], and Mistral series [Jia+23], among others.

2.3 LLM Evaluation

As the number of LLMs continues to grow, the challenge of effectively comparing their performance
becomes increasingly important. Traditional benchmarks, such as ARC [Cla+18], MMLU [Hen+21a],
and MATH [Hen+21b], while valuable for evaluating specific capabilities like multi-task understanding,
mathematical reasoning, and commonsense knowledge, rely on fixed datasets with predefined gold
answers. This rigid structure limits their ability to assess models’ adaptability and creativity in
generating diverse, contextually appropriate responses. Moreover, as models become more advanced,
they tend to saturate these benchmarks, reducing their discriminative power and making it harder to
distinguish between state-of-the-art systems.
Chatbot Arena [Chi+24] addresses these limitations by providing a benchmark platform for eval-
uating LLMs through anonymous, randomized battles in a crowdsourced manner. Utilizing the Elo
rating system [MVM09], which is widely used in competitive games such as chess, Chatbot Arena allows
users to compare models by engaging with them side-by-side and voting on which one performs bet-
ter. This approach not only facilitates scalable and incremental evaluation but also provides a unique
ranking order for the models. By inviting the community to contribute new models and participate
in the evaluation process, Chatbot Arena helps maintain an up-to-date leaderboard, offering valuable
insights into the relative strengths of various LLMs. This platform exemplifies a practical solution for
benchmarking LLMs in open-ended tasks, where traditional programmatic evaluation methods with
fixed gold answers fall short.

3 Training Pipelines of LLMs

3.1 LLM Pre-training
Between 2017 and 2019, a paradigm shift occurred in the learning of NLP models. The traditional
fully supervised learning paradigm began to be replaced by a two-step process: pre-training and fine-
tuning [Pet+18; Rad+18; Dev+19]. In this new paradigm, a model with a fixed architecture is first
pre-trained as a language model using unlabeled web data through self-supervised learning. This
pre-trained language model is then adapted to various downstream tasks by introducing additional pa-
rameters and fine-tuning these parameters using task-specific objective functions. The pre-training and
fine-tuning paradigm offers several advantages. Firstly, pre-training on a large text corpus enables the
model to learn universal language representations, which can be beneficial for many downstream tasks.
Secondly, pre-training provides a superior model initialization, which often results in better generaliza-
tion performance and faster convergence on the target task. Lastly, pre-training can serve as a form of
regularization, helping to prevent overfitting, particularly when dealing with small datasets [Erh+10].

8
3.1.1 Pre-training Objective
The choice of pre-training tasks plays a pivotal role in learning universal language representations.
These tasks should ideally be challenging and have substantial training data. In this section, we
provide a brief overview of one of the most widely used pre-training tasks.

Standard Language Modeling. Standard language modeling objectives focus on training the
model to learn the probability P (x) of texts from a training corpus [Rad+19]. Typically, text predic-
tion occurs in an autoregressive manner, predicting the tokens in the sequence one at a time, often from
left-to-right, although other orders are possible as well. Formally, a language model is parameterized
by a set of parameters θ and learns a parameterized mapping from the context x<t to the next-token
xt . The goal of the model is to predict the next-token in the sequence given the context, and this is
achieved by minimizing the conditional probability Pθ (xt |x<t ). The loss function of standard language
PT
modeling objectives over a sequence of tokens x = x1 , x2 , · · · , xT is LSLM = − t=1 log Pθ (xt |x<t ).
Due to their simplicity, efficiency, scalability, and proven performance on a wide range of tasks, lan-
guage modeling objectives have become the preferred choice for pre-training LLMs.
In addition to the standard language modeling objective, other primary training objectives such as
corrupted text reconstruction [Raf+20] and full text reconstruction [Lew+20] are also widely utilized.
Beyond these primary objectives, auxiliary objectives have been developed to enhance the model’s
performance on specific downstream tasks. Common auxiliary objectives include next sentence pre-
diction [Dev+19], sentence order prediction [Lan+20], discourse relation prediction [Sun+20], and
token-passage prediction [Liu+20b]. These auxiliary objectives are strategically employed to provide
additional training signals that help refine the model’s understanding and generation capabilities,
thereby improving its applicability and effectiveness across a variety of NLP tasks.

3.1.2 Pre-training Data

The choice of pre-training datasets also plays a crucial role in the development and capabilities of LLM
models. These datasets, often composed of vast amounts of text data from various sources, serve as
the foundation on which LLMs learn the intricacies of language. By training on such massive corpora,
LLMs acquire general language understanding and the ability to generate coherent text.

Data Source Pre-training data can be broadly categorized into two types: general pre-training
corpora and domain-specific pre-training corpora. General pre-training corpora are broad datasets,
covering categories such as webpages, language texts, books, academic materials, code, parallel corpora,
social media, and encyclopedia [Liu+24f]. Webpages are a major source, offering extensive multilin-
gual content that often requires significant cleaning, as seen in derivatives like RefinedWeb [Pen+24]
from Common Crawl2 . Language texts are sourced from large corpora like the American and British
National Corpora,34 often focusing on specific languages or domains like finance. Books provide high-
quality, lengthy texts from sources such as Project Gutenberg5 improve models’ understanding of
complex language. Academic Materials, such as those in arXiv,6 contribute specialized scholarly con-
tent. Code data from repositories like The Stack [Koc+23] and Github7 is essential for programming
tasks. Parallel corpora data, involving bilingual text pairs from resources like ParaCrawl [Bañ+20] is
crucial for translation tasks. Social Media data from platforms such as StackExchange8 and Reddit9
helps models learn conversational dynamics, while Encyclopedia data, particularly from Wikipedia,10
strengthens models’ general knowledge. Interestingly, there are some unexpected phenomena relating
to how certain corpora can enhance the abilities of LLMs. For example, code data not only is essen-
tial for programming tasks, but also significantly enhances non-code performance when included in
pre-training. Specifically, [Ary+24] found that initializing models with code pre-trained data led to a
2 https://commoncrawl.org/
3 https://anc.org/
4 http://www.natcorp.ox.ac.uk/
5 https://www.gutenberg.org/
6 https://arxiv.org/
7 https://github.com/
8 https://stackexchange.com/
9 https://www.reddit.com/
10 https://www.wikipedia.org/

9
relative increase of 8.2% in natural language reasoning and a 12× boost in code performance. More-
over, they also found that using high-quality, synthetically generated code and code-adjacent datasets
like GitHub commits during pre-training or cooldown phases could result in substantial improvements
across various tasks including reasoning and world knowledge tasks.
Domain-specific pre-training corpora fine-tune models for specialized fields. For instance, financial
datasets like BBT-FinCorpus [Lu+23] and medical corpora like Medical-pt [Xu23] provide targeted
content that enhances model performance in finance, law, and other areas. These datasets ensure that
models are better equipped for specific applications.

Data Selection and Mixture Data selection and mixture in LLM training is essential for opti-
mizing model performance. This process is generally divided into three levels: token-level, sample-
level, and group-level selection. Token-level selection focuses on filtering individual tokens within the
data, offering precise control but requiring significant resources [Lin+24b]. Sample-level selection in-
volves choosing specific training examples, commonly used in fine-tuning. While heuristic methods
are typical [Rae+21; Sha+24; Sol+24], advanced approaches use optimization algorithms [Che+24b;
Min+22b], model perplexity [Mue+23], or even LLMs to guide selection [Wet+24; Sac+24], aiming to
enhance the quality of the training data. Group-level selection organizes data into pools, which are
then optimally mixed through mixture techniques. Early methods relied on manual mixtures [Gao+20;
Bro+20], but more recent approaches use learned mixtures, either through fixed weights determined by
proxy models (“offline selection”) [Rae+21; Xie+23] or dynamically adjusted weights during training
(“online selection”) [Che+23a].

3.1.3 Scaling Laws

With the introduction of increasingly large language models [Bro+20; Rae+21; Smi+22], understand-
ing computational efficiency has become crucial. The compute and energy costs for training these
models are substantial [Rae+21; Tho+22], escalating with model size. In practical scenarios, the
training compute budget is often predetermined, depending on the availability of accelerators and the
duration of their use. Given that training large models is usually feasible only once, accurately estimat-
ing optimal hyperparameters for a given compute budget is essential. Kaplan et al. [Kap+20] provided
initial insights into computational efficiency for language models, discovering a power law relationship
between the number of parameters in an autoregressive LM and its performance. They suggested that
with a tenfold increase in computational budget, model size should increase by 5.5X, while the number
of training tokens should only increase by 1.8X. This led to a trend of training larger models to achieve
performance gains. Scaling laws have been shown to apply across various data modalities, including
language, images [Che+20], and videos [WTU20], as well as multimodal modeling [Tsa+19] and even
mathematical problem solving [Sax+19].
Beyond model size and training tokens, [Tay+23] derived scaling laws for different inductive biases
and model architectures, revealing significant variations in scaling coefficients among models. They
found that among ten architectures, the vanilla Transformer exhibited the best scaling behavior, despite
not having the highest absolute performance at each compute region.
In 2022, the Chinchilla scaling law [Hof+22] shifted the focus from model size to the number of
training tokens for computational efficiency. This law suggests that training slightly smaller models
on larger datasets is often more efficient than the previous approach [Kap+20], which favored larger
models on smaller datasets. Compared to Kaplan’s study, key differences introduced by the Chinchilla
study include: (1) a different learning rate schedule for all models, unlike [Kap+20], which did not
account for the impact of these hyperparameters on the loss; and (2) the inclusion of larger scale
models. The Chinchilla model, with 70B parameters trained on 1.4T tokens (approximately 20 tokens
per parameter), outperformed its much larger counterpart, Gopher [Rae+21].
The “Chinchilla efficient” model size and training dataset size, along with the achievable test loss,
can be determined as follows:

0.45
Nopt (C) = 0.6 C

Dopt (C) = 0.3 C 0.55 (4)
−0.154

Lopt (C) = 1070 C + 1.7,


where Nopt represents the optimal number of model parameters, Dopt denotes the optimal number of
training tokens, and Lopt indicates the optimal final pre-training loss achievable under a fixed FLOPs

10
compute budget C. Despite the trend towards larger models, the potential of training smaller models
with larger datasets remains relatively underexplored. Instead of focusing solely on training compute-
optimal language models, [Tou+23b] emphasized the importance of the inference budget. Inference-
optimal language models aim for optimal performance within specific inference constraints, achieved
by training small models with more tokens than recommended by traditional scaling laws[Hof+22].
[Tou+23b] demonstrated that smaller models trained with more data could match or even surpass the
performance of larger models. Additionally, [Tyl23] argued that existing scaling laws [Hof+22] might
not accurately predict scenarios where smaller models are trained for extended periods. Consequently,
efforts have been made to push the limits of scaling laws by training language models to achieve optimal
performance across various inference budgets by using more tokens than typically prescribed by the
Chinchilla scaling law. Notable examples include the Llama series [Tou+23b], TinyLlama [Zha+24c],
Llama2 series [Tou+23a], and Llama3 [Gra+24], which trained an 8B model on 15T tokens, far ex-
ceeding the optimal number of training tokens predicted by the Chinchilla law (approximately 200B
tokens). They found that model performance continued to improve log-linearly even after training the
8B and 70B models on up to 15T tokens.
After pre-training on massive corpora using self-supervised objectives, the parameters of the LLM
are well suited as an initialization for various downstream tasks. The pre-training provides the model
with a broad understanding of language, which can then be fine-tuned for specific applications. Beyond
pre-training strategies, understanding scaling laws plays a crucial role in optimizing model training
and deployment. For example, [JJS24] highlight how multi-objective considerations (e.g., safety and
accuracy) can influence computational efficiency, demonstrating that new entrants to competitive
markets can achieve safety alignment with fewer data points due to differing reputational dynamics.
Their findings that scaling rates slow with increasing dataset size suggest new strategies for allocating
compute resources effectively, particularly in multi-objective settings. These insights, combined with
advances in pre-training and fine-tuning, underscore the potential to refine model training workflows
and push the boundaries of computational efficiency across diverse applications.

3.2 LLM Prompting

For the largest language models, the paradigm has shifted for typical users from traditional supervised
learning to prompt-based learning, often called prompt engineering [Liu+23]. In supervised learning,
we use labeled data consisting of input-output pairs D = (xi , yi )i=1···N to tune the model parameter
θ so that we can predict the output for an input x that is not in the training data using Pθ (y|x).
However, in prompt engineering, the model parameter θ is fixed, and instead, one tunes a template t
which is combined with x to form a new input x′ that achieves good performance when using Pθ (y|x′ ).
In other words, prompt engineering learning involves optimizing a template t that is used to generate
a new input x′ from the original input x, rather than optimizing the model parameters θ directly. This
approach allows us to adapt the model to new inputs without requiring retraining of the model. We
describe two main prompt-based learning techniques as follows.

Vanilla Prompt Engineering Vanilla prompt engineering involves the development of effective
input prompts for LLMs to generate better outputs [Liu+23]. Traditionally, prompts were manually
crafted based on intuitive templates, a process that requires considerable expertise and may not always
yield optimal results. To overcome this limitation, automated methods have been introduced, cate-
gorizing prompts into discrete and continuous types. Discrete prompts, also known as hard prompts,
involve natural language phrases and can be discovered through methods like prompt mining from
large corpora [Jia+20], paraphrasing existing seed prompts [YNL21], gradient-based search over to-
kens [Wal+19], and using LLMs to generate prompts based on inputs [GFC21]. Continuous prompts, or
soft prompts, operate within the embedding space of the model and do not require human-interpretable
language. Prefix-tuning is an example of this approach [LL21], where continuous task-specific vectors
are prepended to inputs, allowing the LM to perform the task more effectively without altering its pa-
rameters. In addition, some methods, such as P-tuning [Liu+24e] and PTR [Han+22], enhance hard
prompt templates by incorporating some tunable embeddings, rather than relying solely on purely
learnable prompt templates. These approaches blend the structure of hard prompts with the flexibil-
ity of trainable soft tokens, improving prompt performance and adaptability.

11
LLM Size Inference Requirements Training Requirements Examples
Small (≤2B) Hostable on a personal lap- Requires a server cluster Gemma-2-2B,
top (≤5GB GPU memory) (≤1 NVIDIA A100 40GB) Llama-3.2-1B
Medium (2∼100B) Requires a server cluster (≤8 Requires a server cluster Llama-3-70B,
NVIDIA A100 80GB) (≥1 NVIDIA A100 80GB) Mixtral-8x7B
Large (≥100B) Requires a server cluster (≥8 Requires a server cluster Llama-3.1-405B
NVIDIA A100 80GB) (≥32 NVIDIA A100 80GB)
Proprietary (Size Accessed via API Accessed via API GPT-4, Claude-
Unknown) 3, Gemini-2

Table 1: Recent popular LLMs categorized based on their hosting requirements. Estimates for the
inference and training costs are based on floating-point 16 precision. New research is trying to compress
further to make larger models available locally [Par+24b; Xia+24a; Hua+24a].

In-Context Learning In-context learning is a paradigm that allows language models to perform
tasks by using a few examples as demonstrations within the prompt, without the need for further fine-
tuning [Bro+20]. Formally, given a set of demonstrations with inputs {x1 , x2 , . . . , xn } and correspond-
ing outputs {y1 , y2 , . . . , yn }, a prompt template t is used to format each input into {x′1 , x′2 , . . . , x′n }. For
a new input x, formatted as x′ using the same template, the language model M predicts the output by
estimating the probability Pθ (y | x′1 , y1 , . . . , x′n , yn , x′ ). By prepending these input-output pairs to the
current input, the model learns to perform the task through the context provided by these examples,
effectively enabling it to generalize to new tasks based on in-context information. Studies have shown
that increasing the number of examples leads to significant performance improvements across both gen-
erative and discriminative tasks [Aga+24]. Despite its apparent simplicity and effectiveness, in-context
learning presents several challenges that significantly impact model performance. The selection and
ordering of examples are crucial. Research has shown that the specific examples chosen in a few-shot
scenario can lead to vastly different outcomes, ranging from near state-of-the-art accuracy to almost
random guessing [Lu+22]. To mitigate this variability, advanced techniques such as using sentence
embeddings to select examples that are semantically close to the input [GFC21; Liu+22] or employing
entropy-based methods to optimize the sequence of examples have been proposed [Lu+22]. Moreover,
formal understanding of why in-context learning works has been a subject of ongoing research. Recent
studies suggest that in-context learning functions as a form of Bayesian inference, where the model
uses the provided examples to recover latent concepts [Xie+22]. Researchers have empirically ob-
served that replacing gold labels with random labels only marginally impacts performance [Min+22a].
Additionally, in-distribution inputs within the demonstrations significantly contribute to performance
gains [Min+22a]. This suggests that in-context learning likely helps elicit knowledge that the model
has already acquired during pre-training, rather than teaching the model new information through
just a few examples.

3.3 LLM Supervised Fine-Tuning (SFT)

Even with extensive pre-training, LLMs may not excel at specific tasks without further adjustment. To
achieve improvements of this kind, a process known as Supervised Fine-Tuning (SFT), which involves
fine-tuning with labeled data, is often necessary. For instance, the BERT paper [Dev+19] demonstrated
the effectiveness of fine-tuning the model on 11 distinct tasks. Although more recent LLMs can perform
tasks through in-context learning [Bro+20] or zero-shot prompting [Liu+23] without prior fine-tuning,
they still stand to gain from fine-tuning tailored to specific tasks or datasets. A notable example is
OpenAI’s GPT-3.5 Turbo, which, despite its smaller size compared to GPT-4, could achieve superior
performance when fine-tuned with task-specific data.11
SFT does not need to be based on a single task. Indeed, to further improve LLMs’ performance and
address data scarcity, researchers are also increasingly adopting Multi-Task Learning (MTL) [Car98]
for NLP tasks. This approach trains models on multiple related tasks simultaneously, broadening the
training dataset and reducing overfitting risks [CZY24]. MTL not only captures generalized and task-
11 https://platform.openai.com/docs/guides/fine-tuning

12
specific knowledge but also facilitates the development of compact, efficient models through implicit
knowledge sharing [DH17; SCN18].

3.3.1 Instruction Tuning

The transformative idea that any NLP task can be converted into a text-to-text format has signifi-
cantly aligned with advances in generative language models [Raf+20]. This task paradigm shift allows
for the fine-tuning of language models across a broad spectrum of NLP tasks using an unified data
format [Mis+22], thereby ensuring a uniform training objective. A pivotal development in this area
is “instruction tuning” where a language model is fine-tuned on a collection of tasks described via in-
structions [Wei+22b]. This method has been shown to improve the zero-shot performance of language
models on unseen tasks. Models such as FLAN [Wei+22b] and T0 [San+22] exemplify this approach.
These models are trained to process a variety of NLP tasks through instructional prompts, setting new
performance benchmarks and demonstrating an impressive ability to generalize to tasks they were not
explicitly trained on. This evolution towards instruction-based task execution highlights the critical
role of multi-task learning in increasing the robustness and adaptability of language models [CZY24].
The significance of instruction tuning is further highlighted in the InstructGPT paper [Ouy+22],
which utilizes a novel dataset comprising prompts crafted by labelers and those submitted to early
InstructGPT models via the OpenAI API. This dataset includes a wide spectrum of tasks such as
brainstorming, rewriting, open-ended question answering, and more, reflecting the diverse and user-
centric nature of modern NLP applications. Labelers demonstrate the desired responses to prompts,
providing training data for SFT using GPT-3. The InstructGPT findings emphasize the necessity
of moving beyond traditional NLP tasks to include user-centric tasks like brainstorming, which are
not adequately captured by traditional NLP datasets. This expansion not only broadens the scope of
tasks that models are trained on but also enhances their ability to perform effectively on real-world,
user-driven tasks, marking a significant shift towards more adaptive and user-focused language models.

3.3.2 Parameter-Efficient Fine-Tuning

Given the sheer size of LLMs, a common approach to fine-tuning is to modify a small fraction of the
model’s parameters while leaving most of them unmodified. This approach, called “Parameter-Efficient
Fine-Tuning” (PEFT), focuses on selectively tuning a limited number of parameters to achieve the
desired performance gains without completely modifying the entire model. The PEFT strategies can
be broadly classified into three types [Han+24b]: (1) Additive fine-tuning, which injects new trainable
modules or parameters into the original model architecture. (2) Selective fine-tuning, which trains only
a subset of model parameters during fine-tuning. (3) Reparameterized fine-tuning, which constructs a
low-dimensional reparameterization of the original model parameters for training.

Additive Fine-Tuning Additive fine-tuning strategies, such as adapters [Hou+19; He+22b] and soft
prompts [LL21; Liu+24e], introduce only a minimal number of trainable parameters that are strategi-
cally positioned within the model architecture. Adapters are small layers inserted within Transformer
blocks, consisting of a down-projection matrix, a nonlinear activation function, and an up-projection
matrix. These layers act as computational bottlenecks, refining the model’s output while leveraging the
existing pre-trained parameters. On the other hand, soft prompts involve appending adjustable vectors
at the beginning of the input sequence, enhancing the model’s ability to utilize the rich information
within the continuous embedding space. This method adjusts the initial conditions of the model’s
input processing, allowing for fine-tuned performance improvements without extensive retraining of
the core model components. Both approaches maintain the original model architecture unmodified
while providing targeted enhancements for specific tasks.

Selective Fine-Tuning Unlike additive PEFT, selective PEFT fine-tunes only a specific subset of
the existing parameters within a model. This is achieved by applying a binary mask to the model’s
parameters, where each element of the mask is either 0 or 1, indicating whether the corresponding
parameter should be updated during fine-tuning. Only the selected parameters are adjusted based on
the gradients of the loss function, using a predefined learning rate. This method allows for targeted im-
provements on downstream tasks by optimizing a limited number of model parameters, thereby main-
taining the overall efficiency and scalability of the model. Techniques such as Diff pruning [GRK21],

13
PaFi [LMM23], and FishMask [SNR21] exemplify this approach, focusing on refining the model’s
performance through selective parameter updates.

Reparameterized Fine-Tuning Reparameterization fine-tuning involves transforming a model’s

architecture by adjusting its parameters, often through a low-rank parameterization to maintain effi-
ciency during training and restoring the original weight configurations for inference. A notable tech-
nique in this domain is LoRA (Low Rank Adaptation) [Xu+24b], which introduces small, trainable
matrices operating alongside the pre-trained weights to inject task-specific updates without burdening
the inference process. This method modifies the output by adding an incremental update, effectively
capturing task-specific nuances while maintaining the model’s original efficiency. Extensions of LoRA,
such as DyLoRA [Val+23], dynamically adjust the rank of these matrices within a training budget
to optimize performance without a fixed rank constraint, enhancing training efficiency. Another vari-
ant, AdaLoRA [Zha+23], uses singular value decomposition to refine the update matrix, pruning less
significant components based on their impact, thus optimizing the parameter count. These reparame-
terization strategies ensure that PEFT not only preserves but also enhances model functionality with
minimal computational overhead, making them suitable for large-scale models.
These reparameterization techniques closely align with methodologies in high-dimensional statistics,
particularly low-rank matrix estimation and recovery [ZHT06; CP11; CP10; CZ13; KX15]. Low-
rank techniques in statistics aim to estimate high-dimensional matrices by uncovering the underlying
low-dimensional structure in the data. Similarly, reparameterized fine-tuning leverages the low-rank
approximations in the model to adapt pre-trained models efficiently to new tasks, capturing task-
specific information with minimal computational overhead. This connection highlights the theoretical
foundation and practical utility of low-rank approaches, as both frameworks exploit the inherently
low-dimensional structure in data and models. By doing so, they provide a principled framework
for achieving a balance between complexity and accuracy, making them particularly well-suited for
large-scale machine learning models.

3.4 System 2 Prompting and Chain-of-Thought

“System 2 prompting” refers to prompts that elicit a deliberate reasoning-like process in AI models
that takes the form of the generation of intermediate steps before arriving at a final response [WS23;
Yu+24]. This contrasts with “System 1 prompting,” where a model directly produces a response
without intermediate steps. Inspired by human cognitive processes, System 2 prompting is de-
signed to handle complex reasoning tasks that System 1 might struggle with, by using techniques
like Chain-of-Thought [Wei+22a], Tree-of-Thoughts [Yao+23], Graph-of-Thoughts [Bes+24], Branch-
Solve-Merge [Sah+24], System 2 Attention [WS23], Rephrase and Respond [Den+24] and others.
These techniques aim to improve performance in areas like multi-step reasoning [Ran+24], mathemat-
ical problem solving [Wei+22a], and commonsense reasoning [Zha+24h]. While System 2 methods can
lead to more accurate and interpretable outcomes, they typically involve higher computational costs
and latency [Ope+24b].

Inference-Time Scaling Law In addition to the established scaling laws for training LLMs, recent
shifts in focus have highlighted the significance of inference-time scaling laws, particularly following the
introduction of OpenAI’s o1 model [Ope+24b], which is designed to extend various computational steps
before generating responses. This can be achieved, for example, by (1) generating multiple candidate
responses and selecting the best using methods such as automatic verifiers [Bro+24], reward mod-
els [Nak+22], or self-consistency [Wan+23d; Che+23b], or (2) enhancing the reasoning process within
a single trial by introducing more intermediate thinking steps like reflection and revision [Ope+24b;
Qin+24; Hua+24b]. For instance, [Bro+24] demonstrated that across multiple tasks and models,
the coverage—defined as the fraction of problems solved by any attempt—significantly scales with
the number of samples across four orders of magnitude. Complementing this, [Sne+24] showed that
optimizing inference-time computation through a combination of (1) searching against dense, process-
based verifier reward models, and (2) adaptively updating the model’s response distribution based
on the test-time prompt can yield greater performance improvements than merely scaling model pa-
rameters. Furthermore, [Dee+25; Tea+25] observed that by directly optimizing for outcome-based
rewards, the system can self-evolve and scale its inference time without external intervention, high-

14
lighting the dynamic nature of this process. Theoretically, advances have also been made, as shown
by [Liu+24g], which revealed that Transformers equipped with Chain-of-Thought (CoT) [Wei+22a]
processing—allowing the Transformer to auto-regressively generate a sequence of intermediate tokens
before answering questions—can be adept at handling complex problems that inherently require serial
computations. These insights collectively suggest that strategic enhancements in inference-time com-
putation could unlock new capabilities in LLMs, paving the way for more sophisticated and nuanced
machine reasoning.

3.5 LLM Reinforcement Learning & Preference Optimization

Current strong LLMs can be prompted to perform a variety of tasks. However, these models sometimes
exhibit unintended behaviors, such as making up facts or producing biased or toxic content [Wei+21;
Bom+22; Ken+21]. This issue has been framed as a lack of “alignment” of the models, where alignment
in LLMs is defined in terms of desiderata such as “ensuring that models are helpful, honest, and
harmless” [Ask+21].

3.5.1 Approaches to Aligning LLMs

To tackle the issue of aligning LLMs, researchers have developed two distinct categories of approaches:
(1) Reward-based methods, which involve training a reward model using preference data and sub-
sequently optimizing the model’s behaviors to maximize the reward received; and (2) Reward-free
methods, which eliminate the reward model altogether and instead directly utilize human preferences
to train the LLM.

Reward-based methods Many leading proprietary LLMs, including GPT-4 [Ope+24a] and Claude
3,12 employ reward-based methods for alignment, specifically Reinforcement Learning from Human
Feedback (RLHF) [Sti+20]. The RLHF pipeline usually includes three phases: (1) supervised fine-
tuning (SFT); (2) reward modeling and (3) RL optimization.
1. SFT Phase: RLHF typically starts by fine-tuning a pre-trained LM with supervised learning,
using high-quality data across a large and diverse set of instruction following tasks. This pro-
cess aims to establish a well-prepared initial model, denoted as π SFT , which serves as a good
foundation for subsequent training stages.
2. Reward Modeling Phase: During this phase, the SFT model is prompted with input x to
produce pairs of answers (y1 , y2 ) ∼ π SFT (y|x). These pairs are then evaluated by human labelers
who indicate their preference between the two, represented as yw ≻ yl |x, where yw and yl denotes
the preferred and dispreferred response among (y1 , y2 ) respectively. The preferences are assumed
to be generated by some latent reward model r∗ (y, x), which we do not have access to. A common
approach to modeling these preferences is the Bradley-Terry (BT) model [BT52], which posits
that the human preference distribution p∗ can be expressed as:

exp(r∗ (x, y1 ))
p∗ (y1 ≻ y2 |x) = . (5)
exp(r∗ (x, y1 )) + exp(r∗ (x, y2 ))
(i) (i) ∗
Given a static dataset of comparisons D = {x(i) , yw , yl }N
i=1 sampled from p , we can parametrize
a reward model rϕ (x, y) and estimate its parameters via maximum likelihood. This setup is
treated as a binary classification problem, where the negative log-likelihood loss is defined as:

LR (rϕ , D) = −E(x,yw ,yl )∼D [log σ(rϕ (x, yw ) − rϕ (x, yl ))], (6)

and where σ is the logistic function. In the context of LMs, the network rϕ (x, y) is often initialized
from the SFT model π SFT (y|x) with an additional linear layer on top of the final Transformer layer
that produces a single scalar prediction for the reward value [Zie+20]. To ensure a reward function
with lower variance, it is common practice to normalize the rewards, such that Ex,y∼D [rϕ (x, y)] =
0 for all x. Additionally, having a separate reward model offers the advantage of utilizing it for
rejection sampling during inference time. This process involves generating multiple responses
12 https://www.anthropic.com/news/claude-3-family

15
to a user query and then employing the reward model to identify and select the most suitable
response, thereby enhancing the overall performance of the model [Sti+20].
3. RL optimization Phase: During the RL phase, the learned reward function rϕ (x, y) is used to
provide feedback to the language model. Following prior work [Jaq+17; Jaq+20], the optimiza-
tion is formulated as

max Ex∼D,y∼πθ (y|x) [rϕ (x, y)] − βDKL [πθ (y|x)||πref (y|x)], (7)
πθ

where β is a hyperparameter controlling the deviation from the reference policy πref , which is
initially set as the SFT model π SFT . In practice, the language model policy πθ is also initialized
to π SFT . This constraint is crucial as it helps maintain the model within the bounds of the dis-
tribution for which the reward model is accurate, ensuring diversity in generation and preventing
the model from collapsing to a few high-reward responses. Furthermore, the training process
requires careful implementation of early stopping to prevent reward distributions from collaps-
ing to the same distribution across all prompts, regardless of their diversity [Son+23]. Due
to the discrete nature of language generation, the reward function is typically constructed as
r(x, y) = rϕ (x, y) − β(log πθ (y|x) − log πref (y|x)), and maximized using PPO [Sch+17]. Addition-
ally, some research has explored alternative RL algorithms, such as REINFORCE [SB18], as po-
tential replacements for PPO in RLHF. This exploration aims to reduce the computational costs
and alleviate the challenges of sensitive hyperparameter tuning associated with PPO [Ahm+24].

Reward-free methods RLHF is a complex process that often requires significant memory resources
and extensive hyperparameter tuning. As a result, several recent studies have explored alternatives
to RLHF, with Direct Preference Optimization (DPO) emerging as a notable method. Described in
detail in [Raf+23], DPO is an offline preference optimization algorithm that eliminates the need to
train a separate reward model, thereby simplifying the process and enhancing training stability. It
utilizes a novel reward model parameterization that facilitates the extraction of the optimal policy in
a closed form. The DPO loss function is defined as:
h πθ (yw |x) πθ (yl |x) i
LDPO (πθ ; πref ) = −E(x,yw ,yl )∼D log σ β log − β log . (8)
πref (yw |x) πref (yl |x)

The DPO pipeline proceeds as follows: (1) Sample completions y1 , y2 ∼ πref (·|x) for every prompt x, la-
(i) (i)
belinig them with human preferences to construct the offline preference dataset D = {x(i) , yw , yl }Ni=1
and (2) optimize the language model πθ to minimize LDPO for the given πref and D and chosen β.
Typically, πref is initialized to π SFT whenever available. Despite its advantages, DPO exhibits several
limitations, including susceptibility to overfitting [Aza+24], a tendency to decrease the likelihood of
preferred responses [Pal+24], and inefficiencies in memory and processing speed due to the simultane-
ous maintenance of reference and current policies for loss computation [MXC24]. Additionally, DPO
can lead to an issue known as length explosion in responses [Par+24a]. To mitigate these challenges,
numerous studies have proposed different DPO variants [Aza+24; Eth+24; Pal+24; Pan+24; Xu+24a;
MXC24; Par+24a]. In particular, Iterative DPO [Xu+23] and variants yield improvements over DPO,
indicating the importance of training on model responses that are on-policy.

3.6 LLM Self-Alignment

Aligning LLMs using human feedback is often bottlenecked by the size and quality of human-annotated
data. As models reach or surpass human-level intelligence in particular domains, it is expected that
future models may require feedback that goes beyond what humans can provide in order to provide
an adequate training signal. Leveraging the LLM itself to provide such feedback, in particular to
create high-quality data for instruction-finetuning, called “synthetic data,” has become a promising
and scalable solution.

3.6.1 Synthetic Data Generation

Current alignment methods in synthetic data generation often involve several key components: (a)
Instructions, (b) Responses, and (c) Feedback on those responses.

16
Instructions have typically been provided by human users and consist of questions or tasks that
the user wants the model to address during a dialog interaction. Although there are publicly avail-
able instruction datasets sourced from human users, such as ShareGPT,13 OpenAssistant [Köp+24],
WildChat [Zha+24f], and LMSYS-Chat-1M [Zhe+24], large-scale instruction datasets remain scarce
due to privacy and other concerns. Consequently, several studies have focused on generating synthetic
instructions. For instance, Self-Instruct [Wan+23e] employs an iterative bootstrapping algorithm, be-
ginning with a limited set of manually crafted tasks to guide the generation process. This approach uses
few-shot prompting to enable the model to create novel prompts, which are then added to the prompt
pool for further bootstrapping. Similarly, MathScale [Tan+24b] extracts topics and concepts from
seed math questions to construct a concept graph. Concepts are randomly sampled from this graph,
and few-shot prompting is used to prompt GPT-3.5 to generate new questions. CodecLM [Wan+24c]
utilizes LLMs to summarize use cases and required skills from existing seed instructions, subsequently
decoding novel instructions based on different combinations of use cases and skills.
Utilizing Responses generated by models in the training loop has been found to have significant
potential in enhancing training outcomes, as in the standard RLHF training loop (cf. Sec. 3.5). For
example, self-generated responses can easily be leveraged through pairwise preference learning, where
positive and negative generations are both used to train the model [Pan+24]. Other examples of
self-training include STaR [Zel+22], which collects successful model-generated rationales that solve
problems effectively, using them as training data for iterative supervised fine-tuning.
To effectively utilize responses for supervised fine-tuning (SFT) or preference learning, it is crucial
to assess their quality through feedback. If a response is suboptimal, it may still be improved through
various methods. Feedback on responses can be binary or scalar, indicating the quality of a response, or
it can be preference-based, showing relative quality among multiple responses. For scalar or preference
feedback, while some tasks with fixed short answers are straightforward to evaluate, tasks requiring
long, free-form generation pose challenges. Standard approaches are to use a reward model, or by LLM
prompting, referred to as LLM-as-a-Judge. Recent studies have shown that LLM-as-a-Judge prompting
can yield feedback that aligns well with human judgments [Li+23; Zhe+23]. Feedback can also be
provided in natural language, critiquing the response. For instance, Constitutional AI [Bai+22] uses
LLMs to generate critiques and revisions based on constitutional principles to enhance model-generated
responses. Other specialized critic generators, such as Shepherd [Wan+23c], PandaLM [Wan+24b],
Auto-J [Li+24b], and LLMCRIT [YLG24], have been developed.

3.6.2 Self-Rewarding Language Models

Current alignment methods, such as RLHF, heavily depend on human annotation for some of the
aforementioned components, particularly instructions and feedback. However, there is a scarcity of
high-quality public instruction sources, especially those involving complex and challenging tasks, such
as advanced mathematics and reasoning. As AI systems become increasingly sophisticated, the reliance
on human feedback becomes more problematic. Superintelligent AI could generate outputs and exhibit
behaviors that surpass human comprehension, such as producing vast amounts of novel code that may
be difficult to interpret or potentially dangerous to execute. The question is how to maintain trust
and control over these powerful models. Training such AI systems that remain helpful and harmless
even as they surpass human-level performance suggests that we will need techniques and data that do
not rely primarily on human inputs.
One promising approach to address the challenge of aligning advanced AI is iterative preference
optimization [Xu+23], which leverages a reward model rather than relying on human annotators to
score LLM outputs. This method has proven effective when a well-trained reward model is avail-
able [TGH23; MXC24; Guo+24a]. However, a particularly intriguing idea is to forgo the assumption
of access to an external reward model. Instead, both the LLM and the reward model are the same,
allowing the model to improve itself autonomously. Demonstrating that iterative preference optimiza-
tion can also be effective with synthetic instructions would further validate its potential to provide an
entire autonomous training pipeline.
Self-rewarding language models [Yua+24] embody this concept. These models begin with a small
set of human-authored data that provides basic instruction-following and evaluation capabilities. The
model then iteratively improves itself by generating responses to synthetic instructions and scoring
13 https://huggingface.co/datasets/RyokoAI/ShareGPT52K

17
these responses using an LLM-as-a-Judge mechanism [Zhe+23]. The best and worst responses for
each instruction are selected for further training through DPO, allowing the model to enhance its
instruction following and evaluation capabilities without any human intervention. The authors show
that it is possible to achieve iterative self-improvement using the model’s instruction following and
evaluation capabilities on pure synthetic generated instructions. Building upon this, meta-rewarding
language models [Wu+24a] take the concept further by not only curating training pairs for instruction
following but also for evaluation (LLM-as-a-Judge performance). In each iteration, the model compares
multiple judgments on a single response and selects the best judgment pairs for DPO training in
addition to response pairs. This refinement allows the model to improve both its instruction-following
and evaluation skills more effectively than the self-rewarding approach. These methods represent a
significant step toward aligning highly advanced AI, reducing the need for direct human supervision.

3.6.3 Challenges
Although synthetic data can greatly enhance the scalability of model training, recent studies have
highlighted challenges associated with using data generated by large language models (LLMs).

Reward Misspecification In iterative preference optimization, reward models are crucial but in-
herently imperfect. These models, whether generative or classifier-based, serve as proxies for human
judgment, but their flaws can lead to significant issues [GSH23]. This phenomenon is often described
by Goodhart’s law, which states that when a measure becomes a target, it ceases to be a good
measure. In this context, over-optimization can amplify the reward model’s biases, distorting true
performance rather than enhancing it. Several types of biases can arise in this process. Position bias
occurs when an LLM exhibits a propensity to favor certain positions over others, potentially skewing
results [Zhe+23]. Verbosity bias is another concern, where an LLM judge favors longer, more verbose
responses, even if they lack clarity, quality, or accuracy compared to shorter alternatives [Sin+24;
Dub+24b]. Additionally, self-enhancement bias, a term adopted from social cognition literature, de-
scribes the tendency of LLM judges to favor answers generated by themselves, further complicating
the evaluation process [Zhe+23]. To address these challenges, researchers have introduced benchmarks
like RewardBench [Lam+24] and JudgeBench [Tan+24a]. Both of these benchmarks systematically
evaluate the abilities of LLM judges to correctly identify the better and worse responses within a pair.
They cover various categories, including chat, safety, and reasoning, among others. These benchmarks
help identify the strengths and weaknesses of different reward models, guiding the development of more
reliable and effective reward models for preference optimization.

Distribution Shift and Lack of Diversity Recent findings indicate that while LLMs can generate
text nearly indistinguishable from human-authored content, the overall distribution of these LLM-
generated texts differs notably from human text. Specifically, LLM-generated texts tend to have trun-
cated “tails,” i.e., they produce fewer extreme or rare outputs, and exhibit reduced variability [Shu+23;
Doh+24]. Training on LLM-generated data could potentially lead to a phenomenon where a narrow
range of behaviors results in decreased performance, as seen in tasks like language modeling [Shu+23;
Doh+24] and iterative preference optimization for mathematical reasoning [WLL25]. Studies show
maintaining a mix of clean, human-authored data alongside LLM-generated content during training
helps preserve diversity and prevents the model from deteriorating in performance [Shu+23; Doh+24;
Ger+24]. A few works go beyond the mixing scenario and study how to curate or filter synthetic data
to avoid such performance deterioration [Fen+24b; Zha+24b].

4 Designing Trustworthy LLMs by Statistical Methods

As LLMs increasingly permeate various aspects of society, ensuring their trustworthiness has become
a critical challenge. Trustworthiness encompasses a range of dimensions, including interpretability,
accountability, and algorithmic fairness. Statistical methods offer a rigorous and systematic approach
to address these challenges, providing tools to analyze, enhance, and monitor the behavior of LLMs.
This section explores how statistical techniques can contribute to the design of trustworthy LLMs
across several key areas. First, we discuss mechanistic interpretability, which aims to uncover how
LLMs make predictions and generate outputs. Next, we examine uncertainty quantification, a vital

18
component for understanding and communicating the reliability of model predictions. The section
then delves into methods for LLM watermarking, which ensures the provenance and authenticity
of generated content, and privacy-preserving mechanisms to safeguard user data. We also address
algorithmic fairness, focusing on statistical strategies to mitigate biases in LLM outputs. Finally, we
provide a statistical perspective on LLM self-alignment, a process for aligning model behavior with
human values and goals. Together, these topics underscore the essential role of statistics in fostering
trust and accountability in the era of large-scale AI models.

4.1 Uncertainty Quantification

While LLMs produce human-like responses with impressive accuracy across various tasks, they are also
prone to hallucination [Ji+23; RSD23], raising concerns about their reliability. Quantifying uncertainty
is crucial for addressing these limitations, as it allows models to provide not just answers but also
confidence in their outputs, enabling users to make more informed decisions.

Uncertainty Metrics While uncertainty estimation and calibration are well-established for tra-
ditional machine learning models [Abd+21; Gaw+23], the emergence of LLMs has introduced new
challenges and demands. Unlike fixed-dimensional outputs typical of traditional models, LLM re-
sponses are often complex, requiring uncertainty metrics that can operate on sentence-level outputs.
This complexity necessitates innovative approaches to quantify uncertainty effectively. Classical met-
rics, such as entropy, can be directly calculated on the probability distribution of next-token prediction
and averaged over all tokens [MG21]. To incorporate the special structure of language model, existing
approaches further considered semantic features such as semantic similarity [Fom+20; LLS22], se-
mantic equivalence [KGF23] and token importance [Dua+24b; Bak+24], as well as internal signals in
language models like logits and hidden states [Kad+22; Che+24a; Liu+24c] into the metric design. In
general, they can be easily computed in a white-box setting where the underlying representation and
prediction distribution are available, while for black-box models, some can be calculated via repeated
sampling of the response [Kad+22; LTS24; CM24]. Overall, these approaches aim to develop robust
uncertainty metrics capable of appropriately assessing the confidence of LLM-generated responses in
a meaningful and scalable manner.

Conformal Prediction in LLMs Conformal Prediction (CP) [VGS05; AB23] has emerged as a
versatile framework for distribution-free statistical inference. CP constructs confidence sets for predic-
tions based on the empirical distribution of residuals, ensuring validity without assumptions about the
underlying model or data. Its flexibility and computational efficiency have made it an appealing tool
for LLMs despite challenges like large output spaces and non-exchangeable token sequences. To address
these issues, works such as [Kum+23; Ren+23] have restricted the output space, applying CP to tasks
like multiple-choice question answering and robot actions, while [RGG23] calibrated nucleus sampling
to improve token-level predictions. Other methods, such as [UZM24], adapt CP to non-exchangeable
settings by leveraging latent representations for nearest-neighbor searches, resulting in more precise
prediction sets. Beyond improving accuracy, CP has been extended to control quantities like toxicity
and hallucination risks [Zol+24; Yad+24; MH24; CGC24], enabling safer and more aligned LLM ap-
plications. CP methods have also been leveraged for evaluating LLM performance. [Ye+24a] applied
CP to benchmark LLMs on five natural language processing tasks, measuring average confidence set
sizes for multiple-choice questions to quantify uncertainty. In machine translation, [Gio23] and [ZM24]
used CP to assess translation quality, providing calibrated confidence estimates for both human and
machine evaluations. Additionally, [Sch+21; Sch+22b] proposed confident early exiting methods for
Transformers, where intermediate layers assess uncertainty to speed up inference while maintaining
consistency with the full model.

Hallucination Detection Recently, there has been a growing trend toward adopting uncertainty
estimation methods to address hallucination detection in LLMs. The core idea is that the logits and
hidden states of LLMs encapsulate information about the model’s confidence in its generated output,
which can be leveraged to identify hallucinations. For example, [AM23; Slo+23; Che+24a] use activa-
tions from hidden layers as input to train a classifier specifically designed to detect hallucinations, while

19
[Ver+23] develop epistemic neural networks that aim to reduce hallucination occurrences by model-
ing epistemic uncertainty. Despite these advances, the lack of a consistent and rigorous definition of
hallucination across the literature poses challenges to standardizing this task. However, uncertainty
estimation remains a well-defined problem, and insights from uncertainty quantification can be di-
rectly applied to improve hallucination detection in LLMs, offering a robust statistical foundation for
this critical challenge. One example to frame the hallucination detection within a hypothesis testing
framework. Specifically, given a question q and an LLM-generated answer M (q), the problem can be
formulated as follows:
H0 (q, M (q)) : M (q) is not correct for q vs. H1 (q, M (q)) : M (q) is correct for q.
The object is to construct a test function T̂ (q, M (q)) ∈ {0, 1} that, with probability at least 1 − δ
over its construction, ensures that at least 100(1 − α)% of incorrect (q, M (q))
pairs
in future question-
answering instances are detected: PT̂ Pq,M (q) T̂ (q, M (q)) = 1 H0 (q, M (q)) > α ≤ δ. To achieve this,
[Nie+24] propose a hypothesis test that provides finite-sample, distribution-free type I error control,
leveraging a set of i.i.d. samples consisting of (q, M (q)) along with a correct answer for q.

Future Research Although existing uncertainty quantification methods have shown promise in var-
ious aspects of language models, many challenges remain. First, many uncertainty metrics rely on the
entropy of the next-token prediction distribution in the white-box setting or the consistency between
multiple responses in the black-box setting. However, uncertainty in the generation distribution may
not appropriately capture the factual accuracy of language model outputs. For example, when a
language model exhibits high confidence in incorrect facts, its confidence estimates can become mis-
calibrated, making it difficult to detect errors through uncertainty metrics alone. Furthermore, recent
research has demonstrated that these metrics can be fragile and easy to manipulate under adversarial
attacks [Zen+24]. Future work should focus on developing more robust uncertainty metrics that better
reflect the reliability of responses and improve the alignment of existing metrics with factual correct-
ness. Second, as previously mentioned, the broad applicability of conformal predictions in the context
of language models is limited by the challenges of non-exchangeability and the large discrete space of
natural language data. Addressing these limitations by designing computationally efficient conformal
prediction methods that are better aligned with the structure of language data is an exciting area for
future research. Lastly, beyond current applications, we want to highlight that uncertainty quantifica-
tion has the potential to benefit a broader range of applications, such as generalizing to multi-round
interactions, guiding data collection, and enhancing model interpretability. Exploring these directions
can be helpful in advancing uncertainty-aware language models and improving their trustworthiness
in real-world applications.

4.2 LLM Watermarking

The capability of LLMs to generate human-like text has raised significant concerns regarding poten-
tial misuse. For instance, StackOverflow, a leading question-and-answer platform for programmers,
implemented a temporary ban on AI-generated responses to prevent the dissemination of misleading
information [Vin22]. Similar concerns about misinformation proliferation have emerged across various
domains, including fake news [Zel+19], academic integrity [MML23], and data authenticity [Shu+23].
Consequently, robust techniques for attributing text to LLMs have become essential for enabling indi-
viduals and platforms to detect and enforce policies regarding LLM usage.
A direct approach to detecting LLM usage in text is to train detectors based on linguistic features
that distinguish human-written text from LLM-generated text. Some studies trained these detectors
in an end-to-end manner on collected data [GPT23; Zer23; BL24; KKO24], whereas others exploited
structural properties of LLMs for detection [Ipp+20; GSR19; Mit+23] or relied on inherent stylistic
distinctions without training [Yan+24; Tul+24]. However, these ad-hoc methods have shown degraded
performance as LLMs become increasingly capable of generating human-like text [Web+23]. It remains
true that contemporary LLM-generated text still exhibits distinguishable features compared to human-
written text [PCJ25]; furthermore, these methods often exhibit vulnerability to adversarial attacks and
can show bias against non-native English writers [Kri+24; Sad+23; Lia+23], but this is an area that
is evolving.
A more proactive and controllable approach involves subtly modifying the text generation processes
of LLMs to enable provable detection, known as watermarking, by embedding subtle yet detectable

20
statistical signals into the generated text [Kir+23a]. This technique leverages the probabilistic nature
of LLMs, allowing the model provider to utilize pseudorandomness in the generation process.

Statistical Formulation The statistical framework for watermarking in text generation can be
formulated as follows [Li+25]: let the next-token prediction distribution at the t-th token be Pt , which
represents a multinomial distribution. The watermarked LLM first generates a pseudorandom variable
ζt using a hash function
ζt = A(w1:(t−1) , Key),
where w1:(t−1) denotes the first t − 1 tokens, and Key represents a private key. It then samples the next
token wt = S(Pt , ζt ) through a decoder S. The decoder can be deterministic or incorporate randomness
that is independent of ζt . This design enables the computation of the pseudorandom number from
the observed text and the private key in both the generation and detection phases (transmitted via a
trusted protocol). Formally, the tuple (A, S, Key) constitutes a watermark.
Detectability of the watermark signal is enabled by the dependence between the token wt and
pseudorandom variable ζt . [Li+25] proposed constructing a pivot statistic Yt = Y (wt , ζt ) for t =
1, . . . , n that follows a fixed distribution for human-written text, which formulates the problem of
distinguishing between human-written text (null) and LLM-generated text (alternative) as hypothesis
testing:
H0 (human) : Yt ∼ µ0 i.i.d. for all t vs. H1 (LLM) : Yt ∼ µ1,Pt for all t. (9)
One can then either reject or accept the null hypothesis via
( Pn
1 if h(Yt ) ≥ γn,α ,
Th (Y1:n ) = Pt=1
n (10)
0 if t=1 h(Yt ) < γn,α ,

where γn,α is a threshold determined by the sample size n and the significance level α.
To elucidate the rationale behind the detection rule (9), observe that under H0 , human text compo-
sition follows complex cognitive processes that are statistically independent of pseudorandom variables
ζ1:n . By construction,
Pn Yt = Y (wt , ζt ) follows a known distribution µ0 provided that wt is independent
of ζt . Hence, t=1 h(Yt ) follows a known null distribution so that an appropriate threshold can be
calculated to control the type I error. Under H1 , Yt follows µ1,Pt , which depends on the unknown
and varying token distribution Pt . An effective choice of the score function h would increase this sum
under the alternative hypothesis. Indeed, [Li+25] established a general framework to determine the
optimal score function through a minimax optimization problem.

Unbiased Watermarks The Gumbel-max watermark [Aar23], implemented internally at OpenAI,

exemplifies watermarks that can be analyzed within this statistical framework. It relies on the Gumbel-
max trick: let W be the token vocabulary, and let ζ = (Uw )w∈W consist of |W| i.i.d. copies of
U [0, 1]. Then, the Gumbel-max trick states that14 arg maxw∈W (log Uw )/Pw follows the multinomial
distribution P . [Aar23] proposed using the decoder

log Uw
S(P, ζ) = arg max
w∈W Pw
for selecting the next token. Several detection rules of the form (10) have been proposed [Kud+24;
Fer+23; Li+25]. Notably, [Li+25] developed an optimal sum-based test under certain conditions,
achieving the fastest exponential rate of decrease in type II errors. More recently, [Li+24c] introduced
a robust detection rule based on a truncated family of goodness-of-fit tests.
Because the Gumbel-max trick samples exactly from the underlying multinomial distribution, the
resulting watermark is unbiased. Formally,
a watermark is unbiased if, for any token w and token dis-
tribution P , one has Pζ S(P, ζ) = w = Pw . Unbiased watermarking schemes are generally preferred,
as they preserve the LLM’s token distribution, thereby maintaining text quality. Another unbiased ex-
ample is the inverse transform watermark [Kud+24], which corresponds to inverse transform sampling
of the multinomial distribution. Its optimal detection rule was also derived in [Li+25; CLZ25].
14 In an abuse of notation, in this section P denotes the coordinate of the probability vector P corresponding to token
w
w, while generally we write Pt to refer to the entire vector at time t.

21
Biased Watermarks Beyond the unbiased regime, the green–red list watermark [Kir+23a] is a
popular instance of a biased watermark, due to its simplicity and intuitive design. In this watermark,
the token vocabulary is randomly partitioned into a “red list” and a “green list,” where the probabilities
of green tokens are increased. One parameter in this watermarking scheme controls the magnitude of
the distortion in the next-token distribution, while another determines the size of the green list. During
detection, if the proportion of green tokens in a text exceeds a specified threshold, the text is classified
as LLM-generated. Several studies have refined this approach [Huo+24; Wou24; Cai+24] by proposing
methods to optimally select these parameters, thereby balancing watermark detectability with text
quality. Meanwhile, [Xie+24b; Hu+24; Wu+23b] introduced unbiased variants of the green–red list
watermark by applying techniques such as maximal coupling. Furthermore, an optimal detection
rule is established in [CLZ25], addressing both the minimization of type I and type II errors and the
minimization of type II error for a fixed type I error rate.

Other Watermarking Schemes A variety of other watermarking schemes have been proposed,
and the list is growing. Many have been evaluated only empirically, and rigorous statistical analysis
remains to be developed in these cases. Some of the more statistically inspired approaches include that
of [Dat+24], who introduced a production-ready watermarking system at Google DeepMind based on
tournament sampling for multinomial distributions. [CGZ24; ZWL23] employed a secret, hash-based
mechanism to subtly modulate token-selection probabilities, ensuring that the watermark remains
invisible without the key but verifiable with it. [Fai+23] presented a watermark that is highly detectable
and embeds a publicly verifiable cryptographic signature into the LLM output using rejection sampling.
[He+24b] developed a watermarking scheme for tabular data following the principles of the green-red
list watermark. [Xia+24b] injected watermarking signals into both the frequency and pixel domains
of images after generation and employs a classifier to detect the watermark’s presence. [ZLW24]
investigated decoders that transform logit vectors into token probabilities and proposed a provably
robust watermark. [He+24a] characterized optimal watermarking schemes by minimizing a lower
bound on the worst-case type II error while ensuring the worst-case type I error and watermarked-text
distortion remain below specified constants. [GF24] introduced an approach that first selects a sound
watermark detector producing p-values, then generates multiple candidate texts from an LLM prompt,
and finally outputs the text with the lowest p-value. Although this strategy enhances robustness, it
increases computational overhead due to multiple text generations.

Future Research Looking ahead, several challenges persist in developing comprehensive statisti-
cal foundations for watermarking in complex usage scenarios. In practice, watermarked text can be
compromised through paraphrasing or content alteration that removes or obscures watermark sig-
nals [Kir+23b; TZL24; Zha+24e]. Notably, text often comprises a mixture of human-written and
LLM-generated content, necessitating further research on unknown and complex source compositions
[Li+24c]. One direct solution involves developing algorithms for localizing watermarked segments
[Zha+24g; LLZ24]. Another unresolved challenge concerns pseudorandomness collision, where repeated
pseudorandom numbers occur in the text sequence [Kud+24]. [Wu+24b] demonstrated that collision
introduces bias into distortion-free watermarks and proved the impossibility of perfect distortion-free
watermarks under collisions. Certain watermarking schemes are also vulnerable to “watermark steal-
ing,” where attackers can reverse-engineer and remove or spoof the watermark [JSV24]. A crucial
challenge in implementing watermarks for LLMs lies in ensuring robustness against adaptive prompt-
ing and supporting multi-user tracing [CHS24b; CHS24a]. Moreover, achieving low computational
overhead in watermark detection remains critical [Hua+23a]. Additionally, watermarking methods
can be utilized to protect copyrighted training data by detecting data misappropriation [CLZ25].
From a theoretical perspective, a statistical framework of watermarks necessitates assuming simple
yet informative structures of next-token prediction distributions [Li+25]. For empirical evaluation of
watermarks, current metrics may be insufficient for assessing how biased watermarks affect generated
content [SZ23]. Finally, when watermarking schemes are implemented in conjunction with acceleration
techniques such as speculative sampling, [HH24] identified an inherent trade-off between watermark
strength and sampling efficiency.

22
4.3 Privacy and Copyright
Over the past few years, language models have grown in model size and sample size at an unprecedented
speed, making the preservation of user privacy increasingly challenging. Prior work has shown that
LLMs are able to memorize a large portion of training data [Car+21; Car+23], and adversaries can
easily extract gigabytes of training data from LLMs. For example, [Nas+23] showed that ChatGPT
reveals a person’s personal contact information when prompted with “repeat this word forever: ‘poem
poem poem poem’,” and the authors have recovered ten thousand examples from ChatGPT’s training
dataset at a query cost of 200 US dollars. This poses significant risks to user privacy as personal
information and confidential documents could be disclosed through interaction with the model.

Data Sanitization The simplest way to avoid releasing private information is to remove this infor-
mation from the training data. This approach is often referred to as the “data sanitization procedure”
[OZ03; Ami07]. The data sanitization procedure can be formulated as a binary classification problem
and can be efficiently implemented by modern machine learning methods [Der+17; Lis+21; Vak+22].
Recently, data sanitization has been widely used as a pre-processing step for removing personally iden-
tifiable information (PII) or protected health information (PHI) in many companies, such as Microsoft
and PayPal [Wil23; BMW20]. However, sanitization relies on a formal definition of private information,
and for language data, this definition may depend on the context and have no well-defined boundary.
Therefore, data sanitization works best for well-formatted private data, such as social security numbers
and medical records, and has limited power for general privacy-preserving purposes [Bro+22].

Differential Privacy To preserve data privacy from the model side, a standard approach is to exploit
the framework of differential privacy (DP) [Dwo06]. DP ensures that adversaries cannot distinguish
whether a specific data point is included in the training set of the model. The standard method to
provide DP guarantees in deep learning is to replace the standard optimizers with DP optimizers (e.g.,
DPSGD [Aba+16; Bu+20]), an approach that has been extensively used in LLM training [Hoo+21;
Ani+22], fine-tuning [Li+21; Yu+21; Hua+23c], and prompt learning [Dua+24a]. However, as DP
optimizers require clipping the gradient and injecting noise into the training procedure, it can hurt
the model performance and require more computational resources for hyperparameter tuning [Li+21],
making many of the existing methods impractical at the scale of current LLMs. This is an area where
significant new research is needed.

LLM Unlearning After LLMs are trained, it would be desirable to eliminate the influence of specific
data from the model while preserving the model’s utilities on other tasks. This task is often referred
to as “machine unlearning” [CY15]. While exact unlearning requires re-training the language model
without the target data from scratch, it is possible to approximately achieve machine unlearning
efficiently. [NRS21; Gin+19; Guo+20; Sek+21; Geo+24] introduced theoretical metrics for machine
unlearning based on the notion of differential privacy and proposed unlearning methods based on
Newton update removal mechanisms. However, these algorithms require computing the Hessian of loss
functions, which is intractable for LLMs.
Recent research has explored computationally efficient unlearning methods for LLMs. Gradient
ascent is a commonly used technique that reverts the learning process by minimizing the next-token
prediction likelihood on target data [Jan+23; YXL24]. However, reverting the optimization process
through gradient ascent can be unstable as it leads the model parameter to diverge. To mitigate the
issue, [Zha+24d; Fan+24] designed an alternative loss function named “negative preference optimiza-
tion” (NPO). The NPO loss generalizes the gradient ascent objective via adaptive weighting of the
unlearning samples and ensures that the loss function remains bounded, thereby achieving a slower
divergent rate and enhanced stability compared to gradient ascent. Another variant of gradient ascent
is to relabel the target data by randomly assigned labels and train the model to fit on the random
labels [YXL24]. Beyond the gradient ascent-based method, there are several different approaches, such
as localizing and fine-tuning the crucial model units (e.g., layers, weights, neurons) for the unlearning
task [Men+22; PHB24; Yu+23; Wu+23a] and using influence functions [Jia+24a]. However, most of
the current unlearning methods require specifying a target task or content to be unlearned, and there
is still a lack of standardized corpora for LLM unlearning [Liu+25].

23
Copyright Beyond user privacy concerns, the ability of LLMs to memorize and reproduce training
data raises critical issues regarding copyright protection. Copyrighted material embedded in train-
ing datasets can appear in model outputs, either inadvertently or deliberately, potentially violating
intellectual property rights [Sam23]. While this issue can be partially addressed through data saniti-
zation, differential privacy, and prompting techniques that mitigate the risk of disclosing copyrighted
material [Elk+23; VKB23; CS24], urgent research efforts are needed to ensure that training processes
and model outputs remain unconnected to specific instances of copyrighted content. Such research is
crucial for establishing a pathway toward building models that comply with copyright regulations and
support responsible AI deployment. Taking a different angle, [Wan+24a] has proposed an economic
framework leveraging cooperative game theory principles to enable model developers to compensate
copyright owners for using their data in training.

Data Misappropriation A related issue is data misappropriation, which refers to the unauthorized
use, access, or exploitation of data for unintended or unpermitted purposes, often violating legal or
ethical regulations. This concern has been at the center of several high-profile debates. For example, the
lawsuit between The New York Times and OpenAI [Tim23] highlights tensions surrounding the use of
copyrighted data in training LLMs. Additionally, OpenAI’s Terms of Service explicitly prohibit the use
of ChatGPT’s outputs to develop competing models, underscoring the need for mechanisms to detect
whether a newly trained LLM has incorporated ChatGPT-generated content—a process often referred
to as model distillation. Detecting such data misappropriation is challenging, as the probabilistic nature
of LLMs generates content that may resemble, but does not directly copy, the original data [Sag23;
Ges24]. This difficulty has spurred significant research into methods for identifying and tracing LLMs-
generated data [Sad+23; Mit+23; Ren+24]. A statistical hypothesis testing framework is established
in [CLZ25], and optimal detection of data misappropriation is established.
These challenges and advances highlight the urgent need for robust frameworks that ensure ethical
AI development, protect intellectual property, and maintain trust in AI systems. The interplay between
statistical methods and practical detection mechanisms will continue to play a critical role in navigating
the complex issues of unlearning, copyright, and data misappropriation in LLMs.

Future Research While significant progress has been made in privacy protection, copyright en-
forcement, and unlearning, several key challenges remain. One major challenge is the evaluation of
privacy-preserving techniques. Current methods, such as differential privacy and data sanitization,
lack standardized evaluation metrics, making it difficult to quantify the trade-offs between privacy
protection and model performance. Developing robust benchmarks to assess these trade-offs is essen-
tial for advancing practical privacy solutions in LLMs. Another challenge lies in context-dependent
privacy risks. The definition of private information is highly context-dependent, making it difficult to
apply a one-size-fits-all privacy solution. While sanitization techniques work well for structured data,
sensitive information in free-form text often requires more nuanced handling. Future research should
explore adaptive privacy mechanisms that dynamically assess context before enforcing safeguards, en-
suring more reliable protection across diverse applications. Furthermore, continual unlearning remains
an open problem. Most existing unlearning methods assume a static dataset, but in practice, LLMs
are often deployed in environments where data continuously evolves. As new information is incorpo-
rated, there may be legal or ethical requirements to forget specific data, requiring efficient and scalable
unlearning techniques. Developing frameworks that support continual or real-time unlearning will
be critical for maintaining compliance with evolving data privacy regulations while preserving model
utility. Addressing these challenges will require collaboration between statisticians, machine learning
researchers, legal experts, and policymakers. By integrating rigorous statistical methodologies with
emerging AI advances, the community can work toward building more transparent, accountable, and
privacy-preserving language models.

4.4 Interpretability
Interpretable machine learning is a broad concept that captures “the extraction of the relevant knowl-
edge from a machine learning model concerning relationships either contained in data or learned by the
model” [Mur+19]. As LLMs have been deployed in more and more real-world applications, their inter-
pretability has received an increasing amount of attention as people wish to ensure their alignment with

24
human values and understand their potential risks and failures. Simple machine learning models, such
as linear regression or decision trees, are often considered interpretable since the dependency of model
output on the model structure and training data is easy to characterize. Language models, however,
contain billions of parameters and numerous layers, such that the precise dependency of output on data
and model structure can be too complicated for humans to comprehend. Therefore, recent efforts have
been focusing on mechanistic interpretability, which aims to explain the LLMs on an algorithmic
level through reverse engineering the detailed computation performed by the LLMs. As proposed in
[Ola+20], the current mechanistic interpretability research consists of three areas: features, circuits,
and their universality.

Features Unlike classical tabular data, where each of the coordinates represents a concrete variable,
textual input is highly structured and it is unclear how the LLMs extract meaningful features from
the data. Recent work has found that the features are learned and encoded by groups of neurons in
LLMs; for example, [Gur+23] showed that some neurons in LLMs are activated for names of sports.
Similarly, neurons that encode various features have been discovered, including sentiment neurons
[RJS17], knowledge neurons [Dai+22], and skill neurons [Wan+22]. Moreover, it was found that
the LLMs can encode multiple features in a single neuron [Elh+22a; Elh+22b; Gur+23]; i.e., the
neuron can be activated by different concepts. This leads to the hypothesis of superposition, which
implies a model can represent a greater quantity of features compared to the number of neurons.
To extract superposition features, [SBM22; Bri+23; Hub+24] train sparse autoencoders to map the
neuron activations in LLM to a higher-dimensional representation with sparsity. By jointly minimizing
the reconstruction loss with the L1 penalty on the high-dimensional representation, the researchers
have successfully extracted features that are more interpretable than the original neuron activations.

Circuits Instead of identifying individual features, a global approach to mechanistic interpretability

is to identify the “circuits” in LLMs [Ola+20; Elh+21]. This approach is motivated by the circuit
hypothesis that views LLMs as a computation graph that implements their capability through the
composition of several subnetworks within the model. Formally speaking, a language model is repre-
sented as a directed acyclic graph G = (V, E), where V denotes the set of nodes (the MLP layers or
attention layers) and E denotes the set of edges (the connection between those layers). For a specific
task, a circuit is defined as a sub-graph in LLMs that satisfies the following three criteria [Wan+23b;
Shi+24]:
• Faithfulness: The circuit can perform the task as well as the whole model.

• Completeness: The circuit contains all the nodes used to perform the task.
• Minimality The circuit doesn’t contain nodes irrelevant to the task.
Using this approach, researchers at Anthropic have identified the “induction head” that is crucial for
LLMs to execute in-context learning [Ols+22]. An induction head consists of two attention heads that
work together to copy a previous pattern, for example, if [A][B] appears in the sequence earlier, the
induction head will make the Transformer more likely to predict [B] when the model receives input
[A] again. Therefore, the induction head can capture the information in the prompt and provide the
primary source of in-context learning ability. Similarly, existing research has identified the correspond-
ing circuits in LLMs for various tasks, including indirect object identification [Wan+23b], doc string
completion [HJ23], and acronym prediction [GMT24]. Furthermore, to make the circuit identification
process automatic for general tasks, [Con+23] proposed an algorithm that greedily removes the edge
on the computational graph in a topological order.
The evaluation of a circuit often involves an ablation study that knocks out the circuit, imputes
the computation by either zero or the mean value, and then compares the performance of the re-
maining model to the original model. To make the comparison more precise, [LJ24] proposed an
optimal ablation procedure to optimize the performance of the remaining model on various imputation
methods. Furthermore, to formalize the circuit evaluation process statistically, [Shi+24] proposed a
nonparametric hypothesis testing procedure to test the three criteria for valid circuits above.

25
Universality Although mechanistic interpretability has provided a systematic approach to investi-
gating how an LLM implements a certain task, a critical question is that the features and circuits are
specific to a given LLM model (usually a toy model or smaller model such as GPT-2), and it is un-
clear whether these findings are universal across all models. Investigation into universality has yielded
a mixed result: [Ols+22; Wan+23b] has discovered similar circuits in multiple LMs, while [CCN23]
also found evidence that LMs trained from different initialization may implement different circuits.
Therefore, understanding the degrees of universality of mechanistic interpretability remains a crucial
open problem. We refer interested readers to [Rai+24] for a more practical review of mechanistic
interpretability in LLMs.

Physics of LLMs Except for the general mechanistic interpretability approach described above,
another seminal series of works aimed at interpreting the LLM through the “physics” perspective and
discovering the universal law of all LLMs [All24]. Toward this goal, it divides the intelligence of LLM
into multiple different dimensions such as structure, reasoning, and knowledge, and then studies each
individual dimension through idealized control experiments. In each experiment, the authors manage
the data and tweak hyperparameters, such as data quantity, type, difficulty, and format, to determine
the impact of each factor on LLM performance and suggest further improvement. Using this approach,
extensive experiments and observations have been made on learning hierarchical language structures
[AL23a], graduate-level mathematical reasoning [Ye+24b; Ye+24c], knowledge extraction [AL23b],
manipulation [AL23c], and capacity [AL24].

Geometric Laws Recent empirical studies have revealed several geometric laws that emerge in deep
learning models after training. This line of research was pioneered by [PHD20], which documented a
pervasive phenomenon called “neural collapse” in multilayer neural networks. Neural collapse refers
to a geometric law in which the last-layer features and weights exhibit symmetric structures that favor
large margins, an inductive bias with significant implications for interpretability due to its precise
geometry [Fan+21; Ji+22; Thr+22]. This law has since been extended to intermediate layers and
LLMs [HS23; WP24; HS24]. In the context of LLMs, [HS24] offers a new perspective on feature
formation by showing that pre-trained LLMs enhance the predictability of contextualized features
evenly across all layers, from the first to the last.

Future Research. While recent advances have improved LLM interpretability, several critical chal-
lenges remain. A major challenge is developing systematic methods for editing internal representations
to induce desired behaviors, such as reducing hallucinations or improving factuality, without retrain-
ing. Instead of relying solely on post-analysis approaches, future work should explore ways to train
AI models with interpretability as a built-in objective, potentially enabling more transparent and
predictable model behavior from the outset. Additionally, enhancing safety through interpretability
is essential for mitigating vulnerabilities like adversarial manipulation or harmful content generation.
Collaborative efforts between machine learning researchers and statisticians will be key to building
more interpretable and reliable LLMs that support safer and more accountable AI systems.

4.5 Algorithmic Fairness

Modern LLMs are trained with massive quantities of text collected from the real world, and biases
towards certain groups or individuals are inherited from the training data. As a consequence, the usage
of LLMs may lead to discrimination against certain gender [KDS23], race [An+24], religion [AFZ21],
and many other sensitive characteristics [NCR23] in downstream applications. For example, when
asked to write reference letters for Kelly (a common female name) and Joseph (a common male name),
it was observed [Wan+23a] that ChatGPT tends to describe Kelly as a warm and likable person and
Joseph as a leader and role model, indicating that LLMs tends to follow the gender stereotypes that
males are associated with leadership.

Statistical Formulation Before applying any mitigation techniques, it is essential to establish a

statistical formulation for algorithmic fairness in LLMs. Fairness assessments typically consider paired
groups, denoted as A and B, such as sentences including male and female words, or young and old
groups. The primary objective is to evaluate disparities between these groups.

26
Fairness in LLMs can be formulated using two main approaches: outcome-based metrics and
probability-based metrics. Outcome-based metrics treat the LLM as an opaque system, assessing
fairness based on the scores (or labels) directly associated with the model’s outcome response. Follow-
ing [Liu+20a], a dialogue model D can be represented as a function D : C 7→ R that maps a context
C to a response R. Given a measurement function M that assigns a scalar score s to each response
R, the model D is considered fair for groups A and B with respect to M if:

EC∼TA M (D(C)) = EC∼TB M (D(C)),

where TA and TB denote the distribution of the context C related to groups A and B respectively.
In contrast, probability-based metrics analyze the behavior of the language model by examining
the probabilities it assigns to outputs. For example, the probability of generating profession-related
words like “engineer” or “doctor” should be similar for male names like “Joseph” and female names
like “Kelly”. Likewise, prompts related to African Americans should yield a comparable rate of toxic
adjectives as those related to European Americans.
To formalize fairness, let U represent a specific set of sensitive words (e.g., high-paying professions,
toxic adjectives), U denote the collection of all such sets, and G = {(Ai , B i ) | i = 1, 2, . . . I} define the
paired groups of interest. A fairness criterion, following [ZRZ24], can be expressed as:

|P(o(x) ∈ U | x ∈ A) − P(o(x) ∈ U | x ∈ B)| ≤ α, ∀U ∈ U, (A, B) ∈ G, (11)

where the probability is taken over o(x) ∼ f (x), with the next-token output o(x) drawn from the
language model f (x).
To address these fairness issues, many bias mitigation techniques have been studied in different
stages of the LLM workflow, including data pre-processing, model training (in-processing), and post-
processing [Gal+23].

Pre-Processing In the pre-processing stage, the general methodology is to break the imbalance
between different groups in the training data. Under this category, typical approaches include data
augmentation via alternating group-imbalanced words [Lu+20], subset selection [GMA22], instance
reweighting [HBC22]. Collectively, these pre-processing strategies align the data with fairness objec-
tives, enabling LLMs to learn representations that are less prone to perpetuating societal biases.

In-Processing In the model training stage, bias can be mitigated by leveraging some fairness metrics
into the training objective. A common approach is to add a fairness metric into the loss function as
a regularizer. In particular, a family of distance-based metrics is used to shorten the distance of
different groups in the embedding space, such as ℓ2 distance between sentence embeddings [Liu+20a],
cosine similarity [Hua+20] and Jensen-Shannon divergence [Yan+23; Woo+23]. Besides distance-
based metrics, another line of work attempts to design metrics that can disentangle the relationship
between embeddings and the group identity. Towards this goal, [BB19; KB21; Par+23] proposes to
encourage the orthogonality via minimizing the projection length of neutral word (e.g., leadership)
embeddings onto gender embeddings, and [CPC21; WCH23] encourage independence between learned
word embeddings and gender via minimizing the mutual information.
In addition to adding fairness metrics as regularization terms, one may also change the training
objective to mitigate bias. Along this line, [Xia+24d] identifies an inherent algorithmic bias emerging
from RLHF in the post-training phase of LLMs, which significantly amplifies majority preferences
while diminishing minority preferences. To resolve this fairness concern, [Xia+24d] introduces prefer-
ence matching RLHF, which incorporates an additional regularization term in the reward objective.
[Che+21; He+22a; Oh+22] applies contrastive learning with counterfactual data augmentation to max-
imize the similarity between the original sentences and their counterfactual analogs. [ZLM18; Jin+21;
HBC21] use an adversarial training framework, where they train an attacker to predict the protected
attribute from the embeddings of the encoder, and an encoder to prevent the attacker from identi-
fying the protected attribute. [Ouy+22; Bai+22] utilize reinforcement learning with feedback from
human or machine learning models to encourage the model generate fair texts. [Nak+24] establishes
a theoretical framework of leveraging LLMs to artificially increase the sample size of underrepresented
classes in imbalanced classification and spurious correlation problems.

27
Post-Processing After the model is trained, one may enforce fairness by modifying the generating
distribution. A direct approach is to constrain the distribution to generate texts with only unbiased
words and phrases, for example, [Geh+20; Rol+21] forbid the use of toxic words according to a list,
[SSB22] search for outputs in the distribution with different genders to the highest likelihood output,
[Shu+22; Sch+22a] use another ML model to identify the harmful words and replace them with neutral
words. An alternative approach is to edit the next-token probability. For example [Geh+20; Kim+23;
LKW23] reweight the token probability in the generatve distribution according to a fairness score
learned from an evaluation model, and [Liu+21; Hal+23] separately train smaller expert and anti-
expert models specialized in generating non-toxic and toxic texts, and then in the inference stage,
tokens with higher probability in the expert model are up-weighted and tokens with lower probability
in the anti-expert model are down-weighted. There are also recent studies that apply multi-group
fairness notions in LLMs. [ZRZ24] introduces a framework for post-processing machine learning models
so that their predictions satisfy multi-group fairness guarantees, and applies this framework to achieve
the notion (11).

Future Research While significant progress has been made in mitigating algorithmic biases in
LLMs, several key challenges and open problems remain. One fundamental issue is the trade-off
between fairness and utility. Bias mitigation techniques often come at the cost of reduced model per-
formance, particularly in nuanced real-world applications. Future research should explore principled
ways to balance fairness constraints with overall model accuracy, potentially by developing adaptive
or task-specific fairness constraints that optimize both fairness and utility. Another critical chal-
lenge is context-dependent fairness. Existing fairness metrics and debiasing strategies typically rely
on predefined sensitive attributes such as gender or race. However, fairness concerns can be highly
context-dependent, varying across applications, languages, and cultural perspectives. A promising
direction is to develop dynamic, context-aware fairness measures that adapt to different scenarios, en-
suring that bias mitigation strategies remain relevant across diverse settings. Furthermore, fairness in
multi-modal and interactive AI systems remains an understudied area. Many real-world applications
involve interactions between text, images, and structured data, and biases may manifest differently
across modalities. Research should focus on extending fairness-aware learning techniques to multi-
modal LLMs and conversational AI systems to mitigate biases in more complex settings. Additionally,
the long-term impact of fairness interventions requires further investigation. Most current evaluations
assess bias mitigation based on short-term performance metrics, but biases may re-emerge as models
are fine-tuned, updated, or interact with users over time. Developing robust longitudinal studies to
assess the durability of fairness interventions will be crucial for ensuring sustainable bias mitigation.
Finally, scalability and computational efficiency of fairness interventions remain practical concerns.
Many existing approaches, particularly in-processing techniques that modify training objectives, intro-
duce significant computational overhead, making them challenging to implement for large-scale LLMs.
Future work should explore lightweight debiasing techniques that maintain fairness guarantees while
remaining computationally feasible.

4.6 LLM Alignment: A Statistical Perspective

Alignment with RLHF and Preference Optimization Alignment is a crucial step in the lan-
guage model training pipeline, as it aims to ensure that the model generations align with human
preferences. Existing alignment procedures are mostly implemented via reinforcement learning from
human feedback (RLHF), which attempts to learn a reward function based on the pairwise human
preference data and learn the corresponding optimal policy. Denote S the space of states, A the space
of actions, and r(s, a) the reward function of taking action a ∈ A at state s ∈ S, the RL process seeks
PH
a policy π(s) that maximizes the reward Es [ h=1 r(sh , π(sh )] from step 1 to H. In the context of lan-
guage model alignment, the state s ∈ S and action a ∈ A correspond to the input prompts and model
responses, and the reward function represents the human satisfaction of the model response given the
prompt. Under this framework, [ZJJ23] used the Bradley-Terry-Luce model to generate offline human
preference data and showed that a variant of MLE can achieve near-optimal sample complexity under
a linear reward function, a result that was later generalized to general reward functions [Zha+24a],
partially observed reward [Kau+24], and multi-party reward [Zho+24a] settings. Compared to the
standard RL, RLHF uses preference data that contains less information compared to the reward,

28
therefore it is often considered a harder task. However, [WLJ23] showed that for a wide range of
preference models, algorithms from traditional reward-based RL can solve the RLHF task with low
extra complexity.

Self-alignment On the theoretical front, [Ger+24; DD24] analyzed the phenomenon of model col-
lapse on a range of statistical models, including linear regression, generalized linear models, and non-
parametric regression, and found that the estimation error grows linearly in the setting where we
replace the original data with the synthetic data in each iteration, and converges to a certain constant
when one keeps augmenting the original data with synthetic data. Recent work [DFK24] shows that
while increasing model size can mitigate model degradation when training with synthetic data, mixing
synthetic data with real human data cannot fully eliminate the model collapse issue.

5 LLM-Empowered Statistical Analysis

5.1 Data Collection
Recent developments have yielded dramatic improvements on a variety of specific NLP tasks, including
summarization, reasoning and extraction tasks. Such abilities enable using powerful LLMs for zero-shot
or few-shot text mining, and extracting structured data from unstructured texts.

5.1.1 Extract Structured Data from Unstructured text

One prominent application of LLMs is sentiment analysis (SA), a long-standing text mining task which
aims to extract human sentiments from contextual data. These extracted sentiments can then serve as
inputs for downstream statistical models, such as predictive models for Twitter engagement, election
outcomes [Alv+23], and other social phenomena.
Beyond sentiment analysis, LLMs have shown remarkable effectiveness in extracting structured
data from complex, unstructured sources such as electronic health records and other medical datasets.
For example, recent work [Tan+23] demonstrated that fine-tuning on a large volume of synthetic
Biomedical Named Entity Recognition (NER) and Biomedical Relation Extraction (RE) data gener-
ated by ChatGPT significantly improved performance. Specifically, for biomedical NER tasks (e.g.,
NCBI Disease, BC5CDR Disease, BC5CDR Chemical), the F1-score improved from 23.37% to 63.99%.
Similarly, for RE tasks (e.g., GAD, EU-ADR), the F1-score increased from 75.86% to 83.59%. These
results illustrate the potential of LLMs to transform data extraction workflows and improve statistical
analysis pipelines in high-stakes domains like healthcare.

5.1.2 Synthetic Data Generation

The rise of the research fields of super-human alignment as well as self-rewarding are due to the
expensive costs of human-written instructions and annotations that are limited in quantity, diversity,
creativity, and more efforts have shifted to synthetic data generation using powerful language models.
Recent work [Wan+23e] has shown the ability of language models to self-improve by prompting models
to generate natural instructions at human level that are later used for boostrapping. With development
of synthetic text generation tasks, adaptation to generating realistic statistical data has become more
popular in recent years.

Tabular Data The generation of realistic synthetic tabular data has received considerable atten-
tion in recent years. Tabular data collection is often expensive and fraught with challenges, including
class imbalances, long-tailed label distributions [Cao+19], privacy concerns that restrict data sharing
[Gas+16], and data impurities such as noise or missing values [LT20]. Synthetic data generation pro-
vides a practical solution to these challenges, enabling the development of robust statistical models
while addressing privacy and class imbalance issues [Cho+17; Bor+22]. For instance, [Bor+23] in-
troduced a synthetic tabular data generation pipeline using an auto-regressive generative LLM. They
showed that discriminative models trained on synthetic tabular data outperformed competitors trained
on real data. Additionally, [Nak+24] proposed leveraging synthetic tabular data to address imbalanced
classification and spurious correlation challenges, demonstrating the potential of LLMs to improve sta-
tistical workflows even in complex and resource-constrained settings.

29
Financial Data Rational expectations remains the dominant model of beliefs in much of macroe-
conomics and finance and in recent years the use of surveys to tie beliefs to observable data has
emerged as a prominent approach. [Byb23] proposed an alternative method of generating beliefs using
LLMs and evaluate how well generated expectations of the stock market match the expectations of
the American Association of Individual Investors (AAII) survey and Duke CFO Survey.

5.2 Data Cleaning

LLM as coder for tabular data cleaning LLMs can automatically generate code for cleaning,
preprocessing, and transforming raw data, saving data scientists considerable time and effort. For
example [Tu+24] described how ChatGPT with Code Interpreter can assist statisticians with checking
for missing/null values, removing duplicate rows if necessary, checking for inconsistent/invalid values,
converting categorical columns to numerical representations.

LLM for automatic feature engineering Feature engineering refers to the process of building
suitable features from raw data. Given input data {(X, Y )}, feature engineering seeks a transformation
ϕ(x) to maximize the prediction accuracy using {(ϕ(X), Y ))}. Although modern machine learning
models are capable of learning a wide variety of complex relationships, feature engineering is still
crucial when the data is limited and the prediction model is simple, and it often relies on domain
knowledge. With the use of LLMs, it is possible to automatically construct features using the domain
knowledge embedded in it. For example, [HMH24] prompted LLM with contextual information about
the datasets, a sample of data, and several few-shot examples of useful feature engineering, and then
asked LLM to create a new feature. [Han+24a] prompted LLM to generate explicit decision rules to
solve a classification task and then convert those rules into binary features for fitting the prediction
model.

LLM as judge for scalable text data filtering It is standard practice to clean up text before
feeding it into any kind of machine learning algorithm. Whether one is doing pre-training or fine-
tuning, cleaning data before training helps ensure accuracy and improving text quality. To ensure
Llama 3 is trained on data of high quality, a series of data-filtering pipelines are applied to pre-training
dataset including using heuristic filters, NSFW filters, semantic deduplication approaches, and text
classifiers to predict data quality. Besides standard heuristics such as filtering emojis, html tags, xml
parsing, lower casing and other text standardization, more recently, LLMs have been employed to
assign scores or rankings to text data for filtering high-quality text data for training. One prominent
approach is to further fine-tune LLMs as text classifiers. For example, Llama 2 is employed to generate
the training data for the text-quality classifiers that are powering Llama 3. Other approaches such
as BartScore, directly apply BART, an encoder-decoder pre-trained language model, to texts for
evaluating its fluency, factuality, informativeness in an unsupervised fashion. Recent reward scoring
methods such as DPO reward also use LLM-predicted logits on texts to judge which one is better
among a pair of texts. More recently, with emergent abilities of LLMs such as Chain-of-Thoughts(CoT)
[Wei+22a] and In-Context Learning (ICL) [Bro+20], have been employed as a scalable and explainable
way of approximating human judgments which are otherwise expensive to obtain. [Zhe+23] reveal that
strong LLM judges such as GPT-4 can match the quality of human-level preference, achieving over 80%
agreement with human annotations on two challenging benchmarks consisted of open-ended questions,
without finetuning on domain-specific tasks. [Yua+24] can employ LLM-as-judge prompting to score
its own training data for filtering and such self-reward pipeline during iterative training can yield a
model that outperforms many existing state-of-the-art (SOTA) models including Claude 2, Gemini
Pro and GPT-4 0613 on prominent NLP leaderboards.

5.3 LLMs for Data Analysis

There is evidence that LLMs can perform fundamental mathematical reasoning on graduate-level
mathematical tasks. However little prior work has evaluated the reasoning capabilities of LLMs per-
forming statistical analysis, especially on real data. [Liu+24d] reveals that even the strongest models
can struggle in data analysis. For example GPT-4 can only achieve 58% accuracy on basic statistical
reasoning tasks such as “What is a 95% confidence interval?”

30
Tool Usage For Quantitative Analysis In practice people have shown that ChatGPT can be
fed directly with tabular data to perform direct table analysis such as aggregation, averaging etc, and
generate SQL statements capable of filtering, sorting, aggregation and summation logics, and execute
SQL queries on given database it reads from the prompt.

Statistical Analysis for Text Data Text data has long been a rich source for statistical analysis,
with methods like topic modeling providing interpretable summaries of large corpora. Traditional
models such as Latent Dirichlet Allocation (LDA) analyze word co-occurrence patterns to identify
latent themes within a text corpus, offering a probabilistic representation of topics [BNJ03; Ke+23;
KW24; WZT23]. These approaches have been widely used to uncover hidden structures in text data,
serving as a foundation for many downstream applications. With the advent of LLMs, more advanced
methods have emerged for extracting structured data from unstructured text in a scalable manner,
enabling further statistical analysis. A promising avenue for such investigations has been the use of
“word embeddings”—a family of techniques that conceive of meaning as emerging from the distribu-
tion of words that surround a term in text. By integrating text embeddings into statistical models,
the model can leverage contextual information and enhance the regression analysis with a richer repre-
sentation of the input text. In a nutshell, the method takes contextural embeddings which have been
pre-trained on large corpora such as BERT, Llama or GPT, then this requires only a simple linear
transformation of the averaged embeddings to conduct further statistical modeling such as regression
[RSS23]. Such use of word embeddings for prediction has been heavily employed in classification tasks
[SK21] to improve model prediction accuracy. Another useful feature of word embeddings is to capture
the semantic relationship between words and texts.

Statistical Inference with LLM Annotation In many statistical analysis, such as social sur-
veys, human annotation is slow and expensive to obtain. Meanwhile, LLMs can simulate human
annotations efficiently with very low cost, but these annotations can be potentially biased and fail to
provide valid statistical analysis. Motivated by this dilemma, [Ang+23; ADZ23] proposed Prediction-
Powered Inference (PPI), a general framework that allows researchers to utilize predictions from any
black-box machine learning model and perform valid statistical inference, such as computing p-values
and confidence intervals. Assume that the researcher collects human annotation on a small dataset
(Xi , Yi )ni=1 and observes a large unannotated dataset (Xi )n+N
i=n+1 from the same distribution. In addi-
tion, a pre-trained model f is available to provide predictions Ŷi = f (Xi ). The target is to estimate
the parameter θ∗ = arg minθ∈Θ E[ℓθ (X, Y )], such as sample mean or regression coefficient. The PPI
estimator [Ang+23; ADZ23; JLZ25] can be written as:
n n n+N
1X 1X 1 X
θ̂PPI = arg min ℓθ (Wi , Yi ) − ℓθ (Wi , f (Xi )) + ℓθ (Wi , f (Xi )). (12)
θ n i=1 n i=1 N i=n+1

Compared to using human annotation only or naively using machine learning predictions as gold
standard labels, the PPI estimator can effectively leverage the information from predictions while
preserving statistical validity (e.g., coverage). It has been successfully applied to assist various com-
putational social science studies with annotations generated by LLMs, such as detecting hate speech
[Ega+24] and studying the bias in media outlets [Gli+24]. In addition, [JMS24] underscores the po-
tential of scaling laws to guide the effective integration of LLMs-generated surrogate data, addressing
a critical bottleneck in scenarios where collecting high-quality data is expensive or impractical. The
discovery that surrogate data can significantly reduce test error, even when unrelated to the original
dataset, raises questions about the limits of this approach and its implications for generalization.

5.4 LLM-assisted Medical Research

Today, we have access to more health data than ever, with the potential to revolutionize modern
medicine through LLMs. There are many diseases that are difficult to diagnose and cause considerable
amounts of damage to the body before discovery [Ram+17]. Many people in the world do not have
access to quality medical information or healthcare [BO17]. LLMs can process and analyze large
volumes of data quickly, providing healthcare professionals and patients with valuable insights and
potential diagnoses. Patterns that may be unobservable or difficult to differentiate by humans may

31
get picked up by an LLM, and they can provide consistent information that is up to date with the
latest research and data [Nav+23]. This has the potential to improve diagnostic support and improve
accessibility of medical question-answering. Synthetic medical data generated by LLMs can increase
data accessibility and advance medical AI research. It is important to keep in mind, however, that
there are considerable risks and considerations when it comes to the use of LLMs in medicine—bias,
hallucinations, ethical violations, lack of accountability, and lack transparency, to name a few. Here,
we review some important and recent advances in large language models for medicine. For a more
comprehensive review, see Zhou et al., 2024 [Zho+24b].

Early Transformer-Based Clincial Language Models There are a variety of recent language
models that are pre-trained on vast amounts of medical data and fine-tuned for specific tasks within
medicine and healthcare. The most well-known early example of pre-training and fine-tuning a
Transformer-based model on clinical data is Clinical-BERT [Als+19]. Using the same encoder ar-
chitecture as BERT, the authors train the model on clinical notes from ICU admissions. As a result,
ClinicalBERT is able to uncover semantic relationships between medical concepts. ClinicalBERT is
widely used in a variety of clinical AI workflows to produce embeddings for medical concepts in EHRs
data [YW21; SLG20; BDP20; Kan+23].

LLMs for Clinical Text Generation With the significant advances in generative AI models,
medical and clinical large language models were developed for synthetic medical data generation, and
fine-tuned for a variety of natural language processing tasks. It is difficult to access and share large-scale
clinical text and clinical LLMs due to concerns in privacy and maintaining HIPAA compliance. Kweon
et al., 2023 developed Asclepius, the first generative shareable clinical large language model [Kwe+23].
There are versions based on the Llama2-7B and Llama2-13B architectures, and versions based on
the Llama3-8B and Mistral-7B architectures. The models generate synthetic clinical notes based on
anonymized case reports from PubMed Central. Peng et al., 2023 developed GatortronGPT, which
uses 277 billion words of text from 126 clinical departments and 2 million patients at the University
of Florida Health in conjunction with 195 billion words of diverse English text [Pen+23]. The authors
then generate 20 billion words of synthetic text to train synthetic NLP models that accomplish tasks
such as clinical concept extraction, event relations, semantic similarity, natural language inference, and
question answering.

LLMs for Medical Question-Answering There are also a variety of large language models
specifically constructed for medical question-answering, which can democratize medical knowledge,
increase access to underserved populations, and potentially reduce physician burden by improving effi-
ciency. [Sin+23] developed Med-PALM2—trained using the PaLM2 architecture and targeting medical
domain-specific fine-tuning [Sin+23]. The model incorporates instruction fine-tuning applied to mul-
tiple question-answer datasets. The authors train a “unified” model optimized for performance across
all datasets using dataset mixture ratios. The model also incorporates ensemble refinement, where the
model produces multiple generations via temperature sampling, and is then conditioned on the gen-
erations from the previous step to produce a refined explanation and answer. This step is performed
multiple times, and the final answer is then determined by a plurality vote. Han et al., 2023 developed
MedAlpaca based on the Llama architecture and trained on various question-answer pairs from medical
flash cards, StackExchange, and Wikidoc [Han+23]. Med-PALM2 is a closed source model, whereas
MedAlpaca is open source. Xie et al., 2024 developed the Me-LLaMa family of open-source foundation
large language models for medical application, which combines biomedical papers, clinical notes, and
general domain data [Xie+24a]. Me-LLaMa outperformed ChatGPT on many medical datasets.

LLMs for the Identification of Social Determinants of Health There have been recent efforts
in the use of LLMs to identify social determinants of health (SDoH). SDoH are the conditions in
which people are born, grow, live, work, and age that significantly influence health outcomes. Some
examples can include economic stability, education, social and community context, neighborhood, and
food accessibility. Identifying SDoH could provide a more holistic view of a patient’s health, and help
stratify patients for more targeted prevention and resources. Guevara et al., 2024 developed models
to extract SDoH by applying existing LLMs (Flan T-5, GPT 3.5, GPT 4) and fine-tuning with LLM-
generated synthetic SDoH data [Gue+24]. For fine-tuning, LLMs were prompted to generate new

32
sentences for SDoH categories using annotation guidelines as reference. Sample synthetic sentences
were taken as reference to generate more synthetic sentences. The models identified 93.8 percent of
patients with adverse SDoH, compared to 2 percent ICD-10, thus showing great promise for the use
of LLMs in identifying SDoH.

Multimodal LLMs in Medicine Recently, in addition to textual input, large language models have
been adapted to handle other types of data such as images, charts, screenshots, pdf, video, audio, etc.
For example, for the incorporation of medical images in multimodal medical LLMs, vision Transformers
(ViT) are often integrated to handle the image input. ViT produces vector representations of the image,
which are concatenated with text representations to form a single sequence. LMMs such as Flamingo-
CXR [Moo+23] and Med-PaLM [Li+24a] are comparable with radiologists in controlled settings for
generating radiology reports [Hua+23b]. Google’s Gemini models are a new generation of multimodal
models with novel capabilities. Med-Gemini in particular incorporates multimodal capabilities and
the ability to integrate the use of web search, resulting in state of the art performance on complex
diagnostic challenges, image challenges, video question answering and text summarization [Ani+23].

Challenges in LLM adoption in Healthcare While LLMs have been rapidly developed for re-
search in healthcare and medicine, several daunting challenges arise that limit their use in many
high-stakes applications. These challenges and issues include algorithmic biases, poor calibration, hal-
lucinations, misinformation, and response arbitrariness [San+23; Sim+24; Gao+24]. Due to the unique
nature of medical data and the critical need for safety, these models require specialized prompting,
fine-tuning, and alignment to ensure they are truly useful and reliable in real-world clinical settings.
For example, LLMs demonstrate suboptimal clinical reasoning under uncertainty [Omi+23]. LLMs
also have been known to exhibit significant bias [San+23; Cas+23; Aza+24; Cha+24], which can lead
to fairness issues, particularly harming underrepresented minority individuals when clinicians rely on
LLMs for decision support [PFB24; Ayo+24]. Researchers have cautioned against adopting LLMs in
healthcare until these challenges are mitigated [Szo24; FDW24; Tes+24].

6 Discussion
6.1 Statisticians’ Small Language Models
Statisticians, often operating with fewer computational resources compared to their computer science
counterparts, bring a distinctive strength to the development of language models: the ability to de-
sign resource-efficient, theoretically grounded models that leverage statistical principles. While the
computer science community has focused on scaling LLMs through sheer data size and computational
power, statisticians excel in developing “small language models” (SLMs) that achieve competitive
performance in particular domains by emphasizing efficiency and structure over brute force. Statis-
ticians’ advantage also lies in their expertise in embedding domain knowledge into model design.
Techniques such as sparsity assumptions, hierarchical modeling, and structured regularization allow
SLMs to achieve more sample efficiency. For instance, sparsity-inducing priors can focus the model’s
capacity on relevant features, reducing the need for excessive data while improving interpretability.
Similarly, hierarchical Bayesian approaches can incorporate multi-level dependencies in text data, of-
fering a principled way to share information across contexts and reducing the effective sample size
needed for training. These methods can outperform large models in scenarios with limited data, noisy
observations, or constrained resources, such as personalized medicine, legal text analysis, or financial
modeling.
SLMs also align with statisticians’ focus on interpretability. By prioritizing simpler architectures
and explicit assumptions, SLMs are inherently more transparent, allowing researchers and practition-
ers to trace the model’s predictions back to specific features or interactions. This interpretability is
especially critical in high-stakes domains where trust and accountability are paramount. In contrast,
LLMs, while powerful, often function as black boxes, limiting their applicability in settings where
explanation is necessary. Moreover, statisticians are well-positioned to cooperate with computer sci-
entists in tackling the computational challenges associated with modern AI. Techniques that bridge
statistics and computer science, like low-rank approximation, variational inference, and penalized op-
timization, can be employed to reduce the computational footprint of training and inference. Applied

33
to SLMs, these approaches make the models not only more accessible but also environmentally sus-
tainable, addressing concerns about the carbon footprint of large-scale AI systems. By focusing on
principled model design, statisticians can create models that require less computational power while
maintaining robustness and adaptability.
Statisticians’ SLMs are not intended to replace LLMs but to complement them. While LLMs excel
at broad, open-ended tasks, SLMs can specialize in domain-specific applications, delivering competitive
performance with far fewer resources. By leaning into their strengths in theory-driven modeling, sample
efficiency, and interpretability, statisticians can carve out a unique and impactful role in the rapidly
evolving AI landscape. This approach not only democratizes access to advanced language modeling
capabilities but also ensures that AI development is aligned with the values of efficiency transparency,
and sustainability.
Statisticians also have an important role to play in the design of “wrappers” that surround black-
box LLMs—performing roles such as calibration, uncertainty quantification, and debiasing on the
LLM outputs. The Prediction-Powered Inference (PPI) estimator referred to earlier is an instance
of such a wrapper. Causal inference methods can profit from black-box LLMs if properly deployed
around the LLMs. Experimental design methods from statistics remain relevant in the LLM era.
Thus, statisticians can both provide alternatives to LLMs and augmentations of LLMs in addition to
contributing to theoretical analysis of LLMs.

6.2 Towards Understanding LLMs

The remarkable success of LLMs has inspired growing efforts to understand their inner workings, of-
ten using tools from statistical learning theory. Initial research has made strides by examining LLMs
through the lenses of approximation, optimization, generalization, and prompting mechanisms. For
instance, studies have demonstrated the universal approximation properties of Transformers across var-
ious architectures [Yun+20a; Yun+20b; KS24], their computational expressiveness as Turing-complete
systems [PMB19; WCM22], and their superior optimization dynamics compared to MLPs in specific
tasks [Wan+24d]. On generalization, researchers have refined bounds on sequence length [Zha+22;
TT24] and extended them to non-i.i.d. settings such as time-series data [Lim+24]. Prompting tech-
niques, such as in-context learning and chain-of-thought prompting, have further advanced our un-
derstanding of LLM adaptability and reasoning. Theoretical work interprets ICL as implicit Bayesian
inference arising from pre-training distributions like hidden Markov models [Xie+22] and explores its
capacity to approximate diverse algorithms, including gradient descent, Newton’s method, and rein-
forcement learning [Aky+22; Von+23; Fu+23; Bai+24; LBM24; ZFB24; Guo+24c]. CoT prompting
enhances reasoning by decomposing tasks into sequential steps, improving both accuracy and inter-
pretability [Wei+22a; Yao+23]. For example, CoT has been shown to increase the approximation
power of Transformers in arithmetic and equation-solving tasks [Fen+24a] and promote sparse at-
tention patterns that simplify learning [Wen+24]. Furthermore, many pre-trained LLMs exhibit the
attention-sink phenomenon [Xia+24c], where certain “sink tokens” receive disproportionately high
attention weights. [Guo+24b] analyzed this phenomenon using simplified models and attributed it
primarily to the softmax operation in attention heads. Finally, the pre-training of LLMs has found
to follow an empirical scaling law, in which the test error improves polynomially with the model size
and sample size. [Lin+24a] theoretically derived the scaling law in the context of infinite-dimensional
linear regression, aligning with observed empirical scaling laws.
Despite these advances, our understanding of LLMs remains incomplete. Traditional statistical
frameworks, while valuable, often fall short of explaining emergent phenomena like zero-shot general-
ization, in-context learning, and CoT reasoning, which arise from intricate interactions between model
architecture, training data, and optimization dynamics. For instance, while ICL can be partially ex-
plained through algorithm approximation and Bayesian inference, its ability to adapt flexibly across
diverse tasks without parameter updates remains a challenging puzzle. Similarly, CoT’s capacity to
enhance reasoning through structured task decomposition highlights gaps in our theoretical under-
standing of how attention mechanisms drive complex problem-solving. These behaviors underscore
the need for new frameworks that integrate insights from statistical learning theory, optimization, and
cognitive science to fully capture the nuanced capabilities of LLMs. Addressing these gaps presents a
significant opportunity for statisticians to contribute by developing innovative theories and method-
ologies, ultimately bridging the divide between empirical successes and foundational understanding of
these transformative technologies.

34
6.3 Human-AI Collaborative Data Science
As LLMs continue to advance, an important future direction lies in fostering effective human-AI collab-
oration, where AI systems complement rather than replace human expertise. While LLMs demonstrate
remarkable capabilities in reasoning, text generation, and decision support, they lack deep understand-
ing, contextual awareness, and accountability—factors that are crucial in high-stakes applications such
as medicine, law, and scientific research. Thus, rather than viewing AI as a replacement for human
intelligence, a more promising approach is to develop frameworks that integrate human expertise with
AI-driven insights to achieve superior outcomes.
One promising avenue for human-AI collaboration is interactive decision-making, where LLMs pro-
vide recommendations, assist with exploratory data analysis, or generate potential solutions, while
human users retain final control and oversight. For example, in statistical modeling, LLMs can auto-
mate tedious tasks such as data preprocessing, feature engineering, and model selection, while human
analysts focus on domain-specific reasoning, hypothesis testing, and interpreting results. Similarly,
in medical applications, AI can assist doctors by summarizing patient records or suggesting poten-
tial diagnoses, but the final decision remains with human experts, ensuring accountability and ethical
considerations.
However, interactive decision-making introduces new statistical challenges, particularly due to the
non-i.i.d. nature of data. In traditional statistical and machine learning settings, models are typically
trained on independently and identically distributed (i.i.d.) data. However, in human-AI collabora-
tion, the data distribution adapts dynamically based on prior AI suggestions and human feedback.
This feedback loop can introduce selection bias, concept drift, and strategic adaptation, where human
decision-making patterns shift based on AI recommendations. For example, if a recommendation sys-
tem in hiring disproportionately suggests candidates from a particular demographic, decision-makers
may adjust their selection patterns accordingly, reinforcing feedback loops and exacerbating bias. Stan-
dard statistical tools that assume fixed distributions become inadequate in such settings, necessitating
the development of adaptive inference methods, causal modeling techniques, and robust statistical
frameworks that can account for evolving data distributions. See [Per+20; Gar+24; GC24] for more
discussions of these issues and further pointers. Another key research direction is designing adaptive
AI systems that can dynamically adjust to user expertise and preferences. Current LLMs operate
largely as static models, generating responses based solely on input prompts. However, effective col-
laboration requires AI systems that learn from user feedback, refine their responses over time, and
personalize their assistance based on the expertise level of the user. This could involve techniques
such as reinforcement learning from human feedback (RLHF), uncertainty-aware AI models that defer
decisions to humans when confidence is low, or interactive AI systems that engage users in dialogue
to refine understanding and reduce ambiguity.
Addressing these challenges requires a multidisciplinary effort, bringing together statisticians, com-
puter scientists, social scientists, legal scholars and policymakers to design fair, transparent, and ac-
countable AI systems. By integrating rigorous statistical frameworks with advances in LLMs, the
community can work to ensure that human welfare remains a central consideration in the development
of next-generation AI models.

References
[Aar23] Scott Aaronson. “Watermarking of large language models”. In: Workshop on Large Lan-
guage Models and Transformers, Simons Institute, UC Berkeley. 2023.
[AB23] Anastasios N Angelopoulos and Stephen Bates. “Conformal prediction: A gentle intro-
duction”. In: Foundations and Trends® in Machine Learning 16.4 (2023), pp. 494–591.
[Aba+16] Martin Abadi et al. “Deep learning with differential privacy”. In: Proceedings of the 2016
ACM SIGSAC conference on computer and communications security. 2016, pp. 308–318.
[Abd+21] Moloud Abdar et al. “A review of uncertainty quantification in deep learning: Techniques,
applications and challenges”. In: Information fusion 76 (2021), pp. 243–297.
[ADZ23] Anastasios N Angelopoulos, John C Duchi, and Tijana Zrnic. “PPI++: Efficient prediction-
powered inference”. In: arXiv preprint arXiv:2311.01453 (2023).

35
[AFZ21] Abubakar Abid, Maheen Farooqi, and James Zou. “Persistent anti-muslim bias in large
language models”. In: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics,
and Society. 2021, pp. 298–306.
[Aga+24] Rishabh Agarwal et al. “Many-shot in-context learning”. In: Advances in Neural Infor-
mation Processing Systems. Vol. 37. Curran Associates, Inc., 2024, pp. 76930–76966.
[Ahm+24] Arash Ahmadian et al. “Back to basics: Revisiting REINFORCE-style optimization for
learning from human feedback in LLMs”. In: Proceedings of the 62nd Annual Meeting of
the Association for Computational Linguistics (Volume 1: Long Papers). 2024, pp. 12248–
12267.
[Aky+22] Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. “What
learning algorithm is in-context learning? investigations with linear models”. In: arXiv
preprint arXiv:2211.15661 (2022).
[AL23a] Zeyuan Allen-Zhu and Yuanzhi Li. “Physics of language models: Part 1, context-free
grammar”. In: arXiv preprint arXiv:2305.13673 (2023).
[AL23b] Zeyuan Allen-Zhu and Yuanzhi Li. “Physics of language models: Part 3.1, knowledge
storage and extraction”. In: arXiv preprint arXiv:2309.14316 (2023).
[AL23c] Zeyuan Allen-Zhu and Yuanzhi Li. “Physics of language models: Part 3.2, knowledge
manipulation”. In: arXiv preprint arXiv:2309.14402 (2023).
[AL24] Zeyuan Allen-Zhu and Yuanzhi Li. “Physics of language models: Part 3.3, knowledge
capacity scaling laws”. In: arXiv preprint arXiv:2404.05405 (2024).
[All24] Zeyuan Allen-Zhu. ICML 2024 tutorial: Physics of language models. Project page: https:
//physics.allen-zhu.com/. July 2024.
[Als+19] Emily Alsentzer et al. “Publicly available clinical BERT embeddings”. In: Proceedings of
the 2nd Clinical Natural Language Processing Workshop (2019), pp. 72–78.
[Alv+23] Quratulain Alvi et al. “On the frontiers of Twitter data and sentiment analysis in election
prediction: a review”. In: PeerJ Computer Science 9 (2023), e1517.
[AM23] Amos Azaria and Tom Mitchell. “The internal state of an LLM knows when it’s lying”.
In: arXiv preprint arXiv:2304.13734 (2023).
[Ami07] Ali Amiri. “Dare to share: Protecting sensitive knowledge with data sanitization”. In:
Decision Support Systems 43.1 (2007), pp. 181–191.
[An+24] Jiafu An, Difang Huang, Chen Lin, and Mingzhu Tai. “Measuring gender and racial
biases in large language models”. In: arXiv preprint arXiv:2403.15281 (2024).
[Ang+23] Anastasios N Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I Jordan, and Ti-
jana Zrnic. “Prediction-powered inference”. In: Science 382.6671 (2023), pp. 669–674.
[Ani+22] Rohan Anil, Badih Ghazi, Vineet Gupta, Ravi Kumar, and Pasin Manurangsi. “Large-
scale differentially private BERT”. In: Findings of the Association for Computational
Linguistics: EMNLP. 2022, pp. 6481–6491.
[Ani+23] Rohan Anil et al. “Gemini: A family of highly capable multimodal models”. In: arXiv
preprint arXiv:2312.11805 (Dec. 2023).
[Ary+24] Viraat Aryabumi et al. “To code, or not to code? Exploring impact of code in pre-
training”. In: arXiv preprint arXiv:2408.10914 (2024).
[Ask+21] Amanda Askell et al. “A general language assistant as a laboratory for alignment”. In:
arXiv preprint arXiv:2112.00861 (2021).
[Ayo+24] Noel F. Ayoub et al. “Inherent bias in large language models: A random sampling anal-
ysis”. In: Mayo Clinic Proceedings: Digital Health 2.2 (2024), pp. 186–191.
[Aza+24] Mohammad Gheshlaghi Azar et al. “A general theoretical paradigm to understand learn-
ing from human preferences”. In: International Conference on Artificial Intelligence and
Statistics. PMLR. 2024, pp. 4447–4455.
[Bai+09] Bing Bai et al. “Supervised semantic indexing”. In: Proceedings of the 18th ACM confer-
ence on Information and knowledge management. 2009, pp. 187–196.

36
[Bai+22] Yuntao Bai et al. “Constitutional AI: Harmlessness from AI feedback”. In: arXiv preprint
arXiv:2212.08073 (2022).
[Bai+24] Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. “Transformers as statis-
ticians: Provable in-context learning with in-context algorithm selection”. In: Advances
in neural information processing systems 36 (2024).
[Bak+24] Yavuz Faruk Bakman et al. “MARS: Meaning-aware response scoring for uncertainty
estimation in generative LLMs”. In: Proceedings of the 62nd Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1: Long Papers), ACL. 2024, pp. 7752–
7767.
[Bañ+20] Marta Bañón et al. “ParaCrawl: Web-scale acquisition of parallel corpora”. In: Proceed-
ings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020,
pp. 4555–4567.
[BB19] Shikha Bordia and Samuel R. Bowman. “Identifying and reducing gender bias in word-
level language models”. In: Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technolo-
gies, NAACL-HLT 2019, Student Research Workshop. 2019, pp. 7–15.
[BCB15] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation
by jointly learning to align and translate”. In: 3rd International Conference on Learning
Representations, ICLR. 2015.
[BDP20] Brent Biseda, Gaurav Desai, and A. Philip. “Prediction of ICD codes with clinical BERT
embeddings and text augmentation with label balancing using MIMIC-III”. In: arXiv
preprint arXiv:2008.08980 (2020).
[Ben+03] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. “A neural prob-
abilistic language model”. In: J. Mach. Learn. Res. 3.null (2003), pp. 1137–1155. issn:
1532-4435.
[Bes+24] Maciej Besta et al. “Graph of thoughts: Solving elaborate problems with large language
models”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. 16.
2024, pp. 17682–17690.
[Bi+24] Xiao Bi et al. “DeepSeek LLM: Scaling open-source language models with longtermism”.
In: arXiv preprint arXiv:2401.02954 (2024).
[BKH16] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. “Layer normalization”. In:
arXiv preprint arXiv:1607.06450 (2016).
[BL24] Amrita Bhattacharjee and Huan Liu. “Fighting fire with fire: Can ChatGPT detect AI-
generated text?” In: SIGKDD Explor. Newsl. 25.2 (Mar. 2024), pp. 14–21. issn: 1931-
0145.
[BMW20] Andreas Balzer, David Mowatt, and Muiris Woulfe. Obfuscating information related to
personally identifiable information (PII). US Patent 10,839,104. Nov. 2020.
[BNJ03] David M Blei, Andrew Y Ng, and Michael I Jordan. “Latent dirichlet allocation”. In:
Journal of machine Learning research 3.Jan (2003), pp. 993–1022.
[BO17] World Bank and World Health Organization. “World Bank and WHO: Half the world
lacks access to essential health services, 100 million still pushed into extreme poverty
because of health expenses”. In: (2017). Accessed: 2024-09-18. url: https : / / www .
worldbank . org / en / news / press - release / 2017 / 12 / 13 / world - bank - who - half -
world- lacks- access- essential- health- services- 100- million- pushed- into-
extreme-poverty-because-of-health-expenses.
[Bom+22] Rishi Bommasani et al. “On the opportunities and risks of foundation models”. In: arXiv
preprint arXiv:2108.07258 (2022).
[Bor+22] Vadim Borisov et al. “Deep neural networks and tabular data: A survey”. In: IEEE
transactions on neural networks and learning systems (2022).
[Bor+23] Vadim Borisov, Kathrin Seßler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci.
“Language models are realistic tabular data generators”. In: The Eleventh International
Conference on Learning Representations, ICLR. 2023.

37
[Bri+23] Trenton Bricken et al. “Towards monosemanticity: Decomposing language models with
dictionary learning”. In: Transformer Circuits Thread 2 (2023).
[Bro+20] Tom Brown et al. “Language models are few-shot learners”. In: Advances in Neural
Information Processing Systems. Vol. 33. Curran Associates, Inc., 2020, pp. 1877–1901.
[Bro+22] Hannah Brown, Katherine Lee, Fatemehsadat Mireshghallah, Reza Shokri, and Florian
Tramèr. “What does it mean for a language model to preserve privacy?” In: Proceed-
ings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 2022,
pp. 2280–2292.
[Bro+24] Bradley Brown et al. “Large language monkeys: scaling inference compute with repeated
sampling”. In: arXiv preprint arXiv:2407.21787 (2024).
[BT52] Ralph Allan Bradley and Milton E. Terry. “Rank analysis of incomplete block designs:
I. The method of paired comparisons”. In: Biometrika 39.3/4 (1952), pp. 324–345. issn:
00063444, 14643510.
[Bu+20] Zhiqi Bu, Jinshuo Dong, Qi Long, and Weijie J Su. “Deep learning with Gaussian differ-
ential privacy”. In: Harvard Data Science Review 2020.23 (2020).
[Byb23] J Leland Bybee. “The ghost in the machine: Generating beliefs with large language
models”. In: arXiv preprint arXiv:2305.02823 (2023).
[Cai+24] Zhongze Cai, Shang Liu, Hanzhao Wang, Huaiyang Zhong, and Xiaocheng Li. “Towards
better statistical understanding of watermarking LLMs”. In: arXiv preprint arXiv:2403.13027
(2024).
[Cao+19] Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. “Learning im-
balanced datasets with label-distribution-aware margin loss”. In: Advances in neural in-
formation processing systems 32 (2019).
[Car+21] Nicholas Carlini et al. “Extracting training data from large language models”. In: 30th
USENIX Security Symposium (USENIX Security 21). 2021, pp. 2633–2650.
[Car+23] Nicholas Carlini et al. “Quantifying memorization across neural language models”. In:
The Eleventh International Conference on Learning Representations, ICLR. 2023.
[Car98] Rich Caruana. “Multitask learning”. In: Learning to learn. Boston, MA: Springer US,
1998, pp. 95–133. isbn: 978-1-4615-5529-2.
[Cas+23] Stephen Casper et al. “Open problems and fundamental limitations of reinforcement
learning from human feedback”. In: arXiv preprint arXiv:2307.15217 (2023).
[CCN23] Bilal Chughtai, Lawrence Chan, and Neel Nanda. “A toy model of universality: Reverse
engineering how networks learn group operations”. In: International Conference on Ma-
chine Learning. PMLR. 2023, pp. 6243–6267.
[CG96] Stanley F. Chen and Joshua Goodman. “An empirical study of smoothing techniques
for language modeling”. In: 34th Annual Meeting of the Association for Computational
Linguistics. 1996, pp. 310–318.
[CGC24] John J. Cherian, Isaac Gibbs, and Emmanuel J. Candès. “Large language model valid-
ity via enhanced conformal prediction methods”. In: Advances in Neural Information
Processing Systems 38, NeurIPS 2024. 2024.
[CGZ24] Miranda Christ, Sam Gunn, and Or Zamir. “Undetectable watermarks for language mod-
els”. In: The Thirty Seventh Annual Conference on Learning Theory. Vol. 247. Proceed-
ings of Machine Learning Research. PMLR, 2024, pp. 1125–1139.
[Cha+24] Souradip Chakraborty et al. “MaxMin-RLHF: Towards equitable alignment of large lan-
guage models with diverse human preferences”. In: arXiv preprint arXiv:2402.08925
(2024).
[Che+20] Mark Chen et al. “Generative pretraining from pixels”. In: Proceedings of the 37th In-
ternational Conference on Machine Learning. Vol. 119. Proceedings of Machine Learning
Research. PMLR, July 2020, pp. 1691–1703.

38
[Che+21] Pengyu Cheng, Weituo Hao, Siyang Yuan, Shijing Si, and Lawrence Carin. “FairFil:
contrastive neural debiasing method for pretrained text encoders”. In: 9th International
Conference on Learning Representations, ICLR. 2021.
[Che+23a] Mayee F. Chen et al. “Skill-it! A data-driven skills framework for understanding and
training language models”. In: Advances in Neural Information Processing Systems 36,
NeurIPS 2023. 2023.
[Che+23b] Xinyun Chen et al. “Universal self-consistency for large language model generation”. In:
arXiv preprint arXiv:2311.17311 (2023).
[Che+24a] Chao Chen et al. “INSIDE: LLMs’ internal states retain the power of hallucination De-
tection”. In: The Twelfth International Conference on Learning Representations, ICLR.
2024.
[Che+24b] Xuxi Chen et al. “Take the bull by the horns: Hard sample-reweighted continual training
improves LLM generalization”. In: arXiv preprint arXiv:2402.14270 (2024).
[Chi+24] Wei-Lin Chiang et al. “Chatbot Arena: An open platform for evaluating LLMs by human
preference”. In: Forty-first International Conference on Machine Learning, ICML. 2024.
[Cho+17] Edward Choi et al. “Generating multi-label discrete patient records using generative
adversarial networks”. In: Machine learning for healthcare conference. PMLR. 2017,
pp. 286–305.
[CHS24a] Aloni Cohen, Alexander Hoover, and Gabe Schoenbach. “Enhancing watermarked lan-
guage models to identify users”. In: arXiv preprint arXiv:2405.11109 (2024).
[CHS24b] Aloni Cohen, Alexander Hoover, and Gabe Schoenbach. “Watermarking language models
for many adaptive users”. In: 2025 IEEE Symposium on Security and Privacy (SP). IEEE
Computer Society. 2024, pp. 84–84.
[Chu+14] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. “Empirical
evaluation of gated recurrent neural networks on sequence modeling”. In: arXiv preprint
arXiv:1412.3555 (2014).
[Cla+18] Peter Clark et al. “Think you have solved question answering? Try ARC, the AI2 rea-
soning challenge”. In: arXiv preprint arXiv:1803.05457 (2018).
[CLZ25] Yinpeng Cai, Lexin Li, and Linjun Zhang. “A statistical hypothesis testing framework for
data misappropriation detection in large language models”. In: arXiv preprint arXiv:2501.02441
(2025).
[CM24] Jiuhai Chen and Jonas Mueller. “Quantifying uncertainty in answers from any language
model and enhancing their trustworthiness”. In: Proceedings of the 62nd Annual Meet-
ing of the Association for Computational Linguistics (Volume 1: Long Papers). 2024,
pp. 5186–5200.
[Col+11] Ronan Collobert et al. “Natural language processing (almost) from scratch”. In: J. Mach.
Learn. Res. 12.null (2011), pp. 2493–2537. issn: 1532-4435.
[Con+23] Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià
Garriga-Alonso. “Towards automated circuit discovery for mechanistic interpretability”.
In: Advances in Neural Information Processing Systems 36 (2023), pp. 16318–16352.
[CP10] Emmanuel J Candes and Yaniv Plan. “Matrix completion with noise”. In: Proceedings of
the IEEE 98.6 (2010), pp. 925–936.
[CP11] Emmanuel J Candes and Yaniv Plan. “Tight oracle inequalities for low-rank matrix
recovery from a minimal number of noisy random measurements”. In: IEEE Transactions
on Information Theory 57.4 (2011), pp. 2342–2359.
[CPC21] Pierre Colombo, Pablo Piantanida, and Chloé Clavel. “A novel estimator of mutual in-
formation for learning to disentangle textual representations”. In: Annual Meeting of the
Association for Computational Linguistics. Aug. 2021, pp. 6539–6550.
[CS24] Hiroaki Chiba-Okabe and Weijie J Su. “Tackling GenAI copyright issues: Originality
estimation and genericization”. In: Scientific Reports (2024).

39
[CW08] Ronan Collobert and Jason Weston. “A unified architecture for natural language pro-
cessing: deep neural networks with multitask learning”. In: Proceedings of the 25th In-
ternational Conference on Machine Learning. ICML ’08. Association for Computing Ma-
chinery, 2008, pp. 160–167. isbn: 9781605582054.
[CY15] Yinzhi Cao and Junfeng Yang. “Towards making systems forget with machine unlearn-
ing”. In: 2015 IEEE symposium on security and privacy. IEEE. 2015, pp. 463–480.
[CZ13] T Tony Cai and Anru Zhang. “Sparse representation of a polytope and recovery of sparse
signals and low-rank matrices”. In: IEEE transactions on information theory 60.1 (2013),
pp. 122–132.
[CZY24] Shijie Chen, Yu Zhang, and Qiang Yang. “Multi-task learning in natural language pro-
cessing: An overview”. In: ACM Comput. Surv. (May 2024). Just Accepted. issn: 0360-
0300.
[Dai+22] Damai Dai et al. “Knowledge neurons in pretrained transformers”. In: Proceedings of the
60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers). 2022, pp. 8493–8502.
[Dat+24] Sumanth Dathathri et al. “Scalable watermarking for identifying large language model
outputs”. In: Nature 634.8035 (2024), pp. 818–823.
[DD24] Apratim Dey and David Donoho. “Universality of the π 2 /6 pathway in avoiding model
collapse”. In: arXiv preprint arXiv:2410.22812 (2024).
[Dee+25] DeepSeek-AI et al. “DeepSeek-R1: Incentivizing reasoning capability in LLMs via rein-
forcement learning”. In: arXiv preprint arXiv:2501.12948 (2025).
[Den+24] Yihe Deng, Weitong Zhang, Zixiang Chen, and Quanquan Gu. “Rephrase and respond:
Let large language models ask better questions for themselves”. In: arXiv preprint arXiv:2311.04205
(2024).
[Der+17] Franck Dernoncourt, Ji Young Lee, Ozlem Uzuner, and Peter Szolovits. “De-identification
of patient notes with recurrent neural networks”. In: Journal of the American Medical
Informatics Association 24.3 (2017), pp. 596–606.
[Dev+19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-
training of deep bidirectional transformers for language understanding”. In: Proceedings
of the 2019 Conference of the North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
2019, pp. 4171–4186.
[DFK24] Elvis Dohmatob, Yunzhen Feng, and Julia Kempe. “Strong model collapse”. In: arXiv
preprint arXiv:2410.04840 (2024).
[DH17] Tobias Domhan and Felix Hieber. “Using target-side monolingual data for neural ma-
chine translation through multi-task learning”. In: Proceedings of the 2017 Conference
on Empirical Methods in Natural Language Processing. Sept. 2017, pp. 1500–1505.
[Doh+24] Elvis Dohmatob, Yunzhen Feng, Pu Yang, Francois Charton, and Julia Kempe. “A tale
of tails: Model collapse as a change of scaling laws”. In: arXiv preprint arXiv:2402.07043
(2024).
[Dua+24a] Haonan Duan, Adam Dziedzic, Nicolas Papernot, and Franziska Boenisch. “Flocks of
stochastic parrots: Differentially private prompt learning for large language models”. In:
Advances in Neural Information Processing Systems 36 (2024).
[Dua+24b] Jinhao Duan et al. “Shifting attention to relevance: Towards the predictive uncertainty
quantification of free-form large language models”. In: Proceedings of the 62nd Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024,
pp. 5050–5063.
[Dub+24a] Abhimanyu Dubey et al. “The llama 3 herd of models”. In: arXiv preprint arXiv:2407.21783
(2024).
[Dub+24b] Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. “Length-
controlled AlpacaEval: A simple way to debias automatic evaluators”. In: arXiv preprint
arXiv:2404.04475 (2024).

40
[Dwo06] Cynthia Dwork. “Differential privacy”. In: International colloquium on automata, lan-
guages, and programming. Springer. 2006, pp. 1–12.
[Ega+24] Naoki Egami, Musashi Hinck, Brandon Stewart, and Hanying Wei. “Using imperfect
surrogates for downstream inference: Design-based supervised learning for social science
applications of large language models”. In: Advances in Neural Information Processing
Systems 36 (2024).
[Elh+21] Nelson Elhage et al. “A mathematical framework for transformer circuits”. In: Trans-
former Circuits Thread 1.1 (2021), p. 12.
[Elh+22a] Nelson Elhage et al. “Softmax linear units”. In: Transformer Circuits Thread (2022).
[Elh+22b] Nelson Elhage et al. “Toy models of superposition”. In: arXiv preprint arXiv:2209.10652
(2022).
[Elk+23] Niva Elkin-Koren, Uri Hacohen, Roi Livni, and Shay Moran. “Can copyright be reduced
to privacy?” In: arXiv preprint arXiv:2305.14822 (2023).
[Elm90] Jeffrey L. Elman. “Finding structure in time”. In: Cognitive Science 14.2 (1990), pp. 179–
211. issn: 0364-0213.
[Erh+10] Dumitru Erhan et al. “Why does unsupervised pre-training help deep learning?” In: J.
Mach. Learn. Res. 11 (Mar. 2010), pp. 625–660. issn: 1532-4435.
[Eth+24] Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela.
“Model alignment as prospect theoretic optimization”. In: Proceedings of the 41st Inter-
national Conference on Machine Learning. ICML’24. JMLR.org, 2024.
[Fai+23] Jaiden Fairoze et al. “Publicly detectable watermarking for language models”. In: arXiv
preprint arXiv:2310.18491 (2023).
[Fan+21] Cong Fang, Hangfeng He, Qi Long, and Weijie J Su. “Exploring deep neural networks
via layer-peeled model: Minority collapse in imbalanced training”. In: Proceedings of the
National Academy of Sciences 118.43 (2021), e2103091118.
[Fan+24] Chongyu Fan et al. “Simplicity prevails: Rethinking negative preference optimization for
LLM unlearning”. In: Neurips Safe Generative AI Workshop. 2024.
[FDW24] Ari B. Friedman, M. Kit Delgado, and Gary E. Weissman. “Artificial intelligence for
emergency care triage—much promise, but still much to learn”. In: JAMA Network Open
7.5 (May 2024), e248857.
[Fen+24a] Guhao Feng et al. “Towards revealing the mystery behind chain of thought: a theoretical
perspective”. In: Advances in Neural Information Processing Systems 36 (2024).
[Fen+24b] Yunzhen Feng, Elvis Dohmatob, Pu Yang, Francois Charton, and Julia Kempe. “Be-
yond model collapse: Scaling up with synthesized data requires reinforcement”. In: arXiv
preprint arXiv:2406.07515 (2024).
[Fer+23] Pierre Fernandez, Antoine Chaffin, Karim Tit, Vivien Chappelier, and Teddy Furon.
“Three bricks to consolidate watermarks for large language models”. In: 2023 IEEE
International Workshop on Information Forensics and Security (WIFS). IEEE. 2023,
pp. 1–6.
[Fom+20] Marina Fomicheva et al. “Unsupervised quality estimation for neural machine trans-
lation”. In: Transactions of the Association for Computational Linguistics 8 (2020),
pp. 539–555.
[Fu+23] Deqing Fu, Tian-Qi Chen, Robin Jia, and Vatsal Sharan. “Transformers learn higher-
order optimization methods for in-context learning: A study with linear models”. In:
arXiv preprint arXiv:2310.17086 (2023).
[Gag94] Philip Gage. “A new algorithm for data compression”. In: C Users J. 12.2 (Feb. 1994),
pp. 23–38. issn: 0898-9788.
[Gal+23] Isabel O Gallegos et al. “Bias and fairness in large language models: A survey”. In: arXiv
preprint arXiv:2309.00770 (2023).
[Gao+20] Leo Gao et al. “The Pile: An 800GB dataset of diverse text for language modeling”. In:
arXiv preprint arXiv:2101.00027 (2020).

41
[Gao+24] Xiang Gao, Jiaxin Zhang, Lalla Mouatadid, and Kamalika Das. “SPUQ: Perturbation-
based uncertainty quantification for large language models”. In: Proceedings of the 18th
Conference of the European Chapter of the Association for Computational Linguistics,
EACL. 2024, pp. 2336–2346.
[Gar+24] Sumegha Garg, Christopher Jung, Omer Reingold, and Aaron Roth. “Oracle efficient
online multicalibration and omniprediction”. In: Proceedings of the 2024 Annual ACM-
SIAM Symposium on Discrete Algorithms (SODA). SIAM. 2024, pp. 2725–2792.
[Gas+16] Adrià Gascón et al. “Privacy-preserving distributed linear regression on high-dimensional
data”. In: Cryptology ePrint Archive (2016).
[Gaw+23] Jakob Gawlikowski et al. “A survey of uncertainty in deep neural networks”. In: Artificial
Intelligence Review 56.Suppl 1 (2023), pp. 1513–1589.
[GC24] Isaac Gibbs and Emmanuel J Candès. “Conformal inference for online prediction with
arbitrary distribution shifts”. In: Journal of Machine Learning Research 25.162 (2024),
pp. 1–36.
[Geh+20] Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith.
“RealToxicityPrompts: Evaluating neural toxic gegeneration in language models”. In:
Findings of the Association for Computational Linguistics: EMNLP. Vol. EMNLP 2020.
Findings of ACL. 2020, pp. 3356–3369.
[Geo+24] Kristian Georgiev et al. “Attribute-to-delete: Machine unlearning via datamodel match-
ing”. In: arXiv preprint arXiv:2410.23232 (2024).
[Ger+24] Matthias Gerstgrasser et al. “Is model collapse inevitable? Breaking the curse of recursion
by accumulating real and synthetic data”. In: arXiv preprint arXiv:2404.01413 (2024).
[Ges24] Lee Gesmer. “Copyright and the challenge of large language models”. In: (Oct. 2024).
America. url: https : / / www . masslawblog . com / copyright / copyright - and - the -
challenge-of-large-language-models-part-2-2/.
[GF24] Eva Giboulot and Teddy Furon. “WaterMax: breaking the LLM watermark detectability-
robustness-quality trade-off”. In: Neural Information Processing Systems. 2024.
[GFC21] Tianyu Gao, Adam Fisch, and Danqi Chen. “Making pre-trained language models bet-
ter few-shot learners”. In: Proceedings of the 59th Annual Meeting of the Association
for Computational Linguistics and the 11th International Joint Conference on Natural
Language Processing, ACL/IJCNLP. 2021, pp. 3816–3830.
[Gin+19] Antonio Ginart, Melody Guan, Gregory Valiant, and James Y Zou. “Making AI forget
you: Data deletion in machine learning”. In: Advances in neural information processing
systems 32 (2019).
[Gio23] Patrizio Giovannotti. “Evaluating machine translation quality with conformal predic-
tive distributions”. In: Conformal and Probabilistic Prediction with Applications. PMLR.
2023, pp. 413–429.
[GL04] Jianfeng Gao and Chin-Yew Lin. “Introduction to the special issue on statistical language
modeling”. In: ACM Transactions on Asian Language Information Processing (TALIP)
3.2 (2004), pp. 87–93.
[Gli+24] Kristina Gligorić, Tijana Zrnic, Cinoo Lee, Emmanuel J Candès, and Dan Jurafsky.
“Can unconfident LLM annotations be used for confident conclusions?” In: arXiv preprint
arXiv:2408.15204 (2024).
[GMA22] Aparna Garimella, Rada Mihalcea, and Akhash Amarnath. “Demographic-aware lan-
guage model fine-tuning as a bias mitigation technique”. In: Proceedings of the 2nd Con-
ference of the Asia-Pacific Chapter of the Association for Computational Linguistics
and the 12th International Joint Conference on Natural Language Processing (Volume 2:
Short Papers). 2022, pp. 311–319.
[GMT24] Jorge Garcı́a-Carrasco, Alejandro Maté, and Juan Carlos Trujillo. “How does GPT-2
predict acronyms? Extracting and understanding a circuit via mechanistic interpretabil-
ity”. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2024,
pp. 3322–3330.

42
[Gol+24] Olga Golovneva, Tianlu Wang, Jason Weston, and Sainbayar Sukhbaatar. “Contextual
position encoding: Learning to count what’s important”. In: arXiv preprint arXiv:2405.18719
(2024).
[GPT23] GPTZero. GPTZero: More than an AI detector preserve what’s human. 2023. url:
https://gptzero.me/.
[Gra+24] Aaron Grattafiori et al. “The Llama 3 herd of models”. In: arXiv preprint arXiv:2407.21783
(2024).
[GRK21] Demi Guo, Alexander Rush, and Yoon Kim. “Parameter-efficient transfer learning with
diff pruning”. In: Proceedings of the 59th Annual Meeting of the Association for Com-
putational Linguistics and the 11th International Joint Conference on Natural Language
Processing (Volume 1: Long Papers). Aug. 2021, pp. 4884–4896.
[GSH23] Leo Gao, John Schulman, and Jacob Hilton. “Scaling laws for reward model overopti-
mization”. In: Proceedings of the 40th International Conference on Machine Learning.
ICML’23. JMLR.org, 2023.
[GSR19] Sebastian Gehrmann, Hendrik Strobelt, and Alexander M. Rush. “GLTR: Statistical
detection and visualization of generated text”. In: Proceedings of the 57th Conference of
the Association for Computational Linguistics, ACL. 2019, pp. 111–116.
[Gue+24] Marco Guevara et al. “Large language models to identify social determinants of health
in electronic health records”. In: npj Digital Medicine 7 (Jan. 2024), p. 6.
[Guo+20] Chuan Guo, Tom Goldstein, Awni Y. Hannun, and Laurens van der Maaten. “Certified
data removal from machine learning models”. In: Proceedings of the 37th International
Conference on Machine Learning, ICML. Vol. 119. Proceedings of Machine Learning
Research. PMLR, 2020, pp. 3832–3842.
[Guo+24a] Shangmin Guo et al. “Direct language model alignment from online AI feedback”. In:
arXiv preprint arXiv:2402.04792 (2024).
[Guo+24b] Tianyu Guo et al. “Active-dormant attention heads: Mechanistically demystifying extreme-
token phenomena in LLMs”. In: NeurIPS 2024 Workshop on Mathematics of Modern
Machine Learning. 2024.
[Guo+24c] Tianyu Guo et al. “How do transformers learn in-context beyond simple functions? A
case study on learning with representations”. In: The Twelfth International Conference
on Learning Representations, ICLR. 2024.
[Gur+23] Wes Gurnee et al. “Finding neurons in a haystack: Case studies with sparse probing”.
In: Trans. Mach. Learn. Res. 2023 (2023).
[Hal+23] Skyler Hallinan, Alisa Liu, Yejin Choi, and Maarten Sap. “Detoxifying text with MaRCo:
Controllable revision with experts and anti-experts”. In: Proceedings of the 61st Annual
Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). July
2023, pp. 228–242.
[Han+22] Xu Han, Weilin Zhao, Ning Ding, Zhiyuan Liu, and Maosong Sun. “PTR: Prompt tuning
with rules for text classification”. In: AI Open 3 (2022), pp. 182–192.
[Han+23] Tianyu Han et al. “MedAlpaca – An open-source collection of medical conversational AI
models and training data”. In: arXiv preprint arXiv:2304.08247 (Apr. 2023).
[Han+24a] Sungwon Han, Jinsung Yoon, Sercan Ö. Arik, and Tomas Pfister. “Large language models
can automatically engineer features for few-shot tabular learning”. In: Forty-first Inter-
national Conference on Machine Learning, ICML. 2024.
[Han+24b] Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. “Parameter-efficient
fine-tuning for large models: A comprehensive survey”. In: arXiv preprint arXiv:2403.14608
(2024).
[Har54] Zellig Harris. “Distributional structure”. In: Word 10.2-3 (1954), pp. 146–162.
[HBC21] Xudong Han, Timothy Baldwin, and Trevor Cohn. “Diverse adversaries for mitigating
bias in training”. In: Proceedings of the 16th Conference of the European Chapter of the
Association for Computational Linguistics: Main Volume. Apr. 2021, pp. 2760–2765.

43
[HBC22] Xudong Han, Timothy Baldwin, and Trevor Cohn. “Balancing out bias: Achieving fair-
ness through balanced training”. In: Proceedings of the 2022 Conference on Empirical
Methods in Natural Language Processing. Dec. 2022, pp. 11335–11350.
[He+16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for
image recognition”. In: 2016 IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR). 2016, pp. 770–778.
[He+22a] Jacqueline He, Mengzhou Xia, Christiane Fellbaum, and Danqi Chen. “MABEL: Attenu-
ating gender bias using textual entailment data”. In: Proceedings of the 2022 Conference
on Empirical Methods in Natural Language Processing. Dec. 2022, pp. 9681–9702.
[He+22b] Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neu-
big. “Towards a unified view of parameter-efficient transfer learning”. In: International
Conference on Learning Representations. 2022.
[He+24a] Haiyun He, Yepeng Liu, Ziqiao Wang, Yongyi Mao, and Yuheng Bu. “Universally op-
timal watermarking schemes for LLMs: from theory to practice”. In: arXiv preprint
arXiv:2410.02890 (2024).
[He+24b] Hengzhi He, Peiyu Yu, Junpeng Ren, Ying Nian Wu, and Guang Cheng. “Watermarking
generative tabular data”. In: arXiv preprint arXiv:2405.14018 (2024).
[Hen+21a] Dan Hendrycks et al. “Measuring massive multitask language understanding”. In: 9th
International Conference on Learning Representations, ICLR. 2021.
[Hen+21b] Dan Hendrycks et al. “Measuring mathematical problem solving with the MATH dataset”.
In: Proceedings of the Neural Information Processing Systems Track on Datasets and
Benchmarks 1, NeurIPS Datasets and Benchmarks. 2021.
[HH24] Zhengmian Hu and Heng Huang. “Inevitable trade-off between watermark strength and
speculative sampling efficiency for language models”. In: Advances in Neural Information
Processing Systems. Vol. 37. Curran Associates, Inc., 2024, pp. 55370–55402.
[HJ23] Stefan Heimersheim and Jett Janiak. A circuit for Python docstrings in a 4-layer attention-
only transformer. 2023. url: https : / / www . %20alignmentforum . %20org / posts /
u6KXXmKFbXfWzoAXn/acircuit-for-python-docstrings-in-a-4-layer-attention-
only.
[HMH24] Noah Hollmann, Samuel Müller, and Frank Hutter. “Large language models for auto-
mated data science: Introducing caafe for context-aware automated feature engineering”.
In: Advances in Neural Information Processing Systems 36 (2024).
[Hof+22] Jordan Hoffmann et al. “Training compute-optimal large language models”. In: Proceed-
ings of the 36th International Conference on Neural Information Processing Systems.
NIPS ’22. Red Hook, NY, USA: Curran Associates Inc., 2022. isbn: 9781713871088.
[Hof01] Thomas Hofmann. “Unsupervised learning by probabilistic latent semantic analysis”. In:
Mach. Learn. 42.1–2 (2001), pp. 177–196. issn: 0885-6125.
[Hoo+21] Shlomo Hoory et al. “Learning and evaluating a differentially private pre-trained language
model”. In: Findings of the Association for Computational Linguistics: EMNLP. 2021,
pp. 1178–1189.
[Hou+19] Neil Houlsby et al. “Parameter-efficient transfer learning for NLP”. In: Proceedings of
the 36th International Conference on Machine Learning. Vol. 97. Proceedings of Machine
Learning Research. PMLR, June 2019, pp. 2790–2799.
[HS23] Hangfeng He and Weijie J Su. “A law of data separation in deep learning”. In: Proceedings
of the National Academy of Sciences 120.36 (2023), e2221704120.
[HS24] Hangfeng He and Weijie J Su. “A law of next-token prediction in large language models”.
In: arXiv preprint arXiv:2408.13442 (2024).
[HS97] Sepp Hochreiter and Jürgen Schmidhuber. “Long short-term memory”. In: Neural Com-
put. 9.8 (1997), pp. 1735–1780. issn: 0899-7667.
[Hu+24] Zhengmian Hu et al. “Unbiased watermark for large language models”. In: International
Conference on Learning Representations. 2024.

44
[Hua+20] Po-Sen Huang et al. “Reducing sentiment bias in language models via counterfactual
evaluation”. In: Findings of the Association for Computational Linguistics: EMNLP.
Vol. EMNLP 2020. Findings of ACL. 2020, pp. 65–83.
[Hua+23a] Baihe Huang et al. “Towards optimal statistical watermarking”. In: arXiv preprint arXiv:2312.07930
(2023).
[Hua+23b] Jonathan Huang et al. “Generative artificial intelligence for chest radiograph interpre-
tation in the emergency department”. In: JAMA network open 6.10 (2023), e2336100–
e2336100.
[Hua+23c] Yangsibo Huang, Samyak Gupta, Zexuan Zhong, Kai Li, and Danqi Chen. “Privacy
implications of retrieval-based language models”. In: Proceedings of the 2023 Conference
on Empirical Methods in Natural Language Processing. 2023, pp. 14887–14902.
[Hua+24a] Wei Huang et al. “BiLLM: Pushing the limit of post-training quantization for LLMs”.
In: Forty-first International Conference on Machine Learning, ICML. 2024.
[Hua+24b] Zhen Huang et al. “O1 replication journey – Part 2: surpassing o1-preview through simple
distillation, big progress or bitter lesson?” In: arXiv preprint arXiv:2411.16489 (2024).
[Hub+24] Robert Huben, Hoagy Cunningham, Logan Riggs, Aidan Ewart, and Lee Sharkey. “Sparse
autoencoders find highly interpretable features in language models”. In: The Twelfth In-
ternational Conference on Learning Representations, ICLR. 2024.
[Huo+24] Mingjia Huo et al. “Token-specific watermarking with enhanced detectability and se-
mantic coherence for large language models”. In: International Conference on Machine
Learning. 2024.
[Ipp+20] Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. “Automatic
detection of generated text is easiest when humans are fooled”. In: Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics, ACL. 2020, pp. 1808–
1822.
[Jan+23] Joel Jang et al. “Knowledge unlearning for mitigating privacy risks in language mod-
els”. In: Proceedings of the 61st Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), ACL. 2023, pp. 14389–14408.
[Jaq+17] Natasha Jaques et al. “Sequence tutor: conservative fine-tuning of sequence generation
models with KL-control”. In: Proceedings of the 34th International Conference on Ma-
chine Learning - Volume 70. ICML’17. JMLR.org, 2017, pp. 1645–1654.
[Jaq+20] Natasha Jaques et al. “Human-centric dialog training via offline reinforcement learn-
ing”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing, EMNLP. 2020, pp. 3985–4003.
[Jel98] Frederick Jelinek. Statistical methods for speech recognition. Cambridge, MA, USA: MIT
Press, 1998. isbn: 0262100665.
[Ji+22] Wenlong Ji, Yiping Lu, Yiliang Zhang, Zhun Deng, and Weijie J Su. “An unconstrained
layer-peeled perspective on neural collapse”. In: The Tenth International Conference on
Learning Representations, ICLR. 2022.
[Ji+23] Ziwei Ji et al. “Survey of hallucination in natural language generation”. In: ACM Com-
puting Surveys 55.12 (2023), pp. 1–38.
[Jia+20] Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. “How can we know what
language models know”. In: Trans. Assoc. Comput. Linguistics 8 (2020), pp. 423–438.
[Jia+23] Albert Q. Jiang et al. “Mistral 7B”. In: arXiv preprint arXiv:2310.06825 (2023).
[Jia+24a] Jinghan Jia et al. “SOUL: Unlocking the power of second-order optimization for LLM
unlearning”. In: Proceedings of the 2024 Conference on Empirical Methods in Natural
Language Processing, EMNLP. 2024, pp. 4276–4292.
[Jia+24b] Albert Q. Jiang et al. “Mixtral of experts”. In: arXiv preprint arXiv:2401.04088 (2024).
[Jin+21] Xisen Jin et al. “On transferability of bias mitigation effects in language model fine-
tuning”. In: Proceedings of the 2021 Conference of the North American Chapter of the As-
sociation for Computational Linguistics: Human Language Technologies, NAACL-HLT.
2021, pp. 3770–3783.

45
[JJS24] Meena Jagadeesan, Michael I Jordan, and Jacob Steinhardt. “Safety vs. performance:
How multi-objective learning reduces barriers to market entry”. In: arXiv preprint arXiv:2409.03734
(2024).
[JLZ25] Wenlong Ji, Lihua Lei, and Tijana Zrnic. “Predictions as surrogates: Revisiting surrogate
outcomes in the age of AI”. In: arXiv preprint arXiv:2501.09731 (2025).
[JMS24] Ayush Jain, Andrea Montanari, and Eren Sasoglu. “Scaling laws for learning with real
and surrogate data”. In: Advances in Neural Information Processing Systems 38, NeurIPS
2024. 2024.
[Joa98] Thorsten Joachims. “Text categorization with support vector machines: Learning with
many relevant features”. In: European conference on machine learning. Springer. 1998,
pp. 137–142.
[Jor86] Michael I. Jordan. “Attractor dynamics and parallelism in a connectionist sequential ma-
chine”. In: Proceedings of the Eighth Annual Conference of the Cognitive Science Society.
1986.
[JSV24] Nikola Jovanović, Robin Staab, and Martin Vechev. “Watermark stealing in large lan-
guage models”. In: International Conference on Machine Learning. 2024.
[Kad+22] Saurav Kadavath et al. “Language models (mostly) know what they know”. In: arXiv
preprint arXiv:2207.05221 (2022).
[Kan+23] Beichen Kang et al. “Automatic ICD coding based on segmented ClinicalBERT with
hierarchical tree structure learning”. In: Database Systems for Advanced Applications:
28th International Conference, DASFAA. Springer, 2023, pp. 250–265.
[Kap+20] Jared Kaplan et al. “Scaling laws for neural language models.” In: CoRR abs/2001.08361
(2020).
[Kau+24] Chinmaya Kausik, Mirco Mutti, Aldo Pacchiano, and Ambuj Tewari. “A theoretical
framework for partially observed reward-states in RLHF”. In: ICML 2024 Workshop on
Models of Human Feedback for AI Alignment. 2024.
[KB21] Masahiro Kaneko and Danushka Bollegala. “Debiasing pre-trained contextualised em-
beddings”. In: arXiv preprint arXiv:2101.09523 (2021).
[KDS23] Hadas Kotek, Rikker Dockum, and David Sun. “Gender bias and stereotypes in large
language models”. In: Proceedings of The ACM Collective Intelligence Conference. 2023,
pp. 12–24.
[Ke+23] Zheng Tracy Ke, Pengsheng Ji, Jiashun Jin, and Wanshan Li. “Recent advances in text
analysis”. In: Annual Review of Statistics and Its Application 11 (2023).
[Ken+21] Zachary Kenton et al. “Alignment of language agents”. In: arXiv preprint arXiv:2103.14659
(2021).
[KGF23] Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. “Semantic uncertainty: Linguistic
invariances for uncertainty estimation in natural language generation”. In: The Eleventh
International Conference on Learning Representations, ICLR. 2023.
[Kim+23] Minbeom Kim et al. “Critic-guided decoding for controlled text generation”. In: Findings
of the Association for Computational Linguistics: ACL. 2023, pp. 4598–4612.
[Kim14] Yoon Kim. “Convolutional neural networks for sentence classification”. In: Proceedings of
the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).
2014, pp. 1746–1751.
[Kir+23a] John Kirchenbauer et al. “A watermark for large language models”. In: International
Conference on Machine Learning. PMLR. 2023, pp. 17061–17084.
[Kir+23b] John Kirchenbauer et al. “On the reliability of watermarks for large language models”.
In: The Twelfth International Conference on Learning Representations. 2023.
[KKO24] Ryuto Koike, Masahiro Kaneko, and Naoaki Okazaki. “Outfox: LLM-generated essay
detection through in-context learning with adversarially generated examples”. In: Pro-
ceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. 19. 2024, pp. 21258–
21266.

46
[Koc+23] Denis Kocetkov et al. “The Stack: 3 TB of permissively licensed source code”. In: Trans.
Mach. Learn. Res. 2023 (2023).
[Köp+24] Andreas Köpf et al. “OpenAssistant conversations - democratizing large language model
alignment”. In: Proceedings of the 37th International Conference on Neural Information
Processing Systems. NIPS ’23. Curran Associates Inc., 2024.
[Kri+24] Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer.
“Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense”.
In: Advances in Neural Information Processing Systems. Vol. 36. 2024.
[KS24] Tokio Kajitsuka and Issei Sato. “Are transformers with one layer self-attention using low-
rank weight matrices universal approximators?” In: The Twelfth International Conference
on Learning Representations, ICLR. 2024.
[Kud+24] Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang. “Robust
distortion-free watermarks for language models”. In: Transactions on Machine Learning
Research (2024). issn: 2835-8856.
[Kum+23] Bhawesh Kumar et al. “Conformal prediction with large language models for multi-choice
question answering”. In: arXiv preprint arXiv:2305.18404 (2023).
[KW24] Zheng Tracy Ke and Minzhe Wang. “Using SVD for topic modeling”. In: Journal of the
American Statistical Association 119.545 (2024), pp. 434–449.
[Kwe+23] Sunjun Kweon, Junu Kim, Jiyoun Kim, Sujeong Im, et al. “Publicly shareable clinical
large language model built on synthetic clinical notes”. In: arXiv preprint (Sept. 2023).
[KX15] Vladimir Koltchinskii and Dong Xia. “Optimal estimation of low rank density matrices.”
In: J. Mach. Learn. Res. 16.53 (2015), pp. 1757–1792.
[Lam+24] Nathan Lambert et al. “RewardBench: Evaluating reward models for language modeling”.
In: arXiv preprint arXiv:2403.13787 (2024).
[Lan+20] Zhenzhong Lan et al. “ALBERT: A lite BERT for self-supervised learning of language
representations.” In: ICLR. 2020.
[LBM24] Licong Lin, Yu Bai, and Song Mei. “Transformers as decision makers: provable in-context
reinforcement learning via supervised pretraining”. In: The Twelfth International Con-
ference on Learning Representations, ICLR. 2024.
[LD02] Tom Landauer and Scott Dooley. “Latent semantic analysis: theory, method and applica-
tion”. In: Proceedings of the Conference on Computer Support for Collaborative Learning:
Foundations for a CSCL Community. CSCL ’02. International Society of the Learning
Sciences, 2002, pp. 742–743.
[Lew+20] Mike Lewis et al. “BART: Denoising sequence-to-sequence pre-training for natural lan-
guage generation, translation, and comprehension”. In: Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics. 2020, pp. 7871–7880.
[Li+21] Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. “Large language
models can be strong differentially private learners”. In: International Conference on
Learning Representations. 2021.
[Li+23] Xuechen Li et al. AlpacaEval: An automatic evaluator of instruction-following models.
2023. url: https://github.com/tatsu-lab/alpaca_eval.
[Li+24a] Chunyuan Li et al. “Llava-med: Training a large language-and-vision assistant for biomedicine
in one day”. In: Advances in Neural Information Processing Systems 36 (2024).
[Li+24b] Junlong Li et al. “Generative judge for evaluating alignment”. In: The Twelfth Interna-
tional Conference on Learning Representations, ICLR. 2024.
[Li+24c] Xiang Li, Feng Ruan, Huiyuan Wang, Qi Long, and Weijie J Su. “Robust detection of wa-
termarks in large language models under human edits”. In: arXiv preprint arXiv:2411.13868
(2024).
[Li+25] Xiang Li, Feng Ruan, Huiyuan Wang, Qi Long, and Weijie J Su. “A statistical framework
of watermarks for large language models: Pivot, detection efficiency and optimal rules”.
In: The Annals of Statistics 53.1 (2025), pp. 322–351.

47
[Lia+23] Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, and James Zou. “GPT detectors
are biased against non-native English writers”. In: ICLR 2023 Workshop on Trustworthy
and Reliable Large-Scale Machine Learning Models. 2023.
[Lim+24] Yannick Limmer, Anastasis Kratsios, Xuwei Yang, Raeid Saqur, and Blanka Horvath.
“Reality only happens once: Single-path generalization bounds for transformers”. In:
arXiv preprint arXiv:2405.16563 (2024).
[Lin+24a] Licong Lin, Jingfeng Wu, Sham Kakade, Peter Bartlett, and Jason D Lee. “Scaling laws
in linear regression: compute, parameters, and data”. In: Advances in Neural Information
Processing Systems. Vol. 37. Curran Associates, Inc., 2024, pp. 60556–60606.
[Lin+24b] Zhenghao Lin et al. “Rho-1: Not all tokens are what you need”. In: arXiv preprint
arXiv:2404.07965 (2024).
[Lis+21] Pierre Lison, Ildikó Pilán, David Sánchez, Montserrat Batet, and Lilja Øvrelid. “Anonymi-
sation models for text data: State of the art, challenges and future directions”. In: Pro-
ceedings of the 59th Annual Meeting of the Association for Computational Linguistics
and the 11th International Joint Conference on Natural Language Processing (Volume 1:
Long Papers). 2021, pp. 4188–4203.
[Liu+19] Yinhan Liu et al. “RoBERTa: A robustly optimized BERT pretraining approach”. In:
arXiv preprint arXiv:1907.11692 (2019).
[Liu+20a] Haochen Liu et al. “Does gender matter? Towards fairness in dialogue systems”. In: Pro-
ceedings of the 28th International Conference on Computational Linguistics, COLING.
International Committee on Computational Linguistics, 2020, pp. 4403–4416.
[Liu+20b] Zhuang Liu, Degen Huang, Kaiyu Huang, Zhuang Li, and Jun Zhao. “FinBERT: A pre-
trained financial language representation model for financial text mining”. In: Proceedings
of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20.
Special Track on AI in FinTech. International Joint Conferences on Artificial Intelligence
Organization, July 2020, pp. 4513–4519.
[Liu+21] Alisa Liu et al. “DExperts: Decoding-time controlled text generation with experts and
anti-experts”. In: Proceedings of the 59th Annual Meeting of the Association for Com-
putational Linguistics and the 11th International Joint Conference on Natural Language
Processing (Volume 1: Long Papers). Aug. 2021, pp. 6691–6706.
[Liu+22] Jiachang Liu et al. “What makes good in-context examples for GPT-3?” In: Proceedings
of Deep Learning Inside Out: The 3rd Workshop on Knowledge Extraction and Integration
for Deep Learning Architectures, DeeLIO@ACL 2022. 2022, pp. 100–114.
[Liu+23] Pengfei Liu et al. “Pre-train, prompt, and predict: A systematic survey of prompting
methods in natural language processing”. In: ACM Comput. Surv. 55.9 (2023), 195:1–
195:35.
[Liu+24a] Aixin Liu et al. “DeepSeek-v2: A strong, economical, and efficient mixture-of-experts
language model”. In: arXiv preprint arXiv:2405.04434 (2024).
[Liu+24b] Aixin Liu et al. “DeepSeek-v3 technical report”. In: arXiv preprint arXiv:2412.19437
(2024).
[Liu+24c] Linyu Liu, Yu Pan, Xiaocheng Li, and Guanting Chen. “Uncertainty estimation and quan-
tification for LLMs: A simple supervised approach”. In: arXiv preprint arXiv:2404.15993
(2024).
[Liu+24d] Xiao Liu et al. “Are LLMs capable of data-based statistical and causal reasoning? Bench-
marking advanced quantitative reasoning with data”. In: Findings of the Association for
Computational Linguistics: ACL. Aug. 2024, pp. 9215–9235.
[Liu+24e] Xiao Liu et al. “GPT understands, too”. In: AI Open 5 (2024), pp. 208–215.
[Liu+24f] Yang Liu, Jiahuan Cao, Chongyu Liu, Kai Ding, and Lianwen Jin. “Datasets for large
language models: A comprehensive survey”. In: arXiv preprint arXiv:2402.18041 (2024).
[Liu+24g] Zhiyuan Liu, Hong Liu, Denny Zhou, and Tengyu Ma. “Chain of thought empowers trans-
formers to solve inherently serial problems”. In: The Twelfth International Conference
on Learning Representations, ICLR. 2024.

48
[Liu+25] Sijia Liu et al. “Rethinking machine unlearning for large language models”. In: Nature
Machine Intelligence (Feb. 2025).
[LJ24] Maximilian Li and Lucas Janson. “Optimal ablation for interpretability”. In: Advances
in Neural Information Processing Systems 38, NeurIPS 2024. 2024.
[LKW23] Xin Liu, Muhammad Khalifa, and Lu Wang. “BOLT: Fast energy-based controlled text
generation with tunable biases”. In: Proceedings of the 61st Annual Meeting of the As-
sociation for Computational Linguistics (Volume 2: Short Papers). July 2023, pp. 186–
200.
[LL21] Xiang Lisa Li and Percy Liang. “Prefix-tuning: optimizing continuous prompts for gener-
ation”. In: Proceedings of the 59th Annual Meeting of the Association for Computational
Linguistics and the 11th International Joint Conference on Natural Language Processing
(Volume 1: Long Papers). Aug. 2021, pp. 4582–4597.
[LLS22] Zi Lin, Jeremiah Zhe Liu, and Jingbo Shang. “Towards collaborative neural-symbolic
graph semantic parsing via uncertainty”. In: Findings of the Association for Computa-
tional Linguistics: ACL 2022. 2022, pp. 4160–4173.
[LLZ24] Xingchi Li, Guanxun Li, and Xianyang Zhang. “Segmenting watermarked texts from
language models”. In: arXiv preprint arXiv:2410.20670 (2024).
[LMM23] Baohao Liao, Yan Meng, and Christof Monz. “Parameter-efficient fine-tuning without
introducing new latency”. In: Proceedings of the 61st Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers). July 2023, pp. 4242–4260.
[LT20] Wei-Chao Lin and Chih-Fong Tsai. “Missing value imputation: a review and analysis of
the literature (2006–2017)”. In: Artificial Intelligence Review 53 (2020), pp. 1487–1509.
[LTS24] Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. “Generating with confidence: Uncertainty
quantification for black-box large language models”. In: arXiv preprint arXiv:2305.19187
(2024).
[Lu+20] Kaiji Lu, Piotr Mardziel, Fangjing Wu, Preetam Amancharla, and Anupam Datta. “Gen-
der bias in neural natural language processing”. In: Logic, language, and security: essays
dedicated to Andre Scedrov on the occasion of his 65th birthday (2020), pp. 189–202.
[Lu+22] Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. “Fan-
tastically ordered prompts and where to find them: Overcoming few-shot prompt order
sensitivity”. In: Proceedings of the 60th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), ACL 2022. 2022, pp. 8086–8098.
[Lu+23] Dakuan Lu et al. “BBT-Fin: Comprehensive construction of Chinese financial domain
pre-trained language model, corpus and benchmark”. In: arXiv preprint arXiv:2302.09432
(2023).
[Men+22] Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. “Locating and editing
factual associations in GPT”. In: Advances in Neural Information Processing Systems 35
(2022), pp. 17359–17372.
[MG21] Andrey Malinin and Mark J. F. Gales. “Uncertainty estimation in autoregressive struc-
tured prediction”. In: 9th International Conference on Learning Representations, ICLR.
2021.
[MH24] Christopher Mohri and Tatsunori Hashimoto. “Language models with conformal fac-
tuality guarantees”. In: Proceedings of the 41st International Conference on Machine
Learning. ICML’24. JMLR.org, 2024.
[Mik+10] Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan Cernocký, and Sanjeev Khudanpur.
“Recurrent neural network based language model.” In: INTERSPEECH. ISCA, 2010,
pp. 1045–1048.
[Mik+13] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. “Distributed
representations of words and phrases and their compositionality”. In: vol. 26. 2013.
[Min+22a] Sewon Min et al. “Rethinking the role of demonstrations: What makes in-context learning
work?” In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language
Processing. 2022, pp. 11048–11064.

49
[Min+22b] Sören Mindermann et al. “Prioritized training on points that are learnable, worth learn-
ing, and not yet learnt”. In: International Conference on Machine Learning, ICML.
Vol. 162. Proceedings of Machine Learning Research. PMLR, 2022, pp. 15630–15649.
[Mis+22] Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. “Cross-task
generalization via natural language crowdsourcing instructions”. In: Proceedings of the
60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers). May 2022, pp. 3470–3487.
[Mit+23] Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D Manning, and Chelsea
Finn. “Detectgpt: Zero-shot machine-generated text detection using probability curva-
ture”. In: International Conference on Machine Learning. PMLR. 2023, pp. 24950–24962.
[MML23] Silvia Milano, Joshua A McGrane, and Sabina Leonelli. “Large language models challenge
the future of higher education”. In: Nature Machine Intelligence 5.4 (2023), pp. 333–334.
[Moo+23] Michael Moor et al. “Med-flamingo: a multimodal medical few-shot learner”. In: Machine
Learning for Health (ML4H). PMLR. 2023, pp. 353–367.
[Mue+23] Niklas Muennighoff et al. “Scaling data-constrained language models”. In: Advances in
Neural Information Processing Systems 36, NeurIPS 2023. 2023.
[Mur+19] W James Murdoch, Chandan Singh, Karl Kumbier, Reza Abbasi-Asl, and Bin Yu. “Defi-
nitions, methods, and applications in interpretable machine learning”. In: Proceedings of
the National Academy of Sciences 116.44 (2019), pp. 22071–22080.
[MVM09] Frederic P. Miller, Agnes F. Vandome, and John McBrewster. Elo rating system. Alpha
Press, 2009. isbn: 6130078846.
[MXC24] Yu Meng, Mengzhou Xia, and Danqi Chen. “SimPO: Simple preference optimization with
a reference-free reward”. In: Advances in Neural Information Processing Systems. Vol. 37.
Curran Associates, Inc., 2024, pp. 124198–124235.
[Nak+22] Reiichiro Nakano et al. “WebGPT: Browser-assisted question-answering with human feed-
back”. In: arXiv preprint arXiv:2112.09332 (2022).
[Nak+24] Ryumei Nakada, Yichen Xu, Lexin Li, and Linjun Zhang. “Synthetic oversampling: The-
ory and a practical approach using LLMs to address data imbalance”. In: arXiv preprint
arXiv:2406.03628 (2024).
[Nas+23] Milad Nasr et al. “Scalable extraction of training data from (production) language mod-
els”. In: arXiv preprint arXiv:2311.17035 (2023).
[Nav+23] Humza Naveed et al. “A comprehensive overview of large language models”. In: arXiv
preprint arXiv:2307.06435 (2023).
[NCR23] Roberto Navigli, Simone Conia, and Björn Ross. “Biases in large language models: ori-
gins, inventory, and discussion”. In: ACM Journal of Data and Information Quality 15.2
(2023), pp. 1–21.
[Nie+24] Fan Nie et al. “FactTest: Factuality Testing in Large Language Models with Statistical
Guarantees”. In: arXiv preprint arXiv:2411.02603 (2024).
[NRS21] Seth Neel, Aaron Roth, and Saeed Sharifi-Malvajerdi. “Descent-to-delete: Gradient-based
methods for machine unlearning”. In: Algorithmic Learning Theory. PMLR. 2021, pp. 931–
962.
[Oh+22] Changdae Oh et al. “Learning fair representation via distributional contrastive disentan-
glement”. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery
and Data Mining. 2022, pp. 1295–1305.
[Ola+20] Chris Olah et al. “Zoom in: An introduction to circuits”. In: Distill 5.3 (2020), e00024–
001.
[Ols+22] Catherine Olsson et al. “In-context learning and induction heads”. In: arXiv preprint
arXiv:2209.11895 (2022).
[Omi+23] Jesutofunmi A Omiye, Jenna C Lester, Simon Spichak, Veronica Rotemberg, and Roxana
Daneshjou. “Large language models propagate race-based medicine”. In: NPJ Digital
Medicine 6.1 (2023), p. 195.

50
[Ope+24a] OpenAI et al. “GPT-4 technical report”. In: arXiv preprint arXiv:2303.08774 (2024).
[Ope+24b] OpenAI et al. “OpenAI o1 system card”. In: arXiv preprint arXiv:2412.16720 (2024).
[Ouy+22] Long Ouyang et al. “Training language models to follow instructions with human feed-
back”. In: Advances in neural information processing systems 35 (2022), pp. 27730–27744.
[OZ03] Stanley RM Oliveira and Osmar R Zaiane. “Protecting sensitive knowledge by data sani-
tization”. In: Third IEEE International conference on data mining. IEEE. 2003, pp. 613–
616.
[Pal+24] Arka Pal et al. “Smaug: Fixing failure modes of preference optimisation with DPO-
positive”. In: arXiv preprint arXiv:2402.13228 (2024).
[Pan+24] Richard Yuanzhe Pang et al. “Iterative reasoning preference optimization”. In: Advances
in Neural Information Processing Systems. Vol. 37. Curran Associates, Inc., 2024, pp. 116617–
116637.
[Par+23] SunYoung Park, Kyuri Choi, Haeun Yu, and Youngjoong Ko. “Never too late to learn:
Regularizing gender bias in coreference resolution”. In: Proceedings of the Sixteenth ACM
International Conference on Web Search and Data Mining. 2023, pp. 15–23.
[Par+24a] Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. “Disentangling length
from quality in direct preference optimization”. In: Findings of the Association for Com-
putational Linguistics: ACL. 2024, pp. 4998–5017.
[Par+24b] Yeonhong Park, Jake Hyun, SangLyul Cho, Bonggeun Sim, and Jae W. Lee. “Any-
precision LLM: Low-cost deployment of multiple, different-sized LLMs”. In: Forty-first
International Conference on Machine Learning, ICML. 2024.
[PCJ25] Mose Park, Yunjin Choi, and Jong-June Jeon. “Does a large language model really speak
in human-like language?” In: arXiv preprint arXiv:2501.01273 (2025).
[Pen+23] Cheng Peng et al. “A study of generative large language model for medical research and
healthcare”. In: npj Digital Medicine 6 (Nov. 2023), p. 210.
[Pen+24] Guilherme Penedo et al. “The RefinedWeb dataset for falcon LLM: outperforming curated
corpora with web data only”. In: Proceedings of the 37th International Conference on
Neural Information Processing Systems. NIPS ’23. Curran Associates Inc., 2024.
[Per+20] Juan Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. “Perfor-
mative prediction”. In: Proceedings of the 37th International Conference on Machine
Learning. 2020, pp. 7599–7609.
[Pet+18] Matthew E. Peters et al. “Deep contextualized word representations”. In: Proceedings of
the 2018 Conference of the North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, Volume 1 (Long Papers). June 2018,
pp. 2227–2237.
[PFB24] Raphael Poulain, Hamed Fayyaz, and Rahmatollah Beheshti. “Bias patterns in the appli-
cation of LLMs for clinical decision support: A comprehensive study”. In: arXiv preprint
arXiv:2404.15149 (2024).
[PHB24] Vaidehi Patil, Peter Hase, and Mohit Bansal. “Can sensitive information be deleted from
LLMs? Objectives for defending against extraction attacks”. In: The Twelfth Interna-
tional Conference on Learning Representations. 2024.
[PHD20] Vardan Papyan, XY Han, and David L Donoho. “Prevalence of neural collapse during
the terminal phase of deep learning training”. In: Proceedings of the National Academy
of Sciences 117.40 (2020), pp. 24652–24663.
[PMB19] Jorge Pérez, Javier Marinkovic, and Pablo Barceló. “On the Turing completeness of mod-
ern neural network architectures”. In: 7th International Conference on Learning Repre-
sentations, ICLR. 2019.
[PSM14] Jeffrey Pennington, Richard Socher, and Christopher Manning. “GloVe: Global vectors
for word representation”. In: Proceedings of the 2014 Conference on Empirical Methods
in Natural Language Processing (EMNLP). Oct. 2014, pp. 1532–1543.

51
[Qin+24] Yiwei Qin et al. “O1 replication journey: A strategic progress report – Part 1”. In: arXiv
preprint arXiv:2410.18982 (2024).
[Rad+18] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving lan-
guage understanding by generative pre-training. 2018. url: https://cdn.openai.com/
research-covers/language-unsupervised/language_understanding_paper.pdf.
[Rad+19] Alec Radford et al. “Language models are unsupervised multitask learners”. In: OpenAI
(2019).
[Rae+21] Jack W. Rae et al. “Scaling language models: Methods, analysis & insights from training
Gopher”. In: CoRR abs/2112.11446 (2021).
[Raf+20] Colin Raffel et al. “Exploring the limits of transfer learning with a unified text-to-text
transformer”. In: Journal of Machine Learning Research 21.140 (2020), pp. 1–67.
[Raf+23] Rafael Rafailov et al. “Direct preference optimization: Your language model is secretly a
reward model”. In: Advances in Neural Information Processing Systems. Vol. 36. Curran
Associates, Inc., 2023, pp. 53728–53741.
[Rai+24] Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, and Ziyu Yao. “A practical review
of mechanistic interpretability for transformer-based language models”. In: arXiv preprint
arXiv:2407.02646 (2024).
[Ram+17] Rachel B Ramoni et al. “The undiagnosed diseases network: Accelerating discovery about
health and disease”. In: American Journal of Human Genetics 100.2 (2017), pp. 185–192.
[Ran+24] Leonardo Ranaldi, Giulia Pucci, Federico Ranaldi, Elena Sofia Ruzzetti, and Fabio Mas-
simo Zanzotto. “A tree-of-thoughts to broaden multi-step reasoning across languages”. In:
Findings of the Association for Computational Linguistics: NAACL. June 2024, pp. 1229–
1241.
[Ren+23] Allen Z. Ren et al. “Robots that ask for help: Uncertainty alignment for large language
model planners”. In: Conference on Robot Learning, CoRL. Vol. 229. Proceedings of
Machine Learning Research. PMLR, 2023, pp. 661–682.
[Ren+24] Jie Ren et al. “Copyright protection in generative AI: A technical perspective”. In: arXiv
preprint arXiv:2402.02333 (2024).
[RGG23] Shauli Ravfogel, Yoav Goldberg, and Jacob Goldberger. “Conformal Nucleus Sampling”.
In: Findings of the Association for Computational Linguistics: ACL. July 2023, pp. 27–
34.
[RJS17] Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. “Learning to generate reviews and
discovering sentiment”. In: arXiv preprint arXiv:1704.01444 (2017).
[Rol+21] Stephen Roller et al. “Recipes for building an open-domain chatbot”. In: Proceedings
of the 16th Conference of the European Chapter of the Association for Computational
Linguistics: Main Volume. Apr. 2021, pp. 300–325.
[Ros00] R. Rosenfeld. “Two decades of statistical language modeling: where do we go from here?”
In: Proceedings of the IEEE 88.8 (2000), pp. 1270–1278.
[RSD23] Vipula Rawte, Amit Sheth, and Amitava Das. “A survey of hallucination in large foun-
dation models”. In: arXiv preprint arXiv:2309.05922 (2023).
[RSS23] Pedro L Rodriguez, Arthur Spirling, and Brandon M Stewart. “Embedding regression:
Models for context-specific description and inference”. In: American Political Science
Review 117.4 (2023), pp. 1255–1274.
[Sac+24] Noveen Sachdeva et al. “How to train data-efficient LLMs”. In: arXiv preprint arXiv:2402.09668
(2024).
[Sad+23] Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and
Soheil Feizi. “Can AI-generated text be reliably detected?” In: arXiv preprint arXiv:2303.11156
(2023).
[Sag23] Matthew Sag. Copyright safety for generative AI. 2023. url: https://houstonlawreview.
org/article/92126.

52
[Sah+24] Swarnadeep Saha et al. “Branch-Solve-Merge improves large language model evaluation
and generation”. In: Proceedings of the 2024 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies (Volume
1: Long Papers). 2024, pp. 8352–8370.
[Sam23] Pamela Samuelson. “Generative AI meets copyright”. In: Science 381.6654 (2023), pp. 158–
161.
[San+22] Victor Sanh et al. “Multitask prompted training enables zero-shot task generalization”.
In: The Tenth International Conference on Learning Representations, ICLR. 2022.
[San+23] Shibani Santurkar et al. “Whose opinions do language models reflect?” In: International
Conference on Machine Learning. PMLR. 2023, pp. 29971–30004.
[Sax+19] David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. “Analysing mathe-
matical reasoning abilities of neural models”. In: 7th International Conference on Learn-
ing Representations, ICLR. 2019.
[SB18] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. A
Bradford Book, 2018. isbn: 0262039249.
[SBM22] Lee Sharkey, Dan Braun, and Beren Millidge. “Taking features out of superposition with
sparse autoencoders”. In: AI Alignment Forum. Vol. 6. 2022, pp. 12–13.
[Sch+17] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. “Prox-
imal policy optimization algorithms”. In: arXiv preprint arXiv:1707.06347 (2017).
[Sch+21] Tal Schuster, Adam Fisch, Tommi Jaakkola, and Regina Barzilay. “Consistent accelerated
inference via confident adaptive transformers”. In: Proceedings of the 2021 Conference
on Empirical Methods in Natural Language Processing. Nov. 2021, pp. 4962–4979.
[Sch+22a] Patrick Schramowski, Cigdem Turan, Nico Andersen, Constantin A Rothkopf, and Kris-
tian Kersting. “Large pre-trained language models contain human-like biases of what is
right and wrong to do”. In: Nature Machine Intelligence 4.3 (2022), pp. 258–268.
[Sch+22b] Tal Schuster et al. “Confident adaptive language modeling”. In: Advances in Neural
Information Processing Systems 35 (2022), pp. 17456–17472.
[SCN18] Karan Singla, Dogan Can, and Shrikanth Narayanan. “A multi-task approach to learning
multilingual representations”. In: Proceedings of the 56th Annual Meeting of the Associ-
ation for Computational Linguistics (Volume 2: Short Papers). July 2018, pp. 214–220.
[SDG06] Holger Schwenk, Daniel Dechelotte, and Jean-Luc Gauvain. “Continuous space language
models for statistical machine translation”. In: Proceedings of the COLING/ACL 2006
Main Conference Poster Sessions. 2006, pp. 723–730.
[Sek+21] Ayush Sekhari, Jayadev Acharya, Gautam Kamath, and Ananda Theertha Suresh. “Re-
member what you want to forget: Algorithms for machine unlearning”. In: Advances in
Neural Information Processing Systems 34 (2021), pp. 18075–18086.
[Sha+24] Vasu Sharma et al. “Text quality-based pruning for efficient training of language models”.
In: arXiv preprint arXiv:2405.01582 (2024).
[Shi+24] Claudia Shi et al. “Hypothesis testing the circuit hypothesis in LLMs”. In: Advances in
Neural Information Processing Systems 38, NeurIPS 2024. 2024.
[Shu+22] Kurt Shuster et al. “Blenderbot 3: a deployed conversational agent that continually learns
to responsibly engage”. In: arXiv preprint arXiv:2208.03188 (2022).
[Shu+23] Ilia Shumailov et al. “The curse of recursion: Training on generated data makes models
forget”. In: arXiv preprint arXiv:2305.17493 (2023).
[Sim+24] Adi Simhi, Jonathan Herzig, Idan Szpektor, and Yonatan Belinkov. “Constructing bench-
marks and interventions for combating hallucinations in LLMs”. In: arXiv preprint arXiv:2404.09971
(2024).
[Sin+23] Karan Singhal et al. “Towards expert-level medical question answering with large lan-
guage models”. In: arXiv preprint arXiv:2305.09617 (2023).
[Sin+24] Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. “A long way to go:
Investigating length correlations in RLHF”. In: arXiv preprint arXiv:2310.03716 (2024).

53
[SK21] S Selva Birunda and R Kanniga Devi. “A review on word embedding techniques for
text classification”. In: Innovative Data Communication Technologies and Application:
Proceedings of ICIDCA 2020 (2021), pp. 267–281.
[SLG20] Budhaditya Saha, Sanal Lisboa, and Shameek Ghosh. “Understanding patient com-
plaint characteristics using contextual clinical BERT embeddings”. In: arXiv preprint
arXiv:2002.05902 (2020).
[Slo+23] Aviv Slobodkin, Omer Goldman, Avi Caciularu, Ido Dagan, and Shauli Ravfogel. “The
curious case of hallucinatory (un) answerability: Finding truths in the hidden states
of over-confident large language models”. In: Proceedings of the 2023 Conference on
Empirical Methods in Natural Language Processing. 2023, pp. 3607–3625.
[Smi+22] Shaden Smith et al. “Using DeepSpeed and Megatron to train Megatron-Turing NLG
530B, A large-scale generative language model”. In: arXiv preprint arXiv:2201.11990
(2022).
[Sne+24] Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. “Scaling LLM test-time com-
pute optimally can be more effective than scaling model parameters”. In: arXiv preprint
arXiv:2408.03314 (2024).
[SNR21] Yi-Lin Sung, Varun Nair, and Colin A Raffel. “Training neural networks with fixed
sparse masks”. In: Advances in Neural Information Processing Systems. Vol. 34. Curran
Associates, Inc., 2021, pp. 24193–24205.
[Sol+24] Luca Soldaini et al. “Dolma: An open corpus of three trillion tokens for language model
pretraining research”. In: Proceedings of the 62nd Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers). 2024, pp. 15725–15788.
[Son+23] Ziang Song, Tianle Cai, Jason D Lee, and Weijie J Su. “Reward collapse in aligning large
language models”. In: arXiv preprint arXiv:2305.17608 (2023).
[Spa72] Karen Sparck Jones. “A statistical interpretation of term specificity and its application
in retrieval”. In: Journal of documentation 28.1 (1972), pp. 11–21.
[SSB22] Danielle Saunders, Rosie Sallis, and Bill Byrne. “First the worst: Finding better gender
translations during beam search”. In: Findings of the Association for Computational
Linguistics: ACL. May 2022, pp. 3814–3823.
[Sti+20] Nisan Stiennon et al. “Learning to summarize from human feedback”. In: Proceedings of
the 34th International Conference on Neural Information Processing Systems. NIPS ’20.
Curran Associates Inc., 2020. isbn: 9781713829546.
[Su+24] Jianlin Su et al. “RoFormer: Enhanced transformer with rotary position embedding”. In:
Neurocomputing 568 (2024), p. 127063.
[Sun+20] Yu Sun et al. “ERNIE 2.0: A continual pre-training framework for language understand-
ing”. In: Proceedings of the AAAI Conference on Artificial Intelligence 34.05 (Apr. 2020),
pp. 8968–8975.
[SWF+15] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. “End-to-end memory networks”.
In: Advances in neural information processing systems 28 (2015).
[SZ23] Karanpartap Singh and James Zou. “New evaluation metrics capture quality degradation
due to LLM watermarking”. In: Transactions on Machine Learning Research (2023).
[Szo24] Peter Szolovits. “Large language models seem miraculous, but science abhors miracles”.
In: NEJM AI (May 2024).
[Tan+23] Ruixiang Tang, Xiaotian Han, Xiaoqian Jiang, and Xia Hu. “Does synthetic data gener-
ation of LLMs help clinical text mining?” In: arXiv preprint arXiv:2303.04360 (2023).
[Tan+24a] Sijun Tan et al. “JudgeBench: A benchmark for evaluating LLM-based judges”. In: arXiv
preprint arXiv:2410.12784 (2024).
[Tan+24b] Zhengyang Tang, Xingxing Zhang, Benyou Wang, and Furu Wei. “MathScale: Scaling
instruction tuning for mathematical reasoning”. In: Forty-first International Conference
on Machine Learning, ICML. 2024.

54
[Tay+23] Yi Tay et al. “Scaling laws vs model architectures: How does inductive bias influence
scaling?” In: Findings of the Association for Computational Linguistics: EMNLP. 2023,
pp. 12342–12364.
[Tea+25] Kimi Team et al. “Kimi k1.5: Scaling reinforcement learning with LLMs”. In: arXiv
preprint arXiv:2405.02225 (2025).
[Tes+24] Idit Tessler et al. “Unlocking the potential of large language models in healthcare: navi-
gating the opportunities and challenges”. In: Future Medicine AI 2.1 (Apr. 2024).
[TGH23] Hoang Tran, Chris Glaze, and Braden Hancock. Iterative DPO alignment. Tech. rep.
Snorkel AI, 2023.
[Tho+22] Romal Thoppilan et al. “LaMDA: Language models for dialog applications”. In: arXiv
preprint arXiv:2201.08239 (2022).
[Thr+22] Christos Thrampoulidis, Ganesh Ramachandra Kini, Vala Vakilian, and Tina Behnia.
“Imbalance trouble: Revisiting neural-collapse geometry”. In: Advances in Neural Infor-
mation Processing Systems 35 (2022), pp. 27225–27238.
[Tim23] The New York Times. The times sues OpenAI and Microsoft over AI use of copyrighted
work. America. Dec. 2023. url: https://www.nytimes.com/2023/12/27/business/
media/new-york-times-open-ai-microsoft-lawsuit.html.
[Tou+23a] Hugo Touvron et al. “Llama 2: Open foundation and fine-tuned chat models”. In: arXiv
preprint arXiv:2307.09288 (2023).
[Tou+23b] Hugo Touvron et al. “LLaMA: Open and efficient foundation language models”. In: arXiv
preprint arXiv:2302.13971 (2023).
[TRB10] Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio. “Word representations: A simple
and general method for semi-supervised learning”. In: Proceedings of the 48th Annual
Meeting of the Association for Computational Linguistics. 2010, pp. 384–394.
[Tsa+19] Yao-Hung Hubert Tsai et al. “Multimodal transformer for unaligned multimodal language
sequences”. In: Proceedings of the 57th Conference of the Association for Computational
Linguistics, ACL. 2019, pp. 6558–6569.
[TT24] Jacob Trauger and Ambuj Tewari. “Sequence length independent norm-based general-
ization bounds for transformers”. In: International Conference on Artificial Intelligence
and Statistics. PMLR. 2024, pp. 1405–1413.
[Tu+24] Xinming Tu, James Zou, Weijie Su, and Linjun Zhang. “What should data science edu-
cation do with large language models?” In: Harvard Data Science Review 6.1 (2024).
[Tul+24] Eduard Tulchinskii et al. “Intrinsic dimension estimation for robust detection of AI-
generated texts”. In: Advances in Neural Information Processing Systems. Vol. 36. 2024.
[Tyl23] Thaddée Tyl. Chinchilla’s death. Accessed: 2022-05-24. 2023. url: https://espadrine.
github.io/blog/posts/chinchilla-s-death.html.
[TZL24] Brian Tufts, Xuandong Zhao, and Lei Li. “An examination of AI-generated text detectors
across multiple domains and models”. In: Neurips Safe Generative AI Workshop 2024.
2024.
[UZM24] Dennis Ulmer, Chrysoula Zerva, and Andre Martins. “Non-exchangeable conformal lan-
guage generation with nearest neighbors”. In: Findings of the Association for Computa-
tional Linguistics: EACL. Mar. 2024, pp. 1909–1929.
[Vak+22] Thomas Vakili, Anastasios Lamproudis, Aron Henriksson, and Hercules Dalianis. “Down-
stream task performance of bert models pre-trained using automatically de-identified
clinical data”. In: Proceedings of the Thirteenth Language Resources and Evaluation Con-
ference. 2022, pp. 4245–4252.
[Val+23] Mojtaba Valipour, Mehdi Rezagholizadeh, Ivan Kobyzev, and Ali Ghodsi. “DyLoRA:
Parameter-efficient tuning of pre-trained models using dynamic search-free low-rank
adaptation”. In: Proceedings of the 17th Conference of the European Chapter of the As-
sociation for Computational Linguistics. May 2023, pp. 3274–3287.

55
[Vas+17] Ashish Vaswani et al. “Attention is all you need”. In: Advances in Neural Information
Processing Systems. Vol. 30. Curran Associates, Inc., 2017.
[Ver+23] Shreyas Verma, Kien Tran, Yusuf Ali, and Guangyu Min. “Reducing LLM hallucinations
using epistemic neural networks”. In: arXiv preprint arXiv:2312.15576 (2023).
[VGS05] Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic learning in a
random world. Vol. 29. Springer, 2005.
[Vin22] James Vincent. “AI-generated answers temporarily banned on coding Q&A site Stack
Overflow”. In: The Verge 5 (2022).
[VKB23] Nikhil Vyas, Sham M Kakade, and Boaz Barak. “On provable copyright protection for
generative models”. In: International Conference on Machine Learning. PMLR. 2023,
pp. 35277–35299.
[Von+23] Johannes Von Oswald et al. “Transformers learn in-context by gradient descent”. In:
International Conference on Machine Learning. PMLR. 2023, pp. 35151–35174.
[Wal+19] Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. “Universal
adversarial triggers for attacking and analyzing NLP”. In: Proceedings of the 2019 Con-
ference on Empirical Methods in Natural Language Processing and the 9th International
Joint Conference on Natural Language Processing, EMNLP-IJCNLP. 2019, pp. 2153–
2162.
[Wan+22] Xiaozhi Wang et al. “Finding skill neurons in pre-trained transformer-based language
models”. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Lan-
guage Processing. 2022, pp. 11132–11152.
[Wan+23a] Yixin Wan et al. “Kelly is a warm person, Joseph is a role model: Gender Biases in
LLM-Generated Reference Letters”. In: Findings of the Association for Computational
Linguistics: EMNLP. Dec. 2023, pp. 3730–3748.
[Wan+23b] Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Stein-
hardt. “Interpretability in the wild: a circuit for indirect object identification in GPT-2
Small”. In: The Eleventh International Conference on Learning Representations, ICLR.
2023.
[Wan+23c] Tianlu Wang et al. “Shepherd: A critic for language model generation”. In: arXiv preprint
arXiv:2308.04592 (2023).
[Wan+23d] Xuezhi Wang et al. “Self-consistency improves chain of thought reasoning in language
models”. In: The Eleventh International Conference on Learning Representations, ICLR.
2023.
[Wan+23e] Yizhong Wang et al. “Self-instruct: Aligning language models with self-generated instruc-
tions”. In: Proceedings of the 61st Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers). 2023, pp. 13484–13508.
[Wan+24a] Jiachen T Wang, Zhun Deng, Hiroaki Chiba-Okabe, Boaz Barak, and Weijie J Su.
“An economic solution to copyright challenges of generative AI”. In: arXiv preprint
arXiv:2404.13964 (2024).
[Wan+24b] Yidong Wang et al. “PandaLM: An automatic evaluation benchmark for LLM instruction
tuning optimization”. In: International Conference on Learning Representations (ICLR).
2024.
[Wan+24c] Zifeng Wang et al. “CodecLM: Aligning Language Models with Tailored Synthetic Data”.
In: 2024.
[Wan+24d] Zixuan Wang, Stanley Wei, Daniel Hsu, and Jason D. Lee. “Transformers provably learn
sparse token selection while fully-connected nets cannot”. In: Proceedings of the 41st
International Conference on Machine Learning. ICML’24. JMLR.org, 2024.
[WBU11] Jason Weston, Samy Bengio, and Nicolas Usunier. “Wsabie: Scaling up to large vocabu-
lary image annotation”. In: Twenty-Second International Joint Conference on Artificial
Intelligence. Citeseer. 2011.

56
[WCH23] Rui Wang, Pengyu Cheng, and Ricardo Henao. “Toward fairness in text generation via
mutual information minimization based on importance sampling”. In: International Con-
ference on Artificial Intelligence and Statistics. PMLR. 2023, pp. 4473–4485.
[WCM22] Colin Wei, Yining Chen, and Tengyu Ma. “Statistically meaningful approximation: a case
study on approximating Turing machines with transformers”. In: Advances in Neural
Information Processing Systems 35 (2022), pp. 12071–12083.
[Web+23] Debora Weber-Wulff et al. “Testing of detection tools for AI-generated text”. In: Inter-
national Journal for Educational Integrity 19.1 (2023), p. 26.
[Wei+21] Laura Weidinger et al. “Ethical and social risks of harm from language models”. In: arXiv
preprint arXiv:2112.04359 (2021).
[Wei+22a] Jason Wei et al. “Chain-of-thought prompting elicits reasoning in large language models”.
In: Advances in neural information processing systems 35 (2022), pp. 24824–24837.
[Wei+22b] Jason Wei et al. “Finetuned language models are zero-shot learners”. In: International
Conference on Learning Representations. 2022.
[Wen+24] Kaiyue Wen, Huaqing Zhang, Hongzhou Lin, and Jingzhao Zhang. “From sparse depen-
dence to sparse attention: Unveiling how chain-of-thought enhances transformer sample
efficiency”. In: arXiv preprint arXiv:2410.05459 (2024).
[Wet+24] Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen. “QuRating: Selecting
high-quality data for training language models”. In: Forty-first International Conference
on Machine Learning, ICML. 2024.
[Wil23] David Williams. Systems and methods for automatically scrubbing sensitive data. US
Patent 11,645,458. May 2023.
[WLJ23] Yuanhao Wang, Qinghua Liu, and Chi Jin. “Is RLHF more difficult than standard RL?
a theoretical perspective”. In: Advances in Neural Information Processing Systems 36
(2023), pp. 76006–76032.
[WLL25] Ting Wu, Xuefeng Li, and Pengfei Liu. “Progress or regress? Self-improvement reversal in
post-training”. In: The Thirteenth International Conference on Learning Representations.
2025.
[Woo+23] Tae-Jin Woo, Woo-Jeoung Nam, Yeong-Joon Ju, and Seong-Whan Lee. “Compensatory
debiasing for gender imbalances in language models”. In: ICASSP 2023-2023 IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2023,
pp. 1–5.
[Wou24] Bram Wouters. “Optimizing watermarks for large language models”. In: International
Conference on Machine Learning. 2024.
[WP24] Robert Wu and Vardan Papyan. “Linguistic collapse: Neural collapse in (large) language
models”. In: Advances in Neural Information Processing Systems 38, NeurIPS 2024. 2024.
[WS23] Jason Weston and Sainbayar Sukhbaatar. “System 2 attention (is something you might
need too)”. In: arXiv preprint arXiv:2311.11829 (2023).
[WTU20] Dirk Weissenborn, Oscar Täckström, and Jakob Uszkoreit. “Scaling autoregressive video
models”. In: 8th International Conference on Learning Representations, ICLR. 2020.
[Wu+23a] Xinwei Wu et al. “DEPN: Detecting and editing privacy neurons in pretrained language
models”. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Lan-
guage Processing. Dec. 2023, pp. 2875–2886.
[Wu+23b] Yihan Wu, Zhengmian Hu, Hongyang Zhang, and Heng Huang. “DiPmark: A stealthy, ef-
ficient and resilient watermark for large language models”. In: arXiv preprint arXiv:2310.07710
(2023).
[Wu+24a] Tianhao Wu et al. “Meta-rewarding language models: Self-improving alignment with
LLM-as-a-meta-judge”. In: arXiv preprint arXiv:2407.19594 (2024).
[Wu+24b] Yihan Wu et al. “Distortion-free watermarks are not truly distortion-free under water-
mark key collisions”. In: arXiv preprint arXiv:2406.02603 (2024).

57
[WZT23] Ruijia Wu, Linjun Zhang, and T Tony Cai. “Sparse topic modeling: Computational effi-
ciency, near-optimal algorithms, and statistical inference”. In: Journal of the American
Statistical Association 118.543 (2023), pp. 1849–1861.
[Xia+24a] Haojun Xia et al. “FP6-LLM: Efficiently serving large language models through FP6-
centric algorithm-system co-design”. In: arXiv preprint arXiv:2401.14112 (2024).
[Xia+24b] Xun Xian et al. “RAW: A robust and agile plug-and-play watermark framework for AI-
generated images with provable guarantees”. In: The Thirty-eighth Annual Conference
on Neural Information Processing Systems. 2024.
[Xia+24c] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. “Efficient
streaming language models with attention sinks”. In: The Twelfth International Confer-
ence on Learning Representations, ICLR. 2024.
[Xia+24d] Jiancong Xiao et al. “On the algorithmic bias of aligning large language models with
RLHF: Preference collapse and matching regularization”. In: arXiv preprint arXiv:2405.16455
(2024).
[Xie+22] Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. “An explanation
of in-context learning as implicit Bayesian inference”. In: The Tenth International Con-
ference on Learning Representations, ICLR. 2022.
[Xie+23] Sang Michael Xie et al. “DoReMi: Optimizing data mixtures speeds up language model
pretraining”. In: Advances in Neural Information Processing Systems 36, NeurIPS 2023.
2023.
[Xie+24a] Qianqian Xie et al. “Me LLaMA: Foundation large language models for medical applica-
tions”. In: arXiv preprint arXiv:2402.12749 (Feb. 2024).
[Xie+24b] Yangxinyu Xie, Xiang Li, Tanwi Mallick, Weijie J Su, and Ruixun Zhang. “Debias-
ing watermarks for large language models via maximal coupling”. In: arXiv preprint
arXiv:2411.11203 (2024).
[Xu+23] Jing Xu, Andrew Lee, Sainbayar Sukhbaatar, and Jason Weston. “Some things are more
cringe than others: Preference optimization with the pairwise cringe loss”. In: arXiv
preprint arXiv:2312.16682 (2023).
[Xu+24a] Haoran Xu et al. “Contrastive preference optimization: Pushing the boundaries of LLM
performance in machine translation”. In: Forty-first International Conference on Machine
Learning, ICML. 2024.
[Xu+24b] Yuhui Xu et al. “QA-LoRA: Quantization-aware low-rank adaptation of large language
models”. In: The Twelfth International Conference on Learning Representations, ICLR.
2024.
[Xu23] Ming Xu. MedicalGPT: Training medical GPT model. 2023. url: https://github.com/
shibing624/MedicalGPT.
[Yad+24] Yasin Abbasi Yadkori et al. “Mitigating LLM hallucinations via conformal abstention”.
In: arXiv preprint arXiv:2405.01563 (2024).
[Yan+19] Zhilin Yang et al. “XLNet: generalized autoregressive pretraining for language under-
standing”. In: Proceedings of the 33rd International Conference on Neural Information
Processing Systems. Curran Associates Inc., 2019.
[Yan+23] Ke Yang, Charles Yu, Yi R Fung, Manling Li, and Heng Ji. “Adept: A debiasing prompt
framework”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 37.
2023, pp. 10780–10788.
[Yan+24] Xianjun Yang et al. “DNA-GPT: Divergent n-gram analysis for training-free detection
of GPT-generated text”. In: The Twelfth International Conference on Learning Repre-
sentations, ICLR. 2024.
[Yao+23] Shunyu Yao et al. “Tree of thoughts: Deliberate problem solving with large language
models”. In: Advances in Neural Information Processing Systems 36, NeurIPS 2023.
2023.

58
[Ye+24a] Fanghua Ye et al. “Benchmarking LLMs via uncertainty quantification”. In: arXiv preprint
arXiv:2401.12794 (2024).
[Ye+24b] Tian Ye, Zicheng Xu, Yuanzhi Li, and Zeyuan Allen-Zhu. “Physics of language mod-
els: Part 2.1, grade-school math and the hidden reasoning process”. In: arXiv preprint
arXiv:2407.20311 (2024).
[Ye+24c] Tian Ye, Zicheng Xu, Yuanzhi Li, and Zeyuan Allen-Zhu. “Physics of language models:
Part 2.2, How to learn from mistakes on grade-school math problems”. In: arXiv preprint
arXiv:2408.16293 (2024).
[YLG24] Weizhe Yuan, Pengfei Liu, and Matthias Gallé. “LLMCrit: Teaching large language mod-
els to use criteria”. In: Findings of the Association for Computational Linguistics: ACL.
2024, pp. 7929–7960.
[YNL21] Weizhe Yuan, Graham Neubig, and Pengfei Liu. “BARTScore: Evaluating generated text
as text generation”. In: Advances in Neural Information Processing Systems 34, NeurIPS
2021. 2021, pp. 27263–27277.
[Yu+21] Da Yu et al. “Differentially private fine-tuning of language models”. In: International
Conference on Learning Representations. 2021.
[Yu+23] Charles Yu, Sullam Jeoung, Anish Kasi, Pengfei Yu, and Heng Ji. “Unlearning bias in
language models by partitioning gradients”. In: Findings of the Association for Compu-
tational Linguistics: ACL. 2023, pp. 6032–6048.
[Yu+24] Ping Yu, Jing Xu, Jason Weston, and Ilia Kulikov. “Distilling system 2 into system 1”.
In: arXiv preprint arXiv:2407.06023 (2024).
[Yua+24] Weizhe Yuan et al. “Self-rewarding language models”. In: Forty-first International Con-
ference on Machine Learning, ICML. 2024.
[Yun+20a] Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, and Sanjiv
Kumar. “Are Transformers universal approximators of sequence-to-sequence functions?”
In: 8th International Conference on Learning Representations, ICLR. 2020.
[Yun+20b] Chulhee Yun et al. “O(n) connections are expressive enough: Universal approximability of
sparse transformers”. In: Advances in Neural Information Processing Systems 33 (2020),
pp. 13783–13794.
[YW21] Bo Yang and Lijun Wu. “How to leverage the multimodal EHR data for better medical
prediction?” In: Proceedings of the 2021 Conference on Empirical Methods in Natural
Language Processing. 2021, pp. 329–345.
[YXL24] Yuanshun Yao, Xiaojun Xu, and Yang Liu. “Large language model unlearning”. In: Ad-
vances in Neural Information Processing Systems 38, NeurIPS 2024. 2024.
[Zel+19] Rowan Zellers et al. “Defending against neural fake news”. In: Advances in Neural In-
formation Processing Systems. Vol. 32. 2019.
[Zel+22] Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. “STaR: Bootstrapping
reasoning with reasoning”. In: Advances in Neural Information Processing Systems 35,
NeurIPS 2022. 2022.
[Zen+24] Qingcheng Zeng et al. “Uncertainty is fragile: Manipulating uncertainty in large language
models”. In: arXiv preprint arXiv:2407.11282 (2024).
[Zer23] ZeroGPT. ZeroGPT: Trusted GPT-4, ChatGPT and AI detector tool by ZeroGPT. 2023.
url: https://www.zerogpt.com/.
[ZFB24] Ruiqi Zhang, Spencer Frei, and Peter L Bartlett. “Trained transformers learn linear
models in-context”. In: Journal of Machine Learning Research 25.49 (2024), pp. 1–55.
[Zha+22] Yufeng Zhang, Boyi Liu, Qi Cai, Lingxiao Wang, and Zhaoran Wang. “An analysis of
attention via the lens of exchangeability and latent variable models”. In: arXiv preprint
arXiv:2212.14852 (2022).
[Zha+23] Qingru Zhang et al. “AdaLoRA: Adaptive budget allocation for parameter-efficient fine-
Tuning”. In: arXiv preprint arXiv:2303.10512 (2023).

59
[Zha+24a] Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D Lee, and Wen Sun. “Provable
offline preference-based reinforcement learning”. In: The Twelfth International Confer-
ence on Learning Representations. 2024.
[Zha+24b] Jinghui Zhang, Dandan Qiao, Mochen Yang, and Qiang Wei. “Regurgitative training: The
value of real data in training large language models”. In: arXiv preprint arXiv:2407.12835
(2024).
[Zha+24c] Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. “TinyLlama: An open-
source small language model”. In: arXiv preprint arXiv:2401.02385 (2024).
[Zha+24d] Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. “Negative preference optimization:
From catastrophic collapse to effective unlearning”. In: arXiv preprint arXiv:2404.05868
(2024).
[Zha+24e] Ruisi Zhang, Shehzeen Samarah Hussain, Paarth Neekhara, and Farinaz Koushanfar.
“REMARK-LLM: A robust and efficient watermarking framework for generative large
language models”. In: 33rd USENIX Security Symposium (USENIX Security 24). 2024,
pp. 1813–1830.
[Zha+24f] Wenting Zhao et al. “WildChat: 1M ChatGPT interaction logs in the wild”. In: The
Twelfth International Conference on Learning Representations, ICLR. 2024.
[Zha+24g] Xuandong Zhao, Chenwen Liao, Yu-Xiang Wang, and Lei Li. “Efficiently identifying wa-
termarked segments in mixed-source texts”. In: arXiv preprint arXiv:2410.03600 (2024).
[Zha+24h] Xufeng Zhao et al. “Enhancing zero-shot chain-of-thought reasoning in large language
models through logic”. In: Proceedings of the 2024 Joint International Conference on
Computational Linguistics, Language Resources and Evaluation (LREC-COLING). ELRA
and ICCL, 2024, pp. 6144–6166.
[Zhe+23] Lianmin Zheng et al. “Judging LLM-as-a-judge with MT-Bench and Chatbot Arena”.
In: Advances in Neural Information Processing Systems 36, NeurIPS 2023. 2023.
[Zhe+24] Lianmin Zheng et al. “LMSYS-Chat-1M: A large-scale real-world LLM conversation
dataset”. In: The Twelfth International Conference on Learning Representations, ICLR.
2024.
[Zho+24a] Huiying Zhong, Zhun Deng, Weijie J Su, Zhiwei Steven Wu, and Linjun Zhang. “Provable
multi-party reinforcement learning with diverse human feedback”. In: arXiv preprint
arXiv:2403.05006 (2024).
[Zho+24b] Hongjian Zhou et al. “A survey of large language models in medicine: Progress, applica-
tion, and challenge”. In: arXiv preprint arXiv:2311.05112 (2024).
[ZHT06] Hui Zou, Trevor Hastie, and Robert Tibshirani. “Sparse principal component analysis”.
In: Journal of Computational and Graphical Statistics 15.2 (2006), pp. 265–286.
[Zie+20] Daniel M. Ziegler et al. “Fine-tuning language models from human preferences”. In: arXiv
preprint arXiv:1909.08593 (2020).
[ZJJ23] Banghua Zhu, Michael Jordan, and Jiantao Jiao. “Principled reinforcement learning with
human feedback from pairwise or k-wise comparisons”. In: International Conference on
Machine Learning. PMLR. 2023, pp. 43037–43067.
[ZLM18] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. “Mitigating unwanted biases
with adversarial learning”. In: Proceedings of the 2018 AAAI/ACM Conference on AI,
Ethics, and Society. 2018, pp. 335–340.
[ZLW24] Xuandong Zhao, Lei Li, and Yu-Xiang Wang. “Permute-and-Flip: An optimally robust
and watermarkable decoder for LLMs”. In: arXiv preprint arXiv:2402.05864 (2024).
[ZM24] Chrysoula Zerva and André F. T. Martins. “Conformalizing machine translation eval-
uation”. In: Transactions of the Association for Computational Linguistics 12 (2024),
pp. 1460–1478.
[Zol+24] Thomas P. Zollo et al. “Prompt risk control: A Rigorous framework for responsible deploy-
ment of large language models”. In: The Twelfth International Conference on Learning
Representations, ICLR. 2024.

60
[ZRZ24] Lujing Zhang, Aaron Roth, and Linjun Zhang. “Fair risk control: A generalized framework
for calibrating multi-group fairness risks”. In: arXiv preprint arXiv:2405.02225 (2024).
[ZWL23] Xuandong Zhao, Yu-Xiang Wang, and Lei Li. “Protecting language generation models
via invisible watermarking”. In: International Conference on Machine Learning. Vol. 202.
2023, pp. 42187–42199.

61
A Historical Development of LLMs
The capabilities of LLMs arise from two critical components: representation learning and language
modeling. Representation learning involves learning a numerical representation of input text, in a way
that captures their underlying relationships and patterns. Language modeling involves predicting the
next word in a sequence of texts, allowing for coherent and fluent text generation.

A.1 Representation Learning

The first step in an LLM is to transform natural language into a format that computers can understand—
specifically vectors, matrices, and tensors. The symbolic lexical units (i.e., words) that comprise sen-
tences require effective numerical representation strategies, known as word embeddings. Grounded on
the distributional hypothesis, which posits that linguistic items that occur in similar contexts have
similar meanings [Har54], such methods have gradually become more sophisticated over time.
Early attempts at capturing word meanings involved bag-of-words representations, which represent
documents as sparse vectors, with a dimension for each element of the dictionary, for example as
in TF-IDF [Spa72]. Matrix-decomposition methods were subsequently introduced, such as Latent
Semantic Analysis (LSA) [Hof01; LD02], to represent words as (dense) vectors. Many such methods
implicitly involved weights that were defined or learned in an unsupervised fashion. Supervised learning
methods such as Support Vector Machines were later used to train bag-of-word representations that
aimed directly at solving particular tasks [Joa98], leading to performance improvements. Supervised
training of dense word embeddings by gradient-based or neural methods similarly gave improvements
[Bai+09; WBU11], with the caveat that relatively large datasets were required. While useful, most
of these methods fell short of solving challenging tasks in NLP, as they lacked the capacity to convey
complex semantic and syntactic nuances of words in context.
Neural methods first grew to prominence with the advent of Word2vec [Mik+13] and similar ap-
proaches such as GloVe [PSM14] which directly learn low-dimensional word representations from local
word context. The specific objective functions underlying the architecture of these models were ex-
plored in foundational works such as those by [Ben+03; CW08; Col+11; TRB10], setting the stage
for subsequent innovations. Word2vec, for instance, encompasses two specific models: Continuous
Bag-Of-Words (CBOW) and Skip-gram. The CBOW model predicts a target word based on its sur-
rounding context, whereas the Skip-gram model predicts the surrounding context given a target word.
Both models utilize a straightforward yet efficient neural network architecture that effectively captures
semantic and syntactic word relationships [Mik+13]. These approaches generate dense word embed-
dings, which are compact vectors where each dimension encodes a latent feature of the word. After
this unsupervised training, these vectors were then often used as input or initialization for a supervised
training stage for the task of interest.
These approaches to representation learning focused on words as the primary unit, which gave rise
to several challenges, including a large vocabulary size and the issue of out-of-vocabulary words. To
address these challenges, the concept of tokenization was introduced. Tokenization plays a crucial
role in the effectiveness of these models. It involves breaking down text into smaller units, or tokens,
which affects the granularity of the learned representations. A popular technique used in this context
is Byte Pair Encoding (BPE) [Gag94], which reduces vocabulary size by merging the most frequent
pairs of bytes or characters. This method not only helps manage vocabulary size but also addresses
challenges related to out-of-vocabulary words in new texts. Thus, the evolution of representation
learning, language modeling, and tokenization techniques has become a unified and interdependent
process.

A.2 Language Modeling

In the early stages of language models (LMs), LMs were mostly statistical regression models, relying
on the Markov assumption to predict the next word from the most recent context. The dominant
models were n-gram models [Jel98; GL04; Ros00], which determines a word’s probability based on the
preceding n−1 words. Challenges for these models included the need to tackle data sparsity—the issue
of assigning zero probabilities to unseen words or n-grams [CG96]—and to cope with the exponential
number of transition probabilities that need to be estimated as n increases. As n was typically thus

62
Framework Key Features Use Case Tutorial Link

PyTorch Dynamic graphs, commu- Fine-Tuning, prototyping https://pytorch.org/

nity support tutorials/

TensorFlow Scalability, production Large-scale training https://www.tensorflow.

readiness org/tutorials

JAX High-performance com- Experimental optimization https://jax.readthedocs.

putation, flexibility io/en/latest/tutorials.
html

Table 2: Core Frameworks for Model Development.

Library Key Features Use Case Tutorial Link

Hugging Face Pre-trained models, tok- Fine-Tuning, experi- https://huggingface.co/

Transformers enizers, multi-task support mentation docs/transformers/index

LangChain Tool integration, output Application develop- https://python.langchain.

chaining ment com/docs/introduction/

Table 3: Model Libraries and Pre-trained Models.

forced to be small (e.g., 1-5 words) the idea of using such models to complete a long context input
seemed out of scope.

B Key Resources: Accelerating Progress in LLM Research

The development of LLMs require a combination of computational resources, robust frameworks, and
specialized tools. Over the years, a variety of open-source tools and libraries have been developed,
significantly lowering the barrier to entry for researchers and practitioners. This section provides
a detailed overview of the most popular and widely adopted resources for LLM research, focusing
on frameworks, model libraries, data preparation tools, and utilities for training, deployment, and
analysis.

B.1 Core Frameworks for Model Development

PyTorch PyTorch has become the dominant deep learning framework for LLM research due to its
flexibility, dynamic computation graph, and extensive community support. Its ease of debugging and
seamless integration with other libraries make it ideal for developing and fine-tuning LLMs.

TensorFlow and JAX TensorFlow remains a competitive choice for large-scale training, partic-
ularly with its distributed computing capabilities. JAX, with its functional programming paradigm
and support for automatic differentiation, is gaining traction for research prototypes and cutting-edge
optimization techniques.
Please see Table 2 for the comparison of those frameworks.

B.2 Model Libraries and Pre-trained Models

Hugging Face Transformers Hugging Face’s Transformers library is a cornerstone of LLM re-
search, providing pre-trained models, tokenizers, and utilities for fine-tuning across multiple tasks. Its
intuitive API and extensive documentation make it accessible to researchers at all levels.

LangChain LangChain simplifies the process of integrating LLMs into applications by enabling
seamless chaining of language model outputs and external tools, such as APIs or databases.
Please see Table 3 for more descriptions.

63
Tool Key Features Use Case Tutorial Link

Hugging Curated datasets, effi- Loading and preprocessing https://huggingface.co/

Face cient streaming datasets docs/datasets/index
Datasets

clean-text Text normalization, Data preparation https://github.com/

cleaning noisy datasets jfilter/clean-text

SentencePiece Subword tokenization Text tokenization https://github.com/

google/sentencepiece

Hugging Subword tokenization Text tokenization https://huggingface.co/

Face Tok- docs/tokenizers/index
enizers

Table 4: Datasets and Data Preparation Tools.

B.3 Datasets and Data Preparation Tools

Hugging Face Datasets Hugging Face Datasets provides a vast library of curated datasets and
tools for loading, filtering, and processing data. It supports efficient streaming and preprocessing
pipelines, enabling researchers to work with massive datasets without excessive memory usage. The
library also integrates seamlessly with Hugging Face’s tokenizers and models, making it an essential
tool for LLM research.

Data Cleaning Tools Preprocessing is crucial for preparing datasets for LLM training. Tools like
clean-text simplify the cleaning of noisy datasets by removing special characters, normalizing text,
and correcting encoding issues. Additionally, tokenization libraries such as SentencePiece and Hug-
ging Face’s tokenizers library are indispensable for converting text into model-ready input formats.
Please see Table 4 for summarization.

B.4 Training and Fine-Tuning Utilities

DeepSpeed and Megatron-LM DeepSpeed and Megatron-LM are indispensable for scaling the
training of LLMs. DeepSpeed’s ZeRO (Zero Redundancy Optimizer) optimization reduces mem-
ory overhead, enabling the efficient training of massive models on limited hardware. Additionally,
Megatron-LM excels in model parallelism, splitting large models across multiple GPUs for distributed
training.

Hugging Face Fine-Tuning Frameworks Hugging Face provides an intuitive and widely used
framework for fine-tuning pre-trained models on downstream tasks. Using the Trainer API, researchers
can efficiently fine-tune models with minimal code. Hugging Face also supports custom training loops
for more advanced fine-tuning setups.

Hugging Face PEFT Frameworks Hugging Face’s PEFT frameworks build on techniques like
LoRA to offer additional methods for adapting large models to specific tasks while keeping the majority
of the model frozen. This accelerates fine-tuning and reduces resource requirements.

TRL & OpenRLHF TRL (Transformers Reinforcement Learning) and OpenRLHF are powerful
libraries designed to simplify the implementation of RLHF and Direct Preference Optimization (DPO).
Please see Table 5 for summarization.

B.5 Inference and Deployment

vLLM and TensorRT-LLM vLLM and TensorRT-LLM are optimized tools for low-latency in-
ference of large language models. vLLM focuses on efficient memory usage, enabling faster batch
inference for serving applications. TensorRT-LLM, developed by NVIDIA, supports high-throughput

64
Tool Key Features Use Case Tutorial Link

DeepSpeed ZeRO optimization, scal- Large-scale training https://www.deepspeed.

ability ai/tutorials/

Megatron- Model parallelism Large-scale training https://github.com/

LM NVIDIA/Megatron-LM

Hugging Easy fine-tuning setup, Fine-Tuning pre-trained https://huggingface.

Face Trainer Trainer API models co/docs/transformers/
training

Hugging Efficient fine-tuning tech- Resource-constrained set- https://huggingface.co/

Face PEFT niques tings docs/peft/index

OpenRLHF RLHF, DPO Post training, AI align- https://github.com/

ment OpenRLHF/OpenRLHF

TRL RLHF, DPO Post training, AI align- https://huggingface.co/

ment docs/trl/index

Table 5: Training and Fine-Tuning Utilities.

Tool Key Features Use Case Tutorial Link

vLLM Memory optimization, Production inference https://github.com/

fast batch inference vllm-project/vllm

TensorRT- High-throughput GPU Optimized GPU model https://github.com/

LLM inference serving NVIDIA/TensorRT-LLM/
tree/release/0.5.0

Hugging Simplified API deploy- Rapid integration https://huggingface.co/

Face End- ment inference-endpoints/
points dedicated

Triton Scalable serving with Scalable LLM deployment https://developer.

Inference multi-framework support nvidia.com/
Server triton-inference-server

Ray Serve Distributed, parallel in- Scalable LLM deployment https://docs.ray.io/en/

ference latest/serve/index.html

Table 6: LLM Inference and Deployment.

model inference on GPUs, taking advantage of advanced hardware accelerations. Both tools are ideal
for production environments requiring high-performance LLM deployment.

Triton Inference Server NVIDIA’s Triton Inference Server simplifies LLM deployment by support-
ing multiple frameworks (e.g., PyTorch, TensorFlow, ONNX). It enables scalable, production-grade
model serving with GPU and CPU backends.

Hugging Face Inference Endpoints Hugging Face provides an easy-to-use platform for deploying
LLMs as APIs, making it simple for developers to integrate models into their applications without
managing infrastructure.

Ray Serve Ray Serve is a distributed model serving library that supports scaling and parallel
inference for large LLMs. It integrates seamlessly with distributed computing frameworks.
Please see Table 6 for more information.

65
Tool Key Features Use Case Tutorial Link

Weights & Experiment tracking, hy- Experimentation, visual- https://docs.wandb.ai

Biases perparameter tuning ization

TensorBoard Training progress moni- Model performance analy- https://www.tensorflow.

toring sis org/tensorboard

MLflow Experiment lifecycle Comparing experiments, https://mlflow.org/docs/

management model versioning latest/index.html

Table 7: Visualization and Analysis.

B.6 Visualization and Analysis

Weights & Biases Weights & Biases (W&B) is a powerful tool for experiment tracking, hyper-
parameter optimization, and real-time visualization of training metrics. It integrates seamlessly with
most machine learning frameworks, providing interactive dashboards that display loss curves, accu-
racy trends, and other performance indicators. W&B is particularly useful for collaborative research,
allowing teams to share results and insights easily.

TensorBoard TensorBoard is a widely adopted visualization toolkit for monitoring model training
and evaluation. Its key features include plotting scalars (e.g., loss and accuracy), displaying images,
and analyzing graph structures. TensorBoard’s ease of integration with TensorFlow and PyTorch
makes it a staple in the machine learning community.

MLflow MLflow is an open-source platform for managing the lifecycle of machine learning exper-
iments. It supports logging of metrics, artifacts, and parameters, as well as model versioning and
deployment tracking. MLflow’s UI allows researchers to compare experiments and optimize workflows.
Please see Table 7 for summarization.

B.7 Cloud and Hardware Resources

Cloud computing platforms have become integral to large-scale LLM research and deployment. They
provide scalable, on-demand access to powerful computational resources like GPUs, TPUs, and spe-
cialized accelerators, enabling researchers to run experiments without heavy upfront investment in
infrastructure.

AWS (Amazon Web Services) Offers EC2 instances optimized for deep learning, such as p4d
instances equipped with NVIDIA A100 GPUs. AWS also provides SageMaker, a managed service for
building, training, and deploying machine learning models.

Google Cloud Platform (GCP) Features TPU (Tensor Processing Unit) instances and Vertex AI,
a platform for scalable training and inference. GCP is especially advantageous for TensorFlow users.

Microsoft Azure Provides Azure Machine Learning, a suite of tools and services for machine learn-
ing workflows, alongside GPU-powered VMs optimized for deep learning tasks.
Please see Table 8 for summarization.

66
Platform Key Features Use Case Tutorial Link

AWS EC2 instances with Training, deployment https://aws.amazon.com/

GPUs, SageMaker ai/machine-learning/

GCP TPUs, Vertex AI Scalable training, experi- https://cloud.google.com/

mentation products/ai

Microsoft Azure ML, GPU-powered Experimentation, deploy- https://learn.microsoft.

Azure VMs ment com/en-us/azure/
machine-learning

Table 8: Cloud and Hardware Resources.

Unit 2 Notes
No ratings yet
Unit 2 Notes
117 pages
Et Tu Code - Demystifying LLM, AI Mathematics, and Hardware Infra (2024)
No ratings yet
Et Tu Code - Demystifying LLM, AI Mathematics, and Hardware Infra (2024)
541 pages
Foundations of Large Language Models 1738142777
No ratings yet
Foundations of Large Language Models 1738142777
101 pages
Machine Learning Lab Viva
No ratings yet
Machine Learning Lab Viva
3 pages
Day 17 Introduction To LLMs
No ratings yet
Day 17 Introduction To LLMs
7 pages
AIML Important Questions
No ratings yet
AIML Important Questions
5 pages
Skin Cancer Prediction Using Deep Learning Technique
No ratings yet
Skin Cancer Prediction Using Deep Learning Technique
57 pages
Ai Algorithm Failure
No ratings yet
Ai Algorithm Failure
36 pages
Guide To Fine-Tuning LLMs From Basics
No ratings yet
Guide To Fine-Tuning LLMs From Basics
114 pages
Proj Report Final
No ratings yet
Proj Report Final
115 pages
Artificial Intelligence and Business Intelligence
No ratings yet
Artificial Intelligence and Business Intelligence
26 pages
Benchmarking Large Language Models With A Unified Performance Ranking Metric
No ratings yet
Benchmarking Large Language Models With A Unified Performance Ranking Metric
13 pages
A Comprehensive Overview of Large Language Models: Preprint 1
No ratings yet
A Comprehensive Overview of Large Language Models: Preprint 1
46 pages
Planet, Code - PYTHON For LARGE LANGUAGE MODELS - A Beginners Handbook For Leveraging Llms Into Modern Development Workflows and Applications (2025)
No ratings yet
Planet, Code - PYTHON For LARGE LANGUAGE MODELS - A Beginners Handbook For Leveraging Llms Into Modern Development Workflows and Applications (2025)
254 pages
Saleh Et Al-2024-Scientific Reports
No ratings yet
Saleh Et Al-2024-Scientific Reports
11 pages
2023 Paclic-1 83
No ratings yet
2023 Paclic-1 83
10 pages
Potential Complex Optimisation Problems in Science and Industry
No ratings yet
Potential Complex Optimisation Problems in Science and Industry
51 pages
Foundations of LLM
No ratings yet
Foundations of LLM
231 pages
Convolutional Neural Network-Based Real-Time ROV Detection Using Forward-Looking Sonar Image
No ratings yet
Convolutional Neural Network-Based Real-Time ROV Detection Using Forward-Looking Sonar Image
5 pages
Alotaibi 2020
No ratings yet
Alotaibi 2020
12 pages
AI vs. Machine Learning vs. Deep Learning vs. Neural Networks I IBM
No ratings yet
AI vs. Machine Learning vs. Deep Learning vs. Neural Networks I IBM
7 pages
Wirelessllm: Empowering Large Language Models Towards Wireless Intelligence
No ratings yet
Wirelessllm: Empowering Large Language Models Towards Wireless Intelligence
12 pages
A Comprehensive Survey On Integrating Large Language Models With Knowledge-Based Methods
No ratings yet
A Comprehensive Survey On Integrating Large Language Models With Knowledge-Based Methods
68 pages
A Survey On Large Language Models With Some Insights
No ratings yet
A Survey On Large Language Models With Some Insights
174 pages
Aiml Unit-1
No ratings yet
Aiml Unit-1
27 pages
A Bibliometric Review of Large Language Models Research From 2017 To 2023
No ratings yet
A Bibliometric Review of Large Language Models Research From 2017 To 2023
36 pages
Ad3461 Machine Learning Laboratory - 1
No ratings yet
Ad3461 Machine Learning Laboratory - 1
1 page
Ai Driven Document Processing A Novel Framework For 22nei1ew7b04
No ratings yet
Ai Driven Document Processing A Novel Framework For 22nei1ew7b04
10 pages
Exploring The Frontiers of LLMs in Psychological Applications
No ratings yet
Exploring The Frontiers of LLMs in Psychological Applications
34 pages
Wa0009.
No ratings yet
Wa0009.
2 pages
A Comprehensive Overview of Large Language Models: A B, C, D, E, F, G F, G H, J I J
No ratings yet
A Comprehensive Overview of Large Language Models: A B, C, D, E, F, G F, G H, J I J
47 pages
Forecasting Directional Movements of Stock Prices For Intraday Trading Using LSTM and Random Forests
No ratings yet
Forecasting Directional Movements of Stock Prices For Intraday Trading Using LSTM and Random Forests
8 pages
BDAunit 5
No ratings yet
BDAunit 5
26 pages
Notes 4 Large Language Model
No ratings yet
Notes 4 Large Language Model
4 pages
BECE352E Module 3
No ratings yet
BECE352E Module 3
64 pages
A Comprehensive Overview of Large Language Models: A A, B, C, D, E, F E, F G, I H I
No ratings yet
A Comprehensive Overview of Large Language Models: A A, B, C, D, E, F E, F G, I H I
46 pages
A Review On Large Language Models Archit
No ratings yet
A Review On Large Language Models Archit
32 pages
Machine Learning Enabled Wireless Communication Network System
No ratings yet
Machine Learning Enabled Wireless Communication Network System
5 pages
Survey On Efficient Inference For LLMs 1721657409
No ratings yet
Survey On Efficient Inference For LLMs 1721657409
36 pages
Language Models: A Guide For The Perplexed
No ratings yet
Language Models: A Guide For The Perplexed
35 pages
A Review On Large Language Models Architectures Applications Taxonomies Open Issues and Challenges
No ratings yet
A Review On Large Language Models Architectures Applications Taxonomies Open Issues and Challenges
36 pages
Bueno Teoria 2307.06435
No ratings yet
Bueno Teoria 2307.06435
37 pages
Large Language Models Johns Hopkins University
No ratings yet
Large Language Models Johns Hopkins University
54 pages
Techniques, Tricks & Frameworks
No ratings yet
Techniques, Tricks & Frameworks
143 pages
A Survey On Efficient Inference For Large Language Models
No ratings yet
A Survey On Efficient Inference For Large Language Models
35 pages
An Analysis of Large Language Models: Their Impact and Potential Applications
No ratings yet
An Analysis of Large Language Models: Their Impact and Potential Applications
24 pages
Comparing LLMs Using A Unified Performance Ranking System
No ratings yet
Comparing LLMs Using A Unified Performance Ranking System
13 pages
LMM Model
No ratings yet
LMM Model
41 pages
Statistical Learning Slides
No ratings yet
Statistical Learning Slides
60 pages
UpGrad Campus - Data Science & Analytics Brochure
100% (1)
UpGrad Campus - Data Science & Analytics Brochure
10 pages
LLM Research Paper
No ratings yet
LLM Research Paper
2 pages
IJRPR29621
No ratings yet
IJRPR29621
7 pages
NNs PDF
No ratings yet
NNs PDF
16 pages
Scodeen Global Python DS - ML - Django Syllabus Version 13
No ratings yet
Scodeen Global Python DS - ML - Django Syllabus Version 13
22 pages
LLM Model
No ratings yet
LLM Model
43 pages
Pranay Report-1
No ratings yet
Pranay Report-1
36 pages
Final Project Report - Kelompok 4
No ratings yet
Final Project Report - Kelompok 4
6 pages
Week4 LLMs EN
No ratings yet
Week4 LLMs EN
48 pages
LLM Advancements Applications Challenges 20000 Words
No ratings yet
LLM Advancements Applications Challenges 20000 Words
3 pages
LLM Survey
No ratings yet
LLM Survey
31 pages
A Comprehensive Overview of Large Language Models - 2307.06435v9
No ratings yet
A Comprehensive Overview of Large Language Models - 2307.06435v9
46 pages
LLM Mastery Pathways
No ratings yet
LLM Mastery Pathways
8 pages
Revolutionizing Education With Ai Exploring The Transformative Potential of Chatgpt 13152
No ratings yet
Revolutionizing Education With Ai Exploring The Transformative Potential of Chatgpt 13152
13 pages
Training Large Language Models
No ratings yet
Training Large Language Models
7 pages
Fine Tuning Techniques For Large Language Models LLMs
No ratings yet
Fine Tuning Techniques For Large Language Models LLMs
15 pages
SCOA Unit I MCQ
100% (1)
SCOA Unit I MCQ
14 pages
LLM Seminar PDF
No ratings yet
LLM Seminar PDF
10 pages
2024 NTU - Resaro - LLM - Security - Paper
No ratings yet
2024 NTU - Resaro - LLM - Security - Paper
19 pages
Toc 9780138199302
No ratings yet
Toc 9780138199302
8 pages
Large Language Models A Comprehensive Survey of It
No ratings yet
Large Language Models A Comprehensive Survey of It
30 pages
DZ-getting-started-large Language Models LLMs-2024
No ratings yet
DZ-getting-started-large Language Models LLMs-2024
7 pages
Aryan A. What Is LLMOps. Large Language Models in Production 2024
100% (1)
Aryan A. What Is LLMOps. Large Language Models in Production 2024
67 pages
Kickstart Your Journey With LLM - A Comprehensive Guide
No ratings yet
Kickstart Your Journey With LLM - A Comprehensive Guide
2 pages
Attention Is All You Need.
No ratings yet
Attention Is All You Need.
5 pages
Understanding Large Language Models (LLMS) - A Mode
No ratings yet
Understanding Large Language Models (LLMS) - A Mode
3 pages
Python BAKMR010399001
No ratings yet
Python BAKMR010399001
3 pages
2 Notes
No ratings yet
2 Notes
3 pages
1st Note
No ratings yet
1st Note
3 pages
Application of Large Language
No ratings yet
Application of Large Language
75 pages
Data Seminar
No ratings yet
Data Seminar
10 pages
A Review On Large Language Models Architectures Ap
No ratings yet
A Review On Large Language Models Architectures Ap
31 pages
Large Language Models and Their Use Cases
No ratings yet
Large Language Models and Their Use Cases
3 pages
Exploring The Evolution of Large Language Models: Architectures, Applications, and Future Directions
No ratings yet
Exploring The Evolution of Large Language Models: Architectures, Applications, and Future Directions
11 pages
Large Language Models
No ratings yet
Large Language Models
3 pages
Technical Seminar Report On AI
No ratings yet
Technical Seminar Report On AI
30 pages
7 - Classification
No ratings yet
7 - Classification
71 pages
Large Language Model (LLM) 1
100% (1)
Large Language Model (LLM) 1
17 pages
LLMs in Production-MLC - GRC
No ratings yet
LLMs in Production-MLC - GRC
39 pages
Stochastic Modeling: Analysis and Simulation
From Everand
Stochastic Modeling: Analysis and Simulation
Barry L. Nelson
No ratings yet
Time-dependent Behaviour and Design of Composite Steel-concrete Structures
From Everand
Time-dependent Behaviour and Design of Composite Steel-concrete Structures
Massimiliano Bocciarelli
No ratings yet