Language Models: A Guide For The Perplexed
Language Models: A Guide For The Perplexed
Contents
1 Introduction 2
1
5.2.2 Model outputs that reflect social biases . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3 Are language models intelligent? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7 Final remarks 27
Acknowledgments 28
Glossary 28
References 32
Appendix 33
Loss functions and gradient descent, a bit more formally . . . . . . . . . . . . . . . . . . . . . . . . 33
Word error rate, more formally . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Perplexity, more formally . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1 Introduction
In late November 2022, OpenAI released a web-based chatbot, ChatGPT. Within a few months, ChatGPT
was reported to be the fastest-growing application in history, gaining over 100 million users. Reports in the
popular press touted ChatGPT’s ability to engage in conversation, answer questions, play games, write code,
translate and summarize text, produce highly fluent content from a prompt, and much more. New releases
and competing products have followed, and there has been extensive discussion about these new tools: How
will they change the nature of work? How should educators respond to the increased potential for cheating in
academic settings? How can we reduce or detect misinformation in the output? What exactly does it take
(in terms of engineering, computation, and data) to build such a system? What principles should inform
decisions about the construction, deployment, and use of these tools?
Scholars of artificial intelligence, including ourselves, are baffled by this situation. Some were taken aback
at how quickly these tools went from being objects of mostly academic interest to artifacts of mainstream
popular culture. Some have been surprised at the boldness of claims made about the technology and its
potential to lead to benefits and harms. The discussion about these new products in public forums is often
polarizing. When prompted conversationally, the fluency of these systems’ output can be startling; their
interactions with people are so realistic that some have proclaimed the arrival of human-like intelligence in
machines, adding a strong emotional note to conversations that, not so long ago, would have mostly addressed
engineering practices or statistics.
Given the growing importance of AI literacy, we decided to write this tutorial to help narrow the gap between
the discourse among those who study language models—the core technology underlying ChatGPT and
similar products—and those who are intrigued and want to learn more about them. In short, we believe the
perspective of researchers and educators can add some clarity to the public’s understanding of the technologies
beyond what’s currently available, which tends to be either extremely technical or promotional material
generated about products by their purveyors.
Our approach teases apart the concept of a language model from products built on them, from the behaviors
attributed to or desired from those products, and from claims about similarity to human cognition. As a
starting point, we:
1. Offer a scientific viewpoint that focuses on questions amenable to study through experimentation,
2
2. Situate language models as they are today in the context of the research that led to their development,
and
3. Describe the boundaries of what is known about the models at this writing.
Popular writing offers numerous, often thought-provoking metaphors for LMs, including bureaucracies or
markets (Henry Farrell and Cosma Shalizi), demons (Leon Derczynski), and a “blurry JPEG” of the web (Ted
Chiang). Rather than offering a new metaphor, we aim to empower readers to make sense of the discourse
and contribute their own. Our position is that demystifying these new technologies is a first step toward
harnessing and democratizing their benefits and guiding policy to protect from their harms.
LMs and their capabilities are only a part of the larger research program known as artificial intelligence (AI).
(They are often grouped together with technologies that can produce other kinds of content, such as images,
under the umbrella of “generative AI.”) We believe they’re a strong starting point because they underlie the
ChatGPT product, which has had unprecedented reach, and also because of the immense potential of natural
language for communicating complex tasks to machines. The emergence of LMs in popular discourse, and
the way they have captured the imagination of so many new users, reinforces our belief that the language
perspective is as good a place to start as any in understanding where this technology is heading.
The guide proceeds in five parts. We first introduce concepts and tools from the scientific/engineering field of
natural language processing (NLP), most importantly the notion of a “task” and its relationship to data
(section 2). We next define language modeling using these concepts (section 3). In short, language modeling
automates the prediction of the next word in a sequence, an idea that has been around for decades. We then
discuss the developments that led to the current so-called “large” language models (LLMs), which appear to
do much more than merely predict the next word in a sequence (section 4). We next elaborate on the current
capabilities and behaviors of LMs, linking their predictions to the data used to build them (section 5). Finally,
we take a cautious look at where these technologies might be headed in the future (section 6). To overcome
what could be a terminology barrier to understanding admittedly challenging concepts, we also include a
Glossary of NLP and LM words/concepts (including “perplexity,” wryly used in the title of this Guide).
Language. For the most part, NLP researchers focus on human languages and specifically written forms of those languages. Most
often, natural languages contrast with programming languages like Python and C++, which are artifacts designed deliberately
with a goal in mind.
2 There are other uses of the “NLP” acronym with very different meanings. Ambiguous terms and expressions are common in
3
foundations of computer science, you may have been exposed to these ideas before, but we don’t believe they
are universally or consistently taught in classes on those topics. Having a basic understanding of them will
help you to think like an NLP expert.
Figure 1: Some tasks, like alphabetical name sorting, may seem very simple but often raise detailed questions
that must be addressed for a full specification.
These may seem like tedious questions, but the more thoroughly we anticipate the eventual use of the system
we’re building, the better we can ensure it will behave as desired across all possible cases.
4
Each of these high-level applications immediately raises a huge number of questions, likely many more than
for simpler applications like the name sorter, because of the open-ended nature of natural language input (and
output). Some answers to those questions could lead an expert very quickly to the conclusion that the desired
system just isn’t possible yet or would be very expensive to build with the best available methods. Researchers
make progress on these challenging problems by trying to define tasks, or versions of the application that
abstract away some details while making some simplifying assumptions.
For example, consider the translation of text from one language to another. Here are some fairly conventional
assumptions made in many translation research projects:
• The input text will be in one of a small set of languages; it will be formatted according to newspaper-like
writing conventions. The same holds for the output text.
• Text will be translated one sentence or relatively short segment of text at a time.
• The whole segment will be available during translation (that is, translation isn’t happening in “real
time” as the input text is produced, as might be required when subtitling a live broadcast).
It’s not hard to find research on automatic translation that makes different assumptions from those above. A
new system that works well and relies on fewer assumptions is typically celebrated as a sign that the research
community is moving on to harder problems. For example, it’s only in the past few years that we have made
the leap from systems that support single input-to-output translations to systems that support multiple
input-to-output languages. We highlight that there are always some narrowing assumptions, hopefully
temporary, that make a problem more precise and therefore more solvable.
We believe that many discussions about AI systems become more understandable when we recognize the
assumptions beneath a given system. There is a constant tension between tasks that are more general/abstract,
on which progress is more impactful and exciting to researchers, and tasks that are more specific/concrete.
Solving a concrete, well-defined task may be extremely useful to someone, but certain details of how that
task is defined might keep progress on that task from being useful to someone else. To increase the chances
that work on a concrete task will generalize to many others, it’s vital to have a real-world user community
engaged in the definition of that task.
2.1.2 We need data and an evaluation method for research progress on a task
The term “task” is generally used among researchers to refer to a specification of certain components of an
NLP system, most notably data and evaluation:
• Data: there is a set of realistic demonstrations of possible inputs paired with their desirable outputs.
• Evaluation: there is a method for measuring, in a quantitative and reproducible way, how well any
system’s output matches the desired output.
Considerable research activity focuses on building datasets and evaluation methods for NLP research, and the
two depend heavily on each other. Consider again the translation example. Examples of translation between
languages are easy to find for some use cases. A classic example is parliamentary language translated from
English to French, or vice versa. The proceedings of the Canadian Parliament are made available to the public
in both English and French, so human translators are constantly at work producing such demonstrations;
paired bilingual texts are often called “parallel text” in the research community. The European Parliament
does the same for multiple languages. Finding such data isn’t as easy for some languages or pairs of languages,
and as a result, there has been considerably more progress on automated translation for European languages
than for others.
What about evaluation of translation? One way to evaluate how well a system translates text is to take
a demonstration, feed the input part to a system, and then show a human judge the desired output and
the system output. We can ask the judge how faithful the system output is to the desired output. If the
judge speaks both languages, we can show them the input instead of the desired output (or in addition to it)
and ask the same question. We can also ask human judges to look only at the system output and judge the
fluency of the text. As you can imagine, there are many possible variations, and the outcomes might depend
on exactly what questions we ask, how we word those questions, which judges we recruit, how much they
5
know about translation systems already, how well they know the language(s), and whether and how much we
pay them.
In 2002, to speed up translation evaluation in research work, researchers introduced a fully automated way to
evaluate translation quality called “Bleu” scores (Papineni et al. 2002), and there have been many proposed
alternatives since then, with much discussion over how well these cheaper automatic methods correlate with
human judgments. One challenge for automatic evaluation of translation is that natural languages offer many
ways to say the same thing. In general, reliably rating the quality of a translation could require recognizing
all of the alternatives because the system could (in principle) choose any of them.
We used translation as a running example precisely because these questions are so contentious and potentially
costly for this task. We’ll next consider a fairly concrete task that’s much simpler: categorizing the overall
tone of a movie review (positive vs. negative), instantiating a more general problem known as sentiment
analysis. Here, researchers have collected demonstrations from movie review websites that pair reviews with
numerical ratings (e.g.„ one to five stars). If a system takes a review as input and predicts the rating, we can
easily check whether the output exactly matches the actual rating given by the author, or we could calculate
the difference between the system and correct ratings. Here, the collection of data is relatively easy, and the
definition of system quality is fairly uncontroversial: the fewer errors a system makes (or the smaller the
difference between the number of author stars and system-predicted stars), the higher the system’s quality.
Note, however, that a system that does well on the movie review sentiment task may not do so well on reviews
of restaurants, electronics products, or novels. This is because the language people use to say what they like
or don’t like about a movie won’t carry the same meaning in a different context. (If a reviewer says that a
movie “runs for a long time,” that isn’t as obviously positive as the same remark about a battery-operated
toothbrush, for example.) In general, knowing the scope of the task and how a system was evaluated are
crucial to understanding what we can expect of a system in terms of its generalizability, or how well its
performance quality holds up as it’s used on inputs less and less like those it was originally evaluated on. It’s
also essential when we compare systems; if the evaluations use different demonstrations or measure quality
differently, a comparison won’t make sense.
For most of its history, NLP has focused on research rather than development of deployable systems. Recent
interest in user-facing systems highlights a longstanding tension in taskification and the dataset and evaluation
requirements. On one hand, researchers prefer to study more abstract tasks so that their findings will be
more generally applicable across many potential systems. The scientific community will be more excited,
for example, about improvements we can expect will hold across translation systems for many language
pairs (rather than one) or across sentiment analysis systems for many kinds of reviews (rather than just
movies). On the other hand, there is near-term value in making a system that people want to use because it
solves a specific problem well, which requires being more concrete about the intended users, their data, and
meaningful evaluation.
There is yet another step between researching even fairly concrete tasks and building usable systems. These
are evaluated very differently. Evaluations in research tend to focus on specific, narrowly defined capabilities,
as exemplified in a sample of data. It’s an often unstated assumption in research papers that improved
task performance will generalize to similar tasks, perhaps with some degradation. The research community
tends to share such assumptions, with the exception of research specifically on generalization and robustness
across domains of data. Meanwhile, deployable systems tend to receive more rigorous testing with intended
users, at least to the extent that they are built by organizations with an interest in pleasing those users. In
deployment, “task performance” is only part of what’s expected (systems must also be reasonably fast, have
intuitive user interfaces, pose little risk to users, and more).
People interested in NLP systems should be mindful of the gaps between (1) high-level,
aspirational capabilities, (2) their "taskified" versions that permit measurable research progress,
and (3) user-facing products. As research advances, and due to the tension discussed above,
the "tasks" and their datasets and evaluation measures are always in flux.
6
2.2 A closer look at data: where it comes from and how it’s used
For the two task examples discussed above (translation and sentiment analysis tasks), we noted that
demonstrations (inputs with outputs) would be relatively easy to find for some instances of the tasks.
However, data might not always be so easy to come by. The availability of data is a significant issue for two
reasons:
• For most NLP applications, and most tasks that aim to approximate those applications, there is no
“easy” source of data. (Sentiment analysis for movie reviews is so widely studied, we believe, because the
data is unusually easy to find, not because there is especially high demand for automatic number-of-stars
prediction.)
• The best known techniques for building systems require access to substantial amounts of extra data to
build the system, not just to evaluate the quality of its output.
Figure 2: When data is split into training and test sets, it’s critical there is no overlap between the two.
Consider a student who somehow gets a copy of the final exam for one of their classes a few weeks before the
exam. Regardless of how much the student is to blame in accessing the test, regardless of whether they even
knew the exam they saw was the actual final exam, regardless of how honorably they behaved during those
weeks and during the test, if they get a high score, the instructor cannot conclude that the student learned
the material. The same holds true for an NLP system. For the test data to be useful as an indicator of the
quality of the system’s output, it is necessary that the test data be “new” to the system. We consider this
the cardinal rule of experimentation in NLP: The test data cannot be used for any purpose prior to
the final test. Occasionally, someone will discover a case where this rule was violated, and (regardless of
the intent or awareness of those who broke the rule) the conclusions of any research dependent on that case
must be treated as unreliable.
7
To get a sense of an NLP system’s actual quality, it is crucial that the system not be evaluated
on data it has seen during training.
NLP can refer to either creating of new text via expert annotation or crowdsourcing, or collecting existing text into a more
readily accessible form for model developers, such as via web crawling or scraping.
8
of tutorial content is already available about machine learning methods, with new contributions following
fast on the heels of every new research advance. Here, we introduce a few key ideas needed to navigate the
current landscape.
The first concept is a parameter. A parameter is like a single knob attached to a system: Turning the knob
affects the behavior of the system, including how well it performs on the desired task. To make this concrete,
let’s consider an extremely simple system for filtering spam emails. Due to budgetary constraints, this system
will have only one parameter. The system works as follows: it scans an incoming email and increments a
counter every time it encounters an “off-color” word (e.g., an instance of one of the seven words the comedian
George Carlin claimed he wasn’t allowed to say on television). If the count is too high, the email is sent to
the spam filter; otherwise, it goes to the inbox. How high is too high? We need a threshold, and we need to
set it appropriately. Too high, and nothing will get filtered; too low, and too many messages may go to spam.
The threshold is an example of a parameter.
This example neatly divides system-building problem into two separate parts:
1. Deciding what parameters the system will have and how they will work. In our spam
example, the system and the role of the off-color word threshold parameter are easy to explain. The
term architecture (or model architecture, to avoid confusion with hardware architecture) typically
refers to the decision about what parameters a model will have. For example, picture a generic-looking
black box with lots of knobs on it; the box has a slot on one side for inputs and a slot on the other side
for outputs. The “architecture” of that model refers to the number of knobs, how they’re arranged on
the box, and how their settings affect what occurs inside the box when it turns an input into an output.
2. Setting parameter values. This corresponds to determining what value each individual knob on
the box is turned to. While we likely have an intuition about how to set the parameter in the spam
example, the value that works the best is probably best determined via experimentation.
We now walk through how ML works in more detail and introduce some components you’ll likely hear about
if you follow NLP developments.
2.3.2 Choosing values for all the parameters: Minimizing a loss function
In order to work well, a neural network needs to have its parameters set to useful values (i.e., values that
will work well together to mathematically transform each input into an output close to the input’s correct
answer). But how do we choose parameters’ values when we have so many we need to decide? In this section,
we describe the general strategy that we use in NLP.
Imagine yourself in the following (admittedly not recommended) scenario. At night, and with no GPS
or source of light on you, you are dropped in a random location somewhere over the Cascade Range in
Washington State with the instructions to find the deepest valley you can (without just waiting for morning).
You move your feet to estimate the steepest downward direction. You take a small, careful step in that
direction and repeat until you seem to be in a flat place where there’s no direction that seems to take you
farther downward.
4We are referring to the concept from calculus. If a function is “differentiable” with respect to some numbers it uses, then
calculus gives us the ability to calculate which small changes to those variables would result in the biggest change to the function.
9
Machine learning (and, by extension, NLP) views the setting of parameter values as a problem of numerical
optimization, which has been widely studied for many years by mathematicians, statisticians, engineers,
and computer scientists. One of the tools of machine learning is an automated procedure that frames the
parameter value-setting problem like that terrifying hike. Recall that we said that neural networks need to
be differentiable with respect to their parameters— that is, they need to be set up to allow calculus to tell us
which tiny change to each parameter will result in the steepest change of something calculated using the
neural network’s output. In our nighttime hike scenario, at each step, we make a tiny adjustment to our
north-south and east-west coordinates (i.e., position on the map). To adjust the parameters of our neural
network, we will consider our current set of parameters our “coordinates” and likewise repeatedly make tiny
adjustments to our current coordinates. But what does it mean to move “down” in this context? Ideally,
moving “down” should correspond to our neural network producing outputs that better match our data. How
can we define a function—our “landscape”— such that this is true?
A loss function is designed for precisely this purpose: to be lower when a neural network performs better.
In short, a loss function evaluates how well a model’s output resembles a set of target values (our training
data), with a higher “loss” signifying a higher error between the two. The more dissimilar the correct output
is from the model’s produced output, the higher the loss value should be; if they match, it should return zero.
This means a loss function should ideally be closely aligned to our evaluation method.5
By performing the following procedure, we are able to train a neural-network-based model:
1. We use a loss function to define our landscape for our model’s nighttime hike based on our training
inputs and outputs,
2. we make a small adjustment to each of our coordinates (model parameters) to move “down” that
landscape towards closer matches between our model’s outputs and the correct ones, and
3. we repeat step 2 until we can’t make our model’s outputs any more similar to the correct ones.
This method is known as (stochastic) gradient descent (SGD), since the direction that calculus gives us
for each parameter is known as the “gradient.”
Leaving aside some important details (for example, how to efficiently calculate the gradients using calculus,
working out precisely when to stop, exactly how much to change the parameter values in step 3, and some
tricks that make the algorithm more stable), this method has proven effective for choosing parameter values
in modern model architectures and in their predecessors.
parameters we want to set) is given an exam question (an input to the model) and produces an answer. The teacher mechanically
compares the question’s correct answer to the student’s answer, and then reports how many points have been deducted for
mistakes. When the student gets the answer perfectly right, the loss will be zero; no points are deduced. We discuss some
additional mathematical details of loss functions in the appendix.
10
3.1 Language modeling as next word prediction
The language modeling task is remarkably simple in its definition, in the data it requires, and in its evaluation.
Essentially, its goal is to predict the next word in a sequence (the output) given the sequence of preceding
words (the input, often called the “context” or “preceding context”). For example, if we ask you to come up
with an idea of which word might come next in a sentence in progress—say, “This document is about Natural
Language ____”—you’re mentally performing the language modeling task. The real-world application that
should come to mind is some variation on an auto-complete function, which at this writing is available in
many text messaging, email, and word processing applications.
Language modeling was for several decades a core component in systems for speech recognition and text
translation. Recently, it has been deployed for broad-purpose conversational chat, as in the various GPT
products from OpenAI, where a sequence of “next” words is predicted as a sequential response to a natural
language prompt from a user.
Figure 3: Next word prediction samples a word from the language model’s guess of what comes next at each
time step.
What would make it possible to achieve high accuracy at predicting the next word across many contexts? At
a fundamental level, natural language is predictable because it is highly structured. People unconsciously
follow many rules when they use language (e.g., English speakers mostly utter verbs that agree with their
subjects sometime after those subjects, and they place adjectives before the nouns whose meaning they
modify). Also, much of our communication is about predictable, everyday things (consider how frequently
you engage in small talk).
As an NLP task, language modeling easily checks the two critical boxes we discussed in section 2: data and
evaluation. LMs need only text; every word in a large collection of text naturally comes with the preceding
context of words. When we say “only text,” we mean specifically that we don’t need any kind of label to go
with pieces of text (like the star ratings used in sentiment analysis tasks, or the human-written translations
used in translation tasks). The text itself is comprised of inputs and outputs. Because people produce text
and share it in publicly visible forums all the time, the amount of text available (at least in principle, ignoring
matters of usage rights) is extremely large. The problem of fresh, previously unseen test data is also neatly
solved because new text is created every day, reflecting new events and conversations in the world that are
reliably different from those that came before. There is also a relatively non-controversial evaluation of LMs
that requires no human expertise or labor, a more technical topic that we return to in section 3.4.
11
will have needed to implicitly learn those kinds of fluency-related rules to perform the language modeling
task well. This is why LMs have historically been incorporated as a component in larger NLP systems, such
as machine translation systems; by taking their predictions (at least partially) into account, the larger system
is more likely to produce more fluent output.
In more recent years, our understanding of the value of LMs has evolved substantially. In addition to
promoting fluency, a sufficiently powerful language model can implicitly learn a variety of world knowledge.
Consider continuations to the following partial sentences: “The Declaration of Independence was signed by
the Second Continental Congress in the year ____,” or “When the boy received a birthday gift from his
friends, he felt ____.” While there are any number of fluent continuations to those sentences—say, “1501” or
“that the American Civil War broke out” for the first, or “angry” or “like going to sleep” for the second—you
likely thought of “1776” as the continuation for the first sentence and a word like “happy” or “excited” for
the second. Why? It is likely because you were engaging your knowledge of facts about history as well as
your common sense about how human beings react in certain situations. This implies that to produce those
continuations, an LM would need at least a rudimentary version of this information.
To do a good job of guessing the continuations of text, past a certain point, an LM must have
absorbed some additional kinds of information to progress beyond simple forms of fluency.
NLP researchers got an early glimpse of this argument in Peters et al. (2018). This paper reported that
systems that trained an LM first as an early stage of building systems for varied tasks, ranging from
determining the answer to a question based on a given paragraph to determining which earlier entity a
particular pronoun was referencing, far outperformed their analogous versions that weren’t informed by an
LM (as measured by task-specific definitions of quality). This finding led to widespread researcher acceptance
of the power of “pretraining” a model to perform language modeling and then “finetuning” it (using its
pretrained parameters as a starting point) to perform a non-language-modeling task of interest, which also
generally improved end-task performance.
It shouldn’t be too surprising that LMs can perform well at filling in the blanks or answering questions when
the correct answers are in the training data. For a new task, it seems that the more similar its inputs and
outputs are to examples in the pretraining data, the better the LM will perform on that task.
NLP tasks. It’s a useful exercise to consider how to select the set of language names to use as labels for language identification,
e.g.,which dialects of a language are separate from each other and should receive different labels?
12
LMs are often built on text from more than one natural language as well as programming language code. The
dominant approach to defining where every word in the data starts and ends is to apply a fully automated
solution to create a vocabulary (set of words the language model will recognize as such) that is extremely
robust (i.e., it will always be able to break a text into words according to its definition of words). The
approach (Sennrich, Haddow, and Birch 2016) can be summed up quite simply:
• Any single character is a word in the vocabulary. This means that the LM can handle any entirely new
sequence of characters by default by treating it as a sequence of single-character words.
• The most frequently occurring two-word sequence is added to the vocabulary as a new word. This rule
is applied repeatedly until a target vocabulary size is reached.
This data-driven approach to building a language modeling vocabulary is effective and ensures that common
words in the LM’s training data are added to its vocabulary. Other, rarer words will be represented as a
sequence of word pieces in the model’s vocabulary (similarly to how you might sound out an unfamiliar word
and break it down into pieces you’ve seen before). However, note that a lot depends on the data through
the calculation of what two-word sequence is most frequent in that data at each step. Unsurprisingly, if the
dataset used to build the vocabulary includes little or no text from a language (or a sub-language), words in
that language will get “chopped up” into longer sequences of short vocabulary words (some a single character),
which has been shown to affect how well the LM performs on text in that language.
13
actual continuations of the text).9
Like any evaluation method, perplexity depends heavily on the test data. In general, the more similar the
training and test data, the lower we should expect the text data perplexity to be. And if we accidentally
break the cardinal rule and test on data that was included in the training data, we should expect extremely
low perplexity (possibly approaching 1, which is the lowest possible value of perplexity, if the model were
powerful enough to memorize long sequences it has seen in training).
Finally, it’s worth considering when perplexity seems “too” low. The idea that there is some limit to
this predictability, that there is always some uncertainty about what the next word will be, is an old one
(Shannon 1951), motivating much reflection on (1) how much uncertainty there actually is, and (2) what
very low perplexity on language modeling implies. Some have even suggested that strong language modeling
performance is indicative of artificially intelligent behavior. (We return to this question in section 5.)
that next word prediction is the training objective used for these models helps to explain this. The closest an LM comes to
encoding a “fact” is through its parameters’ encoding of which kinds of words tend to follow from a partially written sequence.
Sometimes, the context an LM is prompted with is sufficient to surface facts from its training data. (Imagine our example from
earlier: “The Declaration of Independence was signed by the Second Continental Congress in the year ____.” If an LM fills in
the year “1776” after being given the rest of the sentence as context, that fact has been successfully surfaced.) Other times,
however, it’s not, and we just get a fluent-sounding next word prediction that’s not actually true, or a “hallucination.”
14
suggests. You’re likely to notice that while the short-term continuations to the sentence are reasonable,
the text quickly devolves into moderately fluent incoherence, nothing like text produced by state-of-the-art
web-based products.
Having established the foundations—the language modeling task and the basic strategy for building a language
model—we’ll now consider the factors that have recently transformed the mostly academic language models
of the last decade into the so-called large language models (LLMs) of today.
15
that task, whether high or low or in between. And indeed, as it turns out, the second factor we now mention
falls into the category of a change in algorithm: a change in model architecture.
16
It’s important to recognize that larger datasets and more powerful hardware were the drivers
for the scaling up of language models to architectures with hundreds of billions of parameters
(at this writing), and that the parameter count is only part of the explanation for the
impressive behaviors of these models.
effort has gone into engineering prompts for better task performance and into finetuning LMs to follow instructions describing
widely varied tasks. Such instruction finetuning has become a widely used second stage of training for commercial LM products.
Note that it requires a dataset of instructions paired with the desired response an LM should give to each.
17
Both the transformer architecture and the stochastic gradient descent method used to set its parameters
are mystifying, at least at first. Below, we reflect on that and note important differences that make an
architecture like the transformer more inscrutable.
Stochastic gradient descent, the algorithm used to train transformers and other neural networks, has been
extensively studied and is very well understood for some kinds of problems. Picture a smooth bowl and
imagine a marble placed anywhere in it. That marble will roll and eventually settle at the lowest point. If
the dish were sitting on a piece of graph paper (a two-dimensional plane), the coordinates of that lowest
point are the values of our two parameters that minimize the loss function. Stochastic gradient descent is,
roughly speaking, doing the work of gravity. The simple curve of the dish, with no bumps or cutouts or
chips, corresponds to the property of convexity. Some machine learning problems correspond to a convex loss
function, and theoretical proofs support the existence of the best parameter values, how close SGD gets to
them, and how fast. What remains surprising is that SGD works well in practice even when the loss function
is not convex (like the Cascades, discussed in section 2.3.2). But the mathematics underlying this algorithm
are relatively mature.
The transformer architecture, only a few years old at this writing, remains mysterious. Some researchers have
sought to prove theorems about its limitations (i.e., input-output mappings it cannot represent under some
conditions), and more have run experiments to try to characterize what it learns from data in practice. More
research is clearly needed, both to improve our understanding of what we can expect from this architecture
and to help define new architectures that work better or for which parameter setting is less computationally
expensive.
4.3.3 Cost and complexity affect who can develop these models now
Yet another effect of the move to LLMs has been that a much smaller set of organizations can afford to
produce such models. Since large, well-funded tech companies are (almost exclusively) well positioned to train
LLMs due to their access to both data and specialized hardware, these companies are the sources for almost
all current LLMs. This poses a barrier to entry for many researchers at other institutions. Given the wide
array of different communities that could benefit from using these models, the many different purposes they
might envision for these models, and the vast diversity of language varieties that they represent, determining
ways to broaden participation in LLM development is an important emerging challenge.
Furthermore, when models were smaller, the idea of “running out” of web text on the public internet seemed
ludicrous; now, that’s a looming concern for LLM developers. As massive datasets play an increasingly large
role in model training, some large companies’ access to their own massive proprietary data associated with
platforms they maintain may give them an advantage in their development of models of text.
18
then we had better be quite sure that the purpose put into the machine is the purpose which we really desire”
(Wiener 1960). This idea comes through today in research on using machine learning to alter LM behaviors
directly.
In practice, commercial models are further trained on tasks designed to encourage instruction following
(section 4.3.1) and generating text that humans respond to favorably.15 It is complicated to determine which
behaviors to encourage. In her 2023 keynote at the FAccT research conference, the social scientist Alondra
Nelson made the point that “civilizations, for eons, for millennia. . . choose your long time scale—have been
struggling, fighting, arguing, debating over human values and principles” (Nelson 2023). In other words, not
only is it a difficult problem to determine how to shape models’ outputs to reflect a given set of values, it’s
also extremely complicated to determine which set of values to incorporate into that set. Therefore, we tend
to view these last adjustments of an LLM’s behavior as a kind of customization rather than as an intrinsic
encoding of “human values” into the system. Just like training models, only a few companies are currently
equipped to customize them at this writing.
implies, this method uses machine learning to turn discrete representations of human preferences, like “sampled output A is
preferable to sampled output B,” into a signal for how to adjust a model’s parameters accordingly.
19
• The suite of tasks driving research evaluations needs thorough and ongoing reconsideration and updating
to focus on communities of actual users.
• Observations of how real users interact with an LLM, along with feedback on the quality of the LLM’s
behavior, will be important for continuing to improve LLM quality.
• Because there is diversity in the communities of users, customization of models will become increasingly
important, making thorough evaluation increasingly multi-faceted and challenging.
• Reports of “progress” cannot be taken at face value; there are many different aspects to model quality.
A single performance number (like perplexity on a test set or average performance on a suite of hundreds
or thousands of tasks’ specific evaluations) will not meaningfully convey the strengths and weaknesses
of a system with such wide-ranging possible behaviors.
We believe that these challenges will inspire new collaborations between researchers and users to define
evaluations (and, by extension, models) that work as our needs and the realities of model building evolve..
20
Many researchers have one specific concern about hidden training datasets: Suppose a model is prompted
with a question that seems especially difficult to answer, and it answers accurately and clearly, like an expert.
We should be impressed only if we are confident that the question and answer weren’t in the training data. If
we can’t inspect the training data, we can’t be sure whether the model is really being tested fairly or if it
memorized the answer key before the test, like our student in section 2.2.
21
5.2 Do I always have to check and verify model output, or can I simply “trust”
the result?
At first glance, it might seem that a prompt that produces believable model output means there’s nothing
left for you to do. However, you should never take model output at face value. Always check for the following
important issues.
22
perform a task is heavily influenced by the particular data used to train it. (This is related to our previous
discussion in section 2.1 about the tradeoff between abstract, aspirational notions of a task and concrete,
workable ones.) In practice, models for “hate speech detection” are actually trained to perform “hate speech
detection as exemplified in the HateXplain dataset” or “hate speech detection as exemplified in the IberEval
2018 dataset.” These datasets reflect their builders’ focus on particular type(s) of language—for example,
Spanish-language news articles or American teenagers’ social media posts—but no dataset perfectly represents
the type(s) of language it’s meant to represent. There are simply too many possible utterances! Therefore,
despite ongoing work trying to improve models’ abilities to generalize from the data observed during training,
it remains possible that a model will learn a version of the task that’s informed by quirks of its training data.
Because there are so many possible “quirks,” it’s a safe bet that a model will have learned some of them.
And in fact, we’ve observed this time and again in NLP systems.
To be more specific, let’s look at some past work that’s found bias traceable to the training data within hate
speech detection systems. Sap et al. (2019) found that in two separate hate speech detection datasets, tweets
written in African American Vernacular English (AAVE) were disproportionately more likely to be labeled as
toxic than those written in white-aligned English by the humans employed to detect toxicity. Not only that,
but models trained on those datasets were then more likely to mistakenly label innocuous AAVE language as
toxic than they were to mistakenly flag innocuous tweets in white-aligned English. This gives us an idea
of how dataset bias can propagate to models in text classification systems, but what about in cases where
models generate text? If models aren’t associating text with any human-assigned toxicity labels, how can
they demonstrate bias?
As it turns out, evidence of bias is still visible even in cases where the model isn’t generating a single
predefined category for a piece of text. A famous early example of work showing this for Google Translate
based its study on a variety of occupations for which the US Bureau of Labor Statistics publishes gender
ratios (Prates, Avelar, and Lamb 2019). The authors evaluated machine translation systems that translated
to English from various languages that don’t use gendered singular pronouns, constructing sentences such
as “[neutral pronoun] is an engineer” and translating them into English. They found that these systems
demonstrated a preference for translating to “he” that often far exceeded the actual degree by which men
outnumbered women in an occupation in the US. This bias likely reflects an imbalance in the number of
training sentences associating men and women with these different professions, indicating another way in
which a skew in the training data for a task can influence a model.
Imbalances like this are examples of those “quirks” we mentioned earlier, and they can be puzzling. Some
quirks, like data containing far more mentions of male politicians than female politicians, seem to follow
from the prevalence of those two categories in the real world. Other quirks initially seem to defy common
sense: though black sheep are not prevalent in the world, “black sheep” get mentioned more often in English
text than other-colored sheep, perhaps because they’re more surprising and worthy of mention (or perhaps
because a common idiom, “the black sheep of the family,” uses the phrase).
In the same way that biases can arise in machine translation systems, LMs can exhibit bias in generating
text. While current LMs are trained on a large portion of the internet, text on the internet can still exhibit
biases that might be spurious and purely accidental, or that might be associated with all kinds of underlying
factors: cultural, social, racial, age, gender, political, etc. Very quickly, the risks associated with deploying
real-world systems become apparent if these biases are not checked. Machine learning systems have already
been deployed by private and government organizations to automate high-stakes decisions, like hiring and
determining eligibility for parole, which have been shown to discriminate based on such factors (Raghavan et
al. 2020; Nishi 2019).
So how exactly can researchers prevent models from exhibiting these biases and having these effects? It’s
not a solved problem yet, and some NLP researchers would argue that these technologies simply shouldn’t
be used for these types of systems, at least until there is a reliable solution. For LMs deployed for general
use, research is ongoing into ways to make models less likely to exhibit certain known forms of bias (e.g., see
section 4.3.4). Progress on such research depends on iterative improvements to data and evaluations that let
researchers quantitatively and reproducibly measure the various forms of bias we want to remove.
Remember: datasets and evaluations never perfectly capture the ideal task!
23
5.3 Are language models intelligent?
The emergence of language model products has fueled many conversations, including some that question
whether these models might represent a form of “intelligence.” In particular, some have questioned whether
we have already begun to develop “artificial general intelligence” (AGI). This idea implies something much
bigger than an ability to do tasks with language. What do these discussions imply for potential users of these
models?
We believe that these discussions are largely separate from practical concerns. Until now in this document,
we’ve mostly chosen used the term “natural language processing” instead of “artificial intelligence.” In part,
we have made this choice to scope discussion around technologies for language specifically. However, as
language model products are increasingly used in tandem with models of other kinds of data (e.g., images,
programming language code, and more), and given access to external software systems (e.g., web search),
it’s becoming clear that language models are being used for more than just producing fluent text. In fact,
much of the discussion about these systems tends to refer to them as examples of AI (or to refer to individual
systems as “AIs”).
A difficulty with the term “AI” is its lack of a clear definition. Most uncontroversially, it functions as a
descriptor of several different communities researching or developing systems that, in an unspecified sense,
behave “intelligently.” Exactly what we consider intelligent behavior for a system shifts over time as society
becomes familiar with techniques. Early computers did arithmetic calculations faster than humans, but were
they “intelligent?” And the applications on “smart” phones (at their best) don’t seem as “intelligent” to
people who grew up with those capabilities as they did to their first users.
But there’s a deeper problem with the term, which is the notion of “intelligence” itself. Are the capabilities
of humans that we consider “intelligent” relevant to the capabilities of existing or hypothetical “AI” systems?
The variation in human abilities and behaviors, often used to explain our notions of human intelligence, may
be quite different from the variation we see in machine intelligence. In her 2023 keynote at ACL (one of the
main NLP research conferences), the psychologist Alison Gopnik noted that in cognitive science, it’s widely
understood that “there’s no such thing as general intelligence, natural or artificial,” but rather many different
capabilities that cannot all be maximally attained by a single agent (Gopnik 2023).
In that same keynote, Gopnik also mentioned that, in her framing, “cultural technologies” like language
models, writing, or libraries can be impactful for a society, but it’s people’s learned use of them that make
them impactful, not inherent “intelligence” of the technology itself. This distinction, we believe, echoes a
longstanding debate in yet another computing research community, human-computer interaction. There,
the debate is framed around the development of “intelligence augmentation” tools, which humans directly
manipulate and deeply understand, still taking complete responsibility for their own actions, vs. agents, to
which humans delegate tasks (Shneiderman and Maes 1997).
Notwithstanding debates among scholars, some companies like OpenAI and Anthropic state that developing
AGI is their ultimate goal. We recommend first that you recognize that “AGI” is not a well-defined scientific
concept; for example, there is no agreed-upon test for whether a system has attained AGI. The term should
therefore be understood as a marketing device, similar to saying that a detergent makes clothes smell “fresh”
or that a car is “luxurious.” Second, we recommend that you assess more concrete claims about models’
specific capabilities using the tools that NLP researchers have developed for this purpose. You should expect
no product to “do anything you ask,” and the clear demonstration that it has one capability should never
be taken as evidence that it has different or broader capabilities. Third, we emphasize that AGI is not the
explicit or implicit goal of all researchers or developers of AI systems. In fact, some are far more excited
about tools that augment human abilities than about autonomous agents with abilities that can be compared
to those of humans.
We close with an observation. Until the recent advent of tools marketed as “AI,” our experience with
intelligence has been primarily with other humans, whose intelligence is a bundle of a wide range of
capabilities we take for granted. Language models have, at the very least, linguistic fluency: the text they
generate tends to follow naturally from their prompts, perhaps indistinguishably well from humans. But LMs
don’t have the whole package of intelligence that we associate with humans. In language models, fluency,
24
for example, seems to have been separated from the rest of the intelligence bundle we find in each other.
We should expect this phenomenon to be quite shocking because we haven’t seen it before! And indeed,
many of the heated debates around LMs and current AI systems more generally center on this “unbundled”
intelligence. Are the systems intelligent? Are they more intelligent than humans? Are they intelligent in the
same ways as humans? If the behaviors are in some ways indistinguishable from human behaviors, does it
matter that they were obtained or are carried out differently than for humans?
We suspect that these questions will keep philosophers busy for some time to come. For most of us who work
directly with the models or use them in our daily lives, there are far more pressing questions to ask. What
do I want the language model to do? What do I not want it to do? How successful is it at doing what I
want, and how readily can I discover when it fails or trespasses into proscribed behaviors? We hope that our
discussion helps you devise your own answers to these questions.
Remember: analogies to human capabilities never perfectly capture the capabilities of language
models, and it’s important to explicitly test a model for any specific capability that your use
case requires!
6.1 Why is it difficult to make projections about the future of NLP technologies?
For perspective, let’s consider two past shifts in the field of NLP that happened over the last ten years.
The first, in the early 2010s, was a shift from statistical methods—where each parameter fulfilled a specific,
understandable (to experts) role in a probabilistic model—to neural networks, where blocks of parameters
without a corresponding interpretation were learned via gradient descent. The second shift, around 2018–19,
was the general adoption of the transformer architecture we described in section 4.2, which mostly replaced
past neural network architectures popular within NLP, and the rise of language model pretraining (as discussed
in section 3.2).
Most in the field didn’t anticipate either of those changes, and both faced skepticism. In the 2000s, neural
networks were still largely an idea on the margins of NLP that hadn’t yet demonstrated practical use;
further, prior to the introduction of the transformer, another, very different structure of neural network18 was
ubiquitous in NLP research, with relatively little discussion about replacing it. Indeed, for longtime observers
of NLP, one of the few seeming certainties is of a significant shift in the field every few years—whether in
the form of problems studied, resources used, or strategies for developing models. The form this shift takes
does not necessarily follow from the dominant themes of the field over the preceding years, making it more
“revolutionary” than “evolutionary.” And, as more researchers are entering NLP and more diverse groups
collaborate to consider which methods or which applications to focus on next, predicting the direction of
these changes becomes even more daunting.
A similar difficulty applies when thinking about long-term real-world impacts of NLP technologies. Even
setting aside that we don’t know how NLP technology will develop, determining how a particular technology
18 It was called the LSTM, “long short-term memory” network.
25
will be used poses a difficult societal question. Furthermore, NLP systems are being far more widely deployed
in commercial applications; this means that model developers are getting far more feedback about them from
a wider range of users, but we don’t yet know the effects that deployment and popular attention will have on
the field.
Remembering how these models work at a fundamental level—using preceding context to predict the next
text, word by word, based on what worked best to mimic demonstrations observed during training—and
imagining the kinds of use cases that textual mimicry is best-suited towards will help us all stay grounded
and make sense of new developments.
26
is released likely won’t regulate for a certain point in time—as we are already seeing in some ways with the
Executive Order on AI. Any regulation that isn’t focused on broader concepts like harm reduction and safe
use cases runs the risk of becoming quickly outdated, given the current (and likely future) pace of technology
development.
At a lower level closer to the implementation and training of AI systems, the legal focus so far has over-
whelmingly been on copyrights associated with models’ training data. A 2018 amendment to Japan’s 1970
Copyright Act gives generative AI models broad leeway to train on copyrighted materials provided that the
training’s “purpose is not to enjoy the ideas or sentiments expressed in the work for oneself or to have others
enjoy the work.” However, more recent court cases focused on generative image models, such as Getty Images
suing Stability AI Inc. or a group of artists suing Stability AI, Midjourney, and DeviantArt, are pushing
back on that view and have yet to reach a resolution.
Even these early forays into the intersection of AI systems with copyright protection differ in their leanings,
which shows how difficult it can be to legislate comprehensively on AI issues. (Indeed, there are already
further proposed amendments to Japan’s Copyright Act that consider restricting the application of the 2018
amendment.) To date, we haven’t seen many court cases focused on generative models of text. Perhaps the
closest is a court case about computer programming language code, namely Doe 1 v. Github, Inc., which
focuses on the fact that many public repositories of code on the GitHub website, from which training data
has been drawn, come with a license that was stripped from the data during training. Given that such court
cases focus on training data, one unanswered question is how such legal cases will affect companies’ openness
about their models’ training data in the future. As we discussed, the more opaque the training data, the less
hope we have of understanding a model.
7 Final remarks
Current language models are downright perplexing. By keeping in mind the trends in the research communities
that produced them, though, we gain a sense of why these models behave as they do. Keeping in mind the
primary task that these models have been trained to accomplish, i.e., next word prediction, also helps us
27
to understand how they work. Many open questions about these models remain, but we hope that we’ve
provided some helpful guidance on how to use and assess them. Though determining how these technologies
will continue to develop is difficult, there are helpful actions that each of us can take to push that development
in a positive direction. By broadening the number and type of people involved in decisions about model
development and engaging in broader conversations about the role of LMs and AI in society, we can all help
to shape AI systems into a positive force.
Acknowledgments
The authors appreciate feedback from Sandy Kaplan, Lauren Bricker, Nicole DeCario, and Cass Hodges at
various stages of this project, which was supported in part by NSF grant 2113530. All opinions and errors
are the authors’ alone.
Glossary
Algorithm: A procedure that operates on a set of inputs in a predefined, precisely specified way to produce
a set of outputs. Algorithms can be translated into computer programs. This document references several
different algorithms: (1) stochastic gradient descent, which takes as input a (neural network) model
architecture, a dataset, and other settings and produces as output a model; (2) a model itself, which takes
as input specified text and produces an output for the task the model was trained to perform (for example, a
probability distribution over different kinds of attitudes being expressed for a sentiment classification
model, or a probability distribution over which word comes next for a language model); (3) an algorithm
for constructing a language model’s vocabulary (section 3.3).
Alignment (of a model to human preferences): This term can refer either to the degree to which a model
reflects human preferences, or to the process of adjusting a model to better reflect human preferences. See
section 4.3.4.
Architecture (of a model): The template for arranging a model’s parameters and specifying how those
parameters are jointly used (with an input) to produce the model’s output. Note that specifying the model
architecture does not involve specifying the values of individual parameters, which are defined later. (If
you consider a model to be a “black box” with knobs on its side that is given an input and produces an
output, the model’s “architecture” refers to the arrangement of knobs on/inside the box without including
the particular values to which each knob is set.)
Artificial intelligence (AI): (1) Broadly describes several fields or research communities that focus on
improving machines’ ability to process complicated sources of information about the world (like images or
text) into predictions, analyses, or other human-useful outputs. (2) Also refers in popular usage (but not this
guide) to an individual system (perhaps a model) built using techniques developed in those fields (such as
Deep Blue or ChatGPT).
Bleu scores: A fully automated way introduced by Papineni et al. (2002) to evaluate the quality of a
proposed translation of text into a target language. At a high level, the Bleu score for a proposed translation
of text (with respect to a set of approved reference translations for that same text) is calculated by looking
at which fraction of small chunks (e.g., one-word chunks, two-word chunks, etc.) of the proposed translation
appear in at least one of the reference translations.
Computer vision (CV): A subfield of computer science research that advances the automated processing
and production of information from visual signals (images).
Content safeguards: A term commonly used within NLP to refer to the strategies that are used to try to
keep language models from generating outputs that are offensive, harmful, dangerous, etc. We give some
examples of these strategies in section 4.3.5.
Convergence: A concept in machine learning that explains when the loss between a model’s output and
expected output from data is less than some threshold. Model convergence during training usually means
28
that the model is no longer improving, such as occurs at the end of SGD.
Data: The pairs of sample inputs and their desired outputs associated with a task, used to train or evaluate
a model. For NLP, this is typically a massive collection of either text that originates in digital form (e.g.,
text scraped from a post published to an internet forum) or text converted into a digital format (e.g., text
extracted from a scanned handwritten document). It may also include additional information describing the
text, like sentiment labels for a sentiment analysis dataset.
Data-driven: A description of a process indicating that it determines actions based on analysis of massive
data stores (in contrast to having a person or multiple people make all of these decisions). For example, a
person deciding on the vocabulary for a language model they’re about to build could either (1) manually
define a list of all words or parts of words that the model’s vocabulary would include (not data-driven) or (2)
collect text data and run a data-driven algorithm (see section 3.3) to automatically produce a vocabulary
based on that dataset for the eventual model. Machine learning algorithms are, in general, data-driven.
Deep learning: A term that describes machine learning methods focused on training (neural network)
models with many layers.
Depth (of a model): Refers to the number of layers a neural network architecture contains.
Domain (of data): A specific and intuitive (though not formally defined) grouping of specific data. For
example, an NLP researcher might refer to “the Wikipedia domain” of text data, or “the business email
domain” of text data. The term offers an expedient way for researchers or practitioners to refer to data that
generally has some unifying characteristics or characteristics different from some other data.
Extrinsic evaluation (of a model): An evaluation (of a model) that evaluates whether using that model as
part of a larger system helps that system (and how much), or which considers factors related to the model’s
eventual use in practice, etc.
Finetuning (of a model for a specific task): Continued training of a model on a new dataset of choice that
occurs after original parameter values were trained on other tasks/datasets. Use of the term “finetuning”
indicates that the model about to be finetuned has already been trained on some task/dataset.
Function: Broadly, a mapping of inputs to outputs. In other words, a function takes as input any input
matching a particular description (like “number” or “text”) and will give a (consistent) answer for what that
input’s corresponding output should be. However, everywhere we use the word “function” in this document
(except in the context of “autocomplete functions”), we are referring more specifically to functions that take
in a set of numbers and produce single-number outputs.
Generative AI: A subset of artificial intelligence focused on models that learn to simulate (and can
therefore automatically produce/generate) complex forms of data, such as text or images.
Gradient (of a function): A calculus concept. Given a particular point in an n-dimensional landscape,
the gradient of a function indicates the direction (and magnitude) of that function’s steepest ascent from
that point. By considering the current parameters of a neural network model as the point in that
n-dimensional landscape, and taking the gradient of a loss function with respect to those parameters, it is
possible to determine a very small change to each parameter that increases the loss function as much as
locally possible. This also indicates that the opposite small change can decrease the loss function as much as
locally possible, the goal when running SGD.
Hallucination (by a language model): A term commonly used to describe nonfactual or false statements
in outputs produced by a language model.
Hardware: The (physical) machines on which algorithms are run. For contemporary NLP, these are typically
GPUs (graphics processing units), which were initially designed to render computer graphics quickly but were
later used to to do the same for the kinds of matrix-based operations often performed by neural networks.
Intrinsic evaluation (of a model): An evaluation (of a model) that evaluates that model on a specific test
set “in a vacuum,” that is, without considering how plugging that model into a larger system would help
that larger system.
29
Label: Some tasks have outputs that are a relatively small set of fixed categories (unlike language modeling,
where the output is a token from some usually enormous vocabulary). In cases where outputs are decided
from that kind of small set, NLP researchers typically refer to the correct output for a particular input as
that input’s “label”. For example, the set of labels for an email spam-identification task would be “spam” or
“not spam,” and a sentiment analysis task might define its set of possible labels to be “positive,” “negative,”
or “neutral.”
Language model: A model that takes text as input and produces a probability distribution over which
word in its vocabulary might come next. See section 3.
Layer (of a neural-network-based model): A submodule with learnable parameters of a neural network
that takes as input a numerical representation of data and outputs a numerical representation of data. Modern
neural networks tend to be deep, meaning that they “stack” many layers so that the output from one layer
is fed to another, whose output is then fed to another, and so on.
Loss function: A mathematical function that takes in a model’s proposed output given a particular input
and compares it to (at least) one reference output for what the output is supposed to be. Based on how
similar the reference output is to the model’s proposed output, the loss function will return a single number,
called a “loss.” The higher the loss, the less similar the model’s proposed output is to the reference output.
Machine learning (ML): An area of computer science focused on algorithms that learn how to (ap-
proximately) solve a problem from data, i.e., to use data to create other algorithms (models) that are
deployable on new, previously unseen data.
Mappings (of input to output): A pairing of each (unique) possible input to a (not necessarily unique)
output, with the mapping “translating” any input it is given to its paired output.
Model: An algorithm for performing a particular task. (NLP researchers typically refer to such an
algorithm as a model only if its corresponding task is sufficiently complicated to lack any provably correct,
computationally feasible way for a machine to perform it. Hence, we apply machine learning to build a
model to approximate the task.) Though a model that performs a particular task does not necessarily have
to take the form of a neural network (e.g., it could instead take the form of a list of human-written rules),
in practice, current NLP models almost all take the form of neural networks.
Natural language processing (NLP): A subfield of computer science that advances the study and
implementation of automated processing and generation of information from text and, perhaps, other
language data like speech or video of signed languages.
Neural network: A category of model architecture widely used in machine learning that is subdif-
ferentiable and contains many parameters, making it well-suited to being trained using some variant of
stochastic gradient descent. Neural networks use a series of calculations performed in sequence by densely
connected layers (loosely inspired by the human brain) to produce their output.
(Numerical) optimization: Can refer to (1) a family of strategies for choosing the best values for a
predetermined set of parameters, given a particular quantity to minimize/maximize which is calculated
based on those parameters (and often some data as well) or to (2) the field of research that studies these
strategies. In this document we refer exclusively to the first definition.
Overfitting: When a model learns patterns that are overly specific to its training data and that do not
generalize well to new data outside of that training set. This problem is typically characterized by the model’s
very strong task performance on the training data itself but far worse performance when given previously
unseen data.
Parallel text: A term used within NLP to refer to pairs of text (usually pairs of sentences) in two languages
that are translations of each other. Parallel text is widely used for the development of NLP models that
perform the task (commonly called “machine translation”) of translating text from a specific source language
(e.g., Urdu) into a specific target language (e.g., Thai). Some pairs of languages have much more (digital)
parallel text available, and the difference in the quality of machine translation systems across different
language pairs reflects that disparity.
30
Parameter (in a neural network model): A single value (model coefficient) that is part of the mathematical
function that neural networks define to perform their operations. If we consider a model as being a black
box that performs some task, a parameter is a single one of that black box’s knobs. “Parameter” can refer
either to the knob itself or the value the knob is set to, depending on context.
Perplexity: A number from 1 to infinity that represents how “surprised” a language model generally is to
see the actual continuations of fragments of text. The lower the perplexity, the better the language model can
predict the actual continuations of those text fragments in the evaluation data. Perplexity is an important
intrinsic evaluation for language models.
Probability distribution: A collection of numbers (not necessarily unique) that are all at least 0 and add
up to 1 (for example, 0.2, 0.2, 0.1, and 0.5), each paired with some possible event; the events are mutually
exclusive. For one such event, its number is interpreted as the chance that the event will occur. For example,
if a language model with a tiny vocabulary consisting of only [apple, banana, orange] takes as input
the sentence-in-progress “banana banana banana banana” and produces a probability distribution over its
vocabulary of 0.1 for “apple,” 0.6 for “banana,” and 0.3 for “orange,” this means that the model is predicting
that the next word to appear after the given sentence-in-progress has a 60% chance of being “banana.”
Prompt (to a language model): The text provided by a user to the language model, which the model
then uses as its context—i.e., as its initial basis for its next word prediction that it performs over and over
again to produce its output, word by word.
Sentiment analysis: A task in NLP that aims to determine whether the overall sentiment of a piece of
text skews positive, negative, or in some versions of the task, neutral. For example, suppose that a sentiment
analysis model was given the input “Wow, that movie was amazing!” The correct output for the model given
that input would be “positive” (or five stars, or 10/10, or something similar if the labels were in the form of
stars or integer scores from 0 to 10 instead).
Stochastic gradient descent (SGD): A process by which parameters of a model are adjusted to minimize
some specific function (e.g., a loss function). SGD requires repeatedly running varying batches of data
through the model, whose output can then be used to get a value from our (loss) function. For each batch,
we then use the gradient of that function to adjust the parameters of our model to take a tiny descending
step along that gradient. This process is repeated until the loss function’s gradient flattens out and stops
indicating a lower direction.
Task: A job we want a model to do. Tasks are usually described abstractly—for example, sentiment analysis,
question answering, or machine translation—in a way that is not tied to any one source of data. However, in
practice, if a model is trained to perform a particular task, the version of that task that the model learns to
perform will be heavily influenced by the specific training data used. See section 5.2.2.
Test set (or test data): A set of data unseen by a model during its training, used to evaluate how well
the model works.
Token: The base unit of language into which an NLP model splits any text input. For contemporary
language models, a token can be either a word or a piece of a word. A text input passed to such a model
will be split into its component words (in cases where that word is part of the model’s vocabulary) and word
pieces (in cases where that full word doesn’t exist in the model’s vocabulary, so its component pieces are
added to the sequence of tokens instead).
Training set (or training data): A set of data used to train a model (in other words, to decide that
model’s parameter values). For a model that takes the form of a neural network, the training set comprises
the batches of data used while running stochastic gradient descent.
Transformer: A kind of neural network architecture introduced in 2017 that allows large models built
using it to train faster than earlier model architectures would have allowed, and on more data (assuming
access to certain relatively high-memory hardware). They do this by using techniques (e.g., self-attention)
beyond the scope of this work. See section 4.2.
31
References
Church, Kenneth W., and Robert L. Mercer. 1993. “Introduction to the Special Issue on Computational
Linguistics Using Large Corpora.” Computational Linguistics 19 (1): 1–24. https://aclanthology.org/J93-
1001.
Dodge, Jesse, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret
Mitchell, and Matt Gardner. 2021. “Documenting Large Webtext Corpora: A Case Study on the Colossal
Clean Crawled Corpus.” In Proceedings of the 2021 Conference on Empirical Methods in Natural Language
Processing, 1286–1305. Online; Punta Cana, Dominican Republic: Association for Computational
Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.98.
Gopnik, Alison. 2023. “Large Language Models as Cultural Technologies.” Presented at the 61st Annual
Meeting of the Association for Computational Linguistics.
Gururangan, Suchin, Dallas Card, Sarah Dreier, Emily Gade, Leroy Wang, Zeyu Wang, Luke Zettlemoyer,
and Noah A. Smith. 2022. “Whose Language Counts as High Quality? Measuring Language Ideologies in
Text Data Selection.” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language
Processing, 2562–80. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.
https://aclanthology.org/2022.emnlp-main.165.
Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford,
Diego de Las Casas, et al. 2022. “An Empirical Analysis of Compute-Optimal Large Language Model
Training.” In Advances in Neural Information Processing Systems, edited by S. Koyejo, S. Mohamed, A.
Agarwal, D. Belgrave, K. Cho, and A. Oh, 35:30016–30. Curran Associates, Inc. https://proceedings.neur
ips.cc/paper_files/paper/2022/file/c1e2faff6f588870935f114ebe04a3e5-Paper-Conference.pdf.
Nelson, Alondra. 2023. “Thick Alignment.” Presented at the 2023 ACM Conference on Fairness, Accountabil-
ity, and Transparency (ACM FAccT). https://youtu.be/Sq_XwqVTqvQ?t=957.
Nishi, Andrea. 2019. “Privatizing Sentencing: A Delegation Framework for Recidivism Risk Assessment.”
Columbia Law Review 119 (6): 1671–1710. https://columbialawreview.org/content/privatizing-sentencing-
a-delegation-framework-for-recidivism-risk-assessment/.
Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. “Bleu: A Method for Automatic
Evaluation of Machine Translation.” In Proceedings of the 40th Annual Meeting of the Association for
Computational Linguistics, 311–18. Philadelphia, Pennsylvania, USA: Association for Computational
Linguistics. https://doi.org/10.3115/1073083.1073135.
Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. “Deep Contextualized Word Representations.” In Proceedings of the 2018 Conference
of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers), 2227–37. New Orleans, Louisiana: Association for Computational
Linguistics. https://doi.org/10.18653/v1/N18-1202.
Prates, Marcelo O. R., Pedro H. Avelar, and Luís C. Lamb. 2019. “Assessing Gender Bias in Machine
Translation: A Case Study with Google Translate.” Neural Computing and Applications 32 (10): 6363–81.
https://doi.org/10.1007/s00521-019-04144-6.
Raghavan, Manish, Solon Barocas, Jon Kleinberg, and Karen Levy. 2020. “Mitigating Bias in Algorithmic
Hiring: Evaluating Claims and Practices.” In Proceedings of the 2020 Conference on Fairness, Account-
ability, and Transparency, 469–81. FAT* ’20. New York, NY, USA: Association for Computing Machinery.
https://doi.org/10.1145/3351095.3372828.
Sap, Maarten, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A. Smith. 2019. “The Risk of Racial
Bias in Hate Speech Detection.” In Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics, 1668–78. Florence, Italy: Association for Computational Linguistics. https:
//doi.org/10.18653/v1/P19-1163.
Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. “Neural Machine Translation of Rare Words
with Subword Units.” In Proceedings of the 54th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), 1715–25. Berlin, Germany: Association for Computational
Linguistics. https://doi.org/10.18653/v1/P16-1162.
Shannon, C. E. 1951. “Prediction and Entropy of Printed English.” The Bell System Technical Journal 30
(1): 50–64. https://doi.org/10.1002/j.1538-7305.1951.tb01366.x.
Shneiderman, Ben, and Pattie Maes. 1997. “Direct Manipulation Vs. Interface Agents.” Interactions 4 (6):
42–61. https://doi.org/10.1145/267505.267514.
32
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,
and Illia Polosukhin. 2017. “Attention Is All You Need.” In Advances in Neural Information Processing
Systems, edited by I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R.
Garnett. Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f
5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
Wiener, Norbert. 1960. “Some Moral and Technical Consequences of Automation.” Science 131 (3410):
1355–58. http://www.jstor.org/stable/1705998.
Appendix
Loss functions and gradient descent, a bit more formally
The first important property for a loss function is that it takes into account all the potential good and bad
things about outputs when deducting points. The more dissimilar our model’s output given a particular input
is from that input’s correct output, the higher the loss function should be. The second important property
is that we must be able to deduce, fully automatically and in parallel for all parameters, what adjustments
would make the loss function decrease. You may recall from a course on calculus that questions like “How
does a small change to an input to a function affect the function’s output?” are related to the concept of
differentiation. In sum, we need the loss function to be differentiable with respect to the parameters. (This
may be a bit confusing because in calculus, we think about differentiating a function with respect to its
inputs. In a mathematical sense, the input is only part of the input to the mathematical function encoded by
a neural network; the parameters are also part of its input.) If the loss function has this property, then we
can use differentiation to automatically calculate a small change for each parameter that should decrease the
loss on a given example.
These two properties—faithfulness to the desired evaluation and differentiability with respect to parame-
ters—conflict because most evaluation scores aren’t differentiable. Bleu scores for translation and error rates
for sentiment analysis are stepwise functions (“piecewise constant” in mathematical terms): changing the
parameters a tiny bit usually won’t affect these evaluation scores; when it does, it could be a dramatic change.
Human judgments also are not differentiable with respect to parameters.
Once we know a differentiable loss function, and with a few additional assumptions, we quickly arrive at the
algorithm for stochastic gradient descent (SGD), for setting system parameters. To describe its steps a bit
more formally than we did in section 2.3.2:
1. Initialize the parameters randomly.
2. Take a random sample of the training data (typically 100 to 1000 demonstrations); run each input
through the system and calculate the loss and its first derivative with respect to every parameter.
(When first derivatives are stacked into a vector, it’s called the gradient.) Keep a running total of the
sum of loss values and a running total of the sum of gradients.
3. For each parameter, change its value proportional to the corresponding value in the gradient vector. (If
the gradient is zero, don’t change that parameter.)
4. Go to step 2 if the loss is converging.
33
2. Let the language model predict the next word; call its prediction wpred .
3. If wpred is anything other than wi , the language model made an incorrect prediction, so add 1 to
m.
3. The error rate is m/N .
34
similar in spirit but slightly different: we take the geometric average of the inverses of these probability scores,
a value known as (test data) perplexity. The reasons are partly practical (tiny numbers can lead to a problem
in numerical calculations, called underflow), partly theoretical, and partly historical. For completeness, here’s
the procedure:
1. Set m = 0. (This quantity is no longer a running tally of mistakes.)
2. For every word wi in the test data (i is its position):
1. Feed wi ’s preceding context, which after the first few words will be the sequence w1 , w2 , ... , wi−1 ,
into the language model as input.
2. Let p be the probability that the language model assigns to wi (the correct next word).
3. Add − log(p) to m.
3. The perplexity is exp(m/N ).
Though it’s probably not very intuitive from the preceding procedure, perplexity does have some nice intuitive
properties:
• If our model perfectly predicted every word in the test data with probability 1, we would get a perplexity
of 1.21 This can’t happen because (1) there is some fundamental amount of uncertainty in fresh, unseen
text data, and (2) some probability mass is reserved for every wrong word, too (no zeros rule). If
perplexity comes very close to 1, the cardinal rule that test data must not be used for anything other
than the final test, like training, should be carefully verified.
• If our model ever assigned a probability of 0 to some word in the test data, perplexity would go to
infinity.22 This won’t happen because of the no zeros rule.
• Lower perplexity is better.
• The perplexity can be interpreted as an average “branch factor”; in a typical next word prediction
instance, how many vocabulary words are “effectively” being considered?
21 To see this, note that − log(1) = 0, so m stays 0 throughout step 2. Note that exp(0/N ) = exp(0) = 1.
22 To see this, note that log(0) tends toward infinity.
35