large_language_model
large_language_model
A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation and
understanding. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally
intensive self-supervised and semi-supervised training process.[1] LLMs are artificial neural networks, the largest and most capable of
which are built with a transformer-based architecture. Some recent implementations are based on other architectures, such as recurrent
neural network variants and Mamba (a state space model).[2][3][4]
LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or
word.[5] Up to 2020, fine tuning was the only way a model could be adapted to be able to accomplish specific tasks. Larger sized
models, such as GPT-3, however, can be prompt-engineered to achieve similar results.[6] They are thought to acquire knowledge
about syntax, semantics and "ontology" inherent in human language corpora, but also inaccuracies and biases present in the
corpora.[7]
Some notable LLMs are OpenAI's GPT series of models (e.g., GPT-3.5 and GPT-4, used in ChatGPT and Microsoft Copilot),
Google's PaLM and Gemini (used in Bard), Meta's LLaMA family of open-source models, and Anthropic's Claude models.
History
At the 2017 NeurIPS conference, Google researchers introduced the transformer
architecture in their landmark paper "Attention Is All You Need". This paper's goal
was to improve upon 2014 Seq2seq technology, [8] and was based mainly on the
attention mechanism developed by Bahdanau et al. in 2014.[9] The following year
in 2018, BERT was introduced and quickly became "ubiquitous".[10] Though the
original transformer has both encoder and decoder blocks, BERT is an encoder-
only model.
Although decoder-only GPT-1 was introduced in 2018, it was GPT-2 in 2019 that
caught widespread attention because OpenAI at first deemed it too powerful to
release publicly, out of fear of malicious use.[11] GPT-3 in 2020 went a step further
and as of 2024 is available only via API with no offering of downloading the
model to execute locally. But it was the 2022 consumer-facing browser-based
ChatGPT that captured the imaginations of the general population and caused
An illustration of main components of the
some media hype and online buzz.[12] The 2023 GPT-4 was praised for its
transformer model from the original paper, where
increased accuracy and as a "holy grail" for its multimodal capabilities.[13]
layers were normalized after (instead of before)
OpenAI did not reveal high-level architecture and the number of parameters of
multiheaded attention.
GPT-4.
In the meantime, competing language models have for the most part been playing
catch-up to the GPT series, at least in terms of number of parameters.[14] Notable exceptions in terms of number of parameters
included Google's 2019 T5-11B and 2022 PaLM-E. In terms of Elo ratings, on January 26, 2024, Google's Bard (Gemini Pro)
surpassed the regular GPT-4, but not the limited-availability GPT-4-Turbo.[15]
Since 2022, source-available models have been gaining popularity, especially at first with BLOOM and LLaMA, though both have
restrictions on the field of use. Mistral AI's models Mistral 7B and Mixtral 8x7b have the more permissive Apache License. As of
January 2024, Mixtral 8x7b is the most powerful open LLM according to the LMSYS Chatbot Arena Leaderboard, being more
powerful than GPT-3.5 but not as powerful as GPT-4.[16]
Dataset preprocessing
Probabilistic tokenization
Using a modification of byte-pair encoding, in the first step, all unique characters (including blanks and punctuation marks) are treated
as an initial set of n-grams (i.e. initial set of uni-grams). Successively the most frequent pair of adjacent characters is merged into a bi-
gram and all instances of the pair are replaced by it. All occurrences of adjacent pairs of (previously merged) n-grams that most
frequently occur together are then again merged into even lengthier n-gram repeatedly until a vocabulary of prescribed size is obtained
(in case of GPT-3, the size is 50257).[17] Token vocabulary consists of integers, spanning from zero up to the size of the token
vocabulary. New words can always be interpreted as combinations of the tokens and the initial-set uni-grams.[18]
A token vocabulary based on the frequencies extracted from mainly English corpora uses as few tokens as possible for an average
English word. An average word in another language encoded by such an English-optimized tokenizer is however split into
suboptimal amount of tokens.
n-grams: token izer : texts -> series of numerical " t ok ens "
numbers as "tokens": 30001 7509 25 13399 4613 2168 286 29052 366 83 482 641 1
Probabilistic tokenization also compresses the datasets, which is the reason for using the byte pair encoding algorithm as a tokenizer.
Because LLMs generally require input to be an array that is not jagged, the shorter texts must be "padded" until they match the length
of the longest one. How many tokens are, on average, needed per word depends on the language of the dataset.[19][20]
Dataset cleaning
In the context of training LLMs, datasets are typically cleaned by removing toxic passages from the dataset, discarding low-quality
data, and de-duplication.[21] Cleaned datasets can increase training efficiency and lead to improved downstream performance.[22][23]
With the increasing proportion of LLM-generated content on the web, data cleaning in the future may include filtering out such
content. LLM-generated content can pose a problem if the content is similar to human text (making filtering difficult) but of lower
quality (degrading performance of models trained on it).[24]
Reinforcement learning from human feedback (RLHF) through algorithms, such as proximal policy optimization, is used to further
fine-tune a model based on a dataset of human preferences.[25]
Instruction tuning
Using "self-instruct" approaches, LLMs have been able to bootstrap correct responses, replacing any naive responses, starting from
human-generated corrections of a few cases. For example, in the instruction "Write an essay about the main themes represented in
Hamlet," an initial naive completion might be 'If you submit the essay after March 17, your grade will be reduced by 10% for each
day of delay," based on the frequency of this textual sequence in the corpus.[26]
Mixture of experts
The largest LLM may be too expensive to train and use directly. For such models, mixture of experts (MoE) can be applied, a line of
research pursued by Google researchers since 2017 to train models reaching up to 1 trillion parameters.[27][28][29]
Most results previously achievable only by (costly) fine-tuning, can be achieved through prompt engineering, although limited to the
scope of a single conversation (more precisely, limited to the scope of a context window).[30]
In order to find out which tokens are relevant to each other within the scope of the context window, the attention mechanism
calculates "soft" weights for each token, more precisely for its embedding, by using multiple attention heads, each with its own
"relevance" for calculating its own soft weights. For example, the small (i.e. 117M parameter sized) GPT-2 model, has had twelve
attention heads and a context window of only 1k token.[32] In its medium version it has 345M parameters and contains 24 layers,
each with 12 attention heads. For the training with gradient descent a batch size of 512 was utilized.[18]
The largest models can have a context window sized up to 200k (for example, Claude 2.1).[33] Other models with large context
windows includes GPT-4 Turbo, with a context window of up to 128k tokens.[34] Note that this maximum refers to the number of
input tokens and that the maximum number of output tokens differs from the input and is often smaller. For example, the GPT-4
Turbo model has a maximum output of 4096 tokens. Also, As of January 2024, GPT-4 Turbo is, for all tiers of service, "currently
under preview with restrictive rate limits that make them suitable for testing and evaluations, but not for production usage".[35]
Length of a conversation that the model can take into account when generating
its next answer is limited by the size of a context window, as well. If the length of
a conversation, for example with Chat-GPT, is longer than its context window,
only the parts inside the context window are taken into account when generating
the next answer, or the model needs to apply some algorithm to summarize the
too distant parts of conversation.
A model may be pre-trained either to predict how the segment continues, or what
is missing in the segment, given a segment from its training dataset.[36] It can be
either
Models may be trained on auxiliary tasks which test their understanding of the When each head calculates, according to its own
data distribution, such as Next Sentence Prediction (NSP), in which pairs of criteria, how much other tokens are relevant for the
sentences are presented and the model must predict whether they appear "it_" token, note that the second attention head,
consecutively in the training corpus.[37] During training, regularization loss is represented by the second column, is focusing
also used to stabilize training. However regularization loss is usually not used most on the first two rows, i.e. the tokens "The"
during testing and evaluation. and "animal", while the third column is focusing
most on the bottom two rows, i.e. on "tired", which
has been tokenized into two tokens.[31]
Training cost
Advances in software and hardware have reduced the cost substantially since 2020, such that in 2023 training of a 12-billion-
parameter LLM computational cost is 72,300 A100-GPU-hours, while in 2020 the cost of training a 1.5-billion-parameter LLM
(which was two orders of magnitude smaller than the state of the art in 2020) was between $80 thousand and $1.6 million.[38][39][40]
Since 2020, large sums were invested in increasingly large models. For example, training of the GPT-2 (i.e. a 1.5-billion-parameters
model) in 2019 cost $50,000, while training of the PaLM (i.e. a 540-billion-parameters model) in 2022 cost $8 million.[41]
For Transformer-based LLM, training cost is much higher than inference cost. It costs 6 FLOPs per parameter to train on one token,
whereas it costs 1 to 2 FLOPs per parameter to infer on one token.[42]
Tool use
There are certain tasks that, in principle, cannot be solved by any LLM, at least not without the use of external tools or additional
software. An example of such a task is responding to the user's input '354 * 139 = ', provided that the LLM has not already
encountered a continuation of this calculation in its training corpus. In such cases, the LLM needs to resort to running program code
that calculates the result, which can then be included in its response. Another example is 'What is the time now? It is ', where a
separate program interpreter would need to execute a code to get system time on the computer, so LLM could include it in its
reply.[43][44] This basic strategy can be sophisticated with multiple attempts of generated programs, and other sampling strategies.[45]
Cost Savings and Reduced Vendor Dependency
Generally, in order to get an LLM to use tools, one must finetune it for tool-use. If the number of tools is finite, then finetuning may
be done just once. If the number of tools can grow arbitrarily, as with online API services, then the LLM can be finetuned to be able
to read API documentation and call API correctly.[46][47]
A simpler form of tool use is Retrieval Augmented Generation: augment an LLM with document retrieval, sometimes using a vector
database. Given a query, a document retriever is called to retrieve the most relevant (usually measured by first encoding the query and
the documents into vectors, then finding the documents with vectors closest in Euclidean norm to the query vector). The LLM then
generates an output based on both the query and the retrieved documents.[48]
Agency
An LLM is a language model, which is not an agent as it has no goal, but it can be used as a component of an intelligent agent.[49]
Researchers have described several methods for such integrations.
The ReAct ("Reason + Act") method constructs an agent out of an LLM, using the LLM as a planner. The LLM is prompted to
"think out loud". Specifically, the language model is prompted with a textual description of the environment, a goal, a list of possible
actions, and a record of the actions and observations so far. It generates one or more thoughts before generating an action, which is
then executed in the environment.[50] The linguistic description of the environment given to the LLM planner can even be the LaTeX
code of a paper describing the environment.[51]
In the DEPS ("Describe, Explain, Plan and Select") method, an LLM is first connected to the visual world via image descriptions,
then it is prompted to produce plans for complex tasks and behaviors based on its pretrained knowledge and environmental feedback
it receives.[52]
The Reflexion method[53] constructs an agent that learns over multiple episodes. At the end of each episode, the LLM is given the
record of the episode, and prompted to think up "lessons learned", which would help it perform better at a subsequent episode. These
"lessons learned" are given to the agent in the subsequent episodes.
Monte Carlo tree search can use an LLM as rollout heuristic. When a programmatic world model is not available, an LLM can also be
prompted with a description of the environment to act as world model.[54]
For open-ended exploration, an LLM can be used to score observations for their "interestingness", which can be used as a reward
signal to guide a normal (non-LLM) reinforcement learning agent.[55] Alternatively, it can propose increasingly difficult tasks for
curriculum learning.[56] Instead of outputting individual actions, an LLM planner can also construct "skills", or functions for complex
action sequences. The skills can be stored and later invoked, allowing increasing levels of abstraction in planning.[56]
LLM-powered agents can keep a long-term memory of its previous contexts, and the memory can be retrieved in the same way as
Retrieval Augmented Generation. Multiple such agents can interact socially.[57]
Compression
Typically, LLM are trained with full- or half-precision floating point numbers (float32 and float16). One float16 has 16 bits, or 2
bytes, and so one billion parameters require 2 gigabytes. The largest models typically have 100 billion parameters, requiring 200
gigabytes to load, which places them outside the range of most consumer electronics.
Post-training quantization[58] aims to decrease the space requirement by lowering precision of the parameters of a trained model,
while preserving most of its performance.[59][60] The simplest form of quantization simply truncates all numbers to a given number of
bits. It can be improved by using a different quantization codebook per layer. Further improvement can be done by applying different
precisions to different parameters, with higher precision for particularly important parameters ("outlier weights").[61]
While quantized models are typically frozen, and only pre-quantized models are finetuned, quantized models can still be finetuned.[62]
Multimodality
Multimodality means "having several modalities", and a "modality" refers to a type of input or output, such as video, image, audio,
text, proprioception, etc.[63] There have been many AI models trained specifically to ingest one modality and output another modality,
such as AlexNet for image to label,[64] visual question answering for image-text to text,[65] and speech recognition for speech to text.
A common method to create multimodal models out of an LLM is to "tokenize" the output of a trained encoder. Concretely, one can
construct a LLM that can understand images as follows: take a trained LLM, and take a trained image encoder . Make a small
multilayered perceptron , so that for any image , the post-processed vector has the same dimensions as an encoded token.
That is an "image token". Then, one can interleave text tokens and image tokens. The compound model is then finetuned on an
image-text dataset. This basic construction can be applied with more sophistication to improve the model. The image encoder may be
frozen to improve stability.[66]
Flamingo demonstrated the effectiveness of the tokenization method, finetuning a pair of pretrained language model and image
encoder to perform better on visual question answering than models trained from scratch.[67] Google PaLM model was finetuned into
a multimodal model PaLM-E using the tokenization method, and applied to robotic control.[68] LLaMA models have also been turned
multimodal using the tokenization method, to allow image inputs,[69] and video inputs.[70]
GPT-4 can use both text and image as inputs[71] (although the vision component wasn't released to the public until GPT-4V[72]);
Google DeepMind's Gemini is also multimodal.[73]
Properties
Scaling laws
cost of (pre-)training ( ),
size of the artificial neural network itself, such as number of parameters (i.e. amount of neurons in its layers,
amount of weights between them and biases),
size of its (pre-)training dataset (i.e. number of tokens in corpus, ),
performance after (pre-)training.
They are related by simple statistical laws, called "scaling laws". One particular scaling law ("Chinchilla scaling") for LLM
autoregressively trained for one epoch, with a log-log learning rate schedule, states that:[74]
, meaning that it costs 6 FLOPs per parameter to train on one token. Note that training cost is much higher
than inference cost, where it costs 1 to 2 FLOPs per parameter to infer on one token.[42]
Emergent abilities
When one subtracts out from the y-axis the best performance that can be achieved even with
infinite scaling of the x-axis quantity, large models' performance, measured on various tasks,
seems to be a linear extrapolation of other (smaller-sized and medium-sized) models'
performance on a log-log plot. However, sometimes the line's slope transitions from one slope
to another at point(s) referred to as break(s)[75] in downstream scaling laws, appearing as a
series of linear segments connected by arcs; it seems that larger models acquire "emergent
abilities" at this point(s).[30][76] These abilities are discovered rather than programmed-in or
designed, in some cases only after the LLM has been publicly deployed.[5]
At point(s) referred to as breaks,[75]
The most intriguing among emergent abilities is in-context learning from example the lines change their slopes,
demonstrations.[77] In-context learning is involved in tasks, such as: appearing on a log-log plot as a
series of linear segments connected
reported arithmetics, decoding the International Phonetic Alphabet, by arcs.
unscrambling a word's letters, disambiguate word in context,[30][78][79]
converting spatial words, cardinal directions (for example, replying "northeast"
upon [0, 0, 1; 0, 0, 0; 0, 0, 0]), color terms represented in text.[80]
chain-of-thought prompting: Model outputs are improved by chain-of-thought prompting only when model size
exceeds 62B. Smaller models perform better when prompted to answer immediately, without chain of thought.[81]
identifying offensive content in paragraphs of Hinglish (a combination of Hindi and English), and generating a
similar English equivalent of Kiswahili proverbs.[82]
Schaeffer et. al. argue that the emergent abilities are not unpredictably acquired, but predictably acquired according to a smooth
scaling law. The authors considered a toy statistical model of an LLM solving multiple-choice questions, and showed that this
statistical model, modified to account for other types of tasks, applies to these tasks as well.[83]
Let be the number of parameter count, and be the performance of the model.
Interpretation
Large language models by themselves are "black boxes", and it is not clear how they can perform linguistic tasks. There are several
methods for understanding how LLM work.
Mechanistic interpretability aims to reverse-engineer LLM by discovering symbolic algorithms that approximate the inference
performed by LLM. One example is Othello-GPT, where a small Transformer is trained to predict legal Othello moves. It is found
that there is a linear representation of Othello board, and modifying the representation changes the predicted legal Othello moves in
the correct way.[84][85] In another example, a small Transformer is trained on Karel programs. Similar to the Othello-GPT example,
there is a linear representation of Karel program semantics, and modifying the representation changes output in the correct way. The
model also generates correct programs that are on average shorter than those in the training set.[86]
In another example, the authors trained small transformers on modular arithmetic addition. The resulting models were reverse-
engineered, and it turned out they used discrete Fourier transform.[87]
NLP researchers were evenly split when asked, in a 2022 survey, whether (untuned) LLMs "could (ever) understand natural language
in some nontrivial sense".[88] Proponents of "LLM understanding" believe that some LLM abilities, such as mathematical reasoning,
imply an ability to "understand" certain concepts. A Microsoft team argued in 2023 that GPT-4 "can solve novel and difficult tasks
that span mathematics, coding, vision, medicine, law, psychology and more" and that GPT-4 "could reasonably be viewed as an early
(yet still incomplete) version of an artificial general intelligence system": "Can one reasonably say that a system that passes exams for
software engineering candidates is not really intelligent?"[89][90] Some researchers characterize LLMs as "alien intelligence".[91][92]
For example, Conjecture CEO Connor Leahy considers untuned LLMs to be like inscrutable alien "Shoggoths", and believes that
RLHF tuning creates a "smiling facade" obscuring the inner workings of the LLM: "If you don't push it too far, the smiley face stays
on. But then you give it [an unexpected] prompt, and suddenly you see this massive underbelly of insanity, of weird thought
processes and clearly non-human understanding."[93][94]
In contrast, some proponents of the "LLMs lack understanding" school believe that existing LLMs are "simply remixing and
recombining existing writing",[92] or point to the deficits existing LLMs continue to have in prediction skills, reasoning skills, agency,
and explainability.[88] For example, GPT-4 has natural deficits in planning and in real-time learning.[90] Generative LLMs have been
observed to confidently assert claims of fact which do not seem to be justified by their training data, a phenomenon which has been
termed "hallucination".[95] Specifically, hallucinations in the context of LLMs correspond to the generation of text or responses that
seem syntactically sound, fluent, and natural but are factually incorrect, nonsensical, or unfaithful to the provided source input.[96]
Neuroscientist Terrence Sejnowski has argued that "The diverging opinions of experts on the intelligence of LLMs suggests that our
old ideas based on natural intelligence are inadequate".[88]
The matter of LLM's exhibiting intelligence or understanding has two main aspects - the first is how to model thought and language in
a computer system, and the second is how to enable the computer system to generate human like language.[88] These aspects of
language as a model of cognition have been developed in the field of cognitive linguistics. American linguist George Lakoff presented
Neural Theory of Language (NTL)[97] as a computational basis for using language as a model of learning tasks and understanding.
The NTL Model (https://www.icsi.berkeley.edu/icsi/projects/ai/ntl) outlines how specific neural structures of the human brain shape
the nature of thought and language and in turn what are the computational properties of such neural systems that can be applied to
model thought and language in a computer system. After a framework for modeling language in a computer systems was established,
the focus shifted to establishing frameworks for computer systems to generate language with acceptable grammar. In his 2014 book
titled The Language Myth: Why Language Is Not An Instinct, British cognitive linguist and digital communication technologist
Vyvyan Evans mapped out the role of probabilistic context-free grammar (PCFG) in enabling NLP to model cognitive patterns and
generate human like language.[98] [99]
Evaluation
Perplexity
The most commonly used measure of a language model's performance is its perplexity on a given text corpus. Perplexity is a measure
of how well a model is able to predict the contents of a dataset; the higher the likelihood the model assigns to the dataset, the lower the
perplexity. Mathematically, perplexity is defined as the exponential of the average negative log likelihood per token:
here is the number of tokens in the text corpus, and "context for token " depends on the specific type of LLM used. If the LLM is
autoregressive, then "context for token " is the segment of text appearing before token . If the LLM is masked, then "context for
token " is the segment of text surrounding token .
Because language models may overfit to their training data, models are usually evaluated by their perplexity on a test set of unseen
data.[37] This presents particular challenges for the evaluation of large language models. As they are trained on increasingly large
corpora of text largely scraped from the web, it becomes increasingly likely that models' training data inadvertently includes portions
of any given test set.[6]
In information theory, the concept of entropy is intricately linked to perplexity, a relationship notably established by Claude
Shannon.[100] This relationship is mathematically expressed as .
Entropy, in this context, is commonly quantified in terms of bits per word (BPW) or bits per character (BPC), which hinges on
whether the language model utilizes word-based or character-based tokenization.
Notably, in the case of larger language models that predominantly employ sub-word tokenization, bits per token (BPT) emerges as a
seemingly more appropriate measure. However, due to the variance in tokenization methods across different Large Language Models
(LLMs), BPT does not serve as a reliable metric for comparative analysis among diverse models. To convert BPT into BPW, one can
multiply it by the average number of tokens per word.
In the evaluation and comparison of language models, cross-entropy is generally the preferred metric over entropy. The underlying
principle is that a lower BPW is indicative of a model's enhanced capability for compression. This, in turn, reflects the model's
proficiency in making accurate predictions.
A large number of testing datasets and benchmarks have also been developed to evaluate the capabilities of language models on more
specific downstream tasks. Tests may be designed to evaluate a variety of capabilities, including general knowledge, commonsense
reasoning, and mathematical problem-solving.
One broad category of evaluation dataset is question answering datasets, consisting of pairs of questions and correct answers, for
example, ("Have the San Jose Sharks won the Stanley Cup?", "No").[101] A question answering task is considered "open book" if the
model's prompt includes text from which the expected answer can be derived (for example, the previous question could be adjoined
with some text which includes the sentence "The Sharks have advanced to the Stanley Cup finals once, losing to the Pittsburgh
Penguins in 2016." [101]). Otherwise, the task is considered "closed book", and the model must draw on knowledge retained during
training.[102] Some examples of commonly used question answering datasets include TruthfulQA, Web Questions, TriviaQA, and
SQuAD.[102]
Evaluation datasets may also take the form of text completion, having the model select the most likely word or sentence to complete a
prompt, for example: "Alice was friends with Bob. Alice went to visit her friend, ____".[6]
Some composite benchmarks have also been developed which combine a diversity of different evaluation datasets and tasks.
Examples include GLUE, SuperGLUE, MMLU, BIG-bench, and HELM.[103][102]
It was previously standard to report results on a heldout portion of an evaluation dataset after doing supervised fine-tuning on the
remainder. It is now more common to evaluate a pre-trained model directly through prompting techniques, though researchers vary in
the details of how they formulate prompts for particular tasks, particularly with respect to how many examples of solved tasks are
adjoined to the prompt (i.e. the value of n in n-shot prompting).
Because of the rapid pace of improvement of large language models, evaluation benchmarks have suffered from short lifespans, with
state of the art models quickly "saturating" existing benchmarks, exceeding the performance of human annotators, leading to efforts to
replace or augment the benchmark with more challenging tasks.[104] In addition, there are cases of "shortcut learning" wherein AIs
sometimes "cheat" on multiple-choice tests by using statistical correlations in superficial test question wording in order to guess the
correct responses, without necessarily understanding the actual question being asked.[88]
Some datasets have been constructed adversarially, focusing on particular problems on which extant language models seem to have
unusually poor performance compared to humans. One example is the TruthfulQA dataset, a question answering dataset consisting of
817 questions which language models are susceptible to answering incorrectly by mimicking falsehoods to which they were
repeatedly exposed during training. For example, an LLM may answer "No" to the question "Can you teach an old dog new tricks?"
because of its exposure to the English idiom you can't teach an old dog new tricks, even though this is not literally true.[105]
Another example of an adversarial evaluation dataset is Swag and its successor, HellaSwag, collections of problems in which one of
multiple options must be selected to complete a text passage. The incorrect completions were generated by sampling from a language
model and filtering with a set of classifiers. The resulting problems are trivial for humans but at the time the datasets were created state
of the art language models had poor accuracy on them. For example:
We see a fitness center sign. We then see a man talking to the camera and sitting and laying on a exercise ball. The man...
a) demonstrates how to increase efficient exercise work by running up and down balls.
b) moves all his arms and legs and builds up a lot of muscle.
c) then plays the ball and we see a graphics and hedge trimming demonstration.
d) performs sit ups while on the ball and talking.[106]
BERT selects b) as the most likely completion, though the correct answer is d).[106]
Wider impact
In 2023, Nature Biomedical Engineering wrote that "it is no longer possible to accurately distinguish" human-written text from text
created by large language models, and that "It is all but certain that general-purpose large language models will rapidly proliferate... It
is a rather safe bet that they will change many industries over time."[107] Goldman Sachs suggested in 2023 that generative language
AI could increase global GDP by 7% in the next ten years, and could expose to automation 300 million jobs globally.[108][109]
Copyright
Memorization is an emergent behavior in LLMs in which long strings of text are occasionally output verbatim from training data,
contrary to typical behavior of traditional artificial neural nets. Evaluations of controlled LLM output measure the amount memorized
from training data (focused on GPT-2-series models) as variously over 1% for exact duplicates[110] or up to about 7%.[111]
Security
Some commenters expressed concern over accidental or deliberate creation of misinformation, or other forms of misuse.[112] For
example, the availability of large language models could reduce the skill-level required to commit bioterrorism; biosecurity researcher
Kevin Esvelt has suggested that LLM creators should exclude from their training data papers on creating or enhancing
pathogens.[113]
A study by researchers at Google and several universities, including Cornell University and University of California, Berkeley,
showed that there are potential security risks in language models such as ChatGPT. In their study, they examined the possibility that
questioners could get, from ChatGPT, the training data that the AI model used; they found that they could get the training data from
the AI model. For example, when asking ChatGPT 3.5 turbo to repeat the word "poem" forever, the AI model will say "poem"
hundreds of times and then diverge, deviating from the standard dialogue style and spitting out nonsense phrases, thus spitting out the
training data as it is. The researchers have seen more than 10,000 examples of the AI model exposing their training data in a similar
method. The researchers said that it was hard to tell if the AI model was actually safe or not.[114]
The potential presence of "sleeper agents" within LLM models is another emerging security concern. These are hidden functionalities
built into the model that remain dormant until triggered by a specific event or condition. Upon activation, the LLM deviates from its
expected behavior to make insecure actions.[115]
Algorithmic bias
While LLMs have shown remarkable capabilities in generating human-like text, they are susceptible to inheriting and amplifying
biases present in their training data. This can manifest in skewed representations or unfair treatment of different demographics, such as
those based on race, gender, language, and cultural groups.[116] Since English data is overrepresented in current large language
models' training data, it may also downplay non-English views.[117]
Stereotyping
AI models can reinforce a wide range of stereotypes, including those based on gender, ethnicity, age, nationality, religion, or
occupation. This can lead to outputs that unfairly generalize or caricature groups of people, sometimes in harmful or derogatory
ways.[118]
Notably, gender bias refers to the tendency of these models to produce outputs that are unfairly prejudiced towards one gender over
another. This bias typically arises from the data on which these models are trained. Large language models often assign roles and
characteristics based on traditional gender norms.[116] For example, it might associate nurses or secretaries predominantly with
women and engineers or CEOs with men.[119]
Political bias
Political bias refers to the tendency of algorithms to systematically favor certain political viewpoints, ideologies, or outcomes over
others. Language models may also exhibit political biases. Since the training data includes a wide range of political opinions and
coverage, the models might generate responses that lean towards particular political ideologies or viewpoints, depending on the
prevalence of those views in the data.[120]
List
For the training cost column, 1 petaFLOP-day = 1 petaFLOP/sec × 1 day = 8.64E19 FLOP.
Training
[a] Number of cost
Name Release date Developer Corpus size License[c] Notes
parameters[b] (petaFLOP-
day)
An early and
influential
340 3.3 billion Apache language model,[7]
BERT October 2018 Google 9[123] but encoder-only
million[122] words[122] 2.0[124] and thus not built
to be prompted or
generative[125]
An alternative to
~340 33 billion Apache BERT; designed
XLNet June 2019 Google
million[126] words 2.0[127] as encoder-
only[128][129]
general-purpose
40GB[131] (~10
[130] [133] model based on
GPT-2 February 2019 OpenAI 1.5 billion billion MIT transformer
tokens)[132] architecture
A fine-tuned
variant of GPT-3,
termed GPT-3.5,
was made
300 billion
GPT-3 May 2020 OpenAI 175 billion[38] 3640[134] proprietary available to the
tokens[132] public through a
web interface
called ChatGPT in
2022.[135]
The first of a
series of free
GPT-3 alternatives
released by
EleutherAI. GPT-
Neo outperformed
GPT-Neo March 2021 EleutherAI 2.7 billion[136] 825 GiB[137] MIT[138] an equivalent-size
GPT-3 model on
some
benchmarks, but
was significantly
worse than the
largest GPT-3.[138]
GPT-3-style
GPT-J June 2021 EleutherAI 6 billion[139] 825 GiB[137] 200[140] Apache 2.0
language model
Standard
530 338.6 billion architecture but
Megatron-Turing Microsoft Restricted
October 2021[141] trained on a
NLG and Nvidia billion[142] tokens[142] web access
supercomputing
cluster.
Chinese-language
260 LLM. Ernie Bot is
Ernie 3.0 Titan December 2021 Baidu 4 Tb Proprietary
billion[143] based on this
model.
Fine-tuned for
400 billion desirable behavior
Claude[144] December 2021 Anthropic 52 billion [145] beta
tokens[145] in
conversations.[146]
Sparse mixture of
experts model,
making it more
GLaM (Generalist 1.6 trillion expensive to train
December 2021 Google 1.2 trillion[29] 5600[29] Proprietary
Language Model) tokens[29] but cheaper to run
inference
compared to GPT-
3.
based on the
GPT-NeoX February 2022 EleutherAI 20 billion[152] 825 GiB[137] 740[140] Apache 2.0 Megatron
architecture
Reduced-
parameter model
1.4 trillion trained on more
Chinchilla March 2022 DeepMind 70 billion[153] 6805[149] Proprietary data. Used in the
tokens[153][148] Sparrow bot. Often
cited for its neural
scaling law.
A language model
privately- designed for live-
Neuro-sama December 2022 Independent Unknown Unknown
owned streaming on
Twitch.
Trained on a large
20-language
corpus to aim for
better performance
with fewer
LLaMA (Large Non- parameters.[166]
Language Model February 2023 Meta 65 billion[166] 1.4 trillion[166] 6300[167] commercial Researchers from
Meta AI) research[e] Stanford
University trained
a fine-tuned model
based on LLaMA
weights, called
Alpaca.[168]
GPT-4 March 2023 OpenAI Exact number Unknown Unknown proprietary Available for
unknown[f] ChatGPT Plus
users and used in
several products.
Trained with
Cerebras-GPT March 2023 Cerebras 13 billion[170] 270[140] Apache 2.0
Chinchilla formula.
1 trillion
tokens, from
RefinedWeb
Technology (filtered web
text Apache
Falcon March 2023 Innovation 40 billion[171] 2800[167]
Institute corpus)[172] 2.0[174]
plus some
"curated
corpora".[173]
LLM trained on
363 billion financial data from
token dataset proprietary
based on sources, that
Bloomberg's "outperforms
Bloomberg data sources, existing models on
BloombergGPT March 2023 50 billion Proprietary
L.P. plus 345 billion financial tasks by
tokens from significant margins
general without sacrificing
purpose performance on
datasets[175] general LLM
benchmarks"
329 billion
PanGu-Σ March 2023 Huawei 1.085 trillion Proprietary
tokens[176]
Trained on
1.5 trillion
OpenAssistant[177] March 2023 LAION 17 billion
tokens
Apache 2.0 crowdsourced
open data
Exact size
Jurassic-2[178] March 2023 AI21 Labs
unknown
Unknown Proprietary Multilingual[179]
Used in Claude
Claude 2 July 2023 Anthropic Unknown Unknown Unknown Proprietary
chatbot.[183]
Technology 180 3.5 trillion Falcon 180B
Falcon 180B September 2023 Innovation
Institute billion[184] tokens[184] TII license
Used in Claude
chatbot. Has a
Claude 2.1 November 2023 Anthropic Unknown Unknown Unknown Proprietary context window of
200,000 tokens, or
~500 pages.[186]
Used in Grok
chatbot. Grok-1
has a context
Grok-1 November 2023 x.AI Unknown Unknown Unknown Proprietary length of 8,192
tokens and has
access to X
(Twitter).[187]
Multimodal model,
Google comes in three
Gemini December 2023 Unknown Unknown Unknown Proprietary sizes. Used in
DeepMind
Bard chatbot.[188]
Mixture of experts
model,
outperforms GPT-
46.7B total,
3.5 and Llama 2
12.9B
Mixtral 8x7B December 2023 Mistral AI Unknown Unknown Apache 2.0 70B on many
parameters
benchmarks. All
per token[189] weights were
released via
torrent.[190]
So-called small
language model,
that "matches or
outperforms
models up to 25x
larger", trained on
"textbook-quality"
Phi-2 December 2023 Microsoft 2.7B 1.4T tokens Unknown MIT data based on the
paper "Textbooks
Are All You Need".
Model training
took "14 days on
96 A100
GPUs".[191]
An "attention-free"
linear transformer
Eagle 7B January 2024 RWKV 7.52B 1.1T tokens Unknown Apache 2.0 based on RWKV-
v5
architecture.[192]
See also
Foundation models
Notes
a. This is the date that documentation describing the model's architecture was first released.
b. In many cases, researchers release or report on multiple versions of a model having different sizes. In these cases,
the size of the largest model is listed here.
c. This is the license of the pre-trained model weights. In almost all cases the training code itself is open-source or can
be easily replicated.
d. The smaller models including 66B are publicly available, while the 175B model is available on request.
e. Facebook's license and distribution scheme restricted access to approved researchers, but the model weights were
leaked and became widely available.
f. As stated in Technical report: "Given both the competitive landscape and the safety implications of large-scale
models like GPT-4, this report contains no further details about the architecture (including model size), hardware,
training compute, dataset construction, training method ..."[169]
References
1. "Better Language Models and Their Implications" (https://openai.com/blog/better-language-models/). OpenAI. 2019-
02-14. Archived (https://web.archive.org/web/20201219132206/https://openai.com/blog/better-language-models/)
from the original on 2020-12-19. Retrieved 2019-08-25.
2. Peng, Bo; et al. (2023). "RWKV: Reinventing RNNS for the Transformer Era". arXiv:2305.13048 (https://arxiv.org/ab
s/2305.13048) [cs.CL (https://arxiv.org/archive/cs.CL)].
3. Merritt, Rick (2022-03-25). "What Is a Transformer Model?" (https://blogs.nvidia.com/blog/2022/03/25/what-is-a-trans
former-model/). NVIDIA Blog. Retrieved 2023-07-25.
4. Gu, Albert; Dao, Tri (2023-12-01), Mamba: Linear-Time Sequence Modeling with Selective State Spaces,
arXiv:2312.00752 (https://arxiv.org/abs/2312.00752)
5. Bowman, Samuel R. (2023). "Eight Things to Know about Large Language Models". arXiv:2304.00612 (https://arxiv.
org/abs/2304.00612) [cs.CL (https://arxiv.org/archive/cs.CL)].
6. Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan,
Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen;
Henighan, Tom; Child, Rewon; Ramesh, Aditya; Ziegler, Daniel M.; Wu, Jeffrey; Winter, Clemens; Hesse,
Christopher; Chen, Mark; Sigler, Eric; Litwin, Mateusz; Gray, Scott; Chess, Benjamin; Clark, Jack; Berner,
Christopher; McCandlish, Sam; Radford, Alec; Sutskever, Ilya; Amodei, Dario (Dec 2020). Larochelle, H.; Ranzato,
M.; Hadsell, R.; Balcan, M.F.; Lin, H. (eds.). "Language Models are Few-Shot Learners" (https://proceedings.neurips.
cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf) (PDF). Advances in Neural Information
Processing Systems. Curran Associates, Inc. 33: 1877–1901.
7. Manning, Christopher D. (2022). "Human Language Understanding & Reasoning" (https://www.amacad.org/publicat
ion/human-language-understanding-reasoning). Daedalus. 151 (2): 127–138. doi:10.1162/daed_a_01905 (https://d
oi.org/10.1162%2Fdaed_a_01905). S2CID 248377870 (https://api.semanticscholar.org/CorpusID:248377870).
8. Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz;
Polosukhin, Illia (2017). "Attention is All you Need" (https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee
91fbd053c1c4a845aa-Paper.pdf) (PDF). Advances in Neural Information Processing Systems. Curran Associates,
Inc. 30.
9. Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (2014). "Neural Machine Translation by Jointly Learning to
Align and Translate". arXiv:1409.0473 (https://arxiv.org/abs/1409.0473) [cs.CL (https://arxiv.org/archive/cs.CL)].
10. Rogers, Anna; Kovaleva, Olga; Rumshisky, Anna (2020). "A Primer in BERTology: What We Know About How
BERT Works" (https://aclanthology.org/2020.tacl-1.54). Transactions of the Association for Computational
Linguistics. 8: 842–866. arXiv:2002.12327 (https://arxiv.org/abs/2002.12327). doi:10.1162/tacl_a_00349 (https://doi.
org/10.1162%2Ftacl_a_00349). S2CID 211532403 (https://api.semanticscholar.org/CorpusID:211532403).
11. Hern, Alex (14 February 2019). "New AI fake text generator may be too dangerous to release, say creators" (https://w
ww.theguardian.com/technology/2019/feb/14/elon-musk-backed-ai-writes-convincing-news-fiction). The Guardian.
Retrieved 20 January 2024.
12. "ChatGPT a year on: 3 ways the AI chatbot has completely changed the world in 12 months" (https://www.euronews.
com/next/2023/11/30/chatgpt-a-year-on-3-ways-the-ai-chatbot-has-completely-changed-the-world-in-12-months).
Euronews. November 30, 2023. Retrieved January 20, 2024.
13. Heaven, Will (March 14, 2023). "GPT-4 is bigger and better than ChatGPT—but OpenAI won't say why" (https://ww
w.technologyreview.com/2023/03/14/1069823/gpt-4-is-bigger-and-better-chatgpt-openai/). MIT Technology Review.
Retrieved January 20, 2024.
14. "Parameters in notable artificial intelligence systems" (https://ourworldindata.org/grapher/artificial-intelligence-para
meter-count?time=2017-09-05..latest). ourworldindata.org. November 30, 2023. Retrieved January 20, 2024.
15. "Google's Gemini Pro Beats GPT-4" (https://analyticsindiamag.com/googles-gemini-pro-beats-gpt-4/).
analyticsindiamag.com. January 27, 2024. Retrieved January 29, 2024.
16. "LMSYS Chatbot Arena Leaderboard" (https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard).
huggingface.co. Retrieved January 20, 2024.
17. "OpenAI API" (https://web.archive.org/web/20230423211308/https://platform.openai.com/tokenizer).
platform.openai.com. Archived from the original (https://platform.openai.com/) on April 23, 2023. Retrieved
2023-04-30.
18. Paaß, Gerhard; Giesselbach, Sven (2022). "Pre-trained Language Models" (https://link.springer.com/chapter/10.100
7/978-3-031-23190-2_2). Foundation Models for Natural Language Processing. Artificial Intelligence: Foundations,
Theory, and Algorithms. pp. 19–78. doi:10.1007/978-3-031-23190-2_2 (https://doi.org/10.1007%2F978-3-031-23190
-2_2). ISBN 9783031231902. Retrieved 3 August 2023.
19. Yennie Jun (2023-05-03). "All languages are NOT created (tokenized) equal" (https://blog.yenniejun.com/p/all-langu
ages-are-not-created-tokenized). Language models cost much more in some languages than others. Retrieved
2023-08-17. "In other words, to express the same sentiment, some languages require up to 10 times more tokens."
20. Petrov, Aleksandar; Malfa, Emanuele La; Torr, Philip; Bibi, Adel (June 23, 2023). "Language Model Tokenizers
Introduce Unfairness Between Languages" (https://openreview.net/forum?id=Pj4YYuxTq9). NeurIPS.
arXiv:2305.15425 (https://arxiv.org/abs/2305.15425) – via openreview.net.
21. Dodge, Jesse; Sap, Maarten; Marasović, Ana; Agnew, William; Ilharco, Gabriel; Groeneveld, Dirk; Mitchell, Margaret;
Gardner, Matt (2021). "Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled
Corpus". arXiv:2104.08758 (https://arxiv.org/abs/2104.08758) [cs.CL (https://arxiv.org/archive/cs.CL)].
22. Lee, Katherine; Ippolito, Daphne; Nystrom, Andrew; Zhang, Chiyuan; Eck, Douglas; Callison-Burch, Chris; Carlini,
Nicholas (May 2022). "Deduplicating Training Data Makes Language Models Better" (https://aclanthology.org/2022.
acl-long.577.pdf) (PDF). Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 1:
Long Papers: 8424–8445. doi:10.18653/v1/2022.acl-long.577 (https://doi.org/10.18653%2Fv1%2F2022.acl-long.57
7).
23. Li, Yuanzhi; Bubeck, Sébastien; Eldan, Ronen; Del Giorno, Allie; Gunasekar, Suriya; Lee, Yin Tat (2023-09-11),
Textbooks Are All You Need II: phi-1.5 technical report (http://arxiv.org/abs/2309.05463), arXiv:2309.05463 (https://ar
xiv.org/abs/2309.05463), retrieved 2024-01-20
24. Brown, Tom B.; et al. (2020). "Language Models are Few-Shot Learners". arXiv:2005.14165 (https://arxiv.org/abs/20
05.14165) [cs.CL (https://arxiv.org/archive/cs.CL)].
25. Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong;
Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, John; Hilton, Jacob; Kelton, Fraser; Miller, Luke; Simens,
Maddie; Askell, Amanda; Welinder, Peter; Christiano, Paul; Leike, Jan; Lowe, Ryan (2022). "Training language
models to follow instructions with human feedback". arXiv:2203.02155 (https://arxiv.org/abs/2203.02155) [cs.CL (http
s://arxiv.org/archive/cs.CL)].
26. Wang, Yizhong; Kordi, Yeganeh; Mishra, Swaroop; Liu, Alisa; Smith, Noah A.; Khashabi, Daniel; Hajishirzi,
Hannaneh (2022). "Self-Instruct: Aligning Language Model with Self Generated Instructions". arXiv:2212.10560 (http
s://arxiv.org/abs/2212.10560) [cs.CL (https://arxiv.org/archive/cs.CL)].
27. Shazeer, Noam; Mirhoseini, Azalia; Maziarz, Krzysztof; Davis, Andy; Le, Quoc; Hinton, Geoffrey; Dean, Jeff (2017-
01-01). "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer". arXiv:1701.06538 (htt
ps://arxiv.org/abs/1701.06538) [cs.LG (https://arxiv.org/archive/cs.LG)].
28. Lepikhin, Dmitry; Lee, HyoukJoong; Xu, Yuanzhong; Chen, Dehao; Firat, Orhan; Huang, Yanping; Krikun, Maxim;
Shazeer, Noam; Chen, Zhifeng (2021-01-12). "GShard: Scaling Giant Models with Conditional Computation and
Automatic Sharding". arXiv:2006.16668 (https://arxiv.org/abs/2006.16668) [cs.CL (https://arxiv.org/archive/cs.CL)].
29. Dai, Andrew M; Du, Nan (December 9, 2021). "More Efficient In-Context Learning with GLaM" (https://ai.googleblog.
com/2021/12/more-efficient-in-context-learning-with.html). ai.googleblog.com. Retrieved 2023-03-09.
30. Wei, Jason; Tay, Yi; Bommasani, Rishi; Raffel, Colin; Zoph, Barret; Borgeaud, Sebastian; Yogatama, Dani; Bosma,
Maarten; Zhou, Denny; Metzler, Donald; Chi, Ed H.; Hashimoto, Tatsunori; Vinyals, Oriol; Liang, Percy; Dean, Jeff;
Fedus, William (31 August 2022). "Emergent Abilities of Large Language Models" (https://openreview.net/forum?id=
yzkSU5zdwD). Transactions on Machine Learning Research. ISSN 2835-8856 (https://www.worldcat.org/issn/2835-
8856).
31. Allamar, Jay. "Illustrated transformer" (https://jalammar.github.io/illustrated-transformer/). Retrieved 2023-07-29.
32. Allamar, Jay. "The Illustrated GPT-2 (Visualizing Transformer Language Models)" (https://jalammar.github.io/illustrat
ed-gpt2/). Retrieved 2023-08-01.
33. "Long context prompting for Claude 2.1" (https://www.anthropic.com/news/claude-2-1-prompting). December 6,
2023. Retrieved January 20, 2024.
34. Schade, Michael. "GPT-4 Turbo: Our latest model" (https://help.openai.com/en/articles/8555510-gpt-4-turbo).
Retrieved January 20, 2024.
35. "Rate limits" (https://platform.openai.com/docs/guides/rate-limits). openai.com. Retrieved January 20, 2024.
36. Zaib, Munazza; Sheng, Quan Z.; Emma Zhang, Wei (4 February 2020). "A Short Survey of Pre-trained Language
Models for Conversational AI-A New Age in NLP" (https://www.researchgate.net/publication/338931711).
Proceedings of the Australasian Computer Science Week Multiconference. pp. 1–4. arXiv:2104.10810 (https://arxiv.
org/abs/2104.10810). doi:10.1145/3373017.3373028 (https://doi.org/10.1145%2F3373017.3373028).
ISBN 9781450376976. S2CID 211040895 (https://api.semanticscholar.org/CorpusID:211040895).
37. Jurafsky, Dan; Martin, James H. (7 January 2023). Speech and Language Processing (https://web.stanford.edu/~jur
afsky/slp3/ed3book_jan72023.pdf) (PDF) (3rd edition draft ed.). Retrieved 24 May 2022.
38. Wiggers, Kyle (28 April 2022). "The emerging types of language models and why they matter" (https://techcrunch.co
m/2022/04/28/the-emerging-types-of-language-models-and-why-they-matter/). TechCrunch.
39. Sharir, Or; Peleg, Barak; Shoham, Yoav (2020). "The Cost of Training NLP Models: A Concise Overview".
arXiv:2004.08900 (https://arxiv.org/abs/2004.08900) [cs.CL (https://arxiv.org/archive/cs.CL)].
40. Biderman, Stella; Schoelkopf, Hailey; Anthony, Quentin; Bradley, Herbie; Khan, Mohammad Aflah; Purohit,
Shivanshu; Prashanth, USVSN Sai (April 2023). "Pythia: A Suite for Analyzing Large Language Models Across
Training and Scaling". arXiv:2304.01373 (https://arxiv.org/abs/2304.01373) [cs.CL (https://arxiv.org/archive/cs.CL)].
41. Vincent, James (3 April 2023). "AI is entering an era of corporate control" (https://www.theverge.com/23667752/ai-pr
ogress-2023-report-stanford-corporate-control). The Verge. Retrieved 19 June 2023.
42. Section 2.1 and Table 1, Kaplan, Jared; McCandlish, Sam; Henighan, Tom; Brown, Tom B.; Chess, Benjamin; Child,
Rewon; Gray, Scott; Radford, Alec; Wu, Jeffrey; Amodei, Dario (2020). "Scaling Laws for Neural Language Models".
arXiv:2001.08361 (https://arxiv.org/abs/2001.08361) [cs.LG (https://arxiv.org/archive/cs.LG)].
43. Gao, Luyu; Madaan, Aman; Zhou, Shuyan; Alon, Uri; Liu, Pengfei; Yang, Yiming; Callan, Jamie; Neubig, Graham
(2022-11-01). "PAL: Program-aided Language Models". arXiv:2211.10435 (https://arxiv.org/abs/2211.10435) [cs.CL
(https://arxiv.org/archive/cs.CL)].
44. "PAL: Program-aided Language Models" (https://reasonwithpal.com/). reasonwithpal.com. Retrieved 2023-06-12.
45. Paranjape, Bhargavi; Lundberg, Scott; Singh, Sameer; Hajishirzi, Hannaneh; Zettlemoyer, Luke; Tulio Ribeiro,
Marco (2023-03-01). "ART: Automatic multi-step reasoning and tool-use for large language models".
arXiv:2303.09014 (https://arxiv.org/abs/2303.09014) [cs.CL (https://arxiv.org/archive/cs.CL)].
46. Liang, Yaobo; Wu, Chenfei; Song, Ting; Wu, Wenshan; Xia, Yan; Liu, Yu; Ou, Yang; Lu, Shuai; Ji, Lei; Mao,
Shaoguang; Wang, Yun; Shou, Linjun; Gong, Ming; Duan, Nan (2023-03-01). "TaskMatrix.AI: Completing Tasks by
Connecting Foundation Models with Millions of APIs". arXiv:2303.16434 (https://arxiv.org/abs/2303.16434) [cs.AI (htt
ps://arxiv.org/archive/cs.AI)].
47. Patil, Shishir G.; Zhang, Tianjun; Wang, Xin; Gonzalez, Joseph E. (2023-05-01). "Gorilla: Large Language Model
Connected with Massive APIs". arXiv:2305.15334 (https://arxiv.org/abs/2305.15334) [cs.CL (https://arxiv.org/archive/
cs.CL)].
48. Lewis, Patrick; Perez, Ethan; Piktus, Aleksandra; Petroni, Fabio; Karpukhin, Vladimir; Goyal, Naman; Küttler,
Heinrich; Lewis, Mike; Yih, Wen-tau; Rocktäschel, Tim; Riedel, Sebastian; Kiela, Douwe (2020). "Retrieval-
Augmented Generation for Knowledge-Intensive NLP Tasks" (https://proceedings.neurips.cc/paper/2020/hash/6b49
3230205f780e1bc26945df7481e5-Abstract.html). Advances in Neural Information Processing Systems. Curran
Associates, Inc. 33: 9459–9474. arXiv:2005.11401 (https://arxiv.org/abs/2005.11401).
49. Huang, Wenlong; Abbeel, Pieter; Pathak, Deepak; Mordatch, Igor (2022-06-28). "Language Models as Zero-Shot
Planners: Extracting Actionable Knowledge for Embodied Agents" (https://proceedings.mlr.press/v162/huang22a.ht
ml). Proceedings of the 39th International Conference on Machine Learning. PMLR: 9118–9147. arXiv:2201.07207
(https://arxiv.org/abs/2201.07207).
50. Yao, Shunyu; Zhao, Jeffrey; Yu, Dian; Du, Nan; Shafran, Izhak; Narasimhan, Karthik; Cao, Yuan (2022-10-01).
"ReAct: Synergizing Reasoning and Acting in Language Models". arXiv:2210.03629 (https://arxiv.org/abs/2210.036
29) [cs.CL (https://arxiv.org/archive/cs.CL)].
51. Wu, Yue; Prabhumoye, Shrimai; Min, So Yeon (24 May 2023). "SPRING: GPT-4 Out-performs RL Algorithms by
Studying Papers and Reasoning". arXiv:2305.15486 (https://arxiv.org/abs/2305.15486) [cs.AI (https://arxiv.org/archiv
e/cs.AI)].
52. Wang, Zihao; Cai, Shaofei; Liu, Anji; Ma, Xiaojian; Liang, Yitao (2023-02-03). "Describe, Explain, Plan and Select:
Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents". arXiv:2302.01560 (http
s://arxiv.org/abs/2302.01560) [cs.AI (https://arxiv.org/archive/cs.AI)].
53. Shinn, Noah; Cassano, Federico; Labash, Beck; Gopinath, Ashwin; Narasimhan, Karthik; Yao, Shunyu (2023-03-
01). "Reflexion: Language Agents with Verbal Reinforcement Learning". arXiv:2303.11366 (https://arxiv.org/abs/230
3.11366) [cs.AI (https://arxiv.org/archive/cs.AI)].
54. Hao, Shibo; Gu, Yi; Ma, Haodi; Jiahua Hong, Joshua; Wang, Zhen; Zhe Wang, Daisy; Hu, Zhiting (2023-05-01).
"Reasoning with Language Model is Planning with World Model". arXiv:2305.14992 (https://arxiv.org/abs/2305.1499
2) [cs.CL (https://arxiv.org/archive/cs.CL)].
55. Zhang, Jenny; Lehman, Joel; Stanley, Kenneth; Clune, Jeff (2 June 2023). "OMNI: Open-endedness via Models of
human Notions of Interestingness". arXiv:2306.01711 (https://arxiv.org/abs/2306.01711) [cs.AI (https://arxiv.org/archi
ve/cs.AI)].
56. "Voyager | An Open-Ended Embodied Agent with Large Language Models" (https://voyager.minedojo.org/).
voyager.minedojo.org. Retrieved 2023-06-09.
57. Park, Joon Sung; O'Brien, Joseph C.; Cai, Carrie J.; Ringel Morris, Meredith; Liang, Percy; Bernstein, Michael S.
(2023-04-01). "Generative Agents: Interactive Simulacra of Human Behavior". arXiv:2304.03442 (https://arxiv.org/ab
s/2304.03442) [cs.HC (https://arxiv.org/archive/cs.HC)].
58. Nagel, Markus; Amjad, Rana Ali; Baalen, Mart Van; Louizos, Christos; Blankevoort, Tijmen (2020-11-21). "Up or
Down? Adaptive Rounding for Post-Training Quantization" (https://proceedings.mlr.press/v119/nagel20a.html).
Proceedings of the 37th International Conference on Machine Learning. PMLR: 7197–7206.
59. Polino, Antonio; Pascanu, Razvan; Alistarh, Dan (2018-02-01). "Model compression via distillation and
quantization". arXiv:1802.05668 (https://arxiv.org/abs/1802.05668) [cs.NE (https://arxiv.org/archive/cs.NE)].
60. Frantar, Elias; Ashkboos, Saleh; Hoefler, Torsten; Alistarh, Dan (2022-10-01). "GPTQ: Accurate Post-Training
Quantization for Generative Pre-trained Transformers". arXiv:2210.17323 (https://arxiv.org/abs/2210.17323) [cs.LG
(https://arxiv.org/archive/cs.LG)].
61. Dettmers, Tim; Svirschevski, Ruslan; Egiazarian, Vage; Kuznedelev, Denis; Frantar, Elias; Ashkboos, Saleh;
Borzunov, Alexander; Hoefler, Torsten; Alistarh, Dan (2023-06-01). "SpQR: A Sparse-Quantized Representation for
Near-Lossless LLM Weight Compression". arXiv:2306.03078 (https://arxiv.org/abs/2306.03078) [cs.CL (https://arxiv.
org/archive/cs.CL)].
62. Dettmers, Tim; Pagnoni, Artidoro; Holtzman, Ari; Zettlemoyer, Luke (2023-05-01). "QLoRA: Efficient Finetuning of
Quantized LLMs". arXiv:2305.14314 (https://arxiv.org/abs/2305.14314) [cs.LG (https://arxiv.org/archive/cs.LG)].
63. Kiros, Ryan; Salakhutdinov, Ruslan; Zemel, Rich (2014-06-18). "Multimodal Neural Language Models" (https://proc
eedings.mlr.press/v32/kiros14.html). Proceedings of the 31st International Conference on Machine Learning. PMLR:
595–603.
64. Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E (2012). "ImageNet Classification with Deep Convolutional
Neural Networks" (https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstrac
t.html). Advances in Neural Information Processing Systems. Curran Associates, Inc. 25.
65. Antol, Stanislaw; Agrawal, Aishwarya; Lu, Jiasen; Mitchell, Margaret; Batra, Dhruv; Zitnick, C. Lawrence; Parikh,
Devi (2015). "VQA: Visual Question Answering" (https://openaccess.thecvf.com/content_iccv_2015/html/Antol_VQA
_Visual_Question_ICCV_2015_paper.html). ICCV: 2425–2433.
66. Li, Junnan; Li, Dongxu; Savarese, Silvio; Hoi, Steven (2023-01-01). "BLIP-2: Bootstrapping Language-Image Pre-
training with Frozen Image Encoders and Large Language Models". arXiv:2301.12597 (https://arxiv.org/abs/2301.12
597) [cs.CV (https://arxiv.org/archive/cs.CV)].
67. Alayrac, Jean-Baptiste; Donahue, Jeff; Luc, Pauline; Miech, Antoine; Barr, Iain; Hasson, Yana; Lenc, Karel; Mensch,
Arthur; Millican, Katherine; Reynolds, Malcolm; Ring, Roman; Rutherford, Eliza; Cabi, Serkan; Han, Tengda; Gong,
Zhitao (2022-12-06). "Flamingo: a Visual Language Model for Few-Shot Learning" (https://proceedings.neurips.cc/p
aper_files/paper/2022/hash/960a172bc7fbf0177ccccbb411a7d800-Abstract-Conference.html). Advances in Neural
Information Processing Systems. 35: 23716–23736. arXiv:2204.14198 (https://arxiv.org/abs/2204.14198).
68. Driess, Danny; Xia, Fei; Sajjadi, Mehdi S. M.; Lynch, Corey; Chowdhery, Aakanksha; Ichter, Brian; Wahid, Ayzaan;
Tompson, Jonathan; Vuong, Quan; Yu, Tianhe; Huang, Wenlong; Chebotar, Yevgen; Sermanet, Pierre; Duckworth,
Daniel; Levine, Sergey (2023-03-01). "PaLM-E: An Embodied Multimodal Language Model". arXiv:2303.03378 (http
s://arxiv.org/abs/2303.03378) [cs.LG (https://arxiv.org/archive/cs.LG)].
69. Liu, Haotian; Li, Chunyuan; Wu, Qingyang; Lee, Yong Jae (2023-04-01). "Visual Instruction Tuning".
arXiv:2304.08485 (https://arxiv.org/abs/2304.08485) [cs.CV (https://arxiv.org/archive/cs.CV)].
70. Zhang, Hang; Li, Xin; Bing, Lidong (2023-06-01). "Video-LLaMA: An Instruction-tuned Audio-Visual Language
Model for Video Understanding". arXiv:2306.02858 (https://arxiv.org/abs/2306.02858) [cs.CL (https://arxiv.org/archiv
e/cs.CL)].
71. OpenAI (2023-03-27). "GPT-4 Technical Report". arXiv:2303.08774 (https://arxiv.org/abs/2303.08774) [cs.CL (https://
arxiv.org/archive/cs.CL)].
72. OpenAI (September 25, 2023). "GPT-4V(ision) System Card" (https://cdn.openai.com/papers/GPTV_System_Card.p
df) (PDF).
73. Pichai, Sundar, Google Keynote (Google I/O '23) (https://www.youtube.com/watch?v=cNfINi5CNbY&t=931s),
timestamp 15:31, retrieved 2023-07-02
74. Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; Buchatskaya, Elena; Cai, Trevor; Rutherford, Eliza;
Casas, Diego de Las; Hendricks, Lisa Anne; Welbl, Johannes; Clark, Aidan; Hennigan, Tom; Noland, Eric; Millican,
Katie; Driessche, George van den; Damoc, Bogdan (2022-03-29). "Training Compute-Optimal Large Language
Models". arXiv:2203.15556 (https://arxiv.org/abs/2203.15556) [cs.CL (https://arxiv.org/archive/cs.CL)].
75. Caballero, Ethan; Gupta, Kshitij; Rish, Irina; Krueger, David (2022). "Broken Neural Scaling Laws".
arXiv:2210.14891 (https://arxiv.org/abs/2210.14891) [cs.LG (https://arxiv.org/archive/cs.LG)].
76. "137 emergent abilities of large language models" (https://www.jasonwei.net/blog/emergence). Jason Wei.
Retrieved 2023-06-24.
77. Hahn, Michael; Goyal, Navin (2023-03-14). "A Theory of Emergent In-Context Learning as Implicit Structure
Induction". arXiv:2303.07971 (https://arxiv.org/abs/2303.07971) [cs.LG (https://arxiv.org/archive/cs.LG)].
78. Pilehvar, Mohammad Taher; Camacho-Collados, Jose (June 2019). "Proceedings of the 2019 Conference of the
North" (https://aclanthology.org/N19-1128). Proceedings of the 2019 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
Minneapolis, Minnesota: Association for Computational Linguistics: 1267–1273. doi:10.18653/v1/N19-1128 (https://
doi.org/10.18653%2Fv1%2FN19-1128). S2CID 102353817 (https://api.semanticscholar.org/CorpusID:102353817).
79. "WiC: The Word-in-Context Dataset" (https://pilehvar.github.io/wic/). pilehvar.github.io. Retrieved 2023-06-27.
80. Patel, Roma; Pavlick, Ellie (2021-10-06). "Mapping Language Models to Grounded Conceptual Spaces" (https://ope
nreview.net/forum?id=gJcEM8sxHK). ICLR.
81. A Closer Look at Large Language Models Emergent Abilities (https://www.notion.so/A-Closer-Look-at-Large-Langua
ge-Models-Emergent-Abilities-493876b55df5479d80686f68a1abd72f) (Yao Fu, Nov 20, 2022)
82. Ornes, Stephen (March 16, 2023). "The Unpredictable Abilities Emerging From Large AI Models" (https://www.quant
amagazine.org/the-unpredictable-abilities-emerging-from-large-ai-models-20230316/). Quanta Magazine.
83. Schaeffer, Rylan; Miranda, Brando; Koyejo, Sanmi (2023-04-01). "Are Emergent Abilities of Large Language Models
a Mirage?". arXiv:2304.15004 (https://arxiv.org/abs/2304.15004) [cs.AI (https://arxiv.org/archive/cs.AI)].
84. Li, Kenneth; Hopkins, Aspen K.; Bau, David; Viégas, Fernanda; Pfister, Hanspeter; Wattenberg, Martin (2022-10-01).
"Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task". arXiv:2210.13382 (h
ttps://arxiv.org/abs/2210.13382) [cs.LG (https://arxiv.org/archive/cs.LG)].
85. "Large Language Model: world models or surface statistics?" (https://thegradient.pub/othello/). The Gradient. 2023-
01-21. Retrieved 2023-06-12.
86. Jin, Charles; Rinard, Martin (2023-05-01). "Evidence of Meaning in Language Models Trained on Programs".
arXiv:2305.11169 (https://arxiv.org/abs/2305.11169) [cs.LG (https://arxiv.org/archive/cs.LG)].
87. Nanda, Neel; Chan, Lawrence; Lieberum, Tom; Smith, Jess; Steinhardt, Jacob (2023-01-01). "Progress measures
for grokking via mechanistic interpretability". arXiv:2301.05217 (https://arxiv.org/abs/2301.05217) [cs.LG (https://arxi
v.org/archive/cs.LG)].
88. Mitchell, Melanie; Krakauer, David C. (28 March 2023). "The debate over understanding in AI's large language
models" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10068812). Proceedings of the National Academy of
Sciences. 120 (13): e2215907120. arXiv:2210.13966 (https://arxiv.org/abs/2210.13966).
Bibcode:2023PNAS..12015907M (https://ui.adsabs.harvard.edu/abs/2023PNAS..12015907M).
doi:10.1073/pnas.2215907120 (https://doi.org/10.1073%2Fpnas.2215907120). PMC 10068812 (https://www.ncbi.nl
m.nih.gov/pmc/articles/PMC10068812). PMID 36943882 (https://pubmed.ncbi.nlm.nih.gov/36943882).
89. Metz, Cade (16 May 2023). "Microsoft Says New A.I. Shows Signs of Human Reasoning" (https://www.nytimes.com/
2023/05/16/technology/microsoft-ai-human-reasoning.html). The New York Times.
90. Bubeck, Sébastien; Chandrasekaran, Varun; Eldan, Ronen; Gehrke, Johannes; Horvitz, Eric; Kamar, Ece; Lee,
Peter; Lee, Yin Tat; Li, Yuanzhi; Lundberg, Scott; Nori, Harsha; Palangi, Hamid; Ribeiro, Marco Tulio; Zhang, Yi
(2023). "Sparks of Artificial General Intelligence: Early experiments with GPT-4". arXiv:2303.12712 (https://arxiv.org/
abs/2303.12712) [cs.CL (https://arxiv.org/archive/cs.CL)].
91. "ChatGPT is more like an 'alien intelligence' than a human brain, says futurist" (https://www.zdnet.com/article/chatgp
t-is-more-like-an-alien-intelligence-than-a-human-brain-says-futurist/). ZDNET. 2023. Retrieved 12 June 2023.
92. Newport, Cal (13 April 2023). "What Kind of Mind Does ChatGPT Have?" (https://www.newyorker.com/science/anna
ls-of-artificial-intelligence/what-kind-of-mind-does-chatgpt-have). The New Yorker. Retrieved 12 June 2023.
93. Roose, Kevin (30 May 2023). "Why an Octopus-like Creature Has Come to Symbolize the State of A.I." (https://www.
nytimes.com/2023/05/30/technology/shoggoth-meme-ai.html) The New York Times. Retrieved 12 June 2023.
94. "The A to Z of Artificial Intelligence" (https://time.com/6271657/a-to-z-of-artificial-intelligence/). Time Magazine. 13
April 2023. Retrieved 12 June 2023.
95. Ji, Ziwei; Lee, Nayeon; Frieske, Rita; Yu, Tiezheng; Su, Dan; Xu, Yan; Ishii, Etsuko; Bang, Yejin; Dai, Wenliang;
Madotto, Andrea; Fung, Pascale (November 2022). "Survey of Hallucination in Natural Language Generation" (http
s://dl.acm.org/doi/pdf/10.1145/3571730) (pdf). ACM Computing Surveys. Association for Computing Machinery. 55
(12): 1–38. arXiv:2202.03629 (https://arxiv.org/abs/2202.03629). doi:10.1145/3571730 (https://doi.org/10.1145%2F3
571730). S2CID 246652372 (https://api.semanticscholar.org/CorpusID:246652372). Retrieved 15 January 2023.
96. Varshney, Neeraj; Yao, Wenlin; Zhang, Hongming; Chen, Jianshu; Yu, Dong (2023). "A Stitch in Time Saves Nine:
Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation". arXiv:2307.03987 (http
s://arxiv.org/abs/2307.03987) [cs.CL (https://arxiv.org/archive/cs.CL)].
97. Lakoff, George (1999). Philosophy in the Flesh: The Embodied Mind and Its Challenge to Western Philosophy;
Appendix: The Neural Theory of Language Paradigm. New York Basic Books. pp. 569–583. ISBN 978-0-465-
05674-3.
98. Evans, Vyvyan. (2014). The Language Myth. Cambridge University Press. ISBN 978-1-107-04396-1.
99. Friston, Karl J. (2022). Active Inference: The Free Energy Principle in Mind, Brain, and Behavior; Chapter 4 The
Generative Models of Active Inference. The MIT Press. ISBN 978-0-262-36997-8.
100. Huyen, Chip (2019). "Understanding Evaluation Metrics for Language Modeling" (https://thegradient.pub/understand
ing-evaluation-metrics-for-language-models/). The Gradient. Retrieved January 14, 2024.
101. Clark, Christopher; Lee, Kenton; Chang, Ming-Wei; Kwiatkowski, Tom; Collins, Michael; Toutanova, Kristina (2019).
"BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions". arXiv:1905.10044 (https://arxiv.org/abs/190
5.10044) [cs.CL (https://arxiv.org/archive/cs.CL)].
102. Wayne Xin Zhao; Zhou, Kun; Li, Junyi; Tang, Tianyi; Wang, Xiaolei; Hou, Yupeng; Min, Yingqian; Zhang, Beichen;
Zhang, Junjie; Dong, Zican; Du, Yifan; Yang, Chen; Chen, Yushuo; Chen, Zhipeng; Jiang, Jinhao; Ren, Ruiyang; Li,
Yifan; Tang, Xinyu; Liu, Zikang; Liu, Peiyu; Nie, Jian-Yun; Wen, Ji-Rong (2023). "A Survey of Large Language
Models". arXiv:2303.18223 (https://arxiv.org/abs/2303.18223) [cs.CL (https://arxiv.org/archive/cs.CL)].
103. Huyen, Chip (18 October 2019). "Evaluation Metrics for Language Modeling" (https://thegradient.pub/understanding-
evaluation-metrics-for-language-models/). The Gradient.
104. Srivastava, Aarohi; et al. (2022). "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of
language models". arXiv:2206.04615 (https://arxiv.org/abs/2206.04615) [cs.CL (https://arxiv.org/archive/cs.CL)].
105. Lin, Stephanie; Hilton, Jacob; Evans, Owain (2021). "TruthfulQA: Measuring How Models Mimic Human
Falsehoods". arXiv:2109.07958 (https://arxiv.org/abs/2109.07958) [cs.CL (https://arxiv.org/archive/cs.CL)].
106. Zellers, Rowan; Holtzman, Ari; Bisk, Yonatan; Farhadi, Ali; Choi, Yejin (2019). "HellaSwag: Can a Machine Really
Finish Your Sentence?". arXiv:1905.07830 (https://arxiv.org/abs/1905.07830) [cs.CL (https://arxiv.org/archive/cs.C
L)].
107. "Prepare for truly useful large language models". Nature Biomedical Engineering. 7 (2): 85–86. 7 March 2023.
doi:10.1038/s41551-023-01012-6 (https://doi.org/10.1038%2Fs41551-023-01012-6). PMID 36882584 (https://pubm
ed.ncbi.nlm.nih.gov/36882584). S2CID 257403466 (https://api.semanticscholar.org/CorpusID:257403466).
108. "Your job is (probably) safe from artificial intelligence" (https://www.economist.com/finance-and-economics/2023/05/
07/your-job-is-probably-safe-from-artificial-intelligence). The Economist. 7 May 2023. Retrieved 18 June 2023.
109. "Generative AI Could Raise Global GDP by 7%" (https://www.goldmansachs.com/intelligence/pages/generative-ai-c
ould-raise-global-gdp-by-7-percent.html). Goldman Sachs. Retrieved 18 June 2023.
110. Peng, Zhencan; Wang, Zhizhi; Deng, Dong (13 June 2023). "Near-Duplicate Sequence Search at Scale for Large
Language Model Memorization Evaluation" (https://people.cs.rutgers.edu/~dd903/assets/papers/sigmod23.pdf)
(PDF). Proceedings of the ACM on Management of Data. 1 (2): 1–18. doi:10.1145/3589324 (https://doi.org/10.114
5%2F3589324). S2CID 259213212 (https://api.semanticscholar.org/CorpusID:259213212). Retrieved 2024-01-20.
Citing Lee et al 2022.
111. Peng, Wang & Deng 2023, p. 8.
112. Alba, Davey (1 May 2023). "AI chatbots have been used to create dozens of news content farms" (https://www.japan
times.co.jp/news/2023/05/01/business/tech/ai-fake-news-content-farms/). The Japan Times. Retrieved 18 June
2023.
113. "Could chatbots help devise the next pandemic virus?" (https://www.science.org/content/article/could-chatbots-help-
devise-next-pandemic-virus). Science. 14 June 2023. doi:10.1126/science.adj2463 (https://doi.org/10.1126%2Fscie
nce.adj2463).
114. Stephen Council (1 Dec 2023). "How Googlers cracked an SF rival's tech model with a single word" (https://www.sf
gate.com/tech/article/google-openai-chatgpt-break-model-18525445.php). SFGATE.
115. Hubinger, Evan (10 January 2024). "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety
Training". arXiv:2401.05566 (https://arxiv.org/abs/2401.05566) [cs.CR (https://arxiv.org/archive/cs.CR)].
116. Stokel-Walker, Chris (November 22, 2023). "ChatGPT Replicates Gender Bias in Recommendation Letters" (https://
www.scientificamerican.com/article/chatgpt-replicates-gender-bias-in-recommendation-letters/). Scientific American.
Retrieved 2023-12-29.
117. Luo, Queenie; Puett, Michael J.; Smith, Michael D. (2023-03-28). "A Perspectival Mirror of the Elephant:
Investigating Language Bias on Google, ChatGPT, Wikipedia, and YouTube". arXiv:2303.16281v2 (https://arxiv.org/
abs/2303.16281v2) [cs.CY (https://arxiv.org/archive/cs.CY)].
118. Cheng, Myra; Durmus, Esin; Jurafsky, Dan (2023-05-29), Marked Personas: Using Natural Language Prompts to
Measure Stereotypes in Language Models, arXiv:2305.18189 (https://arxiv.org/abs/2305.18189)
119. Kotek, Hadas; Dockum, Rikker; Sun, David (2023-11-05). "Gender bias and stereotypes in Large Language
Models" (https://dl.acm.org/doi/10.1145/3582269.3615599). Proceedings of the ACM Collective Intelligence
Conference. CI '23. New York, NY, USA: Association for Computing Machinery. pp. 12–24.
doi:10.1145/3582269.3615599 (https://doi.org/10.1145%2F3582269.3615599). ISBN 979-8-4007-0113-9.
120. Heikkilä, Melissa (August 7, 2023). "AI language models are rife with different political biases" (https://www.technolo
gyreview.com/2023/08/07/1077324/ai-language-models-are-rife-with-political-biases/). MIT Technology Review.
Retrieved 2023-12-29.
121. "finetune-transformer-lm" (https://github.com/openai/finetune-transformer-lm). GitHub. Retrieved 2 January 2024.
122. Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2 (https://arxiv.org/abs/1810.04805v2)
[cs.CL (https://arxiv.org/archive/cs.CL)].
123. Prickett, Nicole Hemsoth (2021-08-24). "Cerebras Shifts Architecture To Meet Massive AI/ML Models" (https://www.n
extplatform.com/2021/08/24/cerebras-shifts-architecture-to-meet-massive-ai-ml-models/). The Next Platform.
Retrieved 2023-06-20.
124. "BERT" (https://github.com/google-research/bert). March 13, 2023 – via GitHub.
125. Patel, Ajay; Li, Bryan; Rasooli, Mohammad Sadegh; Constant, Noah; Raffel, Colin; Callison-Burch, Chris (2022).
"Bidirectional Language Models Are Also Few-shot Learners". arXiv:2209.14500 (https://arxiv.org/abs/2209.14500)
[cs.LG (https://arxiv.org/archive/cs.LG)].
126. "BERT, RoBERTa, DistilBERT, XLNet: Which one to use?" (https://www.kdnuggets.com/bert-roberta-distilbert-xlnet-
which-one-to-use.html). KDnuggets.
127. "xlnet" (https://github.com/zihangdai/xlnet/). GitHub. Retrieved 2 January 2024.
128. Naik, Amit Raja (September 23, 2021). "Google Introduces New Architecture To Reduce Cost Of Transformers" (http
s://analyticsindiamag.com/google-introduces-new-architecture-to-reduce-cost-of-transformers/). Analytics India
Magazine.
129. Yang, Zhilin; Dai, Zihang; Yang, Yiming; Carbonell, Jaime; Salakhutdinov, Ruslan; Le, Quoc V. (2 January 2020).
"XLNet: Generalized Autoregressive Pretraining for Language Understanding". arXiv:1906.08237 (https://arxiv.org/a
bs/1906.08237) [cs.CL (https://arxiv.org/archive/cs.CL)].
130. "GPT-2: 1.5B Release" (https://openai.com/blog/gpt-2-1-5b-release/). OpenAI. 2019-11-05. Archived (https://web.arc
hive.org/web/20191114074358/https://openai.com/blog/gpt-2-1-5b-release/) from the original on 2019-11-14.
Retrieved 2019-11-14.
131. "Better language models and their implications" (https://openai.com/research/better-language-models). openai.com.
132. "OpenAI's GPT-3 Language Model: A Technical Overview" (https://lambdalabs.com/blog/demystifying-gpt-3).
lambdalabs.com. 3 June 2020.
133. "gpt-2" (https://github.com/openai/gpt-2). GitHub. Retrieved 13 March 2023.
134. Table D.1 in Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla;
Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel;
Krueger, Gretchen; Henighan, Tom; Child, Rewon; Ramesh, Aditya; Ziegler, Daniel M.; Wu, Jeffrey; Winter,
Clemens; Hesse, Christopher; Chen, Mark; Sigler, Eric; Litwin, Mateusz; Gray, Scott; Chess, Benjamin; Clark, Jack;
Berner, Christopher; McCandlish, Sam; Radford, Alec; Sutskever, Ilya; Amodei, Dario (May 28, 2020). "Language
Models are Few-Shot Learners". arXiv:2005.14165v4 (https://arxiv.org/abs/2005.14165v4) [cs.CL (https://arxiv.org/ar
chive/cs.CL)].
135. "ChatGPT: Optimizing Language Models for Dialogue" (https://openai.com/blog/chatgpt/). OpenAI. 2022-11-30.
Retrieved 2023-01-13.
136. "GPT Neo" (https://github.com/EleutherAI/gpt-neo). March 15, 2023 – via GitHub.
137. Gao, Leo; Biderman, Stella; Black, Sid; Golding, Laurence; Hoppe, Travis; Foster, Charles; Phang, Jason; He,
Horace; Thite, Anish; Nabeshima, Noa; Presser, Shawn; Leahy, Connor (31 December 2020). "The Pile: An 800GB
Dataset of Diverse Text for Language Modeling". arXiv:2101.00027 (https://arxiv.org/abs/2101.00027) [cs.CL (https://
arxiv.org/archive/cs.CL)].
138. Iyer, Abhishek (15 May 2021). "GPT-3's free alternative GPT-Neo is something to be excited about" (https://ventureb
eat.com/ai/gpt-3s-free-alternative-gpt-neo-is-something-to-be-excited-about/). VentureBeat.
139. "GPT-J-6B: An Introduction to the Largest Open Source GPT Model | Forefront" (https://www.forefront.ai/blog-posts/g
pt-j-6b-an-introduction-to-the-largest-open-sourced-gpt-model). www.forefront.ai. Retrieved 2023-02-28.
140. Dey, Nolan; Gosal, Gurpreet; Zhiming; Chen; Khachane, Hemant; Marshall, William; Pathria, Ribhu; Tom, Marvin;
Hestness, Joel (2023-04-01). "Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras
Wafer-Scale Cluster". arXiv:2304.03208 (https://arxiv.org/abs/2304.03208) [cs.LG (https://arxiv.org/archive/cs.LG)].
141. Alvi, Ali; Kharya, Paresh (11 October 2021). "Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B,
the World's Largest and Most Powerful Generative Language Model" (https://www.microsoft.com/en-us/research/blo
g/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generati
ve-language-model/). Microsoft Research.
142. Smith, Shaden; Patwary, Mostofa; Norick, Brandon; LeGresley, Patrick; Rajbhandari, Samyam; Casper, Jared; Liu,
Zhun; Prabhumoye, Shrimai; Zerveas, George; Korthikanti, Vijay; Zhang, Elton; Child, Rewon; Aminabadi, Reza
Yazdani; Bernauer, Julie; Song, Xia (2022-02-04). "Using DeepSpeed and Megatron to Train Megatron-Turing NLG
530B, A Large-Scale Generative Language Model". arXiv:2201.11990 (https://arxiv.org/abs/2201.11990) [cs.CL (http
s://arxiv.org/archive/cs.CL)].
143. Wang, Shuohuan; Sun, Yu; Xiang, Yang; Wu, Zhihua; Ding, Siyu; Gong, Weibao; Feng, Shikun; Shang, Junyuan;
Zhao, Yanbin; Pang, Chao; Liu, Jiaxiang; Chen, Xuyi; Lu, Yuxiang; Liu, Weixin; Wang, Xi; Bai, Yangfan; Chen,
Qiuliang; Zhao, Li; Li, Shiyong; Sun, Peng; Yu, Dianhai; Ma, Yanjun; Tian, Hao; Wu, Hua; Wu, Tian; Zeng, Wei; Li,
Ge; Gao, Wen; Wang, Haifeng (December 23, 2021). "ERNIE 3.0 Titan: Exploring Larger-scale Knowledge
Enhanced Pre-training for Language Understanding and Generation". arXiv:2112.12731 (https://arxiv.org/abs/2112.
12731) [cs.CL (https://arxiv.org/archive/cs.CL)].
144. "Product" (https://www.anthropic.com/product). Anthropic. Retrieved 14 March 2023.
145. Askell, Amanda; Bai, Yuntao; Chen, Anna; et al. (9 December 2021). "A General Language Assistant as a
Laboratory for Alignment". arXiv:2112.00861 (https://arxiv.org/abs/2112.00861) [cs.CL (https://arxiv.org/archive/cs.C
L)].
146. Bai, Yuntao; Kadavath, Saurav; Kundu, Sandipan; et al. (15 December 2022). "Constitutional AI: Harmlessness from
AI Feedback". arXiv:2212.08073 (https://arxiv.org/abs/2212.08073) [cs.CL (https://arxiv.org/archive/cs.CL)].
147. "Language modelling at scale: Gopher, ethical considerations, and retrieval" (https://www.deepmind.com/blog/langu
age-modelling-at-scale-gopher-ethical-considerations-and-retrieval). www.deepmind.com. 8 December 2021.
Retrieved 20 March 2023.
148. Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; et al. (29 March 2022). "Training Compute-Optimal Large
Language Models". arXiv:2203.15556 (https://arxiv.org/abs/2203.15556) [cs.CL (https://arxiv.org/archive/cs.CL)].
149. Table 20 and page 66 of PaLM: Scaling Language Modeling with Pathways (https://storage.googleapis.com/pathwa
ys-language-model/PaLM-paper.pdf)
150. Cheng, Heng-Tze; Thoppilan, Romal (January 21, 2022). "LaMDA: Towards Safe, Grounded, and High-Quality
Dialog Models for Everything" (https://ai.googleblog.com/2022/01/lamda-towards-safe-grounded-and-high.html).
ai.googleblog.com. Retrieved 2023-03-09.
151. Thoppilan, Romal; De Freitas, Daniel; Hall, Jamie; Shazeer, Noam; Kulshreshtha, Apoorv; Cheng, Heng-Tze; Jin,
Alicia; Bos, Taylor; Baker, Leslie; Du, Yu; Li, YaGuang; Lee, Hongrae; Zheng, Huaixiu Steven; Ghafouri, Amin;
Menegali, Marcelo (2022-01-01). "LaMDA: Language Models for Dialog Applications". arXiv:2201.08239 (https://arxi
v.org/abs/2201.08239) [cs.CL (https://arxiv.org/archive/cs.CL)].
152. Black, Sidney; Biderman, Stella; Hallahan, Eric; et al. (2022-05-01). GPT-NeoX-20B: An Open-Source
Autoregressive Language Model (https://aclanthology.org/2022.bigscience-1.9/). Proceedings of BigScience
Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models. Vol. Proceedings of
BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models. pp. 95–
136. Retrieved 2022-12-19.
153. Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; Sifre, Laurent (12 April 2022). "An empirical analysis of
compute-optimal large language model training" (https://www.deepmind.com/blog/an-empirical-analysis-of-compute
-optimal-large-language-model-training). Deepmind Blog.
154. Narang, Sharan; Chowdhery, Aakanksha (April 4, 2022). "Pathways Language Model (PaLM): Scaling to 540 Billion
Parameters for Breakthrough Performance" (https://ai.googleblog.com/2022/04/pathways-language-model-palm-sca
ling-to.html). ai.googleblog.com. Retrieved 2023-03-09.
155. "Democratizing access to large-scale language models with OPT-175B" (https://ai.facebook.com/blog/democratizin
g-access-to-large-scale-language-models-with-opt-175b/). ai.facebook.com.
156. Zhang, Susan; Roller, Stephen; Goyal, Naman; Artetxe, Mikel; Chen, Moya; Chen, Shuohui; Dewan, Christopher;
Diab, Mona; Li, Xian; Lin, Xi Victoria; Mihaylov, Todor; Ott, Myle; Shleifer, Sam; Shuster, Kurt; Simig, Daniel; Koura,
Punit Singh; Sridhar, Anjali; Wang, Tianlu; Zettlemoyer, Luke (21 June 2022). "OPT: Open Pre-trained Transformer
Language Models". arXiv:2205.01068 (https://arxiv.org/abs/2205.01068) [cs.CL (https://arxiv.org/archive/cs.CL)].
157. Khrushchev, Mikhail; Vasilev, Ruslan; Petrov, Alexey; Zinov, Nikolay (2022-06-22), YaLM 100B (https://github.com/y
andex/YaLM-100B), retrieved 2023-03-18
158. Lewkowycz, Aitor; Andreassen, Anders; Dohan, David; Dyer, Ethan; Michalewski, Henryk; Ramasesh, Vinay; Slone,
Ambrose; Anil, Cem; Schlag, Imanol; Gutman-Solo, Theo; Wu, Yuhuai; Neyshabur, Behnam; Gur-Ari, Guy; Misra,
Vedant (30 June 2022). "Solving Quantitative Reasoning Problems with Language Models". arXiv:2206.14858 (http
s://arxiv.org/abs/2206.14858) [cs.CL (https://arxiv.org/archive/cs.CL)].
159. "Minerva: Solving Quantitative Reasoning Problems with Language Models" (https://ai.googleblog.com/2022/06/min
erva-solving-quantitative-reasoning.html). ai.googleblog.com. 30 June 2022. Retrieved 20 March 2023.
160. Ananthaswamy, Anil (8 March 2023). "In AI, is bigger always better?" (https://www.nature.com/articles/d41586-023-0
0641-w). Nature. 615 (7951): 202–205. Bibcode:2023Natur.615..202A (https://ui.adsabs.harvard.edu/abs/2023Natur.
615..202A). doi:10.1038/d41586-023-00641-w (https://doi.org/10.1038%2Fd41586-023-00641-w). PMID 36890378
(https://pubmed.ncbi.nlm.nih.gov/36890378). S2CID 257380916 (https://api.semanticscholar.org/CorpusID:2573809
16).
161. "bigscience/bloom · Hugging Face" (https://huggingface.co/bigscience/bloom). huggingface.co.
162. Taylor, Ross; Kardas, Marcin; Cucurull, Guillem; Scialom, Thomas; Hartshorn, Anthony; Saravia, Elvis; Poulton,
Andrew; Kerkez, Viktor; Stojnic, Robert (16 November 2022). "Galactica: A Large Language Model for Science".
arXiv:2211.09085 (https://arxiv.org/abs/2211.09085) [cs.CL (https://arxiv.org/archive/cs.CL)].
163. "20B-parameter Alexa model sets new marks in few-shot learning" (https://www.amazon.science/blog/20b-paramete
r-alexa-model-sets-new-marks-in-few-shot-learning). Amazon Science. 2 August 2022.
164. Soltan, Saleh; Ananthakrishnan, Shankar; FitzGerald, Jack; et al. (3 August 2022). "AlexaTM 20B: Few-Shot
Learning Using a Large-Scale Multilingual Seq2Seq Model". arXiv:2208.01448 (https://arxiv.org/abs/2208.01448)
[cs.CL (https://arxiv.org/archive/cs.CL)].
165. "AlexaTM 20B is now available in Amazon SageMaker JumpStart | AWS Machine Learning Blog" (https://aws.amaz
on.com/blogs/machine-learning/alexatm-20b-is-now-available-in-amazon-sagemaker-jumpstart/). aws.amazon.com.
17 November 2022. Retrieved 13 March 2023.
166. "Introducing LLaMA: A foundational, 65-billion-parameter large language model" (https://ai.facebook.com/blog/large-
language-model-llama-meta-ai/). Meta AI. 24 February 2023.
167. "The Falcon has landed in the Hugging Face ecosystem" (https://huggingface.co/blog/falcon). huggingface.co.
Retrieved 2023-06-20.
168. "Stanford CRFM" (https://crfm.stanford.edu/2023/03/13/alpaca.html). crfm.stanford.edu.
169. "GPT-4 Technical Report" (https://cdn.openai.com/papers/gpt-4.pdf) (PDF). OpenAI. 2023. Archived (https://web.arch
ive.org/web/20230314190904/https://cdn.openai.com/papers/gpt-4.pdf) (PDF) from the original on March 14, 2023.
Retrieved March 14, 2023.
170. Dey, Nolan (March 28, 2023). "Cerebras-GPT: A Family of Open, Compute-efficient, Large Language Models" (http
s://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/). Cerebras.
171. "Abu Dhabi-based TII launches its own version of ChatGPT" (https://fastcompanyme.com/news/abu-dhabi-based-tii-
launches-its-own-version-of-chatgpt/). tii.ae.
172. Penedo, Guilherme; Malartic, Quentin; Hesslow, Daniel; Cojocaru, Ruxandra; Cappelli, Alessandro; Alobeidli,
Hamza; Pannier, Baptiste; Almazrouei, Ebtesam; Launay, Julien (2023-06-01). "The RefinedWeb Dataset for Falcon
LLM: Outperforming Curated Corpora with Web Data, and Web Data Only". arXiv:2306.01116 (https://arxiv.org/abs/2
306.01116) [cs.CL (https://arxiv.org/archive/cs.CL)].
173. "tiiuae/falcon-40b · Hugging Face" (https://huggingface.co/tiiuae/falcon-40b). huggingface.co. 2023-06-09.
Retrieved 2023-06-20.
174. UAE's Falcon 40B, World's Top-Ranked AI Model from Technology Innovation Institute, is Now Royalty-Free (https://
www.businesswire.com/news/home/20230531005608/en/UAE's-Falcon-40B-World's-Top-Ranked-AI-Model-from-T
echnology-Innovation-Institute-is-Now-Royalty-Free), 31 May 2023
175. Wu, Shijie; Irsoy, Ozan; Lu, Steven; Dabravolski, Vadim; Dredze, Mark; Gehrmann, Sebastian; Kambadur,
Prabhanjan; Rosenberg, David; Mann, Gideon (March 30, 2023). "BloombergGPT: A Large Language Model for
Finance". arXiv:2303.17564 (https://arxiv.org/abs/2303.17564) [cs.LG (https://arxiv.org/archive/cs.LG)].
176. Ren, Xiaozhe; Zhou, Pingyi; Meng, Xinfan; Huang, Xinjing; Wang, Yadao; Wang, Weichao; Li, Pengfei; Zhang,
Xiaoda; Podolskiy, Alexander; Arshinov, Grigory; Bout, Andrey; Piontkovskaya, Irina; Wei, Jiansheng; Jiang, Xin; Su,
Teng; Liu, Qun; Yao, Jun (March 19, 2023). "PanGu-Σ: Towards Trillion Parameter Language Model with Sparse
Heterogeneous Computing". arXiv:2303.10845 (https://arxiv.org/abs/2303.10845) [cs.CL (https://arxiv.org/archive/cs.
CL)].
177. Köpf, Andreas; Kilcher, Yannic; von Rütte, Dimitri; Anagnostidis, Sotiris; Tam, Zhi-Rui; Stevens, Keith; Barhoum,
Abdullah; Duc, Nguyen Minh; Stanley, Oliver; Nagyfi, Richárd; ES, Shahul; Suri, Sameer; Glushkov, David;
Dantuluri, Arnav; Maguire, Andrew (2023-04-14). "OpenAssistant Conversations -- Democratizing Large Language
Model Alignment". arXiv:2304.07327 (https://arxiv.org/abs/2304.07327) [cs.CL (https://arxiv.org/archive/cs.CL)].
178. Wrobel, Sharon. "Tel Aviv startup rolls out new advanced AI language model to rival OpenAI" (https://www.timesofisr
ael.com/ai21-labs-rolls-out-new-advanced-ai-language-model-to-rival-openai/). www.timesofisrael.com. Retrieved
2023-07-24.
179. Wiggers, Kyle (2023-04-13). "With Bedrock, Amazon enters the generative AI race" (https://techcrunch.com/2023/04/
13/with-bedrock-amazon-enters-the-generative-ai-race/). TechCrunch. Retrieved 2023-07-24.
180. Elias, Jennifer (16 May 2023). "Google's newest A.I. model uses nearly five times more text data for training than its
predecessor" (https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predec
essor.html). CNBC. Retrieved 18 May 2023.
181. "Introducing PaLM 2" (https://blog.google/technology/ai/google-palm-2-ai-large-language-model/). Google. May 10,
2023.
182. "Introducing Llama 2: The Next Generation of Our Open Source Large Language Model" (https://ai.meta.com/llam
a/). Meta AI. 2023. Retrieved 2023-07-19.
183. "Claude 2" (https://www.anthropic.com/index/claude-2). anthropic.com. Retrieved 12 December 2023.
184. "Falcon 180B" (https://falconllm.tii.ae/falcon-180b.html). Technology Innovation Institute. 2023. Retrieved
2023-09-21.
185. "Announcing Mistral 7B" (https://mistral.ai/news/announcing-mistral-7b/). Mistral. 2023. Retrieved 2023-10-06.
186. "Introducing Claude 2.1" (https://www.anthropic.com/index/claude-2-1). anthropic.com. Retrieved 12 December
2023.
187. "Grok-1 model card" (https://x.ai/model-card/). x.ai. Retrieved 12 December 2023.
188. "Gemini - Google DeepMind" (https://deepmind.google/technologies/gemini/#capabilities). deepmind.google.
Retrieved 12 December 2023.
189. "Mixtral of experts" (https://mistral.ai/news/mixtral-of-experts/). mistral.ai. 11 December 2023. Retrieved
12 December 2023.
190. Franzen, Carl (11 December 2023). "Mistral shocks AI community as latest open source model eclipses GPT-3.5
performance" (https://venturebeat.com/ai/mistral-shocks-ai-community-as-latest-open-source-model-eclipses-gpt-3-
5-performance/). VentureBeat. Retrieved 12 December 2023.
191. Hughes, Alyssa (12 December 2023). "Phi-2: The surprising power of small language models" (https://www.microsof
t.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/). Microsoft Research. Retrieved
🦅
13 December 2023.
192. Cheah, Eugene. " Eagle 7B : Soaring past Transformers with 1 Trillion Tokens Across 100+ Languages (RWKV-
v5)" (https://blog.rwkv.com/p/eagle-7b-soaring-past-transformers). blog.rwkv.com. Retrieved 31 January 2024.
Further reading
Jurafsky, Dan, Martin, James. H. Speech and Language Processing: An Introduction to Natural Language
Processing, Computational Linguistics, and Speech Recognition (https://web.stanford.edu/~jurafsky/slp3/ed3book_j
an72023.pdf), 3rd Edition draft, 2023.
Phuong, Mary; Hutter, Marcus (2022). "Formal Algorithms for Transformers". arXiv:2207.09238 (https://arxiv.org/abs/
2207.09238) [cs.LG (https://arxiv.org/archive/cs.LG)].
Eloundou, Tyna; Manning, Sam; Mishkin, Pamela; Rock, Daniel (2023). "GPTs are GPTs: An Early Look at the Labor
Market Impact Potential of Large Language Models". arXiv:2303.10130 (https://arxiv.org/abs/2303.10130) [econ.GN
(https://arxiv.org/archive/econ.GN)].
Eldan, Ronen; Li, Yuanzhi (2023). "TinyStories: How Small Can Language Models Be and Still Speak Coherent
English?". arXiv:2305.07759 (https://arxiv.org/abs/2305.07759) [cs.CL (https://arxiv.org/archive/cs.CL)].
Frank, Michael C. (27 June 2023). "Baby steps in evaluating the capacities of large language models" (https://www.
nature.com/articles/s44159-023-00211-x). Nature Reviews Psychology. 2 (8): 451–452. doi:10.1038/s44159-023-
00211-x (https://doi.org/10.1038%2Fs44159-023-00211-x). ISSN 2731-0574 (https://www.worldcat.org/issn/2731-05
74). S2CID 259713140 (https://api.semanticscholar.org/CorpusID:259713140). Retrieved 2 July 2023.
Zhao, Wayne Xin; et al. (2023). "A Survey of Large Language Models". arXiv:2303.18223 (https://arxiv.org/abs/2303.
18223) [cs.CL (https://arxiv.org/archive/cs.CL)].
Kaddour, Jean; et al. (2023). "Challenges and Applications of Large Language Models". arXiv:2307.10169 (https://ar
xiv.org/abs/2307.10169) [cs.CL (https://arxiv.org/archive/cs.CL)].
Yin, Shukang; Fu, Chaoyou; Zhao, Sirui; Li, Ke; Sun, Xing; Xu, Tong; Chen, Enhong (2023-06-01). "A Survey on
Multimodal Large Language Models". arXiv:2306.13549 (https://arxiv.org/abs/2306.13549) [cs.CV (https://arxiv.org/a
rchive/cs.CV)].
Open LLMs repository (https://github.com/eugeneyan/open-llms) on GitHub.