Large Language Model Algorithms in Plain English
Large Language Model Algorithms in Plain English
Product Managers
and Non-technical Business
Professionals
by Adnan Boz
Thank you for joining our event. Your interest was truly inspiring. I hope the training provided
you with useful insights into large language models (LLMs) that you can apply in your product
management efforts. You will find the topics that I presented documented here for your
reference.
To further your career and help you deliver unprecedented value to your customers through
Generative AI or AI, my company, the AI Product Institute, offers workshops on Generative AI
products, training programs on AI product development lifecycles, and AI business strategy,
catered to both individual product managers and corporate product management teams. For
more details on our workshops and training programs, please visit
https://www.aiproductinstitute.com/generative-ai.
Feel free to use this document as a reference for any topics from the event, and share it with
your friends at https://drive.google.com/drive/folders/1VaLRsry9zxRZTOvb8FAGZjxyfyCD2gmD.
If you have any questions, you can reach out to me via LinkedIn at
https://linkedin.com/in/adnanboz or email me at [email protected].
Learnings
1) ChatGPT chatbot uses a type of LLM called GPT (Generative Pre-training Transformer);
it “generates”, it does not “answer”. It works with prompt and completion1. The prompt is
the text you enter, the completion is the result you receive. Even the OpenAI API
endpoints are named as “https://api.openai.com/v1/chat/completions”8 and
“https://api.openai.com/v1/completions”9.
2) OpenAI has released GPT foundation models that have been sequentially numbered, to
comprise its "GPT-n" series. However, this doesn’t mean that there is an actual model
deployed with the name GPT-n, in fact, the default LLMs that are deployed on the cloud
and are available under the GPT-3 version are ada, babbage, curie, davinci,
text-ada-001, text-babbage-001, text-curie-001. Similarly, GPT-3.5 and GPT-4 versions
have multiple models available for usage. Each of these actual models have different
capabilities.12
4) Most LLM are trained through a process known as generative pretraining4, where the
model learns to predict text tokens from a given training dataset. This training can be
generally categorized into two primary methods.
5) GPT and BERT’s foundation lies in the transformer6 neural network architecture, a
novel concept that revolutionized language processing tasks. The transformer neural
networks depart from the traditional recurrent neural networks (RNNs) or convolutional
neural networks (CNNs). Instead, it uses a mechanism called 'attention' to understand
the context of words in a sentence. In simple terms, the 'attention' mechanism allows the
model to focus on important parts of the input sequence when producing an output,
rather than processing the input in a fixed order or looking at it in isolation.
6) The training data for generative pre-training for text is called the corpus. In linguistics, a
corpus (plural corpora) or text corpus is a language resource consisting of a large and
structured set of texts.7 BERT has 3.3 billion words, GPT-2 10 billion tokens, GPT-3 499
billion tokens (410B from Common Crawl, 19B from WebText2, 12B from Books1, 12B
from Books2, 3B from Wikipedia), LaMDA has 1.56T words (168 billion tokens) and
PaLM has 768 billion tokens in their corpora.4
10) The term "max token" is also used to refer to the maximum length of text the model can
handle in one pass, often referred to as the "maximum sequence length" or "context
window." For instance, GPT-3.5 has a maximum sequence length of 4,096 tokens total
for prompt and completion, GPT-4 has 8,192 and GPT-4-32k has 32,768.12
11) The number of tokens in the prompt and completion together also determines the price
of the OpenAI api. For example, as of February 2023, the rate for using Davinci is $0.06
per 1,000 tokens, while the rate for using Ada is $0.0008 per 1,000 tokens.
12) Parameters are the learned parts of the model. For GPT-style models, which are
transformer-based models, the parameters are the weights and biases in the various
layers of the model. During training, the model learns the best values for these
parameters to predict the next token in the input text. The number of parameters in a
model often correlates with its capacity to learn and represent complex patterns. For
instance, GPT-3 has 175 billion parameters, BERT has 340 million, LaMDA has 137
billion, PaLM has 540 billion, GPT-4 is approximated as around 1 trillion.13
13) Bi-gram language models (LMs), n-gram language models for n=2, provide good
examples to understand the inner workings of GPT-style LLMs. They also predict the
next word in a sentence based on the previous word, token or even character,
essentially considering pairs or "bi-grams." A character-based bi-gram model, for
instance, would generate the next character for a text.
a) Take for example, given the corpus “sunny day.”, the bi-gram model would create
the following bi-grams:
i) s -> u (u comes after s),
ii) u -> n (n comes after u),
iii) n -> n (n comes after n),
iv) n -> y (y comes after n),
v) y -> (space comes after y)
vi) -> d (d after space)
vii) d -> a (a after d)
viii) a -> y (y after a)
c) The unique set of these characters are called the “vocabulary”. In this example
the vocabulary consist of the “adnsy .” unique characters. And the “vocabulary
size” is 6 characters. BERT has a vocabulary size of 30K tokens, GPT-3 ada
model 50K tokens, GPT-3 davinci 60K tokens, babbage and curie with 50K
tokens.15
d) This information is enough to predict and generate the next character. If I would
start a prompt with an “s”, what would you complete it to if you would be a
bi-gram LM? “u”, of course.
e) You can keep generating until “sun” without any doubt. Once you hit “n”, you
encounter a problem because, given “n”, our simple model tells us that there are
two possible options; either another “n” or “y”. Which one would you choose if
your logic would be limited to the model?
f) This is where the probabilistic behavior of LLMs comes into the picture. In these
cases the algorithm would roll a dice to pick one. As you can imagine, only 50%
of the time it would be the correct letter. However, if we would use a larger corpus
such as “sunny day in ny.” where “y” follows “n” two times but “n” follows “n” only
one time, then the probability would be distributed as 66% for n->y and 33% for
n->n.
g) As you can imagine, with a much larger corpus such as tens of thousands of
words, the possibilities of the characters coming after “n” would be ranging from
“a” all the way to “z” and even other characters such as “:”, ”.” and others. These
can be seen as probabilities and represented in a way we call probability
distribution. Then, we need to decide which one to pick again. Due to how our
language is formed, it is highly likely that we will always have some probabilities
larger than others, which will allow us to make a choice. However, what if our
corpus is biased, what if the LLM architecture we picked can really not store
enough logic? Therefore, LLMs throw the dice on the probability distribution
based on the weight. For example, if n->y happens 4% of the time and n->n 1%
of the time, then every time I run the LLM with this prompt there will be a higher
chance to get y after n, rather than n.
h) This is the fundamental reason why you can get a different completion from
ChatGPT. But, this issue is not specific to LLMs. Every ML model that has to
cross the boundaries from the probabilistic world to a deterministic one, where
only one choice can exist, will encounter the same problem. Since we are not
dealing in the quantum world or performing the Schrödinger's cat experiment, we
will always have this problem of choosing one final output over another. In cancer
14) To provide some level of control to the users, OpenAI provides two parameters:
temperature and top_p. You can set them in your API call or test it out in their
Playground at https://platform.openai.com/playground .
15) A low temperature makes the output more focused, potentially causing repetition by
favoring the most likely next word. At a temperature of 1, the model uses the raw values
directly, striking a balance between diversity and coherence. High temperatures increase
output diversity but might lead to nonsensical outputs by giving more weight to less likely
words. This happens because the raw values are divided by the temperature prior to
applying the exponential function, causing larger temperature values to yield smaller
results pre-exponentiation, hence flattening the probability distribution. This allows
multiple options to have similar probabilities. This is represented by the formula:
softmax(x_i/T) = exp(x_i/T) / Σ exp(x_j/T) for all j.
16) In top_p, also known as nucleus sampling, instead of selecting the most probable next
word, it adds an element of randomness to make the generated text more diverse. A low
Top_p value means that the model will only consider a small subset of the most probable
next words, leading to more focused and coherent, but potentially repetitive, outputs. At
a Top_p of 1, the model considers all possible next words, leading to more diverse
outputs. When the Top_p value is set to a specific fraction (say 0.9), the model
dynamically selects the smallest set of next words whose cumulative probability exceeds
this fraction. This means the model may consider a larger or smaller set of next words
depending on their individual probabilities. This approach allows for more randomness
than temperature scaling, but still places a higher likelihood on more probable words.
17) Emergent abilities of large language models are unpredictable behaviors or skills that
are not present in smaller models but emerge in larger ones, and cannot be easily
predicted by simply extrapolating the performance of smaller models. For example, large
language models have been shown to perform well on tasks that require reasoning or
inference, such as question-answering, even with limited or no training data. The
existence of emergent abilities in large language models raises important questions
about the potential for further expansion of the range of capabilities of these models
18) Language model "hallucination" refers to when a model generates information not
present or implied in the input. This can occur due to biases in training data or limitations
in model architecture. For example, if you ask a language model "What is the color of
George Washington's smartphone?" it might respond, "George Washington had a yellow
smartphone." In reality, smartphones didn't exist in Washington's time, so this is a
hallucination: the model is inventing details that aren't factual, based on its training on
modern language and lack of deep understanding of historical context. Hallucination is
often defined as "generated content that is nonsensical or unfaithful to the provided
source content".17
References
1
https://platform.openai.com/docs/guides/completion
2
https://en.wikipedia.org/wiki/Bard_(chatbot)
3
https://blog.google/technology/ai/google-palm-2-ai-large-language-model/,
https://blog.google/technology/ai/bard-google-ai-search-updates/
4
https://en.wikipedia.org/wiki/Large_language_model
5
https://en.wikipedia.org/wiki/BERT_(language_model)
6
https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)
7
https://en.wikipedia.org/wiki/Text_corpus
8
https://platform.openai.com/docs/api-reference/chat
9
https://platform.openai.com/docs/api-reference/completions
10
https://en.wikipedia.org/wiki/Lexical_analysis#Token
11
https://en.wikipedia.org/wiki/Large_language_model#Tokenization
12
https://platform.openai.com/docs/models/gpt-4
13
https://en.wikipedia.org/wiki/Large_language_model#List_of_large_language_models
14
https://en.wikipedia.org/wiki/N-gram_language_model
15
https://learn.microsoft.com/en-us/semantic-kernel/concepts-ai/tokens
16
https://openreview.net/forum?id=yzkSU5zdwD
17
https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)