0% found this document useful (0 votes)
15 views

Large Language Model Algorithms in Plain English

The document discusses key learnings about large language models including how they work, how they are trained, common models from OpenAI and Google, and technical details like tokenization and model parameters.

Uploaded by

Gopika Chaganti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Large Language Model Algorithms in Plain English

The document discusses key learnings about large language models including how they work, how they are trained, common models from OpenAI and Google, and technical details like tokenization and model parameters.

Uploaded by

Gopika Chaganti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Large Language Model

Algorithms in Plain English


for

Product Managers
and Non-technical Business
Professionals

by Adnan Boz
Thank you for joining our event. Your interest was truly inspiring. I hope the training provided
you with useful insights into large language models (LLMs) that you can apply in your product
management efforts. You will find the topics that I presented documented here for your
reference.

To further your career and help you deliver unprecedented value to your customers through
Generative AI or AI, my company, the AI Product Institute, offers workshops on Generative AI
products, training programs on AI product development lifecycles, and AI business strategy,
catered to both individual product managers and corporate product management teams. For
more details on our workshops and training programs, please visit
https://www.aiproductinstitute.com/generative-ai.

Feel free to use this document as a reference for any topics from the event, and share it with
your friends at https://drive.google.com/drive/folders/1VaLRsry9zxRZTOvb8FAGZjxyfyCD2gmD.
If you have any questions, you can reach out to me via LinkedIn at
https://linkedin.com/in/adnanboz or email me at [email protected].

Learnings
1) ChatGPT chatbot uses a type of LLM called GPT (Generative Pre-training Transformer);
it “generates”, it does not “answer”. It works with prompt and completion1. The prompt is
the text you enter, the completion is the result you receive. Even the OpenAI API
endpoints are named as “https://api.openai.com/v1/chat/completions”8 and
“https://api.openai.com/v1/completions”9.

2) OpenAI has released GPT foundation models that have been sequentially numbered, to
comprise its "GPT-n" series. However, this doesn’t mean that there is an actual model
deployed with the name GPT-n, in fact, the default LLMs that are deployed on the cloud
and are available under the GPT-3 version are ada, babbage, curie, davinci,
text-ada-001, text-babbage-001, text-curie-001. Similarly, GPT-3.5 and GPT-4 versions
have multiple models available for usage. Each of these actual models have different
capabilities.12

3) Google Bard, another popular conversational generative AI chatbot, is based on the


LaMDA (Language Model for Dialogue Applications)2 family of large language models,
as well as PaLM 2 for improved multilingual capabilities3.

4) Most LLM are trained through a process known as generative pretraining4, where the
model learns to predict text tokens from a given training dataset. This training can be
generally categorized into two primary methods.

© AI Product Institute, LLC. All Rights Reserved. https://aiproductinstitute.com 1


a) GPT-style (Generative Pre-trained Transformer). Aka autoregressive ("predict the
next word"). The model is presented with text tokens such as "I enjoy reading"
and it is trained to predict the upcoming tokens, for example, "a book in the park".

b) BERT5-style (Bidirectional Encoder Representations from Transformers). Aka


masked ("fill in the blank test"): The model is provided with a text segment with
certain masked tokens, like "I enjoy reading [MASK] [MASK] in the park", and it is
expected to predict the concealed tokens, in this case, "a book".

5) GPT and BERT’s foundation lies in the transformer6 neural network architecture, a
novel concept that revolutionized language processing tasks. The transformer neural
networks depart from the traditional recurrent neural networks (RNNs) or convolutional
neural networks (CNNs). Instead, it uses a mechanism called 'attention' to understand
the context of words in a sentence. In simple terms, the 'attention' mechanism allows the
model to focus on important parts of the input sequence when producing an output,
rather than processing the input in a fixed order or looking at it in isolation.

6) The training data for generative pre-training for text is called the corpus. ​In linguistics, a
corpus (plural corpora) or text corpus is a language resource consisting of a large and
structured set of texts.7 BERT has 3.3 billion words, GPT-2 10 billion tokens, GPT-3 499
billion tokens (410B from Common Crawl, 19B from WebText2, 12B from Books1, 12B
from Books2, 3B from Wikipedia), LaMDA has 1.56T words (168 billion tokens) and
PaLM has 768 billion tokens in their corpora.4

7) In natural language processing, a token is a piece of a whole, so a "token" could be a


word or part of a word. The way a body of text is split into tokens can vary. For English
language models, tokens are often individual words and punctuation. In GPT-3, a token
is more accurately a subword unit.10 In ChatGPT, the model reads in one token at a time
and tries to predict the next token, given the previous ones.

8) OpenAI provides a web tool called “Tokenizer” at https://platform.openai.com/tokenizer


that allows anyone to see and count the tokens of a prompt.

© AI Product Institute, LLC. All Rights Reserved. https://aiproductinstitute.com 2


9) LLMs are mathematical functions whose input and output are lists of numbers.
Consequently, words must be converted to numbers. In general, a LLM uses a separate
tokenizer. A tokenizer maps between texts and lists of integers.11 Different models can
use different tokenizers: GPT-3 (OpenAI), Uses Byte Pair Encoding (BPE), a form of
subword tokenization that can break down words into smaller parts. BERT (Google),
Uses WordPiece tokenization, another form of subword tokenization similar to BPE. It
splits words into smaller units, and prefixes all but the first token of a word with '##' to
indicate that they are subparts of a larger word. RoBERTa (Facebook), a variant of
BERT, uses Byte-Level BPE, which operates at the byte level, allowing it to handle any
possible byte sequence, and not limited to just unicode characters.

10) The term "max token" is also used to refer to the maximum length of text the model can
handle in one pass, often referred to as the "maximum sequence length" or "context
window." For instance, GPT-3.5 has a maximum sequence length of 4,096 tokens total
for prompt and completion, GPT-4 has 8,192 and GPT-4-32k has 32,768.12

11) The number of tokens in the prompt and completion together also determines the price
of the OpenAI api. For example, as of February 2023, the rate for using Davinci is $0.06
per 1,000 tokens, while the rate for using Ada is $0.0008 per 1,000 tokens.

12) Parameters are the learned parts of the model. For GPT-style models, which are
transformer-based models, the parameters are the weights and biases in the various
layers of the model. During training, the model learns the best values for these
parameters to predict the next token in the input text. The number of parameters in a
model often correlates with its capacity to learn and represent complex patterns. For
instance, GPT-3 has 175 billion parameters, BERT has 340 million, LaMDA has 137
billion, PaLM has 540 billion, GPT-4 is approximated as around 1 trillion.13

13) Bi-gram language models (LMs), n-gram language models for n=2, provide good
examples to understand the inner workings of GPT-style LLMs. They also predict the
next word in a sentence based on the previous word, token or even character,
essentially considering pairs or "bi-grams." A character-based bi-gram model, for
instance, would generate the next character for a text.

a) Take for example, given the corpus “sunny day.”, the bi-gram model would create
the following bi-grams:
i) s -> u (u comes after s),
ii) u -> n (n comes after u),
iii) n -> n (n comes after n),
iv) n -> y (y comes after n),
v) y -> (space comes after y)
vi) -> d (d after space)
vii) d -> a (a after d)
viii) a -> y (y after a)

© AI Product Institute, LLC. All Rights Reserved. https://aiproductinstitute.com 3


ix) y -> . (. after y)
b) This complete logic is a language model.

c) The unique set of these characters are called the “vocabulary”. In this example
the vocabulary consist of the “adnsy .” unique characters. And the “vocabulary
size” is 6 characters. BERT has a vocabulary size of 30K tokens, GPT-3 ada
model 50K tokens, GPT-3 davinci 60K tokens, babbage and curie with 50K
tokens.15

d) This information is enough to predict and generate the next character. If I would
start a prompt with an “s”, what would you complete it to if you would be a
bi-gram LM? “u”, of course.

e) You can keep generating until “sun” without any doubt. Once you hit “n”, you
encounter a problem because, given “n”, our simple model tells us that there are
two possible options; either another “n” or “y”. Which one would you choose if
your logic would be limited to the model?

f) This is where the probabilistic behavior of LLMs comes into the picture. In these
cases the algorithm would roll a dice to pick one. As you can imagine, only 50%
of the time it would be the correct letter. However, if we would use a larger corpus
such as “sunny day in ny.” where “y” follows “n” two times but “n” follows “n” only
one time, then the probability would be distributed as 66% for n->y and 33% for
n->n.

g) As you can imagine, with a much larger corpus such as tens of thousands of
words, the possibilities of the characters coming after “n” would be ranging from
“a” all the way to “z” and even other characters such as “:”, ”.” and others. These
can be seen as probabilities and represented in a way we call probability
distribution. Then, we need to decide which one to pick again. Due to how our
language is formed, it is highly likely that we will always have some probabilities
larger than others, which will allow us to make a choice. However, what if our
corpus is biased, what if the LLM architecture we picked can really not store
enough logic? Therefore, LLMs throw the dice on the probability distribution
based on the weight. For example, if n->y happens 4% of the time and n->n 1%
of the time, then every time I run the LLM with this prompt there will be a higher
chance to get y after n, rather than n.

h) This is the fundamental reason why you can get a different completion from
ChatGPT. But, this issue is not specific to LLMs. Every ML model that has to
cross the boundaries from the probabilistic world to a deterministic one, where
only one choice can exist, will encounter the same problem. Since we are not
dealing in the quantum world or performing the Schrödinger's cat experiment, we
will always have this problem of choosing one final output over another. In cancer

© AI Product Institute, LLC. All Rights Reserved. https://aiproductinstitute.com 4


detection from x-ray images, for instance, although the ML classification
algorithm output provides a probability, something or someone has to make the
final binary decision of benign or malignant. Would you pick the benign output
that is showing 50.0001%?

14) To provide some level of control to the users, OpenAI provides two parameters:
temperature and top_p. You can set them in your API call or test it out in their
Playground at https://platform.openai.com/playground .

15) A low temperature makes the output more focused, potentially causing repetition by
favoring the most likely next word. At a temperature of 1, the model uses the raw values
directly, striking a balance between diversity and coherence. High temperatures increase
output diversity but might lead to nonsensical outputs by giving more weight to less likely
words. This happens because the raw values are divided by the temperature prior to
applying the exponential function, causing larger temperature values to yield smaller
results pre-exponentiation, hence flattening the probability distribution. This allows
multiple options to have similar probabilities. This is represented by the formula:
softmax(x_i/T) = exp(x_i/T) / Σ exp(x_j/T) for all j.

16) In top_p, also known as nucleus sampling, instead of selecting the most probable next
word, it adds an element of randomness to make the generated text more diverse. A low
Top_p value means that the model will only consider a small subset of the most probable
next words, leading to more focused and coherent, but potentially repetitive, outputs. At
a Top_p of 1, the model considers all possible next words, leading to more diverse
outputs. When the Top_p value is set to a specific fraction (say 0.9), the model
dynamically selects the smallest set of next words whose cumulative probability exceeds
this fraction. This means the model may consider a larger or smaller set of next words
depending on their individual probabilities. This approach allows for more randomness
than temperature scaling, but still places a higher likelihood on more probable words.

17) Emergent abilities of large language models are unpredictable behaviors or skills that
are not present in smaller models but emerge in larger ones, and cannot be easily
predicted by simply extrapolating the performance of smaller models. For example, large
language models have been shown to perform well on tasks that require reasoning or
inference, such as question-answering, even with limited or no training data. The
existence of emergent abilities in large language models raises important questions
about the potential for further expansion of the range of capabilities of these models

© AI Product Institute, LLC. All Rights Reserved. https://aiproductinstitute.com 5


through additional scaling. It also motivates research into why such abilities are acquired
and how they can be optimized. If you want to learn more about this topic please read
“Emergent Abilities of Large Language Models”16.

18) Language model "hallucination" refers to when a model generates information not
present or implied in the input. This can occur due to biases in training data or limitations
in model architecture. For example, if you ask a language model "What is the color of
George Washington's smartphone?" it might respond, "George Washington had a yellow
smartphone." In reality, smartphones didn't exist in Washington's time, so this is a
hallucination: the model is inventing details that aren't factual, based on its training on
modern language and lack of deep understanding of historical context. Hallucination is
often defined as "generated content that is nonsensical or unfaithful to the provided
source content".17

References
1
https://platform.openai.com/docs/guides/completion
2
https://en.wikipedia.org/wiki/Bard_(chatbot)
3
https://blog.google/technology/ai/google-palm-2-ai-large-language-model/,
https://blog.google/technology/ai/bard-google-ai-search-updates/
4
https://en.wikipedia.org/wiki/Large_language_model
5
https://en.wikipedia.org/wiki/BERT_(language_model)
6
https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)
7
https://en.wikipedia.org/wiki/Text_corpus
8
https://platform.openai.com/docs/api-reference/chat
9
https://platform.openai.com/docs/api-reference/completions
10
https://en.wikipedia.org/wiki/Lexical_analysis#Token
11
https://en.wikipedia.org/wiki/Large_language_model#Tokenization
12
https://platform.openai.com/docs/models/gpt-4
13
https://en.wikipedia.org/wiki/Large_language_model#List_of_large_language_models
14
https://en.wikipedia.org/wiki/N-gram_language_model
15
https://learn.microsoft.com/en-us/semantic-kernel/concepts-ai/tokens
16
https://openreview.net/forum?id=yzkSU5zdwD
17
https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)

© AI Product Institute, LLC. All Rights Reserved. https://aiproductinstitute.com 6


© 2023 AI Product Institute LLC. All rights reserved. AI Product Institute and the AI Product
Institute logo are trademarks and/or registered trademarks of AI Product Institute LLC in the
U.S. and other countries. Other company and product names may be trademarks of the
respective companies with which they are associated. MAY23

© AI Product Institute, LLC. All Rights Reserved. https://aiproductinstitute.com 7

You might also like