0% found this document useful (0 votes)

15 views

Large Language Model Algorithms in Plain English

The document discusses key learnings about large language models including how they work, how they are trained, common models from OpenAI and Google, and technical details like tokenization and model parameters.

Uploaded by

Gopika Chaganti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Large Language Model Algorithms in Plain English

Uploaded by

Gopika Chaganti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Large Language Model

Algorithms in Plain English

for

Product Managers
and Non-technical Business
Professionals

by Adnan Boz
Thank you for joining our event. Your interest was truly inspiring. I hope the training provided
you with useful insights into large language models (LLMs) that you can apply in your product
management efforts. You will find the topics that I presented documented here for your
reference.

To further your career and help you deliver unprecedented value to your customers through
Generative AI or AI, my company, the AI Product Institute, offers workshops on Generative AI
products, training programs on AI product development lifecycles, and AI business strategy,
catered to both individual product managers and corporate product management teams. For
more details on our workshops and training programs, please visit
https://www.aiproductinstitute.com/generative-ai.

Feel free to use this document as a reference for any topics from the event, and share it with
your friends at https://drive.google.com/drive/folders/1VaLRsry9zxRZTOvb8FAGZjxyfyCD2gmD.
If you have any questions, you can reach out to me via LinkedIn at
https://linkedin.com/in/adnanboz or email me at [email protected].

Learnings
1) ChatGPT chatbot uses a type of LLM called GPT (Generative Pre-training Transformer);
it “generates”, it does not “answer”. It works with prompt and completion1. The prompt is
the text you enter, the completion is the result you receive. Even the OpenAI API
endpoints are named as “https://api.openai.com/v1/chat/completions”8 and
“https://api.openai.com/v1/completions”9.

2) OpenAI has released GPT foundation models that have been sequentially numbered, to
comprise its "GPT-n" series. However, this doesn’t mean that there is an actual model
deployed with the name GPT-n, in fact, the default LLMs that are deployed on the cloud
and are available under the GPT-3 version are ada, babbage, curie, davinci,
text-ada-001, text-babbage-001, text-curie-001. Similarly, GPT-3.5 and GPT-4 versions
have multiple models available for usage. Each of these actual models have different
capabilities.12

3) Google Bard, another popular conversational generative AI chatbot, is based on the

LaMDA (Language Model for Dialogue Applications)2 family of large language models,
as well as PaLM 2 for improved multilingual capabilities3.

4) Most LLM are trained through a process known as generative pretraining4, where the
model learns to predict text tokens from a given training dataset. This training can be
generally categorized into two primary methods.

© AI Product Institute, LLC. All Rights Reserved. https://aiproductinstitute.com 1

a) GPT-style (Generative Pre-trained Transformer). Aka autoregressive ("predict the
next word"). The model is presented with text tokens such as "I enjoy reading"
and it is trained to predict the upcoming tokens, for example, "a book in the park".

b) BERT5-style (Bidirectional Encoder Representations from Transformers). Aka

masked ("fill in the blank test"): The model is provided with a text segment with
certain masked tokens, like "I enjoy reading [MASK] [MASK] in the park", and it is
expected to predict the concealed tokens, in this case, "a book".

5) GPT and BERT’s foundation lies in the transformer6 neural network architecture, a
novel concept that revolutionized language processing tasks. The transformer neural
networks depart from the traditional recurrent neural networks (RNNs) or convolutional
neural networks (CNNs). Instead, it uses a mechanism called 'attention' to understand
the context of words in a sentence. In simple terms, the 'attention' mechanism allows the
model to focus on important parts of the input sequence when producing an output,
rather than processing the input in a fixed order or looking at it in isolation.

6) The training data for generative pre-training for text is called the corpus. In linguistics, a
corpus (plural corpora) or text corpus is a language resource consisting of a large and
structured set of texts.7 BERT has 3.3 billion words, GPT-2 10 billion tokens, GPT-3 499
billion tokens (410B from Common Crawl, 19B from WebText2, 12B from Books1, 12B
from Books2, 3B from Wikipedia), LaMDA has 1.56T words (168 billion tokens) and
PaLM has 768 billion tokens in their corpora.4

7) In natural language processing, a token is a piece of a whole, so a "token" could be a

word or part of a word. The way a body of text is split into tokens can vary. For English
language models, tokens are often individual words and punctuation. In GPT-3, a token
is more accurately a subword unit.10 In ChatGPT, the model reads in one token at a time
and tries to predict the next token, given the previous ones.

8) OpenAI provides a web tool called “Tokenizer” at https://platform.openai.com/tokenizer

that allows anyone to see and count the tokens of a prompt.

© AI Product Institute, LLC. All Rights Reserved. https://aiproductinstitute.com 2

9) LLMs are mathematical functions whose input and output are lists of numbers.
Consequently, words must be converted to numbers. In general, a LLM uses a separate
tokenizer. A tokenizer maps between texts and lists of integers.11 Different models can
use different tokenizers: GPT-3 (OpenAI), Uses Byte Pair Encoding (BPE), a form of
subword tokenization that can break down words into smaller parts. BERT (Google),
Uses WordPiece tokenization, another form of subword tokenization similar to BPE. It
splits words into smaller units, and prefixes all but the first token of a word with '##' to
indicate that they are subparts of a larger word. RoBERTa (Facebook), a variant of
BERT, uses Byte-Level BPE, which operates at the byte level, allowing it to handle any
possible byte sequence, and not limited to just unicode characters.

10) The term "max token" is also used to refer to the maximum length of text the model can
handle in one pass, often referred to as the "maximum sequence length" or "context
window." For instance, GPT-3.5 has a maximum sequence length of 4,096 tokens total
for prompt and completion, GPT-4 has 8,192 and GPT-4-32k has 32,768.12

11) The number of tokens in the prompt and completion together also determines the price
of the OpenAI api. For example, as of February 2023, the rate for using Davinci is $0.06
per 1,000 tokens, while the rate for using Ada is $0.0008 per 1,000 tokens.

12) Parameters are the learned parts of the model. For GPT-style models, which are
transformer-based models, the parameters are the weights and biases in the various
layers of the model. During training, the model learns the best values for these
parameters to predict the next token in the input text. The number of parameters in a
model often correlates with its capacity to learn and represent complex patterns. For
instance, GPT-3 has 175 billion parameters, BERT has 340 million, LaMDA has 137
billion, PaLM has 540 billion, GPT-4 is approximated as around 1 trillion.13

13) Bi-gram language models (LMs), n-gram language models for n=2, provide good
examples to understand the inner workings of GPT-style LLMs. They also predict the
next word in a sentence based on the previous word, token or even character,
essentially considering pairs or "bi-grams." A character-based bi-gram model, for
instance, would generate the next character for a text.

a) Take for example, given the corpus “sunny day.”, the bi-gram model would create
the following bi-grams:
i) s -> u (u comes after s),
ii) u -> n (n comes after u),
iii) n -> n (n comes after n),
iv) n -> y (y comes after n),
v) y -> (space comes after y)
vi) -> d (d after space)
vii) d -> a (a after d)
viii) a -> y (y after a)

© AI Product Institute, LLC. All Rights Reserved. https://aiproductinstitute.com 3

ix) y -> . (. after y)
b) This complete logic is a language model.

c) The unique set of these characters are called the “vocabulary”. In this example
the vocabulary consist of the “adnsy .” unique characters. And the “vocabulary
size” is 6 characters. BERT has a vocabulary size of 30K tokens, GPT-3 ada
model 50K tokens, GPT-3 davinci 60K tokens, babbage and curie with 50K
tokens.15

d) This information is enough to predict and generate the next character. If I would
start a prompt with an “s”, what would you complete it to if you would be a
bi-gram LM? “u”, of course.

e) You can keep generating until “sun” without any doubt. Once you hit “n”, you
encounter a problem because, given “n”, our simple model tells us that there are
two possible options; either another “n” or “y”. Which one would you choose if
your logic would be limited to the model?

f) This is where the probabilistic behavior of LLMs comes into the picture. In these
cases the algorithm would roll a dice to pick one. As you can imagine, only 50%
of the time it would be the correct letter. However, if we would use a larger corpus
such as “sunny day in ny.” where “y” follows “n” two times but “n” follows “n” only
one time, then the probability would be distributed as 66% for n->y and 33% for
n->n.

g) As you can imagine, with a much larger corpus such as tens of thousands of
words, the possibilities of the characters coming after “n” would be ranging from
“a” all the way to “z” and even other characters such as “:”, ”.” and others. These
can be seen as probabilities and represented in a way we call probability
distribution. Then, we need to decide which one to pick again. Due to how our
language is formed, it is highly likely that we will always have some probabilities
larger than others, which will allow us to make a choice. However, what if our
corpus is biased, what if the LLM architecture we picked can really not store
enough logic? Therefore, LLMs throw the dice on the probability distribution
based on the weight. For example, if n->y happens 4% of the time and n->n 1%
of the time, then every time I run the LLM with this prompt there will be a higher
chance to get y after n, rather than n.

h) This is the fundamental reason why you can get a different completion from
ChatGPT. But, this issue is not specific to LLMs. Every ML model that has to
cross the boundaries from the probabilistic world to a deterministic one, where
only one choice can exist, will encounter the same problem. Since we are not
dealing in the quantum world or performing the Schrödinger's cat experiment, we
will always have this problem of choosing one final output over another. In cancer

detection from x-ray images, for instance, although the ML classification
algorithm output provides a probability, something or someone has to make the
final binary decision of benign or malignant. Would you pick the benign output
that is showing 50.0001%?

14) To provide some level of control to the users, OpenAI provides two parameters:
temperature and top_p. You can set them in your API call or test it out in their
Playground at https://platform.openai.com/playground .

15) A low temperature makes the output more focused, potentially causing repetition by
favoring the most likely next word. At a temperature of 1, the model uses the raw values
directly, striking a balance between diversity and coherence. High temperatures increase
output diversity but might lead to nonsensical outputs by giving more weight to less likely
words. This happens because the raw values are divided by the temperature prior to
applying the exponential function, causing larger temperature values to yield smaller
results pre-exponentiation, hence flattening the probability distribution. This allows
multiple options to have similar probabilities. This is represented by the formula:
softmax(x_i/T) = exp(x_i/T) / Σ exp(x_j/T) for all j.

16) In top_p, also known as nucleus sampling, instead of selecting the most probable next
word, it adds an element of randomness to make the generated text more diverse. A low
Top_p value means that the model will only consider a small subset of the most probable
next words, leading to more focused and coherent, but potentially repetitive, outputs. At
a Top_p of 1, the model considers all possible next words, leading to more diverse
outputs. When the Top_p value is set to a specific fraction (say 0.9), the model
dynamically selects the smallest set of next words whose cumulative probability exceeds
this fraction. This means the model may consider a larger or smaller set of next words
depending on their individual probabilities. This approach allows for more randomness
than temperature scaling, but still places a higher likelihood on more probable words.

17) Emergent abilities of large language models are unpredictable behaviors or skills that
are not present in smaller models but emerge in larger ones, and cannot be easily
predicted by simply extrapolating the performance of smaller models. For example, large
language models have been shown to perform well on tasks that require reasoning or
inference, such as question-answering, even with limited or no training data. The
existence of emergent abilities in large language models raises important questions
about the potential for further expansion of the range of capabilities of these models

through additional scaling. It also motivates research into why such abilities are acquired
and how they can be optimized. If you want to learn more about this topic please read
“Emergent Abilities of Large Language Models”16.

18) Language model "hallucination" refers to when a model generates information not
present or implied in the input. This can occur due to biases in training data or limitations
in model architecture. For example, if you ask a language model "What is the color of
George Washington's smartphone?" it might respond, "George Washington had a yellow
smartphone." In reality, smartphones didn't exist in Washington's time, so this is a
hallucination: the model is inventing details that aren't factual, based on its training on
modern language and lack of deep understanding of historical context. Hallucination is
often defined as "generated content that is nonsensical or unfaithful to the provided
source content".17

References
1
https://platform.openai.com/docs/guides/completion
2
https://en.wikipedia.org/wiki/Bard_(chatbot)
3
https://blog.google/technology/ai/google-palm-2-ai-large-language-model/,
https://blog.google/technology/ai/bard-google-ai-search-updates/
4
https://en.wikipedia.org/wiki/Large_language_model
5
https://en.wikipedia.org/wiki/BERT_(language_model)
6
https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)
7
https://en.wikipedia.org/wiki/Text_corpus
8
https://platform.openai.com/docs/api-reference/chat
9
https://platform.openai.com/docs/api-reference/completions
10
https://en.wikipedia.org/wiki/Lexical_analysis#Token
11
https://en.wikipedia.org/wiki/Large_language_model#Tokenization
12
https://platform.openai.com/docs/models/gpt-4
13
https://en.wikipedia.org/wiki/Large_language_model#List_of_large_language_models
14
https://en.wikipedia.org/wiki/N-gram_language_model
15
https://learn.microsoft.com/en-us/semantic-kernel/concepts-ai/tokens
16
https://openreview.net/forum?id=yzkSU5zdwD
17
https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)

© 2023 AI Product Institute LLC. All rights reserved. AI Product Institute and the AI Product
Institute logo are trademarks and/or registered trademarks of AI Product Institute LLC in the
U.S. and other countries. Other company and product names may be trademarks of the
respective companies with which they are associated. MAY23

Introduction To Building AI Applications With Foundation Models - AI Engineering
100% (1)
Introduction To Building AI Applications With Foundation Models - AI Engineering
32 pages
Instant ebooks textbook Build a Large Language Model (From Scratch) (MEAP V01) Sebastian Raschka download all chapters
100% (3)
Instant ebooks textbook Build a Large Language Model (From Scratch) (MEAP V01) Sebastian Raschka download all chapters
34 pages
Building LLMs - Stanford
No ratings yet
Building LLMs - Stanford
78 pages
Day 1
No ratings yet
Day 1
32 pages
The Diverse Landscape of Large Language Models Deepsense Ai
No ratings yet
The Diverse Landscape of Large Language Models Deepsense Ai
16 pages
OpenAI Generative Pre-trained Transformer 3 (GPT-3) for developers
No ratings yet
OpenAI Generative Pre-trained Transformer 3 (GPT-3) for developers
24 pages
DAB311 DL Week 11 RNN
No ratings yet
DAB311 DL Week 11 RNN
25 pages
genaitoolboxltslides1736779963542
No ratings yet
genaitoolboxltslides1736779963542
38 pages
Large Language Models
No ratings yet
Large Language Models
40 pages
Large Language Model
No ratings yet
Large Language Model
49 pages
LLM - A Introduction To Generative AI
100% (1)
LLM - A Introduction To Generative AI
31 pages
LLMs Overview and OpenAI API Ver 1-8 - Final NLP Day-UM6P-Nov 2023
No ratings yet
LLMs Overview and OpenAI API Ver 1-8 - Final NLP Day-UM6P-Nov 2023
45 pages
Whitepaper_Foundational Large Language Models & Text Generation_v2
100% (1)
Whitepaper_Foundational Large Language Models & Text Generation_v2
86 pages
OceanofPDF - Com Large Language Models Concepts - John AtkinsonAbutridy
No ratings yet
OceanofPDF - Com Large Language Models Concepts - John AtkinsonAbutridy
185 pages
Generative AI 101 Introduction to the Fundamentals michael-callaghan
No ratings yet
Generative AI 101 Introduction to the Fundamentals michael-callaghan
145 pages
Slides
No ratings yet
Slides
137 pages
15 Ai Tools Changing The World Script
No ratings yet
15 Ai Tools Changing The World Script
9 pages
LLM and Gen AI
No ratings yet
LLM and Gen AI
4 pages
LangChain
No ratings yet
LangChain
7 pages
ChatGPT in The Age of Generative AI and Large Lang
No ratings yet
ChatGPT in The Age of Generative AI and Large Lang
60 pages
Technical Seminar
No ratings yet
Technical Seminar
16 pages
LLM_introduction 2024
No ratings yet
LLM_introduction 2024
77 pages
Introduction to Gen AI
No ratings yet
Introduction to Gen AI
7 pages
large_language_model
No ratings yet
large_language_model
22 pages
Introduction to Large Language Models
No ratings yet
Introduction to Large Language Models
3 pages
ChatGPT KZ Feb2023 PDF
No ratings yet
ChatGPT KZ Feb2023 PDF
7 pages
LLM 1
No ratings yet
LLM 1
6 pages
Global Logic Interview Questions and Answers
No ratings yet
Global Logic Interview Questions and Answers
6 pages
GEN-AI-unit 3
No ratings yet
GEN-AI-unit 3
30 pages
Natural Language Processing
No ratings yet
Natural Language Processing
8 pages
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
No ratings yet
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
53 pages
Compact Guide To Large Language Models
No ratings yet
Compact Guide To Large Language Models
9 pages
Generative AI and LLMS
No ratings yet
Generative AI and LLMS
34 pages
Google About Generative Ai
No ratings yet
Google About Generative Ai
17 pages
LLM Cheatsheet
No ratings yet
LLM Cheatsheet
1 page
LLM_Review
No ratings yet
LLM_Review
16 pages
Module1_L5_GPT_variants
No ratings yet
Module1_L5_GPT_variants
7 pages
Chapter 2. Transformers: A Note For Early Release Readers
No ratings yet
Chapter 2. Transformers: A Note For Early Release Readers
85 pages
14-LookingForward
No ratings yet
14-LookingForward
48 pages
On The Application of Large Language Models For Language Teaching and Assessment Technology
No ratings yet
On The Application of Large Language Models For Language Teaching and Assessment Technology
25 pages
Scalexm - Ai: A Compact Guide To Large Language Models
No ratings yet
Scalexm - Ai: A Compact Guide To Large Language Models
9 pages
Generative AI in Modern Marketing Module 2 1
No ratings yet
Generative AI in Modern Marketing Module 2 1
14 pages
aa
No ratings yet
aa
11 pages
Generative AI For Software Practitioners
No ratings yet
Generative AI For Software Practitioners
9 pages
Module 5
No ratings yet
Module 5
76 pages
2_notes (3)
No ratings yet
2_notes (3)
3 pages
Module1_L4_LLMs_new
No ratings yet
Module1_L4_LLMs_new
37 pages
Introduction_to_LLMs
No ratings yet
Introduction_to_LLMs
2 pages
Large Language Models A Comprehensive Survey of It
No ratings yet
Large Language Models A Comprehensive Survey of It
30 pages
The Best LLMs Cheatsheet - Part 1
No ratings yet
The Best LLMs Cheatsheet - Part 1
16 pages
Virtual_Agent_Chatbot_using_Open_Artificial_Intelligence_Final_
No ratings yet
Virtual_Agent_Chatbot_using_Open_Artificial_Intelligence_Final_
16 pages
50 LLM Interview Questions
No ratings yet
50 LLM Interview Questions
56 pages
AI for Marketing (1)
No ratings yet
AI for Marketing (1)
198 pages
Gradivo ChatGPT in Umetna Inteligenca V Praksi
No ratings yet
Gradivo ChatGPT in Umetna Inteligenca V Praksi
38 pages
LLM_book_43-102
No ratings yet
LLM_book_43-102
60 pages
Top 10 Open-Source LLM Models - Large Language Models - GeeksforGeeks
No ratings yet
Top 10 Open-Source LLM Models - Large Language Models - GeeksforGeeks
17 pages
GPT-3
No ratings yet
GPT-3
15 pages