Cheatsheet Transformers Large Language Models

Uploaded by

sudo.sohel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1K views4 pages

Cheatsheet Transformers Large Language Models

Uploaded by

sudo.sohel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

CME 295 – Transformers & Large Language Models https://cme295.stanford.

edu

VIP Cheatsheet: Similar Dissimilar Independent

cute
Transformers & Large Language Models teddy bear
unpleasant
teddy bear
airplane
teddy bear

Afshine Amidi and Shervine Amidi Remark: Approximate Nearest Neighbors (ANN) and Locality Sensitive Hashing (LSH) are
methods that approximate the similarity operation efficiently over large databases.
March 23, 2025
2 Transformers
This VIP cheatsheet gives an overview of what is in the "Super Study Guide: Transformers & 2.1 Attention
Large Language Models" book, which contains ∼600 illustrations over 250 pages and goes into
the following concepts in depth. You can find more details at https: // superstudy. guide . ❒ Formula – Given a query q, we want to know which key k the query should pay "attention"
to with respect to the associated value v.
1 Foundations va vcute v teddy bear v is v reading v.
a cute teddy bear is reading .
1.1 Tokens
k Ta k Tcute k Tteddy bear k Tis k Treading k T.
❒ Definition – A token is an indivisible unit of text, such as a word, subword or character,
and is part of a predefined vocabulary. q teddy bear
Remark: The unknown token [UNK] represents unknown pieces of text while the padding token a cute teddy bear is reading .
[PAD] is used to fill empty positions to ensure consistent input sequence lengths.
Attention can be efficiently computed using matrices Q, K, V that contain queries q, keys k and
❒ Tokenizer – A tokenizer T divides text into tokens of an arbitrary level of granularity. values v respectively, along with the dimension dk of keys:

this teddy bear is reaaaally cute T this teddy bear is [UNK] cute [PAD] ... [PAD] QK T
attention = softmax √ V
dk
Here are the main types of tokenizers:
❒ MHA – A Multi-Head Attention (MHA) layer performs attention computations across mul-
Type Pros Cons Illustration
tiple heads, then projects the result in the output space.
• Easy to interpret • Large vocabulary size
Word teddy bear Attention head 1
• Short sequence • Word variations not handled
Input queries W1Q
• Word roots leveraged • Increased sequence length W1K
Subword ted ##dy bear
• Intuitive embeddings • Tokenization more complex W1V
Projection
WO

...
Character • No out-of-vocabulary • Much longer sequence length Input keys Output
t e d d y
concerns • Patterns hard to interpret b e a r
Attention head h
Byte • Small vocabulary size because too low-level WhQ
WhK
Remark: Byte-Pair Encoding (BPE) and Unigram are commonly-used subword-level tokenizers. Input values WhV

1.2 Embeddings It is composed of h attention heads as well as matrices W Q , W K , W V that project the input
❒ Definition – An embedding is a numerical representation of an element (e.g. token, sentence) to obtain queries Q, keys K and values V . The projection is done using matrix W O .
and is characterized by a vector x ∈ Rn . Remark: Grouped-Query Attention (GQA) and Multi-Query Attention (MQA) are variations
of MHA that reduce computational overhead by sharing keys and values across attention heads.
❒ Similarity – The cosine similarity between two tokens t1 , t2 is quantified by:
t1 · t2 2.2 Architecture
similarity(t1 , t2 ) = = cos(θ) ∈ [−1, 1]
||t1 || ||t2 || ❒ Overview – Transformer is a landmark model relying on the self-attention mechanism and
The angle θ characterizes the similarity between the two tokens: is composed of encoders and decoders. Encoders compute meaningful embeddings of the input
that are then used by decoders to predict the next token in the sequence.

Stanford University 1 Spring 2025

CME 295 – Transformers & Large Language Models Shervine Amidi & Afshine Amidi

est positive

Encoder Decoder Linear + Softmax

... ... N× Encoder

[CLS] my teddy bear is cute

Encoder Decoder

my teddy bear is cute . [BOS] mon ours en peluche A [CLS] token is added at the beginning of the sequence to capture the meaning of the sentence.
Its encoded embedding is often used in downstream tasks, such as sentiment extraction.
en-US fr-FR
❒ Decoder-only – Generative Pre-trained Transformer (GPT) is an autoregressive Transformer-
based model that is composed of a stack of decoders. Contrary to BERT and its derivatives,
Remark: Although the Transformer was initially proposed as a model for translation tasks, it GPT treats all problems as text-to-text problems.
is now widely used across many other applications.
cute
❒ Components – The encoder and decoder are two fundamental components of the Trans-
former and have different roles: Linear + Softmax

Encoder Decoder
N× Decoder
Encoded embeddings encapsulate meaning Decoded embeddings encapsulate meaning
of input of both input and output predicted so far
[BOS] my teddy bear is

+ Most of the current state-of-the-art LLMs rely on a decoder-only architecture, such as the GPT
series, LLaMA, Mistral, Gemma, DeepSeek, etc.
Feed-Forward Neural Network
+ Remark: Encoder-decoder models, like T5, are also autoregressive and share many character-
+ istics with decoder-only models.
Feed-Forward Neural Network
Cross-Attention
+ 2.4 Optimizations
Value Key Query
...
Self-Attention ❒ Attention approximation – Attention computations are in O(n2 ), which can be costly as
Value Key Query + the sequence length n increases. There are two main methods to approximate computations:
Masked Self-Attention • Sparsity: Self-attention does not happen through the whole sequence but only between
Value Key Query more relevant tokens.

❒ Position embeddings – Position embeddings inform where the token is in the sentence and
are of the same dimension as the token embeddings. They can either be arbitrarily defined or
learned from the data.
Remark: Rotary Position Embeddings (RoPE) are a popular and efficient variation that rotate
query and key vectors to incorporate relative position information. • Low-rank: The attention formula is simplified as the product of low-rank matrices, which
brings down the computation burden.
2.3 Variants
❒ Flash attention – Flash attention is an exact method that optimizes attention computations
❒ Encoder-only – Bidirectional Encoder Representations from Transformers (BERT) is a by cleverly leveraging GPU hardware, using the fast Static Random-Access Memory (SRAM)
Transformer-based model composed of a stack of encoders that takes some text as input, and for matrix operations before writing results to the slower High Bandwidth Memory (HBM).
outputs meaningful embeddings, which can be later used in downstream classification tasks. Remark: In practice, this reduces memory usage and speeds up computations.

Stanford University 2 Spring 2025

CME 295 – Transformers & Large Language Models Shervine Amidi & Afshine Amidi

3 Large language models Remark: Other PEFT techniques include prefix tuning and adapter layer insertion.
3.1 Overview
3.4 Preference tuning
❒ Definition – A Large Language Model (LLM) is a Transformer-based model with strong
❒ Reward model – A Reward Model (RM) is a model that predicts how well an output ŷ
NLP capabilities. It is "large" in the sense that it typically contains billions of parameters.
aligns with desired behavior given the input x. Best-of-N (BoN) sampling, also called rejection
❒ Lifecycle – An LLM is trained in 3 steps: pretraining, finetuning and preference tuning. sampling, is a method that uses a reward model to select the best response among N generations.

Preference
Pretraining Finetuning
tuning x f y1̂ , y2̂ , ..., yN̂ RM k = argmax r(x, yî )
i ∈ [[1,N ]]
Learn generalities about language Learn speci c tasks Demote bad answers

Finetuning and preference tuning are post-training approaches that aim at aligning the model ❒ Reinforcement learning – Reinforcement Learning (RL) is an approach that leverages RM
to perform certain tasks. and updates the model f based on rewards for its generated outputs. If RM is based on human
preferences, this process is called Reinforcement Learning from Human Feedback (RLHF).
3.2 Prompting
❒ Context length – The context length of a model is the maximum number of tokens that
can fit into the input. It typically ranges from tens of thousands to millions of tokens. x f ŷ RM r(x, y)̂

❒ Decoding sampling – Token predictions are sampled from the predicted probability distri-
bution pi , which is controlled by the hyperparameter temperature T .
Proximal Policy Optimization (PPO) is a popular RL algorithm that incentivizes higher rewards

xi
while keeping the model close to the base model to prevent reward hacking.
p exp p
T Remark: There are also supervised approaches, like Direct Preference Optimization (DPO),
pi =
T≪1 n
X x T ≪1 that combine RM and RL into one supervised step.
j
fi exp
T
x j=1 x 3.5 Optimizations

Remark: High temperatures lead to more creative outputs whereas low temperatures lead to ❒ Mixture of experts – A Mixture of Experts (MoE) is a model that activates only a portion
more deterministic ones. of its neurons at inference time. It is based on a gate G and experts E1 , ..., En .

❒ Chain-of-thought – Chain-of-Thought (CoT) is a reasoning process in which the model G

breaks down a complex problem into a series of intermediate steps. This helps the model to
generate the correct final response. Tree of Thoughts (ToT) is a more advanced version of CoT. E1 n
X
Remark: Self-consistency is a method that aggregates answers across CoT reasoning paths. x E2 × ŷ ŷ = G(x)i Ei (x)
... i=1

3.3 Finetuning En
❒ SFT – Supervised FineTuning (SFT) is a post-training approach that aligns the behavior of
the model with an end task. It relies on high-quality input-output pairs aligned with the task. MoE-based LLMs use this gating mechanism in their FFNNs.
Remark: If the SFT data is about instructions, then this step is called "instruction tuning". Remark: Training an MoE-based LLM is notoriously challenging, as mentioned in the LLaMA
paper whose authors chose to not use this architecture despite its inference-time efficiency.
❒ PEFT – Parameter-Efficient FineTuning (PEFT) is a category of methods used to run SFT
efficiently. In particular, Low-Rank Adaptation (LoRA) approximates the learnable weights W ❒ Distillation – Distillation is a process where a (small) student model S is trained on the
by fixing W0 and learning low-rank matrices A, B instead: prediction outputs of a (big) teacher model T . It is trained using the KL divergence loss:
k k r k (i)

X (i) ŷT
r A KL(ŷT ||ŷS ) = ŷT log (i)
≈ d + d B ×
ŷS
d W W0 i

Remark: Training labels are considered as "soft" labels since they represent class probabilities.

Stanford University 3 Spring 2025

CME 295 – Transformers & Large Language Models Shervine Amidi & Afshine Amidi

❒ Quantization – Model quantization is a category of techniques that reduces the precision of Given a knowledge base D and a question, a Retriever fetches the most relevant documents,
model weights while limiting its impact on the resulting model performance. As a result, this then Augments the prompt with the relevant information before Generating the output.
reduces the model’s memory footprint and speeds up its inference.
Remark: The retrieval stage typically relies on embeddings from encoder-only models.
Remark: QLoRA is a commonly-used quantized variant of LoRA.
❒ Hyperparameters – The knowledge base D is initialized by chunking the documents into
4 Applications chunks of size nc and embedding them into vectors of size Rd .

4.1 LLM-as-a-Judge
d
❒ Definition – LLM-as-a-Judge (LaaJ) is a method that uses an LLM to score given outputs
according to some provided criteria. Notably, it is also able to generate a rationale for its score, nc
1
which helps with interpretability.

4.3 Agents
Criteria Cuteness Teddy bears are the cutest Rationale
LaaJ ❒ Definition – An agent is a system that autonomously pursues goals and completes tasks on
a user’s behalf. It may use different chains of LLM calls to do so.
Item to score Teddy bear 10/10 Score
❒ ReAct – Reason + Act (ReAct) is a framework that allows for multiple chains of LLM calls
to complete complex tasks:
Contrary to pre-LLM era metrics such as Recall-Oriented Understudy for Gisting Evaluation
(ROUGE), LaaJ does not need any reference text, which makes it convenient to evaluate on any Input Output
Observe
kind of task. In particular, LaaJ shows strong correlation with human ratings when it relies on
a big powerful model (e.g. GPT-4), as it requires reasoning capabilities to perform well.
Act Plan
Remark: LaaJ is useful to perform quick rounds of evaluations but it is important to monitor
the alignment between LaaJ outputs and human evaluations to make sure there is no divergence. This framework is composed of the steps below:
❒ Common biases – LaaJ models can exhibit the following biases: • Observe: Synthesize previous actions and explicitly state what is currently known.
• Plan: Detail what tasks need to be accomplished and what tools to call.
Position bias Verbosity bias Self-enhancement bias • Act: Perform an action via an API or look for relevant information in a knowledge base.
Remark: Evaluating an agentic system is challenging. However, this can still be done both at
Favors first position in Favors more verbose Favors outputs generated
Problem the component level via local inputs-outputs and at the system level via chains of calls.
pairwise comparisons content by themselves
Average metric on Add a penalty on the Use a judge built from 4.4 Reasoning models
Solution
randomized positions output length a different base model ❒ Definition – A reasoning model is a model that relies on CoT-based reasoning traces to solve
more complex tasks in math, coding and logic. Examples of reasoning models include OpenAI’s
A remedy to these issues can be to finetune a custom LaaJ, but this requires a lot of effort. o series, DeepSeek-R1 and Google’s Gemini Flash Thinking.
Remark: The list of biases above is not exhaustive. Remark: DeepSeek-R1 explicitly outputs its reasoning trace between <think> tags.

4.2 RAG ❒ Scaling – Two types of scaling methods are used to enhance reasoning capabilities:

❒ Definition – Retrieval-Augmented Generation (RAG) is a method that allows the LLM to Description Illustration
access relevant external knowledge to answer a given question. This is particularly useful if we
want to incorporate information past the LLM pretrained knowledge cut-off date. Run RL for longer to let the model learn Performance
Train-time
how to produce CoT-style reasoning
scaling
traces before giving an answer RL steps

Let the model think longer before Performance

Test-time
providing an answer with budget
scaling
Q LLM A
forcing keywords such as "Wait" CoT length

Stanford University 4 Spring 2025

Generative AI A Transformative Force in Business Intelligence
No ratings yet
Generative AI A Transformative Force in Business Intelligence
7 pages
5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)
No ratings yet
5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)
61 pages
The Rise of Vector Databases in The Age of LLMs
No ratings yet
The Rise of Vector Databases in The Age of LLMs
26 pages
LangChain Programming For Beginners
No ratings yet
LangChain Programming For Beginners
154 pages
Machine Learning Systems
No ratings yet
Machine Learning Systems
1,748 pages
Natural Language Processing-Wiki
No ratings yet
Natural Language Processing-Wiki
237 pages
LLM Cheat Sheetpdf
No ratings yet
LLM Cheat Sheetpdf
7 pages
Machine Learning For Absolute Beginne... (Z-Library)
No ratings yet
Machine Learning For Absolute Beginne... (Z-Library)
150 pages
Neural Networks
No ratings yet
Neural Networks
40 pages
Large Language ModelBrained GUI Agents
No ratings yet
Large Language ModelBrained GUI Agents
78 pages
Newwhitepaper Agents2
No ratings yet
Newwhitepaper Agents2
84 pages
Whitepaper - Foundational Large Language Models & Text Generation - v2
100% (1)
Whitepaper - Foundational Large Language Models & Text Generation - v2
86 pages
Ai 2027
No ratings yet
Ai 2027
71 pages
Everything You Need To Know About Small Language Models (SLM) and Its Applications
No ratings yet
Everything You Need To Know About Small Language Models (SLM) and Its Applications
3 pages
Multi-Modal Generative AI Survey
No ratings yet
Multi-Modal Generative AI Survey
23 pages
Langauage Model
No ratings yet
Langauage Model
148 pages
Digital Carbon Footprint Awareness Among High School Students
No ratings yet
Digital Carbon Footprint Awareness Among High School Students
20 pages
NCA-GENL Nvidia Generative Ai Llms Exam Dumps
No ratings yet
NCA-GENL Nvidia Generative Ai Llms Exam Dumps
5 pages
Jade M Kit
No ratings yet
Jade M Kit
1 page
Advanced Prompt Engineering
No ratings yet
Advanced Prompt Engineering
27 pages
NLP and Generative AI Syllabus - 2025
No ratings yet
NLP and Generative AI Syllabus - 2025
5 pages
LangChain & RAG
No ratings yet
LangChain & RAG
62 pages
Generative AI Interview Questions and Answers
No ratings yet
Generative AI Interview Questions and Answers
7 pages
01 In28minutes Presentation Generative Ai With Google
No ratings yet
01 In28minutes Presentation Generative Ai With Google
95 pages
NLP Notes
No ratings yet
NLP Notes
80 pages
DC9 072A Industrial
No ratings yet
DC9 072A Industrial
4 pages
Sinan Ozdemir - Quick Start Guide To Large Language Models, Second Edition-Addison-Wesley (2024)
No ratings yet
Sinan Ozdemir - Quick Start Guide To Large Language Models, Second Edition-Addison-Wesley (2024)
279 pages
Implementation of N-Gram Technique
No ratings yet
Implementation of N-Gram Technique
6 pages
Acquiring Bank
No ratings yet
Acquiring Bank
6 pages
Techniques To FineTune LLMs
No ratings yet
Techniques To FineTune LLMs
7 pages
AI Institutes
No ratings yet
AI Institutes
98 pages
AI Using Python
No ratings yet
AI Using Python
9 pages
Vector Database
No ratings yet
Vector Database
8 pages
LLM and RAG
No ratings yet
LLM and RAG
12 pages
A Review On Large Language Models Architectures Ap
No ratings yet
A Review On Large Language Models Architectures Ap
31 pages
Suricata Tutorial: Flocon 2016
No ratings yet
Suricata Tutorial: Flocon 2016
78 pages
LLaVA - Large Multimodal Model
No ratings yet
LLaVA - Large Multimodal Model
15 pages
Introducing Multimodal Llama 3.2
No ratings yet
Introducing Multimodal Llama 3.2
29 pages
Knowledge Graph Construction Using Large Language Models
No ratings yet
Knowledge Graph Construction Using Large Language Models
17 pages
Software Project Management: Dr. R. Mall
No ratings yet
Software Project Management: Dr. R. Mall
87 pages
Career Track For AI/ML
No ratings yet
Career Track For AI/ML
10 pages
MM-LLMs Recent Advances in MultiModal Large Language Models
No ratings yet
MM-LLMs Recent Advances in MultiModal Large Language Models
22 pages
NLP Quick NOtes
No ratings yet
NLP Quick NOtes
15 pages
Huggingface Basics
No ratings yet
Huggingface Basics
28 pages
Levels of AI Agents - From Rules To Large Language Models
No ratings yet
Levels of AI Agents - From Rules To Large Language Models
8 pages
Ai Notes
No ratings yet
Ai Notes
2 pages
Ai in The 5g Network White Paper
No ratings yet
Ai in The 5g Network White Paper
6 pages
A Step-By-Step Guide To Building AI Agents With LangGraph - by Alannaelga - Coinmonks - Nov, 2024 - Medium
No ratings yet
A Step-By-Step Guide To Building AI Agents With LangGraph - by Alannaelga - Coinmonks - Nov, 2024 - Medium
32 pages
Data Entry
No ratings yet
Data Entry
2 pages
MemGPT - Towards LLMs As Operating Systems - 2310.08560
No ratings yet
MemGPT - Towards LLMs As Operating Systems - 2310.08560
15 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Research Ethics in The Digital Age - Ethics For The Social Sciences and Humanities in Times of Mediatization and Digitization (High)
No ratings yet
Research Ethics in The Digital Age - Ethics For The Social Sciences and Humanities in Times of Mediatization and Digitization (High)
159 pages
5 Techiques To FineTune LLMs
No ratings yet
5 Techiques To FineTune LLMs
7 pages
Internet Architecture and Performance Metrics
No ratings yet
Internet Architecture and Performance Metrics
14 pages
Aumr + Cadx-A Series: Split Air Conditioners
No ratings yet
Aumr + Cadx-A Series: Split Air Conditioners
24 pages
Performance Analysis of LoRA Finetuning Llama-2
No ratings yet
Performance Analysis of LoRA Finetuning Llama-2
4 pages
How To Use An Existing DNN Recognizer For Decoding in Kaldi
No ratings yet
How To Use An Existing DNN Recognizer For Decoding in Kaldi
14 pages
(NOV) F2升F3 BI (tech savvy)
No ratings yet
(NOV) F2升F3 BI (tech savvy)
33 pages
Day 1
No ratings yet
Day 1
32 pages
91329-0136097111 ch01
No ratings yet
91329-0136097111 ch01
12 pages
Ollama - Your Shortcut To Supercharged Applications - Bridge The Gap With LLMs - by Kanishk Khatter - Medium
No ratings yet
Ollama - Your Shortcut To Supercharged Applications - Bridge The Gap With LLMs - by Kanishk Khatter - Medium
16 pages
Drawing 19851
No ratings yet
Drawing 19851
1 page
PyTorch Workflow Fundamentals
No ratings yet
PyTorch Workflow Fundamentals
1 page
DWM Lab Workbook Sample
No ratings yet
DWM Lab Workbook Sample
10 pages
0 VHDL Basic
No ratings yet
0 VHDL Basic
38 pages
MS-7549 Ver:0A
No ratings yet
MS-7549 Ver:0A
34 pages
00 Course Introduction
100% (1)
00 Course Introduction
17 pages
Manual de Servicio
No ratings yet
Manual de Servicio
133 pages
Colgate OpenCore ComputerVision
No ratings yet
Colgate OpenCore ComputerVision
8 pages
Deltapilot S Db50
No ratings yet
Deltapilot S Db50
60 pages
Transformer
No ratings yet
Transformer
5 pages
Alpha Series - Front End Cylinder With Single Eye
0% (1)
Alpha Series - Front End Cylinder With Single Eye
2 pages
AirCheck Detail Report - PK8AP02
No ratings yet
AirCheck Detail Report - PK8AP02
100 pages
MY25 Taurus Spec Sheet EN
No ratings yet
MY25 Taurus Spec Sheet EN
3 pages
HSV5 TB
No ratings yet
HSV5 TB
15 pages
Career Transition Handbook
No ratings yet
Career Transition Handbook
8 pages
Proper Waste Management
No ratings yet
Proper Waste Management
20 pages
Cisco Certified Expert Firewall Fundamentals: Optional
No ratings yet
Cisco Certified Expert Firewall Fundamentals: Optional
4 pages
Stahl Control 6 K
No ratings yet
Stahl Control 6 K
12 pages
Lesson 1 in ICT FIRST QUARTER
No ratings yet
Lesson 1 in ICT FIRST QUARTER
2 pages
MatLab Add
No ratings yet
MatLab Add
9 pages
Empirical Study To Explore The Impact of
No ratings yet
Empirical Study To Explore The Impact of
20 pages
Hugging Face Transformers
No ratings yet
Hugging Face Transformers
8 pages
EN - Update0910 - Datasheet BDH-800
No ratings yet
EN - Update0910 - Datasheet BDH-800
2 pages
10 Evani Generative AI Champion
No ratings yet
10 Evani Generative AI Champion
39 pages
Whitney Workout
No ratings yet
Whitney Workout
1 page
Results Experiment 1: Determination of Power Input, Heat Output and Coefficient of Performance
No ratings yet
Results Experiment 1: Determination of Power Input, Heat Output and Coefficient of Performance
6 pages
Langchain PDF Reader
100% (1)
Langchain PDF Reader
15 pages
Mythical Beasts
From Everand
Mythical Beasts
Steve Porter
No ratings yet