0% found this document useful (0 votes)
1K views4 pages

Cheatsheet Transformers Large Language Models

Uploaded by

sudo.sohel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views4 pages

Cheatsheet Transformers Large Language Models

Uploaded by

sudo.sohel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

CME 295 – Transformers & Large Language Models https://cme295.stanford.

edu

VIP Cheatsheet: Similar Dissimilar Independent


cute
Transformers & Large Language Models teddy bear
unpleasant
teddy bear
airplane
teddy bear

Afshine Amidi and Shervine Amidi Remark: Approximate Nearest Neighbors (ANN) and Locality Sensitive Hashing (LSH) are
methods that approximate the similarity operation efficiently over large databases.
March 23, 2025
2 Transformers
This VIP cheatsheet gives an overview of what is in the "Super Study Guide: Transformers & 2.1 Attention
Large Language Models" book, which contains ∼600 illustrations over 250 pages and goes into
the following concepts in depth. You can find more details at https: // superstudy. guide . ❒ Formula – Given a query q, we want to know which key k the query should pay "attention"
to with respect to the associated value v.
1 Foundations va vcute v teddy bear v is v reading v.
a cute teddy bear is reading .
1.1 Tokens
k Ta k Tcute k Tteddy bear k Tis k Treading k T.
❒ Definition – A token is an indivisible unit of text, such as a word, subword or character,
and is part of a predefined vocabulary. q teddy bear
Remark: The unknown token [UNK] represents unknown pieces of text while the padding token a cute teddy bear is reading .
[PAD] is used to fill empty positions to ensure consistent input sequence lengths.
Attention can be efficiently computed using matrices Q, K, V that contain queries q, keys k and
❒ Tokenizer – A tokenizer T divides text into tokens of an arbitrary level of granularity. values v respectively, along with the dimension dk of keys:
 
this teddy bear is reaaaally cute T this teddy bear is [UNK] cute [PAD] ... [PAD] QK T
attention = softmax √ V
dk
Here are the main types of tokenizers:
❒ MHA – A Multi-Head Attention (MHA) layer performs attention computations across mul-
Type Pros Cons Illustration
tiple heads, then projects the result in the output space.
• Easy to interpret • Large vocabulary size
Word teddy bear Attention head 1
• Short sequence • Word variations not handled
Input queries W1Q
• Word roots leveraged • Increased sequence length W1K
Subword ted ##dy bear
• Intuitive embeddings • Tokenization more complex W1V
Projection
WO

...
Character • No out-of-vocabulary • Much longer sequence length Input keys Output
t e d d y
concerns • Patterns hard to interpret b e a r
Attention head h
Byte • Small vocabulary size because too low-level WhQ
WhK
Remark: Byte-Pair Encoding (BPE) and Unigram are commonly-used subword-level tokenizers. Input values WhV

1.2 Embeddings It is composed of h attention heads as well as matrices W Q , W K , W V that project the input
❒ Definition – An embedding is a numerical representation of an element (e.g. token, sentence) to obtain queries Q, keys K and values V . The projection is done using matrix W O .
and is characterized by a vector x ∈ Rn . Remark: Grouped-Query Attention (GQA) and Multi-Query Attention (MQA) are variations
of MHA that reduce computational overhead by sharing keys and values across attention heads.
❒ Similarity – The cosine similarity between two tokens t1 , t2 is quantified by:
t1 · t2 2.2 Architecture
similarity(t1 , t2 ) = = cos(θ) ∈ [−1, 1]
||t1 || ||t2 || ❒ Overview – Transformer is a landmark model relying on the self-attention mechanism and
The angle θ characterizes the similarity between the two tokens: is composed of encoders and decoders. Encoders compute meaningful embeddings of the input
that are then used by decoders to predict the next token in the sequence.

Stanford University 1 Spring 2025


CME 295 – Transformers & Large Language Models Shervine Amidi & Afshine Amidi

est positive

Encoder Decoder Linear + Softmax

... ... N× Encoder

[CLS] my teddy bear is cute


Encoder Decoder

my teddy bear is cute . [BOS] mon ours en peluche A [CLS] token is added at the beginning of the sequence to capture the meaning of the sentence.
Its encoded embedding is often used in downstream tasks, such as sentiment extraction.
en-US fr-FR
❒ Decoder-only – Generative Pre-trained Transformer (GPT) is an autoregressive Transformer-
based model that is composed of a stack of decoders. Contrary to BERT and its derivatives,
Remark: Although the Transformer was initially proposed as a model for translation tasks, it GPT treats all problems as text-to-text problems.
is now widely used across many other applications.
cute
❒ Components – The encoder and decoder are two fundamental components of the Trans-
former and have different roles: Linear + Softmax

Encoder Decoder
N× Decoder
Encoded embeddings encapsulate meaning Decoded embeddings encapsulate meaning
of input of both input and output predicted so far
[BOS] my teddy bear is

+ Most of the current state-of-the-art LLMs rely on a decoder-only architecture, such as the GPT
series, LLaMA, Mistral, Gemma, DeepSeek, etc.
Feed-Forward Neural Network
+ Remark: Encoder-decoder models, like T5, are also autoregressive and share many character-
+ istics with decoder-only models.
Feed-Forward Neural Network
Cross-Attention
+ 2.4 Optimizations
Value Key Query
...
Self-Attention ❒ Attention approximation – Attention computations are in O(n2 ), which can be costly as
Value Key Query + the sequence length n increases. There are two main methods to approximate computations:
Masked Self-Attention • Sparsity: Self-attention does not happen through the whole sequence but only between
Value Key Query more relevant tokens.

❒ Position embeddings – Position embeddings inform where the token is in the sentence and
are of the same dimension as the token embeddings. They can either be arbitrarily defined or
learned from the data.
Remark: Rotary Position Embeddings (RoPE) are a popular and efficient variation that rotate
query and key vectors to incorporate relative position information. • Low-rank: The attention formula is simplified as the product of low-rank matrices, which
brings down the computation burden.
2.3 Variants
❒ Flash attention – Flash attention is an exact method that optimizes attention computations
❒ Encoder-only – Bidirectional Encoder Representations from Transformers (BERT) is a by cleverly leveraging GPU hardware, using the fast Static Random-Access Memory (SRAM)
Transformer-based model composed of a stack of encoders that takes some text as input, and for matrix operations before writing results to the slower High Bandwidth Memory (HBM).
outputs meaningful embeddings, which can be later used in downstream classification tasks. Remark: In practice, this reduces memory usage and speeds up computations.

Stanford University 2 Spring 2025


CME 295 – Transformers & Large Language Models Shervine Amidi & Afshine Amidi

3 Large language models Remark: Other PEFT techniques include prefix tuning and adapter layer insertion.
3.1 Overview
3.4 Preference tuning
❒ Definition – A Large Language Model (LLM) is a Transformer-based model with strong
❒ Reward model – A Reward Model (RM) is a model that predicts how well an output ŷ
NLP capabilities. It is "large" in the sense that it typically contains billions of parameters.
aligns with desired behavior given the input x. Best-of-N (BoN) sampling, also called rejection
❒ Lifecycle – An LLM is trained in 3 steps: pretraining, finetuning and preference tuning. sampling, is a method that uses a reward model to select the best response among N generations.

Preference
Pretraining Finetuning
tuning x f y1̂ , y2̂ , ..., yN̂ RM k = argmax r(x, yî )
i ∈ [[1,N ]]
Learn generalities about language Learn speci c tasks Demote bad answers

Finetuning and preference tuning are post-training approaches that aim at aligning the model ❒ Reinforcement learning – Reinforcement Learning (RL) is an approach that leverages RM
to perform certain tasks. and updates the model f based on rewards for its generated outputs. If RM is based on human
preferences, this process is called Reinforcement Learning from Human Feedback (RLHF).
3.2 Prompting
❒ Context length – The context length of a model is the maximum number of tokens that
can fit into the input. It typically ranges from tens of thousands to millions of tokens. x f ŷ RM r(x, y)̂

❒ Decoding sampling – Token predictions are sampled from the predicted probability distri-
bution pi , which is controlled by the hyperparameter temperature T .
Proximal Policy Optimization (PPO) is a popular RL algorithm that incentivizes higher rewards

xi
 while keeping the model close to the base model to prevent reward hacking.
p exp p
T Remark: There are also supervised approaches, like Direct Preference Optimization (DPO),
pi =
T≪1 n
X x  T ≪1 that combine RM and RL into one supervised step.
j
fi exp
T
x j=1 x 3.5 Optimizations

Remark: High temperatures lead to more creative outputs whereas low temperatures lead to ❒ Mixture of experts – A Mixture of Experts (MoE) is a model that activates only a portion
more deterministic ones. of its neurons at inference time. It is based on a gate G and experts E1 , ..., En .

❒ Chain-of-thought – Chain-of-Thought (CoT) is a reasoning process in which the model G


breaks down a complex problem into a series of intermediate steps. This helps the model to
generate the correct final response. Tree of Thoughts (ToT) is a more advanced version of CoT. E1 n
X
Remark: Self-consistency is a method that aggregates answers across CoT reasoning paths. x E2 × ŷ ŷ = G(x)i Ei (x)
... i=1

3.3 Finetuning En
❒ SFT – Supervised FineTuning (SFT) is a post-training approach that aligns the behavior of
the model with an end task. It relies on high-quality input-output pairs aligned with the task. MoE-based LLMs use this gating mechanism in their FFNNs.
Remark: If the SFT data is about instructions, then this step is called "instruction tuning". Remark: Training an MoE-based LLM is notoriously challenging, as mentioned in the LLaMA
paper whose authors chose to not use this architecture despite its inference-time efficiency.
❒ PEFT – Parameter-Efficient FineTuning (PEFT) is a category of methods used to run SFT
efficiently. In particular, Low-Rank Adaptation (LoRA) approximates the learnable weights W ❒ Distillation – Distillation is a process where a (small) student model S is trained on the
by fixing W0 and learning low-rank matrices A, B instead: prediction outputs of a (big) teacher model T . It is trained using the KL divergence loss:
k k r k  (i)

X (i) ŷT
r A KL(ŷT ||ŷS ) = ŷT log (i)
≈ d + d B ×
ŷS
d W W0 i

Remark: Training labels are considered as "soft" labels since they represent class probabilities.

Stanford University 3 Spring 2025


CME 295 – Transformers & Large Language Models Shervine Amidi & Afshine Amidi

❒ Quantization – Model quantization is a category of techniques that reduces the precision of Given a knowledge base D and a question, a Retriever fetches the most relevant documents,
model weights while limiting its impact on the resulting model performance. As a result, this then Augments the prompt with the relevant information before Generating the output.
reduces the model’s memory footprint and speeds up its inference.
Remark: The retrieval stage typically relies on embeddings from encoder-only models.
Remark: QLoRA is a commonly-used quantized variant of LoRA.
❒ Hyperparameters – The knowledge base D is initialized by chunking the documents into
4 Applications chunks of size nc and embedding them into vectors of size Rd .

4.1 LLM-as-a-Judge
d
❒ Definition – LLM-as-a-Judge (LaaJ) is a method that uses an LLM to score given outputs
according to some provided criteria. Notably, it is also able to generate a rationale for its score, nc
1
which helps with interpretability.

4.3 Agents
Criteria Cuteness Teddy bears are the cutest Rationale
LaaJ ❒ Definition – An agent is a system that autonomously pursues goals and completes tasks on
a user’s behalf. It may use different chains of LLM calls to do so.
Item to score Teddy bear 10/10 Score
❒ ReAct – Reason + Act (ReAct) is a framework that allows for multiple chains of LLM calls
to complete complex tasks:
Contrary to pre-LLM era metrics such as Recall-Oriented Understudy for Gisting Evaluation
(ROUGE), LaaJ does not need any reference text, which makes it convenient to evaluate on any Input Output
Observe
kind of task. In particular, LaaJ shows strong correlation with human ratings when it relies on
a big powerful model (e.g. GPT-4), as it requires reasoning capabilities to perform well.
Act Plan
Remark: LaaJ is useful to perform quick rounds of evaluations but it is important to monitor
the alignment between LaaJ outputs and human evaluations to make sure there is no divergence. This framework is composed of the steps below:
❒ Common biases – LaaJ models can exhibit the following biases: • Observe: Synthesize previous actions and explicitly state what is currently known.
• Plan: Detail what tasks need to be accomplished and what tools to call.
Position bias Verbosity bias Self-enhancement bias • Act: Perform an action via an API or look for relevant information in a knowledge base.
Remark: Evaluating an agentic system is challenging. However, this can still be done both at
Favors first position in Favors more verbose Favors outputs generated
Problem the component level via local inputs-outputs and at the system level via chains of calls.
pairwise comparisons content by themselves
Average metric on Add a penalty on the Use a judge built from 4.4 Reasoning models
Solution
randomized positions output length a different base model ❒ Definition – A reasoning model is a model that relies on CoT-based reasoning traces to solve
more complex tasks in math, coding and logic. Examples of reasoning models include OpenAI’s
A remedy to these issues can be to finetune a custom LaaJ, but this requires a lot of effort. o series, DeepSeek-R1 and Google’s Gemini Flash Thinking.
Remark: The list of biases above is not exhaustive. Remark: DeepSeek-R1 explicitly outputs its reasoning trace between <think> tags.

4.2 RAG ❒ Scaling – Two types of scaling methods are used to enhance reasoning capabilities:

❒ Definition – Retrieval-Augmented Generation (RAG) is a method that allows the LLM to Description Illustration
access relevant external knowledge to answer a given question. This is particularly useful if we
want to incorporate information past the LLM pretrained knowledge cut-off date. Run RL for longer to let the model learn Performance
Train-time
how to produce CoT-style reasoning
scaling
traces before giving an answer RL steps

Let the model think longer before Performance


Test-time
providing an answer with budget
scaling
Q LLM A
forcing keywords such as "Wait" CoT length

Stanford University 4 Spring 2025

You might also like