LLM Cheatsheet
LLM Cheatsheet
Large Language Models (LLMs)
Large neural networks trained at internet scale Decoder only = Autoregressive model
to estimate the probability of sequences
of words Ex: GPT, BLOOM
Ex: GPT, FLAN-T5, LLaMA, PaLM, BLOOM
2 Random Sampling The model chooses
(transformers with billions of parameters)
an output word at random using the probability
Abilities (and computing resources needed)
distribution to weigh the selection (could be
tend to rise with the number of parameters
To predict the next token
PRE-TRAINING OBJECTIVE too creative)
USE CASES based on the previous sequence of tokens TECHNIQUES TO CONTROL RANDOM SAMPLING
– Standard NLP tasks (= Causal Language Modeling)
– Top K The next token is drawn from
(classification, summarization, etc.) OUTPUT Next token the k tokens with the highest probabilities
– Content generation USE CASES Text generation – Top P The next token is drawn from
– Reasoning (Q&A, planning, coding, etc.) Token Word or sub-word the tokens with the highest probabilities,
The basic unit processed by transformers Encoder-Decoder = Seq-to-seq model whose combined probabilities exceed p
Encoder Processes input sequence Ex: T5, BART
to generate a vector representation (or
embedding) for each token
Decoder Processes input tokens to produce
new tokens
In-context learning Specifying the task
to perform directly in the prompt Embedding layer Maps each token
to a trainable vector
Temperature Influence the shape of
Positional encoding vector the probability distribution through a scaling
Added to the token embedding vector PRE-TRAINING OBJECTIVE Vary from model to model
factor in the softmax layer
to keep track of the token’s position (e.g., Span corruption like T5)
OUTPUT Sentinel token + predicted tokens
Self-Attention Computes the importance
of each word in the input sequence to all USE CASES Translation, Q&A, summarization
other words in the sequence
© 2024 Dataiku