LLM .Foundation - Models.from - The.ground - Up
LLM .Foundation - Models.from - The.ground - Up
Language
Models
Foundation Models
from the Ground Up
Matei Zaharia
Co-founder & CTO of Databricks
Associate Professor of Computer Science
at UC Berkeley
Dolly
Stable Diffusion
MPT
Facebook Llama Stanford Alpaca Databricks Dolly Mosaic MPT TII Falcon
“Smaller, more performant models “Alpaca behaves qualitatively “Dolly will help democratize LLMs, “MPT-7B is trained from scratch “Falcon significantly outperforms
such as Llama … democratizes similarly to OpenAI … while being transforming them into a on 1T tokens … is open source, GPT-3 for … 75% of the training
access in this important, surprisingly small and easy commodity every company can available for commercial use, and compute budget—and … a fifth of
fast-changing field.” /cheap to reproduce” own and customize” matches the quality of LLaMA-7B” the compute at inference time.”
February 24, 2023 March 13, 2023 March 24, 2023 May 5, 2023 May 24, 2023
● Constant development
● Supported by community
and industry
Source: Reddit
Attention
Where is the neural network?
Fully Connected
Neural Network
Let’s look at The Transformer Block
Normalization and
Residual Connection
Output vector
Transformer block
token from the model’s vocabulary Enrichment
Preparation
Source: ArXiv
Attention Mechanism
How important is each word to each other?
● Allow for the building-up of enriched vectors with more context and logic
(FFN):
Output Token Selection
● Use the information gathered from the attention stage for each
element in the sequence and translate it to a further enriched state
Layer Normalization
• Normalizes the inputs across the features rather than the batch.
• Stabilizes the network's training, ensuring consistent input distribution,
crucial in tasks with varying sequence lengths.
Input Token Vector ● This new sequence, with each element a dense vector, is then passed to the first transformer
©2023 Databricks Inc. — All rights reserved block.
Transformer Architectures
“Encoders, decoders, T5, oh my!”
Encoder-only Models
Defining Features:
● Two sets of transformer blocks
○ Encoder blocks
○ Decoder blocks
○ Cross-attention
● Typical Uses:
○ Translation
○ Conversion
Source: ArXiv
©2023 Databricks Inc. — All rights reserved
B is for BERT
Encoder-only models
Novel Features of BERT:
• Segment embeddings ([CLS] and [SEP]
special tokens)
Source: IllustratedTransformer
We use three, large (millions of elements), matrices to create the Query, Key, and
Value vectors in each layer.
The Attention equation can be expressed as an operation over all the vectors:
1) The input vector (in the first layer, this is the word embedding vector (with Key Vector
the position information), is used to create three new vectors: Value Vector
a) the Query (Q) vector - made from the current token
b) the Key (K) vector - made from all other tokens in the sequence
Key Vector 1
c) the Value (V) vector - made from all tokens in the sequence
Key Vector 2
Query Vector
Key Vector 3
Key Vector…
2) A scaled dot-product is then performed on the Query and Key vectors, this
is the attention score. This also uses a softmax function to produce a final
Attention Weights
set of Attention Weights that are scaled 0-1.
Element-wise
Value Vector multiplication
Output Vector
• Available Data
• Task-relevant data
• Good language coverage
Architecture:
• Layer count
Autoencoding models Sequence-to-sequence Autoregressive models
• Context size models
Source: GitHub
Source: MosaicML
Fine tuning and other methods are required to ensure task success.
• ChatGPT was originally built atop a large language model known as GPT-3.5.
12 x 48 x 96 x
Source: LinkedIn
512 dimension 1024 dimension 2048 dimension
embeddings embeddings embeddings
©2023 Databricks Inc. — All rights reserved
Generative Pre-trained Transformers (GPT)
Generational Improvements
GPT (2018):
It was pre-trained on the BooksCorpus dataset and contained 117 million parameters. GPT was an autoregressive
model based on the Transformer architecture and employed a unidirectional language modeling objective.
-----------------------------------------------------------------------
GPT-2 (2019):
pre-trained on a much larger dataset called WebText, which contained text from web pages with a total of 45
terabytes of data. GPT-2 came in four different sizes: 117M (small), 345M (medium), 774M (large), and 1.5B
(extra-large) parameters. 1,542M Parameters
762M Parameters
345 Parameters
117M Parameters
GPT-3 (2020):
Pre-trained on the WebText2 dataset, an even larger portion of the internet, 45 terabytes of text.
GPT-3 came with a staggering 175 billion parameters
Exceptional few-shot and zero-shot learning capabilities, allowing it to perform well on various
tasks with minimal or no fine-tuning.
-----------------------------------------------------------------------------------
GPT-4 (2023):
Specific details about the architecture, dataset, and parameter count for GPT-4 have not been
officially released. Likely ~1T parameters in an ensemble of smaller models of ~220B parameters
each.
Like in convolutional layers, the earlier features that are learned might be edges and lines, and the
later layers are more complex visual artifacts, attention layers work in a similar fashion:
3. Late Attention: integrates the lower layers and generates coherent and contextual
outputs.
• High-level abstractions.
• Discourse structure.
• Sentiment. Transformer Block Input
• Complex long-range dependencies.
Source: Wikipedia
©2023 Databricks Inc. — All rights reserved
Why so many parameters?
Layers 12 48 3x
Model
768 1600 2x
Dimensionality
Attention Heads
per Layer 12 25 2.5x
1.5B parameters
Source: IllustratedGPT2
©2023 Databricks Inc. — All rights reserved
Training GPT
GPT-1: BooksCorpus
● Over 7,000 unpublished books spanning various genres.
● 800 million words, covering a wide range of topics and styles.
GPT-2: WebText
GPT-3: WebText2
● Even larger than WebText and more diverse
● The increased size and diversity of the WebText2 dataset contribute to
GPT-3's impressive performance on various NLP tasks.
2.0
©2023 Databricks Inc. — All rights reserved
Comparing LLM
Architectures
● Pre-trained on a masked language modeling task ● Autoregressive nature makes them well-suited for ● Encoder-decoder architecture allows for better
bidirectional context for a deeper understanding text generation tasks. handling of complex, structured input-output
of input sequences. Leads to strong feature relationships.
● Can generate coherent and contextually relevant
Pros extraction capabilities. text. ● Attention mechanisms help weigh the importance of
different parts of the input when generating the
● Typically several orders of magnitude smaller than output.
decoder or seq-2-seq model.
● Not ideal for text generation tasks due to their ● Only capture unidirectional context (left-to-right), ● Can require more training data and computational
bidirectional nature. which can limit their contextual understanding. resources compared to encoder-only or
decoder-only models.
Cons ● Can have higher computational costs compared ● Autoregressive generation can be slow due to
to decoder-only models. ● Can be more complex to train and fine-tune due to
sequential token prediction.
the two-part architecture.
• Learn what parameter-efficient fine-tuning is and what the popular strategies are
• Transfer learning
• Apply a general pre-trained model to a new, but related task
• Fine tuning
• Use a general pre-trained model and then train that model further
©2023 Databricks Inc. — All rights reserved Source: tensorflow.org, interview with Andrej Karpathy and Andrew Ng
How to leverage a pre-trained foundation model?
• Improve performance
Full Fine-tune on
downstream fine-tuning classifier
• Different pre-trained
vs fine-tuned tasks
• Different domains
• Ensure regulatory
compliance
• Not new:
Source: Howard and Ruder 2018
• ULMfit paper in 2018
©2023 Databricks Inc. — All rights reserved
Fine tune = update foundation model weights
AKA parameter fine tuning
[Sentiment]: Positive
tuning
###
[Tweet]: "This is the link to the article"
[Sentiment]: Neutral
Not updating model weights ###
[Tweet]: "This new music video was incredible"
Prompt
[Sentiment]:""")
©2023 Databricks Inc. — All rights reserved
Pros and cons of X-shot learning
Also known as in-context learning
Pros Cons
• Manual prompt engineering
• No need for huge labeled training
• Prompts are specific to models
data
• Context length limitation
• No need to create a copy of
• Add more examples? Less space for
model for each task instructions
• Simplify model serving • Longer context = higher latency
• Text prompts feel interpretable • LLMs forget middle portion
• Liu et al 2023 (released in July)
• Longer context window is not the
solution!
• Performance might still be
lackluster
Task 1
• Instruction fine-tuned Addition
Task 2
• Multi-task serving Subtraction,
using natural
language (NL)
Task 3
Multiplication,
mix of NL +
mathematical
symbols
Task 4
Division,
mix of NL +
mathematical
symbols
©2023 Databricks Inc. — All rights reserved
Instruction-tuned, multi-task LLM
Instruction-tuned = tune general purpose LLMs to follow instructions
• Examples
• T5 -> FLAN-T5
• PaLM -> FLAN-PaLM
• Implementation:
• Acts on the core Transformer block
• Basic multi-head attention and/or feed forward network
• Some act specifically on the weight matrices: Query, Key, Value
• These matrices pass information from one token to another
Q K V
Source: Vasmani et al 2021
• Analogy:
• Bitcoin: We can’t touch it like cash. We don’t know how it
“looks”, but it exists and works.
• Analogy:
• Bitcoin: We can’t touch it like cash. We don’t know how it
“looks”, but it exists and works.
• SuperGLUE (2019)
• Styled after GLUE, but more difficult
and diverse
• Boolean questions, comprehension,
etc.
In this example:
Instability • (Virtual) prompt length = 2
Larger
models do
fine with
prompt
length of 1
Source: Lightning AI
• Row rank: 1
• 2nd row = 3x 1st row
• Column rank: 1
• 2nd column = 2x 1st column
• 3rd column = 2x 2nd column
Wdelta = Wa + Wb
• 37.7 / 175255.8
= 0.0002 Rouge
= 0.02% of parameters!
Source: Hu et al 2021
Wq = query
Wk = key
Wv = value
Wo = output
Source: Hu et al 2021
• Future research
• From LoRA authors: If Wdelta is rank-deficient, is W too?
• Newer PEFT technique: IA3 (2022)
• Reduces even more trainable parameters than LoRA!
Colossal
Cleaned
Crawled
Corpus
Source: Wu et al 2023
Source: OpenAI
Source: Zhou et al 2023
©2023 Databricks Inc. — All rights reserved
Data preparation best practices
• Develop an understanding of the utility of quantization in LLMs for both training and
inferencing
What if we could improve the speed and footprint while preserving quality?
©2023 Databricks Inc. — All rights reserved
Improving Learning Efficiency
How can we train and fine-tune better?
Attention is all you need… to fix! And just with a linear bias.
Observation:
○ Fastest memory in the GPU is SRAM
○ Longer context = larger attention matrices
Problem:
○ SRAM is small relative to the attention matrix
needed in calculations
○ Solution: Flash Attention!
○ Attention compute operations are redone, no
matrix created!
○ More time spent in SRAM, massive
performance boost
Source: Flash Attention
©2023 Databricks Inc. — All rights reserved
Many queries, fewer keys
Multi-query and Grouped-query Attention
● Multi-Headed Attention
#Queries = #Keys = #Values
Each head can focus on different parts of language.
Inference - slow, accurate.
● Multi-Query Attention
Many Queries, 1 Key, 1 Value
Forcing the model to use different queries
Inference - fast, inaccurate.
● Grouped-Query Attention
#Queries = #/n Keys = #/n Values
Inference - fast, accurate. Source: Grouped Query Attention
Source: QLoRA
Mixture-of-Experts (MoE):
● Input is sent to a router
● Multiple NNs are trained
Switch Transformer:
● Application of MoE
● Position-wise FFNN are multiplied
● Single attention network
LLM cascade:
● Send prompts to
smallest models
● Gather confidence
of response
Fine-tuning/Inferencing:
● LoRA/QLoRA
● FrugalGPT
• Understand how transformers accept non-text inputs, e.g. images and audio
Source: OpenAI
©2023 Databricks Inc. — All rights reserved
Multi-modality mirrors how we perceive info
More user-friendly, flexible, and capable
HuggingGPT Demo
1 2 3 4
5 6 7 8
Music notes;
Audio; image source
Images image source
Self attention
Self attention
Zero-shot Few-shot
learning learning
Source: KDNuggets
©2023 Databricks Inc. — All rights reserved
Colored images are 3-D tensors
Grayscale images are 2-D tensors: all 3 channels have the same value
th
Wid
Height
3-D tensor
Limitations:
• Lose vertical spatial
relationships
Limitations:
• Lose vertical spatial
relationships
• Memory and
computational
requirements scale
quadratically to
Source: Chen et al 2021
sequence length, O(N2)
Source: David Coccomini 2021
©2023 Databricks Inc. — All rights reserved
Vision Transformer (ViT)
Computes attention on patches of images: image-to-patch embeddings
One
pixel
16
One patch
©2023 Databricks Inc. — All rights reserved
ViT: An image is worth 16x16 words
N input patches
with shape of ……
3 x Pixel (P) x P
©2023 Databricks Inc. — All rights reserved
Linear project each patch to D-dimensional vector
N input patches
with shape of ……
3 x Pixel (P) x P
©2023 Databricks Inc. — All rights reserved
D-dim patch
embedding ……
N input patches
with shape of ……
3 x Pixel (P) x P
©2023 Databricks Inc. — All rights reserved
D-dim patch
embedding
1 2 3 4 5 6 7 8 …… 14 15 16
D-dim
learned Add positional embedding
positional
embedding
Linear project each patch to D-dimensional vector
N input patches
with shape of ……
3 x Pixel (P) x P
©2023 Databricks Inc. — All rights reserved
D-dim patch
embedding
1 2 3 4 5 6 7 8 …… 14 15 16 CLS
D-dim
learned Add positional embedding Special
positional classification
embedding token,
N input patches
with shape of ……
3 x Pixel (P) x P
©2023 Databricks Inc. — All rights reserved
Original Transformer
D-dim patch
embedding
1 2 3 4 5 6 7 8 …… 14 15 16 CLS
D-dim
learned Add positional embedding Special
positional classification
embedding token,
N input patches
with shape of ……
3 x Pixel (P) x P
©2023 Databricks Inc. — All rights reserved
C-dim
classifier
output ……
vector, where
C= # classes
Original Transformer
D-dim patch
embedding
1 2 3 4 5 6 7 8 …… 14 15 16 CLS
D-dim
learned Add positional embedding Special
positional classification
embedding token,
N input patches
with shape of ……
3 x Pixel (P) x P
©2023 Databricks Inc. — All rights reserved
Label = “cat”
C-dim
classifier
output ……
vector, where
C= # classes
Original Transformer
D-dim patch
embedding
1 2 3 4 5 6 7 8 …… 14 15 16 CLS
D-dim
learned Add positional embedding Special
positional classification
embedding token,
N input patches
with shape of ……
3 x Pixel (P) x P
©2023 Databricks Inc. — All rights reserved
ViT only outperforms ResNets on larger datasets
More computationally efficient than ResNet
ResNet
Zero-shot Few-shot
learning learning
Frequency Frequency
• Data2vec (MetaAI,
2022)
• Self-supervised
algorithm for speech,
vision, and text
Source: Mehrish et al 2023 • The only one so far?
©2023 Databricks Inc. — All rights reserved
Training Data for MLLMs
Source: WebVid
Source: VideoChat
Source: CC-3M
Source: LLaVa
©2023 Databricks Inc. — All rights reserved
Instruction-tuned, model-generated data
Actually: manually design examples first, then ask model to generate more
Disclaimer:
Contains mostly
copyrighted
images. LAION
doesn’t claim
ownership.
Source: LAION
©2023 Databricks Inc. — All rights reserved
X-shot learning for MLLMs
Zero-shot Few-shot
learning learning
At test time
Big limitation:
• Inflexible
• Cannot generate text
Source: OpenAI
©2023 Databricks Inc. — All rights reserved
Few-shot, in-context: Flamingo
Unifies treatment of high-dimensional video and image inputs
Architecture
Architecture highlight 3:
highlight 2: Interleave
Perceiver cross-attention
resampler with
language-only
self-attention
layers
Architecture highlight 1:
Allows interleaved multi-modal
inputs Source: Alayrac et al 2022
©2023 Databricks Inc. — All rights reserved
Flamingo bridges vision and language models
Vision encoder similar to CLIP + Chinchilla (Language) accept interleaved
inputs
Maps
variable-size
grid of inputs to
a fixed number
of output tokens
Query
Variable-sized inputs
Source: Alayrac et al 2022
©2023 Databricks Inc. — All rights reserved
Outperforms 6 out of 16 SOTA fine-tuned models
Curated 3 high-quality datasets: LTIP (Long Text-Image Pairs), VTP
(Video-Text Pairs), and MultiModal Massive Web (M3W)
Spectrogram
• Hallucination
• Prompt sensitivity
• Context limit
• Inference compute cost
• Bias, toxicity, etc.
• Copyrights issues Source: Alayrac et al 2022 (Flamingo)
Source: laion.ai
Source: New York Times, April 2023
Source: Next Shark, X Post
©2023 Databricks Inc. — All rights reserved
MLLMs can lack common sense (like LLMs)
GPT-3 completion
Input: You poured yourself a glass of cranberry, but then absentmindedly, you poured about a teaspoon
of grape juice into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything.
You are very thirsty. So you
Completion: drink it. You are now dead. Source: Robust AI and NYU
©2023 Databricks Inc. — All rights reserved
Attention may not be forever
What may remain or rise?
Source: Li et al 2022
Prompt:
“It’s an
apocalypse.
Describe how you Lower-resource languages
survive and make
allies”