0% found this document useful (0 votes)

328 views195 pages

LLM .Foundation - Models.from - The.ground - Up

The document outlines a course on large language models, with the first module focusing on transformers and explaining how the attention mechanism and transformer architecture work by measuring the importance of each word to others and using feedforward networks and normalization techniques. The course later covers efficient fine-tuning, deployment optimizations, and multi-modal language models.

Uploaded by

tanooskar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

328 views195 pages

LLM .Foundation - Models.from - The.ground - Up

Uploaded by

tanooskar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 195

Large

Language
Models
Foundation Models
from the Ground Up

©2023 Databricks Inc. — All rights reserved

Course Outline
Course Introduction
Module 1 - Transformers: Attention and the Transformer Architecture
Module 2 - Efﬁcient Fine-Tuning: Doing more with less
Module 3 - Deployment Optimizations: Improving model size and speed
Module 4 - Multi-modal LLMs: Beyond text-based transformers

©2023 Databricks Inc. — All rights reserved

Course Introduction

©2023 Databricks Inc. — All rights reserved

What are LLMs?

Matei Zaharia
Co-founder & CTO of Databricks
Associate Professor of Computer Science
at UC Berkeley

©2023 Databricks Inc. — All rights reserved

Generative AI state of the art is rapidly advancing
No single model to rule them all—trade-offs are required to ﬁnd
the best model for each use case

Privacy Quality Cost Latency

Proprietary LLMs Open Source LLMs

Dolly

Stable Diffusion
MPT

©2023 Databricks Inc. — All rights reserved

6
Open Source quality is rapidly advancing – while
ﬁne tuning cost is rapidly decreasing
Dolly started the trend to open models with a commercially friendly license

Facebook Llama Stanford Alpaca Databricks Dolly Mosaic MPT TII Falcon

“Smaller, more performant models “Alpaca behaves qualitatively “Dolly will help democratize LLMs, “MPT-7B is trained from scratch “Falcon significantly outperforms
such as Llama … democratizes similarly to OpenAI … while being transforming them into a on 1T tokens … is open source, GPT-3 for … 75% of the training
access in this important, surprisingly small and easy commodity every company can available for commercial use, and compute budget—and … a fifth of
fast-changing field.” /cheap to reproduce” own and customize” matches the quality of LLaMA-7B” the compute at inference time.”

February 24, 2023 March 13, 2023 March 24, 2023 May 5, 2023 May 24, 2023

Non Commercial Use Only | Commercial Use Permitted

● Constant development

● Supported by community
and industry

● Rapid innovation cycle

©2023 Databricks Inc. — All rights reserved Source: Hugginface.co

“A strong foundation… is all you need”
What and who this course is for
Research and innovation has exploded around LLMs.
If you want to keep up:
• The fundamentals of LLMs have
not changed since 2018
• Most of the innovation are
variations of the original.

Source: Reddit

©2023 Databricks Inc. — All rights reserved

Enjoy the course!

©2023 Databricks Inc. — All rights reserved

Course Outline
Course Introduction
Module 1 - Transformers: Attention and the Transformer Architecture
Module 2 - Parameter Efﬁcient Fine-Tuning: Doing more with less
Module 3 - Deployment Optimizations: Improving model size and speed
Module 4 - Multi-modal LLMs: Beyond text-based transformers

©2023 Databricks Inc. — All rights reserved

Module 1
Transformers
Attention and the Transformer Architecture

©2023 Databricks Inc. — All rights reserved

Learning Objectives

By the end of this module you will:

• Describe and build the inner workings of a transformer model

• Compare and contrast the different types of transformer architectures

• Recognize the importance and mechanics of attention

• Apply the different types of transformer models to solve problems

©2023 Databricks Inc. — All rights reserved

A Transformative Technology
A breakthrough in Natural Language
Processing

©2023 Databricks Inc. — All rights reserved

Let’s Chat
A new technology paradigm has arrived.

Less than a year old, ChatGPT is

the fastest adopted technology in
human history.

LLMs like ChatGPT have ushered in

a new golden era of AI.

©2023 Databricks Inc. — All rights reserved Source: Brookings

We’ve seen this before
The Convolution Layer led to models that dominated the vision ﬁeld
NEC-UIUC

©2023 Databricks Inc. — All rights reserved Source: ResearchGate

We weren’t paying attention before
Attention: “the convolution layer of language”

The attention mechanism measures how words interrelate

The cat sat on the mat

©2023 Databricks Inc. — All rights reserved

Attention is (not) all you need
A new type of deep learning architecture
Input token embedding vector

Attention is a linear, matrix operation.

Attention
Where is the neural network?

Fully Connected
Neural Network
Let’s look at The Transformer Block

Normalization and
Residual Connection

Output vector

©2023 Databricks Inc. — All rights reserved

The Transformer Block
Building up the most important AI
advancement in years

©2023 Databricks Inc. — All rights reserved

Transforming a sequence
What is the goal of a transformer block?

Transformer blocks: Output Token Selection

1. Process input vectors with attention to enrich

Prediction
2. Add nonlinear transformation with NN
3. Apply enriched vectors to select the correct

Transformer block
token from the model’s vocabulary Enrichment

Preparation

Input Token Vector

Source: ArXiv
Attention Mechanism
How important is each word to each other?

Output Token Selection

The role of attention is to:

● Measure the importance and relevance of each word relative to each

other

● Allow for the building-up of enriched vectors with more context and logic

Input Token Vector

©2023 Databricks Inc. — All rights reserved
Position-wise Feed-Forward Networks
Adding nonlinearity to understanding the input.

(FFN):
Output Token Selection

The role of Position-wise Feed-Forward Network is to:

● Use the information gathered from the attention stage for each
element in the sequence and translate it to a further enriched state

● Allow for nonlinear transformations of each element in the sequence

and builds upon itself in subsequent layers

Input Token Vector

©2023 Databricks Inc. — All rights reserved
Residual Connections & Layer Normalization
Ensuring robust and reliable training.

Output Token Selection Residual Connections

• Shortcuts in the network that allow information to ﬂow directly from
earlier layers to later layers.
• Help mitigate the vanishing gradient problem in deep neural networks.

Layer Normalization
• Normalizes the inputs across the features rather than the batch.
• Stabilizes the network's training, ensuring consistent input distribution,
crucial in tasks with varying sequence lengths.

Input Token Vector

©2023 Databricks Inc. — All rights reserved
Into and out of the transformer
Input transformations and output transformations
Output from the Transformer
Output Token Selection ● Exiting the last Transformer block, is a sequence of context-aware vectors.
● Each vector in this sequence represents an input token, but now its
representation is deeply inﬂuenced by its interactions with all other tokens in the
sequence.
● Different models use the output from the transformer differently.

Input to the Transformer

● The initial input to a Transformer model is a sequence of word tokens.
● These tokens are converted into word embeddings.
● Positional encodings are added to these embeddings, providing the model with necessary
positional information.

Input Token Vector ● This new sequence, with each element a dense vector, is then passed to the ﬁrst transformer
©2023 Databricks Inc. — All rights reserved block.
Transformer Architectures
“Encoders, decoders, T5, oh my!”

©2023 Databricks Inc. — All rights reserved

The Transformer Family Tree
Encoder-Decoder
Models Decoder-only Models

Encoder-only Models

©2023 Databricks Inc. — All rights reserved

Source: GitHub
The Original Transformer
An encoder-decoder model

Deﬁning Features:
● Two sets of transformer blocks
○ Encoder blocks
○ Decoder blocks
○ Cross-attention

● Typical Uses:
○ Translation
○ Conversion

Source: ArXiv
©2023 Databricks Inc. — All rights reserved
B is for BERT
Encoder-only models
Novel Features of BERT:
• Segment embeddings ([CLS] and [SEP]
special tokens)

• Trained with Masked Language Modeling

& Next Sentence Prediction

• Excellent Fine-Tuning Performance

©2023 Databricks Inc. — All rights reserved

Source: ArXiv
Generating text with GPT
Decoder-only models

Decoder models have a single

task:
• Look over the sequence
and predict the next word
(or token).

Decoder models products: ChatGPT, Bard, Claude, LLaMA, MPT

Source: IllustratedTransformer

©2023 Databricks Inc. — All rights reserved

Important variables in Transformers
Input Internal Training
• Vocabulary Size (V): • Number of Attention Heads • Batch Size (B):
The number of unique tokens (H): The number of examples
that the model recognizes. In the multi-head attention processed together in one
mechanism, the input is forward/backward pass during
• Embedding/Model Size (D): divided into H different parts. training.
The dimensionality of the word
embeddings, also known as the • Intermediate Size (I): • Tokens Trained on (T):
hidden size. The feed-forward network The total number of tokens that a
has an intermediate layer model sees during training. This is
• Sequence/Context Length (L): whose size is typically larger normally reported more than the
The maximum number of tokens than the embedding size. number of epochs.
that the model can process in a
single pass. • Number of Layers (N):
The number of Transformer
blocks/layers.

©2023 Databricks Inc. — All rights reserved

Time to Pay Attention
The secret that unlocked the power of
LLMs

©2023 Databricks Inc. — All rights reserved

The inner workings of attention
Learning the weights of attention.

We use three, large (millions of elements), matrices to create the Query, Key, and
Value vectors in each layer.

Query Matrix: WQ, Key Matrix: WK, Value Matrix: WV

Query Vector Word Vector WQ

The Attention equation can be expressed as an operation over all the vectors:

©2023 Databricks Inc. — All rights reserved

The inner workings of attention
How do we calculate attention?
Word+Position Embedding Vector

The mechanism can be broken down into three steps:

Query Vector

1) The input vector (in the ﬁrst layer, this is the word embedding vector (with Key Vector
the position information), is used to create three new vectors: Value Vector
a) the Query (Q) vector - made from the current token
b) the Key (K) vector - made from all other tokens in the sequence

Key Vector 1
c) the Value (V) vector - made from all tokens in the sequence

Key Vector 2
Query Vector

Key Vector 3
Key Vector…
2) A scaled dot-product is then performed on the Query and Key vectors, this
is the attention score. This also uses a softmax function to produce a ﬁnal
Attention Weights
set of Attention Weights that are scaled 0-1.

3) Each Value vector is multiplied by its corresponding attention weight and

then all are summed up to generate the output for the self-attention layer. Attention Weights

Element-wise
Value Vector multiplication

Output Vector

©2023 Databricks Inc. — All rights reserved

Building Base/Foundation
Models
Training transformers, what does it take?

©2023 Databricks Inc. — All rights reserved

Foundation Model Training - Getting Started
Choosing the right options to build your model.

A foundation, or base, model is a transformer that is trained from scratch.

• Model Architecture - decoder? encoder-decoder?
• Tailored to the task/problem
• Different sizes: embedding dimensions, number of transformer blocks, etc.

• Available Data
• Task-relevant data
• Good language coverage

• Available Compute Resources

• Time to allow for training
• Hardware available

©2023 Databricks Inc. — All rights reserved

Foundation Model Training - Architecture
Which transformer ﬂavor is right?

Start with your task:

• Classiﬁcation?
• Generation?
• Translation?

Architecture:
• Layer count
Autoencoding models Sequence-to-sequence Autoregressive models
• Context size models

Source: GitHub

©2023 Databricks Inc. — All rights reserved

Foundation Model Training - Data
It’s all about the data

Datasets for foundation LLM:

• Web Text
• Code
• Images
• Digitized Text
• Transcriptions

Source: Stanford CS324

©2023 Databricks Inc. — All rights reserved

Foundation Model Training - Training
Optimizing LLM Losses

• LLMs train just like other deep learning

models.
• Loss functions are typically cross-entropy.
• Familiar optimizers like AdamW, are used.
• Training LLMs with 100’s of billions of
parameters, can take months on hundreds
of GPUs.

Source: MosaicML

©2023 Databricks Inc. — All rights reserved

Now what?
What do you use a foundation LLM for?

Foundation LLMs suffer from “alignment” problems:

• Bias/toxicity in the training data

• Lack of speciﬁc task focus -> for models like GPT it just focuses on
selected the next correct word

Fine tuning and other methods are required to ensure task success.

©2023 Databricks Inc. — All rights reserved

Generative Pretrained Transformer
A journey to discover how GPT-4 and ChatGPT
were built.

©2023 Databricks Inc. — All rights reserved

The journey to ChatGPT
What is GPT?

• ChatGPT was originally built atop a large language model known as GPT-3.5.

• GPT-3.5 is a type of decoder-based transformer model.

• Let’s see what exactly GPT is. Source: OpenAI

©2023 Databricks Inc. — All rights reserved

Generative Pre-trained Transformers (GPT)
Decoder-based transformers
• The ﬁrst GPT model, introduced in 2018 was just the decoder
part of the original transformer.
Decoder Models
only use the
second half of the
• GPT-2/-3/-4 have mostly just been larger versions. With the key original
transformer

differences coming from training data and training processes. architecture. No

cross/encoder
attention.

GPT/GPT-1 GPT-2 GPT-3

12 x 48 x 96 x

Source: LinkedIn
512 dimension 1024 dimension 2048 dimension
embeddings embeddings embeddings
©2023 Databricks Inc. — All rights reserved
Generative Pre-trained Transformers (GPT)
Generational Improvements

GPT (2018):

It was pre-trained on the BooksCorpus dataset and contained 117 million parameters. GPT was an autoregressive
model based on the Transformer architecture and employed a unidirectional language modeling objective.

-----------------------------------------------------------------------

GPT-2 (2019):

pre-trained on a much larger dataset called WebText, which contained text from web pages with a total of 45
terabytes of data. GPT-2 came in four different sizes: 117M (small), 345M (medium), 774M (large), and 1.5B
(extra-large) parameters. 1,542M Parameters

762M Parameters

345 Parameters
117M Parameters

©2023 Databricks Inc. — All rights reserved

Source: IllustratedGPT2
Generative Pre-trained Transformers (GPT)
Generational Improvements

GPT-3 (2020):
Pre-trained on the WebText2 dataset, an even larger portion of the internet, 45 terabytes of text.
GPT-3 came with a staggering 175 billion parameters
Exceptional few-shot and zero-shot learning capabilities, allowing it to perform well on various
tasks with minimal or no fine-tuning.
-----------------------------------------------------------------------------------
GPT-4 (2023):
Specific details about the architecture, dataset, and parameter count for GPT-4 have not been
officially released. Likely ~1T parameters in an ensemble of smaller models of ~220B parameters
each.

©2023 Databricks Inc. — All rights reserved

GPT Architecture

©2023 Databricks Inc. — All rights reserved

So many layers? Transformer Block Output

Why GPTs keep getting bigger

The main component of each layer: multi-headed attention block.

Like in convolutional layers, the earlier features that are learned might be edges and lines, and the
later layers are more complex visual artifacts, attention layers work in a similar fashion:

1. Early Attention: short-range dependencies, relationships between adjacent or close

tokens.
• Word order
• Part-of-Speech
• Basic sentence structure.

2. Middle Attention: overall context of the input sequence.

• Semantic information.
• Meaning.
• Relationships between phrases.
• Roles of different words within the sentence.

3. Late Attention: integrates the lower layers and generates coherent and contextual
outputs.
• High-level abstractions.
• Discourse structure.
• Sentiment. Transformer Block Input
• Complex long-range dependencies.
Source: Wikipedia
©2023 Databricks Inc. — All rights reserved
Why so many parameters?

Parameter GPT-2 GPT-2 Extra Difference Extra 117M parameters

Small Large Large / Small

Layers 12 48 3x

Model
768 1600 2x
Dimensionality
Attention Heads
per Layer 12 25 2.5x

Total 117 Million 1.5 Billion 12.8x

1.5B parameters
Source: IllustratedGPT2
©2023 Databricks Inc. — All rights reserved
Training GPT

©2023 Databricks Inc. — All rights reserved

Training Data
The magic behind the model

GPT-1: BooksCorpus
● Over 7,000 unpublished books spanning various genres.
● 800 million words, covering a wide range of topics and styles.

GPT-2: WebText

● Much larger than BookCorpus

● A collection of text from web pages, consists of 45 terabytes of data.
● Filtered out web pages with low quality content

GPT-3: WebText2
● Even larger than WebText and more diverse
● The increased size and diversity of the WebText2 dataset contribute to
GPT-3's impressive performance on various NLP tasks.
2.0
©2023 Databricks Inc. — All rights reserved
Comparing LLM
Architectures

©2023 Databricks Inc. — All rights reserved

BERT vs. GPT vs. T5
Which type of LLM is best?

Encoder-only (eg. BERT) Decoder-only (eg. GPT) Seq-2-Seq (eg. T5)

● Pre-trained on a masked language modeling task ● Autoregressive nature makes them well-suited for ● Encoder-decoder architecture allows for better
bidirectional context for a deeper understanding text generation tasks. handling of complex, structured input-output
of input sequences. Leads to strong feature relationships.
● Can generate coherent and contextually relevant
Pros extraction capabilities. text. ● Attention mechanisms help weigh the importance of
different parts of the input when generating the
● Typically several orders of magnitude smaller than output.
decoder or seq-2-seq model.

● Not ideal for text generation tasks due to their ● Only capture unidirectional context (left-to-right), ● Can require more training data and computational
bidirectional nature. which can limit their contextual understanding. resources compared to encoder-only or
decoder-only models.
Cons ● Can have higher computational costs compared ● Autoregressive generation can be slow due to
to decoder-only models. ● Can be more complex to train and ﬁne-tune due to
sequential token prediction.
the two-part architecture.

©2023 Databricks Inc. — All rights reserved

Module Summary
Transformers - What have we learned

• Transformers are built from a number of transformer blocks

• Transformer blocks use attention and neural networks to enrich vectors
• Transformer models can be encoder, decoder, or encoder-decoder
• The evolution of GPT required changes in architecture, and in data
• Base/Foundation models require ﬁne tuning to solve most tasks

©2023 Databricks Inc. — All rights reserved

Time for some code!

©2023 Databricks Inc. — All rights reserved

Module 2
Efﬁcient Fine-Tuning
Doing more with less

©2023 Databricks Inc. — All rights reserved

Learning Objectives

By the end of this module you will:

• Understand what ﬁne-tuning is and why we do it

• Learn what parameter-efﬁcient ﬁne-tuning is and what the popular strategies are

• Understand the limitations of parameter-efﬁcient ﬁne-tuning

• Gain knowledge about data preparation best practices

©2023 Databricks Inc. — All rights reserved

Fine tuning vs. transfer learning
They are often referenced interchangeably

• Transfer learning
• Apply a general pre-trained model to a new, but related task

• Fine tuning
• Use a general pre-trained model and then train that model further

• Transfer learning ~= ﬁne tuning

• Train it more
• Train on different data

©2023 Databricks Inc. — All rights reserved Source: tensorﬂow.org, interview with Andrej Karpathy and Andrew Ng
How to leverage a pre-trained foundation model?

©2023 Databricks Inc. — All rights reserved

How to leverage a pre-trained foundation model?

©2023 Databricks Inc. — All rights reserved

How to leverage a pre-trained foundation model?

©2023 Databricks Inc. — All rights reserved

Why ﬁne tuning?
Leverage an effective pre-trained model on our own data – it’s not new

• Improve performance
Full Fine-tune on
downstream fine-tuning classifier
• Different pre-trained
vs fine-tuned tasks
• Different domains

• Ensure regulatory
compliance

• Not new:
Source: Howard and Ruder 2018
• ULMﬁt paper in 2018
©2023 Databricks Inc. — All rights reserved
Fine tune = update foundation model weights
AKA parameter ﬁne tuning

• Update more layers = better model performance

• Full ﬁne-tuning typically produces one model per task

• Serve one model per task
• May forget other pre-trained tasks: catastrophic forgetting

• Full ﬁne-tuning LLMs is expensive. How to avoid it?

• X-shot learning Source: Devlin et al 2019
• Parameter-efﬁcient ﬁne tuning
©2023 Databricks Inc. — All rights reserved
X-shot learning
Provide several examples of new tasks
pipeline(

Prompt engineering """For each tweet, describe its sentiment: Instruction

= developing prompts [Tweet]: "I hate it when my phone battery dies."

[Sentiment]: Negative
= prompt design ###
Few-shot
= hard/discrete prompt [Tweet]: "My day has been 👍" examples

[Sentiment]: Positive
tuning
###
[Tweet]: "This is the link to the article"
[Sentiment]: Neutral
Not updating model weights ###
[Tweet]: "This new music video was incredible"
Prompt
[Sentiment]:""")
©2023 Databricks Inc. — All rights reserved
Pros and cons of X-shot learning
Also known as in-context learning

Pros Cons
• Manual prompt engineering
• No need for huge labeled training
• Prompts are speciﬁc to models
data
• Context length limitation
• No need to create a copy of
• Add more examples? Less space for
model for each task instructions
• Simplify model serving • Longer context = higher latency
• Text prompts feel interpretable • LLMs forget middle portion
• Liu et al 2023 (released in July)
• Longer context window is not the
solution!
• Performance might still be
lackluster

©2023 Databricks Inc. — All rights reserved

Fine-tuning outperforms X-shot learning
Example: GOod at Arithmetic Tasks (Goat-7B)

• Foundation model = Llama Image Source

• Trained on 1M synthetic data samples

• Accuracy outperforms
• Few-shot PaLM-540B (much bigger model !! )
• GPT-4
• Typically doesn’t do well in arithmetics; accuracy ~0
• SOTA on arithmetic benchmark (BIG-bench)
• Supervised instruction ﬁne-tuning
• Trained using LoRA on a 24GB VRAM GPU
• LoRA to be covered soon!
Source: Liu and Low 2023
(released in May)

©2023 Databricks Inc. — All rights reserved

Important observations about Goat

Task 1
• Instruction ﬁne-tuned Addition
Task 2
• Multi-task serving Subtraction,
using natural
language (NL)
Task 3

Multiplication,
mix of NL +
mathematical
symbols
Task 4

Division,
mix of NL +
mathematical
symbols
©2023 Databricks Inc. — All rights reserved
Instruction-tuned, multi-task LLM
Instruction-tuned = tune general purpose LLMs to follow instructions

FLAN (Fine-tuned LAnguage Net) Dolly

• Foundation model = 137B model • Foundation model = Pythia-12B

• Instruction-tuned on over 60 NLP • Instruction-tuned on 15k

datasets with different task types prompt/response pairs
• Task types: Q/A, translation, • Task types: Q/A, classiﬁcation,
reasoning, comprehension, etc. information extraction, etc.

• Examples
• T5 -> FLAN-T5
• PaLM -> FLAN-PaLM

Source: Wei et al 2022

• Full ﬁne-tuning can be computationally prohibitive

• Memory usage: activation, optimizer states, gradients, parameters
• This gives the best performance

• Compromise: Do some, but not full, ﬁne-tuning

• Saves cost to use low-memory GPUs

• We want multi-task serving, rather than one model per task

• E.g. one model for Q/A, summarization, classification
•
Enter parameter-efficient fine-tuning
©2023 Databricks Inc. — All rights reserved
Parameter-efficient
fine-tuning (PEFT)

©2023 Databricks Inc. — All rights reserved

3 categories of PEFT methods

Additive Selective Re-parameterization

• Soft prompt • Akin to updating a few • Decompose weight

• Prompt tuning foundation model layers matrix updates into
• Preﬁx tuning • BitFit smaller-rank matrices
• Only updates bias
• LoRA
parameters
• Diff Pruning
• Creates
task-speciﬁc “diff”
vectors and only
updates them

©2023 Databricks Inc. — All rights reserved

We will cover additive and reparameterization

Additive Selective Re-parameterization

• Soft prompt • Model quality • Decompose weight

• Prompt tuning performance is not as matrix into smaller-rank
• Preﬁx tuning good matrices
• Akin to updating a few • LoRA
foundation model layers
• BitFit
• Only updates bias
parameters
• DiffPruning
• Creates
task-speciﬁc “diff”
vectors and only
©2023 Databricks Inc. — All rights reserved update them
High-level overview of PEFT
Active research area: >100 papers in last few years!

• Additive: Add new tunable layers to model

• Keep the foundation model weights frozen and update only the new layer weights

• Reparameterization: Decompose a weight matrix into lower-rank matrices

• Implementation:
• Acts on the core Transformer block
• Basic multi-head attention and/or feed forward network
• Some act speciﬁcally on the weight matrices: Query, Key, Value
• These matrices pass information from one token to another

Q K V
Source: Vasmani et al 2021

©2023 Databricks Inc. — All rights reserved

Additive:
Prompt Tuning
(and preﬁx tuning)

©2023 Databricks Inc. — All rights reserved

Soft prompt tuning
Concatenates trainable parameters with the input embeddings

• Learn a new sequence of task-speciﬁc embeddings

• We call this prompt tuning, not model tuning, because we only update
prompt weights

©2023 Databricks Inc. — All rights reserved

What are these virtual tokens?
Goal: remove manual element of engineering prompts!

• Randomly initialized embedding vectors

• We can also initialize to discrete prompts
• But random initialization is nearly as good as informed
initialization (Qin and Eisner 2021)

• Not part of vocabulary

• Analogy:
• Bitcoin: We can’t touch it like cash. We don’t know how it
“looks”, but it exists and works.

©2023 Databricks Inc. — All rights reserved

What are these virtual tokens?
Goal: remove manual element of engineering prompts!

• Randomly initialized embedding vectors

• We can also initialize to discrete prompts
• But random initialization is nearly as good as informed
initialization (Qin and Eisner 2021)

• Not part of vocabulary

• Analogy:
• Bitcoin: We can’t touch it like cash. We don’t know how it
“looks”, but it exists and works.

©2023 Databricks Inc. — All rights reserved

Compare full ﬁne-tuning vs prompt tuning
Scenario: full ﬁne-tuning
Backprop: update all weights based on loss

©2023 Databricks Inc. — All rights reserved

Compare full ﬁne-tuning vs prompt tuning
Scenario: prompt tuning
Backprop: update only prompt weights based on loss
• The model learns the optimal representation of the prompt automatically

©2023 Databricks Inc. — All rights reserved

Allows swapping of task prompts
Efﬁcient for multi-task serving

• Each task is a prompt, not a model

• Only need to serve a single copy of the frozen model for multi-task serving

• Prompts for various tasks can be applied to different inputs

• A serving request can be a single, larger mixed task batch

Source: Lester et al 2021

©2023 Databricks Inc. — All rights reserved

Matches ﬁne tuning performance for >11B model

• Comparable with full ﬁne-tuning

at the 10B model scale
Performance
• More applicable to larger models gap with
small model

• SuperGLUE (2019)
• Styled after GLUE, but more difﬁcult
and diverse
• Boolean questions, comprehension,
etc.

Source: Lester et al 2021

In this example:
Instability • (Virtual) prompt length = 2
Larger
models do
ﬁne with
prompt
length of 1

Source: Lester et al 2021

©2023 Databricks Inc. — All rights reserved

Advantages of prompt tuning

• Use whole training set

• Not limited by # of examples that can ﬁt in the
context

• Automatically learn a new prompt for a

new model
• Backprop helps us ﬁnd the best representation

• One foundation model copy only Source: Google AI Blog

• Resilient to domain shift

Less interpretable Unstable performance

• Need to convert the embeddings

back to tokens
• Use cosine distance to ﬁnd the
top-K nearest neighbors Instability

Source: Lester et al 2021

©2023 Databricks Inc. — All rights reserved

Preﬁx tuning is very similar to prompt tuning
Adding tunable layer to each transformer block, rather than just the input
layer

Source: Li and Liang 2021

Source: Lightning AI

©2023 Databricks Inc. — All rights reserved

Re-parameterization:
LoRA

©2023 Databricks Inc. — All rights reserved

Low-Rank Adaptation (LoRA)
Decomposes the weight change matrix into lower-rank matrices

©2023 Databricks Inc. — All rights reserved

Low-Rank Adaptation (LoRA)
Decomposes the weight change matrix into lower-rank matrices

©2023 Databricks Inc. — All rights reserved

Rank? Brief visit to linear algebra
Maximum # of linearly independent columns or rows

• How many unique rows or columns?

• Full rank = no redundant row or column in the matrix
• Linear = can multiply by a constant
• Independence = no dependence on each other

• Row rank: 1
• 2nd row = 3x 1st row
• Column rank: 1
• 2nd column = 2x 1st column
• 3rd column = 2x 2nd column

©2023 Databricks Inc. — All rights reserved

How does weight matrix decomposition work?
Observation: Actual rank of the attention weight matrices is low

Wdelta = Wa + Wb

• Total parameters = (100 x 2) + (2 x 100) = 400

• Original parameters = (100 x 100) = 10,000 parameters
• Reduction = 10,000 - 400 = 96%!
©2023 Databricks Inc. — All rights reserved
LoRA matches/~outperforms full ﬁne-tuning

• 37.7 / 175255.8
= 0.0002 Rouge
= 0.02% of parameters!

Source: Hu et al 2021

©2023 Databricks Inc. — All rights reserved

LoRA performs well with very small ranks
GPT-3’s validation accuracies are similar across rank sizes

Wq = query
Wk = key
Wv = value
Wo = output
Source: Hu et al 2021

But, small r likely won’t work for all tasks/datasets.

• E.g. downstream task is in a different language

©2023 Databricks Inc. — All rights reserved

Advantages of LoRA
Similar to prompt tuning, majority of the model weights are frozen

• Able to share and re-use the foundation model

• Swap different LoRA weights for serving different tasks

• Improves training efﬁciency

• Lower hardware barrier (no need to calculate most gradients or optimizer states)

• Adds no additional serving latency

• W_a * W_b can be merged

• Can be combined with other PEFT methods

• Not straightforward to do multi-task serving

• How to swap different combos of A and B in a single forward pass?
• If dynamically choose A and B based on tasks, there is additional serving latency

• Future research
• From LoRA authors: If Wdelta is rank-deﬁcient, is W too?
• Newer PEFT technique: IA3 (2022)
• Reduces even more trainable parameters than LoRA!

©2023 Databricks Inc. — All rights reserved

PEFT Limitations

©2023 Databricks Inc. — All rights reserved

Model performance limitations

• Difﬁcult to match the performance of full ﬁne-tuning

• Sensitive to hyperparameters
• Unstable performance

• Current research area: where is best to apply PEFT?

• E.g. why apply PEFT to only attention weight matrices? Soft prompts?
• Vu et al 2022: Soft prompt transfer

• We may still need full-parameter ﬁne-tuning

• Lv et al 2023 (released in June): use new optimizer, LOMO, to reduce memory
usage to ~11%

©2023 Databricks Inc. — All rights reserved

Compute limitations

Doesn’t reduce time

complexity of training
Doesn’t reduce the cost of
Doesn’t always make
storing massive foundation Requires full forward and
inference more efﬁcient
models backward passes

©2023 Databricks Inc. — All rights reserved

Data Preparation
Best Practices

©2023 Databricks Inc. — All rights reserved

Better models from better training data
Many newer good models use C4 (e.g. MPT-7B)

Llama GPT-Neo and GPT-J

• Trained on the Pile: 22 diverse

• Trained on 20 most-spoken
datasets
languages, focusing on those with
• Outperformed GPT-3 in some
Latin and Cyrillic alphabets
instances (Read more here)

Colossal
Cleaned
Crawled
Corpus

Source: Touvron et al 2023

©2023 Databricks Inc. — All rights reserved Source: Gao et al 2020

Training data makes the biggest difference
Not necessarily the model architecture

• Bloomberg created 363B-token dataset of English ﬁnancial documents

spanning 40 years
• Augmented with 345B-token public dataset

• Outperforms existing open models on ﬁnancial tasks

Source: Wu et al 2023

©2023 Databricks Inc. — All rights reserved

How much ﬁne-tuning data do I need?

• Zhou et al 2023 (May): ﬁne-tune 1,000 high-quality labeled examples

from LLaMa 65B
• When scaling up data quantity, need to scale up prompt diversity

• OpenAI: At least a couple hundred

• Doubling dataset size leads to linear increase in model performance

• How to get more data? Synthetic data

• Synonym replacement / rewrite
• Word deletion: “brilliantly expressed” => “expressed”
• Word position swapping: “It is lovely” -> “Lovely, it is”
• Noise injection: introduce typos
©2023 Databricks Inc. — All rights reserved
Data preparation best practices
Quantity, diversity, and quality

• Don’t provide detailed instructions.

Only prompt and completion.
• Fixed separator \n\n###\n\n to inform
when the prompt ends and completion
begins
• The separator shouldn’t appear anywhere
else

Source: OpenAI
Source: Zhou et al 2023
©2023 Databricks Inc. — All rights reserved
Data preparation best practices

• Remove undesired data

• Offensive, toxic content
• Private or conﬁdential information

• Using LLM output as data is not always the answer

• Imitation models learn style, rather than content (Gudibande et al 2023)
• Consistent with Zhou et al 2023: knowledge is largely learned during
pre-training

• Manually verify data quality

©2023 Databricks Inc. — All rights reserved

Module Summary
Efﬁcient Fine-Tuning - What have we learned?

• Fine-tuning gives the best results, but can be computationally

expensive
• Parameter-efﬁcient ﬁne-tuning reduces # of trainable parameters
• Prompt tuning allows virtual prompts to be learned automatically
• LoRA decomposes the weight change matrix into lower-rank matrices
• Fine-tuning data quality and diversity matters a lot

©2023 Databricks Inc. — All rights reserved

Time for some code!

©2023 Databricks Inc. — All rights reserved

Module 3
Deployment Optimizations
Improving model size and speed

©2023 Databricks Inc. — All rights reserved

Learning Objectives

By the end of this module you will:

• Be able to make the design choices for your LLM development

• Create and design your own pseudo Mixture-of-Experts LLM system

• Develop an understanding of the utility of quantization in LLMs for both training and
inferencing

©2023 Databricks Inc. — All rights reserved

Extra-Large Language Models
What if our models are too big?

©2023 Databricks Inc. — All rights reserved

The issue with high performance LLMs
Paying the price for quality

As models grow in size, they get “better” and “worse”.

• Better - accuracy, alignment, abilities
• Worse - speed, memory footprint, updatability

small fast low quality model Large…slow...high...quality…model.

What if we could improve the speed and footprint while preserving quality?
©2023 Databricks Inc. — All rights reserved
Improving Learning Efﬁciency
How can we train and ﬁne-tune better?

©2023 Databricks Inc. — All rights reserved

How we interact with LLMs
The importance of context length

LLMs, like us, do better at tasks with more context.

This means longer input/context length.

- Computing input embeddings linearly
- Performing FFNN calculations linearly
- Calculating attention scores quadratically
Even worse:
Attention cannot perform as well on longer context
than it was trained on.

©2023 Databricks Inc. — All rights reserved

Source: ALiBi
Training short but inference long
You’ll need a good Alibi for this one

Attention is all you need… to ﬁx! And just with a linear bias.

©2023 Databricks Inc. — All rights reserved

Source: ALiBi
Faster calculations
Calculating attention in a ﬂash.

Observation:
○ Fastest memory in the GPU is SRAM
○ Longer context = larger attention matrices

Problem:
○ SRAM is small relative to the attention matrix
needed in calculations
○ Solution: Flash Attention!
○ Attention compute operations are redone, no
matrix created!
○ More time spent in SRAM, massive
performance boost
Source: Flash Attention
©2023 Databricks Inc. — All rights reserved
Many queries, fewer keys
Multi-query and Grouped-query Attention

● Multi-Headed Attention
#Queries = #Keys = #Values
Each head can focus on different parts of language.
Inference - slow, accurate.
● Multi-Query Attention
Many Queries, 1 Key, 1 Value
Forcing the model to use different queries
Inference - fast, inaccurate.
● Grouped-Query Attention
#Queries = #/n Keys = #/n Values
Inference - fast, accurate. Source: Grouped Query Attention

©2023 Databricks Inc. — All rights reserved

Improving Model Footprint
Doing more with less

©2023 Databricks Inc. — All rights reserved

Storing numbers
Billions of parameters. Each a ﬂoating point number.

FP16 FP32 - IEEE standards.

Google Brain saw this and created the BF16
● Same range as FP32
● Same size as FP16

Source: Google bﬂoat16

What if we need to go even further?
©2023 Databricks Inc. — All rights reserved
Quantization
Do we need so much precision?

Approximate the values in quantized forms

Create quantized functions

©2023 Databricks Inc. — All rights reserved

Source: LLM.int8()
QLoRA
Applying quantization to ﬁne tuning

● LoRA was already great!

● QLoRA adds even more:
○ 4-bit quantization
○ Paged optimization

Source: QLoRA

©2023 Databricks Inc. — All rights reserved

Multi-LLM Inferencing
Hybrid and Ensemble-based systems

©2023 Databricks Inc. — All rights reserved

Mixture-of-Experts
A trillion parameters, for a fraction of the training

Mixture-of-Experts (MoE):
● Input is sent to a router
● Multiple NNs are trained

Switch Transformer:
● Application of MoE
● Position-wise FFNN are multiplied
● Single attention network

Source: Switch Transformer

©2023 Databricks Inc. — All rights reserved
LLM Cascades and FrugalGPT
Improving our resource allocation in LLM inferencing

LLM cascade:
● Send prompts to
smallest models

● Gather conﬁdence
of response

● If too small, move

to a larger model
Source: FrugalGPT

©2023 Databricks Inc. — All rights reserved

Current Best Practices
If you want to build now, do it right

©2023 Databricks Inc. — All rights reserved

Best Practices

Training from scratch:

● ALiBi
● Flash Attention
● Grouped-Query Attention
● Mixture-of-Experts

Fine-tuning/Inferencing:
● LoRA/QLoRA
● FrugalGPT

©2023 Databricks Inc. — All rights reserved

Source: LLM-Numbers
Module Summary
Deployment Optimizations - What have we learned?

• LLMs are currently outpacing modern compute capacity, necessitating

the development of work around solutions and approaches
• Modifying the original approach to attention has allowed for longer
contexts, better use of hardware, more efﬁcient calculations
• Quantization helps to store and use massive LLMs on smaller hardware
• Combing LLM inferences of different models allows an effective scale
up of parameters with minimal cost changes

©2023 Databricks Inc. — All rights reserved

Time for some code!

Module 4
Multi-modal Language
Models (MLLMs)
Beyond text-based transformers

Learning Objectives

By the end of this module you will:

• Survey the broad landscape of multi-modality that leverages LLMs

• Understand how transformers accept non-text inputs, e.g. images and audio

• Examine how multi-modal data is procured

• Discuss limitations of multi-modal LLMs and alternative architectures to transformers

• Identify the wide possibilities of multi-modal applications

Going beyond uni-modality
LLM-based models that can receive and reason with multimodal info
Source: Yin et al 2023 (released in late June)

Source: Zhu et al 2023

Source: Zhang et al 2023
Demo for Video LLaMA

Chain-of-Thought MLLMs
We can also supply multi-modal information as “in-context”

Source: Zhang et al 2023

Source: Himakunthala et al 2023

MLLMs can process multi-modalities
simultaneously

Source: Su et al 2023 (released in May)

PandaGPT Demo

MLLMs also call tools/models to ﬁnish tasks

HuggingGPT Demo

Transformers beyond text
One architecture to rule them all

Transformer: a general sequence processing tool
We can treat many things as a sequence

1 2 3 4

5 6 7 8

Music notes;
Audio; image source
Images image source

Video frames; Protein;

Modality A Cross attention Modality B

Self attention
Self attention

A and B could be:

• Images, audio, text, time series, or any sequence!
Example:
• Stable Diffusion uses cross attention to bridge between text and images

Computer vision

Well-researched area
We need to ﬁrst understand how to represent images as numbers

Zero-shot Few-shot
learning learning

Vision • Swin Transformer, • CLIP, 2021 VisualGPT, Flamingo,

GPT-3,
Transformer 2021 2022 2023
2020
(ViT), 2021 • SimVLM,
• MLP-Mixer, 2021; preﬁx LM for
not transformer image/text,
nor CNNs 2022

We chop an image up into small pixels
A colored image is made up of Red, Blue, Green (RBG) levels

Each pixel range = (0, 256)

th
Wid

Height

Scalar Vector Matrix Chann

1-D tensor el
0-D tensor 2-D tensor

3-D tensor

Adapted from TowardDataScience and KDNuggets

Initial idea: Turn pixels into a sequence
Use self-attention to predict the next pixel, instead of word token

Limitations:
• Lose vertical spatial
relationships

Source: Chen et al 2021

Initial idea: Turn pixels into a sequence
Use self-attention to predict the next pixel, instead of word token

Limitations:
• Lose vertical spatial
relationships

• Memory and
computational
requirements scale
quadratically to
Source: Chen et al 2021
sequence length, O(N2)
Source: David Coccomini 2021
©2023 Databricks Inc. — All rights reserved
Vision Transformer (ViT)
Computes attention on patches of images: image-to-patch embeddings

Vision Transformer (ViT)
Computes attention on patches of images: image-to-patch embeddings
16

One
pixel

Linear project each patch to D-dimensional vector

D-dim
learned Add positional embedding
positional
embedding
Linear project each patch to D-dimensional vector

D-dim
learned Add positional embedding Special
positional classiﬁcation
embedding token,

Linear project each patch to D-dimensional vector D-dim learned

D-dim patch
embedding
1 2 3 4 5 6 7 8 …… 14 15 16 CLS

D-dim
learned Add positional embedding Special
positional classiﬁcation
embedding token,

Linear project each patch to D-dimensional vector D-dim learned

Original Transformer

D-dim patch
embedding
1 2 3 4 5 6 7 8 …… 14 15 16 CLS

D-dim
learned Add positional embedding Special
positional classiﬁcation
embedding token,

Linear project each patch to D-dimensional vector D-dim learned

C-dim
classiﬁer
output ……
vector, where
C= # classes

Original Transformer

D-dim patch
embedding
1 2 3 4 5 6 7 8 …… 14 15 16 CLS

D-dim
learned Add positional embedding Special
positional classiﬁcation
embedding token,

Linear project each patch to D-dimensional vector D-dim learned

• ViT is ~4 times faster

ViT is worse
than ResNet to train
ViT
outperforms

ResNet

Source: Dosovitsky et al 2021

Many other vision-text models
Not necessarily revolutionary, but an evolution in computer vision research

Zero-shot Few-shot
learning learning

Vision • Swin Transformer, • CLIP, 2021 VisualGPT, Flamingo,

Transformer 2021 2022 2023
GPT-3
(ViT), 2021 • SimVLM,
• MLP-Mixer, 2021; preﬁx LM for
not transformer image/text,
nor CNNs 2022
• Inspires many
other
follow-ups!

Audio

Audio signals are 2-dim spectrograms
We create embedding vectors for each t-min audio frame

Embedding for each

audio frame

Frequency Frequency

Time Time = Length of audio

= # of columns in the matrix

Audio is usually much longer than text length
Need to apply convolution layers with large strides to reduce dimensions

Speech Transformer (2018)

• Encoder-decoder
Transformer

Extract features using

different optional
modules: ResCNN,
ResCNNLSTM,e tc.

Few multi-modal advances
Also much harder: emotion, acoustics, tone, speed, speaker identiﬁcation

• Most models focus on

only text-speech,
speech-text, or
speech-speech

Hand-crafted training data
Text-audio or text-video data is much harder to procure

Source: WebVid

Source: VideoChat

Source: Himakunthala et al 2023

Source: CC-3M

Source: Liu et al 2023

LAION-5B: open source image-text data
Original data: Common Crawl; ﬁltered with OpenAI’s CLIP model

Disclaimer:
Contains mostly
copyrighted
images. LAION
doesn’t claim
ownership.

Computer Vision

X-shot learning to the rescue?
Gathering multi-modal data is harder than just images or text data

Zero-shot Few-shot
learning learning

Vision • Swin Transformer, • CLIP, 2021 VisualGPT, Flamingo,

Transformer 2021 2022 2023
GPT-3
(ViT), 2021 • SimVLM,
• MLP-Mixer, 2021; preﬁx LM for
not transformer image/text,
nor CNNs 2022
• Inspires many
other
follow-ups!

Zero-shot: Contrastive Language-Image Pairing
CLIP predicts which image-text pair actually occurs in the training data

• Collects 400M image-text pairs from the internet as training data

At test time

Predict the probability of

Big limitation:
• Inﬂexible
• Cannot generate text

CLIP performs much

better in
non-ImageNet settings

Source: OpenAI
©2023 Databricks Inc. — All rights reserved
Few-shot, in-context: Flamingo
Uniﬁes treatment of high-dimensional video and image inputs
Architecture
Architecture highlight 3:
highlight 2: Interleave
Perceiver cross-attention
resampler with
language-only
self-attention
layers

Architecture highlight 1:
Allows interleaved multi-modal
inputs Source: Alayrac et al 2022
©2023 Databricks Inc. — All rights reserved
Flamingo bridges vision and language models
Vision encoder similar to CLIP + Chinchilla (Language) accept interleaved
inputs

Source: Alayrac et al 2022

Image source: Samuel Albanie

Perceiver resampler outputs ﬁxed-sized
tokens
5 output tokens

Maps
variable-size
grid of inputs to
a ﬁxed number
of output tokens

Query

Keys and values

= spatio-temporal Temporal
visual features embeddings

Variable-sized inputs
Source: Alayrac et al 2022
©2023 Databricks Inc. — All rights reserved
Outperforms 6 out of 16 SOTA ﬁne-tuned models
Curated 3 high-quality datasets: LTIP (Long Text-Image Pairs), VTP
(Video-Text Pairs), and MultiModal Massive Web (M3W)

Source: Alayrac et al 2022

Qualitative inspection on selected samples
Supported input format: (image, text) or (video, text) + visual query

Source: Alayrac et al 2022

Audio

Zero-shot: OpenAI’s Whisper
Encoder-decoder transformer: splits input audio into 30-second frames

Spectrogram

Source: Radford et al 2022

Whisper matches human robustness
Without ﬁne-tuning on benchmark data

• WER = word error rate

• LibriSpeech
• 1K hours of read English speech

Challenges
We haven’t ﬁgured it all out yet

MLLMs are not immune from LLM limitations
They inherit LLM risks

• Hallucination
• Prompt sensitivity
• Context limit
• Inference compute cost
• Bias, toxicity, etc.
• Copyrights issues Source: Alayrac et al 2022 (Flamingo)

Source: Thrush et al 2022 Source: Ramesh et al 2022

Tested on many models, including CLIP and VisualBERT Used CLIP and DALLE-2; Prompt: “a red cube on top of a blue cube”

GPT-3 completion
Input: You poured yourself a glass of cranberry, but then absentmindedly, you poured about a teaspoon
of grape juice into it. It looks OK. You try snifﬁng it, but you have a bad cold, so you can’t smell anything.
You are very thirsty. So you
Completion: drink it. You are now dead. Source: Robust AI and NYU
©2023 Databricks Inc. — All rights reserved
Attention may not be forever
What may remain or rise?

Reinforcement learning with human feedback
Human feedback trains the reward model (LM). KL loss ensures minimal
divergence from the original LLM. Proximal Policy Optimization (PPO)
updates the LLM.

Source: Ahead of AI on Substack

Hyena Hierarchy
Convolutional neural networks are making a comeback?

• Good few-shot learners for languages

• Matches performances of Vision Transformers (ViT)

Source: Poli et al 2023

Retentive Networks
A new attention variant: a retention mechanism to connect recurrence
and attention, without compromising performance

Source: Sun et al 2023 (released in July)

Emerging applications
It’s a great time to be alive

DreamFusion
Generates 3D objects from text caption

Source: Poole et al 2022

Make-a-Video
Generates video from text: “Cat watching TV with a remote in hand”

Source: Singer et al 2022

Source: Driess et al 2023

AlphaCode: generate code
Problem: Minimum # of minutes to makes pizzas of N slices

Source: Li et al 2022

Multi-lingual models: Bactrian-X
An instruction-following model

Prompt:
“It’s an
apocalypse.
Describe how you Lower-resource languages
survive and make
allies”

Source: Li et al 2023 (GitHub and Paper); released in May

Source: Reed et al 2022

Textless NLP
Generate speech from raw audio without text transcription

Helps with low-resource languages

Click here for demo on Meta AI

Source: Google DeepMind and Timeline of Breakthrough

Module Summary
Multi-modal LLMs - What have we learned?

• MLLMs are gaining traction

• Transformers are general sequence-processing architectures that can
accept non-text sequences
• MLLMs inherit limitations from LLMs
• Transformers may not be the last architecture standing
• More exciting and unimaginable MLLM applications are on the horizon

Time for some code!