0% found this document useful (0 votes)
328 views195 pages

LLM .Foundation - Models.from - The.ground - Up

The document outlines a course on large language models, with the first module focusing on transformers and explaining how the attention mechanism and transformer architecture work by measuring the importance of each word to others and using feedforward networks and normalization techniques. The course later covers efficient fine-tuning, deployment optimizations, and multi-modal language models.

Uploaded by

tanooskar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
328 views195 pages

LLM .Foundation - Models.from - The.ground - Up

The document outlines a course on large language models, with the first module focusing on transformers and explaining how the attention mechanism and transformer architecture work by measuring the importance of each word to others and using feedforward networks and normalization techniques. The course later covers efficient fine-tuning, deployment optimizations, and multi-modal language models.

Uploaded by

tanooskar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 195

Large

Language
Models
Foundation Models
from the Ground Up

©2023 Databricks Inc. — All rights reserved


Course Outline
Course Introduction
Module 1 - Transformers: Attention and the Transformer Architecture
Module 2 - Efficient Fine-Tuning: Doing more with less
Module 3 - Deployment Optimizations: Improving model size and speed
Module 4 - Multi-modal LLMs: Beyond text-based transformers

©2023 Databricks Inc. — All rights reserved


Course Outline
Course Introduction
Module 1 - Transformers: Attention and the Transformer Architecture
Module 2 - Efficient Fine-Tuning: Doing more with less
Module 3 - Deployment Optimizations: Improving model size and speed
Module 4 - Multi-modal LLMs: Beyond text-based transformers

©2023 Databricks Inc. — All rights reserved


Course Introduction

©2023 Databricks Inc. — All rights reserved


What are LLMs?

Matei Zaharia
Co-founder & CTO of Databricks
Associate Professor of Computer Science
at UC Berkeley

©2023 Databricks Inc. — All rights reserved


Generative AI state of the art is rapidly advancing
No single model to rule them all—trade-offs are required to find
the best model for each use case

Privacy Quality Cost Latency

Proprietary LLMs Open Source LLMs

Dolly

Stable Diffusion
MPT

©2023 Databricks Inc. — All rights reserved


6
Open Source quality is rapidly advancing – while
fine tuning cost is rapidly decreasing
Dolly started the trend to open models with a commercially friendly license

Facebook Llama Stanford Alpaca Databricks Dolly Mosaic MPT TII Falcon

“Smaller, more performant models “Alpaca behaves qualitatively “Dolly will help democratize LLMs, “MPT-7B is trained from scratch “Falcon significantly outperforms
such as Llama … democratizes similarly to OpenAI … while being transforming them into a on 1T tokens … is open source, GPT-3 for … 75% of the training
access in this important, surprisingly small and easy commodity every company can available for commercial use, and compute budget—and … a fifth of
fast-changing field.” /cheap to reproduce” own and customize” matches the quality of LLaMA-7B” the compute at inference time.”

February 24, 2023 March 13, 2023 March 24, 2023 May 5, 2023 May 24, 2023

Non Commercial Use Only | Commercial Use Permitted


©2023 Databricks Inc. — All rights reserved
OSS LLMs are getting better everyday

● Constant development

● Supported by community
and industry

● Rapid innovation cycle

©2023 Databricks Inc. — All rights reserved Source: Hugginface.co


“A strong foundation… is all you need”
What and who this course is for
Research and innovation has exploded around LLMs.
If you want to keep up:
• The fundamentals of LLMs have
not changed since 2018
• Most of the innovation are
variations of the original.

Source: Reddit

©2023 Databricks Inc. — All rights reserved


Enjoy the course!

©2023 Databricks Inc. — All rights reserved


Course Outline
Course Introduction
Module 1 - Transformers: Attention and the Transformer Architecture
Module 2 - Parameter Efficient Fine-Tuning: Doing more with less
Module 3 - Deployment Optimizations: Improving model size and speed
Module 4 - Multi-modal LLMs: Beyond text-based transformers

©2023 Databricks Inc. — All rights reserved


Module 1
Transformers
Attention and the Transformer Architecture

©2023 Databricks Inc. — All rights reserved


Learning Objectives

By the end of this module you will:


• Describe and build the inner workings of a transformer model

• Compare and contrast the different types of transformer architectures

• Recognize the importance and mechanics of attention

• Apply the different types of transformer models to solve problems

©2023 Databricks Inc. — All rights reserved


A Transformative Technology
A breakthrough in Natural Language
Processing

©2023 Databricks Inc. — All rights reserved


Let’s Chat
A new technology paradigm has arrived.

Less than a year old, ChatGPT is


the fastest adopted technology in
human history.

LLMs like ChatGPT have ushered in


a new golden era of AI.

©2023 Databricks Inc. — All rights reserved Source: Brookings


We’ve seen this before
The Convolution Layer led to models that dominated the vision field
NEC-UIUC

©2023 Databricks Inc. — All rights reserved Source: ResearchGate


We weren’t paying attention before
Attention: “the convolution layer of language”

The attention mechanism measures how words interrelate

The cat sat on the mat

©2023 Databricks Inc. — All rights reserved


Attention is (not) all you need
A new type of deep learning architecture
Input token embedding vector

Attention is a linear, matrix operation.

Attention
Where is the neural network?

Fully Connected
Neural Network
Let’s look at The Transformer Block

Normalization and
Residual Connection

Output vector

©2023 Databricks Inc. — All rights reserved


The Transformer Block
Building up the most important AI
advancement in years

©2023 Databricks Inc. — All rights reserved


Transforming a sequence
What is the goal of a transformer block?

Transformer blocks: Output Token Selection

1. Process input vectors with attention to enrich


Prediction
2. Add nonlinear transformation with NN
3. Apply enriched vectors to select the correct

Transformer block
token from the model’s vocabulary Enrichment

Preparation

Input Token Vector


©2023 Databricks Inc. — All rights reserved *Image modified from original transformer paper.

Source: ArXiv
Attention Mechanism
How important is each word to each other?

Output Token Selection


The role of attention is to:

● Measure the importance and relevance of each word relative to each


other

● Allow for the building-up of enriched vectors with more context and logic

Input Token Vector


©2023 Databricks Inc. — All rights reserved
Position-wise Feed-Forward Networks
Adding nonlinearity to understanding the input.

(FFN):
Output Token Selection

The role of Position-wise Feed-Forward Network is to:

● Use the information gathered from the attention stage for each
element in the sequence and translate it to a further enriched state

● Allow for nonlinear transformations of each element in the sequence


and builds upon itself in subsequent layers

Input Token Vector


©2023 Databricks Inc. — All rights reserved
Residual Connections & Layer Normalization
Ensuring robust and reliable training.

Output Token Selection Residual Connections


• Shortcuts in the network that allow information to flow directly from
earlier layers to later layers.
• Help mitigate the vanishing gradient problem in deep neural networks.

Layer Normalization
• Normalizes the inputs across the features rather than the batch.
• Stabilizes the network's training, ensuring consistent input distribution,
crucial in tasks with varying sequence lengths.

Input Token Vector


©2023 Databricks Inc. — All rights reserved
Into and out of the transformer
Input transformations and output transformations
Output from the Transformer
Output Token Selection ● Exiting the last Transformer block, is a sequence of context-aware vectors.
● Each vector in this sequence represents an input token, but now its
representation is deeply influenced by its interactions with all other tokens in the
sequence.
● Different models use the output from the transformer differently.

Input to the Transformer


● The initial input to a Transformer model is a sequence of word tokens.
● These tokens are converted into word embeddings.
● Positional encodings are added to these embeddings, providing the model with necessary
positional information.

Input Token Vector ● This new sequence, with each element a dense vector, is then passed to the first transformer
©2023 Databricks Inc. — All rights reserved block.
Transformer Architectures
“Encoders, decoders, T5, oh my!”

©2023 Databricks Inc. — All rights reserved


The Transformer Family Tree
Encoder-Decoder
Models Decoder-only Models

Encoder-only Models

©2023 Databricks Inc. — All rights reserved


Source: GitHub
The Original Transformer
An encoder-decoder model

Defining Features:
● Two sets of transformer blocks
○ Encoder blocks
○ Decoder blocks
○ Cross-attention

● Typical Uses:
○ Translation
○ Conversion

Source: ArXiv
©2023 Databricks Inc. — All rights reserved
B is for BERT
Encoder-only models
Novel Features of BERT:
• Segment embeddings ([CLS] and [SEP]
special tokens)

• Trained with Masked Language Modeling


& Next Sentence Prediction

• Excellent Fine-Tuning Performance

©2023 Databricks Inc. — All rights reserved


Source: ArXiv
Generating text with GPT
Decoder-only models

Decoder models have a single


task:
• Look over the sequence
and predict the next word
(or token).

Decoder models products: ChatGPT, Bard, Claude, LLaMA, MPT

Source: IllustratedTransformer

©2023 Databricks Inc. — All rights reserved


Important variables in Transformers
Input Internal Training
• Vocabulary Size (V): • Number of Attention Heads • Batch Size (B):
The number of unique tokens (H): The number of examples
that the model recognizes. In the multi-head attention processed together in one
mechanism, the input is forward/backward pass during
• Embedding/Model Size (D): divided into H different parts. training.
The dimensionality of the word
embeddings, also known as the • Intermediate Size (I): • Tokens Trained on (T):
hidden size. The feed-forward network The total number of tokens that a
has an intermediate layer model sees during training. This is
• Sequence/Context Length (L): whose size is typically larger normally reported more than the
The maximum number of tokens than the embedding size. number of epochs.
that the model can process in a
single pass. • Number of Layers (N):
The number of Transformer
blocks/layers.

©2023 Databricks Inc. — All rights reserved


Time to Pay Attention
The secret that unlocked the power of
LLMs

©2023 Databricks Inc. — All rights reserved


The inner workings of attention
Learning the weights of attention.

We use three, large (millions of elements), matrices to create the Query, Key, and
Value vectors in each layer.

Query Matrix: WQ, Key Matrix: WK, Value Matrix: WV

Query Vector Word Vector WQ

The Attention equation can be expressed as an operation over all the vectors:

©2023 Databricks Inc. — All rights reserved


The inner workings of attention
How do we calculate attention?
Word+Position Embedding Vector

The mechanism can be broken down into three steps:


Query Vector

1) The input vector (in the first layer, this is the word embedding vector (with Key Vector
the position information), is used to create three new vectors: Value Vector
a) the Query (Q) vector - made from the current token
b) the Key (K) vector - made from all other tokens in the sequence

Key Vector 1
c) the Value (V) vector - made from all tokens in the sequence

Key Vector 2
Query Vector

Key Vector 3
Key Vector…
2) A scaled dot-product is then performed on the Query and Key vectors, this
is the attention score. This also uses a softmax function to produce a final
Attention Weights
set of Attention Weights that are scaled 0-1.

3) Each Value vector is multiplied by its corresponding attention weight and


then all are summed up to generate the output for the self-attention layer. Attention Weights

Element-wise
Value Vector multiplication

Output Vector

©2023 Databricks Inc. — All rights reserved


Building Base/Foundation
Models
Training transformers, what does it take?

©2023 Databricks Inc. — All rights reserved


Foundation Model Training - Getting Started
Choosing the right options to build your model.

A foundation, or base, model is a transformer that is trained from scratch.


• Model Architecture - decoder? encoder-decoder?
• Tailored to the task/problem
• Different sizes: embedding dimensions, number of transformer blocks, etc.

• Available Data
• Task-relevant data
• Good language coverage

• Available Compute Resources


• Time to allow for training
• Hardware available

©2023 Databricks Inc. — All rights reserved


Foundation Model Training - Architecture
Which transformer flavor is right?

Start with your task:


• Classification?
• Generation?
• Translation?

Architecture:
• Layer count
Autoencoding models Sequence-to-sequence Autoregressive models
• Context size models

Source: GitHub

©2023 Databricks Inc. — All rights reserved


Foundation Model Training - Data
It’s all about the data

Datasets for foundation LLM:


• Web Text
• Code
• Images
• Digitized Text
• Transcriptions

Source: Stanford CS324

©2023 Databricks Inc. — All rights reserved


Foundation Model Training - Training
Optimizing LLM Losses

• LLMs train just like other deep learning


models.
• Loss functions are typically cross-entropy.
• Familiar optimizers like AdamW, are used.
• Training LLMs with 100’s of billions of
parameters, can take months on hundreds
of GPUs.

Source: MosaicML

©2023 Databricks Inc. — All rights reserved


Now what?
What do you use a foundation LLM for?

Foundation LLMs suffer from “alignment” problems:

• Bias/toxicity in the training data


• Lack of specific task focus -> for models like GPT it just focuses on
selected the next correct word

Fine tuning and other methods are required to ensure task success.

©2023 Databricks Inc. — All rights reserved


Generative Pretrained Transformer
A journey to discover how GPT-4 and ChatGPT
were built.

©2023 Databricks Inc. — All rights reserved


The journey to ChatGPT
What is GPT?

• ChatGPT was originally built atop a large language model known as GPT-3.5.

• GPT-3.5 is a type of decoder-based transformer model.


• Let’s see what exactly GPT is. Source: OpenAI

©2023 Databricks Inc. — All rights reserved


Generative Pre-trained Transformers (GPT)
Decoder-based transformers
• The first GPT model, introduced in 2018 was just the decoder
part of the original transformer.
Decoder Models
only use the
second half of the
• GPT-2/-3/-4 have mostly just been larger versions. With the key original
transformer

differences coming from training data and training processes. architecture. No


cross/encoder
attention.

GPT/GPT-1 GPT-2 GPT-3

12 x 48 x 96 x

Source: LinkedIn
512 dimension 1024 dimension 2048 dimension
embeddings embeddings embeddings
©2023 Databricks Inc. — All rights reserved
Generative Pre-trained Transformers (GPT)
Generational Improvements

GPT (2018):

It was pre-trained on the BooksCorpus dataset and contained 117 million parameters. GPT was an autoregressive
model based on the Transformer architecture and employed a unidirectional language modeling objective.

-----------------------------------------------------------------------

GPT-2 (2019):

pre-trained on a much larger dataset called WebText, which contained text from web pages with a total of 45
terabytes of data. GPT-2 came in four different sizes: 117M (small), 345M (medium), 774M (large), and 1.5B
(extra-large) parameters. 1,542M Parameters

762M Parameters

345 Parameters
117M Parameters

©2023 Databricks Inc. — All rights reserved


Source: IllustratedGPT2
Generative Pre-trained Transformers (GPT)
Generational Improvements

GPT-3 (2020):
Pre-trained on the WebText2 dataset, an even larger portion of the internet, 45 terabytes of text.
GPT-3 came with a staggering 175 billion parameters
Exceptional few-shot and zero-shot learning capabilities, allowing it to perform well on various
tasks with minimal or no fine-tuning.
-----------------------------------------------------------------------------------
GPT-4 (2023):
Specific details about the architecture, dataset, and parameter count for GPT-4 have not been
officially released. Likely ~1T parameters in an ensemble of smaller models of ~220B parameters
each.

©2023 Databricks Inc. — All rights reserved


GPT Architecture

©2023 Databricks Inc. — All rights reserved


So many layers? Transformer Block Output

Why GPTs keep getting bigger

The main component of each layer: multi-headed attention block.

Like in convolutional layers, the earlier features that are learned might be edges and lines, and the
later layers are more complex visual artifacts, attention layers work in a similar fashion:

1. Early Attention: short-range dependencies, relationships between adjacent or close


tokens.
• Word order
• Part-of-Speech
• Basic sentence structure.

2. Middle Attention: overall context of the input sequence.


• Semantic information.
• Meaning.
• Relationships between phrases.
• Roles of different words within the sentence.

3. Late Attention: integrates the lower layers and generates coherent and contextual
outputs.
• High-level abstractions.
• Discourse structure.
• Sentiment. Transformer Block Input
• Complex long-range dependencies.
Source: Wikipedia
©2023 Databricks Inc. — All rights reserved
Why so many parameters?

Parameter GPT-2 GPT-2 Extra Difference Extra 117M parameters

Small Large Large / Small

Layers 12 48 3x

Model
768 1600 2x
Dimensionality
Attention Heads
per Layer 12 25 2.5x

Total 117 Million 1.5 Billion 12.8x

1.5B parameters
Source: IllustratedGPT2
©2023 Databricks Inc. — All rights reserved
Training GPT

©2023 Databricks Inc. — All rights reserved


Training Data
The magic behind the model

GPT-1: BooksCorpus
● Over 7,000 unpublished books spanning various genres.
● 800 million words, covering a wide range of topics and styles.

GPT-2: WebText

● Much larger than BookCorpus


● A collection of text from web pages, consists of 45 terabytes of data.
● Filtered out web pages with low quality content

GPT-3: WebText2
● Even larger than WebText and more diverse
● The increased size and diversity of the WebText2 dataset contribute to
GPT-3's impressive performance on various NLP tasks.
2.0
©2023 Databricks Inc. — All rights reserved
Comparing LLM
Architectures

©2023 Databricks Inc. — All rights reserved


BERT vs. GPT vs. T5
Which type of LLM is best?

Encoder-only (eg. BERT) Decoder-only (eg. GPT) Seq-2-Seq (eg. T5)

● Pre-trained on a masked language modeling task ● Autoregressive nature makes them well-suited for ● Encoder-decoder architecture allows for better
bidirectional context for a deeper understanding text generation tasks. handling of complex, structured input-output
of input sequences. Leads to strong feature relationships.
● Can generate coherent and contextually relevant
Pros extraction capabilities. text. ● Attention mechanisms help weigh the importance of
different parts of the input when generating the
● Typically several orders of magnitude smaller than output.
decoder or seq-2-seq model.

● Not ideal for text generation tasks due to their ● Only capture unidirectional context (left-to-right), ● Can require more training data and computational
bidirectional nature. which can limit their contextual understanding. resources compared to encoder-only or
decoder-only models.
Cons ● Can have higher computational costs compared ● Autoregressive generation can be slow due to
to decoder-only models. ● Can be more complex to train and fine-tune due to
sequential token prediction.
the two-part architecture.

©2023 Databricks Inc. — All rights reserved


Module Summary
Transformers - What have we learned

• Transformers are built from a number of transformer blocks


• Transformer blocks use attention and neural networks to enrich vectors
• Transformer models can be encoder, decoder, or encoder-decoder
• The evolution of GPT required changes in architecture, and in data
• Base/Foundation models require fine tuning to solve most tasks

©2023 Databricks Inc. — All rights reserved


Time for some code!

©2023 Databricks Inc. — All rights reserved


Course Outline
Course Introduction
Module 1 - Transformers: Attention and the Transformer Architecture
Module 2 - Parameter Efficient Fine-Tuning: Doing more with less
Module 3 - Deployment Optimizations: Improving model size and speed
Module 4 - Multi-modal LLMs: Beyond text-based transformers

©2023 Databricks Inc. — All rights reserved


Module 2
Efficient Fine-Tuning
Doing more with less

©2023 Databricks Inc. — All rights reserved


Learning Objectives

By the end of this module you will:


• Understand what fine-tuning is and why we do it

• Learn what parameter-efficient fine-tuning is and what the popular strategies are

• Understand the limitations of parameter-efficient fine-tuning

• Gain knowledge about data preparation best practices

©2023 Databricks Inc. — All rights reserved


Fine tuning vs. transfer learning
They are often referenced interchangeably

• Transfer learning
• Apply a general pre-trained model to a new, but related task

• Fine tuning
• Use a general pre-trained model and then train that model further

• Transfer learning ~= fine tuning


• Train it more
• Train on different data

©2023 Databricks Inc. — All rights reserved Source: tensorflow.org, interview with Andrej Karpathy and Andrew Ng
How to leverage a pre-trained foundation model?

©2023 Databricks Inc. — All rights reserved


How to leverage a pre-trained foundation model?

©2023 Databricks Inc. — All rights reserved


How to leverage a pre-trained foundation model?

©2023 Databricks Inc. — All rights reserved


Why fine tuning?
Leverage an effective pre-trained model on our own data – it’s not new

• Improve performance
Full Fine-tune on
downstream fine-tuning classifier
• Different pre-trained
vs fine-tuned tasks
• Different domains

• Ensure regulatory
compliance

• Not new:
Source: Howard and Ruder 2018
• ULMfit paper in 2018
©2023 Databricks Inc. — All rights reserved
Fine tune = update foundation model weights
AKA parameter fine tuning

• Update more layers = better model performance

• Full fine-tuning typically produces one model per task


• Serve one model per task
• May forget other pre-trained tasks: catastrophic forgetting

• Full fine-tuning LLMs is expensive. How to avoid it?


• X-shot learning Source: Devlin et al 2019
• Parameter-efficient fine tuning
©2023 Databricks Inc. — All rights reserved
X-shot learning
Provide several examples of new tasks
pipeline(

Prompt engineering """For each tweet, describe its sentiment: Instruction

= developing prompts [Tweet]: "I hate it when my phone battery dies."


[Sentiment]: Negative
= prompt design ###
Few-shot
= hard/discrete prompt [Tweet]: "My day has been 👍" examples

[Sentiment]: Positive
tuning
###
[Tweet]: "This is the link to the article"
[Sentiment]: Neutral
Not updating model weights ###
[Tweet]: "This new music video was incredible"
Prompt
[Sentiment]:""")
©2023 Databricks Inc. — All rights reserved
Pros and cons of X-shot learning
Also known as in-context learning

Pros Cons
• Manual prompt engineering
• No need for huge labeled training
• Prompts are specific to models
data
• Context length limitation
• No need to create a copy of
• Add more examples? Less space for
model for each task instructions
• Simplify model serving • Longer context = higher latency
• Text prompts feel interpretable • LLMs forget middle portion
• Liu et al 2023 (released in July)
• Longer context window is not the
solution!
• Performance might still be
lackluster

©2023 Databricks Inc. — All rights reserved


Fine-tuning outperforms X-shot learning
Example: GOod at Arithmetic Tasks (Goat-7B)

• Foundation model = Llama Image Source

• Trained on 1M synthetic data samples


• Accuracy outperforms
• Few-shot PaLM-540B (much bigger model !! )
• GPT-4
• Typically doesn’t do well in arithmetics; accuracy ~0
• SOTA on arithmetic benchmark (BIG-bench)
• Supervised instruction fine-tuning
• Trained using LoRA on a 24GB VRAM GPU
• LoRA to be covered soon!
Source: Liu and Low 2023
(released in May)

©2023 Databricks Inc. — All rights reserved


Important observations about Goat

Task 1
• Instruction fine-tuned Addition
Task 2
• Multi-task serving Subtraction,
using natural
language (NL)
Task 3

Multiplication,
mix of NL +
mathematical
symbols
Task 4

Division,
mix of NL +
mathematical
symbols
©2023 Databricks Inc. — All rights reserved
Instruction-tuned, multi-task LLM
Instruction-tuned = tune general purpose LLMs to follow instructions

FLAN (Fine-tuned LAnguage Net) Dolly

• Foundation model = 137B model • Foundation model = Pythia-12B

• Instruction-tuned on over 60 NLP • Instruction-tuned on 15k


datasets with different task types prompt/response pairs
• Task types: Q/A, translation, • Task types: Q/A, classification,
reasoning, comprehension, etc. information extraction, etc.

• Examples
• T5 -> FLAN-T5
• PaLM -> FLAN-PaLM

Source: Wei et al 2022


©2023 Databricks Inc. — All rights reserved
Quick recap
We want efficient training, serving, and storage

• Full fine-tuning can be computationally prohibitive


• Memory usage: activation, optimizer states, gradients, parameters
• This gives the best performance

• Compromise: Do some, but not full, fine-tuning


• Saves cost to use low-memory GPUs

• We want multi-task serving, rather than one model per task


• E.g. one model for Q/A, summarization, classification

Enter parameter-efficient fine-tuning
©2023 Databricks Inc. — All rights reserved
Parameter-efficient
fine-tuning (PEFT)

©2023 Databricks Inc. — All rights reserved


3 categories of PEFT methods

Additive Selective Re-parameterization

• Soft prompt • Akin to updating a few • Decompose weight


• Prompt tuning foundation model layers matrix updates into
• Prefix tuning • BitFit smaller-rank matrices
• Only updates bias
• LoRA
parameters
• Diff Pruning
• Creates
task-specific “diff”
vectors and only
updates them

©2023 Databricks Inc. — All rights reserved


We will cover additive and reparameterization

Additive Selective Re-parameterization

• Soft prompt • Model quality • Decompose weight


• Prompt tuning performance is not as matrix into smaller-rank
• Prefix tuning good matrices
• Akin to updating a few • LoRA
foundation model layers
• BitFit
• Only updates bias
parameters
• DiffPruning
• Creates
task-specific “diff”
vectors and only
©2023 Databricks Inc. — All rights reserved update them
High-level overview of PEFT
Active research area: >100 papers in last few years!

• Additive: Add new tunable layers to model


• Keep the foundation model weights frozen and update only the new layer weights

• Reparameterization: Decompose a weight matrix into lower-rank matrices

• Implementation:
• Acts on the core Transformer block
• Basic multi-head attention and/or feed forward network
• Some act specifically on the weight matrices: Query, Key, Value
• These matrices pass information from one token to another

Q K V
Source: Vasmani et al 2021

©2023 Databricks Inc. — All rights reserved


Additive:
Prompt Tuning
(and prefix tuning)

©2023 Databricks Inc. — All rights reserved


Soft prompt tuning
Concatenates trainable parameters with the input embeddings

• Learn a new sequence of task-specific embeddings


• We call this prompt tuning, not model tuning, because we only update
prompt weights

©2023 Databricks Inc. — All rights reserved


What are these virtual tokens?
Goal: remove manual element of engineering prompts!

• Randomly initialized embedding vectors


• We can also initialize to discrete prompts
• But random initialization is nearly as good as informed
initialization (Qin and Eisner 2021)

• Not part of vocabulary

• Analogy:
• Bitcoin: We can’t touch it like cash. We don’t know how it
“looks”, but it exists and works.

©2023 Databricks Inc. — All rights reserved


What are these virtual tokens?
Goal: remove manual element of engineering prompts!

• Randomly initialized embedding vectors


• We can also initialize to discrete prompts
• But random initialization is nearly as good as informed
initialization (Qin and Eisner 2021)

• Not part of vocabulary

• Analogy:
• Bitcoin: We can’t touch it like cash. We don’t know how it
“looks”, but it exists and works.

©2023 Databricks Inc. — All rights reserved


Compare full fine-tuning vs prompt tuning
Scenario: full fine-tuning
Backprop: update all weights based on loss

©2023 Databricks Inc. — All rights reserved


Compare full fine-tuning vs prompt tuning
Scenario: prompt tuning
Backprop: update only prompt weights based on loss
• The model learns the optimal representation of the prompt automatically

©2023 Databricks Inc. — All rights reserved


Allows swapping of task prompts
Efficient for multi-task serving

• Each task is a prompt, not a model


• Only need to serve a single copy of the frozen model for multi-task serving

• Prompts for various tasks can be applied to different inputs


• A serving request can be a single, larger mixed task batch

Source: Lester et al 2021

©2023 Databricks Inc. — All rights reserved


Matches fine tuning performance for >11B model

• Comparable with full fine-tuning


at the 10B model scale
Performance
• More applicable to larger models gap with
small model

• SuperGLUE (2019)
• Styled after GLUE, but more difficult
and diverse
• Boolean questions, comprehension,
etc.

Source: Lester et al 2021


©2023 Databricks Inc. — All rights reserved
Prompt length affects larger models less
Prompt length of 20-100 is typical

In this example:
Instability • (Virtual) prompt length = 2
Larger
models do
fine with
prompt
length of 1

Source: Lester et al 2021

©2023 Databricks Inc. — All rights reserved


Advantages of prompt tuning

• Use whole training set


• Not limited by # of examples that can fit in the
context

• Automatically learn a new prompt for a


new model
• Backprop helps us find the best representation

• One foundation model copy only Source: Google AI Blog

• Resilient to domain shift


©2023 Databricks Inc. — All rights reserved
Disadvantages of prompt tuning

Less interpretable Unstable performance

• Need to convert the embeddings


back to tokens
• Use cosine distance to find the
top-K nearest neighbors Instability

Source: Lester et al 2021

©2023 Databricks Inc. — All rights reserved


Prefix tuning is very similar to prompt tuning
Adding tunable layer to each transformer block, rather than just the input
layer

Source: Li and Liang 2021

Source: Lightning AI

©2023 Databricks Inc. — All rights reserved


Re-parameterization:
LoRA

©2023 Databricks Inc. — All rights reserved


Low-Rank Adaptation (LoRA)
Decomposes the weight change matrix into lower-rank matrices

©2023 Databricks Inc. — All rights reserved


Low-Rank Adaptation (LoRA)
Decomposes the weight change matrix into lower-rank matrices

©2023 Databricks Inc. — All rights reserved


Rank? Brief visit to linear algebra
Maximum # of linearly independent columns or rows

• How many unique rows or columns?


• Full rank = no redundant row or column in the matrix
• Linear = can multiply by a constant
• Independence = no dependence on each other

• Row rank: 1
• 2nd row = 3x 1st row
• Column rank: 1
• 2nd column = 2x 1st column
• 3rd column = 2x 2nd column

©2023 Databricks Inc. — All rights reserved


How does weight matrix decomposition work?
Observation: Actual rank of the attention weight matrices is low

Wdelta = Wa + Wb

• Total parameters = (100 x 2) + (2 x 100) = 400


• Original parameters = (100 x 100) = 10,000 parameters
• Reduction = 10,000 - 400 = 96%!
©2023 Databricks Inc. — All rights reserved
LoRA matches/~outperforms full fine-tuning

• 37.7 / 175255.8
= 0.0002 Rouge
= 0.02% of parameters!

Source: Hu et al 2021

©2023 Databricks Inc. — All rights reserved


LoRA performs well with very small ranks
GPT-3’s validation accuracies are similar across rank sizes

Wq = query
Wk = key
Wv = value
Wo = output
Source: Hu et al 2021

But, small r likely won’t work for all tasks/datasets.


• E.g. downstream task is in a different language

©2023 Databricks Inc. — All rights reserved


Advantages of LoRA
Similar to prompt tuning, majority of the model weights are frozen

• Able to share and re-use the foundation model


• Swap different LoRA weights for serving different tasks

• Improves training efficiency


• Lower hardware barrier (no need to calculate most gradients or optimizer states)

• Adds no additional serving latency


• W_a * W_b can be merged

• Can be combined with other PEFT methods


©2023 Databricks Inc. — All rights reserved
Limitations of LoRA

• Not straightforward to do multi-task serving


• How to swap different combos of A and B in a single forward pass?
• If dynamically choose A and B based on tasks, there is additional serving latency

• Future research
• From LoRA authors: If Wdelta is rank-deficient, is W too?
• Newer PEFT technique: IA3 (2022)
• Reduces even more trainable parameters than LoRA!

©2023 Databricks Inc. — All rights reserved


PEFT Limitations

©2023 Databricks Inc. — All rights reserved


Model performance limitations

• Difficult to match the performance of full fine-tuning


• Sensitive to hyperparameters
• Unstable performance

• Current research area: where is best to apply PEFT?


• E.g. why apply PEFT to only attention weight matrices? Soft prompts?
• Vu et al 2022: Soft prompt transfer

• We may still need full-parameter fine-tuning


• Lv et al 2023 (released in June): use new optimizer, LOMO, to reduce memory
usage to ~11%

©2023 Databricks Inc. — All rights reserved


Compute limitations

Doesn’t reduce time


complexity of training
Doesn’t reduce the cost of
Doesn’t always make
storing massive foundation Requires full forward and
inference more efficient
models backward passes

©2023 Databricks Inc. — All rights reserved


Data Preparation
Best Practices

©2023 Databricks Inc. — All rights reserved


Better models from better training data
Many newer good models use C4 (e.g. MPT-7B)

Llama GPT-Neo and GPT-J

• Trained on the Pile: 22 diverse


• Trained on 20 most-spoken
datasets
languages, focusing on those with
• Outperformed GPT-3 in some
Latin and Cyrillic alphabets
instances (Read more here)

Colossal
Cleaned
Crawled
Corpus

Source: Touvron et al 2023

©2023 Databricks Inc. — All rights reserved Source: Gao et al 2020


Training data makes the biggest difference
Not necessarily the model architecture

• Bloomberg created 363B-token dataset of English financial documents


spanning 40 years
• Augmented with 345B-token public dataset

• Outperforms existing open models on financial tasks

Source: Wu et al 2023

©2023 Databricks Inc. — All rights reserved


How much fine-tuning data do I need?

• Zhou et al 2023 (May): fine-tune 1,000 high-quality labeled examples


from LLaMa 65B
• When scaling up data quantity, need to scale up prompt diversity

• OpenAI: At least a couple hundred


• Doubling dataset size leads to linear increase in model performance

• How to get more data? Synthetic data


• Synonym replacement / rewrite
• Word deletion: “brilliantly expressed” => “expressed”
• Word position swapping: “It is lovely” -> “Lovely, it is”
• Noise injection: introduce typos
©2023 Databricks Inc. — All rights reserved
Data preparation best practices
Quantity, diversity, and quality

• Don’t provide detailed instructions.


Only prompt and completion.
• Fixed separator \n\n###\n\n to inform
when the prompt ends and completion
begins
• The separator shouldn’t appear anywhere
else

Source: OpenAI
Source: Zhou et al 2023
©2023 Databricks Inc. — All rights reserved
Data preparation best practices

• Remove undesired data


• Offensive, toxic content
• Private or confidential information

• Using LLM output as data is not always the answer


• Imitation models learn style, rather than content (Gudibande et al 2023)
• Consistent with Zhou et al 2023: knowledge is largely learned during
pre-training

• Manually verify data quality

©2023 Databricks Inc. — All rights reserved


Module Summary
Efficient Fine-Tuning - What have we learned?

• Fine-tuning gives the best results, but can be computationally


expensive
• Parameter-efficient fine-tuning reduces # of trainable parameters
• Prompt tuning allows virtual prompts to be learned automatically
• LoRA decomposes the weight change matrix into lower-rank matrices
• Fine-tuning data quality and diversity matters a lot

©2023 Databricks Inc. — All rights reserved


Time for some code!

©2023 Databricks Inc. — All rights reserved


Course Outline
Course Introduction
Module 1 - Transformers: Attention and the Transformer Architecture
Module 2 - Parameter Efficient Fine-Tuning: Doing more with less
Module 3 - Deployment Optimizations: Improving model size and speed
Module 4 - Multi-modal LLMs: Beyond text-based transformers

©2023 Databricks Inc. — All rights reserved


Module 3
Deployment Optimizations
Improving model size and speed

©2023 Databricks Inc. — All rights reserved


Learning Objectives

By the end of this module you will:


• Be able to make the design choices for your LLM development

• Create and design your own pseudo Mixture-of-Experts LLM system

• Develop an understanding of the utility of quantization in LLMs for both training and
inferencing

©2023 Databricks Inc. — All rights reserved


Extra-Large Language Models
What if our models are too big?

©2023 Databricks Inc. — All rights reserved


The issue with high performance LLMs
Paying the price for quality

As models grow in size, they get “better” and “worse”.


• Better - accuracy, alignment, abilities
• Worse - speed, memory footprint, updatability

small fast low quality model Large…slow...high...quality…model.

What if we could improve the speed and footprint while preserving quality?
©2023 Databricks Inc. — All rights reserved
Improving Learning Efficiency
How can we train and fine-tune better?

©2023 Databricks Inc. — All rights reserved


How we interact with LLMs
The importance of context length

LLMs, like us, do better at tasks with more context.

This means longer input/context length.


- Computing input embeddings linearly
- Performing FFNN calculations linearly
- Calculating attention scores quadratically
Even worse:
Attention cannot perform as well on longer context
than it was trained on.

©2023 Databricks Inc. — All rights reserved


Source: ALiBi
Training short but inference long
You’ll need a good Alibi for this one

Attention is all you need… to fix! And just with a linear bias.

©2023 Databricks Inc. — All rights reserved


Source: ALiBi
Faster calculations
Calculating attention in a flash.

Observation:
○ Fastest memory in the GPU is SRAM
○ Longer context = larger attention matrices

Problem:
○ SRAM is small relative to the attention matrix
needed in calculations
○ Solution: Flash Attention!
○ Attention compute operations are redone, no
matrix created!
○ More time spent in SRAM, massive
performance boost
Source: Flash Attention
©2023 Databricks Inc. — All rights reserved
Many queries, fewer keys
Multi-query and Grouped-query Attention

● Multi-Headed Attention
#Queries = #Keys = #Values
Each head can focus on different parts of language.
Inference - slow, accurate.
● Multi-Query Attention
Many Queries, 1 Key, 1 Value
Forcing the model to use different queries
Inference - fast, inaccurate.
● Grouped-Query Attention
#Queries = #/n Keys = #/n Values
Inference - fast, accurate. Source: Grouped Query Attention

©2023 Databricks Inc. — All rights reserved


Improving Model Footprint
Doing more with less

©2023 Databricks Inc. — All rights reserved


Storing numbers
Billions of parameters. Each a floating point number.

FP16 FP32 - IEEE standards.


Google Brain saw this and created the BF16
● Same range as FP32
● Same size as FP16

Source: Google bfloat16


What if we need to go even further?
©2023 Databricks Inc. — All rights reserved
Quantization
Do we need so much precision?

Approximate the values in quantized forms


Create quantized functions

©2023 Databricks Inc. — All rights reserved


Source: LLM.int8()
QLoRA
Applying quantization to fine tuning

● LoRA was already great!


● QLoRA adds even more:
○ 4-bit quantization
○ Paged optimization

Source: QLoRA

©2023 Databricks Inc. — All rights reserved


Multi-LLM Inferencing
Hybrid and Ensemble-based systems

©2023 Databricks Inc. — All rights reserved


Mixture-of-Experts
A trillion parameters, for a fraction of the training

Mixture-of-Experts (MoE):
● Input is sent to a router
● Multiple NNs are trained

Switch Transformer:
● Application of MoE
● Position-wise FFNN are multiplied
● Single attention network

Source: Switch Transformer


©2023 Databricks Inc. — All rights reserved
LLM Cascades and FrugalGPT
Improving our resource allocation in LLM inferencing

LLM cascade:
● Send prompts to
smallest models

● Gather confidence
of response

● If too small, move


to a larger model
Source: FrugalGPT

©2023 Databricks Inc. — All rights reserved


Current Best Practices
If you want to build now, do it right

©2023 Databricks Inc. — All rights reserved


Best Practices

Training from scratch:


● ALiBi
● Flash Attention
● Grouped-Query Attention
● Mixture-of-Experts

Fine-tuning/Inferencing:
● LoRA/QLoRA
● FrugalGPT

©2023 Databricks Inc. — All rights reserved


Source: LLM-Numbers
Module Summary
Deployment Optimizations - What have we learned?

• LLMs are currently outpacing modern compute capacity, necessitating


the development of work around solutions and approaches
• Modifying the original approach to attention has allowed for longer
contexts, better use of hardware, more efficient calculations
• Quantization helps to store and use massive LLMs on smaller hardware
• Combing LLM inferences of different models allows an effective scale
up of parameters with minimal cost changes

©2023 Databricks Inc. — All rights reserved


Time for some code!

©2023 Databricks Inc. — All rights reserved


Course Outline
Course Introduction
Module 1 - Transformers: Attention and the Transformer Architecture
Module 2 - Parameter Efficient Fine-Tuning: Doing more with less
Module 3 - Deployment Optimizations: Improving model size and speed
Module 4 - Multi-modal LLMs: Beyond text-based transformers

©2023 Databricks Inc. — All rights reserved


Module 4
Multi-modal Language
Models (MLLMs)
Beyond text-based transformers

©2023 Databricks Inc. — All rights reserved


Learning Objectives

By the end of this module you will:


• Survey the broad landscape of multi-modality that leverages LLMs

• Understand how transformers accept non-text inputs, e.g. images and audio

• Examine how multi-modal data is procured

• Discuss limitations of multi-modal LLMs and alternative architectures to transformers

• Identify the wide possibilities of multi-modal applications

©2023 Databricks Inc. — All rights reserved


Going beyond uni-modality
LLM-based models that can receive and reason with multimodal info
Source: Yin et al 2023 (released in late June)

Source: OpenAI
©2023 Databricks Inc. — All rights reserved
Multi-modality mirrors how we perceive info
More user-friendly, flexible, and capable

Source: Zhu et al 2023


Source: Zhang et al 2023
Demo for Video LLaMA

©2023 Databricks Inc. — All rights reserved


Chain-of-Thought MLLMs
We can also supply multi-modal information as “in-context”

Source: Zhang et al 2023

Source: Himakunthala et al 2023

©2023 Databricks Inc. — All rights reserved


MLLMs can process multi-modalities
simultaneously

Source: Su et al 2023 (released in May)


PandaGPT Demo

©2023 Databricks Inc. — All rights reserved


MLLMs also call tools/models to finish tasks

HuggingGPT Demo

©2023 Databricks Inc. — All rights reserved


Transformers beyond text
One architecture to rule them all

©2023 Databricks Inc. — All rights reserved


Transformer: a general sequence processing tool
We can treat many things as a sequence

1 2 3 4

5 6 7 8

Music notes;
Audio; image source
Images image source

Video frames; Protein;


image source Game actions; image source
©2023 Databricks Inc. — All rights reserved
image source
Cross attention bridges between modalities
Allows different modalities to influence each other

Modality A Cross attention Modality B

Self attention
Self attention

A and B could be:


• Images, audio, text, time series, or any sequence!
Example:
• Stable Diffusion uses cross attention to bridge between text and images

©2023 Databricks Inc. — All rights reserved


Computer vision

©2023 Databricks Inc. — All rights reserved


Well-researched area
We need to first understand how to represent images as numbers

Zero-shot Few-shot
learning learning

Vision • Swin Transformer, • CLIP, 2021 VisualGPT, Flamingo,


GPT-3,
Transformer 2021 2022 2023
2020
(ViT), 2021 • SimVLM,
• MLP-Mixer, 2021; prefix LM for
not transformer image/text,
nor CNNs 2022

©2023 Databricks Inc. — All rights reserved


We chop an image up into small pixels
A colored image is made up of Red, Blue, Green (RBG) levels

Each pixel range = (0, 256)

Source: KDNuggets
©2023 Databricks Inc. — All rights reserved
Colored images are 3-D tensors
Grayscale images are 2-D tensors: all 3 channels have the same value

th
Wid

Height

Scalar Vector Matrix Chann


1-D tensor el
0-D tensor 2-D tensor

3-D tensor

Adapted from TowardDataScience and KDNuggets

©2023 Databricks Inc. — All rights reserved


Initial idea: Turn pixels into a sequence
Use self-attention to predict the next pixel, instead of word token

Limitations:
• Lose vertical spatial
relationships

Source: Chen et al 2021

©2023 Databricks Inc. — All rights reserved


Initial idea: Turn pixels into a sequence
Use self-attention to predict the next pixel, instead of word token

Limitations:
• Lose vertical spatial
relationships

• Memory and
computational
requirements scale
quadratically to
Source: Chen et al 2021
sequence length, O(N2)
Source: David Coccomini 2021
©2023 Databricks Inc. — All rights reserved
Vision Transformer (ViT)
Computes attention on patches of images: image-to-patch embeddings

©2023 Databricks Inc. — All rights reserved


Vision Transformer (ViT)
Computes attention on patches of images: image-to-patch embeddings
16

One
pixel

16

One patch
©2023 Databricks Inc. — All rights reserved
ViT: An image is worth 16x16 words

N input patches
with shape of ……
3 x Pixel (P) x P
©2023 Databricks Inc. — All rights reserved
Linear project each patch to D-dimensional vector

N input patches
with shape of ……
3 x Pixel (P) x P
©2023 Databricks Inc. — All rights reserved
D-dim patch
embedding ……

Linear project each patch to D-dimensional vector

N input patches
with shape of ……
3 x Pixel (P) x P
©2023 Databricks Inc. — All rights reserved
D-dim patch
embedding
1 2 3 4 5 6 7 8 …… 14 15 16

D-dim
learned Add positional embedding
positional
embedding
Linear project each patch to D-dimensional vector

N input patches
with shape of ……
3 x Pixel (P) x P
©2023 Databricks Inc. — All rights reserved
D-dim patch
embedding
1 2 3 4 5 6 7 8 …… 14 15 16 CLS

D-dim
learned Add positional embedding Special
positional classification
embedding token,

Linear project each patch to D-dimensional vector D-dim learned

N input patches
with shape of ……
3 x Pixel (P) x P
©2023 Databricks Inc. — All rights reserved
Original Transformer

D-dim patch
embedding
1 2 3 4 5 6 7 8 …… 14 15 16 CLS

D-dim
learned Add positional embedding Special
positional classification
embedding token,

Linear project each patch to D-dimensional vector D-dim learned

N input patches
with shape of ……
3 x Pixel (P) x P
©2023 Databricks Inc. — All rights reserved
C-dim
classifier
output ……
vector, where
C= # classes

Original Transformer

D-dim patch
embedding
1 2 3 4 5 6 7 8 …… 14 15 16 CLS

D-dim
learned Add positional embedding Special
positional classification
embedding token,

Linear project each patch to D-dimensional vector D-dim learned

N input patches
with shape of ……
3 x Pixel (P) x P
©2023 Databricks Inc. — All rights reserved
Label = “cat”

C-dim
classifier
output ……
vector, where
C= # classes

Original Transformer

D-dim patch
embedding
1 2 3 4 5 6 7 8 …… 14 15 16 CLS

D-dim
learned Add positional embedding Special
positional classification
embedding token,

Linear project each patch to D-dimensional vector D-dim learned

N input patches
with shape of ……
3 x Pixel (P) x P
©2023 Databricks Inc. — All rights reserved
ViT only outperforms ResNets on larger datasets
More computationally efficient than ResNet

• ViT is ~4 times faster


ViT is worse
than ResNet to train
ViT
outperforms

ResNet

Source: Dosovitsky et al 2021

©2023 Databricks Inc. — All rights reserved


Many other vision-text models
Not necessarily revolutionary, but an evolution in computer vision research

Zero-shot Few-shot
learning learning

Vision • Swin Transformer, • CLIP, 2021 VisualGPT, Flamingo,


Transformer 2021 2022 2023
GPT-3
(ViT), 2021 • SimVLM,
• MLP-Mixer, 2021; prefix LM for
not transformer image/text,
nor CNNs 2022
• Inspires many
other
follow-ups!

©2023 Databricks Inc. — All rights reserved


Audio

©2023 Databricks Inc. — All rights reserved


Audio signals are 2-dim spectrograms
We create embedding vectors for each t-min audio frame

Embedding for each


audio frame

Frequency Frequency

Time Time = Length of audio


= # of columns in the matrix

©2023 Databricks Inc. — All rights reserved


Audio is usually much longer than text length
Need to apply convolution layers with large strides to reduce dimensions

Speech Transformer (2018)


• Encoder-decoder
Transformer

Extract features using


different optional
modules: ResCNN,
ResCNNLSTM,e tc.

©2023 Databricks Inc. — All rights reserved


Few multi-modal advances
Also much harder: emotion, acoustics, tone, speed, speaker identification

• Most models focus on


only text-speech,
speech-text, or
speech-speech

• Data2vec (MetaAI,
2022)
• Self-supervised
algorithm for speech,
vision, and text
Source: Mehrish et al 2023 • The only one so far?
©2023 Databricks Inc. — All rights reserved
Training Data for MLLMs

©2023 Databricks Inc. — All rights reserved


Hand-crafted training data
Text-audio or text-video data is much harder to procure

Source: WebVid

Source: VideoChat

Source: Himakunthala et al 2023


©2023 Databricks Inc. — All rights reserved
Instruction-tuned, hand-crafted data

Source: CC-3M

Source: LLaVa
©2023 Databricks Inc. — All rights reserved
Instruction-tuned, model-generated data
Actually: manually design examples first, then ask model to generate more

Source: Liu et al 2023

©2023 Databricks Inc. — All rights reserved


LAION-5B: open source image-text data
Original data: Common Crawl; filtered with OpenAI’s CLIP model

Disclaimer:
Contains mostly
copyrighted
images. LAION
doesn’t claim
ownership.

Source: LAION
©2023 Databricks Inc. — All rights reserved
X-shot learning for MLLMs

©2023 Databricks Inc. — All rights reserved


Computer Vision

©2023 Databricks Inc. — All rights reserved


X-shot learning to the rescue?
Gathering multi-modal data is harder than just images or text data

Zero-shot Few-shot
learning learning

Vision • Swin Transformer, • CLIP, 2021 VisualGPT, Flamingo,


Transformer 2021 2022 2023
GPT-3
(ViT), 2021 • SimVLM,
• MLP-Mixer, 2021; prefix LM for
not transformer image/text,
nor CNNs 2022
• Inspires many
other
follow-ups!

©2023 Databricks Inc. — All rights reserved


Zero-shot: Contrastive Language-Image Pairing
CLIP predicts which image-text pair actually occurs in the training data

• Collects 400M image-text pairs from the internet as training data

At test time

Predict the probability of


each caption
Source: OpenAI
©2023 Databricks Inc. — All rights reserved
CLIP performs better across settings

Big limitation:
• Inflexible
• Cannot generate text

CLIP performs much


better in
non-ImageNet settings

Source: OpenAI
©2023 Databricks Inc. — All rights reserved
Few-shot, in-context: Flamingo
Unifies treatment of high-dimensional video and image inputs
Architecture
Architecture highlight 3:
highlight 2: Interleave
Perceiver cross-attention
resampler with
language-only
self-attention
layers

Architecture highlight 1:
Allows interleaved multi-modal
inputs Source: Alayrac et al 2022
©2023 Databricks Inc. — All rights reserved
Flamingo bridges vision and language models
Vision encoder similar to CLIP + Chinchilla (Language) accept interleaved
inputs

Source: Alayrac et al 2022

Image source: Samuel Albanie

©2023 Databricks Inc. — All rights reserved


Perceiver resampler outputs fixed-sized
tokens
5 output tokens

Maps
variable-size
grid of inputs to
a fixed number
of output tokens

Query

Keys and values


= spatio-temporal Temporal
visual features embeddings

Variable-sized inputs
Source: Alayrac et al 2022
©2023 Databricks Inc. — All rights reserved
Outperforms 6 out of 16 SOTA fine-tuned models
Curated 3 high-quality datasets: LTIP (Long Text-Image Pairs), VTP
(Video-Text Pairs), and MultiModal Massive Web (M3W)

Source: Alayrac et al 2022

©2023 Databricks Inc. — All rights reserved


Qualitative inspection on selected samples
Supported input format: (image, text) or (video, text) + visual query

Source: Alayrac et al 2022

©2023 Databricks Inc. — All rights reserved


Audio

©2023 Databricks Inc. — All rights reserved


Zero-shot: OpenAI’s Whisper
Encoder-decoder transformer: splits input audio into 30-second frames

Spectrogram

Source: Radford et al 2022

©2023 Databricks Inc. — All rights reserved


Whisper matches human robustness
Without fine-tuning on benchmark data

• WER = word error rate


• LibriSpeech
• 1K hours of read English speech

©2023 Databricks Inc. — All rights reserved


Challenges
We haven’t figured it all out yet

©2023 Databricks Inc. — All rights reserved


MLLMs are not immune from LLM limitations
They inherit LLM risks

• Hallucination
• Prompt sensitivity
• Context limit
• Inference compute cost
• Bias, toxicity, etc.
• Copyrights issues Source: Alayrac et al 2022 (Flamingo)

Source: laion.ai
Source: New York Times, April 2023
Source: Next Shark, X Post
©2023 Databricks Inc. — All rights reserved
MLLMs can lack common sense (like LLMs)

Source: Thrush et al 2022 Source: Ramesh et al 2022


Tested on many models, including CLIP and VisualBERT Used CLIP and DALLE-2; Prompt: “a red cube on top of a blue cube”

GPT-3 completion
Input: You poured yourself a glass of cranberry, but then absentmindedly, you poured about a teaspoon
of grape juice into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything.
You are very thirsty. So you
Completion: drink it. You are now dead. Source: Robust AI and NYU
©2023 Databricks Inc. — All rights reserved
Attention may not be forever
What may remain or rise?

©2023 Databricks Inc. — All rights reserved


Reinforcement learning with human feedback
Human feedback trains the reward model (LM). KL loss ensures minimal
divergence from the original LLM. Proximal Policy Optimization (PPO)
updates the LLM.

Source: Ahead of AI on Substack

©2023 Databricks Inc. — All rights reserved


Hyena Hierarchy
Convolutional neural networks are making a comeback?

• Good few-shot learners for languages


• Matches performances of Vision Transformers (ViT)

Source: Poli et al 2023

©2023 Databricks Inc. — All rights reserved


Retentive Networks
A new attention variant: a retention mechanism to connect recurrence
and attention, without compromising performance

Source: Sun et al 2023 (released in July)

©2023 Databricks Inc. — All rights reserved


Emerging applications
It’s a great time to be alive

©2023 Databricks Inc. — All rights reserved


DreamFusion
Generates 3D objects from text caption

Source: Poole et al 2022

©2023 Databricks Inc. — All rights reserved


Make-a-Video
Generates video from text: “Cat watching TV with a remote in hand”

Source: Singer et al 2022


©2023 Databricks Inc. — All rights reserved
PaLM-E-bot
Robotics application: “bring me the rice chips from the drawer”

Source: Driess et al 2023

©2023 Databricks Inc. — All rights reserved


AlphaCode: generate code
Problem: Minimum # of minutes to makes pizzas of N slices

Source: Li et al 2022

©2023 Databricks Inc. — All rights reserved


Multi-lingual models: Bactrian-X
An instruction-following model

Prompt:
“It’s an
apocalypse.
Describe how you Lower-resource languages
survive and make
allies”

Source: Li et al 2023 (GitHub and Paper); released in May


©2023 Databricks Inc. — All rights reserved
Gato: a generalist AI agent
Can play Atari, caption images, chat, stack blocks with a real robot arm

Source: Reed et al 2022

©2023 Databricks Inc. — All rights reserved


Textless NLP
Generate speech from raw audio without text transcription

Helps with low-resource languages

Click here for demo on Meta AI


©2023 Databricks Inc. — All rights reserved
AlphaFold
Uses attention to predict protein folding

Source: Google DeepMind and Timeline of Breakthrough

©2023 Databricks Inc. — All rights reserved


Module Summary
Multi-modal LLMs - What have we learned?

• MLLMs are gaining traction


• Transformers are general sequence-processing architectures that can
accept non-text sequences
• MLLMs inherit limitations from LLMs
• Transformers may not be the last architecture standing
• More exciting and unimaginable MLLM applications are on the horizon

©2023 Databricks Inc. — All rights reserved


Time for some code!

©2023 Databricks Inc. — All rights reserved


©2023 Databricks Inc. — All rights reserved

You might also like