0% found this document useful (0 votes)
3 views

Lecture22-Multimodal

The document discusses Multimodal AI, highlighting various modalities such as text, images, and physiological signals, and their importance in enhancing model performance across applications like healthcare and robotics. It covers unique challenges in vision modeling and data, as well as specific multimodal tasks and models like CLIP, Flamingo, BLIP2, and LLaVA. The content emphasizes the integration of different data types to improve human-machine interaction and model capabilities.

Uploaded by

chatpgtzhangyue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture22-Multimodal

The document discusses Multimodal AI, highlighting various modalities such as text, images, and physiological signals, and their importance in enhancing model performance across applications like healthcare and robotics. It covers unique challenges in vision modeling and data, as well as specific multimodal tasks and models like CLIP, Flamingo, BLIP2, and LLaVA. The content emphasizes the integration of different data types to improve human-machine interaction and model capabilities.

Uploaded by

chatpgtzhangyue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Advanced Natural Language Processing

Lecture 21: Multimodal AI

陈冠华 CHEN Guanhua


Department of Statistics and Data Science
Examples of Modalities

• Texts
• Images or videos
• Documents
• With texts, images, videos, tables
• Speech, music
• Physiological signals
• Electrocardiogram (ECG), Electroencephalogram (EEG), electromyography (EMG)
• Other modalities
• Infrared images, depth images, fMRI
• Haptics / touch, smell, taste and self-motion

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 2
Why Multimodal

• Deal with a mixture of data modalities


such as healthcare, robotics, retail, etc.
• Incorporating data from other modalities
can help boost model performance
• Provide a more flexible human-machine
interface

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 3
Why Multimodal

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 4
Why Multimodal

• The second example is asking GPT-4 to draw a


unicorn in TiKZ.
• We gave to GPT-4 a transformed version of the
TikZ code it produced for Figure 1.1, with the part
drawing the horn removed.
• We asked for code to add back the horn, and
display the result.
• This demonstrates that GPT-4 can “see” despite
being a pure language model
• We emphasize again that the version we test with
is not multimodal.

[2303.12712] Sparks of Artificial General Intelligence: Early experiments with GPT-4 (arxiv.org)
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 5
Multimodal Tasks

Multimodal can mean one or more of the following:


• Input and output are of different modalities
• E.g. text-to-image, image-to-text, image-text matching (classification, retrieval)
• Inputs are multimodal
• E.g. a system that can process both text and images, Visual question answering
• Outputs are multimodal
• E.g. a system that can generate both text and images, multimodal dialog

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 6
Unique Challenges in Vision: Modeling

a) Different types of inputs:


Temporality: static image, video
sequence
Multi-modality: w/text, w/audio, etc.

b) Different granularities of
tasks:
c) Different types of
• Image-level: classification,
outputs:
captioning, etc.
• Spatial: edges, boxes,
• Region-level: object
masks, etc.
detection, grounding, etc.
• Semantic: class labels,
• Pixel-level: segmentation,
descriptions, etc.
depth, etc.

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 7
Unique Challenges in Vision: Data

Mask
Annotation
(COCO, LVSI)

From poor to Box From coarse


richer Annotation to finer
(COCO, O365)
semantics grain

Image
Annotation
(ImageNet, LAION)

Scales differ significantly across different types of


annotations
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 8
Content

• CLIP
• Flamingo
• BLIP2
• LLaVA

Guanhua Chen @ Stat-DS, SUSTech and Data 9


Department of Statistics 9
CLIP
Learning image representations from web-scale noisy text supervision
• Training: simple contrastive learning, and the beauty lies in large-scale pre-training
• Downstream: zero-shot image classification and image-text retrieval
• Image classification can be reformatted as a retrieval task via considering the semantics
behind label names

1 Learning transferable visual models from natural language supervision, ICML 2021
2 Scaling up visual and vision-language representation learning with noisy text supervision, ICML 2021

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 10
CLIP

Trained with contrastive loss


• Each image can be paired with N possible texts, and the model tries to predict
the correct one

• Each text can be paired with N possible images, and the model tries to predict
the correct image

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 11
How to Improve CLIP
• Since the birth of CLIP, tons of follow-up works and applications

3. Objective functions

2. Model design

1. Data scaling up

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 12
Data Scaling Up

• Reproducible scaling laws for CLIP training


• Open large-scale LAION-2B dataset
• Pre-training OpenCLIP across various scales

• DataComp: In search of the next-generation image-text datasets


• Instead of fixing the dataset, and designing different algorithms,
• Authors propose to fix the CLIP training method, but select the datasets
instead
• A benchmark used to design new filtering techniques or curate new data
sources.
• Evaluate the new dataset by running standardized CLIP training code and
testing the resulting model on 38 downstream test sets

1 Reproducible scaling laws for contrastive language-image learning, CVPR 2023


2 Datacomp: In search of the next generation of multimodal datasets, 2023

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 13
Data Scaling Up

1 Reproducible scaling laws for contrastive language-image learning, CVPR 2023


2 Datacomp: In search of the next generation of multimodal datasets, 2023

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 14
CLIP: Pseudocode

open_clip/loss.py code

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 15
Few-shot Performance

• Zero-shot CLIP outperforms other


few-shot baselines
• Few-shot CLIP further improves
with a few labeled data.

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 16
FILIP
• FILIP: Fine-grained supervision
• Still dual encoder, not a fusion encoder
• Learns word-patch alignment that is good for visualization

[1] FILIP: Fine-grained Interactive Language-Image Pre-Training, ICLR 2022

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 17
FILIP

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 18
CoCa
• CoCa: Contrastive Captioner
• Use mixed image-text and image-label (JFT-3B) data for pre-training
• But adding an additional generative branch for enhanced performance and
enabling new capabilities (image captioning and VQA)
• CoCa aims to learn a better image encoder from scratch

[1] Coca: Contrastive captioners are image-text foundation models, 2022


Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 19
Flamingo

• Flamingo can generate text


responses conditioned on
both visual and text inputs

[2204.14198] Flamingo: a Visual Language Model for Fe


w-Shot Learning (arxiv.org)

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 20
Flamingo

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 21
Flamingo

• Vision encoder
• A CLIP-like model is trained using contrastive learning.
• The text encoder of this model is then discarded.
• The vision encoder is frozen to be used in the main model
• Language model
• Flamingo finetunes Chinchilla to generate text tokens, conditioned on visuals and
text, using language model loss
• With two additional components Perceiver Resampler and gated xattn-dense
layers.

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 22
Flamingo

• Four datasets
• 2 (image, text) pair datasets,
• 1 (video, text) pair dataset,
• 1 interleaved image and text dataset.

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 23
Flamingo

• Perceiver Resampler
• Converts these variable features into a
consistent 64 visual outputs

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 24
Flamingo

Gated xattn-dense layers


• Inserted between existing and
frozen LM layers

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 25
BLIP2

• Propose Q-Former as the trainable module to bridge the gap between a frozen
image encoder and a frozen LLM
• Extracts a fixed number of output features from the image encoder
• Q-Former consists of
• An image transformer that interacts with the frozen image encoder for visual
feature extraction,
• A text transformer that can function as both a text encoder and a text decoder
• Initialize Q-Former with the pre-trained weights of BERT_base

[2301.12597] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
(arxiv.org)
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 26
BLIP2

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 27
BLIP2

Jointly optimize three pre-training objectives


• Image-Text Contrastive Learning (ITC)
• Contrasting the image-text similarity of a positive pair against those of negative
pairs.
• Use in-batch negatives
• Image-grounded Text Generation (ITG)
• Generate texts, given input images as the condition.
• Employ a multimodal causal self-attention mask to control query-text interaction
• Image-Text Matching (ITM)
• A binary classification task where the model is asked to predict whether an image-
text pair is positive (matched) or negative (unmatched).

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 28
BLIP2

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 29
BLIP2

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 30
LLaVA

• Architecture

• Two-stage Training
•Stage 1: Pre-training for Feature Alignment.
Only the projection matrix is updated, based on a subset of CC3M.
•Stage 2: Fine-tuning End-to-End. Both the projection matrix and LLM are updated
•Visual Chat: Our generated multimodal instruction data for daily user-oriented
applications.
•Science QA: Multimodal reasoning dataset for the science domain.

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 31
Thank you

You might also like