Lecture22-Multimodal
Lecture22-Multimodal
• Texts
• Images or videos
• Documents
• With texts, images, videos, tables
• Speech, music
• Physiological signals
• Electrocardiogram (ECG), Electroencephalogram (EEG), electromyography (EMG)
• Other modalities
• Infrared images, depth images, fMRI
• Haptics / touch, smell, taste and self-motion
[2303.12712] Sparks of Artificial General Intelligence: Early experiments with GPT-4 (arxiv.org)
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 5
Multimodal Tasks
b) Different granularities of
tasks:
c) Different types of
• Image-level: classification,
outputs:
captioning, etc.
• Spatial: edges, boxes,
• Region-level: object
masks, etc.
detection, grounding, etc.
• Semantic: class labels,
• Pixel-level: segmentation,
descriptions, etc.
depth, etc.
Mask
Annotation
(COCO, LVSI)
Image
Annotation
(ImageNet, LAION)
• CLIP
• Flamingo
• BLIP2
• LLaVA
1 Learning transferable visual models from natural language supervision, ICML 2021
2 Scaling up visual and vision-language representation learning with noisy text supervision, ICML 2021
• Each text can be paired with N possible images, and the model tries to predict
the correct image
3. Objective functions
2. Model design
1. Data scaling up
open_clip/loss.py code
• Vision encoder
• A CLIP-like model is trained using contrastive learning.
• The text encoder of this model is then discarded.
• The vision encoder is frozen to be used in the main model
• Language model
• Flamingo finetunes Chinchilla to generate text tokens, conditioned on visuals and
text, using language model loss
• With two additional components Perceiver Resampler and gated xattn-dense
layers.
• Four datasets
• 2 (image, text) pair datasets,
• 1 (video, text) pair dataset,
• 1 interleaved image and text dataset.
• Perceiver Resampler
• Converts these variable features into a
consistent 64 visual outputs
• Propose Q-Former as the trainable module to bridge the gap between a frozen
image encoder and a frozen LLM
• Extracts a fixed number of output features from the image encoder
• Q-Former consists of
• An image transformer that interacts with the frozen image encoder for visual
feature extraction,
• A text transformer that can function as both a text encoder and a text decoder
• Initialize Q-Former with the pre-trained weights of BERT_base
[2301.12597] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
(arxiv.org)
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 26
BLIP2
• Architecture
• Two-stage Training
•Stage 1: Pre-training for Feature Alignment.
Only the projection matrix is updated, based on a subset of CC3M.
•Stage 2: Fine-tuning End-to-End. Both the projection matrix and LLM are updated
•Visual Chat: Our generated multimodal instruction data for daily user-oriented
applications.
•Science QA: Multimodal reasoning dataset for the science domain.