Seminar CLIP

Download as pdf or txt
Download as pdf or txt
You are on page 1of 71

AI VIETNAM

Seminar

Multimodal Deep Learning Explorations:


From CLIP and Beyond

Minh-Duc Bui

Year 2024
AI VIETNAM
Seminar

Outline
▪ Motivation of Multimodal
▪ CLIP and Its Variants
▪ Beyond Text and Image
▪ Multimodal with LLMs
▪ From Large Language Model to Large Vision Model
AI VIETNAM
Seminar Foundation Models

3
AI VIETNAM
Seminar Foundation Models

https://blogs.nvidia.com/blog/what-are-foundation-models/ 4
AI VIETNAM
Seminar Foundation Models

https://blogs.nvidia.com/blog/what-are-foundation-models/ 5
AI VIETNAM
Seminar Foundation Models

Awais, Muhammad, et al. "Foundational models defining a new era in vision: A survey and outlook." arXiv preprint arXiv:2307.13721 (2023). 6
AI VIETNAM
Seminar Foundation Models

(a) and (b) require fine-tuning


for each specific task with
task-specific labelled data.

(c) (VLMs) enables effective


usage of web data and zero-
shot predictions without task-
specific fine-tuning.

Zhang, Jingyi, et al. "Vision-language models for vision tasks: A survey." arXiv preprint arXiv:2304.00685 (2023). 7
AI VIETNAM
Seminar Foundation Models

Number of publications on visual recognition VLMs (from Google Scholar). The


publications grow exponentially since the pioneer study CLIP in 2021.

Zhang, Jingyi, et al. "Vision-language models for vision tasks: A survey." arXiv preprint arXiv:2304.00685 (2023). 8
AI VIETNAM
Seminar Foundation Models

Zhang, Jingyi, et al. "Vision-language models for vision tasks: A survey." arXiv preprint arXiv:2304.00685 (2023). 9
AI VIETNAM
Seminar Dataset for VLMs

1
AI VIETNAM
Seminar

Outline
▪ Motivation of Multimodal
▪ CLIP and Its Variants
▪ Beyond Text and Image
▪ Multimodal with LLMs
▪ From Large Language Model to Large Vision Model
AI VIETNAM
Seminar Motivation
v CNN and Transformer

CNN Transformer
1
AI VIETNAM
Seminar Motivation
v CNN and Transformer
512 512

an
? image
Image Text
Encoder Encoder
of
ResNet BERT
a
VGG RoBERTa cat
ViT DistilBERT
... ...
Image embedding Text embedding

1
AI VIETNAM
Seminar Motivation
v Dataset - WIT (Web-Image-Text)
400M (image, text) pairs collected from Internet

1
AI VIETNAM
Seminar Motivation
v Bag-of-words Prediction
Multi-label classification
Vocab size
(10_000 or 100_000) Ground-truth

1024 plane
the
cat
plane
dog
is
CNN Head go
on
sky
the
...
sky
white
Embeddings Output

Joulin, Armand, et al. "Learning visual features from large weakly supervised data." Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The
Netherlands, October 11–14, 2016, Proceedings, Part VII 14. Springer International Publishing, 2016. 1
AI VIETNAM
Seminar Motivation
v Image Captioning

1
AI VIETNAM
Seminar Motivation
v Result of BoW and Image-Captioning models

Both these approaches share a key similarity:


• They try to predict the exact words of the text accompanying each image.
• This is a difficult task due to the wide variety of descriptions, comments,
and related text that co-occur with images.
1
AI VIETNAM
Seminar Motivation
v What we want?
Feature Space

far

far
near

an image of a cat the dog is jumping

1
AI VIETNAM
Seminar

Outline
▪ Motivation of Multimodal
▪ CLIP and Its Variants
▪ Beyond Text and Image
▪ Multimodal with LLMs
▪ From Large Language Model to Large Vision Model
AI VIETNAM
Seminar CLIP – 3/21
v Original CLIP paper

Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML, 2021. 20
AI VIETNAM
Seminar CLIP – 3/21
v Training

Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML, 2021. 21
AI VIETNAM
Seminar CLIP – 3/21
v Cosine Similarity

https://www.learndatasci.com/glossary/cosine-similarity/#:~:text=For%20example%3A,a%20cosine%20similarity%20of%20%2D1. 22
AI VIETNAM
Seminar CLIP – 3/21
v Training – Pseudo-code

Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML, 2021. 23
AI VIETNAM
Seminar CLIP – 3/21
v CLIP's high-level architecture

https://huyenchip.com/2023/10/10/multimodal.html 24
AI VIETNAM
Seminar CLIP – 3/21
v Inference

Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML, 2021. 25
AI VIETNAM
Seminar CLIP – 3/21
v Inference

Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML, 2021. 26
AI VIETNAM
Seminar CLIP – 3/21
v Inference

Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML, 2021. 27
AI VIETNAM
Seminar CLIP – 3/21
v Zero-shot Performance

Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML, 2021. 28
AI VIETNAM
Seminar CLIP – 3/21
v Zero-shot Performance

Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML, 2021. 29
AI VIETNAM
Seminar CLIP – 3/21
v Generalize to Out-of-Distribution (OOD)

Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML, 2021. 30
AI VIETNAM
Seminar Finetune CLIP

Goyal, Sachin, et al. "Finetune like you pretrain: Improved finetuning of zero-shot vision models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. 31
AI VIETNAM
Seminar CLIP Variants
v DALL-E2

Ramesh, Aditya, et al. "Hierarchical text-conditional image generation with clip latents." arXiv preprint arXiv:2204.06125 1.2 (2022): 3. 32
AI VIETNAM
Seminar CLIP Variants
v AudioCLIP

Guzhov, Andrey, et al. "Audioclip: Extending clip to image, text and audio." ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022. 33
AI VIETNAM
Seminar CLIP Variants
v GridCLIP

Lin, Jiayi, and Shaogang Gong. "GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation Learning." arXiv preprint arXiv:2303.09252 (2023). 34
AI VIETNAM
Seminar CLIP Variants
v PointCLIP

Zhang, Renrui, et al. "Pointclip: Point cloud understanding by clip." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. 35
AI VIETNAM
Seminar CLIP Variants
v SAM

Zhang, Renrui, et al. "Pointclip: Point cloud understanding by clip." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. 36
AI VIETNAM
Seminar CLIP Variants
v SAM

Kirillov, Alexander, et al. "Segment anything." arXiv preprint arXiv:2304.02643 (2023). 37


AI VIETNAM
Seminar CLIP Variants
v SAM

Kirillov, Alexander, et al. "Segment anything." arXiv preprint arXiv:2304.02643 (2023). 38


AI VIETNAM
Seminar CLIP Variants
v SAM-CLIP

~41M images

Wang, Haoxiang, et al. "Sam-clip: Merging vision foundation models towards semantic and spatial understanding." arXiv preprint arXiv:2310.15308 (2023). 39
AI VIETNAM
Seminar Different Architecture Styles of Image-Text

Awais, Muhammad, et al. "Foundational models defining a new era in vision: A survey and outlook." arXiv preprint arXiv:2307.13721 (2023). 4
AI VIETNAM
Seminar

Outline
▪ Motivation of Multimodal
▪ CLIP and Its Variants
▪ Beyond Text and Image
▪ Multimodal with LLMs
▪ From Large Language Model to Large Vision Model
AI VIETNAM
Seminar ImageBind – 5/23

Girdhar, Rohit, et al. "Imagebind: One embedding space to bind them all." CVPR. 2023. 42
AI VIETNAM
Seminar ImageBind – 5/23

Images Encoder CLIP

Video Encoder

... ... ...

Depth Encoder

Girdhar, Rohit, et al. "Imagebind: One embedding space to bind them all." CVPR. 2023. 43
AI VIETNAM
Seminar ImageBind – 5/23
v Modality Encoder Design
1. We use the Vision Transformer (ViT) for images.
2. We encode audio and convert a 2 second audio sampled at 16kHz into spectrograms using 128 mel-
spectrogram bins. As the spectrogram is also a 2D signal like an image, we use a ViT with a patch size of 16
and stride 10.
3. We treat thermal images and depth images as one-channel images and also use a ViT to encode them.
4. We extract the IMU signal consisting of accelerometer and gyroscope measurements across the X, Y, and Z
axes. We use 5 second clips resulting in 2K time step IMU readings which are projected using a 1D
convolution with a kernel size of 8. The resulting sequence is encoded using a Transformer.
5. Finally, we follow the text encoder design from CLIP, which is a Transformer.

We add a modality-specific linear projection head on each encoder to obtain a fixed size d dimensional embedding.

We use a Transformer architecture for all the modality encoders.

Girdhar, Rohit, et al. "Imagebind: One embedding space to bind them all." CVPR. 2023. 44
AI VIETNAM
Seminar ImageBind – 5/23
v Results

Girdhar, Rohit, et al. "Imagebind: One embedding space to bind them all." CVPR. 2023. 45
AI VIETNAM
Seminar ImageBind – 5/23
v Results

Girdhar, Rohit, et al. "Imagebind: One embedding space to bind them all." CVPR. 2023. 46
AI VIETNAM
Seminar Meta-Transformer – 7/23
v 12 Modalities

Zhang, Yiyuan, et al. "Meta-transformer: A unified framework for multimodal learning." arXiv preprint arXiv:2307.10802 (2023). 47
AI VIETNAM
Seminar Meta-Transformer – 7/23
v Pipeline

Zhang, Yiyuan, et al. "Meta-transformer: A unified framework for multimodal learning." arXiv preprint arXiv:2307.10802 (2023). 48
AI VIETNAM
Seminar Meta-Transformer – 7/23
v Data-to-Sequence Tokenization

Zhang, Yiyuan, et al. "Meta-transformer: A unified framework for multimodal learning." arXiv preprint arXiv:2307.10802 (2023). 49
AI VIETNAM
Seminar Meta-Transformer – 7/23
v Results Image

Zhang, Yiyuan, et al. "Meta-transformer: A unified framework for multimodal learning." arXiv preprint arXiv:2307.10802 (2023). 50
AI VIETNAM
Seminar Meta-Transformer – 7/23
v Results Text

Infrared Hyperspectral

Zhang, Yiyuan, et al. "Meta-transformer: A unified framework for multimodal learning." arXiv preprint arXiv:2307.10802 (2023). 51
AI VIETNAM
Seminar LanguageBind – 11/23
v ImageBind vs. LanguageBind

Zhu, Bin, et al. "LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment." arXiv preprint arXiv:2310.01852 (2023). 52
AI VIETNAM
Seminar LanguageBind – 11/23
v Pipeline

Zhu, Bin, et al. "LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment." arXiv preprint arXiv:2310.01852 (2023). 53
AI VIETNAM
Seminar LanguageBind – 11/23
v Results

Zhu, Bin, et al. "LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment." arXiv preprint arXiv:2310.01852 (2023). 54
AI VIETNAM
Seminar

Outline
▪ Motivation of Multimodal
▪ CLIP and Its Variants
▪ Beyond Text and Image
▪ Multimodal with LLMs
▪ From Large Language Model to Large Vision Model
AI VIETNAM
Seminar ChatSpot – 7/23

Zhao, Liang, et al. "Chatspot: Bootstrapping multimodal llms via precise referring instruction tuning." arXiv preprint arXiv:2307.09474 (2023). 56
AI VIETNAM
Seminar ChatSpot – 7/23
v Datasets

Zhao, Liang, et al. "Chatspot: Bootstrapping multimodal llms via precise referring instruction tuning." arXiv preprint arXiv:2307.09474 (2023). 57
AI VIETNAM
Seminar ChatSpot – 7/23
v Pipeline

Zhao, Liang, et al. "Chatspot: Bootstrapping multimodal llms via precise referring instruction tuning." arXiv preprint arXiv:2307.09474 (2023). 58
AI VIETNAM
Seminar ChatSpot – 7/23
v Results

Zhao, Liang, et al. "Chatspot: Bootstrapping multimodal llms via precise referring instruction tuning." arXiv preprint arXiv:2307.09474 (2023). 59
AI VIETNAM
Seminar ImageBind-LLM – 9/23

Han, Jiaming, et al. "Imagebind-llm: Multi-modality instruction tuning." arXiv preprint arXiv:2309.03905 (2023). 60
AI VIETNAM
Seminar ImageBind-LLM – 9/23
v Multi-modal Instruction Examples

Han, Jiaming, et al. "Imagebind-llm: Multi-modality instruction tuning." arXiv preprint arXiv:2309.03905 (2023). 61
AI VIETNAM
Seminar ImageBind-LLM – 9/23

Han, Jiaming, et al. "Imagebind-llm: Multi-modality instruction tuning." arXiv preprint arXiv:2309.03905 (2023). 62
AI VIETNAM
Seminar ImageBind-LLM – 9/23
v Bind Network

Han, Jiaming, et al. "Imagebind-llm: Multi-modality instruction tuning." arXiv preprint arXiv:2309.03905 (2023). 63
AI VIETNAM
Seminar ImageBind-LLM – 9/23
v Results

Han, Jiaming, et al. "Imagebind-llm: Multi-modality instruction tuning." arXiv preprint arXiv:2309.03905 (2023). 64
AI VIETNAM
Seminar ImageBind-LLM – 9/23
v Results

Han, Jiaming, et al. "Imagebind-llm: Multi-modality instruction tuning." arXiv preprint arXiv:2309.03905 (2023). 65
AI VIETNAM
Seminar ImageBind-LLM – 9/23
v Results

Han, Jiaming, et al. "Imagebind-llm: Multi-modality instruction tuning." arXiv preprint arXiv:2309.03905 (2023). 66
AI VIETNAM
Seminar

Outline
▪ Motivation of Multimodal
▪ CLIP and Its Variants
▪ Beyond Text and Image
▪ Multimodal with LLMs
▪ From Large Language Model to Large Vision
Model
AI VIETNAM Text Encoder → LLM
Seminar
Image Encoder → LVM

68
AI VIETNAM Scalable Pre-training of Large
Seminar
Autoregressive Image Models – 1/24

Alaaeldin El-Nouby, et al. "Scalable Pre-training of Large Autoregressive Image Models." arXiv preprint arxiv:2401.08541 (2024). 69
AI VIETNAM Scalable Pre-training of Large
Seminar
Autoregressive Image Models – 1/24

Alaaeldin El-Nouby, et al. "Scalable Pre-training of Large Autoregressive Image Models." arXiv preprint arxiv:2401.08541 (2024). 70
AI VIETNAM Scalable Pre-training of Large
Seminar
Autoregressive Image Models – 1/24

Alaaeldin El-Nouby, et al. "Scalable Pre-training of Large Autoregressive Image Models." arXiv preprint arxiv:2401.08541 (2024). 71

You might also like