Seminar CLIP

AI VIETNAM
Seminar
Multimodal Deep Learning Explorations:

From CLIP and Beyond
Minh-Duc Bui
Year 2024
AI VIETNAM
Seminar
Outline
▪ Motivation of Multimodal
▪ CLIP and Its Variants
▪ Beyond Text and Image
▪ Multimodal with LLMs
▪ From Large Language Model to Large Vision Model
AI VIETNAM
Seminar Foundation Models
3
AI VIETNAM
https://blogs.nvidia.com/blog/what-are-foundation-models/ 4
AI VIETNAM
https://blogs.nvidia.com/blog/what-are-foundation-models/ 5
AI VIETNAM
Awais, Muhammad, et al. "Foundational models defining a new era in vision: A survey and outlook." arXiv preprint arXiv:2307.13721 (2023). 6
AI VIETNAM
(a) and (b) require fine-tuning

for each specific task with
task-specific labelled data.
(c) (VLMs) enables effective

usage of web data and zero-
shot predictions without task-
specific fine-tuning.
Zhang, Jingyi, et al. "Vision-language models for vision tasks: A survey." arXiv preprint arXiv:2304.00685 (2023). 7
AI VIETNAM
Number of publications on visual recognition VLMs (from Google Scholar). The

publications grow exponentially since the pioneer study CLIP in 2021.
AI VIETNAM
AI VIETNAM
Seminar Dataset for VLMs
1
AI VIETNAM
Seminar
Outline
AI VIETNAM
Seminar Motivation
v CNN and Transformer
CNN Transformer
1
AI VIETNAM
Seminar Motivation
v CNN and Transformer
512 512
an
? image
Image Text
Encoder Encoder
of
ResNet BERT
a
VGG RoBERTa cat
ViT DistilBERT
... ...
Image embedding Text embedding
1
AI VIETNAM
Seminar Motivation
v Dataset - WIT (Web-Image-Text)
400M (image, text) pairs collected from Internet
1
AI VIETNAM
Seminar Motivation
v Bag-of-words Prediction
Multi-label classification
Vocab size
(10_000 or 100_000) Ground-truth
1024 plane
the
cat
plane
dog
is
CNN Head go
on
sky
the
...
sky
white
Embeddings Output
Joulin, Armand, et al. "Learning visual features from large weakly supervised data." Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The
Netherlands, October 11–14, 2016, Proceedings, Part VII 14. Springer International Publishing, 2016. 1
AI VIETNAM
Seminar Motivation
v Image Captioning
1
AI VIETNAM
Seminar Motivation
v Result of BoW and Image-Captioning models
Both these approaches share a key similarity:

• They try to predict the exact words of the text accompanying each image.
• This is a difficult task due to the wide variety of descriptions, comments,
and related text that co-occur with images.
1
AI VIETNAM
Seminar Motivation
v What we want?
Feature Space
far
far
near
an image of a cat the dog is jumping
1
AI VIETNAM
Seminar
Outline
AI VIETNAM
Seminar CLIP – 3/21
v Original CLIP paper
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML, 2021. 20
AI VIETNAM
v Training
AI VIETNAM
v Cosine Similarity
https://www.learndatasci.com/glossary/cosine-similarity/#:~:text=For%20example%3A,a%20cosine%20similarity%20of%20%2D1. 22
AI VIETNAM
v Training – Pseudo-code
AI VIETNAM
v CLIP's high-level architecture
https://huyenchip.com/2023/10/10/multimodal.html 24
AI VIETNAM
v Inference
AI VIETNAM
v Inference
AI VIETNAM
v Inference
AI VIETNAM
v Zero-shot Performance
AI VIETNAM
v Zero-shot Performance
AI VIETNAM
v Generalize to Out-of-Distribution (OOD)
AI VIETNAM
Seminar Finetune CLIP
Goyal, Sachin, et al. "Finetune like you pretrain: Improved finetuning of zero-shot vision models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. 31
AI VIETNAM
Seminar CLIP Variants
v DALL-E2
Ramesh, Aditya, et al. "Hierarchical text-conditional image generation with clip latents." arXiv preprint arXiv:2204.06125 1.2 (2022): 3. 32
AI VIETNAM
v AudioCLIP
Guzhov, Andrey, et al. "Audioclip: Extending clip to image, text and audio." ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022. 33
AI VIETNAM
v GridCLIP
Lin, Jiayi, and Shaogang Gong. "GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation Learning." arXiv preprint arXiv:2303.09252 (2023). 34
AI VIETNAM
v PointCLIP
Zhang, Renrui, et al. "Pointclip: Point cloud understanding by clip." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. 35
AI VIETNAM
v SAM
Zhang, Renrui, et al. "Pointclip: Point cloud understanding by clip." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. 36
AI VIETNAM
v SAM
Kirillov, Alexander, et al. "Segment anything." arXiv preprint arXiv:2304.02643 (2023). 37

AI VIETNAM
v SAM
Kirillov, Alexander, et al. "Segment anything." arXiv preprint arXiv:2304.02643 (2023). 38

AI VIETNAM
v SAM-CLIP
~41M images
Wang, Haoxiang, et al. "Sam-clip: Merging vision foundation models towards semantic and spatial understanding." arXiv preprint arXiv:2310.15308 (2023). 39
AI VIETNAM
Seminar Different Architecture Styles of Image-Text
Awais, Muhammad, et al. "Foundational models defining a new era in vision: A survey and outlook." arXiv preprint arXiv:2307.13721 (2023). 4
AI VIETNAM
Seminar
Outline
AI VIETNAM
Seminar ImageBind – 5/23
Girdhar, Rohit, et al. "Imagebind: One embedding space to bind them all." CVPR. 2023. 42
AI VIETNAM
Images Encoder CLIP
Video Encoder
... ... ...
Depth Encoder
AI VIETNAM
v Modality Encoder Design
1. We use the Vision Transformer (ViT) for images.
2. We encode audio and convert a 2 second audio sampled at 16kHz into spectrograms using 128 mel-
spectrogram bins. As the spectrogram is also a 2D signal like an image, we use a ViT with a patch size of 16
and stride 10.
3. We treat thermal images and depth images as one-channel images and also use a ViT to encode them.
4. We extract the IMU signal consisting of accelerometer and gyroscope measurements across the X, Y, and Z
axes. We use 5 second clips resulting in 2K time step IMU readings which are projected using a 1D
convolution with a kernel size of 8. The resulting sequence is encoded using a Transformer.
5. Finally, we follow the text encoder design from CLIP, which is a Transformer.
We add a modality-specific linear projection head on each encoder to obtain a fixed size d dimensional embedding.
We use a Transformer architecture for all the modality encoders.
AI VIETNAM
v Results
AI VIETNAM
v Results
AI VIETNAM
Seminar Meta-Transformer – 7/23
v 12 Modalities
Zhang, Yiyuan, et al. "Meta-transformer: A unified framework for multimodal learning." arXiv preprint arXiv:2307.10802 (2023). 47
AI VIETNAM
v Pipeline
AI VIETNAM
v Data-to-Sequence Tokenization
AI VIETNAM
v Results Image
AI VIETNAM
v Results Text
Infrared Hyperspectral
AI VIETNAM
Seminar LanguageBind – 11/23
v ImageBind vs. LanguageBind
Zhu, Bin, et al. "LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment." arXiv preprint arXiv:2310.01852 (2023). 52
AI VIETNAM
v Pipeline
AI VIETNAM
v Results
AI VIETNAM
Seminar
Outline
AI VIETNAM
Seminar ChatSpot – 7/23
Zhao, Liang, et al. "Chatspot: Bootstrapping multimodal llms via precise referring instruction tuning." arXiv preprint arXiv:2307.09474 (2023). 56
AI VIETNAM
v Datasets
AI VIETNAM
v Pipeline
AI VIETNAM
v Results
AI VIETNAM
Seminar ImageBind-LLM – 9/23
Han, Jiaming, et al. "Imagebind-llm: Multi-modality instruction tuning." arXiv preprint arXiv:2309.03905 (2023). 60
AI VIETNAM
v Multi-modal Instruction Examples
AI VIETNAM
AI VIETNAM
v Bind Network
AI VIETNAM
v Results
AI VIETNAM
v Results
AI VIETNAM
v Results
AI VIETNAM
Seminar
Outline
▪ From Large Language Model to Large Vision
Model
AI VIETNAM Text Encoder → LLM
Seminar
Image Encoder → LVM
68
AI VIETNAM Scalable Pre-training of Large
Seminar
Autoregressive Image Models – 1/24
Alaaeldin El-Nouby, et al. "Scalable Pre-training of Large Autoregressive Image Models." arXiv preprint arxiv:2401.08541 (2024). 69
Seminar
Seminar

Seminar CLIP

Uploaded by

Copyright:

Available Formats

Seminar CLIP

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Seminar CLIP

Uploaded by

Copyright:

Available Formats

AI VIETNAM

Multimodal Deep Learning Explorations:

(a) and (b) require fine-tuning

(c) (VLMs) enables effective

Number of publications on visual recognition VLMs (from Google Scholar). The

Both these approaches share a key similarity:

an image of a cat the dog is jumping

Kirillov, Alexander, et al. "Segment anything." arXiv preprint arXiv:2304.02643 (2023). 37

Kirillov, Alexander, et al. "Segment anything." arXiv preprint arXiv:2304.02643 (2023). 38

Images Encoder CLIP

... ... ...

We use a Transformer architecture for all the modality encoders.

You might also like