Seminar CLIP
Seminar CLIP
Seminar CLIP
Seminar
Minh-Duc Bui
Year 2024
AI VIETNAM
Seminar
Outline
▪ Motivation of Multimodal
▪ CLIP and Its Variants
▪ Beyond Text and Image
▪ Multimodal with LLMs
▪ From Large Language Model to Large Vision Model
AI VIETNAM
Seminar Foundation Models
3
AI VIETNAM
Seminar Foundation Models
https://blogs.nvidia.com/blog/what-are-foundation-models/ 4
AI VIETNAM
Seminar Foundation Models
https://blogs.nvidia.com/blog/what-are-foundation-models/ 5
AI VIETNAM
Seminar Foundation Models
Awais, Muhammad, et al. "Foundational models defining a new era in vision: A survey and outlook." arXiv preprint arXiv:2307.13721 (2023). 6
AI VIETNAM
Seminar Foundation Models
Zhang, Jingyi, et al. "Vision-language models for vision tasks: A survey." arXiv preprint arXiv:2304.00685 (2023). 7
AI VIETNAM
Seminar Foundation Models
Zhang, Jingyi, et al. "Vision-language models for vision tasks: A survey." arXiv preprint arXiv:2304.00685 (2023). 8
AI VIETNAM
Seminar Foundation Models
Zhang, Jingyi, et al. "Vision-language models for vision tasks: A survey." arXiv preprint arXiv:2304.00685 (2023). 9
AI VIETNAM
Seminar Dataset for VLMs
1
AI VIETNAM
Seminar
Outline
▪ Motivation of Multimodal
▪ CLIP and Its Variants
▪ Beyond Text and Image
▪ Multimodal with LLMs
▪ From Large Language Model to Large Vision Model
AI VIETNAM
Seminar Motivation
v CNN and Transformer
CNN Transformer
1
AI VIETNAM
Seminar Motivation
v CNN and Transformer
512 512
an
? image
Image Text
Encoder Encoder
of
ResNet BERT
a
VGG RoBERTa cat
ViT DistilBERT
... ...
Image embedding Text embedding
1
AI VIETNAM
Seminar Motivation
v Dataset - WIT (Web-Image-Text)
400M (image, text) pairs collected from Internet
1
AI VIETNAM
Seminar Motivation
v Bag-of-words Prediction
Multi-label classification
Vocab size
(10_000 or 100_000) Ground-truth
1024 plane
the
cat
plane
dog
is
CNN Head go
on
sky
the
...
sky
white
Embeddings Output
Joulin, Armand, et al. "Learning visual features from large weakly supervised data." Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The
Netherlands, October 11–14, 2016, Proceedings, Part VII 14. Springer International Publishing, 2016. 1
AI VIETNAM
Seminar Motivation
v Image Captioning
1
AI VIETNAM
Seminar Motivation
v Result of BoW and Image-Captioning models
far
far
near
1
AI VIETNAM
Seminar
Outline
▪ Motivation of Multimodal
▪ CLIP and Its Variants
▪ Beyond Text and Image
▪ Multimodal with LLMs
▪ From Large Language Model to Large Vision Model
AI VIETNAM
Seminar CLIP – 3/21
v Original CLIP paper
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML, 2021. 20
AI VIETNAM
Seminar CLIP – 3/21
v Training
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML, 2021. 21
AI VIETNAM
Seminar CLIP – 3/21
v Cosine Similarity
https://www.learndatasci.com/glossary/cosine-similarity/#:~:text=For%20example%3A,a%20cosine%20similarity%20of%20%2D1. 22
AI VIETNAM
Seminar CLIP – 3/21
v Training – Pseudo-code
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML, 2021. 23
AI VIETNAM
Seminar CLIP – 3/21
v CLIP's high-level architecture
https://huyenchip.com/2023/10/10/multimodal.html 24
AI VIETNAM
Seminar CLIP – 3/21
v Inference
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML, 2021. 25
AI VIETNAM
Seminar CLIP – 3/21
v Inference
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML, 2021. 26
AI VIETNAM
Seminar CLIP – 3/21
v Inference
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML, 2021. 27
AI VIETNAM
Seminar CLIP – 3/21
v Zero-shot Performance
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML, 2021. 28
AI VIETNAM
Seminar CLIP – 3/21
v Zero-shot Performance
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML, 2021. 29
AI VIETNAM
Seminar CLIP – 3/21
v Generalize to Out-of-Distribution (OOD)
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML, 2021. 30
AI VIETNAM
Seminar Finetune CLIP
Goyal, Sachin, et al. "Finetune like you pretrain: Improved finetuning of zero-shot vision models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. 31
AI VIETNAM
Seminar CLIP Variants
v DALL-E2
Ramesh, Aditya, et al. "Hierarchical text-conditional image generation with clip latents." arXiv preprint arXiv:2204.06125 1.2 (2022): 3. 32
AI VIETNAM
Seminar CLIP Variants
v AudioCLIP
Guzhov, Andrey, et al. "Audioclip: Extending clip to image, text and audio." ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022. 33
AI VIETNAM
Seminar CLIP Variants
v GridCLIP
Lin, Jiayi, and Shaogang Gong. "GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation Learning." arXiv preprint arXiv:2303.09252 (2023). 34
AI VIETNAM
Seminar CLIP Variants
v PointCLIP
Zhang, Renrui, et al. "Pointclip: Point cloud understanding by clip." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. 35
AI VIETNAM
Seminar CLIP Variants
v SAM
Zhang, Renrui, et al. "Pointclip: Point cloud understanding by clip." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. 36
AI VIETNAM
Seminar CLIP Variants
v SAM
~41M images
Wang, Haoxiang, et al. "Sam-clip: Merging vision foundation models towards semantic and spatial understanding." arXiv preprint arXiv:2310.15308 (2023). 39
AI VIETNAM
Seminar Different Architecture Styles of Image-Text
Awais, Muhammad, et al. "Foundational models defining a new era in vision: A survey and outlook." arXiv preprint arXiv:2307.13721 (2023). 4
AI VIETNAM
Seminar
Outline
▪ Motivation of Multimodal
▪ CLIP and Its Variants
▪ Beyond Text and Image
▪ Multimodal with LLMs
▪ From Large Language Model to Large Vision Model
AI VIETNAM
Seminar ImageBind – 5/23
Girdhar, Rohit, et al. "Imagebind: One embedding space to bind them all." CVPR. 2023. 42
AI VIETNAM
Seminar ImageBind – 5/23
Video Encoder
Depth Encoder
Girdhar, Rohit, et al. "Imagebind: One embedding space to bind them all." CVPR. 2023. 43
AI VIETNAM
Seminar ImageBind – 5/23
v Modality Encoder Design
1. We use the Vision Transformer (ViT) for images.
2. We encode audio and convert a 2 second audio sampled at 16kHz into spectrograms using 128 mel-
spectrogram bins. As the spectrogram is also a 2D signal like an image, we use a ViT with a patch size of 16
and stride 10.
3. We treat thermal images and depth images as one-channel images and also use a ViT to encode them.
4. We extract the IMU signal consisting of accelerometer and gyroscope measurements across the X, Y, and Z
axes. We use 5 second clips resulting in 2K time step IMU readings which are projected using a 1D
convolution with a kernel size of 8. The resulting sequence is encoded using a Transformer.
5. Finally, we follow the text encoder design from CLIP, which is a Transformer.
We add a modality-specific linear projection head on each encoder to obtain a fixed size d dimensional embedding.
Girdhar, Rohit, et al. "Imagebind: One embedding space to bind them all." CVPR. 2023. 44
AI VIETNAM
Seminar ImageBind – 5/23
v Results
Girdhar, Rohit, et al. "Imagebind: One embedding space to bind them all." CVPR. 2023. 45
AI VIETNAM
Seminar ImageBind – 5/23
v Results
Girdhar, Rohit, et al. "Imagebind: One embedding space to bind them all." CVPR. 2023. 46
AI VIETNAM
Seminar Meta-Transformer – 7/23
v 12 Modalities
Zhang, Yiyuan, et al. "Meta-transformer: A unified framework for multimodal learning." arXiv preprint arXiv:2307.10802 (2023). 47
AI VIETNAM
Seminar Meta-Transformer – 7/23
v Pipeline
Zhang, Yiyuan, et al. "Meta-transformer: A unified framework for multimodal learning." arXiv preprint arXiv:2307.10802 (2023). 48
AI VIETNAM
Seminar Meta-Transformer – 7/23
v Data-to-Sequence Tokenization
Zhang, Yiyuan, et al. "Meta-transformer: A unified framework for multimodal learning." arXiv preprint arXiv:2307.10802 (2023). 49
AI VIETNAM
Seminar Meta-Transformer – 7/23
v Results Image
Zhang, Yiyuan, et al. "Meta-transformer: A unified framework for multimodal learning." arXiv preprint arXiv:2307.10802 (2023). 50
AI VIETNAM
Seminar Meta-Transformer – 7/23
v Results Text
Infrared Hyperspectral
Zhang, Yiyuan, et al. "Meta-transformer: A unified framework for multimodal learning." arXiv preprint arXiv:2307.10802 (2023). 51
AI VIETNAM
Seminar LanguageBind – 11/23
v ImageBind vs. LanguageBind
Zhu, Bin, et al. "LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment." arXiv preprint arXiv:2310.01852 (2023). 52
AI VIETNAM
Seminar LanguageBind – 11/23
v Pipeline
Zhu, Bin, et al. "LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment." arXiv preprint arXiv:2310.01852 (2023). 53
AI VIETNAM
Seminar LanguageBind – 11/23
v Results
Zhu, Bin, et al. "LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment." arXiv preprint arXiv:2310.01852 (2023). 54
AI VIETNAM
Seminar
Outline
▪ Motivation of Multimodal
▪ CLIP and Its Variants
▪ Beyond Text and Image
▪ Multimodal with LLMs
▪ From Large Language Model to Large Vision Model
AI VIETNAM
Seminar ChatSpot – 7/23
Zhao, Liang, et al. "Chatspot: Bootstrapping multimodal llms via precise referring instruction tuning." arXiv preprint arXiv:2307.09474 (2023). 56
AI VIETNAM
Seminar ChatSpot – 7/23
v Datasets
Zhao, Liang, et al. "Chatspot: Bootstrapping multimodal llms via precise referring instruction tuning." arXiv preprint arXiv:2307.09474 (2023). 57
AI VIETNAM
Seminar ChatSpot – 7/23
v Pipeline
Zhao, Liang, et al. "Chatspot: Bootstrapping multimodal llms via precise referring instruction tuning." arXiv preprint arXiv:2307.09474 (2023). 58
AI VIETNAM
Seminar ChatSpot – 7/23
v Results
Zhao, Liang, et al. "Chatspot: Bootstrapping multimodal llms via precise referring instruction tuning." arXiv preprint arXiv:2307.09474 (2023). 59
AI VIETNAM
Seminar ImageBind-LLM – 9/23
Han, Jiaming, et al. "Imagebind-llm: Multi-modality instruction tuning." arXiv preprint arXiv:2309.03905 (2023). 60
AI VIETNAM
Seminar ImageBind-LLM – 9/23
v Multi-modal Instruction Examples
Han, Jiaming, et al. "Imagebind-llm: Multi-modality instruction tuning." arXiv preprint arXiv:2309.03905 (2023). 61
AI VIETNAM
Seminar ImageBind-LLM – 9/23
Han, Jiaming, et al. "Imagebind-llm: Multi-modality instruction tuning." arXiv preprint arXiv:2309.03905 (2023). 62
AI VIETNAM
Seminar ImageBind-LLM – 9/23
v Bind Network
Han, Jiaming, et al. "Imagebind-llm: Multi-modality instruction tuning." arXiv preprint arXiv:2309.03905 (2023). 63
AI VIETNAM
Seminar ImageBind-LLM – 9/23
v Results
Han, Jiaming, et al. "Imagebind-llm: Multi-modality instruction tuning." arXiv preprint arXiv:2309.03905 (2023). 64
AI VIETNAM
Seminar ImageBind-LLM – 9/23
v Results
Han, Jiaming, et al. "Imagebind-llm: Multi-modality instruction tuning." arXiv preprint arXiv:2309.03905 (2023). 65
AI VIETNAM
Seminar ImageBind-LLM – 9/23
v Results
Han, Jiaming, et al. "Imagebind-llm: Multi-modality instruction tuning." arXiv preprint arXiv:2309.03905 (2023). 66
AI VIETNAM
Seminar
Outline
▪ Motivation of Multimodal
▪ CLIP and Its Variants
▪ Beyond Text and Image
▪ Multimodal with LLMs
▪ From Large Language Model to Large Vision
Model
AI VIETNAM Text Encoder → LLM
Seminar
Image Encoder → LVM
68
AI VIETNAM Scalable Pre-training of Large
Seminar
Autoregressive Image Models – 1/24
Alaaeldin El-Nouby, et al. "Scalable Pre-training of Large Autoregressive Image Models." arXiv preprint arxiv:2401.08541 (2024). 69
AI VIETNAM Scalable Pre-training of Large
Seminar
Autoregressive Image Models – 1/24
Alaaeldin El-Nouby, et al. "Scalable Pre-training of Large Autoregressive Image Models." arXiv preprint arxiv:2401.08541 (2024). 70
AI VIETNAM Scalable Pre-training of Large
Seminar
Autoregressive Image Models – 1/24
Alaaeldin El-Nouby, et al. "Scalable Pre-training of Large Autoregressive Image Models." arXiv preprint arxiv:2401.08541 (2024). 71