ICML2023 - Tutorial多模态机器学习Multimodal Machine Learning
ICML2023 - Tutorial多模态机器学习Multimodal Machine Learning
Mobile Online
Wearable
2
Multimodal AI Technologies
Robots Personal Vehicles Ubiquitous
Video Conferencing
Mobile Online
Wearable
3
What is
Multimodal?
What is a Modality?
5
What is a Modality?
Definition
Raw Abstract
Modalities Modalities
(closest from sensor) (farthest from sensor)
from a sensor
6
What is Multimodal?
A dictionary definition…
A research-oriented definition…
7
Heterogeneous Modalities
8
Dimensions of Heterogeneity Modality A Modality B
1 Element representations:
Discrete, continuous, granularity
2 Element distributions:
Density, frequency
3 Structure:
Temporal, spatial, latent, explicit
4 Information:
Abstraction, entropy 𝐻( ) 𝐻( )
5 Noise:
Uncertainty, noise, missing data
6 Relevance: 𝑦!
Task, context dependence 𝑦"
9
9
Connected Modalities
unconnected
unique
stronger
weaker
Modality A
Modality B unique
Statistical Semantic
10
Interacting Modalities
Modality A 𝑧 inference 𝑧
Modality B response
Interactions happen
during inference!
11
Taxonomy of Interaction Responses – A Behavioral Science View
signal response signal response
Redundancy
inference 𝑧 a a+b Equivalence
response
inputs
b a+b Enhancement
Nonredundancy
Multimodal Communication a+b and Independence
a
a+b Dominance
b
a+b (or ) Modulation
a+b Emergence
Partan and Marler (2005). Issues in the classification of multimodal
12
communication signals. American Naturalist, 166(2)
12
Cross-modal Interaction Mechanics
signal response
Noninteracting
Redundancy a+b Equivalence
(shared)
Additive
a+b Enhancement
unique
shared 𝑧 Noninteracting
(union) a+b and Independence
unique
Asymmetric a+b Dominance
13
What is
Why is it hard? What is next?
Multimodal?
Heterogeneous
14
14
Multimodal Machine Learning
Modality A
Modality B Multimodal ML or 𝑦! or
Modality C q Self-supervised,
q Reinforcement,
q Supervised, …
15
Multimodal Technical Challenges – Surveys, Tutorials and Courses
2016 2022
Multimodal Machine learning (11th edition) Tutorials: ICML 2023, CVPR 2022, NAACL 2022
https://cmu-multicomp-lab.github.io/mmml-course/fall2020/
Advanced Topics in Multimodal Machine learning Multimodal Machine learning (12th edition)
https://cmu-multicomp-lab.github.io/adv-mmml-course/spring2022/
https://cmu-multicomp-lab.github.io/mmml-course/fall2022/
16
16
Core Multimodal Challenges
Representation Generation
Reasoning Quantification
17
17
Challenge 1: Representation
Individual elements:
18
Challenge 1: Representation
Sub-challenges:
Fusion Coordination Fission
19
Challenge 2: Alignment
Modality A
Modality B
Spatial Hierarchical
20
20
Challenge 2: Alignment
Sub-challenges:
Discrete Continuous Contextualized
Alignment Alignment Representation
21
Challenge 3: Reasoning
Modality A
or 𝑦!
Modality B
22
22
Challenge 3: Reasoning
Modality A
words
or 𝑦!
words
Modality B
words
External
knowledge
23
23
Challenge 4: Generation
Sub-challenges:
Enriched Modality A
only available
during training
Transference A B
Modality A Modality B
25
25
Challenge 6: Quantification
Sub-challenges:
Connections &
Heterogeneity Learning
Interactions
Loss
Epoch
26
26
Core Multimodal Challenges
Representation Generation
Reasoning Quantification
27
27
Sub-Challenge: Representation Fusion
Modality A Modality A
Homogeneous Fusion Heterogeneous Fusion
Modality B Modality B
28
Feature Interactions: From Additive to Multiplicative
300 book reviews 𝑦: audience score H1: Does smiling reveal what the
audience score was?
𝑥# : percentage of smiling
𝑥$ : professional status H2: Does the effect of smiling depend
(0=non-critic, 1=critic) on professional status?
Linear regression:
𝑦 = 𝑤% + 𝑤! 𝑥# + 𝜖
slope Estimate 95% CI
𝑤! 4.63 [4.20, 5.06]
𝑦 𝑤" 1.20 [0.83, 1.57]
𝑥#
29
29
Feature Interactions: From Additive to Multiplicative
300 book reviews 𝑦: audience score H1: Does smiling reveal what the
audience score was?
𝑥# : percentage of smiling
𝑥$ : professional status H2: Does the effect of smiling depend
(0=non-critic, 1=critic) on professional status?
Linear regression:
𝑦 = 𝑤% + 𝑤! 𝑥# + 𝑤" 𝑥$ + 𝜖
Estimate 95% CI
𝑤! 5.29 [4.86, 5.73]
𝑦 𝑤" 1.19 [0.85, 1.53] Positive effect
𝑤# −1.69 [−2.14, −1.24] Negative effect
𝑥#
30
30
Feature Interactions: From Additive to Multiplicative
300 book reviews 𝑦: audience score H1: Does smiling reveal what the
audience score was?
𝑥# : percentage of smiling
𝑥$ : professional status H2: Does the effect of smiling depend
(0=non-critic, 1=critic) on professional status?
Linear regression:
𝑦 = 𝑤% + 𝑤! 𝑥# + 𝑤" 𝑥$ + 𝑤& 𝑥# ×𝑥' + 𝜖
Estimate 95% CI
𝑤! 5.79 [5.29, 6.29]
𝑦 𝑤" 0.68 [0.25, 1.11]
𝑤# −2.94 [−3.73, −2.15]
Multiplicative
𝑤$ 1.29 [0.61, 1.97]
interaction!
𝑥# 31
31
Basic Fusion – Additive Interactions
Modality A
𝒙# Additive fusion:
Add
Fusion
𝒛 𝒛 = 𝒘! 𝒙" + 𝒘# 𝒙$
Modality B
𝒙$ 1-layer neural network
can be seen as additive
With unimodal encoders:
Additive fusion:
Modality A encoder
𝑓# Fusion
𝒛 = 𝑓" + 𝑓$
𝒛
Modality B encoder It could be seen as an
𝑓$ ensemble approach
(late fusion)
32
32
Multiplicative Interactions
Modality A
𝒙# Simple multiplicative fusion:
Mult
𝒛 𝒛 = 𝒘 𝒙" ×𝒙$
Modality B
𝒙$
Modality A
𝒙# Bilinear Fusion:
Bilinear
𝒁 = 𝑾 𝒙"% , 𝒙$
Modality B
𝒙$ 𝒁
[Jayakumar et al., Multiplicative Interactions and Where to Find Them. ICLR 2020]33
33
Tensor Fusion
unimodal bimodal
(additive) (multiplicative)
Modality A 1
Tensor Fusion (bimodal):
𝒙#
Tensor %
𝒁 = 𝒘 𝒙" 1 , 𝒙$ 1
Modality B 1 1
𝒙$ 𝒁
Modality A 1 bimodal
𝒙# (multiplicative)
Modality C 1
𝒙( trimodal
𝒁 (multiplicative)
[Zadeh et al., Tensor Fusion Network for Multimodal Sentiment Analysis. EMNLP 2017]
[Hou et al., Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling.
34
NeurIPS 2019]
34
Low-rank Tensor Fusion
(!) (#)
𝑧) 𝑤) 𝑤)
𝟏
⨂ 𝒵 ! 𝒲 = +ℎ + ⋯
𝟏
𝑧& (!) (#)
⨂ 𝑤& ⨂ 𝑤&
𝟏
[Liu et al., Efficient Low-rank Multimodal Fusion with Modality-Specific Factors. ACL
35
2018]
35
Low-rank Tensor Fusion
(!) (#)
𝑧) 𝑤) 𝑤)
𝟏
⨂ 𝒵 ! + + ⋯ = ℎ
𝟏
𝑧& (!) (#)
⨂ 𝑤& ⨂ 𝑤&
𝟏
[Liu et al., Efficient Low-rank Multimodal Fusion with Modality-Specific Factors. ACL
36
2018]
36
Low-rank Tensor Fusion
(!) (#)
𝑧) 𝑤) 𝑤)
𝟏
! + + ⋯ ∘ ! + + ⋯ = ℎ
[Liu et al., Efficient Low-rank Multimodal Fusion with Modality-Specific Factors. ACL
37
2018]
37
Gated Fusion
gate
gate
Modality A
Fusion
Gating output can be one weight
Modality B 𝒛 for the whole modality
gate
[Arevalo et al., Gated Multimodal Units for information fusion, ICLR-workshop 2017]
38
38
Modality-Shifting Fusion
Primary shift
modality 𝒛
𝒙#
Secondary 𝒙$ gate
modalities
𝒙(
Negative-shifted
Example with language modality: representation word: “expectations”
[Wang et al., Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors, AAAI 2019]
[Rahman et al., Integrating Multimodal Information in Large Pretrained Transformers,
39
ACL 2020]
39
Mixture of Fusions
Multiple fusion strategies
Unimodal
Modality A Unimodal
𝒙# Fusion
Additive 𝒛
Modality B
𝒙$
bilinear
[Zadeh et al., Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph, ACL 2018]
[Xu et al., MUFASA: Multimodal Fusion Architecture Search for Electronic Health Records,
40 AAAI 2021]
40
Nonlinear Fusion
Nonlinear fusion:
Modality A
𝒙# & = 𝑓 𝒙# , 𝒙$ ∈ ℝ)
𝒚
Fusion Prediction &
𝒚
𝒛
Modality B For any nonlinear model
𝒙$
Modality A
𝒙# This could be seen as early fusion:
Fusion +
prediction
&
𝒚
Modality B & = 𝑓 [𝒙# , 𝒙$ ]
𝒚
𝒙$
41
Measuring Non-Additive Interactions
Nonlinear fusion:
Modality A
𝒙# & = 𝑓 𝒙# , 𝒙$
𝒚
Nonlinear
fusion &
𝒚 Projection?
Modality B Additive fusion:
𝒙$ &′ = 𝑓# 𝒙# + 𝑓$ 𝒙$
𝒚
𝑓0 𝒙# , 𝒙$ = 𝔼 𝑓 𝒙# , 𝒙$ + 𝔼 𝑓 𝒙# , 𝒙$
𝒙& 𝒙'
𝑓" 𝒙" 𝑓$ 𝒙$
Additive fusion
(approximation)
Modality A Modality B
[Hessel and Lee, Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!, EMNLP 2020]
42
Measuring Non-Additive Interactions
Nonlinear fusion:
Modality A
𝒙# & = 𝑓 𝒙# , 𝒙$
𝒚 EMAP
Nonlinear
&
𝒚
fusion
projection
Modality B Additive fusion:
𝒙$ &+ = 𝑓2# 𝒙# + 𝑓2$ 𝒙$ + 𝝁
𝒚 &
Nonlinear
Polynomial
Nonlinear
Nonlinear
Always a
Additive good baseline!
Best Model Differences
Additive
are small!!!
[Hessel and Lee, Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!, EMNLP 2020]
43
Learning Non-additive Bimodal and Trimodal Interactions
Idea: prioritize simpler Unimodal Bimodal Trimodal
(additive) (non-additive) (non-additive)
interactions
residual residual
Multimodal
Residual ℒ(𝑦, 𝑦%234 ) + ℒ(𝑦 − 𝑦%234 , 𝑦%54 ) + ℒ(𝑦 − 𝑦%234 − 𝑦%54 , 𝑦%674 )
Optimization
Modality A
𝒙# 𝒙# , 𝒙(
Modality B &$
𝒙𝒚 ∑ 𝒙$ , 𝒙( ∑ 𝒙# , 𝒙$ , 𝒙( &𝒕𝒓𝒊
𝒚
Modality C 𝒙( 𝒙# , 𝒙$
[Wortwein et al., Beyond Additive Fusion: Learning Non-Additive Multimodal Interactions, Findings-EMNLP 2022]
44
44
Fusion with Heterogeneous Modalities
Language
Fusion Transformer Self-Attention
Vision
cls 𝑥! 𝑥" 𝑥# mask 𝑥$ sep 𝑥′! mask 𝑥′# 𝑥′% 𝑥′$
I do not it I my time here
45
45
Image Representation Learning: Masked Auto-Encoder (MAE)
Mask a random
subset (~70%)
Transformer
Only used
during
Visual Transformer Reconstruction
pre-training
(ViT) loss function over
the whole image
[He et al., Masked Autoencoders Are Scalable Vision Learners, CVPR 2022]
46
Multimodal Masked Autoencoder
47
Dynamic Early Fusion
Modality A
Heterogeneous Fusion
Modality B
48
Dynamic Early Fusion
𝒚
Fusion fully learned from optimization and data
Add fuse
1. Define basic representation building blocks
ReLU Layer norm Conv Self-attention
Conv Self-attention
Layer norm
[Xu et al., MUFASA: Multimodal Fusion Architecture Search for Electronic Health Records. AAAI 2021]
[Liu et al., DARTS: Differentiable Architecture Search. ICLR 2019] 49
49
Heterogeneity-aware Fusion
Information transfer, transfer learning perspective
1a. Estimate modality
heterogeneity via transfer 2a. Compute modality heterogeneity matrix
0
1 0
3. Determine parameter clustering
3 2 0
1 2 3 0
5 4 6 3 0
(Implicitly captures heterogeneity)
2b. Compute interaction heterogeneity matrix
1b. Estimate interaction { }{ }{ }{ }
heterogeneity via transfer
{ } 0
{ } 1 0
{ } 3 2 0
{ } 1 2 4 0
50
Heterogeneity-aware Fusion
Information transfer, transfer learning perspective
[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]
51
Improving Optimization
[Wang et al., What Makes Training Multi-modal Classification Networks Hard? CVPR 2020]
[Wu et al., Characterizing and Overcoming the Greedy Nature of Learning in Multi-modal Deep Neural Networks. ICML 2022]
52
Improving Optimization
Relevance heterogeneity
2 explanations for drop in performance:
1. Multimodal networks are more prone to overfitting due to
increased complexity
2. Different modalities overfit and generalize at different rates
[Wang et al., What Makes Training Multi-modal Classification Networks Hard? CVPR 2020]
[Wu et al., Characterizing and Overcoming the Greedy Nature of Learning in Multi-modal Deep Neural Networks. ICML 2022]
53
Improving Optimization
Relevance heterogeneity
Prediction !
𝒚
𝒙) 𝒙) 𝒙)
Fusion +
prediction
!
𝒚 Fusion +
prediction
!
𝒚
Prediction !
𝒚
𝒙* 𝒙* 𝒙*
54
Heterogeneity in Noise: Studying Robustness
Strong tradeoffs between performance and robustness
Noise within Modality
noise → nosie
Missing Modalities
Today was great!
55
Heterogeneity in Noise: Studying Robustness
Several approaches towards more robust models
Modality A Modality A
𝒙)
𝒙) 𝒙,𝑨
Fusion +
prediction
!
𝒚 Fusion +
!
𝒚
prediction
Modality B
𝒙* 𝒙,𝑩
Modality B
𝒙*
Translation model
Joint probabilistic model
56
modalities
Homogenous
Late fusion
Additive fusion
Multiplicative fusion
Tensor fusion
Polynomial fusion
Gated fusion
57
Modality-shift fusion
Nonlinear fusion
Sub-Challenge 1a: Representation Fusion
Heterogeneity-aware
Improving optimization
cross-modal interactions between
Improving robustness
Definition: Learn a joint representation that models
modalities
Heterogenous
What is
Multimodal?
Heterogeneous
58
58
Core Multimodal Challenges
Representation Generation
Reasoning Quantification
Alignment 𝑦% Transference
59
59
Challenge 1: Representation
Sub-challenges:
Fusion Coordination Fission
60
Sub-Challenge: Representation Coordination
ℒ = 𝑔 𝑓" , 𝑓$
with model parameters 𝜃! , 𝜃"! and 𝜃""
61
Sub-Challenge: Representation Coordination
62
Sub-Challenge: Representation Coordination
63
Sub-Challenge: Representation Coordination
View 𝒛"
Modality B encoder
𝒛* View 𝒛!
𝑓*
𝑼 𝑽
Learning with coordination function:
3 Canonical Correlation Analysis:
ℒ = 𝑔 𝑓" , 𝑓$ argmax corr 𝒛) , 𝒛* 𝑓" 𝑓$
𝑽,𝑼,1! ,1"
with model parameters 𝜃! , 𝜃"! and 𝜃""
64
Coordination with Contrastive Learning
𝒛# Contrastive loss:
Modality A encoder
𝑓# brings positive pairs closer and
pushes negative pairs apart
Modality B encoder
𝑓$ 𝒛$
Simple contrastive loss:
65
Example – Visual-Semantic Embeddings
𝒛7 Two contrastive loss terms:
Language encoder
𝑓7 ℒ max 0, 𝛼 + 𝑠𝑖𝑚 𝒛7 , 𝒛5
8 − 𝑠𝑖𝑚(𝒛7 8)
, 𝒛6
66
Example – CLIP (Contrastive Language–Image Pre-training)
𝒛7
Language encoder Popular contrastive loss: InfoNCE
𝑓7 ℒ positive pairs
,
1 sim(𝒛"* , 𝒛*$ )
Visual encoder ℒ= − < log
𝑁 ∑ , sim(𝒛* , 𝒛 - )
(image)
𝑓8 𝒛8 *+! -+! " $
negative pairs
Similarity function can
Positive and negative pairs: be cosine similarity and positive pairs
67
Contrastive Learning and Connected Modalities
𝒛7
Language encoder Popular contrastive loss: InfoNCE
𝑓7 ℒ ,
1 sim(𝒛"* , 𝒛*$ )
Visual encoder ℒ = − < log
𝒛8 𝑁 ∑, sim(𝒛"* , 𝒛- )
(image)
𝑓8 *+! -+! $
68
Open
Multiview Redundancy and Contrastive Learning challenges
Multi-view redundancy:
Just right
Multi-view redundancy
may not hold for
multimodal problems!
[Tian et al., What makes for Good Views for Contrastive Learning? NeurIPS 2020]
[Tosh et al., Contrastive Learning, Multi-view Redundancy, and Linear models. ALT 2021]
69
69
Sub-Challenge 1c: Representation Fission
Unique to modality 1
and task Y
Unique to modality 2
and task Y
70
70
Partial Information Decomposition
Classical Information Theory Partial Information Decomposition
Can be negative!
Task-relevant
multimodal info
[Williams and Beer. Non-negative Decomposition of Mutual Information. 2010] 71
71
Partial Information Decomposition
72
Partial Information Decomposition
73
Open
Quantifying Interactions challenges
These interactions can be efficiently estimated – gives a path towards understanding interactions
Is there a
red shape
above a
circle?
Also matches human judgment of interactions, and other sanity checks on synthetic datasets
Can also be used to choose most appropriate models – can they be used to better train/design new models?
[Liang et al., Quantifying & Modeling Feature Interactions: An Information Decomposition
74
Framework. arXiv 2023]
74
Open
On Agreement, Disagreement, and Synergy challenges
Agreement synergy
𝑓!: 𝑦! Future work?
𝑓": 𝑦"
Disagreement uniqueness
MRMR/feature selection
Disagreement
Disagreement synergy
Future work?
[Blum and Mitchell. Combining Labeled and Unlabeled Data with Co-training. COLT 1998
[Peng et al., Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. TPAMI 2005]
[Liang et al., Multimodal Learning Without Labeled Multimodal Data: Guarantees and 75
Applications, arxiv 2023]
75
Factorized Learning of Shared + Unique Information
76
Open
Learning Task-relevant Unique Information challenges
77
Challenge 1: Representation
Sub-challenges:
Fusion Coordination Fission
78
Challenge 2: Alignment
Sub-challenges:
Discrete Continuous Contextualized
Alignment Alignment Representation
79
Sub-Challenge 2c: Contextualized Representations
Definition: Learn representations that reflect the
cross-modal interactions of the
structured multimodal data
Alignment
(attentions)
Rolls eyes
(sigh)
time
80
Sub-Challenge 2c: Contextualized Representations
Definition: Learn representations that reflect the
cross-modal interactions of the
structured multimodal data
81
Joint Undirected Contextualized Representations
Transformer self-attention
[Li et al., VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv 2019]
82
Contextualized Representations Pre-training
Transformer self-attention
83
Contextualized Representations Pre-training
Transformer self-attention
84
Contextualized Representations Pre-training
Transformer self-attention
85
Contextualized Representations Pre-training
Transformer self-attention
Alignment
86
Directed Cross-Modal Alignment
Attention
New visually-contextualized
Language-vision similarities representation of language
Text Image
+ modality embedding + modality embedding
+ position embedding + position embedding
[Tsai et al., Multimodal Transformer for Unaligned Multimodal Language Sequences. ACL 2019]
[Lu et al., ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. NeurIPS 2019]
87
Directed Cross-Modal Alignment
Cross-modal
Cross-modal Attention
Multi-head Attention
Representation
Summary
Text Image
+ modality embedding + modality embedding
+ position embedding + position embedding
[Tsai et al., Multimodal Transformer for Unaligned Multimodal Language Sequences. ACL 2019]
[Lu et al., ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. NeurIPS 2019]
88
High-Modality Multimodal Transformers
Transfer across partially observable modalities
Unified model + parameter sharing + multitask and transfer learning
Video Visual Sentiment, Atari Robot
Non-parallel multitask learning
classification QA emotions games manipulation
Task-specific classifiers
Same model
architecture!
HighMMT Shared multimodal model
Same
parameters!
Modality-specific embeddings
89
Open
High-Modality Models challenges
Modality-specific embeddings?
90
Sub-Challenge 2c: Contextualized Representations
Definition: Learn representations that reflect the
cross-modal interactions of the
structured multimodal data
91
Conditioning Pretrained Language Models
Conditioning via prefix tuning A small red boat on the water.
[Tsimpoukelli et al., Multimodal Few-Shot Learning with Frozen Language Models. NeurIPS 2021]
92
Conditioning Pretrained Language Models
Conditioning via prefix tuning
Blue
0-shot VQA:
Adapted + pretrained p(x|c)
[Tsimpoukelli et al., Multimodal Few-Shot Learning with Frozen Language Models. NeurIPS 2021]
93
Conditioning Pretrained Language Models
Conditioning via prefix tuning Steve Jobs
1-shot outside
knowledge VQA: Adapted + pretrained p(x|c)
[Tsimpoukelli et al., Multimodal Few-Shot Learning with Frozen Language Models. NeurIPS 2021]
94
Conditioning Pretrained Language Models
Conditioning via prefix tuning This is a dax.
Few-shot image
classification: Adapted + pretrained
[Tsimpoukelli et al., Multimodal Few-Shot Learning with Frozen Language Models. NeurIPS 2021]
95
Conditioning Pretrained Language Models
Mini-GPT4
[Zhu et al., MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models. 2023]
96
Conditioning Pretrained Language Models
LLaMA-Adapter
[Gridhar et al., ImageBind: One Embedding Space To Bind Them All. CVPR 2023]
[Gao et al., LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model. arxiv 2023]
97
Structure in Contextualized Representations
Compositionality requires knowing the structure.
In the continuous case (i.e., if structure is given or can be learned easily in a differentiable manner):
98
98
Interactive Structure
Structure defined through interactive environment
Main difference from temporal - actions taken at previous time steps affect future states
Modality A …
Policy 𝒂
Modality B …
Local representation
+ Aligned representation
99
99
Interactive Structure
Structure defined through interactive environment
Main difference from temporal - actions taken at previous time steps affect future states
Language-conditional RL Language-assisted RL
[Luketina et al., A Survey of Reinforcement Learning Informed by Natural Language. IJCAI 2019]
100
Open
Language Models for Planning challenges
Policy 𝒂𝟏
Modality A … Policy 𝒂𝟐
Modality B … Policy 𝒂𝟑
101
Hierarchical Structure
Leverage syntactic structure of language
Attend [red]
Attend [red]
“red”
102
Hierarchical Structure
Leverage syntactic structure of language
Attend [circle]
103
Hierarchical Structure
Leverage syntactic structure of language
Re-attend [above]
104
Hierarchical Structure
Leverage syntactic structure of language
Combine [and]
Attend [red] Combine [and]
105
Hierarchical Structure
Leverage syntactic structure of language
Enables better interpretability, but requires parsing + engineering each specific module
106
Open
Hierarchical Structure + Multimodal Pretraining challenges
Multimodal input
multimodal output
models?
More in Challenge 4:
Generation
[Koh et al., Grounding Language Models to Images for Multimodal Inputs and Outputs. ICML 2023]
107
Open
Hierarchical Structure + Multimodal Pretraining challenges
[Chen et al., Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge. ACM Multimedia 2022]
[Zhang et al., Multimodal Analogical Reasoning over Knowledge Graphs. ICLR 2023]
108
Open
Hierarchical Structure + Multimodal Pretraining challenges
109
Open
Hierarchical Structure + Multimodal Pretraining challenges
[Zeng et al., Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language. 2023]
110
Open
Hierarchical Structure + Multimodal Pretraining challenges
[Thrush et al., Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality. CVPR 2023]
[Pandey et al., Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment. ACL 2023]
111
Sub-Challenge 2c: Contextualized Representations
Definition: Learn representations that reflect the
cross-modal interactions of the
structured multimodal data
112
Core Multimodal Challenges
Representation Generation
Reasoning Quantification
Alignment 𝑦% Transference
113
113
What is
Why is it hard? What is next?
Multimodal?
Heterogeneous Representation
Alignment
Connected
Reasoning
Interacting
Generation
Transference
𝑧
Quantification
114
114
Future Direction: Heterogeneity
Homogeneity vs Heterogeneity
Challenges:
Arbitrary tokenization Beyond differentiable interactions
115
MultiBench
Future Direction: High-modality https://github.com/pliang279/MultiBench
Language Vision Audio Graphs Control LIDAR Sensors Set Table Financial Medical
116
116
Future Direction: Long-term
Short-term Long-term
seconds
or minutes
Challenges:
Compositionality Memory Personalization
117
Social-IQ
Future Direction: Interaction https://www.thesocialiq.com/
Social Intelligence
Reasoning
Perception Generation
Multimodal
Interaction
Challenges:
Multi-Party Causality Ethical
118
118
MultiViz
Future Direction: Real-world https://github.com/pliang279/MultiViz
Challenges:
Robustness Fairness Generalization Interpretation
119
What is
Why is it hard? What is next?
Multimodal?
Representation High-modality
Heterogeneous
Alignment
Heterogeneity
Reasoning
Connected Long-term
Generation
Transference Interaction
Interacting
Quantification Real-world
https://cmu-multicomp-lab.github.io/mmml-course/fall2022/
Liang, Zadeh, and Morency. Foundations and Trends on Multimodal Machine Learning. 2022
120
120