0% found this document useful (0 votes)
9 views

ICML2023 - Tutorial多模态机器学习Multimodal Machine Learning

Uploaded by

watertulip6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

ICML2023 - Tutorial多模态机器学习Multimodal Machine Learning

Uploaded by

watertulip6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 120

Tutorial on

Multimodal Machine Learning

LP Morency Paul Liang


Multimodal AI Technologies
Robots Personal Vehicles Ubiquitous

Mobile Online

Wearable
2
Multimodal AI Technologies
Robots Personal Vehicles Ubiquitous
Video Conferencing

Mobile Online

Wearable
3
What is
Multimodal?
What is a Modality?

Multimodal Behaviors and Signals

Language Visual Touch


§ Haptics
§ Lexicon § Gestures
§ Words § Head gestures § Motion
Syntax § Eye gestures
Physiological
§
§ Part-of-speech § Arm gestures
§ Dependencies § Body language § Skin conductance
§ Pragmatics § Body posture
§ Electrocardiogram
§ Discourse acts § Proxemics

Acoustic § Eye contact


§ Head gaze
Mobile
§ Prosody
§ Eye gaze § GPS location
§ Intonation
§ Voice quality § Facial expressions § Accelerometer
§ FACS action units
§ Vocal expressions § Light sensors
§ Smile, frowning
§ Laughter, moans

5
What is a Modality?

Definition

Modality refers to the way in which something expressed or perceived.

Raw Abstract
Modalities Modalities
(closest from sensor) (farthest from sensor)
from a sensor

Speech Language Sentiment


Examples:
signal intensity

Image Detected Object


objects categories

6
What is Multimodal?

A dictionary definition…

Multimodal: with multiple modalities

A research-oriented definition…

Multimodal is the scientific study of


heterogeneous and interconnected data
Connected + Interacting
7

7
Heterogeneous Modalities

Heterogeneous: Diverse qualities, structures and representations.

Modality A Homogeneous Heterogeneous


Modalities Modalities
Modality B (with similar qualities) (with diverse qualities)

Examples: Images Text from Language


from 2 2 different and vision
cameras languages

Abstract modalities are more likely to be homogeneous


8

8
Dimensions of Heterogeneity Modality A Modality B

1 Element representations:
Discrete, continuous, granularity

2 Element distributions:
Density, frequency

3 Structure:
Temporal, spatial, latent, explicit

4 Information:
Abstraction, entropy 𝐻( ) 𝐻( )
5 Noise:
Uncertainty, noise, missing data

6 Relevance: 𝑦!
Task, context dependence 𝑦"
9

9
Connected Modalities

Connected: Shared information that relates modalities

unconnected
unique

stronger

weaker
Modality A

Modality B unique

Statistical Semantic

Association Dependency Correspondence Relationship


= laptop used for

e.g., correlation, e.g., causal, e.g., grounding e.g., function


co-occurrence temporal
10

10
Interacting Modalities

Interacting: process affecting each modality, creating new response

Modality A 𝑧 inference 𝑧
Modality B response

Interactions happen
during inference!

“Inference” examples: • Representation fusion representation


• Prediction task 𝑦& prediction
• Modality translation modality C
11

11
Taxonomy of Interaction Responses – A Behavioral Science View
signal response signal response

Redundancy
inference 𝑧 a a+b Equivalence
response
inputs
b a+b Enhancement

Nonredundancy
Multimodal Communication a+b and Independence
a
a+b Dominance
b
a+b (or ) Modulation

a+b Emergence
Partan and Marler (2005). Issues in the classification of multimodal
12
communication signals. American Naturalist, 166(2)
12
Cross-modal Interaction Mechanics
signal response

Noninteracting
Redundancy a+b Equivalence
(shared)
Additive
a+b Enhancement
unique
shared 𝑧 Noninteracting
(union) a+b and Independence
unique
Asymmetric a+b Dominance

Nonredundancy Contextualized a+b (or ) Modulation


(unique) (transference)

Non-additive a+b Emergence


(nonlinear)
13

13
What is
Why is it hard? What is next?
Multimodal?

Heterogeneous

Connected Multimodal is the scientific


Interacting study of heterogeneous and
interconnected data J
𝑧

14

14
Multimodal Machine Learning

Modality A

Modality B Multimodal ML or 𝑦! or

Modality C q Self-supervised,
q Reinforcement,
q Supervised, …

What are the core multimodal technical challenges,


understudied in conventional machine learning?
15

15
Multimodal Technical Challenges – Surveys, Tutorials and Courses

2016 2022

Paul Liang, Amir Zadeh and Louis-Philippe Morency


Tadas Baltrusaitis, Chaitanya Ahuja and Louis-Philippe Morency
(Arxiv 2017, IEEE TPAMI journal, February 2019) ☑ 6 core challenges
https://arxiv.org/abs/1705.09406 ☑ 50+ taxonomic classes
Tutorials: CVPR 2016, ACL 2016, ICMI 2016, … ☑ 700+ referenced papers
https://arxiv.org/abs/2209.03430

Multimodal Machine learning (11th edition) Tutorials: ICML 2023, CVPR 2022, NAACL 2022
https://cmu-multicomp-lab.github.io/mmml-course/fall2020/

Advanced Topics in Multimodal Machine learning Multimodal Machine learning (12th edition)
https://cmu-multicomp-lab.github.io/adv-mmml-course/spring2022/
https://cmu-multicomp-lab.github.io/mmml-course/fall2022/

16

16
Core Multimodal Challenges
Representation Generation

Reasoning Quantification

Alignment 𝑦" Transference

17

17
Challenge 1: Representation

Definition: Learning representations that reflect cross-modal interactions


between individual elements, across different modalities

This is a core building block for most multimodal modeling problems!

Individual elements:

Modality A It can be seen as a “local” representation


or
Modality B representation using holistic features

18
Challenge 1: Representation

Definition: Learning representations that reflect cross-modal interactions


between individual elements, across different modalities

Sub-challenges:
Fusion Coordination Fission

# modalities > # representations # modalities = # representations # modalities < # representations

19
Challenge 2: Alignment

Definition: Identifying and modeling cross-modal connections between all


elements of multiple modalities, building from the data structure

Most modalities have internal structure with multiple elements

Elements with temporal structure: Other structured examples:

Modality A

Modality B
Spatial Hierarchical
20

20
Challenge 2: Alignment

Definition: Identifying and modeling cross-modal connections between all


elements of multiple modalities, building from the data structure

Sub-challenges:
Discrete Continuous Contextualized
Alignment Alignment Representation

Discrete elements Segmentation and Alignment + representation


and connections continuous warping
21

21
Challenge 3: Reasoning

Definition: Combining knowledge, usually through multiple inferential steps,


exploiting multimodal alignment and problem structure

Modality A

or 𝑦!
Modality B

22

22
Challenge 3: Reasoning

Definition: Combining knowledge, usually through multiple inferential steps,


exploiting multimodal alignment and problem structure

Modality A

words
or 𝑦!

words
Modality B

words
External
knowledge

23

23
Challenge 4: Generation

Definition: Learning a generative process to produce raw modalities that


reflects cross-modal interactions, structure and coherence

Sub-challenges:

Summarization Translation Creation

Reduction Maintenance Expansion


Information:
(content)
> = <
24
Challenge 5: Transference

Definition: Transfer knowledge between modalities, usually to help the


target modality which may be noisy or with limited resources

Enriched Modality A

only available
during training
Transference A B

Modality A Modality B
25

25
Challenge 6: Quantification

Definition: Empirical and theoretical study to better understand heterogeneity,


cross-modal interactions and the multimodal learning process

Sub-challenges:
Connections &
Heterogeneity Learning
Interactions

Loss
Epoch

26

26
Core Multimodal Challenges
Representation Generation

Reasoning Quantification

Alignment 𝑦" Transference

27

27
Sub-Challenge: Representation Fusion

Definition: Learn a joint representation that models


cross-modal interactions between
individual elements of different modalities

Basic fusion: Raw-modality fusion:

Modality A Modality A
Homogeneous Fusion Heterogeneous Fusion

Modality B Modality B

28
Feature Interactions: From Additive to Multiplicative

300 book reviews 𝑦: audience score H1: Does smiling reveal what the
audience score was?
𝑥# : percentage of smiling
𝑥$ : professional status H2: Does the effect of smiling depend
(0=non-critic, 1=critic) on professional status?

Linear regression:
𝑦 = 𝑤% + 𝑤! 𝑥# + 𝜖
slope Estimate 95% CI
𝑤! 4.63 [4.20, 5.06]
𝑦 𝑤" 1.20 [0.83, 1.57]

𝑥#
29

29
Feature Interactions: From Additive to Multiplicative

300 book reviews 𝑦: audience score H1: Does smiling reveal what the
audience score was?
𝑥# : percentage of smiling
𝑥$ : professional status H2: Does the effect of smiling depend
(0=non-critic, 1=critic) on professional status?

Linear regression:
𝑦 = 𝑤% + 𝑤! 𝑥# + 𝑤" 𝑥$ + 𝜖
Estimate 95% CI
𝑤! 5.29 [4.86, 5.73]
𝑦 𝑤" 1.19 [0.85, 1.53] Positive effect
𝑤# −1.69 [−2.14, −1.24] Negative effect

𝑥#
30

30
Feature Interactions: From Additive to Multiplicative

300 book reviews 𝑦: audience score H1: Does smiling reveal what the
audience score was?
𝑥# : percentage of smiling
𝑥$ : professional status H2: Does the effect of smiling depend
(0=non-critic, 1=critic) on professional status?

Linear regression:
𝑦 = 𝑤% + 𝑤! 𝑥# + 𝑤" 𝑥$ + 𝑤& 𝑥# ×𝑥' + 𝜖
Estimate 95% CI
𝑤! 5.79 [5.29, 6.29]
𝑦 𝑤" 0.68 [0.25, 1.11]
𝑤# −2.94 [−3.73, −2.15]
Multiplicative
𝑤$ 1.29 [0.61, 1.97]
interaction!
𝑥# 31

31
Basic Fusion – Additive Interactions

Modality A
𝒙# Additive fusion:
Add
Fusion
𝒛 𝒛 = 𝒘! 𝒙" + 𝒘# 𝒙$
Modality B
𝒙$ 1-layer neural network
can be seen as additive
With unimodal encoders:

Additive fusion:
Modality A encoder
𝑓# Fusion
𝒛 = 𝑓" + 𝑓$
𝒛
Modality B encoder It could be seen as an
𝑓$ ensemble approach
(late fusion)
32

32
Multiplicative Interactions

Modality A
𝒙# Simple multiplicative fusion:
Mult
𝒛 𝒛 = 𝒘 𝒙" ×𝒙$
Modality B
𝒙$

Modality A
𝒙# Bilinear Fusion:
Bilinear
𝒁 = 𝑾 𝒙"% , 𝒙$
Modality B
𝒙$ 𝒁

[Jayakumar et al., Multiplicative Interactions and Where to Find Them. ICLR 2020]33

33
Tensor Fusion
unimodal bimodal
(additive) (multiplicative)
Modality A 1
Tensor Fusion (bimodal):
𝒙#
Tensor %
𝒁 = 𝒘 𝒙" 1 , 𝒙$ 1
Modality B 1 1

𝒙$ 𝒁

Modality A 1 bimodal
𝒙# (multiplicative)

Modality B 1 Tensor … but the weight matrix


𝒙$ unimodal may end up quite large!
(additive)

Modality C 1
𝒙( trimodal
𝒁 (multiplicative)
[Zadeh et al., Tensor Fusion Network for Multimodal Sentiment Analysis. EMNLP 2017]
[Hou et al., Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling.
34
NeurIPS 2019]

34
Low-rank Tensor Fusion

(!) (#)
𝑧) 𝑤) 𝑤)
𝟏
⨂ 𝒵 ! 𝒲 = +ℎ + ⋯
𝟏
𝑧& (!) (#)
⨂ 𝑤& ⨂ 𝑤&
𝟏

[Liu et al., Efficient Low-rank Multimodal Fusion with Modality-Specific Factors. ACL
35
2018]

35
Low-rank Tensor Fusion

(!) (#)
𝑧) 𝑤) 𝑤)
𝟏
⨂ 𝒵 ! + + ⋯ = ℎ
𝟏
𝑧& (!) (#)
⨂ 𝑤& ⨂ 𝑤&
𝟏

[Liu et al., Efficient Low-rank Multimodal Fusion with Modality-Specific Factors. ACL
36
2018]

36
Low-rank Tensor Fusion

(!) (#)
𝑧) 𝑤) 𝑤)
𝟏
! + + ⋯ ∘ ! + + ⋯ = ℎ

𝑧& (!) (#)


𝑤& 𝑤&
𝟏

[Liu et al., Efficient Low-rank Multimodal Fusion with Modality-Specific Factors. ACL
37
2018]

37
Gated Fusion
gate

Example with additive fusion:


Modality A
𝒙#
Fusion 𝒛 = 𝑔" 𝒙" , 𝒙$ , 𝒙" + 𝑔$ (𝒙" , 𝒙$ ) , 𝒙$
Modality B
𝒛
𝒙$
gate 𝑔" and 𝑔$ can be seen as attention functions

gate
Modality A
Fusion
Gating output can be one weight
Modality B 𝒛 for the whole modality

gate
[Arevalo et al., Gated Multimodal Units for information fusion, ICLR-workshop 2017]
38

38
Modality-Shifting Fusion

Primary shift
modality 𝒛
𝒙#

Secondary 𝒙$ gate
modalities
𝒙(
Negative-shifted
Example with language modality: representation word: “expectations”

Primary modality: language Positive-shifted


Secondary modalities: acoustic and visual representation

[Wang et al., Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors, AAAI 2019]
[Rahman et al., Integrating Multimodal Information in Large Pretrained Transformers,
39
ACL 2020]

39
Mixture of Fusions
Multiple fusion strategies

Unimodal

Modality A Unimodal
𝒙# Fusion
Additive 𝒛
Modality B
𝒙$
bilinear

Gating can be with


gate soft or hard attention

[Zadeh et al., Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph, ACL 2018]
[Xu et al., MUFASA: Multimodal Fusion Architecture Search for Electronic Health Records,
40 AAAI 2021]

40
Nonlinear Fusion
Nonlinear fusion:
Modality A
𝒙# & = 𝑓 𝒙# , 𝒙$ ∈ ℝ)
𝒚
Fusion Prediction &
𝒚
𝒛
Modality B For any nonlinear model
𝒙$

Modality A
𝒙# This could be seen as early fusion:
Fusion +
prediction
&
𝒚
Modality B & = 𝑓 [𝒙# , 𝒙$ ]
𝒚
𝒙$

… but will our neural network learn the nonlinear interactions?

41
Measuring Non-Additive Interactions
Nonlinear fusion:
Modality A
𝒙# & = 𝑓 𝒙# , 𝒙$
𝒚
Nonlinear
fusion &
𝒚 Projection?
Modality B Additive fusion:
𝒙$ &′ = 𝑓# 𝒙# + 𝑓$ 𝒙$
𝒚

Projection from nonlinear to additive (using EMAP):

𝑓0 𝒙# , 𝒙$ = 𝔼 𝑓 𝒙# , 𝒙$ + 𝔼 𝑓 𝒙# , 𝒙$
𝒙& 𝒙'

𝑓" 𝒙" 𝑓$ 𝒙$
Additive fusion
(approximation)
Modality A Modality B

[Hessel and Lee, Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!, EMNLP 2020]

42
Measuring Non-Additive Interactions
Nonlinear fusion:
Modality A
𝒙# & = 𝑓 𝒙# , 𝒙$
𝒚 EMAP
Nonlinear
&
𝒚
fusion
projection
Modality B Additive fusion:
𝒙$ &+ = 𝑓2# 𝒙# + 𝑓2$ 𝒙$ + 𝝁
𝒚 &

Nonlinear
Polynomial
Nonlinear
Nonlinear
Always a
Additive good baseline!
Best Model Differences
Additive
are small!!!

[Hessel and Lee, Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!, EMNLP 2020]

43
Learning Non-additive Bimodal and Trimodal Interactions
Idea: prioritize simpler Unimodal Bimodal Trimodal
(additive) (non-additive) (non-additive)
interactions
residual residual
Multimodal
Residual ℒ(𝑦, 𝑦%234 ) + ℒ(𝑦 − 𝑦%234 , 𝑦%54 ) + ℒ(𝑦 − 𝑦%234 − 𝑦%54 , 𝑦%674 )
Optimization

𝑦4,-. 𝑦4'. 𝑦4/0. &


𝒚

Modality A
𝒙# 𝒙# , 𝒙(
Modality B &$
𝒙𝒚 ∑ 𝒙$ , 𝒙( ∑ 𝒙# , 𝒙$ , 𝒙( &𝒕𝒓𝒊
𝒚
Modality C 𝒙( 𝒙# , 𝒙$
[Wortwein et al., Beyond Additive Fusion: Learning Non-Additive Multimodal Interactions, Findings-EMNLP 2022]
44

44
Fusion with Heterogeneous Modalities

Goal: Fusion with raw modalities

Modality A Can the same fusion algorithm handle


Heterogeneous Fusion raw heterogeonous modalities?
Modality B
like enjoy

ℎ& ℎ! ℎ" ℎ# ℎ% ℎ$ ℎ&'( ℎ′! ℎ′" ℎ′# ℎ′% ℎ′$

Language
Fusion Transformer Self-Attention
Vision
cls 𝑥! 𝑥" 𝑥# mask 𝑥$ sep 𝑥′! mask 𝑥′# 𝑥′% 𝑥′$
I do not it I my time here
45

45
Image Representation Learning: Masked Auto-Encoder (MAE)

Mask a random
subset (~70%)

Transformer
Only used
during
Visual Transformer Reconstruction
pre-training
(ViT) loss function over
the whole image
[He et al., Masked Autoencoders Are Scalable Vision Learners, CVPR 2022]

46
Multimodal Masked Autoencoder

[Geng et al., Multimodal Masked Autoencoders Learn Transferable Representations, 2022]

47
Dynamic Early Fusion

Modality A
Heterogeneous Fusion

Modality B

Idea: Deciding when to fuse in early fusion

Unimodal Unimodal Unimodal


Visual
Fuse Fuse Fuse
Acoustic
Unimodal Unimodal Unimodal
[Xue and Marculescu, Dynamic Multimodal Fusion, arxiv 2022] 48

48
Dynamic Early Fusion
𝒚
Fusion fully learned from optimization and data
Add fuse
1. Define basic representation building blocks
ReLU Layer norm Conv Self-attention

2. Define basic fusion building blocks Conv Layer norm

Concat fuse Attention fuse Add fuse


Concat fuse Attention fuse
3. Automatically search for composition using neural
architecture search

Conv Self-attention

Layer norm

[Xu et al., MUFASA: Multimodal Fusion Architecture Search for Electronic Health Records. AAAI 2021]
[Liu et al., DARTS: Differentiable Architecture Search. ICLR 2019] 49

49
Heterogeneity-aware Fusion
Information transfer, transfer learning perspective
1a. Estimate modality
heterogeneity via transfer 2a. Compute modality heterogeneity matrix

0
1 0
3. Determine parameter clustering
3 2 0
1 2 3 0
5 4 6 3 0
(Implicitly captures heterogeneity)
2b. Compute interaction heterogeneity matrix
1b. Estimate interaction { }{ }{ }{ }
heterogeneity via transfer
{ } 0

{ } 1 0

{ } 3 2 0

{ } 1 2 4 0

[Zamir et al., Taskonomy: Disentangling Task Transfer Learning. CVPR 2018]


[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]

50
Heterogeneity-aware Fusion
Information transfer, transfer learning perspective

[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]

51
Improving Optimization

Kinetics dataset Adding more modalities should always help?


Modalities: RGB (video clips)
A (Audio features)
OF (optical flow - motion)

But sometimes multimodal doesn’t help! Why?

[Wang et al., What Makes Training Multi-modal Classification Networks Hard? CVPR 2020]
[Wu et al., Characterizing and Overcoming the Greedy Nature of Learning in Multi-modal Deep Neural Networks. ICML 2022]

52
Improving Optimization
Relevance heterogeneity
2 explanations for drop in performance:
1. Multimodal networks are more prone to overfitting due to
increased complexity
2. Different modalities overfit and generalize at different rates

Key idea 1: compute overfitting-to-


generalization ratio (OGR)
Loss

Gap between training and valid loss


OGR wrt each modality tells us
Epoch
how much to train that modality

[Wang et al., What Makes Training Multi-modal Classification Networks Hard? CVPR 2020]
[Wu et al., Characterizing and Overcoming the Greedy Nature of Learning in Multi-modal Deep Neural Networks. ICML 2022]

53
Improving Optimization
Relevance heterogeneity

Conventional approach Proposed approach

Prediction !
𝒚
𝒙) 𝒙) 𝒙)
Fusion +
prediction
!
𝒚 Fusion +
prediction
!
𝒚
Prediction !
𝒚
𝒙* 𝒙* 𝒙*

Key idea 2: Simultaneously train unimodal


networks to estimate OGR wrt each modality
Reweight multimodal loss
using unimodal OGR values
Allows to better balance generalization &
overfitting rate of different modalities
[Wang et al., What Makes Training Multi-modal Classification Networks Hard? CVPR 2020]
[Wu et al., Characterizing and Overcoming the Greedy Nature of Learning in Multi-modal Deep Neural Networks. ICML 2022]

54
Heterogeneity in Noise: Studying Robustness
Strong tradeoffs between performance and robustness
Noise within Modality
noise → nosie

[Belinkov & Bisk, 2018; Subramaniam et al.,


2009; Boyat & Joshi, 2015]

Missing Modalities
Today was great!

[Zadeh et al., 2020]


rate of accuracy drops
[Liang et al., MultiBench: Multiscale Benchmarks for Multimodal Representation Learning. NeurIPS 2021]

55
Heterogeneity in Noise: Studying Robustness
Several approaches towards more robust models

Robust data + training Infer missing modalities

Modality A Modality A
𝒙)
𝒙) 𝒙,𝑨
Fusion +
prediction
!
𝒚 Fusion +
!
𝒚
prediction
Modality B
𝒙* 𝒙,𝑩
Modality B
𝒙*

Translation model
Joint probabilistic model

[Ngiam et al., Multimodal Deep Learning. ICML 2011]


[Srivastava and Salakhutdinov, Multimodal Learning with Deep Boltzmann Machines. JMLR 2014]
[Tran et al., Missing Modalities Imputation via Cascaded Residual Autoencoder. CVPR 2017]
[Pham et al., Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities. AAAI 2019

56
modalities
Homogenous
Late fusion
Additive fusion

Multiplicative fusion

Tensor fusion

Polynomial fusion

Gated fusion

57
Modality-shift fusion

Nonlinear fusion
Sub-Challenge 1a: Representation Fusion

Very early fusion


Dynamic early fusion

Heterogeneity-aware

Improving optimization
cross-modal interactions between

Improving robustness
Definition: Learn a joint representation that models

individual elements of different modalities

modalities
Heterogenous
What is
Multimodal?

Heterogeneous

Connected Multimodal is the scientific


Interacting study of heterogeneous and
interconnected data J
𝑧

58

58
Core Multimodal Challenges
Representation Generation

Reasoning Quantification

Alignment 𝑦% Transference

59

59
Challenge 1: Representation

Definition: Learning representations that reflect cross-modal interactions


between individual elements, across different modalities

Sub-challenges:
Fusion Coordination Fission

# modalities > # representations # modalities = # representations # modalities < # representations

60
Sub-Challenge: Representation Coordination

Definition: Learn multimodally-contextualized


representations that are coordinated through
their cross-modal interactions
𝒛)
Modality A encoder 2. Coordination function
𝑓)
captures interconnections
𝑔 𝒛" , 𝒛$
Modality B encoder
𝒛*
𝑓*

1. Specialized encoders capture heterogeneity


Learning with coordination function:

ℒ = 𝑔 𝑓" , 𝑓$
with model parameters 𝜃! , 𝜃"! and 𝜃""

61
Sub-Challenge: Representation Coordination

Definition: Learn multimodally-contextualized


representations that are coordinated through
their cross-modal interactions
𝒛)
Modality A encoder
𝑓)
𝑔 𝒛" , 𝒛$
Modality B encoder
𝒛*
𝑓*

Learning with coordination function: 𝒛) ) 𝒛*


1 Cosine similarity: 𝑔 𝒛) , 𝒛* =
‖𝒛) ‖ 𝒛*
ℒ = 𝑔 𝑓" , 𝑓$
with model parameters 𝜃! , 𝜃"! and 𝜃""

62
Sub-Challenge: Representation Coordination

Definition: Learn multimodally-contextualized


representations that are coordinated through
their cross-modal interactions
𝒛)
Modality A encoder
𝑓)
𝑔 𝒛" , 𝒛$
Modality B encoder
𝒛*
𝑓*

Learning with coordination function:


2 Kernel similarity functions:
ℒ = 𝑔 𝑓" , 𝑓$ 𝑔 𝒛) , 𝒛* = 𝑘(𝒛) , 𝒛* ) • Linear
• Polynomial
with model parameters 𝜃! , 𝜃"! and 𝜃"" • Exponential
• RBF

63
Sub-Challenge: Representation Coordination

Definition: Learn multimodally-contextualized


representations that are coordinated through
their cross-modal interactions
𝒛)
Modality A encoder
𝑓)
𝑔 𝒛" , 𝒛$

View 𝒛"
Modality B encoder
𝒛* View 𝒛!
𝑓*
𝑼 𝑽
Learning with coordination function:
3 Canonical Correlation Analysis:
ℒ = 𝑔 𝑓" , 𝑓$ argmax corr 𝒛) , 𝒛* 𝑓" 𝑓$
𝑽,𝑼,1! ,1"
with model parameters 𝜃! , 𝜃"! and 𝜃""

64
Coordination with Contrastive Learning
𝒛# Contrastive loss:
Modality A encoder
𝑓# brings positive pairs closer and
pushes negative pairs apart
Modality B encoder
𝑓$ 𝒛$
Simple contrastive loss:

Paired data: max 0, 𝛼 + 𝑠𝑖𝑚 𝒛# , 𝒛5


$ − 𝑠𝑖𝑚(𝒛# $)
, 𝒛6
{ , }
(e.g., images and text descriptions) positive pairs negative pair

1 1 Similarity functions are


2 2 often cosine similarity
3 3
Positive pairs
4 4
Negative pairs
5 5

65
Example – Visual-Semantic Embeddings
𝒛7 Two contrastive loss terms:
Language encoder
𝑓7 ℒ max 0, 𝛼 + 𝑠𝑖𝑚 𝒛7 , 𝒛5
8 − 𝑠𝑖𝑚(𝒛7 8)
, 𝒛6

Visual encoder + max 0, 𝛼 + 𝑠𝑖𝑚 𝒛8 , 𝒛5


7 − 𝑠𝑖𝑚(𝒛8 7)
, 𝒛6
(image)
𝑓8 𝒛8

[Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language 66


Models. NeurIPS 2014]

66
Example – CLIP (Contrastive Language–Image Pre-training)
𝒛7
Language encoder Popular contrastive loss: InfoNCE
𝑓7 ℒ positive pairs
,
1 sim(𝒛"* , 𝒛*$ )
Visual encoder ℒ= − < log
𝑁 ∑ , sim(𝒛* , 𝒛 - )
(image)
𝑓8 𝒛8 *+! -+! " $

negative pairs
Similarity function can
Positive and negative pairs: be cosine similarity and positive pairs

CLIP encoders (𝑓7 and 𝑓8 ) are


Contrastive
pre-training

great for language-vision tasks

𝒛7 and 𝒛8 are coordinated but not


identical representation spaces
[Radford et al., Learning Transferable Visual Models From Natural Language Supervision.
67 ICML 2021]

67
Contrastive Learning and Connected Modalities
𝒛7
Language encoder Popular contrastive loss: InfoNCE
𝑓7 ℒ ,
1 sim(𝒛"* , 𝒛*$ )
Visual encoder ℒ = − < log
𝒛8 𝑁 ∑, sim(𝒛"* , 𝒛- )
(image)
𝑓8 *+! -+! $

Connected modalities: CLIP focuses on


shared connections
unique
Modality A
shared 𝐼( ; ) Mutual information 𝐼 𝑋; 𝑌
Modality B
unique 𝑃9; 𝑥, 𝑦
𝔼9,; log
𝑃9 𝑥 𝑃; (𝑦)
[Oord et al., Representation Learning with Contrastive Predictive Coding. 2018] 68

68
Open
Multiview Redundancy and Contrastive Learning challenges

How much information should be shared?

Multi-view redundancy:

Not enough signal

Just right

Too much noise

Multi-view redundancy
may not hold for
multimodal problems!

[Tian et al., What makes for Good Views for Contrastive Learning? NeurIPS 2020]
[Tosh et al., Contrastive Learning, Multi-view Redundancy, and Linear models. ALT 2021]
69

69
Sub-Challenge 1c: Representation Fission

Definition: Learning a new set of representations


that reflects individual multimodal interactions and
data clustering.

Unique to modality 1
and task Y

Redundancy: Shared by Synergy: Emerging information


both modalities and task from multimodal interaction

Unique to modality 2
and task Y
70

70
Partial Information Decomposition
Classical Information Theory Partial Information Decomposition

Can be negative!

No synergy! Explains negative!

Task-relevant
multimodal info
[Williams and Beer. Non-negative Decomposition of Mutual Information. 2010] 71

71
Partial Information Decomposition

One type of information decomposition


Unimodal marginal-matching distributions:

Task-relevant Task-relevant multimodal


multimodal info info without synergy:

[Bertschinger et al., Quantifying Unique Information, Entropy 2014] 72

72
Partial Information Decomposition

One type of information decomposition


Unimodal marginal-matching distributions:

+ consistency equations relating interactions with information theory:


Only need unimodal marginals to infer redundancy and uniqueness:

[Bertschinger et al., Quantifying Unique Information, Entropy 2014] 73

73
Open
Quantifying Interactions challenges

These interactions can be efficiently estimated – gives a path towards understanding interactions

Is there a
red shape
above a
circle?

Sentiment Sarcasm VQA

Language/Agreement Multimodal Transformer Multiplicative/Transformer

Also matches human judgment of interactions, and other sanity checks on synthetic datasets
Can also be used to choose most appropriate models – can they be used to better train/design new models?
[Liang et al., Quantifying & Modeling Feature Interactions: An Information Decomposition
74
Framework. arXiv 2023]

74
Open
On Agreement, Disagreement, and Synergy challenges

A (mini) taxonomy of interactions Agreement redundancy


Co-training
Agreement Multi-view learning

Agreement synergy
𝑓!: 𝑦! Future work?

𝑓": 𝑦"
Disagreement uniqueness
MRMR/feature selection
Disagreement

Disagreement synergy
Future work?

[Blum and Mitchell. Combining Labeled and Unlabeled Data with Co-training. COLT 1998
[Peng et al., Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. TPAMI 2005]
[Liang et al., Multimodal Learning Without Labeled Multimodal Data: Guarantees and 75
Applications, arxiv 2023]

75
Factorized Learning of Shared + Unique Information

Modeling unique information

𝐻 ) 1 Maximize the mutual information


unique
shared 𝐼( ; ) 𝐼(𝒛; ) and 𝐼(𝒛; )
unique
Modality A 𝐻 ) 2 Minimize the conditional entropy
Modality B 𝐻(𝒛| ) and 𝐻(𝒛| )

[Tsai et al., Learning Factorized Multimodal Representations. ICLR 2021]


[Wang et al., Rethinking Minimal Sufficient Representation in Contrastive Learning, CVPR
76
2022]

76
Open
Learning Task-relevant Unique Information challenges

Modeling task-relevant unique information

2 Maximize task-relevant unique information


𝐼 𝒁; 𝑌 )
𝑌
1 Maximize task-relevant shared information
𝐼(𝒁; ; 𝑌) and 𝐼(𝒁; ; 𝑌)
3 Maximize task-relevant unique information
𝐼 𝒁; 𝑌 )
Approximate task-relevance 𝑌 using multi-view data augmentations
New scalable lower and upper bounds on mutual information
[Liang et al., Factorized Contrastive Learning: Going Beyond Multi-view Redundancy,77arxiv 2023]

77
Challenge 1: Representation

Definition: Learning representations that reflect cross-modal interactions


between individual elements, across different modalities

Sub-challenges:
Fusion Coordination Fission

# modalities > # representations # modalities = # representations # modalities < # representations

78
Challenge 2: Alignment

Definition: Identifying and modeling cross-modal connections between all


elements of multiple modalities, building from the data structure

Sub-challenges:
Discrete Continuous Contextualized
Alignment Alignment Representation

Discrete elements Segmentation and Alignment + representation


and connections continuous warping
79

79
Sub-Challenge 2c: Contextualized Representations
Definition: Learn representations that reflect the
cross-modal interactions of the
structured multimodal data

Communicative cues Aligned representations Sarcasm


Oh I definitely enjoyed that..

Alignment
(attentions)

Rolls eyes

(sigh)

time
80
Sub-Challenge 2c: Contextualized Representations
Definition: Learn representations that reflect the
cross-modal interactions of the
structured multimodal data

Joint undirected Cross-modal Alignment with Structured


alignment directed alignment unimodal models alignment

81
Joint Undirected Contextualized Representations

Early fusion in sequence dimension

Transformer self-attention

Text Image Audio


+ modality embedding + modality embedding + modality embedding
+ position embedding + position embedding + position embedding

[Li et al., VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv 2019]

82
Contextualized Representations Pre-training

Various pre-training objectives: masked language/image/audio modeling

Transformer self-attention

Text Image Audio


[Kim et al., ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. ICML 2021]

83
Contextualized Representations Pre-training

Various pre-training objectives: global image-text alignment


Match? 0/1

Transformer self-attention

Text Positive/negative Image Audio


[Kim et al., ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. ICML 2021]

84
Contextualized Representations Pre-training

Various pre-training objectives: local image region-token alignment


Optimal transport local alignment

Transformer self-attention

Text Image Audio


[Kim et al., ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. ICML 2021]

85
Contextualized Representations Pre-training

And representation objectives before contextualization:

Transformer self-attention

Alignment

Text Image Audio


[Li et al., Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. NeurIPS 2021]

86
Directed Cross-Modal Alignment

Attention

New visually-contextualized
Language-vision similarities representation of language

Text Image
+ modality embedding + modality embedding
+ position embedding + position embedding

[Tsai et al., Multimodal Transformer for Unaligned Multimodal Language Sequences. ACL 2019]
[Lu et al., ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. NeurIPS 2019]

87
Directed Cross-Modal Alignment

Cross-modal
Cross-modal Attention
Multi-head Attention
Representation
Summary

Text Image
+ modality embedding + modality embedding
+ position embedding + position embedding

[Tsai et al., Multimodal Transformer for Unaligned Multimodal Language Sequences. ACL 2019]
[Lu et al., ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. NeurIPS 2019]

88
High-Modality Multimodal Transformers
Transfer across partially observable modalities
Unified model + parameter sharing + multitask and transfer learning
Video Visual Sentiment, Atari Robot
Non-parallel multitask learning
classification QA emotions games manipulation

Task-specific classifiers
Same model
architecture!
HighMMT Shared multimodal model
Same
parameters!
Modality-specific embeddings

Standardized input sequence

Language Video Audio Audio Video Image Proprioception Action

[Reed et al., A Generalist Agent. TMLR 2022]


[Liang et al., HighMMT: Quantifying Modality and Task Heterogeneity for High-Modality Representation Learning. TMLR 2022]

89
Open
High-Modality Models challenges

Some implicit assumptions:


- All modalities can be represented as sequences without losing information.
- Dimensions of heterogeneity can be perfectly captured by modality-specific embeddings.
- Cross-modal connections & interactions are shared across modalities and tasks.

Video classification Sentiment, emotions Robot dynamics

Gato/ HighMMT Shared multimodal model?

Modality-specific embeddings?

Standardized input sequence?

Language Video Audio Audio Video Video Time-series

90
Sub-Challenge 2c: Contextualized Representations
Definition: Learn representations that reflect the
cross-modal interactions of the
structured multimodal data

Joint undirected Cross-modal Alignment with Structured


alignment directed alignment unimodal models alignment

91
Conditioning Pretrained Language Models
Conditioning via prefix tuning A small red boat on the water.

Adapted + pretrained p(x|c)

Adapter Pretrained p(x)

A small red boat on the water.

[Tsimpoukelli et al., Multimodal Few-Shot Learning with Frozen Language Models. NeurIPS 2021]

92
Conditioning Pretrained Language Models
Conditioning via prefix tuning
Blue

0-shot VQA:
Adapted + pretrained p(x|c)

Adapter Pretrained p(x)

What color is the car?

[Tsimpoukelli et al., Multimodal Few-Shot Learning with Frozen Language Models. NeurIPS 2021]

93
Conditioning Pretrained Language Models
Conditioning via prefix tuning Steve Jobs

1-shot outside
knowledge VQA: Adapted + pretrained p(x|c)

Adapter Adapter p(x)

[Tsimpoukelli et al., Multimodal Few-Shot Learning with Frozen Language Models. NeurIPS 2021]

94
Conditioning Pretrained Language Models
Conditioning via prefix tuning This is a dax.

Few-shot image
classification: Adapted + pretrained

Adapter Adapter Adapter

[Tsimpoukelli et al., Multimodal Few-Shot Learning with Frozen Language Models. NeurIPS 2021]

95
Conditioning Pretrained Language Models
Mini-GPT4

Stage 1: Alignment using


paired image-text data.

Stage 2: Instruction tuning


using image + text instructions
and example completions.

[Zhu et al., MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models. 2023]

96
Conditioning Pretrained Language Models
LLaMA-Adapter

Can be combined with ImageBind


– alignment of many modalities to
language (i.e., high-modality
coordination model)

[Gridhar et al., ImageBind: One Embedding Space To Bind Them All. CVPR 2023]
[Gao et al., LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model. arxiv 2023]

97
Structure in Contextualized Representations
Compositionality requires knowing the structure.

In the continuous case (i.e., if structure is given or can be learned easily in a differentiable manner):

Sequential Interactive Hierarchical Graphical

98

98
Interactive Structure
Structure defined through interactive environment
Main difference from temporal - actions taken at previous time steps affect future states

Integrates multimodality into the


reinforcement learning framework

Modality A …

Policy 𝒂
Modality B …

Local representation
+ Aligned representation
99

99
Interactive Structure
Structure defined through interactive environment
Main difference from temporal - actions taken at previous time steps affect future states

Language-conditional RL Language-assisted RL

Instruction following Reading instruction manuals

Big topic, see full multimodal RL lectures at: https://cmu-multicomp-lab.github.io/mmml-course/fall2022/

[Luketina et al., A Survey of Reinforcement Learning Informed by Natural Language. IJCAI 2019]

100
Open
Language Models for Planning challenges

Multi-step planning afforded by language models

Policy 𝒂𝟏

Modality A … Policy 𝒂𝟐

Modality B … Policy 𝒂𝟑

Task: get glass of milk


Policy 𝒂𝟒

[Huang et al., Language Models as Zero-shot Planners. ICML 2022] 101

101
Hierarchical Structure
Leverage syntactic structure of language

Local composition with


interpretable output concepts

Attend [red]
Attend [red]

“red”

Is there a red shape


above a circle?

[Andreas et al., Neural Module Networks. CVPR 2016]

102
Hierarchical Structure
Leverage syntactic structure of language

Local composition with


interpretable output concepts
Attend [red]

Attend [circle]

Is there a red shape


above a circle?
Attend [circle] “circle”

[Andreas et al., Neural Module Networks. CVPR 2016]

103
Hierarchical Structure
Leverage syntactic structure of language

Local composition with


interpretable output concepts
Attend [red]

Re-attend [above]

Is there a red shape


above a circle?
Attend [circle] Re-attend [above]

[Andreas et al., Neural Module Networks. CVPR 2016]

104
Hierarchical Structure
Leverage syntactic structure of language

Local composition with


interpretable output concepts

Combine [and]
Attend [red] Combine [and]

Is there a red shape


above a circle?
Attend [circle] Re-attend [above]

[Andreas et al., Neural Module Networks. CVPR 2016]

105
Hierarchical Structure
Leverage syntactic structure of language

Enables better interpretability, but requires parsing + engineering each specific module

Attend [red] Combine [and] Measure [is] YES

Is there a red shape


above a circle?
Attend [circle] Re-attend [above]

[Andreas et al., Neural Module Networks. CVPR 2016]

106
Open
Hierarchical Structure + Multimodal Pretraining challenges

Multimodal pretrained models


+ retrieval

Multimodal input
multimodal output
models?
More in Challenge 4:
Generation

[Koh et al., Grounding Language Models to Images for Multimodal Inputs and Outputs. ICML 2023]

107
Open
Hierarchical Structure + Multimodal Pretraining challenges

Multimodal pretrained models


+ retrieval
+ knowledge graphs

[Chen et al., Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge. ACM Multimedia 2022]
[Zhang et al., Multimodal Analogical Reasoning over Knowledge Graphs. ICLR 2023]

108
Open
Hierarchical Structure + Multimodal Pretraining challenges

Multimodal pretrained models


+ retrieval
+ knowledge graphs
+ multi-step reasoning

[Zhang et al., Multimodal Chain-of-Thought Reasoning in Language Models. 2023]


[Zeng et al., Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language. 2023]

109
Open
Hierarchical Structure + Multimodal Pretraining challenges

Socratic models All modalities as natural language?

[Zeng et al., Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language. 2023]

110
Open
Hierarchical Structure + Multimodal Pretraining challenges

Multimodal pretrained models Compositionality? Tokenization? Reasoning?


+ retrieval
+ knowledge graphs
+ multi-step reasoning
+ relational alignment

[Thrush et al., Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality. CVPR 2023]
[Pandey et al., Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment. ACL 2023]

111
Sub-Challenge 2c: Contextualized Representations
Definition: Learn representations that reflect the
cross-modal interactions of the
structured multimodal data

More in Challenge 3: Reasoning


Joint undirected Cross-modal Alignment with Structured
alignment directed alignment unimodal models alignment

112
Core Multimodal Challenges
Representation Generation

Reasoning Quantification

Alignment 𝑦% Transference

113

113
What is
Why is it hard? What is next?
Multimodal?

Heterogeneous Representation

Alignment
Connected
Reasoning
Interacting
Generation

Transference
𝑧
Quantification
114

114
Future Direction: Heterogeneity

Homogeneity vs Heterogeneity

Challenges:
Arbitrary tokenization Beyond differentiable interactions

Causal, logical, brain-inspired interactions

Theoretical study of interactions

115
MultiBench
Future Direction: High-modality https://github.com/pliang279/MultiBench

Few modalities High-modality

Language Vision Audio Graphs Control LIDAR Sensors Set Table Financial Medical

Challenges: Non-parallel learning Limited resources

116

116
Future Direction: Long-term

Short-term Long-term

seconds
or minutes

Challenges:
Compositionality Memory Personalization

117
Social-IQ
Future Direction: Interaction https://www.thesocialiq.com/

Social Intelligence
Reasoning

Perception Generation

Multimodal
Interaction

Challenges:
Multi-Party Causality Ethical
118

118
MultiViz
Future Direction: Real-world https://github.com/pliang279/MultiViz

Healthcare Intelligent Interfaces and Online Learning


Decision Support Vehicles and Education

Challenges:
Robustness Fairness Generalization Interpretation

119
What is
Why is it hard? What is next?
Multimodal?

Representation High-modality
Heterogeneous
Alignment
Heterogeneity
Reasoning
Connected Long-term
Generation
Transference Interaction
Interacting
Quantification Real-world

https://cmu-multicomp-lab.github.io/mmml-course/fall2022/
Liang, Zadeh, and Morency. Foundations and Trends on Multimodal Machine Learning. 2022

120

120

You might also like