0% found this document useful (0 votes)

9 views

ICML2023 - Tutorial多模态机器学习Multimodal Machine Learning

Uploaded by

watertulip6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

ICML2023 - Tutorial多模态机器学习Multimodal Machine Learning

Uploaded by

watertulip6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 120

Tutorial on

Multimodal Machine Learning

LP Morency Paul Liang

Multimodal AI Technologies
Robots Personal Vehicles Ubiquitous

Mobile Online

Wearable
2
Multimodal AI Technologies
Robots Personal Vehicles Ubiquitous
Video Conferencing

Mobile Online

Wearable
3
What is
Multimodal?
What is a Modality?

Multimodal Behaviors and Signals

Language Visual Touch

§ Haptics
§ Lexicon § Gestures
§ Words § Head gestures § Motion
Syntax § Eye gestures
Physiological
§
§ Part-of-speech § Arm gestures
§ Dependencies § Body language § Skin conductance
§ Pragmatics § Body posture
§ Electrocardiogram
§ Discourse acts § Proxemics

Acoustic § Eye contact

§ Head gaze
Mobile
§ Prosody
§ Eye gaze § GPS location
§ Intonation
§ Voice quality § Facial expressions § Accelerometer
§ FACS action units
§ Vocal expressions § Light sensors
§ Smile, frowning
§ Laughter, moans

5
What is a Modality?

Definition

Modality refers to the way in which something expressed or perceived.

Raw Abstract
Modalities Modalities
(closest from sensor) (farthest from sensor)
from a sensor

Speech Language Sentiment

Examples:
signal intensity

Image Detected Object

objects categories

6
What is Multimodal?

A dictionary definition…

Multimodal: with multiple modalities

A research-oriented definition…

Multimodal is the scientific study of

heterogeneous and interconnected data
Connected + Interacting
7

7
Heterogeneous Modalities

Heterogeneous: Diverse qualities, structures and representations.

Modality A Homogeneous Heterogeneous

Modalities Modalities
Modality B (with similar qualities) (with diverse qualities)

Examples: Images Text from Language

from 2 2 different and vision
cameras languages

Abstract modalities are more likely to be homogeneous

8
Dimensions of Heterogeneity Modality A Modality B

1 Element representations:
Discrete, continuous, granularity

2 Element distributions:
Density, frequency

3 Structure:
Temporal, spatial, latent, explicit

4 Information:
Abstraction, entropy 𝐻( ) 𝐻( )
5 Noise:
Uncertainty, noise, missing data

6 Relevance: 𝑦!
Task, context dependence 𝑦"
9

9
Connected Modalities

Connected: Shared information that relates modalities

unconnected
unique

stronger

weaker
Modality A

Modality B unique

Statistical Semantic

Association Dependency Correspondence Relationship

= laptop used for

e.g., correlation, e.g., causal, e.g., grounding e.g., function

co-occurrence temporal
10

10
Interacting Modalities

Interacting: process affecting each modality, creating new response

Modality A 𝑧 inference 𝑧
Modality B response

Interactions happen
during inference!

“Inference” examples: • Representation fusion representation

• Prediction task 𝑦& prediction
• Modality translation modality C
11

11
Taxonomy of Interaction Responses – A Behavioral Science View
signal response signal response

Redundancy
inference 𝑧 a a+b Equivalence
response
inputs
b a+b Enhancement

Nonredundancy
Multimodal Communication a+b and Independence
a
a+b Dominance
b
a+b (or ) Modulation

a+b Emergence
Partan and Marler (2005). Issues in the classification of multimodal
12
communication signals. American Naturalist, 166(2)
12
Cross-modal Interaction Mechanics
signal response

Noninteracting
Redundancy a+b Equivalence
(shared)
Additive
a+b Enhancement
unique
shared 𝑧 Noninteracting
(union) a+b and Independence
unique
Asymmetric a+b Dominance

Nonredundancy Contextualized a+b (or ) Modulation

(unique) (transference)

Non-additive a+b Emergence

(nonlinear)
13

13
What is
Why is it hard? What is next?
Multimodal?

Heterogeneous

Connected Multimodal is the scientific

Interacting study of heterogeneous and
interconnected data J
𝑧

14
Multimodal Machine Learning

Modality A

Modality B Multimodal ML or 𝑦! or

Modality C q Self-supervised,
q Reinforcement,
q Supervised, …

What are the core multimodal technical challenges,

understudied in conventional machine learning?
15

15
Multimodal Technical Challenges – Surveys, Tutorials and Courses

2016 2022

Paul Liang, Amir Zadeh and Louis-Philippe Morency

Tadas Baltrusaitis, Chaitanya Ahuja and Louis-Philippe Morency
(Arxiv 2017, IEEE TPAMI journal, February 2019) ☑ 6 core challenges
https://arxiv.org/abs/1705.09406 ☑ 50+ taxonomic classes
Tutorials: CVPR 2016, ACL 2016, ICMI 2016, … ☑ 700+ referenced papers
https://arxiv.org/abs/2209.03430

Multimodal Machine learning (11th edition) Tutorials: ICML 2023, CVPR 2022, NAACL 2022
https://cmu-multicomp-lab.github.io/mmml-course/fall2020/

Advanced Topics in Multimodal Machine learning Multimodal Machine learning (12th edition)
https://cmu-multicomp-lab.github.io/adv-mmml-course/spring2022/
https://cmu-multicomp-lab.github.io/mmml-course/fall2022/

16
Core Multimodal Challenges
Representation Generation

Reasoning Quantification

Alignment 𝑦" Transference

17
Challenge 1: Representation

Definition: Learning representations that reflect cross-modal interactions

between individual elements, across different modalities

This is a core building block for most multimodal modeling problems!

Individual elements:

Modality A It can be seen as a “local” representation

or
Modality B representation using holistic features

18
Challenge 1: Representation

Definition: Learning representations that reflect cross-modal interactions

between individual elements, across different modalities

Sub-challenges:
Fusion Coordination Fission

# modalities > # representations # modalities = # representations # modalities < # representations

19
Challenge 2: Alignment

Definition: Identifying and modeling cross-modal connections between all

elements of multiple modalities, building from the data structure

Most modalities have internal structure with multiple elements

Elements with temporal structure: Other structured examples:

Modality A

Modality B
Spatial Hierarchical
20

20
Challenge 2: Alignment

Definition: Identifying and modeling cross-modal connections between all

elements of multiple modalities, building from the data structure

Sub-challenges:
Discrete Continuous Contextualized
Alignment Alignment Representation

Discrete elements Segmentation and Alignment + representation

and connections continuous warping
21

21
Challenge 3: Reasoning

Definition: Combining knowledge, usually through multiple inferential steps,

exploiting multimodal alignment and problem structure

Modality A

or 𝑦!
Modality B

22
Challenge 3: Reasoning

Definition: Combining knowledge, usually through multiple inferential steps,

exploiting multimodal alignment and problem structure

Modality A

words
or 𝑦!

words
Modality B

words
External
knowledge

23
Challenge 4: Generation

Definition: Learning a generative process to produce raw modalities that

reflects cross-modal interactions, structure and coherence

Sub-challenges:

Summarization Translation Creation

Reduction Maintenance Expansion

Information:
(content)
> = <
24
Challenge 5: Transference

Definition: Transfer knowledge between modalities, usually to help the

target modality which may be noisy or with limited resources

Enriched Modality A

only available
during training
Transference A B

Modality A Modality B
25

25
Challenge 6: Quantification

Definition: Empirical and theoretical study to better understand heterogeneity,

cross-modal interactions and the multimodal learning process

Sub-challenges:
Connections &
Heterogeneity Learning
Interactions

Loss
Epoch

26
Core Multimodal Challenges
Representation Generation

Reasoning Quantification

Alignment 𝑦" Transference

27
Sub-Challenge: Representation Fusion

Definition: Learn a joint representation that models

cross-modal interactions between
individual elements of different modalities

Basic fusion: Raw-modality fusion:

Modality A Modality A
Homogeneous Fusion Heterogeneous Fusion

Modality B Modality B

28
Feature Interactions: From Additive to Multiplicative

300 book reviews 𝑦: audience score H1: Does smiling reveal what the
audience score was?
𝑥# : percentage of smiling
𝑥$ : professional status H2: Does the effect of smiling depend
(0=non-critic, 1=critic) on professional status?

Linear regression:
𝑦 = 𝑤% + 𝑤! 𝑥# + 𝜖
slope Estimate 95% CI
𝑤! 4.63 [4.20, 5.06]
𝑦 𝑤" 1.20 [0.83, 1.57]

𝑥#
29

29
Feature Interactions: From Additive to Multiplicative

Linear regression:
𝑦 = 𝑤% + 𝑤! 𝑥# + 𝑤" 𝑥$ + 𝜖
Estimate 95% CI
𝑤! 5.29 [4.86, 5.73]
𝑦 𝑤" 1.19 [0.85, 1.53] Positive effect
𝑤# −1.69 [−2.14, −1.24] Negative effect

𝑥#
30

30
Feature Interactions: From Additive to Multiplicative

Linear regression:
𝑦 = 𝑤% + 𝑤! 𝑥# + 𝑤" 𝑥$ + 𝑤& 𝑥# ×𝑥' + 𝜖
Estimate 95% CI
𝑤! 5.79 [5.29, 6.29]
𝑦 𝑤" 0.68 [0.25, 1.11]
𝑤# −2.94 [−3.73, −2.15]
Multiplicative
𝑤$ 1.29 [0.61, 1.97]
interaction!
𝑥# 31

31
Basic Fusion – Additive Interactions

Modality A
𝒙# Additive fusion:
Add
Fusion
𝒛 𝒛 = 𝒘! 𝒙" + 𝒘# 𝒙$
Modality B
𝒙$ 1-layer neural network
can be seen as additive
With unimodal encoders:

Additive fusion:
Modality A encoder
𝑓# Fusion
𝒛 = 𝑓" + 𝑓$
𝒛
Modality B encoder It could be seen as an
𝑓$ ensemble approach
(late fusion)
32

32
Multiplicative Interactions

Modality A
𝒙# Simple multiplicative fusion:
Mult
𝒛 𝒛 = 𝒘 𝒙" ×𝒙$
Modality B
𝒙$

Modality A
𝒙# Bilinear Fusion:
Bilinear
𝒁 = 𝑾 𝒙"% , 𝒙$
Modality B
𝒙$ 𝒁

[Jayakumar et al., Multiplicative Interactions and Where to Find Them. ICLR 2020]33

33
Tensor Fusion
unimodal bimodal
(additive) (multiplicative)
Modality A 1
Tensor Fusion (bimodal):
𝒙#
Tensor %
𝒁 = 𝒘 𝒙" 1 , 𝒙$ 1
Modality B 1 1

𝒙$ 𝒁

Modality A 1 bimodal
𝒙# (multiplicative)

Modality B 1 Tensor … but the weight matrix

𝒙$ unimodal may end up quite large!
(additive)

Modality C 1
𝒙( trimodal
𝒁 (multiplicative)
[Zadeh et al., Tensor Fusion Network for Multimodal Sentiment Analysis. EMNLP 2017]
[Hou et al., Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling.
34
NeurIPS 2019]

34
Low-rank Tensor Fusion

(!) (#)
𝑧) 𝑤) 𝑤)
𝟏
⨂ 𝒵 ! 𝒲 = +ℎ + ⋯
𝟏
𝑧& (!) (#)
⨂ 𝑤& ⨂ 𝑤&
𝟏

[Liu et al., Efficient Low-rank Multimodal Fusion with Modality-Specific Factors. ACL
35
2018]

35
Low-rank Tensor Fusion

(!) (#)
𝑧) 𝑤) 𝑤)
𝟏
⨂ 𝒵 ! + + ⋯ = ℎ
𝟏
𝑧& (!) (#)
⨂ 𝑤& ⨂ 𝑤&
𝟏

[Liu et al., Efficient Low-rank Multimodal Fusion with Modality-Specific Factors. ACL
36
2018]

36
Low-rank Tensor Fusion

(!) (#)
𝑧) 𝑤) 𝑤)
𝟏
! + + ⋯ ∘ ! + + ⋯ = ℎ

𝑧& (!) (#)

𝑤& 𝑤&
𝟏

[Liu et al., Efficient Low-rank Multimodal Fusion with Modality-Specific Factors. ACL
37
2018]

37
Gated Fusion
gate

Example with additive fusion:

Modality A
𝒙#
Fusion 𝒛 = 𝑔" 𝒙" , 𝒙$ , 𝒙" + 𝑔$ (𝒙" , 𝒙$ ) , 𝒙$
Modality B
𝒛
𝒙$
gate 𝑔" and 𝑔$ can be seen as attention functions

gate
Modality A
Fusion
Gating output can be one weight
Modality B 𝒛 for the whole modality

gate
[Arevalo et al., Gated Multimodal Units for information fusion, ICLR-workshop 2017]
38

38
Modality-Shifting Fusion

Primary shift
modality 𝒛
𝒙#

Secondary 𝒙$ gate
modalities
𝒙(
Negative-shifted
Example with language modality: representation word: “expectations”

Primary modality: language Positive-shifted

Secondary modalities: acoustic and visual representation

[Wang et al., Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors, AAAI 2019]
[Rahman et al., Integrating Multimodal Information in Large Pretrained Transformers,
39
ACL 2020]

39
Mixture of Fusions
Multiple fusion strategies

Unimodal

Modality A Unimodal
𝒙# Fusion
Additive 𝒛
Modality B
𝒙$
bilinear

Gating can be with

gate soft or hard attention

[Zadeh et al., Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph, ACL 2018]
[Xu et al., MUFASA: Multimodal Fusion Architecture Search for Electronic Health Records,
40 AAAI 2021]

40
Nonlinear Fusion
Nonlinear fusion:
Modality A
𝒙# & = 𝑓 𝒙# , 𝒙$ ∈ ℝ)
𝒚
Fusion Prediction &
𝒚
𝒛
Modality B For any nonlinear model
𝒙$

Modality A
𝒙# This could be seen as early fusion:
Fusion +
prediction
&
𝒚
Modality B & = 𝑓 [𝒙# , 𝒙$ ]
𝒚
𝒙$

… but will our neural network learn the nonlinear interactions?

41
Measuring Non-Additive Interactions
Nonlinear fusion:
Modality A
𝒙# & = 𝑓 𝒙# , 𝒙$
𝒚
Nonlinear
fusion &
𝒚 Projection?
Modality B Additive fusion:
𝒙$ &′ = 𝑓# 𝒙# + 𝑓$ 𝒙$
𝒚

Projection from nonlinear to additive (using EMAP):

𝑓0 𝒙# , 𝒙$ = 𝔼 𝑓 𝒙# , 𝒙$ + 𝔼 𝑓 𝒙# , 𝒙$
𝒙& 𝒙'

𝑓" 𝒙" 𝑓$ 𝒙$
Additive fusion
(approximation)
Modality A Modality B

[Hessel and Lee, Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!, EMNLP 2020]

42
Measuring Non-Additive Interactions
Nonlinear fusion:
Modality A
𝒙# & = 𝑓 𝒙# , 𝒙$
𝒚 EMAP
Nonlinear
&
𝒚
fusion
projection
Modality B Additive fusion:
𝒙$ &+ = 𝑓2# 𝒙# + 𝑓2$ 𝒙$ + 𝝁
𝒚 &

Nonlinear
Polynomial
Nonlinear
Nonlinear
Always a
Additive good baseline!
Best Model Differences
Additive
are small!!!

[Hessel and Lee, Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!, EMNLP 2020]

43
Learning Non-additive Bimodal and Trimodal Interactions
Idea: prioritize simpler Unimodal Bimodal Trimodal
(additive) (non-additive) (non-additive)
interactions
residual residual
Multimodal
Residual ℒ(𝑦, 𝑦%234 ) + ℒ(𝑦 − 𝑦%234 , 𝑦%54 ) + ℒ(𝑦 − 𝑦%234 − 𝑦%54 , 𝑦%674 )
Optimization

𝑦4,-. 𝑦4'. 𝑦4/0. &

𝒚

Modality A
𝒙# 𝒙# , 𝒙(
Modality B &$
𝒙𝒚 ∑ 𝒙$ , 𝒙( ∑ 𝒙# , 𝒙$ , 𝒙( &𝒕𝒓𝒊
𝒚
Modality C 𝒙( 𝒙# , 𝒙$
[Wortwein et al., Beyond Additive Fusion: Learning Non-Additive Multimodal Interactions, Findings-EMNLP 2022]
44

44
Fusion with Heterogeneous Modalities

Goal: Fusion with raw modalities

Modality A Can the same fusion algorithm handle

Heterogeneous Fusion raw heterogeonous modalities?
Modality B
like enjoy

ℎ& ℎ! ℎ" ℎ# ℎ% ℎ$ ℎ&'( ℎ′! ℎ′" ℎ′# ℎ′% ℎ′$

Language
Fusion Transformer Self-Attention
Vision
cls 𝑥! 𝑥" 𝑥# mask 𝑥$ sep 𝑥′! mask 𝑥′# 𝑥′% 𝑥′$
I do not it I my time here
45

45
Image Representation Learning: Masked Auto-Encoder (MAE)

Mask a random
subset (~70%)

Transformer
Only used
during
Visual Transformer Reconstruction
pre-training
(ViT) loss function over
the whole image
[He et al., Masked Autoencoders Are Scalable Vision Learners, CVPR 2022]

46
Multimodal Masked Autoencoder

[Geng et al., Multimodal Masked Autoencoders Learn Transferable Representations, 2022]

47
Dynamic Early Fusion

Modality A
Heterogeneous Fusion

Modality B

Idea: Deciding when to fuse in early fusion

Unimodal Unimodal Unimodal

Visual
Fuse Fuse Fuse
Acoustic
Unimodal Unimodal Unimodal
[Xue and Marculescu, Dynamic Multimodal Fusion, arxiv 2022] 48

48
Dynamic Early Fusion
𝒚
Fusion fully learned from optimization and data
Add fuse
1. Define basic representation building blocks
ReLU Layer norm Conv Self-attention

2. Define basic fusion building blocks Conv Layer norm

Concat fuse Attention fuse Add fuse

Concat fuse Attention fuse
3. Automatically search for composition using neural
architecture search

Conv Self-attention

Layer norm

[Xu et al., MUFASA: Multimodal Fusion Architecture Search for Electronic Health Records. AAAI 2021]
[Liu et al., DARTS: Differentiable Architecture Search. ICLR 2019] 49

49
Heterogeneity-aware Fusion
Information transfer, transfer learning perspective
1a. Estimate modality
heterogeneity via transfer 2a. Compute modality heterogeneity matrix

0
1 0
3. Determine parameter clustering
3 2 0
1 2 3 0
5 4 6 3 0
(Implicitly captures heterogeneity)
2b. Compute interaction heterogeneity matrix
1b. Estimate interaction { }{ }{ }{ }
heterogeneity via transfer
{ } 0

{ } 1 0

{ } 3 2 0

{ } 1 2 4 0

[Zamir et al., Taskonomy: Disentangling Task Transfer Learning. CVPR 2018]

[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]

50
Heterogeneity-aware Fusion
Information transfer, transfer learning perspective

[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]

51
Improving Optimization

Kinetics dataset Adding more modalities should always help?

Modalities: RGB (video clips)
A (Audio features)
OF (optical flow - motion)

But sometimes multimodal doesn’t help! Why?

[Wang et al., What Makes Training Multi-modal Classification Networks Hard? CVPR 2020]
[Wu et al., Characterizing and Overcoming the Greedy Nature of Learning in Multi-modal Deep Neural Networks. ICML 2022]

52
Improving Optimization
Relevance heterogeneity
2 explanations for drop in performance:
1. Multimodal networks are more prone to overfitting due to
increased complexity
2. Different modalities overfit and generalize at different rates

Key idea 1: compute overfitting-to-

generalization ratio (OGR)
Loss

Gap between training and valid loss

OGR wrt each modality tells us
Epoch
how much to train that modality

53
Improving Optimization
Relevance heterogeneity

Conventional approach Proposed approach

Prediction !
𝒚
𝒙) 𝒙) 𝒙)
Fusion +
prediction
!
𝒚 Fusion +
prediction
!
𝒚
Prediction !
𝒚
𝒙* 𝒙* 𝒙*

Key idea 2: Simultaneously train unimodal

networks to estimate OGR wrt each modality
Reweight multimodal loss
using unimodal OGR values
Allows to better balance generalization &
overfitting rate of different modalities
[Wang et al., What Makes Training Multi-modal Classification Networks Hard? CVPR 2020]
[Wu et al., Characterizing and Overcoming the Greedy Nature of Learning in Multi-modal Deep Neural Networks. ICML 2022]

54
Heterogeneity in Noise: Studying Robustness
Strong tradeoffs between performance and robustness
Noise within Modality
noise → nosie

[Belinkov & Bisk, 2018; Subramaniam et al.,

2009; Boyat & Joshi, 2015]

Missing Modalities
Today was great!

[Zadeh et al., 2020]

rate of accuracy drops
[Liang et al., MultiBench: Multiscale Benchmarks for Multimodal Representation Learning. NeurIPS 2021]

55
Heterogeneity in Noise: Studying Robustness
Several approaches towards more robust models

Robust data + training Infer missing modalities

Modality A Modality A
𝒙)
𝒙) 𝒙,𝑨
Fusion +
prediction
!
𝒚 Fusion +
!
𝒚
prediction
Modality B
𝒙* 𝒙,𝑩
Modality B
𝒙*

Translation model
Joint probabilistic model

[Ngiam et al., Multimodal Deep Learning. ICML 2011]

[Srivastava and Salakhutdinov, Multimodal Learning with Deep Boltzmann Machines. JMLR 2014]
[Tran et al., Missing Modalities Imputation via Cascaded Residual Autoencoder. CVPR 2017]
[Pham et al., Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities. AAAI 2019

56
modalities
Homogenous
Late fusion
Additive fusion

Multiplicative fusion

Tensor fusion

Polynomial fusion

Gated fusion

57
Modality-shift fusion

Nonlinear fusion
Sub-Challenge 1a: Representation Fusion

Very early fusion

Dynamic early fusion

Heterogeneity-aware

Improving optimization
cross-modal interactions between

Improving robustness
Definition: Learn a joint representation that models

individual elements of different modalities

modalities
Heterogenous
What is
Multimodal?

Heterogeneous

Connected Multimodal is the scientific

Interacting study of heterogeneous and
interconnected data J
𝑧

58
Core Multimodal Challenges
Representation Generation

Reasoning Quantification

Alignment 𝑦% Transference

59
Challenge 1: Representation

Definition: Learning representations that reflect cross-modal interactions

between individual elements, across different modalities

Sub-challenges:
Fusion Coordination Fission

# modalities > # representations # modalities = # representations # modalities < # representations

60
Sub-Challenge: Representation Coordination

Definition: Learn multimodally-contextualized

representations that are coordinated through
their cross-modal interactions
𝒛)
Modality A encoder 2. Coordination function
𝑓)
captures interconnections
𝑔 𝒛" , 𝒛$
Modality B encoder
𝒛*
𝑓*

1. Specialized encoders capture heterogeneity

Learning with coordination function:

ℒ = 𝑔 𝑓" , 𝑓$
with model parameters 𝜃! , 𝜃"! and 𝜃""

61
Sub-Challenge: Representation Coordination

Definition: Learn multimodally-contextualized

representations that are coordinated through
their cross-modal interactions
𝒛)
Modality A encoder
𝑓)
𝑔 𝒛" , 𝒛$
Modality B encoder
𝒛*
𝑓*

Learning with coordination function: 𝒛) ) 𝒛*

1 Cosine similarity: 𝑔 𝒛) , 𝒛* =
‖𝒛) ‖ 𝒛*
ℒ = 𝑔 𝑓" , 𝑓$
with model parameters 𝜃! , 𝜃"! and 𝜃""

62
Sub-Challenge: Representation Coordination

Definition: Learn multimodally-contextualized

representations that are coordinated through
their cross-modal interactions
𝒛)
Modality A encoder
𝑓)
𝑔 𝒛" , 𝒛$
Modality B encoder
𝒛*
𝑓*

Learning with coordination function:

2 Kernel similarity functions:
ℒ = 𝑔 𝑓" , 𝑓$ 𝑔 𝒛) , 𝒛* = 𝑘(𝒛) , 𝒛* ) • Linear
• Polynomial
with model parameters 𝜃! , 𝜃"! and 𝜃"" • Exponential
• RBF

63
Sub-Challenge: Representation Coordination

Definition: Learn multimodally-contextualized

representations that are coordinated through
their cross-modal interactions
𝒛)
Modality A encoder
𝑓)
𝑔 𝒛" , 𝒛$

View 𝒛"
Modality B encoder
𝒛* View 𝒛!
𝑓*
𝑼 𝑽
Learning with coordination function:
3 Canonical Correlation Analysis:
ℒ = 𝑔 𝑓" , 𝑓$ argmax corr 𝒛) , 𝒛* 𝑓" 𝑓$
𝑽,𝑼,1! ,1"
with model parameters 𝜃! , 𝜃"! and 𝜃""

64
Coordination with Contrastive Learning
𝒛# Contrastive loss:
Modality A encoder
𝑓# brings positive pairs closer and
pushes negative pairs apart
Modality B encoder
𝑓$ 𝒛$
Simple contrastive loss:

Paired data: max 0, 𝛼 + 𝑠𝑖𝑚 𝒛# , 𝒛5

$ − 𝑠𝑖𝑚(𝒛# $)
, 𝒛6
{ , }
(e.g., images and text descriptions) positive pairs negative pair

1 1 Similarity functions are

2 2 often cosine similarity
3 3
Positive pairs
4 4
Negative pairs
5 5

65
Example – Visual-Semantic Embeddings
𝒛7 Two contrastive loss terms:
Language encoder
𝑓7 ℒ max 0, 𝛼 + 𝑠𝑖𝑚 𝒛7 , 𝒛5
8 − 𝑠𝑖𝑚(𝒛7 8)
, 𝒛6

Visual encoder + max 0, 𝛼 + 𝑠𝑖𝑚 𝒛8 , 𝒛5

7 − 𝑠𝑖𝑚(𝒛8 7)
, 𝒛6
(image)
𝑓8 𝒛8

[Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language 66

Models. NeurIPS 2014]

66
Example – CLIP (Contrastive Language–Image Pre-training)
𝒛7
Language encoder Popular contrastive loss: InfoNCE
𝑓7 ℒ positive pairs
,
1 sim(𝒛"* , 𝒛*$ )
Visual encoder ℒ= − < log
𝑁 ∑ , sim(𝒛* , 𝒛 - )
(image)
𝑓8 𝒛8 *+! -+! " $

negative pairs
Similarity function can
Positive and negative pairs: be cosine similarity and positive pairs

CLIP encoders (𝑓7 and 𝑓8 ) are

Contrastive
pre-training

great for language-vision tasks

𝒛7 and 𝒛8 are coordinated but not

identical representation spaces
[Radford et al., Learning Transferable Visual Models From Natural Language Supervision.
67 ICML 2021]

67
Contrastive Learning and Connected Modalities
𝒛7
Language encoder Popular contrastive loss: InfoNCE
𝑓7 ℒ ,
1 sim(𝒛"* , 𝒛*$ )
Visual encoder ℒ = − < log
𝒛8 𝑁 ∑, sim(𝒛"* , 𝒛- )
(image)
𝑓8 *+! -+! $

Connected modalities: CLIP focuses on

shared connections
unique
Modality A
shared 𝐼( ; ) Mutual information 𝐼 𝑋; 𝑌
Modality B
unique 𝑃9; 𝑥, 𝑦
𝔼9,; log
𝑃9 𝑥 𝑃; (𝑦)
[Oord et al., Representation Learning with Contrastive Predictive Coding. 2018] 68

68
Open
Multiview Redundancy and Contrastive Learning challenges

How much information should be shared?

Multi-view redundancy:

Not enough signal

Just right

Too much noise

Multi-view redundancy
may not hold for
multimodal problems!

[Tian et al., What makes for Good Views for Contrastive Learning? NeurIPS 2020]
[Tosh et al., Contrastive Learning, Multi-view Redundancy, and Linear models. ALT 2021]
69

69
Sub-Challenge 1c: Representation Fission

Definition: Learning a new set of representations

that reflects individual multimodal interactions and
data clustering.

Unique to modality 1
and task Y

Redundancy: Shared by Synergy: Emerging information

both modalities and task from multimodal interaction

Unique to modality 2
and task Y
70

70
Partial Information Decomposition
Classical Information Theory Partial Information Decomposition

Can be negative!

No synergy! Explains negative!

Task-relevant
multimodal info
[Williams and Beer. Non-negative Decomposition of Mutual Information. 2010] 71

71
Partial Information Decomposition

One type of information decomposition

Unimodal marginal-matching distributions:

Task-relevant Task-relevant multimodal

multimodal info info without synergy:

[Bertschinger et al., Quantifying Unique Information, Entropy 2014] 72

72
Partial Information Decomposition

One type of information decomposition

Unimodal marginal-matching distributions:

+ consistency equations relating interactions with information theory:

Only need unimodal marginals to infer redundancy and uniqueness:

[Bertschinger et al., Quantifying Unique Information, Entropy 2014] 73

73
Open
Quantifying Interactions challenges

These interactions can be efficiently estimated – gives a path towards understanding interactions

Is there a
red shape
above a
circle?

Sentiment Sarcasm VQA

Language/Agreement Multimodal Transformer Multiplicative/Transformer

Also matches human judgment of interactions, and other sanity checks on synthetic datasets
Can also be used to choose most appropriate models – can they be used to better train/design new models?
[Liang et al., Quantifying & Modeling Feature Interactions: An Information Decomposition
74
Framework. arXiv 2023]

74
Open
On Agreement, Disagreement, and Synergy challenges

A (mini) taxonomy of interactions Agreement redundancy

Co-training
Agreement Multi-view learning

Agreement synergy
𝑓!: 𝑦! Future work?

𝑓": 𝑦"
Disagreement uniqueness
MRMR/feature selection
Disagreement

Disagreement synergy
Future work?

[Blum and Mitchell. Combining Labeled and Unlabeled Data with Co-training. COLT 1998
[Peng et al., Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. TPAMI 2005]
[Liang et al., Multimodal Learning Without Labeled Multimodal Data: Guarantees and 75
Applications, arxiv 2023]

75
Factorized Learning of Shared + Unique Information

Modeling unique information

𝐻 ) 1 Maximize the mutual information

unique
shared 𝐼( ; ) 𝐼(𝒛; ) and 𝐼(𝒛; )
unique
Modality A 𝐻 ) 2 Minimize the conditional entropy
Modality B 𝐻(𝒛| ) and 𝐻(𝒛| )

[Tsai et al., Learning Factorized Multimodal Representations. ICLR 2021]

[Wang et al., Rethinking Minimal Sufficient Representation in Contrastive Learning, CVPR
76
2022]

76
Open
Learning Task-relevant Unique Information challenges

Modeling task-relevant unique information

2 Maximize task-relevant unique information

𝐼 𝒁; 𝑌 )
𝑌
1 Maximize task-relevant shared information
𝐼(𝒁; ; 𝑌) and 𝐼(𝒁; ; 𝑌)
3 Maximize task-relevant unique information
𝐼 𝒁; 𝑌 )
Approximate task-relevance 𝑌 using multi-view data augmentations
New scalable lower and upper bounds on mutual information
[Liang et al., Factorized Contrastive Learning: Going Beyond Multi-view Redundancy,77arxiv 2023]

77
Challenge 1: Representation

Definition: Learning representations that reflect cross-modal interactions

between individual elements, across different modalities

Sub-challenges:
Fusion Coordination Fission

# modalities > # representations # modalities = # representations # modalities < # representations

78
Challenge 2: Alignment

Definition: Identifying and modeling cross-modal connections between all

elements of multiple modalities, building from the data structure

Sub-challenges:
Discrete Continuous Contextualized
Alignment Alignment Representation

Discrete elements Segmentation and Alignment + representation

and connections continuous warping
79

79
Sub-Challenge 2c: Contextualized Representations
Definition: Learn representations that reflect the
cross-modal interactions of the
structured multimodal data

Communicative cues Aligned representations Sarcasm

Oh I definitely enjoyed that..

Alignment
(attentions)

Rolls eyes

(sigh)

time
80
Sub-Challenge 2c: Contextualized Representations
Definition: Learn representations that reflect the
cross-modal interactions of the
structured multimodal data

Joint undirected Cross-modal Alignment with Structured

alignment directed alignment unimodal models alignment

81
Joint Undirected Contextualized Representations

Early fusion in sequence dimension

Transformer self-attention

Text Image Audio

+ modality embedding + modality embedding + modality embedding
+ position embedding + position embedding + position embedding

[Li et al., VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv 2019]

82
Contextualized Representations Pre-training

Various pre-training objectives: masked language/image/audio modeling

Transformer self-attention

Text Image Audio

[Kim et al., ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. ICML 2021]

83
Contextualized Representations Pre-training

Various pre-training objectives: global image-text alignment

Match? 0/1

Transformer self-attention

Text Positive/negative Image Audio

[Kim et al., ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. ICML 2021]

84
Contextualized Representations Pre-training

Various pre-training objectives: local image region-token alignment

Optimal transport local alignment

Transformer self-attention

Text Image Audio

[Kim et al., ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. ICML 2021]

85
Contextualized Representations Pre-training

And representation objectives before contextualization:

Transformer self-attention

Alignment

Text Image Audio

[Li et al., Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. NeurIPS 2021]

86
Directed Cross-Modal Alignment

Attention

New visually-contextualized
Language-vision similarities representation of language

Text Image
+ modality embedding + modality embedding
+ position embedding + position embedding

[Tsai et al., Multimodal Transformer for Unaligned Multimodal Language Sequences. ACL 2019]
[Lu et al., ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. NeurIPS 2019]

87
Directed Cross-Modal Alignment

Cross-modal
Cross-modal Attention
Multi-head Attention
Representation
Summary

Text Image
+ modality embedding + modality embedding
+ position embedding + position embedding

88
High-Modality Multimodal Transformers
Transfer across partially observable modalities
Unified model + parameter sharing + multitask and transfer learning
Video Visual Sentiment, Atari Robot
Non-parallel multitask learning
classification QA emotions games manipulation

Task-specific classifiers
Same model
architecture!
HighMMT Shared multimodal model
Same
parameters!
Modality-specific embeddings

Standardized input sequence

Language Video Audio Audio Video Image Proprioception Action

[Reed et al., A Generalist Agent. TMLR 2022]

[Liang et al., HighMMT: Quantifying Modality and Task Heterogeneity for High-Modality Representation Learning. TMLR 2022]

89
Open
High-Modality Models challenges

Some implicit assumptions:

- All modalities can be represented as sequences without losing information.
- Dimensions of heterogeneity can be perfectly captured by modality-specific embeddings.
- Cross-modal connections & interactions are shared across modalities and tasks.

Video classification Sentiment, emotions Robot dynamics

Gato/ HighMMT Shared multimodal model?

Modality-specific embeddings?

Standardized input sequence?

Language Video Audio Audio Video Video Time-series

90
Sub-Challenge 2c: Contextualized Representations
Definition: Learn representations that reflect the
cross-modal interactions of the
structured multimodal data

Joint undirected Cross-modal Alignment with Structured

alignment directed alignment unimodal models alignment

91
Conditioning Pretrained Language Models
Conditioning via prefix tuning A small red boat on the water.

Adapted + pretrained p(x|c)

Adapter Pretrained p(x)

A small red boat on the water.

[Tsimpoukelli et al., Multimodal Few-Shot Learning with Frozen Language Models. NeurIPS 2021]

92
Conditioning Pretrained Language Models
Conditioning via prefix tuning
Blue

0-shot VQA:
Adapted + pretrained p(x|c)

Adapter Pretrained p(x)

What color is the car?

[Tsimpoukelli et al., Multimodal Few-Shot Learning with Frozen Language Models. NeurIPS 2021]

93
Conditioning Pretrained Language Models
Conditioning via prefix tuning Steve Jobs

1-shot outside
knowledge VQA: Adapted + pretrained p(x|c)

Adapter Adapter p(x)

[Tsimpoukelli et al., Multimodal Few-Shot Learning with Frozen Language Models. NeurIPS 2021]

94
Conditioning Pretrained Language Models
Conditioning via prefix tuning This is a dax.

Few-shot image
classification: Adapted + pretrained

Adapter Adapter Adapter

[Tsimpoukelli et al., Multimodal Few-Shot Learning with Frozen Language Models. NeurIPS 2021]

95
Conditioning Pretrained Language Models
Mini-GPT4

Stage 1: Alignment using

paired image-text data.

Stage 2: Instruction tuning

using image + text instructions
and example completions.

[Zhu et al., MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models. 2023]

96
Conditioning Pretrained Language Models
LLaMA-Adapter

Can be combined with ImageBind

– alignment of many modalities to
language (i.e., high-modality
coordination model)

[Gridhar et al., ImageBind: One Embedding Space To Bind Them All. CVPR 2023]
[Gao et al., LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model. arxiv 2023]

97
Structure in Contextualized Representations
Compositionality requires knowing the structure.

In the continuous case (i.e., if structure is given or can be learned easily in a differentiable manner):

Sequential Interactive Hierarchical Graphical

98
Interactive Structure
Structure defined through interactive environment
Main difference from temporal - actions taken at previous time steps affect future states

Integrates multimodality into the

reinforcement learning framework

Modality A …

Policy 𝒂
Modality B …

Local representation
+ Aligned representation
99

99
Interactive Structure
Structure defined through interactive environment
Main difference from temporal - actions taken at previous time steps affect future states

Language-conditional RL Language-assisted RL

Instruction following Reading instruction manuals

Big topic, see full multimodal RL lectures at: https://cmu-multicomp-lab.github.io/mmml-course/fall2022/

[Luketina et al., A Survey of Reinforcement Learning Informed by Natural Language. IJCAI 2019]

100
Open
Language Models for Planning challenges

Multi-step planning afforded by language models

Policy 𝒂𝟏

Modality A … Policy 𝒂𝟐

Modality B … Policy 𝒂𝟑

Task: get glass of milk

Policy 𝒂𝟒

[Huang et al., Language Models as Zero-shot Planners. ICML 2022] 101

101
Hierarchical Structure
Leverage syntactic structure of language

Local composition with

interpretable output concepts

Attend [red]
Attend [red]

“red”

Is there a red shape

above a circle?

[Andreas et al., Neural Module Networks. CVPR 2016]

102
Hierarchical Structure
Leverage syntactic structure of language

Local composition with

interpretable output concepts
Attend [red]

Attend [circle]

Is there a red shape

above a circle?
Attend [circle] “circle”

[Andreas et al., Neural Module Networks. CVPR 2016]

103
Hierarchical Structure
Leverage syntactic structure of language

Local composition with

interpretable output concepts
Attend [red]

Re-attend [above]

Is there a red shape

above a circle?
Attend [circle] Re-attend [above]

[Andreas et al., Neural Module Networks. CVPR 2016]

104
Hierarchical Structure
Leverage syntactic structure of language

Local composition with

interpretable output concepts

Combine [and]
Attend [red] Combine [and]

Is there a red shape

above a circle?
Attend [circle] Re-attend [above]

[Andreas et al., Neural Module Networks. CVPR 2016]

105
Hierarchical Structure
Leverage syntactic structure of language

Enables better interpretability, but requires parsing + engineering each specific module

Attend [red] Combine [and] Measure [is] YES

Is there a red shape

above a circle?
Attend [circle] Re-attend [above]

[Andreas et al., Neural Module Networks. CVPR 2016]

106
Open
Hierarchical Structure + Multimodal Pretraining challenges

Multimodal pretrained models

+ retrieval

Multimodal input
multimodal output
models?
More in Challenge 4:
Generation

[Koh et al., Grounding Language Models to Images for Multimodal Inputs and Outputs. ICML 2023]

107
Open
Hierarchical Structure + Multimodal Pretraining challenges

Multimodal pretrained models

+ retrieval
+ knowledge graphs

[Chen et al., Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge. ACM Multimedia 2022]
[Zhang et al., Multimodal Analogical Reasoning over Knowledge Graphs. ICLR 2023]

108
Open
Hierarchical Structure + Multimodal Pretraining challenges

Multimodal pretrained models

+ retrieval
+ knowledge graphs
+ multi-step reasoning

[Zhang et al., Multimodal Chain-of-Thought Reasoning in Language Models. 2023]

[Zeng et al., Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language. 2023]

109
Open
Hierarchical Structure + Multimodal Pretraining challenges

Socratic models All modalities as natural language?

[Zeng et al., Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language. 2023]

110
Open
Hierarchical Structure + Multimodal Pretraining challenges

Multimodal pretrained models Compositionality? Tokenization? Reasoning?

+ retrieval
+ knowledge graphs
+ multi-step reasoning
+ relational alignment

[Thrush et al., Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality. CVPR 2023]
[Pandey et al., Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment. ACL 2023]

111
Sub-Challenge 2c: Contextualized Representations
Definition: Learn representations that reflect the
cross-modal interactions of the
structured multimodal data

More in Challenge 3: Reasoning

Joint undirected Cross-modal Alignment with Structured
alignment directed alignment unimodal models alignment

112
Core Multimodal Challenges
Representation Generation

Reasoning Quantification

Alignment 𝑦% Transference

113

113
What is
Why is it hard? What is next?
Multimodal?

Heterogeneous Representation

Alignment
Connected
Reasoning
Interacting
Generation

Transference
𝑧
Quantification
114

114
Future Direction: Heterogeneity

Homogeneity vs Heterogeneity

Challenges:
Arbitrary tokenization Beyond differentiable interactions

Causal, logical, brain-inspired interactions

Theoretical study of interactions

115
MultiBench
Future Direction: High-modality https://github.com/pliang279/MultiBench

Few modalities High-modality

Language Vision Audio Graphs Control LIDAR Sensors Set Table Financial Medical

Challenges: Non-parallel learning Limited resources

116

116
Future Direction: Long-term

Short-term Long-term

seconds
or minutes

Challenges:
Compositionality Memory Personalization

117
Social-IQ
Future Direction: Interaction https://www.thesocialiq.com/

Social Intelligence
Reasoning

Perception Generation

Multimodal
Interaction

Challenges:
Multi-Party Causality Ethical
118

118
MultiViz
Future Direction: Real-world https://github.com/pliang279/MultiViz

Healthcare Intelligent Interfaces and Online Learning

Decision Support Vehicles and Education

Challenges:
Robustness Fairness Generalization Interpretation

119
What is
Why is it hard? What is next?
Multimodal?

Representation High-modality
Heterogeneous
Alignment
Heterogeneity
Reasoning
Connected Long-term
Generation
Transference Interaction
Interacting
Quantification Real-world

https://cmu-multicomp-lab.github.io/mmml-course/fall2022/
Liang, Zadeh, and Morency. Foundations and Trends on Multimodal Machine Learning. 2022

120

YouTube Video Summariser Using NLP
No ratings yet
YouTube Video Summariser Using NLP
27 pages
SENTIMENT - ZADEHetal - Tensor Fusion Network For Multimodal Sentiment Analysis - Paper
No ratings yet
SENTIMENT - ZADEHetal - Tensor Fusion Network For Multimodal Sentiment Analysis - Paper
12 pages
Lecture1.1-Introduction
No ratings yet
Lecture1.1-Introduction
52 pages
Multimodal Machine Learning: A Survey and Taxonomy: Tadas Baltru Saitis, Chaitanya Ahuja, and Louis-Philippe Morency
No ratings yet
Multimodal Machine Learning: A Survey and Taxonomy: Tadas Baltru Saitis, Chaitanya Ahuja, and Louis-Philippe Morency
20 pages
Multimodal_Machine_Learning_A_Survey_and_Taxonomy
No ratings yet
Multimodal_Machine_Learning_A_Survey_and_Taxonomy
21 pages
Multimodal Deep Learning
No ratings yet
Multimodal Deep Learning
21 pages
multimodel deep learning
No ratings yet
multimodel deep learning
92 pages
MMML Tutorial - P2 Representation
No ratings yet
MMML Tutorial - P2 Representation
41 pages
Multi Model
No ratings yet
Multi Model
36 pages
MBL Final
No ratings yet
MBL Final
6 pages
Session 15-1 Multimodal
No ratings yet
Session 15-1 Multimodal
82 pages
Recent Advances and Trends in Multimodal Deep Learning A Review
No ratings yet
Recent Advances and Trends in Multimodal Deep Learning A Review
35 pages
Hessel and Lee - 2020 - Does My Multimodal Model Learn Cross-modal Interactions It`s Harder to Tell Than You Might Think!
No ratings yet
Hessel and Lee - 2020 - Does My Multimodal Model Learn Cross-modal Interactions It`s Harder to Tell Than You Might Think!
17 pages
Author NameAffiliationauthor@Email
No ratings yet
Author NameAffiliationauthor@Email
8 pages
MMML Tutorial ACL2017
No ratings yet
MMML Tutorial ACL2017
221 pages
Lecture12 1MultimodalFusion
No ratings yet
Lecture12 1MultimodalFusion
66 pages
Multimodal Learning
No ratings yet
Multimodal Learning
29 pages
2023 Multimodal Large Language Models- A Survey
No ratings yet
2023 Multimodal Large Language Models- A Survey
10 pages
A Survey On Multimodal Large Language Models
No ratings yet
A Survey On Multimodal Large Language Models
15 pages
Multimodal Learning With Transformers a Survey
No ratings yet
Multimodal Learning With Transformers a Survey
20 pages
2306.04539v2
No ratings yet
2306.04539v2
32 pages
Kunal - Duplichecker Plagiarism Report
No ratings yet
Kunal - Duplichecker Plagiarism Report
2 pages
Multi-Modal Generative AI Survey
No ratings yet
Multi-Modal Generative AI Survey
23 pages
Deep Learning Book PDF
No ratings yet
Deep Learning Book PDF
272 pages
The Evolution of 2024 Multimodal Model Architectures
No ratings yet
The Evolution of 2024 Multimodal Model Architectures
30 pages
sensors-23-06986
No ratings yet
sensors-23-06986
21 pages
mmE5- Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data
No ratings yet
mmE5- Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data
21 pages
Conference 1
No ratings yet
Conference 1
13 pages
sensor fusion presentation
No ratings yet
sensor fusion presentation
10 pages
MultiVae 8
No ratings yet
MultiVae 8
39 pages
Multimodal Pretrained Transformer
No ratings yet
Multimodal Pretrained Transformer
18 pages
lecture2.1- Basic Concepts - Neural Networks
No ratings yet
lecture2.1- Basic Concepts - Neural Networks
66 pages
One Model To Learn Them All: Work Performed While at Google Brain
No ratings yet
One Model To Learn Them All: Work Performed While at Google Brain
10 pages
ACM CFP TALLIP Natural Language Processing Cross Modal Learning
No ratings yet
ACM CFP TALLIP Natural Language Processing Cross Modal Learning
2 pages
1-s2.0-S0893608024004775-main
No ratings yet
1-s2.0-S0893608024004775-main
14 pages
Li Et Al. - 2023 - Multimodal Foundation Models From Specialists To
No ratings yet
Li Et Al. - 2023 - Multimodal Foundation Models From Specialists To
119 pages
Lecture 20
No ratings yet
Lecture 20
12 pages
paper-3
No ratings yet
paper-3
13 pages
Efficient Low-Rank Multimodal Fusion With Modality-Specific Factors
No ratings yet
Efficient Low-Rank Multimodal Fusion With Modality-Specific Factors
10 pages
Conference 6
No ratings yet
Conference 6
12 pages
A Survey On Multimodal Large Language Models
No ratings yet
A Survey On Multimodal Large Language Models
18 pages
Graphbased Multimodal Ranking Models for Multimodal Summarization-1-1
No ratings yet
Graphbased Multimodal Ranking Models for Multimodal Summarization-1-1
21 pages
Lecture1.2- Multimodal Research Tasks
No ratings yet
Lecture1.2- Multimodal Research Tasks
154 pages
CVPR 2023
No ratings yet
CVPR 2023
11 pages
MM-LLMs Recent Advances in MultiModal Large Language Models
No ratings yet
MM-LLMs Recent Advances in MultiModal Large Language Models
22 pages
M Z & M B: A Standardized Toolkit For Multimodal Deep Learning
No ratings yet
M Z & M B: A Standardized Toolkit For Multimodal Deep Learning
7 pages
Multimodal Fusion Research Papers Survey
No ratings yet
Multimodal Fusion Research Papers Survey
1 page
Circulant-interactive Transformer with Dimension-aware
No ratings yet
Circulant-interactive Transformer with Dimension-aware
16 pages
Multimodal Sentiment Analysis-6
No ratings yet
Multimodal Sentiment Analysis-6
20 pages
2410 12787v1
No ratings yet
2410 12787v1
20 pages
Deepsetfusion
No ratings yet
Deepsetfusion
10 pages
2402.03173v2
No ratings yet
2402.03173v2
16 pages
Ulti Modal Atent Iffusion
No ratings yet
Ulti Modal Atent Iffusion
40 pages
Multimodal Prompt Transformer With Hybrid Contrastive Learning For Emotion Recognition in Conversation
No ratings yet
Multimodal Prompt Transformer With Hybrid Contrastive Learning For Emotion Recognition in Conversation
11 pages
Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing
No ratings yet
Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing
10 pages
SMIL - Multimodal Learning With Severely Missing Modality 2021
No ratings yet
SMIL - Multimodal Learning With Severely Missing Modality 2021
9 pages
ConFEDE- Contrastive Feature Decomposition for Multimodal Sentiment
No ratings yet
ConFEDE- Contrastive Feature Decomposition for Multimodal Sentiment
14 pages
T03 Screen
No ratings yet
T03 Screen
76 pages
Deep Boltzmann Machine Paper
No ratings yet
Deep Boltzmann Machine Paper
9 pages
Learning to Use Medical Tools with Multi-modal Agent
No ratings yet
Learning to Use Medical Tools with Multi-modal Agent
14 pages
Agentive Cognitive Construction Grammar: Mind, Agency and the Materiality of Language: Agentive Cognitive Construction Grammar
From Everand
Agentive Cognitive Construction Grammar: Mind, Agency and the Materiality of Language: Agentive Cognitive Construction Grammar
Sergio Torres-Martínez
No ratings yet
Park, S. (2024). The work of art in the age of generative AI: aura, liberation, and democratization. AI & Soc, 40, 1807–1816. https://doi.org/10.1007/s00146-024-01948-6
No ratings yet
Park, S. (2024). The work of art in the age of generative AI: aura, liberation, and democratization. AI & Soc, 40, 1807–1816. https://doi.org/10.1007/s00146-024-01948-6
9 pages
Arshad Resume_2025 (1)
No ratings yet
Arshad Resume_2025 (1)
2 pages
Telme: Teacher-Leading Multimodal Fusion Network For Emotion Recognition in Conversation
No ratings yet
Telme: Teacher-Leading Multimodal Fusion Network For Emotion Recognition in Conversation
14 pages
MACS- Multi-source Audio-to-image Generation with Contextual Significance and Semantic Alignment
No ratings yet
MACS- Multi-source Audio-to-image Generation with Contextual Significance and Semantic Alignment
20 pages
Multimodal Emotion Recognition Using Deep Learning Architectures
No ratings yet
Multimodal Emotion Recognition Using Deep Learning Architectures
9 pages
Publications by BJ Orn W. Schuller:) B Books Authored
No ratings yet
Publications by BJ Orn W. Schuller:) B Books Authored
40 pages
Kec Ai Gryffindor Dravidianlangtech Naacl 2025
No ratings yet
Kec Ai Gryffindor Dravidianlangtech Naacl 2025
7 pages
MedCoDi-M: A Multi-Prompt Foundation Model for Multimodal Medical Data Generation
No ratings yet
MedCoDi-M: A Multi-Prompt Foundation Model for Multimodal Medical Data Generation
40 pages
2404.08295
No ratings yet
2404.08295
13 pages
Neuro-Symbolic-AI-for-Medical-Image-Diagnosis
No ratings yet
Neuro-Symbolic-AI-for-Medical-Image-Diagnosis
10 pages
Where can buy Multiple Sensorial Media Advances and Applications New Developments in Mulsemedia 1st Edition George Ghinea ebook with cheap price
100% (2)
Where can buy Multiple Sensorial Media Advances and Applications New Developments in Mulsemedia 1st Edition George Ghinea ebook with cheap price
81 pages
Auditory Interfaces (2022)
100% (1)
Auditory Interfaces (2022)
241 pages
Foundation Models For Generalist Medical Artificial Intelligence
No ratings yet
Foundation Models For Generalist Medical Artificial Intelligence
7 pages
Enhanced Multimodal Biometric Recognition Approach
No ratings yet
Enhanced Multimodal Biometric Recognition Approach
12 pages
MoME Mixture of Multimodal Experts For Cancer Survival Prediction
No ratings yet
MoME Mixture of Multimodal Experts For Cancer Survival Prediction
11 pages
Emotion Prediction As Computation Over A Generative Theory of Mind
No ratings yet
Emotion Prediction As Computation Over A Generative Theory of Mind
6 pages
Introducing Multimodality by Carey Jewitt Jeff Bezemer Kay O Halloran
No ratings yet
Introducing Multimodality by Carey Jewitt Jeff Bezemer Kay O Halloran
5 pages
HCI Transforming E Learning Platforms
No ratings yet
HCI Transforming E Learning Platforms
23 pages
2019 MMCNet Deep Learning-Based Multimodal Classification Model Using Dynamic Knowledge
No ratings yet
2019 MMCNet Deep Learning-Based Multimodal Classification Model Using Dynamic Knowledge
10 pages
Visual Language Navigation: A Survey and Open Challenges: Sang Min Park Young Gab Kim
No ratings yet
Visual Language Navigation: A Survey and Open Challenges: Sang Min Park Young Gab Kim
63 pages
3689646 (1)
No ratings yet
3689646 (1)
20 pages
A Survey On ChatGPT AI-Generated Contents, Challenges, and Solutions. 2024.23s.
No ratings yet
A Survey On ChatGPT AI-Generated Contents, Challenges, and Solutions. 2024.23s.
23 pages
A Reference Architecture for Designing Foundation Model based Agents
No ratings yet
A Reference Architecture for Designing Foundation Model based Agents
11 pages
Mental Health Bot 2
No ratings yet
Mental Health Bot 2
6 pages
Electronics 12 00218
No ratings yet
Electronics 12 00218
19 pages
Sonic Interaction Design
No ratings yet
Sonic Interaction Design
4 pages
15_Important_AI_Agents__1741789607
No ratings yet
15_Important_AI_Agents__1741789607
18 pages
Get Intelligent Systems and Applications: Proceedings of the 2020 Intelligent Systems Conference (IntelliSys) Volume 3 Kohei Arai free all chapters
100% (3)
Get Intelligent Systems and Applications: Proceedings of the 2020 Intelligent Systems Conference (IntelliSys) Volume 3 Kohei Arai free all chapters
62 pages