0% found this document useful (0 votes)
15 views

Lecture8 1MultimodalAlignment

Uploaded by

Sanjay Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Lecture8 1MultimodalAlignment

Uploaded by

Sanjay Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Advanced

Multimodal Machine Learning


Lecture 8.1: Multimodal alignment
Louis-Philippe Morency

* Original version co-developed with Tadas Baltrusaitis

1
Upcoming Schedule

▪ First project assignment:


▪ Proposal presentation (10/3 and 10/5)
▪ First project report (10/8)
▪ Midterm project assignment
▪ Midterm presentations (Tuesday 11/6 & Thursday
11/8)
▪ Midterm report (Sunday 11/11) – No extensions
▪ Final project assignment
▪ Final presentation (TBD)
▪ Final report (12/11 at 11:59pm ET)
Midterm Presentation Instructions

▪ 7-8 minute presentations (max: 8 mins)


▪ +1.5 minutes for written feedback and notes
▪ All team members should be involved.
▪ The ordering of the presentations (Tuesday vs.
Thursday) is the inverse from the proposals.
▪ The presentations will be from 4:30pm – 6pm
▪ Please arrive on time!
Midterm Presentation Instructions

▪ General definition of your research problem, including a


mathematical formalization of the problem. Include definitions
of the main variables and overall objective function (2-3
slides)
▪ Explain at least two multimodal baseline model for your
research problem (2-4 slides)
▪ Present current results of this baseline model(s) on your
dataset. You should study the failure cases of the baseline
model (3-5 slides)
▪ Describe the research directions you are planning to explore.
Discuss how they will address some of the shortcoming of
your baseline model. (2-3 slides)
Midterm Project Report Instructions

▪ Main sections:
▪ Abstract
▪ Introduction
▪ Related work
▪ Problem statement
▪ Multimodal baseline models
▪ Experimental methodology
▪ Results and discussion
▪ Proposed approaches
Lecture objectives

▪ Multimodal alignment
▪ Implicit
▪ Explicit
▪ Explicit signal alignment
▪ Dynamic Time Warping
▪ Canonical Time Warping
▪ Attention models in deep learning (implicit and
explicit alignment)
▪ Soft attention
▪ Hard attention
▪ Spatial Transformer Networks
Multi-modal alignment
7
Multimodal-alignment

▪ Multimodal alignment – finding relationships


and correspondences between two or more
modalities Modality 1 Modality 2
▪ Examples
▪ Images with captions t1
▪ Recipe steps with a how-to video

t2 t4

Fancy algorithm
Phrases/words of translated sentences
▪ Two types t3 t5
▪ Explicit – alignment is the task in itself
▪ Latent – alignment helps when solving a
different task (for example “Attention”
models)
tn tn
Explicit multimodal-alignment

▪ Explicit alignment - goal is to find correspondences


between modalities
▪ Aligning speech signal to a transcript
▪ Aligning two out-of sync sequences
▪ Co-referring expressions
Implicit multimodal-alignment

▪ Implicit alignment - uses internal latent alignment of


modalities in order to better solve various problems
▪ Machine Translation
▪ Cross-modal retrieval
▪ Image & Video Captioning
▪ Visual Question Answering
Explicit alignment
11
Temporal sequence alignment

Applications:
- Re-aligning asynchronous
data
- Finding similar data across
modalities (we can estimate
the aligned cost)
- Event reconstruction from
multiple sources
Let’s start unimodal – Dynamic Time Warping

▪ We have two unaligned temporal unimodal


signals
▪ 𝐗 = 𝒙1 , 𝒙2 , … , 𝒙𝑛𝑥 ∈ ℝ𝑑×𝑛𝑥
▪ 𝐘 = 𝒚1 , 𝒚2 , … , 𝒚𝑛𝑦 ∈ ℝ𝑑×𝑛𝑦
▪ Find set of indices to minimize the alignment
difference:
𝑙
𝑦 2
𝐿(𝒑𝑡𝑥 , 𝒑𝑡 ) =෍ 𝒙 𝒑𝑥
𝑡
−𝒚 𝑦
𝒑𝑡
2
𝑡=1

𝑦
▪ Where 𝒑𝑥 and 𝒑 are index vectors of same
length
▪ Finding these indices is called Dynamic Time
Warping
Dynamic Time Warping continued

▪ Lowest cost path in a cost


𝑦
(𝒑𝑙𝑥 , 𝒑𝒍 )
matrix
▪ Restrictions
▪ Monotonicity – no going back in
time
▪ Continuity - no gaps
▪ Boundary conditions - start and
end at the same points 𝑦
(𝒑𝑡𝑥 , 𝒑𝒕 )
▪ Warping window - don’t get too far
from diagonal
▪ Slope constraint – do not insert or
skip too much
𝑦
(𝒑1𝑥 , 𝒑1 )
Dynamic Time Warping continued

▪ Lowest cost path in a cost


𝑦
(𝒑𝑙𝑥 , 𝒑𝒍 )
matrix
▪ Solved using dynamic
programming while respecting
the restrictions

𝑦
(𝒑𝑡𝑥 , 𝒑𝒕 )

𝑦
(𝒑1𝑥 , 𝒑1 )
DTW alternative formulation
𝑙
2
𝑥 𝑦
𝐿(𝒑 , 𝒑 ) = ෍ 𝒙𝒑𝑥 − 𝒚𝒑𝑦
𝑡 𝑡 2
𝑡=1 Replication doesn’t change the objective!

= 𝐗𝐖𝑥
=
= 𝐘𝐖y

Alternative objective:
2
𝑿, 𝒀 – original signals (same #rows, possibly
𝐿(𝑾𝒙 , 𝑾𝒚 ) = 𝑿𝑾𝑥 − 𝒀𝑾𝑦 different #columns)
𝐹
𝑾𝑥 , 𝑾𝑦 - alignment matrices
2 2
Frobenius norm 𝑨 𝐹 = σ𝑖 σ𝑗 𝑎𝑖,𝑗

16
DTW - limitations

▪ Computationally complex

m sequences

▪ Sensitive to outliers

▪ Unimodal!
Canonical Correlation Analysis reminder
maximize: 𝑡𝑟(𝑼𝑻 𝚺𝑿𝒀 𝑽)

subject to: 𝑼𝑻 𝚺𝒀𝒀 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰 , 𝒖𝑻(𝑗) 𝚺𝑿𝒀 𝒗(𝑖) = 𝟎 for 𝑖 ≠ 𝑗

projection of Y
Linear projections maximizing
1 correlation

projection of X

2 Orthogonal projections 𝑯𝒙 𝑯𝒚

Unit variance of the projection ··· ···


3 vectors 𝑼 𝑽
··· ···
Text Image
𝑿 𝒀

18
Canonical Correlation Analysis reminder

▪ When data is normalized it is actually equivalent to smallest RMSE


reconstruction
▪ CCA loss can also be re-written as:

projection of Y
2
𝐿(𝑼, 𝑽) = 𝐔𝑇 𝐗 − 𝐕 𝑇 𝐘 𝐹

subject to: 𝑼𝑻 𝚺𝒀𝒀 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰, 𝒖𝑻(𝑗) 𝚺𝑿𝒀 𝒗(𝑖) = 𝟎


projection of X

𝑯𝒙 𝑯𝒚
··· ···
𝑼 𝑽
··· ···
Text Image
𝑿 𝒀
Canonical Time Warping

▪ Dynamic Time Warping + Canonical Correlation Analysis


= Canonical Time Warping
2
𝐿(𝑼, 𝑽, 𝑾𝒙 , 𝑾𝒚 ) = 𝐔 𝑇 𝐗𝐖𝐱 − 𝑇
𝐕 𝐘𝐖𝐲
𝐹

▪ Allows to align multi-modal or multi-view (same modality


but from a different point of view)
▪ 𝑾𝒙 , 𝑾𝒚 – temporal alignment
▪ 𝑼, 𝑽 – cross-modal (spatial) alignment

[Canonical Time Warping for Alignment of Human Behavior, Zhou and De la Tore, 2009]
Canonical Time Warping

2
𝐿(𝑼, 𝑽, 𝑾𝒙 , 𝑾𝒚 ) = 𝐔 𝑇 𝐗𝐖𝐱 − 𝑇
𝐕 𝐘𝐖𝐲
𝐹

Optimized by Coordinate-descent – fix one set of parameters,


optimize another
Generalized Eigen-decomposition

𝑾𝒙 , 𝑾𝒚 𝑼, 𝑽

Gauss-Newton

[Canonical Time Warping for Alignment of Human Behavior, Zhou and De la Tore, 2009, NIPS]
Generalized Time warping

▪ Generalize to multiple sequences all of different


modality
2
𝐿(𝑼𝒊 , 𝑾𝒊 ) = ෍ ෍ 𝐔𝑖𝑇 𝐗 i 𝐖i − 𝑇
𝐔𝑗 𝐗 j 𝐖𝑗
𝐹
𝑖=1 𝑗=1
▪ 𝑾𝒊 – set of temporal alignments
▪ 𝑼𝒊 – set of cross-modal (spatial) alignments

(1) Time warping


(2) Spatial embedding

[Generalized Canonical Time Warping, Zhou and De la Tore, 2016, TPAMI]


Alignment examples (unimodal)
CMU Motion Capture
Subject 1: 199 frames
Subject 2: 217 frames
Subject 3: 222 frames

Weizmann

Subject 1: 40 frames
Subject 2: 44 frames
Subject 3: 43 frames

23
Alignment examples (multimodal)
Canonical time warping - limitations

▪ Linear transform between modalities


▪ How to address this?
Deep Canonical Time Warping

2
𝐿(𝜽1 , 𝜽2 , 𝑾𝒙 , 𝑾𝒚 ) = 𝑓𝜽1 (𝐗)𝐖𝐱 − 𝑓𝜽1 (𝐘)𝐖𝐲
𝐹

▪ Could be seen as generalization of DCCA and GTW

[Deep Canonical Time Warping, Trigeorgis et al., 2016, CVPR]


Deep Canonical Time Warping

2
𝐿(𝜽1 , 𝜽2 , 𝑾𝒙 , 𝑾𝒚 ) = 𝑓𝜽1 (𝐗)𝐖𝐱 − 𝑓𝜽1 (𝐘)𝐖𝐲
𝐹

▪ The projections are orthogonal (like in DCCA)


▪ Optimization is again iterative:
▪ Solve for alignment (𝑾𝒙 , 𝑾𝒚 ) with fixed projections (𝜽1 , 𝜽2 )
▪ Eigen decomposition
▪ Solve for projections (𝜽1 , 𝜽2 ) with fixed alignment (𝑾𝒙 , 𝑾𝒚 )
▪ Gradient descent
▪ Repeat till convergence

[Deep Canonical Time Warping, Trigeorgis et al., 2016, CVPR]


Implicit alignment
28
Implicit alignment

▪ We looked how to explicitly align temporal data


▪ Could use that as an internal (hidden) step in
our models?
▪ Can we instead encourage the model to align
data when solving a different problem?
▪ Yes!
▪ Graphical models
▪ Neural attention models (focus of today’s lecture)
Attention models
30
Attention in humans

▪ Foveal vision – we only see in “high resolution” in 2 degrees of


vision
▪ We focus our attention selectively to certain words (for example our
names)
▪ We attend to relevant speech in a noisy room
Attention models in deep learning

▪ Many examples of attention models in recent years!


▪ Why:
▪ Allows for implicit data alignment
▪ Good results empirically
▪ In some cases faster (don’t need to focus on all the image)
▪ Better Interpretability
Types of Attention Models

▪ Recent attention models can be roughly split into


three major categories
1. Soft attention
▪ Acts like a gate function. Deterministic inference.
2. Transform network
▪ Warp the input to better align with canonical view
3. Hard attention
▪ Includes stochastic processes. Related to reinforcement
learning.
Soft attention
34
Machine Translation

▪ Given a sentence in one language translate it to another

Dog on the beach le chien sur la plage

▪ Not exactly multimodal task – but a good start! Each


language can be seen almost as a modality.
Machine Translation with RNNs

▪ A quick reminder about encoder


decoder frameworks
▪ First we encode the sentence Dog on the beach

▪ Then we decode it in a different


language

Decode

Context /
embedding /
Encoder sentence
representation
le chien sur la plage
Machine Translation with RNNs

▪ What is the problem with this?


▪ What happens when the sentences are very long?

▪ We expect the encoders hidden state to capture everything in a


sentence, a very complex state in a single vector, such as

The agreement on the European Economic


Area was signed in August 1992.

L’ accord sur la zone économique


européenne a été signé en août 1992.
Decoder – attention model

▪ Before encoder would just take the final hidden state, now we
actually care about the intermediate hidden states

Dog

Attention Hidden state 𝒔0


module /
gate

Context 𝒛𝟎

𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓

Encoder [Bahdanau et al., “Neural Machine Translation


by Jointly Learning to Align and Translate”, ICLR
le chien sur la plage 2015]
Decoder – attention model

▪ Before encoder would just take the final hidden state, now we
actually care about the intermediate hidden states

Dog on

Attention Hidden state 𝒔1


module /
gate

Context 𝒛𝟏

𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓

Encoder [Bahdanau et al., “Neural Machine Translation


by Jointly Learning to Align and Translate”, ICLR
le chien sur la plage 2015]
Decoder – attention model

▪ Before encoder would just take the final hidden state, now we
actually care about the intermediate hidden states

on the

Attention Hidden state 𝒔2


module /
gate

Context 𝒛𝟐

𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓

Encoder [Bahdanau et al., “Neural Machine Translation


by Jointly Learning to Align and Translate”, ICLR
le chien sur la plage 2015]
How do we encode attention

▪ Before:
▪ 𝑝 𝑦𝑖 𝑦1 , … , 𝑦𝑖−1 , 𝒙 = 𝑔(𝑦𝑖−1 , 𝒔𝑖 , 𝒛), where 𝒛 = 𝒉 𝑇 ,
and 𝒔𝑖 - the current state of the decoder
▪ Now:
▪ 𝑝 𝑦𝑖 𝑦1 , … , 𝑦𝑖−1 , 𝒙 = 𝑔(𝑦𝑖−1 , 𝒔𝑖 , 𝒛𝑖 )
▪ Have an attention “gate”
▪ A different context 𝒛𝑖 used at each time step!
𝑇
▪ 𝒛𝑖 = σ𝑗=𝑖
𝑥
𝛼𝑖𝑗 𝒉𝑗

𝛼𝑖𝑗 - the (scalar) attention for word j at generation step i


MT with attention

So how do we determine 𝛼𝑖𝑗 ,


exp(𝑒𝑖𝑗 )
▪ 𝛼𝑖,𝑗 = 𝑇𝑥 exp(𝑒 )
- softmax, making sure they sum to 1
σ𝑘=1 𝑖𝑘

Where:
▪ 𝑒𝑖𝑗 = 𝒗𝑇 𝜎 𝑊𝑠𝑖−1 + 𝑈ℎ𝑗
a feedforward network that can tell us given the current state of
decoder how important the current encoding is now
𝒗, 𝑊, 𝑈– learnable weights

𝑇𝑥
σ
𝑧𝑖 = 𝑗=𝑖 𝛼𝑖𝑗 ℎ𝑗 expectation of the context (a fancy way to
say it’s a weighted average)
MT with attention

Basically we are using a neural network to tell us where a


neural network should be looking!
▪ We can use with RNN, LSTM or GRU
▪ Encoder being used is the same structure as before
▪ Can use uni-directional
▪ Can use bi-directional
▪ Model can be trained using our regular back-propagation
through time, all of the modules are differentiable
Does it work?
MT with attention recap

▪ Get good translation results (especially for long


sentences)
▪ Also get a (soft) alignment of sentences in
different languages
▪ Extra interpretability of method functioning
▪ How do we move to multimodal?
Visual captioning with soft attention

[Show, Attend and Tell: Neural


Image Caption Generation with
Visual Attention, Xu et al., 2015]
Recap RNN for Captioning

Bird in the sky

Why might we not want to focus on the final layer?


Looking at more fine grained features

Distribution
Output
over L
word
locations
𝑎1 𝑎2 𝑑1 𝑎3 𝑑2

𝑠0 𝑠1 𝑠2

𝑧1 𝑦0 𝑧2 𝑦1

Expectation First word


over
features: D

48
Soft attention

▪ Allows for latent data alignment


▪ Allows us to get an idea of what the network “sees”
▪ Can be optimized using back propagation

▪ Good at paper naming!


▪ Show, Attend and Tell (extension of Show and Tell)
▪ Listen, Attend and Walk
▪ Listen, Attend and Spell
▪ Ask, Attend and Answer
Spatial Transformer
networks
50
Some limitations of grid based attention

▪ Can we fixate on small parts of image but still have easy


end-to-end training?
Spatial Transformer Networks

Can we make this


function differentiable?

52
Spatial Transformer Networks

Idea: Function mapping pixel


coordinates (𝑥 𝑡 , 𝑦 𝑡 ) of output to
pixel coordinates (𝑥 𝑠 , 𝑦 𝑠 ) of
Can we make this input
function differentiable?
𝑥𝑖𝑠 𝜃1,1 𝜃1,2 𝜃1,3 𝑥𝑖𝑡
= 𝑦𝑖𝑡
𝑦𝑖𝑠 𝜃2,1 𝜃2,2 𝜃2,3
1

53
Spatial Transformer Networks

Idea: Function mapping pixel


coordinates (𝑥 𝑡 , 𝑦 𝑡 ) of output to
pixel coordinates (𝑥 𝑠 , 𝑦 𝑠 ) of
input
Can we make this
function differentiable?
𝑥𝑖𝑠 𝜃1,1 𝜃1,2 𝜃1,3 𝑥𝑖𝑡
= 𝑦𝑖𝑡
𝑦𝑖𝑠 𝜃2,1 𝜃2,2 𝜃2,3
1

Network “attends” to
input by predicting 𝜃

54
Spatial Transformer Networks

55
Spatial Transformer Networks

56
Examples on real world data

▪ Results on traffic sign recognition

Code available http://torch.ch/blog/2015/09/07/spatial_transformers.html


Recap on Spatial Transformer Networks

▪ Differentiable so we can just use back-prop for training end-to-end


▪ Can use complex models for focusing on an image
▪ Affine and Piece-Wise Affine, Perspective, This Plate Splines
▪ Can use to focus on certain parts of an image
▪ We can use it instead of grid based soft and hard attention for multi-
modal tasks
Glimpse Network
(Hard Attention)
59
Hard attention

▪ Soft attention requires computing a representation for the whole


image or sentence
▪ Hard attention on the other hand forces looking only at one part
▪ Main motivation was reduced computational cost rather than
improved accuracy (although that happens a bit as well)
▪ Saccade followed by a glimpse – how human visual system
works

[Recurrent Models of Visual Attention, Mnih, 2014]


[Multiple Object Recognition with Visual Attention,
Ba, 2015]
Hard attention examples
Glimpse Sensor

▪ Looking at a part of an image at different scales

▪ At a number of different scales combined to a single multichannel


image (human retina like representation)
▪ Given a location 𝑙𝑡 output an image summary at that location
[Recurrent Models of Visual Attention, Mnih, 2014]
Glimpse network

▪ Combining the Glimpse and the location of the glimpse into a joint network

▪ The glimpse is followed by a feedforward network (CNN or a DNN)


▪ The exact formulation of how the location and appearance are combined
varies, the important thing is combining what and where
▪ Differentiable with respect to glimpse parameters but not the location
Overall Architecture - Emission network

▪ Given an image a glimpse


location 𝑙𝑡 , and optionally an
action 𝑎𝑡
▪ Action can be:
▪ Some action in a dynamic
system – press a button etc.
▪ Classification of an object
▪ Word output
▪ This is an RNN with two output
gates and a slightly more
complex input gate!
Recurrent model of Visual Attention (RAM)

▪ Sample locations of glimpses


leading to updates in the network
▪ Use gradient descent to update the
weights (the glimpse network
weights are differentiable)
▪ The emission network is an RNN
▪ Not as simple as backprop but
doable
▪ Turns out this is very similar and in
some cases equivalent to
reinforcement learning using the
REINFORCE learning rule
[Williams, 1992]
Multi-modal alignment
recap
67
Multimodal-alignment recap

▪ Explicit alignment - aligns two or more modalities (or


views) as an actual task. The goal is to find
correspondences between modalities
▪ Dynamic Time Warping
▪ Canonical Time Warping
▪ Deep Canonical Time Warping
▪ Implicit alignment - uses internal latent alignment of
modalities in order to better solve various problems
▪ Attention models
▪ Soft attention
▪ Spatial transformer networks
▪ Hard attention

You might also like