0% found this document useful (0 votes)

15 views

Lecture8 1MultimodalAlignment

Uploaded by

Sanjay Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Lecture8 1MultimodalAlignment

Uploaded by

Sanjay Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 67

Advanced

Multimodal Machine Learning

Lecture 8.1: Multimodal alignment
Louis-Philippe Morency

* Original version co-developed with Tadas Baltrusaitis

1
Upcoming Schedule

▪ First project assignment:

▪ Proposal presentation (10/3 and 10/5)
▪ First project report (10/8)
▪ Midterm project assignment
▪ Midterm presentations (Tuesday 11/6 & Thursday
11/8)
▪ Midterm report (Sunday 11/11) – No extensions
▪ Final project assignment
▪ Final presentation (TBD)
▪ Final report (12/11 at 11:59pm ET)
Midterm Presentation Instructions

▪ 7-8 minute presentations (max: 8 mins)

▪ +1.5 minutes for written feedback and notes
▪ All team members should be involved.
▪ The ordering of the presentations (Tuesday vs.
Thursday) is the inverse from the proposals.
▪ The presentations will be from 4:30pm – 6pm
▪ Please arrive on time!
Midterm Presentation Instructions

▪ General definition of your research problem, including a

mathematical formalization of the problem. Include definitions
of the main variables and overall objective function (2-3
slides)
▪ Explain at least two multimodal baseline model for your
research problem (2-4 slides)
▪ Present current results of this baseline model(s) on your
dataset. You should study the failure cases of the baseline
model (3-5 slides)
▪ Describe the research directions you are planning to explore.
Discuss how they will address some of the shortcoming of
your baseline model. (2-3 slides)
Midterm Project Report Instructions

▪ Main sections:
▪ Abstract
▪ Introduction
▪ Related work
▪ Problem statement
▪ Multimodal baseline models
▪ Experimental methodology
▪ Results and discussion
▪ Proposed approaches
Lecture objectives

▪ Multimodal alignment
▪ Implicit
▪ Explicit
▪ Explicit signal alignment
▪ Dynamic Time Warping
▪ Canonical Time Warping
▪ Attention models in deep learning (implicit and
explicit alignment)
▪ Soft attention
▪ Hard attention
▪ Spatial Transformer Networks
Multi-modal alignment
7
Multimodal-alignment

▪ Multimodal alignment – finding relationships

and correspondences between two or more
modalities Modality 1 Modality 2
▪ Examples
▪ Images with captions t1
▪ Recipe steps with a how-to video
▪
t2 t4

Fancy algorithm
Phrases/words of translated sentences
▪ Two types t3 t5
▪ Explicit – alignment is the task in itself
▪ Latent – alignment helps when solving a
different task (for example “Attention”
models)
tn tn
Explicit multimodal-alignment

▪ Explicit alignment - goal is to find correspondences

between modalities
▪ Aligning speech signal to a transcript
▪ Aligning two out-of sync sequences
▪ Co-referring expressions
Implicit multimodal-alignment

▪ Implicit alignment - uses internal latent alignment of

modalities in order to better solve various problems
▪ Machine Translation
▪ Cross-modal retrieval
▪ Image & Video Captioning
▪ Visual Question Answering
Explicit alignment
11
Temporal sequence alignment

Applications:
- Re-aligning asynchronous
data
- Finding similar data across
modalities (we can estimate
the aligned cost)
- Event reconstruction from
multiple sources
Let’s start unimodal – Dynamic Time Warping

▪ We have two unaligned temporal unimodal

signals
▪ 𝐗 = 𝒙1 , 𝒙2 , … , 𝒙𝑛𝑥 ∈ ℝ𝑑×𝑛𝑥
▪ 𝐘 = 𝒚1 , 𝒚2 , … , 𝒚𝑛𝑦 ∈ ℝ𝑑×𝑛𝑦
▪ Find set of indices to minimize the alignment
difference:
𝑙
𝑦 2
𝐿(𝒑𝑡𝑥 , 𝒑𝑡 ) =෍ 𝒙 𝒑𝑥
𝑡
−𝒚 𝑦
𝒑𝑡
2
𝑡=1

𝑦
▪ Where 𝒑𝑥 and 𝒑 are index vectors of same
length
▪ Finding these indices is called Dynamic Time
Warping
Dynamic Time Warping continued

▪ Lowest cost path in a cost

𝑦
(𝒑𝑙𝑥 , 𝒑𝒍 )
matrix
▪ Restrictions
▪ Monotonicity – no going back in
time
▪ Continuity - no gaps
▪ Boundary conditions - start and
end at the same points 𝑦
(𝒑𝑡𝑥 , 𝒑𝒕 )
▪ Warping window - don’t get too far
from diagonal
▪ Slope constraint – do not insert or
skip too much
𝑦
(𝒑1𝑥 , 𝒑1 )
Dynamic Time Warping continued

▪ Lowest cost path in a cost

𝑦
(𝒑𝑙𝑥 , 𝒑𝒍 )
matrix
▪ Solved using dynamic
programming while respecting
the restrictions

𝑦
(𝒑𝑡𝑥 , 𝒑𝒕 )

𝑦
(𝒑1𝑥 , 𝒑1 )
DTW alternative formulation
𝑙
2
𝑥 𝑦
𝐿(𝒑 , 𝒑 ) = ෍ 𝒙𝒑𝑥 − 𝒚𝒑𝑦
𝑡 𝑡 2
𝑡=1 Replication doesn’t change the objective!

= 𝐗𝐖𝑥
=
= 𝐘𝐖y

Alternative objective:
2
𝑿, 𝒀 – original signals (same #rows, possibly
𝐿(𝑾𝒙 , 𝑾𝒚 ) = 𝑿𝑾𝑥 − 𝒀𝑾𝑦 different #columns)
𝐹
𝑾𝑥 , 𝑾𝑦 - alignment matrices
2 2
Frobenius norm 𝑨 𝐹 = σ𝑖 σ𝑗 𝑎𝑖,𝑗

16
DTW - limitations

▪ Computationally complex

m sequences

▪ Sensitive to outliers

▪ Unimodal!
Canonical Correlation Analysis reminder
maximize: 𝑡𝑟(𝑼𝑻 𝚺𝑿𝒀 𝑽)

subject to: 𝑼𝑻 𝚺𝒀𝒀 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰 , 𝒖𝑻(𝑗) 𝚺𝑿𝒀 𝒗(𝑖) = 𝟎 for 𝑖 ≠ 𝑗

projection of Y
Linear projections maximizing
1 correlation

projection of X

2 Orthogonal projections 𝑯𝒙 𝑯𝒚

Unit variance of the projection ··· ···

3 vectors 𝑼 𝑽
··· ···
Text Image
𝑿 𝒀

18
Canonical Correlation Analysis reminder

▪ When data is normalized it is actually equivalent to smallest RMSE

reconstruction
▪ CCA loss can also be re-written as:

projection of Y
2
𝐿(𝑼, 𝑽) = 𝐔𝑇 𝐗 − 𝐕 𝑇 𝐘 𝐹

subject to: 𝑼𝑻 𝚺𝒀𝒀 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰, 𝒖𝑻(𝑗) 𝚺𝑿𝒀 𝒗(𝑖) = 𝟎

projection of X

𝑯𝒙 𝑯𝒚
··· ···
𝑼 𝑽
··· ···
Text Image
𝑿 𝒀
Canonical Time Warping

▪ Dynamic Time Warping + Canonical Correlation Analysis

= Canonical Time Warping
2
𝐿(𝑼, 𝑽, 𝑾𝒙 , 𝑾𝒚 ) = 𝐔 𝑇 𝐗𝐖𝐱 − 𝑇
𝐕 𝐘𝐖𝐲
𝐹

▪ Allows to align multi-modal or multi-view (same modality

but from a different point of view)
▪ 𝑾𝒙 , 𝑾𝒚 – temporal alignment
▪ 𝑼, 𝑽 – cross-modal (spatial) alignment

[Canonical Time Warping for Alignment of Human Behavior, Zhou and De la Tore, 2009]
Canonical Time Warping

2
𝐿(𝑼, 𝑽, 𝑾𝒙 , 𝑾𝒚 ) = 𝐔 𝑇 𝐗𝐖𝐱 − 𝑇
𝐕 𝐘𝐖𝐲
𝐹

Optimized by Coordinate-descent – fix one set of parameters,

optimize another
Generalized Eigen-decomposition

𝑾𝒙 , 𝑾𝒚 𝑼, 𝑽

Gauss-Newton

[Canonical Time Warping for Alignment of Human Behavior, Zhou and De la Tore, 2009, NIPS]
Generalized Time warping

▪ Generalize to multiple sequences all of different

modality
2
𝐿(𝑼𝒊 , 𝑾𝒊 ) = ෍ ෍ 𝐔𝑖𝑇 𝐗 i 𝐖i − 𝑇
𝐔𝑗 𝐗 j 𝐖𝑗
𝐹
𝑖=1 𝑗=1
▪ 𝑾𝒊 – set of temporal alignments
▪ 𝑼𝒊 – set of cross-modal (spatial) alignments

(1) Time warping

(2) Spatial embedding

[Generalized Canonical Time Warping, Zhou and De la Tore, 2016, TPAMI]

Alignment examples (unimodal)
CMU Motion Capture
Subject 1: 199 frames
Subject 2: 217 frames
Subject 3: 222 frames

Weizmann

Subject 1: 40 frames
Subject 2: 44 frames
Subject 3: 43 frames

23
Alignment examples (multimodal)
Canonical time warping - limitations

▪ Linear transform between modalities

▪ How to address this?
Deep Canonical Time Warping

2
𝐿(𝜽1 , 𝜽2 , 𝑾𝒙 , 𝑾𝒚 ) = 𝑓𝜽1 (𝐗)𝐖𝐱 − 𝑓𝜽1 (𝐘)𝐖𝐲
𝐹

▪ Could be seen as generalization of DCCA and GTW

[Deep Canonical Time Warping, Trigeorgis et al., 2016, CVPR]

Deep Canonical Time Warping

2
𝐿(𝜽1 , 𝜽2 , 𝑾𝒙 , 𝑾𝒚 ) = 𝑓𝜽1 (𝐗)𝐖𝐱 − 𝑓𝜽1 (𝐘)𝐖𝐲
𝐹

▪ The projections are orthogonal (like in DCCA)

▪ Optimization is again iterative:
▪ Solve for alignment (𝑾𝒙 , 𝑾𝒚 ) with fixed projections (𝜽1 , 𝜽2 )
▪ Eigen decomposition
▪ Solve for projections (𝜽1 , 𝜽2 ) with fixed alignment (𝑾𝒙 , 𝑾𝒚 )
▪ Gradient descent
▪ Repeat till convergence

[Deep Canonical Time Warping, Trigeorgis et al., 2016, CVPR]

Implicit alignment
28
Implicit alignment

▪ We looked how to explicitly align temporal data

▪ Could use that as an internal (hidden) step in
our models?
▪ Can we instead encourage the model to align
data when solving a different problem?
▪ Yes!
▪ Graphical models
▪ Neural attention models (focus of today’s lecture)
Attention models
30
Attention in humans

▪ Foveal vision – we only see in “high resolution” in 2 degrees of

vision
▪ We focus our attention selectively to certain words (for example our
names)
▪ We attend to relevant speech in a noisy room
Attention models in deep learning

▪ Many examples of attention models in recent years!

▪ Why:
▪ Allows for implicit data alignment
▪ Good results empirically
▪ In some cases faster (don’t need to focus on all the image)
▪ Better Interpretability
Types of Attention Models

▪ Recent attention models can be roughly split into

three major categories
1. Soft attention
▪ Acts like a gate function. Deterministic inference.
2. Transform network
▪ Warp the input to better align with canonical view
3. Hard attention
▪ Includes stochastic processes. Related to reinforcement
learning.
Soft attention
34
Machine Translation

▪ Given a sentence in one language translate it to another

Dog on the beach le chien sur la plage

▪ Not exactly multimodal task – but a good start! Each

language can be seen almost as a modality.
Machine Translation with RNNs

▪ A quick reminder about encoder

decoder frameworks
▪ First we encode the sentence Dog on the beach

▪ Then we decode it in a different

language

Decode

Context /
embedding /
Encoder sentence
representation
le chien sur la plage
Machine Translation with RNNs

▪ What is the problem with this?

▪ What happens when the sentences are very long?

▪ We expect the encoders hidden state to capture everything in a

sentence, a very complex state in a single vector, such as

The agreement on the European Economic

Area was signed in August 1992.

L’ accord sur la zone économique

européenne a été signé en août 1992.
Decoder – attention model

▪ Before encoder would just take the final hidden state, now we
actually care about the intermediate hidden states

Dog

Attention Hidden state 𝒔0

module /
gate

Context 𝒛𝟎

𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓

Encoder [Bahdanau et al., “Neural Machine Translation

by Jointly Learning to Align and Translate”, ICLR
le chien sur la plage 2015]
Decoder – attention model

▪ Before encoder would just take the final hidden state, now we
actually care about the intermediate hidden states

Dog on

Attention Hidden state 𝒔1

module /
gate

Context 𝒛𝟏

𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓

Encoder [Bahdanau et al., “Neural Machine Translation

by Jointly Learning to Align and Translate”, ICLR
le chien sur la plage 2015]
Decoder – attention model

▪ Before encoder would just take the final hidden state, now we
actually care about the intermediate hidden states

on the

Attention Hidden state 𝒔2

module /
gate

Context 𝒛𝟐

𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓

Encoder [Bahdanau et al., “Neural Machine Translation

by Jointly Learning to Align and Translate”, ICLR
le chien sur la plage 2015]
How do we encode attention

▪ Before:
▪ 𝑝 𝑦𝑖 𝑦1 , … , 𝑦𝑖−1 , 𝒙 = 𝑔(𝑦𝑖−1 , 𝒔𝑖 , 𝒛), where 𝒛 = 𝒉 𝑇 ,
and 𝒔𝑖 - the current state of the decoder
▪ Now:
▪ 𝑝 𝑦𝑖 𝑦1 , … , 𝑦𝑖−1 , 𝒙 = 𝑔(𝑦𝑖−1 , 𝒔𝑖 , 𝒛𝑖 )
▪ Have an attention “gate”
▪ A different context 𝒛𝑖 used at each time step!
𝑇
▪ 𝒛𝑖 = σ𝑗=𝑖
𝑥
𝛼𝑖𝑗 𝒉𝑗

𝛼𝑖𝑗 - the (scalar) attention for word j at generation step i

MT with attention

So how do we determine 𝛼𝑖𝑗 ,

exp(𝑒𝑖𝑗 )
▪ 𝛼𝑖,𝑗 = 𝑇𝑥 exp(𝑒 )
- softmax, making sure they sum to 1
σ𝑘=1 𝑖𝑘

Where:
▪ 𝑒𝑖𝑗 = 𝒗𝑇 𝜎 𝑊𝑠𝑖−1 + 𝑈ℎ𝑗
a feedforward network that can tell us given the current state of
decoder how important the current encoding is now
𝒗, 𝑊, 𝑈– learnable weights

𝑇𝑥
σ
𝑧𝑖 = 𝑗=𝑖 𝛼𝑖𝑗 ℎ𝑗 expectation of the context (a fancy way to
say it’s a weighted average)
MT with attention

Basically we are using a neural network to tell us where a

neural network should be looking!
▪ We can use with RNN, LSTM or GRU
▪ Encoder being used is the same structure as before
▪ Can use uni-directional
▪ Can use bi-directional
▪ Model can be trained using our regular back-propagation
through time, all of the modules are differentiable
Does it work?
MT with attention recap

▪ Get good translation results (especially for long

sentences)
▪ Also get a (soft) alignment of sentences in
different languages
▪ Extra interpretability of method functioning
▪ How do we move to multimodal?
Visual captioning with soft attention

[Show, Attend and Tell: Neural

Image Caption Generation with
Visual Attention, Xu et al., 2015]
Recap RNN for Captioning

Bird in the sky

Why might we not want to focus on the final layer?

Looking at more fine grained features

Distribution
Output
over L
word
locations
𝑎1 𝑎2 𝑑1 𝑎3 𝑑2

𝑠0 𝑠1 𝑠2

𝑧1 𝑦0 𝑧2 𝑦1

Expectation First word

over
features: D

48
Soft attention

▪ Allows for latent data alignment

▪ Allows us to get an idea of what the network “sees”
▪ Can be optimized using back propagation

▪ Good at paper naming!

▪ Show, Attend and Tell (extension of Show and Tell)
▪ Listen, Attend and Walk
▪ Listen, Attend and Spell
▪ Ask, Attend and Answer
Spatial Transformer
networks
50
Some limitations of grid based attention

▪ Can we fixate on small parts of image but still have easy

end-to-end training?
Spatial Transformer Networks

Can we make this

function differentiable?

52
Spatial Transformer Networks

Idea: Function mapping pixel

coordinates (𝑥 𝑡 , 𝑦 𝑡 ) of output to
pixel coordinates (𝑥 𝑠 , 𝑦 𝑠 ) of
Can we make this input
function differentiable?
𝑥𝑖𝑠 𝜃1,1 𝜃1,2 𝜃1,3 𝑥𝑖𝑡
= 𝑦𝑖𝑡
𝑦𝑖𝑠 𝜃2,1 𝜃2,2 𝜃2,3
1

53
Spatial Transformer Networks

Idea: Function mapping pixel

coordinates (𝑥 𝑡 , 𝑦 𝑡 ) of output to
pixel coordinates (𝑥 𝑠 , 𝑦 𝑠 ) of
input
Can we make this
function differentiable?
𝑥𝑖𝑠 𝜃1,1 𝜃1,2 𝜃1,3 𝑥𝑖𝑡
= 𝑦𝑖𝑡
𝑦𝑖𝑠 𝜃2,1 𝜃2,2 𝜃2,3
1

Network “attends” to
input by predicting 𝜃

54
Spatial Transformer Networks

55
Spatial Transformer Networks

56
Examples on real world data

▪ Results on traffic sign recognition

Code available http://torch.ch/blog/2015/09/07/spatial_transformers.html

Recap on Spatial Transformer Networks

▪ Differentiable so we can just use back-prop for training end-to-end

▪ Can use complex models for focusing on an image
▪ Affine and Piece-Wise Affine, Perspective, This Plate Splines
▪ Can use to focus on certain parts of an image
▪ We can use it instead of grid based soft and hard attention for multi-
modal tasks
Glimpse Network
(Hard Attention)
59
Hard attention

▪ Soft attention requires computing a representation for the whole

image or sentence
▪ Hard attention on the other hand forces looking only at one part
▪ Main motivation was reduced computational cost rather than
improved accuracy (although that happens a bit as well)
▪ Saccade followed by a glimpse – how human visual system
works

[Recurrent Models of Visual Attention, Mnih, 2014]

[Multiple Object Recognition with Visual Attention,
Ba, 2015]
Hard attention examples
Glimpse Sensor

▪ Looking at a part of an image at different scales

▪ At a number of different scales combined to a single multichannel

image (human retina like representation)
▪ Given a location 𝑙𝑡 output an image summary at that location
[Recurrent Models of Visual Attention, Mnih, 2014]
Glimpse network

▪ Combining the Glimpse and the location of the glimpse into a joint network

▪ The glimpse is followed by a feedforward network (CNN or a DNN)

▪ The exact formulation of how the location and appearance are combined
varies, the important thing is combining what and where
▪ Differentiable with respect to glimpse parameters but not the location
Overall Architecture - Emission network

▪ Given an image a glimpse

location 𝑙𝑡 , and optionally an
action 𝑎𝑡
▪ Action can be:
▪ Some action in a dynamic
system – press a button etc.
▪ Classification of an object
▪ Word output
▪ This is an RNN with two output
gates and a slightly more
complex input gate!
Recurrent model of Visual Attention (RAM)

▪ Sample locations of glimpses

leading to updates in the network
▪ Use gradient descent to update the
weights (the glimpse network
weights are differentiable)
▪ The emission network is an RNN
▪ Not as simple as backprop but
doable
▪ Turns out this is very similar and in
some cases equivalent to
reinforcement learning using the
REINFORCE learning rule
[Williams, 1992]
Multi-modal alignment
recap
67
Multimodal-alignment recap

▪ Explicit alignment - aligns two or more modalities (or

views) as an actual task. The goal is to find
correspondences between modalities
▪ Dynamic Time Warping
▪ Canonical Time Warping
▪ Deep Canonical Time Warping
▪ Implicit alignment - uses internal latent alignment of
modalities in order to better solve various problems
▪ Attention models
▪ Soft attention
▪ Spatial transformer networks
▪ Hard attention

Ea Drdo
No ratings yet
Ea Drdo
22 pages
L5
No ratings yet
L5
99 pages
Attention: Sharad Jones
No ratings yet
Attention: Sharad Jones
25 pages
3128/submission 3128
No ratings yet
3128/submission 3128
10 pages
NLP_slides2
No ratings yet
NLP_slides2
93 pages
Deep Learning For Machine Translation: A Dramatic Turn of Paradigm
No ratings yet
Deep Learning For Machine Translation: A Dramatic Turn of Paradigm
36 pages
Power Point Presentation Report
No ratings yet
Power Point Presentation Report
31 pages
Signlanguage Detection 2
No ratings yet
Signlanguage Detection 2
30 pages
Lecture12 1MultimodalFusion
No ratings yet
Lecture12 1MultimodalFusion
66 pages
Ying Zhang Deep Cross-Modal Projection ECCV 2018 Paper
No ratings yet
Ying Zhang Deep Cross-Modal Projection ECCV 2018 Paper
16 pages
2009 Tutorial Nips
No ratings yet
2009 Tutorial Nips
113 pages
Rec03 - Deep Architectures
No ratings yet
Rec03 - Deep Architectures
65 pages
Cs224n 2020 Lecture08 NMT
No ratings yet
Cs224n 2020 Lecture08 NMT
77 pages
Temporal Pattern Classification Using Spiking Neural Networks
No ratings yet
Temporal Pattern Classification Using Spiking Neural Networks
67 pages
Computer Vision 11 Transformers
No ratings yet
Computer Vision 11 Transformers
63 pages
AttentionBased TargetForesight Pralin 2017 0006
No ratings yet
AttentionBased TargetForesight Pralin 2017 0006
10 pages
ICML2023 - Tutorial多模态机器学习Multimodal Machine Learning
No ratings yet
ICML2023 - Tutorial多模态机器学习Multimodal Machine Learning
120 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
cs224n 2022 Lecture08 Final Project
No ratings yet
cs224n 2022 Lecture08 Final Project
71 pages
Encoding Time Series As Images For Visual Inspection and 10179-46015-1-PB
No ratings yet
Encoding Time Series As Images For Visual Inspection and 10179-46015-1-PB
7 pages
1484 Drop DTW Aligning Common Signa
No ratings yet
1484 Drop DTW Aligning Common Signa
12 pages
Convolutional Networks: Neural Networks With Applications To Vision and Language
No ratings yet
Convolutional Networks: Neural Networks With Applications To Vision and Language
36 pages
PW - 4 - Litt. Survey
No ratings yet
PW - 4 - Litt. Survey
6 pages
Machine Translation Wise 2016/2017
No ratings yet
Machine Translation Wise 2016/2017
58 pages
ENG6500 8 DL IntroductionToDeepLearning Part2
No ratings yet
ENG6500 8 DL IntroductionToDeepLearning Part2
65 pages
Alignment by Maximization of Mutual Information
No ratings yet
Alignment by Maximization of Mutual Information
19 pages
06 Neural Networks For NLP
No ratings yet
06 Neural Networks For NLP
26 pages
Deep Learning: Hoàng Huy Minh Hoàng Thảo Lan Chi Phạm Huy Thiên Phúc Trương Huỳnh Đăng Khoa
No ratings yet
Deep Learning: Hoàng Huy Minh Hoàng Thảo Lan Chi Phạm Huy Thiên Phúc Trương Huỳnh Đăng Khoa
25 pages
Lecture-28-TransformerIntroductionFinal-1
No ratings yet
Lecture-28-TransformerIntroductionFinal-1
69 pages
Complément_Cours_DTW_2
No ratings yet
Complément_Cours_DTW_2
82 pages
XCS224N Module5 Slides
No ratings yet
XCS224N Module5 Slides
80 pages
Introduction to Deep Learning 17th January 2025 (2)
No ratings yet
Introduction to Deep Learning 17th January 2025 (2)
60 pages
Applications
No ratings yet
Applications
6 pages
Week8 - Machine Learning
No ratings yet
Week8 - Machine Learning
35 pages
Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning
No ratings yet
Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning
28 pages
Sign Language Translation Using Machine Learning and Computer Vis
No ratings yet
Sign Language Translation Using Machine Learning and Computer Vis
34 pages
An Iterative Alignment Framework For Temporal Sentence Grounding
No ratings yet
An Iterative Alignment Framework For Temporal Sentence Grounding
10 pages
Signature Forgery Detection
No ratings yet
Signature Forgery Detection
6 pages
L15-Transformer1 (1)
No ratings yet
L15-Transformer1 (1)
19 pages
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
No ratings yet
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
50 pages
Objection detection
No ratings yet
Objection detection
25 pages
lecture15_transformer
No ratings yet
lecture15_transformer
26 pages
Domain Adaptation For Deep Learning: "What You Saw Is Not What You Get"
No ratings yet
Domain Adaptation For Deep Learning: "What You Saw Is Not What You Get"
51 pages
Recent Advances and Trends in Multimodal Deep Learning A Review
No ratings yet
Recent Advances and Trends in Multimodal Deep Learning A Review
35 pages
Lecture7.2-MultimodalInference
No ratings yet
Lecture7.2-MultimodalInference
68 pages
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
No ratings yet
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
5 pages
Speech Recognition
No ratings yet
Speech Recognition
40 pages
2405.14129v2
No ratings yet
2405.14129v2
12 pages
2024_Transformer_master
No ratings yet
2024_Transformer_master
50 pages
2023 ICASSP Soft Dynamic Time Warping for Multi-Pitch Estimation and Beyond
No ratings yet
2023 ICASSP Soft Dynamic Time Warping for Multi-Pitch Estimation and Beyond
5 pages
Word Alignment Modeling With Context Dependent Deep Neural Network
No ratings yet
Word Alignment Modeling With Context Dependent Deep Neural Network
10 pages
Introduction to machine learning
No ratings yet
Introduction to machine learning
33 pages
Synthesis Lectures On Computer Vision: Series Editors
No ratings yet
Synthesis Lectures On Computer Vision: Series Editors
8 pages
Deep Learning Basics
No ratings yet
Deep Learning Basics
10 pages
Neural Machine Translation PDF
No ratings yet
Neural Machine Translation PDF
15 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
Unit 5b - Natural Language Processing
No ratings yet
Unit 5b - Natural Language Processing
41 pages
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
From Everand
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
Fouad Sabry
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Bda 1
No ratings yet
Bda 1
28 pages
Cambridge Methods 1/2 - Chapter 17 Differentiation and Antidifferentiation
No ratings yet
Cambridge Methods 1/2 - Chapter 17 Differentiation and Antidifferentiation
32 pages
Simulation System Engineering
No ratings yet
Simulation System Engineering
33 pages
Power Flow Analysis
No ratings yet
Power Flow Analysis
11 pages
FPM2601 - 2022 - S2 - Assignment 1
No ratings yet
FPM2601 - 2022 - S2 - Assignment 1
3 pages
Cs100 Lesson 1
No ratings yet
Cs100 Lesson 1
12 pages
3. Extra Practice_ Topics 2.1-2.6
No ratings yet
3. Extra Practice_ Topics 2.1-2.6
5 pages
Berlekamp-Massey Algorithm Revisited PDF
No ratings yet
Berlekamp-Massey Algorithm Revisited PDF
7 pages
Ch02 Time-Domain Representations of LTI Systems Compatibility Mode
No ratings yet
Ch02 Time-Domain Representations of LTI Systems Compatibility Mode
120 pages
Silence Sweep: A Novel Method For Measuring Electro - Acoustical Devices
No ratings yet
Silence Sweep: A Novel Method For Measuring Electro - Acoustical Devices
35 pages
Full Business Analytics Data Analysis Decision Making 6th Edition S. Christian Albright Ebook All Chapters
100% (3)
Full Business Analytics Data Analysis Decision Making 6th Edition S. Christian Albright Ebook All Chapters
62 pages
Tools For Planning
No ratings yet
Tools For Planning
17 pages
Robotics 2 - Flowchart
No ratings yet
Robotics 2 - Flowchart
40 pages
Lecture 6 - Improving Local Search
No ratings yet
Lecture 6 - Improving Local Search
28 pages
251 Sample MT2-1
No ratings yet
251 Sample MT2-1
6 pages
Applications of Network Flow
No ratings yet
Applications of Network Flow
75 pages
unit 5 cosm short notes
No ratings yet
unit 5 cosm short notes
6 pages
AlgoTradeSoft V9.78
No ratings yet
AlgoTradeSoft V9.78
2 pages
SQE UNIT 3
No ratings yet
SQE UNIT 3
9 pages
IITM CS Courses
No ratings yet
IITM CS Courses
3 pages
1805 07297 PDF
No ratings yet
1805 07297 PDF
29 pages
Deception in Optimal Control: Melkior Ornik and Ufuk Topcu
No ratings yet
Deception in Optimal Control: Melkior Ornik and Ufuk Topcu
27 pages
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
No ratings yet
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
16 pages
Fake Logo Detection
No ratings yet
Fake Logo Detection
5 pages
Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
No ratings yet
Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
1,000 pages
Data Mining: Homework 1 Solution
No ratings yet
Data Mining: Homework 1 Solution
5 pages
Dynamic Programming by Alamin Bhuiyan
No ratings yet
Dynamic Programming by Alamin Bhuiyan
4 pages
DIP Assignment-3
No ratings yet
DIP Assignment-3
5 pages
21.0.3 Class Activity - Creating Codes
No ratings yet
21.0.3 Class Activity - Creating Codes
3 pages
List Data Structures: and Doubly) Linked List Search, Adding New Nodes
No ratings yet
List Data Structures: and Doubly) Linked List Search, Adding New Nodes
86 pages

Lecture8 1MultimodalAlignment

Uploaded by

Lecture8 1MultimodalAlignment

Uploaded by

Advanced

Multimodal Machine Learning

* Original version co-developed with Tadas Baltrusaitis

▪ First project assignment:

▪ 7-8 minute presentations (max: 8 mins)

▪ General definition of your research problem, including a

▪ Multimodal alignment – finding relationships

▪ Explicit alignment - goal is to find correspondences

▪ Implicit alignment - uses internal latent alignment of

▪ We have two unaligned temporal unimodal

▪ Lowest cost path in a cost

▪ Lowest cost path in a cost

subject to: 𝑼𝑻 𝚺𝒀𝒀 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰 , 𝒖𝑻(𝑗) 𝚺𝑿𝒀 𝒗(𝑖) = 𝟎 for 𝑖 ≠ 𝑗

Unit variance of the projection ··· ···

▪ When data is normalized it is actually equivalent to smallest RMSE

subject to: 𝑼𝑻 𝚺𝒀𝒀 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰, 𝒖𝑻(𝑗) 𝚺𝑿𝒀 𝒗(𝑖) = 𝟎

▪ Dynamic Time Warping + Canonical Correlation Analysis

▪ Allows to align multi-modal or multi-view (same modality

Optimized by Coordinate-descent – fix one set of parameters,

▪ Generalize to multiple sequences all of different

(1) Time warping

[Generalized Canonical Time Warping, Zhou and De la Tore, 2016, TPAMI]

▪ Linear transform between modalities

▪ Could be seen as generalization of DCCA and GTW

[Deep Canonical Time Warping, Trigeorgis et al., 2016, CVPR]

▪ The projections are orthogonal (like in DCCA)

[Deep Canonical Time Warping, Trigeorgis et al., 2016, CVPR]

▪ We looked how to explicitly align temporal data

▪ Foveal vision – we only see in “high resolution” in 2 degrees of

▪ Many examples of attention models in recent years!

▪ Recent attention models can be roughly split into

▪ Given a sentence in one language translate it to another

Dog on the beach le chien sur la plage

▪ Not exactly multimodal task – but a good start! Each

▪ A quick reminder about encoder

▪ Then we decode it in a different

▪ What is the problem with this?

▪ We expect the encoders hidden state to capture everything in a

The agreement on the European Economic

L’ accord sur la zone économique

Attention Hidden state 𝒔0

Encoder [Bahdanau et al., “Neural Machine Translation

Attention Hidden state 𝒔1

Encoder [Bahdanau et al., “Neural Machine Translation

Attention Hidden state 𝒔2

Encoder [Bahdanau et al., “Neural Machine Translation

𝛼𝑖𝑗 - the (scalar) attention for word j at generation step i

So how do we determine 𝛼𝑖𝑗 ,

Basically we are using a neural network to tell us where a

▪ Get good translation results (especially for long

[Show, Attend and Tell: Neural

Bird in the sky

Why might we not want to focus on the final layer?

Expectation First word

▪ Allows for latent data alignment

▪ Good at paper naming!

▪ Can we fixate on small parts of image but still have easy

Can we make this

Idea: Function mapping pixel

Idea: Function mapping pixel

▪ Results on traffic sign recognition

Code available http://torch.ch/blog/2015/09/07/spatial_transformers.html

▪ Differentiable so we can just use back-prop for training end-to-end

▪ Soft attention requires computing a representation for the whole

[Recurrent Models of Visual Attention, Mnih, 2014]

▪ Looking at a part of an image at different scales

▪ At a number of different scales combined to a single multichannel

▪ The glimpse is followed by a feedforward network (CNN or a DNN)

▪ Given an image a glimpse

▪ Sample locations of glimpses

▪ Explicit alignment - aligns two or more modalities (or

You might also like