0% found this document useful (0 votes)
5 views

Lecture7 1CoordinatedRepresentations

Uploaded by

Sanjay Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lecture7 1CoordinatedRepresentations

Uploaded by

Sanjay Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Advanced

Multimodal Machine Learning


Lecture 7.1: Multivariate Statistics
and Coordinated Representations
Louis-Philippe Morency

* Original version co-developed with Tadas Baltrusaitis

1
Lecture Objectives

▪ Quick recap
▪ Temporal Joint Representation
▪ Multivariate statistical analysis
▪ Basic concepts (multivariate, covariance,…)
▪ Principal component analysis (+SVD)
▪ Canonical Correlation Analysis
▪ Deep Correlation Networks
▪ Deep CCA, DCCA-AutoEncoder
▪ (Deep) Correlational neural networks
▪ Matrix Factorization
▪ Nonnegative Matrix Factorization
Temporal Joint
Representation
3
Sequence Representation with LSTM

𝒚𝟏 𝒚𝟐 𝒚𝟑 𝒚𝜏

LSTM(1) LSTM(2) LSTM(3) LSTM(𝜏)

𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝜏
Multimodal Sequence Representation – Early Fusion

𝒚𝟏 𝒚𝟐 𝒚𝟑 𝒚𝜏

LSTM(1) LSTM(2) LSTM(3) LSTM(𝜏)

(1) (1) (1) (1)


𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝜏

(2) (2) (2) (2)


𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝜏

(3) (3) (3) (3)


𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝜏
Multi-View Long Short-Term Memory (MV-LSTM)

𝒚𝟏 𝒚𝟐 𝒚𝟑 𝒚𝜏

MV- MV- MV- MV-


LSTM(1) LSTM(2) LSTM(3) LSTM(𝜏)

(1) (1) (1) (1)


𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝜏

(2) (2) (2) (2)


𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝜏

(3) (3) (3) (3)


𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝜏

[Shyam, Morency, et al. Extending Long Short-Term Memory for Multi-View Structured Learning, ECCV, 2016]
Multi-View Long Short-Term Memory

Multi-view topologies Multiple


memory cells
(1) 𝒈(1)
𝒕 𝒄𝒕(1)
𝒉𝒕−𝟏 MV- (2)
𝒉(1)
𝒕
MV- 𝒈𝒕
LSTM(1) 𝒉(2)
tanh 𝒄𝒕(2) 𝒉(2)
𝒕−𝟏 𝒕
(3)
𝒈 𝒄𝒕(3)
𝒉(3)
𝒕−𝟏 𝒉(3)
𝒕

MV-
(1) sigm
𝒙𝒕
𝒙(2)
𝒕
MV-
(3)
𝒙𝒕 sigm

MV-
sigm

[Shyam, Morency, et al. Extending Long Short-Term Memory for Multi-View Structured Learning, ECCV, 2016]
Topologies for Multi-View LSTM

Fully-
View-specific connected
Multi-view topologies α=1, β=0 α=1, β=1
𝒙(1) 𝒈(1) 𝒙(1) 𝒈(1)
𝒈(1)
𝒕
𝒕 𝒕 𝒕 𝒕
𝒉(1)
𝒕−𝟏 MV- (2) 𝒉(1)
𝒕−𝟏 𝒉(1)
𝒕−𝟏
MV- 𝒈𝒕
tanh
LSTM(1) 𝒉(2)
𝒕−𝟏 𝒉(2)
𝒕−𝟏 𝒉(2)
𝒕−𝟏
(3)
𝒈
𝒉(3)
𝒕−𝟏 𝒉(3)
𝒕−𝟏 𝒉(3)
𝒕−𝟏

𝒙(1)
𝒕 Coupled Hybrid
α: Memory from α=0, β=1 α=2/3, β=1/3
𝒙(2)
𝒕
current view 𝒙(1)
𝒕 𝒈(1) 𝒙(1)
𝒕 𝒈(1)
(3) 𝒕 𝒕
𝒙𝒕 β: Memory from 𝒉(1)
𝒕−𝟏 𝒉(1)
𝒕−𝟏
other views
𝒉(2)
𝒕−𝟏 𝒉(2)
𝒕−𝟏
Design parameters 𝒉(3)
𝒕−𝟏 𝒉(3)
𝒕−𝟏

[Shyam, Morency, et al. Extending Long Short-Term Memory for Multi-View Structured Learning, ECCV, 2016]
Multi-View Long Short-Term Memory (MV-LSTM)

Multimodal prediction of children engagement

[Shyam, Morency, et al. Extending Long Short-Term Memory for Multi-View Structured Learning, ECCV, 2016]
Quick Recap
10
Multimodal Representation Learning

Learn (unsupervised) a joint


representation between multiple
modalities where similar unimodal
concepts are closely projected.
··· softmax
❑ Deep Multimodal
Boltzmann machines ···

··· ···

··· ···
Text Image
𝑿 𝒀
11
Multimodal Representation Learning
𝑿′ 𝒀′
Learn (unsupervised) a joint Text Image
representation between multiple
··· ···
modalities where similar unimodal
concepts are closely projected. ··· ···
❑ Deep Multimodal
Boltzmann machines ···
❑ Stacked Autoencoder
··· ···

··· ···
Text Image
𝑿 𝒀
12
Multimodal Representation Learning

Learn (unsupervised) a joint


representation between multiple
modalities where similar unimodal
concepts are closely projected.
❑ Deep Multimodal

···
Boltzmann machines
❑ Stacked Autoencoder
··· ···
❑ Encoder-Decoder
··· ···
Text Image
𝑿 𝒀
13
Multimodal Representation Learning

Learn (unsupervised) a joint


representation between multiple e.g. Sentiment
modalities where similar unimodal ··· softmax

concepts are closely projected. Bimodal


𝒉𝒎
❑ Deep Multimodal Unimodal
1
Boltzmann machines
❑ Stacked Autoencoder 𝒉𝒙 ··· ··· 𝒉𝒚

··· ···
❑ Encoder-Decoder Text Image
𝑿 𝒀
❑ Tensor Fusion representation

How Can We Learn Better Representations?


14
Coordinated
Multimodal
Representations
15
Coordinated multimodal embeddings

▪ Instead of projecting to a joint space enforce the similarity between


unimodal embeddings

Repres. 1 Repres 2

Modality 1 Modality 2
Coordinated Multimodal Representations

Learn (unsupervised) two or more


coordinated representations from
multiple modalities. A loss function
is defined to bring closer these (e.g.,
multiple representations. Similarity metric cosine
distance)

··· ···

··· ···

··· ···
Text Image
𝑿 𝒀
17
Coordinated Multimodal Embeddings

[Huang et al., Learning Deep Structured Semantic Models for Web Search using Clickthrough Data, 2013]
Multimodal Vector Space Arithmetic

[Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014]
Multimodal Vector Space Arithmetic

[Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014]
Structured coordinated embeddings

▪ Instead of or in addition to similarity add alternative


structure

[Vendrov et al., Order-Embeddings of [Jiang and Li, Deep Cross-Modal Hashing]


Images and Language, 2016]
Multivariate
Statistical Analysis
22
Multivariate Statistical Analysis

“Statistical approaches to understand the


relationships in high dimensional data”

▪ Example of multivariate analysis approaches:


▪ Multivariate analysis of variance (MANOVA)
▪ Principal components analysis (PCA)
▪ Factor analysis
▪ Linear discriminant analysis (LDA)
▪ Canonical correlation analysis (CCA)
Random Variables

Definition: A variable whose possible values are


numerical outcomes of a random phenomenon.
❑ Discrete random variable is one which may take on only a
countable number of distinct values such as 0,1,2,3,4,…
❑ Continuous random variable is one which takes an infinite
number of possible values.

Examples of random variables:


• Someone’s age Discrete or
• Someone’s height continuous?
• Someone’s weight Correlated?

24
Definitions
Given two random variables 𝑋 and 𝑌:
Expected value probability-weighted average of all possible values

𝜇 = 𝐸 𝑋 = ෍ 𝑥𝑖 𝑃(𝑥𝑖 )
𝑖
➢ If same probability for all observations 𝑥𝑖 , then same as arithmetic mean

Variance measures the spread of the observations


If data is
𝜎 2 = 𝑉𝑎𝑟(𝑋) = 𝐸[ 𝑋 − 𝜇 𝑋 − 𝜇 ] = 𝐸[𝑋ത 𝑋]

centered
➢ Variance is equal to the square of the standard deviation 𝜎

Covariance measures how much two random variables change together


𝑐𝑜𝑣(𝑋, 𝑌) = 𝐸[ 𝑋 − 𝜇𝑋 𝑌 − 𝜇𝑦 ] = 𝐸[𝑋ത 𝑌]

25
Definitions

Pearson Correlation measures the extent to which two


variables have a linear relationship with each other
𝑐𝑜𝑣(𝑋, 𝑌)
𝜌𝑋,𝑌 = 𝑐𝑜𝑟𝑟 𝑋, 𝑌 =
𝑣𝑎𝑟 𝑋 𝑣𝑎𝑟(𝑌)

26
Pearson Correlation Examples

27
Definitions
Multivariate (multidimensional) random variables
(aka random vector)
𝑿= [𝑋 1 , 𝑋 2 , 𝑋 3 , … , 𝑋 𝑀 ]
𝒀 = [𝑌1 , 𝑌 2 , 𝑌 3 , … , 𝑌 𝑁 ]
Covariance matrix generalizes the notion of variance
𝑇 ഥ𝑿ഥ𝑇]
Σ𝑿 = Σ𝑿,𝑿 = 𝑣𝑎𝑟(𝑿) = 𝐸 𝑿 − 𝐸[𝑿] 𝑿 − 𝐸[𝑿] = 𝐸[𝑿
Cross-covariance matrix generalizes the notion of covariance
𝑇 ഥ𝒀ഥ𝑇 ]
Σ𝑿,𝒀 = 𝑐𝑜𝑣(𝑿, 𝒀) = 𝐸 𝑿 − 𝐸[𝑿] 𝒀 − 𝐸[𝒀] = 𝐸[𝑿

28
Definitions
Multivariate (multidimensional) random variables
(aka random vector)
𝑿= [𝑋 1 , 𝑋 2 , 𝑋 3 , … , 𝑋 𝑀 ]
𝒀 = [𝑌1 , 𝑌 2 , 𝑌 3 , … , 𝑌 𝑁 ]
Covariance matrix generalizes the notion of variance
𝑇 ഥ𝑿ഥ𝑇]
Σ𝑿 = Σ𝑿,𝑿 = 𝑣𝑎𝑟(𝑿) = 𝐸 𝑿 − 𝐸[𝑿] 𝑿 − 𝐸[𝑿] = 𝐸[𝑿
Cross-covariance matrix generalizes the notion of covariance

𝑐𝑜𝑣(𝑋1 , 𝑌1 ) 𝑐𝑜𝑣(𝑋2 , 𝑌1 ) … 𝑐𝑜𝑣(𝑋𝑀 , 𝑌1 )


𝑐𝑜𝑣(𝑋1 , 𝑌2 ) 𝑐𝑜𝑣(𝑋2 , 𝑌2 ) … 𝑐𝑜𝑣(𝑋𝑀 , 𝑌2 )
Σ𝑿,𝒀 = 𝑐𝑜𝑣(𝑿, 𝒀) =
⋮ ⋮ ⋱ ⋮
𝑐𝑜𝑣(𝑋1 , 𝑌𝑁 ) 𝑐𝑜𝑣(𝑋2 , 𝑌𝑁 ) … 𝑐𝑜𝑣(𝑋𝑀 , 𝑌𝑁 )
29
Definitions – Matrix Operations

Trace is defined as the sum of the elements on the main diagonal


of any matrix 𝑿
𝑛

𝑡𝑟 𝑿 = ෍ 𝑥𝑖𝑖
𝑖=1

30
Principal component analysis

PCA converts a set of observations of possibly correlated


variables into a set of values of linearly uncorrelated
variables called principal components
▪ Eigenvectors are orthogonal towards each other and have
length one
▪ The first couple of eigenvectors explain the most of the
variance observed in the data
▪ Low eigenvalues indicate little loss of information if omitted

31
Eigenvalues and Eigenvectors
Eigenvalue decomposition
If A is an nn matrix, do there exist nonzero vectors x
in Rn such that Ax is a scalar multiple of x?
➢ (The term eigenvalue is from the German
word Eigenwert, meaning “proper value”)
Eigenvalue equation:
Ax =  x Geometric Interpretation
y

Eigenvector Eigenvalue Ax =  x

A: an nn matrix
: a scalar (could be zero) x

x: a nonzero vector in Rn x
Singular Value Decomposition (SVD)

▪ SVD expresses any matrix 𝐀 as

𝐀 = 𝐔𝐒𝐕 𝑇

▪ The columns of 𝐔 are eigenvectors of 𝐀𝐀𝑇 , and


the columns of 𝐕 are eigenvectors of 𝐀𝑇 𝐀.

𝐀𝐀𝑇 𝐮𝑖 = 𝑠𝑖2 𝐮𝑖
𝐀𝑇 𝐀𝐯𝑖 = 𝑠𝑖2 𝐯𝑖
Canonical
Correlation Analysis
34
Multi-view Learning
𝑿 𝒀

demographic properties responses to survey

audio features at time i video features at time i

35
Canonical Correlation Analysis

“canonical”: reduced to the simplest or clearest


schema possible
1 Learn two linear projections, one

projection of Y
for each view, that are maximally
correlated:
projection of X

𝒖∗ , 𝒗∗ = argmax 𝑐𝑜𝑟𝑟 𝑯𝒙 , 𝑯𝒚 𝑯𝒙 𝑯𝒚
𝒖,𝒗
··· ···
= argmax 𝑐𝑜𝑟𝑟 𝒖𝑻 𝑿, 𝒗𝑻 𝒀 𝑼 𝑽
𝒖,𝒗
··· ···
Text Image
𝑿 𝒀
36
Correlated Projection

1 Learn two linear projections, one for each view,


that are maximally correlated:

𝒖∗ , 𝒗∗ = argmax 𝑐𝑜𝑟𝑟 𝒖𝑻 𝑿, 𝒗𝑻 𝒀
𝒖,𝒗

𝒗
𝒖

𝒀
𝑿

Two views 𝑿, 𝒀 where same instances have the same color

37
Canonical Correlation Analysis

1 Learn two linear projections, one for each view,


that are maximally correlated:
𝒖∗ , 𝒗∗ = argmax 𝑐𝑜𝑟𝑟 𝒖𝑻 𝑿, 𝒗𝑻 𝒀
𝒖,𝒗
where
𝑐𝑜𝑣(𝒖𝑻 𝑿, 𝒗𝑻 𝒀)
= argmax 𝚺𝑿𝒀 = 𝑐𝑜𝑣(𝑿, 𝒀) = 𝑿𝒀𝑻
𝒖,𝒗 𝑣𝑎𝑟 𝒖𝑻 𝑿 𝑣𝑎𝑟(𝒗𝑻 𝒀)
if both 𝑿, 𝒀 have 0 mean
𝒖𝑻 𝑿𝒀𝑇 𝒗 𝝁𝑿 = 𝟎 𝝁𝒀 = 𝟎
= argmax
𝒖,𝒗 𝒖𝑻 𝑿𝑿𝑻 𝒖 𝒗𝑻 𝒀𝒀𝑻 𝒗
𝒖𝑻 𝚺𝑿𝒀 𝒗
= argmax
𝒖,𝒗 𝒖𝑻 𝚺𝑿𝑿 𝒖 𝒗𝑻 𝚺𝒀𝒀 𝒗
38
Canonical Correlation Analysis
We want to learn multiple projection pairs 𝒖(𝑖) 𝑿, 𝒗(𝑖) 𝒀 :
𝒖𝑻(𝑖) 𝚺𝑿𝒀 𝒗(𝑖)
𝒖∗(𝑖) , 𝒗∗(𝑖) = argmax
𝒖 𝑖 ,𝒗(𝑖)
𝒖𝑻(𝑖) 𝚺𝑿𝑿 𝒖(𝑖) 𝒗𝑻(𝑖) 𝚺𝒀𝒀 𝒗(𝑖)

2 We want these multiple projection pairs to be orthogonal


(“canonical”) to each other:

𝒖𝑻(𝑖) 𝚺𝑿𝒀 𝒗(𝑗) = 𝒖𝑻(𝑗) 𝚺𝑿𝒀 𝒗(𝑖) = 𝟎 for 𝑖 ≠ 𝑗

𝑼𝚺𝑿𝒀 𝑽 = 𝑡𝑟(𝑼𝚺𝑿𝒀 𝑽) where 𝑼 = [𝒖 1 , 𝒖 2 ,…, 𝒖 𝑘 ]


and 𝑽 = [𝒗 1 , 𝒗 2 ,…, 𝒗 𝑘 ]
39
Canonical Correlation Analysis

𝑡𝑟(𝑼𝑻 𝚺𝑿𝒀 𝑽)
𝑼∗ , 𝑽∗ = argmax
𝑼,𝑽 𝑼𝑻 𝚺𝑿𝑿 𝑼 𝑽𝑻 𝚺𝒀𝒀 𝑽

3 Since this objective function is invariant to scaling, we


can constraint the projections to have unit variance:
𝑼𝑻 𝚺𝑿𝑿 𝑼 = 𝑰 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰

Canonical Correlation Analysis:

maximize: 𝑡𝑟(𝑼𝑻 𝚺𝑿𝒀 𝑽)


subject to: 𝑼𝑻 𝚺𝑿𝑿 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰, 𝒖𝑻(𝑗) 𝚺𝑿𝒀 𝒗(𝑖) = 𝟎
for 𝑖 ≠ 𝑗

40
Canonical Correlation Analysis
maximize: 𝑡𝑟(𝑼𝑻 𝚺𝑿𝒀 𝑽)
subject to: 𝑼𝑻 𝚺𝑿𝑿 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰, 𝒖𝑻(𝑗) 𝚺𝑿𝒀 𝒗(𝑖) = 𝟎
for 𝑖 ≠ 𝑗

1 0 0 𝜆1 0 0
𝚺𝑿𝑿 𝚺𝒀𝑿 0 1 0 0 𝜆2 0
𝑼,𝑽 0 0 1 0 0 𝜆3
Σ= 𝜆1 0 0
𝚺𝑿𝒀 𝚺𝒀𝒀 1 0 0
0 𝜆2 0 0 1 0
0 0 𝜆3 0 0 1

41
Canonical Correlation Analysis
maximize: 𝑡𝑟(𝑼𝑻 𝚺𝑿𝒀 𝑽)
subject to: 𝑼𝑻 𝚺𝑿𝑿 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰, 𝒖𝑻(𝑗) 𝚺𝑿𝒀 𝒗(𝑖) = 𝟎
for 𝑖 ≠ 𝑗

How to solve it? ➢ Lagrange Multipliers!


Lagrange function

𝑳 = 𝑡𝑟(𝑼𝑻 𝚺𝑿𝒀 𝑽) + 𝛼 𝑼𝑻 𝚺𝒀𝒀 𝑼 − 𝑰 + 𝛽(𝑽𝑻 𝚺𝒀𝒀 𝑽 − 𝑰)


𝜕𝐿 𝜕𝐿
➢ And then find stationary points of 𝐿: =0 =0
𝜕𝑼 𝜕𝑽
−𝟏 −𝟏 𝑻
𝚺𝑿𝑿 𝚺𝑿𝒀 𝚺𝒀𝒀 𝚺𝑿𝒀 𝑼 = 𝝀𝑼
−𝟏 𝑻 −𝟏
𝚺𝒀𝒀 𝚺𝑿𝒀 𝚺𝑿𝑿 𝚺𝑿𝒀 𝑽 = 𝝀𝑽 where 𝜆 = 4𝛼𝛽

42
Canonical Correlation Analysis
maximize: 𝑡𝑟(𝑼𝑻 𝚺𝑿𝒀 𝑽)
subject to: 𝑼𝑻 𝚺𝑿𝑿 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰, 𝒖𝑻(𝑗) 𝚺𝑿𝒀 𝒗(𝑖) = 𝟎
for 𝑖 ≠ 𝑗
−𝟏Τ𝟐 −𝟏Τ𝟐
𝑻≜ 𝚺𝑿𝑿 𝚺𝑿𝒀 𝚺𝒀𝒀
−𝟏 𝟐 Τ Τ
−𝟏 𝟐
𝑼∗ , 𝑽∗ = (𝚺𝑿𝑿 𝑼𝑺𝑽𝑫 , 𝚺𝒀𝒀 𝑽𝑺𝑽𝑫 )
➢ Can solve these eigenvalue
equations with Singular Value Eigenvalues
Decomposition (SVD) Eigenvectors

−𝟏 −𝟏 𝑻
Eigenvalue 𝚺𝑿𝑿 𝚺𝑿𝒀 𝚺𝒀𝒀 𝚺𝑿𝒀 𝑼 = 𝝀𝑼
equations −𝟏 𝑻 −𝟏
𝚺𝒀𝒀 𝚺𝑿𝒀 𝚺𝑿𝑿 𝚺𝑿𝒀 𝑽 = 𝝀𝑽 where 𝜆 = 4𝛼𝛽

43
Canonical Correlation Analysis
maximize: 𝑡𝑟(𝑼𝑻 𝚺𝑿𝒀 𝑽)
subject to: 𝑼𝑻 𝚺𝑿𝑿 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰, 𝒖𝑻(𝑗) 𝚺𝑿𝒀 𝒗(𝑖) = 𝟎
for 𝑖 ≠ 𝑗

projection of Y
Linear projections
1
maximizing correlation
projection of X

2 Orthogonal projections 𝑯𝒙 𝑯𝒚
Unit variance of the ··· ···
3 𝑼 𝑽
projection vectors
··· ···
Text Image
𝑿 𝒀
44
Exploring Deep
Correlation Networks
45
Deep Canonical Correlation Analysis

Same objective function as CCA:

argmax 𝑐𝑜𝑟𝑟 𝑯𝒙 , 𝑯𝒚

View 𝐻𝑥
𝑽,𝑼,𝑾𝒙 ,𝑾𝒚

And need to compute gradients: View 𝐻𝑦

𝑯𝒙 ··· ··· 𝑯𝒚
𝜕𝑐𝑜𝑟𝑟 𝑯𝒙 , 𝑯𝒚
𝑼 𝑽
𝜕𝑈 ··· ···
𝑾𝒙 𝑾𝒚
𝜕𝑐𝑜𝑟𝑟 𝑯𝒙 , 𝑯𝒚 ··· ···
𝜕𝑉 Text Image
Andrew et al., ICML 2013 𝑿 𝒀
46
Deep Canonical Correlation Analysis

Training procedure:
𝑿′ 𝒀′
1. Pre-train the models Text Image
parameters using ··· ···
denoising autoencoders
··· ···
𝑯𝒙 ··· ··· 𝑯𝒚
𝑼 𝑽
··· ···
𝑾𝒙 𝑾𝒚
··· ···
Text Image
Andrew et al., ICML 2013 𝑿 𝒀
47
Deep Canonical Correlation Analysis

Training procedure:
1. Pre-train the models

View 𝐻𝑥
parameters using
denoising autoencoders
View 𝐻𝑦
2. Optimize the CCA
objective functions using 𝑯𝒙 ··· ··· 𝑯𝒚
large mini-batches or 𝑼 𝑽
full-batch (L-BFGS) ··· ···
𝑾𝒙 𝑾𝒚
··· ···
Text Image
Andrew et al., ICML 2013 𝑿 𝒀
48
Deep Canonically Correlated Autoencoders (DCCAE)
𝑿′ 𝒀′
Jointly optimize for DCCA and Text Image
autoencoders loss functions ··· ···
➢ A trade-off between multi-view
··· ···
correlation and reconstruction

View 𝐻𝑥
error from individual views
View 𝐻𝑦

𝑯𝒙 ··· ··· 𝑯𝒚
𝑼 𝑽
··· ···
𝑾𝒙 𝑾𝒚
··· ···
Text Image
Wang et al., ICML 2015 𝑿 𝒀
49
Deep Correlational Neural Network

1. Learn a shallow CCA autoencoder (similar to 1


layer DCCAE model)
2. Use the learned weights for initializing the
autoencoder layer
3. Repeat procedure

Chandar et al., Neural Computation, 2015


Matrix Factorization
51
Data Clustering

How to discover groups in your data?


K-mean is a simple clustering algorithm
based on competitive learning
• Iterative approach
o Assign each data point to one
cluster (based on distance metric)
o Update cluster centers
o Until convergence
• “Winner takes all”

··· ···
Text Image
𝑿 𝒀
52
Enforcing Data Clustering in Deep Networks

How to enforce data clustering in our


(multimodal) deep learning
algorithms?

··· ···

··· ···

··· ···
Text Image
𝑿 𝒀
53
Nonnegative Matrix Factorization (NMF)

Given: Nonnegative n x m matrix M (all entries ≥ 0)

G
X = F

Want: Nonnegative matrices F (n x r) and G (r x m),


s.t. X = FG.
➢ easier to interpret
➢ provide better results in information retrieval, clustering

54
Semi-NMF and Other Extensions

··· ···

··· ···

··· ···
Text Image
Ding et al., TPAMI2015 𝑿 𝒀
55
Deep Matrix Factorization

Li and Tang, MMML 2015

56
Deep Semi-NMF Model

Trigerous et al., TPAMI 2015

57
Multivariate Statistics

▪ Multivariate analysis of variance (MANOVA)


▪ Principal components analysis (PCA)
▪ Factor analysis
▪ Linear discriminant analysis (LDA)
▪ Canonical correlation analysis (CCA)
▪ Correspondence analysis
▪ Canonical correspondence analysis
▪ Multidimensional scaling
▪ Multivariate regression
▪ Discriminant analysis

58

You might also like