Lecture7 1CoordinatedRepresentations
Lecture7 1CoordinatedRepresentations
1
Lecture Objectives
▪ Quick recap
▪ Temporal Joint Representation
▪ Multivariate statistical analysis
▪ Basic concepts (multivariate, covariance,…)
▪ Principal component analysis (+SVD)
▪ Canonical Correlation Analysis
▪ Deep Correlation Networks
▪ Deep CCA, DCCA-AutoEncoder
▪ (Deep) Correlational neural networks
▪ Matrix Factorization
▪ Nonnegative Matrix Factorization
Temporal Joint
Representation
3
Sequence Representation with LSTM
𝒚𝟏 𝒚𝟐 𝒚𝟑 𝒚𝜏
𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝜏
Multimodal Sequence Representation – Early Fusion
𝒚𝟏 𝒚𝟐 𝒚𝟑 𝒚𝜏
𝒚𝟏 𝒚𝟐 𝒚𝟑 𝒚𝜏
[Shyam, Morency, et al. Extending Long Short-Term Memory for Multi-View Structured Learning, ECCV, 2016]
Multi-View Long Short-Term Memory
MV-
(1) sigm
𝒙𝒕
𝒙(2)
𝒕
MV-
(3)
𝒙𝒕 sigm
MV-
sigm
[Shyam, Morency, et al. Extending Long Short-Term Memory for Multi-View Structured Learning, ECCV, 2016]
Topologies for Multi-View LSTM
Fully-
View-specific connected
Multi-view topologies α=1, β=0 α=1, β=1
𝒙(1) 𝒈(1) 𝒙(1) 𝒈(1)
𝒈(1)
𝒕
𝒕 𝒕 𝒕 𝒕
𝒉(1)
𝒕−𝟏 MV- (2) 𝒉(1)
𝒕−𝟏 𝒉(1)
𝒕−𝟏
MV- 𝒈𝒕
tanh
LSTM(1) 𝒉(2)
𝒕−𝟏 𝒉(2)
𝒕−𝟏 𝒉(2)
𝒕−𝟏
(3)
𝒈
𝒉(3)
𝒕−𝟏 𝒉(3)
𝒕−𝟏 𝒉(3)
𝒕−𝟏
𝒙(1)
𝒕 Coupled Hybrid
α: Memory from α=0, β=1 α=2/3, β=1/3
𝒙(2)
𝒕
current view 𝒙(1)
𝒕 𝒈(1) 𝒙(1)
𝒕 𝒈(1)
(3) 𝒕 𝒕
𝒙𝒕 β: Memory from 𝒉(1)
𝒕−𝟏 𝒉(1)
𝒕−𝟏
other views
𝒉(2)
𝒕−𝟏 𝒉(2)
𝒕−𝟏
Design parameters 𝒉(3)
𝒕−𝟏 𝒉(3)
𝒕−𝟏
[Shyam, Morency, et al. Extending Long Short-Term Memory for Multi-View Structured Learning, ECCV, 2016]
Multi-View Long Short-Term Memory (MV-LSTM)
[Shyam, Morency, et al. Extending Long Short-Term Memory for Multi-View Structured Learning, ECCV, 2016]
Quick Recap
10
Multimodal Representation Learning
··· ···
··· ···
Text Image
𝑿 𝒀
11
Multimodal Representation Learning
𝑿′ 𝒀′
Learn (unsupervised) a joint Text Image
representation between multiple
··· ···
modalities where similar unimodal
concepts are closely projected. ··· ···
❑ Deep Multimodal
Boltzmann machines ···
❑ Stacked Autoencoder
··· ···
··· ···
Text Image
𝑿 𝒀
12
Multimodal Representation Learning
···
Boltzmann machines
❑ Stacked Autoencoder
··· ···
❑ Encoder-Decoder
··· ···
Text Image
𝑿 𝒀
13
Multimodal Representation Learning
··· ···
❑ Encoder-Decoder Text Image
𝑿 𝒀
❑ Tensor Fusion representation
Repres. 1 Repres 2
Modality 1 Modality 2
Coordinated Multimodal Representations
··· ···
··· ···
··· ···
Text Image
𝑿 𝒀
17
Coordinated Multimodal Embeddings
[Huang et al., Learning Deep Structured Semantic Models for Web Search using Clickthrough Data, 2013]
Multimodal Vector Space Arithmetic
[Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014]
Multimodal Vector Space Arithmetic
[Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014]
Structured coordinated embeddings
24
Definitions
Given two random variables 𝑋 and 𝑌:
Expected value probability-weighted average of all possible values
𝜇 = 𝐸 𝑋 = 𝑥𝑖 𝑃(𝑥𝑖 )
𝑖
➢ If same probability for all observations 𝑥𝑖 , then same as arithmetic mean
25
Definitions
26
Pearson Correlation Examples
27
Definitions
Multivariate (multidimensional) random variables
(aka random vector)
𝑿= [𝑋 1 , 𝑋 2 , 𝑋 3 , … , 𝑋 𝑀 ]
𝒀 = [𝑌1 , 𝑌 2 , 𝑌 3 , … , 𝑌 𝑁 ]
Covariance matrix generalizes the notion of variance
𝑇 ഥ𝑿ഥ𝑇]
Σ𝑿 = Σ𝑿,𝑿 = 𝑣𝑎𝑟(𝑿) = 𝐸 𝑿 − 𝐸[𝑿] 𝑿 − 𝐸[𝑿] = 𝐸[𝑿
Cross-covariance matrix generalizes the notion of covariance
𝑇 ഥ𝒀ഥ𝑇 ]
Σ𝑿,𝒀 = 𝑐𝑜𝑣(𝑿, 𝒀) = 𝐸 𝑿 − 𝐸[𝑿] 𝒀 − 𝐸[𝒀] = 𝐸[𝑿
28
Definitions
Multivariate (multidimensional) random variables
(aka random vector)
𝑿= [𝑋 1 , 𝑋 2 , 𝑋 3 , … , 𝑋 𝑀 ]
𝒀 = [𝑌1 , 𝑌 2 , 𝑌 3 , … , 𝑌 𝑁 ]
Covariance matrix generalizes the notion of variance
𝑇 ഥ𝑿ഥ𝑇]
Σ𝑿 = Σ𝑿,𝑿 = 𝑣𝑎𝑟(𝑿) = 𝐸 𝑿 − 𝐸[𝑿] 𝑿 − 𝐸[𝑿] = 𝐸[𝑿
Cross-covariance matrix generalizes the notion of covariance
𝑡𝑟 𝑿 = 𝑥𝑖𝑖
𝑖=1
30
Principal component analysis
31
Eigenvalues and Eigenvectors
Eigenvalue decomposition
If A is an nn matrix, do there exist nonzero vectors x
in Rn such that Ax is a scalar multiple of x?
➢ (The term eigenvalue is from the German
word Eigenwert, meaning “proper value”)
Eigenvalue equation:
Ax = x Geometric Interpretation
y
Eigenvector Eigenvalue Ax = x
A: an nn matrix
: a scalar (could be zero) x
x: a nonzero vector in Rn x
Singular Value Decomposition (SVD)
𝐀 = 𝐔𝐒𝐕 𝑇
𝐀𝐀𝑇 𝐮𝑖 = 𝑠𝑖2 𝐮𝑖
𝐀𝑇 𝐀𝐯𝑖 = 𝑠𝑖2 𝐯𝑖
Canonical
Correlation Analysis
34
Multi-view Learning
𝑿 𝒀
35
Canonical Correlation Analysis
projection of Y
for each view, that are maximally
correlated:
projection of X
𝒖∗ , 𝒗∗ = argmax 𝑐𝑜𝑟𝑟 𝑯𝒙 , 𝑯𝒚 𝑯𝒙 𝑯𝒚
𝒖,𝒗
··· ···
= argmax 𝑐𝑜𝑟𝑟 𝒖𝑻 𝑿, 𝒗𝑻 𝒀 𝑼 𝑽
𝒖,𝒗
··· ···
Text Image
𝑿 𝒀
36
Correlated Projection
𝒖∗ , 𝒗∗ = argmax 𝑐𝑜𝑟𝑟 𝒖𝑻 𝑿, 𝒗𝑻 𝒀
𝒖,𝒗
𝒗
𝒖
𝒀
𝑿
37
Canonical Correlation Analysis
𝑡𝑟(𝑼𝑻 𝚺𝑿𝒀 𝑽)
𝑼∗ , 𝑽∗ = argmax
𝑼,𝑽 𝑼𝑻 𝚺𝑿𝑿 𝑼 𝑽𝑻 𝚺𝒀𝒀 𝑽
40
Canonical Correlation Analysis
maximize: 𝑡𝑟(𝑼𝑻 𝚺𝑿𝒀 𝑽)
subject to: 𝑼𝑻 𝚺𝑿𝑿 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰, 𝒖𝑻(𝑗) 𝚺𝑿𝒀 𝒗(𝑖) = 𝟎
for 𝑖 ≠ 𝑗
1 0 0 𝜆1 0 0
𝚺𝑿𝑿 𝚺𝒀𝑿 0 1 0 0 𝜆2 0
𝑼,𝑽 0 0 1 0 0 𝜆3
Σ= 𝜆1 0 0
𝚺𝑿𝒀 𝚺𝒀𝒀 1 0 0
0 𝜆2 0 0 1 0
0 0 𝜆3 0 0 1
41
Canonical Correlation Analysis
maximize: 𝑡𝑟(𝑼𝑻 𝚺𝑿𝒀 𝑽)
subject to: 𝑼𝑻 𝚺𝑿𝑿 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰, 𝒖𝑻(𝑗) 𝚺𝑿𝒀 𝒗(𝑖) = 𝟎
for 𝑖 ≠ 𝑗
42
Canonical Correlation Analysis
maximize: 𝑡𝑟(𝑼𝑻 𝚺𝑿𝒀 𝑽)
subject to: 𝑼𝑻 𝚺𝑿𝑿 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰, 𝒖𝑻(𝑗) 𝚺𝑿𝒀 𝒗(𝑖) = 𝟎
for 𝑖 ≠ 𝑗
−𝟏Τ𝟐 −𝟏Τ𝟐
𝑻≜ 𝚺𝑿𝑿 𝚺𝑿𝒀 𝚺𝒀𝒀
−𝟏 𝟐 Τ Τ
−𝟏 𝟐
𝑼∗ , 𝑽∗ = (𝚺𝑿𝑿 𝑼𝑺𝑽𝑫 , 𝚺𝒀𝒀 𝑽𝑺𝑽𝑫 )
➢ Can solve these eigenvalue
equations with Singular Value Eigenvalues
Decomposition (SVD) Eigenvectors
−𝟏 −𝟏 𝑻
Eigenvalue 𝚺𝑿𝑿 𝚺𝑿𝒀 𝚺𝒀𝒀 𝚺𝑿𝒀 𝑼 = 𝝀𝑼
equations −𝟏 𝑻 −𝟏
𝚺𝒀𝒀 𝚺𝑿𝒀 𝚺𝑿𝑿 𝚺𝑿𝒀 𝑽 = 𝝀𝑽 where 𝜆 = 4𝛼𝛽
43
Canonical Correlation Analysis
maximize: 𝑡𝑟(𝑼𝑻 𝚺𝑿𝒀 𝑽)
subject to: 𝑼𝑻 𝚺𝑿𝑿 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰, 𝒖𝑻(𝑗) 𝚺𝑿𝒀 𝒗(𝑖) = 𝟎
for 𝑖 ≠ 𝑗
projection of Y
Linear projections
1
maximizing correlation
projection of X
2 Orthogonal projections 𝑯𝒙 𝑯𝒚
Unit variance of the ··· ···
3 𝑼 𝑽
projection vectors
··· ···
Text Image
𝑿 𝒀
44
Exploring Deep
Correlation Networks
45
Deep Canonical Correlation Analysis
argmax 𝑐𝑜𝑟𝑟 𝑯𝒙 , 𝑯𝒚
View 𝐻𝑥
𝑽,𝑼,𝑾𝒙 ,𝑾𝒚
𝑯𝒙 ··· ··· 𝑯𝒚
𝜕𝑐𝑜𝑟𝑟 𝑯𝒙 , 𝑯𝒚
𝑼 𝑽
𝜕𝑈 ··· ···
𝑾𝒙 𝑾𝒚
𝜕𝑐𝑜𝑟𝑟 𝑯𝒙 , 𝑯𝒚 ··· ···
𝜕𝑉 Text Image
Andrew et al., ICML 2013 𝑿 𝒀
46
Deep Canonical Correlation Analysis
Training procedure:
𝑿′ 𝒀′
1. Pre-train the models Text Image
parameters using ··· ···
denoising autoencoders
··· ···
𝑯𝒙 ··· ··· 𝑯𝒚
𝑼 𝑽
··· ···
𝑾𝒙 𝑾𝒚
··· ···
Text Image
Andrew et al., ICML 2013 𝑿 𝒀
47
Deep Canonical Correlation Analysis
Training procedure:
1. Pre-train the models
View 𝐻𝑥
parameters using
denoising autoencoders
View 𝐻𝑦
2. Optimize the CCA
objective functions using 𝑯𝒙 ··· ··· 𝑯𝒚
large mini-batches or 𝑼 𝑽
full-batch (L-BFGS) ··· ···
𝑾𝒙 𝑾𝒚
··· ···
Text Image
Andrew et al., ICML 2013 𝑿 𝒀
48
Deep Canonically Correlated Autoencoders (DCCAE)
𝑿′ 𝒀′
Jointly optimize for DCCA and Text Image
autoencoders loss functions ··· ···
➢ A trade-off between multi-view
··· ···
correlation and reconstruction
View 𝐻𝑥
error from individual views
View 𝐻𝑦
𝑯𝒙 ··· ··· 𝑯𝒚
𝑼 𝑽
··· ···
𝑾𝒙 𝑾𝒚
··· ···
Text Image
Wang et al., ICML 2015 𝑿 𝒀
49
Deep Correlational Neural Network
··· ···
Text Image
𝑿 𝒀
52
Enforcing Data Clustering in Deep Networks
··· ···
··· ···
··· ···
Text Image
𝑿 𝒀
53
Nonnegative Matrix Factorization (NMF)
G
X = F
54
Semi-NMF and Other Extensions
··· ···
··· ···
··· ···
Text Image
Ding et al., TPAMI2015 𝑿 𝒀
55
Deep Matrix Factorization
56
Deep Semi-NMF Model
57
Multivariate Statistics
58