Ensi3 PRML s6 Encoders
Ensi3 PRML s6 Encoders
James L. Crowley
Generative Networks:
EigenSpace Coding, Auto-Encoders, Variational
Autoencoders and Generative Adversarial Networks
Outline:
Notation.............................................................................. 2
Key Equations ................................................................................. 2
AutoEncoders ................................................................... 12
The Sparsity Parameter ................................................................... 13
Kullback-Leibler Divergence .......................................................... 14
Auto-Encoders vs Principal Components Analysis ........................ 16
Variational Autoencoders ............................................................... 17
1
Background Reading
• Turk, M. and Pentland, A., Eigenfaces for Recognition. Journal of cognitive
neuroscience, vol. 3, no 1, p. 71-86, 1991.
• Kingma, D.P., Mohamed, S., Rezende, D.J. and Welling, M., Semi-supervised
learning with deep generative models. In Advances in neural information
processing systems (pp. 3581-3589), NIPS 2014.
• Radford A, Metz L, Chintala S. Unsupervised representation learning with
deep convolutional generative adversarial networks.
Notation
W(i,j) An RxC image window (Imagette).
!
W! A flattened 1-D Vector representing the imagette.
{Wm } A training set of M imagettes
! !
µ = E{Wm } Average (mean) imagette
! ! !
Vm = Wm − µ Zero-mean normalized imagette
! ! !
(
V = V1 V2 " VN )
Training Matrix of Zero-mean imagettes
Σ = VVT Covariance of imagettes
ϑ NxN rotation matrix (N Eigenvectors of Σ )
Λ Diagonal matrix of eigen values for Σ
xd A feature. An observed or measured value.
!
X A vector of D features.
D The number of dimensions for the vector
!
{ X m } {ym } Training samples for learning.
M The number of training samples.
a (lj ) the activation output of the jth neuron of the lth layer.
€ € wij(l ) the weight for the unit i of layer l–1 and the unit j of layer l.
b lj the bias term for jth unit of layer l.
€ ρ The sparsity parameter
€
€
€ Key Equations
Principal Components Analysis: ϑ T Σϑ = Λ
1 M (1)
The average activation at layer l: ρ̂ j = ∑ a j,m
M m=1
! 1 ! (2) ! 2
(1)
N
The autoencoder cost function: Lsparse (W , B; X m , ym ) = (am − X m ) + β ∑ KL(ρ || ρˆ j )
2 j=1
N (1) %
1− ρ (
(1)
N
ρ
the Kullback-Leibler Divergence: ∑ KL(ρ || ρˆ j ) = ∑''ρ log ρˆ + (1− ρ )log 1− ρˆ **
j=1 j=1 & j j)
€
2
€
Generative Networks
Deep learning was originally invented for recognition. The same technology can be
used for generation. Up to now we have looked at what are called “discriminative”
techniques. These are techniques that attempt to discriminate a class label, y from a
!
feature vector, X .
! !
X D( X ) yˆ
€
!
The same process can be used to learn a network that generates X given a code y.
This is called a “generative”
€ €process. €
! €
y G(y) X
!
Given an observable random variable X , and a target variable, gradient descent
! ! !
allows us to learn€a joint €probability €
distribution, P( X , Y ), where X , is generally
!
composed of continuous variables, and Y is generally a discrete set of classes
represented by a binary vector. €
€ €
€ ! !
A discriminative model gives a conditional probability distribution P(Y | X ) .
! !
A generative model gives a a conditional probability P( X | Y )
We can combine a discriminative process for one data set€with a generative process
from another and use these to make synthetic€outputs.
! ! !
X D( X ) yˆ y G(y) X
3
Principal Components Analysis and Eigen-Space Coding
Principal Components Analysis (PCA) is a popular method to reduce the number of
dimensions in high dimensional feature vectors. In some cases this can provide an
important reduction in computing time with little or no impact on recognition rates. It
can also be used to determine an orthogonal basis set for highly redundant features,
such as the raw pixels in small windows extracted from images.
While PCA is primarily a data compression method for encoding, it has been
successfully used as a method for generating features for detection and recognition.
An important example occurred in 1991, with the thesis of Mathew Turk at MIT
Media Lab. (Turk and Pentland - CVPR 1991). This paper won “best paper” at CVPR
and marked the beginning of the use of appearance-based techniques in Computer
Vision.
We can also train a classifier for subsets of face images, for example all images of a
particular person. Other recognition techniques are also possible.
4
Principal Components Analysis of Face Imagettes.
For notation reasons, it is often !convenient to map (or flatten) the 2D window W(i,j)
imagettes onto a 1-D vector W . This allows us to express using classical vector
algebra.
!
This basis is provided by the principal components of the covariance matrix of {Wm } .
First, normalize the imagettes to zero mean:
! ! 1 M
Compute the average vector: µ = E{Wm } that is: µ (n) = ∑Wm (n)
M m=1
! ! !
Normalize to zero mean the training data: Vm = Wm − µ
The covariance matrix is then constructed from the matrix of normalized training
imagettes. Compose the matrix V as
!
V has N lines and M columns. Each column is a training image, m . Each row is a
V
pixel n← (i,j).
5
The outer product V ⋅V T is a covariance matrix, Σ:
Σ = V ⋅V T
# ⋅ ⋅ ⋅ ⋅ & # ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ &
% ( % (
% ⋅ ⋅ ⋅ ⋅ (# & % ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ (
% (% ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ( % ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ (
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
Σ = VV = %
T
⋅ ⋅ ⋅ ⋅ (% (=% ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ (
% (% ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ( % (
% ⋅ ⋅ ⋅ ⋅ (% ( % ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ (
% ($ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ' % (
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
% ⋅ ⋅ ⋅ ⋅ ( % ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ (
$ ' $ '
M
1
σ ij2 =
M
∑V m (i)Vm ( j)
m=1
For small imagettes, the covariance matrix, Σ, is easily diagonalized using standard
€ algorithms for diagonalizing large matrices, such as Householder's method. (See
numerical recipes in C or any other numerical methods toolkit). This will work for
up to 32 x 32 imagettes (covariance matrices of 1024 x 1024).
The square root of the diagonal terms of Λ are the eigenvalues, λn. The N
eigenvalues tell the average energy for each eigenvector for the imagettes in the
training set. This can be used to indicate the average squared error that would result
from reconstructing an image without the corresponding eigenvector .
6
Diagonalisation generally work well for matrices up to 1024 x 1024. Thus we can
easily use this method for imagettes up to 32 x 32 pixels. Other more exotic
algorithms can be used for larger matrices.
The eigenvectors provide an orthogonal basis for the training data. We can project
!
any imagette W onto this basis with:
! ! N !
X = ϕ T W = ∑W (n)⋅ ϕ (n)
n=1
N
where the coefficients are xd = ∑W (n)⋅ ϕ d (n) for d=1,…,D with D ≤ N
n=1
This projection acts as a "code" for the imagette. We can reconstruct an imagette
this code as a weight sum of the bases plus the average image.
This reconstruction will only produce imagettes that resemble the training data.
Patterns not in the training data will not appear in the reconstruction! This is
classically used as
€ a filter to eliminate noise. It can also be used as a pattern detector!
The error energy gives an indication of how similar the input imagette is to the
€ training data. The reconstructed image shows where the difference occurs.
When used with a set of face imagettes, the error energy is called the "difference
from Face Space".
7
Example
2.5000E+07
2.0000E+07
1.5000E+07
1.0000E+07
5.0000E+06
0.0000E+00
-5.0000E+06 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
8
Note that for this to work well for face images, the images should be normalized in
position and scale. This is generally accomplished by aligning the eyes to standard
positions.
If the images are note aligned, then the eigenvalues will remain large and a larger
code is required.
Reconstruction
Image Error
The residue image can be used to determine if a new face imagette, W(n), is "similar"
to the eigenspace (linear subspace). In this case, the residue is called the "Distance
from Face Space" (DFS)
We scan the image with different size windows, texture map each window to a
standard size, then computer the residue distance from face space. If the distance is
small, the window contains a face similar to the Face space.
9
!
If all N bases used, then any imagette from {Wm } can be perfectly reconstructed
(except for round off error).
Test if an imagette is a face: if ( ε 2R < Threshold) THEN Face ELSE NOT Face
€
In practice, this method is less effective and more expensive than the cascade
classifiers seen last lecture.
In 1996 we were able to demonstrate video telephony in real time (video rates) over a
9600 baud serial line! We ran a video conf with MIT with very low band-width
phone line. In this demo, we used 32 coefficients per image! (32 x 4 - 128
bytes/image).
10
Eigenspace Coding for Face Recognition.
! !
We can use the coefficients X = W (n), ϕ (n) as a feature vector for recognition.
! I
! ! ! ! € 1 1 ! ! ! !
– ( X – µ )T Σ−1 ( X – µ )
p( X; ν ) = ∑α iN ( X; µi , Σi ) where : N ( X; µ, Σ) = D 1 e 2
i=1
(2π) det(Σ)
2 2
! !€
N
!
Given an unknown face, W(n): X = W (n), ϕ (n) = ∑W (n) ⋅ ϕ (n)
€ n=1
In practice, if there are variations in illumination, these will dominate the first
eigenvectors. In this case the corresponding eigenvectors are not useful for
recognition and can be omitted.
11
AutoEncoders
!
Using the notation from our 2 layer network, given an input feature vector X m the
auto-encoder learns {wij(1) ,b (1)
j } and {w jk ,bk }
(2) (2)
such that for each training sample,
! (2) ˆ !
am = X m ≈ X m using as few hidden units as possible.
€
Note that N(2) N(1) << N(2)
€ =D and that €
€
12
When the number of hidden units N(2) is less than the number of input units, D,
! !
am(2) = Xˆ m ≈ X m is necessarily an approximation. The hidden unites provide
!
a “lossy” encoding for X m . This encoding can be used to suppress noise!
! "
€ The error for back-propagation for each unit is a vector δm(2) = a"m(2) – X m
!
with a component
€ δi,m for component xi,m of the training sample X m
! D
level 1: Ym = a j,m = f (∑ wij(1) xi,m + b(1)
(1)
j ) the code vector
€ i=1
N (1)
level 2: X̂ m = a(2)
k,m = f (∑ w (2) (1) (2)
jk a j,m + bk ) the reconstruction of the input.
j=1
" a1(2) %
! $ ' ! ! " "
am(2) = $ " ' = Xˆ m ≈ X m , with error δm(2) = am(2) – X m
$ a (2) '
# D &
€
13
€
The sparsity ρˆ j for each hidden unit (code component) is computed as the average
activation for the M training samples:
M
€ 1
ρˆ j =
M
∑a (1)
j,m
m=1
Standard back propagation tries to minimize a loss based on the sum of squared
errors. The loss for each sample is.
! 1 !
Lm ( X m , ym ) = (am(L ) − ym )2
2
For an auto-encoder, the target output is the input vector, and the loss is squared
difference from the input vector:
! 1 ! !
Lm ( X m , ym ) = (am(L ) − X m )2
2
! 1 ! (L ) ! 2 N (1)
Lm ( X m , ym ) = (am − X m ) + β ∑ KL( ρ || ρ̂ j )
2 j=1
N (1)
where ∑ KL(ρ || ρˆ ) j is the Kullback-Leibler Divergence of the vector of hidden unit
j=1
€
Kullback-Leibler Divergence
€
The KL divergence between the desired and average activation is:
N (1) N (1) % ρ 1− ρ (
∑ KL(ρ || ρˆ ) = ∑''ρ log ρˆ
j + (1− ρ )log *
1− ρˆ j *)
j=1 j=1 & j
14
To incorporate the KL divergence into back propagation, we replace
( 2)
∂f (z (1)
j )
N
(1)
δ =
j
∂z (1)
∑ w(2)jk δk(2)
j k=1
with
€ ∂f (z (1) ( N (2) ( ρ 1− ρ ++
j )
* ∑ w jk δk + β **− ρˆ + 1− ρˆ ----
(1) (2) (2)
δ =
j
*
∂z (1)
j ) k=1 ) j j ,,
where N(2) = D, the size of the size of the input and output vectors.
€ (The network output has the same number of components as the input).
The average activation ρˆ j is used to compute the correction. Thus you need to
compute a forward pass on a batch of training data, before computing the back-
propagation. Thus learning is necessarily batch mode.
€
The auto-encoder forces the hidden units to become approximately orthogonal,
allowing a small correlation determined by the target sparsity, ρ . Thus the hidden
units act as a form of basis space for the input vectors. The values of the hidden code
layer are referred to as latent variables. The latent variables provide a compressed
representation that reduces dimensionality and eliminates € random noise.
15
ld Learning Hypothesis
ate%near%a%lower%dimensional%
Auto-Encoders vs Principal Components Analysis
%of%high%density%where%small%changes%are%only% The Manifold Learn
direcGons)%
What is the difference between an Auto-Encoder and Principle Components analysis?
• Examples%concentrate%near%a%lowe
Both techniques project a high-dimensional data set onto a lower dimensional
manifold (variété différentielle in french). “manifold”%(region%of%high%density
allowed%in%certain%direcGons)%
Affine Transformations of a Bitmap Image Face Expressions for an individual
83%
(Illustration from the NAACL 2013 lecture from R. Socher and C. Manning)
reconstrucGon(x)%
Minimizing%reconstrucGon%error%
reconstrucGon%error%vector% forces%latent%representaGon%of%%
x$ “similar%inputs”%to%stay%on%%
manifold%
LSA%example:%
x%=%(normalized)%distribuGon%
of%coJoccurrence%frequencies% AutoEncoder 84%
82%
So which one works better as a general face detector? Try it and see.
16
Variational Autoencoders
For a fully connected network, decoding is fairly obvious. The network input is a
!
binary vector Y with k binary values yk , with one for each target class. This is a
!
code. The output for a training sample Ym is an approximation of a feature vector
!
belonging to the code class, Xˆ m
€ €
! !
am(2) = Xˆ m ≈ X m €
€
and the error is the difference between a output and the actual members of the class.
€
! " "
δm(2) = am(2) – X m
! !
The average error for at training set {Ym } , { X m } can be used to drive back-
€ propagation.
€ €
17
Generative Adversarial Networks.
Generative Networks
It is possible to put a discriminative network together with a generative network and
have them train each other. This is called a Generative Adversarial Network (GAN).
The two networks compete in a zero-sum game, where each network attempts to fool
the other network. The generative network generates examples of an image and the
discriminative network attempts to recognize whether the generated image is realistic
or not. Each network provides feedback to the other, and together they train each
other. The result is a technique for unsupervised learning that can learn to create
realistic patterns. Applications include synthesis of images, video, speech or
coordinated actions for robots.
Generally, the discriminator is first trained on real data. The discriminator is then
frozen and used to train the generator. The generator is trained by using random
inputs to generate fake outputs. Feedback from the discriminator drives gradient
ascent by back propagation. When the generator is sufficiently trained, the two
networks are put in competition.
The perceptrons D() and G() play a two-player zero-sum min-max game with a value
function V(D, G):
In practice, this may not give sufficient gradient to learn. To avoid this, the
discriminator is first trained on real data. The generator is then trained with the
discriminator held constant. When the generator is sufficiently trained, the two
networks are put in competition, providing unsupervised learning.
19