0% found this document useful (0 votes)
7 views19 pages

Ensi3 PRML s6 Encoders

The document discusses generative networks, focusing on techniques such as EigenSpace Coding, Auto-Encoders, Variational Autoencoders, and Generative Adversarial Networks. It emphasizes the use of Principal Components Analysis (PCA) for dimensionality reduction and feature extraction in image recognition tasks, particularly for face detection and recognition. The document also provides mathematical formulations and examples related to these concepts.

Uploaded by

Mohit Burchunde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views19 pages

Ensi3 PRML s6 Encoders

The document discusses generative networks, focusing on techniques such as EigenSpace Coding, Auto-Encoders, Variational Autoencoders, and Generative Adversarial Networks. It emphasizes the use of Principal Components Analysis (PCA) for dimensionality reduction and feature extraction in image recognition tasks, particularly for face detection and recognition. The document also provides mathematical formulations and examples related to these concepts.

Uploaded by

Mohit Burchunde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Pattern Recognition and Machine Learning

James L. Crowley

ENSIMAG 3 - MMIS Fall Semester 2020


Lessons 5 9 December 2020

Generative Networks:
EigenSpace Coding, Auto-Encoders, Variational
Autoencoders and Generative Adversarial Networks

Outline:

Notation.............................................................................. 2  
Key Equations ................................................................................. 2  

Generative Networks ......................................................... 3  

Principal Components Analysis and Eigen-Space Coding 4  


Principal Components Analysis of Face Imagettes......................... 5  
Example .......................................................................................... 8  
Reconstruction ................................................................................ 9  
Face Detection using Distance from Face Space ............................ 9  
Eigenspace coding for transmission. ............................................... 10  
Eigenspace Coding for Face Recognition. ...................................... 11  

AutoEncoders ................................................................... 12  
The Sparsity Parameter ................................................................... 13  
Kullback-Leibler Divergence .......................................................... 14  
Auto-Encoders vs Principal Components Analysis ........................ 16  
Variational Autoencoders ............................................................... 17  

Generative Adversarial Networks. ................................... 18  


Generative Networks....................................................................... 18  
GAN Learning as Min-Max Optimization. ..................................... 18  

1
Background Reading
• Turk, M. and Pentland, A., Eigenfaces for Recognition. Journal of cognitive
neuroscience, vol. 3, no 1, p. 71-86, 1991.
• Kingma, D.P., Mohamed, S., Rezende, D.J. and Welling, M., Semi-supervised
learning with deep generative models. In Advances in neural information
processing systems (pp. 3581-3589), NIPS 2014.
• Radford A, Metz L, Chintala S. Unsupervised representation learning with
deep convolutional generative adversarial networks.

Notation
W(i,j) An RxC image window (Imagette).
!
W! A flattened 1-D Vector representing the imagette.
{Wm } A training set of M imagettes
! !
µ = E{Wm } Average (mean) imagette
! ! !
Vm = Wm − µ Zero-mean normalized imagette
! ! !
(
V = V1 V2 " VN )
Training Matrix of Zero-mean imagettes
Σ = VVT Covariance of imagettes
ϑ NxN rotation matrix (N Eigenvectors of Σ )
Λ Diagonal matrix of eigen values for Σ
xd A feature. An observed or measured value.
!
X A vector of D features.
D The number of dimensions for the vector
!
{ X m } {ym } Training samples for learning.
M The number of training samples.
a (lj ) the activation output of the jth neuron of the lth layer.
€ € wij(l ) the weight for the unit i of layer l–1 and the unit j of layer l.
b lj the bias term for jth unit of layer l.
€ ρ The sparsity parameter


€ Key Equations
Principal Components Analysis: ϑ T Σϑ = Λ

1 M (1)
The average activation at layer l: ρ̂ j = ∑ a j,m
M m=1
! 1 ! (2) ! 2
(1)
N
The autoencoder cost function: Lsparse (W , B; X m , ym ) = (am − X m ) + β ∑ KL(ρ || ρˆ j )
2 j=1
N (1) %
1− ρ (
(1)
N
ρ
the Kullback-Leibler Divergence: ∑ KL(ρ || ρˆ j ) = ∑''ρ log ρˆ + (1− ρ )log 1− ρˆ **
j=1 j=1 & j j)

2

Generative Networks
Deep learning was originally invented for recognition. The same technology can be
used for generation. Up to now we have looked at what are called “discriminative”
techniques. These are techniques that attempt to discriminate a class label, y from a
!
feature vector, X .
! !
X D( X ) yˆ

!
The same process can be used to learn a network that generates X given a code y.
This is called a “generative”
€ €process. €

! €
y G(y) X

!
Given an observable random variable X , and a target variable, gradient descent
! ! !
allows us to learn€a joint €probability €
distribution, P( X , Y ), where X , is generally
!
composed of continuous variables, and Y is generally a discrete set of classes
represented by a binary vector. €
€ €
€ ! !
A discriminative model gives a conditional probability distribution P(Y | X ) .
! !
A generative model gives a a conditional probability P( X | Y )

We can combine a discriminative process for one data set€with a generative process
from another and use these to make synthetic€outputs.

! ! !
X D( X ) yˆ y G(y) X

A classic example is an autoencoder. But before we see how to learn an autoencoder,


I would
€ like to € review a similar
€ classic
€ technique
€ from computer
€ vision:
Eigen-Space Coding.

3
Principal Components Analysis and Eigen-Space Coding
Principal Components Analysis (PCA) is a popular method to reduce the number of
dimensions in high dimensional feature vectors. In some cases this can provide an
important reduction in computing time with little or no impact on recognition rates. It
can also be used to determine an orthogonal basis set for highly redundant features,
such as the raw pixels in small windows extracted from images.

While PCA is primarily a data compression method for encoding, it has been
successfully used as a method for generating features for detection and recognition.
An important example occurred in 1991, with the thesis of Mathew Turk at MIT
Media Lab. (Turk and Pentland - CVPR 1991). This paper won “best paper” at CVPR
and marked the beginning of the use of appearance-based techniques in Computer
Vision.

To use the method we require a training set of M face


!
windows (imagettes) {Wm} of
size R rows and C columns. We will use the set {Wm } to learn a set of D orthogonal
!
basis images ϕ d . These can serve as feature detectors for detection, and recognition.

We can then project an imagette onto basis


€ vectors to obtain a feature vector.
C R
! € !
X m = ϕ T Wm where each component is xdm = Wm , ϕ d = ∑ ∑W m (i, j) ϕ d (i, j)
i=1 j−1

We can reconstruct the imagette by as a weighted sum of basis images.


D €
Wˆ (i, j) = µ (i, j) + ∑ xd ϕ d (i. j)
d =1

where µ (i, j) is the “average” imagette from the training data.



The energy of the difference between the original imagette and reconstructed
€ imagette is a measure of similarity of the imagette to a face.
R C
ε R2 = ∑∑ R(i, j)2 where R(i, j) = W (i, j) − Ŵ (i, j)
j=1 i=1

We can also train a classifier for subsets of face images, for example all images of a
particular person. Other recognition techniques are also possible.

4
Principal Components Analysis of Face Imagettes.

For notation reasons, it is often !convenient to map (or flatten) the 2D window W(i,j)
imagettes onto a 1-D vector W . This allows us to express using classical vector
algebra.

For an imagette of size C columns by R rows,


allocate a vector W(n) with N=R x C coefficients
For any pixel (i,j), compute n = j*C + i
Then W(n) = W(i, j).

Principal components analysis is a method to determine a linear subspace that is


optimal for reconstructing a set of vectors.
!
Assume a set of M training vectors (imagettes) {Wm } of N pixels.
The training data are used to compute an orthogonal basis set of D vectors to
!
represent the training set, ϕ d where D ≤ N.

!
This basis is provided by the principal components of the covariance matrix of {Wm } .
First, normalize the imagettes to zero mean:

! ! 1 M
Compute the average vector: µ = E{Wm } that is: µ (n) = ∑Wm (n)
M m=1

! ! !
Normalize to zero mean the training data: Vm = Wm − µ

The covariance matrix is then constructed from the matrix of normalized training
imagettes. Compose the matrix V as

! V (1) V (1) ... VM (1) $


# 1 2
&
! ! ! # V (2) V2 (2) ... VM (2) &
( )
V = V1, V2 ,..., Vm = # 1 &
# ! ! " ! &
# V1 (n) V2 (n) ... VM (n) &%
"

!
V has N lines and M columns. Each column is a training image, m . Each row is a
V
pixel n← (i,j).

5
The outer product V ⋅V T is a covariance matrix, Σ:

Σ = V ⋅V T

This covariance matrix has N x N = N2 coefficients.

# ⋅ ⋅ ⋅ ⋅ & # ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ &
% ( % (
% ⋅ ⋅ ⋅ ⋅ (# & % ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ (
% (% ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ( % ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ (
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
Σ = VV = %
T
⋅ ⋅ ⋅ ⋅ (% (=% ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ (
% (% ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ( % (
% ⋅ ⋅ ⋅ ⋅ (% ( % ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ (
% ($ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ' % (
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
% ⋅ ⋅ ⋅ ⋅ ( % ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ (
$ ' $ '

The coefficients of Σ are the covariances for the pixels, i and j.

M
1
σ ij2 =
M
∑V m (i)Vm ( j)
m=1

For small imagettes, the covariance matrix, Σ, is easily diagonalized using standard
€ algorithms for diagonalizing large matrices, such as Householder's method. (See
numerical recipes in C or any other numerical methods toolkit). This will work for
up to 32 x 32 imagettes (covariance matrices of 1024 x 1024).

(ϕ, Λ) ← PCA(Σ). Giving: ϕ T Σϕ = Λ

Where ϕ, is an NxN rotation matrix and Λ is a diagonal matrix with N non-zero


diagonal values.

The N columns of the rotation matrix, ϕ, are the eigenvectors of Σ. These


!
eigenvectors provide an orthogonal set of N normalized vectors, ϕ n , that provide a
set of orthogonal code vectors for describing the imagettes of size N.

The square root of the diagonal terms of Λ are the eigenvalues, λn. The N
eigenvalues tell the average energy for each eigenvector for the imagettes in the
training set. This can be used to indicate the average squared error that would result
from reconstructing an image without the corresponding eigenvector .

6
Diagonalisation generally work well for matrices up to 1024 x 1024. Thus we can
easily use this method for imagettes up to 32 x 32 pixels. Other more exotic
algorithms can be used for larger matrices.

The eigenvectors provide an orthogonal basis for the training data. We can project
!
any imagette W onto this basis with:

! ! N !
X = ϕ T W = ∑W (n)⋅ ϕ (n)
n=1

N
where the coefficients are xd = ∑W (n)⋅ ϕ d (n) for d=1,…,D with D ≤ N
n=1

This projection acts as a "code" for the imagette. We can reconstruct an imagette
this code as a weight sum of the bases plus the average image.

Reconstruction: Wˆ (n) = µ (n) + ∑ xd ϕ d (n)


d =1

This reconstruction will only produce imagettes that resemble the training data.
Patterns not in the training data will not appear in the reconstruction! This is
classically used as
€ a filter to eliminate noise. It can also be used as a pattern detector!

We can determine an error image:

Error image: R(n) = W (n) − Wˆ (n)

Error Energy: ε 2R = ∑ R(n)2


€ n=1

For xd = W (n), ϕ d (n) = ∑W (n) ⋅ ϕ d (n)


€ n=1

The error energy gives an indication of how similar the input imagette is to the
€ training data. The reconstructed image shows where the difference occurs.

When used with a set of face imagettes, the error energy is called the "difference
from Face Space".

7
Example

16 images randomly selected from a 2 minute video of Francois Berard. (1995).

Average Image Eigen Values


3.5000E+07
3.0000E+07

2.5000E+07
2.0000E+07
1.5000E+07

1.0000E+07
5.0000E+06

0.0000E+00

-5.0000E+06 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Principals Component images

8
Note that for this to work well for face images, the images should be normalized in
position and scale. This is generally accomplished by aligning the eyes to standard
positions.

If the images are note aligned, then the eigenvalues will remain large and a larger
code is required.
Reconstruction

The code can be used to reconstruct an image: Wˆ (n) = µ (n) + ∑ xd ϕ d (n)


d =1

Example reconstructed image (120 bytes) Error Image.

Reconstruction (120 bytes)

Image Error

Face Detection using Distance from Face Space

The residue image can be used to determine if a new face imagette, W(n), is "similar"
to the eigenspace (linear subspace). In this case, the residue is called the "Distance
from Face Space" (DFS)

The distance from Face Space can be used as a face detector!

We scan the image with different size windows, texture map each window to a
standard size, then computer the residue distance from face space. If the distance is
small, the window contains a face similar to the Face space.
9
!
If all N bases used, then any imagette from {Wm } can be perfectly reconstructed
(except for round off error).

Reconstruction: Wˆ (n) = µ (n) + ∑ xd ϕ d (n)


d =1

Residue image: R(n) = W (n) − Wˆ (n)



N

Residue Energy: ε = ∑ R(n)2


2
R
€ n=1

Test if an imagette is a face: if ( ε 2R < Threshold) THEN Face ELSE NOT Face

In practice, this method is less effective and more expensive than the cascade
classifiers seen last lecture.

Eigenspace coding for transmission.

Eigen space coding is a very effective method for signal compression!

In 1996 we were able to demonstrate video telephony in real time (video rates) over a
9600 baud serial line! We ran a video conf with MIT with very low band-width
phone line. In this demo, we used 32 coefficients per image! (32 x 4 - 128
bytes/image).

10
Eigenspace Coding for Face Recognition.
! !
We can use the coefficients X = W (n), ϕ (n) as a feature vector for recognition.

This technique typically used with Gaussian Mixture models,


!
Let us define the parameters
€ for an I component model as ν = (α i , µi , Σi )
then

! I
! ! ! ! € 1 1 ! ! ! !
– ( X – µ )T Σ−1 ( X – µ )
p( X; ν ) = ∑α iN ( X; µi , Σi ) where : N ( X; µ, Σ) = D 1 e 2

i=1
(2π) det(Σ)
2 2

We use an algorithm such as EM to learn a model ν k for Mk samples of images


€ {Wmk (n)} of the face for each individual,
€ k, from a population of K individuals.

! !€
N
!
Given an unknown face, W(n): X = W (n), ϕ (n) = ∑W (n) ⋅ ϕ (n)
€ n=1

from Bayes rule we can determine the most probable individual, Ck as :



!
! p( X | Ck )P(Ck )
P(Ck | X) = k
!
∑ p( X | Ck )P(Ck )
k=1

In 1991, Eigen-space coding was a revolutionary technique for face recognition,


because it was the first technique to actually seem to work!

However, testing soon revealed that it could only work with:


1) Controlled lighting
2) Pre-defined face orientation (i.e. Cooperating Subject)
3) Normalized image position and size.
4) Minimal occlusions
5) Limited size population
In other conditions the results are unreliable.

In practice, if there are variations in illumination, these will dominate the first
eigenvectors. In this case the corresponding eigenvectors are not useful for
recognition and can be omitted.

11
AutoEncoders

An auto-encoder is an unsupervised learning algorithm that uses back-propagation to


learning a sparse set of features for describing the training data. Rather than try to
learn a target variable, ym, the auto-encoder tries to learn to reconstruct the input X
using a minimum set of features (latent variables).

An Autocoder learns to reconstruct (generate) clean copies of data without noise.


The Key concepts are:
1) The training data is the target. The error is the difference between input and output
2) Training is with standard back-propagation (or gradient descent), with the addition
of a “sparsity term” to the loss function
3) Sparsity encodes the data with a minimum number of independent hidden units
(Code vectors)

!
Using the notation from our 2 layer network, given an input feature vector X m the
auto-encoder learns {wij(1) ,b (1)
j } and {w jk ,bk }
(2) (2)
such that for each training sample,
! (2) ˆ !
am = X m ≈ X m using as few hidden units as possible.

Note that N(2) N(1) << N(2)
€ =D and that €

12
When the number of hidden units N(2) is less than the number of input units, D,
! !
am(2) = Xˆ m ≈ X m is necessarily an approximation. The hidden unites provide
!
a “lossy” encoding for X m . This encoding can be used to suppress noise!
! "
€ The error for back-propagation for each unit is a vector δm(2) = a"m(2) – X m
!
with a component
€ δi,m for component xi,m of the training sample X m

The hidden code is composed of independent € “features” that capture some


component of the input vector. Each cell of the code vector
€ is driven by a receptive
field whose sum of products with the receptive fields of other code cells is almost
zero. That is, the code vectors are almost orthogonal. However, rather than
minimizing the product of code vectors, sparsity seeks to generate the smallest set of
code vectors that can reconstruct the training data without the noise. With an
autoencoder the components may have some slight overlap. The average degree of
independence is captured by a “sparsity parameter”, ρˆ .

The Sparsity Parameter


each of the hidden units j=1 to N(1).
The sparsity ρˆ j is the average activation for €
The auto-encoder will learn weights subject to a sparseness constraints specified by a
target sparsity parameter ρ , typically set close to zero.

The simple, 2-layer auto-encoder is described by:

"x %
! $ 1,m '
Level 0: X m = $ " ' an input vector
$x '
# D,m &

! D
level 1: Ym = a j,m = f (∑ wij(1) xi,m + b(1)
(1)
j ) the code vector
€ i=1

N (1)
level 2: X̂ m = a(2)
k,m = f (∑ w (2) (1) (2)
jk a j,m + bk ) the reconstruction of the input.
j=1

The output should approximate the input.

" a1(2) %
! $ ' ! ! " "
am(2) = $ " ' = Xˆ m ≈ X m , with error δm(2) = am(2) – X m
$ a (2) '
# D &


13

The sparsity ρˆ j for each hidden unit (code component) is computed as the average
activation for the M training samples:

M
€ 1
ρˆ j =
M
∑a (1)
j,m
m=1

The auto-encoder is trained to minimize the average sparsity. This is accomplished


€ using back propagation, with a simple tweak to the cost function.

Standard back propagation tries to minimize a loss based on the sum of squared
errors. The loss for each sample is.

! 1 !
Lm ( X m , ym ) = (am(L ) − ym )2
2

For an auto-encoder, the target output is the input vector, and the loss is squared
difference from the input vector:

! 1 ! !
Lm ( X m , ym ) = (am(L ) − X m )2
2

To impose “sparsity” we add an additional term to the loss.

! 1 ! (L ) ! 2 N (1)
Lm ( X m , ym ) = (am − X m ) + β ∑ KL( ρ || ρ̂ j )
2 j=1
N (1)
where ∑ KL(ρ || ρˆ ) j is the Kullback-Leibler Divergence of the vector of hidden unit
j=1

activations and β controls the importance of the sparsity parameter.


Kullback-Leibler Divergence

The KL divergence between the desired and average activation is:

N (1) N (1) % ρ 1− ρ (
∑ KL(ρ || ρˆ ) = ∑''ρ log ρˆ
j + (1− ρ )log *
1− ρˆ j *)
j=1 j=1 & j

14
To incorporate the KL divergence into back propagation, we replace
( 2)
∂f (z (1)
j )
N
(1)
δ =
j
∂z (1)
∑ w(2)jk δk(2)
j k=1

with

€ ∂f (z (1) ( N (2) ( ρ 1− ρ ++
j )
* ∑ w jk δk + β **− ρˆ + 1− ρˆ ----
(1) (2) (2)
δ =
j
*
∂z (1)
j ) k=1 ) j j ,,

where N(2) = D, the size of the size of the input and output vectors.
€ (The network output has the same number of components as the input).

The average activation ρˆ j is used to compute the correction. Thus you need to
compute a forward pass on a batch of training data, before computing the back-
propagation. Thus learning is necessarily batch mode.

The auto-encoder forces the hidden units to become approximately orthogonal,
allowing a small correlation determined by the target sparsity, ρ . Thus the hidden
units act as a form of basis space for the input vectors. The values of the hidden code
layer are referred to as latent variables. The latent variables provide a compressed
representation that reduces dimensionality and eliminates € random noise.

15
ld Learning Hypothesis
ate%near%a%lower%dimensional%
Auto-Encoders vs Principal Components Analysis
%of%high%density%where%small%changes%are%only% The Manifold Learn
direcGons)%
What is the difference between an Auto-Encoder and Principle Components analysis?
• Examples%concentrate%near%a%lowe
Both techniques project a high-dimensional data set onto a lower dimensional
manifold (variété différentielle in french). “manifold”%(region%of%high%density
allowed%in%certain%direcGons)%
Affine Transformations of a Bitmap Image Face Expressions for an individual

83%
(Illustration from the NAACL 2013 lecture from R. Socher and C. Manning)

PCA projects the data onto a high-dimensional linear manifold. AutoEncoders


project the data
PCA = Linear onto a= Linear
Manifold non-linear manifold that (should) provide a better
Auto-
Encoder Auto-Encoders Learn Salient
representation of the latent space. Variations, like a non-linear PCA
input%x,%0Jmean%
features=code=h(x)=W$x$
Linear%manifold%
reconstrucGon(x)=WT$h(x)$=$WT$W$x$
W%=%principal%eigenJbasis%of%Cov(X)%

reconstrucGon(x)%
Minimizing%reconstrucGon%error%
reconstrucGon%error%vector% forces%latent%representaGon%of%%
x$ “similar%inputs”%to%stay%on%%
manifold%
LSA%example:%
x%=%(normalized)%distribuGon%
of%coJoccurrence%frequencies% AutoEncoder 84%
82%

So which one works better as a general face detector? Try it and see.

An experimental comparison of face detection using difference from Face Space


computed with PCA vs AutoEncoders is work +2 points on the second programming
project.

16
Variational Autoencoders

The output of an auto-encoder can be used to drive a decoder to produce a filtered


version of the encoded data or of another training set. However, the output from an
auto-encoder is discrete. We can adapt an auto-encoder to generate a *nearly*
continuous output by replacing the code with a probabilistic code represented by a
mean and variance.

This is called a Variational Autoencoder (VAE). VAEs combine a discriminative


network with a generative network. VAEs can be used to generate "deep fake"
videos sequences.

For a fully connected network, decoding is fairly obvious. The network input is a
!
binary vector Y with k binary values yk , with one for each target class. This is a
!
code. The output for a training sample Ym is an approximation of a feature vector
!
belonging to the code class, Xˆ m
€ €
! !
am(2) = Xˆ m ≈ X m €

and the error is the difference between a output and the actual members of the class.

! " "
δm(2) = am(2) – X m

! !
The average error for at training set {Ym } , { X m } can be used to drive back-
€ propagation.

€ €

17
Generative Adversarial Networks.
Generative Networks
It is possible to put a discriminative network together with a generative network and
have them train each other. This is called a Generative Adversarial Network (GAN).

A Generative Adversarial Network places a generative network in competition with a


Discriminative network.

The two networks compete in a zero-sum game, where each network attempts to fool
the other network. The generative network generates examples of an image and the
discriminative network attempts to recognize whether the generated image is realistic
or not. Each network provides feedback to the other, and together they train each
other. The result is a technique for unsupervised learning that can learn to create
realistic patterns. Applications include synthesis of images, video, speech or
coordinated actions for robots.

Generally, the discriminator is first trained on real data. The discriminator is then
frozen and used to train the generator. The generator is trained by using random
inputs to generate fake outputs. Feedback from the discriminator drives gradient
ascent by back propagation. When the generator is sufficiently trained, the two
networks are put in competition.

GAN Learning as Min-Max Optimization.


! !
The generator is a function Xˆ = G( z ,θ g ) , where G() is a differentiable function computed
as a multi-layer perceptron, with trainable parameters, θ g , and z is an input random vector
! !
with model pz ( z ) , and Xˆ is a synthetic (fake) pattern.
€ €
!
The discriminator is a differentiable function
€ D( X ,θ d ) computed as a multi-layer
perceptron
€ with
€ parameters θ d that estimates the likelihood that belongs to the set
described by the model θ d .

€ 18

! ! !
The generator Xˆ = G( z, θ g ) is trained to minimize Log(1− D(G( z, θ g )))

The perceptrons D() and G() play a two-player zero-sum min-max game with a value
function V(D, G):

In practice, this may not give sufficient gradient to learn. To avoid this, the
discriminator is first trained on real data. The generator is then trained with the
discriminator held constant. When the generator is sufficiently trained, the two
networks are put in competition, providing unsupervised learning.

The discriminator is trained by ascending the gradient to seek a max:

The generator is trained by seeking a minimum of the gradient:

19

You might also like