0% found this document useful (0 votes)
3 views

Mod5_Slides

The document discusses methods for comparing two distributions using two-sample tests, focusing on hypotheses regarding whether samples come from the same distribution. It introduces the concept of using a discriminator in generative models, such as GANs, to distinguish between real and generated samples, and outlines the training objectives for both the generator and discriminator. Additionally, it highlights challenges in GAN training, including mode collapse and optimization difficulties, while presenting various divergence measures for evaluating distribution similarity.

Uploaded by

kaveh1980
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Mod5_Slides

The document discusses methods for comparing two distributions using two-sample tests, focusing on hypotheses regarding whether samples come from the same distribution. It introduces the concept of using a discriminator in generative models, such as GANs, to distinguish between real and generated samples, and outlines the training objectives for both the generator and discriminator. Additionally, it highlights challenges in GAN training, including mode collapse and optimization difficulties, while presenting various divergence measures for evaluating distribution similarity.

Uploaded by

kaveh1980
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Comparing distributions via samples

Given a finite set of samples from two distributions S1 = {x ∼ P} and


S2 = {x ∼ Q}, how can we tell if these samples are from the same
distribution? (i.e., P = Q?)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 8/1


Two-sample tests

Given S1 = {x ∼ P} and S2 = {x ∼ Q}, a two-sample test


considers the following hypotheses
Null hypothesis H0 : P = Q
Alternative hypothesis H1 : P ̸= Q
Test statistic T compares S1 and S2 . For example: difference in
means, variances of the two sets of samples
1 1
P P
T (S1 , S2 ) = |S1 | x∈S1 x− |S2 | x∈S2 x
If T is larger than a threshold α, then reject H0 otherwise we say H0
is consistent with observation.
Key observation: Test statistic is likelihood-free since it does not
involve the densities P or Q (only samples)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 9/1


Generative modeling and two-sample tests

A priori we assume direct access to S1 = D = {x ∼ pdata }


In addition, we have a model distribution pθ
Assume that the model distribution permits efficient sampling (e.g.,
directed models). Let S2 = {x ∼ pθ }
Alternative notion of distance between distributions: Train the
generative model to minimize a two-sample test objective between S1
and S2

Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 10 / 1


Two-Sample Test via a Discriminator

Finding a good two-sample test objective in high dimensions is hard

In the generative model setup, we know that S1 and S2 come from


different distributions pdata and pθ respectively
Key idea: Learn a statistic to automatically identify in what way the
two sets of samples S1 and S2 differ from each other
How? Train a classifier (called a discriminator)!

Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 11 / 1


Two-Sample Test via a Discriminator


x

Two-Sample Test via a Discriminator


Any binary classifier Dϕ (e.g., neural network) which tries to
distinguish “real” (y = 1) samples from the dataset and “fake” (y = 0)
samples generated from the model
Test statistic: -loss of the classifier. Low loss, real and fake samples are
easy to distinguish (different). High loss, real and fake samples are
hard to distinguish (similar).
Goal: Maximize the two-sample test statistic (in support of the
alternative hypothesis pdata ̸= pθ ), or equivalently minimize
classification loss

Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 12 / 1


Two-Sample Test via a Discriminator
Training objective for discriminator:
max V (pθ , Dϕ ) = Ex∼pdata [log Dϕ (x)] + Ex∼pθ [log(1 − Dϕ (x))]

X X
≈ log Dϕ (x) + [log(1 − Dϕ (x))]
x∈S1 x∈S2

For a fixed generative model pθ , the discriminator is performing


binary classification with the cross entropy objective
Assign probability 1 to true data points x ∼ pdata (in set S1 )
Assign probability 0 to fake samples x ∼ pθ (in set S2 )
Optimal discriminator
pdata (x)
Dθ∗ (x) =
pdata (x) + pθ (x)

Sanity check: if pθ = pdata , classifier cannot do better than chance


(Dθ∗ (x) = 1/2)
Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 13 / 1
Generative Adversarial Networks

A two player minimax game between a generator and a


discriminator
z

x
Generator
Directed, latent variable model with a deterministic mapping between z
and x given by Gθ
Sample z ∼ p(z), where p(z) is a simple prior, e.g. Gaussian
Set x = Gθ (z)
Similar to a flow model, but mapping Gθ need not be invertible
Distribution over pθ (x) over x is implicitly defined (no likelihood!)
Minimizes a two-sample test objective (in support of the null
hypothesis pdata = pθ )

Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 14 / 1


Example of GAN objective
Training objective for generator:

min max V (G , D) = Ex∼pdata [log D(x)] + Ex∼pG [log(1 − D(x))]


G D

For the optimal discriminator DG∗ (·), we have

V (G , DG∗ (x))
h i h i
= Ex∼pdata log pdatapdata (x)
(x)+pG (x) + E x∼p G
log pG (x)
pdata (x)+pG (x)
   
pdata (x) pG (x)
= Ex∼pdata log pdata (x)+pG (x) + Ex∼pG log pdata (x)+pG (x) − log 4
2 2
   
pdata + pG pdata + pG
= DKL pdata , + DKL pG , − log 4
2 2
| {z }
2×Jensen-Shannon Divergence (JSD)
= 2DJSD [pdata , pG ] − log 4

Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 15 / 1


Jenson-Shannon Divergence
Also called as the symmetric KL divergence
    
1 p+q p+q
DJSD [p, q] = DKL p, + DKL q,
2 2 2

Properties
DJSD [p, q] ≥ 0
DJSD [p, q] = 0 iff p = q
D
pJSD [p, q] = DJSD [q, p]
DJSD [p, q] satisfies triangle inequality → Jenson-Shannon Distance
Optimal generator for the JSD/Negative Cross Entropy GAN
pG = pdata

For the optimal discriminator DG∗ ∗ (·) and generator G ∗ (·), we have
V (G ∗ , DG∗ ∗ (x)) = − log 4
Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 16 / 1
Recap of GANs

Choose d(pdata , pθ ) to be a two-sample test statistic


Learn the statistic by training a classifier (discriminator)
Under ideal conditions, equivalent to choosing d(pdata , pθ ) to be
DJSD [pdata , pθ ]
Pros:
Loss only requires samples from pθ . No likelihood needed!
Lots of flexibility for the neural network architecture, any Gθ defines a
valid sampling procedure
Fast sampling (single forward pass)
Cons: very difficult to train in practice
Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 17 / 1
The GAN training algorithm

Sample minibatch of m training points x(1) , x(2) , . . . , x(m) from D


Sample minibatch of m noise vectors z(1) , z(2) , . . . , z(m) from pz
Update the discriminator parameters ϕ by stochastic gradient ascent
m
1 X
∇ϕ V (Gθ , Dϕ ) = ∇ϕ [log Dϕ (x(i) ) + log(1 − Dϕ (Gθ (z(i) )))]
m
i=1

Update the generator parameters θ by stochastic gradient descent


m
1 X
∇θ V (Gθ , Dϕ ) = ∇θ log(1 − Dϕ (Gθ (z(i) )))
m
i=1

Repeat for fixed number of epochs

Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 18 / 1


Alternating optimization in GANs

min max V (Gθ , Dϕ ) = Ex∼pdata [log Dϕ (x)] + Ez∼p(z) [log(1 − Dϕ (Gθ (z)))]
θ ϕ

Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 19 / 1


Which one is real?

Both images are generated via GANs!

Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 20 / 1


Frontiers in GAN research

GANs have been successfully applied to several domains and tasks


However, working with GANs can be very challenging in practice
Unstable optimization
Mode collapse
Evaluation
Bag of tricks needed to train GANs successfully

Image Source: Ian Goodfellow. Samples from Goodfellow et al., 2014, Radford et
al., 2015, Liu et al., 2016, Karras et al., 2017, Karras et al., 2018
Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 21 / 1
Optimization challenges
Theorem (informal): If the generator updates are made in function
space and discriminator is optimal at every step, then the generator is
guaranteed to converge to the data distribution
Unrealistic assumptions!
In practice, the generator and discriminator loss keeps oscillating
during GAN training

Source: Mirantha Jayathilaka

No robust stopping criteria in practice (unlike MLE)


Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 22 / 1
Mode Collapse

GANs are notorious for suffering from mode collapse


Intuitively, this refers to the phenomena where the generator of a
GAN collapses to one or few samples (dubbed as “modes”)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 23 / 1


Mode Collapse

True distribution is a mixture of Gaussians

The generator distribution keeps oscillating between different modes


Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 24 / 1
Mode Collapse

Fixes to mode collapse are mostly empirically driven: alternative


architectures, alternative GAN loss, adding regularization terms, etc.
https://github.com/soumith/ganhacks
How to Train a GAN? Tips and tricks to make GANs work by
Soumith Chintala

Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 25 / 1


Beauty lies in the eyes of the discriminator

Source: Robbie Barrat, Obvious

GAN generated art auctioned at Christie’s.


Expected Price: $7, 000 − $10, 000
True Price: $432, 500
Stefano Ermon (AI Lab) Deep Generative Models Lecture 9 26 / 1
Recap.

Likelihood-free training
Training objective for GANs

min max V (G , D) = Ex∼pdata [log D(x)] + Ex∼pG [log(1 − D(x))]


G D

With the optimal discriminator DG∗ , we see GAN minimizes a scaled


and shifted Jensen-Shannon divergence

min 2DJSD [pdata , pG ] − log 4


G

Parameterize D by ϕ and G by θ. Prior distribution p(z).

min max Ex∼pdata [log Dϕ (x)] + Ez∼p(z) [log(1 − Dϕ (Gθ (z)))]


θ ϕ

Stefano Ermon (AI Lab) Deep Generative Models Lecture 10 2 / 28


Selected GANs

https://github.com/hindupuravinash/the-gan-zoo
The GAN Zoo: List of all named GANs
Today
Rich class of likelihood-free objectives via f -GANs
Wasserstein GAN
Inferring latent representations via BiGAN
Application: Image-to-image translation via CycleGANs

Stefano Ermon (AI Lab) Deep Generative Models Lecture 10 3 / 28


Beyond KL and Jenson-Shannon Divergence

What choices do we have for d(·)?


KL divergence: Autoregressive Models, Flow models
(scaled and shifted) Jensen-Shannon divergence (approximately):
original GAN objective

Stefano Ermon (AI Lab) Deep Generative Models Lecture 10 4 / 28


f divergences
Given two densities p and q, the f -divergence is given by
  
p(x)
Df (p, q) = Ex∼q f
q(x)
where f is any convex, lower-semicontinuous function with f (1) = 0.
Convex: Line joining any two points lies above the function
Lower-semicontinuous: function value at any point x0 is close to
f (x0 ) or greater than f (x0 )

Jensen’s
R inequality: Ex∼qR[f (p(x)/q(x))] ≥ f (Ex∼q [p(x)/q(x)]) =
f ( q(x)p(x)/q(x)) = f ( p(x)) = f (1) = 0
Example: KL divergence with f (u) = u log u
Stefano Ermon (AI Lab) Deep Generative Models Lecture 10 5 / 28
f divergences

Many more f-divergences!

Stefano Ermon (AI Lab) Deep Generative Models Lecture 10 6 / 28


Training with f -divergences
Given pdata and pθ , we could minimize Df (pθ , pdata ) or Df (pdata , pθ )
as learning objectives. Non-negative, and zero if pθ = pdata
However, it depends on the density ratio which is unknown
 
  p (x) 
θ
Df (pθ , pdata ) = Ex∼p f
 
| {zdata} pdata (x) 


approx w. samples | {z }
uknown ratio
 
 
 pdata (x) 
Df (pdata , pθ ) = Ex∼pθ f
 
pθ (x) 

| {z } 
approx w. samples | {z }
uknown ratio

To use f -divergences as a two-sample test objective for likelihood-free


learning, we need to be able to estimate the objective using only
samples (e.g., training data and samples from the model)
Stefano Ermon (AI Lab) Deep Generative Models Lecture 10 7 / 28
Towards Variational Divergence Minimization
Fenchel conjugate: For any function f (·), its convex conjugate is
f ∗ (t) = sup (ut − f (u))
u∈domf

where domf is the domain of the function f


f ∗ is convex (pointwise supremum of convex functions is convex) and
lower semi-continuous.
Let f ∗∗ be the Fenchel conjugate of f ∗
f ∗∗ (u) = sup (tu − f ∗ (t))
t∈domf ∗

f ∗∗ ≤ f . Proof: By definition, for all t, u


f ∗ (t) ≥ ut − f (u) or equivalently f (u) ≥ ut − f ∗ (t)

f (u) ≥ sup(ut − f ∗ (t)) = f ∗∗ (u)


t

Strong Duality: f ∗∗ =f when f (·) is convex, lower semicontinous.


Stefano Ermon (AI Lab) Deep Generative Models Lecture 10 8 / 28
f -GAN: Variational Divergence Minimization
We obtain a lower bound to an f -divergence via Fenchel conjugate
     
p(x) p(x)
Df (p, q) = Ex∼q f = Ex∼q f ∗∗
q(x) q(x)
"  #
f ∗∗ def p(x)
= Ex∼q sup t − f ∗ (t)
t∈domf ∗ q(x)
 
p(x)
= Ex∼q T ∗ (x) − f ∗ (T ∗ (x))
q(x)
Z  
∗ p(x)
= q(x) T (x) − f ∗ (T ∗ (x)) dx
X q(x)
Z
= [T (x)p(x) − f ∗ (T ∗ (x))q(x)] dx

X
Z
= sup [T (x)p(x) − f ∗ (T (x))q(x)] dx
T X
Z
≥ sup (T (x)p(x) − f ∗ (T (x))q(x))dx
T ∈T X
= sup (Ex∼p [T (x)] − Ex∼q [f ∗ (T (x)))])
T ∈T

where T : X 7→ R is an arbitrary class of functions


Note: Lower bound is likelihood-free w.r.t. p and q
Stefano Ermon (AI Lab) Deep Generative Models Lecture 10 9 / 28
f -GAN: Variational Divergence Minimization
Variational lower bound

Df (p, q) ≥ sup (Ex∼p [T (x)] − Ex∼q [f ∗ (T (x)))])


T ∈T

Choose any f -divergence


Let p = pdata and q = pG
Parameterize T by ϕ and G by θ
Consider the following f -GAN objective

min max F (θ, ϕ) = Ex∼pdata [Tϕ (x)] − Ex∼pGθ [f ∗ (Tϕ (x)))]


θ ϕ

Generator Gθ tries to minimize the divergence estimate and


discriminator Tϕ tries to tighten the lower bound
Substitute any f -divergence and optimize the f -GAN objective
Stefano Ermon (AI Lab) Deep Generative Models Lecture 10 10 / 28
Beyond KL and Jenson-Shannon Divergence

What choices do we have for d(·)?


KL divergence: Autoregressive Models, Flow models
(scaled and shifted) Jensen-Shannon divergence (approximately): via
the original GAN objective
Any other f -divergence (approximately): via the f -GAN objective

Stefano Ermon (AI Lab) Deep Generative Models Lecture 10 11 / 28


Wasserstein GAN: beyond f -divergences
The f -divergence is defined as
  
p(x)
Df (p, q) = Ex∼q f
q(x)

The support of q has to cover the support of p. Otherwise


discontinuity arises in f -divergences.
( (
1, x = 0 1, x = θ
Let p(x) = , and qθ (x) = .
0, x = ̸ 0 ̸ θ
0, x =
(
0, θ=0
DKL (p, qθ ) = .
∞, θ ̸= 0
(
0, θ=0
DJS (p, qθ ) = .
log 2, θ ̸= 0
We need a “smoother” distance D(p, q) that is defined when p and q
have disjoint supports.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 10 12 / 28
Wasserstein (Earth-Mover) distance

Wasserstein distance

Dw (p, q) = inf E(x,y)∼γ [∥x − y∥1 ],


γ∈Π(p,q)

where Π(p, q) contains allR joint distributions of (x, y) where the


marginal of x is p(x) = γ(x, y)dy, and the marginal of y is q(y).
γ(y | x): a probabilistic earth moving plan that warps p(x) to q(y).
( (
1, x = 0 1, x = θ
Let p(x) = , and qθ (x) = .
0, x ̸= 0 0, x ̸= θ
Dw (p, qθ ) = |θ|.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 10 13 / 28
Wasserstein GAN (WGAN)

Kantorovich-Rubinstein duality

Dw (p, q) = sup Ex∼p [f (x)] − Ex∼q [f (x)]


∥f ∥L ≤1

∥f ∥L ≤ 1 means the Lipschitz constant of f (x) is 1. Technically,

∀x, y : |f (x) − f (y)| ≤ ∥x − y∥1

Intuitively, f cannot change too rapidly.


Wasserstein GAN with discriminator Dϕ (x) and generator Gθ (z):

min max Ex∼pdata [Dϕ (x)] − Ez∼p(z) [Dϕ (Gθ (z))]


θ ϕ

Lipschitzness of Dϕ (x) is enforced through weight clipping or gradient


penalty on ∇x Dϕ (x).

Stefano Ermon (AI Lab) Deep Generative Models Lecture 10 14 / 28


Wasserstein GAN (WGAN)

More stable training, and less mode collapse.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 10 15 / 28


Inferring latent representations in GANs

The generator of a GAN is typically a directed, latent variable model


with latent variables z and observed variables x. How can we infer the
latent feature representations in a GAN?
Unlike a normalizing flow model, the mapping G : z 7→ x need not be
invertible
Unlike a variational autoencoder, there is no inference network q(·)
which can learn a variational posterior over latent variables
Solution 1: For any point x, use the activations of the prefinal layer
of a discriminator as a feature representation
Intuition: Similar to supervised deep neural networks, the
discriminator would have learned useful representations for x while
distinguishing real and fake x

Stefano Ermon (AI Lab) Deep Generative Models Lecture 10 16 / 28


Inferring latent representations in GANs

If we want to directly infer the latent variables z of the generator, we


need a different learning algorithm
A regular GAN optimizes a two-sample test objective that compares
samples of x from the generator and the data distribution
Solution 2: To infer latent representations, we will compare samples
of x, z from the joint distributions of observed and latent variables as
per the model and the data distribution
For any x generated via the model, we have access to z (sampled
from a simple prior p(z))
For any x from the data distribution, the z is however unobserved
(latent). Need an encoder!

Stefano Ermon (AI Lab) Deep Generative Models Lecture 10 17 / 28


Bidirectional Generative Adversarial Networks (BiGAN)

In a BiGAN, we have an encoder network E in addition to the


generator network G
The encoder network only observes x ∼ pdata (x) during training to
learn a mapping E : x 7→ z
As before, the generator network only observes the samples from the
prior z ∼ p(z) during training to learn a mapping G : z 7→ x

Stefano Ermon (AI Lab) Deep Generative Models Lecture 10 18 / 28


Bidirectional Generative Adversarial Networks (BiGAN)

The discriminator D observes samples from the generative model


z, G (z) and the encoding distribution E (x), x
The goal of the discriminator is to maximize the two-sample test
objective between z, G (z) and E (x), x
After training is complete, new samples are generated via G and
latent representations are inferred via E

Stefano Ermon (AI Lab) Deep Generative Models Lecture 10 19 / 28

You might also like