0% found this document useful (0 votes)
10 views26 pages

Lecture 5 - Adversarial Networks and Variants

Uploaded by

vinay thakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views26 pages

Lecture 5 - Adversarial Networks and Variants

Uploaded by

vinay thakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Adversarial Networks G Variants

f-
Background Reading : GANS ,
GAN ,
,
conditional mutual
Information .

Lipschitz functions primal ,


-

Dual optimization

Based the
principles adversarial
learning
basic
on
of .

multiple improvisations have been proposed in the literature

we shall
study a
few of them namely ,
Info GAN ,
BIGAN ,

cycle GAN , Style GANG WGAN .


Info GAN

Objective : To learn a GAN with a latent space


that
is
semantically disentangled .

Proposal : The input noise vector to the


generator is

decomposed into two parts : 2 -

incompressible noise

C- structured latent code .

denoted variables with


C is
by L latent Ci , ↳ ,
.
.
.
.ec ,
distribution by
a
factor given
P( ci.cz ,
. . .

,
CD I
#

Pec:)
1=1

Since the Generator network takes both 2 Eec as


inputs ,

denoted G (2) c) GAN , the


it is
by 0
.
In a standard

latent code C
may
be
ignored E there is no
way

to
enforce
that the
generated distribution should utilize
both 2 & C.
Thus ,
Info GAN proposes an
information theoretic
regularization
the standard GAN
objective
over as
follows :

L±npaAN = minimax
O w
Flo ,w) -
II. ( c;
Golz > G)

Here Icc ; )
Go is the mutual
information between the

latent codes { the Generator distribution .


Variational Mutual
Information Maximization .

mutual Go)
In
practice , the
information term Idc ,
is not

possible to be
directly optimized as it requires access to

the latent posterior pcclx≥ .


Thus , a variational lower bound

is
optimized instead as
follows :
Let q( clx) be the variational approximation to
p (CIA)

I ( c ; G) =
It (c) -
H ( 4G )

=
Ea [ e-
.

8
Plc'l✗) ] -1 It (c)
cinpcc ,,

¥14 [ Eup ( it ) + DKL (Pkk) 110114# + It (c)


, log
=
×
or
,,,

×Ea [E. log 9101×7] -1 (c) ÷ Deal ≥


≥ It 0 .

→ a ,×,
The above term however needs samples from pcclx)

compute the which avoided


to inner
expectation ,
is
using

the trick
following .

E- E 964×1 E- ( Ctx) (1)


by tog
= -

✗na duplex, en Pas ,


✗ na

Lemma 5.1
from in

the
info GAN paper .
The Expectation in the RHS
of Eq . I can be computed

Monte Carlo estimates


using .

In
practice ,
the distribution
of
is approximated using
another

neural network in addition to the Generator E Discriminator

networks { Lin
paan is
optimized .

Post
training ,
it is shown that variation in a
single
component of C
, corresponds to variation in a
single semantic

data space
the
generated
.

in
factor
Bidirectional GANS ( BIG AND

Objective : Standard GANS do not have a means to

learn the inverse the data space


mapping from
to the latent space . The objective of a BIGAN

is to learn both the latent space


mappings from
data
to space & vice-versa
simultaneously .

Proposal : In addition to the standard Generator , an

Encoder network , E : ✗ → Z is trained .


Let peczlx) denote the density induced
by the Encoder

The standard discriminator also


network .
is
modified
to predict p (71×12) where 7=1 npx Gcx)
if ✗ E
if
o ✗ -
.

With this the


,
BIGAN
Optimizes following objective :

↳iaan .
= min Max F (O ,
W
, )
0,10 w

[E
bgtk-D-E.IE?y;i?!!I?..fYD-
Flo , wit ) =
#

÷;:EÉ
+*
It is shown that the optimal point is reached when

where
Pex =
Paz

PEX =L ,
B- [ PE ( 2K) dzdx

Paz
=

{ Pa / ✗
PALME) dxdz

BIGAN the JS divergence between the


Optimizes for joint
distributions over the data E the latent spaces .
Cycle -
GAN

Objective : To learn to translate between the


distributions
samples of a
pair of _

Proposal : use adversarial


learning
in a conditional

the distribution is
setting
where in source

used GAN instead


as
input for a
of the Noise

In addition
variable .
, incorporate a
two-way
enforces transitivity
loss that
consistency .
Given a
pair of domains XGY with Px Er Py as densities ,


Ax
> Cycle GAN has two

% Py mapping functions
^
cry { Gx : ✗→
Y , Gy : Y→ ×

which
Cycle GAN are
simultaneously
-

learned with the


,

along
discriminator
corresponding
functions Dy Ee Dx ,
respectively
In addition to the usual GAN loss a
cycle Consistency
-

loss introduced
for enforcing transitivity
is .

Lucie cons .
= E
x-P ×
[ 11 Gyltrxlxi) -

✗ 11 ]
,
+ ☒
Yup,
[ 116×(441-411)

Therefore the
final objective function is as
follows :

Lcyueaan E
log [ Dy°G✗( )] -1¥ tog Dy th
-1¥ hog G) 1-
= 1- × 1- DIG
>
P☒

)

Px wgD✗(✗ +
Lyde cons .
Style -
GAN

Objective : To learn unsupervised attribute separation or

disentanglement in the
generated space of a GAN .

Proposal : Learn
multiple latent vectors
corresponding
to
different possible styles in the
feature
space the Generator
of .
Given a latent code 2
,
first a latent
transformation
is learned via a network f : 2- → W

EW
Subsequently ,
the w vector is
fed separately
to each
of
the
feature maps
in the
generator via

adaptive instance ( Ada IN) layer


an
normalization as

follows :

✗i tlxi )
Ada IN (xi y) y,
-

,
= Y, +
ocxi )
where Xi is the ith
feature map of the generator ,
with

ocxi) the statistics


µC✗i) Er
being corresponding .

4 =
( % %)
,
is the output of an
affine transformation
layer with was the
input .

modification generator
Note that in the
style GAN
is a

architecture but does not involve


any
loss / metric modification .
Twas se-rste.in GAN

GANS the
in naive
formulation are known to be
very

This the
unstable to train .
is ascribed to non -

alignment
the that the supports the
of manifolds forms of

distributions the
between which
divergence is calculated .
It shown that distributions whose
is
for
a
pair of

do not dont
supports full dimension E
perfectly align .

the usual f- divergences such as JSD


forward / reverse KLD
-

,
,

will be maxed out with the existence discrimi


of a
perfect
-

nator This between


calls out
divergence metric
'

softer
'

for
-

distributions does not when the


that max out
manifolds of
the do
supports not
perfectly align .

Earth Movers or Wasserstein's distance :

Let P GOL denote two distributions over a


space ✗ .

P
The Wasserstein 's distance between GOV is
defined as

WCPIIQ ) =
int E
(x ,y)n ,
[11×-411]
→ €1T (P Q)
,
Here the is all joint distributions
,
infimum over set
of

11-(1%01) whose
marginals are
respectively REQ .

0( ✗ iy) is the amount of


"
mass
"
that is
transported from

to Y in order to P to 0L , which when



transform

multiplied with 11×-411 the amount


'

specifies
work
'

of

done in the said transportation .


Thus
" "

EMD is the the


,
or WD cost incurred
by
optimal transport plan Note that 0 Can be
every
.

which the
thought of as a
transport plan out
of

the IT
optimal is
sought out in EMD via
infimum over .

For instance
'

EMD shown to
good properties
'

is
posess some .

suppose p ,
be a
density over X .
2 be a RV over Z .

Let : ZX Rd → ✗ be a
parametric function [ neural network
]
g0
with distribution Then it can shown that
a
% .

,
if

go
is continuous in 0 ,
so is W( Pr , Po) ,
unlike JSDCP .
> Po)

and KL ¢12 ,
Po) .

Therefore it is desirable to use the

Wasserstein 's distance than


to learn
generative samplers
t However the
any of
the
divergence metric
infimum
in
-

the the
definition of WD is intractable ,
albeit a dual

used the
definition may
be to
optimize WD in practice .
NGAN :

The duality dual


kantrovich Rubinstein
provides a
definition
-

the
for WD as
follows :

w
(17×112) =
Sup
111-11 , ≤I
E-
✗-
Px
(1-6)/-4=(1-1×1)
xnpo

The f- ☒
supremum is over all the 1-
Lipschitz functions : ✗ → .
the
Typically , function 1- is
approximated by a neural

network called the critic network Er is


replaced
supremum

by the maximum over the


parameters of the neural network .

The distribution Po is
approximated sample
using a _ or

generator neural network


go
(2) ,
2nA /0 ,
I] .
With these ,

the objective for a wart N to


optimize will be as
follows :
Lwaan =
min Max E- 1×7 ] E
[to ( go
-

1W ✗ nP*
@ Zupz

Ilfw / tell
In neural which made
practice , fw is a network is

Lipschitz by weight clipping after every gradient update or

weight regularization .

References

1. https://arxiv.org/abs/1606.03657

2. https://arxiv.org/abs/1605.09782

3. https://arxiv.org/abs/1703.10593

4. https://arxiv.org/abs/1701.07875

5. https://arxiv.org/abs/1812.04948

6. https://arxiv.org/abs/1701.04862

7. https://vincentherrmann.github.io/blog/wasserstein

You might also like