0% found this document useful (0 votes)
37 views21 pages

Lecture 4 - Variational Divergence Minimization or Adversarial Learning

Uploaded by

vinay thakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views21 pages

Lecture 4 - Variational Divergence Minimization or Adversarial Learning

Uploaded by

vinay thakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Variational

Divergence Minimization / Adversarial Learning .

Background reading :

i
) convex
functions Ee Fenichel
Conjugates
)
ii Euler
Lagrange equations

Iii ) Definitions of supremum / Infimum


Issue with the KL
divergence
-

Recall that
maximization of the data hog likelihood
the
is equivalent to minimization of the KL divergence -

between the model E the true data distributions .

represent the densities imposed by


Suppose p× Epo

the true data & the model , respectively .

We have
,
Dru (10×1112) =/ log P÷d×

Let us
study the behavior
of DKL at
different
X
regions of .

Consider Xi
a subset × , C- ✗ where
p× ( %-)
>>
pC*) it ✗ C-

which desirable
over ✗\
,
DKL is
high is .

However , suppose
7 subset ✗ < C- ✗ where
pocx ) >>
C✗)
it ✗ £ ✗ 2

DKL will be
relatively lower over Xe which is

does not
undesirable since
optimizing over DKL
with low data density
discourage coverage of regions
.
consider KL
suppose we reverse ,

Dia [ Po 1112] =/ Po log Po tax

✗ ¥

while reverse KL is
high over Xz ,
it is
relatively
lower × which does not the
encourage
over ,

the data
of the
models to entire
cover
support .

Therefore it is desirable to consider alternate

metrics other than


divergence
KL .
The f-
family of divergences .

f-
divergence constitutes a
family of divergence metrics

between distributions .

Given distributions 1Pa


Definition : two Or with

respective absolute continuous density functions peaq

deferred the domain then the f-


on ✗ ,
divergence
between them
-
are
defined as below :
Df [ ④ 110] ⇐
qwf.pk) dx

OLCX )

where t R+→R
: is called the
generator function
which is convex
,
lower semi continuous a f- (1) = -0 .

According
to this
definition ,
there can exist
infinitely
the
many divergence metrics
depending upon
choice
of

f.
Below examples
few popular of Df :
are a

F) flu) =

ulogu , Df = KL
divergence
ii ) flu) = -
.

log u
, Df = Reverse KL

) Pearson
2 2
iii flu) =
@ 1)
-

Df = ✗
,

iD flu) @+ 1)
log Df Jensen Shannon der
¥4

+
ulogu
-

= -

,
.

With this variational method


,
goal is to develop
the a

estimation /
for optimization of Df given samples from pay .
Variational f-
Estimation
of divergences .

Fenchel t
Conjugate for
:
Every convex ,
lower

semi -

continuous
function
£ has a convex
conjugate
*

function f-
defined as
follows :

f- ( t) -1 { ut few}
sup -

u -0
dont
The f- shown to be Er
function convex
Can be


*

lower -

semi continuous with f- = f.


: flu) can also be represented as below :

tcu) =
sup { tu -

f- %-)}
tcdomti point wise lower

bound on f-

Intuitively ,
this implies that
any
convex
function
can be
represented as point wise maximum of many
linear
functions .
{
a)
*qgt(
multiple tower bounds at

f- (a) of which one is

tight .

Let us use this


definition in Df .

Dt
=L EH> e-
(%÷ ) ,
tax
ql✗) sup
:{ ᵗ¥¥ a→}d✗

= -

e-
+ c-
dome

since I
upremum
is over t 2 qcx) is non -

negative
the However
it can removed out
of integral .
,
the

argument inside the supremum is pointwise over ×

& thus when the is taken out the


supremum of

class
integral ,
it will be over a
function

T : ✗ → IR

:
Df ≥ sup
+•
({ PCH
-11×72×-1,90-7 t :( Tex
))d✗)
The above becomes a lower bound since I
may

not contain all


functions that the solutions
forms

for the
point -
wise
optimization problems .


:

Df [ PILE] ≥ sup
TET
[ El B-
(17×1) -
E-
Qx
( f- ( TCM ))]
I
question : supposing can represent all
functions ,

what Tcx) would drive Df to ?


zero

Solution : Df would vanish


for
'

-17×7
¥,
'

= f

Proof : Notice that Df is a


functional of Tcx ] .

To
optimize Tcx) class
°

for Df over
of

. a
,

needs to consider the Eviler


functions one
Legrang
-

Equations .
/×pC✗)TC✗)d✗ (-11×1) [ equality

Df = -

qC✗) 1- tax

7-
assuming
=

/ [p TC✗) -

qlx) f- (1- tax can recover


L
all
functions ]

To
get
the optimal Tcx)
opt
we need to solve

the
following equation :

2L -
2- 2L = 0

JTCX)
'
◦×
ZTCX)

since Ttx) does L the


not
explicitly appear in ,

second term vanishes for all Tcx) .



i Consider 2L = 0

271×3

'

qcx) f- ( TW)
*

pcx) = 0
-

'

f- ( TW )
' •

=
pcx) (f) =
(4)
qcx)

(f) I # G)
't :{¥ )
)=p¥
TH =
f
, ,
Therefore ,
one can
find the
optimal T
depending on

the choice made


for f- .

Variational
Divergence Minimization .

We have so established lower bound the


far on
a

f- divergence . We now seek to obtain a sampler for


the
the unknown distribution via
minimization of

thus obtained lower bound .

First the bound ,


,
to
apply lower the domain
of the

conjugate function f- has to be respected . To achieve

this ,
the variational
function Tw is
represented as

follows : Tw (a) =
g, @ ( )) x
where Vw :X → IR without constraints &
range
: R→ dome is an
output activation

function specific to the f divergence used .

Collecting
everything together ,
the
final objective would be

FCQW) =

LE,p[g+Cvw D+¥%[tTg+lvwc

c)
%
is a
parametric function which is
typically a

Neural Network that samples from


Deep ,
takes

distribution input another


an
arbitrary as .
Vw is

neural network which operates on the data space X .

called the Generator network


01,0 is
often & Vw

the discriminator network .


>

2hr10 , I) / ☒o ¥
> Vw > dont
'

✗a
/ X-p

The optimal 0*4 w*


saddle point
found
are
by sowing btw 0GW
the

following problem alternatively , .

0 ;w• Flaw)
argminmawx
=

0
Since the networks 0 E W are
trying
to
respectively
and the objective this
minimize maximize same
function ,

the
procedure is
often referred to as Adversarial
Learning .

You might also like