Lecture 4 - Variational Divergence Minimization or Adversarial Learning
Lecture 4 - Variational Divergence Minimization or Adversarial Learning
Background reading :
i
) convex
functions Ee Fenichel
Conjugates
)
ii Euler
Lagrange equations
Recall that
maximization of the data hog likelihood
the
is equivalent to minimization of the KL divergence -
We have
,
Dru (10×1112) =/ log P÷d×
✗
Let us
study the behavior
of DKL at
different
X
regions of .
Consider Xi
a subset × , C- ✗ where
p× ( %-)
>>
pC*) it ✗ C-
which desirable
over ✗\
,
DKL is
high is .
However , suppose
7 subset ✗ < C- ✗ where
pocx ) >>
C✗)
it ✗ £ ✗ 2
DKL will be
relatively lower over Xe which is
does not
undesirable since
optimizing over DKL
with low data density
discourage coverage of regions
.
consider KL
suppose we reverse ,
✗ ¥
while reverse KL is
high over Xz ,
it is
relatively
lower × which does not the
encourage
over ,
the data
of the
models to entire
cover
support .
f-
divergence constitutes a
family of divergence metrics
between distributions .
OLCX )
where t R+→R
: is called the
generator function
which is convex
,
lower semi continuous a f- (1) = -0 .
According
to this
definition ,
there can exist
infinitely
the
many divergence metrics
depending upon
choice
of
f.
Below examples
few popular of Df :
are a
F) flu) =
ulogu , Df = KL
divergence
ii ) flu) = -
.
log u
, Df = Reverse KL
) Pearson
2 2
iii flu) =
@ 1)
-
Df = ✗
,
iD flu) @+ 1)
log Df Jensen Shannon der
¥4
≤
+
ulogu
-
= -
,
.
estimation /
for optimization of Df given samples from pay .
Variational f-
Estimation
of divergences .
Fenchel t
Conjugate for
:
Every convex ,
lower
semi -
continuous
function
£ has a convex
conjugate
*
function f-
defined as
follows :
f- ( t) -1 { ut few}
sup -
u -0
dont
The f- shown to be Er
function convex
Can be
☆
*
lower -
•
: flu) can also be represented as below :
tcu) =
sup { tu -
f- %-)}
tcdomti point wise lower
bound on f-
Intuitively ,
this implies that
any
convex
function
can be
represented as point wise maximum of many
linear
functions .
{
a)
*qgt(
multiple tower bounds at
tight .
Dt
=L EH> e-
(%÷ ) ,
tax
ql✗) sup
:{ ᵗ¥¥ a→}d✗
•
= -
e-
+ c-
dome
since I
upremum
is over t 2 qcx) is non -
negative
the However
it can removed out
of integral .
,
the
class
integral ,
it will be over a
function
T : ✗ → IR
•
:
Df ≥ sup
+•
({ PCH
-11×72×-1,90-7 t :( Tex
))d✗)
The above becomes a lower bound since I
may
for the
point -
wise
optimization problems .
•
:
Df [ PILE] ≥ sup
TET
[ El B-
(17×1) -
E-
Qx
( f- ( TCM ))]
I
question : supposing can represent all
functions ,
-17×7
¥,
'
= f
To
optimize Tcx) class
°
for Df over
of
•
. a
,
Equations .
/×pC✗)TC✗)d✗ (-11×1) [ equality
•
Df = -
qC✗) 1- tax
7-
assuming
=
/ [p TC✗) -
✗
L
all
functions ]
To
get
the optimal Tcx)
opt
we need to solve
the
following equation :
2L -
2- 2L = 0
JTCX)
'
◦×
ZTCX)
271×3
'
qcx) f- ( TW)
*
pcx) = 0
-
'
f- ( TW )
' •
=
pcx) (f) =
(4)
qcx)
(f) I # G)
't :{¥ )
)=p¥
TH =
f
, ,
Therefore ,
one can
find the
optimal T
depending on
Variational
Divergence Minimization .
this ,
the variational
function Tw is
represented as
follows : Tw (a) =
g, @ ( )) x
where Vw :X → IR without constraints &
range
: R→ dome is an
output activation
Collecting
everything together ,
the
final objective would be
FCQW) =
LE,p[g+Cvw D+¥%[tTg+lvwc
✓
c)
%
is a
parametric function which is
typically a
2hr10 , I) / ☒o ¥
> Vw > dont
'
✗a
/ X-p
0 ;w• Flaw)
argminmawx
=
0
Since the networks 0 E W are
trying
to
respectively
and the objective this
minimize maximize same
function ,
the
procedure is
often referred to as Adversarial
Learning .