Zohar Ringel - Mappings From DNNs To GPS - Part 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Hete we discuss several relations between highly ovetpatanettized

DNNs and Gaussian Processes GPs

Motivation I Most teal wetld Dads have Patanso Datapoints

often performances improves with ovet patent


2 Over parametrization makes training less
glassy finicky local minima phone
3 GPs ate easier to analyze analytically
4 This led to new state of the att Gps
wide
Outline I Dads at initialization as GPs
pads as universal apptetimatots

Easy and bath to death


functions

Relations between Dad atchitectute


dad GP kernels
Ultra
wife
I Dads following Gradient Descent Noise training
as GPs

pun outputs up to large matrix


Predicting
inversion

I Caveats Over Sights and improvements

SOD vs GD low and high learning tates


how wide is wide enough s
and feature
g g g
Reootralized GPs PerturbationTheeties

first noticed
by
I Wide Dads at initialization as GPs

pi f
1996
2 a

Elena a no 2 lol hidden


y layers
É NAHIN z Ele
m 2 Cela e a Elo
N N
go
cod
M IN H 29C lo
co

X Y Xd Is
to O z too o
2 8 00 509 X

Consider a's a's drawn from some iid distribution

Given such a draw a to the Dna generates a certain ta dem


function FCF 2 I a to Thus we find that

Distribution ovet
A distribution ovet G's and a's
function space
DAN Atchitectute
Let's consider this distribution first by sampling the
functions it generate on a single data point FCI

Since FCI Eta z I lo and Is a's ate assumed iid

all a t I are independent and identical random variables

consequently the it sum as Nsa when properly notralized

tends to a gaussian distribution

This ptopet notralization by Yp is part of all standard

DAN initializations A.k.a Xavi et initialization

Thus we find that fat any fixed I vectet

PIECE

Kc
Fail N f
Not
to o ka

Sf Ia
Whete variance
KCI I is Nat at Vat Z Il
s specific
é
Nat at is some bet typ
give no
y

To calculate Vat z till we note that

É f É 0
f CI lo oof and so its variance

could be derived from IvarLei Vat 7 1,100,09 and


the non
linearity 9 This can proceed recursively until
an explicit answer is obtained

legal
Exercise Consider
g ti N p o o Y s s

and depth l of 2 i.e one layer with non lineatits


and calculate KCI I Fet a genetic I

Next considet n data points I

FCK FLI can be viewed as a multivariate distribution

since toil

É Z IIe

The tandem vector variable on the lines is a sum of


many uncottelated
tandem vector variable on the this

the ai variables
iced Ihs

Playford
I Ease is
drawable

s
f Iz
g glo o
KEITH
GP Of
Spca

Next considet datapoints drawn from a distribution


a
mtg

digs paid
Say in image classifaction scenario I will be the vectet
of image pixels and pix the chance that some pixel vectot
I is an image

At N so and no one gets that fol is a field on x

whose distribution is governed by the partition function

fdt.de K I JFC FCF


Z f Df e
a
pathintegral
whete K is defined by
d
S i C 11 L J 8C 271pct
is

A quadratic field theety with K n Actionporgy

Is CX Y fofana

Diagonalizing the kernel k via solving

Sdt KCI Fe 0,17 ad I

and expanding fix on the resulting basis

f x
fad x

we get
Gf
Z fppe 2
G J

Pictorially fat I G it Z x
y la e at I so

is a draw

GP KG x
fa me to toe with
Mt
non local lineat

80 A comment is in otdet introduction of di


tegatding out a measure

for any finite n we ate not requited to introduce n

choosing any tease able de does not affect the fact that
d tofu
SDfelt foggy
f g f I fg Z

KCI I dy independent

Later on when we discuss trained Dads we'll see that


taking dy Datadistribution is the best choice

connected DN
Example line at fully

z CI lo Oj x 7 KCI 420 147


no ya
No d
Lex 0
557
I X Y
s'd IoT
g es o dl y y s e g he type

Idt 7 O xd
x

can only be solved by eigenfunctions O I lineat in I

trying E Ex E
d

Idt 7 I 0,67 45,4 felt Xia covariance


of data dy
y
Got where
ft
I di it

Diagonalizing the covariance mattix cites cat we find

an ago I.es
I
All eigfune ate line at A

weakly strongly fluctuating lineat


Combination of xi's Éitespond to
high eigenvalues of KG Y
low
with high low energy in the membrane picture
of GP

A simplification occuting in fully connected DM ketrels

kex Y s f la la xx
complete set totationally cdlfinyatia.t quantities
in two vectors

Normalizing X s thx k x y s f xox Dot product ketoed

Assuming dy is uniform on a bypetsphete one find

0,0 hypet spherical harmonics

9 Loca factors o d't l angular momentum

lithotdet polynomial in x
0kg xd

Exptessibility and implicit bias in DMs

what's the scope of functions that


Exptessibility
a DM can approximate within some

small extet E
Implicit Bias Any explicit directly controlled
non

tendency of the Dun to favet certain


functions

the above GP viewpoint exptessislity at N s


Using
what some function fax amounts to
checking whether

feel is expressible

Sdp dt Falkland Pls a

I
f x and x and fatal a o are

fat most Dads with x s Rell tanh Etf activations


and one hidden all reasonable functions
layet ate
possible to express

In patticulat fat such fully connected Dads

Klay
II with all b to
s
xx

implying all spherical harmonics ate possible

µ If all functions are possible and there is a huge


de
e
dios te e ol I
y
folMNISTinput din polynomials e e

how come we don't need huge amounts of samples to leates

Implicit bias

According to the GP limit not all expressible


functions are equally likely
In fact there ate astronomical suppression
factors fat some of them o

Consider
fully
connected Dna with dy on

a d 768 hypetshete As we claimed before

a did get ethetdet


polynomial

S
flineatht 9h 8

Fpatityle I X s
teds 6878

UP Probability of flineat fdfdtfii.atl'flioeat


e
Gp Probability of fparity
g
fdtfdyfpatrskfpati s
te.t Id
768768
e e e e

A fully connected Ddd with andonly close weights

is extremely note likely to generate a

linear function on the hyper sphere in

input space than a specific high otdet

polynomial provided dinpot 771

from Dna architecture


kex x
Obtain
Alternative Refs
kernel method for deeplearning Cho and Saul 2009
Dep ove al s
g
processes Lee et al 2017

out strategy Layet wise bottom up

Recall the tecutsive structure


2 I la e
2 Cela e a ite Eino a an z Il

ft lo EY en
NAHIN z le

Ito p z glowed on
29 I e'd A x x xx x x xx
to

Notice that each zh x x y Xa Is


a sum of tandem vatiable
many
namely oil f I D
Hence at width sa zilch 79cal A to
is multivariate Gaussian chatactetize
a
by
since oil
and independent
ate centered
fandom variables

ofsik
SZ Xn
peg

Efficient xd
Pole calculation
tequites
SI s
fully dee
cell
given k nm onlytrueforfully
connected

layers not fat cads

The challenge Calculate KY given that z x

ate Gaussian with Ketel K

Let of adf.gs fiecx.jpf cx.f


where

Pelz z D Tete E zedo.li


ncafgggq
zig
2T

k ily
Kid za

go check that this yields LEI tile a 7 k


as expected
x

on
fdziiiodzickjp.lz.li z iifo zi cxnfot cx
a

simplifying notation

Z Z xd
1
Zz Z CA

Ketel Recursion Relation fat any fully connected D W N

t É
Tete fdz.de é a
gagged ko
on
2 I

K
t.iq
Kk za

9 a
tricky two dimentional integration
a Analytically doable fat x
Etf x felsite et X X

and sums of these

relation fat
Solving the recursion I Bx
n
t

s x
d

Tets fdz.dz est z


at
Ft z Zit it

2,27pm B Ziti pot BL Z Z 2,727

multivariate K B 649 3030 Kfk


Gaussian
Identities
titis 313 k k 3k Kan KM
a
theater

Results fat g Rell See Che 8 Saul zee 9

k
Kid
jpg
areas
Mike
ifkÉwI k thought of
as an inset product This as the angle
between X a d X

You might also like