Zohar Ringel - Mappings From DNNs To GPS - Part 1

Hete we discuss several relations between highly ovetpatanettized
DNNs and Gaussian Processes GPs
Motivation I Most teal wetld Dads have Patanso Datapoints
often performances improves with ovet patent

2 Over parametrization makes training less
glassy finicky local minima phone
3 GPs ate easier to analyze analytically
4 This led to new state of the att Gps
wide
Outline I Dads at initialization as GPs
pads as universal apptetimatots
Easy and bath to death

functions
Relations between Dad atchitectute

dad GP kernels
Ultra
wife
I Dads following Gradient Descent Noise training
as GPs
pun outputs up to large matrix

Predicting
inversion
I Caveats Over Sights and improvements
SOD vs GD low and high learning tates

how wide is wide enough s
and feature
g g g
Reootralized GPs PerturbationTheeties
first noticed
by
I Wide Dads at initialization as GPs
pi f
1996
2 a
Elena a no 2 lol hidden

y layers
É NAHIN z Ele
m 2 Cela e a Elo
N N
go
cod
M IN H 29C lo
co
X Y Xd Is
to O z too o
2 8 00 509 X
Consider a's a's drawn from some iid distribution
Given such a draw a to the Dna generates a certain ta dem

function FCF 2 I a to Thus we find that
Distribution ovet
A distribution ovet G's and a's
function space
DAN Atchitectute
Let's consider this distribution first by sampling the
functions it generate on a single data point FCI
Since FCI Eta z I lo and Is a's ate assumed iid
all a t I are independent and identical random variables
consequently the it sum as Nsa when properly notralized
tends to a gaussian distribution
This ptopet notralization by Yp is part of all standard
DAN initializations A.k.a Xavi et initialization
Thus we find that fat any fixed I vectet
PIECE
Kc
Fail N f
Not
to o ka
Sf Ia
Whete variance
KCI I is Nat at Vat Z Il
s specific
é
Nat at is some bet typ
give no
y
To calculate Vat z till we note that
É f É 0
f CI lo oof and so its variance
could be derived from IvarLei Vat 7 1,100,09 and

the non
linearity 9 This can proceed recursively until
an explicit answer is obtained
legal
Exercise Consider
g ti N p o o Y s s
and depth l of 2 i.e one layer with non lineatits

and calculate KCI I Fet a genetic I
Next considet n data points I
FCK FLI can be viewed as a multivariate distribution
since toil
É Z IIe
The tandem vector variable on the lines is a sum of

many uncottelated
tandem vector variable on the this
the ai variables
iced Ihs
Playford
I Ease is
drawable
s
f Iz
g glo o
KEITH
GP Of
Spca
Next considet datapoints drawn from a distribution

a
mtg
digs paid
Say in image classifaction scenario I will be the vectet
of image pixels and pix the chance that some pixel vectot
I is an image
At N so and no one gets that fol is a field on x
whose distribution is governed by the partition function
fdt.de K I JFC FCF

Z f Df e
a
pathintegral
whete K is defined by
d
S i C 11 L J 8C 271pct
is
A quadratic field theety with K n Actionporgy
Is CX Y fofana
Diagonalizing the kernel k via solving
Sdt KCI Fe 0,17 ad I
and expanding fix on the resulting basis
f x
fad x
we get
Gf
Z fppe 2
G J
Pictorially fat I G it Z x
y la e at I so
is a draw
GP KG x
fa me to toe with
Mt
non local lineat
80 A comment is in otdet introduction of di

tegatding out a measure
for any finite n we ate not requited to introduce n
choosing any tease able de does not affect the fact that
d tofu
SDfelt foggy
f g f I fg Z
KCI I dy independent
Later on when we discuss trained Dads we'll see that

taking dy Datadistribution is the best choice
connected DN
Example line at fully
z CI lo Oj x 7 KCI 420 147

no ya
No d
Lex 0
557
I X Y
s'd IoT
g es o dl y y s e g he type
Idt 7 O xd
x
can only be solved by eigenfunctions O I lineat in I
trying E Ex E
d
Idt 7 I 0,67 45,4 felt Xia covariance

of data dy
y
Got where
ft
I di it
Diagonalizing the covariance mattix cites cat we find
an ago I.es
I
All eigfune ate line at A
weakly strongly fluctuating lineat

Combination of xi's Éitespond to
high eigenvalues of KG Y
low
with high low energy in the membrane picture
of GP
A simplification occuting in fully connected DM ketrels
kex Y s f la la xx
complete set totationally cdlfinyatia.t quantities
in two vectors
Normalizing X s thx k x y s f xox Dot product ketoed
Assuming dy is uniform on a bypetsphete one find
0,0 hypet spherical harmonics
9 Loca factors o d't l angular momentum
lithotdet polynomial in x
0kg xd
Exptessibility and implicit bias in DMs
what's the scope of functions that

Exptessibility
a DM can approximate within some
small extet E
Implicit Bias Any explicit directly controlled
non
tendency of the Dun to favet certain

functions
the above GP viewpoint exptessislity at N s

Using
what some function fax amounts to
checking whether
feel is expressible
Sdp dt Falkland Pls a
I
f x and x and fatal a o are
fat most Dads with x s Rell tanh Etf activations

and one hidden all reasonable functions
layet ate
possible to express
In patticulat fat such fully connected Dads
Klay
II with all b to
s
xx
implying all spherical harmonics ate possible
µ If all functions are possible and there is a huge

de
e
dios te e ol I
y
folMNISTinput din polynomials e e
how come we don't need huge amounts of samples to leates
Implicit bias
According to the GP limit not all expressible

functions are equally likely
In fact there ate astronomical suppression
factors fat some of them o
Consider
fully
connected Dna with dy on
a d 768 hypetshete As we claimed before
a did get ethetdet

polynomial
S
flineatht 9h 8
Fpatityle I X s
teds 6878
UP Probability of flineat fdfdtfii.atl'flioeat

e
Gp Probability of fparity
g
fdtfdyfpatrskfpati s
te.t Id
768768
e e e e
A fully connected Ddd with andonly close weights
is extremely note likely to generate a
linear function on the hyper sphere in
input space than a specific high otdet
polynomial provided dinpot 771
from Dna architecture

kex x
Obtain
Alternative Refs
kernel method for deeplearning Cho and Saul 2009
Dep ove al s
g
processes Lee et al 2017
out strategy Layet wise bottom up
Recall the tecutsive structure

2 I la e
2 Cela e a ite Eino a an z Il
ft lo EY en
NAHIN z le
Ito p z glowed on
29 I e'd A x x xx x x xx
to
Notice that each zh x x y Xa Is

a sum of tandem vatiable
many
namely oil f I D
Hence at width sa zilch 79cal A to
is multivariate Gaussian chatactetize
a
by
since oil
and independent
ate centered
fandom variables
ofsik
SZ Xn
peg
Efficient xd
Pole calculation
tequites
SI s
fully dee
cell
given k nm onlytrueforfully
connected
layers not fat cads
The challenge Calculate KY given that z x
ate Gaussian with Ketel K
Let of adf.gs fiecx.jpf cx.f

where
Pelz z D Tete E zedo.li

ncafgggq
zig
2T
k ily
Kid za
go check that this yields LEI tile a 7 k

as expected
x
on
fdziiiodzickjp.lz.li z iifo zi cxnfot cx
a
simplifying notation
Z Z xd
1
Zz Z CA
Ketel Recursion Relation fat any fully connected D W N
t É
Tete fdz.de é a
gagged ko
on
2 I
K
t.iq
Kk za
9 a
tricky two dimentional integration
a Analytically doable fat x
Etf x felsite et X X
and sums of these
relation fat
Solving the recursion I Bx
n
t
s x
d
Tets fdz.dz est z

at
Ft z Zit it
2,27pm B Ziti pot BL Z Z 2,727
multivariate K B 649 3030 Kfk

Gaussian
Identities
titis 313 k k 3k Kan KM
a
theater
Results fat g Rell See Che 8 Saul zee 9
k
Kid
jpg
areas
Mike
ifkÉwI k thought of
as an inset product This as the angle
between X a d X

Zohar Ringel - Mappings From DNNs To GPS - Part 1

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Zohar Ringel - Mappings From DNNs To GPS - Part 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Zohar Ringel - Mappings From DNNs To GPS - Part 1

Uploaded by

Copyright:

Available Formats

Hete we discuss several relations between highly ovetpatanettized

DNNs and Gaussian Processes GPs

Motivation I Most teal wetld Dads have Patanso Datapoints

often performances improves with ovet patent

Easy and bath to death

Relations between Dad atchitectute

pun outputs up to large matrix

I Caveats Over Sights and improvements

SOD vs GD low and high learning tates

Elena a no 2 lol hidden

Consider a's a's drawn from some iid distribution

Given such a draw a to the Dna generates a certain ta dem

Since FCI Eta z I lo and Is a's ate assumed iid

all a t I are independent and identical random variables

consequently the it sum as Nsa when properly notralized

tends to a gaussian distribution

This ptopet notralization by Yp is part of all standard

DAN initializations A.k.a Xavi et initialization

Thus we find that fat any fixed I vectet

To calculate Vat z till we note that

could be derived from IvarLei Vat 7 1,100,09 and

and depth l of 2 i.e one layer with non lineatits

Next considet n data points I

FCK FLI can be viewed as a multivariate distribution

The tandem vector variable on the lines is a sum of

Next considet datapoints drawn from a distribution

At N so and no one gets that fol is a field on x

whose distribution is governed by the partition function

fdt.de K I JFC FCF

A quadratic field theety with K n Actionporgy

Diagonalizing the kernel k via solving

Sdt KCI Fe 0,17 ad I

and expanding fix on the resulting basis

80 A comment is in otdet introduction of di

for any finite n we ate not requited to introduce n

Later on when we discuss trained Dads we'll see that

z CI lo Oj x 7 KCI 420 147

can only be solved by eigenfunctions O I lineat in I

Idt 7 I 0,67 45,4 felt Xia covariance

Diagonalizing the covariance mattix cites cat we find

weakly strongly fluctuating lineat

A simplification occuting in fully connected DM ketrels

Normalizing X s thx k x y s f xox Dot product ketoed

Assuming dy is uniform on a bypetsphete one find

0,0 hypet spherical harmonics

9 Loca factors o d't l angular momentum

Exptessibility and implicit bias in DMs

what's the scope of functions that

tendency of the Dun to favet certain

the above GP viewpoint exptessislity at N s

Sdp dt Falkland Pls a

fat most Dads with x s Rell tanh Etf activations

In patticulat fat such fully connected Dads

implying all spherical harmonics ate possible

µ If all functions are possible and there is a huge

how come we don't need huge amounts of samples to leates

According to the GP limit not all expressible

a d 768 hypetshete As we claimed before