Unit 3 .
Unit 3 .
Richard Zemel
λ1 x1 + · · · + λN xN ∈ S for λi > 0, λ1 + · · · λN = 1.
Hyperbolic Tangent
Hard Threshold Logistic
(tanh)
1 if z > 0 1
y=
0 if z ≤ 0 y= e z − e −z
1 + e −z y=
e z + e −z
Or more simply:
The goal:
There are 256 first-level features total. Here are some of them.
We’ve seen that there are some functions that linear classifiers can’t
represent. Are deep networks any better?
Any sequence of linear layers can be equivalently represented with a
single linear layer.
y = |W(3) W{z(2) W(1)} x
,W0
y = σ(x) y = σ(5x)
This is good: logistic units are differentiable, so we can tune them
with gradient descent. (Stay tuned!)
Limits of universality
Limits of universality
You may need to represent an exponentially large network.
If you can learn any function, you’ll just overfit.
Really, we desire a compact representation!
Limits of universality
You may need to represent an exponentially large network.
If you can learn any function, you’ll just overfit.
Really, we desire a compact representation!
We’ve derived units which compute the functions AND, OR, and
NOT. Therefore, any Boolean circuit can be translated into a
feed-forward neural net.
This suggests you might be able to learn compact representations of
some complicated functions
We’ve seen that multilayer neural networks are powerful. But how can
we actually learn them?
Backpropagation is the central algorithm in this course.
It’s is an algorithm for computing gradients.
Really it’s an instance of reverse mode automatic differentiation, which
is much more broadly applicable than just neural nets.
This is “just” a clever and efficient use of the Chain Rule for derivatives.
We’ll see how to implement an automatic differentiation system next
week.
Weight space for a multilayer neural net: one coordinate for each weight or
bias of the network, in all the layers
Conceptually, not any different from what we’ve seen so far — just higher
dimensional and harder to visualize!
We want to compute the cost gradient dJ /dw, which is the vector of
partial derivatives.
This is the average of dL/dw over all the training examples, so in this
lecture we focus on computing dL/dw.
Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 24 / 45
Univariate Chain Rule
d df dx
f (x(t)) = .
dt dx dt
z = wx + b
y = σ(z)
1
L = (y − t)2
2
z = wx + b
y = σ(z) X
1 z` = w`j xj + b`
L = (y − t)2 j
2
1 e zk
R = w2 yk = P z
2 `e
`
Lreg = L + λR
X
L=− tk log yk
k
Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 31 / 45
Multivariate Chain Rule
Suppose we have a function f (x, y ) and functions x(t) and y (t). (All
the variables here are scalar-valued.) Then
d ∂f dx ∂f dy
f (x(t), y (t)) = +
dt ∂x dt ∂y dt
Example:
f (x, y ) = y + e xy
x(t) = cos t
y (t) = t 2
Plug in to Chain Rule:
df ∂f dx ∂f dy
= +
dt ∂x dt ∂y dt
= (ye ) · (− sin t) + (1 + xe xy ) · 2t
xy
In our notation:
dx dy
t=x +y
dt dt
Lreg = 1 dy
dLreg z =y
R = Lreg dz
dR = y σ 0 (z)
Forward pass: = Lreg λ ∂z dR
dLreg w= z +R
z = wx + b L = Lreg ∂w dw
dL = z x + Rw
y = σ(z)
= Lreg ∂z
1 b=z
L = (y − t)2 dL ∂b
2 y =L
1 dy =z
R = w2
2 = L (y − t)
Lreg = L + λR
Backprop rules:
X ∂yk ∂y >
zj = yk z= y,
∂zj ∂z
k
Examples
Matrix-vector product
∂z
z = Wx =W x = W> z
∂x
Elementwise operations
exp(z1 ) 0
∂y ..
y = exp(z) = z = exp(z) ◦ y
∂z .
0 exp(zD )
Backward pass:
L=1
y = L (y − t)
W(2) = yh>
Forward pass: b(2) = y
z = W(1) x + b(1) h = W(2)> y
h = σ(z) z = h ◦ σ 0 (z)
y = W(2) h + b(2) W(1) = zx>
1 b(1) = z
L = kt − yk2
2
Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 41 / 45
Computational Cost
Computational cost of forward pass: one add-multiply operation per
weight X (1) (1)
zi = wij xj + bi
j
– Don Knuth
By now, we’ve seen three different ways of looking at gradients:
Geometric: visualization of gradient in weight space
Algebraic: mechanics of computing the derivatives
Implementational: efficient implementation on the computer
When thinking about neural nets, it’s important to be able to shift
between these different perspectives!