Lecture20 Backprop
Lecture20 Backprop
Neural Networks
and
Backpropagation
Neural Net Readings:
Murphy -‐-‐ Matt Gormley
Bishop 5
HTF 11
Lecture 20
Mitchell 4 April 3, 2017
1
Reminders
• Homework 6: Unsupervised Learning
– Release: Wed, Mar. 22
– Due: Mon, Apr. 03 at 11:59pm
• Homework 5 (Part II): Peer Review
– Release: Wed, Mar. 29 Expectation: You
should spend at most 1
– Due: Wed, Apr. 05 at 11:59pm hour on your reviews
• Peer Tutoring
2
Neural Networks Outline
• Logistic Regression (Recap)
– Data, Model, Learning, Prediction
• Neural Networks
– A Recipe for Machine Learning Last Lecture
– Visual Notation for Neural Networks
– Example: Logistic Regression Output Surface
– 2-‐Layer Neural Network
– 3-‐Layer Neural Network
• Neural Net Architectures
– Objective Functions
– Activation Functions
• Backpropagation
– Basic Chain Rule (of calculus) This Lecture
– Chain Rule for Arbitrary Computation Graph
– Backpropagation Algorithm
– Module-‐based Automatic Differentiation
(Autodiff)
3
DECISION BOUNDARY EXAMPLES
4
Example #1: Diagonal Band
5
Example #2: One Pocket
6
Example #3: Four Gaussians
7
Example #4: Two Pockets
8
Example #1: Diagonal Band
9
Example #1: Diagonal Band
10
Example #1: Diagonal Band
11
Example #1: Diagonal Band
12
Example #1: Diagonal Band
13
Example #1: Diagonal Band
14
Example #1: Diagonal Band
15
Example #2: One Pocket
16
Example #2: One Pocket
17
Example #2: One Pocket
18
Example #2: One Pocket
19
Example #2: One Pocket
20
Example #2: One Pocket
21
Example #2: One Pocket
22
Example #2: One Pocket
23
Example #3: Four Gaussians
24
Example #3: Four Gaussians
25
Example #3: Four Gaussians
26
Example #3: Four Gaussians
27
Example #3: Four Gaussians
28
Example #3: Four Gaussians
29
Example #3: Four Gaussians
36
Example #3: Four Gaussians
37
Example #3: Four Gaussians
38
Example #4: Two Pockets
39
Example #4: Two Pockets
40
Example #4: Two Pockets
41
Example #4: Two Pockets
42
Example #4: Two Pockets
43
Example #4: Two Pockets
44
Example #4: Two Pockets
45
Example #4: Two Pockets
46
Example #4: Two Pockets
47
ARCHITECTURES
54
Neural Network Architectures
Even for a basic Neural Network, there are
many design decisions to make:
1. # of hidden layers (depth)
2. # of units per hidden layer (width)
3. Type of activation function (nonlinearity)
4. Form of objective function
55
Activation Functions
(A) Input
Given xi , i 56
Activation Functions
(A) Input
Given xi , i 57
Activation Functions
Sigmoid / Logistic Function So far, we’ve
1
assumed that the
logistic(u) ≡ activation function
1+ e−u
(nonlinearity) is
always the sigmoid
function…
58
Activation Functions
• A new change: modifying the nonlinearity
– The logistic is not widely used in modern ANNs
Alternate 1:
tanh
depth 4?
sigmoid
vs.
tanh
Forward Backward
1 dJ
Quadratic J = (y y )2 =y y
2 dy
dJ 1 1
Cross Entropy J = y HQ;(y) + (1 y ) HQ;(1 y) = y + (1 y )
dy y y 1
63
Cross-‐entropy vs. Quadratic loss
67
Objective Functions
Matching Quiz: Suppose you are given a neural net with a
single output, y, and one hidden layer.
1) Minimizing sum of squared 5) …MLE estimates of weights assuming
errors… target follows a Bernoulli with
parameter given by the output value
2) Minimizing sum of squared
errors plus squared Euclidean 6) …MAP estimates of weights
norm of weights… …gives… assuming weight priors are zero mean
Gaussian
3) Minimizing cross-‐entropy…
7) …estimates with a large margin on
4) Minimizing hinge loss… the training data
8) …MLE estimates of weights assuming
zero mean Gaussian noise on the output
value
69
A Recipe for
Background
Machine Learning
1. Given training data: 3. Define goal:
70
Approaches to
Training
Differentiation
• Question 1:
When can we compute the gradients of the
parameters of an arbitrary neural network?
• Question 2:
When can we make the gradient
computation efficient?
71
Approaches to
Training
Differentiation
1. Finite Difference Method
– Pro: Great for testing implementations of backpropagation
– Con: Slow for high dimensional inputs / outputs
– Required: Ability to call the function f(x) on any input x
2. Symbolic Differentiation
– Note: The method you learned in high-‐school
– Note: Used by Mathematica / Wolfram Alpha / Maple
– Pro: Yields easily interpretable derivatives
– Con: Leads to exponential computation time if not carefully implemented
– Required: Mathematical expression that defines f(x)
3. Automatic Differentiation -‐ Reverse Mode
– Note: Called Backpropagation when applied to Neural Nets
– Pro: Computes partial derivatives of one output f(x)i with respect to all inputs xj in time proportional
to computation of f(x)
– Con: Slow for high dimensional outputs (e.g. vector-‐valued functions)
– Required: Algorithm for computing f(x)
4. Automatic Differentiation -‐ Forward Mode
– Note: Easy to implement. Uses dual numbers.
– Pro: Computes partial derivatives of all outputs f(x)i with respect to one input xj in time proportional
to computation of f(x)
– Con: Slow for high dimensional inputs (e.g. vector-‐valued x)
– Required: Algorithm for computing f(x)
72
Training Finite Difference Method
Notes:
• Suffers from issues of
floating point precision, in
practice
• Typically only appropriate
to use on small examples
with an appropriately
chosen epsilon
73
Training Symbolic Differentiation
74
Training Symbolic Differentiation
75
Training Chain Rule
Whiteboard
– Chain Rule of Calculus
76
unction f defined as the composition of two functions g
inputs and outputs of g and h are vector-valued variables. Given a
R and R R R R
K J J I K I
! g : ! ) f : !
K },
xand g we
: R compute
J
! R I the output
) f : R K y =I {y , y , . . . , y }, in term
! R . Given
1 2 an input
Training Chain Rule I
compute
{u1 , u2 ,the , uJ }.yThat
. . .output = {yis,
1 , ythe
2 , . . computation
. , yI }, in termsyof=anf (x) = g
. . , uJ }. That
d-forward is, the computation
manner: y = g(u) and y =u f (x) Then the cha
= g(h(x))
= h(x).
dermediate
manner: = g(u) and u = h(x). Then the chain rule
Given: y quantities.
e quantities.
Chain Rule:
X J
J dyi
X dy i du j
i dyi =duj , 8i, k
= dx , 8i, k (2.3)
k j=1
dukj dxkj=1 duj dxk
puts ofhfare
g, and , g,alland h arethen
scalars, all we
scalars,
obtainthen we obtain
the familiar formthe famili
…
dy dy dudy dy du
= (2.4)
dx du dx =
dx du dx
nary logistic regression can be interpreted as a arithmetic 77
sion Binary
e of some logistic(below
loss function regression can
we use be interpreted
regression) with as a ari
unction f defined as the composition of two functions g
inputs and outputs of g and h are vector-valued variables. Given a
R and R R R R
K J J I K I
! g : ! ) f : !
K },
xand g we
: R compute
J
! R I the output
) f : R K y =I {y , y , . . . , y }, in term
! R . Given
1 2 an input
Training Chain Rule I
compute
{u1 , u2 ,the , uJ }.yThat
. . .output = {yis,
1 , ythe
2 , . . computation
. , yI }, in termsyof=anf (x) = g
. . , uJ }. That
d-forward is, the computation
manner: y = g(u) and y =u f (x) Then the cha
= g(h(x))
= h(x).
dermediate
manner: = g(u) and u = h(x). Then the chain rule
Given: y quantities.
e quantities.
Chain Rule:
X J
J dyi
X dy i du j
i dyi =duj , 8i, k
= dx , 8i, k (2.3)
k j=1
dukj dxkj=1 duj dxk
puts ofhfare
g, and , g,alland h arethen
scalars, all we
scalars,
obtainthen we obtain
the familiar formthe famili
Backpropagation
…
is just repeated
dy dy dudy
=
application of the dy du (2.4)
dx du dx
chain rule from =
Calculus 101.
dx du dx
nary logistic regression can be interpreted as a arithmetic 78
sion Binary
e of some logistic(below
loss function regression can
we use be interpreted
regression) with as a ari
Training Backpropagation
Whiteboard
– Example: Backpropagation for Calculus Quiz #1
79
Training Backpropagation
Automatic Differentiation – Reverse Mode (aka. Backpropagation)
Forward Computation
1. Write an algorithm for evaluating the function y = f(x). The
algorithm defines a directed acyclic graph, where each variable is a
node (i.e. the “computation graph”)
2. Visit each node in topological order.
For variable ui with inputs v1,…, vN
a. Compute ui = gi(v1,…, vN)
b. Store the result at the node
Backward Computation
1. Initialize all partial derivatives dy/duj to 0 and dy/dy = 1.
2. Visit each node in reverse topological order.
For variable ui = gi(v1,…, vN)
a. We already know dy/dui
b. Increment dy/dvj by (dy/dui)(dui/dvj)
(Choice of algorithm ensures computing (dui/dvj) is easy)
Forward Backward
dJ
J = cos(u) Y= sin(u)
du
dJ dJ du du dJ dJ du du
u = u1 + u2 Y= , =1 Y= , =1
du1 du du1 du1 du2 du du2 du2
dJ dJ du1 du1
u1 = sin(t) Y= , = +Qb(t)
dt du1 dt dt
dJ dJ du2 du2
u2 = 3t Y= , =3
dt du2 dt dt
dJ dJ dt dt
t = x2 Y= , = 2x
dx dt dx dx
81
Training Backpropagation
Simple Example: The goal is to compute J = +Qb(bBM(x2 ) + 3x2 )
on the forward pass and the derivative dJ
dx on the backward pass.
Forward Backward
dJ
J = cos(u) Y= sin(u)
du
dJ dJ du du dJ dJ du du
u = u1 + u2 Y= , =1 Y= , =1
du1 du du1 du1 du2 du du2 du2
dJ dJ du1 du1
u1 = sin(t) Y= , = +Qb(t)
dt du1 dt dt
dJ dJ du2 du2
u2 = 3t Y= , =3
dt du2 dt dt
dJ dJ dt dt
t = x2 Y= , = 2x
dx dt dx dx
82
Training Backpropagation
Whiteboard
– SGD for Neural Network
– Example: Backpropagation for Neural Network
83
Training Backpropagation
Output
Case 1:
Logistic θ1 θ2 θ3 θM
Regression
…
Input
Forward Backward
dJ y (1 y )
J = y HQ; y + (1 y ) HQ;(1 y) = +
dy y y 1
1 dJ dJ dy dy 2tT( a)
y= = , =
1 + 2tT( a) da dy da da (2tT( a) + 1)2
D
dJ dJ da da
a= j xj = , = xj
j=0
d j da d j d j
dJ dJ da da
= , = j
dxj da dxj dxj
84
Training Backpropagation
(F) Loss
J = 12 (y y (d) )2
(A) Input
Given xi , i 85
Training Backpropagation
(F) Loss
J = 12 (y y )2
(A) Input
Given xi , i 86
Training Backpropagation
Case 2: Forward Backward
Neural dJ y (1 y )
J = y HQ; y + (1 y ) HQ;(1 y) = +
Network dy y y 1
1 dJ dJ dy dy 2tT( b)
y= = , =
1 + 2tT( b) db dy db db (2tT( b) + 1)2
…
D
… dJ dJ db db
b= j zj = , = zj
j=0
d j db d j d j
dJ dJ db db
= , = j
dzj db dzj dzj
1 dJ dJ dzj dzj 2tT( aj )
zj = = , =
1 + 2tT( aj ) daj dzj daj daj (2tT( aj ) + 1)2
M
dJ dJ daj daj
aj = ji xi = , = xi
i=0
d ji daj d ji d ji
D
dJ dJ daj daj
= , = ji
dxi daj dxi dxi j=0 87
Training Backpropagation
Backpropagation (Auto.Diff. -‐ Reverse Mode)
Forward Computation
1. Write an algorithm for evaluating the function y = f(x). The
algorithm defines a directed acyclic graph, where each variable is a
node (i.e. the “computation graph”)
2. Visit each node in topological order.
a. Compute the corresponding variable’s value
b. Store the result at the node
Backward Computation
3. Initialize all partial derivatives dy/duj to 0 and dy/dy = 1.
4. Visit each node in reverse topological order.
For variable ui = gi(v1,…, vN)
a. We already know dy/dui
b. Increment dy/dvj by (dy/dui)(dui/dvj)
(Choice of algorithm ensures computing (dui/dvj) is easy)
Return partial derivatives dy/dui for all variables 88
Training Backpropagation
Case 2: Forward Backward
Neural dJ y (1 y )
Module 5 J = y HQ; y + (1 y ) HQ;(1 y) = +
Network dy y y 1
1 dJ dJ dy dy 2tT( b)
y= = , =
Module 4 1 + 2tT( b) db dy db db (2tT( b) + 1)2
…
D
… dJ dJ db db
b= j zj = , = zj
j=0
d j db d j d j
Module 3
dJ dJ db db
= , = j
dzj db dzj dzj
1 dJ dJ dzj dzj 2tT( aj )
Module 2 zj = = , =
1 + 2tT( aj ) daj dzj daj daj (2tT( aj ) + 1)2
M
dJ dJ daj daj
aj = ji xi = , = xi
i=0
d ji daj d ji d ji
Module 1
D
dJ dJ daj daj
= , = ji
dxi daj dxi dxi j=0 89
A Recipe for
Background
Gradients
Machine Learning
1. Given training data: 3. Define goal:
Backpropagation can compute this
gradient!
And it’s a special case of a more
general algorithm called reverse-‐
2. Choose each of these:
mode automatic differentiation that
– Decision functioncan compute the gradient of any
4. Train with SGD:
differentiable function efficiently!
(take small steps
opposite the gradient)
– Loss function
90
Summary
1. Neural Networks…
– provide a way of learning features
– are highly nonlinear prediction functions
– (can be) a highly parallel network of logistic
regression classifiers
– discover useful hidden representations of the
input
2. Backpropagation…
– provides an efficient way to compute gradients
– is a special case of reverse-‐mode automatic
differentiation
91