0% found this document useful (0 votes)
14 views77 pages

Lecture20 Backprop

Uploaded by

Vathana D
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views77 pages

Lecture20 Backprop

Uploaded by

Vathana D
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

10-­‐601  

Introduction  to  Machine  Learning


Machine  Learning  Department
School  of  Computer  Science
Carnegie  Mellon  University

Neural  Networks
and
Backpropagation
Neural  Net  Readings:
Murphy  -­‐-­‐ Matt  Gormley
Bishop  5
HTF  11
Lecture  20
Mitchell  4 April  3,  2017

1
Reminders
• Homework 6:  Unsupervised Learning
– Release:  Wed,  Mar.  22
– Due:  Mon,  Apr.  03  at  11:59pm
• Homework 5 (Part II):  Peer  Review
– Release:  Wed,  Mar.  29 Expectation:  You  
should  spend  at  most  1  
– Due:  Wed,  Apr.  05  at  11:59pm hour  on  your  reviews

• Peer  Tutoring

2
Neural  Networks  Outline
• Logistic  Regression  (Recap)
– Data,  Model,  Learning,  Prediction
• Neural  Networks
– A  Recipe  for  Machine  Learning Last  Lecture
– Visual  Notation  for  Neural  Networks
– Example:  Logistic  Regression  Output  Surface
– 2-­‐Layer  Neural  Network
– 3-­‐Layer  Neural  Network
• Neural  Net  Architectures
– Objective  Functions
– Activation  Functions
• Backpropagation
– Basic  Chain  Rule  (of  calculus) This  Lecture
– Chain  Rule  for  Arbitrary  Computation  Graph
– Backpropagation  Algorithm
– Module-­‐based  Automatic  Differentiation  
(Autodiff)

3
DECISION  BOUNDARY  EXAMPLES

4
Example  #1:  Diagonal  Band

5
Example  #2:  One  Pocket

6
Example  #3:  Four  Gaussians

7
Example  #4:  Two  Pockets

8
Example  #1:  Diagonal  Band

9
Example  #1:  Diagonal  Band

10
Example  #1:  Diagonal  Band

Error  in  slides:  “layers”  


should  read  “number  of  
hidden  units”

All  the  neural  networks  in  


this  section  used  1  hidden  
layer.

11
Example  #1:  Diagonal  Band

12
Example  #1:  Diagonal  Band

13
Example  #1:  Diagonal  Band

14
Example  #1:  Diagonal  Band

15
Example  #2:  One  Pocket

16
Example  #2:  One  Pocket

17
Example  #2:  One  Pocket

18
Example  #2:  One  Pocket

19
Example  #2:  One  Pocket

20
Example  #2:  One  Pocket

21
Example  #2:  One  Pocket

22
Example  #2:  One  Pocket

23
Example  #3:  Four  Gaussians

24
Example  #3:  Four  Gaussians

25
Example  #3:  Four  Gaussians

26
Example  #3:  Four  Gaussians

27
Example  #3:  Four  Gaussians

28
Example  #3:  Four  Gaussians

29
Example  #3:  Four  Gaussians

36
Example  #3:  Four  Gaussians

37
Example  #3:  Four  Gaussians

38
Example  #4:  Two  Pockets

39
Example  #4:  Two  Pockets

40
Example  #4:  Two  Pockets

41
Example  #4:  Two  Pockets

42
Example  #4:  Two  Pockets

43
Example  #4:  Two  Pockets

44
Example  #4:  Two  Pockets

45
Example  #4:  Two  Pockets

46
Example  #4:  Two  Pockets

47
ARCHITECTURES

54
Neural  Network  Architectures
Even  for  a  basic  Neural  Network,  there  are  
many  design  decisions  to  make:
1. #  of  hidden  layers  (depth)
2. #  of  units  per  hidden  layer  (width)
3. Type  of  activation  function  (nonlinearity)
4. Form  of  objective  function

55
Activation  Functions

Neural  Network  with  sigmoid   (F) Loss


activation  functions J = 12 (y y )2

(E) Output (sigmoid)


1
y = 1+2tT( b)
Output

(D) Output (linear)


D
… b = j=0 j zj
Hidden  Layer

(C) Hidden (sigmoid)


1
zj = 1+2tT( aj ) , j

Input

(B) Hidden (linear)


M
aj = i=0 ji xi , j

(A) Input
Given xi , i 56
Activation  Functions

Neural  Network  with  arbitrary   (F) Loss


J = 12 (y y )2
nonlinear  activation  functions

(E) Output (nonlinear)


y = (b)
Output

(D) Output (linear)


D
… b = j=0 j zj
Hidden  Layer

(C) Hidden (nonlinear)


zj = (aj ), j

Input

(B) Hidden (linear)


M
aj = i=0 ji xi , j

(A) Input
Given xi , i 57
Activation  Functions
Sigmoid  /  Logistic  Function So  far,  we’ve  
1
assumed  that  the  
logistic(u) ≡ activation  function  
1+ e−u
(nonlinearity)  is  
always  the  sigmoid  
function…

58
Activation  Functions
• A  new  change:  modifying  the  nonlinearity
– The  logistic  is  not  widely  used  in  modern  ANNs

Alternate  1:  
tanh

Like  logistic  function  but  


shifted  to  range  [-­‐1,  +1]

Slide  from  William  Cohen


AI  Stats  2010

depth  4?

sigmoid  
vs.  
tanh

Figure  from  Glorot &  Bentio (2010)


Activation  Functions
• A  new  change:  modifying  the  nonlinearity
– reLU often  used  in  vision  tasks

Alternate  2:  rectified  linear  unit

Linear  with  a  cutoff  at  zero

(Implementation:  clip  the  gradient  


when  you  pass  zero)

Slide  from  William  Cohen


Activation  Functions
• A  new  change:  modifying  the  nonlinearity
– reLU often  used  in  vision  tasks

Alternate  2:  rectified  linear  unit

Soft  version:  log(exp(x)+1)

Doesn’t  saturate  (at  one  end)


Sparsifies outputs
Helps  with  vanishing  gradient  

Slide  from  William  Cohen


Objective  Functions  for  NNs
• Regression:
– Use  the  same  objective  as  Linear  Regression
– Quadratic  loss  (i.e.  mean  squared  error)
• Classification:
– Use  the  same  objective  as  Logistic  Regression
– Cross-­‐entropy  (i.e.  negative  log  likelihood)
– This  requires  probabilities,  so  we  add  an  additional  
“softmax”  layer  at  the  end  of  our  network

Forward Backward
1 dJ
Quadratic J = (y y )2 =y y
2 dy
dJ 1 1
Cross Entropy J = y HQ;(y) + (1 y ) HQ;(1 y) = y + (1 y )
dy y y 1
63
Cross-­‐entropy  vs.  Quadratic  loss

Figure  from  Glorot &  Bentio (2010)


A  Recipe  for  
Background
Machine  Learning
1.  Given  training  data: 3.  Define  goal:

2.  Choose  each  of  these:


– Decision  function 4.  Train  with  SGD:
(take  small  steps  
opposite  the  gradient)
– Loss  function

67
Objective  Functions
Matching  Quiz: Suppose  you  are  given  a  neural  net  with  a  
single  output,  y,  and  one  hidden  layer.
1)  Minimizing  sum  of  squared   5)  …MLE  estimates  of  weights  assuming  
errors… target follows  a  Bernoulli  with  
parameter  given  by  the  output  value
2)  Minimizing  sum  of  squared  
errors  plus  squared Euclidean   6)  …MAP  estimates  of weights
norm  of  weights… …gives… assuming  weight  priors  are  zero  mean  
Gaussian
3)  Minimizing cross-­‐entropy…
7)  …estimates  with  a  large margin  on  
4)  Minimizing  hinge loss… the  training  data
8)  …MLE  estimates  of  weights  assuming  
zero  mean  Gaussian  noise  on  the  output  
value

A. 1=5,  2=7,  3=6,  4=8 E. 1=8,  2=6,  3=5,  4=7


B. 1=5,  2=7,  3=8,  4=6 F. 1=8,  2=6,  3=8,  4=6
C. 1=7,  2=5,  3=5,  4=7
D. 1=7,  2=5,  3=6,  4=8 68
BACKPROPAGATION

69
A  Recipe  for  
Background
Machine  Learning
1.  Given  training  data: 3.  Define  goal:

2.  Choose  each  of  these:


– Decision  function 4.  Train  with  SGD:
(take  small  steps  
opposite  the  gradient)
– Loss  function

70
Approaches  to  
Training
Differentiation
• Question  1:
When  can  we  compute  the  gradients  of  the  
parameters  of  an  arbitrary  neural  network?

• Question  2:
When  can  we  make  the  gradient  
computation  efficient?

71
Approaches  to  
Training
Differentiation
1. Finite  Difference  Method
– Pro:  Great  for  testing  implementations  of  backpropagation
– Con:  Slow  for  high  dimensional  inputs  /  outputs
– Required:  Ability  to  call  the  function  f(x)  on  any  input  x
2. Symbolic  Differentiation
– Note:  The  method  you  learned  in  high-­‐school
– Note:  Used  by  Mathematica  /  Wolfram  Alpha  /  Maple
– Pro:  Yields  easily  interpretable  derivatives
– Con:  Leads  to  exponential  computation  time  if  not  carefully  implemented
– Required:  Mathematical  expression  that  defines  f(x)
3. Automatic  Differentiation  -­‐ Reverse  Mode
– Note:  Called  Backpropagation when  applied  to  Neural  Nets
– Pro:  Computes  partial  derivatives  of  one  output  f(x)i with  respect  to  all  inputs  xj in  time  proportional  
to  computation  of  f(x)
– Con:  Slow  for  high  dimensional  outputs  (e.g.  vector-­‐valued  functions)
– Required:  Algorithm  for  computing  f(x)
4. Automatic  Differentiation  -­‐ Forward  Mode
– Note:  Easy  to  implement.  Uses  dual  numbers.
– Pro:  Computes  partial  derivatives  of  all  outputs  f(x)i with  respect  to  one  input  xj in  time  proportional  
to  computation  of  f(x)
– Con:  Slow  for  high  dimensional  inputs  (e.g.  vector-­‐valued  x)
– Required:  Algorithm  for  computing  f(x)

72
Training Finite  Difference  Method

Notes:
• Suffers  from  issues  of  
floating  point  precision,  in  
practice
• Typically  only  appropriate  
to  use  on  small  examples  
with  an  appropriately  
chosen  epsilon
73
Training Symbolic  Differentiation

Calculus  Quiz  #1:


Suppose  x  =  2  and  z  =  3,  what  are  dy/dx  and  
dy/dz for  the  function  below?

74
Training Symbolic  Differentiation

Calculus  Quiz  #2:

75
Training Chain  Rule

Whiteboard
– Chain  Rule  of  Calculus

76
unction f defined as the composition of two functions g
inputs and outputs of g and h are vector-valued variables. Given a
R and R R R R
K J J I K I
! g : ! ) f : !
K },
xand g we
: R compute
J
! R I the output
) f : R K y =I {y , y , . . . , y }, in term
! R . Given
1 2 an input
Training Chain  Rule I
compute
{u1 , u2 ,the , uJ }.yThat
. . .output = {yis,
1 , ythe
2 , . . computation
. , yI }, in termsyof=anf (x) = g
. . , uJ }. That
d-forward is, the computation
manner: y = g(u) and y =u f (x) Then the cha
= g(h(x))
= h(x).
dermediate
manner: = g(u) and u = h(x). Then the chain rule
Given:  y quantities.
e quantities.
Chain  Rule:
X J
J dyi
X dy i du j
i dyi =duj , 8i, k
= dx , 8i, k (2.3)
k j=1
dukj dxkj=1 duj dxk

puts ofhfare
g, and , g,alland h arethen
scalars, all we
scalars,
obtainthen we obtain
the familiar formthe famili

dy dy dudy dy du
= (2.4)
dx du dx =
dx du dx
nary logistic regression can be interpreted as a arithmetic 77
sion Binary
e of some logistic(below
loss function regression can
we use be interpreted
regression) with as a ari
unction f defined as the composition of two functions g
inputs and outputs of g and h are vector-valued variables. Given a
R and R R R R
K J J I K I
! g : ! ) f : !
K },
xand g we
: R compute
J
! R I the output
) f : R K y =I {y , y , . . . , y }, in term
! R . Given
1 2 an input
Training Chain  Rule I
compute
{u1 , u2 ,the , uJ }.yThat
. . .output = {yis,
1 , ythe
2 , . . computation
. , yI }, in termsyof=anf (x) = g
. . , uJ }. That
d-forward is, the computation
manner: y = g(u) and y =u f (x) Then the cha
= g(h(x))
= h(x).
dermediate
manner: = g(u) and u = h(x). Then the chain rule
Given:  y quantities.
e quantities.
Chain  Rule:
X J
J dyi
X dy i du j
i dyi =duj , 8i, k
= dx , 8i, k (2.3)
k j=1
dukj dxkj=1 duj dxk

puts ofhfare
g, and , g,alland h arethen
scalars, all we
scalars,
obtainthen we obtain
the familiar formthe famili
Backpropagation

is  just  repeated  
dy dy dudy
=
application  of  the   dy du (2.4)
dx du dx
chain  rule  from   =
Calculus  101.
dx du dx
nary logistic regression can be interpreted as a arithmetic 78
sion Binary
e of some logistic(below
loss function regression can
we use be interpreted
regression) with as a ari
Training Backpropagation

Whiteboard
– Example:  Backpropagation  for  Calculus  Quiz  #1

Calculus  Quiz  #1:


Suppose  x  =  2  and  z  =  3,  what  are  dy/dx  
and  dy/dz  for  the  function  below?

79
Training Backpropagation
Automatic  Differentiation  – Reverse  Mode  (aka.  Backpropagation)
Forward  Computation
1. Write  an  algorithm for  evaluating  the  function  y  =  f(x).  The  
algorithm  defines  a  directed  acyclic  graph,  where  each  variable  is  a  
node  (i.e.  the  “computation  graph”)
2. Visit  each  node  in  topological  order.  
For  variable  ui with  inputs  v1,…,  vN
a. Compute  ui =  gi(v1,…,  vN)
b. Store  the  result  at  the  node
Backward  Computation
1. Initialize all  partial  derivatives  dy/duj to  0  and  dy/dy =  1.
2. Visit  each  node  in  reverse  topological  order.  
For  variable  ui =  gi(v1,…,  vN)
a. We  already  know  dy/dui
b. Increment  dy/dvj by  (dy/dui)(dui/dvj)
(Choice  of  algorithm  ensures  computing  (dui/dvj)  is  easy)

Return  partial  derivatives  dy/dui  for  all  variables 80


Training Backpropagation
Simple Example: The goal is to compute J = +Qb(bBM(x2 ) + 3x2 )
on the forward pass and the derivative dJ
dx on the backward pass.

Forward Backward
dJ
J = cos(u) Y= sin(u)
du
dJ dJ du du dJ dJ du du
u = u1 + u2 Y= , =1 Y= , =1
du1 du du1 du1 du2 du du2 du2
dJ dJ du1 du1
u1 = sin(t) Y= , = +Qb(t)
dt du1 dt dt
dJ dJ du2 du2
u2 = 3t Y= , =3
dt du2 dt dt
dJ dJ dt dt
t = x2 Y= , = 2x
dx dt dx dx

81
Training Backpropagation
Simple Example: The goal is to compute J = +Qb(bBM(x2 ) + 3x2 )
on the forward pass and the derivative dJ
dx on the backward pass.

Forward Backward
dJ
J = cos(u) Y= sin(u)
du
dJ dJ du du dJ dJ du du
u = u1 + u2 Y= , =1 Y= , =1
du1 du du1 du1 du2 du du2 du2
dJ dJ du1 du1
u1 = sin(t) Y= , = +Qb(t)
dt du1 dt dt
dJ dJ du2 du2
u2 = 3t Y= , =3
dt du2 dt dt
dJ dJ dt dt
t = x2 Y= , = 2x
dx dt dx dx

82
Training Backpropagation

Whiteboard
– SGD  for  Neural  Network
– Example:  Backpropagation  for  Neural  Network

83
Training Backpropagation
Output

Case  1:
Logistic   θ1 θ2 θ3 θM
Regression

Input

Forward Backward
dJ y (1 y )
J = y HQ; y + (1 y ) HQ;(1 y) = +
dy y y 1
1 dJ dJ dy dy 2tT( a)
y= = , =
1 + 2tT( a) da dy da da (2tT( a) + 1)2
D
dJ dJ da da
a= j xj = , = xj
j=0
d j da d j d j
dJ dJ da da
= , = j
dxj da dxj dxj
84
Training Backpropagation
(F) Loss
J = 12 (y y (d) )2

(E) Output (sigmoid)


1
y = 1+2tT( b)
Output

(D) Output (linear)


D
… b = j=0 j zj
Hidden  Layer

(C) Hidden (sigmoid)


1
zj = 1+2tT( aj ) , j

Input

(B) Hidden (linear)


M
aj = i=0 ji xi , j

(A) Input
Given xi , i 85
Training Backpropagation
(F) Loss
J = 12 (y y )2

(E) Output (sigmoid)


1
y = 1+2tT( b)
Output

(D) Output (linear)


D
… b = j=0 j zj
Hidden  Layer

(C) Hidden (sigmoid)


1
zj = 1+2tT( aj ) , j

Input

(B) Hidden (linear)


M
aj = i=0 ji xi , j

(A) Input
Given xi , i 86
Training Backpropagation
Case  2: Forward Backward
Neural   dJ y (1 y )
J = y HQ; y + (1 y ) HQ;(1 y) = +
Network dy y y 1
1 dJ dJ dy dy 2tT( b)
y= = , =
1 + 2tT( b) db dy db db (2tT( b) + 1)2

D
… dJ dJ db db
b= j zj = , = zj
j=0
d j db d j d j
dJ dJ db db
= , = j
dzj db dzj dzj
1 dJ dJ dzj dzj 2tT( aj )
zj = = , =
1 + 2tT( aj ) daj dzj daj daj (2tT( aj ) + 1)2
M
dJ dJ daj daj
aj = ji xi = , = xi
i=0
d ji daj d ji d ji
D
dJ dJ daj daj
= , = ji
dxi daj dxi dxi j=0 87
Training Backpropagation
Backpropagation  (Auto.Diff.  -­‐ Reverse  Mode)
Forward  Computation
1. Write  an  algorithm for  evaluating  the  function  y  =  f(x).  The  
algorithm  defines  a  directed  acyclic  graph,  where  each  variable  is  a  
node  (i.e.  the  “computation  graph”)
2. Visit  each  node  in  topological  order.  
a. Compute  the  corresponding  variable’s  value
b. Store  the  result  at  the  node

Backward  Computation
3. Initialize all  partial  derivatives  dy/duj to  0  and  dy/dy =  1.
4. Visit  each  node  in  reverse  topological  order.  
For  variable  ui =  gi(v1,…,  vN)
a. We  already  know  dy/dui
b. Increment  dy/dvj by  (dy/dui)(dui/dvj)
(Choice  of  algorithm  ensures  computing  (dui/dvj)  is  easy)
Return  partial  derivatives  dy/dui  for  all  variables 88
Training Backpropagation
Case  2: Forward Backward
Neural   dJ y (1 y )
Module  5 J = y HQ; y + (1 y ) HQ;(1 y) = +
Network dy y y 1
1 dJ dJ dy dy 2tT( b)
y= = , =
Module  4 1 + 2tT( b) db dy db db (2tT( b) + 1)2

D
… dJ dJ db db
b= j zj = , = zj
j=0
d j db d j d j
Module  3
dJ dJ db db
= , = j
dzj db dzj dzj
1 dJ dJ dzj dzj 2tT( aj )
Module  2 zj = = , =
1 + 2tT( aj ) daj dzj daj daj (2tT( aj ) + 1)2
M
dJ dJ daj daj
aj = ji xi = , = xi
i=0
d ji daj d ji d ji
Module  1
D
dJ dJ daj daj
= , = ji
dxi daj dxi dxi j=0 89
A  Recipe  for  
Background
Gradients
Machine  Learning
1.  Given  training  data: 3.  Define  goal:
Backpropagation can  compute  this  
gradient!  
And  it’s  a  special  case  of  a  more  
general  algorithm  called  reverse-­‐
2.  Choose  each  of  these:
mode  automatic  differentiation  that  
– Decision  functioncan  compute  the  gradient  of  any  
4.  Train  with  SGD:
differentiable  function  efficiently!
(take  small  steps  
opposite  the  gradient)
– Loss  function

90
Summary
1. Neural  Networks…
– provide  a  way  of  learning  features
– are  highly  nonlinear  prediction  functions
– (can  be)  a  highly  parallel  network  of  logistic  
regression  classifiers
– discover  useful  hidden  representations  of  the  
input
2. Backpropagation…
– provides  an  efficient  way  to  compute  gradients
– is  a  special  case  of  reverse-­‐mode  automatic  
differentiation
91

You might also like