Lec 02 Computation Graphs
Lec 02 Computation Graphs
2.3 Backpropagation
2
2.1
Logistic Regression
Supervised Learning
4
Supervised Learning
4
Supervised Learning
4
Regression
143,52 €
I Mapping: fw : RN → R
4
Classification
"Beach"
4
Logistic Regression
Conditional Maximum Likelihood Estimator for w:
N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
(x)
I But how to choose fw (x)?
0.4
I Requirement: fw (x) ∈ [0, 1]
I Choose fw (x) = σ(w> x) 0.2
N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N h i
log ŷiyi (1 − ŷi )(1−yi )
X
= argmax
w
i=1
N
X
= argmin −y log ŷi − (1 − yi ) log(1 − ŷi )
w | i {z }
i=1
Binary Cross Entropy Loss L(ŷi ,yi )
I In ML, we use the more general term “loss function” rather than “error function”
I Interpretation: We minimize the dissimilarity between the empirical data
distribution pdata (defined by the training set) and the model distribution pmodel
7
Logistic Regression
8
Logistic Regression
A simple 1D example:
Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a 9
Logistic Regression
A simple 1D example:
Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a 9
Logistic Regression
A simple 1D example:
Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a 9
Logistic Regression
A simple 1D example:
Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a 9
Logistic Regression
A simple 1D example:
Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a 9
Logistic Regression
A simple 1D example:
Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a 9
Logistic Regression
A simple 1D example:
Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a 9
Logistic Regression
Maximum Likelihood for Logistic Regression:
N
X
ŵM L = argmin −yi log ŷi − (1 − yi ) log(1 − ŷi )
w | {z }
i=1
Binary Cross Entropy Loss L(ŷi ,yi )
1
with ŷ = fw (x) = σ(w> x) and σ(x) =
1 + e−x
Gradient Descent:
I Pick step size η and tolerance
I Initialize w0
I Repeat until kvk <
PN
I v = ∇w L(ŷ, y) = i=1 ∇w L(ŷi , yi )
I wt+1 = wt − ηv
Variants:
I Line search (green)
I Conjugate gradients (red)
I L-BFGS
11
Logistic Regression
Examples with two-dimensional inputs (x1 , x2 ) ∈ R2 :
1.0 1.0
10 10
0.8 0.8
5 5
0.6 0.6
x2
x2
0 0
0.4 0.4
5 5
0.2 0.2
10
10 0.0 0.0
15 10 5 0 5 10 15 10 5 0 5 10
x1 x1
p(x)
0.6
X
ŵM L = argmax log pmodel (yi |xi , w)
w 0.4
|i=1 {z } 0.2
Log-Likelihood
0.0
0 1 2 3 4 5
x
= argmax Epdata [log pmodel (y|x, w)]
w 1.2
KL Divergence Small
pdata
pmodel
= argmin −Epdata [log pmodel (y|x, w)] 1.0
w | {z }
0.8
Cross Entropy H(pdata ,pmodel )
p(x)
0.6
= argmin Epdata [log pdata (y|x) − log pmodel (y|x, w)] 0.4
w
0.2
= argmin DKL (pdata kpmodel )
w 0.0
0 1 2 3 4 5
| {z }
KL Divergence x
13
2.2
Computation Graphs
Logistic Regression
Maximum Likelihood for Logistic Regression:
N
X
ŵM L = argmin −yi log ŷi − (1 − yi ) log(1 − ŷi )
w | {z }
i=1
Binary Cross Entropy Loss L(ŷi ,yi )
1
with ŷ = fw (x) = σ(w> x) and σ(x) =
1 + e−x
Input nodes
Parameter nodes
Compute nodes
(1) u = w1 x
(2) ŷ = w0 + u
(3) z = ŷ − y
(4) L = z 2
17
Computation Graphs
A computation graph has three kinds of nodes:
Input nodes
Parameter nodes
Compute nodes
(1) ŷ = w0 + w1 x
(2) z = ŷ − y
(3) L = z2
17
Computation Graphs
A computation graph has three kinds of nodes:
Input nodes
Parameter nodes
Compute nodes
(1) ŷ = w0 + w1 x
(2) L = (ŷ − y)2
17
Computation Graphs
A computation graph has three kinds of nodes:
Input nodes
Parameter nodes
Compute nodes
(1) u = w0 + w1 x
(2) ŷ = σ(u)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)
17
Computation Graphs
A computation graph has three kinds of nodes:
Input nodes
Parameter nodes
Compute nodes
(1) u = w> x
(2) ŷ = σ(u)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)
17
Computation Graphs
A computation graph has three kinds of nodes:
Input nodes
Parameter nodes
Compute nodes
(1) h = σ(W1> x)
(2) ŷ = σ(w2> h)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)
17
2.3
Backpropagation
Backpropagation
N
X
∇w − log pmodel (yi |xi , w)
| {z }
i=1
L(yi ,xi ,w)
Chain Rule:
d df dg
f (g(x)) =
dx dg dx
M
d X ∂f dgi
f (g1 (x), . . . , gM (x)) =
dx ∂gi dx
i=1
20
Backpropagation
For now: no distinction between node types (input, parameter, compute)
Forward Pass:
Loss: L = 2x2
(1) y = x2
(2) L = 2y
21
Backpropagation
For now: no distinction between node types (input, parameter, compute)
Forward Pass:
Loss: L = 2x2
(1) y = x2
(2) L = 2y
Backward Pass:
∂L ∂L ∂L
(2) = =2
∂y ∂L ∂y
Forward Pass:
Loss: L = 2x2
(1) y = x2
(2) L = 2y
Backward Pass:
∂L ∂L ∂L
(2) = =2
∂y ∂L ∂y
∂L ∂L ∂y ∂L
(1) = = 2x
∂x ∂y ∂x ∂y
Forward Pass:
Loss: L = 2x2
(1) y = x2
(2) L = 2y
Backward Pass:
∂L ∂L ∂L
(2) = =2
∂y ∂L ∂y
∂L ∂L ∂y ∂L
(1) = = 2x
∂x ∂y ∂x ∂y
Forward Pass:
Loss: L(y(x))
(1) y = y(x)
(2) L = L(y)
Backward Pass:
∂L ∂L ∂L ∂L
(2) = =
∂y ∂L ∂y ∂y
∂L ∂L ∂y
(1) =
∂x ∂y ∂x
23
Backpropagation: Fan-Out > 1
Forward Pass: Loss: L( u(y(x)), v(y(x)) )
(1) y = y(x)
(2) u = u(y)
(2) v = v(y)
(3) L = L(u, v)
Backward Pass:
∂L ∂L ∂L ∂L
(3) = =
∂u ∂L ∂u ∂u
23
Backpropagation: Fan-Out > 1
Forward Pass: Loss: L( u(y(x)), v(y(x)) )
(1) y = y(x)
(2) u = u(y)
(2) v = v(y)
(3) L = L(u, v)
Backward Pass:
∂L ∂L ∂L ∂L
(3) = =
∂u ∂L ∂u ∂u
∂L ∂L ∂L ∂L
(3) = =
∂v ∂L ∂v ∂v
23
Backpropagation: Fan-Out > 1
Forward Pass: Loss: L( u(y(x)), v(y(x)) )
(1) y = y(x)
(2) u = u(y)
(2) v = v(y)
(3) L = L(u, v)
Backward Pass:
∂L ∂L ∂L ∂L
(3) = =
∂u ∂L ∂u ∂u
∂L ∂L ∂L ∂L
(3) = =
∂v ∂L ∂v ∂v
∂L ∂L ∂u ∂L ∂v
(2) = +
∂y ∂u ∂y ∂v ∂y
23
Backpropagation: Fan-Out > 1
Forward Pass: Loss: L( u(y(x)), v(y(x)) )
(1) y = y(x)
(2) u = u(y)
(2) v = v(y)
(3) L = L(u, v)
Backward Pass:
∂L ∂L ∂L ∂L
(3) = =
∂u ∂L ∂u ∂u
∂L ∂L ∂L ∂L
(3) = =
∂v ∂L ∂v ∂v d
∂L ∂L ∂u ∂L ∂v L(u(y), v(y)) = ?
(2) = + dy
∂y ∂u ∂y ∂v ∂y
23
Backpropagation: Fan-Out > 1
Forward Pass: Loss: L( u(y(x)), v(y(x)) )
(1) y = y(x)
(2) u = u(y)
(2) v = v(y)
(3) L = L(u, v)
Backward Pass:
∂L ∂L ∂L ∂L
(3) = =
∂u ∂L ∂u ∂u
∂L ∂L ∂L ∂L
(3) = =
∂v ∂L ∂v ∂v d ∂L du ∂L dv
∂L ∂L ∂u ∂L ∂v L(u(y), v(y)) = +
(2) = + dy ∂u dy ∂v dy
∂y ∂u ∂y ∂v ∂y
All incoming gradients must be summed up!
23
Backpropagation: Fan-Out > 1
Forward Pass: Loss: L( u(y(x)), v(y(x)) )
(1) y = y(x)
(2) u = u(y)
(2) v = v(y)
(3) L = L(u, v)
Backward Pass:
∂L ∂L ∂L ∂L
(3) = =
∂u ∂L ∂u ∂u
∂L ∂L ∂L ∂L
(3) = =
∂v ∂L ∂v ∂v d ∂L du ∂L dv
∂L ∂L ∂u ∂L ∂v L(u(y), v(y)) = +
(2) = + dy ∂u dy ∂v dy
∂y ∂u ∂y ∂v ∂y
All incoming gradients must be summed up!
∂L ∂L ∂y
(1) =
∂x ∂y ∂x 23
Backpropagation: Fan-Out > 1
Forward Pass:
Implementation: Each variable/node is an object
(1) y = y(x)
and has attributes x.value and x.grad. Values
(2) u = u(y) are computed forward and gradients backward:
(2) v = v(y)
(3) L = L(u, v) x.value = Input
y.value = y(x.value)
Backward Pass:
∂L ∂L ∂L ∂L u.value = u(y.value)
(3) = =
∂u ∂L ∂u ∂u v.value = v(y.value)
∂L ∂L ∂L ∂L
(3) = = L.value = L(u.value, v.value)
∂v ∂L ∂v ∂v
∂L ∂L ∂u ∂L ∂v
(2) = +
∂y ∂u ∂y ∂v ∂y
∂L ∂L ∂y
(1) =
∂x ∂y ∂x 24
Backpropagation: Fan-Out > 1
Forward Pass:
Implementation: Each variable/node is an object
(1) y = y(x)
and has attributes x.value and x.grad. Values
(2) u = u(y) are computed forward and gradients backward:
(2) v = v(y)
(3) L = L(u, v) x.grad = y.grad = u.grad = v.grad = 0
L.grad = 1
Backward Pass:
∂L ∂L ∂L ∂L u.grad += L.grad ∗ (∂L/∂u)(u.value, v.value)
(3) = =
∂u ∂L ∂u ∂u v.grad += L.grad ∗ (∂L/∂v)(u.value, v.value)
∂L ∂L ∂L ∂L
(3) = = y.grad += u.grad ∗ (∂u/∂y)(y.value)
∂v ∂L ∂v ∂v
∂L ∂L ∂u ∂L ∂v y.grad += v.grad ∗ (∂v/∂y)(y.value)
(2) = +
∂y ∂u ∂y ∂v ∂y x.grad += y.grad ∗ (∂y/∂x)(x.value)
∂L ∂L ∂y
(1) =
∂x ∂y ∂x 24
Backpropagation: Logistic Regression with 1D Inputs
Forward Pass:
(1) u = w0 + w1 x
Loss: L = BCE(σ(w0 + w1 x), y)
(2) ŷ = σ(u)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)
| {z }
BCE(ŷ,y)
Backward Pass:
25
Backpropagation: Logistic Regression with 1D Inputs
Forward Pass:
(1) u = w0 + w1 x
Loss: L = BCE(σ(w0 + w1 x), y)
(2) ŷ = σ(u)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)
| {z }
BCE(ŷ,y)
Backward Pass:
25
Backpropagation: Logistic Regression with 1D Inputs
Forward Pass:
(1) u = w0 + w1 x
Loss: L = BCE(σ(w0 + w1 x), y)
(2) ŷ = σ(u)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)
| {z }
BCE(ŷ,y)
Backward Pass:
∂L ∂L ∂L ŷ − y
(3) = =
∂ ŷ ∂L ∂ ŷ ŷ(1 − ŷ)
25
Backpropagation: Logistic Regression with 1D Inputs
Forward Pass:
(1) u = w0 + w1 x
Loss: L = BCE(σ(w0 + w1 x), y)
(2) ŷ = σ(u)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)
| {z }
BCE(ŷ,y)
Backward Pass:
∂L ∂L ∂L ŷ − y
(3) = =
∂ ŷ ∂L ∂ ŷ ŷ(1 − ŷ)
∂L ∂L ∂ ŷ ∂L
(2) = = σ(u)(1 − σ(u))
∂u ∂ ŷ ∂u ∂ ŷ
25
Backpropagation: Logistic Regression with 1D Inputs
Forward Pass:
(1) u = w0 + w1 x
Loss: L = BCE(σ(w0 + w1 x), y)
(2) ŷ = σ(u)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)
| {z }
BCE(ŷ,y)
Backward Pass:
∂L ∂L ∂L ŷ − y
(3) = =
∂ ŷ ∂L ∂ ŷ ŷ(1 − ŷ)
∂L ∂L ∂ ŷ ∂L
(2) = = σ(u)(1 − σ(u))
∂u ∂ ŷ ∂u ∂ ŷ
∂L ∂L ∂u ∂L
(1) = =
∂w0 ∂u ∂w0 ∂u
25
Backpropagation: Logistic Regression with 1D Inputs
Forward Pass:
(1) u = w0 + w1 x
Loss: L = BCE(σ(w0 + w1 x), y)
(2) ŷ = σ(u)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)
| {z }
BCE(ŷ,y)
Backward Pass:
∂L ∂L ∂L ŷ − y
(3) = =
∂ ŷ ∂L ∂ ŷ ŷ(1 − ŷ)
∂L ∂L ∂ ŷ ∂L
(2) = = σ(u)(1 − σ(u))
∂u ∂ ŷ ∂u ∂ ŷ
∂L ∂L ∂u ∂L
(1) = =
∂w0 ∂u ∂w0 ∂u
∂L ∂L ∂u ∂L
(1) = = x
∂w1 ∂u ∂w1 ∂u 25
Summary
I We can write mathematical
expressions as a computation graph
Compute Loss
I Values are efficiently computed
forward, gradients backward
I Multiple incoming gradients are
summed up (multivariate chain rule)
I Modularity: Each node must only
“know” how to compute gradients
wrt. its own arguments
I One fw/bw pass per data point: Compute Derivatives
N
X
∇w L(y, X, w) = ∇w L(yi , xi , w)
| {z }
i=1
Backpropagation 26
Disclaimer: So far we discussed backpropagation
only for scalar values. In the next lecture, we will
discuss backpropagation with arrays and tensors.
2.4
Educational Framework
Simple Training Recipe
29
Educational Framework
I 150 lines of Python-NumPy code that
implement a deep learning framework
I Allows us to understand the inner workings
of a deep learning framework in depth
I Variables are bound to objects
I Parents: x, y
I Values: value
I Gradients: grad
I Nodes are implemented as classes:
I Input
I Parameter David McAllester
I CompNode TTI Chicago
30
Educational Framework
class Input :
def __init__ ( self ):
pass
Computation Graph:
def addgrad ( self , delta ):
Input nodes pass
31
Educational Framework
Forward Pass:
def Forward ():
for c in CompNodes : c . forward ()
0.4
33
Educational Framework
Execution Example: import edf
34