0% found this document useful (0 votes)
117 views

Lec 02 Computation Graphs

This document discusses computation graphs and backpropagation in deep learning. It begins with an overview of logistic regression, including defining the logistic regression model using a Bernoulli distribution and sigmoid function. It then derives the binary cross entropy loss function and explains how gradient descent can be used to minimize this loss through iterative updates of the weight parameters.

Uploaded by

Mr. Coffee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views

Lec 02 Computation Graphs

This document discusses computation graphs and backpropagation in deep learning. It begins with an overview of logistic regression, including defining the logistic regression model using a Bernoulli distribution and sigmoid function. It then derives the binary cross entropy loss function and explains how gradient descent can be used to minimize this loss through iterative updates of the weight parameters.

Uploaded by

Mr. Coffee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Deep Learning

Lecture 2 – Computation Graphs

Prof. Dr.-Ing. Andreas Geiger


Autonomous Vision Group
University of Tübingen / MPI-IS
Agenda

2.1 Logistic Regresssion

2.2 Computation Graphs

2.3 Backpropagation

2.4 Educational Framework

2
2.1
Logistic Regression
Supervised Learning

Input Model Output

4
Supervised Learning

Input Model Output

I Learning: Estimate parameters w from training data {(xi , yi )}N


i=1

4
Supervised Learning

Input Model Output

I Learning: Estimate parameters w from training data {(xi , yi )}N


i=1
I Inference: Make novel predictions: y = fw (x)

4
Regression

Input Model Output

143,52 €

I Mapping: fw : RN → R

4
Classification

Input Model Output

"Beach"

I Mapping: fw : RW ×H → {“Beach”, “No Beach”}


I Classification will be the topic of today

4
Logistic Regression
Conditional Maximum Likelihood Estimator for w:
N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1

I We now like to perform binary classification: yi ∈ {0, 1}


I How should we choose pmodel (y|x, w) in this case?
I Answer: Bernoulli distribution

pmodel (y|x, w) = ŷ y (1 − ŷ)(1−y)

with ŷ predicted by a model: ŷ = fw (x)


5
Logistic Regression
We assumed a Bernoulli distribution
1.0
Sigmoid
pmodel (y|x, w) = ŷ y (1 − ŷ)(1−y)
0.8
with ŷ shorthand for ŷ = fw (x).
0.6

(x)
I But how to choose fw (x)?
0.4
I Requirement: fw (x) ∈ [0, 1]
I Choose fw (x) = σ(w> x) 0.2

where σ is the sigmoid function:


0.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
1 x
σ(x) =
1 + e−x
6
Logistic Regression
Putting it together:

N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N h i
log ŷiyi (1 − ŷi )(1−yi )
X
= argmax
w
i=1
N
X
= argmin −y log ŷi − (1 − yi ) log(1 − ŷi )
w | i {z }
i=1
Binary Cross Entropy Loss L(ŷi ,yi )

I In ML, we use the more general term “loss function” rather than “error function”
I Interpretation: We minimize the dissimilarity between the empirical data
distribution pdata (defined by the training set) and the model distribution pmodel
7
Logistic Regression

Binary Cross Entropy Loss:


5
yi = 1 : log(yi)
L(ŷi , yi ) = −yi log ŷi − (1 − yi ) log(1 − ŷi ) yi = 0 : log(1 yi)
 4
− log ŷ
i if yi = 1
=
− log(1 − ŷ ) if y = 0 3
i i

I For yi = 1 the loss L is minimized if ŷi = 1


1
I For yi = 0 the loss L is minimized if ŷi = 0
0
I Thus, L is minimal if ŷi = yi 0.0 0.2 0.4 0.6 0.8 1.0
yi
I Can be extended to > 2 classes

8
Logistic Regression

A simple 1D example:

I Dataset X with positive (yi = 1) and negative (yi = 0) samples

Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a 9
Logistic Regression

A simple 1D example:

I Logistic regressor fw (x) = σ(w0 + w1 x) fit to dataset X

Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a 9
Logistic Regression

A simple 1D example:

I Probabilities of classifier fw (xi ) for positive samples (yi = 1)

Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a 9
Logistic Regression

A simple 1D example:

I Probabilities of classifier fw (xi ) for negative samples (yi = 0)

Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a 9
Logistic Regression

A simple 1D example:

I Putting both together

Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a 9
Logistic Regression

A simple 1D example:

I Let’s get rid of the x axis

Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a 9
Logistic Regression

A simple 1D example:

I And finally compute the negative logarithm: − log(fw (xi ))

Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a 9
Logistic Regression
Maximum Likelihood for Logistic Regression:

N
X
ŵM L = argmin −yi log ŷi − (1 − yi ) log(1 − ŷi )
w | {z }
i=1
Binary Cross Entropy Loss L(ŷi ,yi )

1
with ŷ = fw (x) = σ(w> x) and σ(x) =
1 + e−x

How do we find the minimizer ŵ?


I In contrast to linear regression, the loss L(ŷi , yi ) is not quadratic in w
I We must apply iterative gradient-based optimization. The gradient is given by:

∇w L(ŷi , yi ) = (ŷi − yi )xi


10
Logistic Regression

Gradient Descent:
I Pick step size η and tolerance 
I Initialize w0
I Repeat until kvk < 
PN
I v = ∇w L(ŷ, y) = i=1 ∇w L(ŷi , yi )
I wt+1 = wt − ηv

Variants:
I Line search (green)
I Conjugate gradients (red)
I L-BFGS
11
Logistic Regression
Examples with two-dimensional inputs (x1 , x2 ) ∈ R2 :
1.0 1.0
10 10
0.8 0.8

5 5
0.6 0.6
x2

x2
0 0
0.4 0.4

5 5
0.2 0.2

10
10 0.0 0.0
15 10 5 0 5 10 15 10 5 0 5 10
x1 x1

I Logistic regression model: fw (x1 , x2 ) = σ(w0 + w1 x1 + w2 x2 )


12
Information Theory
1.2
KL Divergence Large
Maximizing the Log-Likelihood is equivalent to pdata
pmodel
1.0
minimizing Cross Entropy or KL Divergence:
0.8
N

p(x)
0.6
X
ŵM L = argmax log pmodel (yi |xi , w)
w 0.4

|i=1 {z } 0.2
Log-Likelihood
0.0
0 1 2 3 4 5
x
= argmax Epdata [log pmodel (y|x, w)]
w 1.2
KL Divergence Small
pdata
pmodel
= argmin −Epdata [log pmodel (y|x, w)] 1.0
w | {z }
0.8
Cross Entropy H(pdata ,pmodel )

p(x)
0.6
= argmin Epdata [log pdata (y|x) − log pmodel (y|x, w)] 0.4
w
0.2
= argmin DKL (pdata kpmodel )
w 0.0
0 1 2 3 4 5
| {z }
KL Divergence x
13
2.2
Computation Graphs
Logistic Regression
Maximum Likelihood for Logistic Regression:

N
X
ŵM L = argmin −yi log ŷi − (1 − yi ) log(1 − ŷi )
w | {z }
i=1
Binary Cross Entropy Loss L(ŷi ,yi )

1
with ŷ = fw (x) = σ(w> x) and σ(x) =
1 + e−x

I Minimization of a non-linear objective requires the calculation of gradients ∇w


I Luckily, in the above case the gradient is simple: ∇w L(ŷi , yi ) = (ŷi − yi )xi
I But this is not true for more complex models such as deep neural networks
I How can we efficiently compute gradients in the general case?
15
Computation Graphs
Key Idea:
I Decompose complex computations into sequence of atomic assignments
I We call this sequence of assignments a computation graph or source code
I The forward pass takes a training point (x, y) as input and computes a loss, e.g.:

L = − log pmodel (y|x, w)

I As we will see, gradients ∇w L can be computed using a backward pass


I Both, the forward pass and the backward pass are efficient due to the use of
dynamic programming, i.e., storing and reusing intermediate results
I This decomposition and reuse of computation is key to the success of the
backpropagation algorithm, the primary workhorse of deep learning
16
Computation Graphs
A computation graph has three kinds of nodes:

Input nodes
Parameter nodes
Compute nodes

Example: Linear Regression

(1) u = w1 x
(2) ŷ = w0 + u
(3) z = ŷ − y
(4) L = z 2
17
Computation Graphs
A computation graph has three kinds of nodes:

Input nodes
Parameter nodes
Compute nodes

Example: Linear Regression

(1) ŷ = w0 + w1 x
(2) z = ŷ − y
(3) L = z2

17
Computation Graphs
A computation graph has three kinds of nodes:

Input nodes
Parameter nodes
Compute nodes

Example: Linear Regression

(1) ŷ = w0 + w1 x
(2) L = (ŷ − y)2

17
Computation Graphs
A computation graph has three kinds of nodes:

Input nodes
Parameter nodes
Compute nodes

Example: Logistic Regression

(1) u = w0 + w1 x
(2) ŷ = σ(u)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)

17
Computation Graphs
A computation graph has three kinds of nodes:

Input nodes
Parameter nodes
Compute nodes

Example: Logistic Regression

(1) u = w> x
(2) ŷ = σ(u)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)

17
Computation Graphs
A computation graph has three kinds of nodes:

Input nodes
Parameter nodes
Compute nodes

Example: Multi-Layer Perceptron

(1) h = σ(W1> x)
(2) ŷ = σ(w2> h)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)

17
2.3
Backpropagation
Backpropagation

Goal: Find gradients of negative log likelihood

N
X
∇w − log pmodel (yi |xi , w)
| {z }
i=1
L(yi ,xi ,w)

or more generally of a loss function


N
X N
X
∇w L(y, X, w) = ∇w L(yi , xi , w) = ∇w L(yi , xi , w)
i=1 i=1

given a dataset X = {(xi , yi )}N


i=1 with N elements. In the following, we consider the
computation of gradients wrt. a single data point: ∇w L(yi , xi , w). The gradient with
respect to the entire dataset X is obtained by summing up all individual gradients.
19
Chain Rule

Chain Rule:

d df dg
f (g(x)) =
dx dg dx

Multivariate Chain Rule:

M
d X ∂f dgi
f (g1 (x), . . . , gM (x)) =
dx ∂gi dx
i=1

20
Backpropagation
For now: no distinction between node types (input, parameter, compute)

Forward Pass:
Loss: L = 2x2
(1) y = x2
(2) L = 2y

21
Backpropagation
For now: no distinction between node types (input, parameter, compute)

Forward Pass:
Loss: L = 2x2
(1) y = x2
(2) L = 2y

Backward Pass:
∂L ∂L ∂L
(2) = =2
∂y ∂L ∂y

I Red: back-propagated gradients I Blue: local gradients


21
Backpropagation
For now: no distinction between node types (input, parameter, compute)

Forward Pass:
Loss: L = 2x2
(1) y = x2
(2) L = 2y

Backward Pass:
∂L ∂L ∂L
(2) = =2
∂y ∂L ∂y
∂L ∂L ∂y ∂L
(1) = = 2x
∂x ∂y ∂x ∂y

I Red: back-propagated gradients I Blue: local gradients


21
Backpropagation
For now: no distinction between node types (input, parameter, compute)

Forward Pass:
Loss: L = 2x2
(1) y = x2
(2) L = 2y

Backward Pass:
∂L ∂L ∂L
(2) = =2
∂y ∂L ∂y
∂L ∂L ∂y ∂L
(1) = = 2x
∂x ∂y ∂x ∂y

I Red: back-propagated gradients I Blue: local gradients


21
Backpropagation: A more abstract Example
For now: no distinction between node types (input, parameter, compute)

Forward Pass:
Loss: L(y(x))
(1) y = y(x)
(2) L = L(y)

Backward Pass:
∂L ∂L ∂L ∂L
(2) = =
∂y ∂L ∂y ∂y
∂L ∂L ∂y
(1) =
∂x ∂y ∂x

I Red: back-propagated gradients I Blue: local gradients


22
Backpropagation: Fan-Out > 1
Forward Pass: Loss: L( u(y(x)), v(y(x)) )
(1) y = y(x)
(2) u = u(y)
(2) v = v(y)
(3) L = L(u, v)

23
Backpropagation: Fan-Out > 1
Forward Pass: Loss: L( u(y(x)), v(y(x)) )
(1) y = y(x)
(2) u = u(y)
(2) v = v(y)
(3) L = L(u, v)

Backward Pass:
∂L ∂L ∂L ∂L
(3) = =
∂u ∂L ∂u ∂u

23
Backpropagation: Fan-Out > 1
Forward Pass: Loss: L( u(y(x)), v(y(x)) )
(1) y = y(x)
(2) u = u(y)
(2) v = v(y)
(3) L = L(u, v)

Backward Pass:
∂L ∂L ∂L ∂L
(3) = =
∂u ∂L ∂u ∂u
∂L ∂L ∂L ∂L
(3) = =
∂v ∂L ∂v ∂v

23
Backpropagation: Fan-Out > 1
Forward Pass: Loss: L( u(y(x)), v(y(x)) )
(1) y = y(x)
(2) u = u(y)
(2) v = v(y)
(3) L = L(u, v)

Backward Pass:
∂L ∂L ∂L ∂L
(3) = =
∂u ∂L ∂u ∂u
∂L ∂L ∂L ∂L
(3) = =
∂v ∂L ∂v ∂v
∂L ∂L ∂u ∂L ∂v
(2) = +
∂y ∂u ∂y ∂v ∂y

23
Backpropagation: Fan-Out > 1
Forward Pass: Loss: L( u(y(x)), v(y(x)) )
(1) y = y(x)
(2) u = u(y)
(2) v = v(y)
(3) L = L(u, v)

Backward Pass:
∂L ∂L ∂L ∂L
(3) = =
∂u ∂L ∂u ∂u
∂L ∂L ∂L ∂L
(3) = =
∂v ∂L ∂v ∂v d
∂L ∂L ∂u ∂L ∂v L(u(y), v(y)) = ?
(2) = + dy
∂y ∂u ∂y ∂v ∂y

23
Backpropagation: Fan-Out > 1
Forward Pass: Loss: L( u(y(x)), v(y(x)) )
(1) y = y(x)
(2) u = u(y)
(2) v = v(y)
(3) L = L(u, v)

Backward Pass:
∂L ∂L ∂L ∂L
(3) = =
∂u ∂L ∂u ∂u
∂L ∂L ∂L ∂L
(3) = =
∂v ∂L ∂v ∂v d ∂L du ∂L dv
∂L ∂L ∂u ∂L ∂v L(u(y), v(y)) = +
(2) = + dy ∂u dy ∂v dy
∂y ∂u ∂y ∂v ∂y
All incoming gradients must be summed up!
23
Backpropagation: Fan-Out > 1
Forward Pass: Loss: L( u(y(x)), v(y(x)) )
(1) y = y(x)
(2) u = u(y)
(2) v = v(y)
(3) L = L(u, v)

Backward Pass:
∂L ∂L ∂L ∂L
(3) = =
∂u ∂L ∂u ∂u
∂L ∂L ∂L ∂L
(3) = =
∂v ∂L ∂v ∂v d ∂L du ∂L dv
∂L ∂L ∂u ∂L ∂v L(u(y), v(y)) = +
(2) = + dy ∂u dy ∂v dy
∂y ∂u ∂y ∂v ∂y
All incoming gradients must be summed up!
∂L ∂L ∂y
(1) =
∂x ∂y ∂x 23
Backpropagation: Fan-Out > 1
Forward Pass:
Implementation: Each variable/node is an object
(1) y = y(x)
and has attributes x.value and x.grad. Values
(2) u = u(y) are computed forward and gradients backward:
(2) v = v(y)
(3) L = L(u, v) x.value = Input
y.value = y(x.value)
Backward Pass:
∂L ∂L ∂L ∂L u.value = u(y.value)
(3) = =
∂u ∂L ∂u ∂u v.value = v(y.value)
∂L ∂L ∂L ∂L
(3) = = L.value = L(u.value, v.value)
∂v ∂L ∂v ∂v
∂L ∂L ∂u ∂L ∂v
(2) = +
∂y ∂u ∂y ∂v ∂y
∂L ∂L ∂y
(1) =
∂x ∂y ∂x 24
Backpropagation: Fan-Out > 1
Forward Pass:
Implementation: Each variable/node is an object
(1) y = y(x)
and has attributes x.value and x.grad. Values
(2) u = u(y) are computed forward and gradients backward:
(2) v = v(y)
(3) L = L(u, v) x.grad = y.grad = u.grad = v.grad = 0
L.grad = 1
Backward Pass:
∂L ∂L ∂L ∂L u.grad += L.grad ∗ (∂L/∂u)(u.value, v.value)
(3) = =
∂u ∂L ∂u ∂u v.grad += L.grad ∗ (∂L/∂v)(u.value, v.value)
∂L ∂L ∂L ∂L
(3) = = y.grad += u.grad ∗ (∂u/∂y)(y.value)
∂v ∂L ∂v ∂v
∂L ∂L ∂u ∂L ∂v y.grad += v.grad ∗ (∂v/∂y)(y.value)
(2) = +
∂y ∂u ∂y ∂v ∂y x.grad += y.grad ∗ (∂y/∂x)(x.value)
∂L ∂L ∂y
(1) =
∂x ∂y ∂x 24
Backpropagation: Logistic Regression with 1D Inputs
Forward Pass:
(1) u = w0 + w1 x
Loss: L = BCE(σ(w0 + w1 x), y)
(2) ŷ = σ(u)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)
| {z }
BCE(ŷ,y)
Backward Pass:

25
Backpropagation: Logistic Regression with 1D Inputs
Forward Pass:
(1) u = w0 + w1 x
Loss: L = BCE(σ(w0 + w1 x), y)
(2) ŷ = σ(u)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)
| {z }
BCE(ŷ,y)
Backward Pass:

25
Backpropagation: Logistic Regression with 1D Inputs
Forward Pass:
(1) u = w0 + w1 x
Loss: L = BCE(σ(w0 + w1 x), y)
(2) ŷ = σ(u)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)
| {z }
BCE(ŷ,y)
Backward Pass:
∂L ∂L ∂L ŷ − y
(3) = =
∂ ŷ ∂L ∂ ŷ ŷ(1 − ŷ)

25
Backpropagation: Logistic Regression with 1D Inputs
Forward Pass:
(1) u = w0 + w1 x
Loss: L = BCE(σ(w0 + w1 x), y)
(2) ŷ = σ(u)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)
| {z }
BCE(ŷ,y)
Backward Pass:
∂L ∂L ∂L ŷ − y
(3) = =
∂ ŷ ∂L ∂ ŷ ŷ(1 − ŷ)
∂L ∂L ∂ ŷ ∂L
(2) = = σ(u)(1 − σ(u))
∂u ∂ ŷ ∂u ∂ ŷ

25
Backpropagation: Logistic Regression with 1D Inputs
Forward Pass:
(1) u = w0 + w1 x
Loss: L = BCE(σ(w0 + w1 x), y)
(2) ŷ = σ(u)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)
| {z }
BCE(ŷ,y)
Backward Pass:
∂L ∂L ∂L ŷ − y
(3) = =
∂ ŷ ∂L ∂ ŷ ŷ(1 − ŷ)
∂L ∂L ∂ ŷ ∂L
(2) = = σ(u)(1 − σ(u))
∂u ∂ ŷ ∂u ∂ ŷ
∂L ∂L ∂u ∂L
(1) = =
∂w0 ∂u ∂w0 ∂u

25
Backpropagation: Logistic Regression with 1D Inputs
Forward Pass:
(1) u = w0 + w1 x
Loss: L = BCE(σ(w0 + w1 x), y)
(2) ŷ = σ(u)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)
| {z }
BCE(ŷ,y)
Backward Pass:
∂L ∂L ∂L ŷ − y
(3) = =
∂ ŷ ∂L ∂ ŷ ŷ(1 − ŷ)
∂L ∂L ∂ ŷ ∂L
(2) = = σ(u)(1 − σ(u))
∂u ∂ ŷ ∂u ∂ ŷ
∂L ∂L ∂u ∂L
(1) = =
∂w0 ∂u ∂w0 ∂u
∂L ∂L ∂u ∂L
(1) = = x
∂w1 ∂u ∂w1 ∂u 25
Summary
I We can write mathematical
expressions as a computation graph
Compute Loss
I Values are efficiently computed
forward, gradients backward
I Multiple incoming gradients are
summed up (multivariate chain rule)
I Modularity: Each node must only
“know” how to compute gradients
wrt. its own arguments
I One fw/bw pass per data point: Compute Derivatives
N
X
∇w L(y, X, w) = ∇w L(yi , xi , w)
| {z }
i=1
Backpropagation 26
Disclaimer: So far we discussed backpropagation
only for scalar values. In the next lecture, we will
discuss backpropagation with arrays and tensors.
2.4
Educational Framework
Simple Training Recipe

Gradient Descent with Backpropagation:


I Pick step size η and tolerance 
I Initialize w0
I Repeat until kvk < 
I For i=1..N
I Forward Pass ⇒ L(ŷi = fw (xi ), yi )
I Backward Pass ⇒ ∇w L(ŷi , yi )
I Gradient v = N
P
i=1 ∇w L(ŷi , yi )
t+1
I Update w = wt − ηv

Let us now implement this in Python code ..

29
Educational Framework
I 150 lines of Python-NumPy code that
implement a deep learning framework
I Allows us to understand the inner workings
of a deep learning framework in depth
I Variables are bound to objects
I Parents: x, y
I Values: value
I Gradients: grad
I Nodes are implemented as classes:
I Input
I Parameter David McAllester
I CompNode TTI Chicago

30
Educational Framework
class Input :
def __init__ ( self ):
pass
Computation Graph:
def addgrad ( self , delta ):
Input nodes pass

Parameter nodes class Parameter :


def __init__ ( self , value ):
Compute nodes self . value = DT ( value )
Parameters . append ( self )

def addgrad ( self , delta ):


self . grad += np . sum ( delta , axis = 0)
Remark: Specific compute node classes
(e.g., Sigmoid) inherit from the abstract def UpdateParameters ( self ):
self . value -= learning_rate * self . grad

base class CompNode. class CompNode :


def addgrad ( self , delta ):
self . grad += delta

31
Educational Framework
Forward Pass:
def Forward ():
for c in CompNodes : c . forward ()

def Backward ( loss ):


for c in CompNodes + Parameters :
c . grad = np . zeros ( c . value . shape , dtype = DT )
loss . grad = np . ones ( loss . value . shape )/ len ( loss . value )
for c in CompNodes [:: -1]:
Backward Pass: c . backward ();

def UpdateParameters ():


for p in Parameters : p . UpdateParameters ()

Remark: Forward() and Backward() compute the


Parameter Update: forward/backward pass over the entire dataset.
N
X Vectorization is more efficient than looping.
wt+1 = wt − η ∇w L(ŷi , yi ) Parallel computing can be exploited on GPUs.
i=1
32
Educational Framework
Computation Node Sigmoid:
1
σ(x) = class Sigmoid ( CompNode ):
1 + e−x def __init__ ( self , x ):
CompNodes . append ( self )
σ 0 (x) = σ(x)(1 − σ(x)) self . x = x

def forward ( self ):


1.0
Sigmoid bounded = np . maximum ( -10 , np . minimum (10 , self . x . value ))
self . value = 1 / (1 + np . exp ( - bounded ))
0.8

def backward ( self ):


0.6 self . x . addgrad ( self . grad * self . value * (1 - self . value ))
(x)

0.4

Remark: In the backward pass, the gradient is sent


0.2
to the parent node self.x.
0.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
x

33
Educational Framework
Execution Example: import edf

I Load data X and labels y # data loading


edf . clear_compgraph ()
x = edf . Input ()
I Initialize parameters w0 y = edf . Input ()
x . value = Load ( data )
I Define computation graph y . value = Load ( labels )

I For all iterations do # initialization of parameters


params_1 = edf . AffineParams ( nInputs , nHiddens )
I Forward Pass params_2 = edf . AffineParams ( nHiddens , nLabels )

L(ŷi = fw (xi ), yi ) # definition of computation graph


h = edf . Sigmoid ( edf . Affine ( params_1 , x ))
p = edf . Softmax ( edf . Affine ( params_2 , h ))
I Backward Pass L = edf . CrossEntropyLoss (p , y )

∇w L(ŷi , yi ) # gradient descent


for i in range ( iterations ):
I Gradient Update edf . Forward ()
PN edf . Backward ( L )
wt+1 = wt − η i=1 ∇w L(ŷi , yi ) edf . UpdateParameters ()

34

You might also like