0% found this document useful (0 votes)

160 views64 pages

Lec 02 Computation Graphs

This document discusses computation graphs and backpropagation in deep learning. It begins with an overview of logistic regression, including defining the logistic regression model using a Bernoulli distribution and sigmoid function. It then derives the binary cross entropy loss function and explains how gradient descent can be used to minimize this loss through iterative updates of the weight parameters.

Uploaded by

Mr. Coffee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

160 views64 pages

Lec 02 Computation Graphs

Uploaded by

Mr. Coffee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

Deep Learning

Lecture 2 – Computation Graphs

Prof. Dr.-Ing. Andreas Geiger

Autonomous Vision Group
University of Tübingen / MPI-IS
Agenda

2.1 Logistic Regresssion

2.2 Computation Graphs

2.3 Backpropagation

2.4 Educational Framework

2
2.1
Logistic Regression
Supervised Learning

Input Model Output

4
Supervised Learning

Input Model Output

I Learning: Estimate parameters w from training data {(xi , yi )}N

i=1

4
Supervised Learning

Input Model Output

I Learning: Estimate parameters w from training data {(xi , yi )}N

i=1
I Inference: Make novel predictions: y = fw (x)

4
Regression

Input Model Output

143,52 €

I Mapping: fw : RN → R

4
Classiﬁcation

Input Model Output

"Beach"

I Mapping: fw : RW ×H → {“Beach”, “No Beach”}

I Classiﬁcation will be the topic of today

4
Logistic Regression
Conditional Maximum Likelihood Estimator for w:
N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1

I We now like to perform binary classiﬁcation: yi ∈ {0, 1}

I How should we choose pmodel (y|x, w) in this case?
I Answer: Bernoulli distribution

pmodel (y|x, w) = ŷ y (1 − ŷ)(1−y)

with ŷ predicted by a model: ŷ = fw (x)

5
Logistic Regression
We assumed a Bernoulli distribution
1.0
Sigmoid
pmodel (y|x, w) = ŷ y (1 − ŷ)(1−y)
0.8
with ŷ shorthand for ŷ = fw (x).
0.6

(x)
I But how to choose fw (x)?
0.4
I Requirement: fw (x) ∈ [0, 1]
I Choose fw (x) = σ(w> x) 0.2

where σ is the sigmoid function:

0.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
1 x
σ(x) =
1 + e−x
6
Logistic Regression
Putting it together:

N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N h i
log ŷiyi (1 − ŷi )(1−yi )
X
= argmax
w
i=1
N
X
= argmin −y log ŷi − (1 − yi ) log(1 − ŷi )
w | i {z }
i=1
Binary Cross Entropy Loss L(ŷi ,yi )

I In ML, we use the more general term “loss function” rather than “error function”
I Interpretation: We minimize the dissimilarity between the empirical data
distribution pdata (deﬁned by the training set) and the model distribution pmodel
7
Logistic Regression

Binary Cross Entropy Loss:

5
yi = 1 : log(yi)
L(ŷi , yi ) = −yi log ŷi − (1 − yi ) log(1 − ŷi ) yi = 0 : log(1 yi)
 4
− log ŷ
i if yi = 1
=
− log(1 − ŷ ) if y = 0 3
i i

I For yi = 1 the loss L is minimized if ŷi = 1

1
I For yi = 0 the loss L is minimized if ŷi = 0
0
I Thus, L is minimal if ŷi = yi 0.0 0.2 0.4 0.6 0.8 1.0
yi
I Can be extended to > 2 classes

8
Logistic Regression

A simple 1D example:

I Dataset X with positive (yi = 1) and negative (yi = 0) samples

Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a 9
Logistic Regression

A simple 1D example:

I Logistic regressor fw (x) = σ(w0 + w1 x) ﬁt to dataset X

Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a 9
Logistic Regression

A simple 1D example:

I Probabilities of classiﬁer fw (xi ) for positive samples (yi = 1)

Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a 9
Logistic Regression

A simple 1D example:

I Probabilities of classiﬁer fw (xi ) for negative samples (yi = 0)

Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a 9
Logistic Regression

A simple 1D example:

I Putting both together

Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a 9
Logistic Regression

A simple 1D example:

I Let’s get rid of the x axis

Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a 9
Logistic Regression

A simple 1D example:

I And ﬁnally compute the negative logarithm: − log(fw (xi ))

Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a 9
Logistic Regression
Maximum Likelihood for Logistic Regression:

N
X
ŵM L = argmin −yi log ŷi − (1 − yi ) log(1 − ŷi )
w | {z }
i=1
Binary Cross Entropy Loss L(ŷi ,yi )

1
with ŷ = fw (x) = σ(w> x) and σ(x) =
1 + e−x

How do we ﬁnd the minimizer ŵ?

I In contrast to linear regression, the loss L(ŷi , yi ) is not quadratic in w
I We must apply iterative gradient-based optimization. The gradient is given by:

∇w L(ŷi , yi ) = (ŷi − yi )xi

10
Logistic Regression

Gradient Descent:
I Pick step size η and tolerance
I Initialize w0
I Repeat until kvk <
PN
I v = ∇w L(ŷ, y) = i=1 ∇w L(ŷi , yi )
I wt+1 = wt − ηv

Variants:
I Line search (green)
I Conjugate gradients (red)
I L-BFGS
11
Logistic Regression
Examples with two-dimensional inputs (x1 , x2 ) ∈ R2 :
1.0 1.0
10 10
0.8 0.8

5 5
0.6 0.6
x2

x2
0 0
0.4 0.4

5 5
0.2 0.2

10
10 0.0 0.0
15 10 5 0 5 10 15 10 5 0 5 10
x1 x1

I Logistic regression model: fw (x1 , x2 ) = σ(w0 + w1 x1 + w2 x2 )

12
Information Theory
1.2
KL Divergence Large
Maximizing the Log-Likelihood is equivalent to pdata
pmodel
1.0
minimizing Cross Entropy or KL Divergence:
0.8
N

p(x)
0.6
X
ŵM L = argmax log pmodel (yi |xi , w)
w 0.4

|i=1 {z } 0.2
Log-Likelihood
0.0
0 1 2 3 4 5
x
= argmax Epdata [log pmodel (y|x, w)]
w 1.2
KL Divergence Small
pdata
pmodel
= argmin −Epdata [log pmodel (y|x, w)] 1.0
w | {z }
0.8
Cross Entropy H(pdata ,pmodel )

p(x)
0.6
= argmin Epdata [log pdata (y|x) − log pmodel (y|x, w)] 0.4
w
0.2
= argmin DKL (pdata kpmodel )
w 0.0
0 1 2 3 4 5
| {z }
KL Divergence x
13
2.2
Computation Graphs
Logistic Regression
Maximum Likelihood for Logistic Regression:

N
X
ŵM L = argmin −yi log ŷi − (1 − yi ) log(1 − ŷi )
w | {z }
i=1
Binary Cross Entropy Loss L(ŷi ,yi )

1
with ŷ = fw (x) = σ(w> x) and σ(x) =
1 + e−x

I Minimization of a non-linear objective requires the calculation of gradients ∇w

I Luckily, in the above case the gradient is simple: ∇w L(ŷi , yi ) = (ŷi − yi )xi
I But this is not true for more complex models such as deep neural networks
I How can we efﬁciently compute gradients in the general case?
15
Computation Graphs
Key Idea:
I Decompose complex computations into sequence of atomic assignments
I We call this sequence of assignments a computation graph or source code
I The forward pass takes a training point (x, y) as input and computes a loss, e.g.:

L = − log pmodel (y|x, w)

I As we will see, gradients ∇w L can be computed using a backward pass

I Both, the forward pass and the backward pass are efﬁcient due to the use of
dynamic programming, i.e., storing and reusing intermediate results
I This decomposition and reuse of computation is key to the success of the
backpropagation algorithm, the primary workhorse of deep learning
16
Computation Graphs
A computation graph has three kinds of nodes: