0% found this document useful (0 votes)
37 views53 pages

Lec 03 Deep Networks 1

This document summarizes a lecture on deep neural networks. It discusses backpropagation with tensors, using matrices and vectors instead of scalars. It describes implementing backpropagation by treating each variable as an object with value and gradient attributes. Minibatching is introduced to make fast matrix operations dominate computation time by applying operations to multiple data points simultaneously. The XOR problem is used as an example of when a simple linear classifier is not sufficient.

Uploaded by

Mr. Coffee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views53 pages

Lec 03 Deep Networks 1

This document summarizes a lecture on deep neural networks. It discusses backpropagation with tensors, using matrices and vectors instead of scalars. It describes implementing backpropagation by treating each variable as an object with value and gradient attributes. Minibatching is introduced to make fast matrix operations dominate computation time by applying operations to multiple data points simultaneously. The XOR problem is used as an example of when a simple linear classifier is not sufficient.

Uploaded by

Mr. Coffee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Deep Learning

Lecture 3 – Deep Neural Networks

Prof. Dr.-Ing. Andreas Geiger


Autonomous Vision Group
University of Tübingen / MPI-IS
Agenda

3.1 Backpropagation with Tensors

3.2 The XOR Problem

3.3 Multi-Layer Perceptrons

3.4 Universal Approximation

2
3.1
Backpropagation with Tensors
Recap: Backpropagation with Scalars
Forward Pass:
(1) y = y(x)
Loss: L( u(y(x)), v(y(x)) )
(2) u = u(y)
(2) v = v(y)
(3) L = L(u, v)

4
Recap: Backpropagation with Scalars
Forward Pass:
(1) y = y(x)
Loss: L( u(y(x)), v(y(x)) )
(2) u = u(y)
(2) v = v(y)
(3) L = L(u, v)

Backward Pass:
∂L ∂L ∂L ∂L
(3) = =
∂u ∂L ∂u ∂u

4
Recap: Backpropagation with Scalars
Forward Pass:
(1) y = y(x)
Loss: L( u(y(x)), v(y(x)) )
(2) u = u(y)
(2) v = v(y)
(3) L = L(u, v)

Backward Pass:
∂L ∂L ∂L ∂L
(3) = =
∂u ∂L ∂u ∂u
∂L ∂L ∂L ∂L
(3) = =
∂v ∂L ∂v ∂v

4
Recap: Backpropagation with Scalars
Forward Pass:
(1) y = y(x)
Loss: L( u(y(x)), v(y(x)) )
(2) u = u(y)
(2) v = v(y)
(3) L = L(u, v)

Backward Pass:
∂L ∂L ∂L ∂L
(3) = =
∂u ∂L ∂u ∂u
∂L ∂L ∂L ∂L
(3) = =
∂v ∂L ∂v ∂v
∂L ∂L ∂u ∂L ∂v
(2) = +
∂y ∂u ∂y ∂v ∂y

4
Recap: Backpropagation with Scalars
Forward Pass:
(1) y = y(x)
Loss: L( u(y(x)), v(y(x)) )
(2) u = u(y)
(2) v = v(y)
(3) L = L(u, v)

Backward Pass:
∂L ∂L ∂L ∂L
(3) = =
∂u ∂L ∂u ∂u
∂L ∂L ∂L ∂L
(3) = =
∂v ∂L ∂v ∂v
∂L ∂L ∂u ∂L ∂v
(2) = +
∂y ∂u ∂y ∂v ∂y
∂L ∂L ∂y
(1) =
∂x ∂y ∂x 4
Recap: Backpropagation with Scalars
Forward Pass:
Implementation: Each variable/node is an object
(1) y = y(x)
and has attributes x.value and x.grad. Values
(2) u = u(y) are computed forward and gradients backward:
(2) v = v(y)
(3) L = L(u, v) x.value = Input
y.value = y(x.value)
Backward Pass:
∂L ∂L ∂L ∂L u.value = u(y.value)
(3) = =
∂u ∂L ∂u ∂u v.value = v(y.value)
∂L ∂L ∂L ∂L
(3) = = L.value = L(u.value, v.value)
∂v ∂L ∂v ∂v
∂L ∂L ∂u ∂L ∂v
(2) = +
∂y ∂u ∂y ∂v ∂y
∂L ∂L ∂y
(1) =
∂x ∂y ∂x 5
Recap: Backpropagation with Scalars
Forward Pass:
Implementation: Each variable/node is an object
(1) y = y(x)
and has attributes x.value and x.grad. Values
(2) u = u(y) are computed forward and gradients backward:
(2) v = v(y)
(3) L = L(u, v) x.grad = y.grad = u.grad = v.grad = 0
L.grad = 1
Backward Pass:
∂L ∂L ∂L ∂L u.grad += L.grad ∗ (∂L/∂u)(u.value, v.value)
(3) = =
∂u ∂L ∂u ∂u v.grad += L.grad ∗ (∂L/∂v)(u.value, v.value)
∂L ∂L ∂L ∂L
(3) = = y.grad += u.grad ∗ (∂u/∂y)(y.value)
∂v ∂L ∂v ∂v
∂L ∂L ∂u ∂L ∂v y.grad += v.grad ∗ (∂v/∂y)(y.value)
(2) = +
∂y ∂u ∂y ∂v ∂y x.grad += y.grad ∗ (∂y/∂x)(x.value)
∂L ∂L ∂y
(1) =
∂x ∂y ∂x 5
Scalar vs. Matrix Operations

So far we have considered computations on scalars:

y = σ(w1 x + w0 )

We now consider computations on vectors and matrices:

y = σ(Ax + b)

I Matrix A and vector b are objects with attributes value and grad
I A.grad stores ∇A L and b.grad stores ∇b L
I A.grad has the same shape/dimensions as A.value (since L is scalar)

6
Backpropagation on Loops

The matrix/vector computation

y = σ(|{z}
Ax +b)
=u

can be written as loops over scalar operations:

for i u.value[i] = 0
for i,j u.value[i] += A.value[i, j] ∗ x.value[j]
for i y.value[i] = σ(u.value[i] + b.value[i])

7
Backpropagation on Loops

The backpropagated gradients for

for i y.value[i] = σ(u.value[i] + b.value[i])

are:

for i u.grad[i] += y.grad[i] ∗ σ 0 (u.value[i] + b.value[i])


for i b.grad[i] += y.grad[i] ∗ σ 0 (u.value[i] + b.value[i])

I Red: back-propagated gradients I Blue: local gradients

8
Backpropagation on Loops

The backpropagated gradients for

for i,j u.value[i] += A.value[i, j] ∗ x.value[j]

are

for i,j A.grad[i, j] += u.grad[i] ∗ x.value[j]


for i,j x.grad[j] += u.grad[i] ∗ A.value[i, j]

I Red: back-propagated gradients I Blue: local gradients

9
Backpropagation on Loops
In practice, all deep learning operations can be written using loops over scalar
assignments. Example for a higher-order tensor:

for h,i,j,k U.value[h, i, j] += A.value[h, i, k] ∗ B.value[h, j, k]


for h,i,j Y.value[h, i, j] = σ(U.value[h, i, j])

Backpropagation loops:

h,i,j U.grad += Y.grad[h, i, j] ∗ σ 0 (U.value[h, i, j])


h,i,j,k A.grad += U.grad[h, i, j] ∗ B.value[h, j, k]
h,i,j,k B.grad += U.grad[h, i, j] ∗ A.value[h, i, k]

10
Minibatching

Source code has two components:


I Slow part: Sequential operations (Python)
I Fast part: Vector/matrix operations (NumPy, BLAS, CUDA)

Goal:
I Fast part should dominate computation (wall clock time)
I Reduce the number of slow sequential operations (e.g., Python loops)
by running the fast vector/matrix operations on several data points jointly
I This is called minibatching and used in stochastic gradient descent

11
Minibatching
Affine + Sigmoid: (applied to N data points simultaneously)
Y = σ(XA
|{z} +B)
=U
I Each row in X ∈ RN ×D is a data point, bias b ∈ RM is broadcast to B ∈ RN ×M

Loops now include batch index b:

for b,i U.value[b, i] = 0


for b,i,j U.value[b, i] += X.value[b, j] ∗ A.value[j, i]
for b,i Y.value[b, i] = σ(U.value[b, i] + B.value[i])

I Only inputs and outputs depend on batch index b, not the parameters (e.g., A, B)
I By convention, the gradients are averaged over the batch
12
Implementation
Affine Transformation: (applied to N data points simultaneously)
Y = XA + B

I Each row in X ∈ RN ×D is a data point, bias b ∈ RM is broadcast to B ∈ RN ×M

Implementation in EDF:
def forward ( self ):
self . value = np . matmul ( self . x . value , self . w . A . value ) + self . w . b . value

def backward ( self ):


self . x . addgrad ( np . matmul ( self . grad , self . w . A . value . transpose ()))
self . w . b . addgrad ( self . grad )
self . w . A . addgrad ( self . x . value [: ,: , np . newaxis ] * self . grad [: , np . newaxis ,:])

I Computation graphs are easy to understand using the loop notation


I Efficient implementation using NumPy primitives not always obvious 13
3.2
The XOR Problem
The XOR Problem

Logistic Regression Model:

1
ŷ = σ(w> x) with σ(x) =
1 + e−x

I Which problems can we solve with such a simple linear classifier?

15
The XOR Problem

Example: 2D Logistic Regression


1.0

Decision Boundary
Sigmoid

1
ŷ = σ(w> x + w0 ) σ(x) = 0.8
1 + e−x
0.6

Class 0 Class 1

(x)
I Let x ∈ R2 0.5

0.4

I Decision boundary: w> x + w0 = 0


I Decide for class 1 ⇔ w> x > −w0 0.2

I Decide for class 0 ⇔ w> x < −w0 0.0


10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
x

16
The XOR Problem

Linear Classifier: !
>
  x
Class 1 ⇔ w x > −w0 1
1 1 > |{z}
0.5
| {z } x2
w>
| {z } −w0
x

x1 x2 OR(x1 ,x2 ) Class 0


Class 1
0 0 0 1
0 1 1
1 0 1
1 1 1
0 1

17
The XOR Problem

Linear Classifier: !
>
  x
Class 1 ⇔ w x > −w0 1
1 1 > |{z}
1.5
| {z } x2
w>
| {z } −w0
x

x1 x2 AND(x1 ,x2 ) Class 0


Class 1
0 0 0 1
0 1 0
1 0 0
1 1 1
0 1

18
The XOR Problem

Linear Classifier: !
>
  x
Class 1 ⇔ w x > −w0 1
−1 −1 > −1.5
} x2
| {z }
| {z −w0
w>
| {z }
x

x1 x2 NAND(x1 ,x2 ) Class 0


Class 1
0 0 1 1
0 1 1
1 0 1
1 1 0
0 1

19
The XOR Problem

Linear Classifier: !
>
  x
Class 1 ⇔ w x > −w0 1
? ? > |{z}
?
| {z } x2
w>
| {z } −w0
x

x1 x2 XOR(x1 ,x2 ) Class 0


Class 1
0 0 0 1
0 1 1
1 0 1 ?
1 1 0
0 1

20
The XOR Problem

Class 0
Class 1
1

0 1

I Visually it is obvious that XOR is not linearly separable


I How can we formally prove this?

21
Convex Sets

I A set S is convex if any line segment connecting 2 points in S lies entirely within S:

x1 , x2 ∈ S ⇒ λx1 + (1 − λ)x2 ∈ S for λ ∈ [0, 1]

22
The XOR Problem

I Half-spaces (e.g., decision regions) are convex sets


I Suppose there was a feasible hypothesis. If the positive examples are in the
positive half-space, then the green line segment must be as well.
I Similarly the red line must lie within the negative half-space.
I But the intersection can’t lie in both half-spaces. Contradiction!

Class 0
Class 1
1

0 1
23
The XOR Problem

Some Historical Remarks:


I While linear classification showed some
promising results in the 50s and 60s on simple
image classification problems (Perceptron)
I But limitations became clear very soon (e.g.,
Minsky and Papert book “Perceptrons”, 1969)
I XOR problem is simple but cannot be solved as
model capacity limited to linear decisions
I Led to decline in neural net research in the 70s
I How can we solve non-linear problems?

24
The XOR Problem
Linear classifier with Class 0
non-linear features ψ: Class 1

1
 
x1
w>  x2  > −w0
 

x1 x2
| {z }
ψ(x)

0 1
x1 x2 ψ1 (x) ψ2 (x) ψ3 (x) XOR
0 0 0 0 0 0 I Non-linear features allow linear
0 1 0 1 0 1 classifier to solve non-linear
1 0 1 0 0 1 classification problems!
1 1 1 1 1 0
I Analogous to polynomial curve fitting
25
Representation Matters
Cartesian Coordinates Polar Coordinates

CHAPT ER 1. INT RODUCT ION

θ
y

x r

I But how to choose the transformation? Can be very hard in practice.


I Yet, this was the dominant approach until the 2000s (vision, speech, ..)
I In this class we want to learn them ⇒ Representation learning
I Human needs to choose the right function family rather than the correct function
26
The XOR Problem

Linear Classifier: XOR(x1 , x2 ) =


Class 1 ⇔ w> x > −w0 AND(OR(x1 , x2 ),NAND(x1 , x2 ))

x1 x2 XOR(x1 ,x2 ) Class 0


Class 1
0 0 0 1
0 1 1

XO
1 0 1

N
AN
1 1 0

D
O
0 1

R
27
The XOR Problem

XOR(x1 , x2 ) = AND(OR(x1 , x2 ),NAND(x1 , x2 ))

The above expression can be rewritten


as a program of logistic regressors:
>
h1 = σ(wOR x + wOR )
>
h2 = σ(wN AN D x + wN AN D )
>
ŷ = σ(wAN D h + wAN D )

Note that h(x) is a non-linear feature of x.


We call h(x) a hidden layer.

28
The XOR Problem

XOR(x1 , x2 ) = AND(OR(x1 , x2 ),NAND(x1 , x2 ))

Writing the two 1D mappings h1 (x) and


h2 (x) as a single 2D mapping h(x) yields:
 
! !
>

 w OR w OR 
h = σ
 w> x+ 
 N AN D wN AN D 
| {z } | {z }
W w
>
ŷ = σ(wAN D h + wAN D )

Parameters can be learned using backprop. This is our first Multi-Layer Perceptron!
28
3.3
Multi-Layer Perceptrons
Multi-Layer Perceptrons
I MLPs are feedforward neural networks (no feedback connections)
I They compose several non-linear functions f (x) = ŷ(h3 (h2 (h1 (x))))
where hi (·) are called hidden layers and ŷ(·) is the output layer
I The data specifies only the behavior of the output layer (thus the name “hidden”)
I Each layer i comprises multiple neurons j which are implemented as affine
transformations (a> x + b) followed by non-linear activation functions (g):

hij = g(a>
ij hi−1 + bij )

I Each neuron in each layer is fully connected to all neurons of the previous layer
I The overall length of the chain is the depth of the model ⇒ “Deep Learning”
I The name MLP is misleading as we don’t use threshold units as in Perceptrons
30
MLP Network Architecture
Network Depth = #Computation Layers = 4

Layer Width = #Neurons in Layer

Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Output Layer

I Neurons are grouped into layers, each neuron fully connected to all prev. ones
I Hidden layer hi = g(Ai hi−1 + bi ) with activation function g(·) and weights Ai , bi
31
Feature Learning Perspective
Linear Regressor / Classifier

Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Output Layer

Class 0 Class 0
Class 1 Class 1

Transformation
by hidden layers

32
Activation Functions g(·)

33
Neural Motivation

I Neurons in the brain are structured in layers


I They receive input from many other units and compute their own activation
I The sigmoid activation function is guided by neuroscientific observations
I However, the architecture and training of modern networks differs radically
I Our main goal is not to model the brain, but to achieve statistical generalization
34
Training
Algorithm for training an MLP using (stochastic) gradient descent:
1. Initialize weights w, pick learning rate η and minibatch size |Xbatch |
2. Draw (random) minibatch Xbatch ⊆ X
3. For all elements (x, y) ∈ Xbatch of minibatch (in parallel) do:
3.1 Forward propagate x through network to calculate h1 , h2 , . . . , ŷ
3.2 Backpropagate gradients through network to obtain ∇w L(ŷ, y)
1
4. Update gradients: wt+1 = wt − η
P
|Xbatch | (x,y)∈Xbatch ∇w L(ŷ, y)
5. If validation error decreases, go to step 2, otherwise stop

Remarks:
I Large datasets typically do not fit into GPU memory ⇒ |Xbatch | < |X |
I Our examples on the next slides are small ⇒ |Xbatch | = |X |
35
Levels of Abstraction

I When designing neural networks and machine learning algorithms,


you’ll need to simultaneously think at multiple level’s of abstraction
I “The psychological profiling [of a programmer] is mostly the ability to shift levels
of abstraction, from low level to high level. To see something in the small and to
see something in the large.” [Donald E. Knuth]
36
The XOR Problem

I Note that we have learned a boolean circuit! ⇒ differentiable programming

https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
37
A More Challenging Problem

2 Hidden Neurons 5 Hidden Neurons 15 Hidden Neurons

https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

38
Expressiveness
This following two-layer MLP

h = g(A1 x + b1 )
y = g(A2 h + b2 )

can be written as
y = g(A2 g(A1 x + b1 ) + b2 )

What if we would be using a linear activation function g(x) = x?

y = A2 (A1 x + b1 ) + b2 = A2 A1 x + A2 b1 + b2 = Ax + b

I With linear activations, a multi-layer network can only express linear functions
I What is the model capacity of MLPs with non-linear activation functions?
39
3.4
Universal Approximation
Universal Approximation Theorem
Theorem 1
Let σ be any continuous discriminatory function. Then finite sums of the form

N
X
G(x) = αj σ(a>
j x + bj )
j=1

are dense in the space of continuous functions C(In ) on the n-dimensional unit cube
In . In other words, given any f ∈ C(In ) and  > 0, there is a sum, G(x) for which

|G(x) − f (x)| <  for all x ∈ In

Remark: Has been proven for various activation functions (e.g., Sigmoid, ReLU).

Cybenko: Approximation by superposition of a sigmoidal function. Mathematics of Control, Signals, and Systems, 1989. 41
Example: Binary Case
x1 x2 x3 y
.. .. .. ..
. . . .
0 1 0 0
0 1 1 1
1 0 0 0
.. .. .. ..
. . . .
X
ŷ = [a> x + bi > 0]
|i {z }
i
hi

I Each hidden linear threshold unit hi recognizes one possible input vector
I We need 2D hidden units to recognize all 2D possible inputs in the binary case
42
Soft Thresholds
1.0
(x)
(2x)
(5x)
0.8 (50x)

0.6

0.4

0.2

0.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
x

Learning linear threshold units is hard as their gradient is 0 almost everywhere


I Solution: Replace hard threshold with soft theshold (e.g., sigmoid)
I Sigmoids approximate step functions when increasing the input weight
43
Network Width vs. Depth
I Universality of 2 layer networks is appealing but requires exponential width
I This leads to an exponential increase in memory and computation time
I Moreover, it doesn’t lead to generalization ⇒ network simply memorizes inputs
I Deep networks can represent functions more compactly (with less parameters)
I Inductive bias: Complex functions modeled as composition of simple functions
I This leads to more compact models and better generalization performance
I Example: The parity function

1 if P x is odd
i i
f (x1 , . . . , xD ) =
0 otherwise

requires an exponentially large shallow network but can be computed using


a deep network whose size is linear in the number of inputs D. 44
Space Folding Intuition

Space folding intuition for the case of absolute value rectification units:
I Geometric explanation of the exponential advantage of deeper networks
I Mirror axis of symmetry given by the hyperplane (defined by weights and bias)
I Complex functions arise as mirrored images of simpler patterns
Montufar, Pascanu, Cho and Bengio: On the Number of Linear Regions of Deep Neural Networks. NIPS, 2014. 45
Effect of Network Depth

I Deeper networks generalize better (task: multi-digit number classification)


Goodfellow, Bulatov, Ibarz, Arnoud and Shet: Multi-digit number recognition from Street View imagery using deep convolutional neural networks. ICLR, 2014. 46
Effect of Network Depth

I Increasing the number of parameters is not as effective as increasing depth


I Shallow models even overfit at around 20 million parameters in this example
I Compositionality is a useful prior over the space of functions the model can learn

Goodfellow, Bulatov, Ibarz, Arnoud and Shet: Multi-digit number recognition from Street View imagery using deep convolutional neural networks. ICLR, 2014. 47

You might also like