Lec 03 Deep Networks 1
Lec 03 Deep Networks 1
2
3.1
Backpropagation with Tensors
Recap: Backpropagation with Scalars
Forward Pass:
(1) y = y(x)
Loss: L( u(y(x)), v(y(x)) )
(2) u = u(y)
(2) v = v(y)
(3) L = L(u, v)
4
Recap: Backpropagation with Scalars
Forward Pass:
(1) y = y(x)
Loss: L( u(y(x)), v(y(x)) )
(2) u = u(y)
(2) v = v(y)
(3) L = L(u, v)
Backward Pass:
∂L ∂L ∂L ∂L
(3) = =
∂u ∂L ∂u ∂u
4
Recap: Backpropagation with Scalars
Forward Pass:
(1) y = y(x)
Loss: L( u(y(x)), v(y(x)) )
(2) u = u(y)
(2) v = v(y)
(3) L = L(u, v)
Backward Pass:
∂L ∂L ∂L ∂L
(3) = =
∂u ∂L ∂u ∂u
∂L ∂L ∂L ∂L
(3) = =
∂v ∂L ∂v ∂v
4
Recap: Backpropagation with Scalars
Forward Pass:
(1) y = y(x)
Loss: L( u(y(x)), v(y(x)) )
(2) u = u(y)
(2) v = v(y)
(3) L = L(u, v)
Backward Pass:
∂L ∂L ∂L ∂L
(3) = =
∂u ∂L ∂u ∂u
∂L ∂L ∂L ∂L
(3) = =
∂v ∂L ∂v ∂v
∂L ∂L ∂u ∂L ∂v
(2) = +
∂y ∂u ∂y ∂v ∂y
4
Recap: Backpropagation with Scalars
Forward Pass:
(1) y = y(x)
Loss: L( u(y(x)), v(y(x)) )
(2) u = u(y)
(2) v = v(y)
(3) L = L(u, v)
Backward Pass:
∂L ∂L ∂L ∂L
(3) = =
∂u ∂L ∂u ∂u
∂L ∂L ∂L ∂L
(3) = =
∂v ∂L ∂v ∂v
∂L ∂L ∂u ∂L ∂v
(2) = +
∂y ∂u ∂y ∂v ∂y
∂L ∂L ∂y
(1) =
∂x ∂y ∂x 4
Recap: Backpropagation with Scalars
Forward Pass:
Implementation: Each variable/node is an object
(1) y = y(x)
and has attributes x.value and x.grad. Values
(2) u = u(y) are computed forward and gradients backward:
(2) v = v(y)
(3) L = L(u, v) x.value = Input
y.value = y(x.value)
Backward Pass:
∂L ∂L ∂L ∂L u.value = u(y.value)
(3) = =
∂u ∂L ∂u ∂u v.value = v(y.value)
∂L ∂L ∂L ∂L
(3) = = L.value = L(u.value, v.value)
∂v ∂L ∂v ∂v
∂L ∂L ∂u ∂L ∂v
(2) = +
∂y ∂u ∂y ∂v ∂y
∂L ∂L ∂y
(1) =
∂x ∂y ∂x 5
Recap: Backpropagation with Scalars
Forward Pass:
Implementation: Each variable/node is an object
(1) y = y(x)
and has attributes x.value and x.grad. Values
(2) u = u(y) are computed forward and gradients backward:
(2) v = v(y)
(3) L = L(u, v) x.grad = y.grad = u.grad = v.grad = 0
L.grad = 1
Backward Pass:
∂L ∂L ∂L ∂L u.grad += L.grad ∗ (∂L/∂u)(u.value, v.value)
(3) = =
∂u ∂L ∂u ∂u v.grad += L.grad ∗ (∂L/∂v)(u.value, v.value)
∂L ∂L ∂L ∂L
(3) = = y.grad += u.grad ∗ (∂u/∂y)(y.value)
∂v ∂L ∂v ∂v
∂L ∂L ∂u ∂L ∂v y.grad += v.grad ∗ (∂v/∂y)(y.value)
(2) = +
∂y ∂u ∂y ∂v ∂y x.grad += y.grad ∗ (∂y/∂x)(x.value)
∂L ∂L ∂y
(1) =
∂x ∂y ∂x 5
Scalar vs. Matrix Operations
y = σ(w1 x + w0 )
y = σ(Ax + b)
I Matrix A and vector b are objects with attributes value and grad
I A.grad stores ∇A L and b.grad stores ∇b L
I A.grad has the same shape/dimensions as A.value (since L is scalar)
6
Backpropagation on Loops
y = σ(|{z}
Ax +b)
=u
for i u.value[i] = 0
for i,j u.value[i] += A.value[i, j] ∗ x.value[j]
for i y.value[i] = σ(u.value[i] + b.value[i])
7
Backpropagation on Loops
are:
8
Backpropagation on Loops
are
9
Backpropagation on Loops
In practice, all deep learning operations can be written using loops over scalar
assignments. Example for a higher-order tensor:
Backpropagation loops:
10
Minibatching
Goal:
I Fast part should dominate computation (wall clock time)
I Reduce the number of slow sequential operations (e.g., Python loops)
by running the fast vector/matrix operations on several data points jointly
I This is called minibatching and used in stochastic gradient descent
11
Minibatching
Affine + Sigmoid: (applied to N data points simultaneously)
Y = σ(XA
|{z} +B)
=U
I Each row in X ∈ RN ×D is a data point, bias b ∈ RM is broadcast to B ∈ RN ×M
I Only inputs and outputs depend on batch index b, not the parameters (e.g., A, B)
I By convention, the gradients are averaged over the batch
12
Implementation
Affine Transformation: (applied to N data points simultaneously)
Y = XA + B
Implementation in EDF:
def forward ( self ):
self . value = np . matmul ( self . x . value , self . w . A . value ) + self . w . b . value
1
ŷ = σ(w> x) with σ(x) =
1 + e−x
15
The XOR Problem
Decision Boundary
Sigmoid
1
ŷ = σ(w> x + w0 ) σ(x) = 0.8
1 + e−x
0.6
Class 0 Class 1
(x)
I Let x ∈ R2 0.5
0.4
16
The XOR Problem
Linear Classifier: !
>
x
Class 1 ⇔ w x > −w0 1
1 1 > |{z}
0.5
| {z } x2
w>
| {z } −w0
x
17
The XOR Problem
Linear Classifier: !
>
x
Class 1 ⇔ w x > −w0 1
1 1 > |{z}
1.5
| {z } x2
w>
| {z } −w0
x
18
The XOR Problem
Linear Classifier: !
>
x
Class 1 ⇔ w x > −w0 1
−1 −1 > −1.5
} x2
| {z }
| {z −w0
w>
| {z }
x
19
The XOR Problem
Linear Classifier: !
>
x
Class 1 ⇔ w x > −w0 1
? ? > |{z}
?
| {z } x2
w>
| {z } −w0
x
20
The XOR Problem
Class 0
Class 1
1
0 1
21
Convex Sets
I A set S is convex if any line segment connecting 2 points in S lies entirely within S:
22
The XOR Problem
Class 0
Class 1
1
0 1
23
The XOR Problem
24
The XOR Problem
Linear classifier with Class 0
non-linear features ψ: Class 1
1
x1
w> x2 > −w0
x1 x2
| {z }
ψ(x)
0 1
x1 x2 ψ1 (x) ψ2 (x) ψ3 (x) XOR
0 0 0 0 0 0 I Non-linear features allow linear
0 1 0 1 0 1 classifier to solve non-linear
1 0 1 0 0 1 classification problems!
1 1 1 1 1 0
I Analogous to polynomial curve fitting
25
Representation Matters
Cartesian Coordinates Polar Coordinates
θ
y
x r
XO
1 0 1
N
AN
1 1 0
D
O
0 1
R
27
The XOR Problem
28
The XOR Problem
Parameters can be learned using backprop. This is our first Multi-Layer Perceptron!
28
3.3
Multi-Layer Perceptrons
Multi-Layer Perceptrons
I MLPs are feedforward neural networks (no feedback connections)
I They compose several non-linear functions f (x) = ŷ(h3 (h2 (h1 (x))))
where hi (·) are called hidden layers and ŷ(·) is the output layer
I The data specifies only the behavior of the output layer (thus the name “hidden”)
I Each layer i comprises multiple neurons j which are implemented as affine
transformations (a> x + b) followed by non-linear activation functions (g):
hij = g(a>
ij hi−1 + bij )
I Each neuron in each layer is fully connected to all neurons of the previous layer
I The overall length of the chain is the depth of the model ⇒ “Deep Learning”
I The name MLP is misleading as we don’t use threshold units as in Perceptrons
30
MLP Network Architecture
Network Depth = #Computation Layers = 4
Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Output Layer
I Neurons are grouped into layers, each neuron fully connected to all prev. ones
I Hidden layer hi = g(Ai hi−1 + bi ) with activation function g(·) and weights Ai , bi
31
Feature Learning Perspective
Linear Regressor / Classifier
Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Output Layer
Class 0 Class 0
Class 1 Class 1
Transformation
by hidden layers
32
Activation Functions g(·)
33
Neural Motivation
Remarks:
I Large datasets typically do not fit into GPU memory ⇒ |Xbatch | < |X |
I Our examples on the next slides are small ⇒ |Xbatch | = |X |
35
Levels of Abstraction
https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
37
A More Challenging Problem
https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
38
Expressiveness
This following two-layer MLP
h = g(A1 x + b1 )
y = g(A2 h + b2 )
can be written as
y = g(A2 g(A1 x + b1 ) + b2 )
y = A2 (A1 x + b1 ) + b2 = A2 A1 x + A2 b1 + b2 = Ax + b
I With linear activations, a multi-layer network can only express linear functions
I What is the model capacity of MLPs with non-linear activation functions?
39
3.4
Universal Approximation
Universal Approximation Theorem
Theorem 1
Let σ be any continuous discriminatory function. Then finite sums of the form
N
X
G(x) = αj σ(a>
j x + bj )
j=1
are dense in the space of continuous functions C(In ) on the n-dimensional unit cube
In . In other words, given any f ∈ C(In ) and > 0, there is a sum, G(x) for which
Remark: Has been proven for various activation functions (e.g., Sigmoid, ReLU).
Cybenko: Approximation by superposition of a sigmoidal function. Mathematics of Control, Signals, and Systems, 1989. 41
Example: Binary Case
x1 x2 x3 y
.. .. .. ..
. . . .
0 1 0 0
0 1 1 1
1 0 0 0
.. .. .. ..
. . . .
X
ŷ = [a> x + bi > 0]
|i {z }
i
hi
I Each hidden linear threshold unit hi recognizes one possible input vector
I We need 2D hidden units to recognize all 2D possible inputs in the binary case
42
Soft Thresholds
1.0
(x)
(2x)
(5x)
0.8 (50x)
0.6
0.4
0.2
0.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
x
Space folding intuition for the case of absolute value rectification units:
I Geometric explanation of the exponential advantage of deeper networks
I Mirror axis of symmetry given by the hyperplane (defined by weights and bias)
I Complex functions arise as mirrored images of simpler patterns
Montufar, Pascanu, Cho and Bengio: On the Number of Linear Regions of Deep Neural Networks. NIPS, 2014. 45
Effect of Network Depth
Goodfellow, Bulatov, Ibarz, Arnoud and Shet: Multi-digit number recognition from Street View imagery using deep convolutional neural networks. ICLR, 2014. 47