01 - Introduction To Deep Learning
01 - Introduction To Deep Learning
Zürich, 9/7/2020
1
Literature
• Courses
– Convolutional Neural Networks for Visual
Recognition http://cs231n.stanford.edu
2
Introduction to Deep Learning
--what’s the hype about?
3
AI, Machine Learning, Deep Learning
Human: 5% misclassification
GoogLeNet 6.7%
• With DL it took approx. 3 years to solve object detection and other computer
vision task
• Further examples
NVIDEA course 8
Focus in this lectures:
Probabilistic Viewpoint
9
Probabilistic vs deterministic models
“Classification”
“Regression” Deterministic Probabilistic
11
Topics
• Day 1
– Introduction to DL
– Fully connected neural Networks (fcNN)
– Introduction to TensorFlow and Keras
• Day 2
– Convolutional Neural Networks (CNN) for image data
– Classification and Regression with fcNN and CNNs
• Day 3
– Probabilistic DL
– Extending the GLM with DL for scalar features and image data
• Day 4
– Extending deep GLMs by deep transformation models
– Deep interpretable ordinal regression models
12
Fully Connected Neural Networks
FCNN
13
The Single Cell: Biological Motivation
14
An artificial neuron
bias weights
output y
y
z
b w1 w2
1 x1 x2 … Different non-linear transformations
(activation functions) are used to
input vector get from z to output y
sigmoid
1
y=
1 + e- z
!
The sigmoid 𝑧 = ensures number from 0 to 1, which can be interpreted as
!"# !"
probability.
𝑥$
𝑥!
16
Exercise: Part 1
𝑏
𝑤!
𝑤$
Model: The above network models the probability 𝑝! that a given banknote is false.
TASK
The weights (determined by a training procedure later) are given by
𝑤! = 0.3, 𝑤$ = 0.1, and 𝑏 = 1.0
The probability can be calculated from z using the function sigmoid(z)
"
!!
!"
𝑤!
𝑝! = sigmoid 𝑥! 𝑥" ⋅ 𝑤 +𝑏
"
18
Result*
𝑥$
𝑥!
General rule: Networks without hidden layer have linear decision boundary.
𝑊from,to
We stack single neurons in layers. Bias 𝑏$!
We use the output of neuron as ℎ!
input to the new neuron.
ℎ$
Bias 𝑏 $!
!
𝑊!,$
𝑝!
𝑤
𝑝! = sigmoid 𝑥! 𝑥" ⋅ ! + 𝑏
𝑤"
ℎ&
ℎ$ !
𝑊!,$
Bias 𝑏 $! ℎ$ = sigmoid 𝑥! 𝑥" ⋅ !
𝑊",$
+ 𝑏"!
!
𝑊!,$
Matrix Notation (later we drop ⃗ )
Note column vectors!
𝑝!
ℎ = sigmoid 𝑥⃗ ⋅ 𝑾𝟏 + 𝑏!
Complete Network
𝑝! = sigmoid ℎ ⋅ 𝑾𝟐 + 𝑏!"
Code:
h =sigmoid(x %*% W1 + b1)
p1=sigmoid(h %*% W2 + b2)
ℎ&
ℎ"
ℎ#
Bias ! #"
#"
ℎ$
23
Increasing number of neurons in the hidden layer
http://cs231n.github.io/neural-networks-1/ 24
DL use many hidden layers
https://www.reddit.com/r/ProgrammerHumor/comments/8c1i45/stack_more_layers/ 26
Experiment yourself, play at home
http://playground.tensorflow.org
In code:
## Solution 2 hidden layers
hidden_1=sigmoid(X %*% W1 + b1)
hidden_2=sigmoid(hidden_1 %*% W2 + b2)
res = sigmoid(hidden_2 %*% W3 + b3)
𝑝 = 𝑓(𝑓 𝑓 x W! W $ 𝑊 ' )
28
2 Class
29
So far: Logistic Regression / Binary Classification
𝑤
𝑝! = sigmoid 𝑥! 𝑥" ⋅ ! + 𝑏
𝑤"
30
Classification: Softmax Activation
31
Training NN
32
Training
Tiger Seal 👎
Neural network with many weights W
Tiger Tiger 👍
Seeh
Seehorse
orse 👍
Trainingsprinciple:
... Weights are tuned so a loss functions gets
minimized.
Typical 1 Mio. Trainingsdaten
𝑙𝑜𝑠𝑠 = 𝑙𝑜𝑠𝑠 𝑦 * , 𝑥 * , 𝑊
33
Loss for classification (‘categorical cross-entropy’)
𝑝( , 𝑝! … 𝑝) are probabilities
for the classes 0 to 9.
𝑙! = − log 𝑝$%&'( 𝑦 ! 𝑥 !
*
For more all examples, just average loss = ∑𝑙!
+ 34
Training / Gradient Descent
35
Optimization in DL
Parameters of the network are
the weights.
• DL many parameters
– Optimization loss by simple
gradient descent
15000
10000
loss
5000
0
−1.5−1.0−0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
a
• Imagine you are the blinded wanderer and just know the loss and
the slope at a position. How to reach the minimum?
– Take a large step if slope is steep (you are away from minimum)
• Slope of loss function is given by gradient (this is a local quantity)
• Iterative update of the parameters
– 𝑎0"! = 𝑎0 − 𝜖 𝑔𝑟𝑎𝑑_𝑎(𝑙𝑜𝑠𝑠)
37
Proper learning rate (Important parameter for DL)
See: https://developers.google.com/machine-learning/crash-course/fitter/graph
¶L(w )
w2 wi ( t ) = wi ( t -1) - e ( t )
¶wi w = w ( t -1)
i i
w1
39
Summary: Simple Network no hidden layer
0.9 j 0.08 0 4
0.25
89.9 2.1 0 5
-1.2 0.01 0 6
Flatten to : -0.2 0.03 0 7
vector with 0.11 0 8
k2 elements
3.2 1.3
0 9
0.9 0.08
wi ( t ) = wi ( t -1) - e ( t )
¶L(w ) Loss of a mini-batch:
¶wi w = w ( t -1)
i i
w1 40
The miracle of gradient descent in DL
Loss surface in DL (is not convex) but SGD magically also works for non-convex
problems.
Modern deep learning: No distinction between network (model) and training (SGD)
forward pass
backward pass
Motivation:
Green:
sigmoid.
Red:
Source: ReLU faster
Alexnet
Krizhevsky et al 2012 convergence
43
Deep Learning Frameworks
44
Recap: The first network
46
Typical Tensors in Deep Learning
W24
• The weights going from e.g. Layer L1 to Layer L2 can be written as a matrix
(often called W)
48
Keras Workflow
Use in production
49
A first run through
50
Define the network
Number of Dimension of
neurons in input, here
first hidden vector size
layer 784
Alternative: input_dim=784 51
Compile the network
52
Fit the network
53
Evaluate the network
54
More layers
• Dropout
– keras.layers.Dropout
• Convolutional (see lecture on CNN)
– keras.layers.Conv2D
– keras.layers.Conv1D
• Pooling (see lecture on CNN)
– keras.layers.MaxPooling2D
• Recurrent (not in course)
– keras.layers.SimpleRNNCell
– keras.layers.GRU
– keras.layers.LSTM
55
How to use TF and Keras in the course
• Use RStudio
– Installation a bit tedious, especially for tfprobability
– You might by lucky though
– Banknote example:
https://colab.research.google.com/drive/1_kWrocpNxlzYYySIi__55ucwtuvgAflv?usp=sharing