Basics of Deep Learning: Pierre-Marc Jodoin and Christian Desrosiers
Basics of Deep Learning: Pierre-Marc Jodoin and Christian Desrosiers
By
D D Sick
Freq
( temp, freq) diagnostic
Patient 1 (37.5, 72) Healthy
Patient 2 (39.1, 103) Sick
Patient 3 (38.3, 100) Sick
(…) …
Patient N (36.7, 88) Healthy
Healthy
x t
temp
Lets start with a simple example
Healthy
temp
Solution
Freq
temp
Solution
Freq
Sick
Healthy
temp
More formally
H ealthy if x is in the blue region
y x From Wikimedia Commons
the free media repository
Sick otherwise
Freq
t=Sick
t=Healthy
temp
How to split
the feature
space?
Definition … a line!
y
y mx b
y slope bias
x
y
m
x
b, bias
x
Definition … a line!
y
y mx b
y
y y xb
x
x
m
y
x
yx yx bx
b, bias
0 yx xy bx
x
Rename variables
0 yx xy bx
w1 w2 w0
Rename variables
0 w1 x w2 y w0
y x2
x x1
0 w1 x1 w2 x2 w0
Classification function
yw x w1 x1 w2 x2 w0
yw x 0
x2
y x 0
w
y x 0
w
N ( w1 , w2 )
x1
Classification function
w1 1.0
yw x w1 x1 w2 x2 w0
w2 2.0
w0 4.0
x2
(1,6)
w1 x1 w2 x2 w0 7 0
w1 x1 w2 x2 w0 0
(2,3)
(4,1)
w1 x1 w2 x2 w0 6 0
x1
Classification function
yw x w1 x1 w2 x2 w0
yxw x w1 x1 w2 x2 w0
y x 0
w
DOT
w , w1 , w2 1, x1 , x2
2
y x 0
w
product
0
wy x 0
w
x'
N ( w1 , w2 )
x1
Classification function
yw x w1 x1 w2 x2 w0
yxw x w1 x1 w2 x2 w0
y x 0
w
2
DOT
w , w1 , w2 1, x1 , x2
y x 0
w
product
0
T
w x'
yw x 0
N ( w1 , w2 )
x1
To simplify notation
T
yw x w x
Learning
Sick
0
t
With the training dataset D
the GOAL is to
t0 t0 t0
4. Training : find w0 , w1 , w2 that minimize L y x , D
w
Today
Perceptron
Logistic regression
Multi-layer perceptron
Conv Nets
22
Perceptron
Rosenblatt, Frank (1958), The Perceptron: A Probabilistic Model for Information Storage and Organization in
the Brain, Psychological Review, v65, No. 6, pp. 386–408
27
Perceptron
(2D and 2 classes)
yw ( x ) 0
y w ( x ) 0
x2 w
1 w0
w1 T
x1 w x y w (x )
y w ( x ) 0
x2 w2
x1
yw ( x ) w0 w1 x1 w2 x1
T
w x
28
Perceptron
y w ( x ) 0
y w ( x ) 0
(2D and 2 classes) x2 w
1 w0
yw ( x ) 0
w1
x1 wT x sign yw ( x) {1,1}
x1
x2 w2
Activation function
yw x signw x
T
29
Perceptron
(2D and 2 classes) yw ( x ) 0
y w ( x ) 0
x2 w
1 w0
y w ( x ) 0
yw x signwT x 1,1
w1
x1 wT x sg
x2 w2 x1
Neuron
Dot product + activation function
30
So far…
1. Training dataset: D
2. Classification function (a line in 2D) : w w1 x1 w2 x2 w0
y x
3. Loss function: L yw x , D
So far…
1. Training dataset: D x1 w1
w0
3. Loss function: L yw x , D 1
4. Training : find w0 , w1 , w2 that minimize L y x , D
w
Linear classifiers have limits
Non-linearly separable training data
D D
( temp, freq) diagnostic ( temp, freq, headache) Diagnostic
Patient 1 (37.5, 72) healthy Patient 1 (37.5, 72, 2) healthy
Patient 2 (39.1, 103) sick Patient 2 (39.1, 103, 8) sick
Patient 3 (38.3, 100) sick Patient 3 (38.3, 100, 6) sick
(…) … (…) …
Patient N Patient N (36.7, 88, 0) healthy
(36.7, 88) healthy
x t x t
Non-linearly separable training data
10
yw x w1 x1 w2 x2 w0 yw x w1 x1 w2 x2 w3 x3 w0
(line) (plane)
yw ( x ) 0
y w ( x ) 0
Perceptron x2 w
1 w0 x1
yw x signwT x 1,1
w1
x1 wT x sg
x2 w2
yw x w1x1 w2 x2 w0
(line)
39
Example 3D
Perceptron
yw ( x ) 1
1
w0
w1
x1 yw ( x ) 1
x3 w3
y w x w1 x1 w2 x2 w3 x3 w0
(plane)
40
Perceptron
(K-D and 2 classes)
1
w0
x1 w1
x2 w2
yw x signwT x 1,1
wT x sg
wK 1
(…)
xK 1
wK
xK
yw x w1 x1 w2 x2 w3 x3 wK xK w0
(hyperplane)
41
Learning a machine
The goal: with a set of training data D x1 , t1 , x2 , t2 ,..., xN , t N , estimate w so:
y xn t n
w n
In other words, minimize the training loss
N
L yw x , D l yw xn , tn
n 1
Optimization problem
42
Loss function
t0 t0 t0
w1
w2
44
Perceptron
Question: how to find the best solution? L yw x , D 0
Random initialization
[w1,w2]=np.random.randn(2)
L yw x , D
w1
w2
45
Gradient descent
Question: how to find the best solution? L yw x , D 0
47
Perceptron Criterion (loss)
Observation
A wrongly classified sample is when
T
w xn 0 et tn 1
or
T
w xn 0 et tn 1.
Consequently wT xnt n is ALWAYS positive for wrongly classified samples
50
Perceptron Criterion (loss)
T
L y x , D
w w xn t n where V is the set of wrongly classified samples
xn V
L yw x , D 464.15
51
So far…
1. Training dataset: D
2. Linear classification function: w w1 x1 w2 x2 wM xM w0
y x
T
3. Loss function: L y
w x , D xn t n
w
xn V
So far…
x1
w1
1. Training dataset: D x 2 w2
w3
x3
y w x sign w T x 1, 1
w T x sg
2. Linear classification function:
(…)
wK
T xK w0
3. Loss function: L y
w x , D xn t n
w 1
xn V
4. Training : find w that minimizes L yw x , D
L y w x , D
w arg min
w
L yw x , D 0
Optimisation
[ k 1] [ k ] [ k ]
w w L
Gradient of the loss function
learning rate
Init w
k=0
DO k=k+1
FOR n = 1 to N
w w [ k ]L xn
UNTIL every data is well classified or k== MAX_ITER
54
Perceptron gradient descent
T
L y x , D
w w xn t n
xn V
L yw x , D tn xn
xn V
Init w
k=0
DO k=k+1
FOR n = 1 to N
IF w xntn 0 THEN /* wrongly classified */
T
w w t n xn
L yw x , D max0, tn w xn
N
T
n1
L yw x , D max0,1 tn wT xn
N
“Hinge Loss” or “SVM” Loss
n1
58
Multiclass Perceptron
y w , 2 ( x ) is max
(2D and 3 classes)
1 w0,1
y w ,1 ( x ) is max
w1,1
w2,1 w0T x yw , 0 ( x ) is max
w0 , 2 w
1, 2
w1T x arg max yW ,i x
x1 i
w2 , 2
w2T x
w0, 0 w1, 0
x2 w2, 0 T
y w ,0(x) w0 x w0 ,0 w0 ,1 x1 w0 ,2 x2
T
y w,1(x) w1 x w1,0 w1,1 x1 w1,2 x2
T
y w,2(x ) w2 x w2 ,0 w2 ,1 x1 w2 ,2 x2
59
Multiclass Perceptron
y w , 2 ( x ) is max
(2D and 3 classes)
1
w0,1
y w ,1 ( x ) is max
w1,1
w2,1 w0T x yw , 0 ( x ) is max
w0 , 2 w
1, 2
w1T x arg max yW ,i x
x1 i
w2 , 2
w2T x
w0, 0 w1, 0
w2, 0
x2 T
yW ( x ) W x
w0, 0 w0,1 w0, 2 1
yW ( x ) w1, 0 w1,1 w1, 2 x1
w2, 0 w2,1 w2, 2 x2
Multiclass Perceptron
(2D and 3 classes)
(1.1, -2.0)
Example
61
Multiclass Perceptron
Loss function
L yW x , D
T T
wj xn wtn xn
xn V
L yw x , D xn
xn V
66
Multiclass Perceptron
init W
k=0, i=0
DO k=k+1
FOR n = 1 to N
j arg max WT xn
IF j ti THEN /* wrongly classified sample */
w j w j xn
wtn wtn xn
67
Perceptron
Advantages:
• Very simple
• Does NOT assume the data follows a Gaussian distribution.
• If data is linearly separable, convergence is guaranteed.
Limitations:
• Zero gradient for many solutions => several “perfect solutions”
• Data must be linearly separable
Many “optimal”
solutions Suboptimal solution Will never converge
Two famous ways of improving the Perceptron
Logistic regression
1. New network architecture
73
Logistic regression
74
Logistic regression
(2D, 2 classes)
New activation function: sigmoid
1
1 w0 t
1 e t
w1
x1 wT x
yw ( x) [0,1]
w2
x2
75
Logistic regression
(2D, 2 classes)
New activation function: sigmoid
1
1
w0 t
1 e t
w1
x1 wT x yw ( x) [0,1]
w2
x2
Neuron
y w (x) σ w x
T 1
T
w x
1 e 76
Logistic regression
(2D, 2 classes)
New activation function: sigmoid
yw ( x ) 1
y w ( x ) 0.5
x2 w
1 w0
x1 w1 wT x yw ( x) [0,1]
y w ( x ) 0
x2 w2
x1
Neuron
yw ( x ) w x
T
77
Logistic regression
(2D, 2 classes)
Example
xn 0.4,1.0, w 2.0,3.6,0.5
1 wT xn 2 3.6 * 0.4 0.5 1.94
2
0.4 3.6
1
wT x yw ( x) 1.94 0.125
1 e1.94
1 0 .5
Since 0.125 is lower than 0.5, xn is behind the plan.
78
Logistic regression
(K-D, 2 classes)
Like the Perceptron the logistic regression accomodates for K-D vectors
1
w0
x1 w1
y w ( x ) w x
w T
w x
T
x2 2
(...)
wK
xK
79
1
w0
x1 w1
y w ( x ) w x
w T
w x
T
x2 2
(...)
wK
xK
80
With a sigmoid,
we can simulate a conditional probability
of c1 GIVEN x
1
w0
x1 w1
y w ( x ) w x Pc1 | x
w2 T T
x2 w x
(...)
wK
xK
81
With a sigmoid,
we can simulate a conditional probability
of c1 GIVEN x
1
w0
x1 w1
y w ( x ) w x Pc1 | x
T
w wT x
x2 2
(...) Pc0 | w, x 1 yw ( x )
wK
xK
82
Cost function is –ln of the likelihood
N
L yw x , D tn ln yw xn 1 tn ln1 yw xn
n1
87
And for K>2 classes?
New activation function : Softmax
w0,1
1
w2,1
w1,1
w0T x exp yw0 ( x) Pc0 | W , x
norm
w0 , 2 w
1, 2
w1T x exp y ( x)
P c1 | W , x
x1 w1
w2 , 2
w2T x exp yw 2 ( x) Pc2 | W , x
w0, 0 w1, 0
x2 w2, 0
T
wi x
e
ywi ( x)
e
c
wcT x
88
And for K>2 classes?
New activation function : Softmax
1
w0,1
w2,1
w1,1
w0T x exp yw0 ( x) Pc0 | W , x
norm
w0 , 2 w
x1
1, 2
w1T x exp y ( x)
w1
P c1 | W , x
w2 , 2
w2T x exp yw 2 ( x) Pc2 | W , x
w0, 0 w1, 0
x2 w2, 0
wiT x
Softmax e
ywi ( x )
ec
wcT x
And for K>2 classes?
'airplane' t 1000000000
'automobile' t 0100000000
'bird' t 0010000000
'cat' t 0001000000
'deer' t 0000100000
'dog' t 0000010000
'frog' t 0000001000
'horse' t 0000000100
'ship' t 0000000010
'truck' t 0000000001
Cifar10
91
K-Class cross entropy loss
N K
L yW x , D tkn ln yWk xn
n1 k 1
N
L xn yW xn tkn
n1
92
Regularization
Different weights may give the same score
x 1.0,1.0,1.0
T
w1 1,0,0
T
w2 1 3 ,1 3 ,1 3 Solution:
Maximum a
T T
w1 x w2 x 1 posteriori
93
Maximum a posteriori
Regularization
Constant
arg min L yw x , D RW
W
Loss function
Regularization
In general L1 or L2 R W 1 ou W 2
94
Wow! Loooots of information!
Lets recap…
95
Neural networks
2 classes
1 1
w0 w0
x1 w1 x1 w1
x2 wT x sg wT x yw ( x) Pc1 | x
w2 yw x 1,1 x2
w2
(...) (...)
wK Sign activation Sigmoid activation
xK xK wK
96
K-Class Neural networks
1
w0,1
w1,1
w2,1 w T0 x
w0, 2 w
1, 2
w1T x
arg max yWi x Perceptron
x1 i
w2, 2
w T2 x
w0, 0 w1, 0
x2 w2, 0
1
w0,1
w2,1
w1,1
w0T x exp yw0 (x) Pc0 | x
norm
w0, 2 w
1, 2
w1T x exp
yw1 (x) Pc1 | x Logistic regression
x1
w2, 2
w2T x exp yw2 (x) Pc2 | x
w0,0 w1,0
x2 w2,0
Softmax activation
Loss functions
2 classes
T
L y x , D
w tn w xn where V is the set of wronglyclassifiedsamples
xn V
L yw x , D max0,tn w xn
N
T
n1
L yw x , D max0,1 tn w xn
N
T “Hinge Loss” or “SVM” Loss
n1
N
L yw x , D tn ln yw xn 1 tn ln1 yw xn Cross entropy loss
n1
98
Loss functions
K classes
L y w x , D
w
T
j
xn wtTn xn where V is the set of wrongly classified samples
xn V
N
L yw x , D max 0,wTj xn wtTn xn
n1 j
N
T T
L yw x , D max 0,1 wj xn wtn xn
“Hinge Loss” or “SVM” Loss
n1 j
N K
L yw x , D tkn ln yWk xn Cross entropy loss with a Softmax
n1 k 1
99
Maximum a posteriori
Constante
N
L yw x , D l yW xn , t n RW
n 1
Loss function
Regularization
RW W 1 or W 2
100
Optimisation
[ k 1] [ k ] [ k ]
w w L
Gradient of the loss function
learning rate
Init w
k=0
DO k=k+1
FOR n = 1 to N
w w [ k ]L xn
UNTIL every data is well classified or k== MAX_ITER
101
Now, lets go
DEEPER
Non-linearly separable training data
1
w0
yw (x) σ w x R
x1
T T
w1 w x
x2 w2
105
Let’s add 3 neurons
1
w0,1
w0T x R
w1,1
w2,1 w0T x
w0 , 2 w
w1T x R
1, 2
w1T x
x1
w2 , 2
w2T x R
w0, 0 w1, 0 T
w2 x
x2 w2, 0
First layer
Input layer (3 neurons)
(3 “neurons”)
1 w0,1
w0 x R
w1,1 T T
w2,1 w0 x
w0 , 2 w
w1T x R
x1 1, 2
w1T x
w2 , 2
w2T x R
w0, 0 w1, 0
w2T x
x2 w2 , 0
w0 x R
w1,1 T T
w2,1 w0 x
w0 , 2 w
w1T x R
x1 1, 2
w1T x
w2 , 2
w2T x R
w0, 0 w1, 0
w2T x
x2 w2 , 0
W x
[0]
2-D, 2-Class, 1 hidden layer
If we want a 2-class Classification via a logistic regression (a cross entropy loss)
we must add an output neuron.
1 w0[ 0,0]
w [0] w0[ 0,1] w0[ 0]T x
0, 2 w0[1]
w1[,00]
yW x w[1]T W [ 0] x R
w1[1]
w[1]T W [0] x
[0]
x1 w 1,1
w1[1]T x
w1[,02]
[0] w2[1]
w2[ 0,0] w 2 ,1 w2[ 2 ]T x
x2 w2[ 0, 2]
Output layer
(1 neuron)
x2 w2[ 0, 2]
Visual
simplification
1
W [0]
w[1]
yW x w[1]T W [ 0 ] x
x0
x1
2-D, 2-Class, 1 hidden layer
Input Hidden Output
layer layer layer
1
W [0]
w[1]
yW x w[1]T W [ 0] x
x0
x1
1
bias neuron.
This network contains a total of 13 parameters
3x3 1x4
1
x1 yW x w W x
[1]T [ 0]
x2
1
http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
kD, 2Classes, 1 hidden layer
Input Hidden Output
layer layer layer
1
x1
yW x w[1]T W [ 0 ] x
x2
(...)
xk 1
xk
1
xk 1
xk 1
1
x f x yW x
x1
Example x2
1
1
1
x
yW (x )
yW x W [4] W [3] W [2] W [1] W [0] x
kD, 4 Classes, 4 hidden layer network
Input Hidden Hidden Hidden Hidden Output
layer Layer 1 Layer 2 Layer 3 Layer 4 layer
x3
yW ,1 x
(...) W 1
[4]
xk
W [4]
yW ,2 x
2
1
1
[4]
yW ,3 x
W 3
1
1
yW x W [4] W [3] W [2] W [1] W [0] x
kD, 4 Classes, 4 hidden layer network
Input Hidden Hidden Hidden Hidden Output
layer Layer 1 Layer 2 Layer 3 Layer 4 layer
Cross
W [0]
W [3] W [4] entropy
W [1]
x1
[2]
x2 W exp
yW ,0 x
x3 N
(...) exp yW ,1 x
O
xk
exp R
yW ,2 x
M
1
1
exp
yW ,3 x
1
1
Softmax
yW x softmax W [4] W [3] W [2] W [1] W [0] x
How to make a prediction?
126
Forward pass
x1
x2
1
1
1
x 127
Forward pass
x1
x2
1
1
1
[0]
W x 128
Forward pass
x1
x2
1
1
1
W x
[0]
129
Forward pass
x1
x2
1
1
1
W W x
[1] [0]
130
Forward pass
x1
x2
1
1
1
W W x
[1] [ 0]
131
Forward pass
x1
x2
1
1
w W W x
1
T [1] [0]
132
Forward pass
x1
x2
1
1
yW x w W W x
T [1] [ 0]
1
133
Forward pass
x1
x2
l yW x , t
1
1
1
l yW x , t t ln yW x (1 t ) ln1 yW x
134
How to optimize the network?
0- From
N
W arg min l yW xn , tn RW
W
n 1
RW W 1 or W 2
135
How to optimize the network?
1- Choose a loss l yW xn , tn for example
Hinge loss
Cross entropy
N
l yW xn , t n RW
Wa[,cb] Wa[,cb] n1
Wa[,cb]
137
How to optimize the network?
x0
x1
1
1
yW x w[ 2] W [1] W [ 0] x
1
N
l yW xn , t n RW
n1 Backpropagation
Wa ,b[c]
138
x0
x1
l yW x , t
1
1
yW x w W W x
T [1] [0]
1
l yW x , t t ln yW x (1 t ) ln1 yW x
139
A
A W x [ 0] x0
B A
x1
C W [1] B
D C 1
1
T
F w D
yW x F 1
l yW x , t
140
A B
A W x [ 0] x0
B A
x1
C W [1] B
D C 1
1
T
F w D
yW x F 1
l yW x , t
141
A B
C
A W x [ 0] x0
B A
x1
C W [1] B
D C 1
1
T
F w D
yW x F 1
l yW x , t
142
A B
C
A W x [ 0] x0
B A D
x1
C W [1] B
D C 1
1
T
F w D
yW x F 1
l yW x , t
143
A B
C
A W x [ 0] x0
B A D
x1 E
C W B [1]
D C 1
1
T
Ew D
yW x F 1
l yW x , t
144
A B
C
A W x [ 0] x0
B A D
x1 E
C W B [1]
D C 1
yW x w W W x
1 T [0]
T
[1]
Ew D
yW x E 1
l yW x , t
145
A B
C
A W x [ 0] x0
B A D
x1 E
C W B [1]
l yW x , t
D C 1
yW x w W W x
1 T [0]
T
[1]
Ew D
yW x E 1
l yW x , t
146
N
l yW xn , tn
n1
W [l ]
Chain rule
147
Chain rule recap
f (u ) u 2 f f u v
f
? x u v x
u (v) 2v
x 1
v( x) 1 / x 2u 2 2
x
148
A B
A W [ 0] x
B A x0 C
C W [1] B D
x1 E
D C
E wT D 1
1
yW x E
l yW x , t
1
l yW x , t l yW x , t yW x E D C B A
W [0]
yW x E D C B A W [ 0]
149
Back propagation
l yW x , t l yW x , t yW x E D C B B
W [ 0] yW x E D C B A W [ 0]
A B
x0 C
D
x1 E
yW x
l yW x , t
1
1
1
150
Activation functions
151
Activation functions
1
x
1 e x
3 Problems :
• Output is zero-centered
• Small gradient when input is large
Tanh(x)
ReLU(x)
(Rectified Linear Unit)
• no gradient saturation
• Super fast
• 0.01 is an hyperparameter
Leaky ReLU(x)
• no gradient saturation
• Super fast
Parametric ReLU(x)
https://ml4a.github.io/ml4a/neural_networks/
How to classify an image?
Pixel 1
Pixel 2
Pixel 3
Pixel 4
Pixel 5
Pixel 6
Pixel 7
Pixel 8
Pixel 9
Pixel 10
Pixel 11
Pixel 12
Pixel 13
Pixel 14
Bias https://ml4a.github.io/ml4a/neural_networks/
Many parameters
(7850 in Layer 1)
Pixel 1
Pixel 2
Pixel 3
Pixel 4
Pixel 5
Pixel 6
Pixel 7
Pixel 8
Pixel 9
Pixel 10
Pixel 11
Pixel 12
Pixel 13
Pixel 14
bias https://ml4a.github.io/ml4a/neural_networks/
Too many parameters
(655,370 in Layer 1)
Pixel 1
Pixel 2
Pixel 3
Pixel 4
Pixel 5
Pixel 6
Pixel 7
Pixel 8
Pixel 9
Pixel 10
Pixel 11
Pixel 12
Pixel 13
256x256 Pixel 14
Pixel 65536
https://ml4a.github.io/ml4a/neural_networks/
Waaay too many parameters
(160M in Layer 1)
Pixel 1
Pixel 2
Pixel 3
Pixel 4
Pixel 5
Pixel 6
Pixel 7
Pixel 8
Pixel 9
Pixel 10
Pixel 11
Pixel 12
Pixel 13
256x256x256 Pixel 14
Pixel 16M
https://ml4a.github.io/ml4a/neural_networks/
Full connections are too many
S F1 x 0,1
S F2 x 0,1
(...) (...)
(...) S F3 x 0,1
S F4 x 0,1
150-D input vector with 150 neurons in Layer 1 => 22,500 parameters!!
No full connection
S F1 x 0,1
S F2 x 0,1
(...) (...) (...)
S F3 x 0,1
S F4 x 0,1
150-D input vector with 150 neurons in Layer 1 => 450 parameters!!
Share weights
w01
w11 w02
2
w12 w 1
(...) w22 S F1 x 0,1
(...)
w01
S F2 x 0,1
w1
1
w02 (...)
w12
1
w
F3 x 0,1
2
w22 S
(...)
2
(...) w
S F4 x 0,1
0
w12
w01
1
w1
w22
w12
w 21
(...)
(...)
(...)
2 .3 w 10
w11
1 .7
4 .0 w 21
2 .8
5 .7
4 .4
(...)
(...)
(...)
2 .8
5 .7
4 .4
2 .3 0 .1
0 .2
1 .7
4 .0 0 .3
2 .8
5 .7
4 .4
(...)
(...)
(...)
2 .8
5 .7
4 .4
2 .3 0 .1
0 .2
1 .7 . 23 . 34 1 . 2 0 . 25
4 .0 0 .3
2 .8
5 .7
4 .4
(...)
(...)
(...)
2 .8
5 .7
4 .4
2 .3
1 .7 0 .1 0 . 25
0 .2
4 .0 . 17 . 8 . 84 0 . 45
2 .8 0 .3
5 .7
4 .4
(...)
(...)
(...)
2 .8
5 .7
4 .4
2 .3
1 .7 0 . 25
4 .0 0 .1 0 . 45
0 .2
2 .8 . 4 . 56 . 17 0 . 50
5 .7 0 .3
4 .4
(...)
(...)
(...)
2 .8
5 .7
4 .4
2 .3
1 .7 0 . 25
4 .0 0 . 45 Convolution!
2 .8 0 . 50
5 .7
4 .4
(...)
(...)
(...)
2 .8 0 .1
0 .2
5 .7 0 . 50
4 .4 0 .3
F x W [ 0]
x W0[ 0] x W1[0]
x W2[0]
x W4[0] x W3[0]
5-feature map
convolution layer
K-feature map
convolution layer
Conv layer 1
POOLING
LAYER
Pooling layer
Goals
• Reduce the spatial resolution of feature
maps
• Lower memory and computation
requirements
• Provide partial invariance to position,
scale and rotation
(stride = 1) (stride = 1)
181
Images: https://pythonmachinelearning.pro/introduction-to-convolutional-neural-networks-for-vision-tasks/
Conv layer 1 Pool layer 1 Conv layer 2 Pool layer 2
2 Class CNN
Conv layer 1 Pool layer 1 Conv layer 2 Pool layer 2 Fully connected layers
(…) yW x
l yW x , t t ln yW x (1 t ) ln 1 yW x
K Class CNN
Conv layer 1 Pool layer 1 Conv layer 2 Pool layer 2 Fully connected layers
exp y w 0 ( x )
norm
(…) exp y w1 ( x )
exp y w 2 ( x )
SOFTMAX
l yW x , t t ln yW x (1 t ) ln 1 yW x
Nice example from the litterature
S. Banerjee, S. Mitra, A. Sharma, and B. U Shankar, A CADe System for Gliomas in Brain MRI using Convolutional Neural
Networks, arXiv:1806.07589v, 2018
185
Learn image-based caracteristics
http://web.eecs.umich.edu/~honglak/icml09-ConvolutionalDeepBeliefNetworks.pdf
Batch processing
x 0.4,1.0
w 2.0,3.6,0.5
x1 wT x
x2
188
x 0.4,1.0
w 2.0,3.6,0.5
1
2.0
wT x 2 3.6 * 0.4 0.5 1.94
3.6 1
0 .4 wT x yw ( x) 1.94 0.125
1 e1.94
0.5
1 .0
189
x 0.4,1.0
w 2.0,3.6,0.5
1
2.0
1
0 .4
3.6
wT x yw ( xa ) 2.0,3.6,0.5 0.4 0.125
1
0.5
1 .0
190
xa 0.4,1.0 , xb 2.1,3.0
w 2.0,3.6,0.5
1
2.0
1
0. 4
3.6
wT x yw ( xa ) 2.0,3.6,0.5 0.4 0.125
1
0.5
1 .0
1
2.0
1
2 .1
3.6
wT x yw ( xb ) 2.0,3.6,0.5 2.1 0.99
3.0
0.5
3 .0
Mini-batch processing
1 1
2.0
1 1
3.6
yw ( xa ) 2.0,3.6,0.5 0.4 2.1 0.125,0.99
0.4 2.1 wT x
1 3.0
0.5
1 .0 3 .0
192
Mini-batch processing
1 1 1 1 2.0
3.6
yw ( xa ) 0.89,0.2,0.125,0.99
1 .1 0.5 0 .4 2 . 1 wT x
0.5
1. 1 0 .2 1 .0 3 .0
193
Mini-batch processing
Horse
Dog
Truck
Mini-batch processing
Horse
Dog
Truck
Mini-batch of
4 images 4 predictions
Classical applications of ConvNets
Classification.
Classical applications of ConvNets
Classification.
S. Banerjee, S. Mitra, A. Sharma, and B. U Shankar, A CADe System for Gliomas in Brain MRI using Convolutional Neural
Networks, arXiv:1806.07589v, 2018
197
Classical applications of ConvNets
Image segmentation
Classical applications of ConvNets
Image segmentation
Tran, P. V., 2016. A fully convolutional neural network for cardiac segmentation in short-axis MRI.
arXiv:1604.00494.
Classical applications of ConvNets
Image segmentation
Fang Liu, Zhaoye Zhou, +3 authors, Deep convolutional neural network and 3D deformable approach for tissue segmentation in
musculoskeletal magnetic resonance imaging. in Magnetic resonance in medicine 2018 DOI:10.1002/mrm.26841
Classical applications of ConvNets
Image segmentation
Havaei M., Davy A., Warde-Farley D., Biard A., Courville A., Bengio Y., Pal C., Jodoin P-M, Larochelle H. (2017)
Brain Tumor Segmentation with Deep Neural Networks, Medical Image Analysis, Vol 35, 18-31
Classical applications of ConvNets
Localization
Classical applications of ConvNets
Localization
S. Banerjee, S. Mitra, A. Sharma, and B. U Shankar, A CADe System for Gliomas in Brain MRI using Convolutional Neural
Networks, arXiv:1806.07589v, 2018
Conclusion
• Linear classification (1 neuron network)
• Logistic regression
• Multilayer perceptron
• Conv Nets
• Many buzz words
– Softmax
– Loss
– Batch
– Gradient descent
– Etc.
204