0% found this document useful (0 votes)
31 views183 pages

Basics of Deep Learning: Pierre-Marc Jodoin and Christian Desrosiers

1. The document introduces basics of deep learning and classification using a simple example of determining if a patient is sick or healthy based on temperature and pulse rate. 2. It describes how to divide the feature space into regions to perform classification using a line, also known as linear classification. 3. The goal of learning is to find the parameters (weights and bias) of the linear classification function that best separates the two classes in the training dataset.

Uploaded by

alyasocialway
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views183 pages

Basics of Deep Learning: Pierre-Marc Jodoin and Christian Desrosiers

1. The document introduces basics of deep learning and classification using a simple example of determining if a patient is sick or healthy based on temperature and pulse rate. 2. It describes how to divide the feature space into regions to perform classification using a line, also known as linear classification. 3. The goal of learning is to find the parameters (weights and bias) of the linear classification function that best separates the two classes in the training dataset.

Uploaded by

alyasocialway
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 183

Basics of deep learning

By

Pierre-Marc Jodoin and Christian Desrosiers


Lets start with a simple example

From Wikimedia Commons


the free media repository
Lets start with a simple example

D D Sick
Freq
( temp, freq) diagnostic
Patient 1 (37.5, 72) Healthy
Patient 2 (39.1, 103) Sick
Patient 3 (38.3, 100) Sick
(…) …
Patient N (36.7, 88) Healthy
 Healthy
x t
temp
Lets start with a simple example

A new patient comes to the hospital


How can we determine if he is sick or not?

Sick From Wikimedia Commons


Freq the free media repository

Healthy
temp
Solution

From Wikimedia Commons


Divide the feature space in 2 regions : sick and healthy the free media repository

Freq

temp
Solution

From Wikimedia Commons


Divide the feature space in 2 regions : sick and healthy the free media repository

Freq

Sick

Healthy

temp
More formally


  H ealthy if x is in the blue region
y x    From Wikimedia Commons
the free media repository

 Sick otherwise

Freq

t=Sick

t=Healthy

temp
How to split
the feature
space?
Definition … a line!

y
y  mx  b
y slope bias
x
y
m
x
b, bias
x
Definition … a line!

y
y  mx  b
y
y y xb
x
x
m
y
x
yx  yx  bx
b, bias
0  yx  xy  bx
x
Rename variables
0  yx  xy  bx
w1 w2 w0
Rename variables
0  w1 x  w2 y  w0
y x2

x x1

0  w1 x1  w2 x2  w0
Classification function

yw x   w1 x1  w2 x2  w0

yw  x   0
x2

y x   0

w


y x   0

w


N  ( w1 , w2 )

x1
Classification function
w1  1.0

yw x   w1 x1  w2 x2  w0
w2  2.0
w0  4.0

x2
(1,6)
w1 x1  w2 x2  w0  7  0

w1 x1  w2 x2  w0  0
(2,3)

(4,1)
w1 x1  w2 x2  w0  6  0

x1
Classification function

yw x   w1 x1  w2 x2  w0

yxw x   w1 x1  w2 x2  w0

y x   0

w
DOT
 w , w1 , w2   1, x1 , x2 
2

y x   0

w
product
0
 
wy x  0

w
x'

N  ( w1 , w2 )

x1
Classification function

yw x   w1 x1  w2 x2  w0
 
yxw x   w1 x1  w2 x2  w0
y x   0
w

2
DOT

 w , w1 , w2   1, x1 , x2 

y x   0

w
product
0
T 
 w x' 
yw  x   0

N  ( w1 , w2 )

x1
To simplify notation

linear classification = dot product with bias included

 T 
yw x  w x
Learning
Sick
0
t
With the training dataset D

the GOAL is to

find the parameters w0 , w1 , w2  that would best 


separate the two classes. yw  x   0
Healthy

yw  x   0 Freq
How do we know
if a model is good?
Loss function

t0 t0 t0

Freq Freq Freq


 
L yw  x , D   0

L yw  x , D   0
 L yw  x , D   0

Good! Medium BAD!


So far…
1. Training dataset: D

2. Classification function (a line in 2D) : w   w1 x1  w2 x2  w0
y   x

3. Loss function: L yw x , D 


4. Training : find w0 , w1 , w2  that minimize L y x , D 

w
Today

Perceptron
Logistic regression
Multi-layer perceptron
Conv Nets

22
Perceptron

Rosenblatt, Frank (1958), The Perceptron: A Probabilistic Model for Information Storage and Organization in
the Brain, Psychological Review, v65, No. 6, pp. 386–408
27
Perceptron
(2D and 2 classes)

yw ( x )  0 
 y w ( x )  0
x2 w
1 w0

w1 T  
x1 w x y w (x )

y w ( x )  0
x2 w2
x1


yw ( x )  w0  w1 x1  w2 x1
T 
w x
28
Perceptron 
y w ( x )  0


y w ( x )  0
(2D and 2 classes) x2 w

1 w0

yw ( x )  0
w1   
x1 wT x sign yw ( x) {1,1}
x1
x2 w2

Activation function

yw x   signw x 
 T 

29
Perceptron 
(2D and 2 classes) yw ( x )  0 
 y w ( x )  0
x2 w

1 w0

y w ( x )  0
yw x   signwT x   1,1
w1     
x1 wT x sg

x2 w2 x1

Neuron
Dot product + activation function

30
So far…
1. Training dataset: D

2. Classification function (a line in 2D) : w   w1 x1  w2 x2  w0
y   x

3. Loss function: L yw x , D 
So far…
1. Training dataset: D x1 w1

2. Classification function (a line in 2D) : w2 


yW x 
 
x2 wT x sg

 w0
3. Loss function: L yw x , D  1


4. Training : find w0 , w1 , w2  that minimize L y x , D 

w
Linear classifiers have limits
Non-linearly separable training data

Linear classifier = large error rate


Non-linearly separable training data

Three classical solutions

1. Acquire more observations


2. Use a non-linear classifier
3. Transform the data
Non-linearly separable training data

Three classical solutions

1. Acquire more observations


2. Use a non-linear classifier
3. Transform the data
Acquire more data

D D
( temp, freq) diagnostic ( temp, freq, headache) Diagnostic
Patient 1 (37.5, 72) healthy Patient 1 (37.5, 72, 2) healthy
Patient 2 (39.1, 103) sick Patient 2 (39.1, 103, 8) sick
Patient 3 (38.3, 100) sick Patient 3 (38.3, 100, 6) sick
(…) … (…) …
Patient N Patient N (36.7, 88, 0) healthy
(36.7, 88) healthy
 
x t x t
Non-linearly separable training data

10

 
yw  x   w1 x1  w2 x2  w0 yw  x   w1 x1  w2 x2  w3 x3  w0
(line) (plane)

yw ( x )  0 
 y w ( x )  0
Perceptron x2 w

(2D and 2 classes)



y w ( x )  0

1 w0 x1
yw x   signwT x   1,1
w1     
x1 wT x sg

x2 w2


yw x   w1x1  w2 x2  w0
(line)

39
Example 3D

Perceptron 
yw ( x )  1

(3D and 2 classes)

1
w0

w1 
x1 yw ( x )  1

yw x   signwT x   1,1


    
wT x sg
w2
x2

x3 w3


y w x   w1 x1  w2 x2  w3 x3  w0
(plane)

40
Perceptron
(K-D and 2 classes)

1
w0
x1 w1

x2 w2
yw x   signwT x   1,1
    
wT x sg
wK 1
(…)

xK 1
wK
xK


yw x   w1 x1  w2 x2  w3 x3    wK xK  w0
(hyperplane)

41
Learning a machine
   
The goal: with a set of training data D  x1 , t1 , x2 , t2 ,..., xN , t N , estimate w so:

y xn   t n

w n
In other words, minimize the training loss

N
 
L yw x , D    l  yw xn , tn 

n 1

Optimization problem

42
Loss function

t0 t0 t0

Freq Freq Freq


  
L yw  x , D   0 L yw  x , D   0
 L yw  x , D   0

L y  x , D 

w

w1
w2

44
Perceptron

Question: how to find the best solution? L yw x , D   0

Random initialization
 [w1,w2]=np.random.randn(2)
L yw  x , D 

w1
w2
45
Gradient descent


Question: how to find the best solution? L yw  x , D   0

 w  ηL yw [k] x , D 


 [k 1 ]  [k] 
w
Gradient of the loss function
Learning rate

47
Perceptron Criterion (loss)
Observation
A wrongly classified sample is when
T 
w xn  0 et tn  1
or
T 
w xn  0 et tn  1.
 
Consequently  wT xnt n is ALWAYS positive for wrongly classified samples

50
Perceptron Criterion (loss)
 T 
L y  x , D  

w   w xn t n where V is the set of wrongly classified samples

xn V


L yw  x , D   464.15

51
So far…
1. Training dataset: D

2. Linear classification function: w   w1 x1  w2 x2    wM xM  w0
y   x
 T 
3. Loss function: L  y 
w  x , D    xn t n

 w
xn V
So far…
x1
w1

1. Training dataset: D x 2 w2

w3
x3
y w  x   sign w T x   1,  1
    
w T x sg
2. Linear classification function:

(…)
wK

 T  xK w0

3. Loss function: L  y 
w  x , D    xn t n

 w 1

xn V


4. Training : find w that minimizes L yw x , D 

 
 L y w  x , D 
w  arg min 
w

L yw  x , D   0
Optimisation
 [ k 1]  [ k ] [ k ]
w  w  L
Gradient of the loss function

learning rate

Stochastic gradient descent (SGD)


Init w
k=0
DO k=k+1
FOR n = 1 to N
  
w  w   [ k ]L xn 
UNTIL every data is well classified or k== MAX_ITER

54
Perceptron gradient descent
 T 
L y  x , D  

w   w xn t n

xn V
 
    
L yw x , D    tn xn

xn V

Stochastic gradient descent (SGD)


Init w
k=0
DO k=k+1
FOR n = 1 to N
IF w xntn  0 THEN /* wrongly classified */
T

  
w  w  t n xn

UNTIL every data is well classified OR k=k_MAX

NOTE : learning rate :

• Too low => slow convergence


• Too large => might not converge (even diverge)
• Can decrease at each iteration (e.g.  [ k ]  cst / k ) 55
Similar loss functions
 T 
L  y w x , D    xn t n
 w wh ere V is the set of wrongly classified samples

x n V

L yw x , D  max0,  tn w xn 
N
 T 

n1

L yw x , D  max0,1 tn wT xn 
N
   “Hinge Loss” or “SVM” Loss
n1

58
Multiclass Perceptron 
y w , 2 ( x ) is max
(2D and 3 classes)

1 w0,1 
y w ,1 ( x ) is max
w1,1   
w2,1 w0T x yw , 0 ( x ) is max

w0 , 2 w 
1, 2
 
w1T x arg max yW ,i  x 
x1 i
w2 , 2
 
w2T x
w0, 0 w1, 0

x2 w2, 0  T 
y w ,0(x)  w0 x  w0 ,0  w0 ,1 x1  w0 ,2 x2
 T 
y w,1(x)  w1 x  w1,0  w1,1 x1  w1,2 x2

 T 
y w,2(x )  w2 x  w2 ,0  w2 ,1 x1  w2 ,2 x2

59
Multiclass Perceptron 
y w , 2 ( x ) is max
(2D and 3 classes)

1
w0,1 
y w ,1 ( x ) is max
w1,1   
w2,1 w0T x yw , 0 ( x ) is max

w0 , 2 w 
1, 2
 
w1T x arg max yW ,i  x 
x1 i
w2 , 2
 
w2T x
w0, 0 w1, 0

w2, 0 
x2 T
yW ( x )  W x
 w0, 0 w0,1 w0, 2   1 
   
yW ( x )   w1, 0 w1,1 w1, 2   x1 
 w2, 0 w2,1 w2, 2   x2 
Multiclass Perceptron
(2D and 3 classes)
(1.1, -2.0)
Example

 2  3.6 0.5   1   6.9 Class 0


 
yW (x )   4 2.4 4.1  1.1   9.6 Class 1

  6 4  4.9  2  8.2  Class 2

61
Multiclass Perceptron
Loss function

L yW  x , D  
T  T 
 wj xn  wtn xn


xn V

Sum over all wrongly


Score of the true class
classified samples

Score of the wrong class

 
L yw x , D   xn

xn V

66
Multiclass Perceptron

Stochastic gradient descent (SGD)

init W
k=0, i=0
DO k=k+1
FOR n = 1 to N

j  arg max WT xn
IF j  ti THEN /* wrongly classified sample */
  
w j  w j  xn
  
wtn  wtn  xn

UNTIL every data is well classified or k > K_MAX.

67
Perceptron
Advantages:
• Very simple
• Does NOT assume the data follows a Gaussian distribution.
• If data is linearly separable, convergence is guaranteed.

Limitations:
• Zero gradient for many solutions => several “perfect solutions”
• Data must be linearly separable
Many “optimal”
solutions Suboptimal solution Will never converge
Two famous ways of improving the Perceptron

1. New activation function + new Loss

Logistic regression
1. New network architecture

Multilayer Perceptron / CNN

73
Logistic regression

74
Logistic regression
(2D, 2 classes)
New activation function: sigmoid

1
1 w0  t  
1  e t
w1  
x1 wT x  
yw ( x) [0,1]

w2
x2

Sigmoid activation function

75
Logistic regression
(2D, 2 classes)
New activation function: sigmoid

1
1
w0  t  
1  e t
w1   
x1 wT x  yw ( x) [0,1]

w2
x2

Neuron

y w (x)  σ w x  
 T  1
T 
w x
1 e 76
Logistic regression
(2D, 2 classes)
New activation function: sigmoid

yw ( x )  1 
 y w ( x )  0.5
x2 w
1 w0

  
x1 w1 wT x  yw ( x) [0,1]

y w ( x )  0
x2 w2

x1
Neuron

yw ( x )   w x 
 T 

77
Logistic regression
(2D, 2 classes)
Example
 
xn  0.4,1.0, w  2.0,3.6,0.5

 
1 wT xn  2  3.6 * 0.4  0.5  1.94
2

0.4  3.6
   1
wT x  yw ( x)   1.94   0.125
1 e1.94

1 0 .5


Since 0.125 is lower than 0.5, xn is behind the plan.

78
Logistic regression
(K-D, 2 classes)

Like the Perceptron the logistic regression accomodates for K-D vectors

1
w0
x1 w1

y w ( x )   w x 
w T 
w x 
 T 
x2 2

(...)
wK
xK

79
1
w0
x1 w1

y w ( x )   w x 
w T 
w x 
 T 
x2 2

(...)
wK
xK

What is the loss function?

80
With a sigmoid,
 we can simulate a conditional probability
of c1 GIVEN x

1
w0
x1 w1
y w ( x )   w x   Pc1 | x 
w2 T   T  
x2 w x 

(...)
wK
xK

81
With a sigmoid,
 we can simulate a conditional probability
of c1 GIVEN x

1
w0
x1 w1
y w ( x )   w x   Pc1 | x 
   T  
w wT x 
x2 2
  
(...)  Pc0 | w, x   1  yw ( x )
wK
xK

82
Cost function is –ln of the likelihood

N
  
L yw  x , D  tn ln yw xn   1 tn  ln1 yw xn 
n1

2 Class Cross entropy

We can also show that



dL yw x , D N  
   yw xn   tn xn

dw n1

As opposed to the Perceptron


the gradient does not depend
on the wrongly classified samples
Logistic Network
Advantages:
• More stable than the Perceptron
• More effective when the data is non separable
Perceptron
Logistic net

87
And for K>2 classes?
New activation function : Softmax

w0,1
1

w2,1
w1,1  
w0T x exp yw0 ( x)  Pc0 | W , x 

norm
w0 , 2 w  
1, 2
 
w1T x exp y ( x)
  P c1 | W , x 
x1 w1

w2 , 2 
 
w2T x exp yw 2 ( x)  Pc2 | W , x 
w0, 0 w1, 0

x2 w2, 0
T 
wi x
 e
ywi ( x)   
e
c
wcT x

88
And for K>2 classes?
New activation function : Softmax

1
w0,1

w2,1
w1,1  
w0T x exp yw0 ( x)  Pc0 | W , x 

norm
w0 , 2 w  
x1
1, 2
 
w1T x exp y ( x)

w1
 P c1 | W , x 
w2 , 2 
 
w2T x exp yw 2 ( x)  Pc2 | W , x 
w0, 0 w1, 0

x2 w2, 0

 
wiT x
Softmax  e
ywi ( x )   
ec
wcT x
And for K>2 classes?
'airplane'  t  1000000000 
'automobile'  t   0100000000 
'bird'  t   0010000000 
'cat'  t   0001000000 
'deer'  t   0000100000 
'dog'  t   0000010000 
'frog'  t   0000001000 
'horse'  t   0000000100 
'ship'  t   0000000010 
'truck'  t   0000000001
Cifar10

Class labels : one-hot vectors


K>2 classes

Cross entropy Loss


N K
 
L yW x , D  tkn ln yWk xn 
n1 k 1

91
K-Class cross entropy loss

N K
 
L yW x , D  tkn ln yWk xn 
n1 k 1

N
 
L   xn  yW xn   tkn 
n1

92
Regularization
Different weights may give the same score


x  1.0,1.0,1.0 
T
w1  1,0,0
T
w2  1 3 ,1 3 ,1 3 Solution:
Maximum a
T  T 
w1 x  w2 x  1 posteriori

93
Maximum a posteriori
Regularization

Constant


arg min  L yw  x , D   RW 
W

Loss function
Regularization

In general L1 or L2 R   W 1 ou W 2
94
Wow! Loooots of information!

Lets recap…

95
Neural networks
2 classes

1 1
w0 w0
x1 w1 x1 w1
     
x2 wT x sg  wT x  yw ( x)  Pc1 | x 
w2 yw  x    1,1 x2
w2
(...) (...)
wK Sign activation Sigmoid activation
xK xK wK

Perceptron Logistic regression

96
K-Class Neural networks
1
w0,1
w1,1 
w2,1 w T0 x

w0, 2 w
1, 2

w1T x

arg max yWi  x  Perceptron
x1 i
w2, 2

w T2 x
w0, 0 w1, 0

x2 w2, 0

1
w0,1

w2,1
w1,1  
w0T x exp yw0 (x)  Pc0 | x 
norm

w0, 2 w
1, 2
 
w1T x exp 
yw1 (x)  Pc1 | x Logistic regression
x1
w2, 2 
 
w2T x exp yw2 (x)  Pc2 | x
w0,0 w1,0

x2 w2,0
Softmax activation
Loss functions
2 classes
 T 
L y  x , D 

w   tn w xn where V is the set of wronglyclassifiedsamples

xn V

L yw x , D  max0,tn w xn 
N
 T 

n1

L yw  x , D   max0,1 tn w xn 
N
 T  “Hinge Loss” or “SVM” Loss

n1

N
  
L yw x , D  tn ln yw xn   1 tn  ln1 yw xn  Cross entropy loss
n1

98
Loss functions
K classes

L y w  x , D   

w
 T
j
  
xn  wtTn xn  where V is the set of wrongly classified samples
xn V

 
N
    
L yw x , D   max 0,wTj xn  wtTn xn
n1 j

 
N
 T  T 
L yw  x , D  max 0,1 wj xn  wtn xn
 “Hinge Loss” or “SVM” Loss
n1 j

N K
 
L yw x , D  tkn ln yWk  xn  Cross entropy loss with a Softmax
n1 k 1

99
Maximum a posteriori
Constante

N
L yw  x , D    l  yW xn , t n   RW 
 
n 1

Loss function

Regularization

RW   W 1 or W 2

100
Optimisation
 [ k 1]  [ k ] [ k ]
w  w  L
Gradient of the loss function

learning rate

Stochastic gradient descent (SGD)


Init w
k=0
DO k=k+1
FOR n = 1 to N
  
w  w   [ k ]L xn 
UNTIL every data is well classified or k== MAX_ITER

101
Now, lets go
DEEPER
Non-linearly separable training data

Three classical solutions

1. Acquire more data


2. Use a non-linear classifier
3. Transform the data
Non-linearly separable training data

Three classical solutions

1. Acquire more data


2. Use a non-linear classifier
3. Transform the data
2D, 2Classes, Linear logistic regression

1
w0

yw (x)  σ w x  R
x1
T   T 
w1 w x 

x2 w2

Input layer Output layer


(3 “neurons”) (1 neuron with sigmoid)

105
Let’s add 3 neurons

1
w0,1

 w0T x  R
w1,1    
w2,1 w0T x 

w0 , 2 w  
 w1T x  R
1, 2
w1T x   
x1
w2 , 2

 w2T x  R
w0, 0 w1, 0 T   
w2 x 
x2 w2, 0

First layer
Input layer (3 neurons)
(3 “neurons”)
1 w0,1

 w0 x  R
w1,1 T  T 
w2,1 w0 x 

w0 , 2 w  
 w1T x  R
x1 1, 2
w1T x   
w2 , 2

 w2T x  R
w0, 0 w1, 0    
w2T x 
x2 w2 , 0

NOTE: The output of the first layer is a vector of 3 real values

  w0, 0 w0,1 w0, 2   1  


   
   w1,0 w1,1 w1, 2   x1    R 3
  
 w
 2, 0 w2,1 w2, 2   x2  
1 w0,1

 w0 x  R
w1,1 T  T 
w2,1 w0 x 

w0 , 2 w  
 w1T x  R
x1 1, 2
w1T x   
w2 , 2

 w2T x  R
w0, 0 w1, 0    
w2T x 
x2 w2 , 0

NOTE: The output of the first layer is a vector of 3 real values

 W x 
 [0]
2-D, 2-Class, 1 hidden layer
If we want a 2-class Classification via a logistic regression (a cross entropy loss)
we must add an output neuron.

1 w0[ 0,0]
 
w [0] w0[ 0,1] w0[ 0]T x 
0, 2 w0[1]
w1[,00]
yW x    w[1]T  W [ 0] x  R
w1[1]   
w[1]T  W [0] x 
[0]
x1 w 1,1
 
w1[1]T x 
 

w1[,02]
[0] w2[1]
w2[ 0,0] w 2 ,1 w2[ 2 ]T x 
 

x2 w2[ 0, 2]

Output layer
(1 neuron)

Input layer hidden layer


(3 « neurons ») (3 neurons)
2-D, 2-Class, 1 hidden layer
1 w0[ 0,0]
 
w [0] w0[ 0,1] w0[ 0]T x 
0, 2 w0[1]
w1[,00]
yW x    w[1]T  W [ 0] x  R
w1[1]   
w[1]T  W [ 0] x 
[0]
x1 w 1,1
 
w1[1]T x 
 

w1[,02]
[0] w2[1]
w2[ 0,0] w 2 ,1 w2[ 2 ]T x 
 

x2 w2[ 0, 2]

Visual
simplification

1
W [0] 
 w[1]

yW  x    w[1]T  W [ 0 ] x 
   
x0 
x1 
2-D, 2-Class, 1 hidden layer
Input Hidden Output
layer layer layer

1
W [0] 
 w[1]

yW  x    w[1]T  W [ 0] x 
   
x0 
x1 
1

bias neuron.
This network contains a total of 13 parameters
3x3 1x4

Input layer Hidden layer


Hidden layer Output layer
2-D, 2-Class, 1 hidden layer
Input Hidden Output
layer layer layer


1

x1   yW x    w  W x 
  [1]T [ 0] 

x2


1

Increasing the number of neurons = increasing the capacity of the model

This network has 5x3+1x6=21 parameters


112
Nb neurons VS Capacity
No hidden neuron 12 hidden neurons 60 hidden neurons

Linear classification Non linear classification Non linear classification


Underfitting Good result Over fitting
(low capacity) (good capacity) (too large capacity)

http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
kD, 2Classes, 1 hidden layer
Input Hidden Output
layer layer layer

1

x1 
 yW x    w[1]T  W [ 0 ] x 
  
x2

(...) 
xk 1

xk
1

Increasing the dimensionality of the data = more columns in W [0]


This network has 5x(k+1)+1x6 parameters
114
kD, 2Classes, 2 hidden layers
5k 1
Input
layer
Hidden
Layer 1
Hidden
Layer 2
Output
layer W [0]
R
36
W [0]
 [1]
W R[1]
1 W  [ 2]
x1  w R 4
 
w[ 2]T
x2 
 yW x    w  W  W x 
  [ 2 ]T [0] 
(...)   [1]

xk 1 

xk 1
1

Adding an hidden layer = Adding a matrix multiplication

This network has 5x(k+1)+6x3 + 1x4 parameters 115


kD, 2 Classes, 4 hidden layer network
5k 1
Input
layer
Hidden
Layer 1
Hidden
Layer 2
Hidden
Layer 3
Hidden
Layer 4
Output
layer
W [0]
R
W [1]  R 36
W [0] [3] 
 [1]
W W [ 2]  R 44
x1 W 
[2]
x2  W  W R[ 3] 75
  [ 4]
 w  [ 4]
x3   w  R8
 
(...)
  
xk
 
  yW  x    w  W  W  W  W x 
  [ 4 ]T [ 3] [ 2] [1] [0] 
1
1


1
1

This network has 5x(k+1)+6x3 + 4x4+7x5+1x8 parameters 116


kD, 2 Classes, 4 hidden layer network
5k 1
Input
layer
Hidden
Layer 1
Hidden
Layer 2
Hidden
Layer 3
Hidden
Layer 4
Output
layer
W [0]
R
W [0]
[3]  W [1]  R 36
 [1]
W
x1 W  W [ 2]  R 44
x2  W [2] 
  [ 4] 75
 w W R [ 3]
x3
   [ 4]
  w  R8
(...)
  
xk
 
  yW  x    w[ 4 ]T  W [3] W [ 2 ] W [1] W [ 0 ] x 
  
1
1


1
1

NOTE : More hidden layers = deeper network = more capacity.


Multilayer Perceptron

x1 

x2  
 
x3  
 
(...)
  
xK
 
 
1
1


1
1

 

x f x  yW  x 

x1 

Example x2

1


 
1

1


x 
yW (x )

Input data Output of the last layer Output of the network


A K-Class neural network
has K output neurons.
kD, 4 Classes, 4 hidden layer network
5k 1
Input
layer
Hidden
Layer 1
Hidden
Layer 2
Hidden
Layer 3
Hidden
Layer 4
Output
layer
W [0]
R
W [0]  [4] W [1]  R 36
W [3] W
 W [1]
x1 
 W [2]
 W [ 2]  R 44
x2  W [4] yW ,0  x 
  0
75
x3    W [ 3]
R
 W1[4] yW ,1  x 

(...)
  W [ 4]  R 84
xk
  W [4]

yW ,2  x 

2

1
1
 [4] 
yW ,3  x 
W 3

1
1


  
yW  x   W [4] W [3] W [2] W [1] W [0] x 


kD, 4 Classes, 4 hidden layer network
Input Hidden Hidden Hidden Hidden Output
layer Layer 1 Layer 2 Layer 3 Layer 4 layer

W [0]  Hinge loss


W [3] W [4]
 W [1]
x1
[2]

x2  W  [4]

yW ,0  x 
W
  0

x3   
yW ,1  x 
(...)   W 1
[4]

 
xk
  W [4]

yW ,2  x 

2

1
1
 [4] 
yW ,3  x 
W 3

1
1


  
yW  x   W [4] W [3] W [2] W [1] W [0] x 


kD, 4 Classes, 4 hidden layer network
Input Hidden Hidden Hidden Hidden Output
layer Layer 1 Layer 2 Layer 3 Layer 4 layer
Cross
W [0] 
W [3] W [4] entropy
 W [1]
x1
[2]

x2  W  exp

yW ,0  x 
 
x3   N 
(...)   exp yW ,1  x 
  O
xk
  exp R 
yW ,2  x 
  M
1
1
 exp

yW ,3  x 

1
1

Softmax

   
yW  x   softmax W [4] W [3] W [2] W [1] W [0] x 


How to make a prediction?

126
Forward pass

x1 

x2 
 
1 
1

1


x 127
Forward pass

x1 

x2 
 
1 
1

1


[0]
W x 128
Forward pass

x1 

x2 
 
1 
1

1

 W x 
 [0]
129
Forward pass

x1 

x2 
 
1 
1

1

W  W x 
[1]  [0]
130
Forward pass

x1 

x2 
 
1 
1

1

 W  W x 
[1]  [ 0]
131
Forward pass

x1 

x2 
 
1 
1

w  W  W x 
1
T [1] [0] 

132
Forward pass

x1 

x2 
 
1 
1

yW  x    w  W  W x 
 T [1] [ 0] 
1

133
Forward pass

x1 

x2 
  
l  yW  x , t 
1 
1

1
  
l  yW  x , t   t ln yW  x   (1  t ) ln1  yW  x 

134
How to optimize the network?
0- From
N

W  arg min   l  yW  xn , tn   RW 
W
n 1

Choose a regularization function

RW   W 1 or W 2

135
How to optimize the network?

1- Choose a loss l  yW xn , tn  for example
Hinge loss
Cross entropy

Do not forget to adjust the output layer with


the loss you have choosen.

cross entropy => Softmax


How to optimize the network?
2- Compute the gradient of the loss with respect to each parameter
 N  
  l  yW  xn , t n   RW 
 n1 
Wa[,cb]
and launch a gradient descent algorithm to update the parameters.

 N  
  l  yW xn , t n   RW 
Wa[,cb]  Wa[,cb]    n1 
Wa[,cb]

137
How to optimize the network?

x0 

x1 
 
1 
1

yW  x    w[ 2] W [1] W [ 0] x 
  
1

 N  
  l  yW  xn , t n   RW 
 n1  Backpropagation
Wa ,b[c]

138

x0 

x1 
  
l  yW x , t 
1 
1

yW  x    w  W  W x 
 T [1] [0] 

1
  
l  yW  x , t   t ln yW  x   (1  t ) ln1  yW  x 

139
A


A W x [ 0] x0 
B    A 
x1 
C  W [1] B  

D   C  1
1
T 
F w D

yW x    F  1

l  yW  x , t 

140
A B


A W x [ 0] x0 
B    A 
x1 
C  W [1] B  

D   C  1
1
T 
F w D

yW x    F  1

l  yW  x , t 

141
A B

 C
A W x [ 0] x0 
B    A 
x1 
C  W [1] B  

D   C  1
1
T 
F w D

yW x    F  1

l  yW  x , t 

142
A B

 C
A W x [ 0] x0 
B    A  D
x1 
C  W [1] B  

D   C  1
1
T 
F w D

yW x    F  1

l  yW  x , t 

143
A B

 C
A W x [ 0] x0 
B    A  D
x1  E
C W B [1]
 

D   C  1
1
T 
Ew D

yW x    F  1

l  yW  x , t 

144
A B

 C
A W x [ 0] x0 
B    A  D
x1  E
C W B [1]
 

D   C  1

yW  x    w  W  W x 
1  T [0] 
T 
[1]
Ew D

yW x    E  1

l  yW  x , t 

145
A B

 C
A W x [ 0] x0 
B    A  D
x1  E
C W B [1]
  
l  yW  x , t 

D   C  1

yW  x    w  W  W x 
1  T [0] 
T 
[1]
Ew D

yW x    E  1

l  yW  x , t 

146
 N  
  l  yW  xn , tn 
 n1 
W [l ]

Chain rule

147
Chain rule recap

f (u )  u 2 f f u v
f   
? x u v x
u (v)  2v
x  1
v( x)  1 / x  2u  2    2 
 x 

148
 A B
A  W [ 0] x 
B    A x0  C
C  W [1] B  D
x1  E
D   C  
  
E  wT D 1
 1
yW x    E  

l  yW  x , t 
1

  
l  yW  x , t  l  yW  x , t   yW x  E  D  C  B   A
 
W [0]
yW x  E D C B A W [ 0]

149
Back propagation
  
l  yW  x , t  l  yW x , t   yW  x  E  D  C  B  B 
 
W [ 0] yW  x  E D C B A W [ 0]

A  B
x0  C
 D
x1  E 
yW  x 
  
l  yW  x , t 
1 
1

1

150
Activation functions

151
Activation functions
1
  x 
1  e x

3 Problems :

• Gradient saturates when input is large


Sigmoïde
• Not zero centered

• exp() is an expensive operation


Activation functions

• Output is zero-centered 
• Small gradient when input is large 

Tanh(x)

[LeCun et al., 1991]


Activation functions
ReLU  x   max  0, x 

• Large gradient for x>0 


• Super fast 

• Output non centered at zero 


• No gradient when x<0? 

ReLU(x)
(Rectified Linear Unit)

[Krizhevsky et al., 2012]


Activation functions
LReLU  x   max  0.01x, x 

• no gradient saturation 
• Super fast 

• 0.01 is an hyperparameter 

Leaky ReLU(x)

[Mass et al., 2013]


[He et al., 2015]
Activation functions
PReLU  x   max  x, x 

• no gradient saturation 
• Super fast 

•  learn with back prop 

Parametric ReLU(x)

[Mass et al., 2013]


[He et al., 2015]
In practice
• By default, people use ReLU.

• Try Leaky ReLU / PReLU / ELU

• Try tanh but might be sub-optimal

• Do not use sigmoïde except at the output of a 2 class net.


How to classify an image?

https://ml4a.github.io/ml4a/neural_networks/
How to classify an image?

Pixel 1

Pixel 2

Pixel 3

Pixel 4

Pixel 5

Pixel 6

Pixel 7

Pixel 8

Pixel 9

Pixel 10

Pixel 11

Pixel 12

Pixel 13

Pixel 14

Bias https://ml4a.github.io/ml4a/neural_networks/
Many parameters
(7850 in Layer 1)
Pixel 1

Pixel 2

Pixel 3

Pixel 4

Pixel 5

Pixel 6

Pixel 7

Pixel 8

Pixel 9

Pixel 10

Pixel 11

Pixel 12

Pixel 13

Pixel 14

bias https://ml4a.github.io/ml4a/neural_networks/
Too many parameters
(655,370 in Layer 1)
Pixel 1

Pixel 2

Pixel 3

Pixel 4

Pixel 5

Pixel 6

Pixel 7

Pixel 8

Pixel 9

Pixel 10

Pixel 11

Pixel 12

Pixel 13

256x256 Pixel 14

Pixel 65536
https://ml4a.github.io/ml4a/neural_networks/
Waaay too many parameters
(160M in Layer 1)
Pixel 1

Pixel 2

Pixel 3

Pixel 4

Pixel 5

Pixel 6

Pixel 7

Pixel 8

Pixel 9

Pixel 10

Pixel 11

Pixel 12

Pixel 13

256x256x256 Pixel 14

Pixel 16M
https://ml4a.github.io/ml4a/neural_networks/
Full connections are too many


S F1  x   0,1


S F2  x   0,1
(...) (...)

(...) S F3 x   0,1


S F4  x   0,1

150-D input vector with 150 neurons in Layer 1 => 22,500 parameters!!
No full connection


S F1  x   0,1


S F2  x   0,1
(...) (...) (...)

S F3 x   0,1


S F4  x   0,1

150-D input vector with 150 neurons in Layer 1 => 450 parameters!!
Share weights
w01
w11 w02
2
w12 w 1


(...) w22 S F1  x   0,1

(...) 
w01
S F2  x   0,1
w1
1
w02 (...)
w12
1
w 
F3  x   0,1
2

w22 S
(...)
2
(...) w 
S F4  x   0,1
0

w12
w01
1
w1
w22
w12

1- Learning convolution filters!


2- Small number of parameters = can make it deep!

150-D input vector with 150 neurons in Layer 1 => 3 parameters!!


w 10
w11

w 21

(...)
(...)
(...)
2 .3 w 10
w11
1 .7

4 .0 w 21
2 .8

5 .7

4 .4
(...)
(...)
(...)

2 .8

5 .7

4 .4
2 .3  0 .1
0 .2
1 .7

4 .0  0 .3
2 .8

5 .7

4 .4
(...)
(...)
(...)

2 .8

5 .7

4 .4
2 .3  0 .1
0 .2
1 .7   . 23  . 34  1 . 2   0 . 25
4 .0  0 .3
2 .8

5 .7

4 .4
(...)
(...)
(...)

2 .8

5 .7

4 .4
2 .3

1 .7  0 .1 0 . 25
0 .2
4 .0   . 17  . 8  . 84   0 . 45
2 .8  0 .3
5 .7

4 .4
(...)
(...)
(...)

2 .8

5 .7

4 .4
2 .3

1 .7 0 . 25
4 .0  0 .1 0 . 45
0 .2
2 .8   . 4  . 56  . 17   0 . 50
5 .7  0 .3
4 .4
(...)
(...)
(...)

2 .8

5 .7

4 .4
2 .3

1 .7 0 . 25
4 .0 0 . 45 Convolution!
2 .8 0 . 50
5 .7

4 .4
(...)
(...)
(...)

2 .8  0 .1
0 .2
5 .7 0 . 50
4 .4  0 .3

Each neuron of layer 1 is connected to 3x3 pixels, layer 1 has 9 parameters!!








Convolution operation
Feature map

F   x W [ 0]
 x W0[ 0]   x W1[0] 

 x W2[0] 

 x W4[0]   x W3[0] 
5-feature map
convolution layer
K-feature map
convolution layer
Conv layer 1
POOLING
LAYER
Pooling layer

Goals
• Reduce the spatial resolution of feature
maps
• Lower memory and computation
requirements
• Provide partial invariance to position,
scale and rotation
(stride = 1) (stride = 1)

181
Images: https://pythonmachinelearning.pro/introduction-to-convolutional-neural-networks-for-vision-tasks/
Conv layer 1 Pool layer 1 Conv layer 2 Pool layer 2
2 Class CNN

Conv layer 1 Pool layer 1 Conv layer 2 Pool layer 2 Fully connected layers


(…)  yW  x 

  
l yW x , t  t ln yW x  (1  t ) ln 1  yW x 
        
K Class CNN

Conv layer 1 Pool layer 1 Conv layer 2 Pool layer 2 Fully connected layers


exp y w 0 ( x )

norm
(…) exp y w1 ( x )

exp y w 2 ( x )

SOFTMAX
  
l yW x , t  t ln yW x  (1  t ) ln 1  yW x 
        
Nice example from the litterature

S. Banerjee, S. Mitra, A. Sharma, and B. U Shankar, A CADe System for Gliomas in Brain MRI using Convolutional Neural
Networks, arXiv:1806.07589v, 2018

185
Learn image-based caracteristics

http://web.eecs.umich.edu/~honglak/icml09-ConvolutionalDeepBeliefNetworks.pdf
Batch processing

x  0.4,1.0 

w  2.0,3.6,0.5

 
x1 wT x 

x2

188

x  0.4,1.0 

w  2.0,3.6,0.5

1  
2.0
wT x  2  3.6 * 0.4  0.5  1.94

 3.6    1
0 .4 wT x  yw ( x)   1.94   0.125
1 e1.94

0.5
 1 .0

189

x  0.4,1.0 

w  2.0,3.6,0.5

1
2.0
  1 
   
0 .4
 3.6  
wT x  yw ( xa )    2.0,3.6,0.5 0.4    0.125
  1  
0.5   
 1 .0

190
 
xa  0.4,1.0 , xb   2.1,3.0 

w  2.0,3.6,0.5

1
2.0
  1 
   
0. 4
 3.6  
wT x  yw ( xa )    2.0,3.6,0.5 0.4    0.125
  1  
0.5   
 1 .0

1
2.0
  1 
   
 2 .1
 3.6  
wT x  yw ( xb )    2.0,3.6,0.5  2.1   0.99
  3.0  
0.5   
3 .0
Mini-batch processing

1 1
2.0
  1 1 
 3.6    
yw ( xa )    2.0,3.6,0.5 0.4  2.1   0.125,0.99
 
0.4  2.1 wT x  
  1 3.0  
0.5   
 1 .0 3 .0

192
Mini-batch processing

1 1 1 1 2.0

 3.6 
yw ( xa )  0.89,0.2,0.125,0.99
 
 1 .1 0.5 0 .4  2 . 1 wT x 

0.5
1. 1  0 .2  1 .0 3 .0

193
Mini-batch processing

Horse
Dog
Truck
Mini-batch processing

Horse
Dog
Truck

Mini-batch of
4 images 4 predictions
Classical applications of ConvNets
Classification.
Classical applications of ConvNets
Classification.

S. Banerjee, S. Mitra, A. Sharma, and B. U Shankar, A CADe System for Gliomas in Brain MRI using Convolutional Neural
Networks, arXiv:1806.07589v, 2018

197
Classical applications of ConvNets
Image segmentation
Classical applications of ConvNets
Image segmentation

Tran, P. V., 2016. A fully convolutional neural network for cardiac segmentation in short-axis MRI.
arXiv:1604.00494.
Classical applications of ConvNets
Image segmentation

Fang Liu, Zhaoye Zhou, +3 authors, Deep convolutional neural network and 3D deformable approach for tissue segmentation in
musculoskeletal magnetic resonance imaging. in Magnetic resonance in medicine 2018 DOI:10.1002/mrm.26841
Classical applications of ConvNets
Image segmentation

Havaei M., Davy A., Warde-Farley D., Biard A., Courville A., Bengio Y., Pal C., Jodoin P-M, Larochelle H. (2017)
Brain Tumor Segmentation with Deep Neural Networks, Medical Image Analysis, Vol 35, 18-31
Classical applications of ConvNets

Localization
Classical applications of ConvNets
Localization

S. Banerjee, S. Mitra, A. Sharma, and B. U Shankar, A CADe System for Gliomas in Brain MRI using Convolutional Neural
Networks, arXiv:1806.07589v, 2018
Conclusion
• Linear classification (1 neuron network)
• Logistic regression
• Multilayer perceptron
• Conv Nets
• Many buzz words
– Softmax
– Loss
– Batch
– Gradient descent
– Etc.

204

You might also like