Deep - Learning
Deep - Learning
Alexandre Thiéry
Department of Statistics and Data Sciences
1 Deep-Learning: motivations
1 44
Why Deep-Learning can sometimes be helpful?
2 44
Performance v.s. Compute v.s. Model size
6 44
Figure: Extracted from Prof. Fleuret lectures
9 44
We will focus on the supervised learning setting
The data is composed of pairs (xi , yi )
Goal: learn a function F that maps xi to yi as well as possible
The function F is a neural network parametrized by weights w
A loss function is used to measure the quality of the prediction
10 44
Fitting a Deep-Learning model: a non-convex optimization problem
Many algorithms available: SGD, ADAM, RMSprop, etc..
No guarantee to find the global minimum, but that’s OK...
understanding Deep-Learning models is still an active area of research
11 44
Outline
1 Deep-Learning: motivations
11 44
We will focus on the Multi-Layer Perceptron (MLP) architecture
It is composed of a sequence of dense layers
This section describes how to fit a simple regression model using a MLP
12 44
Toy Problem: 1D regression
3
4 2 0 2 4
yi ≈ Fw (xi ).
15 44
16 44
The ReLu activation function is most commonly used in deep-learning
It is defined as:
ReLu(z) = max(0, z)
17 44
Commonly used activation functions
18 44
Big Data required
1
2
wij ∼ N 0, σ =
nin + nout
I He initialization with weights:
2
2
wij ∼ N 0, σ =
nin
22 44
Deep-Learning Training
6 4 2 0 2 4 6
23 44
Outline
1 Deep-Learning: motivations
23 44
We will now focus on the multi-class classification setting
24 44
Data: {xi , yi }Ni=1 with
I Categorical response yi ∈ {0, 1, . . . , C − 1}
I Input xi ∈ X (eg. images, documents, time-series, etc...)
Probabilistic classification: we would like to build a model that takes x ∈ X
as input and produces a probability vector as output,
x −→ yb ≡ p0 (x ), p1 (x ), . . . , pC −1 (x ) ∈ P C ⊂ RC
25 44
Probabilistic classification:
x −→ z ∈ RC −→ yb ∈ PC ⊂ RC
σ(z) = yb ∈ PC
exp(zi )
ybi = σ(z) =
exp(z0 ) + . . . + exp(zC −1 )
27 44
Another operation that is often used is the Log-Sum-Exp operation. It is a
convex function that maps a vector z ∈ RC to a scalar:
(C −1 )
X
LSE(z) = log exp(zi )
i=0
σ(z) = ∇ LSE(z)
28 44
Log-Sum-Exp approximates the maximum
29 44
Negative log-likelihood
where the P(yi | xi , w ) is the likelihood of the data yi given the model
parameters w .
With models parametrized by neural weights, minimizing the negative
log-likelihood is typically a non-convex optimization problem.
30 44
NLL = Cross-Entropy Loss
The NLL is often referred to as the Cross-Entropy (CE) loss.
Consider a true label k ∈ {0, 1, . . . , C − 1} and a prediction probability vector
yb = (yb0 , . . . , ybC −1 ). We have:
C −1
yjHot log(ybj )
X
− log P(Y = k | yb ) = − log(ybk ) =−
j=0
Hot
= CE(y , yb )
where the One-Hot encoded version y Hot of y = k is the probability vector
y Hot = (0, 0, . . . , 0, 1, 0, . . . , 0)
The Cross-Entropy between two probability vectors t, p ∈ PC is
C
X −1
CE(t, p) = − tk log(pk ) ≥ 0.
k=0
31 44
To summarize, to train a neural network with weight w for multi-class
classification that creates a probability prediction vector yb (x ) from an input x , one
typically minimizes the averaged Cross-Entropy loss over the training data
{(xi , yi )}Ni=1 :
N
1 X
Loss(w ) = CE(yiHot , yb (xi )).
N i=1
32 44
Outline
1 Deep-Learning: motivations
32 44
Dealing with images as input
34 44
Convolutional Neural Networks (CNNs) are especially design to process images
35 44
The operation of convolution is at the heart of CNNs
36 44
The max-pooling operation is used to down-sample the data. This allows the
neural network to abstract the details of the image and understand the global
structure of the image.
37 44
As an image is processed through the network, the features extracted become more
and more abstract. For example, the first layers might detect edges, the second
layers might detect shapes, while the last layers might detect objects.
38 44
After being processed by a series of convolutions, an image is transformed into a
tensor of dimension W × H × D, where D is the number of features. A global
max-pooling operation is often applied to convert this tensor into a vector of
length D. It is obtained by taking the maximum value of each feature map.
39 44
A typical CNN architecture for image classification is comprised by a series of
convolutions and max-pooling operations that extract features from the image,
followed by a global pooling operation that transforms the tensor into a vector. This
vector is then processed by a series of dense layers and a final softmax operation
that output the final classification probability vector.
40 44
Overfitting is a common issue in deep-learning as the number of parameters is
often very large compared to the amount of data. Different strategies can be used
to mitigate overfitting, although it is not the topic of this lecture to discuss these
important issues.
41 44
Data Augmentation: one can often artificially enlarge the size the dataset by
perturbing the training samples. This simple strategy helps mitigate overfitting
and is often crucial in situations when training data is limited, as is often the
case in industrial and medical applications where data-collection is typically slow
and expensive.
42 44
43 44
Some practical advices
1. If possible, use standard neural architecture that have been proven to work
2. The quality and quantity of the data is often crucial, more than the architecture
3. The choice of the optimizer is often not extremely important
4. Data-augmentation is often an easy way to improve the performance of a model
5. Identify bottlenecks (eg. slow data transfer or data-augmentation)
6. Start with simple neural architectures before moving to more complex ones
7. There are many situations where deep-learning is not the right tool
44 / 44