0% found this document useful (0 votes)
15 views49 pages

Deep - Learning

The document provides an overview of deep learning, covering its motivations, applications in classification and regression, and the architecture of neural networks, particularly Multi-Layer Perceptrons and Convolutional Neural Networks. It discusses the importance of data, optimization techniques, and strategies to mitigate overfitting, such as data augmentation. Practical advice is also provided for implementing deep learning models effectively.

Uploaded by

Kuan-Kai Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views49 pages

Deep - Learning

The document provides an overview of deep learning, covering its motivations, applications in classification and regression, and the architecture of neural networks, particularly Multi-Layer Perceptrons and Convolutional Neural Networks. It discusses the importance of data, optimization techniques, and strategies to mitigate overfitting, such as data augmentation. Practical advice is also provided for implementing deep learning models effectively.

Uploaded by

Kuan-Kai Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Deep Learning

Alexandre Thiéry
Department of Statistics and Data Sciences
1 Deep-Learning: motivations

2 Deep-Regression: toy example

3 Deep-Learning for classification

4 Convolutional Neural Network (CNN)


Why Deep-Learning can sometimes be helpful?

1 44
Why Deep-Learning can sometimes be helpful?

2 44
Performance v.s. Compute v.s. Model size

Figure: Extracted from: (Brown & al, 2020)


3 44
4 44
Success stories in computer vision, natural language processing, speech recognition,
game playing, art generation, molecular design, automated driving, ...
5 44
GPUs were initially designed for video-games
GPUs are very good at matrix operations and parallel computing
For deep-learning, GPUs are often used to accelerate the training process
Note: high electrical consumption can be an issue

6 44
Figure: Extracted from Prof. Fleuret lectures

The CPU is in charge of the overall computation


Transfer of data between the CPU and other devices (RAM, GPU) is slow
Understanding these bottlenecks is crucial for efficient computations
7 44
There exists many high-level libraries to build and train neural networks
We will implement simple neural nets from scratch for pedagogical purposes
8 44
Neural networks are often built by stacking layers and blocks
Each block represents a differentiable function
The overall function is the composition of these blocks
Each block is parametrized by a set of weights
These weights are learned by minimizing a loss function

9 44
We will focus on the supervised learning setting
The data is composed of pairs (xi , yi )
Goal: learn a function F that maps xi to yi as well as possible
The function F is a neural network parametrized by weights w
A loss function is used to measure the quality of the prediction
10 44
Fitting a Deep-Learning model: a non-convex optimization problem
Many algorithms available: SGD, ADAM, RMSprop, etc..
No guarantee to find the global minimum, but that’s OK...
understanding Deep-Learning models is still an active area of research

11 44
Outline

1 Deep-Learning: motivations

2 Deep-Regression: toy example

3 Deep-Learning for classification

4 Convolutional Neural Network (CNN)

11 44
We will focus on the Multi-Layer Perceptron (MLP) architecture
It is composed of a sequence of dense layers
This section describes how to fit a simple regression model using a MLP

12 44
Toy Problem: 1D regression

3
4 2 0 2 4

Training Data: {(xi , yi )}Ni=1 with


I scalar response yi ∈ R
I vector covariate xi = (x1(1) , . . . , xi(p) ) ∈ Rp .
Remark: The figure
13
above illustrate the 1D case when p = 1 44
Reminder on Linear Regression
Model: the response is a linear combination of the covariates
(1) (2) (p)
yi ≈ α0 + α1 xi + α2 xi + . . . + αp xi = Fα (xi )
vector of parameters α = (α0 , α1 , . . . , αp ) and Fα : Rp → R is the function
Fα (x ) = α0 + α1 x (1) + α2 x (2) + . . . + αp x (p)
Fitting: parameters α ∈ Rp+1 are estimated by minimizing the MSE:
N
1 X
MSE(α) = (yi − Fα (xi ))2
N i=1
Remark: in this simple setting, algebraic manipulations show that the
parameters can be computed in closed form with a simple matrix inversion. The
point of the discussion is to illustrate the general idea of fitting a model.
14 44
General Regression

Model: a general regression function Fw : Rp → R such that

yi ≈ Fw (xi ).

The vector w ∈ RD is the set of parameters of the model


Fitting: the parameters w ∈ RD are estimated by minimizing the MSE:
N
1 X
MSE(w ) = (yi − Fw (xi ))2
N i=1

Remark: the MSE is often referred to as the loss function

15 44
16 44
The ReLu activation function is most commonly used in deep-learning
It is defined as:
ReLu(z) = max(0, z)

17 44
Commonly used activation functions

18 44
Big Data required

Because Fw is represented as a neural network, it is computationally


expensive to compute Fw (i.e. more than a linear model)
The number of parameters is large, w ∈ RD with D  1
Deep Learning typically needs a lot of data to work well: the number of
training samples is often large, N  1
19 44
Dealing with a lot of data: SGD

During the training/fitting, instead of processing all the N training samples at


each iteration of the optimization procedure, only a subset of the data is
considered: it is called a mini-batch and contains B ≥ 1 samples.
The full training set if composed of a large number of mini-batches: once all
the mini-batches have been processed once, that is called an epoch.
A large number of epochs is often necessary for training a neural net
20 44
Weight initialization is often crucial for training a neural network
Heuristic: random initial values so that the distributions of the neuron
activations is roughly centered with unit variance
Lecun initialization is a popular choice for initializing dense layers. Consider a
layer with nin input neurons and nout output neurons. The weights are
initialized as
1
 
2
wij ∼ N 0, σ =
nin
21 44
Other popular initialization schemes include:
I Xavier initialization with weights:

1
 
2
wij ∼ N 0, σ =
nin + nout
I He initialization with weights:

2
 
2
wij ∼ N 0, σ =
nin

22 44
Deep-Learning Training

6 4 2 0 2 4 6

23 44
Outline

1 Deep-Learning: motivations

2 Deep-Regression: toy example

3 Deep-Learning for classification

4 Convolutional Neural Network (CNN)

23 44
We will now focus on the multi-class classification setting

24 44
Data: {xi , yi }Ni=1 with
I Categorical response yi ∈ {0, 1, . . . , C − 1}
I Input xi ∈ X (eg. images, documents, time-series, etc...)
Probabilistic classification: we would like to build a model that takes x ∈ X
as input and produces a probability vector as output,
 
x −→ yb ≡ p0 (x ), p1 (x ), . . . , pC −1 (x ) ∈ P C ⊂ RC

25 44
Probabilistic classification:
x −→ z ∈ RC −→ yb ∈ PC ⊂ RC

I A neural network transforms x ∈ X into z ∈ RC


I The softmax operation σ : RC 7→ PC transforms an arbitrary vector
z ∈ RC into a probability vector:
σ(z) = yb ∈ PC
26 44
The softmax operations σ : RC 7→ PC transforms an arbitrary vector
z = (z0 , . . . , zC −1 ) ∈ RC into a probability vector yb = (yb0 , . . . , ybC −1 )

σ(z) = yb ∈ PC

The constraints on the probability vector yb ∈ PC are:

ybi ≥ 0 and yb0 + . . . + ybC −1 = 1

The softmax operations reads:

exp(zi )
ybi = σ(z) =
exp(z0 ) + . . . + exp(zC −1 )

27 44
Another operation that is often used is the Log-Sum-Exp operation. It is a
convex function that maps a vector z ∈ RC to a scalar:
(C −1 )
X
LSE(z) = log exp(zi )
i=0

The Log-Sum-Exp is deeply connected to the softmax operation since

σ(z) = ∇ LSE(z)

Furthermore, the Log-Sum-Exp is a smooth approximation of the maximum:

max zi = lim ε × LSE(z/ε)


0≤i≤C ε→0

28 44
Log-Sum-Exp approximates the maximum

29 44
Negative log-likelihood

Data: (xi , yi )Ni=1


Model: Predictions yb parametrized by (neural) weights w
Model fitting by minimizing the averaged Negative Log-Likelihood (NLL)
N
1 X
Loss(w ) = − log P(yi | xi , w )
N i=1

where the P(yi | xi , w ) is the likelihood of the data yi given the model
parameters w .
With models parametrized by neural weights, minimizing the negative
log-likelihood is typically a non-convex optimization problem.

30 44
NLL = Cross-Entropy Loss
The NLL is often referred to as the Cross-Entropy (CE) loss.
Consider a true label k ∈ {0, 1, . . . , C − 1} and a prediction probability vector
yb = (yb0 , . . . , ybC −1 ). We have:
C −1
yjHot log(ybj )
X
− log P(Y = k | yb ) = − log(ybk ) =−
j=0
Hot
= CE(y , yb )
where the One-Hot encoded version y Hot of y = k is the probability vector
y Hot = (0, 0, . . . , 0, 1, 0, . . . , 0)
The Cross-Entropy between two probability vectors t, p ∈ PC is
C
X −1
CE(t, p) = − tk log(pk ) ≥ 0.
k=0
31 44
To summarize, to train a neural network with weight w for multi-class
classification that creates a probability prediction vector yb (x ) from an input x , one
typically minimizes the averaged Cross-Entropy loss over the training data
{(xi , yi )}Ni=1 :
N
1 X
Loss(w ) = CE(yiHot , yb (xi )).
N i=1

This is typically done using a variation of Stochastic Gradient Descent (SGD).

32 44
Outline

1 Deep-Learning: motivations

2 Deep-Regression: toy example

3 Deep-Learning for classification

4 Convolutional Neural Network (CNN)

32 44
Dealing with images as input

Nearby pixels are likely to be similar


Some notion of translation and rotation invariance
Processing of local information
33 44
In 1980, Professor Fukushima published the so-called neocognitron the original
Convolutional Neural Network architecture

34 44
Convolutional Neural Networks (CNNs) are especially design to process images

35 44
The operation of convolution is at the heart of CNNs

36 44
The max-pooling operation is used to down-sample the data. This allows the
neural network to abstract the details of the image and understand the global
structure of the image.

37 44
As an image is processed through the network, the features extracted become more
and more abstract. For example, the first layers might detect edges, the second
layers might detect shapes, while the last layers might detect objects.

38 44
After being processed by a series of convolutions, an image is transformed into a
tensor of dimension W × H × D, where D is the number of features. A global
max-pooling operation is often applied to convert this tensor into a vector of
length D. It is obtained by taking the maximum value of each feature map.

39 44
A typical CNN architecture for image classification is comprised by a series of
convolutions and max-pooling operations that extract features from the image,
followed by a global pooling operation that transforms the tensor into a vector. This
vector is then processed by a series of dense layers and a final softmax operation
that output the final classification probability vector.
40 44
Overfitting is a common issue in deep-learning as the number of parameters is
often very large compared to the amount of data. Different strategies can be used
to mitigate overfitting, although it is not the topic of this lecture to discuss these
important issues.
41 44
Data Augmentation: one can often artificially enlarge the size the dataset by
perturbing the training samples. This simple strategy helps mitigate overfitting
and is often crucial in situations when training data is limited, as is often the
case in industrial and medical applications where data-collection is typically slow
and expensive.

42 44
43 44
Some practical advices

1. If possible, use standard neural architecture that have been proven to work
2. The quality and quantity of the data is often crucial, more than the architecture
3. The choice of the optimizer is often not extremely important
4. Data-augmentation is often an easy way to improve the performance of a model
5. Identify bottlenecks (eg. slow data transfer or data-augmentation)
6. Start with simple neural architectures before moving to more complex ones
7. There are many situations where deep-learning is not the right tool

44 / 44

You might also like