0% found this document useful (0 votes)

15 views49 pages

Deep - Learning

The document provides an overview of deep learning, covering its motivations, applications in classification and regression, and the architecture of neural networks, particularly Multi-Layer Perceptrons and Convolutional Neural Networks. It discusses the importance of data, optimization techniques, and strategies to mitigate overfitting, such as data augmentation. Practical advice is also provided for implementing deep learning models effectively.

Uploaded by

Kuan-Kai Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views49 pages

Deep - Learning

Uploaded by

Kuan-Kai Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Deep Learning

Alexandre Thiéry
Department of Statistics and Data Sciences
1 Deep-Learning: motivations

2 Deep-Regression: toy example

3 Deep-Learning for classification

4 Convolutional Neural Network (CNN)

Why Deep-Learning can sometimes be helpful?

1 44
Why Deep-Learning can sometimes be helpful?

2 44
Performance v.s. Compute v.s. Model size

Figure: Extracted from: (Brown & al, 2020)

3 44
4 44
Success stories in computer vision, natural language processing, speech recognition,
game playing, art generation, molecular design, automated driving, ...
5 44
GPUs were initially designed for video-games
GPUs are very good at matrix operations and parallel computing
For deep-learning, GPUs are often used to accelerate the training process
Note: high electrical consumption can be an issue

6 44
Figure: Extracted from Prof. Fleuret lectures

The CPU is in charge of the overall computation

Transfer of data between the CPU and other devices (RAM, GPU) is slow
Understanding these bottlenecks is crucial for efficient computations
7 44
There exists many high-level libraries to build and train neural networks
We will implement simple neural nets from scratch for pedagogical purposes
8 44
Neural networks are often built by stacking layers and blocks
Each block represents a differentiable function
The overall function is the composition of these blocks
Each block is parametrized by a set of weights
These weights are learned by minimizing a loss function

9 44
We will focus on the supervised learning setting
The data is composed of pairs (xi , yi )
Goal: learn a function F that maps xi to yi as well as possible
The function F is a neural network parametrized by weights w
A loss function is used to measure the quality of the prediction
10 44
Fitting a Deep-Learning model: a non-convex optimization problem
Many algorithms available: SGD, ADAM, RMSprop, etc..
No guarantee to find the global minimum, but that’s OK...
understanding Deep-Learning models is still an active area of research

11 44
Outline

1 Deep-Learning: motivations

2 Deep-Regression: toy example

3 Deep-Learning for classification

4 Convolutional Neural Network (CNN)

11 44
We will focus on the Multi-Layer Perceptron (MLP) architecture
It is composed of a sequence of dense layers
This section describes how to fit a simple regression model using a MLP

12 44
Toy Problem: 1D regression

3
4 2 0 2 4

Training Data: {(xi , yi )}Ni=1 with

I scalar response yi ∈ R
I vector covariate xi = (x1(1) , . . . , xi(p) ) ∈ Rp .
Remark: The figure
13
above illustrate the 1D case when p = 1 44
Reminder on Linear Regression
Model: the response is a linear combination of the covariates
(1) (2) (p)
yi ≈ α0 + α1 xi + α2 xi + . . . + αp xi = Fα (xi )
vector of parameters α = (α0 , α1 , . . . , αp ) and Fα : Rp → R is the function
Fα (x ) = α0 + α1 x (1) + α2 x (2) + . . . + αp x (p)
Fitting: parameters α ∈ Rp+1 are estimated by minimizing the MSE:
N
1 X
MSE(α) = (yi − Fα (xi ))2
N i=1
Remark: in this simple setting, algebraic manipulations show that the
parameters can be computed in closed form with a simple matrix inversion. The
point of the discussion is to illustrate the general idea of fitting a model.
14 44
General Regression

Model: a general regression function Fw : Rp → R such that

yi ≈ Fw (xi ).

The vector w ∈ RD is the set of parameters of the model

Fitting: the parameters w ∈ RD are estimated by minimizing the MSE:
N
1 X
MSE(w ) = (yi − Fw (xi ))2
N i=1

Remark: the MSE is often referred to as the loss function

15 44
16 44
The ReLu activation function is most commonly used in deep-learning
It is defined as:
ReLu(z) = max(0, z)

17 44
Commonly used activation functions

18 44
Big Data required

Because Fw is represented as a neural network, it is computationally

expensive to compute Fw (i.e. more than a linear model)
The number of parameters is large, w ∈ RD with D 1
Deep Learning typically needs a lot of data to work well: the number of
training samples is often large, N 1
19 44
Dealing with a lot of data: SGD

During the training/fitting, instead of processing all the N training samples at

each iteration of the optimization procedure, only a subset of the data is
considered: it is called a mini-batch and contains B ≥ 1 samples.
The full training set if composed of a large number of mini-batches: once all
the mini-batches have been processed once, that is called an epoch.
A large number of epochs is often necessary for training a neural net
20 44
Weight initialization is often crucial for training a neural network
Heuristic: random initial values so that the distributions of the neuron
activations is roughly centered with unit variance
Lecun initialization is a popular choice for initializing dense layers. Consider a
layer with nin input neurons and nout output neurons. The weights are
initialized as
1

2
wij ∼ N 0, σ =
nin
21 44
Other popular initialization schemes include:
I Xavier initialization with weights:

1

2
wij ∼ N 0, σ =
nin + nout
I He initialization with weights:

2

2
wij ∼ N 0, σ =
nin

22 44
Deep-Learning Training

6 4 2 0 2 4 6

23 44
Outline

1 Deep-Learning: motivations

2 Deep-Regression: toy example

3 Deep-Learning for classification

4 Convolutional Neural Network (CNN)

23 44
We will now focus on the multi-class classification setting

24 44
Data: {xi , yi }Ni=1 with
I Categorical response yi ∈ {0, 1, . . . , C − 1}
I Input xi ∈ X (eg. images, documents, time-series, etc...)
Probabilistic classification: we would like to build a model that takes x ∈ X
as input and produces a probability vector as output,

x −→ yb ≡ p0 (x ), p1 (x ), . . . , pC −1 (x ) ∈ P C ⊂ RC

25 44
Probabilistic classification:
x −→ z ∈ RC −→ yb ∈ PC ⊂ RC

I A neural network transforms x ∈ X into z ∈ RC

I The softmax operation σ : RC 7→ PC transforms an arbitrary vector
z ∈ RC into a probability vector:
σ(z) = yb ∈ PC
26 44
The softmax operations σ : RC 7→ PC transforms an arbitrary vector
z = (z0 , . . . , zC −1 ) ∈ RC into a probability vector yb = (yb0 , . . . , ybC −1 )

σ(z) = yb ∈ PC

The constraints on the probability vector yb ∈ PC are:

ybi ≥ 0 and yb0 + . . . + ybC −1 = 1

The softmax operations reads:

exp(zi )
ybi = σ(z) =
exp(z0 ) + . . . + exp(zC −1 )

27 44
Another operation that is often used is the Log-Sum-Exp operation. It is a
convex function that maps a vector z ∈ RC to a scalar:
(C −1 )
X
LSE(z) = log exp(zi )
i=0

The Log-Sum-Exp is deeply connected to the softmax operation since

σ(z) = ∇ LSE(z)

Furthermore, the Log-Sum-Exp is a smooth approximation of the maximum:

max zi = lim ε × LSE(z/ε)

0≤i≤C ε→0

28 44
Log-Sum-Exp approximates the maximum

29 44
Negative log-likelihood

Data: (xi , yi )Ni=1

Model: Predictions yb parametrized by (neural) weights w
Model fitting by minimizing the averaged Negative Log-Likelihood (NLL)
N
1 X
Loss(w ) = − log P(yi | xi , w )
N i=1

where the P(yi | xi , w ) is the likelihood of the data yi given the model
parameters w .
With models parametrized by neural weights, minimizing the negative
log-likelihood is typically a non-convex optimization problem.

30 44
NLL = Cross-Entropy Loss
The NLL is often referred to as the Cross-Entropy (CE) loss.
Consider a true label k ∈ {0, 1, . . . , C − 1} and a prediction probability vector
yb = (yb0 , . . . , ybC −1 ). We have:
C −1
yjHot log(ybj )
X
− log P(Y = k | yb ) = − log(ybk ) =−
j=0
Hot
= CE(y , yb )
where the One-Hot encoded version y Hot of y = k is the probability vector
y Hot = (0, 0, . . . , 0, 1, 0, . . . , 0)
The Cross-Entropy between two probability vectors t, p ∈ PC is
C
X −1
CE(t, p) = − tk log(pk ) ≥ 0.
k=0
31 44
To summarize, to train a neural network with weight w for multi-class
classification that creates a probability prediction vector yb (x ) from an input x , one
typically minimizes the averaged Cross-Entropy loss over the training data
{(xi , yi )}Ni=1 :
N
1 X
Loss(w ) = CE(yiHot , yb (xi )).
N i=1

This is typically done using a variation of Stochastic Gradient Descent (SGD).

32 44
Outline

1 Deep-Learning: motivations

2 Deep-Regression: toy example

3 Deep-Learning for classification

4 Convolutional Neural Network (CNN)

32 44
Dealing with images as input

Nearby pixels are likely to be similar

Some notion of translation and rotation invariance
Processing of local information
33 44
In 1980, Professor Fukushima published the so-called neocognitron the original
Convolutional Neural Network architecture

34 44
Convolutional Neural Networks (CNNs) are especially design to process images

35 44
The operation of convolution is at the heart of CNNs

36 44
The max-pooling operation is used to down-sample the data. This allows the
neural network to abstract the details of the image and understand the global
structure of the image.

37 44
As an image is processed through the network, the features extracted become more
and more abstract. For example, the first layers might detect edges, the second
layers might detect shapes, while the last layers might detect objects.

38 44
After being processed by a series of convolutions, an image is transformed into a
tensor of dimension W × H × D, where D is the number of features. A global
max-pooling operation is often applied to convert this tensor into a vector of
length D. It is obtained by taking the maximum value of each feature map.

39 44
A typical CNN architecture for image classification is comprised by a series of
convolutions and max-pooling operations that extract features from the image,
followed by a global pooling operation that transforms the tensor into a vector. This
vector is then processed by a series of dense layers and a final softmax operation
that output the final classification probability vector.
40 44
Overfitting is a common issue in deep-learning as the number of parameters is
often very large compared to the amount of data. Different strategies can be used
to mitigate overfitting, although it is not the topic of this lecture to discuss these
important issues.
41 44
Data Augmentation: one can often artificially enlarge the size the dataset by
perturbing the training samples. This simple strategy helps mitigate overfitting
and is often crucial in situations when training data is limited, as is often the
case in industrial and medical applications where data-collection is typically slow
and expensive.

42 44
43 44
Some practical advices

1. If possible, use standard neural architecture that have been proven to work
2. The quality and quantity of the data is often crucial, more than the architecture
3. The choice of the optimizer is often not extremely important
4. Data-augmentation is often an easy way to improve the performance of a model
5. Identify bottlenecks (eg. slow data transfer or data-augmentation)
6. Start with simple neural architectures before moving to more complex ones
7. There are many situations where deep-learning is not the right tool

44 / 44

Deep Learning Tutorial
No ratings yet
Deep Learning Tutorial
133 pages
Deep Learning Computer Vision
No ratings yet
Deep Learning Computer Vision
302 pages
Deep Learning Hand Book 2024
No ratings yet
Deep Learning Hand Book 2024
185 pages
Deep Learning Turorial PDF
No ratings yet
Deep Learning Turorial PDF
301 pages
Deep Learning Model
No ratings yet
Deep Learning Model
144 pages
DL Unit 1
No ratings yet
DL Unit 1
200 pages
Deep Learning Final Sheet
No ratings yet
Deep Learning Final Sheet
915 pages
The Little Book of Deep Learning
100% (1)
The Little Book of Deep Learning
140 pages
A Little Book of Deep Learning - Francois Fleuret
No ratings yet
A Little Book of Deep Learning - Francois Fleuret
149 pages
Lec9 NN I
No ratings yet
Lec9 NN I
47 pages
AA12 Deep Learning 2024
No ratings yet
AA12 Deep Learning 2024
30 pages
03 PL, Activation, BackProp, CNN
No ratings yet
03 PL, Activation, BackProp, CNN
95 pages
DNN - M2 - Deep Feedforward NN 23dec
No ratings yet
DNN - M2 - Deep Feedforward NN 23dec
97 pages
DL - M2 - Deep Feedforward NN
No ratings yet
DL - M2 - Deep Feedforward NN
97 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 7+Deep+Learning
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 7+Deep+Learning
108 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
163 pages
Advance Hydraulic Structures (Multiple Choice Questions)
67% (3)
Advance Hydraulic Structures (Multiple Choice Questions)
4 pages
Deep Learning UNIT 5
No ratings yet
Deep Learning UNIT 5
182 pages
2.game AI 1
No ratings yet
2.game AI 1
268 pages
01 - Introduction To Deep Learning
No ratings yet
01 - Introduction To Deep Learning
56 pages
Deep Learning: Hung-yi Lee 李宏毅
No ratings yet
Deep Learning: Hung-yi Lee 李宏毅
29 pages
Deep Learning
No ratings yet
Deep Learning
95 pages
4 - DL (v2)
No ratings yet
4 - DL (v2)
32 pages
Deep Learning
No ratings yet
Deep Learning
299 pages
CII4Q3 VISI KOMPUTER - Deep Learning - CNN
No ratings yet
CII4Q3 VISI KOMPUTER - Deep Learning - CNN
106 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
DNN Merged Sugata
No ratings yet
DNN Merged Sugata
243 pages
Deep Learning
No ratings yet
Deep Learning
15 pages
ML Lec 10 Neural Networks
No ratings yet
ML Lec 10 Neural Networks
87 pages
Deep Learning Tutorial Complete (v3)
No ratings yet
Deep Learning Tutorial Complete (v3)
109 pages
Deep Learning Notes
100% (1)
Deep Learning Notes
71 pages
Deepnet Lourentzou
No ratings yet
Deepnet Lourentzou
49 pages
ML LittelBook
No ratings yet
ML LittelBook
161 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
168 pages
The Little Book of Deep Learning - (François Fleuret) - University of Geneva-2023.compressed
No ratings yet
The Little Book of Deep Learning - (François Fleuret) - University of Geneva-2023.compressed
163 pages
LBDL
No ratings yet
LBDL
143 pages
LBDL
No ratings yet
LBDL
142 pages
16 DL 1
No ratings yet
16 DL 1
9 pages
A Imprimer 4
No ratings yet
A Imprimer 4
4 pages
NISS Deep Learning Tutorial
No ratings yet
NISS Deep Learning Tutorial
58 pages
The - Little - Book - of - Deep Learning
No ratings yet
The - Little - Book - of - Deep Learning
140 pages
Deep Learning Tutorial: Reference: Hung-Yi Lee
100% (1)
Deep Learning Tutorial: Reference: Hung-Yi Lee
179 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
140 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
143 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
195 pages
Synthetic Fibers and Plastics
No ratings yet
Synthetic Fibers and Plastics
38 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Deep Learning Day 27
No ratings yet
Deep Learning Day 27
43 pages
AI Chapter 4
No ratings yet
AI Chapter 4
63 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
The Deep Learning Revolution: Introductory Overview Lecture
No ratings yet
The Deep Learning Revolution: Introductory Overview Lecture
35 pages
GSW NG01017640 GEN MP4033 00002 001 Piping Material Specification
No ratings yet
GSW NG01017640 GEN MP4033 00002 001 Piping Material Specification
114 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
14 pages
ECE/CS 559 - Neural Networks Lecture Notes #3 Some Example Neural Networks
No ratings yet
ECE/CS 559 - Neural Networks Lecture Notes #3 Some Example Neural Networks
7 pages
Berserk of Gluttony Volume 8 - PDF Room
No ratings yet
Berserk of Gluttony Volume 8 - PDF Room
264 pages
GST 103 Summary
No ratings yet
GST 103 Summary
15 pages
Star Wars Episode VI Return of The Jedi 1983
No ratings yet
Star Wars Episode VI Return of The Jedi 1983
102 pages
Basic Geostatistics
100% (1)
Basic Geostatistics
131 pages
Nail Care Lesson 2
No ratings yet
Nail Care Lesson 2
5 pages
Level 4 Reading Narrative With KEY
No ratings yet
Level 4 Reading Narrative With KEY
13 pages
Respiratory Rehabilitation
No ratings yet
Respiratory Rehabilitation
26 pages
Type of Sutures and Suturing Technique
No ratings yet
Type of Sutures and Suturing Technique
27 pages
Regional Geology of Myanmar
No ratings yet
Regional Geology of Myanmar
20 pages
Corrosion Metals For Resistance
No ratings yet
Corrosion Metals For Resistance
7 pages
PE FItness Test
No ratings yet
PE FItness Test
10 pages
Research Design
No ratings yet
Research Design
25 pages
Duct Insulation
No ratings yet
Duct Insulation
9 pages
JIS-G-3312-2019-Prepainted Hot-Dip Zinc-Coated Steel Sheet and Strip
No ratings yet
JIS-G-3312-2019-Prepainted Hot-Dip Zinc-Coated Steel Sheet and Strip
31 pages
NCERT Solutions For Class 9 Chapter 9 Parallelograms and Triangles Exercise 9 3
No ratings yet
NCERT Solutions For Class 9 Chapter 9 Parallelograms and Triangles Exercise 9 3
14 pages
The Application of Multiplexer
No ratings yet
The Application of Multiplexer
2 pages
Digimon World Data Squad Walk Through
No ratings yet
Digimon World Data Squad Walk Through
15 pages
FLUXUS F G601 and F G608 Brochure
No ratings yet
FLUXUS F G601 and F G608 Brochure
6 pages
Chromosomal Crossover
No ratings yet
Chromosomal Crossover
4 pages
Test Fibrain Respuestas
No ratings yet
Test Fibrain Respuestas
2 pages
Air Demand of A Hydraulic Jump in A Closed Conduit
No ratings yet
Air Demand of A Hydraulic Jump in A Closed Conduit
13 pages
생명과학1.2-2.소화계,순환계,호흡계 - 정리노트 ?
No ratings yet
생명과학1.2-2.소화계,순환계,호흡계 - 정리노트 ?
1 page
The Legend of Saint Barbara
No ratings yet
The Legend of Saint Barbara
2 pages
Alcoholism Chapter Xxxiii
No ratings yet
Alcoholism Chapter Xxxiii
13 pages
A Reading of Baudelaire's "Recueillement"
No ratings yet
A Reading of Baudelaire's "Recueillement"
5 pages
Manual
No ratings yet
Manual
8 pages
Revision Questions Chapter 11: Class Vi Revision Questions Chapter 04/05: Class Vi
No ratings yet
Revision Questions Chapter 11: Class Vi Revision Questions Chapter 04/05: Class Vi
1 page

Deep - Learning

Uploaded by

Deep - Learning

Uploaded by

Deep Learning

2 Deep-Regression: toy example

3 Deep-Learning for classification

4 Convolutional Neural Network (CNN)

Figure: Extracted from: (Brown & al, 2020)

The CPU is in charge of the overall computation

2 Deep-Regression: toy example

3 Deep-Learning for classification

4 Convolutional Neural Network (CNN)

Training Data: {(xi , yi )}Ni=1 with

Model: a general regression function Fw : Rp → R such that

The vector w ∈ RD is the set of parameters of the model

Remark: the MSE is often referred to as the loss function

Because Fw is represented as a neural network, it is computationally

During the training/fitting, instead of processing all the N training samples at

2 Deep-Regression: toy example

3 Deep-Learning for classification

4 Convolutional Neural Network (CNN)

I A neural network transforms x ∈ X into z ∈ RC

The constraints on the probability vector yb ∈ PC are:

ybi ≥ 0 and yb0 + . . . + ybC −1 = 1

The softmax operations reads:

The Log-Sum-Exp is deeply connected to the softmax operation since

Furthermore, the Log-Sum-Exp is a smooth approximation of the maximum:

max zi = lim ε × LSE(z/ε)

Data: (xi , yi )Ni=1

This is typically done using a variation of Stochastic Gradient Descent (SGD).

2 Deep-Regression: toy example

3 Deep-Learning for classification

4 Convolutional Neural Network (CNN)

Nearby pixels are likely to be similar

You might also like