Deep Learning - A Gentle Introduction
Deep Learning - A Gentle Introduction
Tanujit Chakraborty
@ Sorbonne
Webpage: https://www.ctanujit.org
Lecture Outline
• Machine learning basics
Supervised and unsupervised learning
Linear and non-linear classification methods
• Introduction to deep learning
• Elements of neural networks (NNs)
Activation functions
• Training NNs
Gradient descent
Regularization methods
• NN architectures
Convolutional NNs
Recurrent NNs
LSTM
Transformers
GAN
* Most of the slides are adapted from DL and D2L Book and several online resources and slides that are acknowledged at the end of the presentation. 2
Machine Learning Basics
• Artificial Intelligence is a scientific field concerned with the development of algorithms that allow computers to
learn without being explicitly programmed
• Machine Learning is a branch of Artificial Intelligence, which focuses on methods that learn from data and make
predictions on unseen data
Training
Prediction
3
Machine Learning Types
class A
class B
Regression Clustering
Classification
4
Supervised and Unsupervised Learning
• Supervised learning categories and techniques • Unsupervised learning categories and techniques
Numerical classifier functions Clustering
o Linear classifier, perceptron, logistic regression, support o k-means clustering
vector machines (SVM), neural networks o Mean-shift clustering
Parametric (probabilistic) functions o Spectral clustering
o Naïve Bayes, Gaussian discriminant analysis (GDA), Density estimation
hidden Markov models (HMM), probabilistic graphical
models o Gaussian mixture model (GMM)
Non-parametric (instance-based) functions o Graphical models
o k-nearest neighbors, kernel regression, kernel density Dimensionality reduction
estimation, local regression o Principal component analysis (PCA)
Symbolic functions o Factor analysis
o Decision trees, classification and regression trees (CART)
Aggregation (ensemble) learning
o Bagging, boosting (Adaboost), random forest
5
Machine Learning Timeline
Nearest Neighbor Classifier
• Nearest Neighbor – for each test data point, assign the class label of the nearest training data point
Adopt a distance function to find the nearest neighbor
o Calculate the distance to each data point in the training set, and assign the class of the nearest data point (minimum distance)
It does not require learning a set of weights
Test Training
Training example examples
examples from class
from class 2
1
7
Nearest Neighbor Classifier
• For image classification, the distance between all pixels is calculated (e.g., using ℓ1 norm, or ℓ2 norm)
Accuracy on CIFAR-10: 38.6%
• Disadvantages:
The classifier must remember all training data and store it for future comparisons with the test data
Classifying a test image is expensive since it requires a comparison to all training images
ℓ1 norm
(Manhattan distance)
• k-Nearest Neighbors approach considers multiple neighboring data points to classify a test data point
E.g., 3-nearest neighbors
o The test example in the figure is the + mark
o The class of the test example is obtained by voting (based on the distance to the 3 closest points)
x2
x
x
x o
x x
x x
+ o
o x
o + x
o
o o
o
o
x1
9
Linear Classifier
• Linear classifier
Find a linear function f of the inputs xi that separates the classes
𝑓 𝑥𝑖 , 𝑊, 𝑏 = 𝑊𝑥𝑖 + 𝑏
Use pairs of inputs and labels to find the weights matrix W and the bias vector b
o The weights and biases are the parameters of the function f
Several methods have been used to find the optimal set of parameters of a linear classifier
o A common method of choice is the Perceptron algorithm, where the parameters are updated until a minimal error is reached (single
layer, does not use backpropagation)
Linear classifier is a simple approach, but it is a building block of advanced classification algorithms, such as SVM and
neural networks
o Earlier multi-layer neural networks were referred to as multi-layer perceptrons (MLPs)
10
Linear Classifier
• The decision boundary is linear
A straight line in 2D, a flat plane in 3D, a hyperplane in 3D and
higher dimensional space
• Example: classify an input image
The selected parameters in this example are not good, because the
predicted cat score is low
12
Linear vs Non-linear Techniques
13
Non-linear Techniques
• Non-linear classification
Features 𝑧𝑖 are obtained as non-linear functions of the inputs 𝑥𝑖
It results in non-linear decision boundaries
Can deal with non-linearly separable data
2 2
Features: 𝑧𝑖 = 𝑥𝑛1 𝑥𝑛2 𝑥𝑛1 ∙ 𝑥𝑛2 𝑥𝑛1 𝑥𝑛2
Outputs: 𝑓 𝑥𝑖 , 𝑊, 𝑏 = 𝑊𝑧𝑖 + 𝑏
14
Non-linear Support Vector Machines
• Non-linear SVM
The original input space is mapped to a higher-dimensional feature space where the training set is linearly separable
Define a non-linear kernel function to calculate a non-linear decision boundary in the original feature space
Φ: 𝑥 ↦ 𝜙 𝑥
15
Binary vs Multi-class Classification
• A classification problem with only 2 classes is
referred to as binary classification
The output labels are 0 or 1
E.g., benign or malignant tumor, spam or no-spam
email
• A problem with 3 or more classes is referred to as
multi-class classification
16
Computer Vision Tasks
17
No-Free-Lunch Theorem
• Wolpert (2002) - The Supervised Learning No-Free-Lunch Theorems
• The derived classification models for supervised learning are simplifications of the reality
The simplifications are based on certain assumptions
The assumptions fail in some situations
o E.g., due to inability to perfectly estimate ML model parameters from limited data
Why is DL Useful?
• DL provides a flexible, learnable framework for representing visual, text, linguistic information
Can learn in supervised and unsupervised manner
• DL applies a multi-layer process for learning rich hierarchical features (i.e., data representations)
Input image pixels → Edges → Textures → Parts → Objects
20
Deep Learning Timeline
Representational Power
• NNs with at least one hidden layer are universal approximators
Given any continuous function h(x) and some 𝜖 > 0, there exists a NN with one
hidden layer (and with a reasonable choice of non-linearity) described with the
function f(x), such that ∀𝑥, ℎ 𝑥 − 𝑓(𝑥) < 𝜖
I.e., NN can approximate any arbitrary complex continuous function
22
Introduction to Neural Networks
x1 y1
0.1 is 1
x2
y2
0.7 is 2
The image is “2”
…
…
…
…
…
…
x256 y1
0.2 is 0
16 x 16 = 256
0
Ink → 1 Each dimension represents
the confidence of a digit
No ink →0
23
Introduction to Neural Networks
x1 y1
x2
y2
… Machine “2”
…
…
…
x256 𝑓: 𝑅256 → 𝑅10 y1
0
24
Elements of Neural Networks
z a1w1 a2 w2 aK wK b
a1 w1
𝑎=𝜎 𝑧
a2 w2
z z
…
a
wK
…
output
aK Activation
weights function
b
input
bias
25
Elements of Neural Networks
Weights Biases
Activation functions
𝒉
26
Elements of Neural Networks
• A neural network playground link
27
Elements of Neural Networks
• Deep NNs have many hidden layers
Fully-connected (dense) layers (a.k.a. Multi-Layer Perceptron or MLP)
Each neuron is connected to all neurons in the succeeding layer
…
…
…
…
…
…
…
…
…
…
xN … yM
…
Input Output
Layer Hidden Layer
Layers 28
Elements of Neural Networks
1 ∙ 1 + −1 ∙ −2 + 1 = 4
1 ∙ −1 + −1 ∙ 1 + 0 =-2
29
Elements of Neural Networks
1 0.62
𝑓: 𝑅2 → 𝑅2 𝑓 =
−1 0.83
30
Matrix Operation
• Matrix operations are helpful when working with multidimensional inputs and outputs
1 4 0.98
1 𝜎 W x + b = a
-2
1
1 −2 1 1 0.98
-1 -2 0.12 𝜎 + =
−1 1 −1 0 0.12
-1
1
4
0
−2
31
Matrix Operation
x1 … y1
…
x2 W1 … y2
b1 …
…
…
…
…
…
…
…
…
…
…
xN x a1 … yM
…
a1 = 𝜎 W1 x + b1
32
Matrix Operation
• Multilayer NN, matrix calculations for all layers
x1 … y1
…
x2 W1 W2 …
WL y2
b L
b1 b2 …
…
…
…
…
…
…
…
…
…
…
a2… yM
xN x a1 y
…
𝜎 W1 x + b1
𝜎 W2 a1 + b2
𝜎 WL aL-1 + bL
33
Matrix Operation
• Multilayer NN, function f maps inputs x to outputs y, i.e., 𝑦 = 𝑓(𝑥)
x1 … y1
…
x2 W1 W2 …
WL y2
b L
b1 b2 …
…
…
…
…
…
…
…
…
…
…
xN x
a1 a2
… y yM
…
y =𝑓 x =𝜎 WL …𝜎 W2 𝜎 W1 x + b1 + b2 … + bL
34
Softmax Layer
• In multi-class classification tasks, the output layer is typically a softmax layer
I.e., it employs a softmax activation function
If a layer with a sigmoid activation function is used as the output layer instead, the predictions by the NN may not be
easy to interpret
o Note that an output layer with sigmoid activations can still be used for binary classification
z3
-
3
0.05
y3 z 3
35
Softmax Layer
j 1
0.05 ≈0
3
z3 -3
e
z3
e y3 e z3 zj
e
3 j 1
e
zj
j 1 20.75
36
Activation Functions
37
Activation: Sigmoid
• Sigmoid function σ: takes a real-valued number and “squashes” it into the range between 0 and 1
The output can be interpreted as the firing rate of a biological neuron
o Not firing = 0; Fully firing = 1
When the neuron’s activation are 0 or 1, sigmoid neurons saturate
o Gradients at these regions are almost zero (almost no signal will flow)
Sigmoid activations are less common in modern NNs
𝑓 𝑥 ℝ𝑛 → 0,1
𝑥
38
Activation: Tanh
• Tanh function: takes a real-valued number and “squashes” it into range between -1 and 1
Like sigmoid, tanh neurons saturate
Unlike sigmoid, the output is zero-centered
o It is therefore preferred than sigmoid
Tanh is a scaled sigmoid: tanh(𝑥) = 2 ∙ 𝜎(2𝑥) − 1
𝑓 𝑥 ℝ𝑛 → −1,1
39
Activation: ReLU
• ReLU (Rectified Linear Unit): takes a real-valued number and thresholds it at zero 𝑓 𝑥 = max(0, 𝑥)
ℝ𝑛 → ℝ𝑛+
Most modern deep NNs use ReLU activations
ReLU is fast to compute
o Compared to sigmoid, tanh 𝑓 𝑥
o Simply threshold a matrix at zero
Accelerates the convergence of gradient descent
o Due to linear, non-saturating form
Prevents the gradient vanishing problem
40
Activation: Leaky ReLU
41
Activation: Linear Function
• Linear function means that the output signal is proportional to the input signal to the neuron
ℝ𝑛 → ℝ𝑛
If the value of the constant c is 1, it is also called identity
activation function 𝑓 𝑥 = 𝑐𝑥
This activation type is used in regression problems
o E.g., the last layer can have linear activation function, in
order to output a real number (and not a class
membership)
42
Training NNs
• The network parameters 𝜃 include the weight matrices and bias vectors from all layers
𝜃 = 𝑊 1 , 𝑏1 , 𝑊 2 , 𝑏 2 , ⋯ 𝑊 𝐿 , 𝑏 𝐿
Often, the model parameters 𝜃 are referred to as weights
• Training a model to learn a set of parameters 𝜃 that are optimal (according to a criterion) is one of the greatest
challenges in ML
x1 … y1
0.1 is 1
…
Softmax
x2 … y2
0.7 is 2
…
…
…
…
…
…
…
x256 … y1
0.2 is 0
16 x 16 = 256
… 0
Slide credit: Hung-yi Lee – Deep Learning Tutorial 43
Training NNs
• Data preprocessing – helps convergence during training
Mean subtraction, to obtain zero-centered data
o Subtract the mean for each individual data dimension (feature)
Standardization
o In zero-centered data, divide each feature by its standard deviation
• To obtain standard deviation of 1 and mean of 0 for each data dimension (feature)
Normalization
o Scale the data within the range [0,1] or [-1, 1]
• E.g., image pixel intensities are divided by 255 to be scaled in the [0,1] range
45
Training NNs
• Define a loss function/objective function/cost function ℒ 𝜃 that calculates the difference (error) between the
model prediction and the true label
E.g., ℒ 𝜃 can be mean-squared error, cross-entropy, etc.
x1 … y1 0.2 1
…
x2 … y2 0.3 0
… Cost
…
…
…
…
…
…
…
…
…
…
…
…
x256 … y10 0.5 ℒ(𝜃) 0
…
True label “1”
46
Training NNs
𝑁
• For a training set of 𝑁 images, calculate the total loss overall all images: ℒ 𝜃 = 𝑛=1 ℒ 𝑛 𝜃
• Find the optimal parameters 𝜃 ∗ that minimize the total loss ℒ 𝜃
ℒ1 𝜃
x1 NN 𝑦1 y1
ℒ2 𝜃
x2 NN 𝑦2 y
2
ℒ3 𝜃
x3 3
NN 𝑦 y3
…
…
…
…
…
…
…
…
ℒ𝑛 𝜃
xN NN 𝑦𝑁 yN
47
Loss Functions
• Classification tasks
𝑁 𝐾
1 (𝑖) (𝑖) (𝑖)
Loss function Cross-entropy ℒ 𝜃 =− 𝑦𝑘 log 𝑦𝑘 + 1 − 𝑦𝑘 log 1 − 𝑦𝑘𝑖
𝑁
𝑖=1 𝑘=1
48
Loss Functions
• Regression tasks
Output
Layer Linear (Identity) or Sigmoid Activation
𝑛
1 2
Mean Squared Error ℒ 𝜃 = 𝑦 (𝑖) − 𝑦 (𝑖)
𝑛
𝑖=1
Loss function
𝑛
1
Mean Absolute Error ℒ 𝜃 = 𝑦 (𝑖) − 𝑦 (𝑖)
𝑛
𝑖=1
49
Training NNs
• Optimizing the loss function ℒ 𝜃
Almost all DL models these days are trained with a variant of the gradient descent (GD) algorithm
GD applies iterative refinement of the network parameters 𝜃
GD uses the opposite direction of the gradient of the loss with respect to the NN parameters (i.e.,𝛻ℒ 𝜃 = 𝜕ℒ 𝜕𝜃𝑖 )
for updating 𝜃
o The gradient of the loss function 𝛻ℒ 𝜃 gives the direction of fastest increase of the loss function ℒ 𝜃 when the parameters 𝜃
are changed
ℒ 𝜃
𝜕ℒ
𝜕𝜃𝑖
𝜃𝑖
50
Training NNs
• The loss functions for most DL tasks are defined over very high-dimensional spaces
E.g., ResNet50 NN has about 23 million parameters
This makes the loss function impossible to visualize
• We can still gain intuitions by studying 1-dimensional and 2-dimensional examples of loss functions
1D loss (the minimum point is obvious) 2D loss (blue = low loss, red = high
loss)
Picture from: https://cs231n.github.io/optimization-1/ 51
Gradient Descent Algorithm
• Steps in the gradient descent algorithm:
1. Randomly initialize the model parameters, 𝜃 0
2. Compute the gradient of the loss function at the initial parameters 𝜃 0 : 𝛻ℒ 𝜃 0
3. Update the parameters as: 𝜃 𝑛𝑒𝑤 = 𝜃 0 − 𝛼𝛻ℒ 𝜃 0
o Where α is the learning rate
4. Go to step 2 and repeat (until a terminating criterion is reached)
𝜕ℒ
Loss ℒ Initial Gradient 𝛻ℒ =
parameters 𝜃
𝜕
𝜃0
Parameters 𝜃
52
Gradient Descent Algorithm
• Example: a NN with only 2 parameters 𝑤1 and 𝑤2 , i.e., 𝜃 = 𝑤1 , 𝑤2
The different colors represent the values of the loss (minimum loss 𝜃 ∗ is ≈ 1.3)
1. Randomly pick a
starting point 𝜃 0
2. Compute the gradient
at 𝜃 0 , 𝛻ℒ 𝜃 0
𝜃∗
𝑤2 3. Times the learning
𝜃1 rate 𝜂, and update 𝜃,
𝜃 𝑛𝑒𝑤 = 𝜃 0 − 𝛼𝛻ℒ 𝜃 0
𝜃1 = 𝜃 0 − 𝛼𝛻ℒ 𝜃 0
−𝛻ℒ 𝜃 0 4. Go to step 2, repeat
𝜃0
0 𝜕ℒ 𝜃 0 /𝜕𝑤1
𝛻ℒ 𝜃 =
𝑤1 𝜕ℒ 𝜃 0 /𝜕𝑤2
53
Gradient Descent Algorithm
• Example (contd.)
𝜃2
3. Times the learning rate 𝜂,
1 1
𝜃 − 𝛼𝛻ℒ 𝜃 and update 𝜃,
𝑤2 𝜃 2 − 𝛼𝛻ℒ 𝜃 2
𝜃 𝑛𝑒𝑤 = 𝜃 𝑜𝑙𝑑 − 𝛼𝛻ℒ 𝜃 𝑜𝑙𝑑
𝜃1
4. Go to step 2, repeat
𝜃0
𝑤1 54
Gradient Descent Algorithm
• Gradient descent algorithm stops when a local minimum of the loss surface is
reached
GD does not guarantee reaching a global minimum
However, empirical evidence suggests that GD works well for NNs
ℒ 𝜃
• For most tasks, the loss surface ℒ 𝜃 is highly complex (and non-convex)
56
Backpropagation
• Modern NNs employ the backpropagation method for calculating the gradients of the loss function 𝛻ℒ 𝜃 =
𝜕ℒ 𝜕𝜃𝑖
Backpropagation is short for “backward propagation”
• For training NNs, forward propagation (forward pass) refers to passing the inputs 𝑥 through the hidden layers
to obtain the model outputs (predictions) 𝑦
The loss ℒ 𝑦, 𝑦 function is then calculated
Backpropagation traverses the network in reverse order, from the outputs 𝑦 backward toward the inputs 𝑥 to calculate
the gradients of the loss 𝛻ℒ 𝜃
The chain rule is used for calculating the partial derivatives of the loss function with respect to the parameters 𝜃 in the
different layers in the network
• Each update of the model parameters 𝜃 during training takes one forward and one backward pass (e.g., of a
batch of inputs)
• Automatic calculation of the gradients (automatic differentiation) is available in all current deep learning
libraries
It significantly simplifies the implementation of deep learning algorithms, since it obviates deriving the partial
derivatives of the loss function by hand
57
Mini-batch Gradient Descent
• It is wasteful to compute the loss over the entire training dataset to perform a single parameter update for large
datasets
E.g., ImageNet has 14M images
Therefore, GD (a.k.a. vanilla GD) is almost always replaced with mini-batch GD
58
Stochastic Gradient Descent
59
Problems with Gradient Descent
• Besides the local minima problem, the GD algorithm can be very slow at plateaus, and it can get stuck at
saddle points
cost ℒ 𝜃
𝛻ℒ 𝜃 ≈ 0 𝛻ℒ 𝜃
=0 𝛻ℒ 𝜃 = 0
𝜃
60
Gradient Descent with Momentum
• Gradient descent with momentum uses the momentum of the gradient for parameter optimization
cost ℒ 𝜃
Movement = Negative of Gradient + Momentum
Negative of Gradient
Momentum
Real Movement
𝜃
Gradient = 0
61
Gradient Descent with Momentum
• Parameters update in GD with momentum at iteration 𝑡: 𝜃 𝑡 = 𝜃 𝑡−1 − 𝑉 𝑡
o Where: 𝑉 𝑡 = 𝛽𝑉 𝑡−1 + 𝛼𝛻ℒ 𝜃𝑡−1
o I.e., 𝜃𝑡 = 𝜃𝑡−1 − 𝛼𝛻ℒ 𝜃 𝑡−1 − 𝛽𝑉 𝑡−1
• Compare to vanilla GD: 𝜃 𝑡 = 𝜃 𝑡−1 − 𝛼𝛻ℒ 𝜃 𝑡−1
Where 𝜃 𝑡−1 are the parameters from the previous iteration 𝑡 − 1
• The term 𝑉 𝑡 is called momentum
This term accumulates the gradients from the past several steps, i.e.,
𝑉 𝑡 = 𝛽𝑉 𝑡−1 + 𝛼𝛻ℒ 𝜃 𝑡−1
= 𝛽 𝛽𝑉 𝑡−2 + 𝛼𝛻ℒ 𝜃 𝑡−2 + 𝛼𝛻ℒ 𝜃 𝑡−1
= 𝛽2 𝑉 𝑡−2 + 𝛽𝛼𝛻ℒ 𝜃 𝑡−2 + 𝛼𝛻ℒ 𝜃 𝑡−1
3 𝑡−3
=𝛽 𝑉 + 𝛽 2 𝛼𝛻ℒ 𝜃 𝑡−3 + 𝛽𝛼𝛻ℒ 𝜃 𝑡−2 + 𝛼𝛻ℒ 𝜃 𝑡−1
This term is analogous to a momentum of a heavy ball rolling down the hill
• This method updates the parameters 𝜃 in the direction of the weighted average of the past gradients
62
Nesterov Accelerated Momentum
GD with Nesterov
GD with momentum
momentum
63
Adam
64
Learning Rate
• Learning rate
The gradient tells us the direction in which the loss has the steepest rate of increase, but it does not tell us how far along
the opposite direction we should step
Choosing the learning rate (also called the step size) is one of the most important hyper-parameter settings for NN
training
LR too small
LR too large
65
Learning Rate
• Training loss for different learning rates
High learning rate: the loss increases or plateaus too quickly
Low learning rate: the loss decreases too slowly (takes many epochs to reach a solution)
66
Learning Rate Scheduling
• Learning rate scheduling is applied to change the values of the learning rate during the training
Annealing is reducing the learning rate over time (a.k.a. learning rate decay)
o Approach 1: reduce the learning rate by some factor every few epochs
• Typical values: reduce the learning rate by a half every 5 epochs, or divide by 10 every 20 epochs
o Approach 2: exponential or cosine decay gradually reduce the learning rate over time
o Approach 3: reduce the learning rate by a constant (e.g., by half) whenever the validation loss stops improving
• In TensorFlow: tf.keras.callbacks.ReduceLROnPleateau()
• Monitor: validation loss, factor: 0.1 (i.e., divide by 10), patience: 10 (how many epochs to wait before applying it), Minimum learning rate: 1e-6 (when to stop)
Warmup is gradually increasing the learning rate initially, and afterward let it cool down until the end of the training
67
Vanishing Gradient Problem
• In some cases, during training, the gradients can become either very small (vanishing gradients) of very large
(exploding gradients)
They result in very small or very large update of the parameters
Solutions: change learning rate, ReLU activations, regularization, LSTM units in RNNs
x1 … y1
…
x2 … y2
…
…
…
…
…
…
…
…
…
…
…
xN … yM
…
Small gradients, learns very slow
68
Generalization
• Underfitting
The model is too “simple” to represent all the relevant
class characteristics
E.g., model with too few parameters
Produces high error on the training set and high error
on the validation set
• Overfitting
The model is too “complex” and fits irrelevant
characteristics (noise) in the data
E.g., model with too many parameters
Produces low error on the training error and high error
on the validation set
69
Overfitting
• Overfitting – a model with high capacity fits the noise in the data instead of the
underlying relationship
• The model may fit the training data very well, but
fails to generalize to new examples (test or
validation data)
70
Regularization: Weight Decay
• ℓ𝟐 weight decay
A regularization term that penalizes large weights is added to the loss function
ℒ𝑟𝑒𝑔 𝜃 = ℒ 𝜃 + 𝜆 𝜃𝑘2
𝑘
For every weight in the network, we add the regularization term to the loss value
o During gradient descent parameter update, every weight is decayed linearly toward zero
The weight decay coefficient 𝜆 determines how dominant the regularization is during the gradient computation
71
Regularization: Weight Decay
72
Regularization: Weight Decay
• ℓ𝟏 weight decay
The regularization term is based on the ℓ1 norm of the weights
ℒ𝑟𝑒𝑔 𝜃 = ℒ 𝜃 + 𝜆 𝑘 𝜃𝑘
ℓ1 weight decay is less common with NN
o Often performs worse than ℓ2 weight decay
It is also possible to combine ℓ1 and ℓ2 regularization
o Called elastic net regularization
2
ℒ𝑟𝑒𝑔 𝜃 = ℒ 𝜃 + 𝜆1 𝑘 𝜃𝑘 + 𝜆2 𝑘 𝜃𝑘
73
Regularization: Dropout
• Dropout
Randomly drop units (along with their connections) during training
Each unit is retained with a fixed dropout rate p, independent of other units
The hyper-parameter p needs to be chosen (tuned)
o Often, between 20% and 50% of the units are dropped
74
Regularization: Dropout
……
75
Regularization: Early Stopping
• Early-stopping
During model training, use a validation set along with a training data
o E.g., validation/train ratio of about 25% to 75% (often)
Stop when the validation accuracy (or loss) has not improved after n subsequent epochs
o The parameter n is called patience
Stop training
validation
76
Batch Normalization
• Batch normalization layers act similar to the data preprocessing steps mentioned earlier
They calculate the mean μ and variance σ of a batch of input data, and normalize the data x to a zero mean and unit
variance
𝑥−𝜇
i.e., 𝑥 =
𝜎
• BatchNorm layers alleviate the problems of proper initialization of the parameters and hyper-parameters
Result in faster convergence training, allow larger learning rates
Reduce the internal covariate shift
• BatchNorm layers are inserted immediately after convolutional layers or fully-connected layers, and before
activation layers
They are very common with convolutional NNs
77
Hyper-parameter Tuning
78
Hyper-parameter Tuning
• Grid search
Check all values in a range with a step value
• Random search
Randomly sample values for the parameter
Often preferred to grid search
79
k-Fold Cross-Validation
• Using k-fold cross-validation for hyper-parameter tuning is common when the size of the training data is small
It also leads to a better and less noisy estimate of the model performance by averaging the results across several folds
80
k-Fold Cross-Validation
• Illustration of a 5-fold cross-validation
81
Ensemble Learning
• Ensemble learning is training multiple classifiers separately and combining their predictions
Ensemble learning often outperforms individual classifiers
Better results obtained with higher model variety in the ensemble
Bagging (bootstrap aggregating)
o Randomly draw subsets from the training set (i.e., bootstrap samples)
o Train separate classifiers on each subset of the training set
o Perform classification based on the average vote of all classifiers
Boosting
o Train a classifier, and apply weights on the training set (apply higher weights on misclassified examples, focus on “hard examples”)
o Train new classifier, reweight training set according to prediction error
o Repeat
o Perform classification based on weighted vote of the classifiers
82
Deep vs Shallow Networks
output
Shallow Deep
NN NN
……
x1 x2 …… xN
input
83
Convolutional Neural Networks (CNNs)
• Convolutional neural networks (CNNs) were primarily designed for image data
• CNNs use a convolutional operator for extracting data features
Allows parameter sharing
Efficient to train
Have less parameters than NNs with fully-connected layers
• CNNs are robust to spatial translations of objects in images
• A convolutional filter slides (i.e., convolves) across the image
Convolutional
3x3 filter
Input matrix
84
Convolutional Neural Networks (CNNs)
• When the convolutional filters are scanned over the image, they capture useful features
E.g., edge detection by convolutions
0 1 0
Filter 1 -4 1
0 1 0
1 1 1 1 1 1 0.015686 0.015686 0.011765 0.015686 0.015686 0.015686 0.015686 0.964706 0.988235 0.964706 0.866667 0.031373 0.023529 0.007843
0.007843 0.741176 1 1 0.984314 0.023529 0.019608 0.015686 0.015686 0.015686 0.011765 0.101961 0.972549 1 1 0.996078 0.996078 0.996078 0.058824 0.015686
0.019608 0.513726 1 1 1 0.019608 0.015686 0.015686 0.015686 0.007843 0.011765 1 1 1 0.996078 0.031373 0.015686 0.019608 1 0.011765
0.015686 0.733333 1 1 0.996078 0.019608 0.019608 0.015686 0.015686 0.011765 0.984314 1 1 0.988235 0.027451 0.015686 0.007843 0.007843 1 0.352941
0.015686 0.823529 1 1 0.988235 0.019608 0.019608 0.015686 0.015686 0.019608 1 1 0.980392 0.015686 0.015686 0.015686 0.015686 0.996078 1 0.996078
0.015686 0.913726 1 1 0.996078 0.019608 0.019608 0.019608 0.019608 1 1 0.984314 0.015686 0.015686 0.015686 0.015686 0.952941 1 1 0.992157
0.019608 0.913726 1 1 0.988235 0.019608 0.019608 0.019608 0.039216 0.996078 1 0.015686 0.015686 0.015686 0.015686 0.996078 1 1 1 0.007843
0.019608 0.898039 1 1 0.988235 0.019608 0.015686 0.019608 0.968628 0.996078 0.980392 0.027451 0.015686 0.019608 0.980392 0.972549 1 1 1 0.019608
0.043137 0.905882 1 1 1 0.015686 0.035294 0.968628 1 1 0.023529 1 0.792157 0.996078 1 1 0.980392 0.992157 0.039216 0.023529
1 1 1 1 1 0.992157 0.992157 1 1 0.984314 0.015686 0.015686 0.858824 0.996078 1 0.992157 0.501961 0.019608 0.019608 0.023529
0.996078 0.992157 1 1 1 0.933333 0.003922 0.996078 1 0.988235 1 0.992157 1 1 1 0.988235 1 1 1 1
0.015686 0.74902 1 1 0.984314 0.019608 0.019608 0.031373 0.984314 0.023529 0.015686 0.015686 1 1 1 0 0.003922 0.027451 0.980392 1
0.019608 0.023529 1 1 1 0.019608 0.019608 0.564706 0.894118 0.019608 0.015686 0.015686 1 1 1 0.015686 0.015686 0.015686 0.05098 1
0.015686 0.015686 1 1 1 0.047059 0.019608 0.992157 0.007843 0.011765 0.011765 0.015686 1 1 1 0.015686 0.019608 0.996078 0.023529 0.996078
0.019608 0.015686 0.243137 1 1 0.976471 0.035294 1 0.003922 0.011765 0.011765 0.015686 1 1 1 0.988235 0.988235 1 0.003922 0.015686
0.019608 0.019608 0.027451 1 1 0.992157 0.223529 0.662745 0.011765 0.011765 0.011765 0.015686 1 1 1 0.015686 0.023529 0.996078 0.011765 0.011765
0.015686 0.015686 0.011765 1 1 1 1 0.035294 0.011765 0.011765 0.011765 0.015686 1 1 1 0.015686 0.015686 0.964706 0.003922 0.996078
0.007843 0.019608 0.011765 0.054902 1 1 0.988235 0.007843 0.011765 0.011765 0.015686 0.011765 1 1 1 0.015686 0.015686 0.015686 0.023529 1
0.007843 0.007843 0.015686 0.015686 0.960784 1 0.490196 0.015686 0.015686 0.015686 0.007843 0.027451 1 1 1 0.011765 0.011765 0.043137 1 1
0.023529 0.003922 0.007843 0.023529 0.980392 0.976471 0.039216 0.019608 0.007843 0.019608 0.015686 1 1 1 1 1 1 1 1 1
85
Convolutional Neural Networks (CNNs)
• In CNNs, hidden units in a layer are only connected to a small region of the layer before it (called local receptive
field)
The depth of each feature map corresponds to the number of convolutional filters used at each layer
w1 w2
w3 w4 w5 w6
w7 w8
Filter 1
Filter 2
Input Image
Layer 1
Feature Map Layer 2
Feature Map
86
Convolutional Neural Networks (CNNs)
0 1 0 4
Output Matrix
Input Matrix
87
Convolutional Neural Networks (CNNs)
Living Room
Bedroom
Kitchen
128
256
256
512
512
512
512
128
256
512
512
64
64
Bathroom
Outdoor
Conv layer
Max Pool
Fully Connected Layer
88
Residual CNNs
89
Recurrent Neural Networks (RNNs)
• Recurrent NNs are used for modeling sequential data and data with varying length of inputs and outputs
Videos, text, speech, DNA sequences, human skeletal data
• RNN variants:
Basic (Vanilla) RNN networks
o Are sensitive to the vanishing gradient problem
Long Short-Term Memory (LSTM) networks
o LSTM mitigates the vanishing/exploding gradient problem
• Solution: a Memory Cell, updated at each step in the sequence
o Three gates control the flow of information to and from the Memory Cell
Gated Recurrent Networks (GRU)
o Similar to LSTM, less commonly used than LSTM
90
Recurrent Neural Networks (RNNs)
• RNN use same set of weights 𝑤ℎ and 𝑤𝑥 across all time steps
A sequence of hidden states ℎ𝑜 , ℎ𝑜 ℎ2 , ℎ3 , … is learned, which represents the memory of the network
The hidden state at step t, ℎ 𝑡 , is calculated based on the previous hidden state ℎ 𝑡 − 1 and the input at the current step
𝑥 𝑡 , i.e., ℎ 𝑡 = 𝑓ℎ 𝑤ℎ ∗ ℎ 𝑡 − 1 + 𝑤𝑥 ∗ 𝑥 𝑡
The function 𝑓ℎ ∙ is a nonlinear activation function, e.g., ReLU or tanh
• RNN shown rolled over time
𝑤𝑥 𝑤𝑥 𝑤𝑥
x1 x2 x3
INPUT SEQUENCE: 𝑥1 , 𝑥2 , 𝑥3 , … .
91
Recurrent Neural Networks (RNNs)
• RNNs can have one of many inputs and one of many outputs
A person riding a
Image
motorbike on dirt
Captioning
road
Machine
Happy Diwali शुभ दीपावली
Translatio
n
Slide credit: Param Vir SIngh– Deep Learning 92
Bidirectional RNNs
• Bidirectional RNNs incorporate both forward and backward passes through sequential data
The output may not only depend on the previous elements in the sequence, but also on future elements in the sequence
It resembles two RNNs stacked on top of each other
𝑦𝑡 = 𝑓 ℎ𝑡 ; ℎ𝑡
93
LSTM Networks
• Most modern RNN models use either LSTM units or other more advanced types of recurrent units (e.g., GRU
units)
94
LSTM Networks
• LSTM cell
Input gate, output gate, forget gate, memory cell
LSTM can learn long-term correlations within data sequences
95
Transformer Networks
• Transformer networks have been initially designed for processing test data in Large Language Models, such as
GPT-3, ChatGPT, etc.
Later, they have been used for image tasks, and tabular data processing
• The main block of transformers is the self-attention mechanism, which uses scaled dot-product attention to force
the model to attend to portions of the data
Several self-attention modules are combined into a multi-head attention layer
96
Generative adversarial network (GAN)
• Generative models are capable of making new plausible data.
• The term “adversarial” in this case pertains to a unique architecture
for training an effective generator network.
• Random input vectors are given to the generator to make plausible
samples that are ideally discriminated 50:50 in a fully trained model.
However, some issue in training GANs include:
• Vanishing gradient
Best if discriminator starts in a less robust state so it is improving along with the
generator rather than being too sophisticated from the start
Modified minimax loss can help with this, proposed in original paper, maximizing
log(D(G(z))) instead of minimizing it.
• Mode Collapse
Local minimum in discriminator training, Wasserstein loss can help with this, uses
critic model, D(x) – D(G(z)), rather than a threshold valued discriminator
• Failure to Converge
Regularization of discriminator can help with this
98
Other Useful Resources
99
Happy Learning
https://www.deeplearningbook.org/
https://d2l.ai/