Lecture 2 Deep Learning Overview
Lecture 2 Deep Learning Overview
Lecture 2
2
CS 404/504, Fall 2021
Lecture Outline
3
CS 404/504, Fall 2021
Training
Prediction
Learned
Labeled Data Prediction
model
class A
class B
Regression Clustering
Classification
Supervised Learning
Machine Learning Basics
Unsupervised Learning
Machine Learning Basics
• Nearest Neighbor – for each test data point, assign the class label of
the nearest training data point
Adopt a distance function to find the nearest neighbor
o Calculate the distance to each data point in the training set, and assign the class
of the nearest data point (minimum distance)
It does not require learning a set of weights
Test Training
Training example examples
examples from class 2
from class 1
norm
(Manhattan distance)
x2
x
x
x o
x x
x
+ o x
o x
o + x
o
o o
o
o
x1
Linear Classifier
Machine Learning Basics
• Linear classifier
Find a linear function f of the inputs xi that separates the classes
Use pairs of inputs and labels to find the weights matrix W and the bias
vector b
o The weights and biases are the parameters of the function f
Several methods have been used to find the optimal set of parameters of a
linear classifier
o A common method of choice is the Perceptron algorithm, where the parameters
are updated until a minimal error is reached (single layer, does not use
backpropagation)
Linear classifier is a simple approach, but it is a building block of
advanced classification algorithms, such as SVM and neural networks
o Earlier multi-layer neural networks were referred to as multi-layer perceptrons
(MLPs)
11
CS 404/504, Fall 2021
Linear Classifier
Machine Learning Basics
13
CS 404/504, Fall 2021
14
CS 404/504, Fall 2021
Non-linear Techniques
Linear vs Non-linear Techniques
• Non-linear classification
Features are obtained as non-linear functions of the inputs
It results in non-linear decision boundaries
Can deal with non-linearly separable data
Inputs:
Features:
Outputs:
• Non-linear SVM
The original input space is mapped to a higher-dimensional feature space
where the training set is linearly separable
Define a non-linear kernel function to calculate a non-linear decision
boundary in the original feature space
Φ : 𝑥 ↦ 𝜙 (𝑥 )
Binary vs Multi-class
Classification
Binary vs Multi-class Classification
18
CS 404/504, Fall 2021
Binary vs Multi-class
Classification
Binary vs Multi-class Classification
19
CS 404/504, Fall 2021
Picture from: Fie-Fei Li, Andrej Karpathy, Justin Johnson – Understanding and Visualizing CNNs 20
CS 404/504, Fall 2021
No-Free-Lunch Theorem
Machine Learning Basics
21
CS 404/504, Fall 2021
Why is DL Useful?
Introduction to Deep Learning
25
CS 404/504, Fall 2021
Representational Power
Introduction to Deep Learning
26
CS 404/504, Fall 2021
Input Output
x1 y1
0.1 is 1
x2
y2
0.7 is 2
The image is “2”
……
……
……
x256 y10
0.2 is 0
16 x 16 = 256
Ink → 1 Each dimension represents the
confidence of a digit
No ink → 0
Slide credit: Hung-yi Lee – Deep Learning Tutorial 27
CS 404/504, Fall 2021
x1 y1
x2
Machine y2
“2
……
……
”
x256 𝑓 :𝑅
256
→𝑅
10
y10
The function is represented by a neural network
z a1w1 a2 w2 aK wK b
a1 w1
𝑎=𝜎 ( 𝑧 )
a2 w2
z z a
…
wK output
…
aK Activation
weights function
b
input
bias
Weights Biases
𝒉𝒊𝒅𝒅𝒆𝒏 𝒍𝒂𝒚𝒆𝒓 𝒉 =𝝈 ( 𝐖 𝟏 𝒙 + 𝒃𝟏 )
Activation functions
𝒉
Slide credit: Ismini Lourentzou – Introduction to Deep Learning 30
CS 404/504, Fall 2021
31
CS 404/504, Fall 2021
……
……
……
……
……
xN …… yM
( 1 ∙1 ) + (− 1 ) ∙ ( − 2 )+ 1= 4
1 -2
Slide credit: Hung-yi Lee – Deep Learning Tutorial 33
CS 404/504, Fall 2021
𝑓 :𝑅 →𝑅
2 2 𝑓
([ ]) [1
−1
=
0 .62
0.83 ]
Slide credit: Hung-yi Lee – Deep Learning Tutorial 34
CS 404/504, Fall 2021
Matrix Operation
Introduction to Neural Networks
1 4 0.98
1 W x + b a
-2
1
-1
-1 -2 0.12 𝜎
1
−1
−2
1 [ ] [ ] +¿ [ ] ¿ [
(
−
1
1
1
)
0
0 .98
0.12 ]
1
0 [ −2]
4
Matrix Operation
Introduction to Neural Networks
x1 …… y1
x 2 W1 …… y2
b1
……
……
……
……
……
xN x a1 …… yM
a1 W1 x + b1
Matrix Operation
Introduction to Neural Networks
x1 …… y1
x 2 W1 W2 ……
WL y2
b1 b2 bL
……
……
……
……
……
xN x a1 a2
…… y yM
𝜎W1 x(+ b)
1
𝜎W2 a1(+ b)
2
𝜎 L-1 + )
WL a( bL
Matrix Operation
Introduction to Neural Networks
x1 …… y1
x 2 W1 W2 ……
WL y2
b1 b2 bL
……
……
……
……
……
xN x a1 a2
…… y yM
y ¿ 𝑓 x( )
¿ WL … 𝜎
𝜎
W2 𝜎 b)
+ ( + b2 …
) + bL
1
W1 x(
( )
Softmax Layer
Introduction to Neural Networks
3
z1
0.95
y1 z1
1
z2 0.73
y2 z 2
-3
y3 z3
0.05
z3
Softmax Layer
Introduction to Neural Networks
1 2.7 0.12 3
z2 e e z2
y2 e z2
e
zj
j 1
0.05 ≈0
z3 -3
3
e e z3
y3 e z3
e
zj
3 j 1
e
zj
j 1
Activation Functions
Introduction to Neural Networks
Activation: Sigmoid
Introduction to Neural Networks
𝑓 (𝑥 ) ℝ𝑛 → [ 0 , 1 ]
𝑥
Slide credit: Ismini Lourentzou – Introduction to Deep Learning 42
CS 404/504, Fall 2021
Activation: Tanh
Introduction to Neural Networks
𝑓 (𝑥 ) ℝ 𝑛 → [ −1 , 1 ]
𝑥
Slide credit: Ismini Lourentzou – Introduction to Deep Learning 43
CS 404/504, Fall 2021
Activation: ReLU
Introduction to Neural Networks
44
CS 404/504, Fall 2021
45
CS 404/504, Fall 2021
46
CS 404/504, Fall 2021
Training NNs
Training Neural Networks
x1 …… y1
0.1 is 1
x2
Softmax
…… y2
0.7 is 2
……
……
……
x256 …… y10
0.2 is 0
16 x 16 = 256
Slide credit: Hung-yi Lee – Deep Learning Tutorial 47
CS 404/504, Fall 2021
Training NNs
Training Neural Networks
Training NNs
Training Neural Networks
• To train a NN, set the parameters such that for a training subset of
images, the corresponding elements in the predicted output have
maximum values
Training NNs
Training Neural Networks
x1 …… y1 0.2 1
x2 …… y2 0.3 0
Cost
……
……
……
……
……
……
x256 …… y10 0.5 ℒ(𝜃) 0
True label “1”
Training NNs
Training Neural Networks
• For a training set of images, calculate the total loss overall all
images:
• Find the optimal parameters that minimize the total loss
ℒ1 ( 𝜃 )
x1 NN ^
𝑦
1
y1
ℒ2 ( 𝜃 )
x2 NN ^
𝑦
2
y2
ℒ3 ( 𝜃 )
x3 NN ^
𝑦
3
y3
……
……
……
……
ℒ𝑛 ( 𝜃 )
xN NN ^
𝑦
𝑁 yN
Slide credit: Hung-yi Lee – Deep Learning Tutorial 51
CS 404/504, Fall 2021
Loss Functions
Training Neural Networks
• Classification tasks
𝑁 𝐾
1
Loss function ℒ ( 𝜃 )=−
Cross-entropy ∑ ∑
𝑁 𝑖 =1 𝑘=1
𝑦 (𝑖𝑘 ) log ^ [
𝑦 (𝑖𝑘 ) + ( 1− 𝑦 (𝑖𝑘 ) ) log ( 1 − ^
𝑦 (𝑘𝑖 ) ) ]
Ground-truth class labels and model predicted class labels
Loss Functions
Training Neural Networks
• Regression tasks
Output
Linear (Identity) or Sigmoid Activation
Layer
𝑛
1
ℒ ( 𝜃 )= ∑ ( 𝑦 − 𝑦 )
(𝑖) (𝑖) 2
Mean Squared Error ^
Loss function 𝑛 𝑖=1
𝑛
1
Mean Absolute Error ℒ ( 𝜃 )= ∑ | 𝑦
(𝑖)
− ^
𝑦
(𝑖)
|
𝑛 𝑖=1
Training NNs
Training Neural Networks
ℒ( 𝜃) 𝜕ℒ
𝜕𝜃𝑖
𝜃𝑖
54
CS 404/504, Fall 2021
Parameter update:
Parameters
55
CS 404/504, Fall 2021
2. Compute the
gradient at ,
∗
𝜃
𝑤2 3. Times the learning
1 rate , and update
𝜃
− 𝛻 ℒ ( 𝜃0 )
4. Go to step 2, repeat
0
𝜃
𝑤1
𝛻 ℒ ( 𝜃 )=
0
[ 𝜕 ℒ ( 𝜃 0 ) / 𝜕 𝑤1
𝜕 ℒ (𝜃 )/ 𝜕 𝑤 2
0
]
Slide credit: Hung-yi Lee – Deep Learning Tutorial 56
CS 404/504, Fall 2021
• Example (contd.)
4. Go to step 2, repeat
0
𝜃
• For most tasks, the loss surface is highly complex (and non-convex)
• Random initialization in NNs
results in different initial
parameters every time the NN ℒ
is trained
Gradient descent may reach
different minima at every run
Therefore, NN will produce
different predicted outputs
• In addition, currently we don’t
have algorithms that guarantee
reaching a global minimum for 𝑤1 𝑤2
an arbitrary loss function
Backpropagation
Training Neural Networks
61
CS 404/504, Fall 2021
62
CS 404/504, Fall 2021
• Besides the local minima problem, the GD algorithm can be very slow
at plateaus, and it can get stuck at saddle points
cost
𝛻 ℒ ( 𝜃 ) ≈ 0 𝛻 ℒ ( 𝜃 )=0
𝛻 ℒ ( 𝜃 )=0
𝜃 63
Slide credit: Hung-yi Lee – Deep Learning Tutorial
CS 404/504, Fall 2021
cost
Movement = Negative of Gradient + Momentum
Negative of Gradient
Momentum
Real Movement
𝜃
Gradient = 0
Slide credit: Hung-yi Lee – Deep Learning Tutorial 64
CS 404/504, Fall 2021
This term is analogous to a momentum of a heavy ball rolling down the hill
• The parameter is referred to as a coefficient of momentum
A typical value of the parameter is 0.9
• This method updates the parameters in the direction of the weighted
average of the past gradients
65
CS 404/504, Fall 2021
GD with Nesterov
GD with momentum
momentum
Adam
Training Neural Networks
67
CS 404/504, Fall 2021
Learning Rate
Training Neural Networks
• Learning rate
The gradient tells us the direction in which the loss has the steepest rate
of increase, but it does not tell us how far along the opposite direction we
should step
Choosing the learning rate (also called the step size) is one of the most
important hyper-parameter settings for NN training
LR too LR too
small large
68
CS 404/504, Fall 2021
Learning Rate
Training Neural Networks
70
CS 404/504, Fall 2021
• In some cases, during training, the gradients can become either very
small (vanishing gradients) of very large (exploding gradients)
They result in very small or very large update of the parameters
Solutions: change learning rate, ReLU activations, regularization, LSTM
units in RNNs
x1 …… y1
x2 …… y2
……
……
……
……
……
xN …… yM
Generalization
Generalization
• Underfitting
The model is too “simple” to
represent all the relevant class
characteristics
E.g., model with too few
parameters
Produces high error on the training
set and high error on the validation
set
• Overfitting
The model is too “complex” and fits
irrelevant characteristics (noise) in
the data
E.g., model with too many
parameters
Produces low error on the training
72
CS 404/504, Fall 2021
Overfitting
Generalization
• Overfitting – a model with high capacity fits the noise in the data
instead of the underlying relationship
• weight decay
A regularization term that penalizes large weights is added to the loss
function
Data loss Regularization loss
For every weight in the network, we add the regularization term to the
loss value
o During gradient descent parameter update, every weight is decayed linearly
toward zero
The weight decay coefficient determines how dominant the regularization
is during the gradient computation
74
CS 404/504, Fall 2021
75
CS 404/504, Fall 2021
• weight decay
The regularization term is based on the norm of the weights
76
CS 404/504, Fall 2021
Regularization: Dropout
Regularization
• Dropout
Randomly drop units (along with their connections) during training
Each unit is retained with a fixed dropout rate p, independent of other
units
The hyper-parameter p needs to be chosen (tuned)
o Often, between 20% and 50% of the units are dropped
Regularization: Dropout
Regularization
……
• Early-stopping
During model training, use a validation set
o E.g., validation/train ratio of about 25% to 75%
Stop when the validation accuracy (or loss) has not improved after n
epochs
o The parameter n is called patience
Stop training
validation
79
CS 404/504, Fall 2021
Batch Normalization
Regularization
80
CS 404/504, Fall 2021
Hyper-parameter Tuning
Hyper-parameter Tuning
81
CS 404/504, Fall 2021
Hyper-parameter Tuning
Hyper-parameter Tuning
• Grid search
Check all values in a range with a step value
• Random search
Randomly sample values for the parameter
Often preferred to grid search
• Bayesian hyper-parameter optimization
Is an active area of research
82
CS 404/504, Fall 2021
k-Fold Cross-Validation
k-Fold Cross-Validation
83
CS 404/504, Fall 2021
k-Fold Cross-Validation
k-Fold Cross-Validation
Ensemble Learning
Ensemble Learning
85
CS 404/504, Fall 2021
output
Shallow Deep
NN NN
……
x1 x2 …… xN
input
Slide credit: Hung-yi Lee – Deep Learning Tutorial 86
CS 404/504, Fall 2021
Convolutional
Input matrix 3x3 filter
• When the convolutional filters are scanned over the image, they
capture useful features
E.g., edge detection by convolutions
0 1
Filter 0
1 -4
1 1 1 1 1
1
1 0.015686 0.015686 0.011765 0.015686 0.015686 0.015686 0.015686 0.964706 0.988235 0.964706 0.866667 0.031373 0.023529 0.007843
0.007843 0.741176 1 1 0.984314 0.023529 0.019608 0.015686 0.015686 0.015686 0.011765 0.101961 0.972549 1 0 1 0.996078 0.058824 0.015686
1 0.996078 0.996078
0.019608 0.513726 1 1 1 0.019608 0.015686 0.015686 0.015686 0.007843 0.011765 1 1 1 0.996078 0.031373 0.015686 0.019608 1 0.011765
0.015686 0.733333 1 1 0.996078 0.019608 0.019608 0.015686 0.015686 0.011765 0.984314 1 00.015686 0.007843 0.007843 1 0.352941
1 0.988235 0.027451
0.015686 0.823529 1 1 0.988235 0.019608 0.019608 0.015686 0.015686 0.019608 1 1 0.980392 0.015686 0.015686 0.015686 0.015686 0.996078 1 0.996078
0.015686 0.913726 1 1 0.996078 0.019608 0.019608 0.019608 0.019608 1 1 0.984314 0.015686 0.015686 0.015686 0.015686 0.952941 1 1 0.992157
0.019608 0.913726 1 1 0.988235 0.019608 0.019608 0.019608 0.039216 0.996078 1 0.015686 0.015686 0.015686 0.015686 0.996078 1 1 1 0.007843
0.019608 0.898039 1 1 0.988235 0.019608 0.015686 0.019608 0.968628 0.996078 0.980392 0.027451 0.015686 0.019608 0.980392 0.972549 1 1 1 0.019608
0.043137 0.905882 1 1 1 0.015686 0.035294 0.968628 1 1 0.023529 1 0.792157 0.996078 1 1 0.980392 0.992157 0.039216 0.023529
1 1 1 1 1 0.992157 0.992157 1 1 0.984314 0.015686 0.015686 0.858824 0.996078 1 0.992157 0.501961 0.019608 0.019608 0.023529
0.996078 0.992157 1 1 1 0.933333 0.003922 0.996078 1 0.988235 1 0.992157 1 1 1 0.988235 1 1 1 1
0.015686 0.74902 1 1 0.984314 0.019608 0.019608 0.031373 0.984314 0.023529 0.015686 0.015686 1 1 1 0 0.003922 0.027451 0.980392 1
0.019608 0.023529 1 1 1 0.019608 0.019608 0.564706 0.894118 0.019608 0.015686 0.015686 1 1 1 0.015686 0.015686 0.015686 0.05098 1
0.015686 0.015686 1 1 1 0.047059 0.019608 0.992157 0.007843 0.011765 0.011765 0.015686 1 1 1 0.015686 0.019608 0.996078 0.023529 0.996078
0.019608 0.015686 0.243137 1 1 0.976471 0.035294 1 0.003922 0.011765 0.011765 0.015686 1 1 1 0.988235 0.988235 1 0.003922 0.015686
0.019608 0.019608 0.027451 1 1 0.992157 0.223529 0.662745 0.011765 0.011765 0.011765 0.015686 1 1 1 0.015686 0.023529 0.996078 0.011765 0.011765
0.015686 0.015686 0.011765 1 1 1 1 0.035294 0.011765 0.011765 0.011765 0.015686 1 1 1 0.015686 0.015686 0.964706 0.003922 0.996078
0.007843 0.019608 0.011765 0.054902 1 1 0.988235 0.007843 0.011765 0.011765 0.015686 0.011765 1 1 1 0.015686 0.015686 0.015686 0.023529 1
0.007843 0.007843 0.015686 0.015686 0.960784 1 0.490196 0.015686 0.015686 0.015686 0.007843 0.027451 1 1 1 0.011765 0.011765 0.043137 1 1
0.023529 0.003922 0.007843 0.023529 0.980392 0.976471 0.039216 0.019608 0.007843 0.019608 0.015686 1 1 1 1 1 1 1 1 1
Input Convoluted
Image Image
w1 w2
w3 w4 w5 w6
w7 w8
Filter
1 Filter
Input 2
Image Layer 1
Feature Layer 2
Map Feature
Map
Bedroom
128
256
256
512
512
512
512
128
256
512
512
Kitchen
64
64
Bathroom
Outdoor
Conv
layer
Max
Pool
Fully Connected
Layer
Residual CNNs
Convolutional Neural Networks
92
CS 404/504, Fall 2021
• Recurrent NNs are used for modeling sequential data and data with
varying length of inputs and outputs
Videos, text, speech, DNA sequences, human skeletal data
• RNNs introduce recurrent connections between the neurons
This allows processing sequential data one element at a time by selectively
passing information across a sequence
Memory of the previous inputs is stored in the model’s internal state and
affect the model predictions
Can capture correlations in sequential data
• RNNs use backpropagation-through-time for training
• RNNs are more sensitive to the vanishing gradient problem than
CNNs
93
CS 404/504, Fall 2021
• RNN use same set of weights and across all time steps
A sequence of hidden states is learned, which represents the memory of
the network
The hidden state at step t, , is calculated based on the previous hidden
state and the input at the current step , i.e.,
The function is a nonlinear activation function, e.g., ReLU or tanh
• RNN shown rolled over time
𝑤h 𝑤h 𝑤h 𝑤𝑦
h0 (·) h1 (·) h2 (·) h3 (·)
𝑤𝑥 𝑤𝑥 𝑤𝑥
x1 x2 x3
INPUT SEQUENCE:
Slide credit: Param Vir Singh – Deep Learning 94
CS 404/504, Fall 2021
• RNNs can have one of many inputs and one of many outputs
A person riding a
Image
motorbike on dirt
Captioning
road
Machine शुभ
Happy Diwali
Translation दीपावली
Bidirectional RNNs
Recurrent Neural Networks
h⃑ 𝑡=𝜎 ( ⃑
𝑊 ⃑h𝑡 − 1+ ⃑
(hh ) (h𝑥)
𝑊 𝑥𝑡 )
´ (hh ) h́ + 𝑊
h́ 𝑡=𝜎 ( 𝑊 ´ (h𝑥) 𝑥 )
𝑡 +1 𝑡
𝑦 𝑡= 𝑓 ( [ ⃑h𝑡 ; h́ 𝑡 ] )
LSTM Networks
Recurrent Neural Networks
97
CS 404/504, Fall 2021
LSTM Networks
Recurrent Neural Networks
• LSTM cell
Input gate, output gate, forget gate, memory cell
LSTM can learn long-term correlations within data sequences
98
CS 404/504, Fall 2021
References
99