0% found this document useful (0 votes)

181 views99 pages

Lecture 2 Deep Learning Overview

This document provides an overview of machine learning and deep learning concepts. It discusses supervised and unsupervised learning approaches. For supervised learning, it covers classification methods like nearest neighbors, linear classifiers, support vector machines, and neural networks. It also discusses training neural networks using gradient descent and regularization. Common neural network architectures like convolutional neural networks and recurrent neural networks are briefly introduced.

Uploaded by

Rizal Adi Saputra ST., M.Kom

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

181 views99 pages

Lecture 2 Deep Learning Overview

Uploaded by

Rizal Adi Saputra ST., M.Kom

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 99

CS 404/504

Special Topics: Adversarial Machine

Learning

Dr. Alex Vakanski

CS 404/504, Fall 2021

Lecture 2

Deep Learning Overview

2
CS 404/504, Fall 2021

Lecture Outline

• Machine learning basics

 Supervised and unsupervised learning
 Linear and non-linear classification methods
• Introduction to deep learning
• Elements of neural networks (NNs)
 Activation functions
• Training NNs
 Gradient descent
 Regularization methods
• NN architectures
 Convolutional NNs
 Recurrent NNs

3
CS 404/504, Fall 2021

Machine Learning Basics

• Artificial Intelligence is a scientific field concerned with the development of

algorithms that allow computers to learn without being explicitly programmed
• Machine Learning is a branch of Artificial Intelligence, which focuses on
methods that learn from data and make predictions on unseen data

Machine Learning
Labeled Data algorithm

Training
Prediction

Learned
Labeled Data Prediction
model

Picture from: Ismini Lourentzou – Introduction to Deep Learning 4

CS 404/504, Fall 2021

Machine Learning Types

Machine Learning Basics

• Supervised: learning with labeled data

 Example: email classification, image classification
 Example: regression for predicting real-valued outputs
• Unsupervised: discover patterns in unlabeled data
 Example: cluster similar data points
• Reinforcement learning: learn to act based on feedback/reward
 Example: learn to play Go

class A

class B

Regression Clustering
Classification

Slide credit: Ismini Lourentzou – Introduction to Deep Learning 5

CS 404/504, Fall 2021

Supervised Learning
Machine Learning Basics

• Supervised learning categories and techniques

 Numerical classifier functions
o Linear classifier, perceptron, logistic regression, support vector machines (SVM), neural
networks
 Parametric (probabilistic) functions
o Naïve Bayes, Gaussian discriminant analysis (GDA), hidden Markov models (HMM),
probabilistic graphical models
 Non-parametric (instance-based) functions
o k-nearest neighbors, kernel regression, kernel density estimation, local regression
 Symbolic functions
o Decision trees, classification and regression trees (CART)
 Aggregation (ensemble) learning
o Bagging, boosting (Adaboost), random forest

Slide credit: Y-Fan Chang – An Overview of Machine Learning 6

CS 404/504, Fall 2021

Unsupervised Learning
Machine Learning Basics

Slide credit: Y-Fan Chang – An Overview of Machine Learning 7

CS 404/504, Fall 2021

Nearest Neighbor Classifier

Machine Learning Basics

• Nearest Neighbor – for each test data point, assign the class label of the nearest
training data point
 Adopt a distance function to find the nearest neighbor
o Calculate the distance to each data point in the training set, and assign the class of the nearest
data point (minimum distance)
 It does not require learning a set of weights

Test Training
Training example examples
examples from class 2
from class 1

Picture from: James Hays – Machine Learning Overview 8

CS 404/504, Fall 2021

Nearest Neighbor Classifier

Machine Learning Basics

• For image classification, the distance between all pixels is calculated (e.g., using
norm, or norm)
 Accuracy on CIFAR-10: 38.6%
• Disadvantages:
 The classifier must remember all training data and store it for future comparisons with
the test data
 Classifying a test image is expensive since it requires a comparison to all training
images

norm
(Manhattan distance)

Picture from: https://cs231n.github.io/classification/ 9

CS 404/504, Fall 2021

k-Nearest Neighbors Classifier

Machine Learning Basics

• k-Nearest Neighbors approach considers multiple neighboring data points to

classify a test data point
 E.g., 3-nearest neighbors
o The test example in the figure is the + mark
o The class of the test example is obtained by voting (based on the distance to the 3 closest
points)

x2
x
x
x o
x x
x
+ o x
o x
o + x
o
o o
o
o

Picture from: James Hays – Machine Learning Overview 10

CS 404/504, Fall 2021

Linear Classifier
Machine Learning Basics

• Linear classifier
 Find a linear function f of the inputs xi that separates the classes

 Use pairs of inputs and labels to find the weights matrix W and the bias vector b
o The weights and biases are the parameters of the function f
 Several methods have been used to find the optimal set of parameters of a linear
classifier
o A common method of choice is the Perceptron algorithm, where the parameters are updated
until a minimal error is reached (single layer, does not use backpropagation)
 Linear classifier is a simple approach, but it is a building block of advanced
classification algorithms, such as SVM and neural networks
o Earlier multi-layer neural networks were referred to as multi-layer perceptrons (MLPs)

11
CS 404/504, Fall 2021

Linear Classifier
Machine Learning Basics

• The decision boundary is linear

 A straight line in 2D, a flat plane in 3D, a hyperplane in 3D and
higher dimensional space
• Example: classify an input image
 The selected parameters in this example are not good, because the
predicted cat score is low

Picture from: https://cs231n.github.io/classification/ 12

CS 404/504, Fall 2021

Support Vector Machines

Machine Learning Basics

• Support vector machines (SVM)

 How to find the best decision boundary?
o All lines in the figure correctly separate the 2 classes
o The line that is farthest from all training examples
will have better generalization capabilities
 SVM solves an optimization problem:
o First, identify a decision boundary that correctly
classifies the examples
o Next, increase the geometric margin between the
boundary and all examples
 The data points that define the maximum
margin width are called support vectors
 Find W and b by solving:

13
CS 404/504, Fall 2021

Linear vs Non-linear Techniques

• Linear classification techniques

 Linear classifier
 Perceptron
 Logistic regression
 Linear SVM
 Naïve Bayes
• Non-linear classification techniques
 k-nearest neighbors
 Non-linear SVM
 Neural networks
 Decision trees
 Random forest

14
CS 404/504, Fall 2021

Linear vs Non-linear Techniques

• For some tasks, input data

can be linearly separable,
and linear classifiers can be
suitably applied

• For other tasks, linear

classifiers may have
difficulties to produce
adequate decision
boundaries

Picture from: Y-Fan Chang – An Overview of Machine Learning 15

CS 404/504, Fall 2021

Non-linear Techniques
Linear vs Non-linear Techniques

• Non-linear classification
 Features are obtained as non-linear functions of the inputs
 It results in non-linear decision boundaries
 Can deal with non-linearly separable data

Inputs:

Features:

Outputs:

Picture from: Y-Fan Chang – An Overview of Machine Learning 16

CS 404/504, Fall 2021

Non-linear Support Vector Machines

Linear vs Non-linear Techniques

• Non-linear SVM
 The original input space is mapped to a higher-dimensional feature space where the
training set is linearly separable
 Define a non-linear kernel function to calculate a non-linear decision boundary in the
original feature space

Φ : 𝑥 ↦ 𝜙 (𝑥 )

Picture from: James Hays – Machine Learning Overview 17

CS 404/504, Fall 2021

Binary vs Multi-class Classification

• A classification problem with only 2 classes is referred to as binary classification

 The output labels are 0 or 1
 E.g., benign or malignant tumor, spam or no-spam email
• A problem with 3 or more classes is referred to as multi-class classification

18
CS 404/504, Fall 2021

Binary vs Multi-class Classification

• Both the binary and multi-class classification problems can be linearly or non-
linearly separated
 Figure: linearly and non-linearly separated data for binary classification problem

19
CS 404/504, Fall 2021

Computer Vision Tasks

Machine Learning Basics

• Computer vision has been the primary area of interest for ML

• The tasks include: classification, localization, object detection, instance
segmentation

Picture from: Fie-Fei Li, Andrej Karpathy, Justin Johnson – Understanding and Visualizing CNNs 20
CS 404/504, Fall 2021

No-Free-Lunch Theorem
Machine Learning Basics

• Wolpert (2002) - The Supervised Learning No-Free-Lunch Theorems

• The derived classification models for supervised learning are simplifications of
the reality
 The simplifications are based on certain assumptions
 The assumptions fail in some situations
o E.g., due to inability to perfectly estimate ML model parameters from limited data
• In summary, No-Free-Lunch Theorem states:
 No single classifier works the best for all possible problems
 Since we need to make assumptions to generalize

21
CS 404/504, Fall 2021

ML vs. Deep Learning

Introduction to Deep Learning

• Conventional machine learning methods rely on human-designed feature

representations
 ML becomes just optimizing weights to best make a final prediction

Picture from: Ismini Lourentzou – Introduction to Deep Learning 22

CS 404/504, Fall 2021

ML vs. Deep Learning

Introduction to Deep Learning

• Deep learning (DL) is a machine learning subfield that uses multiple layers for
learning data representations
 DL is exceptionally effective at learning patterns

Picture from: https://www.xenonstack.com/blog/static/public/uploads/media/machine-learning-vs-deep-learning.png 23

CS 404/504, Fall 2021

ML vs. Deep Learning

Introduction to Deep Learning

• DL applies a multi-layer process for learning rich hierarchical features (i.e., data
representations)
 Input image pixels → Edges → Textures → Parts → Objects

Low-Level Mid-Level High-Level Trainable

Output
Features Features Features Classifier

Slide credit: Param Vir Singh – Deep Learning 24

CS 404/504, Fall 2021

Why is DL Useful?
Introduction to Deep Learning

• DL provides a flexible, learnable framework for representing visual, text,

linguistic information
 Can learn in supervised and unsupervised manner
• DL represents an effective end-to-end learning system
• Requires large amounts of training data
• Since about 2010, DL has outperformed other ML techniques
 First in vision and speech, then NLP, and other applications

25
CS 404/504, Fall 2021

Representational Power
Introduction to Deep Learning

• NNs with at least one hidden layer are universal approximators

 Given any continuous function h(x) and some , there exists a NN with one hidden
layer (and with a reasonable choice of non-linearity) described with the function f(x),
such that
 I.e., NN can approximate any arbitrary complex continuous function

• NNs use nonlinear mapping of the inputs x to the

outputs f(x) to compute complex decision boundaries
• But then, why use deeper NNs?
 The fact that deep NNs work better is an empirical
observation
 Mathematically, deep NNs have the same
representational power as a one-layer NN

26
CS 404/504, Fall 2021

Introduction to Neural Networks

• Handwritten digit recognition (MNIST dataset)

 The intensity of each pixel is considered an input element
 Output is the class of the digit

Input Output
x1 y1
0.1 is 1
x2
y2
0.7 is 2
The image is “2”
……

……
……
x256 y10
0.2 is 0
16 x 16 = 256
Ink → 1 Each dimension represents the
confidence of a digit
No ink → 0
Slide credit: Hung-yi Lee – Deep Learning Tutorial 27
CS 404/504, Fall 2021

Introduction to Neural Networks

• Handwritten digit recognition

x1 y1
x2
Machine y2
“2
……

……
”
x256 𝑓 : 𝑅256 → 𝑅10 y10
The function is represented by a neural network

Slide credit: Hung-yi Lee – Deep Learning Tutorial 28

CS 404/504, Fall 2021

Elements of Neural Networks

Introduction to Neural Networks

• NNs consist of hidden layers with neurons (i.e., computational units)

• A single neuron maps a set of inputs into an output number, or

a1 z  a1w1  a2 w2    aK wK  b
w1
𝑎=𝜎 ( 𝑧 )
a2 w2
 z  z  a
…

wK output
…

aK Activation
weights function
b
input
bias

Slide credit: Hung-yi Lee – Deep Learning Tutorial 29

CS 404/504, Fall 2021

Elements of Neural Networks

Introduction to Neural Networks

• A NN with one hidden layer and one output layer

Weights Biases

𝒉𝒊𝒅𝒅𝒆𝒏 𝒍𝒂𝒚𝒆𝒓 𝒉= 𝝈 ( 𝐖 𝟏 𝒙 + 𝒃𝟏 )

𝒐𝒖𝒕𝒑𝒖𝒕 𝒍𝒂𝒚𝒆𝒓 𝒚 =𝝈 (𝑾 𝟐 𝒉+ 𝒃𝟐)

Activation functions

4 + 2 = 6 neurons (not counting inputs)

𝒚 [3 × 4] + [4 × 2] = 20 weights
4 + 2 = 6 biases
𝒙 26 learnable parameters

𝒉
Slide credit: Ismini Lourentzou – Introduction to Deep Learning 30
CS 404/504, Fall 2021

Elements of Neural Networks

Introduction to Neural Networks

• A neural network playground link

31
CS 404/504, Fall 2021

Elements of Neural Networks

Introduction to Neural Networks

• Deep NNs have many hidden layers

 Fully-connected (dense) layers (a.k.a. Multi-Layer Perceptron or MLP)
 Each neuron is connected to all neurons in the succeeding layer

Input Layer 1 Layer 2 Layer L Output

x1 …… y1
x2 …… y2

……
……

……

……
xN …… yM

Input Layer Output Layer

Hidden Layers
Slide credit: Hung-yi Lee – Deep Learning Tutorial 32
CS 404/504, Fall 2021

Elements of Neural Networks

Introduction to Neural Networks

• A simple network, toy example

( 1∙ 1 ) + ( − 1 ) ∙ (− 2 ) +1= 4

0.98 Sigmoid Function

1 4
1
-2 1
 z  
1 1  e z
 z 
-1 -2 0.12
-1
1 z
0

1 -2
Slide credit: Hung-yi Lee – Deep Learning Tutorial 33
CS 404/504, Fall 2021

Elements of Neural Networks

Introduction to Neural Networks

• A simple network, toy example (cont’d)

 For an input vector , the output is

1 4 0.98 2 0.86 3 0.62

1
-2 -1 -1
1 0 -2
-1 -2 0.12 -2 0.11 -1 0.83
-1
1 -1 4
0 0 2

2
𝑓 : 𝑅 →𝑅 2 𝑓
([ ]) [
1
−1
=
0 .62
0.83 ]
Slide credit: Hung-yi Lee – Deep Learning Tutorial 34
CS 404/504, Fall 2021

Matrix Operation
Introduction to Neural Networks

• Matrix operations are helpful when working with multidimensional inputs and
outputs

1 4 0.98
1 W x + b a
-2
1

-1
-1 -2 0.12 [
1
𝜎
−1
−2
1 ] [ ] +¿ [ ] ¿ [ ]
(
−
1
1
1
)0
0 .98
0.12

1
0 [ ]4
−2

Slide credit: Hung-yi Lee – Deep Learning Tutorial 35

CS 404/504, Fall 2021

Matrix Operation
Introduction to Neural Networks

• Multilayer NN, matrix calculations for the first layer

 Input vector x, weights matrix W1, bias vector b1, output vector a1

x1 …… y1
x 2 W1 …… y2
b1

……
……

……

……
xN
x a1 …… yM

a1 W1 x + b1

Slide credit: Hung-yi Lee – Deep Learning Tutorial 36

CS 404/504, Fall 2021

Matrix Operation
Introduction to Neural Networks

• Multilayer NN, matrix calculations for all layers

x1 …… y1
x 2 W1 y2
W2 ……
WL
b1 b2 bL

……
……

……

……
xN …… yM
x a1 a2 y

𝜎
W1 x(+ )
b1
𝜎
W2 a1(+ )
b2
𝜎
WL a(
L-1 +)L
b

Slide credit: Hung-yi Lee – Deep Learning Tutorial 37

CS 404/504, Fall 2021

Matrix Operation
Introduction to Neural Networks

• Multilayer NN, function f maps inputs x to outputs y, i.e.,

x1 …… y1
x 2 W1 y2
W2 ……
WL
b1 b2 bL

……
……

……

……
xN …… yM
x a1 a2 y

y ¿ 𝑓 ()
x WL …
¿ 𝜎
W2
𝜎 W1 x(
𝜎 (+(
b1)
) + b)
2 … + bL

Slide credit: Hung-yi Lee – Deep Learning Tutorial 38

CS 404/504, Fall 2021

Softmax Layer
Introduction to Neural Networks

• In multi-class classification tasks, the output layer is typically a softmax layer

 I.e., it employs a softmax activation function
 If a layer with a sigmoid activation function is used as the output layer instead, the
predictions by the NN may not be easy to interpret
o Note that an output layer with sigmoid activations can still be used for binary classification

A Layer with Sigmoid Activations

3
z1
0.95
 y1   z1  
z2
1
 0.73
y2   z 2  
z3
-3
 0.05
y3   z 3  

Slide credit: Hung-yi Lee – Deep Learning Tutorial 39

CS 404/504, Fall 2021

Softmax Layer
Introduction to Neural Networks

• The softmax layer applies softmax activations to output Probability:

a probability value in the range [0, 1]
 The values z inputted to the softmax layer are referred to as
logits
A Softmax Layer
3
3 20 0.88
 e
z1 z1 zj
z1 e e y1  e
j 1
0.12 3
z2 1 e e 2.7
z2
 y2  e z2
e
zj

j 1
0.05 ≈0
z3 -3
3
e e z3
 y3  e z3
e
zj

3 j 1

  ez j

j 1

Slide credit: Hung-yi Lee – Deep Learning Tutorial 40

CS 404/504, Fall 2021

Activation Functions
Introduction to Neural Networks

• Non-linear activations are needed to learn complex (non-linear) data

representations
 Otherwise, NNs would be just a linear function (such as )
 NNs with large number of layers (and neurons) can approximate more complex
functions
o Figure: more neurons improve representation (but, may overfit)

Picture from: http://cs231n.github.io/assets/nn1/layer_sizes.jpeg 41

CS 404/504, Fall 2021

Activation: Sigmoid
Introduction to Neural Networks

• Sigmoid function σ: takes a real-valued number and “squashes” it into the range
between 0 and 1
 The output can be interpreted as the firing rate of a biological neuron
o Not firing = 0; Fully firing = 1
 When the neuron’s activation are 0 or 1, sigmoid neurons saturate
o Gradients at these regions are almost zero (almost no signal will flow)
 Sigmoid activations are less common in modern NNs

𝑓 (𝑥 ) ℝ 𝑛 → [ 0,1 ]

𝑥
Slide credit: Ismini Lourentzou – Introduction to Deep Learning 42
CS 404/504, Fall 2021

Activation: Tanh
Introduction to Neural Networks

• Tanh function: takes a real-valued number and “squashes” it into range between
-1 and 1
 Like sigmoid, tanh neurons saturate
 Unlike sigmoid, the output is zero-centered
o It is therefore preferred than sigmoid
 Tanh is a scaled sigmoid:

𝑓 (𝑥 ) ℝ 𝑛 → [ − 1 ,1 ]

𝑥
Slide credit: Ismini Lourentzou – Introduction to Deep Learning 43
CS 404/504, Fall 2021

Activation: ReLU
Introduction to Neural Networks

• ReLU (Rectified Linear Unit): takes a real-valued number and thresholds it at

zero
ℝ 𝑛 → ℝ +¿ ¿
𝑛

 Most modern deep NNs use ReLU

activations
 ReLU is fast to compute 𝑓 (𝑥 )
o Compared to sigmoid, tanh
o Simply threshold a matrix at zero
 Accelerates the convergence of gradient
descent
o Due to linear, non-saturating form
 Prevents the gradient vanishing problem 𝑥

44
CS 404/504, Fall 2021

Activation: Leaky ReLU

Introduction to Neural Networks

• The problem of ReLU activations: they can “die”

 ReLU could cause weights to update in a way that the gradients can become zero and
the neuron will not activate again on any data
 E.g., when a large learning rate is used

• Leaky ReLU activation function is a variant of ReLU

 Instead of the function being 0 when , a leaky ReLU has a small negative slope (e.g., α
= 0.01, or similar)
 This resolves the dying ReLU problem
 Most current works still use ReLU
o With a proper setting of the learning rate,
𝑓 ( 𝑥 )=
{𝑥
𝛼 𝑥 for 𝑥< 0
for 𝑥 ≫0
the problem of dying ReLU can be avoided

45
CS 404/504, Fall 2021

Activation: Linear Function

Introduction to Neural Networks

• Linear function means that the output signal is proportional to the input signal
to the neuron ℝ𝑛 → ℝ𝑛
 If the value of the constant c is 1, it is also
called identity activation function 𝑓 ( 𝑥 )=𝑐𝑥
 This activation type is used in regression
problems
o E.g., the last layer can have linear activation
function, in order to output a real number
(and not a class membership)

46
CS 404/504, Fall 2021

Training NNs
Training Neural Networks

• The network parameters include the weight matrices and bias vectors from all
layers
𝜃 = {𝑊 1 ,𝑏 1 , 𝑊 2 , 𝑏2 , ⋯ 𝑊 𝐿 ,𝑏 𝐿 }
 Often, the model parameters are referred to as weights
• Training a model to learn a set of parameters that are optimal (according to a
criterion) is one of the greatest challenges in ML

x1 …… y1
0.1 is 1
x2

Softmax
…… y2
0.7 is 2
……

……

……
x256 …… y10
0.2 is 0
16 x 16 = 256
Slide credit: Hung-yi Lee – Deep Learning Tutorial 47
CS 404/504, Fall 2021

Training NNs
Training Neural Networks

• Data preprocessing - helps convergence during training

 Mean subtraction, to obtain zero-centered data
o Subtract the mean for each individual data dimension (feature)
 Normalization
o Divide each feature by its standard deviation
– To obtain standard deviation of 1 for each data dimension (feature)
o Or, scale the data within the range [0,1] or [-1, 1]
– E.g., image pixel intensities are divided by 255 to be scaled in the [0,1] range

Picture from: https://cs231n.github.io/neural-networks-2/ 48

CS 404/504, Fall 2021

Training NNs
Training Neural Networks

• To train a NN, set the parameters such that for a training subset of images, the
corresponding elements in the predicted output have maximum values

Input: y1 has the maximum value

Input: y2 has the maximum value

.
.
.

Input: y9 has the maximum value

Input: y10 has the maximum value

Slide credit: Hung-yi Lee – Deep Learning Tutorial 49

CS 404/504, Fall 2021

Training NNs
Training Neural Networks

• Define a loss function/objective function/cost function that calculates the

difference (error) between the model prediction and the true label
 E.g., can be mean-squared error, cross-entropy, etc.

x1 …… y1 0.2 1
x2 …… y2 0.3 0
Cost
……

……
……

……
……
……
x256 …… y10 0.5 ℒ (𝜃) 0
True label “1”

Slide credit: Hung-yi Lee – Deep Learning Tutorial 50

CS 404/504, Fall 2021

Training NNs
Training Neural Networks

• For a training set of images, calculate the total loss overall all images:
• Find the optimal parameters that minimize the total loss

ℒ1( 𝜃 )
x1 NN ^𝑦 1 y1
ℒ2( 𝜃 )
x2 NN ^𝑦 2 y2

ℒ3( 𝜃 )
x3 NN ^𝑦 3
y3
……
……

……
……

ℒ𝑛 ( 𝜃 )
xN NN ^𝑦 𝑁 yN
Slide credit: Hung-yi Lee – Deep Learning Tutorial 51
CS 404/504, Fall 2021

Loss Functions
Training Neural Networks

• Classification tasks

Training
Pairs of 𝑁 inputs and ground-truth class labels
examples

Output Softmax Activations

Layer [maps to a probability distribution]

𝑁 𝐾
1
Loss function Cross-entropyℒ ( 𝜃 ) =− ∑ ∑
𝑁 𝑖 =1 𝑘=1
𝑦 (𝑖) (𝑖 )
[
𝑘 log 𝑦 𝑘 + ( 1− 𝑦 𝑘 ) log ( 1 − 𝑦 𝑘 )
^ (𝑖 ) ^ (𝑖 ) ]
Ground-truth class labels and model predicted class labels

Slide credit: Ismini Lourentzou – Introduction to Deep Learning 52

CS 404/504, Fall 2021

Loss Functions
Training Neural Networks

• Regression tasks

Training
Pairs of 𝑁 inputs and ground-truth output values
examples

Output
Linear (Identity) or Sigmoid Activation
Layer

𝑛
1
ℒ ( 𝜃)= ∑ ( 𝑦 − 𝑦 )
(𝑖) 2
Mean Squared Error
(𝑖)
^
Loss function 𝑛 𝑖=1
𝑛
1
Mean Absolute Error ℒ ( 𝜃)= ∑ | 𝑦 (𝑖)
− ^
𝑦 |
(𝑖)
𝑛 𝑖=1

Slide credit: Ismini Lourentzou – Introduction to Deep Learning 53

CS 404/504, Fall 2021

Training NNs
Training Neural Networks

• Optimizing the loss function

 Almost all DL models these days are trained with a variant of the gradient descent
(GD) algorithm
 GD applies iterative refinement of the network parameters
 GD uses the opposite direction of the gradient of the loss with respect to the NN
parameters (i.e., ) for updating
o The gradient of the loss function gives the direction of fastest increase of the loss function
when the parameters are changed

ℒ ( 𝜃) 𝜕ℒ
𝜕 𝜃𝑖

𝜃𝑖

54
CS 404/504, Fall 2021

Gradient Descent Algorithm

Training Neural Networks

• Steps in the gradient descent algorithm:

1. Randomly initialize the model parameters
2. Compute the gradient of the loss function at the initial parameters :
3. Update the parameters as:
o Where α is the learning rate
4. Go to step 2 and repeat (until a terminating criterion is reached)

Loss Initial Gradient

parameters

Parameter update:

Global loss minimum

Parameters
55
CS 404/504, Fall 2021

Gradient Descent Algorithm

Training Neural Networks

• Example: a NN with only 2 parameters and , i.e.,

 The different colors represent the values of the loss (minimum loss is ≈ 1.3)

1. Randomly pick a
starting point

2. Compute the
gradient at ,
𝜃∗
𝑤2 3. Times the learning
1 rate , and update
𝜃
− 𝛻 ℒ ( 𝜃0 )
4. Go to step 2, repeat
0
𝜃

𝑤1
𝛻 ℒ (𝜃 )=
0

[ 𝜕 ℒ ( 𝜃0 ) /𝜕 𝑤 1
𝜕 ℒ ( 𝜃0 ) /𝜕 𝑤 2 ]
Slide credit: Hung-yi Lee – Deep Learning Tutorial 56
CS 404/504, Fall 2021

Gradient Descent Algorithm

Training Neural Networks

• Example (contd.)

Eventually, we would reach a minimum …..

2. Compute the gradient

𝜃2 at ,
𝜃 1 − 𝛼 𝛻 ℒ ( 𝜃 1)
𝑤2 𝜃 2 − 𝛼 𝛻 ℒ (𝜃 2 ) 3. Times the learning rate ,
and update
𝜃1

4. Go to step 2, repeat
0
𝜃

𝑤1 Slide credit: Hung-yi Lee – Deep Learning Tutorial 57

CS 404/504, Fall 2021

Gradient Descent Algorithm

Training Neural Networks

• Gradient descent algorithm stops when a local minimum of the loss surface is
reached
 GD does not guarantee reaching a global minimum
 However, empirical evidence suggests that GD works well for NNs

Picture from: https://blog.paperspace.com/intro-to-optimization-in-deep-learning-gradient-descent/ 58

CS 404/504, Fall 2021

Gradient Descent Algorithm

Training Neural Networks

• For most tasks, the loss surface is highly complex (and non-convex)
• Random initialization in NNs results
in different initial parameters every
time the NN is trained ℒ
 Gradient descent may reach different
minima at every run
 Therefore, NN will produce different
predicted outputs
• In addition, currently we don’t have
algorithms that guarantee reaching a
global minimum for an arbitrary loss
function 𝑤1 𝑤2

Slide credit: Hung-yi Lee – Deep Learning Tutorial 59

CS 404/504, Fall 2021

Backpropagation
Training Neural Networks

• Modern NNs employ the backpropagation method for calculating the gradients
of the loss function
 Backpropagation is short for “backward propagation”
• For training NNs, forward propagation (forward pass) refers to passing the
inputs through the hidden layers to obtain the model outputs (predictions)
 The loss function is then calculated
 Backpropagation traverses the network in reverse order, from the outputs backward
toward the inputs to calculate the gradients of the loss
 The chain rule is used for calculating the partial derivatives of the loss function with
respect to the parameters in the different layers in the network
• Each update of the model parameters during training takes one forward and
one backward pass (e.g., of a batch of inputs)
• Automatic calculation of the gradients (automatic differentiation) is available in
all current deep learning libraries
 It significantly simplifies the implementation of deep learning algorithms, since it
obviates deriving the partial derivatives of the loss function by hand

60
CS 404/504, Fall 2021

Mini-batch Gradient Descent

Training Neural Networks

• It is wasteful to compute the loss over the entire training dataset to perform a
single parameter update for large datasets
 E.g., ImageNet has 14M images
 Therefore, GD (a.k.a. vanilla GD) is almost always replaced with mini-batch GD
• Mini-batch gradient descent
 Approach:
o Compute the loss on a mini-batch of images, update the parameters , and repeat until all
images are used
o At the next epoch, shuffle the training data, and repeat the above process
 Mini-batch GD results in much faster training
 Typical mini-batch size: 32 to 256 images
 It works because the gradient from a mini-batch is a good approximation of the
gradient from the entire training set

61
CS 404/504, Fall 2021

Stochastic Gradient Descent

Training Neural Networks

• Stochastic gradient descent

 SGD uses mini-batches that consist of a single input example
o E.g., one image mini-batch
 Although this method is very fast, it may cause significant fluctuations in the loss
function
o Therefore, it is less commonly used, and mini-batch GD is preferred
 In most DL libraries, SGD typically means a mini-batch GD (with an option to add
momentum)

62
CS 404/504, Fall 2021

Problems with Gradient Descent

Training Neural Networks

• Besides the local minima problem, the GD algorithm can be very slow at
plateaus, and it can get stuck at saddle points
cost

Very slow at the plateau

Stuck at a saddle point

Stuck at a local minimum

𝛻 ℒ (𝜃 )≈ 0 𝛻 ℒ ( 𝜃 )=0
𝛻 ℒ ( 𝜃 )=0
𝜃 63
Slide credit: Hung-yi Lee – Deep Learning Tutorial
CS 404/504, Fall 2021

Gradient Descent with Momentum

Training Neural Networks

• Gradient descent with momentum uses the momentum of the gradient for
parameter optimization

cost
Movement = Negative of Gradient + Momentum

Negative of Gradient
Momentum
Real Movement

𝜃
Gradient = 0
Slide credit: Hung-yi Lee – Deep Learning Tutorial 64
CS 404/504, Fall 2021

Gradient Descent with Momentum

Training Neural Networks

• Parameters update in GD with momentum at iteration :

o Where: =
o I.e.,
• Compare to vanilla GD:
 Where are the parameters from the previous iteration
• The term is called momentum
 This term accumulates the gradients from the past several steps, i.e.,
=

 This term is analogous to a momentum of a heavy ball rolling down the hill
• The parameter is referred to as a coefficient of momentum
 A typical value of the parameter is 0.9
• This method updates the parameters in the direction of the weighted average of
the past gradients

65
CS 404/504, Fall 2021

Nesterov Accelerated Momentum

Training Neural Networks

• Gradient descent with Nesterov accelerated momentum

 Parameter update:
o Where: =
 The term allows to predict the position of the parameters in the next step (i.e., )
 The gradient is calculated with respect to the approximate future position of the
parameters in the next iteration, , calculated at iteration

GD with Nesterov
GD with momentum
momentum

Picture from: https://towardsdatascience.com/learning-parameters-part-2-a190bef2d12 66

CS 404/504, Fall 2021

Adam
Training Neural Networks

• Adaptive Moment Estimation (Adam)

 Adam combines insights from the momentum optimizers that accumulate the values
of past gradients, and it also introduces new terms based on the second moment of the
gradient
o Similar to GD with momentum, Adam computes a weighted average of past gradients (first
moment of the gradient), i.e., =
o Adam also computes a weighted average of past squared gradients (second moment of the
gradient), , i.e., =
 The parameter update is:
o Where: and
o The proposed default values are = 0.9, = 0.999, and
• Other commonly used optimization methods include:
 Adagrad, Adadelta, RMSprop, Nadam, etc.
 Most commonly used optimizers nowadays are Adam and SGD with momentum

67
CS 404/504, Fall 2021

Learning Rate
Training Neural Networks

• Learning rate
 The gradient tells us the direction in which the loss has the steepest rate of increase,
but it does not tell us how far along the opposite direction we should step
 Choosing the learning rate (also called the step size) is one of the most important
hyper-parameter settings for NN training

LR too LR too
small large

68
CS 404/504, Fall 2021

Learning Rate
Training Neural Networks

• Training loss for different learning rates

 High learning rate: the loss increases or plateaus too quickly
 Low learning rate: the loss decreases too slowly (takes many epochs to reach a
solution)

Picture from: https://cs231n.github.io/neural-networks-3/ 69

CS 404/504, Fall 2021

Learning Rate Scheduling

Training Neural Networks

• Learning rate scheduling is applied to change the values of the learning rate
during the training
 Annealing is reducing the learning rate over time (a.k.a. learning rate decay)
o Approach 1: reduce the learning rate by some factor every few epochs
– Typical values: reduce the learning rate by a half every 5 epochs, or divide by 10 every 20 epochs
o Approach 2: exponential or cosine decay gradually reduce the learning rate over time
o Approach 3: reduce the learning rate by a constant (e.g., by half) whenever the validation loss
stops improving
– In TensorFlow: tf.keras.callbacks.ReduceLROnPleateau()
» Monitor: validation loss, factor: 0.1 (i.e., divide by 10), patience: 10 (how many epochs to wait before applying it),
Minimum learning rate: 1e-6 (when to stop)
 Warmup is gradually increasing the learning rate initially, and afterward let it cool
down until the end of the training
Exponential decay Cosine decay Warmup

70
CS 404/504, Fall 2021

Vanishing Gradient Problem

Training Neural Networks

• In some cases, during training, the gradients can become either very small
(vanishing gradients) of very large (exploding gradients)
 They result in very small or very large update of the parameters
 Solutions: change learning rate, ReLU activations, regularization, LSTM units in RNNs

x1 …… y1
x2 …… y2
……

……
……

……

……
xN …… yM

Small gradients, learns very slow

Slide credit: Hung-yi Lee – Deep Learning Tutorial 71

CS 404/504, Fall 2021

Generalization
Generalization

• Underfitting
 The model is too “simple” to represent
all the relevant class characteristics
 E.g., model with too few parameters
 Produces high error on the training set
and high error on the validation set

• Overfitting
 The model is too “complex” and fits
irrelevant characteristics (noise) in the
data
 E.g., model with too many parameters
 Produces low error on the training error
and high error on the validation set

72
CS 404/504, Fall 2021

Overfitting
Generalization

• Overfitting – a model with high capacity fits the noise in the data instead of the
underlying relationship

• The model may fit the training data

very well, but fails to generalize to new
examples (test or validation data)

Picture from: http://cs231n.github.io/assets/nn1/layer_sizes.jpeg 73

CS 404/504, Fall 2021

Regularization: Weight Decay

Regularization

• weight decay
 A regularization term that penalizes large weights is added to the loss function

Data loss Regularization loss

 For every weight in the network, we add the regularization term to the loss value
o During gradient descent parameter update, every weight is decayed linearly toward zero
 The weight decay coefficient determines how dominant the regularization is during
the gradient computation

74
CS 404/504, Fall 2021

Regularization: Weight Decay

Regularization

• Effect of the decay coefficient

 Large weight decay coefficient → penalty for weights with large values

75
CS 404/504, Fall 2021

Regularization: Weight Decay

Regularization

• weight decay
 The regularization term is based on the norm of the weights

 weight decay is less common with NN

o Often performs worse than weight decay
 It is also possible to combine and regularization
o Called elastic net regularization

76
CS 404/504, Fall 2021

Regularization: Dropout
Regularization

• Dropout
 Randomly drop units (along with their connections) during training
 Each unit is retained with a fixed dropout rate p, independent of other units
 The hyper-parameter p needs to be chosen (tuned)
o Often, between 20% and 50% of the units are dropped

Slide credit: Hung-yi Lee – Deep Learning Tutorial 77

CS 404/504, Fall 2021

Regularization: Dropout
Regularization

• Dropout is a kind of ensemble learning

 Using one mini-batch to train one network with a slightly different
architecture

minibatch minibatch minibatch minibatch

1 2 3 n

……

Slide credit: Hung-yi Lee – Deep Learning Tutorial 78

CS 404/504, Fall 2021

Regularization: Early Stopping

Regularization

• Early-stopping
 During model training, use a validation set
o E.g., validation/train ratio of about 25% to 75%
 Stop when the validation accuracy (or loss) has not improved after n epochs
o The parameter n is called patience

Stop training

validation

79
CS 404/504, Fall 2021

Batch Normalization
Regularization

• Batch normalization layers act similar to the data preprocessing steps

mentioned earlier
 They calculate the mean μ and variance σ of a batch of input data, and normalize the
data x to a zero mean and unit variance
 I.e.,
• BatchNorm layers alleviate the problems of proper initialization of the
parameters and hyper-parameters
 Result in faster convergence training, allow larger learning rates
 Reduce the internal covariate shift
• BatchNorm layers are inserted immediately after convolutional layers or fully-
connected layers, and before activation layers
 They are very common with convolutional NNs

80
CS 404/504, Fall 2021

Hyper-parameter Tuning
Hyper-parameter Tuning

• Training NNs can involve setting many hyper-parameters

• The most common hyper-parameters include:
 Number of layers, and number of neurons per layer
 Initial learning rate
 Learning rate decay schedule (e.g., decay constant)
 Optimizer type
• Other hyper-parameters may include:
 Regularization parameters ( penalty, dropout rate)
 Batch size
 Activation functions
 Loss function
• Hyper-parameter tuning can be time-consuming for larger NNs

81
CS 404/504, Fall 2021

Hyper-parameter Tuning
Hyper-parameter Tuning

• Grid search
 Check all values in a range with a step value
• Random search
 Randomly sample values for the parameter
 Often preferred to grid search
• Bayesian hyper-parameter optimization
 Is an active area of research

82
CS 404/504, Fall 2021

k-Fold Cross-Validation
k-Fold Cross-Validation

• Using k-fold cross-validation for hyper-parameter tuning is common when the

size of the training data is small
 It also leads to a better and less noisy estimate of the model performance by averaging
the results across several folds
• E.g., 5-fold cross-validation (see the figure on the next slide)
1. Split the train data into 5 equal folds
2. First use folds 2-5 for training and fold 1 for validation
3. Repeat by using fold 2 for validation, then fold 3, fold 4, and fold 5
4. Average the results over the 5 runs (for reporting purposes)
5. Once the best hyper-parameters are determined, evaluate the model on the test
data

83
CS 404/504, Fall 2021

k-Fold Cross-Validation
k-Fold Cross-Validation

• Illustration of a 5-fold cross-validation

Picture from: https://scikit-learn.org/stable/modules/cross_validation.html 84

CS 404/504, Fall 2021

Ensemble Learning
Ensemble Learning

• Ensemble learning is training multiple classifiers separately and combining their

predictions
 Ensemble learning often outperforms individual classifiers
 Better results obtained with higher model variety in the ensemble
 Bagging (bootstrap aggregating)
o Randomly draw subsets from the training set (i.e., bootstrap samples)
o Train separate classifiers on each subset of the training set
o Perform classification based on the average vote of all classifiers
 Boosting
o Train a classifier, and apply weights on the training set (apply higher weights on misclassified
examples, focus on “hard examples”)
o Train new classifier, reweight training set according to prediction error
o Repeat
o Perform classification based on weighted vote of the classifiers

85
CS 404/504, Fall 2021

Deep vs Shallow Networks

• Deeper networks perform better than shallow networks

 But only up to some limit: after a certain number of layers, the performance of deeper
networks plateaus

output

Shallow Deep
NN NN

……

x1 x2 …… xN

input
Slide credit: Hung-yi Lee – Deep Learning Tutorial 86
CS 404/504, Fall 2021

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks

• Convolutional neural networks (CNNs) were primarily designed for image data
• CNNs use a convolutional operator for extracting data features
 Allows parameter sharing
 Efficient to train
 Have less parameters than NNs with fully-connected layers
• CNNs are robust to spatial translations of objects in images
• A convolutional filter slides (i.e., convolves) across the image

Convolutional
Input matrix 3x3 filter

Picture from: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution 87

CS 404/504, Fall 2021

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks

• When the convolutional filters are scanned over the image, they capture useful
features
 E.g., edge detection by convolutions

0 1 0
Filter 1 -4 1
0 1 0
1 1 1 1 1 1 0.015686 0.015686 0.011765 0.015686 0.015686 0.015686 0.015686 0.964706 0.988235 0.964706 0.866667 0.031373 0.023529 0.007843
0.007843 0.741176 1 1 0.984314 0.023529 0.019608 0.015686 0.015686 0.015686 0.011765 0.101961 0.972549 1 1 0.996078 0.996078 0.996078 0.058824 0.015686
0.019608 0.513726 1 1 1 0.019608 0.015686 0.015686 0.015686 0.007843 0.011765 1 1 1 0.996078 0.031373 0.015686 0.019608 1 0.011765
0.015686 0.733333 1 1 0.996078 0.019608 0.019608 0.015686 0.015686 0.011765 0.984314 1 1 0.988235 0.027451 0.015686 0.007843 0.007843 1 0.352941
0.015686 0.823529 1 1 0.988235 0.019608 0.019608 0.015686 0.015686 0.019608 1 1 0.980392 0.015686 0.015686 0.015686 0.015686 0.996078 1 0.996078
0.015686 0.913726 1 1 0.996078 0.019608 0.019608 0.019608 0.019608 1 1 0.984314 0.015686 0.015686 0.015686 0.015686 0.952941 1 1 0.992157
0.019608 0.913726 1 1 0.988235 0.019608 0.019608 0.019608 0.039216 0.996078 1 0.015686 0.015686 0.015686 0.015686 0.996078 1 1 1 0.007843
0.019608 0.898039 1 1 0.988235 0.019608 0.015686 0.019608 0.968628 0.996078 0.980392 0.027451 0.015686 0.019608 0.980392 0.972549 1 1 1 0.019608
0.043137 0.905882 1 1 1 0.015686 0.035294 0.968628 1 1 0.023529 1 0.792157 0.996078 1 1 0.980392 0.992157 0.039216 0.023529
1 1 1 1 1 0.992157 0.992157 1 1 0.984314 0.015686 0.015686 0.858824 0.996078 1 0.992157 0.501961 0.019608 0.019608 0.023529
0.996078 0.992157 1 1 1 0.933333 0.003922 0.996078 1 0.988235 1 0.992157 1 1 1 0.988235 1 1 1 1
0.015686 0.74902 1 1 0.984314 0.019608 0.019608 0.031373 0.984314 0.023529 0.015686 0.015686 1 1 1 0 0.003922 0.027451 0.980392 1
0.019608 0.023529 1 1 1 0.019608 0.019608 0.564706 0.894118 0.019608 0.015686 0.015686 1 1 1 0.015686 0.015686 0.015686 0.05098 1
0.015686 0.015686 1 1 1 0.047059 0.019608 0.992157 0.007843 0.011765 0.011765 0.015686 1 1 1 0.015686 0.019608 0.996078 0.023529 0.996078
0.019608 0.015686 0.243137 1 1 0.976471 0.035294 1 0.003922 0.011765 0.011765 0.015686 1 1 1 0.988235 0.988235 1 0.003922 0.015686
0.019608 0.019608 0.027451 1 1 0.992157 0.223529 0.662745 0.011765 0.011765 0.011765 0.015686 1 1 1 0.015686 0.023529 0.996078 0.011765 0.011765
0.015686 0.015686 0.011765 1 1 1 1 0.035294 0.011765 0.011765 0.011765 0.015686 1 1 1 0.015686 0.015686 0.964706 0.003922 0.996078
0.007843 0.019608 0.011765 0.054902 1 1 0.988235 0.007843 0.011765 0.011765 0.015686 0.011765 1 1 1 0.015686 0.015686 0.015686 0.023529 1
0.007843 0.007843 0.015686 0.015686 0.960784 1 0.490196 0.015686 0.015686 0.015686 0.007843 0.027451 1 1 1 0.011765 0.011765 0.043137 1 1
0.023529 0.003922 0.007843 0.023529 0.980392 0.976471 0.039216 0.019608 0.007843 0.019608 0.015686 1 1 1 1 1 1 1 1 1

Input Image Convoluted

Image

Slide credit: Param Vir Singh – Deep Learning 88

CS 404/504, Fall 2021

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks

• In CNNs, hidden units in a layer are only connected to a small region of the
layer before it (called local receptive field)
 The depth of each feature map corresponds to the number of convolutional filters
used at each layer

w1 w2

w3 w4 w5 w6

w7 w8
Filter 1
Filter 2
Input Image
Layer 1
Feature Map Layer 2
Feature Map

Slide credit: Param Vir Singh – Deep Learning 89

CS 404/504, Fall 2021

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks

• Max pooling: reports the maximum output within a rectangular neighborhood

• Average pooling: reports the average output of a rectangular neighborhood
• Pooling layers reduce the spatial size of the feature maps
 Reduce the number of parameters, prevent overfitting

MaxPool with a 2×2 filter with stride of 2

1 3 5 3
4 5
4 2 3 1
3 4
3 1 1 3
0 1 0 4
Output Matrix
Input Matrix

Slide credit: Param Vir Singh – Deep Learning 90

CS 404/504, Fall 2021

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks

• Feature extraction architecture

 After 2 convolutional layers, a max-pooling layer reduces the size of the feature maps
(typically by 2)
 A fully convolutional and a softmax layers are added last to perform classification

Living
Room

Bedroom

Kitchen
128

256

512
512

512
256

512
128

256

512
512
64
64

Bathroom

Outdoor
Max Pool
Conv
layer

Fully Connected Layer

Slide credit: Param Vir Singh – Deep Learning 91

CS 404/504, Fall 2021

Residual CNNs
Convolutional Neural Networks

• Residual networks (ResNets)

 Introduce “identity” skip connections
o Layer inputs are propagated and added to the layer output
o Mitigate the problem of vanishing gradients during training
o Allow training very deep NN (with over 1,000 layers)
 Several ResNet variants exist: 18, 34, 50, 101, 152, and 200 layers
 Are used as base models of other state-of-the-art NNs
o Other similar models: ResNeXT, DenseNet

92
CS 404/504, Fall 2021

Recurrent Neural Networks (RNNs)

Recurrent Neural Networks

• Recurrent NNs are used for modeling sequential data and data with varying
length of inputs and outputs
 Videos, text, speech, DNA sequences, human skeletal data
• RNNs introduce recurrent connections between the neurons
 This allows processing sequential data one element at a time by selectively passing
information across a sequence
 Memory of the previous inputs is stored in the model’s internal state and affect the
model predictions
 Can capture correlations in sequential data
• RNNs use backpropagation-through-time for training
• RNNs are more sensitive to the vanishing gradient problem than CNNs

93
CS 404/504, Fall 2021

Recurrent Neural Networks (RNNs)

Recurrent Neural Networks

• RNN use same set of weights and across all time steps
 A sequence of hidden states is learned, which represents the memory of the network
 The hidden state at step t, , is calculated based on the previous hidden state and the
input at the current step , i.e.,
 The function is a nonlinear activation function, e.g., ReLU or tanh
• RNN shown rolled over time

HIDDEN STATES SEQUENCE: OUTPUT

𝑤h 𝑤h 𝑤h 𝑤𝑦
h0 (·) h1 (·) h2 (·) h3 (·)

𝑤𝑥 𝑤𝑥 𝑤𝑥

x1 x2 x3

INPUT SEQUENCE:
Slide credit: Param Vir Singh – Deep Learning 94
CS 404/504, Fall 2021

Recurrent Neural Networks (RNNs)

Recurrent Neural Networks

• RNNs can have one of many inputs and one of many outputs

RNN Application Input Output

A person riding a
Image
motorbike on dirt
Captioning
road

Sentiment Awesome movie.

Analysis Highly recommended. Positive

Machine
Happy Diwali शुभ दीपावली
Translation

Slide credit: Param Vir SIngh– Deep Learning 95

CS 404/504, Fall 2021

Bidirectional RNNs
Recurrent Neural Networks

• Bidirectional RNNs incorporate both forward and backward passes through

sequential data
 The output may not only depend on the previous elements in the sequence, but also
on future elements in the sequence
 It resembles two RNNs stacked on top of each other

⃑ ⃑ ⃑ ⃑
h𝑡 = 𝜎 ( 𝑊 ( hh) h𝑡 − 1+ 𝑊 ( h𝑥 ) 𝑥 𝑡 )
´ (hh) h́ + 𝑊
h́𝑡 =𝜎 ( 𝑊 ´ (h𝑥) 𝑥 )
𝑡 +1 𝑡

𝑦𝑡= 𝑓 ([ h⃑ 𝑡 ; h́𝑡 ])

Outputs both past and future elements

Slide credit: Param Vir Singh – Deep Learning 96

CS 404/504, Fall 2021

LSTM Networks
Recurrent Neural Networks

• Long Short-Term Memory (LSTM) networks are a variant of RNNs

• LSTM mitigates the vanishing/exploding gradient problem
 Solution: a Memory Cell, updated at each step in the sequence
• Three gates control the flow of information to and from the Memory Cell
 Input Gate: protects the current step from irrelevant inputs
 Output Gate: prevents current step from passing irrelevant information to later steps
 Forget Gate: limits information passed from one cell to the next
• Most modern RNN models use either LSTM units or other more advanced types
of recurrent units (e.g., GRU units)

97
CS 404/504, Fall 2021

LSTM Networks
Recurrent Neural Networks

• LSTM cell
 Input gate, output gate, forget gate, memory cell
 LSTM can learn long-term correlations within data sequences

98
CS 404/504, Fall 2021

References

1. Hung-yi Lee – Deep Learning Tutorial

2. Ismini Lourentzou – Introduction to Deep Learning
3. CS231n Convolutional Neural Networks for Visual Recognition (Stanford CS
course) (link)
4. James Hays, Brown – Machine Learning Overview
5. Param Vir Singh, Shunyuan Zhang, Nikhil Malik – Deep Learning
6. Sebastian Ruder – An Overview of Gradient Descent Optimization Algorithms
(link)

Utopia or Oblivion (1969) (R. Buckminster Fuller) (Z-Library) PDF
No ratings yet
Utopia or Oblivion (1969) (R. Buckminster Fuller) (Z-Library) PDF
378 pages
7143cem Portfolio March2023 Brief
No ratings yet
7143cem Portfolio March2023 Brief
9 pages
CNN RNN Assignment Set 4
0% (1)
CNN RNN Assignment Set 4
2 pages
Deep Learning Methods and Applications For Electrical Power Systems A Comprehensive Review
No ratings yet
Deep Learning Methods and Applications For Electrical Power Systems A Comprehensive Review
22 pages
Machine Learning - Exploring The Model - Resp
No ratings yet
Machine Learning - Exploring The Model - Resp
18 pages
Guidelines, Principles, and Theories: Designing The User Interface: Strategies For Effective Human-Computer Interaction
No ratings yet
Guidelines, Principles, and Theories: Designing The User Interface: Strategies For Effective Human-Computer Interaction
33 pages
ch10 Sequence Modelling - Recurrent and Recursive Nets
No ratings yet
ch10 Sequence Modelling - Recurrent and Recursive Nets
45 pages
Practical File: Internet Programming Lab
No ratings yet
Practical File: Internet Programming Lab
26 pages
ML-UNIT4
No ratings yet
ML-UNIT4
41 pages
Lecture 2.1.9 Comparison of BNN and ANN
No ratings yet
Lecture 2.1.9 Comparison of BNN and ANN
5 pages
Convolutional Neural Networks For Visual Recognition
No ratings yet
Convolutional Neural Networks For Visual Recognition
45 pages
ML Unit-4
No ratings yet
ML Unit-4
9 pages
Machine Learning Unit 3
No ratings yet
Machine Learning Unit 3
40 pages
Tableau Lab Example
No ratings yet
Tableau Lab Example
9 pages
Unit 2 AI
No ratings yet
Unit 2 AI
22 pages
CS230 Midterm Solutions Fall 2022
No ratings yet
CS230 Midterm Solutions Fall 2022
20 pages
Deep Learning Unit 1
No ratings yet
Deep Learning Unit 1
32 pages
Inception Net
No ratings yet
Inception Net
88 pages
Practice Final sp22
No ratings yet
Practice Final sp22
10 pages
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
No ratings yet
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
7 pages
Unit 2
No ratings yet
Unit 2
112 pages
CS6456-Object Oriented Programming
No ratings yet
CS6456-Object Oriented Programming
15 pages
MLT Unit 3
100% (1)
MLT Unit 3
38 pages
2.building Blocks of Neural Networks
100% (1)
2.building Blocks of Neural Networks
2 pages
Laboratory 1. Working With Images in Opencv
No ratings yet
Laboratory 1. Working With Images in Opencv
13 pages
ML Unit-Iv
No ratings yet
ML Unit-Iv
19 pages
Data Mining and Business Intelligence Lab Manual
No ratings yet
Data Mining and Business Intelligence Lab Manual
52 pages
Programming in C - CS3251 - HandWritten Notes - Un_250316_200237
No ratings yet
Programming in C - CS3251 - HandWritten Notes - Un_250316_200237
38 pages
Unit 2 Preparing To Model
No ratings yet
Unit 2 Preparing To Model
49 pages
Lecture 3 - Uninformed Search II
No ratings yet
Lecture 3 - Uninformed Search II
39 pages
Unit Iv Web Retrieval and Web Crawling 9
No ratings yet
Unit Iv Web Retrieval and Web Crawling 9
1 page
Query Operation 2021
No ratings yet
Query Operation 2021
35 pages
Clustering & Association Algorithms 4
No ratings yet
Clustering & Association Algorithms 4
17 pages
Convolution Neural Network
No ratings yet
Convolution Neural Network
74 pages
CS3491 - Notes - Unit 4 - Ensemble Techniques and Unsupervised Learning
No ratings yet
CS3491 - Notes - Unit 4 - Ensemble Techniques and Unsupervised Learning
35 pages
2.neural Network
No ratings yet
2.neural Network
19 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
8 pages
It2402 Mobile Communication
No ratings yet
It2402 Mobile Communication
1 page
Data Literacy Questions All Types
No ratings yet
Data Literacy Questions All Types
2 pages
Fsd Unit III
No ratings yet
Fsd Unit III
22 pages
ML Lesson Plan (21AI63)
No ratings yet
ML Lesson Plan (21AI63)
8 pages
Decision Trees
No ratings yet
Decision Trees
5 pages
AD3501 Deep Learning Course Plan
No ratings yet
AD3501 Deep Learning Course Plan
6 pages
18AI61
No ratings yet
18AI61
3 pages
DM DT Solved Example 02 - Unlocked
No ratings yet
DM DT Solved Example 02 - Unlocked
3 pages
Artificial Neural Networks Kluniversity Course Handout
No ratings yet
Artificial Neural Networks Kluniversity Course Handout
18 pages
Machine Learning Full Question Bank
No ratings yet
Machine Learning Full Question Bank
14 pages
CISC 867: Deep Learning Assignment #1: K J Net
No ratings yet
CISC 867: Deep Learning Assignment #1: K J Net
3 pages
AIML UNIT 4
No ratings yet
AIML UNIT 4
26 pages
7th Sem Syllabus
No ratings yet
7th Sem Syllabus
11 pages
Unit 3 Modelling and Evaluation
No ratings yet
Unit 3 Modelling and Evaluation
40 pages
The CAP Theorem
100% (1)
The CAP Theorem
3 pages
Learning Processes
No ratings yet
Learning Processes
30 pages
Unit 3 Supervised Learning
No ratings yet
Unit 3 Supervised Learning
89 pages
AD8552-Machnie Learning QB
No ratings yet
AD8552-Machnie Learning QB
25 pages
Image Classification Using Pre-Trained Convolutional Neural Network in COLAB
No ratings yet
Image Classification Using Pre-Trained Convolutional Neural Network in COLAB
6 pages
ML First Unit
No ratings yet
ML First Unit
70 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Android Interview Questions PDF
No ratings yet
Android Interview Questions PDF
24 pages
Omkar Sabnis B4-764 Experiment No. 7 Aim: Implementation of MC-Culloch Pitt Model For AND Gate Using Python. Theory
No ratings yet
Omkar Sabnis B4-764 Experiment No. 7 Aim: Implementation of MC-Culloch Pitt Model For AND Gate Using Python. Theory
10 pages
Lecture 2 Deep Learning Overview
No ratings yet
Lecture 2 Deep Learning Overview
99 pages
AllFusion Endevor Change Manager - r4 - Administration Guide
No ratings yet
AllFusion Endevor Change Manager - r4 - Administration Guide
216 pages
Lesson 1 5 Ge2
No ratings yet
Lesson 1 5 Ge2
40 pages
8) Pertemuan 1
No ratings yet
8) Pertemuan 1
19 pages
Harvard - CURRICULUM
No ratings yet
Harvard - CURRICULUM
3 pages
Methodology:: General
No ratings yet
Methodology:: General
3 pages
Program of Study Outcomes: Lesson Title/Focus Class #: Grade 3-4 (60 Minute Lesson) Course Drama (Improv)
No ratings yet
Program of Study Outcomes: Lesson Title/Focus Class #: Grade 3-4 (60 Minute Lesson) Course Drama (Improv)
3 pages
Distribution Transformer: Bureau of Energy Efficiency
No ratings yet
Distribution Transformer: Bureau of Energy Efficiency
51 pages
Earthquake Load Analysis
No ratings yet
Earthquake Load Analysis
14 pages
English B Blood Brothers
100% (1)
English B Blood Brothers
3 pages
POCSO Outline
No ratings yet
POCSO Outline
2 pages
Everything About Marie Landry’s Spy Shop at Marielandryceo.com
No ratings yet
Everything About Marie Landry’s Spy Shop at Marielandryceo.com
14 pages
Practice Essays - WriteX 20200725095703 (1) (2) (2) 20200810120843 (2) 20200820160212
No ratings yet
Practice Essays - WriteX 20200725095703 (1) (2) (2) 20200810120843 (2) 20200820160212
6 pages
05 Lecture Bulk RNA-seq Array
No ratings yet
05 Lecture Bulk RNA-seq Array
40 pages
Ardjuna Field (1)
No ratings yet
Ardjuna Field (1)
6 pages
HNLU Recruitment Brochure
No ratings yet
HNLU Recruitment Brochure
46 pages
Speech Writing.docx
No ratings yet
Speech Writing.docx
2 pages
Non-Core Placement Bluebook 2024-25
No ratings yet
Non-Core Placement Bluebook 2024-25
709 pages
Mars Climate Orbiter Case Study
No ratings yet
Mars Climate Orbiter Case Study
4 pages
Improvement of Efficiency Parameter of A Microstrip Patch Antenna Operating at 2.4 GHZ For Wlan
No ratings yet
Improvement of Efficiency Parameter of A Microstrip Patch Antenna Operating at 2.4 GHZ For Wlan
6 pages
Swat Weather Database: A Quick Guide
100% (1)
Swat Weather Database: A Quick Guide
14 pages
Nama: Prado Pratama Putra NIM: 09030581923047 Kelas: Tk4A
No ratings yet
Nama: Prado Pratama Putra NIM: 09030581923047 Kelas: Tk4A
10 pages
Puerto Galera Painting
No ratings yet
Puerto Galera Painting
2 pages
HIRARC Guidelines Part C
No ratings yet
HIRARC Guidelines Part C
1 page
Crystals Based On DRIVER OR BDN & Conductor or Life Path Number
No ratings yet
Crystals Based On DRIVER OR BDN & Conductor or Life Path Number
1 page
3rd PT-SCI4
No ratings yet
3rd PT-SCI4
4 pages
The External Environment's Effect On Management and Strategy
No ratings yet
The External Environment's Effect On Management and Strategy
19 pages
Analisis Risiko Produksi Usahatani Bawang Merah Di Desa Petak Kecamatan Bagor Kabupaten Nganjuk
No ratings yet
Analisis Risiko Produksi Usahatani Bawang Merah Di Desa Petak Kecamatan Bagor Kabupaten Nganjuk
17 pages
A novel algorithmic trading strategy using data driven
No ratings yet
A novel algorithmic trading strategy using data driven
8 pages
Holiday HW Class7
No ratings yet
Holiday HW Class7
5 pages

Lecture 2 Deep Learning Overview

Uploaded by

Lecture 2 Deep Learning Overview

Uploaded by

CS 404/504

Special Topics: Adversarial Machine

Dr. Alex Vakanski

Deep Learning Overview

• Machine learning basics

Machine Learning Basics

• Artificial Intelligence is a scientific field concerned with the development of

Picture from: Ismini Lourentzou – Introduction to Deep Learning 4

Machine Learning Types

• Supervised: learning with labeled data

Slide credit: Ismini Lourentzou – Introduction to Deep Learning 5

• Supervised learning categories and techniques

Slide credit: Y-Fan Chang – An Overview of Machine Learning 6

• Unsupervised learning categories and techniques

Slide credit: Y-Fan Chang – An Overview of Machine Learning 7

Nearest Neighbor Classifier

Picture from: James Hays – Machine Learning Overview 8

Nearest Neighbor Classifier

Picture from: https://cs231n.github.io/classification/ 9

k-Nearest Neighbors Classifier

• k-Nearest Neighbors approach considers multiple neighboring data points to

Picture from: James Hays – Machine Learning Overview 10

• The decision boundary is linear

Picture from: https://cs231n.github.io/classification/ 12

Support Vector Machines

• Support vector machines (SVM)

Linear vs Non-linear Techniques

• Linear classification techniques

Linear vs Non-linear Techniques

• For some tasks, input data

• For other tasks, linear

Picture from: Y-Fan Chang – An Overview of Machine Learning 15

Picture from: Y-Fan Chang – An Overview of Machine Learning 16

Non-linear Support Vector Machines

Picture from: James Hays – Machine Learning Overview 17

Binary vs Multi-class Classification

• A classification problem with only 2 classes is referred to as binary classification

Binary vs Multi-class Classification

Computer Vision Tasks

• Computer vision has been the primary area of interest for ML

• Wolpert (2002) - The Supervised Learning No-Free-Lunch Theorems

ML vs. Deep Learning

• Conventional machine learning methods rely on human-designed feature

Picture from: Ismini Lourentzou – Introduction to Deep Learning 22

ML vs. Deep Learning

Picture from: https://www.xenonstack.com/blog/static/public/uploads/media/machine-learning-vs-deep-learning.png 23

ML vs. Deep Learning

Low-Level Mid-Level High-Level Trainable

Slide credit: Param Vir Singh – Deep Learning 24

• DL provides a flexible, learnable framework for representing visual, text,

• NNs with at least one hidden layer are universal approximators

• NNs use nonlinear mapping of the inputs x to the

Introduction to Neural Networks

• Handwritten digit recognition (MNIST dataset)

Introduction to Neural Networks

• Handwritten digit recognition

Slide credit: Hung-yi Lee – Deep Learning Tutorial 28

Elements of Neural Networks

• NNs consist of hidden layers with neurons (i.e., computational units)

Slide credit: Hung-yi Lee – Deep Learning Tutorial 29

Elements of Neural Networks

• A NN with one hidden layer and one output layer

𝒐𝒖𝒕𝒑𝒖𝒕 𝒍𝒂𝒚𝒆𝒓 𝒚 =𝝈 (𝑾 𝟐 𝒉+ 𝒃𝟐)

4 + 2 = 6 neurons (not counting inputs)

Elements of Neural Networks

• A neural network playground link

Elements of Neural Networks

• Deep NNs have many hidden layers

Input Layer 1 Layer 2 Layer L Output

Input Layer Output Layer

Elements of Neural Networks

• A simple network, toy example

0.98 Sigmoid Function

Elements of Neural Networks

• A simple network, toy example (cont’d)

1 4 0.98 2 0.86 3 0.62