Lecture 2 Deep Learning Overview
Lecture 2 Deep Learning Overview
Lecture 2
2
CS 404/504, Fall 2021
Lecture Outline
3
CS 404/504, Fall 2021
Machine Learning
Labeled Data algorithm
Training
Prediction
Learned
Labeled Data Prediction
model
class A
class B
Regression Clustering
Classification
Supervised Learning
Machine Learning Basics
Unsupervised Learning
Machine Learning Basics
• Nearest Neighbor – for each test data point, assign the class label of the nearest
training data point
Adopt a distance function to find the nearest neighbor
o Calculate the distance to each data point in the training set, and assign the class of the nearest
data point (minimum distance)
It does not require learning a set of weights
Test Training
Training example examples
examples from class 2
from class 1
• For image classification, the distance between all pixels is calculated (e.g., using
norm, or norm)
Accuracy on CIFAR-10: 38.6%
• Disadvantages:
The classifier must remember all training data and store it for future comparisons with
the test data
Classifying a test image is expensive since it requires a comparison to all training
images
norm
(Manhattan distance)
x2
x
x
x o
x x
x
+ o x
o x
o + x
o
o o
o
o
x1
Linear Classifier
Machine Learning Basics
• Linear classifier
Find a linear function f of the inputs xi that separates the classes
Use pairs of inputs and labels to find the weights matrix W and the bias vector b
o The weights and biases are the parameters of the function f
Several methods have been used to find the optimal set of parameters of a linear
classifier
o A common method of choice is the Perceptron algorithm, where the parameters are updated
until a minimal error is reached (single layer, does not use backpropagation)
Linear classifier is a simple approach, but it is a building block of advanced
classification algorithms, such as SVM and neural networks
o Earlier multi-layer neural networks were referred to as multi-layer perceptrons (MLPs)
11
CS 404/504, Fall 2021
Linear Classifier
Machine Learning Basics
13
CS 404/504, Fall 2021
14
CS 404/504, Fall 2021
Non-linear Techniques
Linear vs Non-linear Techniques
• Non-linear classification
Features are obtained as non-linear functions of the inputs
It results in non-linear decision boundaries
Can deal with non-linearly separable data
Inputs:
Features:
Outputs:
• Non-linear SVM
The original input space is mapped to a higher-dimensional feature space where the
training set is linearly separable
Define a non-linear kernel function to calculate a non-linear decision boundary in the
original feature space
Φ : 𝑥 ↦ 𝜙 (𝑥 )
18
CS 404/504, Fall 2021
• Both the binary and multi-class classification problems can be linearly or non-
linearly separated
Figure: linearly and non-linearly separated data for binary classification problem
19
CS 404/504, Fall 2021
Picture from: Fie-Fei Li, Andrej Karpathy, Justin Johnson – Understanding and Visualizing CNNs 20
CS 404/504, Fall 2021
No-Free-Lunch Theorem
Machine Learning Basics
21
CS 404/504, Fall 2021
• Deep learning (DL) is a machine learning subfield that uses multiple layers for
learning data representations
DL is exceptionally effective at learning patterns
• DL applies a multi-layer process for learning rich hierarchical features (i.e., data
representations)
Input image pixels → Edges → Textures → Parts → Objects
Why is DL Useful?
Introduction to Deep Learning
25
CS 404/504, Fall 2021
Representational Power
Introduction to Deep Learning
26
CS 404/504, Fall 2021
Input Output
x1 y1
0.1 is 1
x2
y2
0.7 is 2
The image is “2”
……
……
……
x256 y10
0.2 is 0
16 x 16 = 256
Ink → 1 Each dimension represents the
confidence of a digit
No ink → 0
Slide credit: Hung-yi Lee – Deep Learning Tutorial 27
CS 404/504, Fall 2021
x1 y1
x2
Machine y2
“2
……
……
”
x256 𝑓 : 𝑅256 → 𝑅10 y10
The function is represented by a neural network
a1 z a1w1 a2 w2 aK wK b
w1
𝑎=𝜎 ( 𝑧 )
a2 w2
z z a
…
wK output
…
aK Activation
weights function
b
input
bias
Weights Biases
𝒉𝒊𝒅𝒅𝒆𝒏 𝒍𝒂𝒚𝒆𝒓 𝒉= 𝝈 ( 𝐖 𝟏 𝒙 + 𝒃𝟏 )
Activation functions
𝒉
Slide credit: Ismini Lourentzou – Introduction to Deep Learning 30
CS 404/504, Fall 2021
31
CS 404/504, Fall 2021
……
……
……
……
……
xN …… yM
( 1∙ 1 ) + ( − 1 ) ∙ (− 2 ) +1= 4
1 -2
Slide credit: Hung-yi Lee – Deep Learning Tutorial 33
CS 404/504, Fall 2021
2
𝑓 : 𝑅 →𝑅 2 𝑓
([ ]) [
1
−1
=
0 .62
0.83 ]
Slide credit: Hung-yi Lee – Deep Learning Tutorial 34
CS 404/504, Fall 2021
Matrix Operation
Introduction to Neural Networks
• Matrix operations are helpful when working with multidimensional inputs and
outputs
1 4 0.98
1 W x + b a
-2
1
-1
-1 -2 0.12 [
1
𝜎
−1
−2
1 ] [ ] +¿ [ ] ¿ [ ]
(
−
1
1
1
)0
0 .98
0.12
1
0 [ ]4
−2
Matrix Operation
Introduction to Neural Networks
x1 …… y1
x 2 W1 …… y2
b1
……
……
……
……
……
xN
x a1 …… yM
a1 W1 x + b1
Matrix Operation
Introduction to Neural Networks
x1 …… y1
x 2 W1 y2
W2 ……
WL
b1 b2 bL
……
……
……
……
……
xN …… yM
x a1 a2 y
𝜎
W1 x(+ )
b1
𝜎
W2 a1(+ )
b2
𝜎
WL a(
L-1 +)L
b
Matrix Operation
Introduction to Neural Networks
x1 …… y1
x 2 W1 y2
W2 ……
WL
b1 b2 bL
……
……
……
……
……
xN …… yM
x a1 a2 y
y ¿ 𝑓 ()
x WL …
¿ 𝜎
W2
𝜎 W1 x(
𝜎 (+(
b1)
) + b)
2 … + bL
Softmax Layer
Introduction to Neural Networks
Softmax Layer
Introduction to Neural Networks
j 1
0.05 ≈0
z3 -3
3
e e z3
y3 e z3
e
zj
3 j 1
ez j
j 1
Activation Functions
Introduction to Neural Networks
Activation: Sigmoid
Introduction to Neural Networks
• Sigmoid function σ: takes a real-valued number and “squashes” it into the range
between 0 and 1
The output can be interpreted as the firing rate of a biological neuron
o Not firing = 0; Fully firing = 1
When the neuron’s activation are 0 or 1, sigmoid neurons saturate
o Gradients at these regions are almost zero (almost no signal will flow)
Sigmoid activations are less common in modern NNs
𝑓 (𝑥 ) ℝ 𝑛 → [ 0,1 ]
𝑥
Slide credit: Ismini Lourentzou – Introduction to Deep Learning 42
CS 404/504, Fall 2021
Activation: Tanh
Introduction to Neural Networks
• Tanh function: takes a real-valued number and “squashes” it into range between
-1 and 1
Like sigmoid, tanh neurons saturate
Unlike sigmoid, the output is zero-centered
o It is therefore preferred than sigmoid
Tanh is a scaled sigmoid:
𝑓 (𝑥 ) ℝ 𝑛 → [ − 1 ,1 ]
𝑥
Slide credit: Ismini Lourentzou – Introduction to Deep Learning 43
CS 404/504, Fall 2021
Activation: ReLU
Introduction to Neural Networks
44
CS 404/504, Fall 2021
45
CS 404/504, Fall 2021
• Linear function means that the output signal is proportional to the input signal
to the neuron ℝ𝑛 → ℝ𝑛
If the value of the constant c is 1, it is also
called identity activation function 𝑓 ( 𝑥 )=𝑐𝑥
This activation type is used in regression
problems
o E.g., the last layer can have linear activation
function, in order to output a real number
(and not a class membership)
46
CS 404/504, Fall 2021
Training NNs
Training Neural Networks
• The network parameters include the weight matrices and bias vectors from all
layers
𝜃 = {𝑊 1 ,𝑏 1 , 𝑊 2 , 𝑏2 , ⋯ 𝑊 𝐿 ,𝑏 𝐿 }
Often, the model parameters are referred to as weights
• Training a model to learn a set of parameters that are optimal (according to a
criterion) is one of the greatest challenges in ML
x1 …… y1
0.1 is 1
x2
Softmax
…… y2
0.7 is 2
……
……
……
x256 …… y10
0.2 is 0
16 x 16 = 256
Slide credit: Hung-yi Lee – Deep Learning Tutorial 47
CS 404/504, Fall 2021
Training NNs
Training Neural Networks
Training NNs
Training Neural Networks
• To train a NN, set the parameters such that for a training subset of images, the
corresponding elements in the predicted output have maximum values
Training NNs
Training Neural Networks
x1 …… y1 0.2 1
x2 …… y2 0.3 0
Cost
……
……
……
……
……
……
x256 …… y10 0.5 ℒ (𝜃) 0
True label “1”
Training NNs
Training Neural Networks
• For a training set of images, calculate the total loss overall all images:
• Find the optimal parameters that minimize the total loss
ℒ1( 𝜃 )
x1 NN ^𝑦 1 y1
ℒ2( 𝜃 )
x2 NN ^𝑦 2 y2
ℒ3( 𝜃 )
x3 NN ^𝑦 3
y3
……
……
……
……
ℒ𝑛 ( 𝜃 )
xN NN ^𝑦 𝑁 yN
Slide credit: Hung-yi Lee – Deep Learning Tutorial 51
CS 404/504, Fall 2021
Loss Functions
Training Neural Networks
• Classification tasks
Training
Pairs of 𝑁 inputs and ground-truth class labels
examples
𝑁 𝐾
1
Loss function Cross-entropyℒ ( 𝜃 ) =− ∑ ∑
𝑁 𝑖 =1 𝑘=1
𝑦 (𝑖) (𝑖 )
[
𝑘 log 𝑦 𝑘 + ( 1− 𝑦 𝑘 ) log ( 1 − 𝑦 𝑘 )
^ (𝑖 ) ^ (𝑖 ) ]
Ground-truth class labels and model predicted class labels
Loss Functions
Training Neural Networks
• Regression tasks
Training
Pairs of 𝑁 inputs and ground-truth output values
examples
Output
Linear (Identity) or Sigmoid Activation
Layer
𝑛
1
ℒ ( 𝜃)= ∑ ( 𝑦 − 𝑦 )
(𝑖) 2
Mean Squared Error
(𝑖)
^
Loss function 𝑛 𝑖=1
𝑛
1
Mean Absolute Error ℒ ( 𝜃)= ∑ | 𝑦 (𝑖)
− ^
𝑦 |
(𝑖)
𝑛 𝑖=1
Training NNs
Training Neural Networks
ℒ ( 𝜃) 𝜕ℒ
𝜕 𝜃𝑖
𝜃𝑖
54
CS 404/504, Fall 2021
Parameter update:
Parameters
55
CS 404/504, Fall 2021
1. Randomly pick a
starting point
2. Compute the
gradient at ,
𝜃∗
𝑤2 3. Times the learning
1 rate , and update
𝜃
− 𝛻 ℒ ( 𝜃0 )
4. Go to step 2, repeat
0
𝜃
𝑤1
𝛻 ℒ (𝜃 )=
0
[ 𝜕 ℒ ( 𝜃0 ) /𝜕 𝑤 1
𝜕 ℒ ( 𝜃0 ) /𝜕 𝑤 2 ]
Slide credit: Hung-yi Lee – Deep Learning Tutorial 56
CS 404/504, Fall 2021
• Example (contd.)
4. Go to step 2, repeat
0
𝜃
• Gradient descent algorithm stops when a local minimum of the loss surface is
reached
GD does not guarantee reaching a global minimum
However, empirical evidence suggests that GD works well for NNs
• For most tasks, the loss surface is highly complex (and non-convex)
• Random initialization in NNs results
in different initial parameters every
time the NN is trained ℒ
Gradient descent may reach different
minima at every run
Therefore, NN will produce different
predicted outputs
• In addition, currently we don’t have
algorithms that guarantee reaching a
global minimum for an arbitrary loss
function 𝑤1 𝑤2
Backpropagation
Training Neural Networks
• Modern NNs employ the backpropagation method for calculating the gradients
of the loss function
Backpropagation is short for “backward propagation”
• For training NNs, forward propagation (forward pass) refers to passing the
inputs through the hidden layers to obtain the model outputs (predictions)
The loss function is then calculated
Backpropagation traverses the network in reverse order, from the outputs backward
toward the inputs to calculate the gradients of the loss
The chain rule is used for calculating the partial derivatives of the loss function with
respect to the parameters in the different layers in the network
• Each update of the model parameters during training takes one forward and
one backward pass (e.g., of a batch of inputs)
• Automatic calculation of the gradients (automatic differentiation) is available in
all current deep learning libraries
It significantly simplifies the implementation of deep learning algorithms, since it
obviates deriving the partial derivatives of the loss function by hand
60
CS 404/504, Fall 2021
• It is wasteful to compute the loss over the entire training dataset to perform a
single parameter update for large datasets
E.g., ImageNet has 14M images
Therefore, GD (a.k.a. vanilla GD) is almost always replaced with mini-batch GD
• Mini-batch gradient descent
Approach:
o Compute the loss on a mini-batch of images, update the parameters , and repeat until all
images are used
o At the next epoch, shuffle the training data, and repeat the above process
Mini-batch GD results in much faster training
Typical mini-batch size: 32 to 256 images
It works because the gradient from a mini-batch is a good approximation of the
gradient from the entire training set
61
CS 404/504, Fall 2021
62
CS 404/504, Fall 2021
• Besides the local minima problem, the GD algorithm can be very slow at
plateaus, and it can get stuck at saddle points
cost
𝛻 ℒ (𝜃 )≈ 0 𝛻 ℒ ( 𝜃 )=0
𝛻 ℒ ( 𝜃 )=0
𝜃 63
Slide credit: Hung-yi Lee – Deep Learning Tutorial
CS 404/504, Fall 2021
• Gradient descent with momentum uses the momentum of the gradient for
parameter optimization
cost
Movement = Negative of Gradient + Momentum
Negative of Gradient
Momentum
Real Movement
𝜃
Gradient = 0
Slide credit: Hung-yi Lee – Deep Learning Tutorial 64
CS 404/504, Fall 2021
This term is analogous to a momentum of a heavy ball rolling down the hill
• The parameter is referred to as a coefficient of momentum
A typical value of the parameter is 0.9
• This method updates the parameters in the direction of the weighted average of
the past gradients
65
CS 404/504, Fall 2021
GD with Nesterov
GD with momentum
momentum
Adam
Training Neural Networks
67
CS 404/504, Fall 2021
Learning Rate
Training Neural Networks
• Learning rate
The gradient tells us the direction in which the loss has the steepest rate of increase,
but it does not tell us how far along the opposite direction we should step
Choosing the learning rate (also called the step size) is one of the most important
hyper-parameter settings for NN training
LR too LR too
small large
68
CS 404/504, Fall 2021
Learning Rate
Training Neural Networks
• Learning rate scheduling is applied to change the values of the learning rate
during the training
Annealing is reducing the learning rate over time (a.k.a. learning rate decay)
o Approach 1: reduce the learning rate by some factor every few epochs
– Typical values: reduce the learning rate by a half every 5 epochs, or divide by 10 every 20 epochs
o Approach 2: exponential or cosine decay gradually reduce the learning rate over time
o Approach 3: reduce the learning rate by a constant (e.g., by half) whenever the validation loss
stops improving
– In TensorFlow: tf.keras.callbacks.ReduceLROnPleateau()
» Monitor: validation loss, factor: 0.1 (i.e., divide by 10), patience: 10 (how many epochs to wait before applying it),
Minimum learning rate: 1e-6 (when to stop)
Warmup is gradually increasing the learning rate initially, and afterward let it cool
down until the end of the training
Exponential decay Cosine decay Warmup
70
CS 404/504, Fall 2021
• In some cases, during training, the gradients can become either very small
(vanishing gradients) of very large (exploding gradients)
They result in very small or very large update of the parameters
Solutions: change learning rate, ReLU activations, regularization, LSTM units in RNNs
x1 …… y1
x2 …… y2
……
……
……
……
……
xN …… yM
Generalization
Generalization
• Underfitting
The model is too “simple” to represent
all the relevant class characteristics
E.g., model with too few parameters
Produces high error on the training set
and high error on the validation set
• Overfitting
The model is too “complex” and fits
irrelevant characteristics (noise) in the
data
E.g., model with too many parameters
Produces low error on the training error
and high error on the validation set
72
CS 404/504, Fall 2021
Overfitting
Generalization
• Overfitting – a model with high capacity fits the noise in the data instead of the
underlying relationship
• weight decay
A regularization term that penalizes large weights is added to the loss function
For every weight in the network, we add the regularization term to the loss value
o During gradient descent parameter update, every weight is decayed linearly toward zero
The weight decay coefficient determines how dominant the regularization is during
the gradient computation
74
CS 404/504, Fall 2021
75
CS 404/504, Fall 2021
• weight decay
The regularization term is based on the norm of the weights
76
CS 404/504, Fall 2021
Regularization: Dropout
Regularization
• Dropout
Randomly drop units (along with their connections) during training
Each unit is retained with a fixed dropout rate p, independent of other units
The hyper-parameter p needs to be chosen (tuned)
o Often, between 20% and 50% of the units are dropped
Regularization: Dropout
Regularization
……
• Early-stopping
During model training, use a validation set
o E.g., validation/train ratio of about 25% to 75%
Stop when the validation accuracy (or loss) has not improved after n epochs
o The parameter n is called patience
Stop training
validation
79
CS 404/504, Fall 2021
Batch Normalization
Regularization
80
CS 404/504, Fall 2021
Hyper-parameter Tuning
Hyper-parameter Tuning
81
CS 404/504, Fall 2021
Hyper-parameter Tuning
Hyper-parameter Tuning
• Grid search
Check all values in a range with a step value
• Random search
Randomly sample values for the parameter
Often preferred to grid search
• Bayesian hyper-parameter optimization
Is an active area of research
82
CS 404/504, Fall 2021
k-Fold Cross-Validation
k-Fold Cross-Validation
83
CS 404/504, Fall 2021
k-Fold Cross-Validation
k-Fold Cross-Validation
Ensemble Learning
Ensemble Learning
85
CS 404/504, Fall 2021
output
Shallow Deep
NN NN
……
x1 x2 …… xN
input
Slide credit: Hung-yi Lee – Deep Learning Tutorial 86
CS 404/504, Fall 2021
• Convolutional neural networks (CNNs) were primarily designed for image data
• CNNs use a convolutional operator for extracting data features
Allows parameter sharing
Efficient to train
Have less parameters than NNs with fully-connected layers
• CNNs are robust to spatial translations of objects in images
• A convolutional filter slides (i.e., convolves) across the image
Convolutional
Input matrix 3x3 filter
• When the convolutional filters are scanned over the image, they capture useful
features
E.g., edge detection by convolutions
0 1 0
Filter 1 -4 1
0 1 0
1 1 1 1 1 1 0.015686 0.015686 0.011765 0.015686 0.015686 0.015686 0.015686 0.964706 0.988235 0.964706 0.866667 0.031373 0.023529 0.007843
0.007843 0.741176 1 1 0.984314 0.023529 0.019608 0.015686 0.015686 0.015686 0.011765 0.101961 0.972549 1 1 0.996078 0.996078 0.996078 0.058824 0.015686
0.019608 0.513726 1 1 1 0.019608 0.015686 0.015686 0.015686 0.007843 0.011765 1 1 1 0.996078 0.031373 0.015686 0.019608 1 0.011765
0.015686 0.733333 1 1 0.996078 0.019608 0.019608 0.015686 0.015686 0.011765 0.984314 1 1 0.988235 0.027451 0.015686 0.007843 0.007843 1 0.352941
0.015686 0.823529 1 1 0.988235 0.019608 0.019608 0.015686 0.015686 0.019608 1 1 0.980392 0.015686 0.015686 0.015686 0.015686 0.996078 1 0.996078
0.015686 0.913726 1 1 0.996078 0.019608 0.019608 0.019608 0.019608 1 1 0.984314 0.015686 0.015686 0.015686 0.015686 0.952941 1 1 0.992157
0.019608 0.913726 1 1 0.988235 0.019608 0.019608 0.019608 0.039216 0.996078 1 0.015686 0.015686 0.015686 0.015686 0.996078 1 1 1 0.007843
0.019608 0.898039 1 1 0.988235 0.019608 0.015686 0.019608 0.968628 0.996078 0.980392 0.027451 0.015686 0.019608 0.980392 0.972549 1 1 1 0.019608
0.043137 0.905882 1 1 1 0.015686 0.035294 0.968628 1 1 0.023529 1 0.792157 0.996078 1 1 0.980392 0.992157 0.039216 0.023529
1 1 1 1 1 0.992157 0.992157 1 1 0.984314 0.015686 0.015686 0.858824 0.996078 1 0.992157 0.501961 0.019608 0.019608 0.023529
0.996078 0.992157 1 1 1 0.933333 0.003922 0.996078 1 0.988235 1 0.992157 1 1 1 0.988235 1 1 1 1
0.015686 0.74902 1 1 0.984314 0.019608 0.019608 0.031373 0.984314 0.023529 0.015686 0.015686 1 1 1 0 0.003922 0.027451 0.980392 1
0.019608 0.023529 1 1 1 0.019608 0.019608 0.564706 0.894118 0.019608 0.015686 0.015686 1 1 1 0.015686 0.015686 0.015686 0.05098 1
0.015686 0.015686 1 1 1 0.047059 0.019608 0.992157 0.007843 0.011765 0.011765 0.015686 1 1 1 0.015686 0.019608 0.996078 0.023529 0.996078
0.019608 0.015686 0.243137 1 1 0.976471 0.035294 1 0.003922 0.011765 0.011765 0.015686 1 1 1 0.988235 0.988235 1 0.003922 0.015686
0.019608 0.019608 0.027451 1 1 0.992157 0.223529 0.662745 0.011765 0.011765 0.011765 0.015686 1 1 1 0.015686 0.023529 0.996078 0.011765 0.011765
0.015686 0.015686 0.011765 1 1 1 1 0.035294 0.011765 0.011765 0.011765 0.015686 1 1 1 0.015686 0.015686 0.964706 0.003922 0.996078
0.007843 0.019608 0.011765 0.054902 1 1 0.988235 0.007843 0.011765 0.011765 0.015686 0.011765 1 1 1 0.015686 0.015686 0.015686 0.023529 1
0.007843 0.007843 0.015686 0.015686 0.960784 1 0.490196 0.015686 0.015686 0.015686 0.007843 0.027451 1 1 1 0.011765 0.011765 0.043137 1 1
0.023529 0.003922 0.007843 0.023529 0.980392 0.976471 0.039216 0.019608 0.007843 0.019608 0.015686 1 1 1 1 1 1 1 1 1
• In CNNs, hidden units in a layer are only connected to a small region of the
layer before it (called local receptive field)
The depth of each feature map corresponds to the number of convolutional filters
used at each layer
w1 w2
w3 w4 w5 w6
w7 w8
Filter 1
Filter 2
Input Image
Layer 1
Feature Map Layer 2
Feature Map
Living
Room
Bedroom
Kitchen
128
256
512
512
512
256
512
128
256
512
512
64
64
Bathroom
Outdoor
Max Pool
Conv
layer
Residual CNNs
Convolutional Neural Networks
92
CS 404/504, Fall 2021
• Recurrent NNs are used for modeling sequential data and data with varying
length of inputs and outputs
Videos, text, speech, DNA sequences, human skeletal data
• RNNs introduce recurrent connections between the neurons
This allows processing sequential data one element at a time by selectively passing
information across a sequence
Memory of the previous inputs is stored in the model’s internal state and affect the
model predictions
Can capture correlations in sequential data
• RNNs use backpropagation-through-time for training
• RNNs are more sensitive to the vanishing gradient problem than CNNs
93
CS 404/504, Fall 2021
• RNN use same set of weights and across all time steps
A sequence of hidden states is learned, which represents the memory of the network
The hidden state at step t, , is calculated based on the previous hidden state and the
input at the current step , i.e.,
The function is a nonlinear activation function, e.g., ReLU or tanh
• RNN shown rolled over time
𝑤h 𝑤h 𝑤h 𝑤𝑦
h0 (·) h1 (·) h2 (·) h3 (·)
𝑤𝑥 𝑤𝑥 𝑤𝑥
x1 x2 x3
INPUT SEQUENCE:
Slide credit: Param Vir Singh – Deep Learning 94
CS 404/504, Fall 2021
• RNNs can have one of many inputs and one of many outputs
A person riding a
Image
motorbike on dirt
Captioning
road
Machine
Happy Diwali शुभ दीपावली
Translation
Bidirectional RNNs
Recurrent Neural Networks
⃑ ⃑ ⃑ ⃑
h𝑡 = 𝜎 ( 𝑊 ( hh) h𝑡 − 1+ 𝑊 ( h𝑥 ) 𝑥 𝑡 )
´ (hh) h́ + 𝑊
h́𝑡 =𝜎 ( 𝑊 ´ (h𝑥) 𝑥 )
𝑡 +1 𝑡
𝑦𝑡= 𝑓 ([ h⃑ 𝑡 ; h́𝑡 ])
LSTM Networks
Recurrent Neural Networks
97
CS 404/504, Fall 2021
LSTM Networks
Recurrent Neural Networks
• LSTM cell
Input gate, output gate, forget gate, memory cell
LSTM can learn long-term correlations within data sequences
98
CS 404/504, Fall 2021
References
99