Lec5 CNN RNN Attention
Lec5 CNN RNN Attention
INTRODUCTION
TO ARTIFICIAL
INTELLIGENCE
Tianyi Zhou
09/13/2023
University of Maryland
Some slides are adapted from Song & Russell @Berkeley,
Fe—Fei @ Stanford, and Shao @William and Mary 1
Seven Components of this course
Action,
Acting Prediction
Probabilistic
Reasoning
Human
users Agent
Language
Models Embodied &
Multi-modal AI World
Neural
Networks
3
Underfitting and Overfitting
§ Overfitting occurs when a statistical model fits exactly against a few training data, leading to low
training error but high test error.
§ Underfitting: when model is too simple but too many data, both training and test errors are high.
5
Weight decay and L2 regularization
• Weight decay avoids large weights by removing 𝑤 ← 𝑤 − 𝜂∇𝐸 𝑤 − 𝜆𝑤 in each step.
• It is similar but not exactly the same as L2 regularization, which leads to 𝑤 ← 𝑤 − 𝜂(∇𝐸 𝑤 + 𝜆𝑤).
Dropout: Prevent Overfitting
Dropout refers to dropping out units (neurons) in a neural network. It is used to
prevent Overfitting problem.
0, 𝑝
dropout 𝑝 = )
1, 1 − 𝑝
8
Dropout vs. Ensemble
Ø Ensemble that combines several base models in order to produce one optimal predictive
model
Majority Voting
9
Batch Normalization (BN)
Batch normalization is a method that normalizes activations in a network across the
mini-batch. For each feature, batch normalization computes the mean and variance of
that feature in the mini-batch
Image 1 1 2 5
Image 2 4 1 3
10
Batch Normalization (BN)
q Batch normalization is a method that normalizes activations in a network across
the mini-batch.
q For each feature, batch normalization computes the mean and variance of that
feature in the mini-batch
BN
11
Advantages of Batch
Normalization (BN)
12
Layer Normalization
Layer normalization normalizes the activations along the feature direction instead of mini-
batch direction. Namely, normalize input across the features for each input data sample.
Image 1 1 2 5
Image 2 4 1 3
13
Layer Normalization
Advantages:
q Remove the dependency on batches of data samples
q Make it easier to apply for Neural Networks
14
Other normalization
Early Stopping
Early stopping is an optimization technique used to reduce overfitting without
compromising on model accuracy
loss
0 10 20 30 epochs
16
Any Questions?
17
ld o? Why Convolutional
s Wa
ere’ Neural Networks (CNN)?
Wh
18
Multiple-Layer Perceptrons (MLP)
One Channel Spatial information
8
…
1
1
64
1
…
19
vectorize MLP may destroy their spatial information
WHY CNN?
One Channel
1 1 1
1 1
1 1
1 1
1 1
1 1 1
Gray image: 1 channel
Kernel
20
Convolution
Convolution: sum over the element-wise product
0 1 1 1
2 ? ?
1 0 1 1 1 1
1 1 0 0 * 1 0
= ? ? ?
? ? ?
0 1 1 0 Kernel
0∗1+1∗1+1∗1+0∗0=2
21
Convolution
Convolution: sum over the element-wise product
0 1 1 1
2 ? ?
1 0 1 1 1 1
1 1 0 0 * 1 0
= ? ? ?
? ? ?
0 1 1 0 Kernel
1∗1+1∗1+0∗1+1∗0=2
22
Convolution
Convolution: sum over the element-wise product
0 1 1 1
2 2 ?
1 0 1 1 1 1
1 1 0 0 * 1 0
= ? ? ?
? ? ?
0 1 1 0 Kernel
1∗1+1∗1+1∗1+1∗0=3
23
Convolution
Convolution: sum over the element-wise product
0 1 1 1
2 2 3
1 0 1 1 1 1
1 1 0 0 * 1 0
= ? ? ?
? ? ?
0 1 1 0 Kernel
24
CNN: Translation invariance
& Weight sharing via kernels
• How to detect/track a moving dog in the scene?
• Can we use Perceptron or MLP? Train w to
maximize <w, x> where x is the vector of all
pixels?
• Too many parameters for MLP: 1024 ✕1024
input pixels, 4096 hidden neurons, 1000 Input image (5x5 Kernel / Output
output classes, 2-layer MLP has 109 parameters! zero padded) Filter (3x3) featuremap (3x3)
• In general:
– Input: n ✕ H ✕ W ✕ Nin
– Weights: Nout ✕ k ✕ k ✕ Nin
– Output: n ✕ H’ ✕ W’ ✕ Nout
"#$%&' )#$%&'
– 𝐻! = + 1, 𝑊 ! = +1
( (
Depth-wise convolution (left) &
depth-wise separable convolution (right)
• Can we reduce the computation and number of parameters for multi-channel input and output?
https://eli.thegreenplace.net/2018/depthwise-separable-convolutions-for-machine-learning/
CNN: Receptive field
• Each pixel in the output featuremap of layer-l is
produced from a region of the input image (and thus
captures the feature of the region).
• The size (height x width) of the region is called
receptive field (RF).
• Different RF captures different scales of features.
• Multi-scale features can be helpful to many tasks.
https://www.baeldung.com/cs/cnn-receptive-field-size
https://theaisummer.com/receptive-field/
Dilated convolution
• RF exponentially increases with the #layers and lower-layer features have smaller RF.
• How to increase the RF without increasing the depth or changing the kernel configurations?
CNN: pooling
• How to quickly reduce the
dimensionality of a featuremap?
• Pooling: a summary of the featuremap.
• Compared to convolution: no
learnable kernel weights; same idea to
compute the output size.
• Max-pooling is not differentiable! How
does backprop go through it?
– Locally linear at the selected maximal value
with slope of 1
– Zero for other neurons (similar to ReLU)
• Advanced: attentional pooling in
Transformer models such as ViT.
CNN: multi-scale features
CNN: multi-scale features
https://distill.pub/2017/feature-visualization/
CNN: flatten layer + MLP
How to compute the final prediction from the CNN featuremap?
CNN: design
Any Questions?
37
Data Augmentation
Data augmentation: a technique used to increase the amount of data by adding slightly
modified copies of already existing data
Flip
39
Data Augmentation
Crop
40
Data Augmentation
Rotation
41
Data Augmentation
Translation just involves moving the image along the X or Y direction
42
Data Augmentation
Cutout
43
Data Augmentation
CutMix
44
Data Augmentation
MixUp
45
ResNet: practical CNN
100 layers 𝐹 𝑥 +𝑥 ⨁
skip
𝐹(𝑥) connections
47
Transposed convolution
3 2 1 1
4 0 ** 1 1
Input Kernel
3 3 2 2
= 3 3 + 2 2 + 4 4 + 0 0
4 4 0 0
3 5 2
= 7 9 2
4 4 0
Transposed convolution (practice)
1 2 1 0
3 0 ** 0 1
Input Kernel
= + + +
=
Transposed convolution (answer)
1 2 1 0
3 0 ** 0 1
Input Kernel
1 0 2 0
= 0 1 + 0 2 + 3 0 + 0 0
0 3 0 0
1 2 0
= 3 1 2
0 3 0
U-Net
51
U-Net
52
Diffusion model
• Forward diffusion (image to
noise): progressively add noise to
an image for T steps.
• Backward diffusion (noise to
image): apply U-Net recursively
for T denoising steps.
• Train U-Net to predict the noise
at each step-t and then remove
the noise Type equation here..
• Diffusion model aims to learn a
generative model capturing the
distribution of images.
Stable diffusion & AIGC
Any Questions?
55
Why Recurrent neural networks?
• How to model time series data (speech,
text, videos, house price, trajectories,
sensor signals), i.e., (𝑥! , 𝑥" , ⋯ , 𝑥# )?
– How to capture the time dependency?
– How to represent a sequence?
• What kind of tasks we expect to address?
– Denoising of sequence: T inputs T outputs
– Generation of sequences: T inputs T’ outputs
– Sequential labeling: T inputs T outputs
– Sequence classification: T inputs 1 output
• Can we apply a neuron to input at each
time step?
• How to process input data with
different lengths using the same DNN?
Sequential generation of non-
sequential data
Image captioning: image Action recognition: video Machine translation: Speech recognition:
to a sequence of words frames to an action class Chinese to English word-level prediction
Sequence to sequence =
many to one + many to many
• Encoder-decoder architecture that is widely used in machine translation
RNN: update Hidden state from input
RNN: generate output from hidden state
Unrolling rnn
Vanilla RNN
Backpropagation through time
• “Many to One” task as an example.
Gradient vanishing & explosion
• Recurrent: send previous step’s output hidden state to next step
input for T time steps.
• RNN: MLP with T layers but they all share the same weight matrix.
• Training CNN can be a problem: backpropagation through time
(BPTT) leads to gradient vanishing.
https://towardsdatascience.com/the-exploding-and-
vanishing-gradients-problem-in-time-series-6b87d558d22
RNNs equipped with gates
LSTM: Long-short term memory GRU: gated recurrent unit
https://towardsdatascience.com/illustrated-guide-to-
lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21
Lstm: Forget gate
Lstm: Input gate
Lstm: cell gate
LSTM: Output gate
Seq2seq task using lstm
From perceptron to attention
• Perceptron: 𝑓$ 𝑥 = 𝜎 ∑'%&! 𝑤%,$ 𝑥% , nm parameters of 𝑤%,$ for n inputs and m outputs.
• Convolution: 𝑓$ 𝑥 = 𝜎 ∑!)%,*)+ 𝑤$±%,$±* 𝑥%,* , 𝑘 " parameters.