0% found this document useful (0 votes)

22 views71 pages

Lec5 CNN RNN Attention

Uploaded by

kaydee140492

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views71 pages

Lec5 CNN RNN Attention

Uploaded by

kaydee140492

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

FALL 2023

INTRODUCTION
TO ARTIFICIAL
INTELLIGENCE

Tianyi Zhou

09/13/2023
University of Maryland
Some slides are adapted from Song & Russell @Berkeley,
Fe—Fei @ Stanford, and Shao @William and Mary 1
Seven Components of this course
Action,
Acting Prediction
Probabilistic
Reasoning
Human
users Agent
Language
Models Embodied &
Multi-modal AI World
Neural
Networks

Search & Perception

Planning
Reward, Data,
Observation
Neural networks
• Perceptron and MLP
• Optimization and Backpropagation
• Convolutional Neural Networks
• Recurrent Neural Networks and LSTM
• Attention and Transformer
• Graph Neural Networks

3
Underfitting and Overfitting
§ Overfitting occurs when a statistical model fits exactly against a few training data, leading to low
training error but high test error.
§ Underfitting: when model is too simple but too many data, both training and test errors are high.

• Increase the • Reduce the model

size (depth, width,
model size
sharing weights)
(depth, width, • Regularization
sharing weights) (weight decay)
• Ensemble • Dropout
• Train longer • Perturb data (data
• Lift dimension augmentation)
of input • Data selection
• Early stopping

5
Weight decay and L2 regularization
• Weight decay avoids large weights by removing 𝑤 ← 𝑤 − 𝜂∇𝐸 𝑤 − 𝜆𝑤 in each step.
• It is similar but not exactly the same as L2 regularization, which leads to 𝑤 ← 𝑤 − 𝜂(∇𝐸 𝑤 + 𝜆𝑤).
Dropout: Prevent Overfitting
Dropout refers to dropping out units (neurons) in a neural network. It is used to
prevent Overfitting problem.

0, 𝑝
dropout 𝑝 = )
1, 1 − 𝑝

(a) Standard Neural Nets (b) After Dropout 7

Dropout vs. Ensemble
Ø Create Decision Tree based on Bootstrapped dataset
Subset dataset
age income student buy_computer
<=30 high no no
<=30 high no no
31…40 high no yes
>40 medium no yes
>40 low yes yes
>40 low yes no
31…40 low yes yes
<=30 medium no no
<=30 low yes yes
>40 medium yes yes
<=30 medium yes yes
31…40 medium no yes
31…40 high yes yes
>40 medium no no

8
Dropout vs. Ensemble
Ø Ensemble that combines several base models in order to produce one optimal predictive
model

Covid-19 Covid-19 No Covid-19

Majority Voting
9
Batch Normalization (BN)
Batch normalization is a method that normalizes activations in a network across the
mini-batch. For each feature, batch normalization computes the mean and variance of
that feature in the mini-batch

𝑋" − 𝑚𝑒𝑎𝑛 𝑋" − 𝜇 𝑋" − 𝜇

𝑋! = = ≅
𝑠𝑡𝑑 𝜎 𝜎# + 𝜖 Avoid 0

Re-scale the values

Feature 1 Feature 2 Feature 3
𝐵𝑁 𝑋 = 𝛾𝑋! + 𝛽 Data
(X1) (X2) (X3)

Image 1 1 2 5

Image 2 4 1 3
10
Batch Normalization (BN)
q Batch normalization is a method that normalizes activations in a network across
the mini-batch.
q For each feature, batch normalization computes the mean and variance of that
feature in the mini-batch

11
Advantages of Batch
Normalization (BN)

q Speed Up the Training

B
q Handles internal covariate shift N
q Make the training stable

12
Layer Normalization
Layer normalization normalizes the activations along the feature direction instead of mini-
batch direction. Namely, normalize input across the features for each input data sample.

𝑋" − 𝑚𝑒𝑎𝑛 𝑋" − 𝜇 𝑋" − 𝜇

𝑋! = = ≅
𝑠𝑡𝑑 𝜎 𝜎# + 𝜖
Re-scale the values

𝐵𝑁 𝑋 = 𝛾𝑋! + 𝛽 Feature 1 Feature 2 Feature 3

Data
(X1) (X2) (X3)

Image 1 1 2 5

Image 2 4 1 3
13
Layer Normalization

Advantages:
q Remove the dependency on batches of data samples
q Make it easier to apply for Neural Networks

14
Other normalization
Early Stopping
Early stopping is an optimization technique used to reduce overfitting without
compromising on model accuracy

loss

0 10 20 30 epochs
16
Any Questions?

17
ld o? Why Convolutional
s Wa
ere’ Neural Networks (CNN)?
Wh

18
Multiple-Layer Perceptrons (MLP)
One Channel Spatial information

8
…
1
1
64

1
…
19
vectorize MLP may destroy their spatial information
WHY CNN?
One Channel
1 1 1
1 1
1 1
1 1
1 1
1 1 1
Gray image: 1 channel

Kernel

CNN: spatial invariant

20
Convolution
Convolution: sum over the element-wise product

0 1 1 1
2 ? ?
1 0 1 1 1 1
1 1 0 0 * 1 0
= ? ? ?
? ? ?
0 1 1 0 Kernel

0∗1+1∗1+1∗1+0∗0=2

21
Convolution
Convolution: sum over the element-wise product

0 1 1 1
2 ? ?
1 0 1 1 1 1
1 1 0 0 * 1 0
= ? ? ?
? ? ?
0 1 1 0 Kernel

1∗1+1∗1+0∗1+1∗0=2

22
Convolution
Convolution: sum over the element-wise product

0 1 1 1
2 2 ?
1 0 1 1 1 1
1 1 0 0 * 1 0
= ? ? ?
? ? ?
0 1 1 0 Kernel

1∗1+1∗1+1∗1+1∗0=3

23
Convolution
Convolution: sum over the element-wise product

0 1 1 1
2 2 3
1 0 1 1 1 1
1 1 0 0 * 1 0
= ? ? ?
? ? ?
0 1 1 0 Kernel

𝑃𝑟𝑎𝑐𝑡𝑖𝑐𝑒 𝑎𝑓𝑡𝑒𝑟 𝑐𝑙𝑎𝑠𝑠

24
CNN: Translation invariance
& Weight sharing via kernels
• How to detect/track a moving dog in the scene?
• Can we use Perceptron or MLP? Train w to
maximize <w, x> where x is the vector of all
pixels?
• Too many parameters for MLP: 1024 ✕1024
input pixels, 4096 hidden neurons, 1000 Input image (5x5 Kernel / Output
output classes, 2-layer MLP has 109 parameters! zero padded) Filter (3x3) featuremap (3x3)

• Weight sharing: apply the same weights (kernel

or filter) to different regions of the input image.
• Weight sharing also saves parameters! You only
need 3 ✕3 parameters (agnostic to input size).
for each channel in a hidden layer.
CNN: Padding
• How to control the output featuremap size (steps of convolution)?
• Zero padding: add all-zero columns and rows to boarders of the input image.
• Conv features focusing on the edge: edges usually contain important information.
• Large padding value leads to a large output featuremap.
Valid Same Full
CNN: Stride
• Does the convolutional filter have to
move 1 pixel at a time?
• We call the number of steps “stride”.
• Large stride creates smaller featuremap.
• Use stride together with padding.
CNN: Multi-channel in & Out
• How to apply CNN to RGB (or HSV) images?
– Nin input channels RGB: three channels
– Kernel size = k ✕ k ✕ Nin

• How to produce different features?

– Nout output channels
– Nout kernels, each producing an output featuremap

• In general:
– Input: n ✕ H ✕ W ✕ Nin
– Weights: Nout ✕ k ✕ k ✕ Nin
– Output: n ✕ H’ ✕ W’ ✕ Nout
"#$%&' )#$%&'
– 𝐻! = + 1, 𝑊 ! = +1
( (
Depth-wise convolution (left) &
depth-wise separable convolution (right)
• Can we reduce the computation and number of parameters for multi-channel input and output?

https://eli.thegreenplace.net/2018/depthwise-separable-convolutions-for-machine-learning/
CNN: Receptive field
• Each pixel in the output featuremap of layer-l is
produced from a region of the input image (and thus
captures the feature of the region).
• The size (height x width) of the region is called
receptive field (RF).
• Different RF captures different scales of features.
• Multi-scale features can be helpful to many tasks.

https://www.baeldung.com/cs/cnn-receptive-field-size
https://theaisummer.com/receptive-field/
Dilated convolution
• RF exponentially increases with the #layers and lower-layer features have smaller RF.
• How to increase the RF without increasing the depth or changing the kernel configurations?
CNN: pooling
• How to quickly reduce the
dimensionality of a featuremap?
• Pooling: a summary of the featuremap.
• Compared to convolution: no
learnable kernel weights; same idea to
compute the output size.
• Max-pooling is not differentiable! How
does backprop go through it?
– Locally linear at the selected maximal value
with slope of 1
– Zero for other neurons (similar to ReLU)
• Advanced: attentional pooling in
Transformer models such as ViT.
CNN: multi-scale features
CNN: multi-scale features

https://distill.pub/2017/feature-visualization/
CNN: flatten layer + MLP
How to compute the final prediction from the CNN featuremap?
CNN: design
Any Questions?

37
Data Augmentation
Data augmentation: a technique used to increase the amount of data by adding slightly
modified copies of already existing data

Flip

Original image Flip horizontally Flip vertically

38
Source: images from the website
Data Augmentation
Gaussian Noise

Original image Gaussian noise Impulse noise

39
Data Augmentation

Crop

Original image Crop right Crop top

40
Data Augmentation
Rotation

Original image 180 degree rotation 90 degree rotation

41
Data Augmentation
Translation just involves moving the image along the X or Y direction

42
Data Augmentation
Cutout

43
Data Augmentation
CutMix

44
Data Augmentation

MixUp

45
ResNet: practical CNN

100 layers 𝐹 𝑥 +𝑥 ⨁

skip
𝐹(𝑥) connections

What is the issue of 100 layers DNN? 𝑥

Gradient vanishing ResNet 46
Transposed Convolution
• Long known as deconvolution (a misleading name)
• Use it when you try to reconstruct or generate images.
• Building block of diffusion models.

47
Transposed convolution
3 2 1 1
4 0 ** 1 1

Input Kernel

3 3 2 2
= 3 3 + 2 2 + 4 4 + 0 0
4 4 0 0

3 5 2
= 7 9 2
4 4 0
Transposed convolution (practice)
1 2 1 0
3 0 ** 0 1

Input Kernel

= + + +

=
Transposed convolution (answer)
1 2 1 0
3 0 ** 0 1

Input Kernel

1 0 2 0

= 0 1 + 0 2 + 3 0 + 0 0
0 3 0 0

1 2 0
= 3 1 2
0 3 0
U-Net

51
U-Net

52
Diffusion model
• Forward diffusion (image to
noise): progressively add noise to
an image for T steps.
• Backward diffusion (noise to
image): apply U-Net recursively
for T denoising steps.
• Train U-Net to predict the noise
at each step-t and then remove
the noise Type equation here..
• Diffusion model aims to learn a
generative model capturing the
distribution of images.
Stable diffusion & AIGC
Any Questions?

55
Why Recurrent neural networks?
• How to model time series data (speech,
text, videos, house price, trajectories,
sensor signals), i.e., (𝑥! , 𝑥" , ⋯ , 𝑥# )?
– How to capture the time dependency?
– How to represent a sequence?
• What kind of tasks we expect to address?
– Denoising of sequence: T inputs T outputs
– Generation of sequences: T inputs T’ outputs
– Sequential labeling: T inputs T outputs
– Sequence classification: T inputs 1 output
• Can we apply a neuron to input at each
time step?
• How to process input data with
different lengths using the same DNN?
Sequential generation of non-
sequential data

• Ba, Mnih, and Kavukcuoglu,

“Multiple Object Recognition with
Visual Attention”, ICLR 2015.

• Gregor et al, “DRAW: A Recurrent

Neural Network For Image
Generation”, ICML 2015

• Aäron Van Den Oord, Nal

Kalchbrenner, and Koray
Kavukcuoglu, “ Pixel recurrent
neural networks”, ICML 2016
RNN tasks on sequence data

Image captioning: image Action recognition: video Machine translation: Speech recognition:
to a sequence of words frames to an action class Chinese to English word-level prediction
Sequence to sequence =
many to one + many to many
• Encoder-decoder architecture that is widely used in machine translation
RNN: update Hidden state from input
RNN: generate output from hidden state
Unrolling rnn
Vanilla RNN
Backpropagation through time
• “Many to One” task as an example.
Gradient vanishing & explosion
• Recurrent: send previous step’s output hidden state to next step
input for T time steps.
• RNN: MLP with T layers but they all share the same weight matrix.
• Training CNN can be a problem: backpropagation through time
(BPTT) leads to gradient vanishing.

https://towardsdatascience.com/the-exploding-and-
vanishing-gradients-problem-in-time-series-6b87d558d22
RNNs equipped with gates
LSTM: Long-short term memory GRU: gated recurrent unit

https://towardsdatascience.com/illustrated-guide-to-
lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21
Lstm: Forget gate
Lstm: Input gate
Lstm: cell gate
LSTM: Output gate
Seq2seq task using lstm
From perceptron to attention
• Perceptron: 𝑓$ 𝑥 = 𝜎 ∑'%&! 𝑤%,$ 𝑥% , nm parameters of 𝑤%,$ for n inputs and m outputs.
• Convolution: 𝑓$ 𝑥 = 𝜎 ∑!)%,*)+ 𝑤$±%,$±* 𝑥%,* , 𝑘 " parameters.

• Attention (informal): 𝑓$ 𝑥 = 𝜎 ∑-%&! 𝑤%,$ 𝑥% , 𝑤%,$ = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑥% , 𝑥$ , attention parameters are

independent of the input and output dimensions.

Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
98 pages
Introduction To Convolutional Neural Network (CNN) Using Tensorflow - by Govinda Dumane - Towards Data Science
No ratings yet
Introduction To Convolutional Neural Network (CNN) Using Tensorflow - by Govinda Dumane - Towards Data Science
17 pages
More On CNN
No ratings yet
More On CNN
131 pages
Convolutional Neural Network: by Gagandeep Kaur
100% (1)
Convolutional Neural Network: by Gagandeep Kaur
107 pages
Module11 - NNandDeep Learning
No ratings yet
Module11 - NNandDeep Learning
84 pages
Convolutional Neuralnetworks: Abin - Roozgard
No ratings yet
Convolutional Neuralnetworks: Abin - Roozgard
54 pages
CNN2
No ratings yet
CNN2
70 pages
Convolutional Neural Networks: Convolutions, Pooling and Cnns. Neural Architectures For Computer Vision
No ratings yet
Convolutional Neural Networks: Convolutions, Pooling and Cnns. Neural Architectures For Computer Vision
64 pages
Unit III
No ratings yet
Unit III
89 pages
Unit 4a - Convolutional Neural Networks
No ratings yet
Unit 4a - Convolutional Neural Networks
107 pages
DeepLearning Unit-II
No ratings yet
DeepLearning Unit-II
70 pages
Sarma CNN Vce Oct 2022
No ratings yet
Sarma CNN Vce Oct 2022
63 pages
Ch. 10: Introduction To Convolution Neural Networks CNN and Systems
No ratings yet
Ch. 10: Introduction To Convolution Neural Networks CNN and Systems
69 pages
Lec 8
No ratings yet
Lec 8
60 pages
Scan 30 Sep 23 18 20 44
No ratings yet
Scan 30 Sep 23 18 20 44
30 pages
CNN Iitkgp
No ratings yet
CNN Iitkgp
112 pages
Lecture 08
No ratings yet
Lecture 08
43 pages
ML Lec 13 CNN
No ratings yet
ML Lec 13 CNN
44 pages
Convolutional Neural Network
No ratings yet
Convolutional Neural Network
55 pages
CH VI - Convolutional Neural Network - 24
No ratings yet
CH VI - Convolutional Neural Network - 24
33 pages
Lec6 RNN Attention Search
No ratings yet
Lec6 RNN Attention Search
62 pages
Deep Learning: Seungsang Oh
No ratings yet
Deep Learning: Seungsang Oh
39 pages
CNN 2
No ratings yet
CNN 2
47 pages
Convolutional Networks 2024
No ratings yet
Convolutional Networks 2024
44 pages
Intro To CNN
No ratings yet
Intro To CNN
93 pages
CNN
No ratings yet
CNN
10 pages
CVlecture 5
No ratings yet
CVlecture 5
56 pages
Convolutional Neural Network (CNN)
No ratings yet
Convolutional Neural Network (CNN)
38 pages
Unit2 CNN
No ratings yet
Unit2 CNN
34 pages
Lecture 3
No ratings yet
Lecture 3
48 pages
03 Convolution Neural Networks and Computer Vision With Tensorflow
No ratings yet
03 Convolution Neural Networks and Computer Vision With Tensorflow
21 pages
Lab 5 - Intro To Convolutional Neural Networks
No ratings yet
Lab 5 - Intro To Convolutional Neural Networks
52 pages
Unit 3 CNN 2024
No ratings yet
Unit 3 CNN 2024
58 pages
L09-10 DL and CNN
No ratings yet
L09-10 DL and CNN
56 pages
Some Important Question
No ratings yet
Some Important Question
59 pages
Convolutional Neural Networks - Part 1
No ratings yet
Convolutional Neural Networks - Part 1
44 pages
CS601 Machine Learning Unit 3
No ratings yet
CS601 Machine Learning Unit 3
47 pages
Unit IV Deep Leraning
No ratings yet
Unit IV Deep Leraning
35 pages
Deep Learning
No ratings yet
Deep Learning
17 pages
Convolutional Neural Network
No ratings yet
Convolutional Neural Network
11 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
108 pages
Unit Iii Deep Learning
No ratings yet
Unit Iii Deep Learning
31 pages
What Is Convolutional Neural Network
No ratings yet
What Is Convolutional Neural Network
16 pages
CV 2025 Spring 16
No ratings yet
CV 2025 Spring 16
53 pages
Convolutional Neural Networks Notes
No ratings yet
Convolutional Neural Networks Notes
29 pages
Convolutional Neural Networks in Python - DataCamp
No ratings yet
Convolutional Neural Networks in Python - DataCamp
22 pages
Convolutional Neural Network (CNN) : Assignment On
No ratings yet
Convolutional Neural Network (CNN) : Assignment On
8 pages
AE556 2024 Topic4 CNN
No ratings yet
AE556 2024 Topic4 CNN
26 pages
Module 3 Notes
No ratings yet
Module 3 Notes
22 pages
Aiml Ece Unit-5
No ratings yet
Aiml Ece Unit-5
48 pages
E-Note 33951 Content Document 20250328020322PM
No ratings yet
E-Note 33951 Content Document 20250328020322PM
29 pages
Deep LearningUNIT-IV
No ratings yet
Deep LearningUNIT-IV
16 pages
Convolutional Networks
No ratings yet
Convolutional Networks
37 pages
Understanding of Convolutional Neural Network (CNN) - Deep Learning
No ratings yet
Understanding of Convolutional Neural Network (CNN) - Deep Learning
7 pages
AI Publishing. Python Scikit-Learn For Beginners... For Data Scientist 2021
100% (8)
AI Publishing. Python Scikit-Learn For Beginners... For Data Scientist 2021
339 pages
DL Unit-Ii
No ratings yet
DL Unit-Ii
34 pages
Understanding of Convolutional Neural Network (CNN) - Deep Learning - by Prabhu Raghav - Medium
No ratings yet
Understanding of Convolutional Neural Network (CNN) - Deep Learning - by Prabhu Raghav - Medium
10 pages
UNIT-III Convolution Neural Networks
No ratings yet
UNIT-III Convolution Neural Networks
9 pages
Computer Vision: Field of AI That Enables Computers To Derive Meaningful Information From
No ratings yet
Computer Vision: Field of AI That Enables Computers To Derive Meaningful Information From
26 pages
DEEP LEARNING Unit-2 NOTES For Post Graduation
No ratings yet
DEEP LEARNING Unit-2 NOTES For Post Graduation
11 pages
The Multilayer Perceptron
No ratings yet
The Multilayer Perceptron
11 pages
Adaline Madaline Comes Under The Supervised Learning Networks
No ratings yet
Adaline Madaline Comes Under The Supervised Learning Networks
8 pages
A Comprehensive Survey On Machine Learning Techniques For Android Malware Detection
No ratings yet
A Comprehensive Survey On Machine Learning Techniques For Android Malware Detection
12 pages
Bernd Klein Python and Machine Learning Letter
No ratings yet
Bernd Klein Python and Machine Learning Letter
453 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
129 pages
A New Approach For Classification of Stretchshortening Cycle - Beyond 250 Ms of Ground Contact Time (Unver, 2024)
No ratings yet
A New Approach For Classification of Stretchshortening Cycle - Beyond 250 Ms of Ground Contact Time (Unver, 2024)
11 pages
A Deep Learning Methodology To Predicting Cybersecurity Attacks On The Internet of Things
No ratings yet
A Deep Learning Methodology To Predicting Cybersecurity Attacks On The Internet of Things
22 pages
Machine - Learning - Content - Python PDF
No ratings yet
Machine - Learning - Content - Python PDF
3 pages
TE - 2019 - (AIML) Artificial Intelligence and Machine Learning
No ratings yet
TE - 2019 - (AIML) Artificial Intelligence and Machine Learning
4 pages
Semester - 6-Machine Learning
No ratings yet
Semester - 6-Machine Learning
4 pages
Bank Additional Names
No ratings yet
Bank Additional Names
2 pages
UNIT-5 ML Notes
No ratings yet
UNIT-5 ML Notes
24 pages
ANALYTICS - by - Debdeep Ghosh
No ratings yet
ANALYTICS - by - Debdeep Ghosh
5 pages
Statistical Learning Slides
No ratings yet
Statistical Learning Slides
60 pages
Ms Data Science Deakin Programme Deakin
No ratings yet
Ms Data Science Deakin Programme Deakin
20 pages
Course Outline of Marketing Research (For MBA)
No ratings yet
Course Outline of Marketing Research (For MBA)
5 pages
Artificial Intelligence Activities
No ratings yet
Artificial Intelligence Activities
34 pages
Research Methodology in Commerce
No ratings yet
Research Methodology in Commerce
3 pages
K Metoids
No ratings yet
K Metoids
18 pages
Nigercon Abuad IEEE 2024
No ratings yet
Nigercon Abuad IEEE 2024
5 pages
Post Processing Post Processing
No ratings yet
Post Processing Post Processing
26 pages
Digital Image Classification
No ratings yet
Digital Image Classification
85 pages
Aiml QB With Ans - 075736
No ratings yet
Aiml QB With Ans - 075736
69 pages
DLT Unit-1 Answers
No ratings yet
DLT Unit-1 Answers
36 pages
K Means Clustering - Ipynb - Colaboratory
No ratings yet
K Means Clustering - Ipynb - Colaboratory
4 pages
CIGRE A Novel Method For Pollution Detection of External Insulation
No ratings yet
CIGRE A Novel Method For Pollution Detection of External Insulation
10 pages
Decision Stump Algorithm 1
No ratings yet
Decision Stump Algorithm 1
4 pages
Peerj Cs 1481
No ratings yet
Peerj Cs 1481
22 pages
Logistic Regression
No ratings yet
Logistic Regression
37 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)

Lec5 CNN RNN Attention

Uploaded by

Lec5 CNN RNN Attention

Uploaded by

FALL 2023

Search & Perception

• Increase the • Reduce the model

(a) Standard Neural Nets (b) After Dropout 7

Covid-19 Covid-19 No Covid-19

𝑋" − 𝑚𝑒𝑎𝑛 𝑋" − 𝜇 𝑋" − 𝜇

Re-scale the values

q Speed Up the Training

𝑋" − 𝑚𝑒𝑎𝑛 𝑋" − 𝜇 𝑋" − 𝜇

𝐵𝑁 𝑋 = 𝛾𝑋! + 𝛽 Feature 1 Feature 2 Feature 3

CNN: spatial invariant

𝑃𝑟𝑎𝑐𝑡𝑖𝑐𝑒 𝑎𝑓𝑡𝑒𝑟 𝑐𝑙𝑎𝑠𝑠

• Weight sharing: apply the same weights (kernel

• How to produce different features?

Original image Flip horizontally Flip vertically

Original image Gaussian noise Impulse noise

Original image Crop right Crop top

Original image 180 degree rotation 90 degree rotation

What is the issue of 100 layers DNN? 𝑥

• Ba, Mnih, and Kavukcuoglu,

• Gregor et al, “DRAW: A Recurrent

• Aäron Van Den Oord, Nal

• Attention (informal): 𝑓$ 𝑥 = 𝜎 ∑-%&! 𝑤%,$ 𝑥% , 𝑤%,$ = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑥% , 𝑥$ , attention parameters are

You might also like