Lecture 08
Lecture 08
Deep Learning 1
learning end-to-end
1/42
ImageNet [2] Benchmark
Task: 1000-class classication (∼3M images)
Remarks: 1): Another key ingredient that helped facilitate the progress on the Ima-
geNet task is the utilization of GPUs in training large CNNs [9]; 2): More up-to-date
2/42
Recap: Multi-layer perceptrons
3/42
Could we use MLPs for Images?
One might atten images to be (very) tall vectors and feed these vectors as
inputs to MLPs.
Image atten
MLP
4/42
Could we use MLPs for Images? (cont.)
5/42
Could we use MLPs for Images? (cont.)
Issue 2: Statistical Ineciency
After attening, we do not exploit local relationships between neighbouring
pixels.
For example, pixels in the regions of car wheels often correlate, and the
correlation might be useful for some learning tasks.
Theisimages
Figure fromfrom
adapted Stanford
theCar and the Car
Stanford annotation
datasetboxes are
[12]. annoted by Pat.
6/42
Overview of CNNs
CNNs = learning hierarchical features (from low to high level features)
using convolution (learning correlation between neighbour pixels)
and pooling layers (enlarging the size of neighbourhoods)1
Remarks: 1): In practice, there are other components in CNNs that help increase the performance of the models, but convolution
and pooling layers are the two main important ingredients; 2) Figure is taken from LeCun et al., 2015 [15].
7/42
Convolution Operator
We refer
▶ x as the input,
▶ w as the kernel,
▶ and (x ∗ w)(t) as the feature map, denoted with a(t).
8/42
Convolution Operator (cont.)
(dene τ ′ := t − τ )
X
= x(t − τ ′ )w(τ ′ )
τ′
= (w ∗ x)(t)
Remarks: 1): the operator is generally referred and the underlying implementation
of convolution layers in deep learning frameworks; therefore, we shall strict with the
name convolution and use the symbol to indicate the actual operator; 2) The range
9/42
Figure from Wikipedia: Convolution.
10/42
Two-Dimensional Discrete Convolution
Weight W
(k = 3, din = 1, dout = 1)
11/42
Numerical Example
30 31 22 1 0
02 02 10 3 1 12.0 12.0 17.0
2 0 0 0 1
0 1 2
2 2 0
3 3 2 1 0
0 1 2
0 0 1 3 1 12.0 12.0 17.0
2 0 00 01 12
12/42
Stride and Padding
Apart from the parameters (W, b), discrete convolution also has two impor-
tant hyper-parameters, namely
▶ Stride (amount of kernel translated)
13/42
Discrete Convolution when d out >1
So far, we mainly discuss the case when dout = 1. When dout > 1, we repeat
the process dout times and concatenate the output of each time together.
14/42
Two-Dimensional Discrete Convolution
Denote a ∈ Rdin ×hin ×win be an input and W ∈ Rdout ×din ×k×k be a convolution
weight. Two-dimensional discrete convolution can be expressed as
k din
k X
X X
zc′ ,m,n = (a ⋆ W )i,j = ac,m+(τ1 −1),n+(τ2 −1) Wc′ ,c,τ1 ,τ2 ,
τ1 =1 τ2 =1 c∈1
The exact value of hout and wout depends on the kernel size, stride, and
padding.
Suppose we have a square input hin = win and using stride=1 and no padding.
We have
See [3] for the relationship between input and output size in other settings.
15/42
Translation Equivariance
16/42
Pooling Layer
Convolution can detect patterns that are not larger than the kernel size k
A stack of convolutions can be used to increase this pattern-respond region,
commonly referred to as Receptive Field.
Practically, it is more eective to increase the receptive eld by subsampling
the input, or Pooling.
Commonly used pooling layers are average1 and max pooling.
2 0.5
Average Pooling
0 4 1 0 2 1
1 3 0 1
2 2 0 2
1 3 1 1 4 1
Max Pooling
3 2
1): One can express the average pooling using convolution with constant weight
17/42
(Local) Translation Invariance of Max Pooling
0 4 1 0
1 3 0 1 4
Max Pooling
2 2 0 2
1 3 1 1
For this example, the top-left output entry of the max pooling remains the
same if the input is shifted left and down one step.
18/42
How the Size of Receptive Field Evolves
Screenshot from [1]. What happens to the receptive eld if we change some of
https://distill.pub/2019/computing-receptive-fields/.
19/42
Blueprint of CNNs for Classication
Feature Extractor
20/42
Case Study: LeNet-5 [17]
Building on earlier work on CNNs by LeCun [16] which was inspired by the Neocog-
nitron by Fukushima[5].
21/42
Case Study: AlexNet [13]
22/42
Case Study: VGG [20]
23/42
Case Study: ResNets [8]
24/42
Case Study: ResNets [8] (cont.)
25/42
Residual Connections
26/42
What features do CNNs learn? [23]
27/42
Shortcoming of CNNs: Vulnerable to Noise in Input
If the input is slightly (adversarially) perturbed, the prediction of CNNs can
dramatically change.
Top: Adversarial noise causes the prediction of a CNN to change from Panda to
28/42
Shortcoming of CNNs: Texture Bias [6]
Unlike humans that often rely on shape information in visual processing [14],
CNNs rely heavily on textural features.
29/42
Applications of CNNs
Apart from image classication, CNNs are often used as feature extractors
in image-based learning tasks, and the concept of CNNs can be generalized
to input from other modalities
▶ Dierent Tasks: Object Detection, Image Segmentation, Image
Captioning, ...
▶ Other Modalities: Text Classication, Text-to-Speech, ...
30/42
Object Detection: You Only Look Once (YOLO) [18]
31/42
Image Segmentation: U-net [19]
Key concepts: contraction (increase what, reduce where) and expanding paths (pre-
cise localization).
32/42
Image Segmentation: U-net [19] (cont.)
33/42
Image Captioning [21, 10, 22]
34/42
Convolution and Pooling for Text Classication [11]
sentence-classification-using-convolutional-neural-networks/
35/42
Transformers: Attention Mechanism
▶ Idea: nd global interactions within an
input sequence or between two input
sequences.
▶ In our case: ′K = V = x ∈ RT and
Q = y ∈ RT
▶ "Query with target yt′ all source keys xt
on how much attention to pay to source
value xt at each target timestep t′ "
36/42
Transformers: Attention Mechanism
▶ Original attention mechanism in Bahdanau et al used MLP as scalar
projector
▶ QK T is a dot product, so we can use arbitrary dimensions
x ∈ RT ×F and y ∈ RT ×F , QK T ∈ RT ×T
′ ′
▶
37/42
Summary
38/42
Bibliography I
[1] A. Araujo, W. Norris, and J. Sim.
Computing receptive elds of convolutional neural networks.
Distill, 2019.
https://distill.pub/2019/computing-receptive-elds.
[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.
ImageNet: A Large-Scale Hierarchical Image Database.
In CVPR09, 2009.
[3] V. Dumoulin and F. Visin.
A guide to convolution arithmetic for deep learning.
arXiv preprint arXiv:1603.07285, 2016.
[4] K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. Song.
Robust physical-world attacks on deep learning visual classication.
In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City,
UT, USA, June 18-22, 2018, pages 16251634. Computer Vision Foundation / IEEE Computer Society,
2018.
[5] K. Fukushima.
A self-organizing neural network model for a mechanism of pattern recognition unaected by shift in
position.
Biol, Cybern, 36:193202, 1980.
39/42
Bibliography II
[8] K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June
2016.
[9] S. Hooker.
The hardware lottery.
Communications of the ACM, 64(12):5865, 2021.
[10] A. Karpathy and L. Fei-Fei.
Deep visual-semantic alignments for generating image descriptions.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 31283137,
2015.
[11] Y. Kim.
Convolutional neural networks for sentence classication.
In A. Moschitti, B. Pang, and W. Daelemans, editors, Proceedingsof the 2014 Conference on Empirical
Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting
of SIGDAT, a Special Interest Group of the ACL, pages 17461751. ACL, 2014.
40/42
Bibliography III
[15] Y. LeCun, Y. Bengio, and G. Hinton.
Deep learning.
nature, 521(7553):436444, 2015.
41/42
Bibliography IV
42/42