0% found this document useful (0 votes)
7 views

Lecture 08

Uploaded by

Tim Widmoser
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Lecture 08

Uploaded by

Tim Widmoser
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

WiSe 2023/24

Deep Learning 1

Lecture 8 Convolution Neural Networks


Machine Learning for Computer Vision

▶ Traditional approach (before 2012): handcrafted features

Image Handcrafted Feature Extractor Trainable Classier

e.g., SIFT, HoG

▶ Deep Learning (2012-...): end-to-end hierarchical feature learning

Image Low-level Features Mid-level Features High-level Features Trainable Classier

learning end-to-end

Figures are adapted from Canziani and LeCun, 2021.

1/42
ImageNet [2] Benchmark
Task: 1000-class classication (∼3M images)

ImageNet Classication Top-5 Error (%). Gordon Cooper, 2019

Remarks: 1): Another key ingredient that helped facilitate the progress on the Ima-

geNet task is the utilization of GPUs in training large CNNs [9]; 2): More up-to-date

results see https://paperswithcode.com/sota/image-classification-on-imagenet.

2/42
Recap: Multi-layer perceptrons

A sequence of ane and thresholding transformations:

z (1) = W (1) x + b(1)


a(1) = σ(z (1) )
..
.
a(j) = σ(W j a(j−1) + b(j) )
..
.

3/42
Could we use MLPs for Images?

One might atten images to be (very) tall vectors and feed these vectors as
inputs to MLPs.

Image atten
MLP

For example, a 3-color 32 × 32 image (e.g., R32×32×3 ) to a tall vector R3072


Issues
▶ space and algorithmic complexity
▶ statistical (learning) ineciency: We do not exploit correlations of
neighbouring pixels.

4/42
Could we use MLPs for Images? (cont.)

Issue 1: Space and Algorithmic Complexity Let m be the number of input


dimensions (after attened), e.g, m = 3072 and n be the number of neurons
in the rst layer of a MLP.

Space Complexity: We need mn + n parameters.


→ mn for weights W (1) and n for biases b(1) .

Algorithmic Complexity: O(mn)

5/42
Could we use MLPs for Images? (cont.)
Issue 2: Statistical Ineciency
After attening, we do not exploit local relationships between neighbouring
pixels.

For example, pixels in the regions of car wheels often correlate, and the
correlation might be useful for some learning tasks.

Theisimages
Figure fromfrom
adapted Stanford
theCar and the Car
Stanford annotation
datasetboxes are
[12]. annoted by Pat.

6/42
Overview of CNNs
CNNs = learning hierarchical features (from low to high level features)
using convolution (learning correlation between neighbour pixels)
and pooling layers (enlarging the size of neighbourhoods)1

Remarks: 1): In practice, there are other components in CNNs that help increase the performance of the models, but convolution
and pooling layers are the two main important ingredients; 2) Figure is taken from LeCun et al., 2015 [15].

7/42
Convolution Operator

Let x(t) ∈ R be a one-dimensional signal at time t ∈ R and w(τ ) be a


weighting function. The convolution operator ∗ is
Z
(x ∗ w)(t) = x(τ )w(t − τ ) dτ (continuous setting)

(discrete setting)
X
≈ x(τ )w(t − τ )
τ −∞

We refer
▶ x as the input,
▶ w as the kernel,
▶ and (x ∗ w)(t) as the feature map, denoted with a(t).

8/42
Convolution Operator (cont.)

We can see the convolution operator is commutative


X
(x ∗ w)(t) = x(τ )w(t − τ )
τ

(dene τ ′ := t − τ )
X
= x(t − τ ′ )w(τ ′ )
τ′

= (w ∗ x)(t)

In deep learning, we instead use a related operator called cross-correlation1 ,


denoted with ⋆ (NOT asterisk ∗)
X
(x ⋆ w)(t) = x(t + τ ′ )w(τ ′ ).
τ′

Remarks: 1): the operator is generally referred and the underlying implementation

of convolution layers in deep learning frameworks; therefore, we shall strict with the

name convolution and use the symbol to indicate the actual operator; 2) The range

of τ′ will be made it clear in the following slides.

9/42
Figure from Wikipedia: Convolution.

10/42
Two-Dimensional Discrete Convolution

A two-dimensional discrete convolution layer consists of


▶ weight (W ∈ Rdout ×din ×k×k , or kernel) whose dimensions governed by
▶ Number of input channels din , e.g., a 3-color image din = 3
▶ Number of output channels dout chosen by the user
▶ Kernel size k chosen by the user
▶ bias (b ∈ R, optional)

Weight W
(k = 3, din = 1, dout = 1)

Discrete convolution using W (no bias) on an


input (din = 1)
Figures are adapted from [3].

11/42
Numerical Example

30 31 22 1 0
02 02 10 3 1 12.0 12.0 17.0

30 11 22 2 3 10.0 17.0 19.0

2 0 0 2 2 9.0 6.0 14.0

2 0 0 0 1
0 1 2
2 2 0
3 3 2 1 0
0 1 2
0 0 1 3 1 12.0 12.0 17.0

Weight W (k = 3) 3 1 20 21 32 10.0 17.0 19.0

2 0 02 22 20 9.0 6.0 14.0

2 0 00 01 12

Left: Weight applied to two locations in the input;


Right: Output of the operator at the two locations
(shaded entries) and other locations.
Figures are from [3].

12/42
Stride and Padding

Apart from the parameters (W, b), discrete convolution also has two impor-
tant hyper-parameters, namely
▶ Stride (amount of kernel translated)

Left: stride = 1 (previous example), Right: stride = 2


▶ Padding : How do we handle regions at the boundary of the input?

Padding with sizes 1 and 2 respectively.

13/42
Discrete Convolution when d out >1

So far, we mainly discuss the case when dout = 1. When dout > 1, we repeat
the process dout times and concatenate the output of each time together.

Example when dout = 2 (stride = 1 and no padding)

14/42
Two-Dimensional Discrete Convolution

Denote a ∈ Rdin ×hin ×win be an input and W ∈ Rdout ×din ×k×k be a convolution
weight. Two-dimensional discrete convolution can be expressed as
k din
k X 
X X
zc′ ,m,n = (a ⋆ W )i,j = ac,m+(τ1 −1),n+(τ2 −1) Wc′ ,c,τ1 ,τ2 ,
τ1 =1 τ2 =1 c∈1

where ∀c′ ∈ {1, . . . , dout }, ∀m ∈ {1, . . . , hout }, and ∀n ∈ {1, . . . , wout }.

The exact value of hout and wout depends on the kernel size, stride, and
padding.
Suppose we have a square input hin = win and using stride=1 and no padding.
We have

hout = wout = hin − k + 1 (Relationship 2 in [3])

See [3] for the relationship between input and output size in other settings.

15/42
Translation Equivariance

Credit: Christian Wolf

16/42
Pooling Layer
Convolution can detect patterns that are not larger than the kernel size k
A stack of convolutions can be used to increase this pattern-respond region,
commonly referred to as Receptive Field.
Practically, it is more eective to increase the receptive eld by subsampling
the input, or Pooling.
Commonly used pooling layers are average1 and max pooling.

2 0.5
Average Pooling
0 4 1 0 2 1
1 3 0 1
2 2 0 2
1 3 1 1 4 1
Max Pooling
3 2

Average and max pooling with k = stride = 2.

1): One can express the average pooling using convolution with constant weight

Wτ1 ,τ2 = 1/k2 and no bias.

17/42
(Local) Translation Invariance of Max Pooling

0 4 1 0
1 3 0 1 4
Max Pooling
2 2 0 2
1 3 1 1

For this example, the top-left output entry of the max pooling remains the
same if the input is shifted left and down one step.

18/42
How the Size of Receptive Field Evolves

Screenshot from [1]. What happens to the receptive eld if we change some of

these parameters? Try it at

https://distill.pub/2019/computing-receptive-fields/.

19/42
Blueprint of CNNs for Classication

Feature Extractor

[Conv. Act. Pooling]×L Global Average Pooling → MLP


z }| {
→ → → | {z }
Classier

20/42
Case Study: LeNet-5 [17]

Key Contribution: Pioneer work on using modern CNNs for handwritten


character recognition using gradient-based learning.

Building on earlier work on CNNs by LeCun [16] which was inspired by the Neocog-
nitron by Fukushima[5].

21/42
Case Study: AlexNet [13]

Key Contributions: First CNN winning ImageNet Challenge (2012); Making


use of GPUs in training

22/42
Case Study: VGG [20]

Key Contributions: Winner ⋆


of ImageNet Challenge (2014); Demonstrating
the benet of the depth.

: ⋆ 1st and 2nd places on the location and classication tracks.

23/42
Case Study: ResNets [8]

Key Contributions: Winner of ImageNet Challenge (2015) and many other


challenges; Inventing the residue connection that allow training CNNs with
many more layers (e.g., 8× deeper than VGG).

24/42
Case Study: ResNets [8] (cont.)

25/42
Residual Connections

▶ Core idea of ResNet by He et al.


▶ For layer l and intermediate representation xl with layers NNl
xl+1 = xl + NNl (xl ; θl )
xl+2 = xl + NNl (xl ; θl ) + NNl+1 (xl+1 ; θl+1 )
...
▶ Better gradient ow by 'shortcutting' over high number of intermediate
layers between loss and layer NNl
∂θl xl+1 = xl +∂θl NNl (xl ; θl )
∂θl+1 xl+2 = xl + NNl (xl ; θl )+∂θl+1 NNl+1 (xl+1 ; θl+1 )
...
▶ Allows training far deeper networks with more parameters which result
in better performance

26/42
What features do CNNs learn? [23]

27/42
Shortcoming of CNNs: Vulnerable to Noise in Input
If the input is slightly (adversarially) perturbed, the prediction of CNNs can
dramatically change.

Top: Adversarial noise causes the prediction of a CNN to change from Panda to

Gibbon; Bottom: Physical adversarial sticker causes stop signs to detected as a

target speed limit sign; Figures are from [7, 4] respectively.

28/42
Shortcoming of CNNs: Texture Bias [6]

Unlike humans that often rely on shape information in visual processing [14],
CNNs rely heavily on textural features.

29/42
Applications of CNNs

Apart from image classication, CNNs are often used as feature extractors
in image-based learning tasks, and the concept of CNNs can be generalized
to input from other modalities
▶ Dierent Tasks: Object Detection, Image Segmentation, Image
Captioning, ...
▶ Other Modalities: Text Classication, Text-to-Speech, ...

30/42
Object Detection: You Only Look Once (YOLO) [18]

The architecture image is from https://dinghow.site/2019/08/24/object-detection-part1.

31/42
Image Segmentation: U-net [19]

Key Contributions: Fast architecture for precise (biological) image segmen-


tation.

Key concepts: contraction (increase what, reduce where) and expanding paths (pre-

cise localization).

32/42
Image Segmentation: U-net [19] (cont.)

33/42
Image Captioning [21, 10, 22]

Figure from [22].

34/42
Convolution and Pooling for Text Classication [11]

Figure is from https://indiantechwarrior.com/

sentence-classification-using-convolutional-neural-networks/

35/42
Transformers: Attention Mechanism
▶ Idea: nd global interactions within an
input sequence or between two input
sequences.
▶ In our case: ′K = V = x ∈ RT and
Q = y ∈ RT
▶ "Query with target yt′ all source keys xt
on how much attention to pay to source
value xt at each target timestep t′ "

▶ Compute pairwise product


QK T ∈ RT ×T to obtain similarities

▶ Normalize over source dimension with


softmax probability
▶ Scale source values V = x
Attention(Q,K,V) = SoftmaxT (QK T ) V
| {z } | {z }
T ′ ×T T ′ ×T
(1)

36/42
Transformers: Attention Mechanism
▶ Original attention mechanism in Bahdanau et al used MLP as scalar
projector
▶ QK T is a dot product, so we can use arbitrary dimensions
x ∈ RT ×F and y ∈ RT ×F , QK T ∈ RT ×T
′ ′

Figure: "Neural Machine Translation by Jointly Learning to Align and Translate"


by Bahndanau et al.

37/42
Summary

▶ CNNs exploit local structure of (2d) and learn hierarchical


representation for a given task.
▶ CNNs share parameters between spatial locations, and they are thus
suitable to learn from signal where features potentially appear in any
location.
▶ Main ingredients of CNNs: Convolution and Pooling layers.
▶ CNNs are widely used in many applications and domains (beyond image
data).
▶ CNNs force local interactions in the rst layers. For some tasks, it is
benecial to capture global interactions (this can be achieved with
attention layers).

38/42
Bibliography I
[1] A. Araujo, W. Norris, and J. Sim.
Computing receptive elds of convolutional neural networks.
Distill, 2019.
https://distill.pub/2019/computing-receptive-elds.
[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.
ImageNet: A Large-Scale Hierarchical Image Database.
In CVPR09, 2009.
[3] V. Dumoulin and F. Visin.
A guide to convolution arithmetic for deep learning.
arXiv preprint arXiv:1603.07285, 2016.

[4] K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. Song.
Robust physical-world attacks on deep learning visual classication.
In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City,
UT, USA, June 18-22, 2018, pages 16251634. Computer Vision Foundation / IEEE Computer Society,
2018.
[5] K. Fukushima.
A self-organizing neural network model for a mechanism of pattern recognition unaected by shift in
position.
Biol, Cybern, 36:193202, 1980.

[6] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel.


Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and
robustness.
In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May
6-9, 2019. OpenReview.net, 2019.

[7] I. J. Goodfellow, J. Shlens, and C. Szegedy.


Explaining and harnessing adversarial examples.
In Y. Bengio and Y. LeCun, editors, 3rd International
Conference on Learning Representations, ICLR
2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.

39/42
Bibliography II
[8] K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June
2016.
[9] S. Hooker.
The hardware lottery.
Communications of the ACM, 64(12):5865, 2021.
[10] A. Karpathy and L. Fei-Fei.
Deep visual-semantic alignments for generating image descriptions.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 31283137,
2015.
[11] Y. Kim.
Convolutional neural networks for sentence classication.
In A. Moschitti, B. Pang, and W. Daelemans, editors, Proceedingsof the 2014 Conference on Empirical
Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting
of SIGDAT, a Special Interest Group of the ACL, pages 17461751. ACL, 2014.

[12] J. Krause, M. Stark, J. Deng, and L. Fei-Fei.


3d object representations for ne-grained categorization.
In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney,
Australia, 2013.
[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classication with deep convolutional neural networks.
In P. L. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors,
Advances
in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing
Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States,
pages 11061114, 2012.
[14] B. Landau, L. B. Smith, and S. S. Jones.
The importance of shape in early lexical learning.
Cognitive development, 3(3):299321, 1988.

40/42
Bibliography III
[15] Y. LeCun, Y. Bengio, and G. Hinton.
Deep learning.
nature, 521(7553):436444, 2015.

[16] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel.


Backpropagation applied to handwritten zip code recognition.
Neural computation, 1(4):541551, 1989.

[17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haner.


Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11):22782324, 1998.

[18] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi.


You only look once: Unied, real-time object detection.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779788,
2016.
[19] O. Ronneberger, P. Fischer, and T. Brox.
U-net: Convolutional networks for biomedical image segmentation.
In International Conference on Medical image computing and computer-assisted intervention, pages
234241. Springer, 2015.
[20] K. Simonyan and A. Zisserman.
Very deep convolutional networks for large-scale image recognition.
In Y. Bengio and Y. LeCun, editors, 3rd International Conference on
Learning Representations, ICLR
2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.

[21] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan.


Show and tell: A neural image caption generator.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 31563164,
2015.
[22] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio.
Show, attend and tell: Neural image caption generation with visual attention.
In International conference on machine learning, pages 20482057. PMLR, 2015.

41/42
Bibliography IV

[23] M. D. Zeiler and R. Fergus.


Visualizing and understanding convolutional networks.
In D. J. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors,
Computer Vision - ECCV 2014 - 13th
European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I, volume 8689 of
Lecture Notes in Computer Science, pages 818833. Springer, 2014.

42/42

You might also like