0% found this document useful (0 votes)

20 views92 pages

L13_intro-cnn-part2__slides

Uploaded by

Osman Hamdi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views92 pages

L13_intro-cnn-part2__slides

Uploaded by

Osman Hamdi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 92

Lecture 13

Introduction to
Convolutional Neural Networks
Part 2
STAT 453: Deep Learning, Spring 2020
Sebastian Raschka
http://stat.wisc.edu/~sraschka/teaching/stat453-ss2020/

https://github.com/rasbt/stat453-deep-learning-ss20/tree/master/L13-cnns-part2

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 1
Quiz

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 2
Lecture Overview

1. Padding (control output size in addition to stride)

2. Spatial Dropout and BatchNorm
3. Considerations for CNNs on GPUs
4. Common Architectures
A. VGG16 (simple, deep CNN)
B. ResNet and skip connections
C. Fully convolutional networks (no fully connected layers)
D. Inception (parallel convolutions and auxiliary losses)
5. Transfer learning

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 3
Padding

output size input size padding pixels per side

⌫
i + 2p k
o= +1
s
<latexit sha1_base64="1+3PcismmoemfcI0obtP4BHnYWc=">AAACGHicbVBNS8NAEN3Ur1q/qh69LBZBKNakCnoRil48VrAf0JSy2W7apZts2J0IJeRnePGvePGgiNfe/Ddu2xy09cHA470ZZuZ5keAabPvbyq2srq1v5DcLW9s7u3vF/YOmlrGirEGlkKrtEc0ED1kDOAjWjhQjgSdYyxvdTf3WE1Oay/ARxhHrBmQQcp9TAkbqFc/ljSuYD67whZTK9RWhCS9XcXQ2ShOduooPhuCqmVvGTq9Ysiv2DHiZOBkpoQz1XnHi9iWNAxYCFUTrjmNH0E2IAk4FSwturFlE6IgMWMfQkARMd5PZYyk+MUof+1KZCgHP1N8TCQm0Hgee6QwIDPWiNxX/8zox+NfdhIdRDCyk80V+LDBIPE0J97liFMTYEEIVN7diOiQmGzBZFkwIzuLLy6RZrTgXlerDZal2m8WRR0foGJ0iB12hGrpHddRAFD2jV/SOPqwX6836tL7mrTkrmzlEf2BNfgDz2aBR</latexit>

kernel size
"floor" function
stride

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 4
(4 - 3 + 2*0)/1 + 1 = 2

output
(5 - 4 + 2*2)/1 + 1 = 6

input

No padding, stride=1 padding=2, stride=1

(5 - 3 + 2*0)/2 + 1 = 2

Highly recommended:
Dumoulin, Vincent, and Francesco Visin. "A guide to
convolution arithmetic for deep learning." arXiv preprint
arXiv:1603.07285 (2016).

No padding, stride=2
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 5
Padding Jargon

"valid" convolution: no padding (feature map may shrink)

"same" convolution: padding such that the output size

is equal to the input size

Common kernel size conventions:

3x3, 5x5, 7x7 (sometimes 1x1 in later layers to reduce channels)

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 6
Padding
⌫
i + 2p k
o= +1
s
<latexit sha1_base64="1+3PcismmoemfcI0obtP4BHnYWc=">AAACGHicbVBNS8NAEN3Ur1q/qh69LBZBKNakCnoRil48VrAf0JSy2W7apZts2J0IJeRnePGvePGgiNfe/Ddu2xy09cHA470ZZuZ5keAabPvbyq2srq1v5DcLW9s7u3vF/YOmlrGirEGlkKrtEc0ED1kDOAjWjhQjgSdYyxvdTf3WE1Oay/ARxhHrBmQQcp9TAkbqFc/ljSuYD67whZTK9RWhCS9XcXQ2ShOduooPhuCqmVvGTq9Ysiv2DHiZOBkpoQz1XnHi9iWNAxYCFUTrjmNH0E2IAk4FSwturFlE6IgMWMfQkARMd5PZYyk+MUof+1KZCgHP1N8TCQm0Hgee6QwIDPWiNxX/8zox+NfdhIdRDCyk80V+LDBIPE0J97liFMTYEEIVN7diOiQmGzBZFkwIzuLLy6RZrTgXlerDZal2m8WRR0foGJ0iB12hGrpHddRAFD2jV/SOPqwX6836tL7mrTkrmzlEf2BNfgDz2aBR</latexit>

Assume you want to use a convolutional operation with

stride 1 and maintain the input dimensions in the output feature map:

How much padding do you need for "same" convolution?

o = i + 2p
<latexit sha1_base64="KScMiOYK5Huk0js+ek4SjQbMhXY=">AAAB+nicbVDLSgMxFL3js9bXVJdugkUQxDJTBd0IRTcuK9gHtEPJpJk2NJMMSUYptZ/ixoUibv0Sd/6NaTsLbT1wuYdz7iU3J0w408bzvp2l5ZXVtfXcRn5za3tn1y3s1bVMFaE1IrlUzRBrypmgNcMMp81EURyHnDbCwc3EbzxQpZkU92aY0CDGPcEiRrCxUsctSHSFGDpB5QSdDmz3O27RK3lToEXiZ6QIGaod96vdlSSNqTCEY61bvpeYYISVYYTTcb6dappgMsA92rJU4JjqYDQ9fYyOrNJFkVS2hEFT9ffGCMdaD+PQTsbY9PW8NxH/81qpiS6DERNJaqggs4eilCMj0SQH1GWKEsOHlmCimL0VkT5WmBibVt6G4M9/eZHUyyX/rFS+Oy9WrrM4cnAAh3AMPlxABW6hCjUg8AjP8ApvzpPz4rw7H7PRJSfb2Yc/cD5/ALi9kQ8=</latexit>
k+1

, p = (o
<latexit sha1_base64="626xz0vjLoBFOjFRhY1RQUsbPLE=">AAACCHicbVA9SwNBEN2LXzF+nVpauBiEiCTeRUEbIWhjYRHBfEASwt5mL1myd3vszikhpLTxr9hYKGLrT7Dz37hJrtDog4HHezPMzPMiwTU4zpeVmptfWFxKL2dWVtfWN+zNraqWsaKsQqWQqu4RzQQPWQU4CFaPFCOBJ1jN61+O/dodU5rL8BYGEWsFpBtyn1MCRmrbu81r5oPi3R4QpeQ9jvA5zsk8P+zn3YOjYtvOOgVnAvyXuAnJogTltv3Z7EgaBywEKojWDdeJoDUkCjgVbJRpxppFhPZJlzUMDUnAdGs4eWSE943Swb5UpkLAE/XnxJAEWg8Cz3QGBHp61huL/3mNGPyz1pCHUQwspNNFfiwwSDxOBXe4YhTEwBBCFTe3YtojilAw2WVMCO7sy39JtVhwjwvFm5Ns6SKJI4120B7KIRedohK6QmVUQRQ9oCf0gl6tR+vZerPep60pK5nZRr9gfXwDrzeXzQ==</latexit>
i+k 1)/2
, p = (k
<latexit sha1_base64="d58aLPAYf749hHhkSP5OGRKTYVc=">AAACBHicbVA9SwNBEN2LXzF+nVqmWQxCLIx3UdBGCNpYWEQwH5AcYW+zlyzZuz1255QQUtj4V2wsFLH1R9j5b9wkV2j0wcDjvRlm5vmx4Boc58vKLCwuLa9kV3Nr6xubW/b2Tl3LRFFWo1JI1fSJZoJHrAYcBGvGipHQF6zhDy4nfuOOKc1ldAvDmHkh6UU84JSAkTp2vn3NAlC81weilLzHMT7HxcGhe3BU7tgFp+RMgf8SNyUFlKLasT/bXUmTkEVABdG65ToxeCOigFPBxrl2ollM6ID0WMvQiIRMe6PpE2O8b5QuDqQyFQGeqj8nRiTUehj6pjMk0Nfz3kT8z2slEJx5Ix7FCbCIzhYFicAg8SQR3OWKURBDQwhV3NyKaZ8oQsHkljMhuPMv/yX1csk9LpVvTgqVizSOLMqjPVRELjpFFXSFqqiGKHpAT+gFvVqP1rP1Zr3PWjNWOrOLfsH6+AYmrZZ1</latexit>
1)/2

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 7
Padding

Probably explains why common kernel size conventions are

3x3, 5x5, 7x7 (sometimes 1x1 in later layers to reduce channels)

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 8
Lecture Overview

1. Padding (control output size in addition to stride)

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 9
Spatial Dropout -- Dropout2D

• Problem with regular dropout and CNNs:

Adjacent pixels are likely highly correlated
(thus, may not help with reducing the
"dependency" much as originally intended by
dropout)

• Hence, it may be better to drop entire feature maps

Idea comes from

Tompson, Jonathan, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler.
"Efficient object localization using convolutional networks." In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 648-656. 2015.

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 10
Spatial Dropout -- Dropout2D
• Dropout2d will drop full feature maps (channels)

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 11
BatchNorm 2D

Source: https://pytorch.org/docs/stable/nn.html

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 12
BatchNorm 2D
Inputs are rank-2 tensors: [N, num_features)

Inputs are rank-4 tensors: [N, C, H, W]

In BatchNorm2d, the mean and standard deviation are computed for

N*H*W, i.e., over the channel dimension

Source: https://pytorch.org/docs/stable/nn.html

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 13
BatchNorm 2D
In BatchNorm2d, the mean and standard deviation are computed for N*H*W, i.e., over the channel dimension

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 14
Lecture Overview

1. Padding (control output size in addition to stride)

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 15
Computing Convolutions on the GPU
• There are many diﬀerent approaches to compute (approximate)
convolution operations

• DL libraries usually use NVIDIA's CUDA & CuDNN libraries, which

implement many diﬀerent convolution algorithms

• These algorithms are usually more eﬃcient than the CPU variants
(convolutions on the CPU e.g., in CPU usually take up much more memory
due to the algorithm choice compared to using the GPU)

If you are interested, you can find more info in:

Lavin, Andrew, and Scott Gray. "Fast algorithms for convolutional neural networks."
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/
Lavin_Fast_Algorithms_for_CVPR_2016_paper.pdf
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 16
Computing Convolutions on the GPU
• CuDNN is more geared towards engineers & speed rather than scientists
and is unfortunately not deterministic/reproducible by default

• I.e., it determines which convolution algorithm to choose during run-time

automatically, based on predicted speeds given the data flow

• For reproducibility and consistent results, I recommend setting the

deterministic flag (speed is about the same, often even a bit faster,
sometimes a bit slower)

import torch
import torch.nn as nn
import torch.nn.functional as F

if torch.cuda.is_available():
torch.backends.cudnn.deterministic = True

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 17
Lecture Overview

1. Padding (control output size in addition to stride)

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 18
Common Architectures Revisited

We will discuss some additional common CNN architectures since

the field evolved quite a bit since 2012 ...

Canziani, A., Paszke, A., & Culurciello, E. (2016). An analysis of deep neural network models for practical
etwork. Single-croparXiv
applications. top-1preprint Figure 2: Top1 vs. operations, size / parameters.
vali- arXiv:1605.07678.
top scoring single-model archi- Top-1 one-crop accuracy versus amount of operations
ce with this chart our choice of required for a single forward pass. The size of the
h will be Sebastian Raschkathis blobs
used throughout STAT 453: Intro totoDeep
is proportional the Learning
number ofand Generative
network pa- Models SS 2020 19
Common Architectures Revisited

number of
model parameters

top-1 vali- Figure 2: Top1 vs. operations, size / parameters.

1. Padding (control output size in addition to stride)

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 21
Common Architectures Revisited

top-1 vali- Figure 2: Top1 vs. operations, size / parameters.

PyTorch implementation: https://github.com/rasbt/stat453-deep-learning-

Table 1: ConvNet configurations (shown in columns). The depth of the configurations increases
from the leftss20/blob/master/L13-cnns-part2/code/vgg16.ipynb
(A) to the right (E), as more layers are added (the added layers are shown in bold). The
convolutional layer parameters are denoted as “conv!receptive field size"-!number of channels"”.
The ReLU activation function is not shown for brevity.
ConvNet Configuration
A A-LRN B C D E
11 weight
layers
11 weight
layers
13 weight
layers
16 weight
layers
16 weight
layers
19 weight
layers
Advantages:
conv3-64 conv3-64
LRN
input (224 × 224 RGB image)
conv3-64
conv3-64
conv3-64
conv3-64
conv3-64
conv3-64
conv3-64
conv3-64
very simple architecture,
conv3-128
maxpool
conv3-128 conv3-128 conv3-128 conv3-128 conv3-128 3x3 convs, stride=1,
conv3-128 conv3-128 conv3-128 conv3-128

conv3-256
maxpool
conv3-256 conv3-256 conv3-256 conv3-256 conv3-256
"same" padding, 2x2 max pooling
conv3-256 conv3-256 conv3-256 conv3-256 conv3-256 conv3-256
conv1-256 conv3-256 conv3-256
conv3-256

conv3-512 conv3-512
maxpool
conv3-512 conv3-512 conv3-512 conv3-512 Disadvantage:
conv3-512 conv3-512 conv3-512 conv3-512 conv3-512 conv3-512
conv1-512 conv3-512 conv3-512
conv3-512 very large number of parameters
maxpool
conv3-512
conv3-512
conv3-512
conv3-512
conv3-512 conv3-512
conv3-512 conv3-512
conv3-512
conv3-512
conv3-512
conv3-512
and slow
conv1-512 conv3-512 conv3-512

maxpool
conv3-512 (see previous slide)
FC-4096
FC-4096
FC-1000
soft-max

Simonyan, Table
Network
Karen,
2: Numberand Andrew
of parameters
A,A-LRN B
Zisserman.
(in millions).
C D E
"Very deep convolutional networks for large-scale
imageNumber
recognition."
of parameters arXiv
133 preprint
133 134 arXiv:1409.1556
138 144 (2014).
such layers have a 7 × 7 effective receptive field. So what have we gained by using, for instance, a
stack of three 3 × 3 conv. layers instead of a single 7 × 7 layer? First, we incorporate three non-linear
Sebastian
rectification layers instead ofRaschka
a single one, which makesSTAT 453:
the decision Intro
function to
moreDeep Learning
discriminative.
Second, we decrease the number of parameters: assuming that both the input and the output of a
and Generative Models SS 2020 23
VGG-16
PyTorch implementation: https://github.com/rasbt/stat453-deep-learning-
ss20/blob/master/L13-cnns-part2/code/vgg16.ipynb

Visualization from
https://www.cs.toronto.edu/~frossard/post/vgg16/

Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale
image recognition." arXiv preprint arXiv:1409.1556 (2014).

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 24
Lecture Overview

1. Padding (control output size in addition to stride)

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 25
Common Architectures Revisited

top-1 vali- Figure 2: Top1 vs. operations, size / parameters.

With their simple trick of

x
allowing skip connections
(the possibility to learn weight layer
identity functions and F(x) relu
x
skip layers that are weight layer
identity
not useful), ResNets
allow us to to implement F(x) + x
relu
very, very deep
architectures
Figure 2. Residual learning: a building block.

are comparably good or better than the constructed so

(or unable to do so in feasible time).
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 27
VGG-19 34-layer plain 34-layer residual Residual Network. Based
image image image insert shortcut connections
output network into its counterpart
3x3 conv, 64
size: 224
3x3 conv, 64
shortcuts (Eqn.(1)) can be d
pool, /2
output are of the same dim
output
size: 112 Fig. 3). When the dimension
3x3 conv, 128

3x3 conv, 128 7x7 conv, 64, /2 7x7 conv, 64, /2

in Fig. 3), we consider two
performs identity mapping,
pool, /2 pool, /2 pool, /2
output for increasing dimensions.
size: 56
3x3 conv, 256 3x3 conv, 64 3x3 conv, 64
parameter; (B) The projecti

Sebastian Raschka
3x3 conv, 256 3x3 conv, 64 3x3 conv, 64
match dimensions (done by
3x3 conv, 256 3x3 conv, 64 3x3 conv, 64
options, when the shortcuts
3x3 conv, 256 3x3 conv, 64 3x3 conv, 64
sizes, they are performed wi
3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

3.4. Implementation
pool, /2 3x3 conv, 128, /2 3x3 conv, 128, /2
output Our implementation for
size: 28
3x3 conv, 512 3x3 conv, 128 3x3 conv, 128 in [21, 40]. The image is re
3x3 conv, 512 3x3 conv, 128 3x3 conv, 128 domly sampled in [256, 48
3x3 conv, 512 3x3 conv, 128 3x3 conv, 128 A 224×224 crop is randoml
3x3 conv, 512 3x3 conv, 128 3x3 conv, 128 horizontal flip, with the per-
3x3 conv, 128 3x3 conv, 128 standard color augmentation
3x3 conv, 128 3x3 conv, 128 normalization (BN) [16] ri
3x3 conv, 128 3x3 conv, 128
before activation, following
output as in [12] and train all plain
pool, /2 3x3 conv, 256, /2 3x3 conv, 256, /2
size: 14
3x3 conv, 512 3x3 conv, 256 3x3 conv, 256
use SGD with a mini-batch
3x3 conv, 512 3x3 conv, 256 3x3 conv, 256
starts from 0.1 and is divide
ResNets

and the models are trained f

3x3 conv, 512 3x3 conv, 256 3x3 conv, 256
use a weight decay of 0.000
3x3 conv, 512 3x3 conv, 256 3x3 conv, 256
do not use dropout [13], foll
3x3 conv, 256 3x3 conv, 256
In testing, for compariso
3x3 conv, 256 3x3 conv, 256
10-crop testing [21]. For b
3x3 conv, 256 3x3 conv, 256
convolutional form as in [4
3x3 conv, 256 3x3 conv, 256
at multiple scales (images a
3x3 conv, 256 3x3 conv, 256 side is in {224, 256, 384, 48
3x3 conv, 256 3x3 conv, 256
STAT 453: Intro to Deep Learning and Generative Models
3x3 conv, 256 3x3 conv, 256 4. Experiments
output
pool, /2 3x3 conv, 512, /2 3x3 conv, 512, /2
size: 7
4.1. ImageNet Classifica
3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512 We evaluate our method

3x3 conv, 512 3x3 conv, 512 cation dataset [35] that consi
3x3 conv, 512 3x3 conv, 512
are trained on the 1.28 milli
3x3 conv, 512 3x3 conv, 512
ated on the 50k validation i
SS 2020

output result on the 100k test ima

fc 4096 avg pool avg pool
size: 1
We evaluate both top-1 and
In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778. 2016.

fc 4096 fc 1000 fc 1000

fc 1000 Plain Networks. We first

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition."

28
ResNets

shortcut

x Convolution Batch Norm ReLU Convolution Batch Norm + ReLU

(l+2) (l+2) (l)

In general: a
<latexit sha1_base64="xaVAckoILm+t/okPmgZyRE9Yq78=">AAACGHicbVDLSsNAFJ3UV62vqEs3g0VoKdSkCroRim5cVrAPaGKZTCd16OTBzESoIZ/hxl9x40IRt935N07SINp6YOBwzrncuccJGRXSML60wtLyyupacb20sbm1vaPv7nVEEHFM2jhgAe85SBBGfdKWVDLSCzlBnsNI1xlfpX73gXBBA/9WTkJie2jkU5diJJU00I/RXVxhtUY1gRfQEnTkIcuho8rjj1yDWaSapHp1oJeNupEBLhIzJ2WQozXQp9YwwJFHfIkZEqJvGqG0Y8QlxYwkJSsSJER4jEakr6iPPCLsODssgUdKGUI34Or5Embq74kYeUJMPEclPSTvxbyXiv95/Ui653ZM/TCSxMezRW7EoAxg2hIcUk6wZBNFEOZU/RXie8QRlqrLkirBnD95kXQadfOk3rg5LTcv8zqK4AAcggowwRlogmvQAm2AwRN4AW/gXXvWXrUP7XMWLWj5zD74A236DVTEnXY=</latexit>
= z +a

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 29
ResNets

shortcut

x Convolution Batch Norm ReLU Convolution Batch Norm + ReLU

(l+2) (l+2) (l)

a
<latexit sha1_base64="xaVAckoILm+t/okPmgZyRE9Yq78=">AAACGHicbVDLSsNAFJ3UV62vqEs3g0VoKdSkCroRim5cVrAPaGKZTCd16OTBzESoIZ/hxl9x40IRt935N07SINp6YOBwzrncuccJGRXSML60wtLyyupacb20sbm1vaPv7nVEEHFM2jhgAe85SBBGfdKWVDLSCzlBnsNI1xlfpX73gXBBA/9WTkJie2jkU5diJJU00I/RXVxhtUY1gRfQEnTkIcuho8rjj1yDWaSapHp1oJeNupEBLhIzJ2WQozXQp9YwwJFHfIkZEqJvGqG0Y8QlxYwkJSsSJER4jEakr6iPPCLsODssgUdKGUI34Or5Embq74kYeUJMPEclPSTvxbyXiv95/Ui653ZM/TCSxMezRW7EoAxg2hIcUk6wZBNFEOZU/RXie8QRlqrLkirBnD95kXQadfOk3rg5LTcv8zqK4AAcggowwRlogmvQAm2AwRN4AW/gXXvWXrUP7XMWLWj5zD74A236DVTEnXY=</latexit>
= z +a
(l+1) (l+2) (l+2) (l)
= <latexit sha1_base64="yxHJNH8tDff77iNrrorUMvyNV/U=">AAACJXicbZDLSsNAFIYn9VbrLerSzWARWgolqYIuFIpuXFawF2himUwn7dDJhZmJUEJexo2v4saFRQRXvoqTNIi2Hhj45j//Yeb8TsiokIbxqRVWVtfWN4qbpa3tnd09ff+gI4KIY9LGAQt4z0GCMOqTtqSSkV7ICfIcRrrO5Cbtdx8JFzTw7+U0JLaHRj51KUZSSQP98gpago48ZDl0VEEPcYXVzGoCuxk1FNWg88PqklmqSWqvDvSyUTeygstg5lAGebUG+swaBjjyiC8xQ0L0TSOUdoy4pJiRpGRFgoQIT9CI9BX6yCPCjrMtE3iilCF0A66OL2Gm/p6IkSfE1HOU00NyLBZ7qfhfrx9J98KOqR9Gkvh4/pAbMSgDmEYGh5QTLNlUAcKcqr9CPEYcYamCLakQzMWVl6HTqJun9cbdWbl5ncdRBEfgGFSACc5BE9yCFmgDDJ7AC3gDM+1Ze9XetY+5taDlM4fgT2lf3+w2oTE=</latexit>
a W +b +a
If all weights and the bias are zero, then
(l) (l) (identity function)
=
<latexit sha1_base64="HMO9tsawTohd5TWLlhXwB7nPjt8=">AAACDXicbVDLSsNAFJ34rPUVdelmsArtpiRV0E2h6MZlBfuAJpbJdJIOnUnCzEQooT/gxl9x40IRt+7d+TdO2gjaemDg3HPu5c49XsyoVJb1ZSwtr6yurRc2iptb2zu75t5+W0aJwKSFIxaJrockYTQkLUUVI91YEMQ9Rjre6CrzO/dESBqFt2ocE5ejIKQ+xUhpqW8e16EjacCR49GgDNFdWmaVSVZUYP2n7Jslq2pNAReJnZMSyNHsm5/OIMIJJ6HCDEnZs61YuSkSimJGJkUnkSRGeIQC0tM0RJxIN51eM4EnWhlAPxL6hQpO1d8TKeJSjrmnOzlSQznvZeJ/Xi9R/oWb0jBOFAnxbJGfMKgimEUDB1QQrNhYE4QF1X+FeIgEwkoHWNQh2PMnL5J2rWqfVms3Z6XGZR5HARyCI1AGNjgHDXANmqAFMHgAT+AFvBqPxrPxZrzPWpeMfOYA/IHx8Q1XNpnT</latexit>
a =a
due to ReLU
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 30
ResNets

shortcut

x Convolution Batch Norm ReLU Convolution Batch Norm + ReLU

(l+2) (l+2) (l)

We assume these have the same dimension

(e.g., via "same" convolution)

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 31
ResNets
shortcut

x Convolution Batch Norm ReLU Convolution Batch Norm + ReLU

shortcut
Convolution Batch Norm

x Convolution Batch Norm ReLU Convolution Batch Norm + ReLU

alternative residual blocks with skip connections such that the input passed via
the shortcut is resized to dimensions of the main path's output

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 32
ResNet Block Implementation

PyTorch implementations of the previous slides:

https://github.com/rasbt/stat453-deep-learning-ss20/blob/master/L13-
cnns-part2/code/resnet-blocks.ipynb

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 33
ResNet-34 and ResNet-152

PyTorch implementations:

https://github.com/rasbt/stat453-deep-learning-ss20/blob/master/L13-
cnns-part2/code/resnet-34.ipynb

https://github.com/rasbt/stat453-deep-learning-ss20/blob/master/L13-
cnns-part2/code/resnet-152.ipynb

(Can be substantially improved with more hyperparameter tuning)

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 34
Lecture Overview

1. Padding (control output size in addition to stride)

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 35
Common Architectures Revisited

top-1 vali- Figure 2: Top1 vs. operations, size / parameters.

Canziani, A., Paszke, A., & Culurciello, E. (2016). An analysis of deep neural network models for practical
odelapplications.
archi- arXiv Top-1 one-crop
preprint accuracy versus amount of operations
arXiv:1605.07678.
r choice of required for a single forward pass. The size of the
ughout Sebastian
this blobsRaschka is proportional
STAT 453: Intro totothe
Deepnumber of Generative
Learning and networkModels
pa- SS 2020 36
Side Note: It is Possible to Replace Fully
Connected Layers by Convolutional Layers
remember, these also involve dot
products between the receptive
fields and kernels
T
w1 x
<latexit sha1_base64="En8xjNMtAnoObRZ7W5LhpkbtT0E=">AAACCHicbVDLSsNAFJ3UV62vqEsXDhZBEEpSBV0W3bis0Be0MUymk3boZBJmJmoJWbrxV9y4UMStn+DOv3HSRtDWAwNnzrmXe+/xIkalsqwvo7CwuLS8Ulwtra1vbG6Z2zstGcYCkyYOWSg6HpKEUU6aiipGOpEgKPAYaXujy8xv3xIhacgbahwRJ0ADTn2KkdKSa+73AqSGnp/cpa5904A/3/sUHkPPtV2zbFWsCeA8sXNSBjnqrvnZ64c4DghXmCEpu7YVKSdBQlHMSFrqxZJECI/QgHQ15Sgg0kkmh6TwUCt96IdCP67gRP3dkaBAynHg6cpsTznrZeJ/XjdW/rmTUB7FinA8HeTHDKoQZqnAPhUEKzbWBGFB9a4QD5FAWOnsSjoEe/bkedKqVuyTSvX6tFy7yOMogj1wAI6ADc5ADVyBOmgCDB7AE3gBr8aj8Wy8Ge/T0oKR9+yCPzA+vgG5hJkf</latexit>
+ b1
W 2 ⇤ x + b2
w2T x
<latexit sha1_base64="w9jYR3nTs25JhVUBGj0bJu/2Obc=">AAACCHicbVDLSsNAFJ3UV62vqEsXDhZBEEoSBV0W3bis0Be0MUymk3boZBJmJmoJWbrxV9y4UMStn+DOv3HSVtDWAwNnzrmXe+/xY0alsqwvo7CwuLS8Ulwtra1vbG6Z2ztNGSUCkwaOWCTaPpKEUU4aiipG2rEgKPQZafnDy9xv3RIhacTrahQTN0R9TgOKkdKSZ+53Q6QGfpDeZZ5zU4c/3/sMHkPfczyzbFWsMeA8saekDKaoeeZntxfhJCRcYYak7NhWrNwUCUUxI1mpm0gSIzxEfdLRlKOQSDcdH5LBQ630YBAJ/biCY/V3R4pCKUehryvzPeWsl4v/eZ1EBeduSnmcKMLxZFCQMKgimKcCe1QQrNhIE4QF1btCPEACYaWzK+kQ7NmT50nTqdgnFef6tFy9mMZRBHvgABwBG5yBKrgCNdAAGDyAJ/ACXo1H49l4M94npQVj2rML/sD4+Aa8n5kh</latexit>
+ b2 <latexit sha1_base64="soFNKwGHy3fCNMYUbtb5wnrOR0E=">AAACCHicbVDLSsNAFJ3UV62vqEsXDhZBFEoSBV0W3bisYB/QhjCZTtqhk0mYmYglZOnGX3HjQhG3foI7/8ZJG0FbDwycOede7r3HjxmVyrK+jNLC4tLySnm1sra+sbllbu+0ZJQITJo4YpHo+EgSRjlpKqoY6cSCoNBnpO2PrnK/fUeEpBG/VeOYuCEacBpQjJSWPHO/FyI19IO0nXkOPIY/3/sMnkDfczyzatWsCeA8sQtSBQUanvnZ60c4CQlXmCEpu7YVKzdFQlHMSFbpJZLECI/QgHQ15Sgk0k0nh2TwUCt9GERCP67gRP3dkaJQynHo68p8Tznr5eJ/XjdRwYWbUh4ninA8HRQkDKoI5qnAPhUEKzbWBGFB9a4QD5FAWOnsKjoEe/bkedJyavZpzbk5q9YvizjKYA8cgCNgg3NQB9egAZoAgwfwBF7Aq/FoPBtvxvu0tGQUPbvgD4yPb+SKmJk=</latexit>

W 1 ⇤ x + b1
<latexit sha1_base64="HbNSBTIzLtBjs59pl8QpsXEsj+U=">AAACCHicbVDLSsNAFL2pr1pfUZcuHCyCKJSkCrosunFZwT6gDWEynbRDJw9mJmIJWbrxV9y4UMStn+DOv3HaRtDWAwNnzrmXe+/xYs6ksqwvo7CwuLS8Ulwtra1vbG6Z2ztNGSWC0AaJeCTaHpaUs5A2FFOctmNBceBx2vKGV2O/dUeFZFF4q0YxdQLcD5nPCFZacs39boDVwPPTVuba6Bj9fO8zdII813bNslWxJkDzxM5JGXLUXfOz24tIEtBQEY6l7NhWrJwUC8UIp1mpm0gaYzLEfdrRNMQBlU46OSRDh1rpIT8S+oUKTdTfHSkOpBwFnq4c7ylnvbH4n9dJlH/hpCyME0VDMh3kJxypCI1TQT0mKFF8pAkmguldERlggYnS2ZV0CPbsyfOkWa3Yp5XqzVm5dpnHUYQ9OIAjsOEcanANdWgAgQd4ghd4NR6NZ+PNeJ+WFoy8Zxf+wPj4BuFvmJc=</latexit>


Fully connected layer w1,1 w1,2
where W1 =
w1,3 w1,4

<latexit sha1_base64="P25N0fbBpEZMogqD0d32xbYxFGU=">AAACP3icbZBLSwMxFIUz9V1fVZdugkVxIWWmLehGKLpxqWBboVOGTHqnhmYyQ5JRyzD/zI1/wZ1bNy4UcevOtB18XwgcvntvknP8mDOlbfvBKkxNz8zOzS8UF5eWV1ZLa+stFSWSQpNGPJIXPlHAmYCmZprDRSyBhD6Htj84HvXbVyAVi8S5HsbQDUlfsIBRog3ySi03JPrSD9J25jn4ELs+9JlIfUMlu8mK117q7GEnwzt4IquZ6+a09kXrWdEF0fvc80plu2KPC/8VTi7KKK9Tr3Tv9iKahCA05USpjmPHupsSqRnlYG5PFMSEDkgfOkYKEoLqpmP/Gd42pIeDSJojNB7T7xspCZUahr6ZHLlVv3sj+F+vk+jgoJsyEScaBJ08FCQc6wiPwsQ9JoFqPjSCUMnMXzG9JJJQbSIvmhCc35b/ila14tQq1bN6uXGUxzGPNtEW2kUO2kcNdIJOURNRdIse0TN6se6sJ+vVepuMFqx8ZwP9KOv9AzMRrJ8=</latexit>

w2,1 w2,2
W2 =
w2,3 <latexit sha1_base64="chChSTZ4lsrXuTisM5DYCe9F9dk=">AAACP3icbZDNS8MwGMZTP+f8mnr0EhyKBxltJ+hFEL14nOA2YR0lzd7OYJqWJFVH6X/mxX/Bm1cvHhTx6s1sK36/EHj4ve+b5HmChDOlbfvBmpicmp6ZLc2V5xcWl5YrK6stFaeSQpPGPJbnAVHAmYCmZprDeSKBRAGHdnB5POy3r0AqFoszPUigG5G+YCGjRBvkV1peRPRFEGbt3HfxAfYC6DORBYZKdpOXr/3M3cFOjrfwWLq55xW0/kV387IHove551eqds0eFf4rnEJUUVENv3Lv9WKaRiA05USpjmMnupsRqRnlYG5PFSSEXpI+dIwUJALVzUb+c7xpSA+HsTRHaDyi3zcyEik1iAIzOXSrfveG8L9eJ9XhfjdjIkk1CDp+KEw51jEehol7TALVfGAEoZKZv2J6QSSh2kReNiE4vy3/FS235tRr7ulu9fCoiKOE1tEG2kYO2kOH6AQ1UBNRdIse0TN6se6sJ+vVehuPTljFzhr6Udb7Bzt5rKQ=</latexit>
w2,4
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 37
Side Note: It is Possible to Replace Fully
Connected Layers by Convolutional Layers

T
w1 x
<latexit sha1_base64="En8xjNMtAnoObRZ7W5LhpkbtT0E=">AAACCHicbVDLSsNAFJ3UV62vqEsXDhZBEEpSBV0W3bis0Be0MUymk3boZBJmJmoJWbrxV9y4UMStn+DOv3HSRtDWAwNnzrmXe+/xIkalsqwvo7CwuLS8Ulwtra1vbG6Z2zstGcYCkyYOWSg6HpKEUU6aiipGOpEgKPAYaXujy8xv3xIhacgbahwRJ0ADTn2KkdKSa+73AqSGnp/cpa5904A/3/sUHkPPtV2zbFWsCeA8sXNSBjnqrvnZ64c4DghXmCEpu7YVKSdBQlHMSFrqxZJECI/QgHQ15Sgg0kkmh6TwUCt96IdCP67gRP3dkaBAynHg6cpsTznrZeJ/XjdW/rmTUB7FinA8HeTHDKoQZqnAPhUEKzbWBGFB9a4QD5FAWOnsSjoEe/bkedKqVuyTSvX6tFy7yOMogj1wAI6ADc5ADVyBOmgCDB7AE3gBr8aj8Wy8Ge/T0oKR9+yCPzA+vgG5hJkf</latexit>
+ b1

w2T x + b2
<latexit sha1_base64="w9jYR3nTs25JhVUBGj0bJu/2Obc=">AAACCHicbVDLSsNAFJ3UV62vqEsXDhZBEEoSBV0W3bis0Be0MUymk3boZBJmJmoJWbrxV9y4UMStn+DOv3HSVtDWAwNnzrmXe+/xY0alsqwvo7CwuLS8Ulwtra1vbG6Z2ztNGSUCkwaOWCTaPpKEUU4aiipG2rEgKPQZafnDy9xv3RIhacTrahQTN0R9TgOKkdKSZ+53Q6QGfpDeZZ5zU4c/3/sMHkPfczyzbFWsMeA8saekDKaoeeZntxfhJCRcYYak7NhWrNwUCUUxI1mpm0gSIzxEfdLRlKOQSDcdH5LBQ630YBAJ/biCY/V3R4pCKUehryvzPeWsl4v/eZ1EBeduSnmcKMLxZFCQMKgimKcCe1QQrNhIE4QF1btCPEACYaWzK+kQ7NmT50nTqdgnFef6tFy9mMZRBHvgABwBG5yBKrgCNdAAGDyAJ/ACXo1H49l4M94npQVj2rML/sD4+Aa8n5kh</latexit>
Or, we can concatenate the inputs
into 1x1 images with 4 channels and
then use 2 kernels
(remember, each kernel then also
Fully connected layer has 4 channels)

Example: https://github.com/rasbt/stat453-deep-learning-ss20/blob/master/L13-cnns-part2/
code/fc-to-conv.ipynb

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 38
Network in Network (NiN)
Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013).

Key Ideas
1)
• A convolution kernel can be thought of as a generalized linear model (GLM)
• Using a "sophisticated" nonlinear function approximator (e.g., an MLP) may
enhance the abstraction ability of the local model
• => Replace GLM by "micro network" (sliding an MLP over the feature map)

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 39
Network in Network (NiN)
Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013).

2)
• Replace the MLP "micro structure" via convolutions
(explanation on the previous slides)
• Replace the fully connected layers in the last layers by
global average pooling
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 40
Network in Network (NiN)
Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013).

...

...
(a) Linear convolution layer (b) Mlpconv layer

Figure 1: Comparison of linear convolution layer and mlpconv layer. The linear convolution layer
includes a linear filter while the mlpconv layer includes a micro network (we choose the multilayer
perceptron in this paper). Both layers map the local receptive field to a confidence value of the latent
concept.

over the input in a similar manner as CNN and are then fed into the next layer. The overall structure
• Usingofathe"sophisticated"
NIN is the stacking ofnonlinear function
multiple mlpconv approximator
layers. It is called “Network(e.g., an MLP)
In Network” (NIN) asmay
we have micro networks (MLP), which are composing elements of the overall deep network, within
enhance thelayers,
mlpconv abstraction ability of the local model
Instead of adopting the traditional fully connected layers for classification in CNN, we directly
output the spatial average of the feature maps from the last mlpconv layer as the confidence of
Sebastian Raschka
categories via a globalSTAT 453:pooling
average Intro to Deep
layer, Learning
and then theand Generative
resulting vectorModels SS 2020
is fed into the softmax 41
This can remove the need for fully connected layers in a CNN based classifier. It is
considered by the authors as a structural regularizer of the CNN, transforming feature

Global Average Pooling in Last Layer

activations into confidence maps, by creating correspondences between features and classes.

Figure 16: Global average pooling layer replacing the fully connected layers. The output layer
implements a Softmax operation with 𝑝 , 𝑝 , … , 𝑝 the predicted probabilities for each class.

Seeking an analogy in nature for pooling layers, it has been already studied by
Figure Source: Singh, Anshuman Vikram. "Content-based image retrieval using deep learning." (2015).
Neuroscience a property of biological neurons called lateral inhibition. The transmitted
activations of feed-forward cells are pooled and an inhibitory signal is sent back to them, thus
modifying the amount of input that reaches the receiver cells [45]. Neurons that have stronger
responses
Sebastian tend to inhibit
Raschka STATthe 453:
response
Intro of
to neighboring neurons.
Deep Learning We can think
and Generative of max pooling
Models SS 2020 42
Network in Network (NiN)
Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013).

1 node per class (like usually)

+ softmax

....
...

...

...
...

...
Figure 2: The overall structure of Network In Network. In this paper the NINs include the stacking
of three mlpconv layers and one global average pooling layer.

Comparison to maxout layers: the maxout layers in the maxout network performs max pooling
across multiple affine feature maps [8]. The feature maps of maxout layers are calculated as follows:
• Replace the fully connected layers in the last layers by
global average pooling fi,j,k = max (wkT xi,j ).
m m
(3)

Maxout over linear functions forms a piecewise linear function which is capable of modeling any
convex function. For a convex function, samples with function values below a specific threshold
Sebastian
form a Raschka STAT 453:
convex set. Therefore, Intro to Deepconvex
by approximating Learning and Generative
functions Models
of the local SShas
patch, maxout 2020 43
Network in Network (NiN)
Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013).

Example Implementation:
https://github.com/rasbt/stat453-deep-learning-ss20/blob/master/L13-cnns-part2/
code/nin-cifar10.ipynb
• Replace the MLP "micro structure" with convolutions
(explanation on the previous slides)

...

...
(a) Linear convolution layer (b) Mlpconv layer

over the input in a similar manner as CNN and are then fed into the next layer. The overall structure
of the NIN is the stacking of multiple mlpconv layers. It is called “Network In Network” (NIN) as
we have micro networks (MLP), which are composing elements of the overall deep network, within
Sebastian Raschka mlpconv STAT
layers, 453: Intro to Deep Learning and Generative Models SS 2020 44
Network in Network (NiN)
Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013).

Example Implementation:
https://github.com/rasbt/stat453-deep-learning-ss20/blob/master/L13-cnns-part2/
code/nin-cifar10.ipynb
• Replace fully connected layers with global average pooling
(explanation on the previous slides)

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 45
Network in Network (NiN)
Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013).

Why it might work well in practice

Using the micro-networks allow us to extract more sophisticated

features (non-linear functions); we may need fewer extractors
and can avoid learning too simple or redundant abstractions

Fully-connected layers have a lot of parameters and may cause

overfitting, replacing those by global average pooling might
help with better generalization
(nice side-eﬀect: we can make the network be somewhat agnostic to the input size)

46
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020
L DESCRIPTION - THE ALL CONVOLUTIONAL NETWORK

ESCRIPTION - THE ALL

we use in our experiments Diﬀerent but related idea:
differ from standard CNNsNETWORK
CONVOLUTIONAL in several key aspects. First –

seor in
object
our recognition,
experimentswith "All-Convolutional Network"
erestingly – we replace the pooling layers, which are present in practically all modern
standard
differ from convolutional
standard CNNs layers with stride
in several
cedure can work it helps to recall the standard formulation for defining convolution and
keytwo. To understand
aspects. First –
ngly Springenberg,
– we replace Jost the Tobias,
pooling layers, which are present in practically all modern
ations in CNNs. Let f denote Alexey Dosovitskiy,
a feature Thomas
map produced byBrox,
someand Martin
layer of aRiedmiller.
CNN. It can"Striving for simplicity: The all
ject recognition,
convolutionalwith
as a 3-dimensional net." standard
array
convolutional
arXiv preprint
of size W ⇥
layersWwith
arXiv:1412.6806
H ⇥ N where(2014).
stride two. To understand
and H are the width and height
enumber
can work it helps to
of channels (inrecall
case fthe standard
is the outputformulation for defining
of a convolutional layer, Nconvolution
is the number andof
Key Idea
slayer).
in CNNs.
ThenLet p-norm :subsampling
f denote Replace Maxpooling
a feature(ormap produced
pooling) by some
with pooling bysize strided
layer convolutions
of half-length
k (or a CNN. It can
k/2)
3-dimensional
applied to the featurearray of map size Wa⇥
f is H ⇥ N where
3-dimensional W and
array s(f )Hwitharethe
thefollowing
width andentries:
height
ber of(i.e.,
channels conv (in case0layers
f is the with
output of stride=2)
a convolutional1layer, N is the number of
1/p
). Then p-norm subsampling bk/2c(or pooling)
X bk/2c
X with pooling size k (or half-length k/2)
ed to the feature
si,j,u (fmap
)=@ f is a 3-dimensional|farray s(f ) with
g(h,w,i,j,u) |p A the, following entries: (1)
0 h= bk/2c w= bk/2c 11/p
bk/2c bk/2c
w, i, j,si,j,u
u) = (r
(f ) = · i
@ + h,
X
r · j +
X
w, u) is the function
|fg(h,w,i,j,u) | pmapping
A , from definition
positions in s
(1)toof max-pooling with
f respecting the stride, p is the order of the p-norm (for p ! 1, itstride=2 becomes theto when p ! 1
h= bk/2c w= bk/2c
sed max pooling). If r > k, pooling regions do not overlap; however, current CNN <latexit sha1_base64="0IP+XSvQcXqyCHSTwgqtKgcX7Lw=">AAAB/XicbVDLSsNAFJ34rPUVHzs3g0VwVZIq6LLoxmUF+4AmlMl00g6dTMLMjRJD8VfcuFDErf/hzr9x2mahrQcuHM65l3vvCRLBNTjOt7W0vLK6tl7aKG9ube/s2nv7LR2nirImjUWsOgHRTHDJmsBBsE6iGIkCwdrB6Hrit++Z0jyWd5AlzI/IQPKQUwJG6tmHCfYUHwyBKBU/YI/LELKeXXGqzhR4kbgFqaACjZ795fVjmkZMAhVE667rJODnRAGngo3LXqpZQuiIDFjXUEkipv18ev0Ynxilj8NYmZKAp+rviZxEWmdRYDojAkM9703E/7xuCuGln3OZpMAknS0KU4EhxpMocJ8rRkFkhhCquLkV0yFRhIIJrGxCcOdfXiStWtU9q9Zuzyv1qyKOEjpCx+gUuegC1dENaqAmougRPaNX9GY9WS/Wu/Uxa12yipkD9AfW5w+ZZ5VS</latexit>

ESCRIPTION - THE ALL

we use in our experiments Diﬀerent but related idea:
differ from standard CNNsNETWORK
CONVOLUTIONAL in several key aspects. First –

seor in
object
our recognition,
experimentswith "All-Convolutional Network"
erestingly – we replace the pooling layers, which are present in practically all modern
differstandard
from convolutional
standard CNNs layers with stride
in several
cedure can work it helps to recall the standard formulation for defining convolution and
keytwo. To understand
aspects. First –
ngly Springenberg,
– we replace Jost the Tobias,
pooling layers, which are present in practically all modern
ations in CNNs. Let f denote Alexey Dosovitskiy,
a feature Thomas
map produced byBrox,
someand Martin
layer of aRiedmiller.
CNN. It can"Striving for simplicity: The all
ject recognition,
convolutionalwith
as a 3-dimensional net." standard
array
convolutional
arXiv preprint
of size W ⇥
layersWwith
arXiv:1412.6806
H ⇥ N where(2014).
stride two. To understand
and H are the width and height
enumber
can work it helps to
of channels (inrecall
case fthe standard
is the outputformulation for defining
of a convolutional layer, Nconvolution
is the number andof
Key Idea
slayer).
in CNNs.
ThenLet p-norm :subsampling
f denote Replace Maxpooling
a feature(ormap produced
pooling) by some
with pooling bysize strided
layer convolutions
of half-length
k (or a CNN. It can
k/2)
3-dimensional
applied to the featurearray of map size Wa⇥
f is H ⇥ N where
3-dimensional W and
array s(f )Hwitharethe
thefollowing
width andentries:
height
ber of(i.e.,
channels conv (in case0layers
f is the with
output of stride=2)
a convolutional1layer, N is the number of
1/p
). Then p-norm subsampling bk/2c(or pooling)
X bk/2c
X with pooling size k (or half-length k/2)
ed to the feature
si,j,u (fmap
)=@ f is a 3-dimensional|farray s(f ) with
g(h,w,i,j,u) |p A the, following entries: (1)
0 h= bk/2c w= bk/2c 11/p
bk/2c bk/2c
w, i, j,si,j,u
u) = (r
(f ) = · i
@ + h,
X
r · j +
X
w, u) is the function
|fg(h,w,i,j,u) | pmapping
A , from definition
positions in s
(1)toof max-pooling with
f respecting the stride, p is the order of the p-norm (for p ! 1, itstride=2 becomes theto when p ! 1
h= bk/2c w= bk/2c
sed max pooling). If r > k, pooling regions do not overlap; however, current CNN <latexit sha1_base64="0IP+XSvQcXqyCHSTwgqtKgcX7Lw=">AAAB/XicbVDLSsNAFJ34rPUVHzs3g0VwVZIq6LLoxmUF+4AmlMl00g6dTMLMjRJD8VfcuFDErf/hzr9x2mahrQcuHM65l3vvCRLBNTjOt7W0vLK6tl7aKG9ube/s2nv7LR2nirImjUWsOgHRTHDJmsBBsE6iGIkCwdrB6Hrit++Z0jyWd5AlzI/IQPKQUwJG6tmHCfYUHwyBKBU/YI/LELKeXXGqzhR4kbgFqaACjZ795fVjmkZMAhVE667rJODnRAGngo3LXqpZQuiIDFjXUEkipv18ev0Ynxilj8NYmZKAp+rviZxEWmdRYDojAkM9703E/7xuCuGln3OZpMAknS0KU4EhxpMocJ8rRkFkhhCquLkV0yFRhIIJrGxCcOdfXiStWtU9q9Zuzyv1qyKOEjpCx+gUuegC1dENaqAmougRPaNX9GY9WS/Wu/Uxa12yipkD9AfW5w+ZZ5VS</latexit>

, typically
u) = (rinclude
· i + h,overlapping
r · j + w,pooling u) is the with k = 3 and
function mapping
r = 2.fromLet uspositions
now comparein s tothe
ation defined
specting by Eq. p1 isto the
the stride, the standard
order of definition
the p-norm of a(for
convolutional
p ! 1, layer c applied
it becomes theto
f given
max as:
pooling). If r > k, pooling regions do not overlap; however, current CNN
0 1
cally include overlapping pooling with k = 3 and r = 2. Let us now compare the
defined by Eq. 1 @to
bk/2c
X
the standard
bk/2c
X X N
definition of a convolutional Alayer definition
c applied to of a convolutional
ci,j,o (f ) = ✓h,w,u,o · fg(h,w,i,j,u) , (2)
en as: h= bk/2c w= bk/2c u=1 layer with stride=2
0 1
bk/2c
the convolutional weights X (orbk/2c the kernel
X XN
weights, or filters), (·) is the activation func-
cyi,j,o @
(f ) = linear activation ReLU (x)✓=
a rectified max(x,
h,w,u,o · fg(h,w,i,j,u)
0), and o 2 A[1,
, M ] is the number (2)
ture (or channel)h= of the
bk/2cconvolutional
w= bk/2c u=1 layer. When formalized like this it becomes clear
erations depend We on canthe think of "strided
same elements convolutions"
of the previous layer feature map. as learnable
The pooling pooling
onvolutional weights a(or
seen as performing the kernel convolution
feature-wise weights, or filters),
1
in which(·)the is the activation
activation func-is
function
ectified
the p-norm.linearOne activation
can thereforeReLU ask(x) the=question 0), and and
max(x,whether o 2why[1, Msuch] is special
the number
layers
or channel)
troduced Sebastian
into ofthe RaschkaWhile alayer.
thenetwork.
convolutional STAT 453:
When
complete Intro to this
formalized
answer of Deeplike
Learning
this itisand
question not Generative
becomes giveModels
clear
easy to SS 2020 48
of regularization. Additionally, to unify the architecture further, we make use of the fact that if the
"All-Convolutional Network"
image area covered by units in the topmost convolutional layer covers a portion of the image large
enough to recognize its content (i.e. the object we want to recognize) then fully connected layers
can also be replaced by simple 1-by-1 convolutions. This leads to predictions of object classes at
different positions
Springenberg, whichAlexey
Jost Tobias, can then simply
Dosovitskiy, be averaged
Thomas Brox, andover
Martinthe whole "Striving
Riedmiller. image.for
This scheme
simplicity: Thewas
all first
convolutional net." arXiv preprint arXiv:1412.6806 (2014).
described by Lin et al. (2014) and further regularizes the network as the one by one convolution
has much less parameters than a fully connected layer. Overall our architecture is thus reduced to
consist only of convolutional layers with rectified linear non-linearities and an averaging + softmax
layer to produce predictions over the whole image.
Experimental Ablation Study with 3 Base Models:
Table 1: The three base networks used for classification on CIFAR-10 and CIFAR-100.
Model
A B C
Input 32 ⇥ 32 RGB image
5 ⇥ 5 conv. 96 ReLU 5 ⇥ 5 conv. 96 ReLU 3 ⇥ 3 conv. 96 ReLU
1 ⇥ 1 conv. 96 ReLU 3 ⇥ 3 conv. 96 ReLU
3 ⇥ 3 max-pooling stride 2
5 ⇥ 5 conv. 192 ReLU 5 ⇥ 5 conv. 192 ReLU 3 ⇥ 3 conv. 192 ReLU
1 ⇥ 1 conv. 192 ReLU 3 ⇥ 3 conv. 192 ReLU
3 ⇥ 3 max-pooling stride 2
3 ⇥ 3 conv. 192 ReLU
1 ⇥ 1 conv. 192 ReLU
1 ⇥ 1 conv. 10 ReLU
global averaging over 6 ⇥ 6 spatial dimensions
10 or 100-way softmax

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 49
the table is a variant of the Network in Network architecture proposed by Lin et al. (2014) in which
only one 1-by-1 convolution is performed after each “normal” convolution layer. Third, model C
"All-Convolutional Network"
replaces all 5 ⇥ 5 convolutions by simple 3 ⇥ 3 convolutions. This serves two purposes: 1) it unifies
the architecture to consist only of layers operating on 3 ⇥ 3 spatial neighborhoods of the previous
layer feature map (with occasional subsampling); 2) if max-pooling is replaced by a convolutional
Springenberg, Jost Tobias, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. "Striving for simplicity: The all
layer, then 3 ⇥ 3 is the minimum filter size to allow overlapping convolution with stride 2. We also
convolutional net." arXiv preprint arXiv:1412.6806 (2014).
highlight that model C resembles the very deep models used by Simonyan & Zisserman (2014) in
this years ImageNet competition.
Experimental Ablation Study with 3 Base Models:
Table 2: Model description of the three networks derived from base model C used for evaluating the
Shown are the modifications for Model C
importance of pooling in case of classification on CIFAR-10 and CIFAR-100. The derived models
for base models A and B are built analogously. The higher layers are the same as in Table 1 .
Model
Strided-CNN-C ConvPool-CNN-C All-CNN-C
Input 32 ⇥ 32 RGB image
3 ⇥ 3 conv. 96 ReLU 3 ⇥ 3 conv. 96 ReLU 3 ⇥ 3 conv. 96 ReLU
3 ⇥ 3 conv. 96 ReLU 3 ⇥ 3 conv. 96 ReLU 3 ⇥ 3 conv. 96 ReLU
with stride r = 2 3 ⇥ 3 conv. 96 ReLU
3 ⇥ 3 max-pooling stride 2 3 ⇥ 3 conv. 96 ReLU
with stride r = 2
3 ⇥ 3 conv. 192 ReLU 3 ⇥ 3 conv. 192 ReLU 3 ⇥ 3 conv. 192 ReLU
3 ⇥ 3 conv. 192 ReLU 3 ⇥ 3 conv. 192 ReLU 3 ⇥ 3 conv. 192 ReLU
with stride r = 2 3 ⇥ 3 conv. 192 ReLU
3 ⇥ 3 max-pooling stride 2 3 ⇥ 3 conv. 192 ReLU
with stride r = 2
..
.

For each of theRaschka

Sebastian base models STAT
we then
453:experiment
Intro to Deepwith threeandadditional
Learning Generativevariants.
Models The additional
SS 2020 50
the table is a variant of the Network in Network architecture proposed by Lin et al. (2014) in which
only one 1-by-1 convolution is performed after each “normal” convolution layer. Third, model C
"All-Convolutional Network"
replaces all 5 ⇥ 5 convolutions by simple 3 ⇥ 3 convolutions. This serves two purposes: 1) it unifies
the architecture to consist only of layers operating on 3 ⇥ 3 spatial neighborhoods of the previous
layer feature map (with occasional subsampling); 2) if max-pooling is replaced by a convolutional
Springenberg, Jost Tobias, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. "Striving for simplicity: The all
layer, then 3 ⇥ 3 is the minimum filter size to allow overlapping convolution with stride 2. We also
convolutional net." arXiv preprint arXiv:1412.6806 (2014).
highlight that model C resembles the very deep models used by Simonyan & Zisserman (2014) in
this years ImageNet competition.
Experimental Ablation Study with 3 Base Models:
Table 2: Model description of the three networks derived from base model C used for evaluating the
Shown are the modifications for Model C
importance of pooling in case of classification on CIFAR-10 and CIFAR-100. The derived models
for base models A
Remove and B are&built
pooling analogously.
increase strideTheofhigher
the layers are the
previous same as in Table 1 .
layer
Model
Strided-CNN-C ConvPool-CNN-C All-CNN-C
Input 32 ⇥ 32 RGB image
3 ⇥ 3 conv. 96 ReLU 3 ⇥ 3 conv. 96 ReLU 3 ⇥ 3 conv. 96 ReLU
3 ⇥ 3 conv. 96 ReLU 3 ⇥ 3 conv. 96 ReLU 3 ⇥ 3 conv. 96 ReLU
with stride r = 2 3 ⇥ 3 conv. 96 ReLU
3 ⇥ 3 max-pooling stride 2 3 ⇥ 3 conv. 96 ReLU
with stride r = 2
3 ⇥ 3 conv. 192 ReLU 3 ⇥ 3 conv. 192 ReLU 3 ⇥ 3 conv. 192 ReLU
3 ⇥ 3 conv. 192 ReLU 3 ⇥ 3 conv. 192 ReLU 3 ⇥ 3 conv. 192 ReLU
with stride r = 2 3 ⇥ 3 conv. 192 ReLU
3 ⇥ 3 max-pooling stride 2 3 ⇥ 3 conv. 192 ReLU
with stride r = 2
..
.

For each of theRaschka

Sebastian base models STAT
we then
453:experiment
Intro to Deepwith threeandadditional
Learning Generativevariants.
Models The additional
SS 2020 51
the table is a variant of the Network in Network architecture proposed by Lin et al. (2014) in which
only one 1-by-1 convolution is performed after each “normal” convolution layer. Third, model C
"All-Convolutional Network"
replaces all 5 ⇥ 5 convolutions by simple 3 ⇥ 3 convolutions. This serves two purposes: 1) it unifies
the architecture to consist only of layers operating on 3 ⇥ 3 spatial neighborhoods of the previous
layer feature map (with occasional subsampling); 2) if max-pooling is replaced by a convolutional
Springenberg, Jost Tobias, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. "Striving for simplicity: The all
layer, then 3 ⇥ 3 is the minimum filter size to allow overlapping convolution with stride 2. We also
convolutional net." arXiv preprint arXiv:1412.6806 (2014).
highlight that model C resembles the very deep models used by Simonyan & Zisserman (2014) in
this years ImageNet competition.
Experimental Ablation Study with 3 Base Models:
Table 2: Model description of the three networks derived from base model C used for evaluating the
Shown are the modifications for Model C
importance of pooling in case of classification on CIFAR-10 and CIFAR-100. The derived models
for base models A
Remove and B are&built
pooling addanalogously.
a stridedThe higherlayer
conv. layers are the same as in Table 1 .
Model
Strided-CNN-C ConvPool-CNN-C All-CNN-C
Input 32 ⇥ 32 RGB image
3 ⇥ 3 conv. 96 ReLU 3 ⇥ 3 conv. 96 ReLU 3 ⇥ 3 conv. 96 ReLU
3 ⇥ 3 conv. 96 ReLU 3 ⇥ 3 conv. 96 ReLU 3 ⇥ 3 conv. 96 ReLU
with stride r = 2 3 ⇥ 3 conv. 96 ReLU
3 ⇥ 3 max-pooling stride 2 3 ⇥ 3 conv. 96 ReLU
with stride r = 2
3 ⇥ 3 conv. 192 ReLU 3 ⇥ 3 conv. 192 ReLU 3 ⇥ 3 conv. 192 ReLU
3 ⇥ 3 conv. 192 ReLU 3 ⇥ 3 conv. 192 ReLU 3 ⇥ 3 conv. 192 ReLU
with stride r = 2 3 ⇥ 3 conv. 192 ReLU
3 ⇥ 3 max-pooling stride 2 3 ⇥ 3 conv. 192 ReLU
with stride r = 2
..
.

For each of theRaschka

Sebastian base models STAT
we then
453:experiment
Intro to Deepwith threeandadditional
Learning Generativevariants.
Models The additional
SS 2020 52
the ImageNet database. Although we expect that a larger network using only 3 ⇥ 3 convolutions
and having stride 1 in the first layer (and thus similar in style to Simonyan & Zisserman (2014))
"All-Convolutional Network"
would perform even better on this dataset, training it would take several weeks and could thus not
be completed in time for this manuscript.

3.2 C LASSIFICATION
Springenberg, RESULTS
Jost Tobias, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. "Striving for simplicity: The all
convolutional net." arXiv preprint arXiv:1412.6806 (2014).
3.2.1 CIFAR-10

Table 3: Comparison between the base and derived models on the CIFAR-10 dataset.
CIFAR-10 classification error
Model Error (%) # parameters
without data augmentation
Model A 12.47% ⇡ 0.9 M
Strided-CNN-A 13.46% ⇡ 0.9 M
ConvPool-CNN-A 10.21% ⇡ 1.28 M
ALL-CNN-A 10.30% ⇡ 1.28 M
Model B 10.20% ⇡1M
Strided-CNN-B 10.98% ⇡1M
ConvPool-CNN-B 9.33% ⇡ 1.35 M
ALL-CNN-B 9.10% ⇡ 1.35 M
Model C 9.74% ⇡ 1.3 M
Strided-CNN-C 10.19% ⇡ 1.3 M
ConvPool-CNN-C 9.31% ⇡ 1.4 M
ALL-CNN-C 9.08% ⇡ 1.4 M

In our first experiment we compared all models from Section 3.1 on the CIFAR-10 dataset without
using any augmentations. All networks were trained using stochastic gradient descent with fixed
momentum of 0.9. The learning rate was adapted using a schedule S = e1 , e2 , e3 in which is
multiplied
SebastianbyRaschka
a fixed multiplier
STATof 0.1
453:after
Introe1to and eLearning
.e2Deep 3 epochs respectively.
and Generative To keep the amount
Models of
SS 2020 53
me for this manuscript.

ATION RESULTS
"All-Convolutional Network"
0
Springenberg, Jost Tobias, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. "Striving for simplicity: The all
convolutional net." arXiv preprint arXiv:1412.6806 (2014).
omparison between the base and derived models on the CIFAR-10 dataset.
CIFAR-10 classification error
Model Error (%) # parameters
without data augmentation
Model A 12.47% ⇡ 0.9 M
Strided-CNN-A 13.46% ⇡ 0.9 M
ConvPool-CNN-A 10.21% ⇡ 1.28 M
ALL-CNN-A 10.30% ⇡ 1.28 M
Model B 10.20% ⇡1M
Strided-CNN-B 10.98% ⇡1M
ConvPool-CNN-B 9.33% ⇡ 1.35 M
ALL-CNN-B 9.10% ⇡ 1.35 M
Model C 9.74% ⇡ 1.3 M
Strided-CNN-C 10.19% ⇡ 1.3 M
ConvPool-CNN-C 9.31% ⇡ 1.4 M Removing maxpooling and increasing
ALL-CNN-C 9.08% ⇡ 1.4 M the stride of the previous layer
performs worse
ment we compared all models from Section 3.1 on the CIFAR-10 dataset without
tations. All networks were trained using stochastic gradient descent with fixed
The learning rate was adapted using a schedule S = e1 , e2 , e3 in which is
ed multiplier of 0.1 after e1 .e2 and e3 epochs respectively. To keep the amount of
sary to perform our comparison bearable 3 we only treat as a changeable hyper-
method. The learning rate schedule and the total amount of training epochs were
eliminary Sebastian
experiment using base model
Raschka A and
STAT 453:then
Introfixed for allLearning
to Deep other experiments.
and Generative Models SS 2020 54
me for this manuscript.

ATION RESULTS
"All-Convolutional Network"
0
Springenberg, Jost Tobias, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. "Striving for simplicity: The all
convolutional net." arXiv preprint arXiv:1412.6806 (2014).
omparison between the base and derived models on the CIFAR-10 dataset.
CIFAR-10 classification error
Model Error (%) # parameters
without data augmentation
Model A 12.47% ⇡ 0.9 M
Strided-CNN-A 13.46% ⇡ 0.9 M
ConvPool-CNN-A 10.21% ⇡ 1.28 M
ALL-CNN-A 10.30% ⇡ 1.28 M
Model B 10.20% ⇡1M
Strided-CNN-B 10.98% ⇡1M
ConvPool-CNN-B 9.33% ⇡ 1.35 M
ALL-CNN-B 9.10% ⇡ 1.35 M
Model C 9.74% ⇡ 1.3 M
Strided-CNN-C 10.19% ⇡ 1.3 M
ConvPool-CNN-C 9.31% ⇡ 1.4 M Replacing maxpooling with an
ALL-CNN-C 9.08% ⇡ 1.4 M
convolutional layer (stride=2) improves
the performance
ment we compared all models from Section 3.1 on the CIFAR-10 dataset without
tations. All networks were trained using stochastic gradient descent with fixed
The learning rate was adapted using a schedule S = e1 , e2 , e3 in which is
ed multiplier of 0.1 after e1 .e2 and e3 epochs respectively. To keep the amount of
sary to perform our comparison bearable 3 we only treat as a changeable hyper-
method. The learning rate schedule and the total amount of training epochs were
eliminary Sebastian
experiment using base model
Raschka A and
STAT 453:then
Introfixed for allLearning
to Deep other experiments.
and Generative Models SS 2020 55
me for this manuscript.

ATION RESULTS
"All-Convolutional Network"
0
Springenberg, Jost Tobias, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. "Striving for simplicity: The all
convolutional net." arXiv preprint arXiv:1412.6806 (2014).
omparison between the base and derived models on the CIFAR-10 dataset.
CIFAR-10 classification error
Model Error (%) # parameters
without data augmentation
Model A 12.47% ⇡ 0.9 M
Strided-CNN-A 13.46% ⇡ 0.9 M
ConvPool-CNN-A 10.21% ⇡ 1.28 M
ALL-CNN-A 10.30% ⇡ 1.28 M
Model B 10.20% ⇡1M
Strided-CNN-B 10.98% ⇡1M
ConvPool-CNN-B 9.33% ⇡ 1.35 M
ALL-CNN-B 9.10% ⇡ 1.35 M
Model C 9.74% ⇡ 1.3 M Replacing maxpooling with an
Strided-CNN-C 10.19% ⇡ 1.3 M
ConvPool-CNN-C 9.31% ⇡ 1.4 M convolutional layer (stride=2) improves
ALL-CNN-C 9.08% ⇡ 1.4 M
the performance
(this may
ment we compared all models from Section 3.1 on the CIFAR-10 datasetbe unfair because of the
without
tations. All networks were trained using stochastic gradient descent with fixed
additional parameters)
The learning rate was adapted using a schedule S = e , e , e in which is
1 2 3
ed multiplier of 0.1 after e1 .e2 and e3 epochs respectively. To keep the amount of
sary to perform our comparison bearable 3 we only treat as a changeable hyper-
method. The learning rate schedule and the total amount of training epochs were
eliminary Sebastian
experiment using base model
Raschka A and
STAT 453:then
Introfixed for allLearning
to Deep other experiments.
and Generative Models SS 2020 56
Diﬀerence to "All-Convolutional Network"
Springenberg, Jost Tobias, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. "Striving for simplicity: The all
convolutional net." arXiv preprint arXiv:1412.6806 (2014).

Code example https://github.com/rasbt/stat453-deep-learning-ss20/blob/

master/L13-cnns-part2/code/convnet-allconv.ipynb

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 57
Lecture Overview

1. Padding (control output size in addition to stride)

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 58
Common Architectures Revisited

top-1 vali- Figure 2: Top1 vs. operations, size / parameters.

"In this paper, we will focus on an efficient deep neural

network architecture for computer vision, codenamed
Inception, which derives its name from the Network in
network paper by Lin et al [12] in conjunction with the
famous “we need to go deeper” internet meme"

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 60
GoogLeNet / Inception v1
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,
Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions." In Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 1-9. 2015.

Key Ideas/Features:

• 1x1 convolutions: An eﬃcient way to reduce the number of channels

(from NiN)

• Global Average Pooling at the last layer (from NiN)

• Use of auxiliary losses that are added to the total loss
• New: Inception module

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 61
62
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,
Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions." In Proceedings of the IEEE

ith softmax loss as the classifier (pre-

1000 classes as the main classifier, but
SS 2020

ence time).
of the resulting network is depicted in
odology
orks were trained using the DistBe-
machine learning system using mod-
and data-parallelism. Although we STAT 453: Intro to Deep Learning and Generative Models
GoogLeNet / Inception v1

mplementation only, a rough estimate

ogLeNet network could be trained to
w high-end GPUs within a week, the
the memory usage. Our training used
tic gradient descent with 0.9 momen-
ng rate schedule (decreasing the learn-
8 epochs). Polyak averaging [13] was
conference on computer vision and pattern recognition, pp. 1-9. 2015.

Full Architecture

l model used at inference time.

methods have changed substantially
ding to the competition, and already
re trained on with other options, some-
with changed hyperparameters, such
earning rate. Therefore, it is hard to
ance to the most effective single way
s. To complicate matters further, some
ainly trained on smaller relative crops,
, inspired by [8]. Still, one prescrip-
to work very well after the competi-
ng of various sized patches of the im-
ributed evenly between 8% and 100%
h aspect ratio constrained to the inter-
found that the photometric distortions
8] were useful to combat overfitting to
ns of training data.
14 Classification Challenge
sults
4 classification challenge involves the

Sebastian Raschka
image into one of 1000 leaf-node cat-
et hierarchy. There are about 1.2 mil-
ng, 50,000 for validation and 100,000
Each image is associated with one
y, and performance is measured based
ng classifier predictions. Two num-
rted: the top-1 accuracy rate, which
truth against the first predicted class,
ate, which compares the ground truth
edicted classes: an image is deemed Figure 3: GoogLeNet network with all the bells and whistles.
the ground truth is among the top-5,
in them. The challenge uses the top-5
GoogLeNet / Inception v1
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,
Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions." In Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 1-9. 2015.

Zoomed in

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 63
GoogLeNet / Inception v1
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,
Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions." In Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 1-9. 2015.

efficie
ing In
the low
not str
ineffic
Au
increa
withou
plexity
use of
(a) Inception module, naı̈ve version
tions w
lows t
be pro
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 64
inefficien
GoogLeNet / Inception v1 A use
increasin
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitruwithout
Erhan,
Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions." In Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 1-9. 2015. plexity a
use of dim
(a) Inception module, naı̈ve version
tions wit
NiN 1x1 conv idea lows the
be proce
the next s
simultane
The im
increasin
ber of sta
One can
inferior,
have fou
(b) Inception module with dimensionality reduction a control
in netwo
Figure 2: Inception module ing netw
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models requires
SS 2020 65
A useful aspe
GoogLeNet / Inception v1 increasing the nu
without an unco
plexity Erhan,
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru at later s
Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions." In Proceedingsuse of dimension
of the IEEE
(a) Inception module, naı̈ve2015.
version
conference on computer vision and pattern recognition, pp. 1-9. tions with larger
lows the practica
be processed at v
the next stage can
simultaneously.
The improved
increasing both t
ber of stages with
One can utilize t
Example: inferior, but com
have found that a
(b) Inception module with dimensionality reduction a controlled balan
in networks that
Figure 2: Inception module ing networks wit
1x1@16 5x5@64 requires careful m
conv conv
ber of filters in the previous stage. The merging of output 5. GoogLeNet
of the pooling layer with outputs of the convolutional lay- By the“GoogL
ers would lead to an inevitable increase in the number of carnation of the I
28x28x128
outputs from stage to stage. 28x28x16
While this architecture might28x28x64
sion for the ILSV
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models
cover the optimal sparse structure, it would do it very inef- SS 2020 66
GoogLeNet / Inception v1
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,
Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions." In Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 1-9. 2015.
(b) Inception module with dimensionality reduction

(a) Inception module, naı̈ve version

Figure 2: Inception module

Feature maps from parallel paths are concatenated

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 67
re
in
in
ac
ha
in
On
be
in

sim
th
be
lo
tio
us
pl
wi
in

in
no
th
in
ef
GoogLeNet / Inception v1
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,
Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions." In Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 1-9. 2015.

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 68
Inception v2
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer
vision. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 2818-2826).

Factorized the traditional large convolutions into multiple smaller convolutions

Factorization with Linear vs ReLU activation
0.8

0.7
ReLU
Linear
0.6

0.5

Top−1 Accuracy
0.4

0.3

0.2

0.1

0
0 0.5 1 1.5 2 2.5 3 3.5 4
Iteration 6
x 10

Figure 1. Mini-network replacing the 5 × 5 convolutions.FigureFigure

3. Mini-network replacing
2. One of several the 3experiments
control × 3 convolutions.
betweenThe
two Incep-
lower tion
layermodels,
of this network
one of consists 3 × 1 convolution
of afactorization
them uses with 3+ ReLU
into linear
outputlayers,
units. the other uses two ReLU layers. After 3.86 million opera-
putational efficiency of the solution. Since Inception net- tions, the former settles at 76.2%, while the latter reaches 77.2%
works are fully convolutional, each weight corresponds to top-1 Accuracy on the validation set.
one multiplication per activation. Therefore, any reduction
in computational cost results in reduced number of param- Figur
change the number of activations/unit by a constant alpha
eters. This means that with suitable factorization, we can convo
end up with more disentangled parameters and therefore factor. Since the 5 × 5 convolution is aggregating, α theis 1
with faster training. Also, we can use the computational typically slightly larger than one (around 1.5 in the case
of GoogLeNet). Having a two layer replacement for the
and memory savings to increase the filter-bank sizes of our
network while maintaining our ability to train each model5 × 5 layer, it seems reasonable to reach this expansion
√ in
replica on a single computer. two steps: increasing the number of filters by α in both
steps. In order to simplify our estimate by choosing α = 1
Sebastian Raschka STAT 453: Intro
3.1. Factorization into smaller convolutions to Deep Learning and Generative Models SS 2020 69
Inception v2 Figure 6. In
convolution
the 17 × 17
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer
vision. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 2818-2826).

factorized the traditional large convolutions into multiple smaller convolutions

Figure 3. Mini-network replacing the 3 × 3 convolutions. The
Figure 4. Original Inception module as described in [20].
lower layer of this network consists of a 3 × 1 convolution with 3
output units.

Figure 6. Inception modules after the factorization of the n × n

convolutions. In our proposed architecture, we chose n = 7 for
the 17 × 17 grid. (The filter sizes are picked using principle 3) Figure 7. In
. This archite
high dimen
Section 2.
since that is
representati
(by 1 × 1 c
gregation.
Figure 4. Original Inception module as described in [20]. Figure 5. Inception modules where each 5 × 5 convolution is re-
placed by two 3 × 3 convolution, as suggested by principle 3 of
Section 2.
can replac
tion follow
receptive field as in a 3 × 3 convolution (see figure 3). Still cost saving
the two-layer solution is 33% cheaper for the same number In practice
of output filters, if the number of input and output filters is does not w
equal. By comparison, factorizing a 3 × 3 convolution into sults on me
Figure 7. Inception modules with expanded the filter bank outputs.
Sebastian Raschka STAT 453: Intro to DeeptwoLearning
a This and Generative
2 × 2 convolution Models
represents only a 11% saving SSof2020 m ranges
architecture is used on the coarsest (8 × 8) grids to promote
70 b
151 × 151 76.4% GoogLeNet [20]
299 × 299
Inception v376.6% VGG [18]
BN-Inception [7]
Table 2. Comparison of recognition performance when the size of PReLU [6]
Szegedy, C., Vanhoucke, V., theIoffe,
receptive field varies,
S., Shlens, but the
J., & Wojna, Z. computational costthe
(2016). Rethinking is constant.
inception architecture forPReLU
computer
[6]
vision. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 2818-2826).Inception-v3

Top-1 Top-5 Cost Inception-v3

Network
Error Error Bn Ops
GoogLeNet [20] 29% 9.2% 1.5 Table 4. Single-model, mu
BN-GoogLeNet 26.8% - 1.5 ing the cumulative effects o
compare our numbers with
BN-Inception [7] 25.2% 7.8 2.0
ence results on the ILSVRC
Inception-v3-basic 23.4% - 3.8
Inception-v3-rmsprop
RMSProp 23.1% 6.3 3.8
Inception-v3-smooth Mo
Network
Label Smoothing 22.8% 6.1 3.8 Evalu
Inception-v3-fact VGGNet [18] 2
Factorized 7 × 7 21.6% 5.8 4.8 GoogLeNet [20] 7
Inception-v3 PReLU [6]
21.2% 5.6% 4.8
BN-auxiliary BN-Inception [7] 6
Inception-v3 4
"Table 3 shows the experimental results about the recognition performance
Table 3. Single crop experimental results comparing the cumula-
tive effects on the various contributing factors. We compare our
of our proposed architecture (Inception-v2) as described
numbers with the best published single-crop inference for Ioffe at in Section 6.
Table Each
5. Ensemble evalua
multi-crop reported results.
Inception-v2 line al [7].shows
For thethe result of the
“Inception-v3-” lines,cumulative
the changes arechanges
cumulative including the
best published ensemble in
and each subsequent line includes the new change in addition to
highlighted new modification plus all the earlier ones. [...]
the previous ones. The last line is referring to all the changes is
classification benchmark.
result reported are on the
BN auxiliary what refersweto the version in which the fully connected
refer to as “Inception-v3” below. Unfortunately, He et layer of the
3.46% top-5 error on the va
auxiliary classifier is alsothenormalized,
al [6] reports not just results,
only 10-crop evaluation convolutions.
but not singleWe are
crop results, which is reported in the Table 4 below.
referring to the model [Inception-v2 + BN auxiliary] as Inception-v3."
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020
11. Conclusions 71
Inception v4
Szegedy, C., Ioﬀe, S., Vanhoucke, V., & Alemi, A. A. (2017, February). Inception-v4, Inception-resnet and the impact of
residual connections on learning. In Thirty-first AAAI Conference on Artificial Intelligence.

Here we give clear empirical evidence that training with residual

connections accelerates the training of Inception networks significantly.

We also present several new streamlined architectures for both residual and
non-residual Inception networks. These variations improve the single-frame
recognition performance on the ILSVRC 2012 classification task
significantly.

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 72
Based on neural architecture search (NAS) and stochastic network generators
https://arxiv.org/abs/1904.01569

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 73
https://arxiv.org/abs/1904.01569

Based on neural architecture search (NAS) and stochastic network generators

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 74
https://arxiv.org/abs/1904.01569

Also utilizes an
LSTM controller with
probabilistic behavior
(will discuss LSTMs in a diﬀerent
context next lecture)

Based on neural architecture search (NAS) and stochastic network generators

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 75
Evolving Normalization-Activation Layers

https://arxiv.org/abs/2004.02967
Hanxiao Liu 1 Andrew Brock 2 Karen Simonyan 2 Quoc V. Le 1

Abstract
and activation functions are
in deep neural networks that
e with each other. Instead
separately, we unify them
tation graph, and evolve its
m low-level primitives. Our
hm leads to the discovery of
ew normalization-activation
nd existing design patterns. Figure 1. Left: Computation graph of a searched normalization-
activation layer that is batch-independent, named EvoNorm-S0.
ers enjoy the property of be-
The corresponding expression is (v1 x) q 2 x + , in con-
m the batch statistics. Our sw,h,c/g (x)
✓ ◆
at EvoNorms not only excel x µw,h,c/g (x)
trast to max q 2 + , 0 for GroupNorm-ReLU.
classification models includ- sw,h,c/g (x)

eNets and EfficientNets, but v1 , µw,h,c/g and s2w,h,c/g refer to a learnable variable, group mean
Sebastian Raschka
Mask R-CNN for instance and STAT 453: Intro
group variance, to DeepRight:
respectively. Learning and Generative
ResNet-50 results with Models SS 2020 76
Lecture Overview

1. Padding (control output size in addition to stride)

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 77
Transfer Learning

• A technique that may be useful for your class projects

• Key idea:
✦ Feature extraction layers may be generally useful
✦ Use a pre-trained model (e.g., pre-trained on ImageNet)
✦ Freeze the weights: Only train last layer (or last few layers)
• Related approach: Fine-tuning, train a pre-trained network on your
smaller dataset

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 78
Which Layers to Replace & Train?

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video
classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer
Vision and Pattern Recognition (pp. 1725-1732).

Model 3-fold Accuracy Group mAP

from
Soomro et al [22] 43.9% scratch
Feature Histograms + Neural Net 59.0% Human-Object Interaction 0.26
Train from scratch 41.3% Body-Motion Only 0.32
Human-Human Interaction 0.40
Fine-tune top layer 64.1% Playing Musical Instruments 0.42
Fine-tune top 3 layers 65.4% Sports 0.57
Fine-tune all layers 62.2% All groups 0.44
Table 3: Results on UCF-101 for various Transfer Learning Table 4: Mean Average Precision of
approaches using the Slow Fusion network. work on UCF-101 classes broken dow

4.2. Transfer Learning Experiments on UCF-101 these were not available and hence we
the Sports-1M dataset has no overlap
The results of our analysis on the Sports-1M dataset in-
ever, these concerns are somewhat mi
dicate that the networks learn powerful motion features. A
a few sampled clips from every video
natural question that arises is whether these features also
We use the Slow Fusion network i
generalize to other datasets and class categories. We ex-
iments as it provides the best perfor
amine this question in detail by performing transfer learn-
The results of the experiments can b
ing experiments on the UCF-101 [22] Activity Recognition
Sebastian Raschka dataset. STAT terestingly, retraining
453: Intro to Deep Learning and Generative Models the softmax79
SS 2020 lay
The dataset consists of 13,320 videos belonging
Transfer Learning
PyTorch implementation: https://github.com/rasbt/stat453-deep-learning-
ss20/blob/master/L13-cnns-part2/code/vgg16-transferlearning.ipynb

Visualization from
https://www.cs.toronto.edu/~frossard/post/vgg16/

VGG-16
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale
image recognition." arXiv preprint arXiv:1409.1556 (2014).

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 80
Transfer Learning
PyTorch implementation: https://github.com/rasbt/stat453-deep-learning-
ss20/blob/master/L13-cnns-part2/code/vgg16-transferlearning.ipynb

Visualization from
https://www.cs.toronto.edu/~frossard/post/vgg16/

Freeze

VGG-16 Replace

Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale
image recognition." arXiv preprint arXiv:1409.1556 (2014).

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 81
Transfer Learning
https://pytorch.org/docs/stable/torchvision/models.html

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 82
Transfer Learning
https://pytorch.org/docs/stable/torchvision/models.html

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 83
Transfer Learning Example

PyTorch implementation: https://github.com/rasbt/stat453-deep-learning-

ss20/blob/master/L13-cnns-part2/code/vgg16-transferlearning.ipynb

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 84
Stuﬀ in the News

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 85
Optional Reading Material

http://www.deeplearningbook.org/contents/convnets.html

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 86
AI Gahaku

https://ai-art.tokyo/en/

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 87
AI Gahaku

The AI artist named “AI Gahaku”

generates a masterpiece
from your photo.
https://ai-art.tokyo/en/

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 88
More Coronavirus Datasets

https://spectrum.ieee.org/the-human-os/biomedical/imaging/hospitals-deploy-ai-tools-detect-covid19-chest-scans

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 89
More Coronavirus Datasets
"The COVIDx dataset now consists of 16,756 chest X-Rays across 13,645 patients"

https://github.com/lindawangg/COVID-Net

"COVID-Net, a deep convolutional neural network design tailored for the detection of
COVID-19 cases from chest radiography images"

Based on a ResNet-like architecture

https://arxiv.org/pdf/2003.09871.pdf

https://medium.com/@sheldon.fernandez/covid-net-larger-dataset-new-models-and-covid-risknet-fd8e7c451c

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 90
More Coronavirus Datasets

https://www.nytimes.com/article/coronavirus-county-data-us.html
Note: you need to open this article in a private browser window to
be able to view it.

https://github.com/nytimes/covid-19-data
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 91
Being able to compress well is closely related to intelligence as explained below. While intelligence is a slippery concept, file sizes are hard numbers.
Wikipedia is an extensive snapshot of Human Knowledge. If you can compress the first 1GB of Wikipedia better than your predecessors, your
(de)compressor likely has to be smart(er). The intention of this prize is to encourage development of intelligent compressors/programs as a path to
AGI.
Losslessly compress the 1GB file enwik9 to less than 116MB. More precisely:
• Create a Linux or Windows compressor comp.exe of size S1 that compresses enwik9 to archive.exe of size S2 such that
S:=S1+S2 < L := 116'673'681 = previous record.
• If run, archive.exe produces (without input from other sources) a 109 byte file that is identical to enwik9.
• If we can verify your claim, you are eligible for a prize of 500'000€×(1-S/L). Minimum claim is 5'000€ (1% improvement).
• Restrictions: Must run in ≲100 hours using a single CPU core and <10GB RAM and <100GB HDD on our test machine.

http://prize.hutter1.net

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 92