L13_intro-cnn-part2__slides
L13_intro-cnn-part2__slides
Introduction to
Convolutional Neural Networks
Part 2
STAT 453: Deep Learning, Spring 2020
Sebastian Raschka
http://stat.wisc.edu/~sraschka/teaching/stat453-ss2020/
https://github.com/rasbt/stat453-deep-learning-ss20/tree/master/L13-cnns-part2
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 1
Quiz
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 2
Lecture Overview
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 3
Padding
⌫
i + 2p k
o= +1
s
<latexit sha1_base64="1+3PcismmoemfcI0obtP4BHnYWc=">AAACGHicbVBNS8NAEN3Ur1q/qh69LBZBKNakCnoRil48VrAf0JSy2W7apZts2J0IJeRnePGvePGgiNfe/Ddu2xy09cHA470ZZuZ5keAabPvbyq2srq1v5DcLW9s7u3vF/YOmlrGirEGlkKrtEc0ED1kDOAjWjhQjgSdYyxvdTf3WE1Oay/ARxhHrBmQQcp9TAkbqFc/ljSuYD67whZTK9RWhCS9XcXQ2ShOduooPhuCqmVvGTq9Ysiv2DHiZOBkpoQz1XnHi9iWNAxYCFUTrjmNH0E2IAk4FSwturFlE6IgMWMfQkARMd5PZYyk+MUof+1KZCgHP1N8TCQm0Hgee6QwIDPWiNxX/8zox+NfdhIdRDCyk80V+LDBIPE0J97liFMTYEEIVN7diOiQmGzBZFkwIzuLLy6RZrTgXlerDZal2m8WRR0foGJ0iB12hGrpHddRAFD2jV/SOPqwX6836tL7mrTkrmzlEf2BNfgDz2aBR</latexit>
kernel size
"floor" function
stride
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 4
(4 - 3 + 2*0)/1 + 1 = 2
output
(5 - 4 + 2*2)/1 + 1 = 6
input
(5 - 3 + 2*0)/2 + 1 = 2
Highly recommended:
Dumoulin, Vincent, and Francesco Visin. "A guide to
convolution arithmetic for deep learning." arXiv preprint
arXiv:1603.07285 (2016).
No padding, stride=2
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 5
Padding Jargon
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 6
Padding
⌫
i + 2p k
o= +1
s
<latexit sha1_base64="1+3PcismmoemfcI0obtP4BHnYWc=">AAACGHicbVBNS8NAEN3Ur1q/qh69LBZBKNakCnoRil48VrAf0JSy2W7apZts2J0IJeRnePGvePGgiNfe/Ddu2xy09cHA470ZZuZ5keAabPvbyq2srq1v5DcLW9s7u3vF/YOmlrGirEGlkKrtEc0ED1kDOAjWjhQjgSdYyxvdTf3WE1Oay/ARxhHrBmQQcp9TAkbqFc/ljSuYD67whZTK9RWhCS9XcXQ2ShOduooPhuCqmVvGTq9Ysiv2DHiZOBkpoQz1XnHi9iWNAxYCFUTrjmNH0E2IAk4FSwturFlE6IgMWMfQkARMd5PZYyk+MUof+1KZCgHP1N8TCQm0Hgee6QwIDPWiNxX/8zox+NfdhIdRDCyk80V+LDBIPE0J97liFMTYEEIVN7diOiQmGzBZFkwIzuLLy6RZrTgXlerDZal2m8WRR0foGJ0iB12hGrpHddRAFD2jV/SOPqwX6836tL7mrTkrmzlEf2BNfgDz2aBR</latexit>
o = i + 2p
<latexit sha1_base64="KScMiOYK5Huk0js+ek4SjQbMhXY=">AAAB+nicbVDLSgMxFL3js9bXVJdugkUQxDJTBd0IRTcuK9gHtEPJpJk2NJMMSUYptZ/ixoUibv0Sd/6NaTsLbT1wuYdz7iU3J0w408bzvp2l5ZXVtfXcRn5za3tn1y3s1bVMFaE1IrlUzRBrypmgNcMMp81EURyHnDbCwc3EbzxQpZkU92aY0CDGPcEiRrCxUsctSHSFGDpB5QSdDmz3O27RK3lToEXiZ6QIGaod96vdlSSNqTCEY61bvpeYYISVYYTTcb6dappgMsA92rJU4JjqYDQ9fYyOrNJFkVS2hEFT9ffGCMdaD+PQTsbY9PW8NxH/81qpiS6DERNJaqggs4eilCMj0SQH1GWKEsOHlmCimL0VkT5WmBibVt6G4M9/eZHUyyX/rFS+Oy9WrrM4cnAAh3AMPlxABW6hCjUg8AjP8ApvzpPz4rw7H7PRJSfb2Yc/cD5/ALi9kQ8=</latexit>
k+1
, p = (o
<latexit sha1_base64="626xz0vjLoBFOjFRhY1RQUsbPLE=">AAACCHicbVA9SwNBEN2LXzF+nVpauBiEiCTeRUEbIWhjYRHBfEASwt5mL1myd3vszikhpLTxr9hYKGLrT7Dz37hJrtDog4HHezPMzPMiwTU4zpeVmptfWFxKL2dWVtfWN+zNraqWsaKsQqWQqu4RzQQPWQU4CFaPFCOBJ1jN61+O/dodU5rL8BYGEWsFpBtyn1MCRmrbu81r5oPi3R4QpeQ9jvA5zsk8P+zn3YOjYtvOOgVnAvyXuAnJogTltv3Z7EgaBywEKojWDdeJoDUkCjgVbJRpxppFhPZJlzUMDUnAdGs4eWSE943Swb5UpkLAE/XnxJAEWg8Cz3QGBHp61huL/3mNGPyz1pCHUQwspNNFfiwwSDxOBXe4YhTEwBBCFTe3YtojilAw2WVMCO7sy39JtVhwjwvFm5Ns6SKJI4120B7KIRedohK6QmVUQRQ9oCf0gl6tR+vZerPep60pK5nZRr9gfXwDrzeXzQ==</latexit>
i+k 1)/2
, p = (k
<latexit sha1_base64="d58aLPAYf749hHhkSP5OGRKTYVc=">AAACBHicbVA9SwNBEN2LXzF+nVqmWQxCLIx3UdBGCNpYWEQwH5AcYW+zlyzZuz1255QQUtj4V2wsFLH1R9j5b9wkV2j0wcDjvRlm5vmx4Boc58vKLCwuLa9kV3Nr6xubW/b2Tl3LRFFWo1JI1fSJZoJHrAYcBGvGipHQF6zhDy4nfuOOKc1ldAvDmHkh6UU84JSAkTp2vn3NAlC81weilLzHMT7HxcGhe3BU7tgFp+RMgf8SNyUFlKLasT/bXUmTkEVABdG65ToxeCOigFPBxrl2ollM6ID0WMvQiIRMe6PpE2O8b5QuDqQyFQGeqj8nRiTUehj6pjMk0Nfz3kT8z2slEJx5Ix7FCbCIzhYFicAg8SQR3OWKURBDQwhV3NyKaZ8oQsHkljMhuPMv/yX1csk9LpVvTgqVizSOLMqjPVRELjpFFXSFqqiGKHpAT+gFvVqP1rP1Zr3PWjNWOrOLfsH6+AYmrZZ1</latexit>
1)/2
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 7
Padding
o = i + 2p
<latexit sha1_base64="KScMiOYK5Huk0js+ek4SjQbMhXY=">AAAB+nicbVDLSgMxFL3js9bXVJdugkUQxDJTBd0IRTcuK9gHtEPJpJk2NJMMSUYptZ/ixoUibv0Sd/6NaTsLbT1wuYdz7iU3J0w408bzvp2l5ZXVtfXcRn5za3tn1y3s1bVMFaE1IrlUzRBrypmgNcMMp81EURyHnDbCwc3EbzxQpZkU92aY0CDGPcEiRrCxUsctSHSFGDpB5QSdDmz3O27RK3lToEXiZ6QIGaod96vdlSSNqTCEY61bvpeYYISVYYTTcb6dappgMsA92rJU4JjqYDQ9fYyOrNJFkVS2hEFT9ffGCMdaD+PQTsbY9PW8NxH/81qpiS6DERNJaqggs4eilCMj0SQH1GWKEsOHlmCimL0VkT5WmBibVt6G4M9/eZHUyyX/rFS+Oy9WrrM4cnAAh3AMPlxABW6hCjUg8AjP8ApvzpPz4rw7H7PRJSfb2Yc/cD5/ALi9kQ8=</latexit>
k+1
, p = (o
<latexit sha1_base64="626xz0vjLoBFOjFRhY1RQUsbPLE=">AAACCHicbVA9SwNBEN2LXzF+nVpauBiEiCTeRUEbIWhjYRHBfEASwt5mL1myd3vszikhpLTxr9hYKGLrT7Dz37hJrtDog4HHezPMzPMiwTU4zpeVmptfWFxKL2dWVtfWN+zNraqWsaKsQqWQqu4RzQQPWQU4CFaPFCOBJ1jN61+O/dodU5rL8BYGEWsFpBtyn1MCRmrbu81r5oPi3R4QpeQ9jvA5zsk8P+zn3YOjYtvOOgVnAvyXuAnJogTltv3Z7EgaBywEKojWDdeJoDUkCjgVbJRpxppFhPZJlzUMDUnAdGs4eWSE943Swb5UpkLAE/XnxJAEWg8Cz3QGBHp61huL/3mNGPyz1pCHUQwspNNFfiwwSDxOBXe4YhTEwBBCFTe3YtojilAw2WVMCO7sy39JtVhwjwvFm5Ns6SKJI4120B7KIRedohK6QmVUQRQ9oCf0gl6tR+vZerPep60pK5nZRr9gfXwDrzeXzQ==</latexit>
i+k 1)/2
, p = (k
<latexit sha1_base64="d58aLPAYf749hHhkSP5OGRKTYVc=">AAACBHicbVA9SwNBEN2LXzF+nVqmWQxCLIx3UdBGCNpYWEQwH5AcYW+zlyzZuz1255QQUtj4V2wsFLH1R9j5b9wkV2j0wcDjvRlm5vmx4Boc58vKLCwuLa9kV3Nr6xubW/b2Tl3LRFFWo1JI1fSJZoJHrAYcBGvGipHQF6zhDy4nfuOOKc1ldAvDmHkh6UU84JSAkTp2vn3NAlC81weilLzHMT7HxcGhe3BU7tgFp+RMgf8SNyUFlKLasT/bXUmTkEVABdG65ToxeCOigFPBxrl2ollM6ID0WMvQiIRMe6PpE2O8b5QuDqQyFQGeqj8nRiTUehj6pjMk0Nfz3kT8z2slEJx5Ix7FCbCIzhYFicAg8SQR3OWKURBDQwhV3NyKaZ8oQsHkljMhuPMv/yX1csk9LpVvTgqVizSOLMqjPVRELjpFFXSFqqiGKHpAT+gFvVqP1rP1Zr3PWjNWOrOLfsH6+AYmrZZ1</latexit>
1)/2
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 8
Lecture Overview
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 9
Spatial Dropout -- Dropout2D
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 10
Spatial Dropout -- Dropout2D
• Dropout2d will drop full feature maps (channels)
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 11
BatchNorm 2D
Source: https://pytorch.org/docs/stable/nn.html
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 12
BatchNorm 2D
Inputs are rank-2 tensors: [N, num_features)
Source: https://pytorch.org/docs/stable/nn.html
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 13
BatchNorm 2D
In BatchNorm2d, the mean and standard deviation are computed for N*H*W, i.e., over the channel dimension
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 14
Lecture Overview
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 15
Computing Convolutions on the GPU
• There are many different approaches to compute (approximate)
convolution operations
• These algorithms are usually more efficient than the CPU variants
(convolutions on the CPU e.g., in CPU usually take up much more memory
due to the algorithm choice compared to using the GPU)
import torch
import torch.nn as nn
import torch.nn.functional as F
if torch.cuda.is_available():
torch.backends.cudnn.deterministic = True
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 17
Lecture Overview
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 18
Common Architectures Revisited
Canziani, A., Paszke, A., & Culurciello, E. (2016). An analysis of deep neural network models for practical
etwork. Single-croparXiv
applications. top-1preprint Figure 2: Top1 vs. operations, size / parameters.
vali- arXiv:1605.07678.
top scoring single-model archi- Top-1 one-crop accuracy versus amount of operations
ce with this chart our choice of required for a single forward pass. The size of the
h will be Sebastian Raschkathis blobs
used throughout STAT 453: Intro totoDeep
is proportional the Learning
number ofand Generative
network pa- Models SS 2020 19
Common Architectures Revisited
number of
model parameters
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 21
Common Architectures Revisited
conv3-256
maxpool
conv3-256 conv3-256 conv3-256 conv3-256 conv3-256
"same" padding, 2x2 max pooling
conv3-256 conv3-256 conv3-256 conv3-256 conv3-256 conv3-256
conv1-256 conv3-256 conv3-256
conv3-256
conv3-512 conv3-512
maxpool
conv3-512 conv3-512 conv3-512 conv3-512 Disadvantage:
conv3-512 conv3-512 conv3-512 conv3-512 conv3-512 conv3-512
conv1-512 conv3-512 conv3-512
conv3-512 very large number of parameters
maxpool
conv3-512
conv3-512
conv3-512
conv3-512
conv3-512 conv3-512
conv3-512 conv3-512
conv3-512
conv3-512
conv3-512
conv3-512
and slow
conv1-512 conv3-512 conv3-512
maxpool
conv3-512 (see previous slide)
FC-4096
FC-4096
FC-1000
soft-max
Simonyan, Table
Network
Karen,
2: Numberand Andrew
of parameters
A,A-LRN B
Zisserman.
(in millions).
C D E
"Very deep convolutional networks for large-scale
imageNumber
recognition."
of parameters arXiv
133 preprint
133 134 arXiv:1409.1556
138 144 (2014).
such layers have a 7 × 7 effective receptive field. So what have we gained by using, for instance, a
stack of three 3 × 3 conv. layers instead of a single 7 × 7 layer? First, we incorporate three non-linear
Sebastian
rectification layers instead ofRaschka
a single one, which makesSTAT 453:
the decision Intro
function to
moreDeep Learning
discriminative.
Second, we decrease the number of parameters: assuming that both the input and the output of a
and Generative Models SS 2020 23
VGG-16
PyTorch implementation: https://github.com/rasbt/stat453-deep-learning-
ss20/blob/master/L13-cnns-part2/code/vgg16.ipynb
Visualization from
https://www.cs.toronto.edu/~frossard/post/vgg16/
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale
image recognition." arXiv preprint arXiv:1409.1556 (2014).
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 24
Lecture Overview
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 25
Common Architectures Revisited
Sebastian Raschka
3x3 conv, 256 3x3 conv, 64 3x3 conv, 64
match dimensions (done by
3x3 conv, 256 3x3 conv, 64 3x3 conv, 64
options, when the shortcuts
3x3 conv, 256 3x3 conv, 64 3x3 conv, 64
sizes, they are performed wi
3x3 conv, 64 3x3 conv, 64
28
ResNets
shortcut
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 29
ResNets
shortcut
shortcut
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 31
ResNets
shortcut
shortcut
Convolution Batch Norm
alternative residual blocks with skip connections such that the input passed via
the shortcut is resized to dimensions of the main path's output
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 32
ResNet Block Implementation
https://github.com/rasbt/stat453-deep-learning-ss20/blob/master/L13-
cnns-part2/code/resnet-blocks.ipynb
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 33
ResNet-34 and ResNet-152
PyTorch implementations:
https://github.com/rasbt/stat453-deep-learning-ss20/blob/master/L13-
cnns-part2/code/resnet-34.ipynb
https://github.com/rasbt/stat453-deep-learning-ss20/blob/master/L13-
cnns-part2/code/resnet-152.ipynb
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 34
Lecture Overview
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 35
Common Architectures Revisited
W 1 ⇤ x + b1
<latexit sha1_base64="HbNSBTIzLtBjs59pl8QpsXEsj+U=">AAACCHicbVDLSsNAFL2pr1pfUZcuHCyCKJSkCrosunFZwT6gDWEynbRDJw9mJmIJWbrxV9y4UMStn+DOv3HaRtDWAwNnzrmXe+/xYs6ksqwvo7CwuLS8Ulwtra1vbG6Z2ztNGSWC0AaJeCTaHpaUs5A2FFOctmNBceBx2vKGV2O/dUeFZFF4q0YxdQLcD5nPCFZacs39boDVwPPTVuba6Bj9fO8zdII813bNslWxJkDzxM5JGXLUXfOz24tIEtBQEY6l7NhWrJwUC8UIp1mpm0gaYzLEfdrRNMQBlU46OSRDh1rpIT8S+oUKTdTfHSkOpBwFnq4c7ylnvbH4n9dJlH/hpCyME0VDMh3kJxypCI1TQT0mKFF8pAkmguldERlggYnS2ZV0CPbsyfOkWa3Yp5XqzVm5dpnHUYQ9OIAjsOEcanANdWgAgQd4ghd4NR6NZ+PNeJ+WFoy8Zxf+wPj4BuFvmJc=</latexit>
Fully connected layer w1,1 w1,2
where W1 =
w1,3 w1,4
<latexit sha1_base64="P25N0fbBpEZMogqD0d32xbYxFGU=">AAACP3icbZBLSwMxFIUz9V1fVZdugkVxIWWmLehGKLpxqWBboVOGTHqnhmYyQ5JRyzD/zI1/wZ1bNy4UcevOtB18XwgcvntvknP8mDOlbfvBKkxNz8zOzS8UF5eWV1ZLa+stFSWSQpNGPJIXPlHAmYCmZprDRSyBhD6Htj84HvXbVyAVi8S5HsbQDUlfsIBRog3ySi03JPrSD9J25jn4ELs+9JlIfUMlu8mK117q7GEnwzt4IquZ6+a09kXrWdEF0fvc80plu2KPC/8VTi7KKK9Tr3Tv9iKahCA05USpjmPHupsSqRnlYG5PFMSEDkgfOkYKEoLqpmP/Gd42pIeDSJojNB7T7xspCZUahr6ZHLlVv3sj+F+vk+jgoJsyEScaBJ08FCQc6wiPwsQ9JoFqPjSCUMnMXzG9JJJQbSIvmhCc35b/ila14tQq1bN6uXGUxzGPNtEW2kUO2kcNdIJOURNRdIse0TN6se6sJ+vVepuMFqx8ZwP9KOv9AzMRrJ8=</latexit>
w2,1 w2,2
W2 =
w2,3 <latexit sha1_base64="chChSTZ4lsrXuTisM5DYCe9F9dk=">AAACP3icbZDNS8MwGMZTP+f8mnr0EhyKBxltJ+hFEL14nOA2YR0lzd7OYJqWJFVH6X/mxX/Bm1cvHhTx6s1sK36/EHj4ve+b5HmChDOlbfvBmpicmp6ZLc2V5xcWl5YrK6stFaeSQpPGPJbnAVHAmYCmZprDeSKBRAGHdnB5POy3r0AqFoszPUigG5G+YCGjRBvkV1peRPRFEGbt3HfxAfYC6DORBYZKdpOXr/3M3cFOjrfwWLq55xW0/kV387IHove551eqds0eFf4rnEJUUVENv3Lv9WKaRiA05USpjmMnupsRqRnlYG5PFSSEXpI+dIwUJALVzUb+c7xpSA+HsTRHaDyi3zcyEik1iAIzOXSrfveG8L9eJ9XhfjdjIkk1CDp+KEw51jEehol7TALVfGAEoZKZv2J6QSSh2kReNiE4vy3/FS235tRr7ulu9fCoiKOE1tEG2kYO2kOH6AQ1UBNRdIse0TN6se6sJ+vVehuPTljFzhr6Udb7Bzt5rKQ=</latexit>
w2,4
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 37
Side Note: It is Possible to Replace Fully
Connected Layers by Convolutional Layers
T
w1 x
<latexit sha1_base64="En8xjNMtAnoObRZ7W5LhpkbtT0E=">AAACCHicbVDLSsNAFJ3UV62vqEsXDhZBEEpSBV0W3bis0Be0MUymk3boZBJmJmoJWbrxV9y4UMStn+DOv3HSRtDWAwNnzrmXe+/xIkalsqwvo7CwuLS8Ulwtra1vbG6Z2zstGcYCkyYOWSg6HpKEUU6aiipGOpEgKPAYaXujy8xv3xIhacgbahwRJ0ADTn2KkdKSa+73AqSGnp/cpa5904A/3/sUHkPPtV2zbFWsCeA8sXNSBjnqrvnZ64c4DghXmCEpu7YVKSdBQlHMSFrqxZJECI/QgHQ15Sgg0kkmh6TwUCt96IdCP67gRP3dkaBAynHg6cpsTznrZeJ/XjdW/rmTUB7FinA8HeTHDKoQZqnAPhUEKzbWBGFB9a4QD5FAWOnsSjoEe/bkedKqVuyTSvX6tFy7yOMogj1wAI6ADc5ADVyBOmgCDB7AE3gBr8aj8Wy8Ge/T0oKR9+yCPzA+vgG5hJkf</latexit>
+ b1
w2T x + b2
<latexit sha1_base64="w9jYR3nTs25JhVUBGj0bJu/2Obc=">AAACCHicbVDLSsNAFJ3UV62vqEsXDhZBEEoSBV0W3bis0Be0MUymk3boZBJmJmoJWbrxV9y4UMStn+DOv3HSVtDWAwNnzrmXe+/xY0alsqwvo7CwuLS8Ulwtra1vbG6Z2ztNGSUCkwaOWCTaPpKEUU4aiipG2rEgKPQZafnDy9xv3RIhacTrahQTN0R9TgOKkdKSZ+53Q6QGfpDeZZ5zU4c/3/sMHkPfczyzbFWsMeA8saekDKaoeeZntxfhJCRcYYak7NhWrNwUCUUxI1mpm0gSIzxEfdLRlKOQSDcdH5LBQ630YBAJ/biCY/V3R4pCKUehryvzPeWsl4v/eZ1EBeduSnmcKMLxZFCQMKgimKcCe1QQrNhIE4QF1btCPEACYaWzK+kQ7NmT50nTqdgnFef6tFy9mMZRBHvgABwBG5yBKrgCNdAAGDyAJ/ACXo1H49l4M94npQVj2rML/sD4+Aa8n5kh</latexit>
Or, we can concatenate the inputs
into 1x1 images with 4 channels and
then use 2 kernels
(remember, each kernel then also
Fully connected layer has 4 channels)
Example: https://github.com/rasbt/stat453-deep-learning-ss20/blob/master/L13-cnns-part2/
code/fc-to-conv.ipynb
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 38
Network in Network (NiN)
Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013).
Key Ideas
1)
• A convolution kernel can be thought of as a generalized linear model (GLM)
• Using a "sophisticated" nonlinear function approximator (e.g., an MLP) may
enhance the abstraction ability of the local model
• => Replace GLM by "micro network" (sliding an MLP over the feature map)
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 39
Network in Network (NiN)
Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013).
Key Ideas
1)
• A convolution kernel can be thought of as a generalized linear model (GLM)
• Using a "sophisticated" nonlinear function approximator (e.g., an MLP) may
enhance the abstraction ability of the local model
• => Replace GLM by "micro network" (sliding an MLP over the feature map)
2)
• Replace the MLP "micro structure" via convolutions
(explanation on the previous slides)
• Replace the fully connected layers in the last layers by
global average pooling
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 40
Network in Network (NiN)
Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013).
...
...
(a) Linear convolution layer (b) Mlpconv layer
Figure 1: Comparison of linear convolution layer and mlpconv layer. The linear convolution layer
includes a linear filter while the mlpconv layer includes a micro network (we choose the multilayer
perceptron in this paper). Both layers map the local receptive field to a confidence value of the latent
concept.
over the input in a similar manner as CNN and are then fed into the next layer. The overall structure
• Usingofathe"sophisticated"
NIN is the stacking ofnonlinear function
multiple mlpconv approximator
layers. It is called “Network(e.g., an MLP)
In Network” (NIN) asmay
we have micro networks (MLP), which are composing elements of the overall deep network, within
enhance thelayers,
mlpconv abstraction ability of the local model
Instead of adopting the traditional fully connected layers for classification in CNN, we directly
output the spatial average of the feature maps from the last mlpconv layer as the confidence of
Sebastian Raschka
categories via a globalSTAT 453:pooling
average Intro to Deep
layer, Learning
and then theand Generative
resulting vectorModels SS 2020
is fed into the softmax 41
This can remove the need for fully connected layers in a CNN based classifier. It is
considered by the authors as a structural regularizer of the CNN, transforming feature
Figure 16: Global average pooling layer replacing the fully connected layers. The output layer
implements a Softmax operation with 𝑝 , 𝑝 , … , 𝑝 the predicted probabilities for each class.
Seeking an analogy in nature for pooling layers, it has been already studied by
Figure Source: Singh, Anshuman Vikram. "Content-based image retrieval using deep learning." (2015).
Neuroscience a property of biological neurons called lateral inhibition. The transmitted
activations of feed-forward cells are pooled and an inhibitory signal is sent back to them, thus
modifying the amount of input that reaches the receiver cells [45]. Neurons that have stronger
responses
Sebastian tend to inhibit
Raschka STATthe 453:
response
Intro of
to neighboring neurons.
Deep Learning We can think
and Generative of max pooling
Models SS 2020 42
Network in Network (NiN)
Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013).
....
...
...
...
...
...
...
...
Figure 2: The overall structure of Network In Network. In this paper the NINs include the stacking
of three mlpconv layers and one global average pooling layer.
Comparison to maxout layers: the maxout layers in the maxout network performs max pooling
across multiple affine feature maps [8]. The feature maps of maxout layers are calculated as follows:
• Replace the fully connected layers in the last layers by
global average pooling fi,j,k = max (wkT xi,j ).
m m
(3)
Maxout over linear functions forms a piecewise linear function which is capable of modeling any
convex function. For a convex function, samples with function values below a specific threshold
Sebastian
form a Raschka STAT 453:
convex set. Therefore, Intro to Deepconvex
by approximating Learning and Generative
functions Models
of the local SShas
patch, maxout 2020 43
Network in Network (NiN)
Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013).
Example Implementation:
https://github.com/rasbt/stat453-deep-learning-ss20/blob/master/L13-cnns-part2/
code/nin-cifar10.ipynb
• Replace the MLP "micro structure" with convolutions
(explanation on the previous slides)
...
...
(a) Linear convolution layer (b) Mlpconv layer
Figure 1: Comparison of linear convolution layer and mlpconv layer. The linear convolution layer
includes a linear filter while the mlpconv layer includes a micro network (we choose the multilayer
perceptron in this paper). Both layers map the local receptive field to a confidence value of the latent
concept.
over the input in a similar manner as CNN and are then fed into the next layer. The overall structure
of the NIN is the stacking of multiple mlpconv layers. It is called “Network In Network” (NIN) as
we have micro networks (MLP), which are composing elements of the overall deep network, within
Sebastian Raschka mlpconv STAT
layers, 453: Intro to Deep Learning and Generative Models SS 2020 44
Network in Network (NiN)
Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013).
Example Implementation:
https://github.com/rasbt/stat453-deep-learning-ss20/blob/master/L13-cnns-part2/
code/nin-cifar10.ipynb
• Replace fully connected layers with global average pooling
(explanation on the previous slides)
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 45
Network in Network (NiN)
Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013).
46
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020
L DESCRIPTION - THE ALL CONVOLUTIONAL NETWORK
seor in
object
our recognition,
experimentswith "All-Convolutional Network"
erestingly – we replace the pooling layers, which are present in practically all modern
standard
differ from convolutional
standard CNNs layers with stride
in several
cedure can work it helps to recall the standard formulation for defining convolution and
keytwo. To understand
aspects. First –
ngly Springenberg,
– we replace Jost the Tobias,
pooling layers, which are present in practically all modern
ations in CNNs. Let f denote Alexey Dosovitskiy,
a feature Thomas
map produced byBrox,
someand Martin
layer of aRiedmiller.
CNN. It can"Striving for simplicity: The all
ject recognition,
convolutionalwith
as a 3-dimensional net." standard
array
convolutional
arXiv preprint
of size W ⇥
layersWwith
arXiv:1412.6806
H ⇥ N where(2014).
stride two. To understand
and H are the width and height
enumber
can work it helps to
of channels (inrecall
case fthe standard
is the outputformulation for defining
of a convolutional layer, Nconvolution
is the number andof
Key Idea
slayer).
in CNNs.
ThenLet p-norm :subsampling
f denote Replace Maxpooling
a feature(ormap produced
pooling) by some
with pooling bysize strided
layer convolutions
of half-length
k (or a CNN. It can
k/2)
3-dimensional
applied to the featurearray of map size Wa⇥
f is H ⇥ N where
3-dimensional W and
array s(f )Hwitharethe
thefollowing
width andentries:
height
ber of(i.e.,
channels conv (in case0layers
f is the with
output of stride=2)
a convolutional1layer, N is the number of
1/p
). Then p-norm subsampling bk/2c(or pooling)
X bk/2c
X with pooling size k (or half-length k/2)
ed to the feature
si,j,u (fmap
)=@ f is a 3-dimensional|farray s(f ) with
g(h,w,i,j,u) |p A the, following entries: (1)
0 h= bk/2c w= bk/2c 11/p
bk/2c bk/2c
w, i, j,si,j,u
u) = (r
(f ) = · i
@ + h,
X
r · j +
X
w, u) is the function
|fg(h,w,i,j,u) | pmapping
A , from definition
positions in s
(1)toof max-pooling with
f respecting the stride, p is the order of the p-norm (for p ! 1, itstride=2 becomes theto when p ! 1
h= bk/2c w= bk/2c
sed max pooling). If r > k, pooling regions do not overlap; however, current CNN <latexit sha1_base64="0IP+XSvQcXqyCHSTwgqtKgcX7Lw=">AAAB/XicbVDLSsNAFJ34rPUVHzs3g0VwVZIq6LLoxmUF+4AmlMl00g6dTMLMjRJD8VfcuFDErf/hzr9x2mahrQcuHM65l3vvCRLBNTjOt7W0vLK6tl7aKG9ube/s2nv7LR2nirImjUWsOgHRTHDJmsBBsE6iGIkCwdrB6Hrit++Z0jyWd5AlzI/IQPKQUwJG6tmHCfYUHwyBKBU/YI/LELKeXXGqzhR4kbgFqaACjZ795fVjmkZMAhVE667rJODnRAGngo3LXqpZQuiIDFjXUEkipv18ev0Ynxilj8NYmZKAp+rviZxEWmdRYDojAkM9703E/7xuCuGln3OZpMAknS0KU4EhxpMocJ8rRkFkhhCquLkV0yFRhIIJrGxCcOdfXiStWtU9q9Zuzyv1qyKOEjpCx+gUuegC1dENaqAmougRPaNX9GY9WS/Wu/Uxa12yipkD9AfW5w+ZZ5VS</latexit>
, typically
u) = (rinclude
· i + h,overlapping
r · j + w,poolingu) is the with k = 3 and
function mapping
r = 2.fromLet uspositions
now comparein s tothe
ation defined
specting by Eq. p1 isto the
the stride, the standard
order of definition
the p-norm of a(for
convolutional
p ! 1, layer c applied
it becomes theto
f given
max as:
pooling). If r > k, pooling regions do not overlap; however, current CNN
0 1
cally include overlapping pooling with k = 3 and r = 2. Let us now compare the
defined by Eq. 1 @to
bk/2c
X
the standard
bk/2c
X X N
definition of a convolutional Alayer definition
c applied to of a convolutional
ci,j,o (f ) = ✓h,w,u,o · fg(h,w,i,j,u) , (2)
en as: h= bk/2c w= bk/2c u=1 layer with stride=2
0 1
bk/2c
the convolutional weights X (orbk/2c the kernel
X XN
weights, or filters), (·) is the activation func-
cyi,j,o @
(f ) = linear activation ReLU (x)✓=
a rectified max(x,
h,w,u,o · fg(h,w,i,j,u)
0), and o 2 A[1,
, M ] is the number (2)
ture (or channel)h= of the
bk/2cconvolutional
w= bk/2c u=1 layer. When formalized like this it becomes clear
erations depend on the same elements of the previous layer feature map. The pooling
onvolutional weights a(or
seen as performing the kernel convolution
feature-wise weights, or filters),
1
in which(·)the is the activation
activation func-is
function
ectified
the p-norm.linearOne activation ReLU ask(x)
can therefore the=question 0), and and
max(x,whether o 2why[1, Msuch] is special
the number
layers
or channel)
troduced Sebastian
into ofthe RaschkaWhile alayer.
thenetwork.
convolutional STAT 453:
When
complete Intro to this
formalized
answer of Deeplike
Learning
this itisand
question not Generative
becomes giveModels
clear
easy to SS 2020 47
L DESCRIPTION - THE ALL CONVOLUTIONAL NETWORK
seor in
object
our recognition,
experimentswith "All-Convolutional Network"
erestingly – we replace the pooling layers, which are present in practically all modern
differstandard
from convolutional
standard CNNs layers with stride
in several
cedure can work it helps to recall the standard formulation for defining convolution and
keytwo. To understand
aspects. First –
ngly Springenberg,
– we replace Jost the Tobias,
pooling layers, which are present in practically all modern
ations in CNNs. Let f denote Alexey Dosovitskiy,
a feature Thomas
map produced byBrox,
someand Martin
layer of aRiedmiller.
CNN. It can"Striving for simplicity: The all
ject recognition,
convolutionalwith
as a 3-dimensional net." standard
array
convolutional
arXiv preprint
of size W ⇥
layersWwith
arXiv:1412.6806
H ⇥ N where(2014).
stride two. To understand
and H are the width and height
enumber
can work it helps to
of channels (inrecall
case fthe standard
is the outputformulation for defining
of a convolutional layer, Nconvolution
is the number andof
Key Idea
slayer).
in CNNs.
ThenLet p-norm :subsampling
f denote Replace Maxpooling
a feature(ormap produced
pooling) by some
with pooling bysize strided
layer convolutions
of half-length
k (or a CNN. It can
k/2)
3-dimensional
applied to the featurearray of map size Wa⇥
f is H ⇥ N where
3-dimensional W and
array s(f )Hwitharethe
thefollowing
width andentries:
height
ber of(i.e.,
channels conv (in case0layers
f is the with
output of stride=2)
a convolutional1layer, N is the number of
1/p
). Then p-norm subsampling bk/2c(or pooling)
X bk/2c
X with pooling size k (or half-length k/2)
ed to the feature
si,j,u (fmap
)=@ f is a 3-dimensional|farray s(f ) with
g(h,w,i,j,u) |p A the, following entries: (1)
0 h= bk/2c w= bk/2c 11/p
bk/2c bk/2c
w, i, j,si,j,u
u) = (r
(f ) = · i
@ + h,
X
r · j +
X
w, u) is the function
|fg(h,w,i,j,u) | pmapping
A , from definition
positions in s
(1)toof max-pooling with
f respecting the stride, p is the order of the p-norm (for p ! 1, itstride=2 becomes theto when p ! 1
h= bk/2c w= bk/2c
sed max pooling). If r > k, pooling regions do not overlap; however, current CNN <latexit sha1_base64="0IP+XSvQcXqyCHSTwgqtKgcX7Lw=">AAAB/XicbVDLSsNAFJ34rPUVHzs3g0VwVZIq6LLoxmUF+4AmlMl00g6dTMLMjRJD8VfcuFDErf/hzr9x2mahrQcuHM65l3vvCRLBNTjOt7W0vLK6tl7aKG9ube/s2nv7LR2nirImjUWsOgHRTHDJmsBBsE6iGIkCwdrB6Hrit++Z0jyWd5AlzI/IQPKQUwJG6tmHCfYUHwyBKBU/YI/LELKeXXGqzhR4kbgFqaACjZ795fVjmkZMAhVE667rJODnRAGngo3LXqpZQuiIDFjXUEkipv18ev0Ynxilj8NYmZKAp+rviZxEWmdRYDojAkM9703E/7xuCuGln3OZpMAknS0KU4EhxpMocJ8rRkFkhhCquLkV0yFRhIIJrGxCcOdfXiStWtU9q9Zuzyv1qyKOEjpCx+gUuegC1dENaqAmougRPaNX9GY9WS/Wu/Uxa12yipkD9AfW5w+ZZ5VS</latexit>
, typically
u) = (rinclude
· i + h,overlapping
r · j + w,pooling u) is the with k = 3 and
function mapping
r = 2.fromLet uspositions
now comparein s tothe
ation defined
specting by Eq. p1 isto the
the stride, the standard
order of definition
the p-norm of a(for
convolutional
p ! 1, layer c applied
it becomes theto
f given
max as:
pooling). If r > k, pooling regions do not overlap; however, current CNN
0 1
cally include overlapping pooling with k = 3 and r = 2. Let us now compare the
defined by Eq. 1 @to
bk/2c
X
the standard
bk/2c
X X N
definition of a convolutional Alayer definition
c applied to of a convolutional
ci,j,o (f ) = ✓h,w,u,o · fg(h,w,i,j,u) , (2)
en as: h= bk/2c w= bk/2c u=1 layer with stride=2
0 1
bk/2c
the convolutional weights X (orbk/2c the kernel
X XN
weights, or filters), (·) is the activation func-
cyi,j,o @
(f ) = linear activation ReLU (x)✓=
a rectified max(x,
h,w,u,o · fg(h,w,i,j,u)
0), and o 2 A[1,
, M ] is the number (2)
ture (or channel)h= of the
bk/2cconvolutional
w= bk/2c u=1 layer. When formalized like this it becomes clear
erations depend We on canthe think of "strided
same elements convolutions"
of the previous layer feature map. as learnable
The pooling pooling
onvolutional weights a(or
seen as performing the kernel convolution
feature-wise weights, or filters),
1
in which(·)the is the activation
activation func-is
function
ectified
the p-norm.linearOne activation
can thereforeReLU ask(x) the=question 0), and and
max(x,whether o 2why[1, Msuch] is special
the number
layers
or channel)
troduced Sebastian
into ofthe RaschkaWhile alayer.
thenetwork.
convolutional STAT 453:
When
complete Intro to this
formalized
answer of Deeplike
Learning
this itisand
question not Generative
becomes giveModels
clear
easy to SS 2020 48
of regularization. Additionally, to unify the architecture further, we make use of the fact that if the
"All-Convolutional Network"
image area covered by units in the topmost convolutional layer covers a portion of the image large
enough to recognize its content (i.e. the object we want to recognize) then fully connected layers
can also be replaced by simple 1-by-1 convolutions. This leads to predictions of object classes at
different positions
Springenberg, whichAlexey
Jost Tobias, can then simply
Dosovitskiy, be averaged
Thomas Brox, andover
Martinthe whole "Striving
Riedmiller. image.for
This scheme
simplicity: Thewas
all first
convolutional net." arXiv preprint arXiv:1412.6806 (2014).
described by Lin et al. (2014) and further regularizes the network as the one by one convolution
has much less parameters than a fully connected layer. Overall our architecture is thus reduced to
consist only of convolutional layers with rectified linear non-linearities and an averaging + softmax
layer to produce predictions over the whole image.
Experimental Ablation Study with 3 Base Models:
Table 1: The three base networks used for classification on CIFAR-10 and CIFAR-100.
Model
A B C
Input 32 ⇥ 32 RGB image
5 ⇥ 5 conv. 96 ReLU 5 ⇥ 5 conv. 96 ReLU 3 ⇥ 3 conv. 96 ReLU
1 ⇥ 1 conv. 96 ReLU 3 ⇥ 3 conv. 96 ReLU
3 ⇥ 3 max-pooling stride 2
5 ⇥ 5 conv. 192 ReLU 5 ⇥ 5 conv. 192 ReLU 3 ⇥ 3 conv. 192 ReLU
1 ⇥ 1 conv. 192 ReLU 3 ⇥ 3 conv. 192 ReLU
3 ⇥ 3 max-pooling stride 2
3 ⇥ 3 conv. 192 ReLU
1 ⇥ 1 conv. 192 ReLU
1 ⇥ 1 conv. 10 ReLU
global averaging over 6 ⇥ 6 spatial dimensions
10 or 100-way softmax
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 49
the table is a variant of the Network in Network architecture proposed by Lin et al. (2014) in which
only one 1-by-1 convolution is performed after each “normal” convolution layer. Third, model C
"All-Convolutional Network"
replaces all 5 ⇥ 5 convolutions by simple 3 ⇥ 3 convolutions. This serves two purposes: 1) it unifies
the architecture to consist only of layers operating on 3 ⇥ 3 spatial neighborhoods of the previous
layer feature map (with occasional subsampling); 2) if max-pooling is replaced by a convolutional
Springenberg, Jost Tobias, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. "Striving for simplicity: The all
layer, then 3 ⇥ 3 is the minimum filter size to allow overlapping convolution with stride 2. We also
convolutional net." arXiv preprint arXiv:1412.6806 (2014).
highlight that model C resembles the very deep models used by Simonyan & Zisserman (2014) in
this years ImageNet competition.
Experimental Ablation Study with 3 Base Models:
Table 2: Model description of the three networks derived from base model C used for evaluating the
Shown are the modifications for Model C
importance of pooling in case of classification on CIFAR-10 and CIFAR-100. The derived models
for base models A and B are built analogously. The higher layers are the same as in Table 1 .
Model
Strided-CNN-C ConvPool-CNN-C All-CNN-C
Input 32 ⇥ 32 RGB image
3 ⇥ 3 conv. 96 ReLU 3 ⇥ 3 conv. 96 ReLU 3 ⇥ 3 conv. 96 ReLU
3 ⇥ 3 conv. 96 ReLU 3 ⇥ 3 conv. 96 ReLU 3 ⇥ 3 conv. 96 ReLU
with stride r = 2 3 ⇥ 3 conv. 96 ReLU
3 ⇥ 3 max-pooling stride 2 3 ⇥ 3 conv. 96 ReLU
with stride r = 2
3 ⇥ 3 conv. 192 ReLU 3 ⇥ 3 conv. 192 ReLU 3 ⇥ 3 conv. 192 ReLU
3 ⇥ 3 conv. 192 ReLU 3 ⇥ 3 conv. 192 ReLU 3 ⇥ 3 conv. 192 ReLU
with stride r = 2 3 ⇥ 3 conv. 192 ReLU
3 ⇥ 3 max-pooling stride 2 3 ⇥ 3 conv. 192 ReLU
with stride r = 2
..
.
3.2 C LASSIFICATION
Springenberg, RESULTS
Jost Tobias, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. "Striving for simplicity: The all
convolutional net." arXiv preprint arXiv:1412.6806 (2014).
3.2.1 CIFAR-10
Table 3: Comparison between the base and derived models on the CIFAR-10 dataset.
CIFAR-10 classification error
Model Error (%) # parameters
without data augmentation
Model A 12.47% ⇡ 0.9 M
Strided-CNN-A 13.46% ⇡ 0.9 M
ConvPool-CNN-A 10.21% ⇡ 1.28 M
ALL-CNN-A 10.30% ⇡ 1.28 M
Model B 10.20% ⇡1M
Strided-CNN-B 10.98% ⇡1M
ConvPool-CNN-B 9.33% ⇡ 1.35 M
ALL-CNN-B 9.10% ⇡ 1.35 M
Model C 9.74% ⇡ 1.3 M
Strided-CNN-C 10.19% ⇡ 1.3 M
ConvPool-CNN-C 9.31% ⇡ 1.4 M
ALL-CNN-C 9.08% ⇡ 1.4 M
In our first experiment we compared all models from Section 3.1 on the CIFAR-10 dataset without
using any augmentations. All networks were trained using stochastic gradient descent with fixed
momentum of 0.9. The learning rate was adapted using a schedule S = e1 , e2 , e3 in which is
multiplied
SebastianbyRaschka
a fixed multiplier
STATof 0.1
453:after
Introe1to and eLearning
.e2Deep 3 epochs respectively.
and Generative To keep the amount
Models of
SS 2020 53
me for this manuscript.
ATION RESULTS
"All-Convolutional Network"
0
Springenberg, Jost Tobias, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. "Striving for simplicity: The all
convolutional net." arXiv preprint arXiv:1412.6806 (2014).
omparison between the base and derived models on the CIFAR-10 dataset.
CIFAR-10 classification error
Model Error (%) # parameters
without data augmentation
Model A 12.47% ⇡ 0.9 M
Strided-CNN-A 13.46% ⇡ 0.9 M
ConvPool-CNN-A 10.21% ⇡ 1.28 M
ALL-CNN-A 10.30% ⇡ 1.28 M
Model B 10.20% ⇡1M
Strided-CNN-B 10.98% ⇡1M
ConvPool-CNN-B 9.33% ⇡ 1.35 M
ALL-CNN-B 9.10% ⇡ 1.35 M
Model C 9.74% ⇡ 1.3 M
Strided-CNN-C 10.19% ⇡ 1.3 M
ConvPool-CNN-C 9.31% ⇡ 1.4 M Removing maxpooling and increasing
ALL-CNN-C 9.08% ⇡ 1.4 M the stride of the previous layer
performs worse
ment we compared all models from Section 3.1 on the CIFAR-10 dataset without
tations. All networks were trained using stochastic gradient descent with fixed
The learning rate was adapted using a schedule S = e1 , e2 , e3 in which is
ed multiplier of 0.1 after e1 .e2 and e3 epochs respectively. To keep the amount of
sary to perform our comparison bearable 3 we only treat as a changeable hyper-
method. The learning rate schedule and the total amount of training epochs were
eliminary Sebastian
experiment using base model
Raschka A and
STAT 453:then
Introfixed for allLearning
to Deep other experiments.
and Generative Models SS 2020 54
me for this manuscript.
ATION RESULTS
"All-Convolutional Network"
0
Springenberg, Jost Tobias, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. "Striving for simplicity: The all
convolutional net." arXiv preprint arXiv:1412.6806 (2014).
omparison between the base and derived models on the CIFAR-10 dataset.
CIFAR-10 classification error
Model Error (%) # parameters
without data augmentation
Model A 12.47% ⇡ 0.9 M
Strided-CNN-A 13.46% ⇡ 0.9 M
ConvPool-CNN-A 10.21% ⇡ 1.28 M
ALL-CNN-A 10.30% ⇡ 1.28 M
Model B 10.20% ⇡1M
Strided-CNN-B 10.98% ⇡1M
ConvPool-CNN-B 9.33% ⇡ 1.35 M
ALL-CNN-B 9.10% ⇡ 1.35 M
Model C 9.74% ⇡ 1.3 M
Strided-CNN-C 10.19% ⇡ 1.3 M
ConvPool-CNN-C 9.31% ⇡ 1.4 M Replacing maxpooling with an
ALL-CNN-C 9.08% ⇡ 1.4 M
convolutional layer (stride=2) improves
the performance
ment we compared all models from Section 3.1 on the CIFAR-10 dataset without
tations. All networks were trained using stochastic gradient descent with fixed
The learning rate was adapted using a schedule S = e1 , e2 , e3 in which is
ed multiplier of 0.1 after e1 .e2 and e3 epochs respectively. To keep the amount of
sary to perform our comparison bearable 3 we only treat as a changeable hyper-
method. The learning rate schedule and the total amount of training epochs were
eliminary Sebastian
experiment using base model
Raschka A and
STAT 453:then
Introfixed for allLearning
to Deep other experiments.
and Generative Models SS 2020 55
me for this manuscript.
ATION RESULTS
"All-Convolutional Network"
0
Springenberg, Jost Tobias, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. "Striving for simplicity: The all
convolutional net." arXiv preprint arXiv:1412.6806 (2014).
omparison between the base and derived models on the CIFAR-10 dataset.
CIFAR-10 classification error
Model Error (%) # parameters
without data augmentation
Model A 12.47% ⇡ 0.9 M
Strided-CNN-A 13.46% ⇡ 0.9 M
ConvPool-CNN-A 10.21% ⇡ 1.28 M
ALL-CNN-A 10.30% ⇡ 1.28 M
Model B 10.20% ⇡1M
Strided-CNN-B 10.98% ⇡1M
ConvPool-CNN-B 9.33% ⇡ 1.35 M
ALL-CNN-B 9.10% ⇡ 1.35 M
Model C 9.74% ⇡ 1.3 M Replacing maxpooling with an
Strided-CNN-C 10.19% ⇡ 1.3 M
ConvPool-CNN-C 9.31% ⇡ 1.4 M convolutional layer (stride=2) improves
ALL-CNN-C 9.08% ⇡ 1.4 M
the performance
(this may
ment we compared all models from Section 3.1 on the CIFAR-10 datasetbe unfair because of the
without
tations. All networks were trained using stochastic gradient descent with fixed
additional parameters)
The learning rate was adapted using a schedule S = e , e , e in which is
1 2 3
ed multiplier of 0.1 after e1 .e2 and e3 epochs respectively. To keep the amount of
sary to perform our comparison bearable 3 we only treat as a changeable hyper-
method. The learning rate schedule and the total amount of training epochs were
eliminary Sebastian
experiment using base model
Raschka A and
STAT 453:then
Introfixed for allLearning
to Deep other experiments.
and Generative Models SS 2020 56
Difference to "All-Convolutional Network"
Springenberg, Jost Tobias, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. "Striving for simplicity: The all
convolutional net." arXiv preprint arXiv:1412.6806 (2014).
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 57
Lecture Overview
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 58
Common Architectures Revisited
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 60
GoogLeNet / Inception v1
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,
Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions." In Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 1-9. 2015.
Key Ideas/Features:
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 61
62
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,
Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions." In Proceedings of the IEEE
ence time).
of the resulting network is depicted in
odology
orks were trained using the DistBe-
machine learning system using mod-
and data-parallelism. Although we STAT 453: Intro to Deep Learning and Generative Models
GoogLeNet / Inception v1
Full Architecture
Sebastian Raschka
image into one of 1000 leaf-node cat-
et hierarchy. There are about 1.2 mil-
ng, 50,000 for validation and 100,000
Each image is associated with one
y, and performance is measured based
ng classifier predictions. Two num-
rted: the top-1 accuracy rate, which
truth against the first predicted class,
ate, which compares the ground truth
edicted classes: an image is deemed Figure 3: GoogLeNet network with all the bells and whistles.
the ground truth is among the top-5,
in them. The challenge uses the top-5
GoogLeNet / Inception v1
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,
Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions." In Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 1-9. 2015.
Zoomed in
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 63
GoogLeNet / Inception v1
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,
Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions." In Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 1-9. 2015.
efficie
ing In
the low
not str
ineffic
Au
increa
withou
plexity
use of
(a) Inception module, naı̈ve version
tions w
lows t
be pro
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 64
inefficien
GoogLeNet / Inception v1 A use
increasin
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitruwithout
Erhan,
Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions." In Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 1-9. 2015. plexity a
use of dim
(a) Inception module, naı̈ve version
tions wit
NiN 1x1 conv idea lows the
be proce
the next s
simultane
The im
increasin
ber of sta
One can
inferior,
have fou
(b) Inception module with dimensionality reduction a control
in netwo
Figure 2: Inception module ing netw
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models requires
SS 2020 65
A useful aspe
GoogLeNet / Inception v1 increasing the nu
without an unco
plexity Erhan,
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru at later s
Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions." In Proceedingsuse of dimension
of the IEEE
(a) Inception module, naı̈ve2015.
version
conference on computer vision and pattern recognition, pp. 1-9. tions with larger
lows the practica
be processed at v
the next stage can
simultaneously.
The improved
increasing both t
ber of stages with
One can utilize t
Example: inferior, but com
have found that a
(b) Inception module with dimensionality reduction a controlled balan
in networks that
Figure 2: Inception module ing networks wit
1x1@16 5x5@64 requires careful m
conv conv
ber of filters in the previous stage. The merging of output 5. GoogLeNet
of the pooling layer with outputs of the convolutional lay- By the“GoogL
ers would lead to an inevitable increase in the number of carnation of the I
28x28x128
outputs from stage to stage. 28x28x16
While this architecture might28x28x64
sion for the ILSV
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models
cover the optimal sparse structure, it would do it very inef- SS 2020 66
GoogLeNet / Inception v1
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,
Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions." In Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 1-9. 2015.
(b) Inception module with dimensionality reduction
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 67
re
in
in
ac
ha
in
On
be
in
sim
th
be
lo
tio
us
pl
wi
in
in
no
th
in
ef
GoogLeNet / Inception v1
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,
Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions." In Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 1-9. 2015.
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 68
Inception v2
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer
vision. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 2818-2826).
0.7
ReLU
Linear
0.6
0.5
Top−1 Accuracy
0.4
0.3
0.2
0.1
0
0 0.5 1 1.5 2 2.5 3 3.5 4
Iteration 6
x 10
We also present several new streamlined architectures for both residual and
non-residual Inception networks. These variations improve the single-frame
recognition performance on the ILSVRC 2012 classification task
significantly.
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 72
Based on neural architecture search (NAS) and stochastic network generators
https://arxiv.org/abs/1904.01569
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 73
https://arxiv.org/abs/1904.01569
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 74
https://arxiv.org/abs/1904.01569
Also utilizes an
LSTM controller with
probabilistic behavior
(will discuss LSTMs in a different
context next lecture)
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 75
Evolving Normalization-Activation Layers
https://arxiv.org/abs/2004.02967
Hanxiao Liu 1 Andrew Brock 2 Karen Simonyan 2 Quoc V. Le 1
Abstract
and activation functions are
in deep neural networks that
e with each other. Instead
separately, we unify them
tation graph, and evolve its
m low-level primitives. Our
hm leads to the discovery of
ew normalization-activation
nd existing design patterns. Figure 1. Left: Computation graph of a searched normalization-
activation layer that is batch-independent, named EvoNorm-S0.
ers enjoy the property of be-
The corresponding expression is (v1 x) q 2 x + , in con-
m the batch statistics. Our sw,h,c/g (x)
✓ ◆
at EvoNorms not only excel x µw,h,c/g (x)
trast to max q 2 + , 0 for GroupNorm-ReLU.
classification models includ- sw,h,c/g (x)
eNets and EfficientNets, but v1 , µw,h,c/g and s2w,h,c/g refer to a learnable variable, group mean
Sebastian Raschka
Mask R-CNN for instance and STAT 453: Intro
group variance, to DeepRight:
respectively. Learning and Generative
ResNet-50 results with Models SS 2020 76
Lecture Overview
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 77
Transfer Learning
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 78
Which Layers to Replace & Train?
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video
classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer
Vision and Pattern Recognition (pp. 1725-1732).
4.2. Transfer Learning Experiments on UCF-101 these were not available and hence we
the Sports-1M dataset has no overlap
The results of our analysis on the Sports-1M dataset in-
ever, these concerns are somewhat mi
dicate that the networks learn powerful motion features. A
a few sampled clips from every video
natural question that arises is whether these features also
We use the Slow Fusion network i
generalize to other datasets and class categories. We ex-
iments as it provides the best perfor
amine this question in detail by performing transfer learn-
The results of the experiments can b
ing experiments on the UCF-101 [22] Activity Recognition
Sebastian Raschka dataset. STAT terestingly, retraining
453: Intro to Deep Learning and Generative Models the softmax79
SS 2020 lay
The dataset consists of 13,320 videos belonging
Transfer Learning
PyTorch implementation: https://github.com/rasbt/stat453-deep-learning-
ss20/blob/master/L13-cnns-part2/code/vgg16-transferlearning.ipynb
Visualization from
https://www.cs.toronto.edu/~frossard/post/vgg16/
VGG-16
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale
image recognition." arXiv preprint arXiv:1409.1556 (2014).
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 80
Transfer Learning
PyTorch implementation: https://github.com/rasbt/stat453-deep-learning-
ss20/blob/master/L13-cnns-part2/code/vgg16-transferlearning.ipynb
Visualization from
https://www.cs.toronto.edu/~frossard/post/vgg16/
Freeze
VGG-16 Replace
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale
image recognition." arXiv preprint arXiv:1409.1556 (2014).
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 81
Transfer Learning
https://pytorch.org/docs/stable/torchvision/models.html
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 82
Transfer Learning
https://pytorch.org/docs/stable/torchvision/models.html
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 83
Transfer Learning Example
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 84
Stuff in the News
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 85
Optional Reading Material
http://www.deeplearningbook.org/contents/convnets.html
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 86
AI Gahaku
https://ai-art.tokyo/en/
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 87
AI Gahaku
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 88
More Coronavirus Datasets
https://spectrum.ieee.org/the-human-os/biomedical/imaging/hospitals-deploy-ai-tools-detect-covid19-chest-scans
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 89
More Coronavirus Datasets
"The COVIDx dataset now consists of 16,756 chest X-Rays across 13,645 patients"
https://github.com/lindawangg/COVID-Net
"COVID-Net, a deep convolutional neural network design tailored for the detection of
COVID-19 cases from chest radiography images"
https://medium.com/@sheldon.fernandez/covid-net-larger-dataset-new-models-and-covid-risknet-fd8e7c451c
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 90
More Coronavirus Datasets
https://www.nytimes.com/article/coronavirus-county-data-us.html
Note: you need to open this article in a private browser window to
be able to view it.
https://github.com/nytimes/covid-19-data
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 91
Being able to compress well is closely related to intelligence as explained below. While intelligence is a slippery concept, file sizes are hard numbers.
Wikipedia is an extensive snapshot of Human Knowledge. If you can compress the first 1GB of Wikipedia better than your predecessors, your
(de)compressor likely has to be smart(er). The intention of this prize is to encourage development of intelligent compressors/programs as a path to
AGI.
Losslessly compress the 1GB file enwik9 to less than 116MB. More precisely:
• Create a Linux or Windows compressor comp.exe of size S1 that compresses enwik9 to archive.exe of size S2 such that
S:=S1+S2 < L := 116'673'681 = previous record.
• If run, archive.exe produces (without input from other sources) a 109 byte file that is identical to enwik9.
• If we can verify your claim, you are eligible for a prize of 500'000€×(1-S/L). Minimum claim is 5'000€ (1% improvement).
• Restrictions: Must run in ≲100 hours using a single CPU core and <10GB RAM and <100GB HDD on our test machine.
http://prize.hutter1.net
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 92