0% found this document useful (0 votes)
97 views104 pages

MLT CNN Architectures

The document outlines an agenda for a deep learning workshop series that will cover popular CNN architectures like VGGNet, InceptionNet, MobileNet, SqueezeNet, ResNet, and DenseNet. It includes interactive implementations of these architectures and short summaries of additional techniques like Feature Pyramid Networks and Neural ODEs. The workshop will run over multiple sessions with topics like 3x3 vs 11x11 convolutions, 1x1 convolutions, depthwise convolutions, and residual and dense connections in convolutional networks.

Uploaded by

Mintesnot Fikir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views104 pages

MLT CNN Architectures

The document outlines an agenda for a deep learning workshop series that will cover popular CNN architectures like VGGNet, InceptionNet, MobileNet, SqueezeNet, ResNet, and DenseNet. It includes interactive implementations of these architectures and short summaries of additional techniques like Feature Pyramid Networks and Neural ODEs. The workshop will run over multiple sessions with topics like 3x3 vs 11x11 convolutions, 1x1 convolutions, depthwise convolutions, and residual and dense connections in convolutional networks.

Uploaded by

Mintesnot Fikir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 104

Deep Learning Workshop Series

CNN Architectures
Plan A

CONTENT

★ CNN Architectures (with interactive implementation)


○ VGG-Net: 3x3 vs 11x11 Convolution
○ Inception-Net: “1x1 convolution” vs “Fully Connected”
○ MobileNet: Depthwise (Separable) Convolutions for Training Light Models
○ ShuffleNet(not implemented)
○ SqueezeNet: Distributed Training of Networks
○ ResNet: Residuals in Convolution Operations
○ DenseNet: Dense Connections in Convolution Operations
★ Extras (short summary only)
○ Feature Pyramid Networks (FPNs)
○ Neural ODEs
Plan A

PROGRAM FLOW

10:00 - 10:30 10:30 - 11:10 11:10 - 12:00 12:00 - 12:45 12:45 - 13:05 13:05 - 13:50

Depthwise
3x3 vs 11x11 “1x1 conv” vs “FC” BREAK Channel shuffling SqueezeNet
(separable) conv
10min 15min 20min 25min 30min 20min 20min 20min 25min

13:50 - 14:20 14:20 - 14:30 14:30 - 15:00 15:00 - 15:30 15:30 - 15:45 16:00 -

Residuals & Dense


Implementation Implementation of
Connections in BREAK Extras Q&A
of Resnet DenseNe
ConvNets
30min 30min 30min 15min
Plan B

CONTENT

★ Part 1 : A Brief Intro to Convolution Operations


★ Part 2 : Popular CNN Architectures (interactive implementation)
○ VGG-Net: 3x3 vs 11x11 Convolution
○ Inception-Net: “1x1 convolution” vs “Fully Connected”
○ MobileNet: Depthwise (Separable) Convolutions for Training Light Models
○ SqueezeNet: Distributed Training of Networks
○ ResNet: Residuals in Convolution Operations
○ DenseNet: Dense Connections in Convolution Operations
★ Extras (short summary only)
○ Feature Pyramid Networks (FPNs)
○ Neural ODEs
Plan B

PROGRAM FLOW

10:00 - 10:30 10:30 - 11:00 11:00 - 12:45 12:45 - 13:30 13:30 - 14:20 14:20 - 15:00

Part 1 : A Historical
Depthwise
Review on Deep 3x3 vs 11x11 “1x1 conv” vs “FC” BREAK SqueezeNet
(separable) conv
CNNs
15min 15min 10min 15min 20min 25min 30min 20min 20min 25min

15:00 - 15:10 15:10 - 15:30 15:30 - 16:00 16:00 - 16:30 16:30 -

Residuals & Dense


Implementation Implementation of
BREAK Connections in CLOSING Q&A
of Resnet DenseNe
ConvNets
20min 30min 30min
Part 1 : A Brief Intro to Convolution Operations
Computer Vision

Computer vision is a field deals with how to


gain high-level understanding from images or
videos
Images are represented by pixel values..

64 x 64 x 3

3-Channel RGB
Goal :
Extracting meaningful features from an image

edges, corners, colors,


shapes, patterns,
statistical (histogram, )..

Tool:
Algorithms
- gray-scaling, thresholding,
- complex descriptors (HOG, SIFT, SURF etc.)
HOG,
SIFT,
SURF..

full-of hand-
engineering
..well, then

..by generalizing all these techniques with applying


convolution operations along with neural nets
● Convolutions are filtering operations

● Different filter (kernel) sizes unveil different image feature information

● Commonly in two steps:

1. Slide the same fixed kernel across the image


2. Calculate the dot product between kernel and image

Blurring

*
Sharpening

*
Let’s try together
* Edge
Extracting Image Features via ConvOps

Extracting useful Sliding windows (kernels or


information from image filters) are used to convolve
an input image

panda
Feature Learning

Simple features Complex features


(edges,colors) (shapes,textures)
Network Structure

1 1 2 4
kernel size 3x3
5 6 7 8
stride 1 Max 2x2, 2
3 2 1 0
padding same

1 2 3 4
Convolution Visualizer
A brief history...
Convolution operations are first introduced into Machine Learning by Yann LeCun at AT&T Laboratories (Y. LeCun et. al.
1989, Fukushima 1980, A.Weibel 1987)

Backpropagation first applied Improved by Gradient-descent

LeCun et al 1989 - LeNet Architecture (1998)


LeNet5 by Lecun et al 1998

r1 3
e r y e r5 Input image
ye y
La La La
REFERENCES
[1] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel, Backpropagation Applied to Handwritten Zip Code
Recognition; AT&T Bell Laboratories
[2] LeCun, Yann; Léon Bottou; Yoshua Bengio; Patrick Haffner (1998). "Gradient-based learning applied to document recognition" (PDF).
Proceedings of the IEEE. 86 (11): 2278–2324. doi:10.1109/5.726791. Retrieved October 7, 2016.
[3] The History of Neural Networks
by Eugenio Culurciello
https://dataconomy.com/2017/04/history-neural-networks/
[4] Convolutions by AI Shack
Utkarsh Sinha
http://aishack.in/tutorials/image-convolution-examples/
[5] The History of Neural Networks
Andrew Fogg
https://www.import.io/post/history-of-deep-learning/
[6] Overview of Convolutional Neural Networks for Image Classification
Intel Academy https://software.intel.com/en-us/articles/hands-on-ai-part-15-overview-of-convolutional-neural-networks-for-image-
classification
[7] Convolution Arithmetic
https://github.com/vdumoulin/conv_arithmetic
Snippet Implementation
Part 2 : Convolutions in Deep Architectures
3x3 vs 11x11

Filter size: 11x11


Bigger filter size, more
global information
captured

B
G
R

Filter size: 3x3


Smaller filter size
more local
information captured
AlexNet vs VGG

● Convolution filter sizes:


○ Alexnet : 11x11 , 3x3 , 5x5
○ VGGNet : 3x3

● How deep networks:


○ AlexNet : 7 Layer
○ VGGNet : 16 Layer
Efficiency & Computation

AlexNet VGG-16
# of Convolution Layers 5 13
Convolution Parameters 3.8M 15M
# of FC Layers 3 3
FC Layer Parameters 59M 59M
Total Params 62M 138M
ImageNet Error 17% 7.3%
REFERENCES

[1] Different Kinds of Convolutional Filters


Soham Chatterjee
https://www.saama.com/blog/different-kinds-convolutional-filters/

[2] Image recognition by Deep Learning on mobile


https://qiita.com/negi111111/items/c46635b5d70058ebae93

[3] Paper Explanation : VGGNet


Mohit Jain
https://mohitjain.me/2018/06/07/vggnet/

[4] CNN Architectures - VGGNet


Gary Chang
https://medium.com/deep-learning-g/cnn-architectures-vggnet-e09d7fe79c45
Interactive Implementation
1x1 (pointwise) conv
Proposed in “Network-in-network” by Min.Lin et.al (2013)

● Micro network - mlpconv


○ uses multilayer perceptron, FC layers

● Better discriminate the model for local patches


● Cross channel pooling
○ Weighted linear combination of feature maps → ReLU
○ Complex and learnable interactions of cross channel information
1x1 (pointwise) conv

● adds non-linearity ● doesn’t care about spatial information


● feature pooling: shrinks the # of channels ● can replace fully-connected layers

6x6x32 1x1x32x16 6x6x16

❏ Decreases the computations (NxN conv → 1x1 conv) ❏ Decreases the parameters (FC → 1x1 conv)

❏ Get more valuable combination of filters: represent “M” features with “N” features
Inception Module
● CNN design has a lot of parameters;
● Do them all (at once)!!!
○ Conv:
● Let the network learn;
■ 3x3?
○ Whatever parameter,
■ 5x5?
○ Whatever the combination of these
■ 1x1?
○ Pooling: filter sizes it wants to learn
● Inception Layer
■ 3x3?
Inception → GoogLeNet

Computation (convolution) per inception layer


● 5x5 conv ⇒ (28x28)x(5x5x192)x(32) ≈ 120M
● 3x3 conv ⇒ (28x28)x(3x3x192)x(128) ≈ 170M
● 1x1 conv ⇒ (28x28)x(1x1x192)x(64) ≈ 10M
● In total; ≈ 300M computations

Input tensor (spatial) One Kernel Output tensor

“Bottleneck layer”:
● 1x1 conv ⇒ 5x5 conv
○ (28x28) x (1x1x192)x(32) ≈ 2.4M
+
○ (28x28) x (5x5x16)x(32) ≈ 10M
○ ≈ 12.4M
● 10x less computation!
Inception → GoogLeNet
GoogLeNet (Inception v1)

Vanishing gradient problem?


● Add two auxiliary classifiers
○ Prevent from “dying out” of middle part of network
○ Regularization effect: prevent from overfitting
● Total loss = real_loss + 0.3 * aux_loss_1 + 0.3 * aux_loss_2
Inception v2

The premise: The solution:


● Reduce “representational bottleneck” - loss of ● Factorize 5x5 conv to two 3x3 conv.
information by reducing the dimensions too much ○ To improve computational speed
● Smart factorization methods ○ 5x5 conv is 2.78x more expensive than a
3x3 conv.
REFERENCES
[1] Network In Network
Min Lin, Qiang Chen, Shuicheng Yan
https://arxiv.org/pdf/1312.4400v3.pdf
[2] Network in Networks and 1x1 Convolutions
Andrew Ng
https://www.coursera.org/lecture/convolutional-neural-networks/networks-in-networks-and-1x1-convolutions-ZTb8x
[3] One by One [1x1] Convolution - counter-intuitively useful
Aaditya Prakash
https://iamaaditya.github.io/2016/03/one-by-one-convolution/
[4] Deep Learning series: Convolutional Neural Networks
Mike Cavaioni
https://medium.com/machine-learning-bites/deeplearning-series-convolutional-neural-networks-a9c2f2ee1524
[5] Going Deeper with Convolutions
Christian Szegedy et.al.
http://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf
[6] A Simple Guide to the Versions of the Inception Network
Bharath Raj
https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202
Interactive Implementation
Depthwise convolution

Difference:
● So far;
○ 2D convolutions performed over all input channels
○ Lets us to mix channels
● Depthwise convolution;
○ Each channel kept separate

Approach:
● Split the input tensor into channels & split the kernel into
channels
● For each channels, convolve the input with the
corresponding filter → 2D tensor
● Stack the output (2D) tensors back together
Depthwise separable conv
● Depthwise convolution is commonly used in combination
with an additional step → depthwise separation convolution
○ 1. Filtering
○ 2. Combining

Input tensor (spatial) One Kernel Output tensor

Depthwise separation convolution:


● Depthwise convolution → 1x1 convolution across channels
● Much less operations
○ Input: 8x8x3, output: 8x8x256
○ Original conv:
■ (8x8) x (5x5x3) x (256) → 1,228,800
○ Depthwise separable conv:
■ (8x8) x (5x5x1) x (3) → 3800
■ (8x8) x (1x1x3) x (256) → 49,152
■ Total: 53,952
○ 1,228,800 / 53,952 ≈ 23x less multiplication
MobileNet

● Depthwise separable convolution


● Shrinking hyperparameters:
○ Width multiplier (𝛼): adjusts the channel numbers
○ Resolution multiplier (ρ): adjusts input image and
feature map spatial dimensions
REFERENCES
[1] Depthwise separable convolutions for machine learning
Eli Bendersky
https://eli.thegreenplace.net/2018/depthwise-separable-convolutions-for-machine-learning/

[2] MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Andrew G.Howard, et. al. (Google Inc.)
https://arxiv.org/pdf/1704.04861.pdf

[3] Xception: Deep Learning with Depthwise Separable Convolutions


Francois Chollet
https://arxiv.org/pdf/1610.02357.pdf

[4] A Basic Introduction to Separable Convolutions


Chi-Feng Wang
https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728
Interactive Implementation
Group Convolutions

Grouped Convolutions:
● Proposed first in AlexNet
○ Memory constraints
● Decreases number of operations
○ 2 groups → 2x less operation
● (+) Learns better representations
○ Feature relationships are sparse
● (-) outputs from a certain channel are only derived from a small fraction of input channels
Channel shuffling

Depthwise separable convolution:


● Eliminate main side effect of grouped convolutions:
○ Outputs from a certain channel are only derived from a small fraction
of input channels
● Solution:
○ Conv1_output → channel shuffling → conv2_input
● Applies group convolutions on 1x1 layer also
● (!) channel shuffling is also differentiable

Grouped Convolutions:
● First proposed in AlexNet
○ Memory constraints
● Decreases number of operations
○ 2 groups → 2x less operation
● (+) Learns better representations
○ Feature relationships are sparse
● (-) outputs from a certain channel are only derived from a small fraction of input channels
ShuffleNet

Channel shuffling:
● Applies group convolutions on 1x1 layer
also
○ By grouping filters, computation
decreases significantly
● (!) remind the side effect of grouped
convolutions
○ Channel shuffling addresses this
issue
● (!) channel shuffling is also differentiable

Bottleneck unit with ShuffleNet unit with


depthwise convolution pointwise group convolution
REFERENCES
[1] Convolutions Type
Illarion Khlestov
https://ikhlestov.github.io/pages/machine-learning/convolutions-types/#depthwise-separable-convolutions-
separable-convolutions

[2] A Tutorial on Filter Groups (Grouped Convolution)


Yani Ioannou
https://blog.yani.io/filter-group-tutorial/

[3] ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices
Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, Jian Sun
https://arxiv.org/abs/1707.01083
SqueezeNet

is;
● Smart and small architecture which proposes:
● → AlexNet level accuracy (on ImageNet) with
○ 50x fewer parameters
○ 500x fewer parameters after compression
● → 3 times faster
● → Fully Convolutional Network (FCN), i.e. no FC Layer
SqueezeNet

SqueezeNet uses multiple tricks: Gains OR analysis outcome:


● Replace 3x3 filters with 1x1 (pointwise) filters ● reduce the computation by 1/9
● Uses 1x1 filters as a bottleneck layer ● reduce depth → reduce
● Use 3x3 filters in fire module ● Affects final accuracy
● Late downsampling ● Preserve feature map spatial dimensions
● Network compression ● Smaller networks also can be compressed
● 1x1 vs. 3x3 rate analysis ● Accuracy of trade-off
● Bypass connections ● Helps to alleviate representational bottleneck
which affected by squeeze layer (in fire module)
SqueezeNet

Fire module:
● Squeeze layer: only 1x1 filters (bottleneck)
● Expand layer: 1x1 and 3x3 filters
REFERENCES
[1] Notes on SqueezeNet
Hao Gao
https://medium.com/@smallfishbigsea/notes-of-squeezenet-4137d51feef4

[2] Review: SqueezeNet (Image Classification)


Sik-Ho Tsang
https://towardsdatascience.com/review-squeezenet-image-classification-e7414825581a

[3] SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size
Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, Kurt Keutzer
https://arxiv.org/abs/1602.07360
Interactive Implementation
Residuals in ConvNets
Power of going deeper = Richer contextual information

low level features

high level features


n - Layer n - Layer n - Layer
Network Network Network
gradient norm

iteration
is there any limitation to having more depth?

Gradients vanishing after repeated layer operations.


accuracy Gradient norm

epochs
Recap : Backpropagation
Recap : Backpropagation

relation of outputs to inputs


Recap : Backpropagation

gradients of a single node.. imagine all calculations..


Recap : Backpropagation

All we want a good parameter update!


Recap : Backpropagation
weak gradients

Backward gradient flow


forward operations

strong gradients
The core idea of ResNet is introducing a “identity shortcut (residual) connection”

Standard connection Skips one or more layers Easy gradient flow via shortcuts

identity
Plain

Input

ResNet

Input

ResNeXT

Input
REFERENCES
[1] DenResNet: Ensembling Dense Networks and Residual Networks
Victor Cheung
http://cs231n.stanford.edu/reports/2017/pdfs/933.pdf

[2] An Overview of ResNet and its Variants


Vincent Fung
https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035

[3] The Efficiency of Densenet


Hao Gao
https://medium.com/@smallfishbigsea/densenet-2b0889854a92

[4] Understanding and Implementing Architectures of ResNet and ResNeXt


Prakash Jay
https://medium.com/@14prakash/understanding-and-implementing-architectures-of-resnet-and-resnext-for-state-of-the-art-
image-cc5d0adf648e

[5] Hand-Gesture Classification using Deep Convolution and Residual Neural Network
Sandipan Dey
https://sandipanweb.wordpress.com/2018/01/20/hand-gesture-classification-using-deep-convolution-and-residual-neural-network-
with-tensorflow-keras-in-python/
Extras:

ResNet vs ResNexT vs Inception-Resnet

● Trend of split-transform-merge ● Inception style in ResNet


● Minor changes on ResNet ● Similar convolution topology.

ResNet ResNeXT Inception-Resnet


Extras:
Extras:
Interactive Implementation
Connections in ConvNets

connect every layer to one another

Transition Layer
(1x1 conv, Pooling)
Standard Connectivity

Successive convolutions

Resnet Connectivity

Element-wise feature
summation

DenseNet Connectivity

Feature concatenation
Standard ResNet DenseNet

x y
● power of feature reuse

ResNet DenseNet

Maintains low complexity..


● More shortcut connections, better gradient flow

Supervision to gradients
● Less parameters , computationally efficient
★ Bottleneck Layer
● Error vs parameters & computation
REFERENCES
[1] DenResNet: Ensembling Dense Networks and Residual Networks
Victor Cheung
http://cs231n.stanford.edu/reports/2017/pdfs/933.pdf

[2] An Overview of ResNet and its Variants


Vincent Fung
https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035

[3] The Efficiency of Densenet


Hao Gao
https://medium.com/@smallfishbigsea/densenet-2b0889854a92

[4] Understanding and Implementing Architectures of ResNet and ResNeXt


Prakash Jay
https://medium.com/@14prakash/understanding-and-implementing-architectures-of-resnet-and-resnext-for-state-of-the-art-
image-cc5d0adf648e

[5] Hand-Gesture Classification using Deep Convolution and Residual Neural Network
Sandipan Dey
https://sandipanweb.wordpress.com/2018/01/20/hand-gesture-classification-using-deep-convolution-and-residual-neural-network-
with-tensorflow-keras-in-python/
Part 3 : Extras
State-of-the-art:
# of parameters # of flops

Source: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
Neural ODEs

● So far we were using the layer (discrete) approach in neural


networks,
● It worked well to differentiate classes

● but lacks when it comes to continuous events

● such as health records taken at random times..


Is possible to achieve continuity?
A Neural Net
Let’s look a bit closer then;

A Plain Network A Residual Network

Very similar to an
Euler-equation
Re-parameterizing continuous dynamics of
hidden states by an ODE
ResNet ODE-Net
Feature Pyramid Networks

Top-down pathway restores resolution with rich semantic information


Bottom-Up Pathway:
applies ResNet to downscale by convolutions

Top-Down Pathway:
applies 1x1 convolutions and neares neighbour to
downscale and element-wise addition of feature
maps
Interactive Implementation
Appendix : State-of-the-art
Cheat Sheet
Convolution Operations Convolution in CNN Architectures 1x1 conv
>Convolutions are basically a filtering > feature pooling
operation used in CV world > convolution: filtering > decreases parameter
>Extracting useful information > stride: sliding step size > decreases computation
from images > padding: control output size > adds nonlinearity
> Sliding windows (kernels or > pooling: downsampling
filter are used to convolve an input 6x6x32 1x1x32x16 6x6x16

image

Inception/GoogLeNet Depthwise Convolution MobileNet


> use bottleneck layer > each kernel is kept separately > depthwise separable convolutions
> decreases computation (10x) > split input & kernels into channels > shrinking hyperparameters
> auxiliary loss layers > convolve each input channel with > width multiplier: adjusts # of channels
> factorize bigger conv layers corresponding filter channel > resolution multiplier: adjusts input
> regularization > stack the output (2D) tensors back image and feature map resolutions
together

Channel Shuffling-ShuffleNet 11x11 vs 3x3 Filter size: 3x3 Residual Nets


> eliminate main side effect of grouped convs > Bigger filters capture, more global information > Identity shortcut (residual) connections
> side effect: outputs are derived only from > Smaller filters captures more local information > Helps for gradient flow
certain channels, shuffles the channels after > AlexNet uses 11x11, 55x55 and 3x3 > Skipping one or more layers
grouped convolutions > VGGNet uses only 3x3 filters > Deeper architectures works better
> apply group convolutions also on 1x1 layers > By VGGNet, effectivity of going deeper proved
> note: channel shuffling is also differentiable!

ResNeXt DenseNet
> Inception style in ResNet >Connecting all layers to the other layers
>Depth concatenation, same >Strong gradient flow THANK YOU FOR JOINING US TODAY!
convolution topology >More diversified features
>Having high cardinality helps in > Allowing feature re-use Machine Learning Tokyo
decreasing validation error >More memory hungry,
>New hyper-parameter : > computationally more efficient
cardinality → width size

You might also like