0% found this document useful (0 votes)
93 views32 pages

Modern CNN Architectures

The document discusses the LeNet convolutional neural network architecture. It consists of two parts: a convolutional encoder with two convolutional layers, and a dense block with three fully connected layers. LeNet takes in handwritten digit images and outputs the probability of the digit being one of 10 possible classes. The convolutional block uses 5x5 kernels, sigmoid activations, and 2x2 average pooling with stride 2. The feature map is flattened before passing to the dense block, which has three fully connected layers with 120, 84, and 10 neurons respectively.

Uploaded by

Arafat Hossain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views32 pages

Modern CNN Architectures

The document discusses the LeNet convolutional neural network architecture. It consists of two parts: a convolutional encoder with two convolutional layers, and a dense block with three fully connected layers. LeNet takes in handwritten digit images and outputs the probability of the digit being one of 10 possible classes. The convolutional block uses 5x5 kernels, sigmoid activations, and 2x2 average pooling with stride 2. The feature map is flattened before passing to the dense block, which has three fully connected layers with 120, 84, and 10 neurons respectively.

Uploaded by

Arafat Hossain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 32

CSE 471: MACHINE LEARNING

Modern CNN architectures


LeNet

 Input: Hand written digits (single channel)


 Output: Probability over 10 possible outcomes
 At a high level, LeNet (LeNet-5) consists of two parts
 A convolutional encoder consisting of two convolutional
layers
 A dense block consisting of three fully connected layers

LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., & others. (1998). Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11), 2278–2324.
LeNet
 Convolution block
 Convolutional layer (5 x 5 kernel)
 Sigmoid activation
 Average pooling
 2 x 2 (stride 2)
 Spatial down sampling
 Output channels
 Layer 1: 6 @ 28 x 28
 Layer 2: 16 @ 10 x 10

 Feature map is flattened before passing onto the dense


layer
LeNet
 Dense block
 3 Fully connected layers
 Layer 1: 120 neurons
 Layer 2: 84 neurons
 Layer 3: 10 neurons
LeNet (PyTorch)
Xavier Initialization
 Let
 Oi – output for some fully-connected layer (without
nonlinearities)
 There are nin inputs xj with associated weights wij
 Weights are drawn independently from the same
distribution, with 0 mean, 2 variance
 xi’s also have 0 mean, 2 variance
 Independent of weights
 Independent of each other

Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks.
Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249–256).
Xavier Initialization

 Variance can be kept fixed if


 nin2 = 1
 Following same reasoning, during backprop.
Gradients’ variance can be kept fixed if
 nout2 = 1
 Therefore, we try to achieve
 0.5 x (nin + nout) 2 = 1
Xavier Initialization
 Sampling weights from N(0, 2)
 Sampling weights from uniform distribution U(-a, a)
LeNet
AlexNet
 Runs on GPU hardware
 Won the ImageNet Large Scale Visual Recognition
Challenge 2012 by a phenomenally large margin
 Architecture
 5 Convolutional layers
 3 fully connected layers
 ReLU activation

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional
neural networks. Advances in neural information processing systems (pp. 1097–1105).
AlexNet
 Input: 224 x 224 3-channel
 11 x 11 filter in the first layer
 10 times more convolution channels/filters than
LeNet
 Uses dropout
 Image augmentation
 Flipping
 Clipping
 Color changes
Dropout
Dropout
 Drop out some neurons during training
 On each iteration
 Layer by layer
 Different neurons will get dropped in different
iterations
 Breaks up co-adaptation

Co-adaptation. Neural network overfitting is characterized


by a state in which each layer relies on a specific pattern of
activations in the previous layer.
Dropout
 Need to normalize the activation of the retained
nodes
 Each intermediate activation h is replaced by a
random variable h′
 Expectation remains unchanged, i.e., E[h′] = h.
Learned filters (96)
AlexNet

LeNet AlexNet
AlexNet (PyTorch)
VGG
 Visual Geometry Group (VGG) at Oxford
University
 Neurons  Layers  Blocks
 Basic VGG block
 A convolution layer with padding
 A nonlinearity (e.g. ReLU)
 A pooling layer (e.g. max pooling)
In the original VGG paper, the authors employed convolutions with 3x3 kernels
with padding of 1 (keeping height and width) and 2x2 max pooling with stride of 2
(halving the resolution after each block)

Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556.
VGG
VGG
Original VGG network
 5 convolutional blocks
 Block# 1, 2: 1 Conv. layer each
 Block# 3, 4, 5: 2 Conv. layer each
 Fully connected block
 Same as AlexNet
 Called VGG-11
 8 Conv. Layers
 3 FC layers
 Uses dropout
GoogLeNet
 Won ImageNet challenge in 2014.
 Investigated which sized kernels are best.
 Employ a combination of variously-sized kernels
 The basic block is called Inception Block.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., … Rabinovich, A. (2015). Going deeper with
convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–9).
Inception Block
 4 parallel paths
 Path# 1: 1x 1 filter
 Path# 2: 3 x 3 filter, pad = 1
 1 x 1 filter used beforehand to reduce channels
 Path# 3: 5 x 5 filter, pad = 2
 1 x 1 filter used beforehand to reduce channels
 Path# 4: 3 x 3 MaxPool, pad = 1
 1 x 1 filter used afterwards to reduce channels
 Input and output have the same height and width
 Channel count varies in the different paths and are
concatenated
 GoogLeNet
 GoogLeNet
 7x7 filter, stride=2, pad=3, 64 channels
 3x3 maxpooling
 1x1 filter – 64 channels
 3x3 filter – 192 channels
 2 inception blocks in series
 Block# 1: 64 + 128 + 32 + 32 = 256 channels
 Block #2: 128 + 192 + 96 + 64 = 480 channels
 And so on …
Residual networks (ResNet)

Each layer completely


replaces the representation
from the preceding layer,

Whereas traditional networks must learn to propagate information and are subject to
catastrophic failure of information propagation for bad choices of the parameters,
residual networks propagate information by default
Functional classes

For non-nested function classes, a larger (indicated by area) function class does
not guarantee to get closer to the “truth” function (f*). This does not happen in
nested function classes.
ResNet (Intuition)
 For deep neural networks, if we can train the
newly-added layer into an identity function f(x) =
x, the new model will be as effective as the original
model.
 As the new model may get a better solution to fit
the training dataset, the added layer might make it
easier to reduce training errors.

Won the ImageNet Large Scale Visual Recognition Challenge in 2015.


ResNet Block
ResNet Block
 Two 3x3 convolution layers
 Same number of output channels
 Batch normalization
 ReLU activation
ResNet Block

Identical input/output channels Non-identical input/output channels


ResNext block

Simplified diagram
The use of grouped convolution with g groups is g times faster than a dense convolution. It is a
bottleneck residual block when the number of intermediate channels b is less than c.

You might also like