CSE 471: MACHINE LEARNING
Modern CNN architectures
LeNet
Input: Hand written digits (single channel)
Output: Probability over 10 possible outcomes
At a high level, LeNet (LeNet-5) consists of two parts
A convolutional encoder consisting of two convolutional
layers
A dense block consisting of three fully connected layers
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., & others. (1998). Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11), 2278–2324.
LeNet
Convolution block
Convolutional layer (5 x 5 kernel)
Sigmoid activation
Average pooling
2 x 2 (stride 2)
Spatial down sampling
Output channels
Layer 1: 6 @ 28 x 28
Layer 2: 16 @ 10 x 10
Feature map is flattened before passing onto the dense
layer
LeNet
Dense block
3 Fully connected layers
Layer 1: 120 neurons
Layer 2: 84 neurons
Layer 3: 10 neurons
LeNet (PyTorch)
Xavier Initialization
Let
Oi – output for some fully-connected layer (without
nonlinearities)
There are nin inputs xj with associated weights wij
Weights are drawn independently from the same
distribution, with 0 mean, 2 variance
xi’s also have 0 mean, 2 variance
Independent of weights
Independent of each other
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks.
Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249–256).
Xavier Initialization
Variance can be kept fixed if
nin2 = 1
Following same reasoning, during backprop.
Gradients’ variance can be kept fixed if
nout2 = 1
Therefore, we try to achieve
0.5 x (nin + nout) 2 = 1
Xavier Initialization
Sampling weights from N(0, 2)
Sampling weights from uniform distribution U(-a, a)
LeNet
AlexNet
Runs on GPU hardware
Won the ImageNet Large Scale Visual Recognition
Challenge 2012 by a phenomenally large margin
Architecture
5 Convolutional layers
3 fully connected layers
ReLU activation
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional
neural networks. Advances in neural information processing systems (pp. 1097–1105).
AlexNet
Input: 224 x 224 3-channel
11 x 11 filter in the first layer
10 times more convolution channels/filters than
LeNet
Uses dropout
Image augmentation
Flipping
Clipping
Color changes
Dropout
Dropout
Drop out some neurons during training
On each iteration
Layer by layer
Different neurons will get dropped in different
iterations
Breaks up co-adaptation
Co-adaptation. Neural network overfitting is characterized
by a state in which each layer relies on a specific pattern of
activations in the previous layer.
Dropout
Need to normalize the activation of the retained
nodes
Each intermediate activation h is replaced by a
random variable h′
Expectation remains unchanged, i.e., E[h′] = h.
Learned filters (96)
AlexNet
LeNet AlexNet
AlexNet (PyTorch)
VGG
Visual Geometry Group (VGG) at Oxford
University
Neurons Layers Blocks
Basic VGG block
A convolution layer with padding
A nonlinearity (e.g. ReLU)
A pooling layer (e.g. max pooling)
In the original VGG paper, the authors employed convolutions with 3x3 kernels
with padding of 1 (keeping height and width) and 2x2 max pooling with stride of 2
(halving the resolution after each block)
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556.
VGG
VGG
Original VGG network
5 convolutional blocks
Block# 1, 2: 1 Conv. layer each
Block# 3, 4, 5: 2 Conv. layer each
Fully connected block
Same as AlexNet
Called VGG-11
8 Conv. Layers
3 FC layers
Uses dropout
GoogLeNet
Won ImageNet challenge in 2014.
Investigated which sized kernels are best.
Employ a combination of variously-sized kernels
The basic block is called Inception Block.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., … Rabinovich, A. (2015). Going deeper with
convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–9).
Inception Block
4 parallel paths
Path# 1: 1x 1 filter
Path# 2: 3 x 3 filter, pad = 1
1 x 1 filter used beforehand to reduce channels
Path# 3: 5 x 5 filter, pad = 2
1 x 1 filter used beforehand to reduce channels
Path# 4: 3 x 3 MaxPool, pad = 1
1 x 1 filter used afterwards to reduce channels
Input and output have the same height and width
Channel count varies in the different paths and are
concatenated
GoogLeNet
GoogLeNet
7x7 filter, stride=2, pad=3, 64 channels
3x3 maxpooling
1x1 filter – 64 channels
3x3 filter – 192 channels
2 inception blocks in series
Block# 1: 64 + 128 + 32 + 32 = 256 channels
Block #2: 128 + 192 + 96 + 64 = 480 channels
And so on …
Residual networks (ResNet)
Each layer completely
replaces the representation
from the preceding layer,
Whereas traditional networks must learn to propagate information and are subject to
catastrophic failure of information propagation for bad choices of the parameters,
residual networks propagate information by default
Functional classes
For non-nested function classes, a larger (indicated by area) function class does
not guarantee to get closer to the “truth” function (f*). This does not happen in
nested function classes.
ResNet (Intuition)
For deep neural networks, if we can train the
newly-added layer into an identity function f(x) =
x, the new model will be as effective as the original
model.
As the new model may get a better solution to fit
the training dataset, the added layer might make it
easier to reduce training errors.
Won the ImageNet Large Scale Visual Recognition Challenge in 2015.
ResNet Block
ResNet Block
Two 3x3 convolution layers
Same number of output channels
Batch normalization
ReLU activation
ResNet Block
Identical input/output channels Non-identical input/output channels
ResNext block
Simplified diagram
The use of grouped convolution with g groups is g times faster than a dense convolution. It is a
bottleneck residual block when the number of intermediate channels b is less than c.