0% found this document useful (0 votes)
45 views69 pages

CNN Slides Part2

The document discusses convolutional layers in neural networks, providing detailed examples of 2D image convolution with various filters. It explains how filters are applied to images to detect features such as edges and isolated points, and highlights the importance of shared parameters across different locations. Additionally, it introduces the concept of activation maps and the stacking of multiple filters to create new image representations in the context of convolutional networks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views69 pages

CNN Slides Part2

The document discusses convolutional layers in neural networks, providing detailed examples of 2D image convolution with various filters. It explains how filters are applied to images to detect features such as edges and isolated points, and highlights the importance of shared parameters across different locations. Additionally, it introduces the concept of activation maps and the stacking of multiple filters to create new image representations in the context of convolutional networks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

CS60050 Machine learning

Neural Network Architectures

Sudeshna Sarkar
Convolutional Layer: 2D example
A 2D 1 0 1 0 0 A filter: -1 -1 -1
image:
1 0 1 0 1 -1 1 -1
1 1 1 0 0 -1 -1 -1
1 0 1 0 1
1 0 1 0 1

After
convolution:

8
Convolutional Layer: 2D example
A 2D 1 0 1 0 0 A filter: -1 -1 -1
image:
1 0 1 0 1 -1 1 -1
1 1 1 0 0 -1 -1 -1
1 0 1 0 1
1 0 1 0 1

After
convolution:

8
Convolutional Layer: 2D example

A 2D 1 0 1 0 0 A filter: -1 -1 -1
image:
1 0 1 0 1 -1 1 -1
1 1 1 0 0 -1 -1 -1
1 0 1 0 1
1 0 1 0 1

-1 + 0 + -1 -7
+ -1 + 0 + -1
After
+ -1 + - 1 + -1
convolution:
= -7

8
Convolutional Layer: 2D example
A 2D A filter: -1 -1 -1
1 0 1 0 0
image:
1 0 1 0 1 -1 1 -1
1 1 1 0 0 -1 -1 -1
1 0 1 0 1
1 0 1 0 1

-7 -2
After
convolution:

8
Convolutional Layer: 2D example
A 2D 10 1 0 0 A filter: -1 -1 -1
image:
10 1 0 1 -1 1 -1
11 1 0 0 -1 -1 -1
10 1 0 1
10 1 0 1

-7 -2 -4
After -5 -2 -5
convolution: -7 -2 -5

8
Convolutional Layer: 2D example
0 00 0 0 0 0
A 2D 0 01 0 1 0 0 A filter: -1 -1 -1
image: 0 01 0 1 0 1 -1 1 -1
0 01 1 1 0 0 -1 -1 -1
0 01 0 1 0 1
0 01 0 1 0 1
0 00 0 0 0 0
-7 -2 -4
After -5 -2 -5
convolution: -7 -2 -5

8
Convolutional Layer: 2D example
0 0 0 0 0 0 0
A 2D 0 0 A filter: -1 -1 -1
1 0 1 0 0
image: 0 0
1 0 1 0 1 -1 1 -1
0 0 -1 -1 -1
1 1 1 0 0
0 1 0 1 0 1 0
0 1 0 1 0 1 0
0 0 0 0 0 0 0 0 -4 0 -3 -1
-2 -7 -2 -4 1
After -2 -5 -2 -5 -2
convolution
-2 -7 -2 -5 0
& ReLU:
0 -4 0 -4 0
9
Convolutional Layer: 2D example
0 0 0 0 0 0 0
A 2D 0 0 A filter: -1 -1 -1
1 0 1 0 0
image: 0 0
1 0 1 0 1 -1 1 -1
0 0 -1 -1 -1
1 1 1 0 0
0 1 0 1 0 1 0
0 1 0 1 0 1 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1
Detecting isolated 1’s. After 0 0 0 0 0
How do we detect all the 1’s?
-- Filter - 1 surrounded by 0’s.
convolution 0 0 0 0 0
How do we detect edges? & ReLU:
-- Exercise?
0 0 0 0 0
9
Convolutional Layer: 2D example
0 0 0 0 0 0 0
A 2D 0 1 0 1 0 0 0 A filter: -1 -1 -1
image: 0 1 0 1 0 1 0 -1 1 -1
0 11 1 0 0 0 -1 -1 -1
0 1 0 1 0 1 0
0 1 0 1 0 1 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1
Detecting isolated 1’s. After 0 0 0 0 0
How do we detect all the 1’s?
-- Filter - 1 surrounded by 0’s.
convolution 0 0 0 0 0
How do we detect edges? & ReLU:
-- Exercise?
0 0 0 0 0
Convolutional Layer: 2D example
0 00 0 0 0 0
A 2D 0 01 0 1 0 0 A filter: -1 -1 -1
image: 0 01 0 1 0 1 -1 1 -1
0 01 1 1 0 0 -1 -1 -1
0 01 0 1 0 1 with bias 2
0 01 0 1 0 1
0 00 0 0 0 0

After
convolution:

10
Convolutional Layer: 2D example
0 0 0 0 0 0 0
A 2D 0 0 A filter: -1 -1 -1
1 0 1 0 0
image: 0 0
1 0 1 0 1 -1 1 -1
0 0 -1 -1 -1
1 1 1 0 0
0 1 0 1 0 1 0 with bias 2
0 1 0 1 0 1 0
0 0 0 0 0 0 0 2-2 ? ? ?
? 2
What does it detect? After ? 1
■ Corner points and
■ isolated ones
convolution: ? 2
? ? ? ? 2
10
Convolutional Layer: 2D example
0 00 0 0 0 0
A 2D 0 01 0 1 0 0 A filter: -1 -1 -1
image: 0 01 0 1 0 1 -1 1 -1
0 01 1 1 0 0 -1 -1 -1
0 01 0 1 0 1 with bias 2
0 01 0 1 0 1
0 00 0 0 0 0

After
convolution:

11
Convolutional Layer: 2D example
0 0 0 0 0 0 0
A 2D 0 0 A filter: w w w
1 0 1 0 0 11 12 13
image: 0 0 w w
1 0 1 0 1 21 w
22
23
0 0
1 1 1 0 0
0 0 w w w33
1 0 1 0 1 with
31 bias
32 b

0 1 0 1 0 1 0
0 0 0 0 0 0 0

After
convolution:

11
Convolutional Layer: 2D example
0 0 0 0 0 0 0
A 2D 0 0 A filter: w w w
1 0 1 0 0 11 12 13
image: 0 0 w w
1 0 1 0 1 21 w
22
23
0 0
1 1 1 0 0
0 0 w w w33
1 0 1 0 1 with
31 bias
32 b

0 01 0 1 0 1
0 0 0 0 0 0 0

After
convolution:

11
Convolutional Layer: 2D example
A 2D 1 0 1 0 0 A filter: w w w
11 12 13
image: w w
1 0 1 0 1 21 w 23
22
1 1 1 0 0
w w w33
1 0 1 0 1 with
31 bias
32 b

1 0 1 0 1

What is the output size?

12
Convolutional Layer

Share the same parameters across different


locations (assuming input is stationary):
Convolutions with learned kernels

17
Convolutional Layer
Learn multiple filters.

E.g.: 200x200 image


100 Filters
Filter size: 10x10
10K parameters

18
32x32x3 image

32 height
Convolution Layer

32 width
3 Depth (color)
Slide based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Convolution Layer
32x32x3 image

5x5x3 filter
32

Convolve the filter with the image


i.e. “slide over the image spatially,
computing dot products”

32
3
Convolution Layer
Filters always extend to the full
depth of the input volume
32x32x3 image

5x5x3 filter
32

Convolve the filter with the image


i.e. “slide over the image spatially,
computing dot products”

32
3
Convolution Layer
32x32x3 image
5x5x3 filter
32

1 number:
the result of taking a dot product between the filter
and a small 5x5x3 chunk of the image
32 (i.e. 5*5*3 = 75-dimensional dot product + bias)
3
Convolution Layer
32x32x3 image activation map
5x5x3 filter
32

28

convolve (slide) over all spatial


locations

32 28
3 1
consider a second, green filter
Convolution Layer
32x32x3 image activation maps
5x5x3 filter
32

28

convolve (slide) over all spatial


locations

32 28
3 1

Slide based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
activation maps

32

28

Convolution Layer

32 28
3 6

We stack these up to get a “new image” of size 28x28x6!


Slide based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Multiple Input Channels

“Tensors”

27
Preview: ConvNet is a sequence of Convolution Layers, interspersed with non-linear
activation functions

32 28

CONV,
ReLU
e.g. 6
5x5x3
32 filters 28
3 6

Slide based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Preview: ConvNet is a sequence of Convolution Layers, interspersed with non-linear
activation functions

32 28 24

….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10

Slide based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Recall

Slide based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Convolution

Filter

Image

Image courtesy:Vincent Dumoulin


02 Feb 2022 CS60010 / Deep Learning | ConvNets (c) Abir Das 31
Convolution with Zero Padding

Filter

Image

02 Feb 2022 CS60010 / Deep Learning | ConvNets (c) Abir Das 32


Convolution with Strides

Convolving a 3x3 kernel over a 5x5 input using 2x2 strides

Image courtesy:Vincent Dumoulin


02 Feb 2022 CS60010 / Deep Learning | ConvNets (c) Abir Das 33
Convolution with Strides and Zero Padding

Convolving a 3x3 kernel over a 5 x5 input using 1 x1 strides and 1


zero-padding

02 Feb 2022 34
Image courtesy:Vincent Dumoulin
Padding

Pooling layer
- makes the representations smaller and more manageable
- operates over each activation map independently:

Slide based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Why Pooling
bird

bird Subsampling pixels will not change


the object.
We can subsample the pixels to make
image smaller

Subsampling

Insertion of pooling layer:


• reduce the spatial size of the representation
• reduce the amount of parameters and computation in the network, and hence also control
overfitting.
• The depth dimension remains unchanged.
Pool layers
• There are two types of the pool layers.
• If window size = stride, this is traditional
1 2 5 6 pooling.
Max pooling • If window size > stride, this is overlapping
3 4 2 8 4 8
pooling.
3 4 4 2 5 6 • The larger window size and stride will
be very destructive.
1 5 6 3 Translational invariance at each level:
by averaging four neighboring
replicated detectors to give a single
Single depth slice output to the next level. Problem: After several levels of
pooling, we have lost information
about the precise positions of
things.
This makes it impossible to use the
precise spatial relationships 38
General pooling
• Other pooling functions: Average pooling, L2-norm pooling

• Backpropagation. the backward pass for a max(x, y) operation routes the gradient
to the input that had the highest value in the forward pass.
• Hence, during the forward pass of a pooling layer you may keep track of the index
of the max activation (sometimes also called the switches) so that gradient
routing is efficient during backpropagation.
Receptive field
3x3 convolutions, stride 1

Input Output
The receptive field of a unit is
the region of the input feature
map whose values contribute to
the response of that unit (either
in the previous layer or in the
initial image)
Receptive field size: 3
Receptive field
3x3 convolutions, stride 1

Input Output

Receptive field size: 5


Receptive field
3x3 convolutions

Input Output

Receptive field size: 7


CNNs: typical architecture

Conv Max Conv Max fla f s


tte ully oftm
input + ReLU pool + ReLU pool n co
nn ax
ec
te
d

17 [ https://www.mathworks.com/solutions/deep-learning/convolutional-neural-network.html ]
CNNs: typical architecture

Conv Max Conv Max fla f s


tte ully oftm
input + ReLU pool + ReLU pool n co
nn ax
ec
feature learning te
d

classification

17 [ https://www.mathworks.com/solutions/deep-learning/convolutional-neural-network.html ]
CNNs: typical architecture

Conv Max Conv Max fla f s


tte ully oftm
input + ReLU pool + ReLU pool n co
nn ax
ec
feature learning te
d

classification
Recall: we wanted to encode
• Spatial locality
• Translation

17
invariance [ https://www.mathworks.com/solutions/deep-learning/convolutional-neural-network.html ]
POOL POOL POOL
ReLU ReLU ReLU ReLU ReLU ReLU Fully-
connected
CONV CONV CONV CONV CONV CONV

32x32

Weights: 280 910 910 910 910 910 1600


46
Size: 16x16 8x8 4x4
CNNs: a taste of backpropagation

l
na

ed
io

t
ec
ut

co lly
ol

nn
fu
nv
co
input
image
output,
identity/no
filter, activation
no bias
LeNet
The history of deep CNNs began with the appearance of LeNet (handwritten
character recognition)
Trained on MNIST digit dataset with 60K training examples

LeCun Y, Jackel LD, Bottou L, Cortes C, Denker JS, Drucker H, Guyon I, Muller UA, Sackinger E, Simard
P, et al. Learning algorithms for classification: a comparison on handwritten digit recognition. Neural
Netw Stat Mech Perspect. 1995;261:276. [
• ~14 million labeled images, 20k classes
• Images gathered from Internet
• Human labels via Amazon MTurk
• ImageNet Large-Scale Visual Recognition
Challenge (ILSVRC):
• 1.2 million training images, 1000 classes
• 100k test

ILSVRC: Imagenet Large Scale Visual Recognition

The Problem: Classification


Challenge
Classify an image into 1000 possible classes:
e.g. Abyssinian cat, Bulldog, French Terrier, Cormorant,
cat, tabby cat (0.71)
Egyptian cat (0.22)
Chickadee, red fox, banjo, barbell, hourglass, knot,
red fox (0.11)
maze, viaduct, etc. …..
www.image-net.org/challenges/LSVRC/
The Evaluation Metric: Top K-error
Top-1 accuracy: 0.0
Top-1 error:
1.0
Top-2 error: Top-2 accuracy: 0.0
True label: Abyssinian cat
1.0 Top-3 Top-3 accuracy: 0.0
error: 1.0 Top-4 accuracy: 1.0
Top-4 error: Top-5 accuracy: 1.0
0.0 cat,Top-5
tabby cat (0.61)
error: Egyptian
0.0 cat (0.22)
red fox (0.11)
Abyssinian cat (0.10)
French terrier (0.03)
…..
Top-5 error on this competition (2012)
Imagenet Leaderboard
Alexnet
• Preprocessing and Data Augmentation
• image translations and horizontal reflections,
• performed Principle Component Analysis (PCA) on the RGB pixel values to
change the intensities of RGB channels
• Using ReLUs instead of Sigmoid or Tanh
• Dropout (Randomly sets Unit outputs to zero during training)
• Momentum + Weight Decay
• GPU Computation!
VGG Network: ILSVRC 2014 2nd place
Simonyan and Zisserman, 2014.
VGGnet
Imagenet Leaderboard
GoogLeNet: ILSVRC 2014 winner
• Deeper networks with computational efficiency
• 22 layers
• Efficient Inception modules
• Networks with Parallel Concatenations
• Employ a combination of variously-sized kernels.
• Deeper networks : 22 layers
• Efficient Inception modules
• Networks with Parallel
Concatenations
• Employ a combination of
variously-sized kernels.
GoogLeNet: Aggressive stem
Stem network at the start aggressively downsamples input

Input size Layer Output size


Layer C H / W filters kernel stride pad C H/W memory (KB) params (K) flop (M)
conv 3 224 64 7 2 3 64 112 3136 9 118
max-pool 64 112 3 2 1 64 56 784 0 2
conv 64 56 64 1 1 0 64 56 784 4 13
conv 64 56 192 3 1 1 192 56 2352 111 347
max-pool 192 56 3 2 1 192 28 588 0 1

Total from 224 to 28 resolution:


Memory: 7.5 MB
Params: 124K
MFLOP: 418

Source: J. Johnson
GoogLeNet: Inception module
• Design a good network topology (network within network) and stack these
modules
• Parallel paths with different receptive field sizes and operations are meant to
capture sparse patterns of correlations in the stack of feature maps
• Use 1x1 convolutions for dimensionality reduction before expensive convolutions

Source: J. Johnson
The first three paths use convolutional layers with window sizes of 1×1, 3×3, and 5×5 to extract information
from different spatial sizes.
The middle two paths perform a 1×1 convolution on the input to reduce the number of channels, reducing
the model’s complexity.
The fourth path uses a 3×3 maximum pooling layer, followed by a 1×1 convolutional layer to change the
number of channels.
The four paths all use appropriate padding to give the input and output the same height and width.
Finally, the outputs along each path are concatenated along the channel dimension and comprise the
block’s output.
1x1 convolution
GoogLeNet Model
• Uses a stack of 9 inception blocks
• 22 total layers with weights
• Maximum pooling between inception blocks
reduces the dimensionality.
• After the last conv layer, a global average
pooling layer is used that spatially averages
across each feature map before final FC layer.

GAP
ResNet
• Deep models have more representation power (more parameters)
than shallower models.
• But deeper models are harder to optimize
• What should the deeper model learn to be at least as good as the
shallower model?
• A solution by construction is copying the learned layers from the
shallower model and setting additional layers to identity mapping.
Resnet
• Naïve solution
– If extra layers are an
identity mapping,
then training errors
do not increase

Very deep networks using


residual connections
• 152 layer for ImageNet
• ILSVRC 2015
classification winner
Residual Blocks

• The portion within the dotted-line box


needs to learn the residual mapping.
• The solid line carrying the layer
input x to the addition operator is
called a residual connection
(or shortcut connection).
• Inputs can forward propagate faster
through the residual connections
across layers.
Convnets Summary

You might also like