0% found this document useful (0 votes)
14 views56 pages

Lecture 7-8

The document discusses convolutional and pooling layers in deep learning, highlighting their roles in image classification tasks such as distinguishing between dogs and cats. It explains key concepts like 2-D convolution, translation invariance, locality, padding, stride, and the architecture of convolutional networks, including the LeNet model. Additionally, it covers the importance of pooling layers for achieving invariance to translation and reducing computational complexity.

Uploaded by

hanimukhtar512
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views56 pages

Lecture 7-8

The document discusses convolutional and pooling layers in deep learning, highlighting their roles in image classification tasks such as distinguishing between dogs and cats. It explains key concepts like 2-D convolution, translation invariance, locality, padding, stride, and the architecture of convolutional networks, including the LeNet model. Additionally, it covers the importance of pooling layers for achieving invariance to translation and reducing computational complexity.

Uploaded by

hanimukhtar512
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Deep Learning

Convolutional and Pooling Layers

Dr. Ahsen Tahir

.The slides in part have been modified from Ian Good Fellow book slides and Alex’s Dive in to Deep Learning book slides
Convolutional Networks
Classifying Dogs and Cats in Images

• Use a good camera


• RGB image has 36M elements
• The model size of a single hidden
layer MLP with a 100 hidden size
is 3.6 Billion parameters
• Exceeds the population of dogs
and cats on earth
(900M dogs + 600M cats)
Flashback - Network with one hidden layer

100 neurons

3.6B parameters = 14GB

36M features

h = σ(Wx + )b
Convolution
2-D Convolution (Cross Correlation)

0 × 0 + 1 × 1 + 3 × 2 + 4 × 3 = 19,
1 × 0 + 2 × 1 + 4 × 2 + 5 × 3 = 25,
3 × 0 + 4 × 1 + 6 × 2 + 7 × 3 = 37,
(vdumoulin@ Github)
4 × 0 + 5 × 1 + 7 × 2 + 8 × 3 = 43.
Two Principles
• Translation
Invariance
• Locality
Idea #1 - Translation Invariance

hi, j = ∑ vi, j,a,bxi+a,j+b


a,b

• A shift in x also leads to a shift in h


• v should not depend on (i,j). Fix via vi, j,a,b = va,b

hi, j = ∑ va,b x i+a,j+b


a,b

That’s a 2-D convolution


cross-correlation
Idea #2 - Locality

hi, j = va,b xi+a,j+b



a,b
• We shouldn’t look very far from x(i,j) in order to assess
what’s going on at h(i,j)
• Outside range |a|,|b| > Δ parameters vanish va,b = 0

Δ Δ
hi, j = ∑ ∑ va,b xi+a,j+b
a=−Δ b=−Δ
2-D Convolution Layer

• X : nh× nw input matrix


• W : kh× kw kernel matrix
• b: scalar bias
• Y : (n − kh+ 1) × (n − k+ 1) output matrix
h w w
Y=X⋆W+b
• W and b are learnable parameters
Examples
Edge Detection

Sharpen

(wikipedia)

Gaussian Blur
Examples

(Rob Fergus)
Gabor filters

@medium
Cross Correlation vs Convolution

• 2-D Cross Correlation


h w
yi, j = ∑ ∑ w xi+a,j+b
a,b
a=1 b=1
• 2-D Convolution
h w
yi, j = ∑ ∑ w x
−a,−b i+a,j+b
a=1 b=1

• No difference in practice during to symmetry


1-D and 3-D Cross Correlations

• 1-D • 3-D
h h w d
yi = ∑ waxi+a yi, j,k= ∑ ∑ ∑ w x
i+a,j+b,k+c
a=1 a=1 b=1 c=1 a,b,c

• Text • Video
• Voice • Medical images
• Time series
Padding and Stride

courses.d2l.ai/berkeley-stat-157
Padding

• Given a 32 x 32 input image


• Apply convolutional layer with 5 x 5 kernel
• 28 x 28 output with 1 layer
• 4 x 4 output with 7 layers
• Shape decreases faster with larger kernels
• Shape reduces from n × nwto
h
(nh− kh+ 1) × (nw− kw+ 1)
Padding

Padding adds rows/columns around input

0×0+0×1+0×2+0×3=0
Padding

• If Padding p=1 (means zero layer around each side of image)


(n+ 2p − k + 1)

• A common choice is 2p = k − 1
Stride

• Padding reduces shape linearly with #layers


• Given a 224 x 224 input with a 5 x 5 kernel, needs 44
layers to reduce the shape to 4 x 4
• Requires a large amount of computation
Stride

• Stride is the #rows/#columns per slide


Strides of 3 and 2 for height and width

0×0+0×1+1×2+2×3=8
0×0+6×1+0×2+0×3=6
Stride

• Given stride s, sh for the height and stride sw for the width,
the output shape is
⌊(n + 2p− k + 1)⌋
s
• With 2p= k− 1 in n+2p-k+1 → n → n/s

(n /s ) × (n/s )
h h w w
Multiple Input and
Output Channels

courses.d2l.ai/berkeley-stat-157
Multiple Input Channels

• Color image may have three RGB channels


• Converting to grayscale loses information
Multiple Input Channels

• Color image may have three RGB channels


• Converting to grayscale loses information
Multiple Input Channels

• Have a kernel for each channel, and then sum results


over channels

(1 × 1 + 2 × 2 + 4 × 3 + 5 × 4)
+(0 × 0 + 1 × 1 + 3 × 2 + 4 × 3)
= 56
Multiple Input Channels

• X : ci× nh× nwinput


• W : ci× kh× kw kernel
• Y : mh× mwoutput

ci
Y= Xi,:,: ⋆ Wi,:,:

i=0
Multiple Output Channels

• No matter how many inputs channels, so far we always


get single output channel
• We can have multiple 3-D kernels, each one generates a
output channel
• Input X : ci× nh× n w
Yi,:,: = X ⋆ W i,:,:,:
• Kernel W : co× ci× kh× kw
• Output Y : co× mh× m w for i = 1,…, co

Tensorflow → Channels Last (default)


Pytorch → Channels First (default)
Multiple Input/Output Channels

• Each output channel may recognize a particular pattern

• Input channels kernels recognize and combines patterns


in inputs
1 x 1 Convolutional Layer

k h= k w= 1 is a popular choice. It doesn’t recognize spatial


patterns, but fuse channels.
2-D Convolution Layer Summary

• Input X : ci× nh× n w


• Kernel W : co× ci× kh× kw
• Bias B : co× c i
Y = X ⋆ W + B
• Output Y : co× m× m w
h
• Complexity (number of floating point operations FLOP)
ci = co = 100
kh= hw= 5 O(c c k k m m ) 1GFLOP
i o hw h w
mh= mw= 64
• 10 layers, 1M examples: 10PF
(CPU: 0.15 TF = 18h, GPU: 12 TF = 14min)
Pooling Layer

courses.d2l.ai/berkeley-stat-157
Pooling
0 output with
• Convolution is sensitive to position 1 pixel shift
• Detect vertical edges

X Y

• We need some degree of invariance to translation


• Lighting, object positions, scales, appearance vary
among images
2-D Max Pooling

• Returns the maximal value in the


sliding window

max(0,1,3,4) = 4
2-D Max Pooling

• Returns the maximal value in the sliding window

Vertical edge detection Conv output 2 x 2 max pooling

Tolerant to 1
pixel shift
Padding, Stride, and Multiple Channels

• Pooling layers have similar padding


and stride as convolutional layers
• No learnable parameters
• Apply pooling for each input channel to
obtain the corresponding output
channel

#output channels = #input channels


Average Pooling

• Max pooling: the strongest pattern signal in a window


• Average pooling: replace max with mean in max pooling
• The average signal strength in a window
Max pooling Average pooling
LeNet Architecture
Handwritten Digit
Recognition

courses.d2l.ai/berkeley-stat-157
MNIST
• Centered and scaled
• 50,000 training data
• 10,000 test data
• 28 x 28 images
• 10 classes

courses.d2l.ai/berkeley-stat-157
Y. LeCun, L.
Bottou, Y. Bengio,
P. Haffner, 1998
Gradient-based
learning applied to
document
recognition

courses.d2l.ai/berkeley-stat-157
Y. LeCun, L.
Bottou, Y. Bengio,
P. Haffner, 1998
Gradient-based
learning applied to
document
recognition

courses.d2l.ai/berkeley-stat-157
Expensive if we
have many
outputs

gluon-cv.mxnet.io
LeNet in MXNet

net = gluon.nn.Sequential()
with net.name_scope():
net.add(gluon.nn.Conv2D(channels=20, kernel_size=5, activation='tanh'))
net.add(gluon.nn.AvgPool2D(pool_size=2))
net.add(gluon.nn.Conv2D(channels=50, kernel_size=5, activation='tanh'))
net.add(gluon.nn.AvgPool2D(pool_size=2))
net.add(gluon.nn.Flatten())
net.add(gluon.nn.Dense(500, activation='tanh'))
net.add(gluon.nn.Dense(10))

loss = gluon.loss.SoftmaxCrossEntropyLoss()

(size and shape inference is automatic)


Summary

• Convolutional layer
• Reduced model capacity compared to dense layer
• Efficient at detecting spatial pattens
• High computation complexity
• Control output shape via padding, strides and
channels
• Max/Average Pooling layer
• Provides some degree of invariance to translation

courses.d2l.ai/berkeley-stat-157

You might also like