Lecture 7-8
Lecture 7-8
.The slides in part have been modified from Ian Good Fellow book slides and Alex’s Dive in to Deep Learning book slides
Convolutional Networks
Classifying Dogs and Cats in Images
100 neurons
36M features
h = σ(Wx + )b
Convolution
2-D Convolution (Cross Correlation)
0 × 0 + 1 × 1 + 3 × 2 + 4 × 3 = 19,
1 × 0 + 2 × 1 + 4 × 2 + 5 × 3 = 25,
3 × 0 + 4 × 1 + 6 × 2 + 7 × 3 = 37,
(vdumoulin@ Github)
4 × 0 + 5 × 1 + 7 × 2 + 8 × 3 = 43.
Two Principles
• Translation
Invariance
• Locality
Idea #1 - Translation Invariance
Δ Δ
hi, j = ∑ ∑ va,b xi+a,j+b
a=−Δ b=−Δ
2-D Convolution Layer
Sharpen
(wikipedia)
Gaussian Blur
Examples
(Rob Fergus)
Gabor filters
@medium
Cross Correlation vs Convolution
• 1-D • 3-D
h h w d
yi = ∑ waxi+a yi, j,k= ∑ ∑ ∑ w x
i+a,j+b,k+c
a=1 a=1 b=1 c=1 a,b,c
• Text • Video
• Voice • Medical images
• Time series
Padding and Stride
courses.d2l.ai/berkeley-stat-157
Padding
0×0+0×1+0×2+0×3=0
Padding
• A common choice is 2p = k − 1
Stride
0×0+0×1+1×2+2×3=8
0×0+6×1+0×2+0×3=6
Stride
• Given stride s, sh for the height and stride sw for the width,
the output shape is
⌊(n + 2p− k + 1)⌋
s
• With 2p= k− 1 in n+2p-k+1 → n → n/s
(n /s ) × (n/s )
h h w w
Multiple Input and
Output Channels
courses.d2l.ai/berkeley-stat-157
Multiple Input Channels
(1 × 1 + 2 × 2 + 4 × 3 + 5 × 4)
+(0 × 0 + 1 × 1 + 3 × 2 + 4 × 3)
= 56
Multiple Input Channels
ci
Y= Xi,:,: ⋆ Wi,:,:
∑
i=0
Multiple Output Channels
courses.d2l.ai/berkeley-stat-157
Pooling
0 output with
• Convolution is sensitive to position 1 pixel shift
• Detect vertical edges
X Y
max(0,1,3,4) = 4
2-D Max Pooling
Tolerant to 1
pixel shift
Padding, Stride, and Multiple Channels
courses.d2l.ai/berkeley-stat-157
MNIST
• Centered and scaled
• 50,000 training data
• 10,000 test data
• 28 x 28 images
• 10 classes
courses.d2l.ai/berkeley-stat-157
Y. LeCun, L.
Bottou, Y. Bengio,
P. Haffner, 1998
Gradient-based
learning applied to
document
recognition
courses.d2l.ai/berkeley-stat-157
Y. LeCun, L.
Bottou, Y. Bengio,
P. Haffner, 1998
Gradient-based
learning applied to
document
recognition
courses.d2l.ai/berkeley-stat-157
Expensive if we
have many
outputs
gluon-cv.mxnet.io
LeNet in MXNet
net = gluon.nn.Sequential()
with net.name_scope():
net.add(gluon.nn.Conv2D(channels=20, kernel_size=5, activation='tanh'))
net.add(gluon.nn.AvgPool2D(pool_size=2))
net.add(gluon.nn.Conv2D(channels=50, kernel_size=5, activation='tanh'))
net.add(gluon.nn.AvgPool2D(pool_size=2))
net.add(gluon.nn.Flatten())
net.add(gluon.nn.Dense(500, activation='tanh'))
net.add(gluon.nn.Dense(10))
loss = gluon.loss.SoftmaxCrossEntropyLoss()
• Convolutional layer
• Reduced model capacity compared to dense layer
• Efficient at detecting spatial pattens
• High computation complexity
• Control output shape via padding, strides and
channels
• Max/Average Pooling layer
• Provides some degree of invariance to translation
courses.d2l.ai/berkeley-stat-157