Convolutional Neural Networks - Part 1
Convolutional Neural Networks - Part 1
Hina Arora
This deck is Copyright ©Hina Arora and Arizona Board of Regents. All rights reserved.
• Thus far, we have built a strong foundation in Fully Connected Feed Forward
Deep NNs.
• In the next few lectures, we will look at the application of Deep Learning in
Computer Vision tasks (specifically Image Analysis) such as Image Classification,
Object Detection, Image Segmentation, etc.
• Let’s say I’d like to build a binary classifier using a Fully Connected Dense Network
with one hidden layer with 128 units.
• I would have to first “flatten” the images before I can feed them to the Fully
Connected Network.
o That is, I would have to convert the 28x28-dimensional 2D image representations
to 784-dimensional 1D vector representations.
o Note: the number of nodes in the input layer of this dense network would be 784
(that is, each pixel in the image is considered a feature).
• The input layer then feeds into a fully connected hidden layer with 128 hidden units.
Which means for layer 1, we end up with a total of 100480 parameters (784*128 =
100352 weight parameters and 128 bias parameters).
Layer 0: Layer 1: Layer 2:
Flattened Input Hidden Layer Output Layer
784 x 1 128 nodes 1 node
1
Input
28 x 28 2
𝑊784 x 128
𝑏1 x 128 𝑊128 x 1
𝑏1 x 1
1
Layer 1:
100480
model
128
params!
• There are several issues with this approach…
• Flattening the images means we are ignoring the spatial structure of the images.
However, images in fact have a spatial structure where neighboring pixels are
related to each other.
• Also, fully connected networks are invariant to the order of the features (in the
case of flattened images, the pixels). We could therefore shuffle the pixels in the
input image, and still get similar results – further proof that we were ignoring the
spatial relationship between the pixels.
• Finally, we are typically dealing with higher resolution color images, and a higher
number of hidden units. This can cause an explosion of model parameters,
leading to overfitting or a need for lots of training data.
• Convolutional Neural Networks (or CNNs or ConvNets)* are specialized neural
networks suitable for processing data with grid-like topologies.
• Data with grid-like topologies are data that contain a spatial relationship between
neighboring datapoints, such as 1D time-series data, and 2D image data.
• By definition, CNNs employ the “convolution” operation in at least one of their
layers. Typically, CNNs are a combination of convolution layers, pooling layers and
fully connected layers.
• Together, the convolution operations and the pooling operations help capture the
spatial relationship between pixels in CNNs. They also help reduce the number of
parameters required, making CNNs more computationally efficient.
• Let’s first review image representation. We’ll then walk through the key operations
that make up CNNs: the convolution operation, the pooling operation, and the
concept of stride and padding.
* Yan LeCun pioneered the use of CNNs - we’ll look at the classic LeNet-5 architecture later
Image Representation
Image: Height, Width, and Channels
Color Image with 3 channels.
Greyscale Image with 1 channel. Each pixel has an (R,G,B) value.
Each pixel has a value 0 – 255. (255,0,0) is pure red, (0,255,0) is pure
0 is pure black, and 255 is pure white. green, and (0,0,255) is pure blue.
255,255,255
0,0,0
B 255 241 30 54 12 0 0
G 20 45 30 99 72 54 255
0 44 135 34 255 21 30
… … … … … … …
R 45 62 73 15 255 250 40
… … … … … … …
… … … … … … …
… … … … … … … … … … … … … …
… … … … … … … … … … … … … …
… … … … … … … … … … … … … …
… … … … … … …
… … … … … … …
… … … … … … …
• Moving forward:
o In general, we’ll use 4D tensor representations – for instance (m, h, w, c) to
represent the number of samples, height, width, and number of channels. So, for
instance, 5 greyscale images can be represented as (5, h, w, 1), and 10 colored
images with 3 channels (RGB) can be represented as (10, h, w, 3).
o Note: channels can also be used to represent stacked outputs of a convolution
layer. We’ll see more examples of that representation soon.
• We’ll now walk through the key operations that make up CNNs: the convolution
operation, the pooling operation, and the concept of stride and padding.
2D Convolutions
Note: Strictly speaking, what we refer to as the “convolution” operation in
deep learning is in fact a “cross-correlation” operation. But we’ll continue
to use the term “convolution” in keeping with standard DL terminology.
• Let’s say I’m trying to build a multiclass classifier for handwritten digits. What kind of
features might I be interested in extracting? Perhaps vertical edges, horizontal edges,
corners, circles, etc?
• As it turns out, the convolution layers let us extract such features - and a whole lot
more - when stacked up in a CNN!
• Before we get into the mathematics of convolution operations, let’s see it in action.
• Here’s an example of the convolution of an image with a kernel resulting in a feature map
which captures the edges of the input image.
Kernel / Filter (3 x 3)
−1 − 1 −1
∎ −1 + 8
−1 − 1
−1
−1
=
The convolution operation is
typically denoted by “∗”. But
we’ll use “∎” for clarity.
• Here’s another example where we’ve convolved the image with two different kernels
resulting in two different feature maps that capture the vertical and horizontal edges.
“Output” / Feature Map (26 x 26)
Kernel / Filter (3 x 3)
+1 0 −1
“Input” (28 x 28) ∎ +2
+1
0
0
−2
−1
=
Kernel / Filter (3 x 3)
+1 + 2 + 1
∎ 0 0 0 =
−1 − 2 − 1
https://en.wikipedia.org/wiki/Kernel_(image_processing)
https://en.wikipedia.org/wiki/Sobel_operator
• In these examples, we already knew what kernels / filters to use in order to detect
useful feature maps (such as vertical and horizontal edges).
• However, we would prefer training a network to learn these kernels on its own.
• This is exactly what CNNs achieve - we essentially train CNNs to learn the appropriate
weights of the kernels for the task at hand (representation learning).
Kernel 0
0 0 0
𝑤00 𝑤01 𝑤02
∎ 0
𝑤10 0
𝑤11 0
𝑤12
0 0 0
𝑤20 𝑤21 𝑤22
Kernel 1
1 1 1
𝑤00 𝑤01 𝑤02
∎ 1
𝑤10 1
𝑤11 1
𝑤12
1 1 1
𝑤20 𝑤21 𝑤22
• So, what exactly does the convolution operation do? Let’s take a look!
2D Convolution - Implementation
Input 𝑋, of dimension ℎ𝑋 , 𝑤𝑋
Kernel 𝐾, of dimension ℎ𝐾 , 𝑤𝐾
Output 𝑌, of dimension ℎ𝑌 , 𝑤𝑌
ℎ𝑌 = ℎ𝑋 − ℎ𝐾 + 1
𝑤𝑌 = 𝑤𝑋 − 𝑤𝐾 + 1
1 2 3
1 2 37 47
4 5 6 ∎ = 67 77
3 4
7 8 9
Left Image Source: Deep Learning, Goodfellow, Bengio, Courville, MIT Press, 2016 (Chapter 9)
Convolution of a 4x4 input image with 2x2 kernel results in convolved image of dimension 3x3
(using stride=1 and padding=valid)
Input Kernel Feature Map
1 2 3
1 0
4 5 6 ∎ = ???
0 1
7 8 9
Input Kernel Feature Map
1 2 3
1 0 6 8
4 5 6 ∎
0 1
= 12 14
7 8 9
• So, the convolution of a kernel with the input image essentially allows us to capture
the spatial relationship between neighboring pixels in the input image, thereby
enabling us to extract meaningful features in the image.
• The idea with the convolutional layer then, is going to be stack up enough of these
kernels, enabling the layers to learn meaningful representations at each stage. And
the kernel weights are essentially what will be learned during training.
• In the examples above, we shifted the kernel over by one pixel at a time. This is called
a stride of 1. We will look at longer strides next.
• We also restricted the output to only those positions where the kernel lies entirely
within the input image. This is called “valid” padding. This has the effect of causing
the output to have a reduced dimension (as against the input). Also, the pixels on
edges of image are involved in fewer convolutions than the pixels in the center of the
image. We’ll look at a different scheme called “same” padding in a bit which attempts
to fix this by adding one or more layers of (zero) pixels along edges of the image.
Stride
• Stride defines how many pixels the kernel should be shifted over at a time along
the height and width of the input.
• In the previous example, we were using a stride of 1 along the height and width
of the input.
• If we use a stride of 𝑠ℎ pixels along the height, and 𝑠𝑤 pixels along the width:
Input 𝑋, of dimension ℎ𝑋 , 𝑤𝑋
Kernel 𝐾, of dimension ℎ𝐾 , 𝑤𝐾
Notice that stride is in the denominator. So higher strides
Output 𝑌, of dimension ℎ𝑌 , 𝑤𝑌 can be used to quickly down sample the input. For
ℎ𝑋 − ℎ𝐾 + 1 instance, a stride of 2 would result in the output to have
ℎ𝑌 = approximately half the height and width of the input.
𝑠ℎ
𝑤𝑋 − 𝑤𝐾 + 1
𝑤𝑌 =
𝑠𝑤
• Note: typically, we set 𝑠ℎ = 𝑠𝑤 = 𝑠.
Convolution of a 4x4 input image with 2x2 kernel results in convolved image of dimension 2x2
(using stride=2 and padding=valid)
stride=1
padding=valid
−1 − 1 − 1
−1 + 8 − 1
−1 − 1 − 1
stride=2
padding=valid
−1 − 1 −1
−1 + 8 −1
−1 − 1 −1
𝐼𝑚𝑎𝑔𝑒6𝑥6 ∎ 𝐾𝑒𝑟𝑛𝑒𝑙2𝑥2 = 𝑂𝑢𝑡𝑝𝑢𝑡???𝑥???
stride=1
padding=valid
→ ∎ =
stride=1 and padding=valid: stride=1 and padding=same:
Input 𝑋, of dimension ℎ𝑋 , 𝑤𝑋 Input 𝑋, of dimension ℎ𝑋 , 𝑤𝑋
Kernel 𝐾, of dimension ℎ𝐾 , 𝑤𝐾 Kernel 𝐾, of dimension ℎ𝐾 , 𝑤𝐾
Output 𝑌, of dimension ℎ𝑌 , 𝑤𝑌 Output 𝑌, of dimension ℎ𝑌 , 𝑤𝑌
ℎ𝑌 = ℎ𝑋 − ℎ𝐾 + 1 ℎ𝑌 = ℎ𝑋
𝑤𝑌 = 𝑤𝑋 − 𝑤𝐾 + 1 𝑤𝑌 = 𝑤𝑋
(H, W, C)
Input (7, 7, 1) (H, W, C)
(H, W, C) Output (5, 5, 1)
Kernel (3, 3, 1)
∎ =
Convolution with Multi-Channel Input
(H, W, C)
(Stride=1, Padding=Valid)
Input (7, 7, 3) (conv_ch1 +
(conv_ch3) conv_ch2 +
conv_ch3)
(H, W, C) (H, W, C)
(conv_ch2)
Kernel (3, 3, 3) Output (5, 5, 1)
(conv_ch1)
∎ =
• The kernel must have same number of channels as the input (here, 3).
• Each channel in the kernel can have different weights.
• Convolutions occur along corresponding channels, and the results are
summed up to produce the outputs. So even though we’re convolving
across 3 channels, the output is still only 1 channel.
𝐼𝑚𝑎𝑔𝑒6𝑥6𝑥3 ∎ 𝐾𝑒𝑟𝑛𝑒𝑙2𝑥2𝑥??? = 𝑂𝑢𝑡𝑝𝑢𝑡???𝑥???𝑥???
stride=1
padding=valid
Output #1 (5, 5, 1)
Kernel #1 (3, 3, 3)
Input (7, 7, 3)
Output (5, 5, 2)
∎ =
Kernel #2 (3, 3, 3)
Output #2 (5, 5, 1)
∎ =
∎ 𝐾𝑒𝑟𝑛𝑒𝑙2𝑥2𝑥???
∎ 𝐾𝑒𝑟𝑛𝑒𝑙???𝑥???𝑥???
𝐼𝑚𝑎𝑔𝑒6𝑥6𝑥3 = 𝑂𝑢𝑡𝑝𝑢𝑡???𝑥???𝑥???
∎ 𝐾𝑒𝑟𝑛𝑒𝑙???𝑥???𝑥???
∎ 𝐾𝑒𝑟𝑛𝑒𝑙???𝑥???𝑥???
stride=2
padding=same
∎ 𝐾𝑒𝑟𝑛𝑒𝑙2𝑥2𝑥3
∎ 𝐾𝑒𝑟𝑛𝑒𝑙2𝑥2𝑥3
𝐼𝑚𝑎𝑔𝑒6𝑥6𝑥3 = 𝑂𝑢𝑡𝑝𝑢𝑡3𝑥3𝑥4
∎ 𝐾𝑒𝑟𝑛𝑒𝑙2𝑥2𝑥3
∎ 𝐾𝑒𝑟𝑛𝑒𝑙2𝑥2𝑥3