0% found this document useful (0 votes)
4 views44 pages

Convolutional Neural Networks - Part 1

The document introduces Convolutional Neural Networks (CNNs) as a specialized architecture for image analysis tasks in deep learning, highlighting their advantages over fully connected networks. It discusses the importance of preserving spatial relationships in images and outlines key operations in CNNs, including convolution and pooling. The document also explains how CNNs learn to extract features from images through training, making them more efficient for tasks like image classification and object detection.

Uploaded by

achatt51
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views44 pages

Convolutional Neural Networks - Part 1

The document introduces Convolutional Neural Networks (CNNs) as a specialized architecture for image analysis tasks in deep learning, highlighting their advantages over fully connected networks. It discusses the importance of preserving spatial relationships in images and outlines key operations in CNNs, including convolution and pooling. The document also explains how CNNs learn to extract features from images through training, making them more efficient for tasks like image classification and object detection.

Uploaded by

achatt51
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Convolutional Neural Networks – Part 1

Hina Arora

This deck is Copyright ©Hina Arora and Arizona Board of Regents. All rights reserved.
• Thus far, we have built a strong foundation in Fully Connected Feed Forward
Deep NNs.

• In the next few lectures, we will look at the application of Deep Learning in
Computer Vision tasks (specifically Image Analysis) such as Image Classification,
Object Detection, Image Segmentation, etc.

• We will predominantly use Convolutional Neural Networks (a specific Deep


Learning architecture) for image analysis.

• And we will use Keras/Tensorflow for the implementation of these ideas.


Motivating CNNs
• Let’s say I have a set of 28x28 greyscale images that I’d to classify as cat/no-cat.

• Let’s say I’d like to build a binary classifier using a Fully Connected Dense Network
with one hidden layer with 128 units.

• I would have to first “flatten” the images before I can feed them to the Fully
Connected Network.
o That is, I would have to convert the 28x28-dimensional 2D image representations
to 784-dimensional 1D vector representations.
o Note: the number of nodes in the input layer of this dense network would be 784
(that is, each pixel in the image is considered a feature).

• The input layer then feeds into a fully connected hidden layer with 128 hidden units.
Which means for layer 1, we end up with a total of 100480 parameters (784*128 =
100352 weight parameters and 128 bias parameters).
Layer 0: Layer 1: Layer 2:
Flattened Input Hidden Layer Output Layer
784 x 1 128 nodes 1 node

1
Input
28 x 28 2

𝑊784 x 128
𝑏1 x 128 𝑊128 x 1
𝑏1 x 1
1

Layer 1:
100480
model
128
params!
• There are several issues with this approach…
• Flattening the images means we are ignoring the spatial structure of the images.
However, images in fact have a spatial structure where neighboring pixels are
related to each other.
• Also, fully connected networks are invariant to the order of the features (in the
case of flattened images, the pixels). We could therefore shuffle the pixels in the
input image, and still get similar results – further proof that we were ignoring the
spatial relationship between the pixels.

• Finally, we are typically dealing with higher resolution color images, and a higher
number of hidden units. This can cause an explosion of model parameters,
leading to overfitting or a need for lots of training data.
• Convolutional Neural Networks (or CNNs or ConvNets)* are specialized neural
networks suitable for processing data with grid-like topologies.
• Data with grid-like topologies are data that contain a spatial relationship between
neighboring datapoints, such as 1D time-series data, and 2D image data.
• By definition, CNNs employ the “convolution” operation in at least one of their
layers. Typically, CNNs are a combination of convolution layers, pooling layers and
fully connected layers.
• Together, the convolution operations and the pooling operations help capture the
spatial relationship between pixels in CNNs. They also help reduce the number of
parameters required, making CNNs more computationally efficient.
• Let’s first review image representation. We’ll then walk through the key operations
that make up CNNs: the convolution operation, the pooling operation, and the
concept of stride and padding.

* Yan LeCun pioneered the use of CNNs - we’ll look at the classic LeNet-5 architecture later
Image Representation
Image: Height, Width, and Channels
Color Image with 3 channels.
Greyscale Image with 1 channel. Each pixel has an (R,G,B) value.
Each pixel has a value 0 – 255. (255,0,0) is pure red, (0,255,0) is pure
0 is pure black, and 255 is pure white. green, and (0,0,255) is pure blue.

H:194, W:259, C:1 H:194, W:259: C:3


(0,0) (0,258) (0,0) (0,258)

255,255,255

0,0,0

(193,0) (193,258) (193,0) (193,258)

0 160 96 255 44 243,140,48 70,222,52 82,1,122

Image Source: https://github.com/opencv/opencv/blob/master/samples/data/HappyFish.jpg


Splitting Multi-Channel Image into Individual Channels
H:194, W:259: C:3 Color Image with 3 channels.
Each pixel his associated with an
(R,G,B) value.

Extract Rs Extract Gs Extract Bs


H:194, W:259: C:1 H:194, W:259: C:1 H:194, W:259: C:1
H:194, W:259: C:1 H:194, W:259: C:3

B 255 241 30 54 12 0 0

G 20 45 30 99 72 54 255
0 44 135 34 255 21 30
… … … … … … …
R 45 62 73 15 255 250 40
… … … … … … …
… … … … … … …
… … … … … … … … … … … … … …
… … … … … … … … … … … … … …
… … … … … … … … … … … … … …
… … … … … … …
… … … … … … …
… … … … … … …
• Moving forward:
o In general, we’ll use 4D tensor representations – for instance (m, h, w, c) to
represent the number of samples, height, width, and number of channels. So, for
instance, 5 greyscale images can be represented as (5, h, w, 1), and 10 colored
images with 3 channels (RGB) can be represented as (10, h, w, 3).
o Note: channels can also be used to represent stacked outputs of a convolution
layer. We’ll see more examples of that representation soon.

• We’ll now walk through the key operations that make up CNNs: the convolution
operation, the pooling operation, and the concept of stride and padding.
2D Convolutions
Note: Strictly speaking, what we refer to as the “convolution” operation in
deep learning is in fact a “cross-correlation” operation. But we’ll continue
to use the term “convolution” in keeping with standard DL terminology.
• Let’s say I’m trying to build a multiclass classifier for handwritten digits. What kind of
features might I be interested in extracting? Perhaps vertical edges, horizontal edges,
corners, circles, etc?

• As it turns out, the convolution layers let us extract such features - and a whole lot
more - when stacked up in a CNN!
• Before we get into the mathematics of convolution operations, let’s see it in action.
• Here’s an example of the convolution of an image with a kernel resulting in a feature map
which captures the edges of the input image.

“Input” (28 x 28) “Output” / Feature Map (26 x 26)

Kernel / Filter (3 x 3)
−1 − 1 −1
∎ −1 + 8
−1 − 1
−1
−1
=
The convolution operation is
typically denoted by “∗”. But
we’ll use “∎” for clarity.
• Here’s another example where we’ve convolved the image with two different kernels
resulting in two different feature maps that capture the vertical and horizontal edges.
“Output” / Feature Map (26 x 26)

Kernel / Filter (3 x 3)
+1 0 −1
“Input” (28 x 28) ∎ +2
+1
0
0
−2
−1
=

“Output” / Feature Map (26 x 26)

Kernel / Filter (3 x 3)
+1 + 2 + 1
∎ 0 0 0 =
−1 − 2 − 1

https://en.wikipedia.org/wiki/Kernel_(image_processing)
https://en.wikipedia.org/wiki/Sobel_operator
• In these examples, we already knew what kernels / filters to use in order to detect
useful feature maps (such as vertical and horizontal edges).
• However, we would prefer training a network to learn these kernels on its own.
• This is exactly what CNNs achieve - we essentially train CNNs to learn the appropriate
weights of the kernels for the task at hand (representation learning).
Kernel 0
0 0 0
𝑤00 𝑤01 𝑤02
∎ 0
𝑤10 0
𝑤11 0
𝑤12
0 0 0
𝑤20 𝑤21 𝑤22

Kernel 1
1 1 1
𝑤00 𝑤01 𝑤02
∎ 1
𝑤10 1
𝑤11 1
𝑤12
1 1 1
𝑤20 𝑤21 𝑤22

• So, what exactly does the convolution operation do? Let’s take a look!
2D Convolution - Implementation

Input 𝑋, of dimension ℎ𝑋 , 𝑤𝑋
Kernel 𝐾, of dimension ℎ𝐾 , 𝑤𝐾
Output 𝑌, of dimension ℎ𝑌 , 𝑤𝑌
ℎ𝑌 = ℎ𝑋 − ℎ𝐾 + 1
𝑤𝑌 = 𝑤𝑋 − 𝑤𝐾 + 1

1 2 3
1 2 37 47
4 5 6 ∎ = 67 77
3 4
7 8 9

Left Image Source: Deep Learning, Goodfellow, Bengio, Courville, MIT Press, 2016 (Chapter 9)
Convolution of a 4x4 input image with 2x2 kernel results in convolved image of dimension 3x3
(using stride=1 and padding=valid)
Input Kernel Feature Map

1 2 3
1 0
4 5 6 ∎ = ???
0 1
7 8 9
Input Kernel Feature Map

1 2 3
1 0 6 8
4 5 6 ∎
0 1
= 12 14
7 8 9
• So, the convolution of a kernel with the input image essentially allows us to capture
the spatial relationship between neighboring pixels in the input image, thereby
enabling us to extract meaningful features in the image.
• The idea with the convolutional layer then, is going to be stack up enough of these
kernels, enabling the layers to learn meaningful representations at each stage. And
the kernel weights are essentially what will be learned during training.
• In the examples above, we shifted the kernel over by one pixel at a time. This is called
a stride of 1. We will look at longer strides next.
• We also restricted the output to only those positions where the kernel lies entirely
within the input image. This is called “valid” padding. This has the effect of causing
the output to have a reduced dimension (as against the input). Also, the pixels on
edges of image are involved in fewer convolutions than the pixels in the center of the
image. We’ll look at a different scheme called “same” padding in a bit which attempts
to fix this by adding one or more layers of (zero) pixels along edges of the image.
Stride
• Stride defines how many pixels the kernel should be shifted over at a time along
the height and width of the input.
• In the previous example, we were using a stride of 1 along the height and width
of the input.
• If we use a stride of 𝑠ℎ pixels along the height, and 𝑠𝑤 pixels along the width:
Input 𝑋, of dimension ℎ𝑋 , 𝑤𝑋
Kernel 𝐾, of dimension ℎ𝐾 , 𝑤𝐾
Notice that stride is in the denominator. So higher strides
Output 𝑌, of dimension ℎ𝑌 , 𝑤𝑌 can be used to quickly down sample the input. For
ℎ𝑋 − ℎ𝐾 + 1 instance, a stride of 2 would result in the output to have
ℎ𝑌 = approximately half the height and width of the input.
𝑠ℎ
𝑤𝑋 − 𝑤𝐾 + 1
𝑤𝑌 =
𝑠𝑤
• Note: typically, we set 𝑠ℎ = 𝑠𝑤 = 𝑠.
Convolution of a 4x4 input image with 2x2 kernel results in convolved image of dimension 2x2
(using stride=2 and padding=valid)
stride=1
padding=valid
−1 − 1 − 1
−1 + 8 − 1
−1 − 1 − 1

stride=2
padding=valid
−1 − 1 −1
−1 + 8 −1
−1 − 1 −1
𝐼𝑚𝑎𝑔𝑒6𝑥6 ∎ 𝐾𝑒𝑟𝑛𝑒𝑙2𝑥2 = 𝑂𝑢𝑡𝑝𝑢𝑡???𝑥???
stride=1
padding=valid

𝐼𝑚𝑎𝑔𝑒6𝑥6 ∎ 𝐾𝑒𝑟𝑛𝑒𝑙2𝑥2 = 𝑂𝑢𝑡𝑝𝑢𝑡???𝑥???


stride=2
padding=valid
𝐼𝑚𝑎𝑔𝑒6𝑥6 ∎ 𝐾𝑒𝑟𝑛𝑒𝑙2𝑥2 = 𝑂𝑢𝑡𝑝𝑢𝑡5𝑥5
stride=1
padding=valid

𝐼𝑚𝑎𝑔𝑒6𝑥6 ∎ 𝐾𝑒𝑟𝑛𝑒𝑙2𝑥2 = 𝑂𝑢𝑡𝑝𝑢𝑡3𝑥3


stride=2
padding=valid
Padding
• When we convolve an image with a kernel (with “valid” padding), the dimension of the output image is
reduced compared to the dimension of the input image. Also, the pixels on edges of image are involved
in fewer convolutions than the pixels in the center of the image.
• “Same” padding attempts to fix this by adding one or more layers of (zero) pixels along edges of the
image.
• For instance, if we convolved a 3x3 image with a 3x3 filter – with stride 1 and valid padding - we would
end up with an output of dimension 1x1. But if we convolved a 3x3 image with a 3x3 filter – with stride 1
and same padding - we would end up with an output of dimension 3x3. This is because the input image
would be appropriately zero-padded before getting convolved.

“Input” Image Padded “Input” Image Kernel “Output” Image

→ ∎ =
stride=1 and padding=valid: stride=1 and padding=same:
Input 𝑋, of dimension ℎ𝑋 , 𝑤𝑋 Input 𝑋, of dimension ℎ𝑋 , 𝑤𝑋
Kernel 𝐾, of dimension ℎ𝐾 , 𝑤𝐾 Kernel 𝐾, of dimension ℎ𝐾 , 𝑤𝐾
Output 𝑌, of dimension ℎ𝑌 , 𝑤𝑌 Output 𝑌, of dimension ℎ𝑌 , 𝑤𝑌
ℎ𝑌 = ℎ𝑋 − ℎ𝐾 + 1 ℎ𝑌 = ℎ𝑋
𝑤𝑌 = 𝑤𝑋 − 𝑤𝐾 + 1 𝑤𝑌 = 𝑤𝑋

stride>1 and padding=valid: stride>1 and padding=same:


Input 𝑋, of dimension ℎ𝑋 , 𝑤𝑋 Input 𝑋, of dimension ℎ𝑋 , 𝑤𝑋
Kernel 𝐾, of dimension ℎ𝐾 , 𝑤𝐾 Kernel 𝐾, of dimension ℎ𝐾 , 𝑤𝐾
Output 𝑌, of dimension ℎ𝑌 , 𝑤𝑌 Output 𝑌, of dimension ℎ𝑌 , 𝑤𝑌
ℎ𝑋 − ℎ𝐾 + 1 ℎ𝑋
ℎ𝑌 = ℎ𝑌 =
𝑠ℎ 𝑠ℎ
𝑤𝑋 − 𝑤𝐾 + 1 𝑤𝑋
𝑤𝑌 = 𝑤𝑌 =
𝑠𝑤 𝑠𝑤
https://www.tensorflow.org/api_docs/python/tf/nn#notes_on_padding_2
𝐼𝑚𝑎𝑔𝑒6𝑥6 ∎ 𝐾𝑒𝑟𝑛𝑒𝑙2𝑥2 = 𝑂𝑢𝑡𝑝𝑢𝑡???𝑥???
stride=1
padding=same

𝐼𝑚𝑎𝑔𝑒6𝑥6 ∎ 𝐾𝑒𝑟𝑛𝑒𝑙2𝑥2 = 𝑂𝑢𝑡𝑝𝑢𝑡???𝑥???


stride=2
padding=same
𝐼𝑚𝑎𝑔𝑒6𝑥6 ∎ 𝐾𝑒𝑟𝑛𝑒𝑙2𝑥2 = 𝑂𝑢𝑡𝑝𝑢𝑡6𝑥6
stride=1
padding=same

𝐼𝑚𝑎𝑔𝑒6𝑥6 ∎ 𝐾𝑒𝑟𝑛𝑒𝑙2𝑥2 = 𝑂𝑢𝑡𝑝𝑢𝑡3𝑥3


stride=2
padding=same
Convolutions with Multi-Channel Input
• So far, we’ve only considered convolutions where the input had a single
channel.
• Often though, inputs have multiple channels (such as colored images with
RGB channels).
• Convolution of a multi-channel input can occur with a (multi-channel) kernel
with the same number of channels as the input.
• Convolutions occur along corresponding channels, and the results are
summed up to produce the outputs. A convolution with a multi-channel
input therefore always results in a single-channel output.
Input 𝑋, of dimension ℎ𝑋 , 𝑤𝑋 , 𝑐
Kernel 𝐾, of dimension ℎ𝐾 , 𝑤𝐾 , 𝑐
Output 𝑌, of dimension ℎ𝑌 , 𝑤𝑌 , 1
Convolution with Single-Channel Input
(Stride=1, Padding=Valid)

(H, W, C)
Input (7, 7, 1) (H, W, C)
(H, W, C) Output (5, 5, 1)
Kernel (3, 3, 1)

∎ =
Convolution with Multi-Channel Input
(H, W, C)
(Stride=1, Padding=Valid)
Input (7, 7, 3) (conv_ch1 +
(conv_ch3) conv_ch2 +
conv_ch3)

(H, W, C) (H, W, C)
(conv_ch2)
Kernel (3, 3, 3) Output (5, 5, 1)

(conv_ch1)

∎ =

• The kernel must have same number of channels as the input (here, 3).
• Each channel in the kernel can have different weights.
• Convolutions occur along corresponding channels, and the results are
summed up to produce the outputs. So even though we’re convolving
across 3 channels, the output is still only 1 channel.
𝐼𝑚𝑎𝑔𝑒6𝑥6𝑥3 ∎ 𝐾𝑒𝑟𝑛𝑒𝑙2𝑥2𝑥??? = 𝑂𝑢𝑡𝑝𝑢𝑡???𝑥???𝑥???
stride=1
padding=valid

𝐼𝑚𝑎𝑔𝑒6𝑥6𝑥1 ∎ 𝐾𝑒𝑟𝑛𝑒𝑙2𝑥2𝑥??? = 𝑂𝑢𝑡𝑝𝑢𝑡???𝑥???𝑥???


stride=2
padding=same
𝐼𝑚𝑎𝑔𝑒6𝑥6𝑥3 ∎ 𝐾𝑒𝑟𝑛𝑒𝑙2𝑥2𝑥3 = 𝑂𝑢𝑡𝑝𝑢𝑡5𝑥5𝑥1
stride=1
padding=valid

𝐼𝑚𝑎𝑔𝑒6𝑥6𝑥1 ∎ 𝐾𝑒𝑟𝑛𝑒𝑙2𝑥2𝑥1 = 𝑂𝑢𝑡𝑝𝑢𝑡3𝑥3𝑥1


stride=2
padding=same
Convolutions with Multiple Filters
• We often like to run multiple kernels on an input. Because remember, we’re using
the kernels to extract useful features from the image. And obviously, a single kernel
is not going to cut it.
• When we use multiple kernels on an input, each kernel’s convolution with the input
would result in a corresponding single-channel output.
• These outputs can be stacked and represented as channels in a single multi-channel
output.
• Note:
o HxW of all kernels in a kernel bank are the same
o (number of channels in each kernel) = (number of channels in input)
o (number of channels in output) = (number of kernels)
• We’ll use this idea of multiple kernels soon when we build out a convolutional layer.
Convolution with Multiple Filters
(Stride=1, Padding=Valid)

Output #1 (5, 5, 1)
Kernel #1 (3, 3, 3)
Input (7, 7, 3)
Output (5, 5, 2)
∎ =

Kernel #2 (3, 3, 3)
Output #2 (5, 5, 1)

∎ =

• HxW of all kernels in a kernel bank are the same


• Number of channels in each kernel = Number of channels in input
• Number of channels in output = Number of kernels
stride=2
padding=same

∎ 𝐾𝑒𝑟𝑛𝑒𝑙2𝑥2𝑥???

∎ 𝐾𝑒𝑟𝑛𝑒𝑙???𝑥???𝑥???
𝐼𝑚𝑎𝑔𝑒6𝑥6𝑥3 = 𝑂𝑢𝑡𝑝𝑢𝑡???𝑥???𝑥???
∎ 𝐾𝑒𝑟𝑛𝑒𝑙???𝑥???𝑥???

∎ 𝐾𝑒𝑟𝑛𝑒𝑙???𝑥???𝑥???
stride=2
padding=same

∎ 𝐾𝑒𝑟𝑛𝑒𝑙2𝑥2𝑥3

∎ 𝐾𝑒𝑟𝑛𝑒𝑙2𝑥2𝑥3
𝐼𝑚𝑎𝑔𝑒6𝑥6𝑥3 = 𝑂𝑢𝑡𝑝𝑢𝑡3𝑥3𝑥4
∎ 𝐾𝑒𝑟𝑛𝑒𝑙2𝑥2𝑥3

∎ 𝐾𝑒𝑟𝑛𝑒𝑙2𝑥2𝑥3

You might also like