0% found this document useful (0 votes)
7 views49 pages

Convolutional Neural Networks - Part 2

This document provides an overview of Convolutional Neural Networks (CNNs), focusing on the convolution layer, its parameters, and how it processes input data through kernels. It discusses the concepts of receptive fields, characteristics of convolutions, and the importance of pooling layers for down-sampling input representations. The document emphasizes the efficiency and effectiveness of CNNs in capturing hierarchical features from images.

Uploaded by

achatt51
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views49 pages

Convolutional Neural Networks - Part 2

This document provides an overview of Convolutional Neural Networks (CNNs), focusing on the convolution layer, its parameters, and how it processes input data through kernels. It discusses the concepts of receptive fields, characteristics of convolutions, and the importance of pooling layers for down-sampling input representations. The document emphasizes the efficiency and effectiveness of CNNs in capturing hierarchical features from images.

Uploaded by

achatt51
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Convolutional Neural Networks – Part 2

Hina Arora

This deck is Copyright ©Hina Arora and Arizona Board of Regents. All rights reserved.
The Convolution Layer
tf.keras.layers.Conv2D
Flattened Input: 64 x 1 Fully Connected Layer: 32 nodes
𝑇 𝐴1 x 32 = 𝑔(𝑋1 x 64 @𝑊64 x 32 + 𝑏1 x 32 )
Recall: 𝑋64 x1
Fully Connected Layer
1 Note, each node:
𝑧 𝑎 = 𝑔(𝑧)
Input: 8 x 8 2

𝑊64 x 32
𝑏1 x 32 𝐴1 x 32

Fully Connected Layer:


Input Dimension: 64
Output Dimension: 32
Number of parameters: 2080
Weights: 64 x 32
32
Bias Terms: 32
Convolutional Layer Example: Single-Channel Input and Conv Layer with 2 Filters

Input to Conv Layer: Convolutional Layer: Output from Conv Layer:


8x8x1 2 (5 x 5 x 1) kernels 4x4x2
𝐴14x4x1
𝐴14x4x1 = 𝑔(𝑋8x8x1 ∎𝐾15x5x1 + 𝑏1)

𝐴24x4x1 = 𝑔(𝑋8x8x1 ∎𝐾25x5x1 + 𝑏2)


𝐴24x4x1

Like before, adding b and pushing Not flattening Conv picks up Convolutional Layer:
through non-linear g to get activation. anymore! spatial connectivity! Input Dimension: 8*8*1=64
Output Dimension: 4*4*2=32
• # of channels in each kernel in the convolutional layer
I/O dimensions are similar Number of parameters: 52
= # of channels in input to the layer (here, 1)
• # of channels in the output of the convolutional layer to FC layer, but # of Kernel Weights: 5*5*1*2=50
= # of kernels in the layer (here, 2) parameters is much lower! Bias Terms: 2
Convolutional Layer Example: Multi-Channel Input and Conv Layer with 2 Filters

Input to Conv Layer: Convolutional Layer: Output from Conv Layer:


8x8x3 2 (5 x 5 x 3) kernels 4x4x2

𝐴14x4x1
𝐴14x4x1 = 𝑔(𝑋8x8x3 ∎𝐾15x5x3 + 𝑏1)

𝐴24x4x1 = 𝑔(𝑋8x8x3 ∎𝐾25x5x3 + 𝑏2)


𝐴24x4x1

Convolutional Layer:
Input Dimension: 8*8*3=192
Output Dimension: 4*4*2=32
• Number of channels in each kernel in the convolutional layer
Number of parameters: 152
= Number of channels in input to the layer (here, 3)
• Number of channels in the output of the convolutional layer Kernel Weights: 5*5*3*2=150
= Number of kernels in the layer (here, 2) Bias Terms: 2
Convolutional Layer, generalizing:
Note, each kernel (“node”) 𝑖:
Input to Layer [𝑙]: 𝑔(𝐴 𝑙−1 ∎𝑘𝑒𝑟𝑛𝑒𝑙𝑖 + 𝑏𝑖𝑎𝑠𝑖 )
• 𝐴[𝑙−1] ∶ 𝑚, ℎ𝑖𝑛 , 𝑤𝑖𝑛 , 𝑐𝑖𝑛
[𝑙]
Layer [𝑙]:
• number of kernels: 𝑛𝑘 1
• all kernels in a layer must have identical dimensions: ℎ𝑘 , 𝑤𝑘 , 𝑐𝑘 = 𝑐𝑖𝑛
𝐴[𝑙−1] 2 𝐴[𝑙]
• all kernels in a layer must have identical stride and padding
• each kernel in a layer will have different weights
• each kernel in a layer will be associated with a bias
𝑛𝑘
Output of Layer [𝑙]:
• 𝐴[𝑙] ∶ 𝑚, ℎ𝑜𝑢𝑡 , 𝑤𝑜𝑢𝑡 , 𝑐𝑜𝑢𝑡 = 𝑛𝑘
Number of Parameters in Layer [𝑙]:
Kernel Weights: ℎ𝑘 ∗ 𝑤𝑘 ∗ 𝑐𝑘 ∗ 𝑛𝑘
Bias Terms: 𝑛𝑘
tf.keras.layers.Conv2D

Notes:
• Kernels are also referred to as filters
• We don’t specify number of channels (c), since that is decided based on the number of channels
in the input to the Conv Layer
• All filters in a Conv layer will have the same kernel size (h,w), same number of channels (c),
same number of strides, same amount of padding, same activation, etc
• A CNN will typically have one or more Convolution Layers.
• Each convolution layer has kernels that are convolved with the inputs to that layer.
• The outputs (feature maps) of each layer become inputs to the next layer.
• This results in the idea of the “effective receptive field”, and how the network picks up
more complex abstractions of the input image as we go from the shallower to the
deeper layers.
Receptive Field
• Let’s assume we’ve built out this CNN network:

Input Layer Conv Layer 1 Output Conv Layer 2 Output Conv Layer 3 Output

7x7 5x5 3x3 1x1

Image Source: https://www.researchgate.net/figure/A-stack-of-three-3-3-convolution-layers-has-an-effective-receptive-field-of-7-7_fig1_369105912


• The number of input pixels that effect an output pixel is called the receptive field
of the output pixel. The receptive field represents the local region of the previous
layer that a pixel is “connected” to.

Input Layer Conv Layer 1 Output Conv Layer 2 Output Conv Layer 3 Output

Receptive Field
Input Layer Conv Layer 1 Output Conv Layer 2 Output Conv Layer 3 Output

Receptive Field
Input Layer Conv Layer 1 Output Conv Layer 2 Output Conv Layer 3 Output

Receptive Field
• The effective receptive field defines the region of the input image (and not just the
previous layer), which effects the pixel of a given layer.

Input Layer Conv Layer 1 Output Conv Layer 2 Output Conv Layer 3 Output

Receptive Field
Effective Receptive Field
Input Layer Conv Layer 1 Output Conv Layer 2 Output Conv Layer 3 Output

Receptive Field
Effective Receptive Field
Input Layer Conv Layer 1 Output Conv Layer 2 Output Conv Layer 3 Output

Receptive Field

Effective Receptive Field


• The effective receptive field of pixels in the deeper layers is therefore larger than the
effective receptive field of pixels in the shallow layers.
• So, in some sense, with CNNs, we’re building this hierarchy of features, where shallow
layers capture a more local view of the input image (such as edges and corners), and
deeper layers capture a more global view of the input image (such as entire objects).
• We like that!

Input Layer Conv Layer 1 Output Conv Layer 2 Output Conv Layer 3 Output
Characteristics of Convolutions
Deep Learning, Goodfellow, Bengio, Courville, MIT Press, 2016 (Chapter 9)
Convolutions help achieve three important properties in data with grid-like
topologies (such as 1D time-series data and 2D image data):

a) sparse interactions

b) parameter sharing

c) translation equivariance
a) Sparse Interactions

• In fully connected layers, every output from a layer interacts with every input to the layer
(𝐴@𝑊). A separate parameter describes the interaction between each output and input.

• In convolutional layers, each output pixel from a layer only interacts with a small subset
of the input pixels to the layer via the kernel weights (𝐴∎𝐾).

• This enables lower memory requirements (due to fewer model parameters), and higher
computational efficiency (due to fewer floating-point operations).
b) Parameter Sharing

• In a fully connected layer, each element of the weight matrix is used exactly once
when computing the output of a layer.

• In a convolutional layer, the same set of kernel weights is used over the entire input
image of the layer to create the output image of a layer.
c) Translation Equivariance

• Parameter sharing in convolutions enables a property called translation equivariance.

• That is, if we shift (translate) an object in the input, its convolved representation will
move the same amount in the output.

• Together with parameter sharing, translation equivariance enables a trained kernel to


pick up a specific relationship between neighboring pixels (such as edges) at multiple
locations in the input.

• Note: convolutions are not naturally equivariant to other transformations, such as


changes in the scale or rotation of an image.
Translation Equivariance
There is also a nice simulation demonstrating translation equivariance here:
https://miro.medium.com/max/658/1*le9_VzIejK-aezFWYCVc8Q.gif

Image Source: https://chriswolfvision.medium.com/what-is-translation-equivariance-and-why-do-we-use-convolutions-to-get-it-6f18139d4c59


2D Pooling
• Pooling is typically used to (deterministically) down-sample the input representation.

• Like the convolution operation, the pooling operation consists of a sliding a window
across the input, and the idea of strides and padding.

• However, unlike the convolution operation, the pooling operation contains no


parameters (i.e., there is no pooling kernel).

• Instead, pooling operators are deterministic – they either compute the max (max
pooling) or avg value (average pooling) of the elements in the pooling window.
Pooling with Single-Channel Input

If we use a stride of (𝑠ℎ , 𝑠𝑤 ), and pad of (𝑝ℎ , 𝑝𝑤 ):


1 2 3
avg - - 3 4
4 5 6 pool = 6 7
Input 𝑋, of dimension ℎ𝑋 , 𝑤𝑋
- -
7 8 9 Pooling Window of dimension ℎ𝐾 , 𝑤𝐾
stride=1 and padding=valid Max/Avg Pooling Output 𝑌, of dimension ℎ𝑌 , 𝑤𝑌
𝐷𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑠 𝑐𝑎𝑙𝑐𝑢𝑎𝑙𝑡𝑒𝑑 𝑎𝑠 𝑖𝑛 𝑐𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛𝑠

1 2 3
max - - 5 6
4 5 6 pool = 8 9
- -
7 8 9

stride=1 and padding=valid


Note that the pooling window is empty. It only
defines the height and width of the window –
there are no model parameters in pooling!
stride=1
padding=valid
avg_pool
− − −
− − −
− − −

stride=2
padding=valid
avg_pool
− − −
− − −
− − −
• For multi-channel inputs, pooling occurs along corresponding channels, but there is no
aggregation across channels.
• So, the output of a pooling operation will have the same number of channels as the
input.

Input 𝑋, of dimension ℎ𝑋 , 𝑤𝑋 , 𝑐
Pooling Window of dimension ℎ𝐾 , 𝑤𝐾 , 𝑐
Max/Avg Pooling Output 𝑌, of dimension ℎ𝑌 , 𝑤𝑌 , 𝑐
Pooling with Multi-Channel Input
(H, W, C)
(Stride=1, Padding=Valid)
Input (7, 7, 3)
(H, W, C)
(H, W, C) Output (5, 5, 3)
Window (3, 3, 3)

- - -
- - - -
- - - -
max/avg -

pool
- - - - =
- - -

• The pooling window must have same number of channels as the input.
• Pooling has no model parameters – it is deterministic (max or avg).
• Pooling occurs along corresponding channels, but there is no
aggregation across channels. So the output will have the same number
of channels as the input (here, 3).
Pool
Input Output
Window
1 5 3 7 Max
0 1 2 1 Pool = ???
9 0 4 1 stride=2
padding=valid
0 0 8 0

Max
𝐼𝑚𝑎𝑔𝑒6𝑥6𝑥3 Pool 𝑃𝑜𝑜𝑙𝑊𝑖𝑛𝑑𝑜𝑤2𝑥2𝑥??? = 𝑂𝑢𝑡𝑝𝑢𝑡???𝑥???𝑥???
stride=1
padding=valid
Pool
Input Output
Window
1 5 3 7 Max
5 7
0 1 2 1 Pool = 9 8
9 0 4 1 stride=2
padding=valid
0 0 8 0

Max
𝐼𝑚𝑎𝑔𝑒6𝑥6𝑥3 Pool 𝑃𝑜𝑜𝑙𝑊𝑖𝑛𝑑𝑜𝑤2𝑥2𝑥3 = 𝑂𝑢𝑡𝑝𝑢𝑡5𝑥5𝑥3
stride=1
padding=valid
The Pooling Layer
tf.keras.layers.MaxPool2D
tf.keras.layers.AveragePooling2D
Pooling Layer (Max or Average Pool):
Note, in layer:
Input to Layer [𝑙]:
pool(𝐴 𝑙−1 , 𝑝𝑜𝑜𝑙𝑊𝑖𝑛𝐷𝑖𝑚𝑠)
• 𝐴[𝑙−1] ∶ 𝑚, ℎ𝑖𝑛 , 𝑤𝑖𝑛 , 𝑐𝑖𝑛

[𝑙]
Layer [𝑙]:
• pooling window dimension: ℎ𝑘 , 𝑤𝑘 , 𝑐𝑘 = 𝑐𝑖𝑛
• pooling window has a stride and padding
𝐴[𝑙−1] 𝐴[𝑙]

Output of Layer [𝑙]:


• 𝐴[𝑙] ∶ 𝑚, ℎ𝑜𝑢𝑡 , 𝑤𝑜𝑢𝑡 , 𝑐𝑜𝑢𝑡 = 𝑐𝑖𝑛

Number of Parameters in Layer [𝑙]:


None!!
tf.keras.layers.MaxPool2D

tf.keras.layers.AveragePooling2D
Characteristics of Pooling
Deep Learning, Goodfellow, Bengio, Courville, MIT Press, 2016 (Chapter 9)
• Pooling helps leverage 2 important ideas:
a) down-sampling
b) translation invariance

• And together with convolutions pooling achieves an important property:


c) learned invariance
a) Down-Sampling

• Pooling with large strides is often used to down-sample in a network (reduce the
spatial dimensions for the next layer in a network).

• This is similar to how we achieve down-sampling with convolutions of large strides.

• Why we’d pick one over the other is based on whether we’re looking to also learn
weights related to this down-sampling (which convolutions would do for us, but
pooling wont).
b) Translation Invariance

• Small shifts (translations) of the input, will not significantly change the pooled output.

• This can be a useful property in situations where we are more concerned with the presence or
absence of a feature, rather than its exact location.

• For instance, if we’re trying to determine whether an image contains a face, we may only care
about the presence of a nose in the center – not so much its exact pixel coordinates.
https://www.kaggle.com/code/viroviro/introduction-to-cnns/notebook

• The max pool layer output for A and B are identical: a small translation shift (1 pixel here) in
input does not change the output of the max-pool layer.
• However, the max poling layer output for A and C are different: a larger translation shift (2
pixels here) in input changes the output of the max pool layer.
• Such invariance (even if it is limited) can be useful in cases where the prediction should not
depend on these details, such as in classification tasks.
c) Learned Invariance

• While max-pooling is naturally invariant to translation, it is not invariant to other


transformations such as rotation.

• However, pooling over the outputs of separately parametrized convolutions allows


the network to learn other transformations such as rotation.

• Let’s look at an example of learned rotational invariance


stride=3
padding=valid stride=3
padding=valid
𝑐𝑜𝑛𝑣
+1 0 − 1 𝑚𝑎𝑥 _𝑝𝑜𝑜𝑙
+2 0 − 2 𝑠𝑖𝑧𝑒 = 3
+1 0 − 1

stride=3 stride=3
padding=valid padding=valid
𝑐𝑜𝑛𝑣 𝑚𝑎𝑥 _𝑝𝑜𝑜𝑙
+1 0 − 1
𝑠𝑖𝑧𝑒 = 3
+2 0 − 2
+1 0 − 1

stride=3
stride=3
padding=valid
padding=valid
𝑐𝑜𝑛𝑣 𝑚𝑎𝑥 _𝑝𝑜𝑜𝑙
+1 0 − 1
𝑠𝑖𝑧𝑒 = 3
+2 0 − 2
+1 0 − 1
Basic/Typical CNN Architecture
A basic/typical CNN architecture starts out by alternating between Convolution and Pooling layers, and ends
with a few Fully Connected layers.

• Shallow convolutional layers focus on the relationship between neighboring pixels to extract elementary
features such as edges, end-points, and corners. These features are then combined by subsequent layers
to detect higher-order features such as shapes and objects.

• Once a feature has been detected, its exact location becomes less important – only its approximate position
relative to other features is relevant. A simple way to achieve this is to reduce the spatial resolution (height
and width) of the feature map. This is typically achieved through the pooling layers.

• Successive layers of conv and pool layers are typically alternated in the CNN, and as we proceed from
shallower layers to deeper layers, the number of feature maps (channels) is increased using Convolution
layers, and the spatial resolution (height and width) is decreased using Pooling layers.

• This progressive reduction of spatial resolution (height and width) combined with a progressive increase of
the richness of the representation (number of channels) provides the network with the ability to achieve a
large degree of invariance to geometric transformations of the input.

• Note: this architecture is only providing the CNN the capability to learn. We still need to provide the
appropriate training data (for instance, rotated/translated data) for it to actually learn.
We’ll look at
several CNN LeNet-5 AlexNet VGG-16 VGG-19
architectures <<<Input>>> <<<Input>>>
<<<Input>>> <<<Input>>>
soon, but here 2 * Conv2D 2 * Conv2D
Conv2D Conv2D
are some basic AvgPooling2D MaxPool2D MaxPool2D MaxPool2D
examples that Conv2D Conv2D 2 * Conv2D 2 * Conv2D
illustrate the AvgPooling2D MaxPool2D MaxPool2D MaxPool2D
general idea. <<<Flatten>>> Conv2D 3 * Conv2D 4 * Conv2D
Dense Conv2D MaxPool2D MaxPool2D
Dense 3 * Conv2D 4 * Conv2D
Conv2D
Dense MaxPool2D MaxPool2D
MaxPool2D
<<<Flatten>>> 3 * Conv2D 4 * Conv2D
Dense MaxPool2D MaxPool2D
<<<Dropout>>> <<<Flatten>>> <<<Flatten>>>
Dense Dense Dense
<<<Dropout>>> Dense Dense
Dense Dense Dense
CNN Feature Hierarchy
When humans recognize a face: CNNs seem to follow a similar process:
• The neurons in the lower visual neuron areas (V1) • The lower convolutional layers extract
have small effective receptive fields, and are sensitive small features such as edges and
to basic features such as edges and lines. colored blobs.
• The neurons in the higher visual neuron areas (V2, V4) • The deeper layers extract general
have larger effective receptive fields, and are sensitive shapes and partial objects.
to complex features such as shapes and objects.
• The neurons in the IT area have the largest and most • And the last layer extract the final
comprehensive receptive fields, and are sensitive to classification.
the entire face.

Image and Text Source: How CNNs see the world – A survey of CNN visualization methods, Qin et al, 2018
Deeper layers:
• Low-resolution, semantically strong features
• More global (less accurately localized) since Predict
down-sampled more times
• Global information resolves “what”

Reduce H,W…
Increase #C..

Shallower layers:
• High-resolution, semantically weak features
• More accurately localized since down-
Image Source: Deep Learning, Goodfellow, Bengio, Courville, MIT Press, 2016
sampled fewer times
• Local information resolves “where”
CNNs:
1) Lower layers (CL1, CL2) capture small edges, corners, and parts.
2) CL3 captures similar textures such as mesh patterns.
3) Higher layers (CL4, CL5) are more class-specific, and show almost entire objects.

Image and Text Source: How CNNs see the world – A survey of CNN visualization methods, Qin et al, 2018
4) lower layers preserve much more detailed information, such as locations of objects.
5) higher CLs and even FLs preserve approximate object location information.
6) unrelated information is gradually filtered from low layers to high layers.

Image and Text Source: How CNNs see the world – A survey of CNN visualization methods, Qin et al, 2018

You might also like