Ch3 CNN
Ch3 CNN
1
CS 404/504, Fall 2021
2
CS 404/504, Fall 2021
4
CS 404/504, Fall 2021
Basics of CNN
• The name “convolutional neural network” indicates that the network employs a
mathematical operation called convolution.
A convolution is:
Convolutions Vs Cross-correlation:
Input
Kernel 6
CS 404/504, Fall 2021
Feature Map/Output
7
CS 404/504, Fall 2021
Convolution Animation:
8
CS 404/504, Fall 2021
• However, nearly all machine learning and deep learning libraries use the
simplified cross-correlation function.
• All this math amounts to is a sign change in how we access the coordinates of
the image I (i.e., we don’t have to “flip” the kernel relative to the input when
applying cross-correlation).
• A convolutional filter slides (i.e., convolves) across the image
Convolutional
Input matrix 3x3 filter
• Essentially, this tiny kernel sits on top of the big image and slides from left-to-
right and top-to-bottom, applying a mathematical operation (i.e., a convolution)
at each (x, y)-coordinate of the original image
10
CS 404/504, Fall 2021
Kernels
• we are sliding the kernel (red region)
from left-to-right and top-to-bottom along the
original image.
11
CS 404/504, Fall 2021
• Kernels can be of arbitrary rectangular size M×N, provided that both M and N are
odd integers.
• We use an odd kernel size to ensure there is a valid integer (x, y)-coordinate at the
center of the image (Figure 2).
• On the left, we have a 3×3 matrix. The center of the matrix is located at x = 1, y = 1,
• where the top-left corner of the matrix is used as the origin and our coordinates
are zero-indexed.
• But on the right, we have a 2×2 matrix. The center of this matrix would be located
at x = 0.5, y = 0.5.
12
CS 404/504, Fall 2021
Types of Kernals
• Prewitt Filters:
It is used to detect vertical and horizontal edges. The horizontal (x-direction) filter helps
to detect edges in the image which cut perpendicularly through the horizontal axis and
vice versa for the vertical (y-direction) filter.
13
CS 404/504, Fall 2021
• Sobel Filters:
Just like the Prewitt operator, the Sobel operator is also made up of a vertical and
horizontal edge detection filter. edges detected using the Sobel filters are sharper in
comparison to Prewitt filters.
14
CS 404/504, Fall 2021
• Laplacian Filter:
Laplacian filter is a single filter which detects edges of different orientation. From a
mathematical standpoint, it computes second order derivatives of pixel values unlike
the Prewitt and Sobel filters which compute first order derivatives.
15
CS 404/504, Fall 2021
Three extremely simple but effective filters are the sharpen, Laplacian and
emboss filters given by 3x3 matrices.
16
CS 404/504, Fall 2021
17
CS 404/504, Fall 2021
• Depending on the element values, a kernel can cause a wide range of effects. .
18
CS 404/504, Fall 2021
• we must manually hand-define each of our kernels for each of our various image
processing operations, such as smoothing, sharpening, and edge detection.
• Is it possible to define a machine learning algorithm that can look at our input images
and eventually learn these types of operators?
• This process of using the lower-level layers to learn high-level features is exactly the
compositionality of CNNs that we were referring to earlier.
• But exactly how do CNNs do this? The answer is by stacking a specific set of layers
in a purposeful manner.
19
CS 404/504, Fall 2021
• The last layer of a neural network (i.e., the “output layer”) is also fully-
connected and represents the final output classifications of the network.
20
CS 404/504, Fall 2021
• let’s again consider the CIFAR-10 dataset. Each image in CIFAR-10 is 32x32 with a
Red, Green, and Blue channel, yielding a total of 32x32x3 = 3;072 total inputs to our
network.
• A total of 3072 inputs does not seem to amount to much, but consider if we were
using 250x250 pixel images – the total number of inputs and weights would jump to
250x250x3 = 187,500 – and this number is only for the input layer alone!
• Surely, we would want to add multiple hidden layers with varying number of nodes
per layer – these parameters can quickly add up, and given the poor performance of
standard neural networks on raw pixel intensities, this bloat is hardly worth it.
• Instead, we can use Convolutional Neural Networks (CNNs) that take advantage of
the input image structure and define a network architecture in a more sensible way.
• again consider the CIFAR-10 dataset: the input volume will have dimensions
32x32x3 (width, height, and depth, respectively).
• Neurons in subsequent layers will only be connected to a small region of the layer
before it (rather than the fully-connected structure of a standard neural network) –
we call this local connectivity which enables us to save a huge amount of
parameters in our network.
21
CS 404/504, Fall 2021
Layer Types
• There are many types of layers used to build Convolutional Neural Networks,
but the ones you are most likely to encounter include:
• Convolutional (CONV)
• Activation (ACT or RELU, where we use the same or the actual activation
function)
• Pooling (POOL)
• Fully-connected (FC)
• Batch normalization (BN)
• Dropout (DO)
• Stacking a series of these layers in a specific manner yields a CNN. We often use
simple text diagrams to describe a
22
CS 404/504, Fall 2021
• Of these layer types, CONV and FC, (and to a lesser extent, BN) are the only
layers that contain parameters that are learned during the training process.
Activation and dropout layers are not considered true “layers" themselves, but
are often included in network diagrams to make the architecture explicitly clear.
• Pooling layers (POOL), of equal importance as CONV and FC, are also
included in network diagrams as they have a substantial impact on the spatial
dimensions of an image as it moves through a CNN.
• CONV, POOL, RELU, and FC are the most important when defining your actual
network architecture. That’s not to say that the other layers are not critical, but
take a backseat to this critical set of four as they define the actual architecture
itself.
23
CS 404/504, Fall 2021
Convolutional Layers
• The CONV layer is the core building block of a Convolutional Neural Network.
• The CONV layer parameters consist of a set of K learnable filters (i.e., “kernels”), where
each filter has a width and a height, and are nearly always square.
• These filters are small (in terms of their spatial dimensions) but extend throughout the full
depth of the volume.
• For inputs to the CNN, the depth is the number of channels in the image (i.e., a depth of
three when working with RGB images, one for each channel). For volumes deeper in the
network, the depth will be the number of filters applied in the previous layer.
• let’s consider the forward-pass of a CNN, where we convolve each of the K filters across the
width and height of the input volume
24
CS 404/504, Fall 2021
Fig: Left: At each convolutional layer in a CNN, there are K kernels applied to the
input volume. Middle: Each of the K kernels is convolved with the input volume. Right:
Each kernel produces a 2D output, called an activation map.
• We can think of each of our K kernels sliding across the input region,
computing an element-wise multiplication, summing, and then storing the
output value in a 2-dimensional activation map, such as in Figure.
• After applying all K filters to the input volume, we now have K, 2-dimensional
activation maps.
• We then stack our K activation maps along the depth dimension of our array to
form the final output volume 25
CS 404/504, Fall 2021
Fig: After obtaining the K activation maps, they are stacked together to form the
input volume to the next layer in the network.
• Every entry in the output volume is thus an output of a neuron that “looks” at
only a small region of the input. In this manner, the network “learns” filters that
activate when they see a specific type of feature at a given spatial location in the
input volume.
• In lower layers of the network, filters may activate when they see edge-like or
corner-like regions.
• Then, in the deeper layers of the network, filters may activate in the presence of
high-level features, such as parts of the face, the paw of a dog, the hood of a car, etc.
26
CS 404/504, Fall 2021
• This activation concept goes back to our neural network analogy these neurons
are becoming “excited” and “activating” when they see a particular pattern in an
input image.
• The concept of convolving a small filter with a large(r) input volume has special
meaning in Convolutional Neural Networks – specifically, the local connectivity
and the receptive field of a neuron.
• when utilizing CNNs, we choose to connect each neuron to only a local region
of the input volume – we call the size of this local region the receptive field (or
simply, the variable F) of the neuron.
• let’s return to our CIFAR-10 dataset where the input volume has an input size of
32x32x3. If our receptive field is of size 3x3, then each neuron in the CONV layer
will connect to a 3x3 local region of the image for a total of 3x3x3 = 27 weights
27
CS 404/504, Fall 2021
• Simply put, the receptive field F is the size of the filter, yielding an FxF kernel that is convolved with the input volume.
• There are three parameters that control the size of an output volume:
• the depth,
• stride, and
• zero-padding
28
CS 404/504, Fall 2021
Depth
• The depth of an output volume controls the number of neurons (i.e., filters) in
the CONV layer that connect to a local region of the input volume. Each filter
produces an activation map that “activates” in the presence of oriented edges or
blobs or color.
• For a given CONV layer, the depth of the activation map will be K, or simply the
number of filters we are learning in the current layer. The set of filters that are
“looking at” the same (x,y)location of the input is called the depth column.
29
CS 404/504, Fall 2021
Stride
• In convolution , we only took a step of one pixel each time. In the context of
CNNs, the same principle can be applied – for each step, we create a new depth
column around the local region of the image where we convolve each of the K
filters with the region and store the output in a 3D volume.
• When creating our CONV layers we normally use a stride step size S of either S
= 1 or S = 2.
• Smaller strides will lead to overlapping receptive fields and larger output
volumes. Conversely, larger strides will result in less overlapping receptive
fields and smaller output volumes.
• To make the concept of convolutional stride more concrete, consider the Table
below
30
CS 404/504, Fall 2021
• Thus, we can see how convolution layers can be used to reduce the spatial
dimensions of the input volumes simply by changing the stride of the kernel.
stride = 1
stride = 2
31
CS 404/504, Fall 2021
Zero-padding
• Sometimes this effect is desirable, and other times it is not, it simply depends on
your application.
• However, in most cases, we want our output image to have the same
dimensions as our input image. To ensure the dimensions are the same, we
apply padding.
• Here we are simply replicating the pixels along the border of the image, such
that the output image will match the dimensions of the input image.
• we need to “pad” the borders of an image to retain the original image size when
applying a convolution – the same is true for filters inside of a CNN.
• Using zero-padding, we can “pad” our input along the borders such that our
output volume size matches our input volume size. The amount of padding we
apply is controlled by the parameter P.
33
CS 404/504, Fall 2021
padding = 1, stride = 1
34
CS 404/504, Fall 2021
• If we instead set P = 1, we can pad our input volume with zeros (right) to create
a 7x7 volume and then apply the convolution operation, leading to an output
volume size that matches the original input volume size of 5x5 (bottom).
• we can compute the size of an output volume as a function of the input volume
size (W, assuming the input images are square, which they nearly always are),
the receptive field size F, the stride S, and the amount of zero-padding P.
35
CS 404/504, Fall 2021
• If equation value is not an integer, then the strides are set incorrectly, and the
neurons cannot be tiled such that they fit across the input volume in a symmetric
way.
36
CS 404/504, Fall 2021
Activation Layers
• After each CONV layer in a CNN, we apply a nonlinear activation function, such
as ReLU, ELU, or any of the other Leaky ReLU variants.
• Activation layers are not technically “layers” (due to the fact that no
parameters/weights are learned inside an activation layer) and are sometimes
omitted from network architecture diagrams as it’s assumed that an activation
immediately follows a convolution.
• An activation layer accepts an input volume of size WxHxD input and then
applies the given activation function. Since the activation function is applied in
an element-wise manner, the output of an activation layer is always the same as
the input dimension.
37
CS 404/504, Fall 2021
Pooling Layers
• There are two methods to reduce the size of an input volume – CONV layers with
a stride > 1 and POOL layers. It is common to insert POOL layers in-between
consecutive CONV layers in a CNN architectures:
INPUT => CONV => RELU => POOL => CONV => RELU => POOL => FC
• The primary function of the POOL layer is to progressively reduce the spatial size
(i.e., width and height) of the input volume.
Benefits:
• Faster computation due to reduced parameters
• Memory issue
• Robust to variations in the features’ positions such that if the feature exists in a
different position than it did in the training data, it can still be accurately classified.
(Translation invariant)
• Reduce overfitting
38
CS 404/504, Fall 2021
39
CS 404/504, Fall 2021
Average Pooling
o In the third type of pooling, the summary of the features in a region are
represented by the average value of that region. Average pooling smooths the
harsh edges of a picture and is used when such edges are not important.
Global pooling
o Each channel in the feature map is reduced to just one value. The value depends
on the type of global pooling, which can be any one of min, max, average types.
o Global pooling is almost like applying a filter of the exact dimensions of the
feature map.
40
CS 404/504, Fall 2021
• Max pooling is typically done in the middle of the CNN architecture to reduce
spatial size, whereas average pooling is normally used as the final layer of the
network (e.x., GoogLeNet, SqueezeNet, ResNet) where we wish to avoid using
FC layers entirely. The most common type of POOL layer is max pooling.
• Typically we’ll use a pool size of 2x2, although deeper CNNs that use larger
input images (> 200 pixels) may use a 3x3 pool size early in the network
architecture.
41
CS 404/504, Fall 2021
• In summary, POOL layers Accept an input volume of size Wi/p x Hi/p x Di/p.
They then require two parameters:
The receptive field size F (also called the “pool size”).
The stride S.
• Applying the POOL operation yields an output volume of size Wo/p x Ho/p x Do/p
where:
Wo/p = ((Wi/p - F)/S)+1
Ho/p = ((Hi/p - F)/S)+1
Do/p = Di/p
• In practice, we tend to see two types of max pooling variations:
• Type #1: F = 3;S = 2 which is called overlapping pooling and normally applied to
images/ input volumes with large spatial dimensions.
• Type #2: F = 2;S = 2 which is called non-overlapping pooling. This is the most
common type of pooling and is applied to images with smaller spatial dimensions.
• For network architectures that accept smaller input images (in the range of 32-64
pixels) you may also see F = 2;S = 1 as well.
42
CS 404/504, Fall 2021
• To POOL or CONV?
2014 paper, Striving for Simplicity: The All Convolutional Net, Springenberg et al.
recommend discarding the POOL layer entirely and simply relying on CONV layers
with a larger stride to handle down sampling the spatial dimensions of the volume.
It’s becoming increasingly more common to not use POOL layers in the middle of the
network architecture and only use average pooling at the end of the network if FC
layers are to be avoided.
Flattening :
We take the pooled feature map and convert it into a column. We just take
the numbers row by row and put them into one long column.
For each pooled feature map in the pooling layer, we apply the flattening
and the resulting column becomes a huge vector of inputs for an artificial
neural network. 43
CS 404/504, Fall 2021
Fully-connected Layers
INPUT => CONV => RELU => POOL => CONV => RELU => POOL => FC => FC
44
CS 404/504, Fall 2021
Batch Normalization
• Batch Normalization: Accelerating Deep Network Training by Reducing Internal
Covariate Shift.
• Although, our input X was normalized with time the output will no longer be on
the same scale. As the data go through multiple layers of the neural network and
L activation functions are applied, it leads to an internal co-variate shift in the
data.
What exactly the Internal covariate shift?
45
CS 404/504, Fall 2021
• if we get a new set of images, consisting of non-white dogs. These new images
will have a slightly different distribution from the previous images.
• Now the model will change its parameters according to these new images.
Hence the distribution of the hidden activation will also change and results are
showing wrong.
46
CS 404/504, Fall 2021
• Batch normalization layers (or BN for short), as the name suggests, are used to
normalize the activations of a given input volume before passing it into the
next layer in the network.
• If we consider x to be our mini-batch of activations, then we can compute the
normalized via the following equation:
• We set equal to a small positive value such as 1e-7 to avoid dividing by zero.
Applying this equation implies that the activations leaving a batch
normalization layer will have approximately zero mean and unit variance
(i.e., zero-centered).
47
CS 404/504, Fall 2021
• At testing time, we replace the mini-batch and with running averages of and
computed during the training process.
• This ensures that we can pass images through our network and still obtain
accurate predictions without being biased by the and from the final mini-batch
passed through the network at training time.
• Benefits of BN:
extremely effective at reducing the number of epochs it takes to train a neural
network
Helping “stabilize” training, allowing for a larger variety of learning rates
and regularization strengths.
help us prevent overfitting and allows us to obtain significantly higher
classification accuracy in fewer epochs compared to the same network
architecture without batch normalization.
Dropout
• Dropout is actually a form of regularization that aims to help prevent overfitting by
increasing testing accuracy, perhaps at the expense of training accuracy.
• For each mini-batch in our training set, dropout layers, with probability p, randomly
disconnect inputs from the preceding layer to the next layer in the network
architecture.
• After the forward and backward pass are computed for the minibatch, we re-connect
the dropped connections, and then sample another set of connections to drop.
49
CS 404/504, Fall 2021
• Instead, dropout ensures there are multiple, redundant nodes that will activate
when presented with similar inputs – this in turn helps our model to generalize.
... CONV => RELU => POOL => FC => DO => FC =>
DO => FC
• we may also apply dropout with smaller probabilities (i.e., p = 0.10 – 0.25) in
earlier layers of the network as well (normally following a down sampling
operation, either via max pooling or convolution).
50
CS 404/504, Fall 2021
51
CS 404/504, Fall 2021
• By far, the most common form of CNN architecture is to stack a few CONV and
RELU layers, following them with a POOL operation.
• We repeat this sequence until the volume width and height is small, at which
point we apply one or more FC layers. Therefore, we can derive the most
common CNN architecture using the following pattern :
INPUT => [[CONV => RELU]*N => POOL?]*M => [FC =>
RELU]*K => FC
Here the * operator implies one or more and the ? indicates an optional
operation. Common choices for each repetition include :
0 <= N <= 3
M >= 0
0 <= K <= 2
52
CS 404/504, Fall 2021
Examples :
INPUT => [CONV => RELU => POOL] * 2 => FC => RELU => FC
INPUT => [CONV => RELU => CONV => RELU => POOL] * 3 => [FC => RELU] * 2 => FC
Here is an example of a very shallow CNN with only one CONV layer (N = M = K = 0) :
INPUT => [CONV => RELU => POOL] * 2 => [CONV => RELU] * 3 => POOL =>
[FC => RELU => DO] * 2 => SOFTMAX
For deeper network architectures, such as VGGNet, we’ll stack two (or more) layers before
every POOL layer :
INPUT => [CONV => RELU] * 2 => POOL => [CONV => RELU] * 2 => POOL => [CONV =>
RELU] * 3 => POOL => [CONV => RELU] * 3 => POOL => [FC => RELU => DO] * 2 =>
SOFTMAX
• We can build our own CNN Architecture by varying layes of CNN model.
• Before we see the CNN Architectures, we should know some of the history
regarding why these architectures were developed.
• So, There is a world wide competition named “The ImageNet Large Scale Visual
Recognition Challenge (ILSVRC)” in which ImageNet dataset is used for object
detection and image classification at large scale. This dataset was introduced by
Fei-Fei Li in 2006.
• ImageNet contains more than 20,000 categories, with a typical category, such as
"balloon" or "strawberry", table, chair consisting of several hundred images.
54
CS 404/504, Fall 2021
CNN Architectures:
The history of deep CNNs began with the appearance of LeNet. At that time, the CNNs
were restricted to handwritten digit recognition tasks, which cannot be scaled to all
image classes.
55
CS 404/504, Fall 2021
• The LeNet architecture consists of two series of CONV => TANH =>
POOL layer sets followed by a fully-connected layer and softmax
output.
LAYER TYPE OUTPUT SIZE FILTER SIZE /
STRIDE
INPUT IMAGE 32X32X1
CONV 28X28X6 5X5,6
POOL 14X14X6 POOL:2X2
STRIDE:2
CONV 10X10X16 5X5,16
FLATTEN 400
FC 120
FC 84
SOFTMAX 10
• But it was not popular at that time because of the lack of hardware
equipment, especially GPU.
• Since the success of AlexNet in 2012, CNN has become the best choice for
computer vision applications 56
CS 404/504, Fall 2021
57
CS 404/504, Fall 2021
58
CS 404/504, Fall 2021
59
CS 404/504, Fall 2021
60
CS 404/504, Fall 2021
61
CS 404/504, Fall 2021
62
CS 404/504, Fall 2021
63
CS 404/504, Fall 2021
VGGNet Architecture
• VGG is a classical convolutional neural network architecture. It was based on an analysis
of how to increase the depth of such networks.
• The network utilises small 3 x 3 filters. Otherwise the network is characterized by its
simplicity: the only other components being pooling layers and a fully connected layer.
• The input to VGG based convNet is a 224*224 RGB image. Preprocessing layer takes the
RGB image with pixel values in the range of 0–255 and subtracts the mean image values
which is calculated over the entire ImageNet training set.
64