Convolutional Neural Network
Convolutional Neural Network
Introduction
Convolutional neural networks (CNNs) are designed to work with data that has a grid-like structure, such
as images. Images have pixels that are closely related to each other, and CNNs take advantage of this. They're
especially good at recognizing image patterns, like shapes and colors.
While images are the most common use case for CNNs, they can also be used for other data types, like text,
sound, and time-series data. One key benefit of images is that they're "translation invariant," meaning that
objects look the same no matter where they are in the image. This makes it easier for CNNs to recognize
patterns and features.
A key feature of convolutional neural networks (CNNs) is the "convolution" operation. This operation
involves multiplying weights with input data from nearby areas. It's particularly useful for data with spatial
patterns, like images. A CNN is defined as a network that uses this convolution operation in at least one layer,
often in multiple layers.
History of Convolutional Neural Networks (CNNs)
CNNs were one of the first successful deep learning architectures. They were inspired by the structure of
the visual cortex in animals. In the 1950s and 60s, scientists Hubel and Wiesel discovered that the visual cortex
has cells that respond to specific regions and shapes in the visual field.
This discovery led to the development of the first neural model, the neocognitron. Later, the LeNet-5
model was developed, which was used to recognize handwritten numbers on checks.
Over time, CNNs have evolved to use more layers, stable activation functions, and powerful hardware. The
annual ImageNet competition has also driven innovation in CNNs.
Today, CNNs are a key technology in computer vision, achieving human-level performance in image
recognition tasks.
The Basic Structure of a Convolutional Network
A CNN consists of multiple layers, each with a 3D structure (height, width, and depth). The input layer
represents the image data, with each pixel having a set of values (e.g., RGB colors). The "depth" of a layer refers
to the number of feature channels (e.g., colors in an image like red, green, blue) and should not be confused with
the number of layers in the network.
Key points:
1. CNNs preserve spatial relationships between grid cells across layers, as this is essential for operations
like convolution.
2. Layers in CNNs include:
o Input Layer: This layer holds raw input images or a sequence of images to provide to our model.
o Convolution layers: Extract features like shapes or patterns.
o ReLU layers: Apply activation to make the network non-linear.
o Pooling layers: Reduce dimensions while keeping key information.
o Flattening Layers: plays a key role in transitioning data from the convolutional and pooling layers
(which detect features like edges, textures, and shapes) to the fully connected layers (which handle
the actual classification or regression task).
o Fully connected layers: Connect final layers to outputs like classifications.
o Output Layer: converts the output of each class into the probability score of each class.
3. Inputs are 2D grids (like images), with depth added for features (e.g., RGB colors). For example, a 32x32
image with three RGB channels would have dimensions 32x32x3.
4. Each layer refines features extracted from earlier layers, transitioning from simple properties (like colors)
in the first layer to complex shapes in hidden layers.
Motivation
Sparse Interactions
Each output unit is connected to (affected by) only a subset of the input units.
Sparse connectivity (upper) vs full connectivity (lower). The grey shaded nodes in the input show the
receptive field of the node in the first layer (source)
If there are m input units and n output units, a fully connected layer would require mn parameters (one
per connection) and correspondingly the number of operations would scale as O(mn). On the other hand, if each
output unit is sparsely connected to k input units, the layer requires kn parameters and O(kn) computations.
In general, for a convolutional layer, the number of output units are a function of kernel size, stride and
padding. This actually makes n a function of m. Keeping this in mind O(mn) ~ O(m²) while O(kn) ~ O(km). By
keeping k several orders of magnitude smaller than m, we see that the computational saving from sparse
connections is huge.
As a practical example, consider a 3x3kernel operating on a black and white image of
dimensions 224x224(this is a very standard setting of kernel size and image size, and can be seen in the first
layer of the VGGNet). For same padding and stride as 1 (discussed in detail later), output size will also
be 224x224. If this first layer were to be a fully connected layer, number of parameters would be ~2.5
billion (=224² x 224²).
On the other hand, using a sparse layer with each output connected to 9(=3x3) inputs, the number of
parameters is ~451 thousand (=224² x 9). In fact, a convolutional layer also incorporates parameter sharing (see
below) and this number will decrease further.
Parameter Sharing
In the previous section, we saw that the output units are only connected to a small number of input units. In a
convolutional layer, each kernel weight is used at every input position (except maybe at boundaries where
different padding rules apply as discussed below), i.e. parameters used to compute different output units
are tied together. By tied together, we mean that at all times their values are same. This means that even during
training, they are updated by the same amount and by collecting the gradients from all output units.
Parameter sharing allows models to capture local connectivity while simultaneously computing the same
features at different spatial locations. We will see the use of this property soon.
Here we make a short detour to section 5 for discussing locally connected layers and tiled convolution.
• Locally connected layer/unshared convolution: The connectivity graph of convolution operation and
locally connected layer is the same. The only difference is that parameter sharing is not performed, i.e.
each output unit performs a linear operation on its neighbourhood but the parameters are not shared
across output units. This allows models to capture local connectivity while allowing different features to
be computed at different spatial locations. This however requires much more parameters than the
convolution operation.
• Tiled convolution is a sort of middle step between locally connected layer and traditional convolution. It
uses a set of kernels that are cycled through. This reduces the number of parameters in the model while
allowing for some freedom provided by unshared convolution.
Comparison of connectivity and parameters of locally-connected (top), tiled (middle) and standard convolution
(bottom) (source)
The parameter complexity and computation complexity can be obtained as below. Note that:
• m = number of input units
• n = number of output units
• k = kernel size
• l = number of kernels in the set (for tiled convolution)
You can see now that the quantity of ~451 thousand parameters correspond to the locally connected
convolution operation. If we use a set of 200 kernels, the number of parameters for tiled convolution is 1.8
thousand. For a traditional convolution operation, this number is 9 parameters.
Equivariance
A function f is said to be equivariant to a function g if
f(g(x)) = g(f(x))
i.e. if input changes, the output changes in the same way.
Parameter sharing in a convolutional network provides equivariance to translation. What this means is that
translation of the image results in corresponding translation in the output map (except maybe for boundary
pixels). The reason for this is very intuitive: the same feature is being computed at all input points.
Key Components of a CNN Layer
1. Filters/Kernels: Small, 3D structures that slide over the input data, performing a dot product to generate
feature maps.
2. Feature Maps: The output of a filter, representing the presence of specific features in the input data.
3. Depth: The number of feature maps in a layer, controlled by the number of filters used.
Convolution Operation
The filter slides over the input data, performing a dot product at each position to generate a feature map. The
number of possible positions defines the spatial dimensions of the next layer.
Example
Input layer: 32x32x3 (RGB image)
Filter: 5x5x3
Output layer: 28x28x5 (5 feature maps)
The number of filters controls the capacity of the model and the number of feature maps in the next layer.
Different layers can have varying numbers of feature maps, depending on the number of filters used.
In Convolutional Neural Networks (CNNs), filters (or kernels) are small 3D structures used to extract
features (like patterns or edges) from input data. These filters slide across the input grid, performing a
mathematical operation called a dot product. Here's the core idea:
1. Feature Maps: Hidden layers generate "feature maps" that represent patterns or activations (like edges
or textures) detected in the data. The more filters used, the more feature maps created, leading to greater
depth in the next layer.
2. Filter Application: Filters are smaller than the input layer. Their depth matches the input depth. For
example, a filter of size 5×5×3 works with an input depth of 3 (e.g., RGB image channels). Sliding a filter
across the input produces an output of smaller dimensions because portions of the filter at the edges
don’t fully overlap the input.
Example:
o Input: 32×32×3
o Filter: 5×5×3
o Output: 28×28×depth (where depth = number of filters, e.g., 5 filters result in depth = 5).
3. Key Idea: Filters specialize in detecting different patterns (e.g., horizontal or vertical edges). Multiple
filters enable the model to understand a broader variety of patterns and combine them into meaningful
representations.
4. Depth and Parameters: Layers closer to the input handle simple patterns, while deeper layers focus on
complex combinations. Later layers are typically smaller in width/height but have greater depth (more
feature maps), enabling richer feature extraction.
In convolutional neural networks (CNNs), the convolution operation defines how filters (3D tensors) interact
with input data to extract features. Here's the breakdown:
1. Filter Representation:
o Each pth filter in the qth layer is represented by a 3D tensor: , where:
▪ i, j, k represent the height, width, and depth of the filter.
o Filters have a specific size (e.g., 5×5×3) and slide across the input grid.
2. Input and Feature Maps:
oThe feature maps in the qth layer are represented by the 3-dimensional tensor ,
where:
▪ For the first layer, H(1) represents the input image.
3. Convolution Process:
o The output of the convolution for the (q+1)th layer is calculated using the dot product between:
▪ The filter W(p,q) and the corresponding region in H(q).
▪ Mathematically:
▪ Here, Fq is the filter size, and the indices (r, s, k) match filter positions with input data.
4. Output Size:
o The spatial size of the output is determined by the filter placement:
o Each position where the filter overlaps the input produces a value, creating the next layer's
feature map.
5. Depth of Output:
o The depth of the output depends on the number of filters (dq+1) applied, with each filter
generating one feature map.
This method ensures that the CNN extracts meaningful features like edges or patterns while maintaining spatial
relationships.
The convolution operation in CNNs may seem complex, but it essentially involves performing a simple dot
product between the filter and the input data over all valid positions in the input grid. Let’s break it down step by
step:
1. Filter Application:
o The filter is a small 3D array (e.g., 5×5×3), and we align it with the input layer (e.g., 32×32×3).
o At each valid position, we compute the dot product between the filter’s values and the
corresponding values from the input grid.
2. Number of Valid Positions:
o For an input of size 32×32 and a filter of size 5×5, there are only (32−5+1) × (32−5+1) = 28×28
valid positions for the filter to fully overlap with the input.
o This means the filter can slide across 28×28 spatial positions to generate outputs.
3. Output Calculation:
o Each valid position produces one output value. For a 5×5×3 filter, the dot product involves 75
values (5×5×3) from both the filter and the input region it overlaps with. This result becomes a
single value in the output feature map.
4. Example with Depth of 1:
o If the input is a 7×7×1 grid and the filter is 3×3×1, applying the filter at every valid position with a
stride of 1 produces a 5×5 output. Each filter generates one feature map.
5. Hierarchical Feature Detection:
o Filters in earlier layers detect simple patterns (like edges), while later layers combine these to
recognize more complex shapes. This hierarchical approach is key to CNNs' effectiveness.
6. Equivariance to Translation:
o Convolution is translation-equivariant, meaning if the input shifts, the corresponding feature map
values shift in the same way. This happens because the filter’s parameters are shared across all
input positions.
➢ Parameter Sharing: Filters share parameters across the entire input, ensuring shapes are detected the
same way regardless of their position in the image.
➢ An Example of Convolution:
• Input: A 7×7×1 layer (e.g., a grayscale image).
• Filter: A 3×3×1 filter (depth 1 for simplicity).
• The filter slides across the input, performing dot products to generate feature maps.
o Example calculations:
▪ 5×1+8×1+1×1+1×2=165×1 + 8×1 + 1×1 + 1×2 = 16
▪ 4×1+4×1+4×1+7×2=264×1 + 4×1 + 4×1 + 7×2 = 26
o The results (16, 26, etc.) form the next layer's feature map.
➢ Receptive Field:
• Each feature in the next layer captures a larger area (receptive field) of the input.
• With successive 3×3 filter convolutions, the receptive fields grow:
o First layer: 3×3 region.
o Second layer: 5×5 region.
o Third layer: 7×7 region.
➢ Layer Depth:
• The depth of the next layer depends on the number of filters in the current layer, not the input depth.
• Example: If layer 1 has five filters, layer 2 will have a depth of 5. However, layer 3 requires filters of depth
5 to match layer 2's depth.
PADDING
1. Purpose of Padding:
o Convolution operations can reduce the size of the output layers compared to the input,
potentially losing important edge information.
o Padding solves this problem by adding extra “pixels” (set to zero) around the borders of the
feature map to maintain or expand the spatial size.
2. Half-Padding:
An example of the padding of a single feature map is shown in Figure 8.3, where two zeros are padded on
all sides of the image (or feature map)
Padding ensures key information is preserved and helps maintain consistent spatial dimensions across
layers.
Strides
1. Strides:
o Strides determine how far the filter moves across the input during convolution.
o A stride of 1 means the filter slides one position at a time, covering every spatial position. Larger
strides (e.g., stride 2) skip positions, reducing computation and spatial size.
2. Effect of Strides:
o Using a stride Sq results in fewer spatial positions being processed:
▪ Output height:
▪ Output width: .
o Larger strides reduce spatial size faster and can shrink the area by approximately .
3. Common Usage:
o Strides of 1 are typical. Stride 2 is sometimes used for efficiency or to limit overfitting. Strides
larger than 2 are rare.
4. Advantages of Larger Strides:
o Increase the receptive field (the region a feature captures in the input).
o Help manage memory in resource-constrained settings.
o Reduce overfitting by lowering resolution when it's unnecessarily high.
5. Alternatives to Max-Pooling:
o Historically, max-pooling was used to increase receptive fields and down sample layers.
Recently, strides have been used instead to achieve similar effects.
In essence, strides offer a balance between reducing spatial footprint and enhancing the ability to capture
complex patterns.
Typical Settings
Typical Settings:
1. Stride Sizes:
o Strides of 1 are most common. Small strides like 2 are occasionally used for efficiency.
o Square images (where ) are preferred. Non-square images are pre-processed (e.g.,
cropped into square patches).
2. Filter Size:
o Typical sizes are small (e.g., 3 or 5). Smaller filters work well, as they lead to deeper networks and
better results.
o For example, the VGG network used a filter size of 3 for all layers, achieving great success.
3. Number of Filters:
o Commonly set as powers of 2 (e.g., 32, 64) to optimize processing and align with hidden layer
depths.
Use of Bias:
1. Adding Bias:
oEach filter has its own bias ( ), added to the convolution's output.
oBias is small, adds one parameter per filter, and is learned during backpropagation.
2. How Bias Works:
o Bias acts like a connection weight applied to a constant input of +1.
o This is equivalent to adding a "special pixel" with a fixed value of 1 to the input.
Therefore, the number of input features in the qth layer is . This is a standard feature-
engineering trick that is used for handling bias in all forms of machine learning
In essence, these settings optimize efficiency and performance while maintaining simplicity.
The ReLU Layer
1. ReLU Activation:
o The ReLU (Rectified Linear Unit) function is applied to all values in a layer. It replaces negative
values with zero while leaving positive values unchanged.
o This operation doesn’t alter the layer’s dimensions since it maps each input value to a single
output value.
2. Position in the Network:
o ReLU follows convolution operations (similar to activation functions following linear
transformations in traditional neural networks).
o While essential, ReLU layers are often omitted from illustrations for simplicity.
3. Why ReLU?:
o Earlier activation functions like sigmoid and tanh were slower and less accurate.
o ReLU offers faster training and better results, allowing deeper networks and longer training times.
ReLU has become the standard activation function in modern convolutional neural networks due to its
simplicity and performance.
Pooling
Pooling is a fundamental step in convolutional neural networks, typically forming the third stage of a
convolutional layer. A typical convolutional layer operates as follows:
1. Convolution: The first stage applies multiple convolutions in parallel, generating a set of linear
activations.
2. Nonlinear Activation: In the second stage, these linear activations are transformed using a nonlinear
function, such as the rectified linear unit (ReLU). This step, often called the detector stage, introduces
nonlinearity into the network.
3. Pooling: The third stage modifies the outputs further through a pooling function.
➢ What is Pooling?
• Pooling reduces the spatial size (height and width) of feature maps while keeping their depth
unchanged.
• It operates independently on small regions (e.g., 2×2 or 3×3) in each feature map.
• The most common type is max-pooling, which selects the maximum value from each region.
➢ Types of Pooling:
Pooling simplifies and condenses the feature representation by summarizing nearby outputs into a single
value. For instance:
• Max Pooling: Selects the maximum value in a defined rectangular region.
• Average Pooling: Calculates the average value of the region.
• L2 Norm Pooling: Computes the square root of the sum of squares of the values.
• Weighted Average Pooling: Averages values based on their distance from a central point.
For a feature map having dimensions nh x nw x nc, the dimensions of output obtained after a pooling
layer is
(nh – f +1) / s x (nw – f + 1) / s x nc
nh → height of feature map
nw → width of feature map
nc → number of channels in the feature map
f → size of filter
s →stride length
MAX POOLING
AVERAGE POOLING
➢ Advantages
• Dimensionality Reduction
• Translation Invariance
• Feature Selection
➢ Disadvantages
• Information Loss
• Over – Smoothing
• Hyper Parameter Tuning
➢ Strides in Pooling:
• The stride determines how far the pooling window moves.
o For stride = 1: Slightly overlapping regions, smaller reduction in size.
o For stride = 2: Non-overlapping regions, more significant size reduction.
➢ Translation Invariance:
• Pooling adds translation invariance, meaning it classifies objects similarly even if their positions in
the image shift slightly (e.g., a bird is still a bird, no matter where it is in the image).
➢ Receptive Field:
• Pooling increases the receptive field, allowing layers to capture larger regions of the input. This helps
identify complex features in deeper layers.
➢ Max-Pooling vs. Other Methods:
• Max-pooling is more common than average-pooling because it retains key information effectively.
• Example: A 2×2 max-pooling with stride 2 reduces a layer size from 4×4 to 2×2.
➢ Emerging Trends:
• Some recent designs replace pooling with convolutional layers using larger strides for reducing size.
However, max-pooling remains popular for its unique advantages like nonlinearity and translation
invariance.
Fully Connected Layers
➢ Fully Connected Layers:
• Features from the final spatial layer are fully connected to the next layer, like in traditional feed-
forward networks.
• Multiple fully connected layers may be used to enhance computational power.
• These layers have the majority of the network's parameters, as they are densely connected (e.g., two
layers with 4096 hidden units would involve over 16 million weights).
➢ Parameter and Memory Trade-Off:
• Convolutional layers have more activations (higher memory usage), but fully connected layers have
more parameters.
• The design of these layers can vary based on the task (e.g., classification vs. segmentation) and
resource constraints (e.g., memory or data availability).
➢ Output Layer:
• For classification tasks, the output layer is fully connected and uses an activation function like
softmax, logistic, or linear, depending on the application.
➢ Alternative to Fully Connected Layers:
• Average Pooling: Aggregates values across the entire spatial area of the final activation maps.
o Example: If the final maps are 7×7×256, average pooling creates 256 features by averaging 49
values per feature.
• This reduces the parameter count and improves generalization. This method was used in Google
Net.
➢ Fully Convolutional Networks (FCNs):
• For tasks like image segmentation, fully connected layers are replaced with 1×1 convolutions to
create an output spatial map where each pixel corresponds to a class label.
The Interleaving Between Layers
➢ Layer Interleaving:
• Convolutional layers (C) are typically followed by ReLU layers (R) to apply non-linearity.
• After two or three convolution-ReLU pairs, a pooling layer (P) is added to reduce spatial size.
➢ Common Patterns:
• Example layer sequences:
o CRCRP: Two convolution-ReLU pairs followed by pooling.
o CRCRCRP: Three convolution-ReLU pairs followed by pooling.
• Repeating these patterns creates deeper networks.
➢ Fully Connected Layers:
• After multiple convolution-ReLU-pooling sequences, fully connected layers (F) are added for final
computation.
Example network: CRCRP → CRCRP → CRCRPF
➢ Pooling's Role:
• Pooling reduces the spatial footprint of activation maps using larger strides.
• Strided convolutions can also reduce spatial size and may replace pooling in some cases.
➢ Deep Networks:
• Modern convolutional networks often have more than 15 layers.
• Skip connections are used to link layers, improving performance in very deep architectures.
LeNet-5
1. Overview:
o LeNet-5 is one of the earliest neural networks, designed for grayscale images with one colour
channel.
o Commonly used for character recognition (e.g., checks), it assumes ten-character classes for
classification.
2. Architecture:
o Contains two convolution layers, two pooling layers, and three fully connected layers.
o Later layers have multiple feature maps due to multiple filters.
3. Details:
o C5 Layer: The first fully connected layer (C5) was labelled as a convolution layer in its original
design because it could handle spatial features, but it is effectively a fully connected layer.
o Subsampling (or pooling) layers averaged values over 2×2 regions with stride 2, unlike modern
max-pooling.
o Sigmoid activations followed subsampling; however, modern networks favour ReLU activations
after convolutions.
4. Evolution:
o Average pooling and RBF (Radial Basis Function) units in the final layer were used. Today, these
are replaced by max-pooling and softmax layers with log-likelihood loss.
5. Significance:
o LeNet-5 was shallow compared to modern architectures but introduced core concepts of
convolutional networks.
o The RBF approach used a prototype vector to compute distances for classification, which is now
considered outdated.
Local Response Normalization
➢ What is LRN?
• LRN is applied immediately after the ReLU layer to improve generalization by fostering competition
among filters.
• Inspired by biological systems, it normalizes the activation values of neurons.
➢ Normalization Formula:
• For an activation ai, the normalized value bi is calculated as:
In the above formula, any value of i − n/2 that is less than 0 is set to 0, and any value of i +n/2 that is greater than
N is set to N. The use of this type of normalization is not obsolete, and its discussion has been included here for
historical reasons.
➢ Practical Computation:
• Instead of normalizing over all filters, it’s done over small groups (e.g., 5 adjacent filters).
• This approach reduces computational complexity while retaining effectiveness.
➢ Relevance Today:
• LRN is now mostly obsolete and included for historical context. Modern architectures favor simpler
approaches for normalization.
In the provided example, a 7x7 image is convolved with a 3x3 filter using a stride of 2.
This means the filter moves two pixels at a time, both horizontally and vertically. The calculation is as
follows:
• The 3x3 filter is applied to the top-left 3x3 portion of the 7x7 input, resulting in the value 91.
• The filter then shifts two steps to the right, and the process is repeated, resulting in 110.
• This continues across the row with a stride of two, and then moves down two rows and repeats,
resulting in a 3x3 output.
The formula to calculate the output size of a convolutional layer is
((n + 2p - f) / s) + 1,
where:
n is the input size
p is the padding
f is the filter size
s is the stride
In this case, with n = 7, p = 0, f = 3, and s = 2, the output size is ((7 + 0 - 3) / 2) + 1 = 3.
Thus, the output is a 3x3 matrix.
Strided convolutions can reduce the spatial dimensions of feature maps, which can be useful for
capturing more global features and reducing computational cost.
Tiled Convolution
Tiled Convolutional Neural Networks (CNNs) are an advanced extension of traditional Neural
Networks. They incorporate multiple convolution kernels (k kernels) within the same layer, applied over
every kth unit, a concept known as "tiling." Research has demonstrated that even a small value of k, like
2, can yield effective results.
Tiled convolution is a technique introduced to strike a balance between two common types of
layers: convolutional layers and locally connected layers. It enhances feature learning capabilities
while maintaining efficiency in terms of memory and computation.
Here’s a breakdown of the concept and its details:
What is Tiled Convolution?
In a traditional convolutional layer:
• The same set of weights (or kernel) is applied across all spatial locations of the input data. This is why
it achieves translational invariance (e.g., recognizing a feature regardless of where it appears in the
input).
In a locally connected layer:
• Different weights are learned for every spatial location, offering flexibility to capture highly localized
variations. However, this requires a large number of parameters, leading to higher memory
requirements.
Tiled Convolution, on the other hand:
• Uses a compromise approach. Instead of having a single set of weights (as in convolutional layers) or
unique weights for each spatial location (as in locally connected layers), it cycles through a set of
kernels (filters) as it moves across the input space.
How It Works
• The tiling mechanism divides the input space into smaller tiles, and each tile is cyclically processed
by a different kernel.
• For example:
o If there are t kernels, the first kernel is applied to the first tile, the second kernel to the next
tile, and so on. This pattern repeats cyclically.
o This allows neighboring locations to be processed by different filters, adding diversity in
feature extraction.
Mathematical Definition
Tiled convolution can be expressed algebraically:
Where:
• represents the output at a specific position in the feature map.
If the stride shape is and the padding is p, The stride of the transposed convolutional layer
determines the step size for the input indices p and q, and the padding determines the number of pixels to
add to the edges of the input before performing the convolution. Then the output of the transposed
convolutional layer will be
Transposed convolutional layers are often used in conjunction with other types of layers, such as pooling
layers and fully connected layers, to build deep convolutional networks for various tasks.
Dilated Convolution
It is a technique that expands the kernel (input) by inserting holes between its consecutive elements.
In simpler terms, it is the same as convolution but it involves pixel skipping, so as to cover a larger area of
the input.
Dilated convolution, also known as atrous convolution, is a type of convolution operation used in
convolutional neural networks (CNNs) that enables the network to have a larger receptive field without
increasing the number of parameters.
In a regular convolution operation, a filter of a fixed size slides over the input feature map, and the
values in the filter are multiplied with the corresponding values in the input feature map to produce a
single output value. The receptive field of a neuron in the output feature map is defined as the area in the
input feature map that the filter can “see”. The size of the receptive field is determined by the size of the
filter and the stride of the convolution.
In contrast, in a dilated convolution operation, the filter is “dilated” by inserting gaps between the
filter values. The dilation rate determines the size of the gaps, and it is a hyperparameter that can be
adjusted. When the dilation rate is 1, the dilated convolution reduces to a regular convolution.
The dilation rate effectively increases the receptive field of the filter without increasing the number
of parameters, because the filter is still the same size, but with gaps between the values. This can be
useful in situations where a larger receptive field is needed, but increasing the size of the filter would lead
to an increase in the number of parameters and computational complexity.
Dilated convolutions have been used successfully in various applications, such as semantic
segmentation, where a larger context is needed to classify each pixel, and audio processing, where the
network needs to learn patterns with longer time dependencies.
Some advantages of dilated convolutions are:
1. Increased receptive field without increasing parameters
2. Can capture features at multiple scales
3. Reduced spatial resolution loss compared to regular convolutions with larger filters
Some disadvantages of dilated convolutions are:
1. Reduced spatial resolution in the output feature map compared to the input feature map
2. Increased computational cost compared to regular convolutions with the same filter size and stride
An additional parameter l (dilation factor) tells how much the input is expanded. In other words,
based on the value of this parameter, (l-1) pixels are skipped in the kernel. Fig 1 depicts the difference
between normal vs dilated convolution. In essence, normal convolution is just a 1-dilated convolution.
Where,
F(s) = Input
k(t) = Applied Filter
*l = l-dilated convolution
(F*lk)(p) = Output
Advantages of Dilated Convolution:
Using this method rather than normal convolution is better as:
1. Larger receptive field (i.e. no loss of coverage)
2. Computationally efficient (as it provides a larger coverage on the same computation cost)
3. Lesser Memory consumption (as it skips the pooling step) implementation
4. No loss of resolution of the output image (as we dilate instead of performing pooling)
5. Structure of this convolution helps in maintaining the order of the data.
Types of Convolutions
Comparing Unshared, Tiled and Traditional Convolutions
Convolution
Properties Advantages and Disadvantages
Type
Advantages
1. No Parameter sharing. 1. Reducing memory consumption
2. Each output unit performs a linear 2. Increasing statistical efficiency
operation on its neighbourhood but 3. Reducing the amount of
Unshared parameters are not shared across output computation needed to perform
Convolution units. forward and back-propagation.
3. Captures local connectivity while Disadvantages
allowing different features to be computed 1. requires much more
at different spatial locations. parameters than the convolution
operation.
A locally connected layer has no sharing at all. We indicate that each connection has its weight by labelling each
connection with a unique letter.
Tiled Convolution
Tiled convolution has a set of t different kernels. Here we illustrate the case of t = 2. One of these kernels has
edges labelled “a” and “b,” while the other has edges labelled “c” and “d.” Each time we move one pixel to the
right in the output, we move on to using a different kernel. This means that, like the locally connected layer,
neighbouring units in the output have different parameters. Unlike the locally connected layer, after we have
gone through all t available kernels, we cycle back to the first kernel. If two output units are separated by a
multiple of t steps, then they share parameters.
Traditional Convolution
Traditional convolution is equivalent to tiled convolution with t = 1. There is only one kernel, and it is applied
everywhere, as indicated in the diagram by using the kernel with weights labelled “a” and “b” everywhere.
Zero
Padding Properties Example
Type
Variable Description
X Input image tensor
Y Probability distribution over tensor for each pixel
H Hidden representation
U Tensor of convolution kernels
V Tensor of kernels to produce an estimation of labels
W Kernel tensor to convolve over Y to provide input to H
Data Types
The data used with a convolutional network usually consists of several channels, each channel
being the observation of a different quantity at some point in space or time.
.
➢ Cell Contribution:
• Each cell in layer i contributes to multiple cells in layer i+1, depending on filter size and stride.
• During backpropagation, gradients from all cells that a given cell contributes to are aggregated backward.
➢ Gradient Calculation Pseudocode:
• Find all cells in layer i+1(set Sc) that depend on a specific cell c in layer i.
• For each dependent cell r:
Multiply
o (loss gradient of r) by (filter weight connecting c to r).
• Sum these products to compute the gradient for cell cc.
➢ Weight Gradients:
• Multiply the hidden activation value in layer by the loss gradient in the layer i to compute weight
gradients.
• As filter weights are shared, sum the gradients for all copies of a shared weight.
The method described above follows traditional backpropagation by accumulating gradients linearly. However,
extra care is needed to track which cells in one layer influence the next. Backpropagation can also be
implemented using tensor operations, which can further be simplified into matrix multiplications. These
techniques provide useful insights into how feed-forward networks can be generalized to convolutional networks
and will be explained in the following sections.
Backpropagation as Convolution with Inverted/Transposed Filter
➢ Backpropagation in Convolutional Networks:
• Similar to traditional neural networks, where gradients are propagated backward by multiplying with the
transposed weight matrix.
• In convolutional neural networks (CNNs), gradients are associated with spatial positions, and the
concept extends to inverted (or transposed) filters.
➢ Gradient Propagation for 2D Convolutions:
• Suppose layer q (input) has depth = 1 and layer q+1(output) has depth = 1, with a stride of 1.
• Backpropagation involves convolving the gradients at layer with the inverted filter from the
forward pass.
• Filter Inversion:
o The convolution filter is flipped both horizontally and vertically for backpropagation.
o This is because the relative movement of the gradients during backpropagation is opposite to that
during forward convolution.
• Padding Relationship:
o For a stride of 1, the total padding between forward and backward convolutions equals:
• When and are greater than 1, additional tensor transpositions are required.
o Note that i and j refer to spatial positions, whereas k refers to the depth-centric position of the
weight.
▪ ,
▪ ,
▪ The index of the filter identifier and depth within a filter have been interchanged
between and in the above equation
➢ Practical Example:
• Suppose there are 20 filters applied to a 3-channel RGB input to produce an output of depth 20.
• During backpropagation:
o For the red channel, extract the corresponding filters, invert them, and aggregate them into a
depth-20 gradient.
o Repeat this for green and blue channels.
• The transposition and inversion described above ensure proper gradient computation.
Summary:
• Backpropagation in CNNs involves using inverted (flipped) filters.
• With multiple input/output depths, tensor transpositions ensure that gradients are mapped correctly.
Data Augmentation
➢ Purpose:
• Data augmentation is used to reduce overfitting by generating new training examples from existing data
through transformations.
➢ Why It Works for Images:
• In image processing, transformations like translation, rotation, patch extraction, and reflection do not
change the essence of objects in the images.
• These transformations help the model generalize better by training it to recognize objects in different
orientations or conditions.
➢ Common Methods:
• Simple Transformations: Mirror images, reflections, or varied color intensities can be applied during
training.
• Patch Extraction: Extracting smaller patches (e.g., 224×224×3 patches used in AlexNet) is common to
train models on fixed input sizes.
➢ Principal Component Analysis (PCA):
• PCA can adjust color intensity by applying Gaussian noise to principal components of pixel values. While
effective, this method can be computationally expensive.
➢ Caution:
• Data augmentation must suit the dataset and task:
o For example, rotation or reflection of MNIST handwritten digits can produce invalid data (e.g.,
rotating a '6' makes it a '9').
➢ Impact:
• Augmentation techniques can significantly improve model performance, reducing error rates, as shown
in studies.
The picture illustrates variations of Gabor functions based on changes in their parameters. Here's what it explains:
1. Left Section:
o Shows how the Gabor function shifts and rotates based on the parameters that define the
coordinate system .
o Each Gabor filter in the grid is centered at a specific position , and its sensitivity aligns
with directions radiating outward from the grid's center.
2. Center Section:
o Highlights the impact of Gaussian scale parameters , which control the width and
height of the Gabor functions.
o As you move left to right, the filters become wider (decreasing ). Moving top to bottom