0% found this document useful (0 votes)
24 views36 pages

Convolutional Neural Network

Convolutional Neural Networks (CNNs) are specialized deep learning architectures designed for grid-like data, primarily images, leveraging spatial relationships to recognize patterns. They consist of multiple layers, including convolutional, activation, pooling, and fully connected layers, with key operations like convolution and parameter sharing enhancing efficiency and feature extraction. CNNs have evolved from early models inspired by the visual cortex to achieve human-level performance in image recognition tasks, driven by innovations in architecture and competition benchmarks.

Uploaded by

oasisolga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views36 pages

Convolutional Neural Network

Convolutional Neural Networks (CNNs) are specialized deep learning architectures designed for grid-like data, primarily images, leveraging spatial relationships to recognize patterns. They consist of multiple layers, including convolutional, activation, pooling, and fully connected layers, with key operations like convolution and parameter sharing enhancing efficiency and feature extraction. CNNs have evolved from early models inspired by the visual cortex to achieve human-level performance in image recognition tasks, driven by innovations in architecture and competition benchmarks.

Uploaded by

oasisolga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

CONVOLUTIONAL NEURAL NETWORK

Introduction
Convolutional neural networks (CNNs) are designed to work with data that has a grid-like structure, such
as images. Images have pixels that are closely related to each other, and CNNs take advantage of this. They're
especially good at recognizing image patterns, like shapes and colors.
While images are the most common use case for CNNs, they can also be used for other data types, like text,
sound, and time-series data. One key benefit of images is that they're "translation invariant," meaning that
objects look the same no matter where they are in the image. This makes it easier for CNNs to recognize
patterns and features.
A key feature of convolutional neural networks (CNNs) is the "convolution" operation. This operation
involves multiplying weights with input data from nearby areas. It's particularly useful for data with spatial
patterns, like images. A CNN is defined as a network that uses this convolution operation in at least one layer,
often in multiple layers.
History of Convolutional Neural Networks (CNNs)
CNNs were one of the first successful deep learning architectures. They were inspired by the structure of
the visual cortex in animals. In the 1950s and 60s, scientists Hubel and Wiesel discovered that the visual cortex
has cells that respond to specific regions and shapes in the visual field.
This discovery led to the development of the first neural model, the neocognitron. Later, the LeNet-5
model was developed, which was used to recognize handwritten numbers on checks.
Over time, CNNs have evolved to use more layers, stable activation functions, and powerful hardware. The
annual ImageNet competition has also driven innovation in CNNs.
Today, CNNs are a key technology in computer vision, achieving human-level performance in image
recognition tasks.
The Basic Structure of a Convolutional Network
A CNN consists of multiple layers, each with a 3D structure (height, width, and depth). The input layer
represents the image data, with each pixel having a set of values (e.g., RGB colors). The "depth" of a layer refers
to the number of feature channels (e.g., colors in an image like red, green, blue) and should not be confused with
the number of layers in the network.
Key points:
1. CNNs preserve spatial relationships between grid cells across layers, as this is essential for operations
like convolution.
2. Layers in CNNs include:
o Input Layer: This layer holds raw input images or a sequence of images to provide to our model.
o Convolution layers: Extract features like shapes or patterns.
o ReLU layers: Apply activation to make the network non-linear.
o Pooling layers: Reduce dimensions while keeping key information.
o Flattening Layers: plays a key role in transitioning data from the convolutional and pooling layers
(which detect features like edges, textures, and shapes) to the fully connected layers (which handle
the actual classification or regression task).
o Fully connected layers: Connect final layers to outputs like classifications.
o Output Layer: converts the output of each class into the probability score of each class.
3. Inputs are 2D grids (like images), with depth added for features (e.g., RGB colors). For example, a 32x32
image with three RGB channels would have dimensions 32x32x3.
4. Each layer refines features extracted from earlier layers, transitioning from simple properties (like colors)
in the first layer to complex shapes in hidden layers.
Motivation
Sparse Interactions
Each output unit is connected to (affected by) only a subset of the input units.
Sparse connectivity (upper) vs full connectivity (lower). The grey shaded nodes in the input show the
receptive field of the node in the first layer (source)
If there are m input units and n output units, a fully connected layer would require mn parameters (one
per connection) and correspondingly the number of operations would scale as O(mn). On the other hand, if each
output unit is sparsely connected to k input units, the layer requires kn parameters and O(kn) computations.
In general, for a convolutional layer, the number of output units are a function of kernel size, stride and
padding. This actually makes n a function of m. Keeping this in mind O(mn) ~ O(m²) while O(kn) ~ O(km). By
keeping k several orders of magnitude smaller than m, we see that the computational saving from sparse
connections is huge.
As a practical example, consider a 3x3kernel operating on a black and white image of
dimensions 224x224(this is a very standard setting of kernel size and image size, and can be seen in the first
layer of the VGGNet). For same padding and stride as 1 (discussed in detail later), output size will also
be 224x224. If this first layer were to be a fully connected layer, number of parameters would be ~2.5
billion (=224² x 224²).
On the other hand, using a sparse layer with each output connected to 9(=3x3) inputs, the number of
parameters is ~451 thousand (=224² x 9). In fact, a convolutional layer also incorporates parameter sharing (see
below) and this number will decrease further.
Parameter Sharing
In the previous section, we saw that the output units are only connected to a small number of input units. In a
convolutional layer, each kernel weight is used at every input position (except maybe at boundaries where
different padding rules apply as discussed below), i.e. parameters used to compute different output units
are tied together. By tied together, we mean that at all times their values are same. This means that even during
training, they are updated by the same amount and by collecting the gradients from all output units.
Parameter sharing allows models to capture local connectivity while simultaneously computing the same
features at different spatial locations. We will see the use of this property soon.
Here we make a short detour to section 5 for discussing locally connected layers and tiled convolution.
• Locally connected layer/unshared convolution: The connectivity graph of convolution operation and
locally connected layer is the same. The only difference is that parameter sharing is not performed, i.e.
each output unit performs a linear operation on its neighbourhood but the parameters are not shared
across output units. This allows models to capture local connectivity while allowing different features to
be computed at different spatial locations. This however requires much more parameters than the
convolution operation.
• Tiled convolution is a sort of middle step between locally connected layer and traditional convolution. It
uses a set of kernels that are cycled through. This reduces the number of parameters in the model while
allowing for some freedom provided by unshared convolution.

Comparison of connectivity and parameters of locally-connected (top), tiled (middle) and standard convolution
(bottom) (source)
The parameter complexity and computation complexity can be obtained as below. Note that:
• m = number of input units
• n = number of output units
• k = kernel size
• l = number of kernels in the set (for tiled convolution)

You can see now that the quantity of ~451 thousand parameters correspond to the locally connected
convolution operation. If we use a set of 200 kernels, the number of parameters for tiled convolution is 1.8
thousand. For a traditional convolution operation, this number is 9 parameters.
Equivariance
A function f is said to be equivariant to a function g if
f(g(x)) = g(f(x))
i.e. if input changes, the output changes in the same way.
Parameter sharing in a convolutional network provides equivariance to translation. What this means is that
translation of the image results in corresponding translation in the output map (except maybe for boundary
pixels). The reason for this is very intuitive: the same feature is being computed at all input points.
Key Components of a CNN Layer
1. Filters/Kernels: Small, 3D structures that slide over the input data, performing a dot product to generate
feature maps.
2. Feature Maps: The output of a filter, representing the presence of specific features in the input data.
3. Depth: The number of feature maps in a layer, controlled by the number of filters used.
Convolution Operation
The filter slides over the input data, performing a dot product at each position to generate a feature map. The
number of possible positions defines the spatial dimensions of the next layer.
Example
Input layer: 32x32x3 (RGB image)
Filter: 5x5x3
Output layer: 28x28x5 (5 feature maps)
The number of filters controls the capacity of the model and the number of feature maps in the next layer.
Different layers can have varying numbers of feature maps, depending on the number of filters used.
In Convolutional Neural Networks (CNNs), filters (or kernels) are small 3D structures used to extract
features (like patterns or edges) from input data. These filters slide across the input grid, performing a
mathematical operation called a dot product. Here's the core idea:
1. Feature Maps: Hidden layers generate "feature maps" that represent patterns or activations (like edges
or textures) detected in the data. The more filters used, the more feature maps created, leading to greater
depth in the next layer.
2. Filter Application: Filters are smaller than the input layer. Their depth matches the input depth. For
example, a filter of size 5×5×3 works with an input depth of 3 (e.g., RGB image channels). Sliding a filter
across the input produces an output of smaller dimensions because portions of the filter at the edges
don’t fully overlap the input.
Example:
o Input: 32×32×3
o Filter: 5×5×3
o Output: 28×28×depth (where depth = number of filters, e.g., 5 filters result in depth = 5).
3. Key Idea: Filters specialize in detecting different patterns (e.g., horizontal or vertical edges). Multiple
filters enable the model to understand a broader variety of patterns and combine them into meaningful
representations.
4. Depth and Parameters: Layers closer to the input handle simple patterns, while deeper layers focus on
complex combinations. Later layers are typically smaller in width/height but have greater depth (more
feature maps), enabling richer feature extraction.
In convolutional neural networks (CNNs), the convolution operation defines how filters (3D tensors) interact
with input data to extract features. Here's the breakdown:
1. Filter Representation:
o Each pth filter in the qth layer is represented by a 3D tensor: , where:
▪ i, j, k represent the height, width, and depth of the filter.
o Filters have a specific size (e.g., 5×5×3) and slide across the input grid.
2. Input and Feature Maps:
oThe feature maps in the qth layer are represented by the 3-dimensional tensor ,
where:
▪ For the first layer, H(1) represents the input image.
3. Convolution Process:
o The output of the convolution for the (q+1)th layer is calculated using the dot product between:
▪ The filter W(p,q) and the corresponding region in H(q).
▪ Mathematically:

▪ Here, Fq is the filter size, and the indices (r, s, k) match filter positions with input data.
4. Output Size:
o The spatial size of the output is determined by the filter placement:
o Each position where the filter overlaps the input produces a value, creating the next layer's
feature map.
5. Depth of Output:
o The depth of the output depends on the number of filters (dq+1) applied, with each filter
generating one feature map.
This method ensures that the CNN extracts meaningful features like edges or patterns while maintaining spatial
relationships.

The convolution operation in CNNs may seem complex, but it essentially involves performing a simple dot
product between the filter and the input data over all valid positions in the input grid. Let’s break it down step by
step:
1. Filter Application:
o The filter is a small 3D array (e.g., 5×5×3), and we align it with the input layer (e.g., 32×32×3).
o At each valid position, we compute the dot product between the filter’s values and the
corresponding values from the input grid.
2. Number of Valid Positions:
o For an input of size 32×32 and a filter of size 5×5, there are only (32−5+1) × (32−5+1) = 28×28
valid positions for the filter to fully overlap with the input.
o This means the filter can slide across 28×28 spatial positions to generate outputs.
3. Output Calculation:
o Each valid position produces one output value. For a 5×5×3 filter, the dot product involves 75
values (5×5×3) from both the filter and the input region it overlaps with. This result becomes a
single value in the output feature map.
4. Example with Depth of 1:
o If the input is a 7×7×1 grid and the filter is 3×3×1, applying the filter at every valid position with a
stride of 1 produces a 5×5 output. Each filter generates one feature map.
5. Hierarchical Feature Detection:
o Filters in earlier layers detect simple patterns (like edges), while later layers combine these to
recognize more complex shapes. This hierarchical approach is key to CNNs' effectiveness.
6. Equivariance to Translation:
o Convolution is translation-equivariant, meaning if the input shifts, the corresponding feature map
values shift in the same way. This happens because the filter’s parameters are shared across all
input positions.
➢ Parameter Sharing: Filters share parameters across the entire input, ensuring shapes are detected the
same way regardless of their position in the image.
➢ An Example of Convolution:
• Input: A 7×7×1 layer (e.g., a grayscale image).
• Filter: A 3×3×1 filter (depth 1 for simplicity).
• The filter slides across the input, performing dot products to generate feature maps.
o Example calculations:
▪ 5×1+8×1+1×1+1×2=165×1 + 8×1 + 1×1 + 1×2 = 16
▪ 4×1+4×1+4×1+7×2=264×1 + 4×1 + 4×1 + 7×2 = 26
o The results (16, 26, etc.) form the next layer's feature map.
➢ Receptive Field:
• Each feature in the next layer captures a larger area (receptive field) of the input.
• With successive 3×3 filter convolutions, the receptive fields grow:
o First layer: 3×3 region.
o Second layer: 5×5 region.
o Third layer: 7×7 region.
➢ Layer Depth:
• The depth of the next layer depends on the number of filters in the current layer, not the input depth.
• Example: If layer 1 has five filters, layer 2 will have a depth of 5. However, layer 3 requires filters of depth
5 to match layer 2's depth.
PADDING
1. Purpose of Padding:
o Convolution operations can reduce the size of the output layers compared to the input,
potentially losing important edge information.
o Padding solves this problem by adding extra “pixels” (set to zero) around the borders of the
feature map to maintain or expand the spatial size.
2. Half-Padding:

o Adds zero pixels to all sides of the input or hidden layers.


o Maintains the spatial size of the input after convolution (e.g., 32×32 increases to 36×36 with
padding, then returns to 32×32 after applying a 5×5 filter).
o Ensures border information is not underrepresented.
3. Valid Padding:
o No padding is applied, reducing the spatial size of the output layer (e.g., a 32×32 input becomes
28×28 with a 5×5 filter).
o This often underrepresents border pixels and is generally less effective in practice.
4. Full-Padding:
o Expands the input size even further, allowing the filter to extend beyond the input borders.

o Adds zero pixels to all sides, increasing spatial dimensions by .


o Useful for specific applications like convolutional autoencoders.
5. Reverse Convolution:
o A full-padded output can undergo reverse convolution to restore the original input dimensions,
often used in backpropagation and autoencoder algorithms.

An example of the padding of a single feature map is shown in Figure 8.3, where two zeros are padded on
all sides of the image (or feature map)
Padding ensures key information is preserved and helps maintain consistent spatial dimensions across
layers.
Strides
1. Strides:
o Strides determine how far the filter moves across the input during convolution.
o A stride of 1 means the filter slides one position at a time, covering every spatial position. Larger
strides (e.g., stride 2) skip positions, reducing computation and spatial size.
2. Effect of Strides:
o Using a stride Sq results in fewer spatial positions being processed:
▪ Output height:
▪ Output width: .

o Larger strides reduce spatial size faster and can shrink the area by approximately .
3. Common Usage:
o Strides of 1 are typical. Stride 2 is sometimes used for efficiency or to limit overfitting. Strides
larger than 2 are rare.
4. Advantages of Larger Strides:
o Increase the receptive field (the region a feature captures in the input).
o Help manage memory in resource-constrained settings.
o Reduce overfitting by lowering resolution when it's unnecessarily high.
5. Alternatives to Max-Pooling:
o Historically, max-pooling was used to increase receptive fields and down sample layers.
Recently, strides have been used instead to achieve similar effects.
In essence, strides offer a balance between reducing spatial footprint and enhancing the ability to capture
complex patterns.
Typical Settings
Typical Settings:
1. Stride Sizes:
o Strides of 1 are most common. Small strides like 2 are occasionally used for efficiency.
o Square images (where ) are preferred. Non-square images are pre-processed (e.g.,
cropped into square patches).
2. Filter Size:
o Typical sizes are small (e.g., 3 or 5). Smaller filters work well, as they lead to deeper networks and
better results.
o For example, the VGG network used a filter size of 3 for all layers, achieving great success.
3. Number of Filters:
o Commonly set as powers of 2 (e.g., 32, 64) to optimize processing and align with hidden layer
depths.
Use of Bias:
1. Adding Bias:
oEach filter has its own bias ( ), added to the convolution's output.
oBias is small, adds one parameter per filter, and is learned during backpropagation.
2. How Bias Works:
o Bias acts like a connection weight applied to a constant input of +1.
o This is equivalent to adding a "special pixel" with a fixed value of 1 to the input.

Therefore, the number of input features in the qth layer is . This is a standard feature-
engineering trick that is used for handling bias in all forms of machine learning
In essence, these settings optimize efficiency and performance while maintaining simplicity.
The ReLU Layer
1. ReLU Activation:
o The ReLU (Rectified Linear Unit) function is applied to all values in a layer. It replaces negative
values with zero while leaving positive values unchanged.
o This operation doesn’t alter the layer’s dimensions since it maps each input value to a single
output value.
2. Position in the Network:
o ReLU follows convolution operations (similar to activation functions following linear
transformations in traditional neural networks).
o While essential, ReLU layers are often omitted from illustrations for simplicity.
3. Why ReLU?:
o Earlier activation functions like sigmoid and tanh were slower and less accurate.
o ReLU offers faster training and better results, allowing deeper networks and longer training times.
ReLU has become the standard activation function in modern convolutional neural networks due to its
simplicity and performance.
Pooling
Pooling is a fundamental step in convolutional neural networks, typically forming the third stage of a
convolutional layer. A typical convolutional layer operates as follows:
1. Convolution: The first stage applies multiple convolutions in parallel, generating a set of linear
activations.
2. Nonlinear Activation: In the second stage, these linear activations are transformed using a nonlinear
function, such as the rectified linear unit (ReLU). This step, often called the detector stage, introduces
nonlinearity into the network.
3. Pooling: The third stage modifies the outputs further through a pooling function.
➢ What is Pooling?
• Pooling reduces the spatial size (height and width) of feature maps while keeping their depth
unchanged.
• It operates independently on small regions (e.g., 2×2 or 3×3) in each feature map.
• The most common type is max-pooling, which selects the maximum value from each region.
➢ Types of Pooling:
Pooling simplifies and condenses the feature representation by summarizing nearby outputs into a single
value. For instance:
• Max Pooling: Selects the maximum value in a defined rectangular region.
• Average Pooling: Calculates the average value of the region.
• L2 Norm Pooling: Computes the square root of the sum of squares of the values.
• Weighted Average Pooling: Averages values based on their distance from a central point.
For a feature map having dimensions nh x nw x nc, the dimensions of output obtained after a pooling
layer is
(nh – f +1) / s x (nw – f + 1) / s x nc
nh → height of feature map
nw → width of feature map
nc → number of channels in the feature map
f → size of filter
s →stride length
MAX POOLING

AVERAGE POOLING

➢ Advantages
• Dimensionality Reduction
• Translation Invariance
• Feature Selection
➢ Disadvantages
• Information Loss
• Over – Smoothing
• Hyper Parameter Tuning

➢ Strides in Pooling:
• The stride determines how far the pooling window moves.
o For stride = 1: Slightly overlapping regions, smaller reduction in size.
o For stride = 2: Non-overlapping regions, more significant size reduction.
➢ Translation Invariance:
• Pooling adds translation invariance, meaning it classifies objects similarly even if their positions in
the image shift slightly (e.g., a bird is still a bird, no matter where it is in the image).
➢ Receptive Field:
• Pooling increases the receptive field, allowing layers to capture larger regions of the input. This helps
identify complex features in deeper layers.
➢ Max-Pooling vs. Other Methods:
• Max-pooling is more common than average-pooling because it retains key information effectively.
• Example: A 2×2 max-pooling with stride 2 reduces a layer size from 4×4 to 2×2.
➢ Emerging Trends:
• Some recent designs replace pooling with convolutional layers using larger strides for reducing size.
However, max-pooling remains popular for its unique advantages like nonlinearity and translation
invariance.
Fully Connected Layers
➢ Fully Connected Layers:
• Features from the final spatial layer are fully connected to the next layer, like in traditional feed-
forward networks.
• Multiple fully connected layers may be used to enhance computational power.
• These layers have the majority of the network's parameters, as they are densely connected (e.g., two
layers with 4096 hidden units would involve over 16 million weights).
➢ Parameter and Memory Trade-Off:
• Convolutional layers have more activations (higher memory usage), but fully connected layers have
more parameters.
• The design of these layers can vary based on the task (e.g., classification vs. segmentation) and
resource constraints (e.g., memory or data availability).
➢ Output Layer:
• For classification tasks, the output layer is fully connected and uses an activation function like
softmax, logistic, or linear, depending on the application.
➢ Alternative to Fully Connected Layers:
• Average Pooling: Aggregates values across the entire spatial area of the final activation maps.
o Example: If the final maps are 7×7×256, average pooling creates 256 features by averaging 49
values per feature.
• This reduces the parameter count and improves generalization. This method was used in Google
Net.
➢ Fully Convolutional Networks (FCNs):
• For tasks like image segmentation, fully connected layers are replaced with 1×1 convolutions to
create an output spatial map where each pixel corresponds to a class label.
The Interleaving Between Layers
➢ Layer Interleaving:
• Convolutional layers (C) are typically followed by ReLU layers (R) to apply non-linearity.
• After two or three convolution-ReLU pairs, a pooling layer (P) is added to reduce spatial size.
➢ Common Patterns:
• Example layer sequences:
o CRCRP: Two convolution-ReLU pairs followed by pooling.
o CRCRCRP: Three convolution-ReLU pairs followed by pooling.
• Repeating these patterns creates deeper networks.
➢ Fully Connected Layers:
• After multiple convolution-ReLU-pooling sequences, fully connected layers (F) are added for final
computation.
Example network: CRCRP → CRCRP → CRCRPF
➢ Pooling's Role:
• Pooling reduces the spatial footprint of activation maps using larger strides.
• Strided convolutions can also reduce spatial size and may replace pooling in some cases.
➢ Deep Networks:
• Modern convolutional networks often have more than 15 layers.
• Skip connections are used to link layers, improving performance in very deep architectures.
LeNet-5
1. Overview:
o LeNet-5 is one of the earliest neural networks, designed for grayscale images with one colour
channel.
o Commonly used for character recognition (e.g., checks), it assumes ten-character classes for
classification.
2. Architecture:
o Contains two convolution layers, two pooling layers, and three fully connected layers.
o Later layers have multiple feature maps due to multiple filters.
3. Details:
o C5 Layer: The first fully connected layer (C5) was labelled as a convolution layer in its original
design because it could handle spatial features, but it is effectively a fully connected layer.
o Subsampling (or pooling) layers averaged values over 2×2 regions with stride 2, unlike modern
max-pooling.
o Sigmoid activations followed subsampling; however, modern networks favour ReLU activations
after convolutions.
4. Evolution:
o Average pooling and RBF (Radial Basis Function) units in the final layer were used. Today, these
are replaced by max-pooling and softmax layers with log-likelihood loss.
5. Significance:
o LeNet-5 was shallow compared to modern architectures but introduced core concepts of
convolutional networks.
o The RBF approach used a prototype vector to compute distances for classification, which is now
considered outdated.
Local Response Normalization
➢ What is LRN?
• LRN is applied immediately after the ReLU layer to improve generalization by fostering competition
among filters.
• Inspired by biological systems, it normalizes the activation values of neurons.
➢ Normalization Formula:
• For an activation ai, the normalized value bi is calculated as:

Key parameters used are:


o k =2, α =10−4, and β = 0.75.
The value of n used in is 5. Therefore, we have the following formula:

In the above formula, any value of i − n/2 that is less than 0 is set to 0, and any value of i +n/2 that is greater than
N is set to N. The use of this type of normalization is not obsolete, and its discussion has been included here for
historical reasons.
➢ Practical Computation:
• Instead of normalizing over all filters, it’s done over small groups (e.g., 5 adjacent filters).
• This approach reduces computational complexity while retaining effectiveness.
➢ Relevance Today:
• LRN is now mostly obsolete and included for historical context. Modern architectures favor simpler
approaches for normalization.

Hierarchical Feature Engineering


1. Layers and Features:
o Early layers in a convolutional network detect low-level features like edges (horizontal, vertical,
etc.).
o Mid-level layers combine these edges into shapes (e.g., hexagons).
o Higher layers combine the shapes into complex objects (e.g., honeycombs).
2. How Low-Level Filters Work:
o Filters detect edges by capturing differences in neighbouring pixel values, which change along
edges (like horizontal or vertical transitions).
o This process mimics biological observations, such as how neurons in a cat’s visual cortex detect
specific edges.
3. Building Complexity:
o Each layer processes the features learned by previous layers to detect larger and more complex
patterns.
o For example, detecting arcs to form circles, and then combining circles with other features to
form car wheels.
4. Layer Limitations:
o The first convolution layer can only learn features within its filter size (e.g., 3×3 or 5×5 pixels).
o Later layers combine patches into larger regions for greater visual understanding.
5. Depth Matters:
o Deeper networks are better at recognizing complex image components because they learn
hierarchical regularities.
o Shallow networks fail to capture these relationships effectively.
6. Dataset Sensitivity:
o The features learned depend on the dataset. For example, a network trained to recognize trucks
learns features specific to trucks.
o Diverse datasets like ImageNet train models with more general-purpose features useful across
applications.
In essence, convolutional neural networks excel by building features layer by layer, from simple patterns to
complex objects, enabling them to classify and interpret images.
Variants of the Basic Convolution Function
• In general, a convolution layer consists of application of several different kernels to the input. Since
convolution with a single kernel can extract only one kind of feature.
• The input is generally not real-valued but instead vector-valued.
• Multi-channel convolutions are commutative if and only if the number of output and input channels is
the same.
• The goal of a CNN is to transform the input image into concise abstract representations of the original
input.
• The individual convolutional layers try to find more complex patterns from the previous layer's
observations.
• The logic is that 10 curved lines would form two ellipses, which would make an eye.
• To do this, each layer uses a kernel, usually a 2x2 or 3x3 matrix, that slides through the previous
layer's output to generate a new output. The word convolve from convolution, means to roll or slide.
• The variants of convolution operations are as follows:
Effect of Strides
• Stride is the number of pixels shifts over the input matrix.
• To allow for calculation of features at a coarser level strided convolutions can be used.
• The effect of strided convolution is the same as that of a convolution followed by a down sampling
stage.
• Strides can be used to reduce the representation size.
STRIDED CONVOLUTION, used in Convolutional Neural Networks (CNNs), modifies the standard
convolution operation by introducing a stride, which dictates how many pixels the filter shifts at a time. In
standard convolution, the filter typically moves one pixel at a time. With a stride greater than 1, the filter
"jumps" over pixels, leading to a smaller output size.
Below is an example representing 2-D Convolution, with (3 * 3) Kernel and Stride of 2 units.

In the provided example, a 7x7 image is convolved with a 3x3 filter using a stride of 2.
This means the filter moves two pixels at a time, both horizontally and vertically. The calculation is as
follows:
• The 3x3 filter is applied to the top-left 3x3 portion of the 7x7 input, resulting in the value 91.
• The filter then shifts two steps to the right, and the process is repeated, resulting in 110.
• This continues across the row with a stride of two, and then moves down two rows and repeats,
resulting in a 3x3 output.
The formula to calculate the output size of a convolutional layer is
((n + 2p - f) / s) + 1,
where:
n is the input size
p is the padding
f is the filter size
s is the stride
In this case, with n = 7, p = 0, f = 3, and s = 2, the output size is ((7 + 0 - 3) / 2) + 1 = 3.
Thus, the output is a 3x3 matrix.
Strided convolutions can reduce the spatial dimensions of feature maps, which can be useful for
capturing more global features and reducing computational cost.
Tiled Convolution
Tiled Convolutional Neural Networks (CNNs) are an advanced extension of traditional Neural
Networks. They incorporate multiple convolution kernels (k kernels) within the same layer, applied over
every kth unit, a concept known as "tiling." Research has demonstrated that even a small value of k, like
2, can yield effective results.
Tiled convolution is a technique introduced to strike a balance between two common types of
layers: convolutional layers and locally connected layers. It enhances feature learning capabilities
while maintaining efficiency in terms of memory and computation.
Here’s a breakdown of the concept and its details:
What is Tiled Convolution?
In a traditional convolutional layer:
• The same set of weights (or kernel) is applied across all spatial locations of the input data. This is why
it achieves translational invariance (e.g., recognizing a feature regardless of where it appears in the
input).
In a locally connected layer:
• Different weights are learned for every spatial location, offering flexibility to capture highly localized
variations. However, this requires a large number of parameters, leading to higher memory
requirements.
Tiled Convolution, on the other hand:
• Uses a compromise approach. Instead of having a single set of weights (as in convolutional layers) or
unique weights for each spatial location (as in locally connected layers), it cycles through a set of
kernels (filters) as it moves across the input space.
How It Works
• The tiling mechanism divides the input space into smaller tiles, and each tile is cyclically processed
by a different kernel.
• For example:
o If there are t kernels, the first kernel is applied to the first tile, the second kernel to the next
tile, and so on. This pattern repeats cyclically.
o This allows neighboring locations to be processed by different filters, adding diversity in
feature extraction.
Mathematical Definition
Tiled convolution can be expressed algebraically:

Where:
• represents the output at a specific position in the feature map.

• is the input at a corresponding position and channel.


• represents the kernel weights.
• Modulo operation (%) ensures the cycling of kernels across tiles in both spatial
dimensions.
Benefits of Tiled Convolution
1. Balancing Parameter Efficiency and Flexibility:
o Tiled convolution uses fewer parameters than locally connected layers but still
enables diversity in feature extraction by applying different kernels to different
regions.
o Memory requirements increase only by a factor of the number of kernels (t), not the
entire feature map size.
2. Capturing Invariances:
o By rotating through multiple kernels, tiled convolution captures invariances to
transformations such as rotation and scaling, in addition to the translation
invariance provided by standard convolutional layers.
3. Pooling Interaction:
o Max pooling and other pooling methods combine the outputs of tiled convolutional
layers, further enhancing invariance. For instance, if different kernels detect
transformed versions of the same feature, pooling ensures the network learns
invariance to these transformations.
Comparison with Other Layers
• Standard Convolutional Layers:
o Apply one set of weights everywhere, leading to strong translation invariance but
limited flexibility.
• Locally Connected Layers:
o Learn completely different weights at every location, which is flexible but
computationally expensive.
• Tiled Convolution:
o Balances the above two approaches by learning a few distinct kernels that are reused
across the spatial locations, providing diversity without excessive parameter growth.
Applications
Tiled convolution is particularly effective in:
• Scenarios requiring rotation or scale invariance, such as image recognition tasks where objects
may appear in different orientations or sizes.
• Environments with limited computational resources, as it avoids the excessive parameter
requirements of locally connected layers.
This concept provides a structured and efficient way to enhance feature learning while
maintaining computational practicality.
Key Features of Tiled CNNs:
1. Pooling and Invariances: Through the pooling operation—where the outputs of convolutional layers are
down sampled by methods like max, average, or stochastic pooling—the tiled layers achieve rotational
and scale invariance in addition to the translational invariance inherent to standard CNNs. This
capability makes Tiled CNNs highly adaptable to diverse input variations.
2. Learned Features: Each convolution operation learns a unique feature (or map), effectively capturing
representations of the input data. Despite their extended functionality, tiled layers maintain a relatively
small number of learned parameters, ensuring computational efficiency.
3. Weight Tying: Tiled CNNs are parameterized by the tile size (k), tying weights between units that are k
steps apart within the same map. This localized weight tying creates a spectrum of models:
o At one end, with k=1, the network resembles traditional CNNs with fully tied weights.
o At the other, with fully untied weights, the model becomes more flexible but more complex.
4. Multiple Maps: The use of multiple "maps," each learning distinct features, enables the network to
construct diverse and comprehensive representations of the data. Tiled CNNs balance the ability to
model complex invariances while maintaining a compact number of parameters.
Advantages:
Tiled CNNs combine the benefits of learning rich, complex features and capturing various invariances
(rotational, scale, and translational). This is achieved without significantly increasing the computational
or memory overhead, making them a powerful tool for tasks that demand diverse feature extraction.
TRANSPOSED CONVOLUTIONAL
A transposed convolutional layer is an up sampling layer that generates the output feature map greater
than the input feature map. It is similar to a deconvolutional layer. A deconvolutional layer reverses the
layer to a standard convolutional layer. If the output of the standard convolution layer is deconvolved
with the deconvolutional layer then the output will be the same as the original value, While in transposed
convolutional value will not be the same, it can reverse to the same dimension,
Transposed convolutional layers are used in a variety of tasks, including image generation, image super-
resolution, and image segmentation. They are particularly useful for tasks that involve upsampling the
input data, such as converting a low-resolution image to a high-resolution one or generating an image
from a set of noise vectors.
The operation of a transposed convolutional layer is similar to that of a normal convolutional
layer, except that it performs the convolution operation in the opposite direction. Instead of sliding the
kernel over the input and performing element-wise multiplication and summation, a transposed
convolutional layer slides the input over the kernel and performs element-wise multiplication and
summation. This results in an output that is larger than the input, and the size of the output can be
controlled by the stride and padding parameters of the layer.

Transposed Convolutional with stride 2


In a transposed convolutional layer, the input is a feature map of size , where and are the
height and width of the input and the kernel size is , where and are the height and
width of the kernel.

If the stride shape is and the padding is p, The stride of the transposed convolutional layer
determines the step size for the input indices p and q, and the padding determines the number of pixels to
add to the edges of the input before performing the convolution. Then the output of the transposed
convolutional layer will be

where and are the height and width of the output.


Example 1:
Suppose we have a grayscale image of size 2 X 2, and we want to up sample it using a transposed
convolutional layer with a kernel size of 2 x 2, a stride of 1, and zero padding (or no padding). The input
image and the kernel for the transposed convolutional layer would be as follows:
The output will be:

The output shape can be calculated as:

Transposed convolutional layers are often used in conjunction with other types of layers, such as pooling
layers and fully connected layers, to build deep convolutional networks for various tasks.
Dilated Convolution
It is a technique that expands the kernel (input) by inserting holes between its consecutive elements.
In simpler terms, it is the same as convolution but it involves pixel skipping, so as to cover a larger area of
the input.
Dilated convolution, also known as atrous convolution, is a type of convolution operation used in
convolutional neural networks (CNNs) that enables the network to have a larger receptive field without
increasing the number of parameters.
In a regular convolution operation, a filter of a fixed size slides over the input feature map, and the
values in the filter are multiplied with the corresponding values in the input feature map to produce a
single output value. The receptive field of a neuron in the output feature map is defined as the area in the
input feature map that the filter can “see”. The size of the receptive field is determined by the size of the
filter and the stride of the convolution.
In contrast, in a dilated convolution operation, the filter is “dilated” by inserting gaps between the
filter values. The dilation rate determines the size of the gaps, and it is a hyperparameter that can be
adjusted. When the dilation rate is 1, the dilated convolution reduces to a regular convolution.
The dilation rate effectively increases the receptive field of the filter without increasing the number
of parameters, because the filter is still the same size, but with gaps between the values. This can be
useful in situations where a larger receptive field is needed, but increasing the size of the filter would lead
to an increase in the number of parameters and computational complexity.
Dilated convolutions have been used successfully in various applications, such as semantic
segmentation, where a larger context is needed to classify each pixel, and audio processing, where the
network needs to learn patterns with longer time dependencies.
Some advantages of dilated convolutions are:
1. Increased receptive field without increasing parameters
2. Can capture features at multiple scales
3. Reduced spatial resolution loss compared to regular convolutions with larger filters
Some disadvantages of dilated convolutions are:
1. Reduced spatial resolution in the output feature map compared to the input feature map
2. Increased computational cost compared to regular convolutions with the same filter size and stride
An additional parameter l (dilation factor) tells how much the input is expanded. In other words,
based on the value of this parameter, (l-1) pixels are skipped in the kernel. Fig 1 depicts the difference
between normal vs dilated convolution. In essence, normal convolution is just a 1-dilated convolution.

Fig 1: Normal Convolution vs Dilated Convolution


Intuition:
Dilated convolution helps expand the area of the input image covered without pooling. The objective
is to cover more information from the output obtained with every convolution operation. This method
offers a wider field of view at the same computational cost. We determine the value of the dilation
factor (l) by seeing how much information is obtained with each convolution on varying values of l.
By using this method, we are able to obtain more information without increasing the number of
kernel parameters. In Fig 1, the image on the left depicts dilated convolution. On keeping the value of l =
2, we skip 1 pixel (l – 1 pixel) while mapping the filter onto the input, thus covering more information in
each step.
Formula Involved:

Where,
F(s) = Input
k(t) = Applied Filter
*l = l-dilated convolution
(F*lk)(p) = Output
Advantages of Dilated Convolution:
Using this method rather than normal convolution is better as:
1. Larger receptive field (i.e. no loss of coverage)
2. Computationally efficient (as it provides a larger coverage on the same computation cost)
3. Lesser Memory consumption (as it skips the pooling step) implementation
4. No loss of resolution of the output image (as we dilate instead of performing pooling)
5. Structure of this convolution helps in maintaining the order of the data.
Types of Convolutions
Comparing Unshared, Tiled and Traditional Convolutions
Convolution
Properties Advantages and Disadvantages
Type

Advantages
1. No Parameter sharing. 1. Reducing memory consumption
2. Each output unit performs a linear 2. Increasing statistical efficiency
operation on its neighbourhood but 3. Reducing the amount of
Unshared parameters are not shared across output computation needed to perform
Convolution units. forward and back-propagation.
3. Captures local connectivity while Disadvantages
allowing different features to be computed 1. requires much more
at different spatial locations. parameters than the convolution
operation.

1. Offers a compromise b/w unshared and


traditional convolution. Advantages
Tiled
2. Learn a set of kernels and cycle/rotate 1. Reduces the parameters in the
Convolution
them through space. model.
3. Makes use of parameter sharing.

1. Equivalent to tiled convolution with t=1.


Traditional
2. Has the same connectivity as unshared
Convolution
convolution
Examples of Unshared, Tiled and Traditional Convolutions
Unshared Convolution

A locally connected layer has no sharing at all. We indicate that each connection has its weight by labelling each
connection with a unique letter.

Tiled Convolution
Tiled convolution has a set of t different kernels. Here we illustrate the case of t = 2. One of these kernels has
edges labelled “a” and “b,” while the other has edges labelled “c” and “d.” Each time we move one pixel to the
right in the output, we move on to using a different kernel. This means that, like the locally connected layer,
neighbouring units in the output have different parameters. Unlike the locally connected layer, after we have
gone through all t available kernels, we cycle back to the first kernel. If two output units are separated by a
multiple of t steps, then they share parameters.

Traditional Convolution

Traditional convolution is equivalent to tiled convolution with t = 1. There is only one kernel, and it is applied
everywhere, as indicated in the diagram by using the kernel with weights labelled “a” and “b” everywhere.

Effect of Zero Padding


• Convolution networks can implicitly zero pad the input V, to make it wider.
• Without zero padding, the width of representation shrinks by one pixel less than the kernel width at
each layer.
• Zero padding the input allows controlling kernel width and size of output independently.
Zero Padding Strategies
• 3 common zero padding strategies are:

Zero
Padding Properties Example
Type

1. No zero padding is used.


2. Output is computed only at
places where entire kernel
lies inside the input.
Valid
3. Shrinkage > 0
Zero-
4. Limits #convolution
Padding
layers to be used in network
5. Input's width = m, Kernel's
width = k,
Width of Output = m-k+1
Zero
Padding Properties Example
Type

1. Just enough zero padding


is added to keep:
Size(Ouput) = Size(Input)
2. Input is padded by (k-1)
zeros
3. Since the #output units
connected to border pixels
Same is less
Zero- than that for centre pixels, it
Padding may under-represent
border pixels.
4. Can add as many
convolution layers as
hardware can support
5. Input's width = m, Kernel's
width = k,
Width of Output = m

1. The input is padded by


enough zeros such that each
input pixel is
connected to the same
output units.
Strong 2. Allows us to make
Zero- an arbitrarily deep NN.
Padding 3. Can add as many
convolution layers as
hardware can support
4. Input's width = m, Kernel's
width = k,
Width of Output = m+k-1

Comparing Computation Times


Structured Outputs
Convolutional networks are incredibly versatile and can go beyond simple classification tasks (assigning
a label to an entire image) or regression tasks (predicting a numerical value).
They can also generate high-dimensional structured outputs, like segmenting an image into distinct
regions or labelling each pixel individually.
Pixel-Wise Labelling
• Instead of producing a single class label for an image, convolutional networks can create a tensor (a
multi-dimensional array) where each element, such as , represents the probability that a pixel at
coordinates belongs to a particular class . For instance:
• In an image of animals, the network can label each pixel as belonging to "dog," "cat," "background," etc.
• This allows the model to create highly accurate masks that follow the outlines of objects.
The Challenge of Size Mismatch
• A common issue arises because the output plane (the spatial dimensions of the output tensor) may
be smaller than the input plane (the size of the input image).
• Pooling Layers: When using pooling layers with a large stride, the resolution of the output decreases
significantly.
Possible Solutions:
o Avoid Pooling: Eliminate pooling layers altogether, as suggested by some researchers (e.g., Jain et
al., 2007), to retain the size of the output tensor.
o Lower-Resolution Labels: Output a grid of labels with a lower resolution than the input, which
simplifies the architecture but reduces detail (e.g., Pinheiro and Collobert, 2014, 2015).
o Unit Stride Pooling: Use pooling with a stride of 1 to minimize spatial reduction while keeping the
benefits of pooling.
Refining Pixel-Wise Predictions
o When labeling each pixel:
o The model may start with an initial prediction for the image labels.
o It then refines these predictions by considering relationships between neighboring pixels.
o By repeating this refinement process multiple times, the network becomes more accurate in its
predictions. This refinement can involve weight sharing between the layers, where the same
convolutional weights are reused at each refinement step.
This setup essentially turns the network into a specialized form of a recurrent neural network (RNN),
as it applies the same operations iteratively to improve its predictions over time (e.g., Pinheiro and
Collobert, 2014, 2015).
Image Segmentation
After labeling individual pixels, further processing methods can segment the image into regions:
• Assumption: Contiguous groups of pixels with the same label are likely to belong to the same
region (e.g., all pixels labeled "dog" form a single segment).
• Methods:
o Graphical Models: These describe probabilistic relationships between neighboring pixels
to smooth and refine segmentation results.
o Training Objectives: The network can be trained to approximate graphical model
objectives directly, allowing it to generate better segmentations (e.g., Ning et al., 2005;
Thompson et al., 2014).
Final Output
The result is an image segmented into distinct regions, where each pixel has been assigned a class label.
This is crucial for applications like:
o Autonomous driving (distinguishing pedestrians, vehicles, and roads).
o Medical imaging (highlighting tumors or abnormalities).
o Object detection and recognition (separating objects within an image).
An example of a recurrent convolutional network for pixel labeling.

Variable Description
X Input image tensor
Y Probability distribution over tensor for each pixel
H Hidden representation
U Tensor of convolution kernels
V Tensor of kernels to produce an estimation of labels
W Kernel tensor to convolve over Y to provide input to H
Data Types
The data used with a convolutional network usually consists of several channels, each channel
being the observation of a different quantity at some point in space or time.

Single channel Multichannel


1-D Audio waveform: The axis we convolve Skeleton animation data: Animations of 3-D
over corresponds to time. We discretize computer-rendered characters are
time and measure the amplitude of the generated by altering the pose of a
waveform once per time step. “skeleton” over time. At each point in time,
the pose of the character is described by a
specification of the angles of each of the
joints in the char acter’s skeleton. Each
channel in the data we feed to the
convolutional model represents the angle
about one axis of one joint.
2-D Audio data that has been pre-processed Color image data: One channel contains the
with a Fourier trans form: We can red pixels, one the green pixels, and one the
transform the audio waveform into a 2- blue pixels. The convolution kernel moves
D tensor with different rows over both the horizontal and the vertical
corresponding to different frequencies axes of the image, conferring translation
and different columns corresponding to equivariance in both directions.
different points in time. Using
convolution in the time makes the
model equivariant to shifts in time.
Using convolution across the frequency
axis makes the model equivariant to
frequency, so that the same melody
played in a different octave produces
the same representation but at a
different height in the network’s output.
3-D Volumetric data: A common source of Volumetric data: A common source of this
this kind of data is medical imaging kind of data is medical imaging technology,
technology, such as CT scans. such as CT scans.
Table: Examples of different formats of data that can be used with convolutional networks.

Training a Convolutional Network


➢ Backpropagation Algorithm:
• Training a convolutional neural network relies on backpropagation to adjust weights and minimize the
loss.
➢ Relu Layer:
• Backpropagation through the ReLU activation is straightforward, as it's similar to traditional neural
networks.
➢ Max-Pooling Layer:
• Non-Overlapping Pools:
o Identify the maximum value in each pooling region.
o The loss gradient flows back only to the maximum value, while other values get a gradient of 0.
• Overlapping Pools:
o A unit h may belong to multiple overlapping pools .
o The gradient contributions from all overlapping pools are summed up for h.
➢ Summary:
• ReLU and max-pooling backpropagation are similar to traditional methods, with specific rules for pooling
regions.
Backpropagating Through Convolutions
➢ Relation to Feed-Forward Networks:
• Backpropagation in convolutional networks is similar to that in feed-forward networks, where errors are
propagated backward using matrix multiplications.
• It can also be viewed as a transposed convolution.
➢ Loss Gradients:
• The loss derivative for each cell in layer (i + 1) (the next layer) is already computed.
• Cells in layer (i+1) are created by aggregating contributions from cells in layer i using filters of size

.
➢ Cell Contribution:
• Each cell in layer i contributes to multiple cells in layer i+1, depending on filter size and stride.
• During backpropagation, gradients from all cells that a given cell contributes to are aggregated backward.
➢ Gradient Calculation Pseudocode:
• Find all cells in layer i+1(set Sc) that depend on a specific cell c in layer i.
• For each dependent cell r:
Multiply
o (loss gradient of r) by (filter weight connecting c to r).
• Sum these products to compute the gradient for cell cc.
➢ Weight Gradients:
• Multiply the hidden activation value in layer by the loss gradient in the layer i to compute weight
gradients.
• As filter weights are shared, sum the gradients for all copies of a shared weight.
The method described above follows traditional backpropagation by accumulating gradients linearly. However,
extra care is needed to track which cells in one layer influence the next. Backpropagation can also be
implemented using tensor operations, which can further be simplified into matrix multiplications. These
techniques provide useful insights into how feed-forward networks can be generalized to convolutional networks
and will be explained in the following sections.
Backpropagation as Convolution with Inverted/Transposed Filter
➢ Backpropagation in Convolutional Networks:
• Similar to traditional neural networks, where gradients are propagated backward by multiplying with the
transposed weight matrix.
• In convolutional neural networks (CNNs), gradients are associated with spatial positions, and the
concept extends to inverted (or transposed) filters.
➢ Gradient Propagation for 2D Convolutions:

• Suppose layer q (input) has depth = 1 and layer q+1(output) has depth = 1, with a stride of 1.
• Backpropagation involves convolving the gradients at layer with the inverted filter from the
forward pass.
• Filter Inversion:
o The convolution filter is flipped both horizontally and vertically for backpropagation.
o This is because the relative movement of the gradients during backpropagation is opposite to that
during forward convolution.
• Padding Relationship:
o For a stride of 1, the total padding between forward and backward convolutions equals:

where is the filter size.


➢ Gradient Propagation for Arbitrary Depths:

• When and are greater than 1, additional tensor transpositions are required.

o The weight of the position of the filter in layer is represented as:

o Note that i and j refer to spatial positions, whereas k refers to the depth-centric position of the
weight.

• For backpropagation, we define:

where the relationship is:


o Here:

▪ ,

▪ ,
▪ The index of the filter identifier and depth within a filter have been interchanged
between and in the above equation
➢ Practical Example:
• Suppose there are 20 filters applied to a 3-channel RGB input to produce an output of depth 20.
• During backpropagation:
o For the red channel, extract the corresponding filters, invert them, and aggregate them into a
depth-20 gradient.
o Repeat this for green and blue channels.
• The transposition and inversion described above ensure proper gradient computation.

Summary:
• Backpropagation in CNNs involves using inverted (flipped) filters.
• With multiple input/output depths, tensor transpositions ensure that gradients are mapped correctly.

• Equation relationships like formalize this process.


Convolution/Backpropagation as Matrix Multiplications
➢ Viewing Convolution as Matrix Multiplication:
• Convolution can be represented as matrix multiplication, which helps in understanding concepts like
transposed convolution and deconvolution.
• Real-world implementations often rely on this approach for efficiency.
➢ Forward and Backward Operations:
• In traditional neural networks, the weight matrix is transposed during backpropagation.
• Similarly, convolution can be seen as a type of matrix multiplication with a spatial structure that can be
“flattened” and later reshaped.
➢ Example (Simplified Case):

• Consider a layer with input size and a filter with size .


• After convolution with stride 1 and zero padding:
o Input area: (if square input).
o Output area: .
• Flatten the input matrix into an -dimensional vector f.
➢ Sparse Matrix Representation:

• Define a sparse matrix C of size :


o Each row corresponds to one convolution operation, representing how the filter interacts with
input pixels.
o Most entries in C are 0, except for those aligned with the filter, where they hold the filter values.
• Multiply C with f to get an -dimensional output vector, which can be reshaped back into a spatial
matrix.
➢ Depth > 1:

• For deeper layers ( >1):


o The same process is applied to each depth slice.
o Combine the results across slices, resulting in tensor multiplication (a generalization of matrix
multiplication).
➢ Multiple Filters:
• For multiple filters:

o Each filter becomes a sparse matrix

o The k-th output feature map is computed as .

Data Augmentation
➢ Purpose:
• Data augmentation is used to reduce overfitting by generating new training examples from existing data
through transformations.
➢ Why It Works for Images:
• In image processing, transformations like translation, rotation, patch extraction, and reflection do not
change the essence of objects in the images.
• These transformations help the model generalize better by training it to recognize objects in different
orientations or conditions.
➢ Common Methods:
• Simple Transformations: Mirror images, reflections, or varied color intensities can be applied during
training.
• Patch Extraction: Extracting smaller patches (e.g., 224×224×3 patches used in AlexNet) is common to
train models on fixed input sizes.
➢ Principal Component Analysis (PCA):
• PCA can adjust color intensity by applying Gaussian noise to principal components of pixel values. While
effective, this method can be computationally expensive.
➢ Caution:
• Data augmentation must suit the dataset and task:
o For example, rotation or reflection of MNIST handwritten digits can produce invalid data (e.g.,
rotating a '6' makes it a '9').
➢ Impact:
• Augmentation techniques can significantly improve model performance, reducing error rates, as shown
in studies.

Efficient Convolution Algorithms


1. Exploiting the Frequency Domain (Fourier Transform):
o Convolution operations can be accelerated by converting both the input and the kernel into the
frequency domain using the Fourier Transform. In the frequency domain, convolution simplifies to
point-wise multiplication of signals, followed by conversion back to the time domain using the
Inverse Fourier Transform.
o This approach can be advantageous for certain input sizes as it reduces computational
complexity compared to the naive discrete convolution method.
2. Separable Kernels:
o If a kernel can be expressed as the outer product of vectors for each dimension, it is considered
separable. Such kernels enable efficiency improvements.
o Instead of performing a computationally intensive multi-dimensional convolution, the operation
can be broken down into a sequence of one-dimensional convolutions with each vector. This
reduces computational requirements and parameter storage.
o For instance, with a kernel width of w in d dimensions:
▪ Naive multidimensional convolution: runtime and parameter storage.
▪ Separable convolution: runtime and parameter storage, which is significantly
more efficient.
3. Efficiency Improvements in Deployment:
o Innovations to speed up convolution without compromising accuracy are an active research area.
o Even optimizations tailored solely for forward propagation can be highly valuable because
commercial applications often prioritize deploying efficient models over optimizing training
processes.
In summary, using frequency-domain transformations, leveraging separable kernels, and developing new
algorithms can significantly enhance convolution efficiency, benefiting both training and real-world
deployment scenarios.
Random and Unsupervised Features
1. Random Initialization for CNN Layers:
o Layers consisting of convolution followed by pooling can inherently become frequency-selective
and translation-invariant even with randomly initialized weights. This is a useful property for
creating feature extractors without intensive pre-training.
2. Efficient Model Selection:
o One approach involves randomly initializing several CNN architectures, training only the final
classification layer, and evaluating their performance. Once the best-performing model (the
"winner") is identified, it can be further fine-tuned using more computationally expensive
methods, such as supervised training. This approach helps reduce training costs during the early
stages of model selection.
3. Hand-Designed Kernels:
o In some cases, using manually designed kernels (e.g., edge detectors) can improve efficiency.
These kernels can serve as pre-trained filters, saving computational resources and simplifying the
convolution process.
4. Unsupervised Pre-Training:
o Training CNNs with unsupervised methods can have a regularization effect, improving
generalization. Additionally, unsupervised pre-training can reduce computational costs, making it
possible to train larger CNNs effectively.
Greedy Layer-wise Pre-training
1. Layer-by-Layer Training:
• Instead of training all the layers of a CNN at once, this method trains each convolutional layer individually
and sequentially.
• Starting with the first layer, features are learned and stored. These features are then used as inputs for
training the next layer. This step-by-step approach reduces the complexity involved in training deep
networks.
2. Managing Computational Costs:
• While this method simplifies training, it can be computationally expensive during inference, as the entire
network is used for predictions. However, it allows for the training of very large models that would
otherwise be difficult to optimize.
3. Applications in Convolution:
• This method could complement efficient algorithms like using separable kernels or frequency domain
techniques by ensuring that training proceeds in a structured, less resource-intensive manner. It
highlights the importance of balancing efficiency during training and inference.
Neuroscientific Basis
Convolutional networks (ConvNets) represent one of the most significant advancements in artificial
intelligence, drawing substantial inspiration from neuroscience. While many disciplines have shaped ConvNets,
their fundamental principles stem from the structure and function of the human brain's visual system.
The origins of ConvNets can be traced back to the pioneering work of neuroscientists David Hubel and
Torsten Wiesel. Through extensive experiments, these Nobel Prize-winning researchers investigated how the
brain processes visual information. They discovered that certain neurons, particularly in the primary visual cortex
(V1), are highly sensitive to specific patterns, such as edges at particular orientations, while remaining largely
unresponsive to others. This foundational research has profoundly influenced the development of artificial
intelligence models.
In biological systems, visual processing begins with light entering the eye and stimulating the retina.
Signals are then transmitted through the optic nerve to the lateral geniculate nucleus and finally to V1, where
advanced processing occurs. ConvNets emulate this process by employing successive layers to detect
increasingly complex features, mirroring the hierarchical structure of the visual system.
By mimicking how the brain works, convolutional networks show how biology can inspire amazing
breakthroughs in artificial intelligence.
How Convolutional Networks Mimic the Brain
A convolutional network layer is designed to copy three key features of V1, the primary visual cortex:
1. Spatial Mapping: V1 has a two-dimensional layout, just like the retina of the eye. This means light hitting
a specific area of the retina affects the matching area in V1. Convolutional networks reflect this by
organizing their features into two-dimensional maps.
2. Simple Cells: V1 has neurons called simple cells that react to small, specific parts of an image using
straightforward rules. Convolutional networks use detector units to mimic the way these simple cells
work.
3. Complex Cells: V1 also has complex cells, which recognize patterns like simple cells do, but they don’t
get confused if the pattern shifts slightly or if the lighting changes. This inspired the pooling units in
convolutional networks, which make them resistant to minor changes in position or lighting.
As we move deeper into the brain’s visual system, the process of detection and pooling is repeated.
Eventually, some neurons, known as “grandmother cells,” respond to very specific concepts, like recognizing a
person’s grandmother no matter how she appears—left or right, up close or far away, in bright light or shadows.
These concept-specific neurons have been found in the human brain's medial temporal lobe. For example,
one neuron was nicknamed the “Halle Berry neuron” because it fired when people saw images, drawings, or
even the name of Halle Berry. However, this doesn’t mean these neurons are as specific as they seem; other
neurons responded to concepts like Bill Clinton or Jennifer Aniston.
The last layer of a convolutional network resembles the brain’s inferotemporal cortex (IT), where advanced
object recognition happens. Within 100 milliseconds of seeing an object, the brain processes visual data
from the retina, through the lateral geniculate nucleus (LGN), to V1, and onward to other areas like V2, V4,
and IT. Convolutional networks are similar to this process and can even match human performance in object
recognition when given limited time.
Key Differences Between ConvNets and the Human Visual System
Convolutional networks (ConvNets) and the mammalian vision system have key differences, many of which
we understand, while others remain mysteries. Here are some notable ones:
1. Eye vs. Input: The human eye is mostly low resolution except for a small area called the fovea, which
captures fine details in a tiny portion of a scene (about the size of a thumbnail at arm’s length). Our brain
creates the illusion of high-resolution vision by combining these small glimpses. ConvNets, on the other
hand, usually process entire high-resolution images in one go. Humans rely on quick eye movements,
called saccades, to focus on important parts of a scene. Adding such attention mechanisms to deep
learning models is an area of research, but it’s still not widely adopted for vision tasks.
2. Integration with Other Senses: Our vision system works closely with other senses, like hearing, and is
influenced by factors like mood and thoughts. ConvNets, however, focus only on visual data.
3. Complex Understanding: Humans don’t just recognize objects—we interpret entire scenes, understand
relationships between objects, and process 3D geometry for interacting with the world. ConvNets are
starting to address these tasks, but progress is limited.
4. Feedback: Even simple areas in the brain, like V1 (the primary visual cortex), are shaped by feedback
from higher-level brain areas. Neural networks have explored feedback mechanisms, but they haven’t yet
shown clear advantages.
5. Different Calculations: The brain likely uses different methods to process information than ConvNets
do. For example, neurons may rely on combinations of quadratic filters rather than simple linear filters.
Moreover, the distinction between “simple” and “complex” cells in the brain might not be as clear-cut as
we thought; they could represent a spectrum of behaviors.
6. Training: The methods used to train ConvNets (like back-propagation) weren’t inspired by neuroscience
and might not resemble how the brain learns. Early models like the neocognitron had similar structures
but used simpler training methods. Modern ConvNets, which emerged in the late 1980s, gained their
power from applying back-propagation to 2D image data.
While neuroscience has inspired ConvNets in many ways, there’s still a lot to learn about how the brain works—
and how those insights can further improve artificial intelligence.
How Simple and Complex Cells Work
1. Simple Cells:
o Simple cells are neurons in the brain’s visual system that respond to specific features like edges
or patterns in an image.
o These cells work in a roughly linear way, meaning their responses can be predicted based on how
the image overlaps with their “weights” or preferences.
o In artificial neural networks, we can visualize what a simple cell responds to by looking at the
convolution kernel (a small filter). In biology, we can estimate what a real neuron responds to by
using an approach called reverse correlation. This involves showing the neuron random patterns
(like white noise) and seeing how it reacts. From this data, we approximate the weights of the
neuron.
o Most simple cells in the visual system have weights that follow a mathematical pattern called a
Gabor function. This function combines two main parts:
▪ A Gaussian factor, which ensures the cell responds most strongly to the center of its
receptive field (the small area of the image it focuses on).
▪ A cosine factor, which controls how the cell reacts to changes in brightness across the
field.
o The Gabor function is expressed as: $$w(x, y) = \exp\left(-\frac{x2}{\sigma_x2} - \frac{y2}{\sigma_y2}\right)
\cos(fx + \phi)$$ Where:
▪ The Gaussian factor depends on xx and yy (spatial coordinates) and controls how quickly
the response fades as we move away from the center.
The cosine factor controls how the cell responds to light or dark bands, where ff is the

frequency of the pattern and ϕ\phi adjusts the phase (alignment) of the wave.
o Simple cells are most excited when the image perfectly matches their weights (bright where
weights are positive, dark where weights are negative).
2. Complex Cells:
o Complex cells respond to similar patterns as simple cells but are more flexible. They remain
unaffected by small shifts in the pattern’s position or whether the pattern is reversed (e.g., black
turns to white).
o A complex cell’s response is computed as: $$c(I) = \sqrt{s_0(I)^2 + s_1(I)^2}$$
▪ Here, s0(I)s_0(I) and s1(I)s_1(I) are the outputs of two simple cells, one slightly out of
phase with the other. Together, they form a “quadrature pair,” allowing the complex cell to
handle changes in position or brightness.
Neuroscience Meets Machine Learning
• Early layers of deep learning models often learn features similar to those of simple cells, such as edge
detectors. These detectors are critical because edges are a fundamental feature in natural images.
• The connection between biology and machine learning was highlighted by researchers like Olshausen
and Field, who showed that even basic learning algorithms (like sparse coding) can uncover features
similar to those in the brain.
• Many different learning methods, including deep learning, naturally discover these edge-detecting
features in their first layer when applied to real-world images. This highlights how important edges and
Gabor-like patterns are for understanding visual data.
By alternating between layers of simple-like selectivity and complex-like invariance, artificial neural networks
and biological vision systems can build up more advanced representations, eventually recognizing objects or
scenes. This process is at the heart of both convolutional networks and how our visual system works.

The picture illustrates variations of Gabor functions based on changes in their parameters. Here's what it explains:
1. Left Section:
o Shows how the Gabor function shifts and rotates based on the parameters that define the

coordinate system .

o Each Gabor filter in the grid is centered at a specific position , and its sensitivity aligns
with directions radiating outward from the grid's center.
2. Center Section:

o Highlights the impact of Gaussian scale parameters , which control the width and
height of the Gabor functions.

o As you move left to right, the filters become wider (decreasing ). Moving top to bottom

increases height (decreasing ).


3. Right Section:
o Focuses on sinusoidal parameters ( ).
o As you move left to right, the frequency ( ) of the Gabor function increases. Moving top to bottom

alters the phase ( ).


In summary, the picture visually demonstrates how Gabor filters (used in edge detection and feature extraction)
change depending on parameters like position, scale, frequency, and phase. It helps understand how these
filters detect different features in an image by modeling visual patterns.
Applications: Computer Vision
Computer Vision refers to enabling machines to interpret and process visual data, much like humans do. This
field utilizes deep learning, particularly convolutional neural networks (CNNs), to analyze and extract meaningful
insights from images and videos. Let’s dive into its applications and relevance across different industries:
1. Object Detection
One of the most common applications of Computer Vision is object detection. The goal here is to identify and
localize objects within an image or video. For example:
• In autonomous vehicles, object detection is crucial for identifying pedestrians, traffic signs, and other
vehicles.
• In retail, cameras powered by object detection algorithms can help track customer activity and inventory.
• In wildlife monitoring, systems can detect animals in their natural habitats to study behavior or prevent
poaching. Object detection models like YOLO (You Only Look Once) and Faster R-CNN are widely used
for real-time detection.
2. Image Classification
In this application, an image is analyzed and assigned a label that defines its content:
• Medical Image Classification: AI-powered systems classify X-rays or MRIs as "normal" or "abnormal."
• Social media: Algorithms classify and tag images uploaded by users (e.g., recognizing food, animals, or
locations).
• Quality Control: In manufacturing, image classification can identify defects in products.
3. Facial Recognition
Facial recognition systems identify or verify individuals by analyzing unique facial features. Applications include:
• Security: Unlocking smartphones or enabling access control in sensitive areas.
• Law Enforcement: Identifying suspects through CCTV footage.
• Retail: Offering personalized shopping experiences to customers.
4. Medical Imaging
Medical applications of computer vision involve analyzing radiological images to assist doctors. For example:
• Detecting anomalies like tumors, fractures, or blockages.
• Segmenting organs and tissues for treatment planning.
• Automating the analysis of huge datasets in genomics research.
5. Autonomous Driving
Computer Vision forms the core of autonomous vehicle navigation systems:
• Analyzing traffic signs, lane markers, and road conditions.
• Detecting and reacting to pedestrians or cyclists in real time.
• Mapping environments using visual SLAM (Simultaneous Localization and Mapping).
6. Challenges and Future Trends
The accuracy of Computer Vision systems depends on diverse and high-quality datasets, which can be
challenging to collect. Ethical issues, such as privacy concerns in facial recognition, are also a growing area of
discussion. Future trends include:
• Enhanced processing with quantum computing.
• Expanding applications in augmented reality (AR) and mixed reality (MR).
Applications: Image Generation
Image generation involves creating or modifying images using AI. Generative models such as GANs (Generative
Adversarial Networks) and VAEs (Variational Autoencoders) have led to impressive progress. Let’s explore its
applications:
1. Artistic Creation
AI-generated art has become a movement of its own:
• Platforms like DeepArt and ArtBreeder allow users to create stylized digital art.
• Museums have collaborated with AI to recreate lost or damaged artwork.
• AI also helps in design industries, providing tools for architects and graphic designers.
2. Augmented and Virtual Reality (AR/VR)
In AR/VR environments, AI-generated images contribute to creating realistic virtual spaces. Examples include:
• Games: Generating lifelike characters, terrains, and dynamic environments.
• Training Simulations: Crafting virtual scenarios for military, healthcare, or educational purposes.
3. Synthetic Data Creation
For applications where real-world data is scarce or expensive to collect, AI can generate synthetic datasets:
• In healthcare, creating synthetic medical images to train diagnostic models.
• For autonomous driving, generating virtual roads and traffic scenarios for testing.
4. Image-to-Image Translation
AI can transform one type of image into another while preserving structure:
• Transforming sketches into photorealistic images (e.g., Nvidia’s GauGAN).
• Altering images’ styles, like turning photos into “paintings” in the style of Van Gogh or Picasso.
• Applications in fashion, where users can visualize clothing in different colors and styles.
5. Realistic Face Generation
GANs can create realistic-looking human faces that don’t belong to real people. This technology has been
applied in:
• Video games and movies, for creating avatars or characters.
• Testing AI systems for tasks like facial recognition.
6. Challenges
Despite advancements, there are challenges:
• High computational requirements for training generative models.
• Ethical concerns, such as the misuse of AI to create "deepfakes."
Applications: Image Compression
Image compression aims to reduce the size of image files while maintaining as much quality as possible. This is
essential for optimizing storage and transmission in a data-driven world.
1. Traditional Methods
Historically, formats like JPEG, PNG, and GIF have been widely used. While JPEG employs lossy compression to
significantly reduce file size, PNG uses lossless compression to preserve all details.
2. Modern Deep Learning Approaches
CNNs and other neural networks have revolutionized image compression:
• Autoencoders: By learning compact representations of images, autoencoders compress images into
lower-dimensional forms, which are reconstructed later with minimal loss.
• End-to-End Trainable Systems: These systems learn both compression and decompression processes
simultaneously, enabling optimization of quality and size.
3. Applications
• Web Optimization: For faster loading of web pages with minimal bandwidth usage, compressed images
are essential.
• Mobile Apps: Compressing photos and videos reduces storage costs and makes sharing faster.
• Media Streaming: Platforms like YouTube and Netflix rely heavily on compression to deliver high-quality
content without excessive bandwidth.
4. Industry-Specific Use Cases
• Medical Imaging: Compression enables the storage of large datasets, such as MRIs or CT scans, for
archival purposes or remote sharing.
• Cloud Storage: With the growing reliance on cloud platforms, efficient compression reduces operational
costs.
• Photography: In cameras and photo editing software, compression helps balance storage with photo
quality.
5. Lossy vs. Lossless Compression
In some use cases, lossy methods are acceptable as minor quality loss is tolerable. For instance, streaming
videos prioritize speed. Lossless compression, however, is critical in areas like medicine or forensic science,
where accuracy cannot be compromised.
6. Challenges
While AI-driven methods have led to better compression rates, challenges persist:
• Balancing computational efficiency with compression quality.
• Handling diverse data types (e.g., text, images, and video) within the same systems.

You might also like