Unit2 3 Notes
Unit2 3 Notes
Feed forward neural networks are artificial neural networks in which nodes do not form loops. This
type of neural network is also known as a multi-layer neural network as all information is only passed
forward.
During data flow, input nodes receive data, which travel through hidden layers, and exit output nodes.
No links exist in the network that could get used to by sending information back from the output node.
• According to the feed forward model, y = f (x; θ). This value determines the closest
approximation of the function.
Feed forward neural networks serve as the basis for object detection in photos, as shown in the
Google Photos app.
When the feed forward neural network gets simplified, it can appear as a single layer perceptron.
This model multiplies inputs with weights as they enter the layer. Afterward, the weighted input
values get added together to get the sum. As long as the sum of the values rises above a certain
threshold, set at zero, the output value is usually 1, while if it falls below the threshold, it is
usually -1.
As a feed forward neural network model, the single-layer perceptron often gets used for
classification. Machine learning can also get integrated into single-layer perceptrons. Through
training, neural networks can adjust their weights based on a property called the delta rule, which
helps them compare their outputs with the intended values.
As a result of training and learning, gradient descent occurs. Similarly, multi-layered perceptrons
update their weights. But, this process gets known as back-propagation. If this is the case, the
network's hidden layers will get adjusted according to the output values produced by the final layer.
• Input layer:
The neurons of this layer receive input and pass it on to the other layers of the network. Feature or
attribute numbers in the dataset must match the number of neurons in the input layer.
• Output layer:
According to the type of model getting built, this layer represents the forecasted feature.
• Hidden layer:
Input and output layers get separated by hidden layers. Depending on the type of model, there may
be several hidden layers.
There are several neurons in hidden layers that transform the input before actually transferring it to
the next layer. This network gets constantly updated with weights in order to make it easier to predict.
• Neuron weights:
Neurons get connected by a weight, which measures their strength or magnitude. Similar to linear
regression coefficients, input weights can also get compared.
• Neurons:
Artificial neurons get used in feed forward networks, which later get adapted from biological neurons.
A neural network consists of artificial neurons.
Neurons function in two ways: first, they create weighted input sums, and second, they activate the
sums to make them normal.
Activation functions can either be linear or nonlinear. Neurons have weights based on their inputs.
During the learning phase, the network studies these weights.
• Activation Function:
According to the activation function, the neurons determine whether to make a linear or nonlinear
decision. Since it passes through so many layers, it prevents the cascading effect from increasing
neuron outputs.
An activation function can be classified into three major categories: sigmoid, Tanh, and Rectified
Linear Unit (ReLu).
• Sigmoid:
• Tanh:
Only positive values are allowed to flow through this function. Negative values get mapped to 0.
Cost function
In a feed forward neural network, the cost function plays an important role. The categorized data
points are little affected by minor adjustments to weights and biases.
Thus, a smooth cost function can get used to determine a method of adjusting weights and biases to
improve performance.
b = biases
a = output vectors
x = input
Loss function
The loss function of a neural network gets used to determine if an adjustment needs to be made in
the learning process.
Neurons in the output layer are equal to the number of classes. Showing the differences between
predicted and actual probability distributions. Following is the cross-entropy loss for binary
classification.
To decrease the function, it subtracts the value (to increase, it would add). As an example, here is
how to write this procedure:
The gradient gets adjusted by the parameter η, which also determines the step size. Performance is
significantly affected by the learning rate in machine learning.
Output units
In the output layer, output units are those units that provide the desired output or prediction,
thereby fulfilling the task that the neural network needs to complete.
There is a close relationship between the choice of output units and the cost function. Any unit that
can serve as a hidden unit can also serve as an output unit in a neural network.
• Machine learning can be boosted with feed forward neural networks' simplified
architecture.
• Neural networks can handle and process nonlinear data easily compared to perceptrons
and sigmoid neurons, which are otherwise complex.
• Depending on the data, the neural network architecture can vary. For example,
convolutional neural networks (CNNs) perform exceptionally well in image processing,
whereas recurrent neural networks (RNNs) perform well in text and voice processing.
• Neural networks need graphics processing units (GPUs) to handle large datasets for
massive computational and hardware performance. Several GPUs get used widely in the
market, including Kaggle Notebooks and Google Collab Notebooks.
It is possible to identify feed forward management in this situation because the central involuntary
regulates the heartbeat before exercise.
Detecting non-temporary changes to the atmosphere is a function of this motif as a feed forward
system. You can find the majority of this pattern in the illustrious networks.
An open-loop transfer converts non-minimum part systems into minimum part systems using this
technique.
Typical deep learning algorithms are neural networks (NNs). As a result of their unique structure,
their popularity results from their 'deep' understanding of data.
Furthermore, NNs are flexible in terms of complexity and structure. Despite all the advanced stuff,
they can't work without the basic elements: they may work better with the advanced stuff, but the
underlying structure remains the same.
Let's begin. NNs get constructed similarly to our biological neurons, and they resemble the
following:
Neurons are hexagons in this image. In neural networks, neurons get arranged into layers: input is the
first layer, and output is the last with the hidden layer in the middle.
NN consists of two main elements that compute mathematical operations. Neurons calculate
weighted sums using input data and synaptic weights since neural networks are just mathematical
computations based on synaptic links.
In the third step, a vector of ones gets multiplied by the output of our hidden layer:
Using the output value, we can calculate the result. Understanding these fundamental concepts will
make building NN much easier, and you will be amazed at how quickly you can do it. Every layer's
output becomes the following layer's input.
In a network, the architecture refers to the number of hidden layers and units in each layer that make
up the network.
A feed forward network based on the Universal Approximation Theorem must have a "squashing"
activation function at least on one hidden layer.
The network can approximate any Borel measurable function within a finite-dimensional space with
at least some amount of non-zero error when there are enough hidden units.
It simply states that we can always represent any function using the multi-layer perceptron (MLP),
regardless of what function we try to learn.
Thus, we now know there will always be an MLP to solve our problem, but there is no specific method
for finding it.
It is impossible to say whether it will be possible to solve the given problem if we use N layers with M
hidden units.
Research is still ongoing, and for now, the only way to determine this configuration is by
experimenting with it.
While it is challenging to find the appropriate architecture, we need to try many configurations before
finding the one that can represent the target function.
There are two possible explanations for this. Firstly, the optimization algorithm may not find the
correct parameters, and secondly, the training algorithms may use the wrong function because of
overfitting.
Backpropagation is a technique based on gradient descent. Each stage of a gradient descent process
involves iteratively moving a function in the opposite direction of its gradient (the slope).
The goal is to reduce the cost function given the training data while learning a neural network.
Network weights and biases of all neurons in each layer determine the cost function.
Backpropagation gets used to calculate the gradient of the cost function iteratively. And then update
weights and biases in the opposite direction to reduce the gradient.
We must define the error of the backpropagation formula to specify i-th neuron in the l-th layer of a
network for the j-th training. Example as follows (in which
represents the weighted input to the neuron, and L represents the loss.)
In backpropagation formulas, the error is defined as above:
Below is the full derivation of the formulas. For each formula below, L stands for the output layer, g
for the activation function, ∇ the gradient, W[l]T layer l weights transposed.
A proportional activation of neuron i at layer l based on bli bias from layer i to layer i, wlik weight
from layer l to layer l-1, and ak[l−1] (j) activation of neuron k at layer l-1 for training example j.
The first equation shows how to calculate the error at the output layer for sample j. Following that,
we can use the second equation to calculate the error in the layer just before the output layer.
Based on the error values for the next layer, the second equation can calculate the error in any layer.
Because this algorithm calculates errors backward, it is known as backpropagation.
For sample j, we calculate the gradient of the loss function by taking the third and fourth equations
and dividing them by the biases and weights.
We can update biases and weights by averaging gradients of the loss function relative to biases and
weights for all samples using the average gradients.
The process is known as batch gradient descent. We will have to wait a long time if we have too many
samples.
If each sample has a gradient, it is possible to update the biases/weights accordingly. The process is
known as stochastic gradient descent.
Even though this algorithm is faster than batch gradient descent, it does not yield a good estimate of
the gradient calculated using a single sample.
It is possible to update biases and weights based on the average gradients of batches. It gets referred
to as mini-batch gradient descent and gets preferred over the other two.
What is Gradient Descent?
Gradient descent is an optimization algorithm used in machine learning to minimize the cost
function by iteratively adjusting parameters in the direction of the negative gradient, aiming to find
the optimal set of parameters.
The cost function represents the discrepancy between the predicted output of the model and the
actual output. Gradient descent aims to find the parameters that minimize this discrepancy and
improve the model’s performance.
The algorithm operates by calculating the gradient of the cost function, which indicates the
direction and magnitude of the steepest ascent. However, since the objective is to minimize the
cost function, gradient descent moves in the opposite direction of the gradient, known as the
negative gradient direction.
By iteratively updating the model’s parameters in the negative gradient direction, gradient descent
gradually converges towards the optimal set of parameters that yields the lowest cost. The learning
rate, a hyperparameter, determines the step size taken in each iteration, influencing the speed and
stability of convergence.
Gradient descent can be applied to various machine learning algorithms, including linear
regression, logistic regression, neural networks, and support vector machines. It provides a
general framework for optimizing models by iteratively refining their parameters based on the cost
function.
Let’s say you are playing a game in which the players are at the top of a mountain and asked to
reach the lowest point of the mountain. Additionally, they are blindfolded. So, what approach do
you think would make you reach the lake?
The best way is to observe the ground and find where the land descends. From that position, step in
the descending direction and iterate this process until we reach the lowest point.
Gradient descent is an iterative optimization algorithm for finding the local minimum of a function.
To find the local minimum of a function using gradient descent, we must take steps proportional to
the negative of the gradient (move away from the gradient) of the function at the current point. If we
take steps proportional to the positive of the gradient (moving towards the gradient), we will
approach a local maximum of the function, and the procedure is called Gradient Ascent.
Gradient descent was originally proposed by CAUCHY in 1847. It is also known as the steepest
descent.
The goal of the gradient descent algorithm is to minimize the given function (say, cost function). To
achieve this goal, it performs two steps iteratively:
1. Compute the gradient (slope), the first-order derivative of the function at that point
2. Make a step (move) in the direction opposite to the gradient. The opposite direction of
the slope increases from the current point by alpha times the gradient at that point
Alpha is called Learning rate – a tuning parameter in the optimization process. It decides the length
of the steps.
The choice of gradient descent algorithm depends on the problem at hand and the size of the
dataset. Batch gradient descent is suitable for small datasets, while stochastic gradient descent
algorithm is more suitable for large datasets. Mini-batch is a good compromise between the two
and is often used in practice.
Batch gradient descent updates the model’s parameters using the gradient of the entire training set.
It calculates the average gradient of the cost function for all the training examples and updates the
parameters in the opposite direction. Batch gradient descent guarantees convergence to the global
minimum but can be computationally expensive and slow for large datasets.
Stochastic gradient descent updates the model’s parameters using the gradient of one training
example at a time. It randomly selects a training dataset example, computes the gradient of the
cost function for that example, and updates the parameters in the opposite direction. Stochastic
gradient descent is computationally efficient and can converge faster than batch gradient descent.
However, it can be noisy and may not converge to the global minimum.
Mini-batch gradient descent updates the model’s parameters using the gradient of a small batch
size of the training dataset, known as a mini-batch. It calculates the average gradient of the cost
function for the mini-batch and updates the parameters in the opposite direction. The mini-batch
gradient descent algorithm combines the advantages of batch and stochastic gradient descent. It is
the most commonly used method in practice. It is computationally efficient and less noisy than
stochastic gradient descent while still being able to converge to a good solution.
When we have a single parameter (theta), we can plot the dependent variable cost on the y-axis and
theta on the x-axis. If there are two parameters, we can go with a 3-D plot, with cost on one axis and
the two parameters (thetas) along the other two axes.
It can also be visualized by using Contours. This shows a 3-D plot in two dimensions with
parameters along axes and the response as a contour. The value of the response increases away
from the center and has the same value as with the rings. The response is directly proportional to
the distance of a point from the center (along a direction).
We have the direction we want to move in. Now, we must decide the size of the step we must take.
*It must be chosen carefully to end up with local minima.
• If the learning rate is too high, we might OVERSHOOT the minima and keep bouncing
without reaching the minima
• If the learning rate is too small, the training might turn out to be too long
• The learning rate is optimal, and the model converges to the minimum.
• The learning rate is too small. It takes more time but converges to the minimum.
• The learning rate is higher than the optimal value. It overshoots but converges ( 1/C < η
<2/C).
• The learning rate is very large. It overshoots and diverges, moves away from the minima, and
performance decreases in learning.
Local Minima
The cost function may consist of many minimum points. Depending on the initial point (i.e., initial
parameters(theta)) and the learning rate, the gradient may settle on any minima. Therefore, the
optimization may converge to different starting points and learning rates.
Advantages and Disadvantages
Advantages
• Easy to use: It’s like rolling the marble yourself – no fancy tools needed, you just gotta push
it in the right direction.
• Fast updates: Each push (iteration) is quick, you don’t have to spend a lot of time figuring
out how hard to push.
• Memory efficient: You don’t need a big backpack to carry around extra information, just the
marble and your knowledge of the hill.
• Usually finds a good spot: Most of the time, the marble will end up in a pretty flat area, even
if it’s not the absolute flattest (global minimum).
Disadvantages
• Slow for giant hills (large datasets): If the hill is enormous, pushing the marble all the way
down each time can be super slow. There are better ways to roll for these giants.
• Can get stuck in shallow dips (local minima): The hill might have many dips, and the marble
could get stuck in one that isn’t the absolute lowest. It depends on where you start pushing
it from.
• Finding the perfect push (learning rate): You need to figure out how har to push the marble
(learning rate). If you push too weakly, it’ll take forever to get anywhere. Push too hard, and it
might roll right past the flat spot.
While gradient descent is a powerful optimization algorithm, it can also present some challenges
affecting its performance. Some of these challenges include:
1. Local Optima: Gradient descent can converge to local optima instead of the global
optimum, especially if the cost function has multiple peaks and valleys.
2. Learning Rate Selection: The choice of learning rate can significantly impact the
performance of gradient descent. If the learning rate is too high, the algorithm may
overshoot the minimum, and if it is too low, the algorithm may take too long to converge.
3. Overfitting: Gradient descent can overfit the training data if the model is too complex or the
learning rate is too high. This can lead to poor generalization performance on new data.
4. Convergence Rate: The convergence rate of gradient descent can be slow for large datasets
or high-dimensional spaces, making the algorithm computationally expensive.
5. Saddle Points: In high-dimensional spaces, saddle points can cause the gradient of the cost
function to get stuck in a plateau, preventing gradient descent from converging to a
minimum.
What Is Regularization in Machine Learning?
1. Purpose: The primary goal of regularization is to reduce the model's complexity to make it more
generalizable to new data, thus improving its performance on unseen datasets.
2. Methods: There are several types of regularization techniques commonly used:
• L1 Regularization (Lasso): This adds a penalty equal to the absolute value of the magnitude
of coefficients. This can lead to some coefficients being zero, which means the model ignores
the corresponding features. It is useful for feature selection.
• L2 Regularization (Ridge): Adds a penalty equal to the square of the magnitude of
coefficients. All coefficients are shrunk by the same factor, and none are eliminated, as in L1.
• Elastic Net: This combination of L1 and L2 regularization controls the model by adding
penalties from both L1 and L2, which can be a useful middle ground.
3. Impact on Loss Function: Regularization modifies the loss function by adding a regularization
term.
4. Choice of Regularization Parameter: The choice of λ (also known as the regularization parameter)
is crucial. It is typically chosen via cross-validation to balance fitting the training data well and
keeping the model simple enough to perform well on new data.
Regularization adds a penalty term to the standard loss function that a machine learning model
minimizes during training. This penalty encourages the model to keep its parameters (like weights in
neural networks or coefficients in regression models) small, which can help prevent overfitting.
Here’s a step-by-step breakdown of how regularization functions:
The regularization process starts by modifying the loss function. The updated loss function
encompasses the initial loss, assessing the model's alignment with the training data, and a
regularization term that discourages excessive parameter magnitudes. The general form of the
regularized loss function is:
Here, λ (lambda) is the regularization strength, which controls the trade-off between fitting the data
well and keeping the model parameters small.
2. Types of Penalties
• L1 Regularization (Lasso): The penalty is the sum of the absolute values of the parameters.
This can lead to a sparse model where some parameter values are exactly zero, effectively
removing those features from the model.
• L2 Regularization (Ridge): The penalty is the sum of the squares of the parameters. This evenly
distributes the penalty among all parameters, shrinking them towards zero but not exactly
zeroing any.
• Elastic Net: A mix of L1 and L2 penalties. It is useful when there are correlations among
features or when you want to combine the feature selection properties of L1 with the
shrinkage properties of L2.
3. Effect on Training
During training, the regularization term influences the updates made to the model parameters:
• Minimizing a larger penalty term (due to larger values of λ) emphasizes smaller model
parameters, leading to simpler models that might generalize better but could underfit the
training data.
• Minimizing a smaller penalty term (lower values of λ) allows the model to fit the training data
more closely, possibly at the expense of increased complexity and overfitting.
• Too high a value can make the model too simple and fail to capture important patterns in the
data (underfitting).
• Too low a value might not sufficiently penalize large coefficients, leading to a model that
captures too much noise from the training data (overfitting).
5. Practical Implementation
In practice, the optimal value of λ and the type of regularization (L1, L2, or Elastic Net) are often
selected through cross-validation, where multiple models are trained with different values of λ and
possibly different types of regularization. The model that performs best on a validation set or through
a cross-validation process is then chosen.
Roles of Regularization
Regularization plays several crucial roles in developing and performing machine learning models. Its
main purposes revolve around managing model complexity, improving generalization to new data,
and addressing specific issues like multicollinearity and feature selection. Here are the primary roles
of regularization in machine learning:
2. Improving Model Generalization: Regularization helps ensure the model performs well on the
training and new, unseen data by constraining its complexity. A well-regularized model will
likely capture the data's underlying trends rather than the training set's specific details and
noise.
5. Improving Robustness to Noise: Regularization makes the model less sensitive to the
idiosyncrasies of the training data. This includes noise and outliers, as the penalty
discourages fitting them too closely. Consequently, the model focuses more on the robust
features that are more generally applicable, enhancing its robustness.
6. Trading Bias for Variance: Regularization introduces bias into the model (assuming that
smaller weights are preferable). However, it reduces variance by preventing the model from
fitting too closely to the training data. This trade-off is beneficial when the unconstrained
model is highly complex and prone to overfitting.
7. Enabling the Use of More Complex Models: Regularization sometimes allows practitioners to
use more complex models than they otherwise could. For example, regularization
techniques like dropout can be used in neural networks to train deep networks without
overfitting, as they help prevent neuron co-adaptation.
8. Aiding in Convergence: For models trained using iterative optimization techniques (like
gradient descent), regularization can help ensure smoother and more reliable convergence.
This is especially true for problems that are ill-posed or poorly conditioned without
regularization.
Techniques of Regularization
3. Elastic Net: This is useful when there are correlations among features or to balance feature
selection with coefficient shrinkage.
4. Dropout: Results in a network that is robust and less likely to overfit, as it has to learn more
robust features from the data that aren't reliant on any small set of neurons.
5. Early Stopping: Prevents overfitting by not allowing the training to continue too long. It is a
straightforward and often very effective form of regularization.
6. Batch Normalization: Reduces the need for other forms of regularization and can sometimes
eliminate the need for dropout.
7. Weight Constraint: This constraint ensures that the weights do not grow too large, which can
help prevent overfitting and improve the model's generalization.
Benefits of Regularization
1. Reduces Overfitting: Regularization helps prevent models from learning noise and irrelevant
details in the training data.
4. Enables Feature Selection: L1 regularization can zero out some coefficients, effectively
selecting more relevant features.
6. Encourages Simplicity: Promotes simpler models that are easier to interpret and less likely
to overfit.
7. Controls Model Complexity: Provides a mechanism to balance the complexity of the model
with its performance on the training and test data.
8. Facilitates Robustness: Makes models less sensitive to individual peculiarities in the training
set.
9. Improves Convergence: Helps optimization algorithms converge more quickly and reliably by
smoothing the error landscape.
10. Adjustable Complexity: The strength of regularization can be tuned to fit the data's specific
needs and desired model complexity.
Data Augmentation
Data augmentation is a technique used to artificially increase the size of a training dataset by
applying various transformations to the existing data. This technique is commonly used in machine
learning and deep learning tasks, especially in computer vision, to improve the generalization and
robustness of the trained models.
1. Rotation: Rotate the image by a certain angle (e.g., 90 degrees, 180 degrees).
The recent advances in deep learning technology have been driven by the advancement of deep
network architectures, powerful computation, and access to big data. Deep convolutional neural
networks (CNNs) have achieved great success in many computer vision tasks such as image
classification, object detection, and image segmentation.
One of the most difficult challenges is the generalizability of deep learning models that describe the
performance difference of a model when evaluated on previously seen data (training data) versus
data it has never seen before (testing data). Models with poor generalizability have overfitted the
training data (Overfitting).
To build useful deep learning models, Data Augmentation is a very powerful method to reduce
overfitting by providing a more comprehensive set of possible data points to minimize the distance
between the training and testing sets.
Data Augmentation approaches overfitting from the root of the problem, the training dataset. The
underlying idea is that more information can be gained from the original image dataset through the
creation of augmentations.
These augmentations artificially inflate the training dataset size by data warping or oversampling.
• Data warping augmentations transform existing images while preserving their label (annotated
information). This includes augmentations such as geometric and color transformations,
random erasing, adversarial training, and neural style transfer.
• Oversampling augmentations create synthetic data instances and add them to the training set.
This includes mixing images, feature space augmentations, and generative adversarial networks
(GANs).
• Combined approaches: Those methods can be applied in combination, for example, GAN
samples can be stacked with random cropping to further inflate the dataset.
Bigger Datasets Are Better
In general, bigger datasets result in better deep learning model performance. However, assembling
very large datasets can be very difficult, and requires an enormous manual effort to collect and label
image data.
The challenge of small, limited datasets with few data points is especially common in real-life
applications, for example, in medical image analysis in healthcare or industrial manufacturing. With
big data, convolutional networks have shown to be very powerful for medical image analysis tasks
such as brain scan analysis or skin lesion classification.
However, data collection for computer vision training is expensive and labor-intensive. It’s especially
challenging to build big image datasets due to the rarity of events, privacy, requirements of industry
experts for labeling, and the expense and manual effort needed to record visual data. These
obstacles are the reason why image data augmentation has become an important research field.
What is Convolution?
Image processing utilizes convolution, a mathematical operation where a matrix (or kernel) traverses
the image and performs a dot product with the overlapping region. A convolution operation involves
the following steps:
The primary goal of using convolution in image processing is to extract important features from the
image and discard the rest. This results in a condensed representation of an image.
Convolution Neural Networks (CNNs) is a deep learning architecture that uses multiple
convolutional layers combined with multiple Neural Network Layers.
Each layer applies different filters (kernels) and captures various aspects of the image. With
increasing layers, the features extracted become dense. The initial layers extract edges and texture,
and the final layers extract parts of an image, for example, a head, eyes, or a tail.
• Layers: Lower layers capture basic features, while deeper layers identify more complex
patterns like parts of objects or entire objects.
• Learning Process: CNNs learn the filters during training. The network adjusts the filters to
minimize the loss between the predicted and actual outcomes, thus optimizing the feature
extraction process.
• Pooling Layers: After the convolution operations, pooling takes place, which reduces the
spatial size of the representation. A pooling layer in CNN downsamples the spatial
dimensions of the input feature maps and reduces their size while preserving important
information.
• Activation Functions: Neural networks use activation functions, like ReLU (Rectified Linear
Unit), at the end to introduce non-linearities. This helps the model learn more complex
patterns.
To apply the convolution:
• Overlay the Kernel on the Image: Start from the top-left corner of the image and place the
kernel so that its center aligns with the current image pixel.
• Element-wise Multiplication: Multiply each element of the kernel with the corresponding
element of the image it covers.
• Summation: Sum up all the products obtained from the element-wise multiplication. This
sum forms a single pixel in the output feature map.
• Continue the Process: Slide the kernel over to the next pixel and repeat the process across
the entire image.
Key Terms in Convolution Operation
• Kernel Size: The convolution operation uses a filter, also known as a kernel, which is typically
a square matrix. Common kernel sizes are 3×3, 5×5, or even larger. Larger kernels analyze
more context within an image but come at the cost of reduced spatial resolution and
increased computational demands.
• Stride: Stride is the number of pixels by which the kernel moves as it slides over the image. A
stride of 1 means the kernel moves one pixel at a time, leading to a high-resolution output of
the convolution. Increasing the stride reduces the output dimensions, which can help
decrease computational cost and control overfitting but at the loss of some image detail.
• Padding: Padding involves adding an appropriate number of rows and columns (typically of
zeros) to the input image borders. This ensures that the convolution kernel fits perfectly at
the borders, allowing the output image to retain the same size as the input image, which is
crucial for deep networks to allow the stacking of multiple layers.
1D Convolution
In 1D convolution, a kernel or filter slides along the input data, performing element-wise
multiplication followed by a sum, just as in 2D, but here the data and kernel are vectors instead of
matrices.
Applications:
1D convolution can extract features from various kinds of sequential data, and is especially prevalent
in:
• Audio Processing: For tasks such as speech recognition, sound classification, and music
analysis, where it can help identify specific features of audio like pitch or tempo.
• Natural Language Processing (NLP): 1D convolutions can help in tasks such as sentiment
analysis, topic classification, and even in generating text.
• Financial Time Series: For analyzing trends and patterns in financial markets, helping
predict future movements based on past data.
• Sensor Data Analysis: Useful in analyzing sequences of sensor data in IoT applications, for
anomaly detection or predictive maintenance.
3D Convolution
3D convolution extends the concept of 2D convolution by adding dimension, which is useful for
analyzing volumetric data.
Like 2D convolution, a three-dimensional kernel moves across the data, but it now simultaneously
processes three axes (height, width, and depth).
Applications:
• AI Video Analytics: Processing video as volumetric data (width, height, time), where the
temporal dimension (frames over time) can be treated similarly to spatial dimensions in
images. The latest video generation model by OpenAI called Sora used 3D CNNs.
• Medical Imaging: Analyzing 3D scans, such as MRI or CT scans, where the additional
dimension represents depth, providing more contextual information.
Pooling layers are one of the building blocks of Convolutional Neural Networks. Where Convolutional
layers extract features from images, Pooling layers consolidate the features learned by CNNs. Its
purpose is to gradually shrink the representation’s spatial dimension to minimize the number of
parameters and computations in the network.
The feature map produced by the filters of Convolutional layers is location-dependent. For example,
If an object in an image has shifted a bit it might not be recognizable by the Convolutional layer. So,
it means that the feature map records the precise positions of features in the input. What pooling
layers provide is “Translational Invariance” which makes the CNN invariant to translations, i.e., even
if the input of the CNN is translated, the CNN will still be able to recognize the features in the input.
In all cases, pooling helps to make the representation become approximately invariant to small
translations of the input. Invariance to translation means that if we translate the input by a small
amount, the values of most of the pooled outputs do not change.
1. Max pooling: This works by selecting the maximum value from every pool. Max Pooling
retains the most prominent features of the feature map, and the returned image is sharper
than the original image.
2. Average pooling: This pooling layer works by getting the average of the pool. Average pooling
retains the average values of features of the feature map. It smoothes the image while
keeping the essence of the feature in an image.