0% found this document useful (0 votes)
27 views125 pages

Deep Learning For Computer Vision

Deep Learning for computer vision involves the use of Deep Neural Networks (DNNs) with multiple hidden layers to process data and learn complex patterns. It encompasses various machine learning paradigms such as supervised, unsupervised, and reinforcement learning, and has applications in fields like image recognition, natural language processing, and robotics. Despite its advantages, deep learning faces challenges including high computational requirements, data availability, and interpretability issues.

Uploaded by

user828306
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views125 pages

Deep Learning For Computer Vision

Deep Learning for computer vision involves the use of Deep Neural Networks (DNNs) with multiple hidden layers to process data and learn complex patterns. It encompasses various machine learning paradigms such as supervised, unsupervised, and reinforcement learning, and has applications in fields like image recognition, natural language processing, and robotics. Despite its advantages, deep learning faces challenges including high computational requirements, data availability, and interpretability issues.

Uploaded by

user828306
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 125

Deep Learning For computer

Vision
Deep Neural Network(DNN)
• A deep neural network (DNN) is an ANN with multiple hidden layers
between the input and output layers.
Deep Learning
• The study of Deep Neural Networks is called Deep Learning

• It consists of layers of interconnected nodes, or neurons, that


collaborate to process input data. In a fully connected deep neural
network, data flows through multiple layers, where each neuron
performs nonlinear transformations, allowing the model to learn
intricate representations of the data
• In a deep neural network, the input layer receives data, which passes
through hidden layers that transform the data using nonlinear
functions. The final output layer generates the model’s prediction.
Deep Learning in Machine
Learning Paradigms
• Supervised Learning: Neural networks learn from labeled data to
predict or classify, using algorithms like CNNs and RNNs for tasks such
as image recognition and language translation.
• Unsupervised Learning: Neural networks identify patterns in
unlabeled data, using techniques like Autoencoders and Generative
Models for tasks like clustering and anomaly detection.
• Reinforcement Learning: An agent learns to make decisions by
maximizing rewards, with algorithms like Deep Q-Network (DQN) and
Deep Deterministic Policy Gradient (DDPG) applied in areas like
robotics and game playing.
Difference between Machine Learning and Deep Learning

Machine Learning Deep Learning


Apply statistical algorithms to learn the hidden Uses artificial neural network architecture to learn the
patterns and relationships in the dataset. hidden patterns and relationships in the dataset.
Can work on the smaller amount of dataset Requires the larger volume of dataset compared to
machine learning
Better for the low-label task. Better for complex task like image processing, natural
language processing, etc.
Takes less time to train the model. Takes more time to train the model.
A model is created by relevant features which are Relevant features are automatically extracted from
manually extracted from images to detect an images. It is an end-to-end learning process.
object in the image.
Less complex and easy to interpret the result. More complex, it works like the black box interpretations
of the result are not easy.
It can work on the CPU or requires less computing It requires a high-performance computer with GPU.
power as compared to deep learning.
Deep Learning Applications

1. Computer vision
In computer vision, deep learning models enable machines to identify
and understand visual data. Some of the main applications of deep
learning in computer vision include:
• Object detection and recognition: Deep learning models are used to identify
and locate objects within images and videos, making it possible for machines
to perform tasks such as self-driving cars, surveillance, and robotics.
• Image classification: Deep learning models can be used to classify images into
categories such as animals, plants, and buildings. This is used in applications
such as medical imaging, quality control, and image retrieval.
• Image segmentation: Deep learning models can be used for image
segmentation into different regions, making it possible to identify specific
features within images.
Deep Learning Applications
2. Natural language processing (NLP):In NLP, deep learning model enable
machines to understand and generate human language. Some of the main
applications of deep learning in NLP include:
• Automatic Text Generation: Deep learning model can learn the corpus of text and
new text like summaries, essays can be automatically generated using these
trained models.
• Language translation: Deep learning models can translate text from one language
to another, making it possible to communicate with people from different
linguistic backgrounds.
• Sentiment analysis: Deep learning models can analyze the sentiment of a piece of
text, making it possible to determine whether the text is positive, negative, or
neutral.
• Speech recognition: Deep learning models can recognize and transcribe spoken
words, making it possible to perform tasks such as speech-to-text conversion,
voice search, and voice-controlled devices.
Deep Learning Applications
• 3. Reinforcement learning
In reinforcement learning, deep learning works as training agents to take
action in an environment to maximize a reward. Some of the main
applications of deep learning in reinforcement learning include:
• Game playing: Deep reinforcement learning models have been able to beat
human experts at games such as Go, Chess, and Atari.
• Robotics: Deep reinforcement learning models can be used to train robots to
perform complex tasks such as grasping objects, navigation, and manipulation.
• Control systems: Deep reinforcement learning models can be used to control
complex systems such as power grids, traffic management, and supply chain
optimization.
Challenges in Deep Learning
• Deep learning has made significant advancements in various fields, but
there are still some challenges that need to be addressed. Here are some
of the main challenges in deep learning:
• Data availability: It requires large amounts of data to learn from. For using deep
learning it’s a big concern to gather as much data for training.
• Computational Resources: For training the deep learning model, it is
computationally expensive because it requires specialized hardware like GPUs and
TPUs.
• Time-consuming: While working on sequential data depending on the
computational resource it can take very large even in days or months.
• Interpretability: Deep learning models are complex, it works like a black box. it is
very difficult to interpret the result.
• Overfitting: when the model is trained again and again, it becomes too specialized
for the training data, leading to overfitting and poor performance on new data.
Disadvantages of Deep Learning
• High computational requirements: Deep Learning AI models require
large amounts of data and computational resources to train and
optimize.
• Requires large amounts of labeled data: Deep Learning models often
require a large amount of labeled data for training, which can be
expensive and time- consuming to acquire.
• Interpretability: Deep Learning models can be challenging to interpret,
making it difficult to understand how they make decisions.
Overfitting: Deep Learning models can sometimes overfit to the training
data, resulting in poor performance on new and unseen data.
• Black-box nature: Deep Learning models are often treated as black
boxes, making it difficult to understand how they work and how they
arrived at their predictions.
Feedforward Neural Network
• Feedforward Neural Network (FNN) is a type of artificial neural
network in which information flows in a single direction from the input
layer through hidden layers to the output layer without loops or
feedback. It is mainly used for pattern recognition tasks like image and
speech classification.
Structure of a Feedforward Neural Network
• Input Layer: The input layer consists of neurons that receive the input
data. Each neuron in the input layer represents a feature of the input
data.
• Hidden Layers: One or more hidden layers are placed between the
input and output layers. These layers are responsible for learning the
complex patterns in the data. Each neuron in a hidden layer applies a
weighted sum of inputs followed by a non-linear activation function.
• Output Layer: The output layer provides the final output of the
network. The number of neurons in this layer corresponds to the
number of classes in a classification problem or the number of outputs
in a regression problem.
Activation Functions

• An activation function is a mathematical function applied to the output of a neuron. It


introduces non-linearity into the model, allowing the network to learn and represent
complex patterns in the data. Without this non-linearity feature, a neural network would
behave like a linear regression model, no matter how many layers it has.

Why is Non-Linearity Important in Neural Networks?


Neural networks consist of neurons that operate using weights, biases, and activation
functions.
In the learning process, these weights and biases are updated based on the error produced
at the output—a process known as backpropagation. Activation functions enable
backpropagation by providing gradients that are essential for updating the weights and
biases.
Without non-linearity, even deep networks would be limited to solving only simple, linearly
separable problems. Activation functions empower neural networks to model highly
complex data distributions and solve advanced deep learning tasks. Adding non-linear
activation functions introduce flexibility and enable the network to learn more complex and
abstract patterns from data.
Need of Non-Linearity in Neural Networks

• The input to the hidden neurons ℎ1​and ℎ2​are calculated as a weighted sum
of the inputs plus a bias:
Need of Non-Linearity in Neural Networks

• let’s consider a network with two input nodes (i1and i2​), a single hidden layer
containing neurons ℎ1 and ℎ2​, and an output neuron (out).
• We will use w1​,w2​ as weights connecting the inputs to the hidden neuron,
and w5​as the weight connecting the hidden neuron to the output. We’ll also
include biases (b1​for the hidden neuron and b2​for the output neuron) to
complete the model.
• Input Layer: Two inputs i1​and i2​.
• Hidden Layer: Two neuron ℎ1​and ℎ2
• Output Layer: One output neuron.
Need of Non-Linearity in Neural Networks
• The output neuron is then a weighted sum of the hidden neuron’s output plus a
bias:

Here, h1, h2 are linear expressions.

In order to add non-linearity, we will be using sigmoid activation function in the output layer:
Types of Activation Functions in Deep Learning
Linear Activation Function
• Linear Activation Function resembles straight line define by y=mx+c. No
matter how many layers the neural network contains, if they all use linear
activation functions, the output is a linear combination of the input.
• The range of the output spans from (−∞ to +∞)(−∞ to +∞).
• Linear activation function is used at just one place i.e. output layer.
• Using linear activation across all layers makes the network’s ability to learn
complex patterns limited.
• Linear activation functions are useful for specific tasks but must be
combined with non-linear functions to enhance the neural network’s
learning and predictive capabilities.
Linear Activation Function
2. Non-Linear Activation Functions
Sigmoid Function
• Sigmoid Activation Function is characterized by ‘S’ shape. It is
mathematically defined as ​A 1 1e . This formula ensures a smooth and
x

continuous output that is essential for gradient-based optimization


methods.

•It allows neural networks to handle and model complex patterns that linear
equations cannot.
•The output ranges between 0 and 1, hence useful for binary classification.
•The function exhibits a steep gradient when x values are between -2 and 2.
This sensitivity means that small changes in input x can cause significant
changes in output y, which is critical during the training process.
Sigmoid Function
Tanh Activation Function
• Tanhfunction (hyperbolic tangent function), is a shifted version of the
sigmoid, allowing it to stretch across the y-axis. It is defined as:
e x  e x
f ( x)  x  x
e e
1  e 2 x
f ( x) 
1  e 2 x
1  1  (e  2 x  1)
f ( x) 
1  e 2 x
tanh( x) 2 xsigmoid (2 x)  1
2
f ( x)  1
1  e 2 x

Alternatively, it can be expressed using the sigmoid function:

tanh( x) 2 sigmoid (2 x)  1
Tanh Activation Function
• Value Range: Outputs values from -1 to +1.
• Non-linear: Enables modeling of complex data patterns.
• Use in Hidden Layers: Commonly used in hidden layers due to its zero-
centered output, facilitating easier learning for subsequent layers.
ReLU (Rectified Linear
Unit) Function
• ReLU activation is defined by A(x)=max(0,x), this means that if the input x
is positive, ReLU returns x, if the input is negative, it returns 0.
• Value Range: [0,∞) meaning the function only outputs non-negative
values.
• Nature: It is a non-linear activation function, allowing neural networks to
learn complex patterns and making backpropagation more efficient.
• Advantage over other Activation: ReLU is less computationally expensive
than tanh and sigmoid because it involves simpler mathematical
operations. At a time only a few neurons are activated making the
network sparse making it efficient and easy for computation.
Softmax Function
Softmax function is designed to handle multi-class classification problems.
It transforms raw output scores from a neural network into probabilities. It
works by squashing the output values of each class into the range of 0 to 1,
while ensuring that the sum of all probabilities equals 1.
• Softmax is a non-linear activation function.
• The Softmax function ensures that each class is assigned a probability,
helping to identify which class the input belongs to.
zi

Soft max( zi )  e
 e
k zj
j 1
Softmax Function
• Where:
• zi​​is the logit (the output of the previous layer in the network) for the
ithclass.
• K zis the number of classes.
e
i

•  represents the exponential of the logit.


k zi
e
j 1

• is the sum of exponentials


Why Use across all classes.
Softmax?
1.Multi-Class Classification: Softmax is ideal for problems involving more than two classes, where the goal is
to predict a single class out of many. The function's ability to generate a probability distribution over classes
makes it particularly useful in classification models.
2.Probabilistic Interpretation: Since Softmax converts logits into probabilities, the output is easily
interpretable. You can not only determine the most likely class but also gauge the confidence of the model in
that prediction.
3.Handling Multiple Classes Simultaneously: The sum-to-one property of Softmax ensures that all possible
classes are considered together. This holistic approach ensures that each prediction takes into account all
classes rather than just focusing on one or two.
How Softmax Works?
• Step 1: Raw Logits (Pre-Softmax Outputs)
• Consider the output from the last layer of the neural network, which
consists of logits. These logits are unbounded real numbers and represent
the raw predictions for each class.
• Let’s assume we are working on a classification task with K classes. The
neural network provides an output vector z=[z1​,z2​,…,zK​], where each zi​​is the
logit corresponding to the ith class.
• Step 2: Applying the Softmax Function
• The Softmax function transforms these logits into probabilities. The
formula for Softmax for each class i is:
zi

Soft max( zi )  e
 e
k zj
j 1
How Softmax Works?

• Step 4: Normalization
• The sum of the exponentials is used to normalize the values into
probabilities. The normalization step ensures that all the probabilities
add up to 1:

Each exponential is then divided by the sum of exponentials to get the final probabilities:

So, the final probability distribution is:[0.62,0.23,0.15]


How Softmax Works?

• This function ensures that:


• The output values are positive.
• The probabilities for all classes sum up to 1, i.e.,

k
i 1
soft max( zi ) 1

Step 3: Exponential Scaling

The exponential function e applied to each logit ​ zi plays a crucial role. It emphasizes the
zi

difference between logits: even a slight increase in a logit value leads to a larger probability,
while small logits result in near-zero probabilities.

Suppose we have three logits:[z1​=1.5,z2​=0.5,z3​=0.1]

Applying the exponential function to these logits results in:


e1.5≈4.48,e0.5≈1.65,e0.1≈1.11
How Softmax Works?

• Step 5: Interpretation of the Output

The result of applying the Softmax function to the logits is a probability distribution.
Each element represents the probability that the input data belongs to a particular class.

In this case:
There is a 62% probability that the input belongs to class 1,
A 23% probability for class 2, and
A 15% probability for class 3.
Softmax and Cross-Entropy Loss

• The Softmax is used in conjunction with the Cross-Entropy Loss.


• The cross-entropy loss compares the predicted probability distribution (from
Softmax) with the true label (which is represented as a one-hot encoded
vector) and penalizes the network if the predicted probability for the correct
class is low.
• The formula for cross-entropy loss is:

y​i is the true label (1 for the correct class, 0 for others),
ŷi is the predicted probability for class iii from the Softmax function.
Feed-Back Neural network
• These networks have connections that loop back, allowing information to
be fed back into the network. This structure enables them to handle
sequential data and temporal dependencies, making them suitable for
tasks like time series prediction and language modeling.
Structure of Feedback Neural Networks
• Feedback neural networks, or RNNs, are characterized by their ability to
maintain a state that captures information about previous inputs. This is
achieved through recurrent connections that loop back from the output to
the input of the same layer or previous layers. The key components of an
RNN include:
• Input Layer: Receives the input data.
• Hidden Layers: Contain neurons with recurrent connections that maintain a state
over time.
• Output Layer: Produces the final output based on the processed information.

• The recurrent connections allow RNNs to maintain a memory of previous


inputs, which is crucial for tasks involving sequential data.
Mechanisms of Feedback in Neural Networks
• Backpropagation: Backpropagation is a method of feedback that involves
the computation of the error gradient at each layer of the network. The
error gradient is then used to update the network's parameters.
Backpropagation is widely used in deep neural networks due to its
efficiency and accuracy.
• Recurrent Connections: Recurrent connections involve the feedback of
information from a later stage of the network to an earlier stage. This type
of feedback is used in recurrent neural networks (RNNs), which are
designed to handle sequential data.
• Lateral Connections: Lateral connections involve the feedback of
information between neurons in the same layer. This type of feedback is
used in applications such as image processing, where the goal is to capture
spatial relationships between pixels.
Learning in Feedback Networks: Embracing Backpropagation Through Time (BPTT)

• Training feedback networks presents a unique challenge compared to feed-forward


networks. The traditional backpropagation algorithm cannot be directly applied due to
the presence of loops. Here, backpropagation through time (BPTT) comes into play.
• BPTT unfolds the recurrent network over time, essentially creating a temporary feed-
forward architecture for each sequence element. The error signal is then propagated
backward through this unfolded network, allowing the network to adjust its weights
and learn from the feedback. However, BPTT can become computationally expensive
for long sequences, necessitating the development of more efficient training
algorithms. The steps involved in BPTT are:
• Forward Pass: Compute the output of the network for each time step.
• Backward Pass: Compute the gradients of the loss function with respect to the weights
by propagating the error backward through time.
• Weight Update: Adjust the weights using the computed gradients to minimize the loss.
Optimizers in deep learning
Deep learning models aim to generalize data and make predictions on
unseen data. To optimize these models, various algorithms, known as
optimizers, are employed. Optimizers adjust model parameters iteratively
during training to minimize a loss function, enabling neural networks to learn
from data. This guide delves into different optimizers used in deep learning,
discussing their advantages, drawbacks, and factors influencing the selection
of one optimizer over another for specific applications. Common optimizers
include Stochastic Gradient Descent (SGD), Adam, and RMSprop etc. each
employing specific update rules, learning rates, and momentum for refining
model parameters. Optimizers play a pivotal role in enhancing accuracy and
speeding up the training process, shaping the overall performance of deep
learning models.
Gradient Descent
Challenges of Gradient descent
Challenges of Gradient descent
• If the random initialization starts the algorithm on the left, then it will converge
to a local minimum, which is not as good as the global minimum.
• If it starts on the right, then it will take a very long time to cross the saddle
point (plateau), and if you stop too early you will never reach the global
minimum.
• The MSE cost function for a Linear Regression model happens to be a convex
function, which means that if you pick any two points on the curve, the line
segment joining them never crosses the curve. This implies that there are no
local minima. It is also a continuous function with a slope that never changes
abruptly.
• These two facts have a great consequence: Gradient Descent is guaranteed to
approach arbitrarily close to the global minimum (if you wait long enough and
if the learning rate is not too high). In fact, the cost function has the shape of a
bowl, but it can be an elongated bowl if the features have very different scales
(without feature scaling)
As you can see, on the left the Gradient Descent algorithm goes straight toward the minimum, thereby reaching it quickly,
whereas on the right it first goes in a direction almost orthogonal to the direction of the global minimum, and it ends with a
long march down an almost flat valley. It will eventually reach the minimum, but it will take a longer time.
So when using Gradient Descent, you should ensure that all features have a similar scale, or else it will take much longer to
converge
Batch Gradient Descent
• In this method, the algorithm computes the gradient (or slope) of the loss
function by considering all the training examples in the dataset at once. It
then uses this gradient to update the model parameters (like weights) in
the direction that reduces the error.
• Compute the Gradient:Calculate the derivative of the loss function with
respect to the model parameters, using the entire dataset.
• Update the Parameters: Adjust the parameters by taking a step in the
opposite direction of the gradient. The size of this step is controlled by
the learning rate.
Stochastic Gradient Descent
• Unlike Batch Gradient Descent, which uses the entire dataset, SGD
updates the model parameters using just one training example at a
time. It takes a small step toward minimizing the loss function after
processing each data point.
• Choose a Data Point: Randomly select one training example.
• Compute the Gradient: Calculate the derivative of the loss function with
respect to the model parameters for this specific example.
• Update the Parameters: Adjust the parameters by moving in the
direction of the negative gradient, scaled by the learning rate.
Mini-batch Gradient Descent
• Mini-batch Gradient Descent is a hybrid optimization algorithm that
balances the computational efficiency of Stochastic Gradient Descent (SGD)
with the stability of Batch Gradient Descent.
• In Mini-batch Gradient Descent, the algorithm processes a small, random
subset (called a "mini-batch") of the training dataset in each iteration to
compute the gradient and update the model parameters.
• Split Dataset: Divide the training dataset into small random subsets or
"mini-batches," each containing a fixed number of examples (e.g., 32, 64, or
128).
• Compute Gradient: Calculate the derivative of the loss function with respect
to the model parameters for each mini-batch.
• Update Parameters: Adjust the model parameters using the averaged
gradient of the mini-batch and the learning rate.
E[ g 2 ]t is the exponentially weighted average of squared gradients at time t
The hyperparameter β1 is generally kept around 0.9 while β2 is kept at 0.99. ε is chosen
to be 1e-10 generally
Convolutional Neural Networks(CNN)

Convolutional neural networks are distinguished from other neural networks


by their superior performance with image, speech or audio signal inputs.
They have three main types of layers, which are.
• Convolutional layer
• Pooling layer
• Fully-connected (FC) layer
The convolutional layer is the first layer of a convolutional network. While
convolutional layers can be followed by additional convolutional layers or
pooling layers, the fully-connected layer is the final layer. With each layer, the
CNN increases in its complexity, identifying greater portions of the image.
Earlier layers focus on simple features, such as colors and edges. As the image
data progresses through the layers of the CNN, it starts to recognize larger
elements or shapes of the object until it finally identifies the intended object.
CNN
Convolutional Layer

The convolutional layer is the core building block of a CNN, and it is where the
majority of computation occurs. It requires a few components, which are input
data, a filter and a feature map. Let’s assume that the input will be a color image,
which is made up of a matrix of pixels in 3D. This means that the input will have
three dimensions—a height, width and depth—which correspond to RGB in an
image. We also have a feature detector, also known as a kernel or a filter, which
will move across the receptive fields of the image, checking if the feature is
present. This process is known as a convolution.
The feature detector is a two-dimensional (2-D) array of weights, which
represents part of the image. While they can vary in size, the filter size is typically
a 3x3 matrix; this also determines the size of the receptive field. The filter is then
applied to an area of the image, and a dot product is calculated between the
input pixels and the filter. This dot product is then fed into an output array.
Afterwards, the filter shifts by a stride, repeating the process until the kernel has
swept across the entire image. The final output from the series of dot products
from the input and the filter is known as a feature map, activation map or a
convolved feature.
Convolutional Layer
Note that the weights in the feature detector remain fixed as it moves across the image, which is also known
as parameter sharing. Some parameters such as the weight values, adjust during training through the
process of backpropagation and gradient descent. However, there are three hyperparameters which affect
the volume size of the output that need to be set before the training of the neural network begins. These
include:
• 1. The number of filters affects the depth of the output. For example, three distinct filters would yield
three different feature maps, creating a depth of three.
• 2. Stride is the distance, or number of pixels, that the kernel moves over the input matrix. While stride
values of two or greater is rare, a larger stride yields a smaller output.
• 3. Zero-padding is usually used when the filters do not fit the input image. This sets all elements that fall
outside of the input matrix to zero, producing a larger or equally sized output. There are three types of
padding:
• Valid padding: This is also known as no padding. In this case, the last convolution is dropped if dimensions
do not align.
• Same padding: This padding ensures that the output layer has the same size as the input layer.
• Full padding: This type of padding increases the size of the output by adding zeros to the border of the
input.
• After each convolution operation, a CNN applies a Rectified Linear Unit (ReLU) transformation to the
Convolution Operation
Pooling layer

• Pooling layers, also known as downsampling, conducts dimensionality


reduction, reducing the number of parameters in the input. Similar to the
convolutional layer, the pooling operation sweeps a filter across the entire
input, but the difference is that this filter does not have any weights. Instead,
the kernel applies an aggregation function to the values within the receptive
field, populating the output array. There are two main types of pooling:
• Max pooling: As the filter moves across the input, it selects the pixel with
the maximum value to send to the output array. As an aside, this approach
tends to be used more often compared to average pooling.
• Average pooling: As the filter moves across the input, it calculates the
average value within the receptive field to send to the output array.
• While a lot of information is lost in the pooling layer, it also has a number of
benefits to the CNN. They help to reduce complexity, improve efficiency, and
limit risk of overfitting.
Fully-connected layer

• The name of the full-connected layer


aptly describes itself. As mentioned
earlier, the pixel values of the input
image are not directly connected to the
output layer in partially connected
layers. However, in the fully-connected
layer, each node in the output layer
connects directly to a node in the
previous layer.
• This layer performs the task of
classification based on the features
extracted through the previous layers
and their different filters. While
convolutional and pooling layers tend
to use ReLu functions, FC layers usually
leverage a softmax activation function
to classify inputs appropriately,
producing a probability from 0 to 1
Additional convolutional layer

• As we mentioned earlier, another convolution layer can follow the initial convolution
layer. When this happens, the structure of the CNN can become hierarchical as the
later layers can see the pixels within the receptive fields of prior layers. As an
example, let’s assume that we’re trying to determine if an image contains a bicycle.
You can think of the bicycle as a sum of parts. It is comprised of a frame, handlebars,
wheels, pedals, and so on. Each individual part of the bicycle makes up a lower-level
pattern in the neural net, and the combination of its parts represents a higher-level
pattern, creating a feature hierarchy within the CNN. Ultimately, the convolutional
layer converts the image into numerical values, allowing the neural network to
interpret and extract relevant patterns.
Convolutional Layer
The name of the full-connected layer aptly describes itself. As mentioned
earlier, the pixel values of the input image are not directly connected to the
output layer in partially connected layers. However, in the fully-connected
layer, each node in the output layer connects directly to a node in the
previous layer.
This layer performs the task of classification based on the features extracted
through the previous layers and their different filters. While convolutional
and pooling layers tend to use ReLu functions, FC layers usually leverage a
softmax activation function to classify inputs appropriately, producing a
probability from 0 to 1.
3D CNN

• 3D Convolutional Neural Network (3D CNN) is a type of deep learning


model used for image segmentation in three-dimensional data, such as
medical volumetric images (e.g., CT scans, MRI scans) or video sequences.
Unlike 2D CNNs that operate on two-dimensional data (e.g., images), 3D
CNNs process volumetric data and are designed to capture spatial and
temporal dependencies in. 3D images.
3D CNN

We have a RGB image above by the left, a filter in


the middle and the result of the filter on the RGB
image by the right.
Since there are 3 channels in the image and not 1 (i.e RGB
image and not black and white), the filter will correspondingly
have 3 channels as we can see above, one filter for the Red 2D
array of the image, another for the Green 2D array of the
image, and finally for the Blue 2D array of the image.

Pooling(either Max or Average Pooling takes place, then every


other required steps of forward propagation takes place).
3D CNNs can be used on Videos( a clip of images ,let’s
say x number of images stacked together.)

Typically the shape of an image takes this format — (height,


width, number of channels = 3), However the shape of the clip
of images will be (x ,height, width, number of channels) where
x = number of images in clip and number of channels = 3.
3D CNN

For every image in the stack of images, there’s a corresponding n set of filters
having 3 channels(for RGB). 3D convolution is basically performing 2D
convolution simultaneously on every image in a clip.
• Methods:
• 3D Convolution: 3D CNNs use 3D convolutional layers to extract features from
volumetric data. These layers slide a 3D kernel (a cube) over the input volume
to detect patterns in all three dimensions.
• Pooling and Striding: 3D CNNs use 3D max-pooling layers and strides to
downsample the spatial dimensions of the data, reducing the computational
load.
• Skip Connections: Skip connections, similar to those used in 2D U-Net
architectures, can be applied to 3D CNNs to improve segmentation accuracy.
• Fully Connected Layers: At the end of the network, fully connected layers are
often used for classification or regression, depending on the segmentation task.
3D CNN Solution
• High Segmentation Accuracy: 3D CNNs excel at capturing complex spatial
patterns, making them well-suited for medical image segmentation and
video analysis.
• Improved Generalization: 3D models can generalize better than 2D models
when dealing with volumetric data.
• Robust to Noise: They are often robust to noise and variations in 3D data,
which is crucial in medical imaging.
3D CNN Applications
• Applications of 3D Convolution
• Action Recognition in Videos
• Magnetic Resonance Imaging.
• Computerized Tomography Scan.
• Self Driving Cars
3D CNN challenges
• Computational Complexity: 3D CNNs require significantly more
computational resources than 2D CNNs, making them slower to train and
evaluate.
• Data Volume: Volumetric data is larger and may require substantial
storage and memory resources.
• Overfitting: Due to their depth and complexity, 3D CNNs can be prone to
overfitting, requiring larger datasets and effective regularization
techniques.
• Data Annotation: Creating accurate 3D ground truth annotations for
segmentation tasks can be time-consuming and labor-intensive,
especially in medical imaging.
• Model Selection: Choosing the right architecture and hyperparameters
for 3D CNNs can be challenging and may require extensive
experimentation.
Sequence Learning
• Sequence learning is the study of machine learning algorithms designed for
sequential data.
• Language model is one of the most interesting topics that use sequence
labeling
1.Language Translation
1. Understand the meaning of each word, and the relationship between
words
2. Input: one sentence in hindi input = “machine learning kyaa he"
3. Output: one sentence in English output = “What is machine learning" (big
league?
Sequence Learning
• To make it easier to understand why we need RNN, let's think about a
simple speaking case
• 1. We are given a hidden state (free mind?) that encodes all the
information in the sentence we want to speak.
• 2. We want to generate a list of words (sentence) in an one-by-one
fashion.
1. At each time step, we can only choose a single word.
2. The hidden state is affected by the words chosen (so we could remember what we
just say and complete the sentence.
Sequence Learning
• Plain CNNs are not born good at length-varying input and output

1. Difficult to define input and output


1. Remember that
1.Input image is a 3D tensor (width, length, color channels)
2. Output is a distribution on fixed number of classes.
2. Sequence could be:
1. "I know that you know that I know that you know that I know that you know that I know that
you know that I know that you know that I know that you know that I don't know"
2. "I don't know“

2. Input and output are strongly correlated within the sequence


3. Still, people figured out ways to use CNN on sequence learnin
Ways to Deal with Sequence
Labeling
1. Autoregressive models
• 1. Predict the next term in a sequence from a fixed number of previous terms
using delay taps.
2. Feed-forward neural nets
1. These generalize autoregressive models by using one or more
layers of non-linear hidden units.

Memoryless models: limited word-memory window; hidden state cannot be used efficiently.
Ways to Deal with Sequence
1.Labeling
Linear Dynamical Systems
1.These are generative models. They have a real-valued hidden state
that cannot be observed directly.
2. Hidden Markov Models
1. Have a discrete one-of-N hidden state. Transitions between states
are stochastic and controlled by a transition matrix. The outputs produced
by a state are stochastic.

Memoryful models, time-cost to infer


the hidden state distribution
Recurrent Neural Network
Recurrent Neural Networks (RNNs) work a bit different from regular neural
networks. In neural network the information flows in one direction from
input to output. However in RNN information is fed back into the system
after each step. Think of it like reading a sentence, when you’re trying to
predict the next word you don’t just look at the current word but also
need to remember the words that came before to make accurate guess.
RNNs allow the network to “remember” past information by feeding the
output from one step into next step. This helps the network understand
the context of what has already happened and make better predictions
based on that. For example when predicting the next word in a sentence
the RNN uses the previous words to help decide what word is most likely
to come next.
Basic architecture of RNN and the feedback loop mechanism where the output is
passed back as input for the next time step
How RNN Differs from Feedforward Neural
Networks?
• Feedforward Neural Networks (FNNs) process data in one direction
from input to output without retaining information from previous
inputs. This makes them suitable for tasks with independent inputs
like image classification. However FNNs struggle with sequential data
since they lack memory.
Key Components of RNNs
• Recurrent Neurons
The fundamental processing unit in RNN is a Recurrent Unit. Recurrent units hold
a hidden state that maintains information about previous inputs in a sequence.
Recurrent units can “remember” information from prior steps by feeding back
their hidden state, allowing them to capture dependencies across time.
RNN Unfolding

• RNN unfolding or unrolling is the process of expanding the recurrent


structure over time steps. During unfolding each step of the sequence
is represented as a separate layer in a series illustrating how
information flows across each time step.
• This unrolling enables backpropagation through time (BPTT)a learning
process where errors are propagated across time steps to adjust the
network’s weights enhancing the RNN’s ability to learn dependencies
within sequential data.
Recurrent Neural Network Architecture

• RNNs share similarities in input and output structures with other deep
learning architectures but differ significantly in how information flows from
input to output. Unlike traditional deep neural networks, where each dense
layer has distinct weight matrices, RNNs use shared weights across time
steps, allowing them to remember information over sequences.
• In RNNs, the hidden state Hi​ is calculated for every input Xi​ to retain
sequential dependencies. The computations follow these core formulas
1. Hidden State Calculation:
h=σ(U⋅X+W⋅ht−1​+B)
Here, ℎ represents the current hidden state, U and W are weight matrices, and B is the bias

2. Output Calculation:
Y=O(V⋅h+C)
The output Y is calculated by applying O, an activation function, to the weighted hidden state, where V and C represent
weights and bias
Recurrent Neural
Network Architecture
• Overall Function:
Y=f(X,h,W,U,V,B,C)
This function defines the entire RNN operation, where the state matrix S holds each element si​
representing the network’s state at each time step i.

At each time step RNNs process units with a fixed activation function. These units have an
internal hidden state that acts as memory that retains information from previous time steps. This
memory allows the network to store past knowledge and adapt based on new inputs.
Updating the Hidden State in RNNs
• The current hidden state ht​depends on the previous state ht−1​​and the
current input xt​, and is calculated using the following relations:
1. State Update
ht​=f(ht−1​,xt​)
• where:
• ht​is the current state
• ht−1​is the previous state
• xt​ is the input at the current time step
Updating the Hidden State in
RNNs
2. Activation Function Application:
ht​=tanh(Whh​⋅ht−1​+Wxh​⋅xt​)
Here, Whh​is the weight matrix for the recurrent neuron, and Wxh​is the
weight matrix for the input neuron.
3. Output Calculation:
yt​=Why​⋅ht​
where yt​is the output and Why​is the weight at the output layer.
These parameters are updated using backpropagation. However, since
RNN works on sequential data here we use an updated backpropagation
which is known as backpropagation through time.
Backpropagation Through Time (BPTT) in RNNs

• Since RNNs process sequential data Backpropagation Through Time (BPTT) is used to
update the network’s parameters. The loss function L(θ) depends on the final hidden
state h3​and each hidden state relies on preceding ones forming a sequential
dependency chain:
• h3​depends on ℎ2,ℎ2 depends on ℎ1,…,ℎ1 depends on ℎ0

In BPTT, gradients are backpropagated through each time
step. This is essential for updating network parameters
based on temporal dependencies
Backpropagation Through Time
(BPTT)
Simplified Gradient in RNNs
Calculation:
​ L( ) L( ) h3
 
W h3 W
Handling Dependencies in Layers:
Each hidden state is updated based on its dependencies:
h3​=σ(W⋅h2​+b)
The gradient is then calculated for each state, considering dependencies from previous hidden states.

Gradient Calculation with Explicit and Implicit Parts: The gradient is broken down into explicit and implicit parts summing
up the indirect paths from each hidden state to the weights.

h3 h3 h3 h2


  
W W h2 W
Backpropagation Through Time
(BPTT)
• Final in RNNs
Gradient Expression:
The final derivative of the loss function with respect to the weight matrix
W is computed:
L( ) L( ) 3 h3 hk
  
W h3 k 1 hk W

This iterative process is the essence of backpropagation through time.


Types Of Recurrent Neural Networks

• This is the simplest type of neural network architecture where there is a single
input and a single output. It is used for straightforward classification tasks such
as binary classification where no sequential data is involved.
2. One-to-Many RNN
In a One-to-Many RNN the network processes a single input to produce
multiple outputs over time. This is useful in tasks where one input triggers a
sequence of predictions (outputs). For example in image captioning a single
image can be used as input to generate a sequence of words as a caption.
3. Many-to-One RNN
• The Many-to-One RNN receives a sequence of inputs and generates a single
output. This type is useful when the overall context of the input sequence is
needed to make one prediction. In sentiment analysis the model receives a
sequence of words (like a sentence) and produces a single output like positive,
negative or neutral.
4. Many-to-Many RNN
• The Many-to-Many RNN type processes a sequence of inputs and
generates a sequence of outputs. In language translation task a sequence
of words in one language is given as input, and a corresponding sequence
in another language is generated as output.

You might also like