Deep Learning For Computer Vision
Deep Learning For Computer Vision
Vision
Deep Neural Network(DNN)
• A deep neural network (DNN) is an ANN with multiple hidden layers
between the input and output layers.
Deep Learning
• The study of Deep Neural Networks is called Deep Learning
1. Computer vision
In computer vision, deep learning models enable machines to identify
and understand visual data. Some of the main applications of deep
learning in computer vision include:
• Object detection and recognition: Deep learning models are used to identify
and locate objects within images and videos, making it possible for machines
to perform tasks such as self-driving cars, surveillance, and robotics.
• Image classification: Deep learning models can be used to classify images into
categories such as animals, plants, and buildings. This is used in applications
such as medical imaging, quality control, and image retrieval.
• Image segmentation: Deep learning models can be used for image
segmentation into different regions, making it possible to identify specific
features within images.
Deep Learning Applications
2. Natural language processing (NLP):In NLP, deep learning model enable
machines to understand and generate human language. Some of the main
applications of deep learning in NLP include:
• Automatic Text Generation: Deep learning model can learn the corpus of text and
new text like summaries, essays can be automatically generated using these
trained models.
• Language translation: Deep learning models can translate text from one language
to another, making it possible to communicate with people from different
linguistic backgrounds.
• Sentiment analysis: Deep learning models can analyze the sentiment of a piece of
text, making it possible to determine whether the text is positive, negative, or
neutral.
• Speech recognition: Deep learning models can recognize and transcribe spoken
words, making it possible to perform tasks such as speech-to-text conversion,
voice search, and voice-controlled devices.
Deep Learning Applications
• 3. Reinforcement learning
In reinforcement learning, deep learning works as training agents to take
action in an environment to maximize a reward. Some of the main
applications of deep learning in reinforcement learning include:
• Game playing: Deep reinforcement learning models have been able to beat
human experts at games such as Go, Chess, and Atari.
• Robotics: Deep reinforcement learning models can be used to train robots to
perform complex tasks such as grasping objects, navigation, and manipulation.
• Control systems: Deep reinforcement learning models can be used to control
complex systems such as power grids, traffic management, and supply chain
optimization.
Challenges in Deep Learning
• Deep learning has made significant advancements in various fields, but
there are still some challenges that need to be addressed. Here are some
of the main challenges in deep learning:
• Data availability: It requires large amounts of data to learn from. For using deep
learning it’s a big concern to gather as much data for training.
• Computational Resources: For training the deep learning model, it is
computationally expensive because it requires specialized hardware like GPUs and
TPUs.
• Time-consuming: While working on sequential data depending on the
computational resource it can take very large even in days or months.
• Interpretability: Deep learning models are complex, it works like a black box. it is
very difficult to interpret the result.
• Overfitting: when the model is trained again and again, it becomes too specialized
for the training data, leading to overfitting and poor performance on new data.
Disadvantages of Deep Learning
• High computational requirements: Deep Learning AI models require
large amounts of data and computational resources to train and
optimize.
• Requires large amounts of labeled data: Deep Learning models often
require a large amount of labeled data for training, which can be
expensive and time- consuming to acquire.
• Interpretability: Deep Learning models can be challenging to interpret,
making it difficult to understand how they make decisions.
Overfitting: Deep Learning models can sometimes overfit to the training
data, resulting in poor performance on new and unseen data.
• Black-box nature: Deep Learning models are often treated as black
boxes, making it difficult to understand how they work and how they
arrived at their predictions.
Feedforward Neural Network
• Feedforward Neural Network (FNN) is a type of artificial neural
network in which information flows in a single direction from the input
layer through hidden layers to the output layer without loops or
feedback. It is mainly used for pattern recognition tasks like image and
speech classification.
Structure of a Feedforward Neural Network
• Input Layer: The input layer consists of neurons that receive the input
data. Each neuron in the input layer represents a feature of the input
data.
• Hidden Layers: One or more hidden layers are placed between the
input and output layers. These layers are responsible for learning the
complex patterns in the data. Each neuron in a hidden layer applies a
weighted sum of inputs followed by a non-linear activation function.
• Output Layer: The output layer provides the final output of the
network. The number of neurons in this layer corresponds to the
number of classes in a classification problem or the number of outputs
in a regression problem.
Activation Functions
• The input to the hidden neurons ℎ1and ℎ2are calculated as a weighted sum
of the inputs plus a bias:
Need of Non-Linearity in Neural Networks
• let’s consider a network with two input nodes (i1and i2), a single hidden layer
containing neurons ℎ1 and ℎ2, and an output neuron (out).
• We will use w1,w2 as weights connecting the inputs to the hidden neuron,
and w5as the weight connecting the hidden neuron to the output. We’ll also
include biases (b1for the hidden neuron and b2for the output neuron) to
complete the model.
• Input Layer: Two inputs i1and i2.
• Hidden Layer: Two neuron ℎ1and ℎ2
• Output Layer: One output neuron.
Need of Non-Linearity in Neural Networks
• The output neuron is then a weighted sum of the hidden neuron’s output plus a
bias:
In order to add non-linearity, we will be using sigmoid activation function in the output layer:
Types of Activation Functions in Deep Learning
Linear Activation Function
• Linear Activation Function resembles straight line define by y=mx+c. No
matter how many layers the neural network contains, if they all use linear
activation functions, the output is a linear combination of the input.
• The range of the output spans from (−∞ to +∞)(−∞ to +∞).
• Linear activation function is used at just one place i.e. output layer.
• Using linear activation across all layers makes the network’s ability to learn
complex patterns limited.
• Linear activation functions are useful for specific tasks but must be
combined with non-linear functions to enhance the neural network’s
learning and predictive capabilities.
Linear Activation Function
2. Non-Linear Activation Functions
Sigmoid Function
• Sigmoid Activation Function is characterized by ‘S’ shape. It is
mathematically defined as A 1 1e . This formula ensures a smooth and
x
•It allows neural networks to handle and model complex patterns that linear
equations cannot.
•The output ranges between 0 and 1, hence useful for binary classification.
•The function exhibits a steep gradient when x values are between -2 and 2.
This sensitivity means that small changes in input x can cause significant
changes in output y, which is critical during the training process.
Sigmoid Function
Tanh Activation Function
• Tanhfunction (hyperbolic tangent function), is a shifted version of the
sigmoid, allowing it to stretch across the y-axis. It is defined as:
e x e x
f ( x) x x
e e
1 e 2 x
f ( x)
1 e 2 x
1 1 (e 2 x 1)
f ( x)
1 e 2 x
tanh( x) 2 xsigmoid (2 x) 1
2
f ( x) 1
1 e 2 x
tanh( x) 2 sigmoid (2 x) 1
Tanh Activation Function
• Value Range: Outputs values from -1 to +1.
• Non-linear: Enables modeling of complex data patterns.
• Use in Hidden Layers: Commonly used in hidden layers due to its zero-
centered output, facilitating easier learning for subsequent layers.
ReLU (Rectified Linear
Unit) Function
• ReLU activation is defined by A(x)=max(0,x), this means that if the input x
is positive, ReLU returns x, if the input is negative, it returns 0.
• Value Range: [0,∞) meaning the function only outputs non-negative
values.
• Nature: It is a non-linear activation function, allowing neural networks to
learn complex patterns and making backpropagation more efficient.
• Advantage over other Activation: ReLU is less computationally expensive
than tanh and sigmoid because it involves simpler mathematical
operations. At a time only a few neurons are activated making the
network sparse making it efficient and easy for computation.
Softmax Function
Softmax function is designed to handle multi-class classification problems.
It transforms raw output scores from a neural network into probabilities. It
works by squashing the output values of each class into the range of 0 to 1,
while ensuring that the sum of all probabilities equals 1.
• Softmax is a non-linear activation function.
• The Softmax function ensures that each class is assigned a probability,
helping to identify which class the input belongs to.
zi
Soft max( zi ) e
e
k zj
j 1
Softmax Function
• Where:
• ziis the logit (the output of the previous layer in the network) for the
ithclass.
• K zis the number of classes.
e
i
Soft max( zi ) e
e
k zj
j 1
How Softmax Works?
• Step 4: Normalization
• The sum of the exponentials is used to normalize the values into
probabilities. The normalization step ensures that all the probabilities
add up to 1:
Each exponential is then divided by the sum of exponentials to get the final probabilities:
The exponential function e applied to each logit zi plays a crucial role. It emphasizes the
zi
difference between logits: even a slight increase in a logit value leads to a larger probability,
while small logits result in near-zero probabilities.
The result of applying the Softmax function to the logits is a probability distribution.
Each element represents the probability that the input data belongs to a particular class.
In this case:
There is a 62% probability that the input belongs to class 1,
A 23% probability for class 2, and
A 15% probability for class 3.
Softmax and Cross-Entropy Loss
yi is the true label (1 for the correct class, 0 for others),
ŷi is the predicted probability for class iii from the Softmax function.
Feed-Back Neural network
• These networks have connections that loop back, allowing information to
be fed back into the network. This structure enables them to handle
sequential data and temporal dependencies, making them suitable for
tasks like time series prediction and language modeling.
Structure of Feedback Neural Networks
• Feedback neural networks, or RNNs, are characterized by their ability to
maintain a state that captures information about previous inputs. This is
achieved through recurrent connections that loop back from the output to
the input of the same layer or previous layers. The key components of an
RNN include:
• Input Layer: Receives the input data.
• Hidden Layers: Contain neurons with recurrent connections that maintain a state
over time.
• Output Layer: Produces the final output based on the processed information.
The convolutional layer is the core building block of a CNN, and it is where the
majority of computation occurs. It requires a few components, which are input
data, a filter and a feature map. Let’s assume that the input will be a color image,
which is made up of a matrix of pixels in 3D. This means that the input will have
three dimensions—a height, width and depth—which correspond to RGB in an
image. We also have a feature detector, also known as a kernel or a filter, which
will move across the receptive fields of the image, checking if the feature is
present. This process is known as a convolution.
The feature detector is a two-dimensional (2-D) array of weights, which
represents part of the image. While they can vary in size, the filter size is typically
a 3x3 matrix; this also determines the size of the receptive field. The filter is then
applied to an area of the image, and a dot product is calculated between the
input pixels and the filter. This dot product is then fed into an output array.
Afterwards, the filter shifts by a stride, repeating the process until the kernel has
swept across the entire image. The final output from the series of dot products
from the input and the filter is known as a feature map, activation map or a
convolved feature.
Convolutional Layer
Note that the weights in the feature detector remain fixed as it moves across the image, which is also known
as parameter sharing. Some parameters such as the weight values, adjust during training through the
process of backpropagation and gradient descent. However, there are three hyperparameters which affect
the volume size of the output that need to be set before the training of the neural network begins. These
include:
• 1. The number of filters affects the depth of the output. For example, three distinct filters would yield
three different feature maps, creating a depth of three.
• 2. Stride is the distance, or number of pixels, that the kernel moves over the input matrix. While stride
values of two or greater is rare, a larger stride yields a smaller output.
• 3. Zero-padding is usually used when the filters do not fit the input image. This sets all elements that fall
outside of the input matrix to zero, producing a larger or equally sized output. There are three types of
padding:
• Valid padding: This is also known as no padding. In this case, the last convolution is dropped if dimensions
do not align.
• Same padding: This padding ensures that the output layer has the same size as the input layer.
• Full padding: This type of padding increases the size of the output by adding zeros to the border of the
input.
• After each convolution operation, a CNN applies a Rectified Linear Unit (ReLU) transformation to the
Convolution Operation
Pooling layer
• As we mentioned earlier, another convolution layer can follow the initial convolution
layer. When this happens, the structure of the CNN can become hierarchical as the
later layers can see the pixels within the receptive fields of prior layers. As an
example, let’s assume that we’re trying to determine if an image contains a bicycle.
You can think of the bicycle as a sum of parts. It is comprised of a frame, handlebars,
wheels, pedals, and so on. Each individual part of the bicycle makes up a lower-level
pattern in the neural net, and the combination of its parts represents a higher-level
pattern, creating a feature hierarchy within the CNN. Ultimately, the convolutional
layer converts the image into numerical values, allowing the neural network to
interpret and extract relevant patterns.
Convolutional Layer
The name of the full-connected layer aptly describes itself. As mentioned
earlier, the pixel values of the input image are not directly connected to the
output layer in partially connected layers. However, in the fully-connected
layer, each node in the output layer connects directly to a node in the
previous layer.
This layer performs the task of classification based on the features extracted
through the previous layers and their different filters. While convolutional
and pooling layers tend to use ReLu functions, FC layers usually leverage a
softmax activation function to classify inputs appropriately, producing a
probability from 0 to 1.
3D CNN
For every image in the stack of images, there’s a corresponding n set of filters
having 3 channels(for RGB). 3D convolution is basically performing 2D
convolution simultaneously on every image in a clip.
• Methods:
• 3D Convolution: 3D CNNs use 3D convolutional layers to extract features from
volumetric data. These layers slide a 3D kernel (a cube) over the input volume
to detect patterns in all three dimensions.
• Pooling and Striding: 3D CNNs use 3D max-pooling layers and strides to
downsample the spatial dimensions of the data, reducing the computational
load.
• Skip Connections: Skip connections, similar to those used in 2D U-Net
architectures, can be applied to 3D CNNs to improve segmentation accuracy.
• Fully Connected Layers: At the end of the network, fully connected layers are
often used for classification or regression, depending on the segmentation task.
3D CNN Solution
• High Segmentation Accuracy: 3D CNNs excel at capturing complex spatial
patterns, making them well-suited for medical image segmentation and
video analysis.
• Improved Generalization: 3D models can generalize better than 2D models
when dealing with volumetric data.
• Robust to Noise: They are often robust to noise and variations in 3D data,
which is crucial in medical imaging.
3D CNN Applications
• Applications of 3D Convolution
• Action Recognition in Videos
• Magnetic Resonance Imaging.
• Computerized Tomography Scan.
• Self Driving Cars
3D CNN challenges
• Computational Complexity: 3D CNNs require significantly more
computational resources than 2D CNNs, making them slower to train and
evaluate.
• Data Volume: Volumetric data is larger and may require substantial
storage and memory resources.
• Overfitting: Due to their depth and complexity, 3D CNNs can be prone to
overfitting, requiring larger datasets and effective regularization
techniques.
• Data Annotation: Creating accurate 3D ground truth annotations for
segmentation tasks can be time-consuming and labor-intensive,
especially in medical imaging.
• Model Selection: Choosing the right architecture and hyperparameters
for 3D CNNs can be challenging and may require extensive
experimentation.
Sequence Learning
• Sequence learning is the study of machine learning algorithms designed for
sequential data.
• Language model is one of the most interesting topics that use sequence
labeling
1.Language Translation
1. Understand the meaning of each word, and the relationship between
words
2. Input: one sentence in hindi input = “machine learning kyaa he"
3. Output: one sentence in English output = “What is machine learning" (big
league?
Sequence Learning
• To make it easier to understand why we need RNN, let's think about a
simple speaking case
• 1. We are given a hidden state (free mind?) that encodes all the
information in the sentence we want to speak.
• 2. We want to generate a list of words (sentence) in an one-by-one
fashion.
1. At each time step, we can only choose a single word.
2. The hidden state is affected by the words chosen (so we could remember what we
just say and complete the sentence.
Sequence Learning
• Plain CNNs are not born good at length-varying input and output
Memoryless models: limited word-memory window; hidden state cannot be used efficiently.
Ways to Deal with Sequence
1.Labeling
Linear Dynamical Systems
1.These are generative models. They have a real-valued hidden state
that cannot be observed directly.
2. Hidden Markov Models
1. Have a discrete one-of-N hidden state. Transitions between states
are stochastic and controlled by a transition matrix. The outputs produced
by a state are stochastic.
• RNNs share similarities in input and output structures with other deep
learning architectures but differ significantly in how information flows from
input to output. Unlike traditional deep neural networks, where each dense
layer has distinct weight matrices, RNNs use shared weights across time
steps, allowing them to remember information over sequences.
• In RNNs, the hidden state Hi is calculated for every input Xi to retain
sequential dependencies. The computations follow these core formulas
1. Hidden State Calculation:
h=σ(U⋅X+W⋅ht−1+B)
Here, ℎ represents the current hidden state, U and W are weight matrices, and B is the bias
2. Output Calculation:
Y=O(V⋅h+C)
The output Y is calculated by applying O, an activation function, to the weighted hidden state, where V and C represent
weights and bias
Recurrent Neural
Network Architecture
• Overall Function:
Y=f(X,h,W,U,V,B,C)
This function defines the entire RNN operation, where the state matrix S holds each element si
representing the network’s state at each time step i.
At each time step RNNs process units with a fixed activation function. These units have an
internal hidden state that acts as memory that retains information from previous time steps. This
memory allows the network to store past knowledge and adapt based on new inputs.
Updating the Hidden State in RNNs
• The current hidden state htdepends on the previous state ht−1and the
current input xt, and is calculated using the following relations:
1. State Update
ht=f(ht−1,xt)
• where:
• htis the current state
• ht−1is the previous state
• xt is the input at the current time step
Updating the Hidden State in
RNNs
2. Activation Function Application:
ht=tanh(Whh⋅ht−1+Wxh⋅xt)
Here, Whhis the weight matrix for the recurrent neuron, and Wxhis the
weight matrix for the input neuron.
3. Output Calculation:
yt=Why⋅ht
where ytis the output and Whyis the weight at the output layer.
These parameters are updated using backpropagation. However, since
RNN works on sequential data here we use an updated backpropagation
which is known as backpropagation through time.
Backpropagation Through Time (BPTT) in RNNs
• Since RNNs process sequential data Backpropagation Through Time (BPTT) is used to
update the network’s parameters. The loss function L(θ) depends on the final hidden
state h3and each hidden state relies on preceding ones forming a sequential
dependency chain:
• h3depends on ℎ2,ℎ2 depends on ℎ1,…,ℎ1 depends on ℎ0
In BPTT, gradients are backpropagated through each time
step. This is essential for updating network parameters
based on temporal dependencies
Backpropagation Through Time
(BPTT)
Simplified Gradient in RNNs
Calculation:
L( ) L( ) h3
W h3 W
Handling Dependencies in Layers:
Each hidden state is updated based on its dependencies:
h3=σ(W⋅h2+b)
The gradient is then calculated for each state, considering dependencies from previous hidden states.
Gradient Calculation with Explicit and Implicit Parts: The gradient is broken down into explicit and implicit parts summing
up the indirect paths from each hidden state to the weights.
• This is the simplest type of neural network architecture where there is a single
input and a single output. It is used for straightforward classification tasks such
as binary classification where no sequential data is involved.
2. One-to-Many RNN
In a One-to-Many RNN the network processes a single input to produce
multiple outputs over time. This is useful in tasks where one input triggers a
sequence of predictions (outputs). For example in image captioning a single
image can be used as input to generate a sequence of words as a caption.
3. Many-to-One RNN
• The Many-to-One RNN receives a sequence of inputs and generates a single
output. This type is useful when the overall context of the input sequence is
needed to make one prediction. In sentiment analysis the model receives a
sequence of words (like a sentence) and produces a single output like positive,
negative or neutral.
4. Many-to-Many RNN
• The Many-to-Many RNN type processes a sequence of inputs and
generates a sequence of outputs. In language translation task a sequence
of words in one language is given as input, and a corresponding sequence
in another language is generated as output.