0% found this document useful (0 votes)
22 views18 pages

CV Unit V

This document provides an overview of deep learning and neural networks, focusing on their application in computer vision. It covers the architecture and components of deep neural networks, including convolutional neural networks (CNNs) for image feature detection, and introduces object detection using R-CNN. The document details the algorithms and steps involved in processing images through CNNs, including convolution, activation functions, pooling, and classification layers.

Uploaded by

Arya Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views18 pages

CV Unit V

This document provides an overview of deep learning and neural networks, focusing on their application in computer vision. It covers the architecture and components of deep neural networks, including convolutional neural networks (CNNs) for image feature detection, and introduces object detection using R-CNN. The document details the algorithms and steps involved in processing images through CNNs, including convolution, activation functions, pooling, and classification layers.

Uploaded by

Arya Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Unit V: Deep learning and Neural Networks for computer vision (6 Hours)

Key components and basic architecture of deep neural network, Convolution neural network,
Object detection using R-CNN, Segmentation using image-to image neural network,
Temporal processing and recurrent neural network.

Deep Neural Network (DNN) - Key Components & Architecture


A Deep Neural Network (DNN) is an advanced form of an artificial neural network (ANN)
that consists of multiple layers of interconnected neurons. It is designed to learn hierarchical
representations of data, making it highly effective for tasks such as image recognition, natural
language processing (NLP), and time series forecasting.

1. Key Components of a Deep Neural Network


A DNN consists of several fundamental components that work together to process and learn
from data:
1.1. Input Layer: The first layer of a neural network that receives raw data. The number of
neurons in this layer corresponds to the number of features in the dataset (e.g., an image of
28×28 pixels would have 784 input neurons). It only passes the input values to the next layer
without performing any computation.
1.2. Hidden Layers: Multiple layers between the input and output layer where the real
computation happens. Each neuron in a hidden layer is connected to neurons in the previous
and next layers. Hidden layers allow the network to learn complex patterns in the data. The
depth (number of hidden layers) defines how deep the network is.
1.3. Neurons (Perceptrons): The fundamental unit of a DNN, performing weighted
summation of inputs followed by an activation function.
1.4. Weights & Biases
 Weights determine the strength of the connections between neurons.
 Bias is an additional parameter added to adjust the output, preventing the neuron from
always passing through the origin.
 The goal of training is to find the optimal set of weights and biases.
1.5. Activation Functions: Introduce non-linearity to the model, allowing it to learn complex
relationships.
1.6. Output Layer: The final layer that provides the network’s predictions.
 The number of neurons depends on the problem:
o Binary Classification: One neuron with a Sigmoid activation.
o Multi-class Classification: Multiple neurons with Softmax activation.
o Regression: One neuron with a linear activation function.
2. Basic Architecture of a Deep Neural Network
A deep neural network follows a structured architecture that enables efficient learning:
2.1. Feedforward Propagation: Data flows from the input layer through the hidden layers to
the output layer. Each neuron receives inputs, applies a weighted sum, passes it through an
activation function, and sends the output to the next layer.
2.2. Loss Function: Measures the difference between the predicted output and the actual
target. Common loss functions: Mean Squared Error (MSE) and Cross-Entropy Loss
2.3. Backpropagation: A technique used to update the weights and biases to minimize the
loss. Steps involved:
 Compute the gradient (derivative) of the loss function w.r.t each weight using the
chain rule.
 Propagate the error backward from the output layer to the input layer.
 Adjust the weights using an optimization algorithm.
2.4. Optimization Algorithms: Used to update weights and biases to minimize loss.
 Popular optimizers:
o Gradient Descent: Updates weights in the direction that reduces loss.
o Adam (Adaptive Moment Estimation): Combines momentum and adaptive
learning rates for efficient optimization.
2.5. Regularization Techniques
To prevent overfitting, several techniques are used:
 Dropout: Randomly disables neurons during training to force generalization.
 L1 & L2 Regularization: Penalizes large weights to prevent overfitting.
3. Convolutional Neural Network (CNN)
CNNs are designed to extract spatial features from images using convolutional operations.
Key Components of CNN
1. Convolution Layer
o Uses filters (kernels) to extract spatial features.
o A sliding window (kernel) moves across the image, computing dot products.
o Output is a feature map that highlights edges, textures, and patterns.
2. Activation Function (ReLU)
o Introduces non-linearity.
o f(x)=max⁡(0,x)f(x) = \max(0, x)f(x)=max(0,x) ensures only positive values
are forwarded.
3. Pooling Layer
o Max Pooling: Retains only the maximum value in each region, reducing
dimensionality.
o Average Pooling: Computes the average of pixel values.
4. Fully Connected (FC) Layer
o Flattens feature maps into a vector and classifies the image.
5. Softmax or Sigmoid Output
o Softmax: For multi-class classification.
o Sigmoid: For binary classification.
CNNs are widely used in image recognition, medical diagnosis, and self-driving cars.

Algorithm for CNN to detect the Image features


The Convolutional Neural Network (CNN) is a widely used deep learning architecture for
detecting and learning features from images. A CNN automatically learns spatial hierarchies
of features, making it ideal for tasks like image classification, object detection, and
segmentation.
Step 1: Input Layer (Image as Input)
Input: A 2D image (e.g., RGB image of size 224x224x3 where 3 represents the color
channels: red, green, and blue). Preprocessing: Normalize the pixel values (e.g., from 0-255
to 0-1 or -1 to 1), and resize the image to a fixed dimension if necessary.
Step 2: Convolutional Layer (Feature Extraction)
• Apply a set of convolution filters (kernels) to the input image.
• Filter size: Typically 3x3 or 5x5 kernels.
• Stride: Defines the step size of the filter while sliding over the image.
• Padding: Zero-padding may be applied to maintain the spatial dimensions after convolution
Each filter extracts a specific feature (e.g., edges, textures) by computing the dot product
between the filter and the input.
• Mathematical operation: For an input image I and a filter K
• Where mxn is the size of the filter, and (x,y) is the position in the feature map. The output of
this step is a feature map (also called activation map) that highlights certain characteristics of
the image (e.g., edges or textures).
• Output: A set of feature maps representing different features extracted by the convolution
filters.

Step 3: Activation Function (Non-linearity)


• Activation function (commonly ReLU, Rectified Linear Unit) is applied to introduce non-
linearity into the model. ReLU replaces all negative values in the feature map with zero:
• This allows the CNN to learn more complex patterns and interactions.
Output: Non-linear feature maps with enhanced important features while eliminating
irrelevant ones.

Step 4: Pooling Layer (Downsampling)


• Pooling (subsampling) reduces the dimensionality of the feature maps, making the network
more computationally efficient and reducing the risk of overfitting.
• Max pooling is the most common type of pooling, which takes the maximum value from
each window in the feature map.
• Pooling size: Typically 2x2 windows with a stride of 2, which reduces the feature map’s
size by half.
• Output: Downsampled feature maps that preserve the most important features.

Step 5: Additional Convolution and Pooling Layers (Deeper Feature Extraction)


• The previous convolutional and pooling steps are repeated multiple times, allowing the
network to learn more abstract and complex features at different levels of abstraction.
• Shallow layers learn low-level features (edges, textures).
• Deeper layers learn high-level features (object parts, shapes).
• Output: A set of hierarchical feature maps that represent various levels of abstraction.

Step 6: Flatten Layer (Vectorization)


The final set of feature maps is flattened into a 1D feature vector, converting the 2D spatial
information into a vector format. This vector is the high-level representation of the image
features and serves as the input to the fully connected layers.
• Output: A flattened feature vector representing the entire image.

Step 7: Fully Connected Layer (Classification)


The flattened feature vector is passed to one or more fully connected (dense) layers. These
layers perform classification based on the learned features from the convolutional layers.
A typical fully connected layer compute:
• Where W are weights, x is the input feature vector, b is the bias, and σ is the activation
function (e.g., ReLU or Softmax).
• Output: A high-dimensional feature vector used for classification.

Step 8: Output Layer (Prediction)


The final fully connected layer produces an output corresponding to the number of classes in
the classification task.
For a multi-class classification problem, a Softmax activation is applied to the output of the
last fully connected layer to predict the probabilities for each class.

Output: A probability distribution over the possible object classes.


# CNN Pseudocode for Image Feature Detection
def cnn_feature_detection(image):
# Step 1: Convolution Layer
conv1 = convolution(image, filters=32, kernel_size=3x3, stride=1,
padding='same')
# Step 2: Activation Function (ReLU)
relu1 = relu_activation(conv1)
# Step 3: Pooling Layer
pool1 = max_pooling(relu1, pool_size=2x2, stride=2)
# Repeat steps 1-3 (Convolution, ReLU, and Pooling) for deeper
layers
conv2 = convolution(pool1, filters=64, kernel_size=3x3, stride=1,
padding='same')
relu2 = relu_activation(conv2)
pool2 = max_pooling(relu2, pool_size=2x2, stride=2)
# Step 4: Flatten the pooled feature maps
flattened = flatten(pool2)
# Step 5: Fully Connected Layer for Classification
fc1 = fully_connected(flattened, units=128)
relu_fc1 = relu_activation(fc1)
# Step 6: Output Layer (Softmax for multi-class classification)
output = softmax(fully_connected(relu_fc1, units=num_classes))
return output # Output the probabilities of each class
# Output: A probability vector indicating the predicted class for
the input image.

Algorithm for CNN to Detect Image Features if an image of 256*256 is given


If an image of size 256x256 is given, we can design a Convolutional Neural Network (CNN)
to detect image features by defining the convolutional layers, activation functions, pooling
layers, and the fully connected layers. The dimensions of the image will progressively reduce
through the network as we extract features using convolutional and pooling layers.
Step 1: Input Layer
Input: A 256x256 RGB image (shape: 256x256x3, where 3 is the number of color channels).
Preprocessing: Normalize the pixel values to a range between [0, 1] (or [-1, 1]). Ensure the
input image size is consistent across the dataset.

Step 2: Convolutional Layer 1


Apply 32 filters with a 3x3 kernel (filter size), a stride of 1, and padding='same'.
This layer will compute 32 feature maps by sliding 32 different 3x3 filters over the input
image. Output size: The image size remains 256x256x32 (because of padding).

Step 3: Activation Function (ReLU)


Apply ReLU activation to introduce non-linearity: This helps the network capture complex
patterns and features.
Step 4: Pooling Layer 1 (Max Pooling)
• Apply 2x2 max pooling with a stride of 2.
• This will reduce the spatial dimensions of the feature maps by half.
• Output size: The image is now 128x128x32.

Step 5: Convolutional Layer 2


Apply 64 filters with a 3x3 kernel, a stride of 1, and padding='same'.
• Extract more complex features.
• Output size: The image size remains 128x128x64.

Step 6: Activation Function (ReLU)


• Apply ReLU activation to introduce non-linearity to the convolution output.

Step 7: Pooling Layer 2 (Max Pooling)


• Apply 2x2 max pooling with a stride of 2.
• Output size: The image is now 64x64x64.

Step 8: Convolutional Layer 3


• Apply 128 filters with a 3x3 kernel, a stride of 1, and padding='same'.
• Output size: The image size remains 64x64x128.

Step 9: Activation Function (ReLU)


• Apply ReLU activation to introduce non-linearity.

Step 10: Pooling Layer 3 (Max Pooling)


• Apply 2x2 max pooling with a stride of 2.
• Output size: The image is now 32x32x128.
Step 11: Convolutional Layer 4
• Apply 256 filters with a 3x3 kernel, a stride of 1, and padding='same'.
• Output size: The image size remains 32x32x256.

Step 12: Activation Function (ReLU): Apply ReLU activation.

Step 13: Pooling Layer 4 (Max Pooling)


• Apply 2x2 max pooling with a stride of 2.
• Output size: The image is now 16x16x256.

Step 14: Flatten Layer


• Flatten the 3D tensor into a 1D vector.
• Output size: The flattened size is 16X16X256= 65,536(I-D Vector)

Step 15: Fully Connected Layer 1


• Apply a fully connected layer with 1024 neurons.
• This layer processes the high-level features learned by the convolutional layers.
• Output size: 1024 neurons.

Step 16: Activation Function (ReLU)


• Apply ReLU activation to introduce non-linearity.

Step 17: Fully Connected Layer 2 (Classification/Output)


• Apply a fully connected layer with neurons equal to the number of output classes.
• For multi-class classification, use a Softmax activation to generate probabilities for each
class.
• Output size: The size of this layer depends on the number of output classes.
Pseudocode for CNN to Detect Features in a 256x256 Image
def cnn_feature_detection(image):
# Input: 256x256x3 image
# Step 2: Convolutional Layer 1
conv1 = convolution(image, filters=32, kernel_size=3x3, stride=1,
padding='same')
# Step 3: ReLU Activation
relu1 = relu_activation(conv1)
# Step 4: Max Pooling Layer 1
pool1 = max_pooling(relu1, pool_size=2x2, stride=2)
# Step 5: Convolutional Layer 2
conv2 = convolution(pool1, filters=64, kernel_size=3x3, stride=1,
padding='same')
# Step 6: ReLU Activation
relu2 = relu_activation(conv2)
# Step 7: Max Pooling Layer 2
pool2 = max_pooling(relu2, pool_size=2x2, stride=2)
# Step 8: Convolutional Layer 3
conv3 = convolution(pool2, filters=128, kernel_size=3x3, stride=1,
padding='same')
# Step 9: ReLU Activation
relu3 = relu_activation(conv3)
# Step 10: Max Pooling Layer 3
pool3 = max_pooling(relu3, pool_size=2x2, stride=2)
# Step 11: Convolutional Layer 4
conv4 = convolution(pool3, filters=256, kernel_size=3x3, stride=1,
padding='same')
# Step 12: ReLU Activation
relu4 = relu_activation(conv4)
# Step 13: Max Pooling Layer 4
pool4 = max_pooling(relu4, pool_size=2x2, stride=2)
# Step 14: Flatten Layer
flattened = flatten(pool4) # Flattening the tensor into a 1D vector
# Step 15: Fully Connected Layer 1
fc1 = fully_connected(flattened, units=1024)
# Step 16: ReLU Activation
relu_fc1 = relu_activation(fc1)
# Step 17: Fully Connected Layer 2 (Output layer, Softmax for multi-
class)
output = softmax(fully_connected(relu_fc1, units=num_classes))
return output # Return probabilities for each class
4. Object Detection using R-CNN (Region-Based CNN)
Object detection identifies and classifies multiple objects in an image, unlike classification,
which assigns a single label to an entire image.
4.1. R-CNN (Region-based CNN) Approach
1. Selective Search
o Generates region proposals where objects might be present.
2. Feature Extraction
o A CNN extracts features from each region.
3. Classification & Regression
o A classifier (like SVM) classifies the object.
o A regressor adjusts the bounding box position.
4.2. Faster R-CNN (Improved version)
 Uses a Region Proposal Network (RPN) instead of Selective Search, making
detection much faster.
 Outputs bounding boxes and class labels in real-time.
4.3. YOLO (You Only Look Once) vs. Faster R-CNN
 YOLO: Processes an image in one pass, making it real-time.
 Faster R-CNN: More accurate but slower.

Algorithm for object detection using R-CNN


The R-CNN (Region-based Convolutional Neural Network) algorithm is one of the
pioneering deep learning approaches for object detection. The main idea is to
first generate region proposals and then classify each proposal into object
categories using a CNN. Here's a step-by-step breakdown of the R-CNN
algorithm:
Input: An image (to be processed for detecting objects).
Step 1: Generate Region Proposals (Selective Search)
• Selective Search Algorithm is used to generate a set of possible object locations (region
proposals) in the input image. The goal is to propose candidate regions that might contain
objects.
• The image is segmented into many small regions using color, texture, size, and shape
similarity.
• These small regions are then merged hierarchically to form larger regions, and
approximately 2000 region proposals are extracted.
• Output: A set of region proposals, each with a bounding box, usually numbering around
2000.

Step 2: Feature Extraction (CNN)


• For each region proposal (bounding box), extract features using a pre-trained Convolutional
Neural Network (CNN) (e.g., AlexNet or VGG16).
• The CNN takes the proposed region, resizes it to a fixed size (e.g., 227x227 pixels), and
then passes it through the network to extract a fixed-length feature vector.
• Note: Each region proposal is processed independently through the CNN.

Step 3: Classify Each Region Proposal (SVM)


• After feature extraction, the feature vector is passed through a Support Vector Machine
(SVM) classifier to predict the object class (e.g., dog, car, person) or determine if the region
contains no object (background).
• A separate SVM is trained for each object class to classify the extracted feature vector.
• Output: Each region is classified into an object category (or as background).

Step 4: Bounding Box Regression


• For more accurate localization, a bounding box regression model is applied. This model
adjusts the predicted bounding box of the object to better fit the actual object in the image.
• The regression model learns to minimize the error between the predicted bounding box
coordinates and the ground truth coordinates.

Step 5: Non-Maximum Suppression (NMS)


• After classifying and adjusting the bounding boxes, there may be multiple overlapping
detections for the same object.
• Non-Maximum Suppression (NMS) is applied to remove redundant boxes. NMS keeps the
detection with the highest confidence score and suppresses (removes) others that overlap
significantly (based on IoU threshold).

Output: A set of final detected objects with their corresponding bounding boxes, class labels,
and confidence scores
Pseudocode for Object Detection Using R-CNN:
# R-CNN Algorithm for Object Detection
def rcnn_object_detection(image):
# Step 1: Generate region proposals using Selective Search
region_proposals = selective_search(image)
# List to store detected objects
detected_objects = []
# Step 2: For each region proposal
for region in region_proposals:
# Extract the bounding box for the region
bounding_box = get_bounding_box(region)
# Step 3: Extract features from the region using CNN
resized_region = resize_region(region) # Resize to input size for
CNN
features = cnn_extract_features(resized_region)
# Step 4: Classify the region using a trained SVM
object_class = svm_classify(features)
# Step 5: Apply bounding box regression to refine the bounding box
refined_bounding_box=bounding_box_regression(bounding_box, features)
# Step 6: Store the result if it's a valid object class (not
background)
if object_class != "background":
detected_objects.append({
'class': object_class,
'bounding_box': refined_bounding_box,
'confidence': svm_confidence_score(object_class)
})
# Step 7: Apply Non-Maximum Suppression (NMS) to remove redundant
detections
final_detections = non_maximum_suppression(detected_objects)
return final_detections
# Output: List of detected objects with bounding boxes, class
labels, and confidence scores.
5. Image Segmentation using Image-to-Image Neural Networks
Segmentation is the process of classifying every pixel in an image into a category.
5.1. Types of Segmentation
1. Semantic Segmentation
o Labels each pixel with a class (e.g., “sky,” “road,” “car”).
o Example: U-Net (used in medical imaging).
2. Instance Segmentation
o Separates different instances of the same class (e.g., multiple people).
o Example: Mask R-CNN (used in self-driving cars).
5.2. U-Net (Image-to-Image Network)
 Developed for medical image segmentation.
 Uses encoder-decoder architecture:
o Encoder (CNN) extracts features.
o Decoder reconstructs pixel-wise classification.
5.3. Mask R-CNN
 Extends Faster R-CNN for pixel-wise object detection.
 Adds a segmentation mask prediction branch.
 Used in autonomous vehicles, robotics, and AR/VR.

Segmentation using image-to-image neural networks is a crucial task in computer vision,


especially in domains like medical imaging, autonomous driving, and object detection. Image
segmentation refers to the process of dividing an image into multiple segments or regions,
typically to isolate objects or regions of interest from the background. Here’s an overview of
how segmentation using image-to-image neural networks works.

1. Types of Segmentation
• Semantic Segmentation: This classifies each pixel in an image into a particular class, but
does not distinguish between objects of the same class (e.g., all cars in an image are labeled
as "car").
• Instance Segmentation: Similar to semantic segmentation, but it distinguishes different
instances of the same object class.
• Panoptic Segmentation: A combination of semantic and instance segmentation, labeling
both things (objects) and stuff (background).
2. Neural Network Architectures for Segmentation
• Fully Convolutional Networks (FCN): Traditional CNNs for image classification have fully
connected layers at the end. However, for segmentation, FCNs replace these with
convolutional layers that output a pixel-wise classification.
• U-Net: A popular architecture originally designed for biomedical image segmentation. It
consists of an encoder-decoder structure with skip connections. The encoder extracts features,
and the decoder reconstructs the segmented image. Skip connections help recover spatial
information lost during downsampling.
• SegNet: Similar to U-Net, SegNet also uses an encoder-decoder structure, but it memorizes
the max-pooling indices in the encoder and uses them in the decoder to ensure better spatial
resolution.
• Mask R-CNN: Extends Faster R-CNN (a region proposal network for object detection) to
also generate segmentation masks for each detected object. It’s commonly used for instance
segmentation.
• DeepLab: Uses dilated/atrous convolutions to capture multi-scale context information and
improves segmentation, especially for smaller objects. Variants like DeepLabV3 and
DeepLabV3+ are widely used.

3. Loss Functions for Segmentation


• Cross-Entropy Loss: Commonly used for pixel-wise classification in semantic
segmentation.
• Dice Coefficient: Measures overlap between predicted and ground truth segments,
especially useful for imbalanced data where the object covers a small portion of the image.
• IoU (Intersection over Union): Similar to Dice, used to measure the accuracy of predicted
segments.

4. Training Image-to-Image Segmentation Models


• Data Preparation: Requires a large, labeled dataset where each pixel of the image has a
corresponding label (ground truth). Augmentation techniques like rotation, flipping, cropping,
and scaling are often used to improve generalization.
• Metrics: Commonly used metrics for evaluating segmentation include accuracy, Dice score,
IoU, and precision-recall metrics.

5. Applications
• Medical Imaging: Segmenting organs or tumors from MRI, CT, or ultrasound scans.
• Autonomous Driving: Road, vehicles, and pedestrians segmentation for scene
understanding.
• Satellite Imagery: Segmenting land use areas, forests, or water bodies.
• Object Detection: Combined with object detection for pixel-accurate instance detection.

Example Workflow with U-Net


1. Input Image: Take an input image, such as an MRI scan.
2. Encoding Stage: Apply a series of convolutional layers and downsampling
to extract feature maps.
3. Decoding Stage: Use upsampling and transposed convolution to restore
the image size and apply segmentation labels to each pixel.
4. Skip Connections: Reintroduce high-resolution features from the
encoding stage into the decoding stage to refine segmentation boundaries.
5. Output: The network outputs a segmented image where each pixel belongs
to a specific class (e.g., background, tumor).
6. Temporal Processing & Recurrent Neural Network (RNN)
Unlike CNNs, RNNs handle sequential and time-dependent data (e.g., text, speech,
videos).
6.1. Key Concepts of RNNs
 Loops in RNNs allow information to persist.
 The hidden state updates recursively: ht=f(Wht−1+Uxt)h_t = f(W h_{t-1} + U x_t)ht
=f(Wht−1+Uxt) where:
o hth_tht is the current hidden state,
o xtx_txt is the input, and
o W,UW, UW,U are weight matrices.
6.2. Long Short-Term Memory (LSTM)
 Overcomes the vanishing gradient problem in RNNs.
 Uses gates to selectively store or forget information.
 Used in speech recognition, stock price prediction, and video analysis.
6.3. Gated Recurrent Unit (GRU)
 A simpler alternative to LSTM with fewer parameters.
 Retains long-term dependencies efficiently.
6.4. Temporal CNNs
 Uses 1D convolutions for time-series data.
 Faster than RNNs for certain tasks.
Temporal processing involves handling data where the order and timing of observations are
crucial, such as time-series data, speech, and video sequences. This type of data requires
specialized models that can capture temporal dependencies and patterns over time. Recurrent
Neural Networks (RNNs) are a class of neural networks designed specifically for such tasks.
Recurrent Neural Networks (RNNs) are a type of artificial neural network designed for
processing sequential data. Unlike traditional feedforward neural networks, RNNs maintain a
hidden state that captures information from previous time steps, allowing them to remember
previous inputs in the sequence.
Structure: An RNN consisting of
• Input Layer: Receives the input sequence.
• Recurrent Layer(s): Contains neurons that connect back to themselves, creating cycles. This
structure enables the network to maintain a state over time.
• Output Layer: Produces the final output, which can be a prediction for the next time step or
a classification based on the entire sequence.
Mathematics of RNNs
The forward pass of an RNN can be described mathematically as follows:

You might also like