0% found this document useful (0 votes)
6 views11 pages

Image Segmentation Basics

The document provides an overview of image segmentation, detailing its definition, applications, and the classification of pixels. It discusses various types of image segmentation, including semantic, instance, and panoptic segmentation, as well as the role of Convolutional Neural Networks (CNNs) in these processes. Additionally, it highlights key techniques in image segmentation and popular architectures such as U-Net, VGG-19, and DoubleU-Net.

Uploaded by

vqtuanminh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views11 pages

Image Segmentation Basics

The document provides an overview of image segmentation, detailing its definition, applications, and the classification of pixels. It discusses various types of image segmentation, including semantic, instance, and panoptic segmentation, as well as the role of Convolutional Neural Networks (CNNs) in these processes. Additionally, it highlights key techniques in image segmentation and popular architectures such as U-Net, VGG-19, and DoubleU-Net.

Uploaded by

vqtuanminh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

SCHOOL OF ELECTRICAL AND ELCTRONICS ENGINEERING

PROJECT I
AN OVERVIEW OF IMAGE SEGMENTATION
ANALYSIS
VU QUACH TUAN MINH
[email protected]

ADVANCED PROGRAM - CONTROL AND AUTOMATION

Instructor: Dr. Pham Van Truong


Instructor’s Signature

Faculty: School of Electric and Electronics Engineering

Hanoi, January 2025


Table of Contents

CHAPTER 1: INTRODUCTION AND KEY CONCEPTS 1


1.1 Definition of Image Segmentation . . . . . . . . . . . . . . . . . . . 1
1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Pixel Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.4 Mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

CHAPTER 2: TYPES OF IMAGE SEGMENTATION 2


2.1 Semantic Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Instance Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.3 Panoptic Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 2

CHAPTER 3: CONVOLUTION NEURAL NETWORK (CNNs) 3


3.1 Overview of CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 Layers in CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.3 Filters/Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

CHAPTER 4: KEY TECHNIQUES IN IMAGE SEGMENTATION 4


4.1 Convolution Operation . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.2 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.3 Strides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.4 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

CHAPTER 5: POPULAR ARCHITECTURE 6


5.1 U-NET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.2 VGG-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.3 DoubleU-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
CHAPTER 1: INTRODUCTION AND KEY CONCEPTS

1.1 Definition of Image Segmentation


Image segmentation is a computer vision technique that involves breaking down
an image into multiple segments or regions. The goal is to simplify the image and
make it more meaningful and easier to analyze by identifying and isolating objects or
areas of interest within the image.
1.2 Applications
Image segmentation is widely used in various fields, including medical imaging
(identifying tumors), autonomous vehicles (detecting road signs and obstacles), and
facial recognition (detecting facial features).
1.3 Pixel Classification
Each pixel in the image is classified into a specific category or class. This
classification can be binary (e.g., object vs. background) or multi-class (e.g., different
objects).
1.4 Mask
A mask is created where each pixel is labeled according to its class. This mask
highlights the segmented regions of the image.

1
CHAPTER 2: TYPES OF IMAGE SEGMENTATION

2.1 Semantic Segmentation


Classifies each pixel into a class, without distinguishing between different instances
of the same class.
Semantic segmentation is like teaching a computer to understand and label every
single pixel in an image according to its category. Imagine having a photo of a busy
street, semantic segmentation would help the computer recognize that certain groups
of pixels belong to cars, others to pedestrians, some to buildings, and so forth. The
result is a new image where every pixel is assigned a label corresponding to what it
represents. This technique is particularly powerful because it enables detailed scene-
understanding, allowing applications like autonomous driving to identify and react to
different elements in the environment accurately. Think of it as giving the computer
a detailed map of everything in the image, making it easier to analyze and interpret
complex scenes.
2.2 Instance Segmentation
Differentiates between individual instances of objects within the same class.
Instance segmentation is like giving the computer the ability to not only identify
and label different categories of objects in an image but also distinguish between
individual instances of the same category. Imagine you have a photo with several
people and cars. Instance segmentation will help the computer understand that there
are multiple people and cars, and it will label each person and car separately, even
if they belong to the same category. This is incredibly useful for applications like
autonomous vehicles, where it is essential to differentiate between multiple objects
of the same type, such as different cars on the road. Think of instance segmentation
as providing a detailed and precise breakdown of every individual object within each
category in the image.
2.3 Panoptic Segmentation
Combines semantic and instance segmentation, providing a complete scene understanding.

2
CHAPTER 3: CONVOLUTION NEURAL NETWORK (CNNs)

3.1 Overview of CNNs


A Convolutional Neural Network (CNN) is a type of deep learning model particularly
well-suited for analyzing visual data like images. At its core, a CNN consists of
multiple layers that work together to automatically and adaptively learn spatial hierarchies
of features from the input images. The fundamental building blocks of a CNN include
convolutional layers, which apply filters (kernels) to the input images to extract features
such as edges, textures, and patterns; activation layers like ReLU (Rectified Linear
Unit) that introduce non-linearity; pooling layers that downsample the feature maps
to reduce computational load and control overfitting; and fully connected layers that
integrate the learned features for final classification or regression tasks. This layered
structure allows CNNs to efficiently process and understand complex visual data,
making them essential tools in various applications, from image classification and
object detection to image segmentation and beyond.
3.2 Layers in CNNs
In the context of Convolutional Neural Networks (CNNs), "layers" refer to the
individual stages through which an input image passes as it is processed by the network.
Each layer performs specific operations that transform the input data, allowing the
network to learn and extract various features.
3.3 Filters/Kernels
In Convolution Neural Networks (CNNs), filters (also known as kernels) are
small matrices of weights that play a crucial role in feature extraction from input
images. Filters are applied to the input data through the convolution operation, where
they slide across the image, performing element-wise multiplication and summation
at each position. Each filter is designed to detect specific features, such as edges,
textures, or patterns, within the image. During the training process, the values of these
filters are learned and adjusted to capture relevant features effectively. The resulting
output of the convolution operation is known as a feature map, which highlights
the presence of the detected features. By stacking multiple convolution layers with
different filters, CNNs can learn hierarchical representations of the image, from low-
level features in the initial layers to high-level, complex patterns in the deeper layers.
This ability to automatically and adaptively extract features makes filters essential
components in the powerful functionality of CNNs.

3
CHAPTER 4: KEY TECHNIQUES IN IMAGE SEGMENTATION

4.1 Convolution Operation


Convolution is a mathematical operation used to extract features from an input
image. It involves applying a filter (kernel) to the image by sliding it across the image
with a specified stride and performing element-wise multiplication and summation at
each position.
For example, we have an "input image" which is illustrated by this matrix A:
 
1 2 0 1 3
 
4 5 1 6 0
 
A=
7 8 1 3 2

4 5 2 1 0
 
0 2 1 3 4

We also have this filter ( or kernel ) matrix B:


 
0 1 0
B = 1 −1 0
 

0 1 0

If we perform element-wise multiplication of matrices A and B ( at the top-left


portion of matrix A), we have
 
1∗0 2∗1 0∗0
4 ∗ 1 5 ∗ −1 1 ∗ 1
 

7∗0 8∗1 1∗0

Now we take the sum of all the products, that is

(1 ∗ 0) + (2 ∗ 1) + (0 ∗ 0) + (4 ∗ 1) + (5 ∗ −1) + (1 ∗ 1) + (7 ∗ 0) + (8 + 1) + (1 ∗ 0)
= 0 + 2 + 0 + 4 + (−5) + 1 + 0 + 8 + 0
= 2+4−5+1+8
= 10
And if we do the same with the other 8 possible positions on Matrix A, we have
the following "Feature Map"
 
10 11 10
12 11 9 
 

10 9 12

4
4.2 Padding
Padding in the context of Convolution Neural Networks (CNNs) is a technique
used to add extra pixels around the borders of an input image or feature map. These
additional pixels, often set to zero (zero padding), allow the convolution filters to
process the border regions of the image more effectively. Padding helps in preserving
the spatial dimensions of the input during the convolution operation, preventing the
reduction in size that typically occurs. This is particularly important for maintaining
the alignment of feature maps across different layers of the network and ensuring
that important features near the edges of the image are not lost. By using padding,
CNNs can generate output feature maps that retain the same height and width as the
original input, facilitating the design of deeper and more complex neural network
architectures.
4.3 Strides
In the context of Convolution Neural Networks (CNNs), stride refers to the
step size with which a convolution filter or pooling window moves across the input
image or feature map. Stride determines how much the filter shifts at each step, both
horizontally and vertically. A stride of 1 means the filter moves one pixel at a time,
resulting in overlapping regions and larger output dimensions. A stride greater than 1
(e.g., 2 or 3) means the filter moves more than one pixel at a time, reducing the output
dimensions and making the computation more efficient. Stride plays a crucial role in
controlling the spatial dimensions of the output feature map and helps in balancing
the trade-off between spatial resolution and computational efficiency.

The example at Part 4.1 uses a stride of 1.

4.4 Pooling
Pooling is a down-sampling operation used in Convolution Neural Networks
(CNNs) to reduce the spatial dimensions (height and width) of feature maps while
retaining the most important information.There are two main types of pooling: max
pooling and average pooling.
In max pooling, the maximum value within a specified window matrix (e.g., 2x2)
is selected, whereas in average pooling, the average value within the window matrix
is calculated. Pooling helps in reducing the computational load and the number of
parameters in the network, which in turn helps prevent overfitting. Additionally,
pooling introduces a degree of invariance to small translations and distortions in
the input image, making the network more robust to variations in the input data.
By effectively summarizing the presence of features, pooling layers enable CNNs
to capture the essential characteristics of the input image efficiently.

5
CHAPTER 5: POPULAR ARCHITECTURE

5.1 U-NET
The U-Net architecture, introduced by Olaf Ronneberger, Philipp Fischer, and
Thomas Brox in 2015, is a specialized type of Convolution Neural Network (CNN)
designed specifically for image segmentation tasks, particularly in biomedical applications.
The key innovation of U-Net lies in its unique architecture, which is shaped like the
letter "U." This design allows the network to capture both context and fine details of
the input images, making it highly effective for precise segmentation.

The U-Net architecture consists of two main parts: the contracting path and
the expanding path. The contracting path, also known as the encoder, follows the
typical structure of a CNN, with repeated application of convolution and pooling
layers to progressively reduce the spatial dimensions and capture high-level features.
The expanding path, or decoder, mirrors the contracting path but uses up-sampling
and convolution layers to gradually restore the spatial dimensions and reconstruct the
segmented output. A crucial aspect of U-Net is the presence of skip connections
that directly link corresponding layers in the contracting and expanding paths. These
connections ensure that fine-grained details lost during down-sampling are preserved
and integrated during up-sampling, resulting in highly accurate segmentation maps.
5.2 VGG-19
The VGG-19 architecture is a deep convolution neural network introduced by
the Visual Geometry Group (VGG) at the University of Oxford in 2014. VGG-19

6
is part of the VGG family of models, known for their simplicity and effectiveness in
image classification tasks. The architecture is characterized by its use of small 3x3
convolution filters stacked on top of each other in multiple layers. VGG-19 consists
of 19 layers, including 16 convolution layers and 3 fully connected layers.

The design of VGG-19 emphasizes depth, with the network having a total of 19
weight layers, which enables it to capture hierarchical features from the input images.
The use of small filters allows the network to learn intricate patterns while keeping
the number of parameters manageable. VGG-19 has been widely adopted in various
computer vision tasks due to its strong performance and relatively straightforward
architecture. It has also served as a foundational model for many subsequent advancements
in deep learning, including the development of more complex architectures.
5.3 DoubleU-Net
The DoubleU-Net architecture is an extension of the popular U-Net model,
designed to improve performance in image segmentation tasks, particularly in biomedical
imaging. The key innovation of DoubleU-Net is the incorporation of two U-Net
structures connected in a cascade. The first U-Net performs initial segmentation,
and its output is passed as input to the second U-Net, which refines the segmentation
results.

This dual U-Net structure allows DoubleU-Net to leverage the strengths of the
original U-Net while addressing some of its limitations. The first U-Net focuses

7
on capturing coarse-level features and providing an initial segmentation map. The
second U-Net, using the output of the first U-Net, performs fine-level segmentation,
enhancing the accuracy and detail of the segmented regions. Skip connections within
each U-Net and between the two U-Nets ensure that important features and contextual
information are preserved throughout the network.

DoubleU-Net has shown significant improvements in segmentation performance,


especially in tasks where high precision and detail are crucial. Its ability to refine
segmentation results makes it a valuable tool in various applications, including medical
diagnostics and remote sensing.

You might also like