Image Segmentation Basics
Image Segmentation Basics
PROJECT I
AN OVERVIEW OF IMAGE SEGMENTATION
ANALYSIS
VU QUACH TUAN MINH
[email protected]
1
CHAPTER 2: TYPES OF IMAGE SEGMENTATION
2
CHAPTER 3: CONVOLUTION NEURAL NETWORK (CNNs)
3
CHAPTER 4: KEY TECHNIQUES IN IMAGE SEGMENTATION
0 1 0
(1 ∗ 0) + (2 ∗ 1) + (0 ∗ 0) + (4 ∗ 1) + (5 ∗ −1) + (1 ∗ 1) + (7 ∗ 0) + (8 + 1) + (1 ∗ 0)
= 0 + 2 + 0 + 4 + (−5) + 1 + 0 + 8 + 0
= 2+4−5+1+8
= 10
And if we do the same with the other 8 possible positions on Matrix A, we have
the following "Feature Map"
10 11 10
12 11 9
10 9 12
4
4.2 Padding
Padding in the context of Convolution Neural Networks (CNNs) is a technique
used to add extra pixels around the borders of an input image or feature map. These
additional pixels, often set to zero (zero padding), allow the convolution filters to
process the border regions of the image more effectively. Padding helps in preserving
the spatial dimensions of the input during the convolution operation, preventing the
reduction in size that typically occurs. This is particularly important for maintaining
the alignment of feature maps across different layers of the network and ensuring
that important features near the edges of the image are not lost. By using padding,
CNNs can generate output feature maps that retain the same height and width as the
original input, facilitating the design of deeper and more complex neural network
architectures.
4.3 Strides
In the context of Convolution Neural Networks (CNNs), stride refers to the
step size with which a convolution filter or pooling window moves across the input
image or feature map. Stride determines how much the filter shifts at each step, both
horizontally and vertically. A stride of 1 means the filter moves one pixel at a time,
resulting in overlapping regions and larger output dimensions. A stride greater than 1
(e.g., 2 or 3) means the filter moves more than one pixel at a time, reducing the output
dimensions and making the computation more efficient. Stride plays a crucial role in
controlling the spatial dimensions of the output feature map and helps in balancing
the trade-off between spatial resolution and computational efficiency.
4.4 Pooling
Pooling is a down-sampling operation used in Convolution Neural Networks
(CNNs) to reduce the spatial dimensions (height and width) of feature maps while
retaining the most important information.There are two main types of pooling: max
pooling and average pooling.
In max pooling, the maximum value within a specified window matrix (e.g., 2x2)
is selected, whereas in average pooling, the average value within the window matrix
is calculated. Pooling helps in reducing the computational load and the number of
parameters in the network, which in turn helps prevent overfitting. Additionally,
pooling introduces a degree of invariance to small translations and distortions in
the input image, making the network more robust to variations in the input data.
By effectively summarizing the presence of features, pooling layers enable CNNs
to capture the essential characteristics of the input image efficiently.
5
CHAPTER 5: POPULAR ARCHITECTURE
5.1 U-NET
The U-Net architecture, introduced by Olaf Ronneberger, Philipp Fischer, and
Thomas Brox in 2015, is a specialized type of Convolution Neural Network (CNN)
designed specifically for image segmentation tasks, particularly in biomedical applications.
The key innovation of U-Net lies in its unique architecture, which is shaped like the
letter "U." This design allows the network to capture both context and fine details of
the input images, making it highly effective for precise segmentation.
The U-Net architecture consists of two main parts: the contracting path and
the expanding path. The contracting path, also known as the encoder, follows the
typical structure of a CNN, with repeated application of convolution and pooling
layers to progressively reduce the spatial dimensions and capture high-level features.
The expanding path, or decoder, mirrors the contracting path but uses up-sampling
and convolution layers to gradually restore the spatial dimensions and reconstruct the
segmented output. A crucial aspect of U-Net is the presence of skip connections
that directly link corresponding layers in the contracting and expanding paths. These
connections ensure that fine-grained details lost during down-sampling are preserved
and integrated during up-sampling, resulting in highly accurate segmentation maps.
5.2 VGG-19
The VGG-19 architecture is a deep convolution neural network introduced by
the Visual Geometry Group (VGG) at the University of Oxford in 2014. VGG-19
6
is part of the VGG family of models, known for their simplicity and effectiveness in
image classification tasks. The architecture is characterized by its use of small 3x3
convolution filters stacked on top of each other in multiple layers. VGG-19 consists
of 19 layers, including 16 convolution layers and 3 fully connected layers.
The design of VGG-19 emphasizes depth, with the network having a total of 19
weight layers, which enables it to capture hierarchical features from the input images.
The use of small filters allows the network to learn intricate patterns while keeping
the number of parameters manageable. VGG-19 has been widely adopted in various
computer vision tasks due to its strong performance and relatively straightforward
architecture. It has also served as a foundational model for many subsequent advancements
in deep learning, including the development of more complex architectures.
5.3 DoubleU-Net
The DoubleU-Net architecture is an extension of the popular U-Net model,
designed to improve performance in image segmentation tasks, particularly in biomedical
imaging. The key innovation of DoubleU-Net is the incorporation of two U-Net
structures connected in a cascade. The first U-Net performs initial segmentation,
and its output is passed as input to the second U-Net, which refines the segmentation
results.
This dual U-Net structure allows DoubleU-Net to leverage the strengths of the
original U-Net while addressing some of its limitations. The first U-Net focuses
7
on capturing coarse-level features and providing an initial segmentation map. The
second U-Net, using the output of the first U-Net, performs fine-level segmentation,
enhancing the accuracy and detail of the segmented regions. Skip connections within
each U-Net and between the two U-Nets ensure that important features and contextual
information are preserved throughout the network.