Unsupervised Image Segmentation in Satellite Imagery Using Deep L
Unsupervised Image Segmentation in Satellite Imagery Using Deep L
LOUIS
2023
Recommended Citation
Regmi, Suraj, "Unsupervised image segmentation in satellite imagery using deep learning" (2023). Theses.
458.
https://louis.uah.edu/uah-theses/458
This Thesis is brought to you for free and open access by the UAH Electronic Theses and Dissertations at LOUIS. It
has been accepted for inclusion in Theses by an authorized administrator of LOUIS.
UNSUPERVISED IMAGE SEGMENTATION
IN SATELLITE IMAGERY USING DEEP
LEARNING
Suraj Regmi
A THESIS
May 2023
Approved by:
Dr. Huaming Zhang, Research Advisor/Committee Chair
Dr. Manil Maskey, Committee Member
Dr. Chaity Banerjee Mukherjee, Committee Member
Dr. Letha Etzkorn, Department Chair
Dr. Rainer Steinwandt, College Dean
Dr. Jon Hakkila, Graduate Dean
Abstract
Suraj Regmi
A thesis submitted in partial fulfillment of the requirements
for the degree of Master of Science in Computer Science
ii
iii
Acknowledgements
iv
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.7 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
v
1.8 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Self-Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Chapter 3. Methodology . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Self-Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
vi
3.6.2 FreeSOLO Pre-trained Weights . . . . . . . . . . . . . 27
4.1.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
vii
List of Figures
viii
List of Tables
ix
List of Algorithms
1 Self-Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 Free Mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 FreeSOLO Self-Training . . . . . . . . . . . . . . . . . . . . . . . 25
x
Chapter 1. Introduction
1.1 Motivation
1
1. Data: Data is the fuel that propels machine learning and deep learning al-
gorithms. It is impossible to develop machine learning models without data
because data is an integral part of model development. The model learns
the patterns from data and uses the patterns to perform the prediction. For
unsupervised learning, labeled data is not required. Unsupervised learning
can be performed with untagged or unlabeled images. The amount of un-
labeled data is five orders of magnitude higher than the amount of labeled
data. So, from the perspective of data availability, unsupervised learning
has an advantage over supervised learning.
2. Cost and time: Preparing labeled data is costly and time consuming. An
example of image database having labeled images is ImageNet[9]. It has
14 million images and 22 thousand visual categories. It took roughly 22
human years to build it. The database is suited for global prediction tasks
like image classification. Preparation of labeled dataset for dense prediction
tasks like object detection and image segmentation is even more costly and
time consuming.
It is even harder to get labeled data for specific domain like earth science
because of need for domain knowledge, time to label the data, and cost associated
with labeling. For example, to label earth science specific labels such as normal
cloud, fire-induced cloud, or contrail, the labeler should be able to differentiate
them. As is obvious, it is much harder than classifying pets as cats and dogs. In
addition to that, there might be some labels which can not be visually identified.
For example, the crop health status might not be distinguishable by just looking
at the satellite images. Bands other than RGB might be needed to distinguish
2
crop health. So, additional capabilities might be needed just to label the earth
science domain data.
However, unlabeled data is available at the large scale both in terms of
spatial dimension as well as temporal dimension. PlanetScope[48] data is offered
by Planet labs, and they collect medium resolution satellite images ( 3m resolution
per pixel) of earth approximately daily. That amounts to coverage of 200 million
square kilometers per day2 . Such a large amount of data (15 TB per day) is ripe
for doing unsupervised learning.
3
helps to get the appropriate features from raw data to perform a particular task.
Data can have many different representations, and some representations can be
better than others for the end task. This fact has been exploited historically
to hand-engineer important features in several domains not limited to computer
vision and digital signal processing. The performance of classification systems, for
instance, is largely dependent on the representation learned from the raw data [1].
The supervised form of representation learning takes both data and labels
and performs the target task such as classification or regression. The features
learned by the intermediate layer can be thought of as representations. The
unsupervised form of representation learning does not use label information at
all. Some of the examples of such representation learning are principal compo-
nent analysis and matrix factorization. Self-supervised learning is the paradigm
of machine learning which does not have labeled data but creates labels out of
unlabeled data. It can be thought of supervised learning using unlabeled data.
In this form of representation learning, features will act themselves as labels ei-
ther in their pure form or modified form. Learning representations using various
forms of autoencoders is an example of semi-supervised representation learning.
Word2Vec, a concept of representing words in numerical embeddings, is also an-
other example of representation learning in natural language processing.
The unsupervised or self-supervised form of representation learning has
become more important than ever because of the exponential increase of unlabeled
data. The first merit is being able to solve complex problems such as image
segmentation in an unsupervised way. Another important merit is its powerful
capability to do representation learning without the use of labels. Unsupervised
learning or self-supervised learning techniques help learn better representations of
the data, and this can improve the performance of downstream tasks. The third
4
merit of unsupervised/self-supervised learning is bettering the semi-supervised
learning based-systems using unsupervised/self-supervised learning components
or embeddings returned by the unsupervised/self-supervised learning components.
5
1.5.3 Panoptic Segmentation
6
second research problem was to establish a benchmark for it. Benchmarking was
considered essential because researchers could perform a comparative analysis of
their methods in the future. The third research problem was the comparison
of the representation learned by the FreeSOLO-based encoder with the super-
vised learning-based encoders based on dense prediction downstream tasks. If a
small number of labeled images are available for some downstream tasks, can the
FreeSOLO-based representations be used for those downstream tasks? How do
those representations compare against supervised learning-based representations?
These are the questions the research aims to answer.
1.7 Approach
7
initial free mask[53] approach are also compared with the final FreeSOLO[53]
model in the satellite images.
The iSAID[56], CrowdAI[37], and PASTIS[44] datasets are used to estab-
lish the satellite imaging-based domain agnostic unsupervised instance segmen-
tation baseline. The FreeSOLO model is run on these three datasets and the
baseline is established.
To assess and answer the question about the effectiveness of the unsu-
pervised learning-based pre-trained weights, their performance is compared with
supervised learning-based pre-trained weights in downstream semantic segmenta-
tion tasks with respect to different segmentation architectures.
The thesis is organized into six chapters. The first chapter introduces
readers to the thesis topic and research problem. It also gives the motivation
behind the research problem and the approach to solving it. The second chap-
ter gives more background to the research topic and provides related works. It
gives a brief literature review on unsupervised image segmentation and unsuper-
vised techniques used in satellite images. It also presents the various unsuper-
vised learning methodologies in the existing literature to segment satellite images.
The third chapter explains the methodology used for this research work in de-
tail. This research work is application of computer science knowledge in earth
science domain (i.e. satellite images). The computer science knowledge (i.e. self-
supervised image segmentation) is explained in great detail in this section. The
fourth chapter presents the result of the experiments. First, the self-supervised
image segmentation method is used in labeled dataset to compare the result with
ground truths. The model is benchmarked on three remote sensing instance seg-
8
mentation datasets. Then, it is used in the University of Alabama in Huntsville
periphery to do visual inspection using PlanetScope data. It is also used on
MBRSC Dubai dataset to see the instance segmentation performance. Follow-
ing these experiments, the self-supervised learning based pre-trained weights are
used in semantic segmentation downstream task. The performance of the self-
supervised learning based pre-trained weights is compared with the performance
of supervised learning based pre-trained weights. The comparison is done with
respect to semantic segmentation downstream task. The fifth chapter discusses
the limitations of this method and the origin of its limitations. The sixth chapter
provides the conclusion of the work and its possible future directions.
9
Chapter 2. Background and Related Works
2.2 Pre-training
10
a similar problem with less data. The last few layers can be cut off in the pre-
trained model and the custom layers can be added based on the usecase. Then, the
model can be finetuned using custom data. Pre-training is pretty much common
in computer vision. Convolutional neural networks are seen to learn hierarchical
features[26]. Low level features such as edges and shapes are common in almost
all the computer vision tasks. So, the layers act as feature extractors and they
help to get better representation of the raw pixel data.
11
2.4 Self-supervised Pre-training
12
ent views of the same image whereas negative pair has views of different images.
The views are different instances of the same image formed by applying some
transformations to the image. In this type of representation learning, the fea-
tures of an image are pushed apart to the features of different images.
While doing pre-training, there are two parts – backbone and projection
head. The backbone is taken from some standard architecture such as ResNet[17]
and the projection head is a stack of layers stacked on top of the backbone. Two
views of same or different images are passed through backbone and projection
head. The result will be global feature vector. One will be query feature vector
and another will be key feature vector. For each query feature vector q, there is
a key feature vector k that matches with the query. Such a pair of query and key
are from the positive pair. The InfoNCE loss[41] is used as the objective function
such that positive pairs are pulled together and the negative pairs are pushed
apart.
q.k+
e τ
Lq = −log (2.1)
P q.k−
eq.k+ + k− e
τ
13
2.4.3 Self-supervised Learning for Dense Prediction
r i .ti+
1 P e τ
Lr = − log (2.2)
i r i .ti−
Hf Wf
ri .ti+
P
e + ti− e τ
14
2.5 Weakly Supervised Learning
15
age segmentation masks. Bounding boxes, on the other hand, will act as
the coarse ground truth for the image segmentation tasks.
2.6 Self-Training
16
The labeled data is denoted as Dl , which is just a set of features and
labels i.e. {(x1 , y1 ), (x2 , y2 ), . . . , (xl , yl )}. Similarly, the unlabeled data is denoted
by Du , which is a set of features i.e. {xl+1 , . . . , xN }. N denotes the total number
of samples for both labeled and unlabeled data. Now, the classic self-training
algorithm is presented below as Algorithm 1.
Algorithm 1 Self-Training
1: Train a pseudo-labeler network, Npl (.), using the labeled data, Dl .
2: repeat
3: Use Npl to pseudo-label the unlabeled data, Du .
4: Subset pseudo-label data into S such that x ∈ Du and (x, Npl (x)) ∈ S.
5: Remove S from Du .
6: Train the predictive network, Npl , using data D = Dl ∪ S.
7: until convergence or Du = ϕ or the iterations are over.
17
Spectral clustering can also be used to perform image segmentation and
has been widely used[46][7][61]. It is a method for clustering data that is based
on the eigenvalues and eigenvectors of a matrix derived from the data. At first,
a similarity matrix is constructed using the pairwise distance between the data
points. Then, the Laplacian matrix is constructed. The Laplacian matrix encodes
the connectivity of the data points. The Laplacian matrix is typically constructed
from the similarity matrix. After that step, eigenvectors of the Laplacian matrix
are calculated, and the data points are clustered in the low-dimensional space.
Finally, the clusters are mapped back to the original data space. That gives
different clusters corresponding to the different segments of the image.
Deep learning has also been used for unsupervised image segmentation.
W-net[58] uses the encoder-decoder architecture of the convolutional neural net-
works to segment images in an unsupervised way. Other methods involving con-
volutional neural networks and backpropagation have also been studied showing
promising results[24].
Melas et al.[35] propose a deep spectral approach for unsupervised semantic
segmentation and localization. Specifically, the authors construct a Laplacian
matrix from a combination of color information and deep features obtained in an
unsupervised manner. The proposed method demonstrates superior performance
in comparison to the state-of-the-art for unsupervised image segmentation and
object localization.
Inspired by autoregressive generative models, [42] have proposed an un-
supervised image segmentation method based on the maximization of mutual
information between different views of the same image.
18
2.8 Unsupervised Learning in Satellite Images
19
Chapter 3. Methodology
20
Figure 3.1: Picture taken from FreeSOLO paper, Figure 2, Wang et. al.[53]
where cosim is the cosine similarity, Qq ∈ RE is the qth query, and Ki,j ∈ RE
is the key in position (i, j). The cosine similarity of the two vectors ⃗x and ⃗y is
21
given by the dot product of ⃗x and ⃗y divided by product of their L2-norms. So,
⃗x.⃗y
cosim(⃗x, ⃗y ) = .
||⃗x||||⃗y ||
Then, the score maps are passed through min-max normalization to form
soft masks. Soft masks have values in the range of [0, 1]. Now, soft masks are
applied some threshold, τ , and the binary masks are formed. All of the soft masks
have maskness score. Maskness score is the non-parametric score used to rank the
coarse masks. Maskness score of a soft mask is given by the following formula.
1 PN
f
maskness = pi (3.2)
Nf i
where Nf is the number of pixels which have the normalized value greater than
the threshold, τ , and pi is the normalized value at the ith pixel.
Then, the masks are sorted on the basis of maskness score of the masks, and
non-maximum-suppression (NMS) is used to remove the redundant masks. After
redundant masks are removed, the coarse masks are obtained. The architecture
diagram for this component is shown in figure 3.2.
The algorithm can be written as given in Algorithm 2.
It has already been mentioned in the free mask section how FreeSOLO[53]
has successfully used key-query mechanism in the self-supervised pretrained model
to extract coarse masks. The coarse masks are the weak labels. These type of
masks can be incomplete supervision as they might not have coarse masks for
all the objects. They can also be inexact supervision because they might not
segment the exact shape of the objects. Also, they can be inaccurate supervision
as they may extract inaccurate masks of the objects.
22
Algorithm 2 Free Mask
1: Use DenseCL pre-trained backbone to generate feature maps of shape H ×
W × E from images of shape H × W × C.
2: Duplicate feature maps to produce keys of shape H × W × E.
3: Perform bilinear downsampling to produce queries of smaller shape H ′ ×W ′ ×
E.
4: Calculate cosine similarity between keys and queries to produce score maps
of shape H × W × H ′ × W ′ .
5: Use min-max normalization to normalize the score maps to the range [0, 1].
They are termed soft masks. Each soft mask has its maskness score.
6: Use some threshold to produce binary masks from soft masks.
7: Sort the binary masks using maskness score.
8: Use NMS to further filter the binary masks and final coarse masks are ob-
tained.
23
For max projection loss,
Lmax proj = L(maxx (m), maxx (m∗ )) + L(maxy (m), maxy (m∗ )) (3.3)
Lavg proj = L(avgx (m), avgx (m∗ )) + L(avgy (m), avgy (m∗ )) (3.4)
where,
L(., .) represents Dice loss[36],
maxx represents maximum projection along x-axis,
avgx represents average projection along x-axis,
maxy represents maximum projection along y-axis, and
avgy represents average projection along y-axis.
They also have another loss component called pairwise affinity loss[52],
which tends to put neighboring similar valued raw pixels in the same instance.
So, the total loss is defined as:
3.3 Self-Training
24
and quantitatively) than the free masks. As part of self-training, the top predicted
masks (predicted by SOLO-based segmenter, not free mask extractor) are sent
to the weakly supervised algorithm again as weak labels, and the SOLO-based
segmenter is trained. The performance of the SOLO-based segmenter increases
using this approach of self-training. In this way, they make use of self-training
with SOLO-based architecture.
The free mask extractor is denoted as Mc , which is used in the first step to
extract coarse masks from an unlabeled image. Using coarse masks and SOLO-
based architecture, the pseudo-labeler network (Npl ) is trained, which uses weakly
supervised learning. The masks outputted by the pseudo-labeler network are
symbolized as Mi , and the top instance masks subset is symbolized as Mit . The
algorithm for FreeSOLO self-training is presented below as Algorithm 3. It repeats
until the performance does not increase (i.e., convergence) or the absence of top
instance masks or the end of a fixed number of iterations.
Algorithm 3 FreeSOLO Self-Training
1: Extract coarse masks (Mc ) using free mask extractor.
2: Train pseudo-labeler network (Npl ) using coarse labels and SOLO-based seg-
menter using weakly supervised learning.
3: repeat
4: Use Npl to extract instance masks, Mi .
5: Subset top instance masks based on their confidence scores.
6: Train Npl in weakly supervised fashion using top instance masks, Mit ⊆
Mi .
7: until convergence or Mit = ϕ or the iterations are over.
25
The architecture is different than that of SOLO because this architecture has two
components in the category head. The first component is standard Focal loss [29].
The second component is categorical semantic loss, which learns the output of
the embedding by the free mask. If the output of the embedding given by the free
mask is q ∗ and the predicted output of the embedding is q, the negative cosine
similarity loss is given by:
q q∗
Lsem = 1 − . . (3.6)
||q||2 ||q ∗ ||2
This loss is added to the focal loss, so, the total category loss is given by:
Now, the total loss of the whole architecture becomes the sum of the loss
from the category branch and mask branch. The expression for the loss of the
mask branch is given by equation 2.7 whereas the loss for the category branch is
given by equation 2.9.
3.5 Benchmark
In this study, three distinct satellite imagery datasets are utilized, namely
iSAID[56], CrowdAI[37], and PASTIS[44], to benchmark the efficacy of the FreeSOLO[53]
method in the context of class-agnostic unsupervised instance segmentation in
satellite images. Due to the significant size of images within the iSAID dataset,
the images are partitioned into 256 pixels by 256 pixels and the experiments are
conducted in them. Both the lower-size data and lower-size annotations had to
be created – still aligning with the COCO format – from the dataset. The bench-
26
mark was established based on the outcomes obtained from the 256 x 256-sized
images and the corresponding annotations. The PASTIS dataset, on the other
hand, was not in COCO format. So, data preprocessing was done to bring the
images and annotations to the COCO format.
The present study utilizes pre-trained weights sourced from the ResNet-
101 backbone of the FreeSOLO model, which were trained through the FreeSOLO
method of self-supervised learning. These weights serve as an encoder, which is
27
integrated with a semantic segmentation architecture. To investigate the efficacy
of this approach, experiments were conducted using both the feature pyramid
network and U-Net architecture with the FreeSOLO ResNet-101 backbone. The
encoder weights were maintained as unmodifiable, while the decoder network was
trained in a supervised manner, using the same training data. The resulting trans-
fer learning methodology employed the self-supervised learning-based pre-trained
weights of the backbone for the downstream task of semantic segmentation. Fi-
nally, the model was evaluated using held-out test data and compared against
supervised learning-based embeddings.
28
Figure 3.2: Free mask[53] architecture
29
Chapter 4. Results and Visualizations
30
4.1 MBRSC Dubai Aerial Imagery Dataset
All the satellite images of MBRSC Dubai Aerial Imagery Dataset are run
through the free mask extractor and the sets of free masks are produced. The
free masks give coarse masks of the images pertaining to different instances. It is
to be noted that different instances of the objects are segmented here instead of
the different semantic categories. The visualization presented below might give
the misconception that the segmenter is segmenting the semantic classes. It is
not segmenting the different semantic classes. It is segmenting different instances
of the objects, agnostic to the semantic classes. However, the visualization is
prepared in such a way that similar semantic instances are given the same color
for the convenience of the reader and better interpretability.
Figure 4.1 shows the qualitative results of free mask[53] on the MBRSC
Dubai Aerial Imagery dataset. As can be seen, the water bodies (lakes and
rivers) and settlement areas are segmented as different instances. From the view
1
https://www.kaggle.com/datasets/humansintheloop/semantic-segmentation-of-aerial-
imagery
2
https://www.kaggle.com/
3
https://www.mbrsc.ae/
31
of satellites, the regions of stuff such as lakes and settlements appear to be like
instances of everyday objects. So, the technique of instance segmentation can be
extended to segment regions of stuff for satellite images.
Figure 4.2 shows more visualizations of free mask output. It is fascinating
to see how this approach can segment complex shapes of rivers too (as demon-
strated in the second row of Figure 4.2. One thing worth noticing is that this
approach does not segment coarse masks as well for big regions of stuff as it does
for small regions of stuff. This makes some sense as the DenseCL[54] pretrained
model was trained on everyday objects. Everyday objects are things rather than
regions of stuff. Even if everyday objects may have some regions of stuff, they
are not as frequent as in satellite images.
The satellite images of MBRSC Dubai Aerial Imagery Dataset are then run
through FreeSOLO[53] model. The model performs better instance segmentations
than free mask as seen in the figure 4.1 and figure 4.2. The same images used
to visualize free mask[53] results are used to visualize the FreeSOLO[53] model
results for visual comparative analysis.
4.1.3 Comparison
32
Figure 4.1: Coarse masks extracted using free mask[53] on MBRSC dataset
2. Shape: The shape of the instances are well taken care of by FreeSOLO
model compared to free mask. It makes sense as the purpose of free mask is
33
to generate coarse masks to do weakly supervised learning. This is validated
by the visualization results too.
34
As shown in figure 4.5, several meaningful objects were segmented. The
model performed the best in segmenting the UAH university lake. It segmented
the whole lake. As also is the case with Dubai MBRSC dataset, the model seems
to perform better in segmenting water bodies like lakes and rivers. The lake
segmentation produced by the model is given a blue mask.
In addition to the lake, the model also performs well in segmenting the
visually identifiable buildings, especially the ones having rectangular shape. The
buildings are given mask color of light blue. The following buildings were seg-
mented by the model.
2. 4800 Bradford Dr NW
3. 408 Allen St NW
The third class of objects that the model was able to segment was housing
area. The housing areas are given the mask color of cyan. It can be seen that
the model is segmenting the fraternity row, the Southeast Campus Housing, and
35
the housing areas around the back entrance of UAH. It is also interesting to see
the model segmenting some portion of I-565 highway. The mask color for roads
and highway is given yellow. In addition to them, the model is segmenting the
parking lot of Teledyne Brown Engineering and Charger Park, and two green
areas covered with trees. The parking lot is given reddish mask color and the
treed area is given green mask color.
36
agricultural land and a void label for areas that are primarily located outside the
patch.
The two figures 4.6 and 4.7 show the visualization of the instance segmen-
tation in the iSAID dataset. First, the iSAID satellite images are split into 256
pixels by 256 pixels smaller images and then, the FreeSOLO model is applied
to the smaller images. After the instance segmentation results are obtained, the
predicted masks are merged together to form the mask of the shape of the original
image.
The other two figures 4.6 and 4.7 show the visualization of the instance
segmentation in two samples of AICrowd dataset. Here, inference is carried out
in the images as they are. They are not split further because the images were
medium resolution and they were not covering large area unlike the iSAID dataset.
In this dataset too, the model is segmenting out other instances such as road, non-
building structures, road curbs, and trees.
The three visualizations figure 4.11, figure 4.12, and figure 4.13 show the
results of the FreeSOLO instance segmentation on the PASTIS dataset. Each
image in this dataset has a dimension of 125 pixels by 125 pixels, which is com-
paratively smaller than other datasets. For this reason, the images are not divided
further. The visualizations reveal the segmentation of the agricultural parcels, but
it’s worth noting that a single segment may contain multiple parcels. The model
appears to have difficulty distinguishing between separate parcels, unlike with
everyday objects.
37
4.3.2 Quantitative Results
The table 4.1 shows the instance segmentation results in terms of average
precision and average recall metrics. The FreeSOLO model is the class-agnostic
unsupervised instance segmentation model so it segments instances of all the ob-
jects found in the image. However, the datasets have instances belonging to a
limited number of classes. So, even if the FreeSOLO model is performing in-
stance segmentation correctly on the category not contained in annotations, the
evaluation flags that as a false positive. So, average precision scores are lim-
ited to single-digit figures. However, it is also to be noted that state-of-the-art
class-agnostic unsupervised instance segmentation technique i.e. FreeSOLO on
the COCO dataset has average precision scores in the single figure only. The rel-
atively higher values of recall validates the hypothesis that precision is suffering
because of incomplete instance annotations. Figure 4.8 shows the visualization
of the instance segmentation on one of the iSAID images. As can be seen, the
model is segmenting big trucks well but it is also segmenting a large portion of
land which is not present in the annotations. The precision metric suffers some
amount because of this too. The instances of PASTIS data did not have large
objects (according to COCO definition) as the images themselves were of small
dimension i.e., 128 pixel by 128 pixel. So, the average precision and average recall
for large objects are given n/a value.
38
Table 4.1: Class-agnostic instance segmentation for iSAID and CrowdAI dataset
39
Table 4.2: SL vs SSL based encodings on semantic segmentation downstream task
40
Similarly, the pre-trained weights of ResNet-101 architecture trained on
ImageNet[9] for FreeSOLO[53] training are taken and used as a feature extrac-
tor. This feature extractor was trained in a self-supervised fashion, so, it did
not require any labels. Then, the same set of architectures i.e. feature pyramid
network and U-Net is used as above to carry out semantic segmentation on the
satellite images. The comparable performance was achieved with respect to the
IOU score and Dice coefficient on the held-out test dataset (shown in table 4.2.
It is impressive to see comparable performance of self-supervised learning based
pre-trained weights compared to supervised learning pre-trained weights. This is
advantageous because most of the time, the labels are not available, as obtaining
labels is costly and time-consuming. So, a self-supervised learning framework can
be used to learn the embeddings for satellite images. And, as shown above, com-
parable performance can be achieved in downstream tasks. The result produced
by the SSL-based method is given in the figure 4.15.
41
Figure 4.2: Additional coarse masks extracted using free mask[53] on MBRSC dataset
42
Figure 4.3: Predicted masks using FreeSOLO[53] model on MBRSC dataset
43
Figure 4.4: Additional predicted masks using FreeSOLO[53] model on MBRSC
dataset
44
Figure 4.5: FreeSOLO[53] predicted mask in PlanetScope UAH image
45
Figure 4.7: Additional instance segmentation in iSAID satellite image
46
Figure 4.9: Class-agnostic instance segmentation in a sample AICrowd satellite image
47
Figure 4.11: Class-agnostic instance segmentation in a sample PASTIS satellite image
48
Figure 4.13: Additional class-agnostic instance segmentation in a sample PASTIS
satellite image
49
Figure 4.15: ImageNet[9] self-supervised learning based pre-trained model on down-
stream segmentation task
50
Chapter 5. Limitations and Applications
5.1 Limitations
5.2 Applications
51
features or objects within an image, scene, or spatial boundary. Additionally, this
method can be particularly useful for tracking changes over time, especially when
using medium-resolution high-frequency data such as PlanetScope. By analyzing
change detection over time, FreeSOLO combined with image classification can
help researchers gain valuable insights into environmental and land-use changes.
52
Chapter 6. Conclusion and Future Research
6.1 Conclusion
53
formance on the COCO dataset is AP50 of 9.8%. So, it is explainable given the
limited number of categories in the annotations and domain gap.
The comparable performance of pre-trained weights trained using self-
supervised learning i.e. FreeSOLO[53] was also shown with respect to the pre-
trained weights trained using supervised learning on downstream semantic seg-
mentation tasks on different semantic segmentation architecture. The Dubai[20]
dataset was taken to quantitatively compare the pre-trained weights using Dice
coefficient and IOU score metric. The self-supervised learning based pre-trained
weights gave IOU score of 0.50 and Dice coefficient of 0.62 compared to supervised
learning-based pre-trained weights (best IOU score of 0.48 and Dice coefficient
of 0.63). This is advantageous because labels were not available for massively
available PlanetScope[48] satellite data. The embeddings can be pre-trained in a
self-supervised learning fashion as described in DenseCL[53] and FreeSOLO[53],
and later used in downstream tasks.
Currently, the coarse masks are extracted through the DenseCL[54] model.
The DenseCL[54] model is pre-trained on ImageNet[9] database. The database
contains the images of everyday objects. However, these methods are being ap-
plied and tested on satellite images. So, in the future, similar model such as
DenseCL[54] can be pre-trained in just satellite images. This is also possible be-
cause of unsupervised nature of it. It does not need any labels to do self-supervised
learning. So, the first line of research work can be learning a DenseCL[54] based
pre-trained model on just satellite images. While extracting the free masks, cur-
rently the score maps are extracted by finding cosine similarity of all the queries
with all the keys. This requires the free mask approach to do filtering of the
54
masks using NMS method. This methodology might possibly be improved if
SWIN transformer[32] architecture can be used because SWIN transformer makes
use of shifted-window attention. This methodology naturally might provide non-
redundant masks as the attention mechanism is applied on each window.
The second line of research can be training a weakly supervised learning-
based model on top of free masks extracted by the satellite images pre-trained
model. This can be a similar work to that of FreeSOLO[53] but specific to satellite
images with satellite images focused architecture. Better segmentation models
could be obtained by this method. This is certainly interesting and impactful
research that could be done in the future.
In this research, transfer learning experiments are done by extracting fea-
tures from the pre-trained ResNet-101 backbone. All the weights of the decoder
part have been learned for doing semantic image segmentation. In the future,
SOLOv2[55] architecture can be used to perform semantic image segmentation
and also reused for the decoder network too.
55
References
[1] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learn-
ing: A review and new perspectives. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 35(8):1798–1828, 2013.
[2] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. Yolact: Real-
time instance segmentation. In Proceedings of the IEEE/CVF International
Conference on Computer Vision (ICCV), October 2019.
[4] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang
Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, et al. Hybrid
task cascade for instance segmentation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 4974–4983,
2019.
[5] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A
simple framework for contrastive learning of visual representations. In Inter-
national conference on machine learning, pages 1597–1607. PMLR, 2020.
[6] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved base-
lines with momentum contrastive learning. arXiv preprint arXiv:2003.04297,
2020.
[7] Wang Chongjun, Ding Lin, Tian Juan, Chen Shifu, et al. Image segmentation
using spectral clustering. In 17th IEEE International Conference on Tools
with Artificial Intelligence (ICTAI’05), pages 2–pp. IEEE, 2005.
[8] Bert De Brabandere, Davy Neven, and Luc Van Gool. Semantic in-
stance segmentation with a discriminative loss function. arXiv preprint
arXiv:1708.02551, 2017.
[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Im-
agenet: A large-scale hierarchical image database. In 2009 IEEE conference
on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
56
[11] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual
representation learning by context prediction. In Proceedings of the IEEE
international conference on computer vision, pages 1422–1430, 2015.
[12] Naiyu Gao, Yanhu Shan, Yupei Wang, Xin Zhao, Yinan Yu, Ming Yang, and
Kaiqi Huang. Ssap: Single-shot instance segmentation with affinity pyramid.
In Proceedings of the IEEE/CVF International Conference on Computer Vi-
sion, pages 642–651, 2019.
[13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-
Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative ad-
versarial networks. Communications of the ACM, 63(11):139–144, 2020.
[14] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momen-
tum contrast for unsupervised visual representation learning. In Proceedings
of the IEEE/CVF conference on computer vision and pattern recognition,
pages 9729–9738, 2020.
[15] Kaiming He, Ross Girshick, and Piotr Dollar. Rethinking imagenet pre-
training. In 2019 IEEE/CVF International Conference on Computer Vision
(ICCV), pages 4917–4926, 2019.
[16] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn.
In Proceedings of the IEEE International Conference on Computer Vision
(ICCV), Oct 2017.
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 770–778, 2016.
[18] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Pro-
ceedings of the IEEE conference on computer vision and pattern recognition,
pages 7132–7141, 2018.
[19] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger.
Densely connected convolutional networks. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition, pages 4700–4708, 2017.
[21] Neal Jean, Sherrie Wang, Anshul Samar, George Azzari, David Lobell, and
Stefano Ermon. Tile2vec: Unsupervised representation learning for spatially
distributed data. In Proceedings of the AAAI Conference on Artificial Intel-
ligence, volume 33, pages 3967–3974, 2019.
57
[22] Alan Jose, S.Ravi, and M.Sambath. Brain tumor segmentation using k-
meansclustering and fuzzy c-means algorithmsand its area calculation. In-
ternational Journal of Innovative Research in Computer and Communication
Engineering, 2:3496–3501, 2014.
[23] Rohini Paul Joseph, C Senthil Singh, and M Manikandan. Brain tumor mri
image segmentation and detection in image processing. International Journal
of Research in Engineering and Technology, 3(1):1–5, 2014.
[25] Shiyi Lan, Zhiding Yu, Christopher Choy, Subhashree Radhakrishnan, Guilin
Liu, Yuke Zhu, Larry S Davis, and Anima Anandkumar. Discobox: Weakly
supervised instance segmentation and semantic correspondence from box su-
pervision. In Proceedings of the IEEE/CVF International Conference on
Computer Vision, pages 3406–3416, 2021.
[26] Yann LeCun, Koray Kavukcuoglu, and Clement Farabet. Convolutional net-
works and applications in vision. In Proceedings of 2010 IEEE International
Symposium on Circuits and Systems, pages 253–256, 2010.
[27] Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei. Fully convolu-
tional instance-aware semantic segmentation. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 2359–2367,
2017.
[28] Ming Lin, Hesen Chen, Xiuyu Sun, Qi Qian, Hao Li, and Rong Jin.
Neural architecture design for gpu-efficient networks. arXiv preprint
arXiv:2006.14090, 2020.
[29] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár.
Focal loss for dense object detection. In Proceedings of the IEEE international
conference on computer vision, pages 2980–2988, 2017.
[31] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation
network for instance segmentation. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 8759–8768, 2018.
58
[32] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen
Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer
using shifted windows. In Proceedings of the IEEE/CVF international con-
ference on computer vision, pages 10012–10022, 2021.
[33] Oscar Manas, Alexandre Lacoste, Xavier Giró-i Nieto, David Vazquez, and
Pau Rodriguez. Seasonal contrast: Unsupervised pre-training from uncu-
rated remote sensing data. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 9414–9423, 2021.
[34] Manil Maskey, Alfreda Hall, Kevin Murphy, Compton Tucker, Will McCarty,
and Aaron Kaulfus. Commercial smallsat data acquisition: Program up-
date. In 2021 IEEE International Geoscience and Remote Sensing Sympo-
sium IGARSS, pages 600–603, 2021.
[35] Luke Melas-Kyriazi, Christian Rupprecht, Iro Laina, and Andrea Vedaldi.
Deep spectral methods: A surprisingly strong baseline for unsupervised se-
mantic segmentation and localization. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages 8364–8375, 2022.
[36] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully
convolutional neural networks for volumetric medical image segmentation.
In 2016 fourth international conference on 3D vision (3DV), pages 565–571.
IEEE, 2016.
[38] Anupurba Nandi. Detection of human brain tumour using mri image segmen-
tation and morphological operators. In 2015 IEEE International Conference
on Computer Graphics, Vision and Information Security (CGVIS), pages
55–60, 2015.
[39] Alejandro Newell, Zhiao Huang, and Jia Deng. Associative embedding: End-
to-end learning for joint detection and grouping. Advances in neural infor-
mation processing systems, 30, 2017.
[40] H.P. Ng, S.H. Ong, K.W.C. Foong, P.S. Goh, and W.L. Nowinski. Med-
ical image segmentation using k-means clustering and improved watershed
algorithm. In 2006 IEEE Southwest Symposium on Image Analysis and In-
terpretation, pages 61–65, 2006.
59
[41] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning
with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
[42] Yassine Ouali, Céline Hudelot, and Myriam Tami. Autoregressive unsuper-
vised image segmentation. In Andrea Vedaldi, Horst Bischof, Thomas Brox,
and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 142–
158, Cham, 2020. Springer International Publishing.
[43] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional
networks for biomedical image segmentation. In Medical Image Comput-
ing and Computer-Assisted Intervention–MICCAI 2015: 18th International
Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18,
pages 234–241. Springer, 2015.
[44] Vivien Sainte Fare Garnot and Loic Landrieu. Panoptic segmentation of
satellite image time series with convolutional temporal attention networks.
ICCV, 2021.
[46] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation.
IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–
905, 2000.
[47] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks
for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[48] Planet Team. Planet application program interface: In space for life on earth,
2017–.
[49] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview
coding. In European conference on computer vision, pages 776–794. Springer,
2020.
[50] Zhi Tian, Chunhua Shen, and Hao Chen. Conditional convolutions for in-
stance segmentation. In European conference on computer vision, pages 282–
298. Springer, 2020.
[51] Zhi Tian, Chunhua Shen, Xinlong Wang, and Hao Chen. Boxinst: High-
performance instance segmentation with box annotations. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), pages 5443–5452, June 2021.
60
[52] Zhi Tian, Chunhua Shen, Xinlong Wang, and Hao Chen. Boxinst: High-
performance instance segmentation with box annotations. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 5443–5452, 2021.
[53] Xinlong Wang, Zhiding Yu, Shalini De Mello, Jan Kautz, Anima Anandku-
mar, Chunhua Shen, and Jose M. Alvarez. Freesolo: Learning to segment
objects without annotations. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), pages 14176–14186,
June 2022.
[54] Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong, and Lei Li. Dense
contrastive learning for self-supervised visual pre-training. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 3024–3033, 2021.
[55] Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong, and Lei Li. Solo: A
simple framework for instance segmentation. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2021.
[56] Syed Waqas Zamir, Aditya Arora, Akshita Gupta, Salman Khan, Guolei
Sun, Fahad Shahbaz Khan, Fan Zhu, Ling Shao, Gui-Song Xia, and Xiang
Bai. isaid: A large-scale dataset for instance segmentation in aerial images.
In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition Workshops, pages 28–37, 2019.
[57] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised
feature learning via non-parametric instance discrimination. In Proceedings
of the IEEE conference on computer vision and pattern recognition, pages
3733–3742, 2018.
[58] Xide Xia and Brian Kulis. W-net: A deep model for fully unsupervised image
segmentation. arXiv preprint arXiv:1711.08506, 2017.
[59] Junyuan Xie, Linli Xu, and Enhong Chen. Image denoising and inpaint-
ing with deep neural networks. Advances in neural information processing
systems, 25, 2012.
[60] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He.
Aggregated residual transformations for deep neural networks. In Proceedings
of the IEEE conference on computer vision and pattern recognition, pages
1492–1500, 2017.
61
[61] Shan Zeng, Rui Huang, Zhen Kang, and Nong Sang. Image segmentation us-
ing spectral clustering of gaussian mixture models. Neurocomputing, 144:346–
356, 2014.
62