Urban Land Cover Classification Using Deep Learning.: Prof. Rushali Patil Priyanshu Rawat Abhishek Kumar Surender Singh

The document discusses urban land cover classification using deep learning techniques, emphasizing the importance of high-resolution satellite imagery for effective environmental monitoring and urban planning. It reviews various methodologies, including CNNs, attention mechanisms, and feature fusion strategies, to enhance classification accuracy amidst challenges like interclass similarities and intraclass variations. The research highlights advancements in models that integrate multi-source data and context-aware learning to improve segmentation and classification performance in complex urban environments.

Uploaded by

priyanshuunofficial80

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views9 pages

Urban Land Cover Classification Using Deep Learning.: Prof. Rushali Patil Priyanshu Rawat Abhishek Kumar Surender Singh

Uploaded by

priyanshuunofficial80

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Urban Land Cover Classification Using Deep

Learning.
Prof. Rushali Patil Priyanshu Rawat Abhishek kumar Surender singh
Computer Department Computer department Computer department Computer department
Army Institute of Technology Army Institute of Technology Army Institute of Technology Army Institute of Technology
Pune, India Pune, India Pune, India Pune, India
[email protected] [email protected] [email protected] [email protected]

Anupam Yadav
Computer Department
Army Institute of Technology
Pune, India
[email protected]

Abstract—Appropriate urban land cover mapping is crucial for areas and roads to vegetation and water bodies, which helps
environmental monitoring resource management planning rapid in making informed decisions in urban development, disaster
improvements in high-resolution imaging have led to various deep prevention, and sustainable planning.
learning algorithms aiming at overcoming traditional categoriza-
tion restrictions notably in the complex spatial and contextual
connections prevalent inside metropolitan areas in studying the The structural elements in urban environments make land
main methodologies in categorization of land cover in cities the cover classification complicated, especially with interclass
focus is predominantly multiscale context-aware models making similarities between categories and intraclass variation be-
use of CNN’s paired with the attention mechanism as well as tween the same categories. Methods such as SVM and RF
feature fusion strategies to boost precision one such data set
we mainly discuss includes the WH-MAVS which specifically uses manual feature extraction and therefore less adaptable to
caters to high resolution segmentation and detection and LULC diverse and complex urban landscapes. Using deep learning
classification tasks to assess a deep learning-based model under and the fact that recent developments had converted a normal
diverse LULC settings in urban environment conditions we also field by using all functionalities for full feature auto-extracting
looked at EuroSAT which is a multispectral dataset of sentinel-2 using convolution neural networks together with high accu-
imagery notably its application towards identifying distinct land
types within very high spectral sensitivity multi-sensor fusion racy on enhancing the classification, problems brought by
approaches are presented in addition which combine optical and both intraclass variations and interclass variation using the
radar data as in contest 2018 IEEE GRSS data fusion challenge ”Multiscale context-aware feature fusion network” [1] could
these methods check the restrictions of data by enhancing the handle cross-scale good correlation of features, due to the
classification accuracy through the complementing capabilities of fact of analysis with its high pixels that suits the performance
diverse sensors and consequently are best suited for applications
that necessitate strong identification of complex urban features high at segmentation of photographs possibly fully built urban
we analyze various approaches and datasets and describe each photographs.
models relevance scalability and performance over a range of ur-
ban datasets this research gives chances for establishing scalable Significant contributions arise from unsupervised ap-
high precision urban land-cover categorization algorithms with proaches. To be mentioned is ”Unsupervised Land Cover”
prospective applications in urban planning and disaster resilience
and sustainable development ”Classification of Hybrid and Dual-Polarized Images” [23].
Index Terms—CNN, LSTM, Gesture-to-Text Conversion, ISL, CNN extracts the features out from the SAR imagery of
Sign Language Recognition classification that also allows being used without any training
data, as well as flexibility and adaptability across differing
I. I NTRODUCTION environments. ”Vegetation Land Cover/Land Use Extraction”
The urban land cover classification makes up a majority [2] is the same work as this utilizes a module called a feature-
of urban planning and environmental management, namely in sensitive perception and tackles traditional pixel problems this
against the backdrop of historic urban growth. High-resolution task presents with adaptable inference of context for refining
satellite imagery has become an important resource in track- of vegetation classification process. Further it leads to further
ing urban growth, monitoring infrastructure, and assessing fusion multisource image time series. This can be observed
environmental impact. This visual data supports a granular in ”Semantic Segmentation of Land Cover in Urban Areas by
understanding of varied land-cover types, from residential Fusing Multisourced Satellite Image Time Series” [3]. This
technique improves the resolution in terms of time which The necessity of context-aware segmentation is highlighted
could give insight into seasonal changes and temporal ones in the paper by Zhan et al. [2], where they introduce the
something important for land-cover change capture. Adaptive Context Inference (ACI) Model with the Focus
Perception (FP) Module to improve boundary definition in
High-resolution SAR images are also applied to urban adverse conditions. The ACI model uses spatial context to im-
classification, such as in ”Derived from High-Resolution SAR prove segmentation accuracy through modeling co-occurrences
Images: Land Cover Semantic Annotation” [24]. Detailed clas- among neighboring pixels. This offers clear object boundaries
sification is possible because of the fact that SAR can image and reduces segmentation errors in high-variability regions,
the urban structures no matter what the weather conditions and like vegetation cover. The FP module improves the model’s
provides good data continuity and robustness of the model for feature detection of vegetation-related features, which supports
different urban scenarios. the idea of improved edge awareness in segmentation models.
Performance measurements on datasets like ZY-3 and Gaofen
This paper discusses these key methods and the challenges Image Dataset (GID) reveal that ACI performs better compared
they overcome, especially in urban land-cover classification to other models like DeepLab v3+. This confirms the require-
using advanced deep learning architectures. Through an eval- ment for models specifically accounting for edge information
uation of the effectiveness of these models across various to improve accuracy in land-cover mapping.
datasets and identification of current limitations, we hope to
provide insights into future directions for robust urban land- In an effort to broaden the representation scope of features
cover classification. and fusion, Jining Yan et al. [3] propose the Multi-source Tem-
poral Attention Fusion-based Temporal-Spatial Transformer
II. LITERATURE SURVEY (MTAF-TST), a model intended to fuse optical and Synthetic
Recent advances in urban land cover classification have Aperture Radar (SAR) satellite image time series (SITS). Their
emphasized the necessity of using multi-scale feature fusion, contribution tackles relevant issues of diversity in the urban
attention mechanisms, and context-aware learning to counter environment, as well as incompleteness in temporal and spatial
the challenges of high-resolution images captured through data, with the drawbacks of conventional feature fusion mech-
remote sensing. Abubakar Siddique et al. [1] present the anisms, which tend to not best utilize the synergistic nature
MCAFNet , a model specially designed to enhance clas- of multi-source imagery. The Temporal-Spatial Transformer
sification accuracy for cluttered urban environments. Their (TST) module plays a crucial role in capturing long-distance
paper investigates three major challenges in high-resolution spatial and temporal relationships through a self-attention
image segmentation: interclass similarity, wherein visually mechanism, thereby allowing crucial patterns between various
similar classes like roads and buildings are hard to separate; sensor modalities to be identified correctly. In addition, unlike
intraclass variation, which explains the variations within a adopting simple feature concatenation, which can contribute
class, such as varying vegetation types causing inconsistency extraneous noise, the MTAF module dynamically adjusts the
in classification; and artifacts, like checkerboards and blurred relative weight of radar and optical information, thereby gen-
edges, which undermine segmentation accuracy. erating a stronger feature representation. The model achieves
a 2.42% boost in Overall Accuracy (OA), 2.26% improvement
To solve these problems, MCAFNet [1] combines three in F1-score, and 3.09% improvement in mean Intersection
core modules that are consistent with the overall goals of over Union (mIoU) compared to established benchmarks,
multi-scale feature integration and edge-aware learning. The showing particular superiority in complicated urban classes
Multiscale Feature Enhancement (MFE) Module employs con- such as vegetation, aquatic bodies, and built areas. These
volutional layers with a series of kernel sizes to extract fine results further confirm the effectiveness of attention-based
and coarse details at various spatial scales, and incorporates multi-source fusion in improving segmentation accuracy.
an attention mechanism to enhance relevant feature repre-
sentations and eliminate background noise. The Multilayer The MTAF-TST framework [3] integrates optical and SAR
Feature Fusion (MLF) Module is an efficient skip connection (synthetic aperture radar) satellite image time series (SITS) to
module that combines features at various hierarchical levels to leverage the unique strengths of each type, to solve the com-
learn interdependencies and improve segmentation accuracy. plex and heterogeneous nature of urban land cover, the authors
The final Pixel-Shuffle Decoder (PSD) employs an adaptive employ a Temporal-Spatial Transformer (TST) module. Unlike
upsampling mechanism to reduce checkerboard artifacts and traditional models that may not fully capture the temporal
blurred edges and thus enhance spatial resolution without evolution of SITS, TST uses a mechanism of self-attention to
introducing additional parameters. These design mechanisms extract long-range spatial and temporal dependencies and at
significantly improve segmentation performance on various last the author uses MTAF nodule that dynamically adjusts
benchmark datasets with an Overall Accuracy (OA) of 93.51% the influence of radar and optical data instead of directly
on the Potsdam dataset and 90.18% on Vaihingen, and a Mean concatenating features from different sources (a common ap-
Intersection over Union (mIoU) of 73.73% on the DeepGlobe proach that can introduce noise and limit model performance.
dataset, which outperforms several state-of-the-art models. MTAF-TST achieved increases of 2.42% in OA, 2.26% in F1-
score, and 3.09% in mIoU over other models. Similarly, on In an effort to improve precision, the research utilized an
the Wuhan dataset, MTAF-TST led in accuracy, particularly adaptive feature enhancement module that selectively enhances
in complex land covers like vegetation, water bodies, and discriminative spectral features and suppresses background
built-up areas.This alone gives it a high value tool for real- noise. This improved visually similar land cover class classi-
time monitoring within an urban area against seasonal or fication. The adaptive feature refinement method applied here
environmental changes. can be seen to be applied directly to the gated feature selection
module of the provided model for better segmentation and
Chunping Qiu et al. [4] proposed a new method for en- noise elimination with joint feature maps that are fine-tuned
hanced urban land cover mapping from Sentinel-2 imagery
processed using a Multibranch Residual Convolutional Neural Wang et al. [7] offer a new approach to urban land-
Network (MB-RCNN). Four branches of ResNet process each cover classification with the introduction of the Dynamic
seasonal image individually, followed by decision-level fusion. Convolution Self-Attention Network (DCSA-Net), designed to
The method addresses critical challenges in urban satellite address feature redundancy, computational inefficiency, and
imagery like cloud coverage, season, and class unbalance — multi-scale spatial feature extraction challenges in remote
very typical for such images. sensing imagery. Traditional convolution-based networks have
a tendency to be computationally intensive because their
The multi-branch structure allows learning of the features at fixed-sized convolutional kernels lead to redundant feature
various scales, and therefore it is improved to extract the global extraction. To overcome this limitation, DCSA-Net has been
seasonal patterns along with the local fine-grained textures. In integrated with the Lightweight Dynamic Convolution Module
this paper, we demonstrate that the classification accuracy is (LDCM), which dynamically adapts convolutional kernels to
improved to 86.7% on average on European urban regions input features. This dynamic adaptation allows the model
with the utilization of multi-scale feature fusion. Multi-scale to suppress irrelevant information, and thereby improve ef-
feature fusion concept in this study is the foundation of the ficiency with high classification performance. This dynamic
proposed architecture, where feature maps of different layers convolution feature can be directly applied to our architecture,
are combined in order to boost spatial resolution and semantic as we also aim to reduce computational overhead by fine-
consistency tuning feature selection at different stages in our hybrid CNN-
Transformer model.
In a bid to enhance the classification performance further, One of the key contributions in this paper is the Context
Bing Liu et al.[5] presented TrmGLU-Net, a network that Information Aggregation Module (CIAM), which employs
integrates Transformers and Convolutional Neural Networks self-attention mechanisms to expand the receptive field of
(CNNs) to encode global context information and local spatial convolutional layers and capture long-term dependencies. This
information. Window-Based Multi-Head Self-Attention (W- has the effect of solving one of the biggest challenges in urban
MSA) is employed in the network to learn long-range re- classification, in which fine details such as roads and foliage
lationships of hyperspectral images efficiently and preserve must be properly distinguished from nearby constructions.
fine-grained spatial information through convolutional layers. Employing a self-attention mechanism enhances the model’s
capacity to keep spatial and contextual relationships to enable
Hybrid architecture improves the performance of segmen- more effective segmentation. We employ in our architecture
tation under low-data situations by merging global and lo- Swin Transformer layers to achieve a similar goal by using
cal features. The study attained improved performance in multi-head self-attention to perform global feature extraction
University of Pavia and Salinas datasets, hence guaranteeing with spatial coherence, thereby enhancing the effectiveness in
the model’s robustness for urban land cover mapping. This combining CNNs with Transformer-based methods for urban
combined feature extraction strategy is well established in the scene segmentation.
model proposed in the form of using VGG16 for local feature
extraction and Swin Transformer for global dependency ex- DCSA-Net also offers a ladder-structured hierarchical fea-
traction to enable a better understanding of the urban context. ture fusion mechanism to merge low- and high-resolution
feature maps to enhance adaptation to scale variation. This
mechanism guarantees that small features, such as individual
Leo Thomas Ramos et al. [6] described that Multispectral trees, and large urban features, such as buildings, get classified
Semantic Segmentation (MSSS) is utilized for the classifica- with equal precision. Our method also utilizes a multi-scale
tion of urban land cover based on multispectral imaging (MSI) feature fusion approach, in which local spatial features from
in bands like Near-Infrared (NIR) and Short-Wave Infrared CNNs get combined with global embeddings from a Trans-
(SWIR). UNet and SegNet models were employed in the former, facilitating better segmentation performance. Experi-
classification of each pixel based on its spectral signature to mental results from the Potsdam and Vaihingen datasets con-
accurately classify urban vegetation, water bodies, and urban firm that DCSA-Net outperforms state-of-the-art techniques
built-up area. such as DeepLab v3 and MANet, with a 93.51% total accuracy
in Potsdam and a huge boost in mean intersection-over-union Yan et al. [9] present HyFormer, a hybrid deep learning
(mIoU) scores. These results validate the effectiveness of model that integrates CNNs and Transformers for pixel-level
fusing self-attention and hierarchical feature fusion, making land-cover classification in multispectral satellite imagery.
our work highly relevant to our research. According to these While traditional CNNs are efficient at extracting spatial
principles, our model enhances feature refinement and global features, they struggle to capture long-range dependencies
awareness, guaranteeing better performance in urban land- within spectral data. On the other hand, Transformer-based
cover classification. models excel at capturing global contextual information but
often fail to retain fine spatial details. To address this issue,
Yu et al. [8] introduce the CMAAC (Combining Multi- HyFormer incorporates a Transformer encoder for spectral
Attention and Asymmetric Convolution) framework to en- feature modeling alongside a CNN-based feature extractor to
hance hyperspectral image classification with multi-attention preserve spatial structure information. By converting multi-
and asymmetric convolution. Traditional convolutional neu- spectral sequences into 3D feature representations, the model
ral networks suffer from limitations in processing high- provides a more comprehensive understanding of spectral
dimensional spectral features, spectral mixing, and sample variations. Through multi-head self-attention mechanisms, Hy-
imbalance, and it becomes difficult to effectively classify Former effectively distinguishes between similar land-cover
land-cover types. To address these limitations, three basic types, improving classification accuracy for both urban and
modules in CMAAC are combined: Channel-Convolutional natural environments.
Long Short-Term Mechanism (CLMS), Pyramid-Enhanced
Attention Mechanism (PEAM), and Asymmetric Convolution This study is particularly relevant to our model as it confirms
Structure (ACS). The CLMS module enhances spectral-spatial the effectiveness of hybrid CNN-Transformer architectures
feature learning by modeling spectral dependencies and multi- for urban land-cover classification. Our approach follows a
order spatial interactions, a critical aspect in hyperspectral similar strategy, utilizing CNNs (VGG16) to extract fine-scale
data. The PEAM module forms adaptive receptive fields spatial features while incorporating Swin Transformer layers
in multi-scales to enable the model to capture fine spatial to capture long-range contextual dependencies. Additionally,
details and remove background noises. The ACS module HyFormer’s method for processing multispectral data aligns
focuses on refining boundary features, especially edge regions, with our focus on integrating multi-scale feature representa-
with asymmetric convolutional filters to concentrate on object tions, enabling more accurate classification of complex urban
boundaries and improve segmentation performance. landscapes. The model was evaluated using Sentinel-2 imagery
from various regions in China, achieving an impressive overall
This paper directly relates to our architecture in that it accuracy of 95.4%, significantly outperforming conventional
focuses on multi-attention mechanisms and feature aggrega- CNN-based and Transformer-only models. These results fur-
tion. Our proposed model also includes attention-based feature ther validate our decision to refine global features through
selection in a comparable way, utilizing self-attention with attention-based mechanisms while ensuring the preservation
Swin Transformer to improve contextual understanding at vari- of local spatial details. By drawing insights from HyFormer,
ous spatial levels. Additionally, pyramid-enhanced attention in we enhance our model’s capability to classify diverse urban
CMAAC also fits with our multi-scale combination approach land-cover types, ultimately improving segmentation accuracy
to features, combining fine and coarse spatial features to in real-world applications.
improve robustness in classification. The authors also offer a
joint loss function to solve class imbalance, further improving WH-MAVS, a novel dataset created to solve the difficulties
classification performance in a range of land-cover types. This in multiple land use and land cover (LULC) applications, is
can be extended in our model by adjusting weight mechanisms introduced in the paper [10]. In order to make it easier to
in the loss function to improve segmentation performance in evaluate deep learning models for tasks like segmentation,
underrepresented urban features. detection, and classification in contexts involving land use and
land cover, a dataset was generated. The authors draw attention
CMAAC was evaluated with Pavia University and Indian to the growing demand for trustworthy datasets in remote
Pines datasets and achieved state-of-the-art classification per- sensing, especially in urban regions where it is necessary to
formance by effectively managing spectral-spatial relation- accurately identify and classify a variety of LULC categories.
ships, further highlighting the value in hybrid feature learning. In addition to providing the dataset, the research includes
Our contribution advances this by integrating CNN-based local benchmark deep learning models for several LULC tasks,
feature extraction with Transformer-driven global relationships which aids in standardizing performance evaluation between
to offer a holistic feature representation tailored to urban various techniques..
land-cover classification. The findings from Yu et al. directly
corroborate our architectural design choices, particularly with Technology Employed in a variety of state-of-the-art meth-
asymmetric convolution for boundary accuracy, multi-scale ods and instruments for classifying land use and land cover
combination of attention, and dealing with high-dimensional are presented in the publication [10] including Deep Learning
remote sensing data. Models: On the WH-MAVS dataset, a number of deep learning
models are used as benchmarks to evaluate performance. and classification. Sentinel-2 Satellite photos: This collection
Convolutional neural networks (CNNs) are among the models provides a variety of data points covering various land use
that have been optimised for use with data from remote types and geographic regions, using multi-spectral photos from
sensing. Item Building Datasets: With its vast collection of the Sentinel-2 mission. By merging data from several sensors,
photos, WH-MAVS encompasses a broad range of urban land Yonghao Xu [11] investigates the application of AI techniques,
use and land cover types and can be used to test generalization specifically, machine learning models, to categorize urban land
across various settings and regions. The dataset can be used use and cover. Their approach makes use of the 2018 IEEE
for many different applications, including segmentation and GRSS Data Fusion Contest Dataset to enhance classification
classification, because it is designed to provide high-quality accuracy by combining optical and radar images. The results
annotations. Radical Structure: Using the dataset, the research emphasize that multi-sensor fusion provides a more thorough
establishes a deep learning benchmarking framework and understanding of urban environments by effectively recogniz-
examines different models. This framework of future research ing various urban land-cover types, including buildings, roads,
projects a consistent assessment procedure to benchmark their and vegetation.
findings against.
In [12] machine learning algorithms-Random Forest, Sup-
The following datasets were used by the authors [10], WH- port Vector Machines (SVM), and Convolutional Neural Net-
MAVS Dataset: The dataset is made up of labelled photos works (CNNs)-are the primary technologies discussed. These
of land cover and use, with an emphasis on urban areas. models, which included both spectral and spatial information,
Urban planning and environmental monitoring depend on were used to classify the multi-sensor data. The research em-
classifications like vegetation, buildings, water bodies, and phasizes that, as compared to single-source data, the fusion of
transportation infrastructure, all of which are included in optical and radar data yields improved classification accuracy;
the collection. Multiple LULC Categories: To ensure that it nevertheless, precise numerical criteria for accuracy are not
accurately reflects the complexity and variety observed in real stressed. Rather, the authors highlight the enhancement of
world circumstances, the dataset encompasses a wide range of visual clarity and classification consistency, especially when it
LULC categories. The EuroSAT dataset, created for classifying comes to differentiating intricate urban elements that provide a
land use and land cover using Sentinel-2 satellite photos, is challenge to conventional classification techniques. The dataset
presented in this study [19]. The dataset is intended to aid used by the authors in [12] is an extensive set of multi-sensor
in the creation of machine learning models for applications data that was processed to successfully train the machine
involving remote sensing. Images from Sentinel-2 are openly learning models. It includes both radar and high-resolution
available and a part of the Copernicus Earth Observation optical image data. The authors conclude that the classification
initiative. 27,000 labelled and geo-referenced photos overall of urban land use is greatly improved by the integration of
from ten classes and 13 spectral band coverage are included multi-sensor data, implying that this approach has real-world
in the EuroSAT dataset. implications in environmental monitoring and urban planning.
All in all, the combination of several sensors’ data turns out to
The paper’s goal is to produce a dataset that may serve be quite successful, enabling the algorithms to identify unique
as a standard for a range of land use and land cover clas- urban characteristics more precisely. This study shows how
sification tasks, promoting the creation and evaluation of artificial intelligence (AI) approaches can be used to automate
deep learning models for use in remote sensing applications. the classification of urban land cover. These techniques have
The following technologies are used in the paper [10]: Deep real-world applications in disaster relief, urban management,
Learning Techniques: Convolutional neural networks (CNNs) and sustainable development. Though the authors point out
are used in the research as a baseline for categorizing land that large-scale implementations may face difficulties due to
cover and use according to the dataset. Multiple-spectral, Data: the increased processing complexity of collecting multi-sensor
Thirteen spectral bands are included in the dataset, which data, the fusion approach offers up new opportunities for
improves the model’s capacity to distinguish between various producing more accurate and detailed land-cover maps.
land cover classifications, including forests, urban areas, and
aquatic bodies. Using cutting-edge deep learning models, the
authors provide benchmarks that are a standard for assessing III. PERFORMANCE COMPARISON
model performance on EuroSAT.

Set of Data The writers of the paper [11] use the fol- Evaluated using two high-resolution aerial picture datasets,
lowing: The EuroSAT Dataset contains land use and land Potsdam and Vaihingen, the table offers a comparative study
cover categories-such as forests, urban areas, aquatic bodies, of several deep learning models for urban land-cover classi-
and agricultural areas-are covered by this dataset, which fication. These datasets are commonly used in urban scene
was created using Sentinel-2 satellite pictures. The 27,000 classification studies. Each model is compared based on its
labelled photos in the collection are intended for use in parameter count (in millions) and overall accuracy (OA) on
a variety of remote sensing applications, such as detection the two datasets.
Model Parameters (Million) Potsdam OA (%) Vaihingen OA (%)

[1] MCN 19.56 M 90.03 85.47

[13] SCAttNet 24.62 M 91.99 90.18

[14] U-Net 32.52 M 88.49 85.47

[15] ColNet 28.57 M 87.97 90.18

[16] MANet 35.86 M 88.46 86.99

[17] ABCNet 45.65 M 85.30 86.32

[18] LiANet 24.59 M 93.51 86.32

[19] EMRT 54.85 M 88.12 88.23

[20] PCJNet 42.67 M 90.24 86.99

[21] WiC0Net 38.24 M 91.11 89.90

[22] MAResU-Net 102 M 90.61 86.97

TABLE I: Comparison of Various Machine Learning Models

MCN [1] demonstrates an efficient balance between per- IV. P ROPOSED M ETHODOLOGY
formance and complexity, with the lowest parameter count The proposed architecture is designed to leverage the power
(19.56 million) and competitive accuracy scores of 90.03% on of both Transformers and Convolutional Neural Networks
Potsdam and 85.47% on Vaihingen. SCAttNet [13], with 24.62 (CNNs) to yield efficient as well as accurate image segmenta-
million parameters, achieves strong accuracy on both datasets- tion. The model is combined using VGG16 as a backbone
91.99% on Potsdam and 90.18% on Vaihingen. U-Net [14] and CNN and Swin Transformer to access the global contex-
ColNet [15], with 32.52 million and 28.57 million parameters, tual features. It is also supported with multi-scale feature
respectively, also deliver good results; U-Net achieves 88.49% fusion, gated feature selection, and edge-aware convolutions
on Potsdam and 85.47% on Vaihingen, while ColNet performs for making the architecture robust for global and local feature
better on Vaihingen (90.18%) but somewhat lower on Potsdam extraction and maintaining fine spatial information.
(87.97%).
1. Hybrid Feature Extractor
Employed for extracting the global and local characteristics
With 35.86 million parameters, MANet [16] attains com- from the input image, the hybrid feature extractor forms the
petitive accuracy, scoring 88.46% on Potsdam and 86.99% on pillar of the suggested architecture. VGG16 is used for local
Vaihingen. ABCNet [17], at 45.65 million parameters, scores spatial patterns extraction and Swin Transformer is used for
85.30% on Potsdam and 86.32% on Vaihingen, while LiANet the extraction of long-range dependencies here.
[18] achieves a remarkable 93.51% on Potsdam with fewer VGG16 is a popular CNN structure renowned for being
parameters (24.59 million), though its performance on Vaihin- simple yet high-performing for image processing. It comprises
gen is significantly lower at 86.32%, EMRT [19] with 54.85 of pooling layers that progressively shrink the spatial dimen-
million parameters, On the Potsdam dataset, achieves an accu- sion of the input with a development in the feature depth
racy (OA) of 88.12% on the Vaihingen dataset, same accuracy. and consecutive convolution layers. For an input image I of
MAResU-Net [22], although it has the highest parameter count size H × W × C, the VGG16 network applies a series of
(102 million), performs well with accuracy scores of 90.61% convolutional operations and ReLU activation functions:
on Potsdam and 86.97% on Vaihingen. PCJNet [20], with
42.67 million parameters, surpasses ABCNet [17] in accuracy, Fl (I) = ReLU (Wl ∗ I + bl )]
scoring 90.24% on Potsdam and 86.99% on Vaihingen. Lastly,
WiCoNet [21], with a relatively modest parameter count of Where: - Fl (I) is the feature map at layer l - Wl is the layer
38.24 million, achieves high accuracy: 91.11% on Potsdam l convolution kernel - bl is the bias term - ∗ is the convolution
and 89.90% on Vaihingen, demonstrating an excellent balance operator - Activation function is ReLU
of performance and complexity. Scattnet [13] and WiCoNet Through this process, VGG16 is able to capture fine-grained
[21] prove effective for large-scale tasks due to their high spatial information like edges, corners, and textures. Low-level
accuracy and manageable parameter sizes. MCN [1] offers a features of this type are required in order to segment boundary
strong performance-complexity trade-off, while LiANet [18] details and small objects.
excels in the Potsdam dataset. One of the advantages of VGG16 is the fact that it is simple
in architecture, and therefore the model is capable of obtaining
hierarchical representations at an economical cost. CNNs have 3. Adaptive Feature Refinement (Merging Gated Selection)
restricted receptive fields and are hence weaker at capturing After feature fusion, the fused features are then refined
long-range dependencies. This is corrected by incorporating by the adaptive feature refinement module, which employs a
the Swin Transformer. gated selection mechanism. The module selectively discards
Swin Transformer supplements the VGG16 backbone with the redundant or weak activations and improves the quality of
a capture of global contextual information. Transformers are the feature representation.
widely known for extracting long-range dependencies using The gated selection mechanism applies a learnable gating
self-attention mechanisms. Swin Transformer builds on that function to each feature channel:
by employing shifted window attention in which the image
is processed within non-overlapping windows with shared Fref ined = G(Ff usion ) ⊙ Ff usion ].
information between successive windows.For a given input
feature map X, the Swin Transformer calculates self-attention Where: - G(Ff usion ) is the gating function, implemented
on local windows: as a sigmoid activation
- ⊙ denotes element-wise multiplication
Q = XWQ , K = XW K , V = XW V The gating mechanism learns the most discriminative fea-
Attention(Q, K, V) = softmax((QKT )/sqrt(d))V ture channels for the segmentation task and gates out the less
discriminative channels. Selective refinement allows the model
Where: - Q, K, and V are the query, key, and value to focus on meaningful features and remove spurious noise and
projections of the input features - W Q , W K , and W V are the redundancy. Adaptive feature enhancement is also required to
corresponding learnable projection matrices - d is the feature enhance segmentation performance in the presence of heavy
dimension. noise or clutter. Suppressing weak activations will make the
The shifted window strategy ensures local and global de- network more confident with its prediction.
pendencies are preserved by the model, which is computa-
tionally more efficient than the standard self-attention but at 4. Detail retention pathway (Skip Connections)
the expense of preserving long-range information. With the For keeping high-resolution spatial details, skip connections
addition of the Swin Transformer, the model is now capable between encoder and decoder steps are employed in the
of learning complex spatial relations, which are very important model. The skip connections allow the explicit passing of
in segmenting large objects and regions. low-level features directly from the VGG16 blocks to the
decoder without going through the intermediate processing
2. Multi-Scale Feature Fusion steps. Application of skip connections is achieved using simple
Following the acquisition of the VGG16 and Swin Trans- concatenation operations:
former feature maps, Multi-scale feature fusion occurs. Multi-
scale feature fusion is a method of combining feature maps Fskip = Fencoder ⊕ Fdecoder
of different scales in an attempt to provide a richer image
representation. These connections enable the model to preserve boundary
Multi-scale fusion is needed because objects in the image information and fine structural details, which are often lost
can have very different sizes. Small objects are explained in deep networks. By reintroducing low-level features at later
well with low-level features, while large objects are explained stages, the architecture achieves more accurate and sharper
well with high-level semantic features. Fusion is done here segmentation masks.
by combining different scales to achieve better segmentation
quality.VGG16 and Swin Transformer feature maps are com- 5. Progressive Upsampling Module The decoder stage em-
bined by a weighted summation: ploys transposed convolutions to progressively upscale feature
maps back to the original image resolution. This multi-step up-
Ff usion = αFV GG + (1 − α)FSwin ] sampling approach gradually reconstructs the high-resolution
output, refining the segmentation mask step by step.
Where: Each upsampling operation is followed by a convolution
- FV GG and FSwin denote the feature maps of VGG16 layer to smooth the feature maps and promote spatial coher-
and Swin Transformer, respectively - α is a trainable weight ence:
that multiplies the contribution of each network This combi-
nation procedure guarantees that the resulting representation Fup = ConvT ranspose(Finput )
maintains global context and fine-grained details. Weighted
summation allows the model to dynamically weight global or This gradual approach helps prevent artifacts and produces
local features, depending on the image content. smoother segmentation masks compared to direct upsampling
The significance of multi-scale feature fusion is that it can methods.
fill the gap between various feature hierarchies. The model is
more resilient to object size and form fluctuations by means 6. Boundary Enhancement Block (Edge-Aware Convolu-
of feature fusion at several scales. tions)
To further refine object boundaries, edge-aware convolutions architecture offers a robust and efficient solution for image
explicitly model edge features. This module applies Sobel segmentation tasks, with the potential to outperform existing
filters to detect edges and integrates the resulting edge maps methods across a wide range of datasets.
into the final segmentation prediction:

Fedge = Sobel(Finput ) V. F UTURE S COPE

The general objective of this research is to enhance urban
Ff inal = Concat(Fup , Fedge ) land cover mapping through the implementation of deep learn-
By directly incorporating edge information, this method ing approaches, and in the process, making urban mapping
minimizes boundary blur and improves segmentation accuracy more efficient and scalable. There are various areas that must
near object edges. be developed and enhanced further to enable the system to be
more flexible, efficient, and generalizable to various real-world
situations.

An interesting area of innovation would be integrating real-

time city monitoring systems. Use of this classification model
in smart city infrastructure, traffic control centers, and disaster
management systems would provide real-time information re-
garding land use pattern changes, road conditions, and natural
calamities. Integration of these features would facilitate urban
planning, policy guidance, and enhanced emergency response
planning, hence driving sustainability and resilience in cities.

Besides, the model can be applied to multi-sensor fusion

by adding LiDAR, SAR, and high-resolution drone images.
Through the use of a combination of multiple sources of
data, classification performance can be enhanced significantly,
particularly in the case of 3D urban mapping, infrastructure
inspection, and vegetation monitoring. This would make the
system more comprehensive and consistent across different
geographic regions and weather conditions.

Moreover, incorporating the classification system into GIS

software and urban planning software is an area of huge
potential. If it is incorporated into widely used software such
as Google Earth Engine, ArcGIS, and QGIS, city planners
and researchers can analyze urban sprawl more effectively,
optimize land-use patterns, and monitor illegal construction
Fig. 1: Architecture
activities. This will make data-driven choices in sustainable
urban planning possible.

7. Segmentation Output. The final segmentation map is All things considered, the aforementioned developments
obtained by applying a softmax activation function to the may improve the accuracy and value of urban land cover
refined feature maps, generating pixel-wise class probabilities. classification, and hence it can be an effective tool for en-
Each pixel is assigned a class label based on the model’s vironmental monitoring, smart city planning, and sustainable
predictions. The combination of hybrid feature extraction, urban development. Further research and development in these
multi-scale fusion, and boundary enhancement ensures that fields can go a long way in creating more livable, efficient, and
the segmentation output is both visually coherent and highly resilient cities.
accurate.

This approach leverages the combined strengths of CNNs VI. CONCLUSION

and Transformers through hybrid feature extraction, multi- This survey highlights recent developments and new tech-
scale fusion, and adaptive feature refinement. The integration niques in the area of urban land-cover classification, such
of skip connections, progressive upsampling, and edge-aware as multiscale context-aware models, attention mechanisms,
convolutions further enhances segmentation performance by and multi-sensor fusion techniques. Traditional methods for
preserving spatial details and refining boundary accuracy. This classification fail to grasp detailed spatial and contextual
relationships as complexity in urban landscapes continues to [13] Haifeng Li, Kaijian Qiu, Li Chen: SCAttNet: Semantic Segmentation
increase. Network With Spatial and Channel Attention Mechanism for High-
Resolution Remote Sensing Images. 2021. 10.1109/LGRS.2020.2988294
Specialized datasets, such as WH-MAVS and EuroSAT, [14] Olaf Ronneberger, Philipp Fischer, Thomas Brox: U-Net: Convolutional
have thus established a new benchmark to evaluate the perfor- Networks for Biomedical Image Segmentation. 2015.
mance of models, thus promoting deep learning’s advancement [15] Qian Zhang, Guang Yang, Guixu Zhang: Collaborative Network for
Super-Resolution and Semantic Segmentation of Remote Sensing Images.
for high-resolution picture classification. Utilizing the inherent 2021. 10.1109/TGRS.2021.3099300
benefits of CNNs, attention layers, and optical-radar data [16] Rui Li, Shunyi Zheng, Ce Zhang, Chenxi Duan, Jianlin Su, Libo Wang:
fusion, Across the range of urban terrain types, the state-of- Multiattention Network for Semantic Segmentation of Fine-Resolution
Remote Sensing Images. 2021. 10.1109/TGRS.2021.3093977
the-art has greatly improved in terms of accuracy, resilience, [17] Rui Li a, Shunyi Zheng a, Ce Zhang b c, Chenxi Duan d e, Libo Wang
and flexibility. Despite that, scalability, computational expense, a, Peter M. Atkinson :ABCNet: Attentive bilateral contextual network
and the generalizability of models over space to varying envi- for efficient semantic segmentation of Fine-Resolution remotely sensed
imagery. 2021. 10.1016/j.isprsjprs.2021.09.005
ronmental and urban characteristics pose challenges. Hence, it [18] Renchu Guan; Mingming Wang; Lorenzo Bruzzone; Haishi Zhao; Chen
is a pointer that continued research is imperative toward op- Yang: Lightweight Attention Network for Very High-Resolution Image
timizing model architectures, towards unsupervised and semi- Semantic Segmentation. 2023. 10.1109/TGRS.2023.3272614
[19] Tao Xiao; Yikun Liu; Yuwen Huang; Mingsong Li; Gongping Yang: En-
supervised learning that makes minimal reliance on labeled hancing Multiscale Representations With Transformer for Remote Sens-
data possible, and toward refined techniques of data fusion that ing Image Semantic Segmentation. 2023. 10.1109/TGRS.2023.3256064
find application in real-world usage. Future innovations in the [20] Bo Liu ,Jinwu Hu, Xiuli Bi, Weisheng Li and Xinbo Gao: PGNet:
Positioning Guidance Network for Semantic Segmentation of Very-High-
classification of urban land cover will revolutionize a variety Resolution Remote Sensing Images. 2022. 10.390/rs14174219
of urban planning, conservation, and disaster management [21] Lei Ding; Dong Lin; Shaofu Lin; Jing Zhang; Xiaojie Cui; Yuebin
practices to provide precise, efficient, and sustainable methods Wang: Looking Outside the Window: Wide-Context Transformer for
the Semantic Segmentation of High-Resolution Remote Sensing Images.
for managing complex urban ecosystems. 2022. 10.1109/TGRS.2022.3168697
[22] Rui Li, Shunyi Zheng, Chenxi Duan, Jianlin Su, Ce Zhang: Multistage
Attention ResU-Net for Semantic Segmentation of Fine-Resolution Re-
R EFERENCES mote Sensing Images. 2021. 10.1109/LGRS.2021.3063381
[1] Abubakar Siddique, Zhengzhou Li, Abdullah Azeem, Yuting Zhang, [23] Ankita Chatterjee, Jayasree Saha, Jayanta Mukherjee, Subhas Aikat,
Bitong Xu (ed): Multiscale Context-Aware Feature Fusion Network for Arundhati Misra (ed): Unsupervised Land Cover Classification of Hybrid
Land-Cover Classification of Urban Scene Imagery. 2023, 16:10.1109/JS- and Dual-Polarized Images Using Deep Convolutional Neural Network.
TARS.2023.3310160 2021, 18:10.1109/LGRS.2020.2993095
[2] Zongqian Zhan, Xiaomeng Zhang, Yi Liu, Xiao Sun, Chao Pang, Chenbo [24] Corneliu Octavian Dumitru, Gottfried Schwarz, Mihai Datcu: Land
Zhao (ed): Vegetation Land Use/Land Cover Extraction From High- Cover Semantic Annotation Derived from High Resolution SAR Images.
Resolution Satellite Images Based on Adaptive Context Inference. 2020, 2016. 10.1109/JSTARS.2016.2549557
8:10.1109/ACCESS.2020.2969812
[3] jining Yan, Ingwa 11.1, ong lang, I ang, un l, e ang: emantlc egmentatlono
an over In Urban Areas by Fusing Multisource Satellite Image Time
Series. 2023. 10.1109/TGRS.2023.3329709
[4] Chunping Qiu, Lichao Mou, Michael Schmitt, Xiao Xiang Zhu: Fusing
Multiseasonal Sentinel-2 Imagery for Urban Land Cover Classifica-
tion With Multibranch Residual Convolutional Neural Networks. 2020.
10.1109/LGRS.2019.2953497
[5] Bing Liu, Yifan Sun, Ruirui Wang, Anzhu Yu, Zhixiang Xue & Yu-
song Wang: TrmGLU-Net: transformer-augmented global-local U-Net for
hyperspectral image classification with limited training samples. 2023.
10.1080/22797254.2023.2227993
[6] Leo Thomas Ramos, Angel D. Sappa: Multispectral Semantic Segmen-
tation for Land Cover Classification: An Overview. 2024. 10.1109/JS-
TARS.2024.3438620
[7] Xuan Wang, Yue Zhang, Tao Lei , Yingbo Wang , Yujie Zhai,
Asoke K. Nandi: Dynamic Convolution Self-Attention Network for
Land-Cover Classification in VHR Remote-Sensing Images. 2022.
10.3390/rs14194941
[8] Lili Yu, Xubing Zhang, Kai Wang: CMAAC: Combining Multiattention
and Asymmetric Convolution Global Learning Framework for Hyperspec-
tral Image Classification. 2024. 10.1109/TGRS.2024.3361555
[9] Chuan Yan, Xiangsuo Fan, Jinlong Fan, Ling Yu, Nayi Wang,
Lin Chen, Xuyang Li : HyFormer: Hybrid Transformer and CNN
for Pixel-Level Multispectral Image Land Cover Classification. 2023.
10.3390/ijerph20043059
[10] Jingwen Yuan, Lixiang Ru, Shugen Wang, Chen Wu: WH-MAVS: A
Novel Dataset and Deep Learning Benchmark for Multiple Land Use
and Land Cover Applications. 2022. 10.1109/JSTARS.2022.3142898
[11] Patrick Helber, Benjamin Bischke, Andreas Dengel, Damian Borth:
EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use
and Land Cover Classification. 2019. 10.1109/JSTARS.2019.2918242
[12] Yonghao Xu, Bo Du, Liangpei Zhang, Daniele Cerra, Miguel Pato,
Emiliano Carmona: Advanced Multi-Sensor Optical Remote Sensing for
Urban Land Use and Land Cover Classification: Outcome of the 2018
IEEE GRSS Data Fusion Contest. 2019. 10.1109/JSTARS.2019.2911113