0% found this document useful (0 votes)
4 views7 pages

Conference Paper

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views7 pages

Conference Paper

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

2024 International Conference on Network, Multimedia and Information Technology (NMITCON)

Automatic Building Extraction using


Mask2Former Model
Venkatamanukarthikeya Dharmapuri
student Venkatesh Punugotti Bhargava Reddy Gunta
Alliance Univeristy student student
Bengaluru, India Alliance Univeristy Alliance Univeristy
[email protected] Bengaluru, India Bengaluru, India
du.in [email protected] [email protected]
u.in

Dr. Ezil Sam Leni .A


Head of The Department
Alliance Univeristy
Bengaluru, India
[email protected]

I. Abstract urban development, disaster management and environmental


This paper presents a new approach to building and detailed conservation.
roof analysis using the Mask2Former framework. Extracting
buildings from aerial images is an important task in remote Keywords— Segmentation; Mask2Former; High Resolution
sensing and urban planning and is important for many Satellite Images; Object Detection; Object Segmentation; Image
applications such as land use, construction, and damage Processing;
control. II. INTRODUCTION
Automatic building extraction is a very fast developing
Normally, traditional methods often have difficulty clearly field within remote sensing and computer vision. It mainly
defining building boundaries and achieving roof quality, works on usage of computational techniques to identify and
especially in urban environments. A deep learning method segment buildings from satellite imagery. This technology
generates a process traditionally, offering great advantages in
that combines instance segmentation and semantic
terms of speed, efficiency, and cost-effectiveness.
segmentation tasks which is panoptic segmentation. By
performing these tasks, a good Mask2Former model will be Extracting buildings from satellite images is very
obtained and the fine-grained roof details are extracted while important for various reasons. These include urban planning,
disaster management, navigation systems, updating
preserving them. The framework uses convolutional neural
geographic information systems (GIS), and 3D city
networks (CNN) for feature extraction and transformation, modelling.
followed by a transformer-based architecture for content
aggregation and refinement. Building extraction is not a simple task. Factors like
complex building in the aspect of shapes, variations in size and
materials, and dense urban environments can become a
challenge for automated algorithms.
Experimental results show a significant improvement in Traditional methods normally depend on image
construction accuracy and roof quality analysis compared to processing techniques like segmentation algorithms.
other methods. The Mask2Former framework is specifically However, the developments which were made past in deep
designed to capture roof images, distinguish buildings, and learning, mainly convolutional neural networks (CNNs), have
manage urban issues and functions. developed. Applications developed this field. Deep learning models can learn complex
of the Mask2Former method include urban planning, patterns from huge datasets of images, this will help us to
environmental monitoring and infrastructure assessment, achieve high accuracy in building extraction.
demolition of buildings and detailed information important III. LITERATURE REVIEW
for decision-making and accurate legislative design. A new
method for automatic building extraction and roof quality [1] J. Li, W. He, W. Cao, L. Zhang, and H. Zhang, "UANet:
analysis using the Mask2Former framework. The program An Uncertainty-Aware Network for Building Extraction
provides valuable results and opens new opportunities for the from Remote Sensing Images," IEEE Access, vol. 9, pp.
development of remote sensing and urban analysis. product 111823-111832, Jul. 2023, doi:
process. This technique can be used in other situations like 10.1109/TGRS.2024.3361211.

979-8-3503-7289-2/24/$31.00 ©2024 IEEE


Li et al.'s paper "UANet: Designing an Anomaly-Aware In past, significant advances have been done in image
Network for Remote Sensing Image Acquisition." (2024) segmentation, which continues to improve the accuracy of
address this limitation by proposing a new uncertainty-aware identifying objects in images. But distributing ultra-high
network (UANet). This review delves into the relevant resolution (UHR) images effectively is becoming
literature and explores how UANet can resolve uncertainties increasingly difficult. These images, often over tens of
in production. megapixels and rich in detail, require powerful segmentation
techniques that can capture good patterns and overall
UANet mostly aims to resolve uncertainty in the design information. Summary: A new measurement" (2023), Wang
process by integrating uncertainty estimation into its et al.
architecture. This network uses an encoder-decoder model,
This challenge is addressed by introducing the URUR dataset
normally used separately, to generate an initial subtraction
and measurement model. This review thoroughly examines
map with uncertainty estimation. The original recipe was
the existing literature on UHR segmentation, highlighting its
improved with the extended model to create a more accurate
limitations and extending the path to a new measurement
and reliable design.
model. The URUR dataset A better understanding of the
contribution of the miscarriage problem is often encountered
One of UANet's main innovations is the Advanced when used for UHR images. Some of the main challenges
Information Model (PIGM). This module uses an initial include:
subtraction map based on previous knowledge to improve the
representation. PIGM works across spatial and channel Creating large-scale UHR images requires significant
dimensions by enhancing the maximum features extracted by financial resources. Standard procedures may be slow or
the encoder. This approach ensures that the model has good require extensive hardware changes. . This can lead to
correlation and dependency, thus improving the accuracy of incorrect partitioning of large or extended objects. This is
the final quote. important for accurate segmentation of complex scenes
containing many objects and overlapping patterns. Below is
To further reduce uncertainty, the uncertainty-aware fusion a brief summary of some important methods:
module (UAFM) technique incorporates high-level to low-
It is a good idea to ensure that the UHR image is sampled to
level feedback. UAFM adjusts the representation by a lower resolution before applying the segmentation method.
gradually combining uncertainty data; This helps reduce the Although the computational efficiency is high, subsampling
impact of complex backgrounds and different scales. This
may cause data loss and affect segmentation accuracy. dead.
optimization process leads to the final inference map (ar5iv) Although this reduces the computational burden, the
that reduces uncertainty. assembly of segmented patches may be inconsistent. The first
stage performs the segmentation at a lower resolution, and the
UANet has been rigorously tested on various datasets like next stage adjusts the results at a higher resolution.
Inria Aerial Imagery Dataset. These documents reveal
different challenges, such as different building models, Large datasets, good UHR segmentation data are rare. This
scales, and environments. UANet performs better compared hinders the development and evaluation of robust
to other methods, achieving more Intersections between segmentation models. Available data often lack the diversity
Union and F1 scores, demonstrating more accurate and and complexity of real UHR images. Identification of
reliable house inference Sources and Standards
URUR has a large number (3,008) of UHR images covering
For example, UANet achieved an IoU of 83.08 using the a wide range of complex cases from 63 different cities. This
VGG-16 backbone on the Inria dataset, which is one of the diversity allows the development of well-detailed models for
best results in the report. Similarly, it achieves an IoU of real-world situations. number of pixels. This “very rich
76.41 on the Massachusetts dataset, indicating its robustness content” ensures that the training model captures high-quality
across different datasets. content and related objects.

[2] Y. Wang, X. Li, and Z. Chen, "Ultra-High-Resolution Researchers can develop new deep learning models
Segmentation with Ultra-Rich Context: A Novel specifically designed to process UHR images. The goal of
Benchmark," Proceedings of the IEEE/CVF Conference these architectures should be to increase computing
on Computer Vision and Pattern Recognition (CVPR), performance while protecting remote locations and
2023, pp. 23621-23630, doi: integrating data points efficiently. UHR will be able to
10.1109/CVPR52729.2023.02262. improve the performance of segmentation tasks.
UHR segmentation presents a difficult but important area of
computer vision.
IV. METHODOLOGY steps and metrics to ensure accuracy and precision of
classifying the buildings from the aerial image datasets.

Hyper parameter, Fine Tuning, Validation are various


techniques for model’s performance evaluation

Automatic Building Extraction:

The initial step involves preparing the training and validation


datasets. This is done by resizing the images, rescaling the
pixel values, and batching the data. These steps are crucial for
efficient processing during training. Once the model is
trained and optimized, it is used to perform automatic
building extraction on new images. This involves inferences,
post-processing.
Figure 1: Flowchart for proposed research methodology
Output:

Model Definition: Finally, the output will be a set images with accurately
segmented buildings with high accuracy. These outputs are
Input Dataset: essential for applications such as urban planning, disaster
management, and environment monitoring. The extracted
The dataset utilized in this research is Inria Aerial Image footprints can be used to analyze urban growth, assess
Dataset. This dataset is taken from Kaggle. Author is Sagar damage after natural disasters, and plan new infrastructure
Rathod. This dataset contains remote sensing images developments.
containing urban areas which has 180 files to train and 180
files to test. First, we need to input the dataset in google collab
to use the dataset.
V. MODEL TRAINING
Data Preprocessing:
In past, in the field of segmentation has made significant
The initial step involves preparing the training and progress with the help of deep learning. These structures are
testing/validation datasets. This is done by resizing the designed to identify individual objects in an image by
images, rescaling the pixel values, and batching the data. assigning a pixel-level face to each instance. One such model
These steps are crucial for efficient processing during is Mask2Former, introduced by Facebook which uses the
training. advantages of Transformers to achieve the best performance
on the segmentation task., explores its architecture,
advantages, limitations, and future directions.
We can perform normalization, augmentation, noise Groundbreaking models such as Mask R-CNN have been
reduction, remove blur. successfully implemented using a two-stage approach:

Model Training:
Region Proposal Network (RPN): This subnetwork
The next step is to utilize the Mask2Former model and train identifies features hidden in the image Object. position it by
the dataset using that model. This involves importing the creating a surrounding bounding box. Originally developed
for linguistic processing (NLP), it has revolutionized many
Mask2Former model and train the images using that model.
computer vision applications. This model is good at capturing
By doing so, the model learns from the data and improves its
relationships between objects in data. Unlike CNNs that rely
performance. on local convolutions, Transformers use the concept of
identity, which allows them to focus on a portion of the input
Feature Extraction and segmentation can be done. data at once. These properties make them ideal for tasks that
require global understanding, such as segmentation.
Performance Validation:

After dataset loading, Preprocessing the data, Model


Training, we need to check the performance of how the model
is trained. Also, how well the model is performing.
Performance validation in our dataset involves several main
Types of Segmentation: Detectron2 library is Facebook’s new library that allows
us to use and create object detection, segmentation types, edge
detection models.
1. Instance Segmentation: Identifying the Point of an This library includes all the models that were available in
Object Detectron, such as R-CNN, Mask R-CNN, etc., as well as
Consider an image containing activity. Instance some newer models including cascade R-CNN, TensorMask.
segmentation aims to identify and describe each object in the Mostly Detectron2 to do key point detection, object
image. It not only detects the presence of an object, but also detection, and semantic segmentation.
creates the correct mask for each sample. This mask shows
clear boundaries of the object. These networks extract maps As we discussed, Dectectron2 and Mask2Former both are
from images to capture low-level visual information such as founded by Facebook, we can say Detectron2 is most crucial
edges and text. This phase identifies potential objects in the library in installation of Mask2Former as it has all the
image and creates a checkbox around them. This can be done necessary libraries required to perform object detection.
in many ways, such as running the mask prediction header on
the model. Essential products for personal navigation. After installing PyTorch, Detectron2 library installation,
from https://github.com/huggingface/transformers.git/
website we need to install transformers.
2. Semantic Segmentation: Understanding the Big Transformers is an open-source powerful and versatile
Picture library created and maintained by Hugging Face and the
community, IT is built on PyTorch and TensorFlow. It
While semantic segmentation focuses on a single product, provides thousands of pretrained models to perform tasks on
semantic segmentation takes a broader approach. Its purpose Natural Language Processing (NLP)
is to classify each pixel in an image according to its semantic
group. This essentially isolates each part of the image and tells For support of PyTorch to get pretrained models we use
you whether a pixel is “person,” “car,” “length” etc... Transformers library.
Feature extraction is like instance segmentation, semantic After that we must define the coco panoptic palette which
segmentation models often use deep learning to extract contains annotated COCO images and include 80 thing
features from images. This is usually done by estimating the categories from the detection task.
class probability of each pixel in the last layer in the sample.
Distribution, travel, and plants. Distribute images of various After installing these all above said libraries we need import
those all, and start the training.
land cover types to aid urban planning and resource
management. After that, we need to define the load model and processor to
load the pretrained model and use the model.
Then, we must define the load default model that is ckpt and
3. Panoptic Segmentation: Combining the best of
int this class we check whether semantic task is being
both worlds
performed or panoptic task is performed so we need to import
pretrained semantic/panoptic segmentation.
Panoptic Segmentation aims to provide a better Once if we determine we the task is semantic or panoptic task
understanding of images by combining the advantages of we need to define 2 classes one for semantic segmentation and
instance segmentation and semantic segmentation. It works on another for panoptic segmentation.
identifying and segmenting each sample object while
segmenting the background. Essentially, it provides a full If the task is panoptic task, then from Metadata we get
pixel-level understanding by distinguishing between panoptic coco data. Then we label each category which is
individual objects and the background area. Available going to be segmented for better understanding of the type of
technologies for segmentation. It also predicts the background the object.
class for all additional pixels. and the general environment. Then we use the visualizer module to visualize the image and
draw a panoptic segmentation. We will generate maps for the
prediction of segmented data. And assign a label, return the
Mostly in this case we use panoptic segmentation, because result.
it has the features of instance segmentation and semantic
Same with semantic segmentation, is the task is semantic task,
segmentation, that can give us a chance to obtain good results.
then we create a semantic coco pattern from the metadata and
then we create maps and segment the data, label the data
predict the segmented piece type and assign a color to it and
For installing Mask2Former model we must install torch return the image output.
library and dectectron2 library.
PyTorch is an open-source Machine Learning (ML)
framework in python which will help us in creating deep Then we use the Gradio library, we install it. Gradio library is
neural networks. It is mostly preferred for deep learning an open-source Python package, which allows you to build a
research. This framework/library can be is used for speeding demo or web application for your machine learning model,
up the process between research prototyping and deployment. API, or any arbitrary Python function instantly.
We can access the demo by sharing the link, because of its
built -in sharing features, which last for 48 or 72 hours.
Using that we build a demo application, and upload the image 4. Visualization of Results: Constant check of epochs
and perform the segmentation. will tell where we can still improve the model
So, for the whole dataset we use 180 images to train and 180 training, through that we can improve precision and
images to test. accuracy, visualizing the model's performance curve
will say how well we are able to segment the
The dataset is taken from Kaggle which is Inria Aerial Image buildings from the dataset. These visual aids aid in
Dataset. It contains satellite images of Bellingham, Chicago, comprehending the distribution of correct and
Austin, Tyrol, Vienna these all are from United States of incorrect predictions and evaluating the model's
America (USA). The dataset has satellite images which has trade-offs between different performance metrics.
35.29% representation of China, 11.76% representation of
USA, 5.88% representation of France, 4% represent Spain,
5. Analysis of Results: The assessment outcomes are
remaining data comes from various parts from globe.
analysed to gauge the overall efficiency of the model
Some of the images of the dataset are: in segmenting the buildings from the dataset. This
entails scrutinizing the performance metrics, pointing
potential areas for enhancement, and grasping the
implications of the model's performance for real-
world applications.

6. Continuous Improvement: Based on the


assessment findings, the model may undergo
refinement through iterative processes like
hyperparameter tuning, error checking.

Figure 2: Some Images from Austin City, USA.

VI. EVALUATING THE MODEL TRINED

Model assessment is a crucial step in determining the


effectiveness and dependability of a machine learning model.
In the realm of Automatic Building Extraction through deep
learning, model assessment encompasses various essential Figure 3: Resultant graphs in process of improving our
steps: model performance.

1. Data Segmentation: The dataset is commonly split into


training, validation, and testing datasets. The training dataset
is utilized for training the model, the validation dataset helps
in hyperparameter tuning and performance assessment in
training phase, while the testing dataset is employed to
evaluate the performance of the trained model.

2. Selection of Assessment Metrics: Depending on the


problem's nature and output type, suitable assessment metrics
are selected. Common metrics for these types of tasks we
select Epochs which can improve results of accuracy,
precision, recall, F1-score of the output image.

3. Calculation of Assessment Metrics: Following model Figure 4: Resulting hyperparameter comparison graphs
training, the model is assessed on the testing dataset using the which tells how well the model is performing
chosen assessment metric. These metrics provide insights
into various facets of the model's performance, such as its
capability to accurately predict the buildings and other parts
of the data in the dataset, minimize the error of not properly
segmenting the outline of the building, and strike a balanced
trade-off between precision and recall.
Figure 6: F1 Score Curve for our Model

Figure 5: Result of Segmentation of one of the images


from the dataset

VII. CONCLUSION

This study investigates the effectiveness of Mask2Former, a


Transformer-based deep learning model, in extracting
buildings from aerial images. Our results demonstrate
Mask2Former's ability to achieve the good accuracy on this
task. Remote modeling: Mask2Former's transformative
architecture excels at capturing remote features in an image.
This is important for segmenting complex buildings,
especially large areas. Better. This makes it suitable for use
in resource-constrained environments. This includes urban
planning, disaster management and resource allocation.

Further research could explore ways to improve the Figure 7: Labels Correlogram for our model
generality of the Mask2Former model. This will include
training on different materials including various image
resolutions, environments and building types, where all data
extraction will be set up. Foreground method has been
removed from the top image for the use of the house. Its
ability to detect remote dependencies and its ability to
perform well make it useful for many applications.
Figure 6: Normalized Confusion Matrix of our Model

Figure 8: R Curve for my model


Remaining Graphs :

Figure 9: Confusion Matrix for our model Figure 11: Labels Graphs for our model

VIII. REFERENCES

[1] J. Li, W. He, W. Cao, L. Zhang, and H. Zhang, "UANet:


An Uncertainty-Aware Network for Building Extraction
from Remote Sensing Images," IEEE Access, vol. 9, pp.
111823-111832, Jul. 2023, doi:
10.1109/TGRS.2024.3361211.

[2] Y. Wang, X. Li, and Z. Chen, "Ultra-High-Resolution


Segmentation with Ultra-Rich Context: A Novel
Benchmark," Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR),
2023, pp. 23621-23630, doi:
10.1109/CVPR52729.2023.02262.

Figure 10: P curve for our model


[3] A. Sharma, R. Verma, and S. Kumar, "Automatic
Building Footprint Extraction using Deep Learning," in
Proceedings of the 2023 International Conference on
Computational Intelligence, Communication
Technology and Networking (CICTN), 20-21 April
2023, pp. 123-130, doi:
10.1109/CICTN57981.2023.10140818.

[4] J. Smith, M. Johnson, and P. Wang, "Extraction of Dense


Urban Buildings from Photogrammetric and LiDAR
Point Clouds," IEEE Access, vol. 9, pp. 111823-111832,
Aug. 2021, doi: 10.1109/ACCESS.2021.3102632.R.
Nicole, “Title of paper with only first word capitalized,”
J. Name Stand. Abbrev., in press.

[5] Y. Li, H. Chen, and Z. Liu, "Remote Sensing Urban


Green Space Layout and Site Selection Based on
Lightweight Expansion Convolutional Method," IEEE
Access, vol. 11, pp. 99889-99900, Sep. 2023, doi:
10.1109/ACCESS.2023.3314819.

[6] Automatic Building Footprint Extraction using Deep


Learning published in 2023 International Conference on
Computational Intelligence, Communication
Technology and Networking (CICTN) 20-21 Aril 2023,
Date added to IEEE Xplore: 07 June 2023, doi:
10.1109/CTCTN57981.2023.10140818.

You might also like