Segmentation 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

mathematics

Article
Automatic Recognition of Indoor Fire and Combustible Material
with Material-Auxiliary Fire Dataset
Feifei Hou , Wenqing Zhao and Xinyu Fan *

School of Automation, Central South University, Changsha 410083, China; [email protected] (F.H.);
[email protected] (W.Z.)
* Correspondence: [email protected]

Abstract: Early and timely fire detection within enclosed spaces notably diminishes the response
time for emergency aid. Previous methods have mostly focused on singularly detecting either fire
or combustible materials, rarely integrating both aspects, leading to a lack of a comprehensive
understanding of indoor fire scenarios. Moreover, traditional fire load assessment methods such as
empirical formula-based assessment are time-consuming and face challenges in diverse scenarios.
In this paper, we collected a novel dataset of fire and materials, the Material-Auxiliary Fire Dataset
(MAFD), and combined this dataset with deep learning to achieve both fire and material recognition
and segmentation in the indoor scene. A sophisticated deep learning model, Dual Attention Network
(DANet), was specifically designed for image semantic segmentation to recognize fire and combustible
material. The experimental analysis of our MAFD database demonstrated that our approach achieved
an accuracy of 84.26% and outperformed the prevalent methods (e.g., PSPNet, CCNet, FCN, ISANet,
OCRNet), making a significant contribution to fire safety technology and enhancing the capacity to
identify potential hazards indoors.

Keywords: fire detection; combustible material recognition; deep learning; indoor fire scene;
semantic segmentation

MSC: 68T45

Citation: Hou, F.; Zhao, W.; Fan, X.


Automatic Recognition of Indoor Fire
and Combustible Material with 1. Introduction
Material-Auxiliary Fire Dataset.
The trend of urbanization has resulted in a growing population residing and working
Mathematics 2024, 12, 54. https://
within large buildings [1]. Due to their densely populated and complex structures, these
doi.org/10.3390/math12010054
large buildings harbor numerous fire hazards. The most common sources of indoor fire
Academic Editors: Jesús are faulty appliances or equipment, aging electrical systems, careless disposal of cigarettes
García-Herrero and Johan Debayle or matches, gas leaks, improper handling or storage of flammable liquids, and deliberate
arson [2]. Indoor fires can spread rapidly and produce toxic smoke and gases, which can
Received: 15 November 2023
impair the visibility and health of the occupants and the firefighters. Indoor fires can also
Revised: 14 December 2023
Accepted: 21 December 2023
damage the structural integrity of the building and cause collapse or partial failure [3].
Published: 23 December 2023
Therefore, substantial casualties and financial losses can result from indoor building fires [4].
Based on this, indoor fire detection has become increasingly challenging and plays a crucial
aspect in effectively managing disasters [5].
One of the keys to indoor fire detection is to identify the combustible materials that
Copyright: © 2023 by the authors. are involved in the fire. Many combustible materials used indoors are one of the main
Licensee MDPI, Basel, Switzerland. causes of fires, and common flammable items such as sofas, mattresses, curtains, and
This article is an open access article wooden furniture can easily ignite and cause large-scale fires. After an indoor fire occurs, it
distributed under the terms and is initially limited to the combustion of combustible materials at the ignition point, then
conditions of the Creative Commons
spreads to adjacent rooms or areas as well as the entire floor, and finally spreads to the
Attribution (CC BY) license (https://
entire building. The degree of indoor fire spread is related to factors such as the combustion
creativecommons.org/licenses/by/
performance of indoor materials, substances, and the quantity of combustible materials.
4.0/).

Mathematics 2024, 12, 54. https://doi.org/10.3390/math12010054 https://www.mdpi.com/journal/mathematics


Mathematics 2024, 12, 54 2 of 17

Mathematics 2024, 12, 54 2 of 16


materials. Therefore, automating the assessment and identification of combustible mate-
rials in indoor fire scenarios becomes imperative to prevent fires and minimize property
losses as much
Therefore, as possible.
automating Firefighters
the assessment and can make an objective
identification and rapid
of combustible judgment
materials of the
in indoor
fire scenarios becomes imperative to prevent fires and minimize property losses as muchwill
fire situation, which is convenient for firefighters to rescue trapped personnel and as
also playFirefighters
possible. an important canguiding
make an role in fire prevention
objective and control
and rapid judgment work.
of the fire situation, which
There arefor
is convenient three challenges
firefighters in automating
to rescue this process.
trapped personnel andThe
willfirst
alsochallenge lies in the
play an important
complexity of indoor scene images,
guiding role in fire prevention and control work. surpassing that of outdoor images due to intricate
backgrounds,
There are diverse interior decorations,
three challenges in automating severe
this occlusions,
process. Thevariations in perspectives,
first challenge lies in the
complexity of indoor scene images, surpassing that of outdoor images due totask.
etc. Therefore, the complex background will interfere with our fire detection The
intricate
second challenge is that once a fire occurs, all factors such as
backgrounds, diverse interior decorations, severe occlusions, variations in perspectives, the source, size, and degree
of combustion
etc. Therefore, of thethe fire willbackground
complex jointly determine the combustion
will interfere with ourlevel of indoor task.
fire detection materials,
The
leading to inconsistent damage levels, which will affect the classification
second challenge is that once a fire occurs, all factors such as the source, size, and degree and recognition
of combustion
of combustible of materials
the fire in thejointly
will future. Therefore,
determine the it is necessary to
combustion consider
level of indoorall constraints
materials,
imposedtoby
leading disaster management
inconsistent damage levels, scenarios.
which willThe affect
third challenge is that the
the classification andbenchmark
recognitionfire of
datasets are not
combustible available,
materials and
in the the fire
future. scene is severely
Therefore, damaged
it is necessary and chaotic,
to consider making it
all constraints
difficult to
imposed bycollect
disaster fire data. Moreover,
management the unavailability
scenarios. The third challengeof on-site images
is that due to privacy
the benchmark fire
concernsare
datasets oftennotrestricts
available, further
and theresearch in this
fire scene is field.
severely damaged and chaotic, making it
To address
difficult to collectthese challenges
fire data. Moreover,effectively, we propose this
the unavailability innovative
of on-site images research
due toutilizing
privacy
efficient CNNs
concerns for precise
often restricts segmentation
further research inof both
this field. fires and combustible materials within
To address
real-world these challenges
fire scenarios. effectively,
Specifically, a novel wefire-material
propose thisrecognition
innovativeframework
research utilizing
is pro-
efficient
posed toCNNs for precise
revolutionize thesegmentation
current stateof ofboth fires and combustible
fire detection materials
and rescue. Figure within real-
1 presents the
world fire scenarios. Specifically, a novel fire-material
proposed framework. This work contributes in three main aspects: recognition framework is proposed to
revolutionize the current state of fire detection and rescue.
(1) We present an efficient deep learning semantic segmentation framework based on a Figure 1 presents the proposed
framework. This work
dual attention contributes
mechanism, in three
which main aspects:
involves position attention and channel attention
(1) and
We present
assigns an efficient
pixels with deep
objectlearning
class andsemantic segmentation
attribute labels. framework based on a
dualfirst
(2) We attention mechanism,
simultaneously which the
estimate involves position
fire object andattention
fire loadand channel
in indoor attention
scenes and
and assigns
explore pixels with
a multi-task objectstrategy
learning class andtoattribute
learn thelabels.
correlations between fire burning
(2) degree
We firstand
simultaneously estimate statistics.
combustible material the fire object and fire load accuracy
The segmentation in indoorlevels
scenes ofand
fire
explore a multi-task learning strategy to learn the correlations between
and combustible material can be significantly enhanced for detailed scene analysis. fire burning
degree
(3) We and combustible
introduce and collectmaterial
a new statistics.
database,Thethe
segmentation accuracy Fire
Material-Auxiliary levelsDataset
of fire
and combustible material can be significantly enhanced for detailed scene analysis.
(MAFD), with attribute labels for combustible material and class labels for fire objects,
(3) which
We introduce
providesandacollect a new database,
benchmark to encouragethe Material-Auxiliary Fire Dataset
automatic applications (MAFD),
in indoor fire
with attribute
scenes. labels for combustible material and class labels for fire objects, which
provides a benchmark to encourage automatic applications in indoor fire scenes.

Figure 1.
Figure 1. Indoor fire
fire and
and combustible
combustible material
material recognition
recognition framework.
framework.

The subsequent sections of this paper are organized as follows. Section 2 provides an
overview of the literature concerning fire detection and the recognition of material. Our
Mathematics 2024, 12, 54 3 of 16

deep learning framework for fire and combustible material segmentation within indoor
scenes is detailed in Section 3. Section 4 introduces the development of our dataset, the
MAFD, and presents extensive experiments with it. Section 5 concludes the paper by
summarizing key points and proposing future research directions.

2. Literature Review
With the continuous advancement of technology, there is growing attention toward
developing efficient and reliable methods to identify fire and smoke. Numerous compre-
hensive reviews and surveys have been conducted within the realm of fire and smoke
detection. Among them, the methods utilized can be categorized into two main groups:
traditional methods and deep learning-based approaches.
Traditional methods are usually based on image processing algorithms such as edge
detection, morphological processing, and threshold segmentation. Wu et al. [6] used camera
sensors for fire smoke detection, extracting static and dynamic features, and achieving
strong results with AdaBoost. Russo et al. [7] proposed a method for the smoke detection
of surveillance cameras based on local binary pattern (LBP) and support vector machine
(SVM). Wang et al. [8] proposed a rapid smoke detection method using slope fitting in
video image histogram, addressing false alarms in early fire smoke detection. Cao et al. [9]
proposed patchwise dictionary learning within the wavelet domain to detect smoke in
forest fire videos. Their method aims to distinguish fire smoke from other objects in the
forest that share a similar visual grayscale appearance. Fire smoke can be distinguished
from other challenging objects in the forest with a similar visual grayscale appearance.
Gagliardi et al. [10] introduced video-based smoke detection technology using techniques
like the Kalman estimator, blob labeling, and decision-making processes. Hossain et al. [11]
introduced a novel technique for forest fire detection that relied on fire-specific color
features and the multi-color space local binary pattern to identify distinct attributes of
flames and smoke. They also employed support vector machines as classifiers. Their results
showed that this method has a higher performance compared to other color or texture-
based methods. However, these traditional methods typically require manual parameter
selection and adjustment and have poor adaptability to various smoke densities and color
variations, which also suffer from several drawbacks including a high rate of false alarms
(FAR), restricted accuracy, and a reduced detection range.
In contrast, deep learning methods, by utilizing learned features to identify and seg-
ment fire and smoke patterns and adapting to various smoke conditions, have introduced
a novel research avenue for addressing early fire detection challenges. Jia et al. [12] utilized
domain knowledge and transfer learning from deep convolutional neural networks (CNN)
for video smoke detection and reduced the false positive rate of the video smoke detec-
tion (VSD) systems to some extent. However, low-level features were not utilized. Peng
et al. [13] combined manually crafted features with deep learning features. They utilized an
algorithm designed manually to extract areas suspected to contain smoke, which were then
processed using an enhanced SqueezeNet deep neural network for smoke detection. Cheng
et al. [14] employed Deeplabv3+ and conditional random fields for accurate segmentation,
established smoke thickness heatmaps and predicted smoke trends with generative adver-
sarial networks, contributing to fire protection and evacuation planning. To address issues
in video-based smoke detection, Yuan et al. [15] introduced a deep smoke segmentation
network designed to derive precise segmentation masks from unclear smoke images. Lin
et al. [16] devised an integrated detection framework by combining a faster Region-CNN
(RCNN) and 3D CNN, enhancing video smoke detection by maximizing the utilization
of temporal information within video sequences. Li et al. [17] introduced an adaptive
linear feature-reuse network (ALFRNet) for rapid forest fire smoke detection, effectively
reducing information loss and interference caused by image blurring during the smoke
image acquisition process. Liu et al. [18] introduced a smoke detection approach using
an ensemble of simple deep CNNs by capturing diverse smoke aspects and aggregating
subnetwork responses via majority voting, outperforming existing methods on newly
Mathematics 2024, 12, 54 4 of 16

established noisy smoke image datasets. To meet the needs of complex aerial forest fire
smoke detection tasks, Zhan et al. [19] proposed an adjacent layer composite network based
on a recursive feature pyramid with deconvolution and dilated convolution and global
optimal non-maximum suppression (ARGNet) for the high-accuracy detection of forest fire
smoke. Hu et al. [20] proposed a novel method for early forest fire smoke detection called
multi-oriented detection. This method integrated a value conversion-attention mechanism
module and Mixed-Non-Maximum Suppression (Mixed-NMS) to overcome common mis-
detection and missed detection issues, elevating target detection accuracy. To change the
fact that the majority of current computer vision-based fire detection methods can only
identify either flames or smoke, Hosseini et al. [21] introduced a unified flame and smoke
detection method, named UFS-Net, which can identify potential fire risks by categorizing
video frames into eight distinct classes. Khan et al. [22] proposed an energy-efficient system
based on VGG-16 architecture for early smoke detection in both normal and foggy IoT
environments. He et al. [23] also proposed a method targeting foggy environments that
combined attention mechanisms and feature-level and decision-level fusion modules. From
various perspectives including overall, individual categories, small smoke, and challenging
negative sample detection, their approach achieved higher detection accuracy, precision,
recall, and F1 scores. To meet the requirements of smoke detection within an industrial
environment, Muhammad et al. [24] proposed an energy-friendly edge intelligence-assisted
method for smoke detection in foggy surveillance environments using deep CNN.
The concept of fire load is of utmost importance in fire safety and building resilience.
Many combustible materials used indoors are one of the main causes of fires. Numerous
studies have focused on material recognition. Strese et al. [25] proposed a tool-mediated
surface classification method. This method combines the extracted feature information
such as sound, image, friction, and acceleration with a naive Bayesian classifier to identify
different materials. Zhang et al. [26] proposed a novel hierarchical multi-feature fusion
(HMF2) model, aiming to gather essential information and employ a classifier for training
a novel material recognition model. They tested the simplicity, effectiveness, robustness,
and efficiency of the HMF2 model on two benchmark datasets. Lee et al. [27] proposed
a material-type identification method using a deep CNN based on color and reflectance
features. The proposed method was evaluated on public datasets, showing promising
results for material type identification.
Although researchers have conducted extensive algorithmic research in the field of
material recognition and have high-quality public datasets, there is currently no algorithmic
research for complex indoor fire scenarios, nor is there a relevant public dataset. In addition,
the main limitation of these methods is the lack of the simultaneous evaluation of fire
objects and fire loads. Fortunately, our work has solved these problems. Specifically,
Section 3 provides details of the proposed methodology, while Section 4 discusses the
experimental validation.

3. Dual Attention Fire Recognition Methodology


3.1. Architecture of Semantic Segmentation Model
The Dual Attention Network (DANet) model [28] was adopted as our semantic seg-
mentation model, and its overall architecture is depicted in Figure 2. Different from
previous approaches that utilize multi-scale feature fusion to capture context, DANet ad-
dresses scene segmentation by a self-attention mechanism. This mechanism efficiently
integrates intricate contextual dependencies, allowing the adaptive fusion of dispersed
features along with their global dependencies. In order to improve the model accuracy, two
types of attention modules were attached on top of the dilated FCN (Fully Convolutional
Network), which simulated semantic interdependence in both the spatial and channel
domains. The position attention module selectively aggregates the features of each position
through a weighted sum of each position, and similar features will be related to each other
regardless of the distance between them. At the same time, the channel attention module
selectively emphasizes interdependent channel maps by integrating the relevant features
domains. The position attention module selectively aggregates the features of each posi-
tion through a weighted sum of each position, and similar features will be related to each
Mathematics 2024, 12, 54 other regardless of the distance between them. At the same time, the channel attention 5 of 16
module selectively emphasizes interdependent channel maps by integrating the relevant
features across the entirety of channel maps. Finally, the outputs of these two attention
modules
across theare summed
entirety to further
of channel improve
maps. thethe
Finally, feature representation,
outputs which helps
of these two attention to obtain
modules are
more accurate segmentation results. During the aggregation process of the two
summed to further improve the feature representation, which helps to obtain more accurate modules,
according to the
segmentation vectorDuring
results. productthetheory, when process
aggregation the product oftwo
of the twomodules,
vectors isaccording
larger, it means
to the
that the angle between the vectors is smaller, and the correlation between
vector product theory, when the product of two vectors is larger, it means that the twothevectors
angle
is stronger.
between theDetailed
vectors descriptions of the
is smaller, and thetwo modulesbetween
correlation are presented in the
the two subsequent
vectors sec-
is stronger.
tion.
Detailed descriptions of the two modules are presented in the subsequent section.

Figure 2. Dual Attention Network (DANet) framework with vector output distribution at each stage.
Figure 2. Dual Attention Network (DANet) framework with vector output distribution at each stage.

3.2. Attention
Attention Modules
Modules for Feature Representation
3.2.1. Position Attention Module
The input
inputofofthethe
position attention
position module
attention is a feature
module is a map A, expressed
feature map A, as C × H × W,
expressed as
𝐶×𝐻×
where C 𝑊, where 𝐶the
represents number of
represents thechannels,
number while H represents
of channels, while the𝐻 represents
height of thethefeature
height
map,
of the and W represents
feature map, and the𝑊 represents
width of the thefeature
width of map. The specific
the feature map. working principle
The specific of
working
this module is shown in Figure 2, and the vector size obtained at each stage is
principle of this module is shown in Figure 2, and the vector size obtained at each stage is also marked
in Figure
also marked2. in
Since the 2.
Figure focus ofthe
Since thefocus
position
of theattention
positionmodule
attentionis to mine is
module thetosimilarity
mine the
relationship between each pixel, in order to better use the attention module,
similarity relationship between each pixel, in order to better use the attention module, the feature
the
featureB,maps
maps C, and𝐵,D𝐶,are
andreshaped to obtaintothree
𝐷 are reshaped obtainsizes C × H,
of sizes
three 𝐶 × 𝐻, N
of where = H×
where 𝑁 W.= 𝐻These
× 𝑊.
three
Thesematrices correspond
three matrices to Q, K,toand
correspond V and
𝑄, 𝐾, of the𝑉 self-attention mechanism.
of the self-attention Each step
mechanism. Eachis
described in detail
step is described inbelow:
detail below:
(1) Calculating the
(1) Calculating the similarity
similarity matrix
matrix between
between pixels,
pixels, the
the process
process is
is to
to obtain
obtain aa similarity
similarity
between pixels with a size of N × N through QT × K,
matrix between pixels with a size of 𝑁 × 𝑁 through 𝑄𝑇 × 𝐾, that is, the (𝑁 ×(N
matrix that is, the 𝐶)× C)
ma-
matrix multiplied by the (C × N)
trix multiplied by the (𝐶 × 𝑁) matrix;matrix;
(2) Perform aa softmax
(2) Perform softmax operation
operation on
on the
the similarity
similarity matrix
matrix to
to obtain
obtain each
each relative
relative factor
factor
that affects the pixel;
that affects the pixel;
(3) Multiply the similarity matrix S after softmax with the V matrix, that is, multiply
(3) Multiply the similarity matrix S after softmax with the V matrix, that is, multiply the
the (C × N) matrix by the (N × N) matrix, and finally obtain the recoded feature
(𝐶 × 𝑁) matrix by the (𝑁 × 𝑁) matrix, and finally obtain the recoded feature repre-
representation, and its size is also C × N, where the generation formula of S is shown
sentation, and its size is also 𝐶 × 𝑁, where the generation formula of 𝑆 is shown in
in Equation (1). The purpose of multiplying the original matrix by the similarity matrix
Equation (1). The purpose of multiplying the original matrix by the similarity matrix
is to amplify the influence of pixels that are similar to it and reduce the influence of
pixels that are not similar to it, which can also be called a re-encoding operation;
(4) Perform the reshape operation on the finally obtained new feature matrix to obtain a
recoded feature map with a size of C × H × W;
(5) Add the feature map to the features extracted from the upper network to obtain the
output E of the final position attention module, whose size is still C × H × W, where
Mathematics 2024, 12, 54 6 of 16

the generation formula of E is shown in Equation (2). The scaling factor α initially
begins at 0 and gradually adjusts to attain higher weights.

exp( Bi Cj
S ji = (1)
∑iN=1 exp( Bi Cj


N
Ej = α ∑ s ji Di + A j

(2)
i =1

3.2.2. Channel Attention Module


The channel attention module is used to mine the similarity relationship between each
channel in the image feature map, so that each channel has global semantic features. The
input is also a feature map A, whose size is 1/8 of the original image. The specific process
of the channel attention mechanism is displayed in Figure 2, and the specific process is
as follows:
(1) Calculating the similarity matrix between pixels, the process is to obtain a similarity
matrix between pixels with a size of N × N through QT × K, that is, multiplying the
(C × N) matrix by the (N × C) matrix;
(2) Perform a softmax operation on the similarity matrix to obtain each relative factor
affecting the channel;
(3) Multiply the similarity matrix X and the V matrix after softmax, that is, the (N × C)
matrix multiplied by the (C × C) matrix, and finally obtain the recoded feature repre-
sentation, and its size is also C × N, where the generation formula of X is shown in
Equation (3). The purpose of multiplying the original matrix by the similarity matrix
is to amplify the influence of similar channels and reduce the influence of dissimilar
channels;
(4) Perform the reshape operation on the finally obtained new feature matrix to obtain a
recoded feature map with a size of C × H × W;
(5) Add the feature map to the features extracted from the upper network to obtain the
output E of the final channel attention module, whose size is still C × H × W, where
the generation formula of E is shown in Equation (4). The initial value of the scaling
factor β is set to 0 and incrementally adapts to gain higher weights.

exp( Ai · A j
x ji = C (3)
∑i=1 exp( Ai · A j


C
Ej = β ∑ x ji Ai + A j

(4)
i =1

3.3. BaseNet Selection


The DANet model is based on dilated-ResNet (Residual Network) [29] as the backbone
network to extract features. Two CNN backbones were chosen as a basis for assessing the
accuracy and time efficiency (denoted by network-depth-feature), forming two DANets:
dilated-ResNet50 and dilated-ResNet101. ResNet and dilated-ResNet are state-of-the-art
CNN models and are widely used backbones in semantic segmentation models, where
dilated ResNet has some advantages over ResNet by introducing the dilated convolution
to improve the resolution of the feature map. Typically, a deeper model is anticipated to
achieve higher accuracy compared to a shallow model, while at the expense of increased
computational time. The dilated-ResNet50 model architecture is shown in Figure 3. After
the backbone network applies dilated convolution to replace the downsampling in the
original model, the feature map of the image is obtained, and its size is 1/8 of the original
image. Then, the feature map is input into two attention modules in parallel to obtain the
lated convolution to improve the resolution of the feature map. Typically, a deeper model
is anticipated to achieve higher accuracy compared to a shallow model, while at the ex-
pense of increased computational time. The dilated-ResNet50 model architecture is shown
in Figure 3. After the backbone network applies dilated convolution to replace the
Mathematics 2024, 12, 54 downsampling in the original model, the feature map of the image is obtained, and its 7 ofsize
16
is 1/8 of the original image. Then, the feature map is input into two attention modules in
parallel to obtain the global features between pixels and the global features between chan-
global features between
nels, respectively, pixels and
and finally the global
integrate features
the output of between channels,modules
the two attention respectively, and
to obtain
finally integrate
a better the output of the two attention modules to obtain a better expression.
expression.

conv
norm_layer

maxpool

layer1 Bottleneck
Bottleneck
Bottleneck
d=1

Bottleneck d=1 d=1 d=1 d=1


d=1 d=1 d=1 d=1
Resnet50 layer2 Bottleneck
Bottleneck h h/2 h/4 w/4
Bottleneck w/2 4c
w 2c
c Group 4 Group 5
Bottleneck
Bottleneck
(a)ResNet d=4 d=4 d=4 d=4
Bottleneck d=1
layer3 Bottleneck
Bottleneck
Bottleneck
h h h
Bottleneck w w
layer4 Bottleneck
c
w
2c 4c
Bottleneck
Group 4 Group 5
(b)DRN

Figure3.3.Dilated
Figure DilatedResNet50
ResNet50backbone.
backbone.

Transferlearning
Transfer learningaims
aimstotoutilize
utilizepreviously
previouslyacquired
acquiredknowledge
knowledgeto toefficiently
efficientlysolvesolve
new but
new but similar
similar problems. Unlike
Unliketraditional
traditionalmachine
machinelearning
learningmethods,
methods, it capitalizes
it capitalizeson
on knowledge
knowledge gathered
gathered fromfrom auxiliary
auxiliary domains’
domainsʹ data
data to to enhance
enhance predictive
predictive modeling modeling
for dis-
for disparate
parate data patterns
data patterns withinwithin the present
the present domain. domain. The fundamental
The fundamental idea of transfer
idea of transfer learning
learning is to extract
is to extract the knowledge
the knowledge from from a previous
a previous or or source
source taskand
task andapply
apply the extracted
extracted
knowledge
knowledgeto toaanew/target
new/target task.
task. A conceptual
conceptual metaphor
metaphor is is that
thatititwill
willbe
beeasier
easierforforaachild
child
to
tolearn
learnhowhowto torecognize
recognizepeaches
peachesififthey
theyhave
havealready
alreadylearned
learnedhow howto torecognize
recognizeapplesapples
and
andpears.
pears.
We
Weemployed
employedthe thetransfer
transferlearning
learningto toreduce
reducethethetraining
trainingdifficulty
difficultyfor forour
ourrelatively
relatively
small dataset as well as enhance performance. Transfer learning
small dataset as well as enhance performance. Transfer learning shows promise shows promise in minimiz-
in mini-
ing the dependence
mizing the dependenceon a large
on anumber
large of target domain
number of targetdata by transferring
domain data by knowledge
transferring
from diverse from
knowledge yet related
diversesource domains
yet related [30]. The
source deep [30].
domains learningThemodel was first model
deep learning pretrained
was
on a large general dataset VOC, and then trained and tested on our dataset.
first pretrained on a large general dataset VOC, and then trained and tested on our dataset.
4. Experiments and Results
4. Experiments and Results
4.1. Experimental Settings and Evaluation Metrics
4.1. Experimental Settings and Evaluation Metrics
The implementation was carried out in a Python environment (version 3.8.10, pro-
videdTheby implementation was carried
the Python Software out in Wilmington,
Foundation, a Python environment
DE, USA) (version
using the 3.8.10, pro-
PyTorch
vided by the Python Software Foundation, Wilmington, Delaware, United States)
deep learning package (PyTorch:1.10.0 + cu111, TorchVision: 0.11.0 + cu111) and a single using
the PyTorch
NVIDIA deepRTX
GeForce learning package
3060 GPU. (PyTorch:1.10.0[31]
MMSegmentation + cu111, TorchVision:PyTorch-based
is an open-source 0.11.0 + cu111)
and a single NVIDIA GeForce RTX 3060 GPU. MMSegmentation [31] is an
toolbox specifically designed for semantic segmentation tasks. It decomposes the open-source
semantic
segmentation framework into different components. By combining different modules, a cus-
tomized semantic segmentation framework can be easily built. The toolbox provides direct
support for prevalent and contemporary semantic segmentation frameworks, providing
pre-trained semantic segmentation models on various mainstream datasets. In this paper,
the semantic segmentation framework provided in MMSegmentation was utilized for both
model training and verification, employing mmcv version 2.0.0rc4 and MMSegmentation
version 1.1.2. DANet employs ResNet as the model backbone. The images in our dataset
were randomly divided into training (90%) and validation (10%) sets with a 9:1 split ratio.
The model was trained on the training dataset and tested on the validation dataset.
During training, images were resized to 512 × 512 pixels for input. The optimizer for
the three models was stochastic gradient descent (SGD) with a learning rate of
Mathematics 2024, 12, 54 8 of 16

5 × 10−4 , a momentum of 0.975 for L2 regularization, and a weight decay factor of 0.0004.
SGD was selected instead of adaptive optimization methods (e.g., AdaGrad, RMSProp,
or Adam) due to its potential to achieve a higher test accuracy, converge toward a flatter
minimum, and consequently yield improved generalization [32]. The decode head predicts
the segmentation map from the feature map using the decoding head of DAHead, and the
loss function uses CrossEntropyLoss, and the loss weight is 0.3. Auxiliary_head encourages
the backbone network to learn lower-level features that are not used for prediction. The
decoding head of FCNHead is used. The loss function uses CrossEntropyLoss and the loss
weight is 0.15. Data augmentation used during training included horizontal flipping and
random cropping. The training used the PolyLR scheduler, which reduces the learning
rate according to a polynomial function, the minimum learning rate is 1 × 10−6 , and is
scheduled according to each iteration. The maximum number of iterations of the training
loop was 40,000, and there was an interval of 10,000 loops for verification. Table 1 displays
the experimental settings.

Table 1. Experimental settings.

Setting Value
Batch size 1
Crop size 512 × 512
Momentum 0.975
Initial learning rate 0.0005
Weight decay 0.0004

The Acc (accuracy) [31] refers to the proportion of accurately classified pixels to the total
number of pixels in the segmentation result. In image segmentation, we usually compare
the predicted label for each pixel with the true label, and then calculate the accuracy rate
between them. Specifically, we can count the number of pixels in the segmentation result
that are the same as the real result, and then divide them by the total number of pixels
to obtain the segmentation accuracy. The higher the Acc, the more pixels are correctly
classified in the segmentation result, and the better the segmentation performance. The
calculation formula of Acc is as follows:

∑ik=0 pii
Acc = (5)
∑ik=0 ∑kj=0 pij

Among them, i represents the real value, j represents the predicted value, Pij represents
the number of pixels that predict i as j, and k represents the total number of categories.
The mIoU (mean intersection over union) refers to the average value of the intersection
and union ratios between the segmentation results and the real segmentation results. In
image segmentation, we usually compare the predicted value with the ground truth for
each class, and then calculate the IoU between them. Specifically, for each category, we can
count the number of pixels in the segmentation result and the ground truth result, and then
calculate the intersection and union between them. We can then divide the intersection
by the union to obtain the IoU for that category. Finally, we can average the IoU of all
categories to obtain the mIoU of the whole image. The higher the mIoU, the closer the
segmentation result is to the real result, and the better the segmentation performance. The
calculation formula of mIoU is as follows:

1 k pii
k + 1 i∑
mIoU = k
(6)
=0 ∑ j=0 ij
p + ∑kj=0 p ji − pii

Among them, i represents the real value, j represents the predicted value, Pij represents
the number of pixels that predict i as j, and k represents the total number of categories.
Mathematics 2024, 12, 54 9 of 16

The mAcc (mean accuracy) refers to the average of the accuracy of each category. In
image segmentation, we usually compare the predicted value of each class with the true
value, and then calculate the accuracy between them. Specifically, for each category, we can
count the number of pixels that are correctly classified in the segmentation result and the
ground truth result, and count the number of pixels of that category in the ground truth
result. We can then divide the number of correctly classified pixels by the number of pixels
for that class to obtain the accuracy for that class. Finally, we can average the accuracies
across all classes to obtain the mAcc for the entire image. The higher the mAcc means that
each category is better recognized and distinguished, and the segmentation performance is
better. The calculation formula of mAcc is as follows:

1 k pii
k + 1 i∑
mAcc = k
(7)
=0 ∑ j=0 pij

Among them, i represents the real value, j represents the predicted value, Pij represents
the number of pixels that predict i as j, and k represents the total number of categories.

4.2. Material-Auxiliary Fire Dataset


In this paper, we collected and developed a dataset containing indoor fire images
and material segmentation annotations, named ‘MAFD’. Indoor scenes typically contain
a variety of materials with complicated layout relationships. Therefore, two material
categories were identified as labels to represent each indoor object based on their primary
composition. The selected categories included: (1) fabric and (2) wood. These materials
are commonly used and are usually the main components of indoor scenes. They are
combustible and also exhibit varying degrees of flammability. In addition, the dataset
also included the fire category, which includes flames and smoke. Therefore, our dataset
involves a total of three categories.
A total of 3899 images were collected from three parts. (1) A part of these images
was gathered from the dataset used in Deep Learning-Based Instance Segmentation for
Indoor Fire Load Recognition [33], available at https://github.com/Zhou-Yucheng/fire-
load-detection (accessed on 5 November 2022). The dataset images contain at least one
combustible object and have an image of appropriate resolution. There were 1015 images
in total, distributed in five classes: bedroom, dining room, hospital, living room, and office.
(2) The second part of these images was collected from online sources and other public
datasets. This part encompasses outdoor scenes with fires including a variety of settings
like buildings, streets, vehicles, forests, and farmland. We specifically targeted images
with clearly distinguishable fire visual attributes and a rich variety of environmental
features to make our dataset more representative. There was a total of 2466 images.
(3) The rest were sourced from the Kaggle website (https://www.kaggle.com/ (accessed
on 5 November 2022)), which is a platform for organizing machine learning competitions
and hosting databases. Specifically, our selection focused on indoor images featuring both
combustible materials and fire instances. Datasets need to be large enough to provide a
balanced sample type, wide diversity, and multiple categories. This collection contains a
variety of individual images from various indoor scenarios such as kitchen, factory, and
living room. Finally, a total of 418 images were selected. These images were manually
annotated into different classes based on their content, and 11,320 instance annotations
were obtained. Table 2 details the number of annotations of each category in our MAFD.

Table 2. Number of annotated instances per category.

Category Number of Instances


Fire 3071
Fabric 4055
Wood 4195
Category Number of Instances
Fire 3071
Fabric 4055
Wood 4195
Mathematics 2024, 12, 54 10 of 16
The type of annotation depends largely on the task attribute. Our task was to segment
fire and material instances to obtain their complete boundary information. Therefore, in
The type
this paper, weoffocused
annotation depends segmentation
on per-pixel largely on the task
and attribute. Ourthe
labeling for task
firewas to segment
category and
fire and material instances to obtain their complete boundary information.
three kinds of materials. Our MAFD contains polygon annotations that enclose same-ma- Therefore, in
this paper, we focused on per-pixel segmentation and labeling for the fire category
terial regions. Each style of annotation comes with a cost proportional to its complexity. and
threeimages
Our kinds of materials.
were annotated Ourusing
MAFD thecontains
LabelMepolygon annotations
[34] tool. that enclose
The annotation results same-
were
material regions. Each style of annotation comes with a cost proportional
saved in the JSON file, and then the format conversion toolkit provided by LabelMe to its complexity.
was
Our images were annotated using the LabelMe [34] tool. The annotation
used to convert the JSON file into the standard Pascal VOC (Pattern Analysis Statistical results were
saved in the
Modeling andJSON file, and then
Computational the format
Learning, conversion
Visual toolkit provided
Object Classes) [35] format.byPascal
LabelMeVOC was
is
used to convert the JSON file into the standard Pascal VOC (Pattern Analysis Statistical
a standardized dataset used for object detection, semantic segmentation, and more. Its
Modeling and Computational Learning, Visual Object Classes) [35] format. Pascal VOC
annotations are stored in the form of XML files, each corresponding to an image, contain-
is a standardized dataset used for object detection, semantic segmentation, and more. Its
ing information about all objects in the image including their location and category. Sam-
annotations are stored in the form of XML files, each corresponding to an image, containing
ple images with annotated categories are visualized in Figure 4, where each labeled in-
information about all objects in the image including their location and category. Sample
stance is indicated by a colored mask with its category name located in the bottom-right
images with annotated categories are visualized in Figure 4, where each labeled instance is
corner.
indicated by a colored mask with its category name located in the bottom-right corner.

Figure 4. Sample
Sample images
images with
with annotated category and name.

4.3. Experiment 1: Selection of Optimal Model


We used DANet-50 with ResNet-50 as the backbone network and DANet-101 with
ResNet-101 as the backbone network to train on our dataset, and compared the results
when the training iterations were 20 k and 40 k. Although the mAcc of DANet-50–40 k was
slightly higher than DANet-101, the average accuracy of DANet-101 was higher and the
recognition accuracy in different categories was average, so we chose DANet-101 as the
training model in this paper. Table 3 displays the aAcc, mIoU, and mAcc percentages of the
DANet-50 and DANet-101 models after training for 20 k and 40 k iterations.

Table 3. The aAcc, mIoU, and mAcc (%) of the DANet-50 and DANet-101 models trained over 20 k
and 40 k iterations.

Method Total Number of Training Iterations aAcc mIoU mAcc


DANet-50 20 k 82.15 59.46 71.27
DANet-50 40 k 81.84 60.63 73.98
DANet-101 20 k 82.99 60.95 72.25
DANet-101 40 k 83.19 61.73 73.33

We then further optimized the parameters of the selected DANet-101 model and added
training iterations. It was found that all indicators improved at first when the training
iterations were set to 100 k. When the training iterations were set to 100 k, the trained
Mathematics 2024, 12, 54 11 of 16

model performed best and its performance in the fire, wood, and fabric categories was
relatively average. When the training iterations continued to increase, the aAcc, mIOU, and
mAcc all decreased. Therefore, the corresponding model parameters when the training
iterations were 100 k were finally selected. Table 4 illustrates how the aAcc, mIoU, and
mAcc changed for the DANet-101 model across diverse training iterations.

Table 4. The aAcc, mIoU, and mAcc (%) of DANet-101 using the training iterations of 20 k, 40 k, 60 k,
80 k, 100 k, and 120 k.

Method Total Number of Training Iterations aAcc mIoU mAcc


DANet-101 20 k 82.99 60.95 72.25
DANet-101 40 k 83.19 61.73 73.33
DANet-101 60 k 82.50 61.11 73.73
DANet-101 80 k 82.71 61.59 73.74
DANet-101 100 k 84.26 64.85 77.05
DANet-101 120 k 83.04 60.03 70.53

Figure 5 showcases the performance curves over 100 k iterations. The loss function
represents the difference between the predicted output and the actual target. As the training
steps increase, the changes in the loss function will display the model’s degree of fit to the
training data during the training period as well as the optimization effect of the model.
From Figure 5a, it can be observed that with an increase in training steps, the value of
‘loss’ gradually decreased to around 0.2, and ‘aux.loss_ce’ (cross-entropy loss) decreased to
around 0.05. This indicates that the model progressively achieved a better fit to the training
Mathematics 2024, 12, 54 data. Figure 5b shows the changes in the test set classification evaluation indicators (mIoU,
12 of 17
mAcc, and aAcc) with the step size. When the step reached 40,000, the aAcc, mIoU, and
mAcc reached their maximum values of 84.26%, 64.85%, and 77.05%, respectively.

(a) (b)

Figure5.5.Performance
Figure Performancecurves
curves of
of100
100kkiterations:
iterations:(a)
(a)the
thevariation
variationof
oftraining
trainingset
setloss
lossfunction
functionwith
with
step size and (b) the changes in the test set classification evaluation indicators (mIoU, mAcc,
step size and (b) the changes in the test set classification evaluation indicators (mIoU, mAcc, and and
aAcc) with step size.
aAcc) with step size.

4.4.Experiment
4.4. Experiment2:2: Visualization
Visualization Results
Resultsofofthe
theProposed
ProposedModel
Model
Toverify
To verifythe
theeffectiveness
effectiveness ofof
thethe proposed
proposed model,
model, wewe provided
provided visual
visual results
results in var-
in various
indoor scenes such as a living room, kitchen, restaurant, and office
ious indoor scenes such as a living room, kitchen, restaurant, and office using the MAFD.using the MAFD. As
shown
As shownin Figure 6, the
in Figure 6,proposed
the proposed scheme was was
scheme able able
to segment fire objects
to segment well well
fire objects without any
without
post-processing.
any post-processing. The first
The and
first third rowsrows
and third in Figure 6 show
in Figure the input
6 show fire images,
the input while
fire images, the
while
corresponding
the corresponding output results
output segmented
results segmentedby the
byproposed
the proposedmodel are shown
model are shownin the
in second
the sec-
and
ondfourth rows rows
and fourth in Figure 6, where
in Figure the firethe
6, where objects are represented
fire objects by red by
are represented masks, wood
red masks,
objects
wood objects are represented by a yellow mask, fabric objects are represented by amask,
are represented by a yellow mask, fabric objects are represented by a green green
and other
mask, andareas
otherare darker.
areas It can be
are darker. seenbe
It can that
seenthethat
scenethein Figure
scene 6a is the
in Figure 6aliving
is the room.
living
The sofa
room. Thein sofa
the living
in the room
livingisroommade is of a flammable
made fabric fabric
of a flammable material, and the
material, andtables and
the tables
chairs are made
and chairs of a flammable
are made of a flammable wood woodmaterial. ThisThis
material. complex
complex indoor
indoorenvironment
environment is ais
a highly flammable place. Figure 6c shows the kitchen scene, and the kitchen stove is also
a dangerous place because of its obvious fire source. The fires in Figure 6b (restaurant)
and Figure 6d (office) were intense and accompanied by a lot of smoke. The above visual-
ization results can mark flame, fabric, and wood in various scenes in the MAFD, which
verifies that the scheme can realize the segmentation of flame, fabric, and wood in indoor
the corresponding output results segmented by the proposed model are shown in the sec-
ond and fourth rows in Figure 6, where the fire objects are represented by red masks,
wood objects are represented by a yellow mask, fabric objects are represented by a green
mask, and other areas are darker. It can be seen that the scene in Figure 6a is the living
Mathematics 2024, 12, 54 room. The sofa in the living room is made of a flammable fabric material, and the tables 12 of 16
and chairs are made of a flammable wood material. This complex indoor environment is
a highly flammable place. Figure 6c shows the kitchen scene, and the kitchen stove is also
ahighly
dangerous placeplace.
flammable because of its6cobvious
Figure shows thefirekitchen
source.scene,
The fires in Figure
and the kitchen6bstove
(restaurant)
is also a
and Figure 6d (office) were intense and accompanied by a lot of smoke. The
dangerous place because of its obvious fire source. The fires in Figure 6b (restaurant) above visual-
and
ization results
Figure 6d canwere
(office) mark flame,and
intense fabric, and woodbyina various
accompanied scenes
lot of smoke. in above
The the MAFD, which
visualization
verifies thatmark
results can the scheme can realize
flame, fabric, and woodthe segmentation of flame,
in various scenes in thefabric,
MAFD, and wood
which in indoor
verifies that
fire
the scenes.
schemeTherefore,
can realizeitthe
is feasible to use the
segmentation proposed
of flame, scheme
fabric, to identify
and wood fire, fabric,
in indoor and
fire scenes.
wood.
Therefore, it is feasible to use the proposed scheme to identify fire, fabric, and wood.

Input image

Segment result
Mathematics 2024, 12, 54 13 of 18

(a) Living room scene (b) Restaurant scene

Input image

Segment result

(c) Kitchen scene (d) Office scene

Figure
Figure 6.
6. Examples
Examples of
of the
the model
model prediction
prediction output.
output.

4.5.
4.5. Experiment
Experiment 3:
3: Comparison
Comparison with
with State-of-the-Art
State-of-the-Art Methods
Methods
To comprehensively demonstrate
To demonstrate the the performance
performance of of our
our proposed
proposedmethod,
method,we wecon-
con-
ducted an
ducted an evaluation
evaluationcomparing
comparingit itwith established
with established state-of-the-art
state-of-the-artmodels
modelssuchsuch
as PSP-
as
Net [36] (Pyramid Scene Parsing Network), CCNet [37] (Criss-Cross
PSPNet [36] (Pyramid Scene Parsing Network), CCNet [37] (Criss-Cross Attention Net- Attention Network),
FCN [38],
work), FCN ISANet
[38], [39] (Interlaced
ISANet Sparse Self-Attention
[39] (Interlaced Network), Network),
Sparse Self-Attention and OCRNet and[40]OCRNet
(Object-
Contextual
[40] Representations
(Object-Contextual Network). They
Representations are all widely
Network). They are recognized
all widelymodels in themodels
recognized field of
semantic
in the fieldsegmentation. The basic experimental
of semantic segmentation. The basicsetup for other models
experimental remained
setup for consistent
other models re-
with ourconsistent
mained approachwithand all
ourexperiments
approach and were
all conducted
experiments exclusively on our exclusively
were conducted MAFD. on
For
our MAFD. comparison, we selected fire and one material, fabric, to represent the evaluation.
We first analyzed thewe
For comparison, comparative
selected fireresults
and onebased on aAcc,
material, fabric, Acc.fire, and Acc.fabric.
to represent the evaluation.The
results indicate that OCRNet performed relatively poorly in all of the evaluation
We first analyzed the comparative results based on aAcc, Acc.fire, and Acc.fabric. The re- metrics.
Our proposed
sults method
indicate that exhibited
OCRNet the highest
performed aAccpoorly
relatively and Acc.fire.
in all ofAlthough Acc.fabric
the evaluation was
metrics.
slightly
Our lower by
proposed 0.94%exhibited
method compared the tohighest
ISANet,aAcc
considering the overall
and Acc.fire. precision,
Although the com-
Acc.fabric was
slightly lower by 0.94% compared to ISANet, considering the overall precision, the com-7
prehensive performance of our method surpassed that of the existing models. Figure
visually displays
prehensive the comparison
performance results ofsurpassed
of our method accuracy (%)thatacross
of thedifferent
existing models,
models.providing
Figure 7
clear evidence of our model’s performance.
visually displays the comparison results of accuracy (%) across different models, provid-
ing clear evidence of our model’s performance.
slightly lower by 0.94% compared to ISANet, considering the overall precision, the com-
prehensive performance of our method surpassed that of the existing models. Figure 7
visually displays the comparison results of accuracy (%) across different models, provid-
ing clear evidence of our model’s performance.
Mathematics 2024, 12, 54 13 of 16

Figure7.7.Comparison
Figure Comparisonresults
resultsofofaccuracy
accuracy(%)
(%)across
acrossdifferent
differentmodels.
models.(aAcc,
(aAcc,Acc.fire,
Acc.fire,and
andAcc.fabric).
Acc.fabric).

Table
Table55shows
showsan anoverall
overallcomparison
comparisonofofthe themIoU,
mIoU,IoU.background,
IoU.background,IoU.fire,
IoU.fire,and
and
IoU.fabric across various
IoU.fabric across various models. Notably, OCRNet showed the poorest
Notably, OCRNet showed the poorest performance onperformance
on
thethe MAFD.
MAFD. In contrast,
In contrast, ourour proposed
proposed method
method outperformed
outperformed otherother
modelsmodels across
across all
all eval-
evaluation metrics.
uation metrics. Comparatively,
Comparatively, it isitreasonable
is reasonable to assume
to assume thatthat
thethe proposed
proposed method
method ex-
exhibits superioroverall
hibits superior overallperformance
performanceand andhigher
higherIoU
IoUvalues
valuesamong
amongall allexisting
existingmodels.
models.

Table 5. Comparison results of IoU (%) across different models.

Model mIoU IoU.background IoU.fire IoU.fabric


PSPNet 60.20 78.07 65.73 62.54
CCNet 60.42 77.86 67.30 62.90
FCN 61.45 77.35 65.02 61.7
ISANet 61.37 78.18 65.42 65.33
OCRNet 53.48 73.48 63.96 39.14
The proposed method 64.85 79.43 70.61 64.53

To highlight the superiority of our method over state-of-the-art semantic segmentation


models, the representative visual results obtained from our comparative experiments are
depicted in Figure 8. In Figure 8a, we present the input images with fire and combustible
materials. Figure 8b–g exhibit the segmentation results produced by PSPNet, CCNet, FCN,
ISANet, OCRNet, and our proposed method, respectively.
From the images in the first column, it is evident that our method exhibited fewer
incorrect pixels compared to other models and accurately identified both clothing and
fire within the images. In the second column, concerning Figure 8b,c,e, our method
demonstrated better recognition of smoke and some minor wooden items. Likewise, in
the third column, our method performed exceptionally well in identification, even for
minute objects.
The above comparison demonstrates the superior accuracy of our model compared
to others when simultaneously detecting combustible materials and fire. Additionally, it
exhibits exceptional performance in identifying even the most minor objects, highlighting
the robustness of our method.
To highlight the superiority of our method over state-of-the-art semantic segmenta-
tion models, the representative visual results obtained from our comparative experiments are
depicted in Figure 8. In Figure 8a, we present the input images with fire and combustible ma-
Mathematics 2024, 12, 54
terials. Figure 8b–g exhibit the segmentation results produced by PSPNet, CCNet,14FCN, of 16

ISANet, OCRNet, and our proposed method, respectively.

(a)

(b)

(c)

(d)

(e)

(f)
Mathematics 2024, 12, 54 15 of 17

(g)

Figure
Figure 8.
8. Exemplary
Exemplary visual
visual results
results from
from various
various models:
models: row
row (a)
(a) illustrates
illustrates the
the input
input images,
images, while
while
(b–f) display the segmentation outcomes of PSPNet, CCNet, FCN, ISANet, OCRNet, and (g) dis-
(b–f) display the segmentation outcomes of PSPNet, CCNet, FCN, ISANet, OCRNet, and (g) displays
plays the results of our approach.
the results of our approach.

From the images


5. Conclusions in the first column, it is evident that our method exhibited fewer
and Discussion
incorrect pixels compared
Mainstream to other
fire detection modelsprimarily
techniques and accurately identified
concentrate both
on fire clothingbut
instances andoften
fire
within the images. In the second column, concerning Figure 8b,c,e, our method
overlook the simultaneous detection of both fire instances and combustible materials. In demon-
strated better
this paper, werecognition
adopted theofDANet
smoke asand some
the mainminor
modelwooden items.
to detect bothLikewise, in the third
fire and combustible
column, our method performed exceptionally well in identification, even
materials. DANet is a deep learning network featuring a dual attention mechanism, for minute ob-
aimed
jects.
The above comparison demonstrates the superior accuracy of our model compared
to others when simultaneously detecting combustible materials and fire. Additionally, it
exhibits exceptional performance in identifying even the most minor objects, highlighting
the robustness of our method.
Mathematics 2024, 12, 54 15 of 16

at improving the accuracy of identifying minute details and complex relationships within
indoor scenes. Additionally, a new database, MAFD, was tailored to collect a wide array
of fire instances and potential combustible materials. Through meticulous annotation,
this dataset aims to provide a comprehensive resource for training and evaluating models
dedicated to fire detection and material recognition within indoor settings. Ultimately, the
experimental results indicated that our model on the MAFD achieved an aAcc of 84.26%
and mAcc of 77.05%. We pioneered the simultaneous estimation of fire instances and fire
load in indoor scenarios, offering a novel strategy for fire safety protection and assessment.
To expand the applicability of our research, there are still more tasks to be undertaken
in the future. Firstly, we will incorporate a wider array of combustible materials in the
MAFD such as paper, plastic, etc., to ensure a richer representation of potential fire hazards
within indoor environments. Secondly, we are refining the training parameters to augment
model training efficiency and accuracy. Thirdly, our research can serve as a new reference
for designing secure buildings and evaluating the fire resistance of structures.

Author Contributions: Conceptualization, F.H.; data curation, F.H.; methodology, F.H.; software,
W.Z.; validation, W.Z.; visualization, W.Z.; writing—original draft, F.H. and W.Z.; writing—review
and editing, X.F. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by National Natural Science Foundation of China (nos. 62203475)
and Changsha Natural Science Foundation (nos. kq2208285).
Data Availability Statement: The data supporting the findings of this study are available upon
request from the readers.
Conflicts of Interest: We hereby declare that we have no conflicts of interest that could be perceived
as influencing the integrity or objectivity of our work.

References
1. Zhang, L.; Wang, G.X.; Yuan, T.; Peng, K.M. Research on Indoor Map. Geom. Spat. Inf. Technol. 2013, 43–47. [CrossRef]
2. Kuti, R.; Zólyomi, G.; László, G.; Hajdu, C.; Környei, L.; Hajdu, F. Examination of Effects of Indoor Fires on Building Structures
and People. Heliyon 2023, 9, e12720. [CrossRef] [PubMed]
3. Kodur, V.; Kumar, P.; Rafi, M.M. Fire Hazard in Buildings: Review, Assessment and Strategies for Improving Fire Safety. PSU Res.
Rev. 2020, 4, 1–23. [CrossRef]
4. Li, S.; Yun, J.; Feng, C.; Gao, Y.; Yang, J.; Sun, G.; Zhang, D. An Indoor Autonomous Inspection and Firefighting Robot Based on
SLAM and Flame Image Recognition. Fire 2023, 6, 93. [CrossRef]
5. Xie, Y.; Zhu, J.; Guo, Y.; You, J.; Feng, D.; Cao, Y. Early Indoor Occluded Fire Detection Based on Firelight Reflection Characteristics.
Fire Saf. J. 2022, 128, 103542. [CrossRef]
6. Wu, X.; Lu, X.; Leung, H. A Video Based Fire Smoke Detection Using Robust AdaBoost. Sensors 2018, 18, 3780. [CrossRef]
[PubMed]
7. Russo, A.U.; Deb, K.; Tista, S.C.; Islam, A. Smoke Detection Method Based on LBP and SVM from Surveillance Camera. In
Proceedings of the 2018 International Conference on Computer, Communication, Chemical, Material and Electronic Engineering
(IC4ME2), Rajshahi, Bangladesh, 8–9 February 2018.
8. Wang, H.; Zhang, Y.; Fan, X. Rapid Early Fire Smoke Detection System Using Slope Fitting in Video Image Histogram. Fire Technol.
2020, 56, 695–714. [CrossRef]
9. Wu, X.; Cao, Y.; Lu, X.; Leung, H. Patchwise Dictionary Learning for Video Forest Fire Smoke Detection in Wavelet Domain.
Neural Comput. Appl. 2021, 33, 7965–7977. [CrossRef]
10. Gagliardi, A.; Saponara, S. AdViSED: Advanced Video SmokE Detection for Real-Time Measurements in Antifire Indoor and
Outdoor Systems. Energies 2020, 13, 2098. [CrossRef]
11. Hossain, F.M.A.; Zhang, Y.M.; Tonima, M.A. Forest Fire Flame and Smoke Detection from UAV-Captured Images Using Fire-
Specific Color Features and Multi-Color Space Local Binary Pattern. J. Unmanned Veh. Syst. 2020, 8, 285–309. [CrossRef]
12. Jia, Y.; Chen, W.; Yang, M.; Wang, L.; Liu, D.; Zhang, Q. Video Smoke Detection with Domain Knowledge and Transfer Learning
from Deep Convolutional Neural Networks. Optik 2021, 240, 166947. [CrossRef]
13. Peng, Y.; Wang, Y. Real-Time Forest Smoke Detection Using Hand-Designed Features and Deep Learning. Comput. Electron. Agric.
2019, 167, 105029. [CrossRef]
14. Cheng, S.; Ma, J. Smoke Detection and Trend Prediction Method Based on Deeplabv3+ and Generative Adversarial Network. J.
Electron. Imaging 2019, 28, 1. [CrossRef]
15. Yuan, F.; Zhang, L.; Xia, X.; Wan, B.; Huang, Q.; Li, X. Deep Smoke Segmentation. Neurocomputing 2019, 357, 248–260. [CrossRef]
Mathematics 2024, 12, 54 16 of 16

16. Lin, G.; Zhang, Y.; Xu, G.; Zhang, Q. Smoke Detection on Video Sequences Using 3D Convolutional Neural Networks. Fire Technol.
2019, 55, 1827–1847. [CrossRef]
17. Li, J.; Zhou, G.; Chen, A.; Wang, Y.; Jiang, J.; Hu, Y.; Lu, C. Adaptive Linear Feature-Reuse Network for Rapid Forest Fire Smoke
Detection Model. Ecol. Inform. 2022, 68, 101584. [CrossRef]
18. Liu, H.; Lei, F.; Tong, C.; Cui, C.; Wu, L. Visual Smoke Detection Based on Ensemble Deep CNNs. Displays 2021, 69, 102020.
[CrossRef]
19. Zhan, J.; Hu, Y.; Zhou, G.; Wang, Y.; Cai, W.; Li, L. A High-Precision Forest Fire Smoke Detection Approach Based on ARGNet.
Comput. Electron. Agric. 2022, 196, 106874. [CrossRef]
20. Hu, Y.; Zhan, J.; Zhou, G.; Chen, A.; Cai, W.; Guo, K.; Hu, Y.; Li, L. Fast Forest Fire Smoke Detection Using MVMNet. Knowl.-Based
Syst. 2022, 241, 108219. [CrossRef]
21. Hosseini, A.; Hashemzadeh, M.; Farajzadeh, N. UFS-Net: A Unified Flame and Smoke Detection Method for Early Detection of
Fire in Video Surveillance Applications Using CNNs. J. Comput. Sci. 2022, 61, 101638. [CrossRef]
22. Khan, S.; Muhammad, K.; Mumtaz, S.; Baik, S.W.; de Albuquerque, V.H.C. Energy-Efficient Deep CNN for Smoke Detection in
Foggy IoT Environment. IEEE Internet Things J. 2019, 6, 9237–9245. [CrossRef]
23. He, L.; Gong, X.; Zhang, S.; Wang, L.; Li, F. Efficient Attention Based Deep Fusion CNN for Smoke Detection in Fog Environment.
Neurocomputing 2021, 434, 224–238. [CrossRef]
24. Muhammad, K.; Khan, S.; Palade, V.; Mehmood, I.; de Albuquerque, V.H.C. Edge Intelligence-Assisted Smoke Detection in Foggy
Surveillance Environments. IEEE Trans. Industr. Inform. 2020, 16, 1067–1075. [CrossRef]
25. Strese, M.; Schuwerk, C.; Iepure, A.; Steinbach, E. Multimodal Feature-Based Surface Material Classification. IEEE Trans. Haptics
2017, 10, 226–239. [CrossRef] [PubMed]
26. Zhang, H.; Jiang, Z.; Xiong, Q.; Wu, J.; Yuan, T.; Li, G.; Huang, Y.; Ji, D. Gathering Effective Information for Real-Time Material
Recognition. IEEE Access 2020, 8, 159511–159529. [CrossRef]
27. Lee, S.; Lee, D.; Kim, H.-C.; Lee, S. Material Type Recognition of Indoor Scenes via Surface Reflectance Estimation. IEEE Access
2022, 10, 134–143. [CrossRef]
28. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the 2019
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019.
29. Yu, F.; Koltun, V.; Funkhouser, T. Dilated Residual Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017.
30. Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A Comprehensive Survey on Transfer Learning. Proc. IEEE
Inst. Electr. Electron. Eng. 2021, 109, 43–76. [CrossRef]
31. GitHub—Open-Mmlab/Mmsegmentation: OpenMMLab Semantic Segmentation Toolbox and Benchmark. Available online:
https://github.com/open-mmlab/mmsegmentation (accessed on 14 March 2023).
32. Wilson, A.C.; Roelofs, R.; Stern, M.; Srebro, N.; Recht, B. The Marginal Value of Adaptive Gradient Methods in Machine Learning.
arXiv 2017, arXiv:1705.08292.
33. Zhou, Y.-C.; Hu, Z.-Z.; Yan, K.-X.; Lin, J.-R. Deep Learning-Based Instance Segmentation for Indoor Fire Load Recognition. IEEE
Access 2021, 9, 148771–148782. [CrossRef]
34. Torralba, A.; Russell, B.C.; Yuen, J. LabelMe: Online Image Annotation and Applications. Proc. IEEE Inst. Electr. Electron. Eng.
2010, 98, 1467–1484. [CrossRef]
35. Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J.
Comput. Vis. 2010, 88, 303–338. [CrossRef]
36. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017.
37. Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. CCNet: Criss-Cross Attention for Semantic Segmentation. In
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2
November 2019.
38. Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach.
Intell. 2017, 39, 640–651. [CrossRef] [PubMed]
39. Huang, L.; Yuan, Y.; Guo, J.; Zhang, C.; Chen, X.; Wang, J. Interlaced Sparse Self-Attention for Semantic Segmentation. arXiv 2019,
arXiv:1907.12273.
40. Yuan, Y.; Chen, X.; Wang, J. Object-Contextual Representations for Semantic Segmentation. In Computer Vision—ECCV 2020;
Lecture Notes in Computer Science; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 173–190,
ISBN 9783030585389.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like