0% found this document useful (0 votes)
68 views21 pages

Sensors: Face Mask Wearing Detection Algorithm Based On Improved YOLO-v4

Uploaded by

Riya Rana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views21 pages

Sensors: Face Mask Wearing Detection Algorithm Based On Improved YOLO-v4

Uploaded by

Riya Rana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

sensors

Article
Face Mask Wearing Detection Algorithm Based on
Improved YOLO-v4
Jimin Yu 1,2 and Wei Zhang 1,2, *

1 College of Automation, Chongqing University of Post and Telecommunications, Chongqing 400065, China;
[email protected]
2 Key Lab of Industrial Wireless Networks and Networked Control of the Ministry of Education,
Chongqing 400065, China
* Correspondence: [email protected]

Abstract: To solve the problems of low accuracy, low real-time performance, poor robustness and
others caused by the complex environment, this paper proposes a face mask recognition and standard
wear detection algorithm based on the improved YOLO-v4. Firstly, an improved CSPDarkNet53 is in-
troduced into the trunk feature extraction network, which reduces the computing cost of the network
and improves the learning ability of the model. Secondly, the adaptive image scaling algorithm can
reduce computation and redundancy effectively. Thirdly, the improved PANet structure is introduced
so that the network has more semantic information in the feature layer. At last, a face mask detection
data set is made according to the standard wearing of masks. Based on the object detection algorithm
of deep learning, a variety of evaluation indexes are compared to evaluate the effectiveness of the
model. The results of the comparations show that the mAP of face mask recognition can reach 98.3%
and the frame rate is high at 54.57 FPS, which are more accurate compared with the exiting algorithm.

 Keywords: adaptive image scaling; CSPDarknNet53; face mask recognition; PANet; YOLO-v4


Citation: Yu, J.; Zhang, W. Face Mask


Wearing Detection Algorithm Based
on Improved YOLO-v4. Sensors 2021, 1. Introduction
21, 3263. https://doi.org/10.3390/
Things such as respiratory infection viruses, toxic and harmful gases and dust sus-
s21093263
pended in the air can enter the lungs of humans as they breathe and then cause pneumonia,
nerve damage and toxic reactions. In particular, the new coronavirus (COVID-2019) has
Academic Editor: Stefano Berretti
spread globally since the end of 2019 which has a great impact on the safety of lives and
property of all human beings. When people are exposed to toxic or harmful gases, wearing
Received: 7 April 2021
Accepted: 6 May 2021
masks can effectively protect them from being endangered, thereby reducing unneces-
Published: 8 May 2021
sary losses [1]. Therefore, it is of great practical significance to realize the mask wearing
detection algorithm.
Publisher’s Note: MDPI stays neutral
At present, in places where masks need to be worn (such as communities, campuses,
with regard to jurisdictional claims in
supermarkets, hospitals, factories, stations, etc.), the wearing of masks is usually checked
published maps and institutional affil- manually. However, this method would cause a waste of human resources and low
iations. efficiency, and most importantly, there are problems such as missing and false detection.
Object detection technology enables us to use the camera and computer integrated way
to realize the face mask wearing detection so that the purpose of non-contact automatic
detection is achieved.
Copyright: © 2021 by the authors.
In ref. [2], the authors proposed the LeNet-5 network architecture, which is a classic
Licensee MDPI, Basel, Switzerland.
work of the convolutional neural network and provides great help for the development of
This article is an open access article
computer vision. However, due to the influence of computing power and a lack of data
distributed under the terms and sets at that time, the neural network model in ref. [2] is surpassed by the effect of SVM [3]
conditions of the Creative Commons under certain computing power conditions. Based on deep learning, until 2012, Hinton [4]
Attribution (CC BY) license (https:// firstly proposed the AlexNet convolutional neural network model, which is respected as a
creativecommons.org/licenses/by/ solid foundation for the development of an object detection algorithm. In that work, the
4.0/). author used the ReLU activation function [5] to speed up the training of the network during

Sensors 2021, 21, 3263. https://doi.org/10.3390/s21093263 https://www.mdpi.com/journal/sensors


Sensors 2021, 21, x FOR PEER REVIEW 2 of 22

Sensors 2021, 21, 3263 2 of 21


a solid foundation for the development of an object detection algorithm. In that work, the
author used the ReLU activation function [5] to speed up the training of the network dur-
ing the gradient descent process of the model and introduced a Dropout layer to suppress
the gradient descent process of the model and introduced a Dropout layer to suppress
over-fitting [6] so that the network can extract object features more effectively. In 2014, He
over-fitting [6] so that the network can extract object features more effectively. In 2014,
[7] extracted object feature in areas with any aspect ratio by using a Spatial Pyramid Pool-
He [7] extracted object feature in areas with any aspect ratio by using a Spatial Pyramid
ing Network (SPPNet) method, which provided ideas for YOLO-v3 [8], YOLO-v4 [9] and
Pooling Network (SPPNet) method, which provided ideas for YOLO-v3 [8], YOLO-v4 [9]
other detection algorithms to extract features at any scale. In the next year, the residual
and other detection algorithms to extract features at any scale. In the next year, the residual
block structure in ResNet was introduced to improve the feature expression ability of the
block structure in ResNet was introduced to improve the feature expression ability of the
model in [10]. Based on Feature Pyramid Networks (FPN) [11], Liu [12] put forward the
model in [10]. Based on Feature Pyramid Networks (FPN) [11], Liu [12] put forward the
Path Aggregation Network (PANet) to prove the importance of the underlying infor-
Path Aggregation
mation of the feature Network
layer (PANet)
in 2018, to prove
thus the importance
realizing of theof
the circulation underlying information
object feature infor-
of the feature layer in 2018, thus realizing the circulation of object feature information.
mation.
At
At present,
present, object
object detection algorithmsbased
detection algorithms basedon ondeep
deeplearning
learningareareusually
usuallydivided
divided
into two categories. The first is the Two-Stage algorithm based on
into two categories. The first is the Two-Stage algorithm based on the R-CNN [13–15]the R-CNN [13–15]
and
and TridenNet [16], etc. The existing problems of such Two-Stage algorithms
TridenNet [16], etc. The existing problems of such Two-Stage algorithms are poor real- are poor
real-time, large model scale and poor small object detection effect. The
time, large model scale and poor small object detection effect. The second is the One-Stagesecond is the
One-Stage algorithm based on the SSD [17–20] and YOLO [21,22], which
algorithm based on the SSD [17–20] and YOLO [21,22], which has high real-time perfor- has high real-time
performance in multi-scale
mance in multi-scale object detection.
object detection. However, However, the detection
the detection accuracyaccuracy
needs to needs
be im-to
be improved.
proved.
Combined
Combined with withthe
theadvantages
advantages ofof YOLO
YOLO series
series object
object detection
detection algorithms,
algorithms, some some
im-
improved
proved methods of CSPDarkNet53 and PANet are introduced into YOLO-v4 in the in
methods of CSPDarkNet53 and PANet are introduced into YOLO-v4 the
pre-
present paper and a model which can enable the mask detection task and
sent paper and a model which can enable the mask detection task and achieve optimal achieve optimal
performance
performance is is developed.
developed.Similarly
Similarly to papers
to papers [23–25],
[23–25], we the
we build build the network
network model
model based
based
on deep learning and computer-aided diagnosis. The method used in this article is de- is
on deep learning and computer-aided diagnosis. The method used in this article
described
scribed in in
thethe flowing
flowing chart
chart in Figure
in Figure 1. 1.

Training set Batches of data

Training model Model file

Validation set Batches of data

Data
Input preprocessing Data partitioning
Face mask
Testing set Batches of data Testing model wearing detection

Setting Creating Loading pre-


parameters model training weights

Figure 1.
Figure 1. Flow chart of the
the proposed
proposedapproach.
approach.

The
The contributions paper are
contributions of this paper are as
asfollows:
follows:
• Aiming
Aiming atatthe problem
the problem of training time,time,
of training this paper introduces
this paper the improved
introduces CSPDark-
the improved
Net53 into the backbone
CSPDarkNet53 to realize to
into the backbone therealize
rapid theconvergence of the model
rapid convergence and
of the reduce
model andthe
time cost in training.
reduce the time cost in training.
• An An adaptive
adaptive image scaling
scaling algorithm
algorithmisisintroduced
introducedtotoreduce
reducethetheuse
useofofredundant
redundant
information
information in the model.
model.
• To Tostrengthen
strengthen the fusion of multi-scale
multi-scalesemantic
semanticinformation,
information,thetheimproved
improved PANet
PANet is is
added into
added into the Neck module.
module.
• TheThe Hard-Swish
Hard-Swish activation function
functionintroduced
introducedin inthis
thispaper
papercan
cannotnotonly
onlystrengthen
strengthen
the nonlinear
the nonlinear feature extraction
extraction ability
abilityof ofthe
thenetwork,
network,but butalso
alsoenable
enablethethedetection
detection
results of
results of the
the model to be more
more accurate.
accurate.
Tosum
To sum up,
up, in
in the face mask
mask detection
detectiontask,
task,thethealgorithm
algorithmproposed
proposedininthis thispaper has
paper has
higher detection
higher detection accuracy
accuracy than
than other
other typical
typicalalgorithms,
algorithms,which
whichmeans
meansthe thealgorithm
algorithm is is
more suitable for the mask detection task. At the same time, the algorithm is more practical
to deploy in public places to urge people to wear masks regularly in order to reduce the
risk of cross-infection.
Sensors 2021, 21, 3263 3 of 21

2. Related Works
2.1. Problems Exist in Object Detection
There are two key points of face mask wearing detection. One is to locate the position
of the face in the image; the other is to identify whether the face given by the data set is
wearing a mask and if the mask is worn correctly. Problems of the present object detection
algorithm can be attributed to face occlusion, variable face scale, uneven illumination,
density, etc., and these problems seriously affect the performance of the algorithm. Further-
more, the traditional object detection algorithm adopts the selective search method [26]
in feature extraction, leading to problems such as poor generalization ability, redundant
information, low accuracy and poor real-time performance.

2.2. Existing Work


Some researchers have used the extraction of RGB color information to perform face
mask recognition [27]. However, the article does not consider the case of non-standard
wearing of masks, so the adaptability of the algorithm needs to be further improved.
Combining YOLO-v2 and ResNet50, the authors in [28] realized face mask recognition
whose backbone network is DarkNet-19. However, DarkNet-19 has been optimized by
CSPDarkNet53. The ablation experiment in our paper shows that the CSP1_X module
produces better results than CSPDarkNet53. In [29], the authors pointed out that the
combination of ResNet50 and SVM can realize face mask detection and its accuracy can
reach up to 99.64%. However, the algorithm takes a lot of computational costs. Furthermore,
the combination of SSD and MobileNetV2 for mask detection was proposed in paper [30],
but its model structure is too complex and its performance is inferior to YOLO-v4.
Only two categories are used in the papers mentioned in the above paragraph and
the authors did not consider the influence of wearing masks irregularly on the algorithm.
Therefore, the feature extraction ability and model practicability of these algorithms need
to be improved. In this paper, based on improved YOLO-v4, face mask recognition is
considered and three categories, face_mask, face and WMI, are included. In addition,
the feature extraction ability of this paper is improved by CSP1_X, and CSP2_X impels
PANet to speed up the circulation of semantic features and strengthen feature fusion, thus
improving the robustness of the model.

3. The Model Structure of YOLO-v4 Network


YOLO-v4 is a high-precision and real-time One-Stage object detection algorithm based
on regression proposed in 2020, which integrated the characteristics of YOLO-v1, YOLO-v2,
YOLO-v3, etc., and achieved the current optimum in terms of detection speed and trade-off
of detection accuracy. The model structure is shown in Figure 2, which consists of three
parts: Backbone, Neck, and Prediction.
Combined with the characteristics of the ResNet structure, YOLO-v3 integrated the
residual module into itself and then obtained Darknet53. Based on this, taking the superior
learning ability of Cross-Stage Partial Network (CSPNet) [31] into account, YOLO-v4
constructed the CSPDarkNet53. In the residual module, the feature layer is input and
the higher-level feature information is output. This means the learning goal of the model
in the ResNet module becomes the difference between the output and the input, thus
realizing residual learning while reducing the parameters of the model and strengthening
feature learning. The Neck can be composed of the SPPNet and PANet. In SPPNet, firstly,
the feature layer is convolved three times, and then the input feature layer is maximally
pooled by using the maximum pooling cores of different sizes. The pooled results are
concatenated firstly and then convolved three times, thus improving the network receptive
field. PANet convolves the feature layers after the operation of Backbone and SPPNet
and then up-samples them, that is, making the original feature layers double in height
and width, and then concatenates the feature layers after convolution and up-sampling
with the feature layers obtained by CSPDarkNet53 to realize feature fusion, and then
down-sampling, compressing the height and width, and finally stacking with the previous
and introduced CIOU as the positioning loss function [33], which made the network more
inclined to optimize in the direction of increasing overlapping areas, thus effectively im-
proving the accuracy. In the actual complex environment, due to the external interference
such as occlusion and multi-scale, there are still some shortcomings in the face mask de-
Sensors 2021, 21, 3263 tection directly using YOLO-v4. The main performances are as follows: 4 of 21
There are still problems such as insufficient shallow feature extraction for multi-scale
objects.
In the reasoning stage, the model adds gray bars at both ends of the image to prevent
feature
the image layers
fromtodistorting,
realize more
butfeature
too manyfusion
gray(five
barstimes).
increaseThethe
Prediction
redundantmodule can make
information of
predictions
the model. by using the feature extracted from the network. Taking a 13 × 13 grid, for
example,
At theitsame
is equivalent
time, theto divide
model the
has input picture
problems such asinto
long × 13 grids,
13 training and
time, then
high each grid
calculation
will be preset with three
cost and overfull parameters. prior frames. The prediction results of the network will adjust
the positions of the three prior frames, and finally, it will be filtered by
To solve these problems, this paper optimizes and improves the model based on the non-maximum
suppression (NMS) [32] algorithm to obtain the final prediction frame.
YOLO-v4.

Input
Backbone Neck Prediction
CBM CSP1 CSP2 CSP8 CSP8 CSP4 CBL SPP CBL CBL Up-Sampling

x3 x3 Concat CBL CBL Up-Sampling


416x416x3
CBL
x5 Concat CBL CBL Conv
CBL
x5

CBM = Conv BN Mish 52x52x24


Maxpool

SPP = Maxpool Concat CBL


Concat Concat CBL CBL Conv
=
CBL = Conv BN Leaky-ReLU
Maxpool x5
26x26x24
Res
CBM CBM Add
unit
CBL
Concat CBL CBL Conv
x5
13x13x24
Res
CSPx = CBM CBM CBM
unit
X residual units Concat CBM
CBM

Figure
Figure 2.
2. YOLO-v4
YOLO-v4network
networkstructure.
structure.

YOLO-v4
4. Improved proposed
YOLO-v4 a new Model
Network mosaic data augmentation method to expand the data
set and introduced CIOU as the
With the increasing number of positioning lossconvolutional
layers of the function [33],neural
whichnetwork,
made thethe network
depth
more inclined to optimize in the direction of increasing overlapping areas, thus
of the network is deepening, and the deeper network structure is beneficial for the extrac-effectively
improving
tion thefeatures.
of object accuracy. In the actual
Thereupon, complex
the semanticenvironment,
informationdue to the objects
of small externalisinterference
increased.
such as occlusion and multi-scale, there are still some shortcomings
The main improvements presented in this paper based on YOLO-v4 are as follows: in the face mask
The
detection directly using YOLO-v4. The main performances are as follows:
CSPDarkNet53 is improved into CSP1_X and CSP2_X, and so reduced network modules
There are still problems such as insufficient shallow feature extraction for
to reduce the parameters of feature extraction in the network model; using the CSP2_X
multi-scale objects.
module in Neck can increase information fusion, and the adaptive image scaling method
In the reasoning stage, the model adds gray bars at both ends of the image to prevent
is used to replace the image scaling method in YOLO-v4.
the image from distorting, but too many gray bars increase the redundant information of
the model.
At the same time, the model has problems such as long training time, high calculation
cost and overfull parameters.
To solve these problems, this paper optimizes and improves the model based
on YOLO-v4.

4. Improved YOLO-v4 Network Model


With the increasing number of layers of the convolutional neural network, the depth of
the network is deepening, and the deeper network structure is beneficial for the extraction
of object features. Thereupon, the semantic information of small objects is increased.
The main improvements presented in this paper based on YOLO-v4 are as follows: The
CSPDarkNet53 is improved into CSP1_X and CSP2_X, and so reduced network modules
to reduce the parameters of feature extraction in the network model; using the CSP2_X
module in Neck can increase information fusion, and the adaptive image scaling method is
used to replace the image scaling method in YOLO-v4.

4.1. Backbone Feature Extraction Network


The residual module introduced into YOLO-v4 is to enhance the learning ability of
the network and reduce the number of parameters. The operation process of the residual
Sensors
Sensors2021,
2021,21,
21,xxFOR
FORPEER
PEERREVIEW
REVIEW 55ofof22
22

Sensors 2021, 21, 3263 5 of 21


4.1.
4.1.Backbone
BackboneFeature
FeatureExtraction
ExtractionNetwork
Network
The
The residual module introducedinto
residual module introduced intoYOLO-v4
YOLO-v4isistotoenhanceenhancethe thelearning
learningability
abilityofof
the
the network and reduce the number of parameters. The operation process of theresidual
network and reduce the number of parameters. The operation process of the residual
module
module (Res-unit) cancan be summed
summed up as Firstly, perform 11 ×
as follows. Firstly, 1 convolution; thenthen
module(Res-unit)
(Res-unit) canbe be summedup up asfollows.
follows. Firstly,perform
perform 1××11convolution;
convolution; then
3 × 3 convolution;
33××33convolution; and weighting the two outputs of the module at last. The purpose of
convolution;and andweighting
weightingthe thetwo
twooutputs
outputsof ofthethemodule
moduleatatlast.
last.The
Thepurpose
purposeof of
weighting
weighting is
is to
to increase
increase the
the information
information of
ofthe
thefeature
feature layer
layerwithout
without changing
changing its dimension
its dimen-
weighting is to increase the information of the feature layer without changing its dimen-
information.
sion In CSPDarkNet53, the set of feature layers of the image is input, isand then andcon-
sioninformation.
information.In InCSPDarkNet53,
CSPDarkNet53,the thesetsetofoffeature
featurelayers
layersof ofthe
theimage
image isinput,
input, and
volution
then down-sampling is performed continuously to gain higher semantic information.
thenconvolution
convolutiondown-sampling
down-samplingisisperformed
performedcontinuously
continuouslytotogain gainhigher
highersemantic
semanticin- in-
Therefore,
formation. the last three layers of Backbone have the highest semantic information, and
formation.Therefore,
Therefore,the thelast
lastthree
threelayers
layersof ofBackbone
Backbonehave havethethehighest
highestsemantic
semanticinfor-
infor-
then theand
mation, last threethelayers three
of features are selected are as the input as of SPPNet and PANet. and The
mation, andthen then thelast
last threelayers
layersofoffeatures
features areselected
selected asthe theinput
inputofofSPPNet
SPPNet and
network
PANet. structure of CSPDarkNet53 is shown in Figure 3.
PANet.TheThenetwork
networkstructure
structureofofCSPDarkNet53
CSPDarkNet53isisshown shownininFigure
Figure3.3.

Res
Res
CSPx ==
CSPx CBM
CBM CBM
CBM unit
CBM
CBM
unit
XXresidual
residualunits
units
Concat
Concat CBM
CBM
CBM
CBM

Figure
Figure3.
Figure 3.CSPDarkNet53
3. CSPDarkNet53module
CSPDarkNet53 modulestructure.
module structure.
structure.

Although
Although YOLO-v4
AlthoughYOLO-v4
YOLO-v4uses uses the
usesthe residual
residual
the residual network
network
network to totoreduce
reduce thethe
reduce thecomputing
computing
computingpowerpower re-
require-
power re-
quirement
quirement of the model, its memory requirement still needs to be improved. Therefore,inin
ment of theof the
model,model,
its its
memorymemory requirement
requirement still still
needs needs
to beto be improved.
improved. Therefore,
Therefore, in this
this
this paper,
paper, paper, the
the network
the network structure
structure
network ofof CSPDarkNet53
of CSPDarkNet53
structure ofof YOLO-v4
of YOLO-v4
CSPDarkNet53 isis improved
is improved
YOLO-v4 toto the
to the CSP1_X
improved the
CSP1_X
CSP1_Xmodule,
module, as shown
module, asinshown
as Figurein
shown 4.Figure
in Figure4.4.

Res
Res
CSP1_X ==
CSP1_X CBH
CBH unit
Conv
Conv
unit Concat
XXresidual
residualunits
units Concat BN Leaky-ReLU CBH
BN Leaky-ReLU CBH
Conv
Conv

Figure
Figure4.
Figure 4.CSP1_X
4. CSP1_Xmodule
CSP1_X modulestructure.
module structure.
structure.

Compared
Comparedwith
Compared withCSPDarkNet53
with CSPDarkNet53in
CSPDarkNet53 inFigure
in Figure2,
Figure 2,2,the
theimproved
the improvednetwork
improved networkuses
network usesthe
uses theH-swish
the H-swish
H-swish
activation
activation function
function [34],
[34], as
as shown
shown
activation function [34], as shown in Equation (1): in
in Equation
Equation (1):
(1):
ReReLULU( xxx+
333)
HHswish
swish
H − swish( x ) = x
 xx xReLU
 x (1)
(1)
(1)
666
AsAsthetheSwish
Swishfunction
function[35] [35]contains
containsthe theSigmoid
Sigmoidfunction, function,the thecalculation
calculationcost costof ofthe
the
As the Swish function [35] contains the Sigmoid function, the calculation cost of the
Swish
Swish function
function is
is higher
higher than
than the
the ReLU
ReLU function,
function, but
but thethe Swish
Swish function
function is
is more
more effective
effective
Swish function is higher than the ReLU function, but the Swish function is more effective
than
thanthe theReLU
ReLUone. one.Howard
Howardused usedthe theH-swish
H-swishfunction functionon onmobile
mobiledevices
devices[36][36]to toreduce
reduce
than the ReLU one. Howard used the H-swish function on mobile devices [36] to reduce
the number
the number
number of of accesses
of accesses
accesses to to memory
to memory
memory by by the
by the model,
the model,
model, whichwhich
which furtherfurther reduced
further reduced
reduced the the time
timecost.
the time cost.
the cost.
Therefore,
Therefore, in
in this
this paper,
paper, the
the advantages
advantages ofof the
the H-swish
H-swish function
function are
are used
used toto reduce
reduce the
the
Therefore, in this paper, the advantages of the H-swish function are used to reduce the
running
running time
time requirements
requirements of
of the
the model
model on
on condition
condition ofofensuring
ensuring no
no gradient
gradient explosion,
explosion,
running time requirements of the model on condition of ensuring no gradient explosion,
disappearance
disappearanceand andother
otherproblems.
problems.At Atthe thesame sametime, time,the thedetection
detectionaccuracy
accuracyof ofthe
themodel
model
disappearance and other problems. At the same time, the detection accuracy of the model
isisadvanced.
advanced.
is advanced.
InInCSP1_X,
CSP1_X,the theinput
inputfeature
featurelayer layerof ofthetheresidual
residualblock blockisis isdivided
dividedinto intotwo
twobranches.
branches.
In CSP1_X, the input feature layer of the residual block divided into two branches.
One
One is
One is used
is used
used asas the
asthe residual
theresidual
residualedgeedge
edge for convolution
forforconvolution
convolution operation.
operation.
operation. The other
The The plays
otherother the
therole
playsplays role
the of the
ofrole
the
trunk
trunk
of part,
the part, performs
trunkperforms 1
part, performs × 1 convolution
1 × 1 convolution
1 × 1 convolution operation
operationoperationat first, then
at first, then performs
performs
at first, 1 × 1
then1performs convolution
× 1 convolution 1×1
totoadjust
adjustthe
convolution thechannel
to adjustafter
channel the entering
afterchannel
entering the
theresidual
after residualblock,
entering block,
the and
residualandthen then performs
block, performs
and then the 33××33convo-
theperforms convo-
the
lution
lution operation
operation totoenhance
enhance the
the feature
feature extraction.
extraction. At
At
3 × 3 convolution operation to enhance the feature extraction. At last, the two branches are last,
last, the
the two
two branches
branches are
are concate-
concate-
nated,
nated,thus thusmerging
concatenated, merging the
thus mergingthechannels
channels totoobtain
the channels obtain tomoremorefeature
obtain feature layer
more feature layerinformation.
information. In
layer information.Inthis
thispaper,
paper,
In this
three
paper, CSP1_X
three CSP1_X
three CSP1_Xmodules
modules are used
are used
modules in
arein the
used improved
the in improved
the improved Backbone,
Backbone, where
Backbone, X
wherewhere represents
X represents the
X represents thenum-
num-
the
ber
berof
number ofresidual
residual
of residualweighting
weighting
weightingoperations
operations
operationsininthethe inresidual
residual structure.
the residualstructure. Finally,
structure.Finally, after
afterstacking,
Finally, stacking,
after aa11××
stacking,
1a1convolution
× 1 convolution
1convolution isisused
usedistotointegrate
integrate
used the
thechannels.
to integrate channels.
the channels.Experiments
Experiments show
showthat
Experiments that using
showusing
thatthis
this residual
usingresidual
this
structure can
canmake
structurestructure
residual make the
can network
themake
network structure easier
structurestructure
the network easiertotoeasier
optimize.
optimize. to optimize.
Sensors 2021, 21, 3263 6 of 21
Sensors 2021, 21, x FOR PEER REVIEW 6 of 22

4.2.
4.2. Neck
Neck Network
Network
The
The convolutional
convolutional neural
neural network
network requires
requires thethe input
input image
image toto have
haveaafixed
fixedsize.
size.In
In
the
thepast
pastconvolution
convolutionneural
neuralnetwork,
network,thethefixed
fixedinput
inputwas
wasobtained
obtainedby bycutting
cuttingand
andwarping
warping
operations,
operations, but these methods
but these methodseasily
easilybring
bringabout
aboutproblems
problems such
such as as object
object missing
missing or
or de-
deformation. To eliminate such problems, researchers proposed SPPNet
formation. To eliminate such problems, researchers proposed SPPNet to remove the re- to remove the
requirement
quirement ofof fixed
fixed input
input size.
size. ToTo gain
gain multi-scale
multi-scale local
local features,
features, YOLO-v4
YOLO-v4 introduced
introduced the
the SPPNet structure based on YOLO-v3. In order to further fuse the multi-scale
SPPNet structure based on YOLO-v3. In order to further fuse the multi-scale local feature local
feature information
information with thewith the feature
global global feature information,
information, we addwe theadd the CSP2_X
CSP2_X modulemodule to the
to the PANet
PANet structure of YOLO-v4 to enhance the feature extraction, which helps
structure of YOLO-v4 to enhance the feature extraction, which helps to speed up the flow to speed up
the flow of feature information and enhance the accuracy of the model. CSP2_X
of feature information and enhance the accuracy of the model. CSP2_X is shown in Figure is shown
in
5. Figure 5.

CSP2_X = CBH CBH Conv


2×X Concat BN Leaky-ReLU CBH
Conv

Figure5.
Figure 5. CSP2_X
CSP2_X module
module structure.
structure.

The
The common
common convolution
convolution operation
operation isis adopted
adopted in in the
the Neck
Neck network
network inin YOLO-v4,
YOLO-v4,
while
while the CSPNet
CSPNet has hasthe
theadvantages
advantagesofof superior
superior learning
learning ability,
ability, reduced
reduced computing
computing bot-
bottleneck
tleneck andand memory
memory cost.
cost. Adding
Adding thethe improved
improved CSPNet
CSPNet network
network module
module based
based on
on YOLO-v4
YOLO-v4 cancan further
further enhance
enhance the the ability
ability of network
of network feature
feature fusion.
fusion. This combined
This combined oper-
operation
ation can can realize
realize the the top-down
top-down transmission
transmission of deeper
of deeper semantic
semantic features
features in in PANet,
PANet, and
and at
at
thethe sametime
same timefuse
fusethe
thebottom-up
bottom-updeepdeeppositioning
positioning features
features from the SPPNet network,
network,
thus
thus realizing
realizing feature
feature fusion
fusion between
between different
different backbone
backbone layers
layers and
anddifferent
differentdetection
detection
layers
layers in the Neck network and providing more useful features for the Predictionnetwork.
in the Neck network and providing more useful features for the Prediction network.

4.3.
4.3. Adaptive
Adaptive Image
Image Scaling
Scaling
In
In the object detection network,
the object detection network, the the image
image data
data received
received by by the
the input
input port
port have
have aa
uniform standard size. For example, the standard size of each image in the handwritten
uniform standard size. For example, the standard size of each image in the handwritten
numeral recognition data set MNIST is 28 × 28. However, different data sets have differ-
numeral recognition data set MNIST is 28 × 28. However, different data sets have different
ent image sizes, and ResNet fixes the input image to 224 × 224. Two standard sizes of
image sizes, and ResNet fixes the input image to 224 × 224. Two standard sizes of 416 ×
416 × 416 and 608 × 608 are provided at the input port of the YOLO-v4 detection network.
416 and 608 × 608 are provided at the input port of the YOLO-v4 detection network. Tra-
Traditional methods for obtaining standard size mainly include cutting, twisting, stretching,
ditional methods for obtaining standard size mainly include cutting, twisting, stretching,
scaling, etc., but these methods are easy to cause problems such as missing objects, loss
scaling, etc., but these methods are easy to cause problems such as missing objects, loss of
of resolution and a reduction in accuracy. In the previous convolutional neural network,
resolution and a reduction in accuracy. In the previous convolutional neural network, it
it was necessary to unify the image data to the standard size manually in advance, while
was necessary to unify the image data to the standard size manually in advance, while
YOLO-v4 standardized the image size directly by using the data generator, and then input
YOLO-v4 standardized the image size directly by using the data generator, and then input
the image into the network to realize the end-to-end learning process. In the training and
the image into the network to realize the end-to-end learning process. In the training and
testing stage of YOLO-v4, the sizes of input images are 416 × 416 and 608 × 608. When
testing stage ofimages
standardizing YOLO-v4, the sizes
of different of input
sizes, imagesimages
the original are 416are × 416 andfirstly;
scaled 608 × 608.
thenWhen
gray
standardizing images of different sizes, the original images are
images with sizes of 416 × 416 or 608 × 608 are generated; and finally, the scaled image scaled firstly; then gray
is
images with
overlaid on the sizes
grayofimage
416 × 416 or 608image
to obtain × 608 data
are generated;
of standard and finally, the scaled image is
size.
overlaid
Taking on the
the gray image tomask
unregulated obtainimage
imageas data
an of standardthe
example, size.
image processed by the
Taking the unregulated mask image as an example,
scaling algorithm is shown in Figure 6. In Figure 6, A is the original the image processed by theAfter
input image. scal-
ing algorithm is shown in Figure 6. In Figure 6, A is the original input
calculating the scaling ratio of the original input image, image B is obtained by the BICUBIC image. After calcu-
lating the scaling
interpolation ratio [37],
algorithm of theand
original
C is a input image,with
gray image imagethe B is obtained size.
standardized by the BICUBIC
Finally, the
interpolation
original imagealgorithm [37], and
can be pasted intoCtheis agray
grayimage,
image with
and the the standardized
standardized size. Finally,
standard inputthe
original image can be pasted into the gray image, and the
image D can be obtained. The image scaling algorithm reduces the resolution without standardized standard input
image D can
distortion, butbe obtained.applications,
in practical The image scaling algorithm
most images havereduces
differentthe resolution
aspect without
ratios, so after
distortion, but in practical applications, most images have different
scaling and filling by this algorithm, the gray image sizes at both ends of the image are aspect ratios, so after
not
scaling
the same.and filling
If the bygray
filled this image
algorithm,
size is thetoogray image
large, theresizes at both ends
is information of the image
redundancy whichare
increases the reasoning time of the model.
Sensors 2021, 21, x FOR PEER REVIEW 7 of 22

Sensors 2021, 21, 3263 7 of 21


not the same. If the filled gray image size is too large, there is information redundancy
which increases the reasoning time of the model.

B: 416×319×3

A: 447×343×3
D: 416×416×3

C: 416×416×3

Figure6.
Figure 6. Image
Image scaling
scaling in
inYOLO-v4.
YOLO-v4.

Therefore,
Therefore, this
this paper
paper introduces
introduces an
anadaptive
adaptiveimage
imagescaling
scalingalgorithm
algorithmtotoadaptively
adaptively
add
add the
the least
least red edge to the original
original image.
image. The
Thesteps
stepsofofthis
thisalgorithm
algorithm
areare shown
shown in
in Algorithm
Algorithm 1. 1.

Algorithm 1 Adaptive image scaling.


Algorithm 1 Adaptive image scaling.
Input: W and H are the width and height of the input image.
Input:T Wand andTHHareare
thethe width
width andand height
height ofobject
of the the input
imageimage.
of standard size.
W
Begin
TW and TH are the width and height of the object image of standard size.
scaling_ratio ← min{ TW /W, TH /H }
Beginnew_w ← W × scaling_ratio
new_h ← H × scaling_ratio
dwscaling
← TW _−ratio
new_w min{TW / W , TH / H }
←T
dhnew _Hw−new_h
W  scaling _ ratio
d ← mod(max(dw , dh ), 64)
new _ h←
padding d/2
H  scaling _ ratio
if (W, H ) 6= (neww_, new_h):
w  T←
dimage W resize
new _ w
(input_image, (new_w, new_h))
new_image ← add_border (image, ( padding, padding))
d h  TH  new _ h
End
 mod(max(d w , d h ), 64)
Output:d new_image

padding  d / 2
The process of this algorithm can be understood as follows: In the first step, TW and
TH in theif standard
(W , H )  (size
new of
_ wthe
, new _ h) : image are divided by W and H of the input image,
object
respectively, and then their minimum value scaling_ratio is treated as the scaling factor. In
the second step,image  resize
multiply the(input
scaling_ image
factor, by
(new
the_ original
w, new _image’s
h)) W, H, respectively, and
then take new_w and new_h as the scaled dimensions
new _ image  add _ border (image, ( padding , padding )) of the original image. In the third
step, the TW and TH of the object image are subtracted by new_w and new_h, respectively,
toEnd
obtain dw and dh . In the fifth step, obtain the maximum value of dw and dh , and then
calculate
Output:thenew remainder
_ image of this maximum value and 64. As the network model will carry out
five down-sampling operations, the size of the original input image is five times that of the
feature graphs after five down-sampling operations, and each down-sampling operation
The process
can compress of this and
the height algorithm
width ofcanthebefeature
understood
graphs asoffollows:
the lastIn theto
time first
1 step, T
W and
2 of the original.
TH in the the
Therefore, standard
size of size of the object
the feature image are
map obtained divided
after by W and Hoperations
five down-sampling of the input im-
is 1/32
of the original image, so the length and width must be multiples of 32.
age, respectively, and then their minimum value scaling _ ratio is treated as the scalingIn this paper, 64 is
also required, and the remainder is assigned as d. The fifth step is to calculate the padding
of the red edge on both sides of the image. For the sixth step, if W, H and new_w, new_h
are not the same, scale the original image to the image of new_w and new_h, respectively.
The last step is to fill the two sides of the image after scaling to obtain a new image.
and each down-sampling operation can compress the height and width of the feature
graphs of the last time to ½ of the original. Therefore, the size of the feature map obtained
after five down-sampling operations is 1/32 of the original image, so the length and width
must be multiples of 32. In this paper, 64 is also required, and the remainder is assigned
as d . The fifth step is to calculate the padding of the red edge on both sides of the image.
Sensors 2021, 21, 3263 8 of 21
For the sixth step, if W , H and new _ w , new _ h are not the same, scale the original
image to the image of new _ w and new _ h , respectively. The last step is to fill the two
sides of the image after scaling to obtain a new image.
Similarly,
Similarly,the
theimage
imageof of wearing
wearing aa mask
mask irregularly,
irregularly,for
for instance,
instance, after
after being
being processed
processed
by
by the
the adaptive
adaptive scaling
scaling algorithm,
algorithm, isis shown
shown in
in Figure
Figure 7.
7.

A: 447x343x3 B:416x319x3 C:416x352x3

Figure 7.
Figure 7. Adaptive
Adaptive image
image scaling.
scaling.

Comparing
Comparing Figure
Figure66with
withFigure
Figure7, 7,
it it
can bebe
can observed thatthat
observed thethe
original image
original addsadds
image the
least red edge
the least at both
red edge endsends
at both of the
of image afterafter
the image adaptive scaling,
adaptive thusthus
scaling, reducing redundant
reducing redun-
information. WhenWhen
dant information. the model is used
the model for reasoning,
is used the calculation
for reasoning, will be
the calculation reduced,
will and
be reduced,
the
andreasoning speed
the reasoning of object
speed detection
of object will be
detection promoted.
will be promoted.
Sensors 2021, 21, x FOR PEER REVIEW 4.4. 9 of 22
4.4. Improved
Improved Network
Network Model
Model Structure
Structure
The
The improved network model
improved network model is is shown
shown in in Figure
Figure 8,
8, in
in which
which threethree CSP1_X
CSP1_X modulesmodules
are
are used in the Backbone of the backbone feature extraction network, and each
used in the Backbone of the backbone feature extraction network, and each CSP1_X
CSP1_X
module
uses has
has X
the combination
module X residual
of 1units.
residual × 1 + 3In
units. × 3this
In + 1 paper,
this ×paper, considering
1 convolution
considering the
modulethe calculation
to reduce the
calculation cost, the
the residual
computation
cost, residual
modules
cost and ensure
modules are
areconnected
accuracy.
connected ininseries into
series intothethecombination
combination of Xofresidual
X residual units. ThisThis
units. operation
operationcan
replace
The the two 3 ×
Prediction 3 convolution
module uses the operations
features with a combination
extracted from the of 1 ×
model to + 3 × 3In
1predict. 1×1
+ this
can replace the two 3 × 3 convolution operations with a combination of 1 × 1 + 3 × 3 + 1 × 1
convolution
paper, module.network
the Prediction × 1 convolution
The firstis1 divided into three layer can compress
effective the number of ×channels
convolution module. The first 1 × 1 convolution layer canfeature
compress layers: 13 × 13
the number 24,chan-
of 26
to
× 26 half
× 24 of
and the
52 original
× 52 × 24,one
whichand reduce
correspond the number
to big of
object, parameters
medium at
object theand same
small time. The
object,
nels to half of the original one and reduce the number of parameters at the same time. The
3 × 3 convolution
respectively. Here, 24 layer can
cancan enhance
be enhance
understood feature extraction
as the and
productand restore
of 3restore
and 8,the the number
andnumber
8 can be of channels.
3 × 3 convolution layer feature extraction ofdivided
channels.
Thethe
into last 1 ×of1 4,
convolution operation restoresthe thefour
output 3 × 3 convolution
of theparameters layer, so
The last 1 × 1 convolution operation restores the output of the 3 × 3 convolutionpredic-
sum 1 and 3, where 4 represents position of the layer, so
thebox,
tion alternate
1 is convolution
used to judge operation
whether is helpful
the prior for contains
box feature extraction,
objects, andensures
3 accuracy
represents and
that
the alternate convolution operation is helpful for feature extraction, ensures accuracy and
reduces
there are the
three amount
categoriesof computation.
of mask detection tasks.
reduces the amount of computation.
The Neck network is mainly composed of the SPPNet and improved PANet. In this
Input
Backbone paper, the SPPNet module enlarges the Neckacceptance range of backbone features effectively,
Prediction
CBH CBH CSP1_3 CBH CSP1_9
andCBH
thus significantly
CSP1_9
separates the most important contextual features. The high compu-
CBH SPP CSP2_3 CBH Up-Sampling

Concat
416x416x3 tational cost of model reasoning is mainly caused
CSP2_3 CBH
by the repeated appearances of gradient
Up-Sampling

Concat
information in the process of network optimization. Therefore, from
CSP2_3
the
Conv
point of view of
CBH = Conv BN H-swish network model design, this paper introduces the CSP2_X module into PANet to divide
52x52x24
the basic feature layer from Backbone into two parts and then reduces the use of repeated
gradient information through cross-stage operation. In the same way, the CSP2_X module
Res CBH
CBH CBH Add Concat
unit CSP2_3 Conv

26x26x24

CBH
Res Concat
CSP1_X = CBH Conv CSP2_3 Conv
unit
X residual units Concat BN Leaky-ReLU CBH
Conv 13x13x24

Maxpool
CSP2_X = CBH CBH Conv
2×X Concat BN Leaky-ReLU CBH SPP = CBH Maxpool Concat CBH
Conv
Maxpool

Figure
Figure 8. 8. Network
Network model
model ofof mask
mask detection.
detection.

TheLocation
4.5. Object Neck network is mainly
and Prediction composed of the SPPNet and improved PANet. In this
Process
paper, the SPPNet module enlarges the acceptance range of backbone features effectively,
The YOLO-v3, YOLO-v4 and the models used in this paper are all predicted by the
and thus significantly separates the most important contextual features. The high computa-
Prediction module after extracting three feature layers. For the 13 × 13 × 24 effective feature
tional cost of model reasoning is mainly caused by the repeated appearances of gradient
layer, it is equivalent to divide the input picture into 13 × 13 grids, and each grid will be
responsible for object detection in the area corresponding to this grid. When the center of
an object falls in this area, it is necessary to use this grid to take charge of the object detec-
tion. Each grid will preset three prior boxes, and the prediction results of the network will
adjust the position parameters of the three prior boxes to obtain the final prediction re-
Sensors 2021, 21, 3263 9 of 21

information in the process of network optimization. Therefore, from the point of view of
network model design, this paper introduces the CSP2_X module into PANet to divide the
basic feature layer from Backbone into two parts and then reduces the use of repeated gra-
dient information through cross-stage operation. In the same way, the CSP2_X module uses
the combination of 1 × 1 + 3 × 3 + 1 × 1 convolution module to reduce the computation
cost and ensure accuracy.
The Prediction module uses the features extracted from the model to predict. In this
paper, the Prediction network is divided into three effective feature layers: 13 × 13 × 24,
26 × 26 × 24 and 52 × 52 × 24, which correspond to big object, medium object and small
object, respectively. Here, 24 can be understood as the product of 3 and 8, and 8 can be
divided into the sum of 4, 1 and 3, where 4 represents the four position parameters of the
prediction box, 1 is used to judge whether the prior box contains objects, and 3 represents
that there are three categories of mask detection tasks.

4.5. Object Location and Prediction Process


The YOLO-v3, YOLO-v4 and the models used in this paper are all predicted by the
Prediction module after extracting three feature layers. For the 13 × 13 × 24 effective
feature layer, it is equivalent to divide the input picture into 13 × 13 grids, and each grid
will be responsible for object detection in the area corresponding to this grid. When the
center of an object falls in this area, it is necessary to use this grid to take charge of the object
detection. Each grid will preset three prior boxes, and the prediction results of the network
will adjust the position parameters of the three prior boxes to obtain the final prediction
results. Similarly, the prediction process of effective feature layers of 26 × 26 × 24 and
52 × 52 × 24 is the same as that of feature layers of 13 × 13 × 24.
In Figure 9, the feature layer is divided into 13 × 13 grids to illustrate the process
of object location and prediction. Figure 9a represents the original input image of three-
channel color with a size of 416 × 416 × 3. Figure 9b is obtained from the feature extraction
of the input image through the network, which represents the effective feature layer with
the size of 13 × 13 × 24 in the Prediction module. The feature layer is divided into
13 × 13 grids and each grid has three prior boxes which are represented by green boxes.
Their center points are c x and cy , width and height are pw and ph , respectively. The final
prediction box is a blue box with center points t x and ty , width and height bw and bh ,
respectively. Figure 9c is an input image mapped by Figure 9b, which means that the size
of the prior box, grid point, prediction box, height and width in Figure 9c is 32 times that
of Figure 9b. Therefore, when the center of the face wearing a mask irregularly falls within
the orange box, this grid is responsible for face detection. The prediction results of the
network will adjust the positions of the three prior boxes, and then the final prediction box
will be screened out by ranking the confidence level and NMS to obtain Figure 9d as the
detection result of the network.
YOLO-v3 is an improved version based on YOLO-v2, which solves the multi-scale
problem of objects and improves the detection effect of the network on small-scale objects.
At the same time, YOLO-v3 uses binary cross-entropy as the loss function, so that the
network can realize multi-category prediction with one boundary box. YOLO-v3 and
YOLO-v4 prediction methods are adopted in the prediction process in the present paper, as
shown in Figure 9b, and t x , ty , tw and th are the four parameters that the network needs to
learn, which are:
bx = σ (t x ) + c x (2)
by = σ ( t y ) + c y (3)
bw = p w e t w (4)
bh = p h e t h (5)
In the training process, the network constantly learns four parameters t x , t x , ty and
tw , thus constantly adjusting the position of the prior box to approach the position of the
Sensors 2021, 21, 3263 10 of 21
Sensors 2021, 21, x FOR PEER REVIEW 10 of 22

prediction
network box,
will and finally
adjust obtaining
the positions of the
thefinal
threeprediction andσ (then
result.
prior boxes, t x ) and (ty ), prediction
the σfinal respectively,
that t x and t
box will be screened out by ranking the confidence level and NMS to obtain Figurecenter
represent y are constrained by the Sigmoid function to ensure that the 9d asof
the prediction box falls within
the detection result of the network.the grid.

Pw

bw

bh (cx , cy ) Ph
 (t y )
 (t x )
(t x , t y )

(a)The input image (b)13×13×24 feature layer

WMI:0.88

(c)Adjustment of prior box (d)The result of prediction

Figure 9.
Figure Theprocess
9. The processof
ofobject
objectpositioning
positioningand
andprediction.
prediction.

The confidence score reflects the accuracy of the model predicting that an object is a
YOLO-v3 is an improved version based on YOLO-v2, which solves the multi-scale
certain category, as shown in Equation (6).
problem of objects and improves the detection effect of the network on small-scale objects.
At the same time, YOLO-v3 uses binary cross-entropy as the loss function, truth so that the
Con f idence = Pr (Classi |Object) × Pr (Object) × IoU pred (6)
network can realize multi-category prediction with one boundary box. YOLO-v3 and
YOLO-v4 prediction
In Equation (6),methods
P (Classare adopted
|Object in the
) means prediction
the process
probability in the
of what present
kind paper,
of object it is
r i
as shown
when it isinknown
Figure 9b, and
to be antx object.
, t y , tw and th are )the
Pr (Object four parameters
represents that the network
the probability needs
of whether the
prediction
to contains an object. If an object is included, Pr (Object) = 1, otherwise it
boxare:
learn, which
truth tells us the overlap ratio between the predicted box and the true box [38].
equals 0. IoU pred
bx   (t x )  cx (2)
4.6. The Size Design of Prior Box by   (t y )  c y (3)
For the mask detection data set in this paper, it is necessary to set appropriate prior
bw  pw eThe
box sizes to obtain accurate prediction results.
tw
size of the prior box obtained by(4)
the
k-means clustering algorithm is shown in Table 1.
bh  ph eth (5)
Sensors 2021, 21, 3263 11 of 21

Table 1. The size of the prior box.

Feature Map Receptive Field Prior Box Size


(221 × 245)
13 × 13 large object (234 × 229)
(245 × 251)
(165 × 175)
26 × 26 medium object (213 × 222)
(217 × 195)
(46 × 51)
52 × 52 small object (82 × 100)
(106 × 201)

5. Experimental Data Set


5.1. Data Set
At present, the published mask data sets are few, and there are problems such as
poor content, poor quality and single background which cannot be directly applied to
the face mask detection task in a complex environment. Under such context, this paper
adopts the method of using my own photos and screening from the published RMFD [39]
and MaskedFace-Net [40] data sets to manufacture a data set of 10,855 images, of which
7826 are selected for training, 868 for verification and 2161 for testing. When creating the
data set, we fully consider the mask type, manufacturer, color and other factors to meet the
richness of the data set. Therefore, the model algorithm has stronger generalization ability
and detection ability in practical use. People’s behavior of covering their faces with objects
that are not masks easily leads to the false detection of objects in the algorithm, hence, we
treat this kind of behavior as “face”. In the whole data set, there are 3615 images without
masks, 3620 images with masks regularly and 3620 images with masks irregularly. The face
in each picture corresponds to a label, and each label corresponds to a serial number. In this
paper, the detection tasks are divided into three categories: serial number 0 corresponds
to the “face”, indicating that no mask is worn; serial number 1 is equal to “face_mask”,
showing that the face wears a mask regularly; and serial number 2 is equivalent to “WMI”,
which means wearing masks irregularly. The sample distribution of different categories
in the data set is shown in Table 2, where images represent the number of categories and
objects represent the number of instances [41].

Table 2. Distribution of different types of samples in the data set.

Training Set Validation Set Testing Set


Sort
Images Objects Images Objects Images Objects
face 2556 2670 338 350 721 753
face_mask 2685 2740 219 228 716 730
WMI 2585 2604 311 311 724 730
total 7826 8014 868 889 2161 2213

5.2. Region Division of Real Box


Whether a face is standard for wearing a mask can be judged by the exposure of the
nose, mouth and chin in the face. We randomly selected 100 original images from the data
set as the research object, and the area between eyebrows and chin in the images as the
research area. We can conclude that the nose is located at the height of 28.5–55% of the
image. The mouth is distributed in 55–81% of the image. The chin is located at 81–98%
of the image, as shown in Figure 10. Therefore, based on this conclusion, we use the
LabelImg tool to label every face in each picture in the data set and determine its category
and coordinate information to obtain the real box.
set as the research object, and the area between eyebrows and chin in the images as th
research area. We can conclude that the nose is located at the height of 28.5%–55% of th
image. The mouth is distributed in 55%–81% of the image. The chin is located at 81%–98%
of the image, as shown in Figure 10. Therefore, based on this conclusion, we use the La
Sensors 2021, 21, 3263 belImg tool to label every face in each picture in the data set and determine12 of
its21 categor
and coordinate information to obtain the real box.

Width

Ymin_nose

Height
Ymax_nose
Ymin_mouth

Ymax_mouth
Ymin_chin
Ymax_chin

Figure 10.10.
Figure Division
Divisionof
of key parts.
key parts.

Generallyspeaking,
Generally speaking, the
the face
face is
is completely
completelyexposed
exposed if not wearing
if not wearinga mask.
a mask.TheThe non
nonstandard wearing of masks can be attributed to four situations: the nose is exposed; the
standard wearing of masks can be attributed to four situations: the nose is exposed; th
nose and mouth are exposed; the mouth, nose and chin are all exposed; and only the chin
nose and mouth are exposed;
is exposed. To wear a mask in athe mouth,
standard nose
way, youandneedchin are all
to ensure exposed;
that the frontand
and only
back the chi
is exposed.
of the maskToare
wear a mask
correctly in a standard
distinguished, and way, you and
the upper need to ensure
lower sides ofthat the front
the mask mustand bac
of the mask are correctly distinguished, and the upper and lower sides of the mask mu
be used to press the metal strips on both sides of the nose bridge with both hands to make
the upper
be used end ofthe
to press themetal
mask close
stripstoon
theboth
bridge of theofnose,
sides and the
the nose left and
bridge rightboth
with endshands
of the to mak
themask
Sensors 2021, 21, x FOR PEER REVIEW
upper close to the cheeks. Then, stretch the mask downwards so that the mask does not
end of the mask close to the bridge of the nose, and the left and right ends o
13 of 22
leave wrinkles and better covers the nose, mouth and chin. Figure 11 shows the standard
theand
mask close to the cheeks. Then, stretch the mask downwards so that the mask doe
non-standard way of wearing a mask.
not leave wrinkles and better covers the nose, mouth and chin. Figure 11 shows the stand
ard and non-standard way of wearing a mask.

(a) face (b) face_mask(Right) (c) face_mask(Postive) (d) face_mask(Left)

(e) WMI(Only exposed (f) WMI(Exposed (g) WMI(Exposed nose, (h) WMI(Only exposed
nose) nose and mouth) mouth and chin) chin)

Figure 11. Sample diagram from the data set.


Figure 11. Sample diagram from the data set.
6. Experimental Results and Analysis
6. Experimental Results and Analysis
To verify the advantages of the improved model compared with other detection
To verify
models, a greatthe advantages
deal of theare
of experiments improved model
carried out compared
to illustrate thewith otherofdetection
validity the model
models, a great deal of experiments are carried out to illustrate the validity of the model
performance.
performance.
6.1. Experimental Platform and Parameters
6.1. Experimental Platform
The configuration and Parameters
parameters of the software and hardware platform implemented by
The configuration
the algorithm parameters
in this paper of the
are shown software
in Table 3. and hardware platform implemented
by the algorithm in this paper are shown in Table 3.

Table 3. Configuration parameters.

Device Configuration
Operating system Windows 10
Sensors 2021, 21, 3263 13 of 21

Table 3. Configuration parameters.

Device Configuration
Operating system Windows 10
Processor Inter(R)i7-9700k
GPU accelerator CUDA 10.1, Cudnn 7.6
GPU RTX 2070Super, 8G
Frames Pytorch, Keras, Tensorflow
Compilers Pycharm, Anaconda
Scripting language Python 3.7
Camera A4tech USB2.0 Camera

Before the model in this article starts to be trained, its hyperparameters need to be
initialized. The model continuously optimizes the parameters during the training process
so that it speeds up the convergence of the network and prevents it from overfitting. All
experiments in this paper are performed under the epoch of 50, batch size of 8 and the
input image size of 416 × 416 × 3. The parameter adjustment process is shown in Table 4.

Table 4. The hyperparameters of the model.

Hyperparameters Before Initialization After Initialization


initial learning rate 0.01000 0.00320
optimizer weight decay 0.00050 0.00036
momentum 0.93700 0.84300
classification coefficient 0.50000 0.24300
object coefficient 1.00000 0.30100
hue 0.01500 0.01380
saturation 0.70000 0.66400
value 0.40000 0.46400
scale 0.50000 0.89800
shear 0.00000 0.60200
mosaic 1.00000 1.00000
mix-up 0.00000 0.24300
flip up-down 0.00000 0.00856

6.2. The Performance of Different Models in Training


In the training process, the model updates its parameters from the training set to
achieve better performance. To verify the effect of CSP1_X and CSP2_X modules on the
improved model, this paper compares the training performance with other object detection
models, as shown in Table 5.

Table 5. Comparison of different models in parameters, model size, and training time.

Model Parameters Model Size Training Time


Proposed work 45.2 MB 91.0 MB 2.834 h
YOLO-v4 61.1 MB 245 MB 9.730 h
YOLO-v3 58.7 MB 235 MB 8.050 h
SSD 22.9 MB 91.7 MB 3.350 h
Faster R-CNN 27.1 MB 109 MB 45.830 h

It can be seen that with the parameters of this model, 15.9 MB are reduced compared
with YOLO-v4 and 13.5 MB are less than YOLO-v3. At the same time, the model size
is 0.371 and 0.387 times that of YOLO-v4 and YOLO-v3, respectively. Under the same
conditions, the training time of this model is 2.834 h, which is the lowest of all the models
compared in the experiment.
In Faster R-CNN, the authors used the Region Proposal Network (RPN) to generate
W × H × K candidate regions, which increases the operation cost. Meanwhile, Faster
Sensors 2021, 21, 3263 14 of 21

R-CNN not only retained the ROI-Pooling layer in it but also used the full connection layer
in the ROI-Pooling layer, which brought the network many repeated operations and then
reduced the training speed of the model.

6.3. Comparision of Reasioning Time and Real-Time Performance


In this paper, video plays a role to verify the real-time performance of the algorithm.
FPS (Frames Per Second) is often used to characterize the real-time performance of the
model. The larger the FPS becomes, the better the real-time performance will be. In
the meantime, the adaptive image scaling method is used to verify the reliability of the
algorithm in the reasoning stage, as shown in Table 6.

Table 6. Comparison of different models in test time, reasoning time, FPS.

Model One Image Test Time All Reasoning Time FPS


Proposed work 0.022 s 144.7 s 54.57
YOLO-v4 0.042 s 151.1 s 23.83
YOLO-v3 0.047 s 153.1 s 21.39
SSD 0.029 s 97.0 s 34.69
Faster R-CNN 0.410 s 1620.7 s 2.44

In the present work, we use the same picture to calculate the test time and compare
the total reasoning time on the test set. It can be seen from the table that the adaptive image
scaling algorithm can effectively reduce the size of red edges at both ends of the image, and
the detection time consumed in the reasoning process is 144.7 s, which is 6.4 s less than that
of YOLO-v4. However, thanks to its model structure, SSD consumes the shortest reasoning
time. Faster R-CNN consumes the most time in reasoning, which is a common feature of
the Two-Stage algorithm. Meanwhile, the FPS of our algorithm can reach 54.57 FPS, which
is the highest among all comparison algorithms, while Faster R-CNN reaches the lowest.

6.4. The Parameter Distuibution of Different Network Layers


In general, the spatial complexity of the model can be reflected by the total number
of parameters. We analyze the distribution of parameters from various network parts of
different models in this paper, thus verifying the effectiveness of the improved backbone
feature extraction network and PANet, as shown in Table 7.

Table 7. The parameter distribution of different modules in different models.

Module Faster R-CNN SSD YOLO-v3 YOLO-v4 Proposed Work


Backbone - - 40,620,740 30,730,448 9,840,832
Neck - - 14,722,972 27,041,012 37,514,988
Prediction - - 6,243,400 6,657,945 43,080
All parameters 28,362,685 24,013,232 61,587,112 64,014,760 47,398,900
All CSPx - - - 26,816,384 -
All CSP1_X - - - - 8,288,896
All CSP2_X - - - - 18,687,744
All layers 185 69 256 370 335

In YOLO-v4, its parameters are mainly distributed in the backbone feature extraction
network, and a different number of residual modules is used to extract deeper information,
but as the network gets deeper, the parameters will become more and this will complicate
the model. It can be seen from Table 7 that the algorithm in this paper has fewer parameters
in the backbone network, which is due to the use of the shallower CSP1_X module, and
it effectively reduces the size of the model. Furthermore, five CSP2_X modules are used
in the Neck module to gather more parameters, which is more helpful to enhance feature
fusion. At last, our model has 335 layers in total, 35 less than YOLO-v4.
Sensors 2021, 21, 3263 15 of 21

6.5. Model Testing


After the model training is completed, the trained weights are used to test the model,
and the model is evaluated from many aspects. For our face mask data set, the test results
can be classified into three categories: TP (true positive) means that the categories in the test
set are the same as the test results; FP (false positive) means the number of samples in the
detected object category is inconsistent with the real object category; and FN (false negative)
indicates that the real sample is detected as the opposite result or in the undetected category.
For all positive cases judged by the model, the number is ( TP + FP), so the proportion of
real cases ( TP) is called the precision rate, which represents the proportion of samples of
real cases in positive cases among samples detected by the model, as shown in Equation (7).

TP
Precision = (7)
TP + FP

For all positive examples in the test set, the number is ( TP + FN ). Therefore, the recall
rate is used to measure the ability of the model to detect the real cases in the test set, as
shown in Equation (8).
TP
Recall = (8)
TP + FN
To characterize the precision of the model, this article introduces AP (Average Preci-
sion) and mAP (mean Average Precision) indicators to evaluate the accuracy of the model,
as shown in Equations (9) and (10).
Z 1
AP = P( R)dR (9)
0

∑iN=1 APi
mAP = (10)
N
Among them, P, R, N, respectively, represent precision, recall rate and the total number
of objects in all categories.
Through Equations (7) and (8), it can be found that there is a contradiction between
precision rate and recall rate. Therefore, the comprehensive evaluation index F-Measure
used to evaluate the detection ability of the model can be shown as:

( α2 + 1) × P × R
Fα = (11)
α2 ( P + R )

When α = 1, F1 represents the harmonic average of precision rate and recall rate, as
shown in Equation (12):
2×P×R
F1 = (12)
( P + R)
If F1 is higher, the test of the model will be more effective. We use 2161 images with a
total of 2213 objects as the test set. The test results of the model with IOU = 0.5 are shown
in Table 8.
It can be seen from Table 8 that the model in this paper reaches the maximum value in
TP and the minimum value in FN, and this means that the model itself has good detection
ability for samples. At the same time, the model reaches the optimal value in the F1 index
compared with other models.
To further compare the detection effect of our model with YOLO-v4 and YOLO-v3 on
each category in the test set, the AP value comparison experiments of several models are
carried out under the same experimental environment, as shown in Table 9.
Sensors 2021, 21, 3263 16 of 21

Table 8. Sample detection results of different models on the test set.

Models Sort Size Object TP FP FN P R F1


face 416 × 416 753 737 50 16 0.936 0.979 0.957
face_mask 416 × 416 730 725 23 5 0.969 0.993 0.980
Proposed work
WMI 416 × 416 730 712 39 18 0.948 0.975 0.961
Total 416 × 416 2213 2174 112 39 0.951 0.982 0.967
face 416 × 416 753 666 42 87 0.941 0.885 0.910
face_mask 416 × 416 730 705 199 25 0.780 0.966 0.860
YOLO-v4
WMI 416 × 416 730 670 195 60 0.775 0.918 0.840
Total 416 × 416 2213 2041 436 172 0.832 0.923 0.870
face 416 × 416 753 640 53 113 0.924 0.850 0.890
face_mask 416 × 416 730 686 23 44 0.968 0.940 0.950
YOLO-v3
WMI 416 × 416 730 623 26 107 0.960 0.853 0.900
Total 416 × 416 2213 1949 102 264 0.950 0.881 0.913

Table 9. The comparative experiments of AP of different models in three categories.

Sort Size IOU Face Face_Mask WMI


416 × 416 [email protected] 0.979 0.995 0.973
Proposed work 416 × 416 [email protected] 0.978 0.995 0.983
416 × 416 [email protected]:.95 0.767 0.939 0.834
416 × 416 [email protected] 0.943 0.969 0.944
YOLO-v4 416 × 416 [email protected] 0.680 0.899 0.800
416 × 416 [email protected]:.95 0.541 0.740 0.670
416 × 416 [email protected] 0.921 0.981 0.941
YOLO-v3 416 × 416 [email protected] 0.617 0.888 0.835
416 × 416 [email protected]:.95 0.559 0.789 0.724
300 × 300 [email protected] 0.941 0.986 0.988
SSD 300 × 300 [email protected] 0.503 0.920 0.926
300 × 300 [email protected]:.95 0.518 0.789 0.790
600 × 600 [email protected] 0.943 0.974 0.950
Faster R-CNN 600 × 600 [email protected] 0.700 0.927 0.866
600 × 600 [email protected]:.95 0.612 0.824 0.769

It can be seen from Table 9 that the performance of the AP value of our model is higher
than YOLO-v4, YOLO-v3, SSD and Faster R-CNN under different IOUs, thus the average
precision of our model is effectively verified. However, in the case of SSD in [email protected], its
AP in the category of “WMI” reaches the highest value.
In this paper, mAP is introduced to measure the detection ability of the model for all
categories, and the model is tested on IOU = 0.5, IOU = 0.75 and IOU = 0.5:0.05:0.95 to
further evaluate the comprehensive detection ability of the model, as shown in Table 10.

Table 10. The mAP comparison experiments of different models in all categories.

Model [email protected] [email protected] [email protected]:95


Proposed work 0.983 0.985 0.847
YOLO-v4 0.952 0.793 0.680
YOLO-v3 0.948 0.780 0.689
SSD 0.972 0.783 0.691
Faster R-CNN 0.956 0.831 0.735

It can be seen that when IOU = 0.5, the mAP of this model is 3.1% higher than that
of YOLO-v4 and 1.1% higher than SSD. Under the condition of IOU = 0.5:0.95, a more
rigorous test is carried out, and the experiment shows that [email protected]:95 is 16.7% and 15.6%
Sensors 2021, 21, 3263 17 of 21

higher than YOLO-v4 and SSD, respectively. This fully shows that the model is superior
to YOLO-v4
Sensors 2021, 21, x FOR PEER REVIEW and SSD in comprehensive performance. It is worth pointing out 18 that
of 22the
mAP of Faster R-CNN is higher than YOLO-v4 and YOLO-v3, but the FPS is the lowest,
which also implies the common characteristics of the Two-Stage detection algorithm: high
detection
detectionaccuracy
accuracy and
and low real-time. At
low real-time. Atthe
thesame
sametime,
time,we
weillustrate
illustrate the
the performance
performance of of
different
differentmodels
models in
in terms
terms of test performance
of test performanceininaavisual
visualway,
way,asasshown
shownininFigure
Figure12.12.

(a) Proposed work.

(b) YOLO-v4.

(c) YOLO-v3.

(d) SSD.

(e) Faster-RCNN.
Figure12.
Figure 12.Visualization
Visualization of
of different
different models
models in
inperformance
performancetesting.
testing.

Thepictures
The pictures used
used in the comparative
comparativeexperiment
experimentininFigure Figure1212are arefromfrom the test
the setset
test of of
thispaper.
this paper. EachEach experiment
experiment is is conducted
conducted in in the
the same
sameenvironment.
environment.Meanwhile, Meanwhile, visual
visual
analysis is
analysis is carried
carried out on condition
condition of ofthe
theconfidence
confidencelevel level0.5.
0.5.InInthe
thefigure,
figure, thethe
number
number
ofoffaces
faces in thetheimage
imagefromfrom leftleft
to right is constantly
to right is constantly increasing, so the so
increasing, distribution of faces of
the distribution
is denser,
faces and the
is denser, andproblems of occlusion,
the problems multi-scale
of occlusion, and density
multi-scale andindensity
a complex in environ-
a complex
ment are fully
environment areconsidered, which which
fully considered, offers convenience to fullytoprove
offers convenience fully the
prove robustness and
the robustness
generalization ability of the model. From the analysis of the figure,
and generalization ability of the model. From the analysis of the figure, it can be found it can be found that
the performance
that the performance of theofmodel used in
the model this in
used paper
thisispaper
betteristhan the than
better otherthefourother
in test results,
four in test
but all the models have poor detection results for severe occlusion
results, but all the models have poor detection results for severe occlusion and half and half face. We con-
face.
sider
We that thethat
consider cause
theofcause
this problem is because
of this problem is of the lack
because ofofthe
images
lack of with
imagesserious withmissing
serious
face features
missing in the data
face features set,data
in the whichset,leads
which toleads
less learning of theseof
to less learning features and the and
these features poorthe
generalization ability of the model. Therefore, one of the main
poor generalization ability of the model. Therefore, one of the main tasks in the future tasks in the future is to
is to
expand the data set and enrich the diversity
expand the data set and enrich the diversity of features. of features.

6.6.Influence
6.6. Influence of
of Different
Different Activation
Activation Functions
Functions
We use the Mish activation function[42]
We use the Mish activation function [42]totohighlight
highlightthe
theeffect ofof
effect thethe
H-swish activa-
H-swish activa-
tion function on the results of this paper, as shown in Table
tion function on the results of this paper, as shown in Table 11. 11.
Sensors 2021, 21, 3263 18 of 21

Table 11. Influence of different activation functions.

Function Train Time Face Face_Mask WMI [email protected]


H-swish 2.834 h 0.979 0.995 0.973 0.983
Mish 3.902 h 0.971 0.995 0.973 0.980
L-ReLU 2.812 h 0.975 0.985 0.974 0.978
ReLU 3.056 h 0.970 0.972 0.969 0.970
Sigmoid 2.985 h 0.966 0.968 0.963 0.966

It can be seen from Table 11 that in the same situation, using H-swish as the activation
function can obtain better detection results. Therefore, the mask detection model has
stronger nonlinear feature learning ability under the action of the H-swish activation
function. At this time, the model has the highest detection accuracy in the comparison
experiment of activation functions.

6.7. Analysis of Ablation Experiment


We use ablation experiments to analyze the influence of the improved method on the
performance of the model. The experiments are divided into five groups as comparisons.
The first group is YOLO-v4. In the second group, the CSP1_X module is introduced into the
backbone feature extraction network module of YOLO-v4. The third group is the CSP2_X
module introduced into the Neck module of YOLO-v4. In the fourth group, both CSP1_X
and CSP2_X modules are added into YOLO-v4 at the same time. The last set of experiments
is the result of the model in this article. The experimental results are shown in Table 12.

Table 12. Ablation experiments.

CSP1_X CSP2_X H-Swish Face Face_Mask WMI [email protected] FPS


×
√ × × 0.943 0.969 0.944 0.952 23.83
×
√ × 0.982 0.984 0.972 0.979 43.47
×
√ √ × 0.969 0.993 0.962 0.975 45.45
√ √ ×
√ 0.971 0.993 0.967 0.977 47.65
0.979 0.995 0.973 0.983 54.57

It can be seen from the analysis in the table that the use of the CSP1_X module in
the backbone feature extraction network enhances the AP values of the three categories,
and at the same time, the and FPS are increased by 2.7% and 19.64 FPS, respectively, thus
demonstrating the effectiveness of CSP1_X. Different from YOLO-v4, this paper takes
advantage of the CSPDarkNet53 module and introduces the CSP2_X module into Neck to
further enhance the learning ability of the network in semantic information. Experiments
show that CSP2_X also improves the AP values of the three categories, and mAP and
FPS are increased by 2.3% and 21.62 FPS, respectively, compared with YOLO-v4. From
the comparative experiments of the fourth and the fifth groups, we find that the H-swish
activation function significantly ameliorates the detection accuracy of the model.
In summary, the improved strategies proposed in this paper based on YOLO-v4
are meaningful for promoting the recognition and detection of face masks in complex
environments.

7. Conclusions
In this paper, an improved algorithm based on YOLO-v4 is proposed to solve the
problem of mask wearing recognition. Meanwhile, the effectiveness and robustness of this
model are verified by the comparative study of two kinds of object detection algorithms.
This article can be summarized as follows:
• Firstly, the CSP1_X module is introduced into the backbone feature extraction network
to enhance feature extraction.
Sensors 2021, 21, 3263 19 of 21

• Secondly, the CSP2_X module is used in the Neck module to ensure that the model
can learn deeper semantic information in the process of feature fusion.
• Thirdly, the Hard-Swish activation function is used to improve the nonlinear feature
learning ability of the model.
• Finally, the proposed adaptive image scaling algorithm can reduce the model’s rea-
soning time.
The experimental results show that the algorithm proposed in this paper has the high-
est detection accuracy compared with others for strict mask detection tasks. Meanwhile, the
phenomena of false and missing detection have been reformed. Moreover, the algorithm in
this paper effectively decreases the requirements of the model on training cost and model
complexity, which enables the model to not only be deployed on medium devices but
also be extended to other object detection tasks, such as mask wearing detection tasks of
students, passengers, patients, and other staff.
However, in the present work, there are still some problems in insufficient feature
extraction for difficult detection samples or even missing and false detection cases. In
addition, the case of wearing a mask when the light is insufficient is also not considered.
Therefore, the next step should be expanding the data set based on the standard mask
wearing criteria and obtaining further improvements for the model in the present work,
and so extending it to more object detection tasks.

Author Contributions: Conceptualization, W.Z., J.Y.; methodology, W.Z., J.Y.; software, W.Z.; valida-
tion, W.Z.; formal analysis, J.Y.; investigation, W.Z.; resources, W.Z.; data curation, W.Z.; writing—
original draft preparation, W.Z.; writing—review and editing, W.Z., J.Y.; visualization, W.Z.; supervi-
sion, J.Y.; project administration, J.Y.; funding acquisition, J.Y. All authors have read and agreed to
the published version of the manuscript.
Funding: This research was funded by the National Natural Science Foundation of China, grant
number 61673079. The research was supported by the Key Lab of Industrial Wireless Networks and
Networked Control of the Ministry of Education.
Institutional Review Board Statement: The study was conducted according to the guidelines of the
Declaration of Helsinki, and approved by the Ethics Committee of College of Automation, Chongqing
University of Posts and Telecommunications.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: The data presented in this study are available on request from the
corresponding author.
Acknowledgments: The work was carried out at the Key Lab of Industrial Wireless Networks and
Networked Control of the Ministry of Education and the authors would like to thank medical staff
Huan Ye for the data support.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Alberca, G.G.F.; Fernandes, I.G.; Sato, M.N.; Alberca, R.W. What Is COVID-19? Front. Young Minds 2020, 8, 74. [CrossRef]
2. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86,
2278–2324. [CrossRef]
3. Platt, J. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Advances in Kernel Methods.
Support Vector Learn. 1998, 208.
4. Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks. Neural Inf. Process.
Syst. 2012, 25. [CrossRef]
5. Glorot, X.; Bordes, A.; Bengio, Y. Deep Sparse Rectifier Neural Networks. J. Mach. Learn. Res. 2011, 15, 315–323.
6. Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R. Improving neural networks by preventing
co-adaptation of feature detectors. Comput. ENCE 2012, 3, 212–223.
7. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans.
Pattern Anal. Mach. Intell. 2014, 37, 1904–1916. [CrossRef] [PubMed]
8. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
9. Bochkovskiy, A.; Wang, C.-Y.; Liao, H. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934.
Sensors 2021, 21, 3263 20 of 21

10. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016.
11. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings
of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 936–944.
12. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018.
13. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014;
pp. 580–587. [CrossRef]
14. Girshick, R.B. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago,
Chile, 11–18 December 2015; pp. 1440–1448.
15. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE
Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef] [PubMed]
16. Li, Y.; Chen, Y.; Wang, N.; Zhang, Z. Scale-Aware Trident Networks for Object Detection. In Proceedings of the 2019 IEEE/CVF
International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 6053–6062. [CrossRef]
17. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of
the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland,
2016; pp. 21–37.
18. Fu, C.-Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A. DSSD: Deconvolutional Single Shot Detector. arXiv 2017, arXiv:1701.06659.
19. Li, Z.; Zhou, F. FSSD: Feature Fusion Single Shot Multibox Detector. arXiv 2017, arXiv:1712.00960.
20. Jeong, J.; Park, H.; Kwak, N. Enhancement of SSD by concatenating feature maps for object detection. arXiv 2017, arXiv:1705.09587.
21. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings
of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016;
pp. 779–788. [CrossRef]
22. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [CrossRef]
23. Shin, H.C.; Roth, H.R.; Gao, M.; Le, L.; Xu, Z.; Nogues, I.; Yao, J.; Mollura, D.; Summers, R.M. Deep convolutional neural networks
for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning. IEEE Trans. Med Imaging 2016, 35,
1285–1298. [CrossRef] [PubMed]
24. Giger, M.L.; Suzuki, K. Computer-aided diagnosis. In Biomedical Information Technology; Academic Press: Cambridge, MA, USA,
2008; pp. 359–374.
25. Khan, M.A.; Kim, Y. Cardiac Arrhythmia Disease Classification Using LSTM Deep Learning Approach. Computers. Mater. Contin.
2021, 67, 427–443.
26. Uijlings, J.R.R.; Sande, K.E.A.v.d.; Gevers, T.; Smeulders, A.W.M. Selective Search for Object Recognition. Int. J. Comput. Vis. 2013,
104, 154–171. [CrossRef]
27. Buciu, I. Color quotient based mask detection. In Proceedings of the 2020 International Symposium on Electronics and Telecom-
munications (ISETC), Timisoara, Romania, 5–6 November 2020; pp. 1–4.
28. Loey, M.; Manogaran, G.; Taha, M.; Khalifa, N.E. Fighting against COVID-19: A novel deep learning model based on YOLO-v2
with ResNet-50 for medical face mask detection. Sustain. Cities Soc. 2020, 65, 102600. [CrossRef] [PubMed]
29. Ml, A.; Gmb, C.; Mhnt, D.; Nemk, D. A hybrid deep transfer learning model with machine learning methods for face mask
detection in the era of the covid-19 pandemic. Measurement 2020, 167, 108288.
30. Nagrath, P.; Jain, R.; Madan, A.; Arora, R.; Kataria, P.; Hemanth, J.D. SSDMNV2: A real time DNN-based face mask detection
system using single shot multibox detector and MobileNetV2. Sustain. Cities Soc. 2020, 66, 102692. [CrossRef] [PubMed]
31. Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Yeh, I.H. CSPNet: A New Backbone that can Enhance Learning Capability of CNN.
In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle,
WA, USA, 14–19 June 2020.
32. Neubeck, A.; Gool, L. Efficient Non-Maximum Suppression. In Proceedings of the 18th International Conference on Pattern
Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; Volume 3, pp. 850–855.
33. Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing Geometric Factors in Model Learning and Inference for
Object Detection and Instance Segmentation. arXiv 2020, arXiv:2005.03572.
34. Avenash, R.; Viswanath, P. Semantic Segmentation of Satellite Images using a Modified CNN with Hard-Swish Activation
Function. VISIGRAPP 2019. [CrossRef]
35. Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for Activation Functions. arXiv 2018, arXiv:1710.05941.
36. Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching
for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea,
27 October–2 November 2019; pp. 1314–1324.
37. Keys, R. Cubic convolution interpolation for digital image processing. In IEEE Transactions on Acoustics, Speech, and Signal
Pro-Cessing; IEEE: Piscataway, NJ, USA, 1981; Volume 29, pp. 1153–1160. [CrossRef]
Sensors 2021, 21, 3263 21 of 21

38. Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. UnitBox: An Advanced Object Detection Network. In Proceedings of the 24th ACM
International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016.
39. Wang, Z.-Y.; Wang, G.; Huang, B.; Xiong, Z.; Hong, Q.; Wu, H.; Yi, P.; Jiang, K.; Wang, N.; Pei, Y.; et al. Masked Face Recognition
Dataset and Application. arXiv 2020, arXiv:2003.09093.
40. Cabani, A.; Hammoudi, K.; Benhabiles, H.; Melkemi, M. MaskedFace-Net-A Dataset of Correctly/Incorrectly Masked Face
Images in the Context of COVID-19. arXiv 2020, arXiv:2008.08016.
41. Zhang, H.; Li, D.; Ji, Y.; Zhou, H.; Wu, W.; Liu, K. Toward New Retail: A Benchmark Dataset for Smart Unmanned Vending
Machines. IEEE Trans. Ind. Inform. 2020, 16, 7722–7731. [CrossRef]
42. Misra, D. Mish: A Self Regularized Non-Monotonic Neural Activation Function. arXiv 2019, arXiv:1908.08681.

You might also like