Sensors: Face Mask Wearing Detection Algorithm Based On Improved YOLO-v4
Sensors: Face Mask Wearing Detection Algorithm Based On Improved YOLO-v4
Article
Face Mask Wearing Detection Algorithm Based on
Improved YOLO-v4
Jimin Yu 1,2 and Wei Zhang 1,2, *
1 College of Automation, Chongqing University of Post and Telecommunications, Chongqing 400065, China;
[email protected]
2 Key Lab of Industrial Wireless Networks and Networked Control of the Ministry of Education,
Chongqing 400065, China
* Correspondence: [email protected]
Abstract: To solve the problems of low accuracy, low real-time performance, poor robustness and
others caused by the complex environment, this paper proposes a face mask recognition and standard
wear detection algorithm based on the improved YOLO-v4. Firstly, an improved CSPDarkNet53 is in-
troduced into the trunk feature extraction network, which reduces the computing cost of the network
and improves the learning ability of the model. Secondly, the adaptive image scaling algorithm can
reduce computation and redundancy effectively. Thirdly, the improved PANet structure is introduced
so that the network has more semantic information in the feature layer. At last, a face mask detection
data set is made according to the standard wearing of masks. Based on the object detection algorithm
of deep learning, a variety of evaluation indexes are compared to evaluate the effectiveness of the
model. The results of the comparations show that the mAP of face mask recognition can reach 98.3%
and the frame rate is high at 54.57 FPS, which are more accurate compared with the exiting algorithm.
Keywords: adaptive image scaling; CSPDarknNet53; face mask recognition; PANet; YOLO-v4
Data
Input preprocessing Data partitioning
Face mask
Testing set Batches of data Testing model wearing detection
Figure 1.
Figure 1. Flow chart of the
the proposed
proposedapproach.
approach.
The
The contributions paper are
contributions of this paper are as
asfollows:
follows:
• Aiming
Aiming atatthe problem
the problem of training time,time,
of training this paper introduces
this paper the improved
introduces CSPDark-
the improved
Net53 into the backbone
CSPDarkNet53 to realize to
into the backbone therealize
rapid theconvergence of the model
rapid convergence and
of the reduce
model andthe
time cost in training.
reduce the time cost in training.
• An An adaptive
adaptive image scaling
scaling algorithm
algorithmisisintroduced
introducedtotoreduce
reducethetheuse
useofofredundant
redundant
information
information in the model.
model.
• To Tostrengthen
strengthen the fusion of multi-scale
multi-scalesemantic
semanticinformation,
information,thetheimproved
improved PANet
PANet is is
added into
added into the Neck module.
module.
• TheThe Hard-Swish
Hard-Swish activation function
functionintroduced
introducedin inthis
thispaper
papercan
cannotnotonly
onlystrengthen
strengthen
the nonlinear
the nonlinear feature extraction
extraction ability
abilityof ofthe
thenetwork,
network,but butalso
alsoenable
enablethethedetection
detection
results of
results of the
the model to be more
more accurate.
accurate.
Tosum
To sum up,
up, in
in the face mask
mask detection
detectiontask,
task,thethealgorithm
algorithmproposed
proposedininthis thispaper has
paper has
higher detection
higher detection accuracy
accuracy than
than other
other typical
typicalalgorithms,
algorithms,which
whichmeans
meansthe thealgorithm
algorithm is is
more suitable for the mask detection task. At the same time, the algorithm is more practical
to deploy in public places to urge people to wear masks regularly in order to reduce the
risk of cross-infection.
Sensors 2021, 21, 3263 3 of 21
2. Related Works
2.1. Problems Exist in Object Detection
There are two key points of face mask wearing detection. One is to locate the position
of the face in the image; the other is to identify whether the face given by the data set is
wearing a mask and if the mask is worn correctly. Problems of the present object detection
algorithm can be attributed to face occlusion, variable face scale, uneven illumination,
density, etc., and these problems seriously affect the performance of the algorithm. Further-
more, the traditional object detection algorithm adopts the selective search method [26]
in feature extraction, leading to problems such as poor generalization ability, redundant
information, low accuracy and poor real-time performance.
Input
Backbone Neck Prediction
CBM CSP1 CSP2 CSP8 CSP8 CSP4 CBL SPP CBL CBL Up-Sampling
Figure
Figure 2.
2. YOLO-v4
YOLO-v4network
networkstructure.
structure.
YOLO-v4
4. Improved proposed
YOLO-v4 a new Model
Network mosaic data augmentation method to expand the data
set and introduced CIOU as the
With the increasing number of positioning lossconvolutional
layers of the function [33],neural
whichnetwork,
made thethe network
depth
more inclined to optimize in the direction of increasing overlapping areas, thus
of the network is deepening, and the deeper network structure is beneficial for the extrac-effectively
improving
tion thefeatures.
of object accuracy. In the actual
Thereupon, complex
the semanticenvironment,
informationdue to the objects
of small externalisinterference
increased.
such as occlusion and multi-scale, there are still some shortcomings
The main improvements presented in this paper based on YOLO-v4 are as follows: in the face mask
The
detection directly using YOLO-v4. The main performances are as follows:
CSPDarkNet53 is improved into CSP1_X and CSP2_X, and so reduced network modules
There are still problems such as insufficient shallow feature extraction for
to reduce the parameters of feature extraction in the network model; using the CSP2_X
multi-scale objects.
module in Neck can increase information fusion, and the adaptive image scaling method
In the reasoning stage, the model adds gray bars at both ends of the image to prevent
is used to replace the image scaling method in YOLO-v4.
the image from distorting, but too many gray bars increase the redundant information of
the model.
At the same time, the model has problems such as long training time, high calculation
cost and overfull parameters.
To solve these problems, this paper optimizes and improves the model based
on YOLO-v4.
Res
Res
CSPx ==
CSPx CBM
CBM CBM
CBM unit
CBM
CBM
unit
XXresidual
residualunits
units
Concat
Concat CBM
CBM
CBM
CBM
Figure
Figure3.
Figure 3.CSPDarkNet53
3. CSPDarkNet53module
CSPDarkNet53 modulestructure.
module structure.
structure.
Although
Although YOLO-v4
AlthoughYOLO-v4
YOLO-v4uses uses the
usesthe residual
residual
the residual network
network
network to totoreduce
reduce thethe
reduce thecomputing
computing
computingpowerpower re-
require-
power re-
quirement
quirement of the model, its memory requirement still needs to be improved. Therefore,inin
ment of theof the
model,model,
its its
memorymemory requirement
requirement still still
needs needs
to beto be improved.
improved. Therefore,
Therefore, in this
this
this paper,
paper, paper, the
the network
the network structure
structure
network ofof CSPDarkNet53
of CSPDarkNet53
structure ofof YOLO-v4
of YOLO-v4
CSPDarkNet53 isis improved
is improved
YOLO-v4 toto the
to the CSP1_X
improved the
CSP1_X
CSP1_Xmodule,
module, as shown
module, asinshown
as Figurein
shown 4.Figure
in Figure4.4.
Res
Res
CSP1_X ==
CSP1_X CBH
CBH unit
Conv
Conv
unit Concat
XXresidual
residualunits
units Concat BN Leaky-ReLU CBH
BN Leaky-ReLU CBH
Conv
Conv
Figure
Figure4.
Figure 4.CSP1_X
4. CSP1_Xmodule
CSP1_X modulestructure.
module structure.
structure.
Compared
Comparedwith
Compared withCSPDarkNet53
with CSPDarkNet53in
CSPDarkNet53 inFigure
in Figure2,
Figure 2,2,the
theimproved
the improvednetwork
improved networkuses
network usesthe
uses theH-swish
the H-swish
H-swish
activation
activation function
function [34],
[34], as
as shown
shown
activation function [34], as shown in Equation (1): in
in Equation
Equation (1):
(1):
ReReLULU( xxx+
333)
HHswish
swish
H − swish( x ) = x
xx xReLU
x (1)
(1)
(1)
666
AsAsthetheSwish
Swishfunction
function[35] [35]contains
containsthe theSigmoid
Sigmoidfunction, function,the thecalculation
calculationcost costof ofthe
the
As the Swish function [35] contains the Sigmoid function, the calculation cost of the
Swish
Swish function
function is
is higher
higher than
than the
the ReLU
ReLU function,
function, but
but thethe Swish
Swish function
function is
is more
more effective
effective
Swish function is higher than the ReLU function, but the Swish function is more effective
than
thanthe theReLU
ReLUone. one.Howard
Howardused usedthe theH-swish
H-swishfunction functionon onmobile
mobiledevices
devices[36][36]to toreduce
reduce
than the ReLU one. Howard used the H-swish function on mobile devices [36] to reduce
the number
the number
number of of accesses
of accesses
accesses to to memory
to memory
memory by by the
by the model,
the model,
model, whichwhich
which furtherfurther reduced
further reduced
reduced the the time
timecost.
the time cost.
the cost.
Therefore,
Therefore, in
in this
this paper,
paper, the
the advantages
advantages ofof the
the H-swish
H-swish function
function are
are used
used toto reduce
reduce the
the
Therefore, in this paper, the advantages of the H-swish function are used to reduce the
running
running time
time requirements
requirements of
of the
the model
model on
on condition
condition ofofensuring
ensuring no
no gradient
gradient explosion,
explosion,
running time requirements of the model on condition of ensuring no gradient explosion,
disappearance
disappearanceand andother
otherproblems.
problems.At Atthe thesame sametime, time,the thedetection
detectionaccuracy
accuracyof ofthe
themodel
model
disappearance and other problems. At the same time, the detection accuracy of the model
isisadvanced.
advanced.
is advanced.
InInCSP1_X,
CSP1_X,the theinput
inputfeature
featurelayer layerof ofthetheresidual
residualblock blockisis isdivided
dividedinto intotwo
twobranches.
branches.
In CSP1_X, the input feature layer of the residual block divided into two branches.
One
One is
One is used
is used
used asas the
asthe residual
theresidual
residualedgeedge
edge for convolution
forforconvolution
convolution operation.
operation.
operation. The other
The The plays
otherother the
therole
playsplays role
the of the
ofrole
the
trunk
trunk
of part,
the part, performs
trunkperforms 1
part, performs × 1 convolution
1 × 1 convolution
1 × 1 convolution operation
operationoperationat first, then
at first, then performs
performs
at first, 1 × 1
then1performs convolution
× 1 convolution 1×1
totoadjust
adjustthe
convolution thechannel
to adjustafter
channel the entering
afterchannel
entering the
theresidual
after residualblock,
entering block,
the and
residualandthen then performs
block, performs
and then the 33××33convo-
theperforms convo-
the
lution
lution operation
operation totoenhance
enhance the
the feature
feature extraction.
extraction. At
At
3 × 3 convolution operation to enhance the feature extraction. At last, the two branches are last,
last, the
the two
two branches
branches are
are concate-
concate-
nated,
nated,thus thusmerging
concatenated, merging the
thus mergingthechannels
channels totoobtain
the channels obtain tomoremorefeature
obtain feature layer
more feature layerinformation.
information. In
layer information.Inthis
thispaper,
paper,
In this
three
paper, CSP1_X
three CSP1_X
three CSP1_Xmodules
modules are used
are used
modules in
arein the
used improved
the in improved
the improved Backbone,
Backbone, where
Backbone, X
wherewhere represents
X represents the
X represents thenum-
num-
the
ber
berof
number ofresidual
residual
of residualweighting
weighting
weightingoperations
operations
operationsininthethe inresidual
residual structure.
the residualstructure. Finally,
structure.Finally, after
afterstacking,
Finally, stacking,
after aa11××
stacking,
1a1convolution
× 1 convolution
1convolution isisused
usedistotointegrate
integrate
used the
thechannels.
to integrate channels.
the channels.Experiments
Experiments show
showthat
Experiments that using
showusing
thatthis
this residual
usingresidual
this
structure can
canmake
structurestructure
residual make the
can network
themake
network structure easier
structurestructure
the network easiertotoeasier
optimize.
optimize. to optimize.
Sensors 2021, 21, 3263 6 of 21
Sensors 2021, 21, x FOR PEER REVIEW 6 of 22
4.2.
4.2. Neck
Neck Network
Network
The
The convolutional
convolutional neural
neural network
network requires
requires thethe input
input image
image toto have
haveaafixed
fixedsize.
size.In
In
the
thepast
pastconvolution
convolutionneural
neuralnetwork,
network,thethefixed
fixedinput
inputwas
wasobtained
obtainedby bycutting
cuttingand
andwarping
warping
operations,
operations, but these methods
but these methodseasily
easilybring
bringabout
aboutproblems
problems such
such as as object
object missing
missing or
or de-
deformation. To eliminate such problems, researchers proposed SPPNet
formation. To eliminate such problems, researchers proposed SPPNet to remove the re- to remove the
requirement
quirement ofof fixed
fixed input
input size.
size. ToTo gain
gain multi-scale
multi-scale local
local features,
features, YOLO-v4
YOLO-v4 introduced
introduced the
the SPPNet structure based on YOLO-v3. In order to further fuse the multi-scale
SPPNet structure based on YOLO-v3. In order to further fuse the multi-scale local feature local
feature information
information with thewith the feature
global global feature information,
information, we addwe theadd the CSP2_X
CSP2_X modulemodule to the
to the PANet
PANet structure of YOLO-v4 to enhance the feature extraction, which helps
structure of YOLO-v4 to enhance the feature extraction, which helps to speed up the flow to speed up
the flow of feature information and enhance the accuracy of the model. CSP2_X
of feature information and enhance the accuracy of the model. CSP2_X is shown in Figure is shown
in
5. Figure 5.
Figure5.
Figure 5. CSP2_X
CSP2_X module
module structure.
structure.
The
The common
common convolution
convolution operation
operation isis adopted
adopted in in the
the Neck
Neck network
network inin YOLO-v4,
YOLO-v4,
while
while the CSPNet
CSPNet has hasthe
theadvantages
advantagesofof superior
superior learning
learning ability,
ability, reduced
reduced computing
computing bot-
bottleneck
tleneck andand memory
memory cost.
cost. Adding
Adding thethe improved
improved CSPNet
CSPNet network
network module
module based
based on
on YOLO-v4
YOLO-v4 cancan further
further enhance
enhance the the ability
ability of network
of network feature
feature fusion.
fusion. This combined
This combined oper-
operation
ation can can realize
realize the the top-down
top-down transmission
transmission of deeper
of deeper semantic
semantic features
features in in PANet,
PANet, and
and at
at
thethe sametime
same timefuse
fusethe
thebottom-up
bottom-updeepdeeppositioning
positioning features
features from the SPPNet network,
network,
thus
thus realizing
realizing feature
feature fusion
fusion between
between different
different backbone
backbone layers
layers and
anddifferent
differentdetection
detection
layers
layers in the Neck network and providing more useful features for the Predictionnetwork.
in the Neck network and providing more useful features for the Prediction network.
4.3.
4.3. Adaptive
Adaptive Image
Image Scaling
Scaling
In
In the object detection network,
the object detection network, the the image
image data
data received
received by by the
the input
input port
port have
have aa
uniform standard size. For example, the standard size of each image in the handwritten
uniform standard size. For example, the standard size of each image in the handwritten
numeral recognition data set MNIST is 28 × 28. However, different data sets have differ-
numeral recognition data set MNIST is 28 × 28. However, different data sets have different
ent image sizes, and ResNet fixes the input image to 224 × 224. Two standard sizes of
image sizes, and ResNet fixes the input image to 224 × 224. Two standard sizes of 416 ×
416 × 416 and 608 × 608 are provided at the input port of the YOLO-v4 detection network.
416 and 608 × 608 are provided at the input port of the YOLO-v4 detection network. Tra-
Traditional methods for obtaining standard size mainly include cutting, twisting, stretching,
ditional methods for obtaining standard size mainly include cutting, twisting, stretching,
scaling, etc., but these methods are easy to cause problems such as missing objects, loss
scaling, etc., but these methods are easy to cause problems such as missing objects, loss of
of resolution and a reduction in accuracy. In the previous convolutional neural network,
resolution and a reduction in accuracy. In the previous convolutional neural network, it
it was necessary to unify the image data to the standard size manually in advance, while
was necessary to unify the image data to the standard size manually in advance, while
YOLO-v4 standardized the image size directly by using the data generator, and then input
YOLO-v4 standardized the image size directly by using the data generator, and then input
the image into the network to realize the end-to-end learning process. In the training and
the image into the network to realize the end-to-end learning process. In the training and
testing stage of YOLO-v4, the sizes of input images are 416 × 416 and 608 × 608. When
testing stage ofimages
standardizing YOLO-v4, the sizes
of different of input
sizes, imagesimages
the original are 416are × 416 andfirstly;
scaled 608 × 608.
thenWhen
gray
standardizing images of different sizes, the original images are
images with sizes of 416 × 416 or 608 × 608 are generated; and finally, the scaled image scaled firstly; then gray
is
images with
overlaid on the sizes
grayofimage
416 × 416 or 608image
to obtain × 608 data
are generated;
of standard and finally, the scaled image is
size.
overlaid
Taking on the
the gray image tomask
unregulated obtainimage
imageas data
an of standardthe
example, size.
image processed by the
Taking the unregulated mask image as an example,
scaling algorithm is shown in Figure 6. In Figure 6, A is the original the image processed by theAfter
input image. scal-
ing algorithm is shown in Figure 6. In Figure 6, A is the original input
calculating the scaling ratio of the original input image, image B is obtained by the BICUBIC image. After calcu-
lating the scaling
interpolation ratio [37],
algorithm of theand
original
C is a input image,with
gray image imagethe B is obtained size.
standardized by the BICUBIC
Finally, the
interpolation
original imagealgorithm [37], and
can be pasted intoCtheis agray
grayimage,
image with
and the the standardized
standardized size. Finally,
standard inputthe
original image can be pasted into the gray image, and the
image D can be obtained. The image scaling algorithm reduces the resolution without standardized standard input
image D can
distortion, butbe obtained.applications,
in practical The image scaling algorithm
most images havereduces
differentthe resolution
aspect without
ratios, so after
distortion, but in practical applications, most images have different
scaling and filling by this algorithm, the gray image sizes at both ends of the image are aspect ratios, so after
not
scaling
the same.and filling
If the bygray
filled this image
algorithm,
size is thetoogray image
large, theresizes at both ends
is information of the image
redundancy whichare
increases the reasoning time of the model.
Sensors 2021, 21, x FOR PEER REVIEW 7 of 22
B: 416×319×3
A: 447×343×3
D: 416×416×3
C: 416×416×3
Figure6.
Figure 6. Image
Image scaling
scaling in
inYOLO-v4.
YOLO-v4.
Therefore,
Therefore, this
this paper
paper introduces
introduces an
anadaptive
adaptiveimage
imagescaling
scalingalgorithm
algorithmtotoadaptively
adaptively
add
add the
the least
least red edge to the original
original image.
image. The
Thesteps
stepsofofthis
thisalgorithm
algorithm
areare shown
shown in
in Algorithm
Algorithm 1. 1.
padding d / 2
The process of this algorithm can be understood as follows: In the first step, TW and
TH in theif standard
(W , H ) (size
new of
_ wthe
, new _ h) : image are divided by W and H of the input image,
object
respectively, and then their minimum value scaling_ratio is treated as the scaling factor. In
the second step,image resize
multiply the(input
scaling_ image
factor, by
(new
the_ original
w, new _image’s
h)) W, H, respectively, and
then take new_w and new_h as the scaled dimensions
new _ image add _ border (image, ( padding , padding )) of the original image. In the third
step, the TW and TH of the object image are subtracted by new_w and new_h, respectively,
toEnd
obtain dw and dh . In the fifth step, obtain the maximum value of dw and dh , and then
calculate
Output:thenew remainder
_ image of this maximum value and 64. As the network model will carry out
five down-sampling operations, the size of the original input image is five times that of the
feature graphs after five down-sampling operations, and each down-sampling operation
The process
can compress of this and
the height algorithm
width ofcanthebefeature
understood
graphs asoffollows:
the lastIn theto
time first
1 step, T
W and
2 of the original.
TH in the the
Therefore, standard
size of size of the object
the feature image are
map obtained divided
after by W and Hoperations
five down-sampling of the input im-
is 1/32
of the original image, so the length and width must be multiples of 32.
age, respectively, and then their minimum value scaling _ ratio is treated as the scalingIn this paper, 64 is
also required, and the remainder is assigned as d. The fifth step is to calculate the padding
of the red edge on both sides of the image. For the sixth step, if W, H and new_w, new_h
are not the same, scale the original image to the image of new_w and new_h, respectively.
The last step is to fill the two sides of the image after scaling to obtain a new image.
and each down-sampling operation can compress the height and width of the feature
graphs of the last time to ½ of the original. Therefore, the size of the feature map obtained
after five down-sampling operations is 1/32 of the original image, so the length and width
must be multiples of 32. In this paper, 64 is also required, and the remainder is assigned
as d . The fifth step is to calculate the padding of the red edge on both sides of the image.
Sensors 2021, 21, 3263 8 of 21
For the sixth step, if W , H and new _ w , new _ h are not the same, scale the original
image to the image of new _ w and new _ h , respectively. The last step is to fill the two
sides of the image after scaling to obtain a new image.
Similarly,
Similarly,the
theimage
imageof of wearing
wearing aa mask
mask irregularly,
irregularly,for
for instance,
instance, after
after being
being processed
processed
by
by the
the adaptive
adaptive scaling
scaling algorithm,
algorithm, isis shown
shown in
in Figure
Figure 7.
7.
Figure 7.
Figure 7. Adaptive
Adaptive image
image scaling.
scaling.
Comparing
Comparing Figure
Figure66with
withFigure
Figure7, 7,
it it
can bebe
can observed thatthat
observed thethe
original image
original addsadds
image the
least red edge
the least at both
red edge endsends
at both of the
of image afterafter
the image adaptive scaling,
adaptive thusthus
scaling, reducing redundant
reducing redun-
information. WhenWhen
dant information. the model is used
the model for reasoning,
is used the calculation
for reasoning, will be
the calculation reduced,
will and
be reduced,
the
andreasoning speed
the reasoning of object
speed detection
of object will be
detection promoted.
will be promoted.
Sensors 2021, 21, x FOR PEER REVIEW 4.4. 9 of 22
4.4. Improved
Improved Network
Network Model
Model Structure
Structure
The
The improved network model
improved network model is is shown
shown in in Figure
Figure 8,
8, in
in which
which threethree CSP1_X
CSP1_X modulesmodules
are
are used in the Backbone of the backbone feature extraction network, and each
used in the Backbone of the backbone feature extraction network, and each CSP1_X
CSP1_X
module
uses has
has X
the combination
module X residual
of 1units.
residual × 1 + 3In
units. × 3this
In + 1 paper,
this ×paper, considering
1 convolution
considering the
modulethe calculation
to reduce the
calculation cost, the
the residual
computation
cost, residual
modules
cost and ensure
modules are
areconnected
accuracy.
connected ininseries into
series intothethecombination
combination of Xofresidual
X residual units. ThisThis
units. operation
operationcan
replace
The the two 3 ×
Prediction 3 convolution
module uses the operations
features with a combination
extracted from the of 1 ×
model to + 3 × 3In
1predict. 1×1
+ this
can replace the two 3 × 3 convolution operations with a combination of 1 × 1 + 3 × 3 + 1 × 1
convolution
paper, module.network
the Prediction × 1 convolution
The firstis1 divided into three layer can compress
effective the number of ×channels
convolution module. The first 1 × 1 convolution layer canfeature
compress layers: 13 × 13
the number 24,chan-
of 26
to
× 26 half
× 24 of
and the
52 original
× 52 × 24,one
whichand reduce
correspond the number
to big of
object, parameters
medium at
object theand same
small time. The
object,
nels to half of the original one and reduce the number of parameters at the same time. The
3 × 3 convolution
respectively. Here, 24 layer can
cancan enhance
be enhance
understood feature extraction
as the and
productand restore
of 3restore
and 8,the the number
andnumber
8 can be of channels.
3 × 3 convolution layer feature extraction ofdivided
channels.
Thethe
into last 1 ×of1 4,
convolution operation restoresthe thefour
output 3 × 3 convolution
of theparameters layer, so
The last 1 × 1 convolution operation restores the output of the 3 × 3 convolutionpredic-
sum 1 and 3, where 4 represents position of the layer, so
thebox,
tion alternate
1 is convolution
used to judge operation
whether is helpful
the prior for contains
box feature extraction,
objects, andensures
3 accuracy
represents and
that
the alternate convolution operation is helpful for feature extraction, ensures accuracy and
reduces
there are the
three amount
categoriesof computation.
of mask detection tasks.
reduces the amount of computation.
The Neck network is mainly composed of the SPPNet and improved PANet. In this
Input
Backbone paper, the SPPNet module enlarges the Neckacceptance range of backbone features effectively,
Prediction
CBH CBH CSP1_3 CBH CSP1_9
andCBH
thus significantly
CSP1_9
separates the most important contextual features. The high compu-
CBH SPP CSP2_3 CBH Up-Sampling
Concat
416x416x3 tational cost of model reasoning is mainly caused
CSP2_3 CBH
by the repeated appearances of gradient
Up-Sampling
Concat
information in the process of network optimization. Therefore, from
CSP2_3
the
Conv
point of view of
CBH = Conv BN H-swish network model design, this paper introduces the CSP2_X module into PANet to divide
52x52x24
the basic feature layer from Backbone into two parts and then reduces the use of repeated
gradient information through cross-stage operation. In the same way, the CSP2_X module
Res CBH
CBH CBH Add Concat
unit CSP2_3 Conv
26x26x24
CBH
Res Concat
CSP1_X = CBH Conv CSP2_3 Conv
unit
X residual units Concat BN Leaky-ReLU CBH
Conv 13x13x24
Maxpool
CSP2_X = CBH CBH Conv
2×X Concat BN Leaky-ReLU CBH SPP = CBH Maxpool Concat CBH
Conv
Maxpool
Figure
Figure 8. 8. Network
Network model
model ofof mask
mask detection.
detection.
TheLocation
4.5. Object Neck network is mainly
and Prediction composed of the SPPNet and improved PANet. In this
Process
paper, the SPPNet module enlarges the acceptance range of backbone features effectively,
The YOLO-v3, YOLO-v4 and the models used in this paper are all predicted by the
and thus significantly separates the most important contextual features. The high computa-
Prediction module after extracting three feature layers. For the 13 × 13 × 24 effective feature
tional cost of model reasoning is mainly caused by the repeated appearances of gradient
layer, it is equivalent to divide the input picture into 13 × 13 grids, and each grid will be
responsible for object detection in the area corresponding to this grid. When the center of
an object falls in this area, it is necessary to use this grid to take charge of the object detec-
tion. Each grid will preset three prior boxes, and the prediction results of the network will
adjust the position parameters of the three prior boxes to obtain the final prediction re-
Sensors 2021, 21, 3263 9 of 21
information in the process of network optimization. Therefore, from the point of view of
network model design, this paper introduces the CSP2_X module into PANet to divide the
basic feature layer from Backbone into two parts and then reduces the use of repeated gra-
dient information through cross-stage operation. In the same way, the CSP2_X module uses
the combination of 1 × 1 + 3 × 3 + 1 × 1 convolution module to reduce the computation
cost and ensure accuracy.
The Prediction module uses the features extracted from the model to predict. In this
paper, the Prediction network is divided into three effective feature layers: 13 × 13 × 24,
26 × 26 × 24 and 52 × 52 × 24, which correspond to big object, medium object and small
object, respectively. Here, 24 can be understood as the product of 3 and 8, and 8 can be
divided into the sum of 4, 1 and 3, where 4 represents the four position parameters of the
prediction box, 1 is used to judge whether the prior box contains objects, and 3 represents
that there are three categories of mask detection tasks.
prediction
network box,
will and finally
adjust obtaining
the positions of the
thefinal
threeprediction andσ (then
result.
prior boxes, t x ) and (ty ), prediction
the σfinal respectively,
that t x and t
box will be screened out by ranking the confidence level and NMS to obtain Figurecenter
represent y are constrained by the Sigmoid function to ensure that the 9d asof
the prediction box falls within
the detection result of the network.the grid.
Pw
bw
bh (cx , cy ) Ph
(t y )
(t x )
(t x , t y )
WMI:0.88
Figure 9.
Figure Theprocess
9. The processof
ofobject
objectpositioning
positioningand
andprediction.
prediction.
The confidence score reflects the accuracy of the model predicting that an object is a
YOLO-v3 is an improved version based on YOLO-v2, which solves the multi-scale
certain category, as shown in Equation (6).
problem of objects and improves the detection effect of the network on small-scale objects.
At the same time, YOLO-v3 uses binary cross-entropy as the loss function, truth so that the
Con f idence = Pr (Classi |Object) × Pr (Object) × IoU pred (6)
network can realize multi-category prediction with one boundary box. YOLO-v3 and
YOLO-v4 prediction
In Equation (6),methods
P (Classare adopted
|Object in the
) means prediction
the process
probability in the
of what present
kind paper,
of object it is
r i
as shown
when it isinknown
Figure 9b, and
to be antx object.
, t y , tw and th are )the
Pr (Object four parameters
represents that the network
the probability needs
of whether the
prediction
to contains an object. If an object is included, Pr (Object) = 1, otherwise it
boxare:
learn, which
truth tells us the overlap ratio between the predicted box and the true box [38].
equals 0. IoU pred
bx (t x ) cx (2)
4.6. The Size Design of Prior Box by (t y ) c y (3)
For the mask detection data set in this paper, it is necessary to set appropriate prior
bw pw eThe
box sizes to obtain accurate prediction results.
tw
size of the prior box obtained by(4)
the
k-means clustering algorithm is shown in Table 1.
bh ph eth (5)
Sensors 2021, 21, 3263 11 of 21
Width
Ymin_nose
Height
Ymax_nose
Ymin_mouth
Ymax_mouth
Ymin_chin
Ymax_chin
Figure 10.10.
Figure Division
Divisionof
of key parts.
key parts.
Generallyspeaking,
Generally speaking, the
the face
face is
is completely
completelyexposed
exposed if not wearing
if not wearinga mask.
a mask.TheThe non
nonstandard wearing of masks can be attributed to four situations: the nose is exposed; the
standard wearing of masks can be attributed to four situations: the nose is exposed; th
nose and mouth are exposed; the mouth, nose and chin are all exposed; and only the chin
nose and mouth are exposed;
is exposed. To wear a mask in athe mouth,
standard nose
way, youandneedchin are all
to ensure exposed;
that the frontand
and only
back the chi
is exposed.
of the maskToare
wear a mask
correctly in a standard
distinguished, and way, you and
the upper need to ensure
lower sides ofthat the front
the mask mustand bac
of the mask are correctly distinguished, and the upper and lower sides of the mask mu
be used to press the metal strips on both sides of the nose bridge with both hands to make
the upper
be used end ofthe
to press themetal
mask close
stripstoon
theboth
bridge of theofnose,
sides and the
the nose left and
bridge rightboth
with endshands
of the to mak
themask
Sensors 2021, 21, x FOR PEER REVIEW
upper close to the cheeks. Then, stretch the mask downwards so that the mask does not
end of the mask close to the bridge of the nose, and the left and right ends o
13 of 22
leave wrinkles and better covers the nose, mouth and chin. Figure 11 shows the standard
theand
mask close to the cheeks. Then, stretch the mask downwards so that the mask doe
non-standard way of wearing a mask.
not leave wrinkles and better covers the nose, mouth and chin. Figure 11 shows the stand
ard and non-standard way of wearing a mask.
(e) WMI(Only exposed (f) WMI(Exposed (g) WMI(Exposed nose, (h) WMI(Only exposed
nose) nose and mouth) mouth and chin) chin)
Device Configuration
Operating system Windows 10
Sensors 2021, 21, 3263 13 of 21
Device Configuration
Operating system Windows 10
Processor Inter(R)i7-9700k
GPU accelerator CUDA 10.1, Cudnn 7.6
GPU RTX 2070Super, 8G
Frames Pytorch, Keras, Tensorflow
Compilers Pycharm, Anaconda
Scripting language Python 3.7
Camera A4tech USB2.0 Camera
Before the model in this article starts to be trained, its hyperparameters need to be
initialized. The model continuously optimizes the parameters during the training process
so that it speeds up the convergence of the network and prevents it from overfitting. All
experiments in this paper are performed under the epoch of 50, batch size of 8 and the
input image size of 416 × 416 × 3. The parameter adjustment process is shown in Table 4.
Table 5. Comparison of different models in parameters, model size, and training time.
It can be seen that with the parameters of this model, 15.9 MB are reduced compared
with YOLO-v4 and 13.5 MB are less than YOLO-v3. At the same time, the model size
is 0.371 and 0.387 times that of YOLO-v4 and YOLO-v3, respectively. Under the same
conditions, the training time of this model is 2.834 h, which is the lowest of all the models
compared in the experiment.
In Faster R-CNN, the authors used the Region Proposal Network (RPN) to generate
W × H × K candidate regions, which increases the operation cost. Meanwhile, Faster
Sensors 2021, 21, 3263 14 of 21
R-CNN not only retained the ROI-Pooling layer in it but also used the full connection layer
in the ROI-Pooling layer, which brought the network many repeated operations and then
reduced the training speed of the model.
In the present work, we use the same picture to calculate the test time and compare
the total reasoning time on the test set. It can be seen from the table that the adaptive image
scaling algorithm can effectively reduce the size of red edges at both ends of the image, and
the detection time consumed in the reasoning process is 144.7 s, which is 6.4 s less than that
of YOLO-v4. However, thanks to its model structure, SSD consumes the shortest reasoning
time. Faster R-CNN consumes the most time in reasoning, which is a common feature of
the Two-Stage algorithm. Meanwhile, the FPS of our algorithm can reach 54.57 FPS, which
is the highest among all comparison algorithms, while Faster R-CNN reaches the lowest.
In YOLO-v4, its parameters are mainly distributed in the backbone feature extraction
network, and a different number of residual modules is used to extract deeper information,
but as the network gets deeper, the parameters will become more and this will complicate
the model. It can be seen from Table 7 that the algorithm in this paper has fewer parameters
in the backbone network, which is due to the use of the shallower CSP1_X module, and
it effectively reduces the size of the model. Furthermore, five CSP2_X modules are used
in the Neck module to gather more parameters, which is more helpful to enhance feature
fusion. At last, our model has 335 layers in total, 35 less than YOLO-v4.
Sensors 2021, 21, 3263 15 of 21
TP
Precision = (7)
TP + FP
For all positive examples in the test set, the number is ( TP + FN ). Therefore, the recall
rate is used to measure the ability of the model to detect the real cases in the test set, as
shown in Equation (8).
TP
Recall = (8)
TP + FN
To characterize the precision of the model, this article introduces AP (Average Preci-
sion) and mAP (mean Average Precision) indicators to evaluate the accuracy of the model,
as shown in Equations (9) and (10).
Z 1
AP = P( R)dR (9)
0
∑iN=1 APi
mAP = (10)
N
Among them, P, R, N, respectively, represent precision, recall rate and the total number
of objects in all categories.
Through Equations (7) and (8), it can be found that there is a contradiction between
precision rate and recall rate. Therefore, the comprehensive evaluation index F-Measure
used to evaluate the detection ability of the model can be shown as:
( α2 + 1) × P × R
Fα = (11)
α2 ( P + R )
When α = 1, F1 represents the harmonic average of precision rate and recall rate, as
shown in Equation (12):
2×P×R
F1 = (12)
( P + R)
If F1 is higher, the test of the model will be more effective. We use 2161 images with a
total of 2213 objects as the test set. The test results of the model with IOU = 0.5 are shown
in Table 8.
It can be seen from Table 8 that the model in this paper reaches the maximum value in
TP and the minimum value in FN, and this means that the model itself has good detection
ability for samples. At the same time, the model reaches the optimal value in the F1 index
compared with other models.
To further compare the detection effect of our model with YOLO-v4 and YOLO-v3 on
each category in the test set, the AP value comparison experiments of several models are
carried out under the same experimental environment, as shown in Table 9.
Sensors 2021, 21, 3263 16 of 21
It can be seen from Table 9 that the performance of the AP value of our model is higher
than YOLO-v4, YOLO-v3, SSD and Faster R-CNN under different IOUs, thus the average
precision of our model is effectively verified. However, in the case of SSD in [email protected], its
AP in the category of “WMI” reaches the highest value.
In this paper, mAP is introduced to measure the detection ability of the model for all
categories, and the model is tested on IOU = 0.5, IOU = 0.75 and IOU = 0.5:0.05:0.95 to
further evaluate the comprehensive detection ability of the model, as shown in Table 10.
Table 10. The mAP comparison experiments of different models in all categories.
It can be seen that when IOU = 0.5, the mAP of this model is 3.1% higher than that
of YOLO-v4 and 1.1% higher than SSD. Under the condition of IOU = 0.5:0.95, a more
rigorous test is carried out, and the experiment shows that [email protected]:95 is 16.7% and 15.6%
Sensors 2021, 21, 3263 17 of 21
higher than YOLO-v4 and SSD, respectively. This fully shows that the model is superior
to YOLO-v4
Sensors 2021, 21, x FOR PEER REVIEW and SSD in comprehensive performance. It is worth pointing out 18 that
of 22the
mAP of Faster R-CNN is higher than YOLO-v4 and YOLO-v3, but the FPS is the lowest,
which also implies the common characteristics of the Two-Stage detection algorithm: high
detection
detectionaccuracy
accuracy and
and low real-time. At
low real-time. Atthe
thesame
sametime,
time,we
weillustrate
illustrate the
the performance
performance of of
different
differentmodels
models in
in terms
terms of test performance
of test performanceininaavisual
visualway,
way,asasshown
shownininFigure
Figure12.12.
(b) YOLO-v4.
(c) YOLO-v3.
(d) SSD.
(e) Faster-RCNN.
Figure12.
Figure 12.Visualization
Visualization of
of different
different models
models in
inperformance
performancetesting.
testing.
Thepictures
The pictures used
used in the comparative
comparativeexperiment
experimentininFigure Figure1212are arefromfrom the test
the setset
test of of
thispaper.
this paper. EachEach experiment
experiment is is conducted
conducted in in the
the same
sameenvironment.
environment.Meanwhile, Meanwhile, visual
visual
analysis is
analysis is carried
carried out on condition
condition of ofthe
theconfidence
confidencelevel level0.5.
0.5.InInthe
thefigure,
figure, thethe
number
number
ofoffaces
faces in thetheimage
imagefromfrom leftleft
to right is constantly
to right is constantly increasing, so the so
increasing, distribution of faces of
the distribution
is denser,
faces and the
is denser, andproblems of occlusion,
the problems multi-scale
of occlusion, and density
multi-scale andindensity
a complex in environ-
a complex
ment are fully
environment areconsidered, which which
fully considered, offers convenience to fullytoprove
offers convenience fully the
prove robustness and
the robustness
generalization ability of the model. From the analysis of the figure,
and generalization ability of the model. From the analysis of the figure, it can be found it can be found that
the performance
that the performance of theofmodel used in
the model this in
used paper
thisispaper
betteristhan the than
better otherthefourother
in test results,
four in test
but all the models have poor detection results for severe occlusion
results, but all the models have poor detection results for severe occlusion and half and half face. We con-
face.
sider
We that thethat
consider cause
theofcause
this problem is because
of this problem is of the lack
because ofofthe
images
lack of with
imagesserious withmissing
serious
face features
missing in the data
face features set,data
in the whichset,leads
which toleads
less learning of theseof
to less learning features and the and
these features poorthe
generalization ability of the model. Therefore, one of the main
poor generalization ability of the model. Therefore, one of the main tasks in the future tasks in the future is to
is to
expand the data set and enrich the diversity
expand the data set and enrich the diversity of features. of features.
6.6.Influence
6.6. Influence of
of Different
Different Activation
Activation Functions
Functions
We use the Mish activation function[42]
We use the Mish activation function [42]totohighlight
highlightthe
theeffect ofof
effect thethe
H-swish activa-
H-swish activa-
tion function on the results of this paper, as shown in Table
tion function on the results of this paper, as shown in Table 11. 11.
Sensors 2021, 21, 3263 18 of 21
It can be seen from Table 11 that in the same situation, using H-swish as the activation
function can obtain better detection results. Therefore, the mask detection model has
stronger nonlinear feature learning ability under the action of the H-swish activation
function. At this time, the model has the highest detection accuracy in the comparison
experiment of activation functions.
It can be seen from the analysis in the table that the use of the CSP1_X module in
the backbone feature extraction network enhances the AP values of the three categories,
and at the same time, the and FPS are increased by 2.7% and 19.64 FPS, respectively, thus
demonstrating the effectiveness of CSP1_X. Different from YOLO-v4, this paper takes
advantage of the CSPDarkNet53 module and introduces the CSP2_X module into Neck to
further enhance the learning ability of the network in semantic information. Experiments
show that CSP2_X also improves the AP values of the three categories, and mAP and
FPS are increased by 2.3% and 21.62 FPS, respectively, compared with YOLO-v4. From
the comparative experiments of the fourth and the fifth groups, we find that the H-swish
activation function significantly ameliorates the detection accuracy of the model.
In summary, the improved strategies proposed in this paper based on YOLO-v4
are meaningful for promoting the recognition and detection of face masks in complex
environments.
7. Conclusions
In this paper, an improved algorithm based on YOLO-v4 is proposed to solve the
problem of mask wearing recognition. Meanwhile, the effectiveness and robustness of this
model are verified by the comparative study of two kinds of object detection algorithms.
This article can be summarized as follows:
• Firstly, the CSP1_X module is introduced into the backbone feature extraction network
to enhance feature extraction.
Sensors 2021, 21, 3263 19 of 21
• Secondly, the CSP2_X module is used in the Neck module to ensure that the model
can learn deeper semantic information in the process of feature fusion.
• Thirdly, the Hard-Swish activation function is used to improve the nonlinear feature
learning ability of the model.
• Finally, the proposed adaptive image scaling algorithm can reduce the model’s rea-
soning time.
The experimental results show that the algorithm proposed in this paper has the high-
est detection accuracy compared with others for strict mask detection tasks. Meanwhile, the
phenomena of false and missing detection have been reformed. Moreover, the algorithm in
this paper effectively decreases the requirements of the model on training cost and model
complexity, which enables the model to not only be deployed on medium devices but
also be extended to other object detection tasks, such as mask wearing detection tasks of
students, passengers, patients, and other staff.
However, in the present work, there are still some problems in insufficient feature
extraction for difficult detection samples or even missing and false detection cases. In
addition, the case of wearing a mask when the light is insufficient is also not considered.
Therefore, the next step should be expanding the data set based on the standard mask
wearing criteria and obtaining further improvements for the model in the present work,
and so extending it to more object detection tasks.
Author Contributions: Conceptualization, W.Z., J.Y.; methodology, W.Z., J.Y.; software, W.Z.; valida-
tion, W.Z.; formal analysis, J.Y.; investigation, W.Z.; resources, W.Z.; data curation, W.Z.; writing—
original draft preparation, W.Z.; writing—review and editing, W.Z., J.Y.; visualization, W.Z.; supervi-
sion, J.Y.; project administration, J.Y.; funding acquisition, J.Y. All authors have read and agreed to
the published version of the manuscript.
Funding: This research was funded by the National Natural Science Foundation of China, grant
number 61673079. The research was supported by the Key Lab of Industrial Wireless Networks and
Networked Control of the Ministry of Education.
Institutional Review Board Statement: The study was conducted according to the guidelines of the
Declaration of Helsinki, and approved by the Ethics Committee of College of Automation, Chongqing
University of Posts and Telecommunications.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: The data presented in this study are available on request from the
corresponding author.
Acknowledgments: The work was carried out at the Key Lab of Industrial Wireless Networks and
Networked Control of the Ministry of Education and the authors would like to thank medical staff
Huan Ye for the data support.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Alberca, G.G.F.; Fernandes, I.G.; Sato, M.N.; Alberca, R.W. What Is COVID-19? Front. Young Minds 2020, 8, 74. [CrossRef]
2. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86,
2278–2324. [CrossRef]
3. Platt, J. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Advances in Kernel Methods.
Support Vector Learn. 1998, 208.
4. Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks. Neural Inf. Process.
Syst. 2012, 25. [CrossRef]
5. Glorot, X.; Bordes, A.; Bengio, Y. Deep Sparse Rectifier Neural Networks. J. Mach. Learn. Res. 2011, 15, 315–323.
6. Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R. Improving neural networks by preventing
co-adaptation of feature detectors. Comput. ENCE 2012, 3, 212–223.
7. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans.
Pattern Anal. Mach. Intell. 2014, 37, 1904–1916. [CrossRef] [PubMed]
8. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
9. Bochkovskiy, A.; Wang, C.-Y.; Liao, H. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934.
Sensors 2021, 21, 3263 20 of 21
10. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016.
11. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings
of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 936–944.
12. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018.
13. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014;
pp. 580–587. [CrossRef]
14. Girshick, R.B. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago,
Chile, 11–18 December 2015; pp. 1440–1448.
15. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE
Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef] [PubMed]
16. Li, Y.; Chen, Y.; Wang, N.; Zhang, Z. Scale-Aware Trident Networks for Object Detection. In Proceedings of the 2019 IEEE/CVF
International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 6053–6062. [CrossRef]
17. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of
the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland,
2016; pp. 21–37.
18. Fu, C.-Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A. DSSD: Deconvolutional Single Shot Detector. arXiv 2017, arXiv:1701.06659.
19. Li, Z.; Zhou, F. FSSD: Feature Fusion Single Shot Multibox Detector. arXiv 2017, arXiv:1712.00960.
20. Jeong, J.; Park, H.; Kwak, N. Enhancement of SSD by concatenating feature maps for object detection. arXiv 2017, arXiv:1705.09587.
21. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings
of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016;
pp. 779–788. [CrossRef]
22. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [CrossRef]
23. Shin, H.C.; Roth, H.R.; Gao, M.; Le, L.; Xu, Z.; Nogues, I.; Yao, J.; Mollura, D.; Summers, R.M. Deep convolutional neural networks
for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning. IEEE Trans. Med Imaging 2016, 35,
1285–1298. [CrossRef] [PubMed]
24. Giger, M.L.; Suzuki, K. Computer-aided diagnosis. In Biomedical Information Technology; Academic Press: Cambridge, MA, USA,
2008; pp. 359–374.
25. Khan, M.A.; Kim, Y. Cardiac Arrhythmia Disease Classification Using LSTM Deep Learning Approach. Computers. Mater. Contin.
2021, 67, 427–443.
26. Uijlings, J.R.R.; Sande, K.E.A.v.d.; Gevers, T.; Smeulders, A.W.M. Selective Search for Object Recognition. Int. J. Comput. Vis. 2013,
104, 154–171. [CrossRef]
27. Buciu, I. Color quotient based mask detection. In Proceedings of the 2020 International Symposium on Electronics and Telecom-
munications (ISETC), Timisoara, Romania, 5–6 November 2020; pp. 1–4.
28. Loey, M.; Manogaran, G.; Taha, M.; Khalifa, N.E. Fighting against COVID-19: A novel deep learning model based on YOLO-v2
with ResNet-50 for medical face mask detection. Sustain. Cities Soc. 2020, 65, 102600. [CrossRef] [PubMed]
29. Ml, A.; Gmb, C.; Mhnt, D.; Nemk, D. A hybrid deep transfer learning model with machine learning methods for face mask
detection in the era of the covid-19 pandemic. Measurement 2020, 167, 108288.
30. Nagrath, P.; Jain, R.; Madan, A.; Arora, R.; Kataria, P.; Hemanth, J.D. SSDMNV2: A real time DNN-based face mask detection
system using single shot multibox detector and MobileNetV2. Sustain. Cities Soc. 2020, 66, 102692. [CrossRef] [PubMed]
31. Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Yeh, I.H. CSPNet: A New Backbone that can Enhance Learning Capability of CNN.
In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle,
WA, USA, 14–19 June 2020.
32. Neubeck, A.; Gool, L. Efficient Non-Maximum Suppression. In Proceedings of the 18th International Conference on Pattern
Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; Volume 3, pp. 850–855.
33. Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing Geometric Factors in Model Learning and Inference for
Object Detection and Instance Segmentation. arXiv 2020, arXiv:2005.03572.
34. Avenash, R.; Viswanath, P. Semantic Segmentation of Satellite Images using a Modified CNN with Hard-Swish Activation
Function. VISIGRAPP 2019. [CrossRef]
35. Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for Activation Functions. arXiv 2018, arXiv:1710.05941.
36. Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching
for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea,
27 October–2 November 2019; pp. 1314–1324.
37. Keys, R. Cubic convolution interpolation for digital image processing. In IEEE Transactions on Acoustics, Speech, and Signal
Pro-Cessing; IEEE: Piscataway, NJ, USA, 1981; Volume 29, pp. 1153–1160. [CrossRef]
Sensors 2021, 21, 3263 21 of 21
38. Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. UnitBox: An Advanced Object Detection Network. In Proceedings of the 24th ACM
International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016.
39. Wang, Z.-Y.; Wang, G.; Huang, B.; Xiong, Z.; Hong, Q.; Wu, H.; Yi, P.; Jiang, K.; Wang, N.; Pei, Y.; et al. Masked Face Recognition
Dataset and Application. arXiv 2020, arXiv:2003.09093.
40. Cabani, A.; Hammoudi, K.; Benhabiles, H.; Melkemi, M. MaskedFace-Net-A Dataset of Correctly/Incorrectly Masked Face
Images in the Context of COVID-19. arXiv 2020, arXiv:2008.08016.
41. Zhang, H.; Li, D.; Ji, Y.; Zhou, H.; Wu, W.; Liu, K. Toward New Retail: A Benchmark Dataset for Smart Unmanned Vending
Machines. IEEE Trans. Ind. Inform. 2020, 16, 7722–7731. [CrossRef]
42. Misra, D. Mish: A Self Regularized Non-Monotonic Neural Activation Function. arXiv 2019, arXiv:1908.08681.