Deep Learning Framework To Detect Face Masks From Video Footage

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

12th International Conference on Computational Intelligence and Communication Networks

Deep Learning Framework to Detect Face Masks


from Video Footage
Aniruddha Srinivas Joshi∗ , Shreyas Srinivas Joshi† ,
Goutham Kanahasabai‡ , Rudraksh Kapil§ , Savyasachi Gupta¶
B.Tech., Department of Computer Science and Engineering
National Institute of Technology, Warangal, Telangana, India - 506004
[email protected]∗ , [email protected]† ,
[email protected]‡ , [email protected]§ , [email protected]

Abstract—The use of facial masks in public spaces has become knowledge for forming feature extractors such as the You Only
a social obligation since the wake of the COVID-19 global Look Once (YOLO) algorithm [7].
pandemic and the identification of facial masks can be imperative The pressing concern with the aforementioned approaches
to ensure public safety. Detection of facial masks in video footages
is a challenging task primarily due to the fact that the masks when it comes to face mask detection is that the face masks,
themselves behave as occlusions to face detection algorithms due with their visual diversity and various orientations behave as
to the absence of facial landmarks in the masked regions. In occlusions and variable noise to the models. This leads to
this work, we propose an approach for detecting facial masks in a lack of local facial features, resulting in the failure of even
videos using deep learning. The proposed framework capitalizes state-of-the-art face detection models. Moreover, there is a lack
on the MTCNN face detection model to identify the faces and
their corresponding facial landmarks present in the video frame. of large datasets with labeled images of faces with facial masks
These facial images and cues are then processed by a neoteric required in order to analyse the vital characteristics common to
classifier that utilises the MobileNetV2 architecture as an object masked faces, thus accounting for the low accuracy of existing
detector for identifying masked regions. The proposed framework models. These factors together justify the challenging nature
was tested on a dataset which is a collection of videos capturing of masked face detection in the field of image processing.
the movement of people in public spaces while complying with
COVID-19 safety protocols. The proposed methodology demon- During the COVID-19 pandemic, everyone is advised to
strated its effectiveness in detecting facial masks by achieving wear face masks in public [8]. According to the World Health
high precision, recall, and accuracy. Organization (WHO), masks can be used for source control
Index Terms—Face mask detection, Deep Learning, Computer (worn by an infected individual to inhibit further transmission)
Vision or for the protection of healthy people. At the time of writing,
the global pandemic has infected over 11 million people
I. I NTRODUCTION worldwide and has led to over half a million casualties [9]. The
wide-scale usage of face masks poses a challenge on public
With the ever swift development of machine learning al- face detection based security systems such as those present in
gorithms and methodologies in recent times, the task of face airports, which are unable to detect facial masks. Since the
detection has been addressed to a large extent. For instance, improper removal of masks can lead to contracting the virus,
the face detection model proposed in [1] achieves a precision it has become essential to improve facial detectors that rely
of 93% even when detecting multiple faces. Due to the on facial cues, so that detection can be performed accurately
advancement of facial detectors, numerous applications such even with inadequately exposed faces.
as real-time face recognition systems [2], security surveillance
systems [3], etc. have been developed. II. R ELATED WORKS AND L ITERATURE
Despite the success of such existing techniques, there is an In this section, we review some similar works done in this
increasing demand for the development of robust and more domain. As elucidated in section I, although research on face
efficient face detection models. In particular, the detection of detection has been going on for decades and has achieved great
masked faces proves to be a challenging and arduous task success, algorithms and methodologies that are earmarked for
for existing face detection models due to several reasons. face mask detection are limited.
Firstly, traditional face detection algorithms are based on the Ge et al. [10] developed a deep learning methodology to
extraction of handcrafted features. The Viola Jones face detec- detect masked faces using LLE-CNNs, which outperforms
tor [4] uses Haar features with the integral images technique state-of-the-art detectors by at least 15%. In the given work,
to extract facial features. Other feature extraction techniques the authors introduced a new dataset called MAsked FAces
include the utilisation of the Histogram of Gradients (HOG) (MAFA), containing 35,806 images of masked faces having
[5], Fast Fourier Transform (FFT) and Local Binary Patterns different orientations and occlusion degrees. The proposed
(LBP) [6]. With advancements in the field of deep learning, LLE-CNNs consist of three modules - proposal module,
neural networks can now learn features without utilising prior embedding module and verification module. The proposal
© IEEE 2020. This article is free to access and download,
along with rights for full text and data mining, re-use and
analysis 435

DOI: 10.1109/CICN.2020.78
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 18,2021 at 05:09:51 UTC from IEEE Xplore. Restrictions apply.
module first combines two CNNs to extract candidate facial
regions from the input image and represents them with high
dimensional descriptors. After that, the embedding module
is turns these descriptors into similarity based descriptors
using Locally Linear Embedding algorithms and dictionaries
trained on a set of faces, comprised of masked and unmasked
images. Finally, the verification module is used to identify
candidate facial regions and refine their positions with the help
of classification and regression tasks.
Nair et al. [11] utilised the Viola Jones object detection
framework to detect masked faces in surveillance videos.
The authors argued that detecting cosmetic components such
as face masks takes a significantly longer period than face
detection. The framework uses the Viola Jones face detection
algorithm to detect the eyes and face of subjects. If eyes are
recognised and later the face is recognised as well, it signifies
that no face mask was used. However, if eyes are recognised
but the face is not, it signifies that a face mask was worn by
the person in consideration.
Bu et al. [12] built a CNN-based cascaded face detector
framework, consisting of three convolutional neural networks.
The first CNN, Mask-1 is a very shallow fully convolutional
layer network with 5 layers that gives a probability of being a
masked face for each detection window, followed by a Non-
maximum Supression (NMS) to merge overlapping candidates. Fig. 1: Workflow of proposed framework
Mask-2 is a deeper CNN with 7 layers, which resizes the
candidate windows and also sets a detection threshold from
the previous CNN. Mask-3 is also a 7 layer CNN which A. Face Detection
resizes the input windows it receives and gives a likelihood
For the task of face detection, we utilized the Multi-
of whether it belongs to a masked face based on a preset
Task Cascaded Convolutional Neural Network (MTCNN) [14]
threshold. After NMS, the remaining detection windows are
as the baseline model. The model is a cascaded structure
the predicted detection results.
comprising of three stages of deep convolutional networks that
Coming to more recent methodologies, Jiang et. al. [13]
predict the facial landmarks.
developed RetinaFaceMask, which is a novel framework for
The input image is initially resized to different scales in
accurately and efficiently detecting face masks. The proposed
order to build an image pyramid, which behaves as input to
framework is a one-stage detector which consists of a feature
the three-staged network elucidated below:
pyramid network to combine high-level semantic data with
numerous feature maps. The authors propose a novel context • Stage 1 consists of a Fully Convolutional Network (FCN)
attention module for the detection of face masks in addition to called Proposal Network (P-Net) [14], which is used to
a cross-class object removal algorithm that discards predictions obtain the potential candidate windows in the input image
with low confidence values. The authors state that their model pyramid and their bounding box regression vectors. In
performs 2.3% and 1.5% more than the baseline result in other words, P-Net is responsible for proposing candidate
face and mask detection precision respectively, and 11.0% and facial regions from the input image. These estimated
5.9% higher than baseline for recall. bounding box regression vectors are used to calibrate the
candidate windows obtained, after which non-maximum
III. P ROPOSED A PPROACH suppression (NMS) is used to combine largely overlap-
ping candidates.
In this section, we elucidate our proposed framework, • Stage 2 consists of a CNN called Refine Network (R-
which is illustrated in Figure 1. The proposed framework Net) [14] to which all the candidate windows obtained
aims to detect whether people in the video footage of a public from the previous stage are fed. R-Net mainly works to
area are wearing face masks or not. In order to do so, we filter these candidate windows. This network rejects a
first detect the face of the person and then determine if a large number of false candidates and utilises bounding
facial mask is present on the face. It is to be noted that the box regression to calibrate the candidates obtained. For
terms ‘face mask’ and ‘facial mask’ are used interchangeably each candidate window, the offset between itself and the
throughout this work. nearest ground-truth is predicted, denoted by Lbox
i . The
learning task is a regression problem and Euclidean loss

436

Authorized licensed use limited to: IEEE Xplore. Downloaded on March 18,2021 at 05:09:51 UTC from IEEE Xplore. Restrictions apply.
is applied for each sample xi as: Each layer has batch normalisation and the activation func-
2 tion used is ReLU6. However, an activation function is not
Lbox
box box
i = ŷi − y i (1) applied to the output of the projection layer. Since this layer
2
outputs low-dimensional data, succeeding this layer with non-
where ŷibox is the target of the network and yibox is the linearity could destroy valuable information.
ground-truth coordinate.
• Stage 3 comprises of a CNN called O-Net [14], which
is responsible for proposing facial landmarks from the
candidate facial regions obtained from the previous stage.
O-Net outputs facial landmark locations, namely the eyes,
nose, and mouth regions of the face. Similar to the
task of bounding box regression, the detection of facial
landmarks is a regression problem and the following
Euclidean loss is minimised:
2
Llandmark
landmark landmark
i = ŷi − y i (2)
2

where ŷilandmark is the facial landmark coordinate pre-


dicted by the network and yilandmark is the ground-truth
coordinate.
For the task of face classification, the learning target can be
formulated as a binary classification problem.
For each sample xi , cross-entropy loss used was:
Ldet
i = −(yidet log pi + (1 − yidet )(1 − log pi )) (3)
where pi is the probability produced by the network that the
sample was a face and yidet ∈ {0, 1} is the ground-truth label.
The output of this stage is the spatial coordinates of the
bounding boxes enclosing the facial regions of the subjects in Fig. 2: Bottleneck Residual block
the frame.
The full MobileNetV2 architecture, as illustrated in Figure
B. Facial Mask Prediction 3, comprises of 17 bottleneck residual blocks in a row. This
For the task of identifying faces which are covered by a is followed by a regular 1×1 convolution. We utilise this base
facial mask, we utilised the MobileNetV2 architecture [15], model of the MobileNetV2 architecture as a feature extractor
which is an effective feature extractor for object detection and for facial mask detection. We create a facial mask classifier
segmentation. MobileNetV2 was chosen due to its ability to using 4 layers, succeeding the earlier mentioned architecture.
be deployed effortlessly on edge devices. We downsample each 2×2 feature map using the average
MobileNetV2 uses depth-wise separable convolutions much pooling layer (i.e. they are flattened) to produce a single long
like its predecessor, but the main residual block has some feature vector for classification. After passing through a ReLU
key alterations from its predecessor [16]. The new residual activation function, we use a softmax function as illustrated
block in MobileNetV2, known as the bottleneck residual block in 3 to get the probability distribution over the predicted
is illustrated in Figure 2. There are a total of 3 convolu- classifications. This is how the facial mask classifier is able to
tional layers in a block, where the latter two are: a depth- predict whether a subject in a given frame is wearing a facial
wise convolution that filters the input and a 1×1 point-wise mask or not.
convolution. However, this 1×1 convolution is quite different. The facial regions obtained from the face detection model
This projection layer projects input data with a higher number discussed in Face Detection (Section III-A) are passed as input
of dimensions (channels) into a tensor with a much lower to the aforementioned facial mask classifier and the output is
number of dimensions. As this layer suppresses the amount a bounding box over each face region, with the label ‘Mask’
of data that flows through the network and the output of each indicating the presence of a face mask or ‘No Mask’ when no
block is a bottleneck, it is known as a bottleneck residual face mask is worn by the subject in consideration. This output
block. Hence, the input and output of the block are low- is illustrated in Figure 6.
dimensional tensors whereas the filtering that takes place
inside the block is on high-dimensional tensors. The other IV. E XPERIMENTAL E VALUATION
key aspect of MobileNetV2 is the residual connection. This In this section, we discuss the dataset used for conducting
primarily aids with the flow of gradients through the network this study and the results obtained by the proposed approach.
during backpropagation. The experiments were conducted on Google Colab [17] with

437

Authorized licensed use limited to: IEEE Xplore. Downloaded on March 18,2021 at 05:09:51 UTC from IEEE Xplore. Restrictions apply.
Fig. 4: Visualisation of the results obtained by the proposed
approach

Fig. 3: Facial mask classifier constructed using MobileNetV2


architecture

Intel(R) Xeon(R) 2.00 GHz CPU, NVIDIA Tesla T4 GPU, 16


GB GDDR6 VRAM and 13 GB RAM. All programs were
written in Python - 3.6 and utilised OpenCV - 4.2.0, Keras -
2.3.0 and TensorFlow - 2.2.0.
A. Dataset Used
The dataset used in this work is a collection of footage
videos of public places from multiple geographical locations,
compiled from YouTube. There are a total of 15 video samples
in the dataset, each with an average duration of 1 minute. The
videos capture the movement of people in public areas after
the imposition of various safety rules and regulations in wake
of the COVID-19 pandemic. The videos showcase people Fig. 5: Some samples from the video dataset used in this work
from multiple ethnicities and also capture different types of
face masks worn by the public. Our dataset contains videos
captured using different specifications of cameras and has a
multitude of camera angles, varying illumination conditions, TP
noise, and an average frames per second (FPS) of 30. Figure Precision = × 100% (4)
TP + FP
5 illustrates a few sample videos present in this dataset. TP
Recall = × 100% (5)
B. Experimental Results and Statistics TP + FN
The proposed approach has been evaluated by measuring TP + TN
Accuracy = × 100% (6)
the precision, recall, and accuracy metrics of the face detection TP + TN + FP + FN
model and facial mask classifier respectively. where TP, TN, FP and FN denote the true positive, true

438

Authorized licensed use limited to: IEEE Xplore. Downloaded on March 18,2021 at 05:09:51 UTC from IEEE Xplore. Restrictions apply.
Fig. 6: Some instances of the results obtained by the proposed approach

negative, false positive, and false negative observations respec- comprises of people wearing head gear. On the other hand,
tively. the dataset used to evaluate our proposed framework captures
1) Face Detection: The face detection model mentioned the various types of face masks worn by the public as a
in Section III-A achieved a precision of 94.50%, recall of precautionary measure for disease control.
86.38%, and accuracy of 81.84% on the chosen dataset. Table II compares our proposed framework to RetinaMask
2) Facial Mask Prediction: The facial mask classifier men- [13]. It can be observed that our proposed framework achieves
tioned in Section III-B achieved a precision of 84.39%, recall a higher precision value in detecting masks and faces as
of 80.92%, and accuracy of 81.74% on the chosen dataset. compared to RetinaMask. However, RetinaMask achieves a
higher recall as the dataset it was evaluated on comprises of
TABLE I: Comparison of proposed framework with Cascaded images of a close-up of people’s faces which accounts for
framework for mask detection [12] their better recall figures in detecting masks and faces. Also,
Approach Accuracy Recall the authors of RetinaMask do not mention the effectiveness
Proposed Framework 81.74% 80.92% of their model in detecting multiple faces at once, while our
Cascaded framework
for mask detection
86.6% 87.8% model works well in detecting multiple faces, as illustrated in
Figure 6.
Finally, our proposed framework has also been tested on
TABLE II: Comparison of proposed framework with Reti- a video dataset unlike the aforesaid approaches which deal
naFaceMask [13] with image datasets. The video dataset used to evaluate the
Face Mask proposed framework contains videos taken using different
Approach
Precision Recall Precision Recall specifications of cameras and has a multitude of camera
Proposed Framework 94.50% 86.38% 84.39% 80.92% angles, varying illumination conditions and noise. Thus, the
RetinaFaceMask
with MobileNet
83.0% 95.6% 82.3% 89.1% proposed approach will perform well on real world camera
captures.

Table I compares our proposed framework to the cascaded C. Analysis of the proposed approach
framework used in [12]. The higher accuracy of the cascaded
framework is due to the fact that it was designed to work From the earlier discussion, it can be observed that the
on images rather than videos. Also, the “MASKED FACE” effectiveness of the facial mask classifier depends on the
dataset [12] which was used to test the cascaded framework effectiveness of the face detection model. If the face detection

439

Authorized licensed use limited to: IEEE Xplore. Downloaded on March 18,2021 at 05:09:51 UTC from IEEE Xplore. Restrictions apply.
model fails to detect a face or incorrectly identifies an object as [7] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
a face, the performance of the facial mask classifier is affected. once: Unified, real-time object detection,” in Proc. of IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779–
The following key observations were made about the effec- 788.
tiveness of the proposed approach: [8] “Coronavirus disease (covid-19) advice for the pub-
lic: When and how to use masks,” Apr 2020.
1) It is able to detect facial masks on subjects present at a [Online]. Available: https://www.who.int/emergencies/diseases/
considerable distance from the camera. novel-coronavirus-2019/advice-for-public/when-and-how-to-use-masks
2) It performed well even in scenarios where the public [9] “Coronavirus cases count.” [Online]. Available: https://www.
worldometers.info/coronavirus/
areas captured were crowded. [10] S. Ge, J. Li, Q. Ye, and Z. Luo, “Detecting masked faces in the wild
3) It satisfactorily detected the presence of facial masks on with lle-cnns,” in Proc. of IEEE Conference on Computer Vision and
subjects not directly facing the camera (i.e. only a side Pattern Recognition (CVPR), 2017, pp. 426–434.
[11] A. Nair and A. Potgantwar, “Masked face detection using the viola
profile of the face was visible) in most cases. jones algorithm: A progressive approach for less time consumption,”
4) It was able to identify subjects who were incorrectly International Journal of Recent Contributions from Engineering, Science
wearing a facial mask (i.e. the mask was not covering & IT (iJES), vol. 6, pp. 4–14, 12 2018.
[12] W. Bu, J. Xiao, C. Zhou, M. Yang, and C. Peng, “A cascade framework
their mouth and nose) and labeled them as ‘No Mask’. for masked face detection,” in Proc. of IEEE International Conference
These observations are illustrated in Figure 6. on Cybernetics and Intelligent Systems (CIS) and IEEE Conference on
Robotics, Automation and Mechatronics (RAM), 2017, pp. 458–462.
V. C ONCLUSIONS AND F UTURE W ORK [13] M. Jiang, X. Fan, and H. Yan, “Retinamask: A face mask detector,”
2020.
In this work, a new approach for detecting face masks [14] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and
from videos is proposed. A highly effective face detection alignment using multitask cascaded convolutional networks,” IEEE
Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.
model is used for obtaining facial images and cues. A distinct [15] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen,
facial classifier is built using deep learning for the task of “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proc. of
determining the presence of a face mask in the facial images IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2018, pp. 4510–4520.
detected. The resulting approach is robust and is evaluated on a [16] A. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
custom dataset obtained for this work. The proposed approach M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural
was found to be effective as it portrayed high precision, networks for mobile vision applications,” 04 2017.
[17] “Google colab.” [Online]. Available: https://colab.research.google.com/
recall, and accuracy values on the chosen dataset which
contained videos with varying occlusions and facial angles.
The effectiveness of the facial mask classifier largely confides
on the ability of the face detection algorithm to accurately
identify faces in the video frames. This could be the subject
of future research in this direction.
VI. ACKNOWLEDGEMENT
We thank Google Colaboratory for providing access to
computational resources used for this study and YouTube for
helping us avail the videos used in our dataset. We also thank
our institute, the National Institute of Technology Warangal for
its constant support and encouragement to undertake research.
R EFERENCES
[1] Lijing Zhang and Yingli Liang, “A fast method of face detection in video
images,” in Proc. of International Conference on Advanced Computer
Control, vol. 4, 2010, pp. 490–494.
[2] N. R. Borkar and S. Kuwelkar, “Real-time implementation of face
recognition system,” in Proc. of International Conference on Computing
Methodologies and Communication (ICCMC), 2017, pp. 249–255.
[3] Z. Jian and S. Wan-juan, “Face detection for security surveillance
system,” in Proc. of International Conference on Computer Science
Education, 2010, pp. 1735–1738.
[4] P. Viola and M. Jones, “Rapid object detection using a boosted cascade
of simple features,” in Proc. of IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR), vol. 1, 2001, pp.
I–511 – I–518.
[5] M. Murugappan and S. Murugappan, “Human emotion recognition
through short time electroencephalogram (eeg) signals using fast fourier
transform (fft),” in Proc. of IEEE International Colloquium on Signal
Processing and its Applications, 2013, pp. 289–294.
[6] F. A. Alomar, G. Muhammad, H. Aboalsamh, M. Hussain, A. M.
Mirza, and G. Bebis, “Gender recognition from faces using bandlet and
local binary patterns,” in Proc. of International Conference on Systems,
Signals and Image Processing (IWSSIP), 2013, pp. 59–62.

440

Authorized licensed use limited to: IEEE Xplore. Downloaded on March 18,2021 at 05:09:51 UTC from IEEE Xplore. Restrictions apply.

You might also like