Deep Learning Framework To Detect Face Masks From Video Footage
Deep Learning Framework To Detect Face Masks From Video Footage
Deep Learning Framework To Detect Face Masks From Video Footage
Abstract—The use of facial masks in public spaces has become knowledge for forming feature extractors such as the You Only
a social obligation since the wake of the COVID-19 global Look Once (YOLO) algorithm [7].
pandemic and the identification of facial masks can be imperative The pressing concern with the aforementioned approaches
to ensure public safety. Detection of facial masks in video footages
is a challenging task primarily due to the fact that the masks when it comes to face mask detection is that the face masks,
themselves behave as occlusions to face detection algorithms due with their visual diversity and various orientations behave as
to the absence of facial landmarks in the masked regions. In occlusions and variable noise to the models. This leads to
this work, we propose an approach for detecting facial masks in a lack of local facial features, resulting in the failure of even
videos using deep learning. The proposed framework capitalizes state-of-the-art face detection models. Moreover, there is a lack
on the MTCNN face detection model to identify the faces and
their corresponding facial landmarks present in the video frame. of large datasets with labeled images of faces with facial masks
These facial images and cues are then processed by a neoteric required in order to analyse the vital characteristics common to
classifier that utilises the MobileNetV2 architecture as an object masked faces, thus accounting for the low accuracy of existing
detector for identifying masked regions. The proposed framework models. These factors together justify the challenging nature
was tested on a dataset which is a collection of videos capturing of masked face detection in the field of image processing.
the movement of people in public spaces while complying with
COVID-19 safety protocols. The proposed methodology demon- During the COVID-19 pandemic, everyone is advised to
strated its effectiveness in detecting facial masks by achieving wear face masks in public [8]. According to the World Health
high precision, recall, and accuracy. Organization (WHO), masks can be used for source control
Index Terms—Face mask detection, Deep Learning, Computer (worn by an infected individual to inhibit further transmission)
Vision or for the protection of healthy people. At the time of writing,
the global pandemic has infected over 11 million people
I. I NTRODUCTION worldwide and has led to over half a million casualties [9]. The
wide-scale usage of face masks poses a challenge on public
With the ever swift development of machine learning al- face detection based security systems such as those present in
gorithms and methodologies in recent times, the task of face airports, which are unable to detect facial masks. Since the
detection has been addressed to a large extent. For instance, improper removal of masks can lead to contracting the virus,
the face detection model proposed in [1] achieves a precision it has become essential to improve facial detectors that rely
of 93% even when detecting multiple faces. Due to the on facial cues, so that detection can be performed accurately
advancement of facial detectors, numerous applications such even with inadequately exposed faces.
as real-time face recognition systems [2], security surveillance
systems [3], etc. have been developed. II. R ELATED WORKS AND L ITERATURE
Despite the success of such existing techniques, there is an In this section, we review some similar works done in this
increasing demand for the development of robust and more domain. As elucidated in section I, although research on face
efficient face detection models. In particular, the detection of detection has been going on for decades and has achieved great
masked faces proves to be a challenging and arduous task success, algorithms and methodologies that are earmarked for
for existing face detection models due to several reasons. face mask detection are limited.
Firstly, traditional face detection algorithms are based on the Ge et al. [10] developed a deep learning methodology to
extraction of handcrafted features. The Viola Jones face detec- detect masked faces using LLE-CNNs, which outperforms
tor [4] uses Haar features with the integral images technique state-of-the-art detectors by at least 15%. In the given work,
to extract facial features. Other feature extraction techniques the authors introduced a new dataset called MAsked FAces
include the utilisation of the Histogram of Gradients (HOG) (MAFA), containing 35,806 images of masked faces having
[5], Fast Fourier Transform (FFT) and Local Binary Patterns different orientations and occlusion degrees. The proposed
(LBP) [6]. With advancements in the field of deep learning, LLE-CNNs consist of three modules - proposal module,
neural networks can now learn features without utilising prior embedding module and verification module. The proposal
© IEEE 2020. This article is free to access and download,
along with rights for full text and data mining, re-use and
analysis 435
DOI: 10.1109/CICN.2020.78
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 18,2021 at 05:09:51 UTC from IEEE Xplore. Restrictions apply.
module first combines two CNNs to extract candidate facial
regions from the input image and represents them with high
dimensional descriptors. After that, the embedding module
is turns these descriptors into similarity based descriptors
using Locally Linear Embedding algorithms and dictionaries
trained on a set of faces, comprised of masked and unmasked
images. Finally, the verification module is used to identify
candidate facial regions and refine their positions with the help
of classification and regression tasks.
Nair et al. [11] utilised the Viola Jones object detection
framework to detect masked faces in surveillance videos.
The authors argued that detecting cosmetic components such
as face masks takes a significantly longer period than face
detection. The framework uses the Viola Jones face detection
algorithm to detect the eyes and face of subjects. If eyes are
recognised and later the face is recognised as well, it signifies
that no face mask was used. However, if eyes are recognised
but the face is not, it signifies that a face mask was worn by
the person in consideration.
Bu et al. [12] built a CNN-based cascaded face detector
framework, consisting of three convolutional neural networks.
The first CNN, Mask-1 is a very shallow fully convolutional
layer network with 5 layers that gives a probability of being a
masked face for each detection window, followed by a Non-
maximum Supression (NMS) to merge overlapping candidates. Fig. 1: Workflow of proposed framework
Mask-2 is a deeper CNN with 7 layers, which resizes the
candidate windows and also sets a detection threshold from
the previous CNN. Mask-3 is also a 7 layer CNN which A. Face Detection
resizes the input windows it receives and gives a likelihood
For the task of face detection, we utilized the Multi-
of whether it belongs to a masked face based on a preset
Task Cascaded Convolutional Neural Network (MTCNN) [14]
threshold. After NMS, the remaining detection windows are
as the baseline model. The model is a cascaded structure
the predicted detection results.
comprising of three stages of deep convolutional networks that
Coming to more recent methodologies, Jiang et. al. [13]
predict the facial landmarks.
developed RetinaFaceMask, which is a novel framework for
The input image is initially resized to different scales in
accurately and efficiently detecting face masks. The proposed
order to build an image pyramid, which behaves as input to
framework is a one-stage detector which consists of a feature
the three-staged network elucidated below:
pyramid network to combine high-level semantic data with
numerous feature maps. The authors propose a novel context • Stage 1 consists of a Fully Convolutional Network (FCN)
attention module for the detection of face masks in addition to called Proposal Network (P-Net) [14], which is used to
a cross-class object removal algorithm that discards predictions obtain the potential candidate windows in the input image
with low confidence values. The authors state that their model pyramid and their bounding box regression vectors. In
performs 2.3% and 1.5% more than the baseline result in other words, P-Net is responsible for proposing candidate
face and mask detection precision respectively, and 11.0% and facial regions from the input image. These estimated
5.9% higher than baseline for recall. bounding box regression vectors are used to calibrate the
candidate windows obtained, after which non-maximum
III. P ROPOSED A PPROACH suppression (NMS) is used to combine largely overlap-
ping candidates.
In this section, we elucidate our proposed framework, • Stage 2 consists of a CNN called Refine Network (R-
which is illustrated in Figure 1. The proposed framework Net) [14] to which all the candidate windows obtained
aims to detect whether people in the video footage of a public from the previous stage are fed. R-Net mainly works to
area are wearing face masks or not. In order to do so, we filter these candidate windows. This network rejects a
first detect the face of the person and then determine if a large number of false candidates and utilises bounding
facial mask is present on the face. It is to be noted that the box regression to calibrate the candidates obtained. For
terms ‘face mask’ and ‘facial mask’ are used interchangeably each candidate window, the offset between itself and the
throughout this work. nearest ground-truth is predicted, denoted by Lbox
i . The
learning task is a regression problem and Euclidean loss
436
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 18,2021 at 05:09:51 UTC from IEEE Xplore. Restrictions apply.
is applied for each sample xi as: Each layer has batch normalisation and the activation func-
2 tion used is ReLU6. However, an activation function is not
Lbox
box box
i =
ŷi − y i
(1) applied to the output of the projection layer. Since this layer
2
outputs low-dimensional data, succeeding this layer with non-
where ŷibox is the target of the network and yibox is the linearity could destroy valuable information.
ground-truth coordinate.
• Stage 3 comprises of a CNN called O-Net [14], which
is responsible for proposing facial landmarks from the
candidate facial regions obtained from the previous stage.
O-Net outputs facial landmark locations, namely the eyes,
nose, and mouth regions of the face. Similar to the
task of bounding box regression, the detection of facial
landmarks is a regression problem and the following
Euclidean loss is minimised:
2
Llandmark
landmark landmark
i =
ŷi − y i
(2)
2
437
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 18,2021 at 05:09:51 UTC from IEEE Xplore. Restrictions apply.
Fig. 4: Visualisation of the results obtained by the proposed
approach
438
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 18,2021 at 05:09:51 UTC from IEEE Xplore. Restrictions apply.
Fig. 6: Some instances of the results obtained by the proposed approach
negative, false positive, and false negative observations respec- comprises of people wearing head gear. On the other hand,
tively. the dataset used to evaluate our proposed framework captures
1) Face Detection: The face detection model mentioned the various types of face masks worn by the public as a
in Section III-A achieved a precision of 94.50%, recall of precautionary measure for disease control.
86.38%, and accuracy of 81.84% on the chosen dataset. Table II compares our proposed framework to RetinaMask
2) Facial Mask Prediction: The facial mask classifier men- [13]. It can be observed that our proposed framework achieves
tioned in Section III-B achieved a precision of 84.39%, recall a higher precision value in detecting masks and faces as
of 80.92%, and accuracy of 81.74% on the chosen dataset. compared to RetinaMask. However, RetinaMask achieves a
higher recall as the dataset it was evaluated on comprises of
TABLE I: Comparison of proposed framework with Cascaded images of a close-up of people’s faces which accounts for
framework for mask detection [12] their better recall figures in detecting masks and faces. Also,
Approach Accuracy Recall the authors of RetinaMask do not mention the effectiveness
Proposed Framework 81.74% 80.92% of their model in detecting multiple faces at once, while our
Cascaded framework
for mask detection
86.6% 87.8% model works well in detecting multiple faces, as illustrated in
Figure 6.
Finally, our proposed framework has also been tested on
TABLE II: Comparison of proposed framework with Reti- a video dataset unlike the aforesaid approaches which deal
naFaceMask [13] with image datasets. The video dataset used to evaluate the
Face Mask proposed framework contains videos taken using different
Approach
Precision Recall Precision Recall specifications of cameras and has a multitude of camera
Proposed Framework 94.50% 86.38% 84.39% 80.92% angles, varying illumination conditions and noise. Thus, the
RetinaFaceMask
with MobileNet
83.0% 95.6% 82.3% 89.1% proposed approach will perform well on real world camera
captures.
Table I compares our proposed framework to the cascaded C. Analysis of the proposed approach
framework used in [12]. The higher accuracy of the cascaded
framework is due to the fact that it was designed to work From the earlier discussion, it can be observed that the
on images rather than videos. Also, the “MASKED FACE” effectiveness of the facial mask classifier depends on the
dataset [12] which was used to test the cascaded framework effectiveness of the face detection model. If the face detection
439
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 18,2021 at 05:09:51 UTC from IEEE Xplore. Restrictions apply.
model fails to detect a face or incorrectly identifies an object as [7] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
a face, the performance of the facial mask classifier is affected. once: Unified, real-time object detection,” in Proc. of IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779–
The following key observations were made about the effec- 788.
tiveness of the proposed approach: [8] “Coronavirus disease (covid-19) advice for the pub-
lic: When and how to use masks,” Apr 2020.
1) It is able to detect facial masks on subjects present at a [Online]. Available: https://www.who.int/emergencies/diseases/
considerable distance from the camera. novel-coronavirus-2019/advice-for-public/when-and-how-to-use-masks
2) It performed well even in scenarios where the public [9] “Coronavirus cases count.” [Online]. Available: https://www.
worldometers.info/coronavirus/
areas captured were crowded. [10] S. Ge, J. Li, Q. Ye, and Z. Luo, “Detecting masked faces in the wild
3) It satisfactorily detected the presence of facial masks on with lle-cnns,” in Proc. of IEEE Conference on Computer Vision and
subjects not directly facing the camera (i.e. only a side Pattern Recognition (CVPR), 2017, pp. 426–434.
[11] A. Nair and A. Potgantwar, “Masked face detection using the viola
profile of the face was visible) in most cases. jones algorithm: A progressive approach for less time consumption,”
4) It was able to identify subjects who were incorrectly International Journal of Recent Contributions from Engineering, Science
wearing a facial mask (i.e. the mask was not covering & IT (iJES), vol. 6, pp. 4–14, 12 2018.
[12] W. Bu, J. Xiao, C. Zhou, M. Yang, and C. Peng, “A cascade framework
their mouth and nose) and labeled them as ‘No Mask’. for masked face detection,” in Proc. of IEEE International Conference
These observations are illustrated in Figure 6. on Cybernetics and Intelligent Systems (CIS) and IEEE Conference on
Robotics, Automation and Mechatronics (RAM), 2017, pp. 458–462.
V. C ONCLUSIONS AND F UTURE W ORK [13] M. Jiang, X. Fan, and H. Yan, “Retinamask: A face mask detector,”
2020.
In this work, a new approach for detecting face masks [14] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and
from videos is proposed. A highly effective face detection alignment using multitask cascaded convolutional networks,” IEEE
Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.
model is used for obtaining facial images and cues. A distinct [15] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen,
facial classifier is built using deep learning for the task of “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proc. of
determining the presence of a face mask in the facial images IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2018, pp. 4510–4520.
detected. The resulting approach is robust and is evaluated on a [16] A. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
custom dataset obtained for this work. The proposed approach M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural
was found to be effective as it portrayed high precision, networks for mobile vision applications,” 04 2017.
[17] “Google colab.” [Online]. Available: https://colab.research.google.com/
recall, and accuracy values on the chosen dataset which
contained videos with varying occlusions and facial angles.
The effectiveness of the facial mask classifier largely confides
on the ability of the face detection algorithm to accurately
identify faces in the video frames. This could be the subject
of future research in this direction.
VI. ACKNOWLEDGEMENT
We thank Google Colaboratory for providing access to
computational resources used for this study and YouTube for
helping us avail the videos used in our dataset. We also thank
our institute, the National Institute of Technology Warangal for
its constant support and encouragement to undertake research.
R EFERENCES
[1] Lijing Zhang and Yingli Liang, “A fast method of face detection in video
images,” in Proc. of International Conference on Advanced Computer
Control, vol. 4, 2010, pp. 490–494.
[2] N. R. Borkar and S. Kuwelkar, “Real-time implementation of face
recognition system,” in Proc. of International Conference on Computing
Methodologies and Communication (ICCMC), 2017, pp. 249–255.
[3] Z. Jian and S. Wan-juan, “Face detection for security surveillance
system,” in Proc. of International Conference on Computer Science
Education, 2010, pp. 1735–1738.
[4] P. Viola and M. Jones, “Rapid object detection using a boosted cascade
of simple features,” in Proc. of IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR), vol. 1, 2001, pp.
I–511 – I–518.
[5] M. Murugappan and S. Murugappan, “Human emotion recognition
through short time electroencephalogram (eeg) signals using fast fourier
transform (fft),” in Proc. of IEEE International Colloquium on Signal
Processing and its Applications, 2013, pp. 289–294.
[6] F. A. Alomar, G. Muhammad, H. Aboalsamh, M. Hussain, A. M.
Mirza, and G. Bebis, “Gender recognition from faces using bandlet and
local binary patterns,” in Proc. of International Conference on Systems,
Signals and Image Processing (IWSSIP), 2013, pp. 59–62.
440
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 18,2021 at 05:09:51 UTC from IEEE Xplore. Restrictions apply.