DualFaceNet: Augmentation Consistency For Optimal Facial Landmark Detection and Face Mask Classification
DualFaceNet: Augmentation Consistency For Optimal Facial Landmark Detection and Face Mask Classification
Corresponding Author:
Somporn Ruang-on
Department of Creative Innovation in Science and Technology, Faculty of Science and Technology
Nakhon Si Thammarat Rajabhat University
1 Tambon Tha Ngio, Mueang Nakhon Si Thammarat, Nakhon Si Thammarat 80280, Thailand
Email: [email protected]
1. INTRODUCTION
Facial recognition, an esteemed pillar of computer vision, consistently positions itself at the
vanguard of technological progression. The assimilation of advanced deep learning methodologies over
recent years has facilitated its metamorphosis from basic image-matching paradigms [1] to complex feature
extraction models [2], thereby rendering traditional manual engineering methodologies increasingly
peripheral [3]. Recent studies [4], [5] have further highlighted the practical applications of face recognition in
the context of smart city security and human emotion recognition, respectively. While these advanced
systems exhibit remarkable proficiency in controlled settings, transitioning to real-world scenarios unveils a
myriad of challenges. Factors such as inconsistent lighting, diverse ethnic backgrounds, age-related
variations, and notable occlusions, especially face masks due to prevailing health concerns, accentuate the
inherent imperfections of prevailing facial recognition frameworks [6].
In the expansive domain of facial recognition, face landmark detection crystallizes as a crucial
preprocessing step, serving as a linchpin for a variety of applications [7], [8]. This foundational
sub-discipline catalyzes the dynamism in emerging realms such as real-time facial expression recognition [9],
immersive augmented reality ecosystems [10], and extends its significance to the security-centric domain of
foolproof authentication mechanisms [11]. Recent trailblazing efforts encompass the work of Zhu et al. [12]
who proposed occlusion-adaptive deep networks to fortify facial landmark detection, Chandran et al. [13]
who introduced attention-driven cropping for high-resolution facial landmark detection, and Li et al. [14]
who pushed the boundaries with cascaded transformers for enhanced accuracy. These contributions not only
signify the rapid advancements but also accentuate the evolving nature of this sub-discipline, showcasing a
promising trajectory as it intersects with the broader domain of facial recognition, hinting at more
sophisticated applications in the foreseeable future.
The ubiquitous use of face masks during the recent pandemic highlighted a significant gap: the
absence of datasets tailored for landmark detection on masked faces. Such a deficiency undermines the
performance of current models, emphasizing the urgency for methodologies that can adapt to these new
challenges. A successful approach would merge the intricacies of facial landmark detection with face mask
identification, leveraging the subtle nuances of facial contours and strategic landmark placement, even when
partially obscured. While Gupta et al. [15] have made strides in mask detection, Ullah et al. [16] introduced
the innovative DeepMaskNet model, bridging the gap between face mask detection and masked facial
recognition. Doe and Smith developed two deep learning models, leveraging MobileNetv2 and a novel deep
convolutional neural network (DCNN), to efficiently categorize mask usage into correctly worn, incorrectly
worn, and not worn, using a Kaggle dataset for validation [17]. Additionally, Hdioud and Tirari [18]
showcased the potential of deep learning for facial expression recognition of masked faces. Other notable
works in the domain of mask detection include those by [19]–[21]. Altogether, these advances underscore the
need for continuous evolution in face landmark detection techniques, which are pivotal in addressing the
challenges presented by widespread mask usage and ensuring robust facial recognition in masked scenarios.
With face masks now entrenched in global societal norms, the fusion of these intertwined domains is
essential for the subsequent phase of facial recognition advancements. Guided by these intricate challenges
and the innovation potential, our research adopts a rigorous technical approach. We propose the use of
semi-supervised learning techniques by jointly training a DCNN on both face landmark detection and face
mask classification datasets. Drawing on the idea that knowledge from one domain can provide auxiliary
information to another, our methodology leverages the shared feature space between face landmarks and
mask classification. Preliminary observations suggest that this joint training not only enhances the granularity
with which landmarks are detected on masked faces but also refines the accuracy and robustness of mask
classifications. By coupling these tasks, we are essentially allowing our model to harness the mutual
information between them, promoting a more generalized and effective learning process. Our initiative seeks
to bridge the current gaps in the field by pioneering a method that optimally utilizes available data for
enhanced performance on both tasks in real-world scenarios.
2. METHOD
In order to refine dual facial recognition, we amalgamate semi-supervised learning, utilizing both
labeled and unlabelled datasets, to tackle the challenges posed by data paucity. This amalgamation dovetails
with multi-task learning (MTL), where our innovative DualFaceNet (DFN) concurrently processes a gamut of
facial attributes. We posit that shared feature spaces across these tasks markedly bolster task-specific
performance. To further fortify our model, we infuse augmentation consistency loss, a mechanism that
underpins model resilience to input fluctuations by mandating consistent outputs across diverse data
augmentations. This synthesis establishes a rigorous foundation for our advanced facial recognition system,
details of which will follow.
Simultaneously, the second output pathway delves into the task of face mask classification. Distilling
the learned features through its own set of fully connected layers, it culminates in producing a probability score.
Leveraging the sigmoid activation function, this score offers a concise verdict on mask presence: scores veering
toward 1 signify a mask worn correctly, while those approaching 0 denote otherwise. By amalgamating these
dual outputs in a single architecture, as visually illustrated in Figure 1, our DFN crystallizes the essence of
MTL, harmonizing two intertwined facial recognition tasks with seamless precision.
where 𝑁 is the number of data points, 𝐿 is the number of total landmarks in each facial image, 𝑙𝑖𝑗 are
ground-truth landmarks locations, and 𝑙̂𝑖𝑗 are the predicted landmarks by the model.
𝑜𝑟𝑖𝑔
Where 𝑁 is the number of data points, 𝐿 is the number of total landmarks in each facial image, 𝑙̂𝑖𝑗 are the
̂𝑎𝑢𝑔
predicted landmarks of the original images by the model, 𝑙𝑖𝑗 are the predicted landmarks of the augmented
images, and 𝑖𝑛𝑣() is an inverse image transformation of the applied augmentations.
where 𝑁 is the number of data points, 𝑦̂𝑖𝑗𝑜𝑟𝑖𝑔 are mask prediction of the original images, and 𝑦̂𝑖𝑗𝑎𝑢𝑔 are the mask
predictions of the augmented images from the model.
DualFaceNet: augmentation consistency for optimal facial landmark detection and … (Kritaphat Songsri-in)
3232 ISSN: 2252-8938
diverse array of facial scenarios, bolstering its generalization capabilities. This rigorous augmentation approach
was pivotal in achieving the impressive performance metrics recorded in our evaluations.
3.1. Datasets
Our approach to facial landmark detection and face mask classification sought to capitalize on the
precision of available annotations while eliminating the need for extensive joint labeling. By harnessing
datasets labeled independently for each task, we could focus on the nuances and specificities inherent to each
domain. This approach not only streamlined our model training and evaluation processes but also highlighted
the potential of MTL when tasks can be addressed without the complexities and overheads of concurrent
annotations. This strategic utilization of pre-existing, task-specific datasets underscores the efficiency and
adaptability of our methodology.
3.2. Metrics
In the domain of facial recognition and landmark detection, a rigorous and precise evaluation of
model performance is indispensable. This evaluative process, underpinned by quantifiable metrics, not only
substantiates the integrity of the research but also elucidates potential avenues for enhancement. Among the
plethora of evaluation metrics, two have emerged as particularly salient in this context: accuracy and the
interocular normalized mean error (INME).
3.2.1. Accuracy
Accuracy is a cornerstone metric in machine learning and classification endeavors. It quantifies the
proportion of instances correctly identified by a model in relation to the entire dataset. Its simplicity and
directness render it a fundamental tool in the assessment repertoire. However, it is imperative to approach
this metric with circumspection, particularly when dealing with datasets that exhibit class imbalances. The
mathematical representation of the accuracy is illustrated in (6).
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (6)
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑜𝑡𝑎𝑙 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
DualFaceNet: augmentation consistency for optimal facial landmark detection and … (Kritaphat Songsri-in)
3234 ISSN: 2252-8938
(a) (b)
Figure 3. Sample facial images and their annotation from the (a) 300W dataset and (b) face mask
classification dataset
√∑𝐿 ̂ 2
𝑗 (𝑙𝑖𝑗 −𝑙𝑖𝑗 )
1
𝐼𝑁𝑀𝐸 = ∑𝑁
𝑖 (7)
𝑁 𝐷𝑖
Where 𝑁 is the number of data points, 𝐿 is the number of total landmarks in each facial image, 𝑙𝑖𝑗 are ground-
truth landmarks locations, 𝑙̂𝑖𝑗 are the predicted landmarks by the model, and 𝐷𝑖 is the distance between the
outer eye corners of each image.
3.3. Methods
In this subsection, we delve into the methodologies employed in our experiments, specifically: the
landmark baseline, face mask baseline, MTL, and DFN. For each of these methods, we provide an in-depth
analysis of their training and validation performances. This comprehensive examination offers a clear
perspective on the efficacy and nuances of each approach in the context of our study.
accuracy. This approach acts as a face mask classification baseline for our proposed method. The training and
validation face mark classification accuracy during training of 200 epochs of the face mark baseline were shown
in Figure 4(b). From the figure, a clear disparity is evident between the training accuracy and the validation
accuracy for face mask classification. Throughout the initial epochs, both accuracies demonstrate a sharp
upward trend. However, post a certain point, while the training accuracy continues its ascent, reaching an
impressive 99.99%, the validation accuracy appears to plateau, settling at 89.35%. This divergence underscores
the challenges of generalization and the nuances of the validation set compared to the training data.
3.3.4. DualFaceNet
Our DFN is an innovative technique that harmoniously fuses multiple sources of information for
enhanced performance by integrating insights from both facial landmark detection and face mask classification.
While the architectural foundation of DFN parallels that of MTL, DFN introduces consistency losses to ensure
robustness against variations, especially in augmented scenarios. The performance metrics for face landmark
detection and face mask classification using DFN are illustrated in Figures 4(e) and 4(f), respectively.
In Figure 4(e), which presents the results for face landmark detection, there was a marked
distinction compared to prior models. The training INME starts with a swift decline, indicative of DFN’s
rapid learning capability, and eventually plateaus at 2.49, underlining the network’s precision in detecting
facial landmarks. The validation INME, while charting a similar course, stabilizes at a slightly elevated 5.42.
Switching our attention to Figure 4(f), which illustrates the face mask classification results, the patterns are
reminiscent of those in Figure 4(e) but with their unique characteristics. The training accuracy accelerates
sharply, reaching a near-perfect 100%. In contrast, the validation accuracy, although beginning on a
promising note, finds its equilibrium at 92.59%.
(a) (b)
(c) (d)
(e) (f)
Figure 4. Training and validation of face landmark INME and face mask accuracy of different models:
(a) landmark INME from baseline, (b) face mask accuracy from baseline, (c) landmark INME from MTL, (d) face
mask accuracy from MTL, (e) landmark INME from DFN (Our), and (f) face mask accuracy from DFN (our)
Table 1. Method comparison for facial landmark detection and face mask classification
Methods INME↓ Accuracy (%)
Landmark baseline 5.61 -
Face mask baseline - 89.35
MTL 5.90 91.98
DFN (our) 5.42 92.59
Figure 5(a) provides a visual narrative of the validation INME trends across the training epochs for
the different models, painting a vivid picture that complements the tabulated results. The curve for the face
landmark baseline serves as the foundational benchmark, tracing a path indicative of its inherent strengths in
facial landmark detection. This behavior, in harmony with its reported INME of 5.61, establishes the
performance standard against which the other models are evaluated. Transitioning to the MTL curve, we
observe an intriguing pattern. Although one might expect gains from simultaneous training on multiple tasks,
the curve reveals a slightly higher plateau at an INME value, corresponding to its tabulated 5.90. This visual
representation underscores the notion that MTL, in this context, might not always lead to enhanced
performance, even faltering slightly compared to the specialized baseline. Lastly, the trajectory of the DFN
emerges as a beacon of promise. With its rapid descent and subsequent stabilization, the curve visually
echoes its superior tabulated INME of 5.42. This affirms DFN’s proficiency in facial landmark detection,
particularly when enhanced with augmented consistency loss.
Figure 5(b) maps the validation accuracy for face mask classification across distinct models and
training epochs. Starting with the face mask baseline, its trajectory serves as a foundational reference. The
steady climb it portrays resonates with its tabulated accuracy of 89.35%, establishing a baseline metric that
more complex models aim to surpass. Progressing to the MTL curve, we witness a heartening surge.
Contrasting the baseline, MTL’s curve showcases a more robust ascent, settling at a plateau that mirrors its
reported accuracy of 91.98%. This ascent underscores the advantages of simultaneous training on intertwined
tasks, as MTL successfully bridges the gap between specialized singular models and more intricate multi-task
frameworks. However, the zenith of performance is captured by the DFN trajectory. Beginning in tandem
with MTL, a pivotal moment transpires just after epoch 100 where DFN’s trajectory begins its overtaking
maneuver. This surge, culminating in a pinnacle reflective of its superior tabulated accuracy of 92.59%,
confirms DFN’s supremacy in face mask classification. The integration of consistency loss offers DFN this
edge, allowing it not only to surpass the baseline but also to outpace MTL.
(a) (b)
Figure 5. The comparison of validation face landmark INME and face mask accuracy of different methods:
(a) validation face landmark INME, and (b) validation face mask accuracy
4. CONCLUSION
The dynamic realm of facial recognition is at a pivotal juncture, with real-world challenges
necessitating adaptive methodologies. Our research ventured into this domain, introducing DFN, a
groundbreaking approach synergistically merging facial landmark detection and face mask classification. By
capitalizing on MTL and consistency loss, DFN transcends traditional single-task models in performance.
Comprehensive evaluations, encompassing diverse datasets and intricate metrics, attest to DFN’s prowess,
particularly in navigating occlusions such as masks. As face masks solidify their presence in global society,
DFN’s fusion of landmark detection and mask classification becomes increasingly vital for future facial
recognition advancements. Anticipating the future, we envision integrating real-time video analysis with
DFN to enhance surveillance and security mechanisms. Further enrichments could arise from adding tasks to
DFN, such as emotion detection or age estimation. Also, testing DFN on larger and more varied datasets will
be pivotal to gauging its scalability and robustness. By relentlessly pushing these frontiers, we aim to sculpt
new benchmarks in the ever-evolving world of facial recognition.
ACKNOWLEDGMENTS
The authors gratefully acknowledge the use of service and facilities of the Faculty of Science and
Technology, Nakhon Si Thammarat Rajabhat University. This study receives funding from the Coordinating
Center for Thai Government Science and Technology Scholarship Students (CSTS) and National Science and
Technology Development Agency (NSTDA), under the Research Grant Scheme JRA-CO-2565-17792-TH.
REFERENCES
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,”
Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017, doi: 10.1145/3065386.
[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Proceedings of the IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, vol. 2016, pp. 770–778, 2016, doi: 10.1109/CVPR.2016.90.
[3] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015, doi: 10.1038/nature14539.
[4] G. B. Praveen and J. Dakala, “Face recognition: challenges and issues in smart city/environments,” 2020 International
DualFaceNet: augmentation consistency for optimal facial landmark detection and … (Kritaphat Songsri-in)
3238 ISSN: 2252-8938
Conference on COMmunication Systems and NETworkS, COMSNETS 2020. IEEE, pp. 791–793, 2020, doi:
10.1109/COMSNETS48256.2020.9027290.
[5] R. Amimi, A. Radgui, and H. I. E. H. El, “A Survey of smart classroom: concept, technologies and facial emotions recognition
application,” Lecture Notes in Networks and Systems, vol. 544. Springer International Publishing, pp. 326–338, 2023, doi:
10.1007/978-3-031-16075-2_23.
[6] C. Libby and J. Ehrenfeld, “Facial recognition technology in 2021: masks, bias, and the future of healthcare,” Journal of Medical
Systems, vol. 45, no. 4, Feb. 2021, doi: 10.1007/s10916-021-01723-w.
[7] K. S. -In, G. Trigeorgis, and S. Zafeiriou, “Deep and deformable: convolutional mixtures of deformable part-based models,”
Proceedings - 13th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2018. IEEE, pp. 218–225,
2018, doi: 10.1109/FG.2018.00040.
[8] Y. Wu and Q. Ji, “Facial landmark detection: a literature survey,” International Journal of Computer Vision, vol. 127, no. 2, pp.
115–142, 2019, doi: 10.1007/s11263-018-1097-z.
[9] L. Zhang, B. Verma, D. Tjondronegoro, and V. Chandran, “Facial expression analysis under partial occlusion: a survey,” ACM
Computing Surveys, vol. 51, no. 2, pp. 1–49, 2018, doi: 10.1145/3158369.
[10] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of GANs for improved quality, stability, and variation,” 6th
International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings, 2018.
[11] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: a unified embedding for face recognition and clustering,” Proceedings of
the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 815–823, 2015, doi:
10.1109/CVPR.2015.7298682.
[12] M. Zhu, D. Shi, M. Zheng, and M. Sadiq, “Robust facial landmark detection via occlusion-adaptive deep networks,” Proceedings
of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE, pp. 3481–3491, 2019, doi:
10.1109/CVPR.2019.00360.
[13] P. Chandran, D. Bradley, M. Gross, and T. Beeler, “Attention-driven cropping for very high resolution facial landmark detection,”
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, pp. 5860–5869,
2020, doi: 10.1109/CVPR42600.2020.00590.
[14] H. Li, Z. Guo, S. M. Rhee, S. Han, and J. J. Han, “Towards accurate facial landmark detection via cascaded transformers,”
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE, pp. 4166–4175,
2022, doi: 10.1109/CVPR52688.2022.00414.
[15] P. Gupta, V. Sharma, and S. Varma, “A novel algorithm for mask detection and recognizing actions of human,” Expert Systems
with Applications, vol. 198, Jul. 2022, doi: 10.1016/j.eswa.2022.116823.
[16] N. Ullah, A. Javed, M. A. Ghazanfar, A. Alsufyani, and S. Bourouis, “A novel DeepMaskNet model for face mask detection and
masked facial recognition,” Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 10, pp. 9905–
9914, Nov. 2022, doi: 10.1016/j.jksuci.2021.12.017.
[17] A. A. Abdulmunem, N. D. A. -Shakarchy, and M. S. Safoq, “Deep learning based masked face recognition in the era of the
COVID-19 pandemic,” International Journal of Electrical and Computer Engineering, vol. 13, no. 2, pp. 1550–1559, 2023, doi:
10.11591/ijece.v13i2.pp1550-1559.
[18] B. Hdioud and M. E. H. Tirari, “Facial expression recognition of masked faces using deep learning,” IAES International Journal
of Artificial Intelligence, vol. 12, no. 2, pp. 921–930, 2023, doi: 10.11591/ijai.v12.i2.pp921-930.
[19] C. X. Ge, M. A. As’ari, and N. A. J. Sufri, “Multiple face mask wearer detection based on YOLOv3 approach,” IAES
International Journal of Artificial Intelligence, vol. 12, no. 1, pp. 384–393, 2023, doi: 10.11591/ijai.v12.i1.pp384-393.
[20] B. U. H. Sheikh and A. Zafar, “RRFMDS: rapid real-time face mask detection system for effective COVID-19 monitoring,” SN
Computer Science, vol. 4, no. 3, pp. 288, 2023, doi: 10.1007/s42979-023-01738-9.
[21] S. Susanto, F. A. Putra, R. Analia, and I. K. L. N. Suciningtyas, “The face mask detection for preventing the spread of COVID-19
at politeknik negeri batam,” Proceedings of ICAE 2020 - 3rd International Conference on Applied Engineering. IEEE, 2020, doi:
10.1109/ICAE50557.2020.9350556.
[22] D. P. Kingma and J. L. Ba, “Adam: a method for stochastic optimization,” 3rd International Conference on Learning
Representations, ICLR 2015 - Conference Track Proceedings, 2015.
[23] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, “300 faces in-the-wild challenge: the first facial landmark localization
challenge,” The IEEE International Conference on Computer Vision. IEEE, pp. 397–403, 2013, doi: 10.1109/ICCVW.2013.59.
[24] X. Su, M. Gao, J. Ren, Y. Li, M. Dong, and X. Liu, “Face mask detection and classification via deep transfer learning,”
Multimedia Tools and Applications, vol. 81, no. 3, pp. 4475–4494, 2022, doi: 10.1007/s11042-021-11772-5.
[25] Z. Wang, B. Huang, G. Wang, P. Yi, and K. Jiang, “Masked face recognition dataset and application,” IEEE Transactions on
Biometrics, Behavior, and Identity Science, vol. 5, no. 2, pp. 298–304, 2023, doi: 10.1109/TBIOM.2023.3242085.
[26] S. Ge, J. Li, Q. Ye, and Z. Luo, “Detecting masked faces in the wild with LLE-CNNs,” Proceedings - 30th IEEE Conference on
Computer Vision and Pattern Recognition, CVPR 2017, IEEE, pp. 426–434, 2017, doi: 10.1109/CVPR.2017.53.
BIOGRAPHIES OF AUTHORS
Munlika Rattaphun received the B.S. degree in computer science from Thaksin
University, Songkhla, Thailand, in 2009, the M.S. degree in computer science from Prince of
Songkla University, Songkhla, Thailand, in 2011, and the Ph.D. degree in computer science
and information engineering from National Chiayi University, Chaiyi, Taiwan, in 2022. She is
currently a lecturer at the Department of Computer Science, Faculty of Science and
Technology, Nakhon Si Thammarat Rajabhat University, Nakhon Si Thammarat, Thailand.
Her current research interests include machine learning, nearest-neighbor search, and
recommender systems. She can be contacted at email: [email protected].
Sopee Kaewchada received the B.Sc. degree in computer science from Rajabhat
Phetchaburi Institute, Thailand, in 1997 the M.S. degree in management of information
technology from Walailak University, Thailand, in 2003, and the Ph.D. degree in Creative
Innovation in Science and Technology, Nakhon Si Thammarat Rajabhat University, Thailand,
in 2023. Currently, she is an Assistant Professor at the Faculty of Science and Technology,
Nakhon Si Thammarat Rajabhat University, Thailand. She can be contacted at email:
[email protected].
DualFaceNet: augmentation consistency for optimal facial landmark detection and … (Kritaphat Songsri-in)