DIP Report Final.
DIP Report Final.
Facial recognition is a widely used technology. However, it fails to perform well when the
user’s face is partially covered. This has become an imminent problem to solve as wearing
masks has become the new normal since the outbreak of COVID-19. In this paper, we propose
an access control system with the capability to recognize users’ identities even when they are
wearing masks. The proposed system we create our dataset by automatically adding masks to
the existing datasets of facial images without masks. A deep learning architecture developed
by matching different current models and loss functions to train the datasets. The model will
be applied to our system with a desktop camera application and a backend server to simulate
an access control system.
ii
TABLE OF CONTENTS
Chapter Description Page No.
Acknowledgement i
Abstract ii
1. INTRODUCTION 1-2
1.1 Overview
1.2 Objective
2. LITERATURE SURVEY 3-7
2.1 DCNN Pipeline
2.2 Loss Function
3. PREAMBLE 08
3.1 Existing System
3.2 Proposed System
3.3 Methodology
4. REQUIREMENT SPECIFICATION 09-10
4.1 Hardware
4.2 Software
5. DESIGN 11-13
5.1 Masked Face Recognition Pipeline
5.2 Dataset Creation
6. IMPLEMENTATION 14-18
6.1 Masked Face Recognition Model
6.2 Masked Face Recognition Process
7. RESULT 19-26
7.1 Masked Face Recognition Model
7.2 InceptionResNetV1 with ArcFace Loss
7.3 InceptionResNetV1 with Triplet Loss
7.4 System
8. CONCLUSION 30
REFERENCES
Appendix
LIST OF TABLES
Table No. Name Page No.
2.1 Literature Survey Summary 03-07
5.4 The raw datasets used 13
7.1 Test result of models with different configurations 19
7.4.1 Test result with different threshold values with method 1 24
7.4.3 Test result with Euclidean distance and cosine similarity 25
LIST OF FIGURES
Figure No. Name Page No.
3.3.1 Simplified recognition flow chart 08
5.1 DCNN Pipeline Design 15
5.2 Flow chart for recognition steps 16
5.3 Sample images after applying MaskTheFace to LFW dataset 17
7.1.1 t-SNE analysis with 10 random people, each with 3 images 19
7.2 InceptionResNetV1. The test accuracy of ArcFace is slightly 19
better than the result of triplet loss with the same model
architecture, InceptionResNetV1
7.2.1 Validation graphs of pre-trained InceptionResNetV1 with 20
ArcFace in epoch 1,5,6,10,11,15
7.2.2 Evaluation matrix of the test result of InceptionResNetV1 21
with ArcFace and confusion matrix
7.3.1 Training with semi-hard triplet pairs for 1 epoch and 5 22
epochs
7.3.2: Training with both semi-hard and hard triplet for 5 epochs with 1, 24
5 and 10 weighting set for hard triplet
7.4.2 Visualization of validation result for our best model 26
(ArcFace) and Conditional probability against distance
difference with a fitted customized sigmoid function
7.4.4 Loss, metrics, false positive and negative during training of 27
the prototypical network
7.4.5 Metrics against threshold values for Euclidean distance and 28
Metrics against threshold values for cosine similarity
8.1 Face Recognized, Mask not detected 29
8.2 Masked Face Recognized 29
8.3 Mask Detected, Face not recognized 30
8.4 Mask not detected; Face not recognized 30
Masked Face Recognition Using Deep Learning
Chapter 1
INTRODUCTION
1.1 Overview
Face recognition is one of the most important applications of machine learning. It has been
widely used in products like access control systems. However, the technology still cannot offer
satisfactory performance in particular conditions, including when the face is half covered.
Some people would wear masks due to health problems or for privacy concerns. However,
there were not many studies done in masked face recognition as these were relatively rare cases
until 2020. With the outbreak of COVID-19 worldwide, wearing masks in public areas has
become the new normal, or even regulations in some places, including Bengaluru, making
recognition of faces with masks an imminent need to improve the current technology.
The main goal of this project is to find a feasible solution to automatically recognize people’s
faces even when they are wearing masks, in order to compensate for the inadequate accuracy
offered by the conventional face recognition technology when the users’ faces are partially
covered. As deep learning is the most popular method to approach conventional face
recognition problems, we also decided to use deep learning to train robust models in this
project. We will try to match various robust models and loss functions that are commonly used
in deep learning and face recognition, and feed them with the dataset we create ourselves in
the project.
To test our model in a real-world situation, we choose to build an access control system which
supports masked face recognition. A server is set up to simulate a practical situation that allows
users to register and upload images themselves and compares the incoming data from the
camera application against the user database to calculate the recognition result.
With the inadequate accuracy offered by the existing conventional face recognition to match
faces with masks, access control systems have to be either temporarily abandoned or require
the users to remove their masks beforehand. This will not only make the process inconvenient
for users but also increase the risk of infection of COVID-19. We believe masked face
recognition can greatly improve the existing system, allowing the users to keep contactless
without removing masks in an automatic access control system.
1.2 Objective
The goal of this project is to train a face recognition model which is capable of identifying
people even when they are wearing masks and integrate the model into an access control
system.
Chapter 2
LITERATURE SURVEY
Sinno Jialin Pan et al., [1] have given us a comprehensive overview of transfer learning for
classification, regression and clustering developed in machine learning and data mining areas.
There has been a large amount of work on transfer learning for reinforcement learning in the
machine learning literature A unified system for face verification, recognition and clustering.
The main benefits of transfer learning include the saving of resources and improved efficiency
when training new models. The need for transfer learning may arise when the data can be easily
outdated.
M. Tan et al., [10] have implemented EfficientNet-B7 achieves state-of-the-art 84.3% top-1
accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best
existing ConvNet. Model scaling and identify that carefully balancing network depth, width,
monitor social distancing. The proposed training model for mask detection is based on Single-
Shot Multibox Detector (SSD) and You Only Look Once (YOLO) version 2. The testing of
this model is performed on complex images including face turning, wearing classes, beard
faces, and scarf images. The testing accuracy for this model attains 93.4%. An off-line step
aiming to create a DL model that is able to detect and locate facemasks. It is a light-weight DL
suited to edge devices; it provides excellent results for object detection.
9. 2019 Arcface: J. Deng, J. Guo, The proposed In this paper, Massive data
Additive N. Xue, and S. ArcFace has a they propose storage
angular margin Zafeiriou clear geometric an burden. The
loss for deep interpretation Additive ML
face due to the exact Angular technology
recognition correspondence Margin Loss used in face
to the geodesic (ArcFace) to detection
distance on the obtain highly requires
hypersphere. discriminative powerful data
features for storage that
face may not be
recognition. available to
all users.
10. 2020 Efficientnet: M. Tan and Q. EfficientNet-B7 model scaling Missing
Rethinking V. Le achieves state- and identify piece,
model scaling of-the-art that carefully preventing
for 84.3% top-1 balancing from better
convolutional accuracy on network accuracy and
neural ImageNet, depth, width, efficiency.
networks while being and resolution
8.4x smaller can lead to
and 6.1x faster better
on inference performance.
than the best
existing
ConvNet.
11. 2020 Masked Face Zhongyuan This work face-eye- Another
Recognition Wang, proposes three based multi- related task is
Dataset and Guangcheng types of masked granularity face mask
Application Wang, Baojin face model recognition,
Huang. datasets,like achieves 95% that is,
Masked Face recognition identifying
Detection accuracy, whether a
Dataset, Real- which is person is
world Masked greater thn 5% wearing a
Face comparing mask as
Recognition various required or
Dataset and models not.
Simulated currently.
Masked Face
Recognition
Dataset.
12. 2020 Masked Face Aqeel Anwar, Masked faces to an open- the largest
Recognition Arijit be recognized source tool, face
for Secure Raychowdhury. with low false- MaskTheFace recognition
Authentication positive rates which can be dataset with
and high overall used to mask 24,771
accuracy, faces. This images. But
without results in the the dataset
requiring the creation faces are not
user dataset to of a large consistent or
be recreated by dataset of aligned
taking new masked faces. making it a
pictures for The dataset little harder to
authentication. generated be used.
with this tool
can then used
towards
training an
effective
facial
recognition
system with
target
accuracy for
masked faces
13. 2021 A Deep Wadii Boulila, An off-line step It is a light- The testing of
Learning-based Ayyub aiming to create weight DL this model is
Approach for Alzahem. a DL model that suited to edge performed on
Real-time is able to detect devices; it complex
Facemask and locate provides images
Detection facemasks. excellent including face
results for turning,
object wearing
detection. classes, beard
faces, and
scarf images.
The testing
accuracy by
93.4%.
Chapter 3
PREAMBLE
3.1 Existing System
When face recognition systems are presented with a masked face, the system fails to identify
the person rendering the system unusable. A face recognition system which can recognize
masked faces becomes evident in the wake of the ongoing situation.
3.2 Proposed System
In this proposed system, we train a face recognition model which is capable of identifying
people even when they are wearing masks and integrate the model into an access control
system.
• Cost effective: The main objective of developing algorithms of a real time system is to
provide cost effectiveness. It is necessary to design a system which is affordable and
includes cost effective components for designing.
• Fast: The main objective of this project is to develop an algorithm which is on par or
even better than the existing ones.
• Accuracy: The main objective of this project is to develop an algorithm which is
accurate.
3.3 Methodology
Chapter 4
REQUIREMENT SPECIFICATION
The study of requirement specification is focused especially on the functioning of the
system. It allows the developer or analyst to understand the system function to be carried out
the performance level to the obtained and corresponding interfaces to be established.
4.1 Hardware
1. GPU for model training
2. Camera for image capture
3. Storage Units
4.1 Software
1. Python
2. PyCharm
3. Anaconda
4. TensorFlow
5. Keras
6. Open CV
Python is a high-level, interpreted, general-purpose programming language. Its design
philosophy emphasizes code readability with the use of significant indentation. Python is
dynamically-typed and garbage-collected. It supports multiple programming paradigms,
including structured (particularly procedural), object-oriented and functional programming. It
is often described as a "batteries included" language due to its comprehensive standard library.
The core principles of Python are summarized as follows:
• Beautiful is better than ugly.
• Explicit is better than implicit.
• Simple is better than complex.
• Complex is better than complicated.
• Readability counts.
PyCharm is a dedicated Python Integrated Development Environment (IDE) providing a wide
range of essential tools for Python developers, tightly integrated to create a convenient
environment for productive Python, web, and data science development. We have used the
Community Edition of PyCharm which is free and open Source. It is used for smart and
intelligent Python development, including code assistance, refactoring, visual debugging, and
version control integration.
Chapter 5
DESIGN
5.1 Masked Face Recognition Pipeline
A supervised DCNN pipeline was designed to solve the facial recognition problem in this
project. We choose the InceptionResNetV1 as the deep convolutional layers and ArcFace as
the loss function because of their highest performance. All facial images will be resized to
128x128x3 first. If the image size is too large, more time and GPU memory are need during
training. If the size is too small, many important features cannot be extracted by the DCNN.
Then, normalization and data augmentation will be done before fitting the data into the
InceptionResNetV1 classifier. The classifier will output a 512-dimensional image embedding.
Finally, ArcFace is used to compute the cost and do optimization.
• Normalization and Data Augmentation: Normalizing the data before training a deep
network can make the data distribution better so that the gradient will not become zero.
Data augmentation includes horizontal or vertical flipping randomly to increase the
number of data to handle overfitting.
• Deep Convolutional Layers: There are many convolutional layers and pooling layers
here. Convolutional layers can extract the important features from the images. Pooling
layers can reduce the size of the images so that the model is relatively small. The last
layer in the convolutional layer is a global average pooling or generalized mean pooling
which can further reduce the size of the images.
• Fully-connected Layer: A layer to resize the output from convolutional layer to 512
dimensions. It is a common dimension size for embedding representation in deep
learning. A 512-dimensional embedding is enough to represent most of the important
features of an image, and it is also efficient in computation.
• Loss Function: A function to calculate the cost of the predictions and real labels so that
optimization can be done through back propagation.
The overall flow of the recognition process in our system is shown in Figure. Face detection is
always in use, and the whole recognition process will be initiated only when a face is detected.
The image is then passed to mask detection which will be used to determine whether to use a
normal face recognition model or masked face recognition model for feature extraction
depending on whether the user is wearing a mask or not. The extracted embedding will then be
sent to the backend server where the user’s data is stored, and embedding matching will be
done in the server to tell whether the user is valid or not. The result will be logged and will
then be sent back to the camera application. If the user is valid, access will be granted, otherwise
denied. The result will be shown in the camera application with the name of the valid user or a
warning if the user is invalid. If the user is valid, a blue box will be turned on else the box will
be red.
A deep learning model always requires a large amount of diverse data in order to train a robust
model, dataset is essential in our project. However, as there is no masked face image dataset
available online, we have to create our own dataset in the project.
MaskTheFace is a GitHub package that can add different kinds of masks to normal face images.
The output masked face images of the package are of satisfactory quality without problematic
images such as masks been added to wrong positions on the face images with few sample
images shown in Figure.
This package enables us to make use of a large number of datasets with normal face images
currently available. Thus, CASIA, VGGFace2 and LFW dataset are used as the raw data in
our project with the detailed information illustrated in Table.
Chapter 6
IMPLEMENTATION
6.1 Masked Face Recognition Model
6.1.1 InceptionResNetV1
We wanted to use transfer learning to train a facial recognition model in order to let the loss
converge faster and save the training time. Since our masked face dataset has more than 2
million facial images, the training time will be very long if we did not use transfer learning. A
Python package, facenet-pytorch, provides a pre-trained InceptionResNetV1, which was
trained on VGG2, a large facial image dataset with more than 3,000,000 images. This pre-
trained model achieves 99.65% accuracy on a facial image dataset, LFW. We simply loaded
the pre-trained InceptionResNetV1 as the classifier. Then, we used PyTorch, a deep learning
framework, to keep training this pre-trained model with our masked face dataset. We did not
need to modify the output size of the 1-layer fully connected layer because the output size is
512 in the last linear layer of InceptionResNetV1. After the loss and accuracy converge, the
model can generate a representative 512- dimensional embedding, which contains the
important feature of the facial image.
6.1.2 SE-ResNeXt-101
We could not find a SE-ResNeXt-101 model, which was trained on facial image datasets. Most
of the pre-trained SE-ResNeXt-101 models were trained on ImageNet, a large object
recognition dataset. A pre-trained model of object recognition is not suitable for doing transfer
learning in facial recognition. The feature extraction in the model can only extract object’s
features. Using this model is the same as training a model from scratch with weight
initialization or even worse than weight initialization. Our method is that training a SE-
ResNeXt-101 from scratch without transfer learning. The Python package, timm, provides
many pre-implemented image models. SE-ResNeXt-101 is one of the pre-implemented models
in timm. We loaded SE-ResNeXt-101 from timm and trained it with the same method and
hyper-parameters as InceptionResNetV1.
6.2 System
Our system consists of three parts. The camera application will capture the images from the
user and stores them. The system will return the result which will be reflected in the camera
application. The system also allows users to register and upload images themselves.
A camera application is built with OpenCV from a skeleton code available online which will
handle the major part of the recognition process. Multiprocessing is used in the application,
one for grabbing the images from the camera and putting them into a queue, and another for
getting the images from the queue and displaying it with an image widget in the application.
There are multiple steps as shown in the following flow chart to process the incoming image
and doing embedding matching on the server with the results being shown in the camera
application.
Face Detection
We use MTCNN, a famous face detection package, in our face detection module. It returns the
coordinates of the bounding box where the face appears in the original image. As users will be
authenticated one by one in an access control system, only the first face detected by MTCNN
in the image captured by the camera will be processed. The cropped facial image with the
bounding box, instead of the whole image, will then be passed to the following steps.
Embedding Extraction
Since the conventional face recognition model still performs better than our model as more
features is available, we will use the pre-trained InceptionResNetV1 to extract the embedding
when the user is not wearing a mask. Otherwise, our masked face recognition model will be
used to extract the embedding.
Pseudo Code
img_format = {'png','PNG','jpg','JPG','JPEG','bmp','BMP'}
def video_init(camera_source=0,resolution="1080",to_write=False,save_dir=None):
'''
:parameter camera_source:
:parameter resolution: '480', '720', '1080'. Set None for videos.
:parameter to_write: to record or not
:parameter save_dir: the folder to save your recording
:return: cap,height,width,writer
'''
writer = None
resolution_dict = {"480":[480,640],"720":[720,1280],"1080":[1080,1920]}
#----camera source connection
cap = cv2.VideoCapture(camera_source)
#----resolution decision
if resolution_dict.get(resolution) is not None:
# if resolution in resolution_dict.keys():
width = resolution_dict[resolution][1]
height = resolution_dict[resolution][0]
cap.set(cv2.CAP_PROP_FRAME_WIDTH, width)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, height)
else:
height = cap.get(cv2.CAP_PROP_FRAME_HEIGHT)#default 480
width = cap.get(cv2.CAP_PROP_FRAME_WIDTH)#default 640
print("video size is auto set")
'''
ref:https://docs.opencv.org/master/dd/d43/tutorial_py_video_display.html
FourCC is a 4-byte code used to specify the video codec.
The list of available codes can be found in fourcc.org.
It is platform dependent. The following codecs work fine for me.
In Fedora: DIVX, XVID, MJPG, X264, WMV1, WMV2. (XVID is more preferable.
MJPG results in high size video. X264 gives very small size video)
In Windows: DIVX (More to be tested and added)
In OSX: MJPG (.mp4), DIVX (.avi), X264 (.mkv).
FourCC code is passed as `cv.VideoWriter_fourcc('M','J','P','G')or
cv.VideoWriter_fourcc(*'MJPG')` for MJPG.
'''
if to_write is True:
#fourcc = cv2.VideoWriter_fourcc('x', 'v', 'i', 'd')
#fourcc = cv2.VideoWriter_fourcc('X', 'V', 'I', 'D')
fourcc = cv2.VideoWriter_fourcc(*'XVID')
save_path = 'demo.avi'
if save_dir is not None:
save_path = os.path.join(save_dir,save_path)
writer = cv2.VideoWriter(save_path, fourcc, 30, (int(width), int(height)))
return cap,height,width,writer
def stream(pb_path,
node_dict,ref_dir,camera_source=0,resolution="480",to_write=False,save_dir=None)
frame_count = 0
FPS = "loading"
face_mask_model_path = r'face_mask_detection.pb'
margin = 40
id2class = {0: 'Mask', 1: 'NoMask'}
batch_size = 32
threshold = 0.8
display_mode = 0
label_type = 0
cap,height,width,writer = video_init(camera_source=camera_source,
resolution=resolution, to_write=to_write, save_dir=save_dir)
# ----face detection init
Dept. of AIML, VKIT 2022-23 Page 17
Masked Face Recognition Using Deep Learning
for i in range(ites):
num_start = i * batch_size
num_end = np.minimum(num_start + batch_size, len_ref_path)
batch_data_dim =[num_end - num_start]
batch_data_dim.extend(model_shape[1:])
batch_data = np.zeros(batch_data_dim,dtype=np.float32)
Chapter 7
RESULT
7.1 Masked Face Recognition Model
This section is mainly about our experiments on different models and loss functions. We
compare their results through the training, validation and testing processes. The best model is
attained with InceptionResNetV1 and ArcFace, which will also be used as the final model
utilized in the access control system. The following table shows the best result we achieve with
different combinations of models and loss functions. Our model’s capability in extracting
similar embedding for images from the same person is also illustrated by t-SNE analysis in
Figure 7.2, where images from the same person can form obvious clusters.
Figure 7.1.1: t-SNE Analysis With 10 Random Figure 7.1.2: Distribution of Two Validation Sets
People, Each With 3 Images
The 2 validation sets and the test set prepared with the method were used to evaluate the
performance of our model, with the sample visualization of the validation result shown in
Figure. For simplicity, validation set 1 which contains the pairs of images from the same person
will be called same-class pair and validation set 2 which contains the pairs of images from
different people will be called different-class pair.
Except for using triplet loss to do optimization, we also try ArcFace to optimize the pre-trained
Figure 7.2: InceptionResNetV1. The Test Accuracy of Arcface Is Slightly Better Than the
Result of Triplet Loss with The Same Model Architecture, Inceptionresnetv1.
Figure shows the result of training loss reduction. We choose a multistep decay scheduler, an
equally spaced five-step learning rate with values 0.1, 0.01, 0.001, 0.0001 and 0.00001(reduce
learning rate at the end of epoch 5, 10, 15, 20). The training loss is dropped a lot after every
learning rate decay. However, we cannot make any conclusion through the training result only.
It is because there may be a model over-fitting occur. We need to analyse the result with the
validation and test result together. Figure 34 shows the IOU result, one of the validation results.
We observe that the model obtains the best result in epoch 11. After epoch 11, the validation
result does not improve anymore; even the training loss continues to drop. It means that the
model is the best in epoch 11. After that, over-fitting occurs.
Figure shows the validation graphs in epoch 0, 5, 6, 10, 11, 15, 16, 20 and 21. We observe that
the validation result is similar to the training result: after the decay of learning rate at the end
of epoch 5 and 10, the validation result is also improved a lot at each time. However, some
overfitting still occurs after training too many epochs. We do not observe an obvious
improvement from the validation graphs in the last two decays of the learning rate, even the
training loss is dropped.
Figure 7.2.2: Evaluation Matrix of The Test Result of Inceptionresnetv1 with Arcface and Confusion
Matrix
After getting the model with the lowest IOU value, we use this model to do testing with the
testing set. The test result of this model (95.85%) is the best compared with other models in
our experiment.
Although FaceNet suggests that semi-hard triplet pairs are preferred in training a face
recognition model, we actually observe that the model trained with semi-hard triplet pairs fails
to learn to recognize images of different people throughout the training. Figure shows that the
model eventually gets a large distance difference for many same-class pairs and also gets a
small distance difference for even more different-class pairs. This situation is alleviated after
we beginning to make use of both semi hard and hard triplet, or even adding a larger weight to
the loss calculated from hard triplet as shown in Figure. On the contrary, the training result
when using only hard triplet is more normal as shown in Figure.
Figure 7.3.1: Training with Semi-Hard Triplet Pairs for 1 Epoch And 5 Epochs
Figure 7.3.2: Training with Both Semi-Hard and Hard Triplet for 5 Epochs With 1, 5 And 10
Weighting Set for Hard Triplet
Figure 7.3.3: Training with Hard Triplet Pairs for 1 Epoch And 5 Epochs
The best model trained with hard triplet achieves an accuracy of 94.28% in our testing set.
Visualization of the changing in the validation set during the training process is shown in Figure
7.3.1. Figure 7.3.2 shows the decreasing in training loss and IOU during the training process
and Figure 7.3.2 shows the ultimate test result. We can see that the actual test result is roughly
the same as the IOU we choose as the metrics for validation, with the lowest IOU achieved at
0.054 and the test accuracy 94.28% (≈ 100% − 5.4%). There is also a validation loss shown in
Figure 31 which is the result we get from a manually downloaded real masked face image
dataset with only around 100 images. Since the dataset is too small, which leads to the
instability in the loss on this dataset, and also the IOU value can already effectively reflect the
performance of our model, the dataset is abandoned.
Figure 7.3.4: Best model (triplet) with triplet loss after 5, 12, 20 and 28 epochs
Figure 7.3.5: Training Loss and IOU of the Best Model (Triplet) During Training and Test Result of
The Best Model (Triplet)
7.4 System
7.4.1 Face Detection
MTCNN works very well during our testing in the real-world situation as well as the
downloaded images we prepare for system testing. However, since GPU cannot be used in our
local machine, it is too computationally expensive to loop over the image too many times for
a real-time application. Nevertheless, we still want to keep a real-time face detection so that
the bounding box can be shown correctly on the camera application, MTCNN has to be done
for each frame. To make the streaming more fluent, we decide to set a minimum face size that
the MTCNN can detect. After doing some experiments, the minimum face size that the camera
can detect is set to be 1/64 of the displayed screen size. This will not affect the effectiveness of
our system as it is reasonable for the user to walk close to the camera for the face recognition
process, but greatly improve the fluency of the streaming.
Methods introduced in to calculate the threshold and implement embedding matching is used.
As there are 129 ∗ 128 = 16512 pairs available, the b16512 ∗ 0.05%c = 8th value will be used
as the threshold to keep a roughly 99.95% precision on comparison on pairs of images. With
Euclidean distance and cosine similarity, the thresholds calculated are 0.7879859 and
0.6895391 respectively. With the first method to compare all 3 embeddings available for each
user, the image threshold can be set to 1, 2 or 3, which means only one 1, 2 or 3 of the images
from a single user have embedding difference smaller than the threshold value will be
considered as a valid user. The detailed comparison results are shown in Table.
Table 7.4.1: Test Result with Different Threshold Values with Method 1
Another metric proposed by us, conditional probability, is calculated based on the validation
set (visualization shown in Figure ) from our masked face recognition model as mentioned in
This significantly affects the conditional probability result. Therefore, we fit a customized
sigmoid function to the conditional probability calculated based on the original values, which
becomes very smooth and will be used by us to estimate the conditional probability. The
problematic original conditional probability results and the smooth fitted sigmoid function are
shown in Figure 7.4.2. We can also easily set the conditional probability threshold to 0.9995 to
keep a high precision and be consistent with how we get the thresholds for Euclidean distance
and cosine similarity. Probability seems to perform the best as shown in Table 7.4.1 with a
comparable precision with Euclidean distance and cosine similarity but with much higher recall
and thereby higher accuracy with different image thresholds set.
Figure 7.4.2: Visualization of Validation Result for Our Best Model (Arcface) And Conditional
Probability Against Distance Difference with A Fitted Customized Sigmoid Function.
The second method to implement embedding matching is to utilize a prototypical network. The
mean value of the 3 embeddings for each user is used as the prototype for each user. This is
extremely useful when the database is very large as only 1 3 calculation and comparison needs
to be done as method 1. The experiment results shown in Table 7.4.3 also reflect its excellent
performance, with precision, recall and accuracy all much better than the results we get from
method 1 with the image threshold set as 1 as shown in Table 7.4.1. To keep a high precision
with comparable accuracy and recall, the result with Euclidean distance as the metric seems to
be the best. Thus, we will use Euclidean distance with threshold 0.7879859 in our system.
We also study the relationship between accuracy, precision and recall values with different
threshold values applied in the case of using prototypes. We can see that the threshold values
we retrieve with the previously mentioned method, 0.7879859 for Euclidean distance and
0.6895391 for cosine similarity, almost achieve the highest accuracy while at the same time
keep a high precision. This means that the method proposed by us to decide the threshold value
is robust and applicable. The relationship between values of different metrics and the threshold
value is illustrated in the following plots.
Figure 7.4.5: Metrics Against Threshold Values for Euclidean Distance and Metrics Against
Threshold Values for Cosine Similarity
A further experiment is done to evaluate the possibility to expand the database size. We firstly
reduce the database size to 50, which are the 50 people who appear in valid testing data. The
missing people in the original database are then added back to the database one by one while
keeping the valid testing and invalid testing data unchanged. The relationship between the
metrics and the increasing database size is shown in Figure 7.4.6. We can see that the
performance is very consistent and is only affected by very few individual users (in our case,
the precision and accuracy drop a little only once due to the addition of one user). Therefore,
we believe our system will be as performant as the current one after not too huge expansion.
Figure 7.4.6: Metrics During Expanding Database Size (From 50 To 100 People)
A very important thing to notice is that the above experiments are done on each of the static
images but our actual system will stream the incoming images of the users taken by the camera,
which means there will be many incoming images that will be used for embedding matching.
Therefore, the performance of our system in the real-world situation is not limited by the 81.3%
accuracy attained.
Chapter 8
CONCLUSION
To summarize this project of a masked face recognition model for an access control system,
we have solved the problems of masked face recognition. In the access control system, we
implemented a camera application for taking in the streaming, a database for storing users’
information Users store their details on the storage unit and can then be recognized and
authenticated in our system
In our primary focus, the masked face recognition model, we explored different deep learning
methods, especially deep convolutional neural networks with transfer learning for
representative embedding learning. We created a new masked face image dataset for training
the DCNN model. We also implemented our DCNN pipeline for embedding learning and found
that the DCNN architecture, InceptionResNetV1, along with ArcFace loss, could achieve the
highest 95.85% accuracy in our experiments. With the embedding learned from the model, we
used a prototypical network to do embedding matching to recognize users’ identities. By
integrating different steps like face detection and mask detection, the model was finally applied
to our access control system.
The whole access control system was fully functional with our masked face recognition model
in the real-world situation. In the camera application, face detection, mask detection and
embedding extraction were used to capture a face and convert it to an embedding to be matched
on the storage.
In conclusion, we have finished our goal of building an access control system with face
recognition that works with and without masks. It achieved high accuracy in identity
verification. Our future work should tackle the questions “how to further improve the accuracy
of the masked face recognition model”, “how to improve data security in the system”, and
“how to prevent spoofing attack with 3D modelling”.
REFERENCES
1. S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on
Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345– 1359, 2010.
2. D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation from scratch,” 2012.
3. F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face
recognition and clustering,” in 2015 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 815–823, 2015.
4. Ziwei Liu, Ping Luo, Xiaogang Wang, Xiaoou Tang., “Deep Learning Face Attributes
in the Wild” 2015
5. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Focal loss for dense object
detection,” 2018.
6. F. Radenovi´c, G. Tolias, and O. Chum, “Fine-tuning cnn image retrieval with no
human annotation” 2018.
7. YuvalNirkin,“facesegmentation.”https://github.com/YuvalNirkin/face_segmentation#
deep-face-segmentation-in-extremely-hard-conditions, 2018.
8. Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2: A dataset for
recognising faces across pose and age,” 2018.
9. J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for
deep face recognition” 2019.
10. M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural
networks,” 2020.
11. Zhongyuan Wang, Guangcheng Wang, Baojin Huang. Masked Face Recognition
Dataset and Application, 2020.
12. Aqeel Anwar, Arijit Raychowdhury,”Masked Face Recognition for Secure
Authentication”, 2020.
13. Wadii Boulila, Ayyub Alzahem., “A Deep Learning-based Approach for Real-time
Facemask Detection”, 2021.