MasterThesis V0
MasterThesis V0
THESIS
To obtain the diploma of Master
Field: Computer Science
Specialty: Information System and Web development
”Systèmes d’Information et Web (SIW)”
Theme
Presented by:
AGGOUN LINA
LAOUEDJ SARAH
Supervised by:
BENSLIMANE Sidi Mohammed
BOUSMAHA Rabab
Academic Year : 2022/2023
State of Art
Definition:
Face recognition system is a popular study task in the field of image processing and
computer vision, owing to its potentially enormous application as well as its theoretical
value. This system is widely deployed in many real-world applications such as security,
surveillance, homeland security, access control, image search, human-machine, and
entertainment. However, these applications pose different challenges such as lighting
conditions and facial expressions.
the characteristics that make a face recognition system useful are the following: its ability to
work with both videos and images, to process in real time, to be robust in different lighting
conditions, to be independent of the person (regardless of hair, ethnicity, or gender), and to
be able to work with faces from different angles, Different types of sensors, including RGB,
depth, EEG, thermal, and wearable inertial sensors, are used to obtain data. These sensors
may provide extra information and help the face recognition systems to identify face images
in both static images and video sequences.
Three basic steps are used to develop a robust face recognition system: face detection,
feature extraction, and face recognition. The face detection step is used to detect and locate
the human face image obtained by the system. The feature extraction step is employed to
extract the feature vectors for any human face located in the first step. Finally, the face
recognition step includes the features extracted from the human face in order to compare it
with all template face databases to decide the human face identity.
1. Face detection:
Face Detection is a Computer Vision task in which a computer program can detect
the presence of human faces and also find their location in an image or a video
stream. This is the first and most crucial step for most computer vision applications
involving a face.
1.2.1 Occlusion
Occlusion greatly affects the ability of any system to detect the face as only a part of the face is
visible, and it is hard to say with confidence whether there is a face in the frame when only part of it
is visible.
1.2.2 Lighting
Any change in the subject’s lighting conditions poses an issue for face detection as it is not necessary
that the method is designed/trained to handle the variation in the lighting.
Also, a particular skin color might behave differently in various lighting conditions than any other skin
color, bringing an added challenge to the detection system.
1.2.4 Pose
The pose or orientation of a face in the image frame affects the performance of the Face detector, as
some methods can only detect frontal faces and fail when the face is sideways or turned slightly to
one side.
1.3.2 Precision
Precision measures the proportion of predicted positives that are correct. If you are wondering how
to calculate precision, it is simply the True Positives out of total detections. Mathematically, it’s
defined as follows:
1.3.3 Recall
Recall measures the proportion of actual positives that were predicted correctly. It is the True
Positives out of all Ground Truths. Mathematically, it is defined as follows.
It essentially shows the Recall against the false positive rate (FPR) for various threshold values.
The area under the curve is used to summarize the performance of a model into a single measure. It
is important when comparing the performance of different models. A model with a high AUC can
occasionally score worse in a specific region than another model with a lower AUC. But in practice,
the AUC performs well as a general measure of predictive accuracy.
1.3.7 MAP – Mean Average Precision
As the name suggests, Mean Average Precision or mAP is the average of AP over all detected classes
in multiclass object detection
To arrive at the mAP, while evaluating a model, Average Precision (AP) is calculated for each class
separately.
Here's a high-level overview of how the HOG method works for face detection:
1. Pre-processing: The input image is first pre-processed to adjust the brightness and contrast.
2. Gradient Computation: The gradient information of the image is computed to capture the
edge information of the image.
3. Cell-level Histogram: The gradient information is divided into small cells and a histogram of
gradient orientations is computed for each cell.
4. Block-level Histogram: The histograms from several cells are combined to form a block-level
histogram, which is used to capture the structural information of the object.
6. SVM Classifier: The normalized histograms are used to train a Support Vector Machine (SVM)
classifier to distinguish between faces and non-faces.
7. Sliding Window Detection: A sliding window approach is used to scan the image, where the
classifier is applied on each window to detect the presence of a face.
Overall, HOG-based face detection is computationally efficient and has shown good performance in
many applications.
A classical Face-Detection technique might fail to detect a face in a few frames, which may
lead to the application not performing as desired or cause complications in the system.
Even if the faces are detected in every frame, the process might take too long. This slows
down the application and, at times, robs it of its whole essence.
No wonder we needed to switch to newer state-of-the-art Face Detectors. These provide high
accuracy (such that no face goes undetected) at very high speeds and can also be used in
microprocessors with low computing power.
The Backbone model: The Backbone model is a typical pre-trained image classification network that
works as the feature map extractor. Here, the image final image classification layers of the model are
removed to give us only the extracted feature maps.
The SSD head: SSD head is made up of a couple of convolutional layers stacked together, and it is
added to the top of the backbone model. This gives us the output as the bounding boxes over the
objects. These convolutional layers detect the various objects in the image.
The proposed MTCNN architecture consists of three stages of CNNs. In the first stage, P-Net
(Proposal Network), it produces candidate windows quickly through a shallow CNN. Then in the R-
Net (Refine Network) stage, it refines the windows by rejecting many non-face bounding boxes
through a more complex CNN. Finally, the O-Net (Output Network) stage uses a more powerful CNN
to refine the result again and output five facial landmarks positions.
Dual Shot Face Detector is a novel Face Detection approach that addresses the following three major
aspects of Facial Detection:
DSFD involves a Feature-Enhance Module (FEM) that enhances the originally received feature maps,
thus extending the single shot detector to a dual shot detector. This module helps incorporate the
current layer’s information along with the feature maps of the previous layers and maintains a
context relationship between the anchors. This helps obtain more discriminate and robust features.
2. Progressive loss design – Loss functions such as Focal Loss and Hierarchical Loss
address the class-imbalance problem and consider original and enhanced learning features,
respectively. However, they are not equipped to progressively learn the feature maps at
different levels and shots. DSFD involves a Progressive Anchor Loss (PAL) computed by two
sets of anchors. It assigns smaller anchor sizes in the first shot and larger ones in the second.
This helps facilitate the features effectively.
3. Anchor assign-based data augmentation – Anchors are generated for each
feature map. Some research involves strategies to increase positive anchors. Such a strategy
ignores the random sampling in data augmentation, resulting in an imbalance between
positive and negative anchors. DSFD uses Improved Anchor Matching (IAM), which involves
anchor-based data augmentation. This provides a better match between the anchors and
ground truth and leads to better initialization for the face-box regressor.
All the above-mentioned aspects are mutually exclusive and can work simultaneously to improve
performance. As you can see, all these techniques relate to a two-stream design, so it has been
named Dual Shot Face Detector. It has the ability to remain robust even under variations in
illumination, pose, scale, occlusion, etc.
RetinaFace detects 900 faces (at a threshold of 0.5) out of 1151 people
It takes pixel-wise face localization to the next level. RetinaFace cleverly takes advantage of extra-
supervised and self-supervised multi-task learning to perform face localization on various scales of
faces, as seen in the above figure.
Many recent state-of-the-art methods focus on single-stage face detection techniques, which
densely sample face locations and scales on feature pyramids. Such a technique provides better
performance at a faster speed compared to two-stage methods.
RetinaFace improves this single-stage framework by:
Exploiting multi-task losses coming from strongly supervised and self-supervised signals.
Employing a multi-task learning strategy to simultaneously predict the face score, face box,
five facial landmarks, and 3D position and correspondence of each face pixel.
1. Face classification loss is a softmax loss for binary classes (face/not face).
2. Face box regression loss – The target bounding boxes are normalized and are in the format
[(x_center, y_center, width, height]).
3. Facial landmark regression loss – This regression technique also normalizes the target.
4. Dense regression loss – Supervised signals increase the significance of better face box and
landmark locations.
It can be used to rapidly prototype perception pipelines with reusable components and in
production-ready Machine Learning applications.
It uses a lightweight feature extractor inspired by the MobileNet model and a GPU-friendly
anchor scheme modified from Single Shot Multibox Detector (SSD).
It provides a JavaScript API to implement Facial Detection on the web and an API to include it on
Android, iOS, and Desktop applications.
It is a CNN-based face detector developed by Chengrui Wang and Yuantao Feng. It is a very
lightweight and fast model. With a model size of less than an MB, it can be loaded on almost any
device. It adopts mobilenet as its backbone and contains 85000 parameters in total.
It achieves a respectable score on the validation set of the WIDER Face dataset for such a lightweight
model.
In previous methods, much of the work involved selecting filters to establish the characteristics such
that as much detail could be extracted from the image as possible. This work can now be automated
with increasing profound understanding and greater computational skills. The CNNs are called since
the original image data is combined with a series of filters. The number of filters to be applied is the
parameter to be selected and the filtration dimension. The filter dimension is known as the step
length. Typical phase values range from 2 to 5. In this particular case, the output of the CNN is a
binary classification that takes value 1 when a face exists, otherwise value 0. A paper on Max-Margin
Object Detection (MMOD) is also implemented for improved performance. This model works with
various facial orientations and is occlusion sturdy and its training method really fast. But is very slow
and cannot detect smaller faces since they are specialized in the size of 80 to 80 faces. You must then
ensure that the face size of the submission is larger than that. However, for smaller faces, you should
train your own facial detector.
RetinaFace-Resnet50, YuNet, and DSFD work perfectly and are not affected, while the other models
fail in multiple cases, with Haar Cascades and DLib-HOG performing the worst, as they have hand-
crafted features.
1.7.5 Pose
DSFD and RetinaFace-Resnet50 win the race for detecting faces in different poses, with YuNet
performing respectably.
Remember that it will be very slow and won’t make sense for real-time inference.
step is employed to extract the feature vectors for any human face located in the first step (face
detection) . it represents a face with a set of features vector called a “signature” that describes the
prominent features of the face image such as mouth, nose, and eyes with their geometry distribution
The goal of feature extraction is to convert high-dimensional, complex data into a set of compact and
informative features that can be easily analysed and used to build predictive models
Several techniques involve extracting the shape of the mouth, eyes, or nose to identify the face using
the size and distance, HOG , Eigenface, independent component analysis (ICA), linear discriminant
analysis (LDA) , scale-invariant feature transform (SIFT) , gabor filter, local phase quantization (LPQ) ,
Haar wavelets, Fourier transforms , and local binary pattern (LBP) techniques are widely used to
extract the face features
Eigenfaces, is a specific application of PCA to face images. In this method, a set of face images is
treated as a matrix, and PCA is applied to obtain the principal components, which are used to
represent the faces in a lower-dimensional space. The eigenfaces correspond to the eigenvectors of
the covariance matrix and capture the most important features of the faces, such as facial
expressions and lighting conditions.
In both PCA and eigenfaces, the goal is to convert high-dimensional, complex data into a set of
compact and informative features that can be easily analyzed and used for tasks like classification
and recognition.
3 assignment of orientation
2.1.5 gabor filter : Gabor filters are designed to be similar to the receptive fields of simple
cells in the primary visual cortex of the brain. They provide a way to analyze images by representing
texture, orientation, and frequency information.
By varying the Gabor parameters new filters are generated and we get closer to the real face
2.1.6 local binary pattern (LBP) : is a texture-based method used in computer
vision and image processing for feature extraction and representation. The idea behind LBP is
to create a unique binary pattern for each pixel in an image by comparing the value of the
center pixel to the values of its surrounding pixels. The resulting binary patterns capture the
spatial relationships between the pixels in the image, and can be used to identify and classify
different textures. LBP has been extended to include rotation invariant, uniform, and multi-scale
variations, which can improve its performance in certain applications.
3 Face recognition:
This step considers the features extracted from the background during the feature extraction step
and compares it with known faces stored in a specific database. There are two general applications of
face recognition, one is called identification and another one is called verification. During the
identification step, a test face is compared with a set of faces aiming to find the most likely match.
During the identification step, a test face is compared with a known face in the database in order to
make the acceptance or rejection decision. Convolutional neural network (CNN), k-nearest neighbour
(K-NN), DeepFace, VGG-FACE, FaceNet and Siamese neural network are known to effectively address
this task.
3.1 Techniques are used for face recognition
3.1.1Convolutional Neural Networks (CNNs): CNNs are a popular deep learning architecture for
image classification and object detection tasks, including face recognition.
In face recognition, a CNN takes an input face image and applies multiple convolutional and pooling
layers to extract meaningful features from the image. These features are then fed into fully
connected layers to produce a prediction of the identity of the person in the face image.
One common approach for face recognition using CNNs is to train a network to predict a compact
representation, called an embedding, for each face image. The embeddings are learned such that
similar faces have similar embeddings. At test time, the embedding for a query face image can be
compared to the embeddings for a set of reference face images to find the closest match.
Another approach is to use a CNN to directly predict the identity of the person in a face image, by
training the network to predict a probability distribution over a set of identities. At test time, the
network outputs a prediction for the identity of the person in the query face image.
3.1.2 K-Nearest Neighbors (KNN): KNN is a simple machine learning algorithm that can be used for
face recognition tasks. In KNN-based face recognition, each face image is represented as a feature
vector, which captures the important information about the face.
At test time, the feature vector for a query face image is compared to the feature vectors of a set of
reference face images. The closest K reference face images to the query face image are selected as
the "nearest neighbors", and the identity of the person in the query face image is predicted based on
the majority class of the K nearest neighbors.
KNN is simple to implement and can be used for face recognition tasks with limited computational
resources. However, KNN can be sensitive to the choice of K and the feature representation used,
and may not perform as well as more advanced machine learning algorithms such as Convolutional
Neural Networks (CNNs) in terms of accuracy and robustness to variations in lighting, pose, and
expression.
In recent years, KNN has mainly been used as a baseline comparison method for evaluating the
performance of more advanced face recognition algorithms. However, KNN can still be a useful tool
for face recognition in certain scenarios, such as real-time face recognition on small datasets with
limited computational resources.
3.1.3 DeepFace: DeepFace is a face recognition system developed by Facebook in 2014, based on
deep learning algorithms. It was one of the first deep learning-based face recognition systems to
achieve human-level accuracy on standard benchmarks.
DeepFace uses a Convolutional Neural Network (CNN) to extract features from a face image and
produce a compact representation, called an embedding, for each face. The embeddings are learned
such that similar faces have similar embeddings.
At test time, the embedding for a query face image can be compared to the embeddings for a set of
reference face images to find the closest match. The embeddings can also be used for other face
recognition tasks, such as face verification (determining if two face images belong to the same
person) and face clustering (grouping similar face images together).
3.1.4 VGG-Face: VGG Face is a deep convolutional neural network architecture for face recognition
that was developed by researchers at the Visual Geometry Group (VGG) at the University of Oxford.
The architecture is based on the VGGNet architecture, which was developed for the ImageNet Large
Scale Visual Recognition Challenge (ILSVRC) and achieved state-of-the-art performance in image
classification tasks.
In VGG Face, the network takes an input face image and applies multiple convolutional and pooling
layers to extract meaningful features from the image. These features are then fed into fully
connected layers to produce a compact representation, called an embedding, for each face. The
embeddings are learned such that similar faces have similar embeddings.
At test time, the embedding for a query face image can be compared to the embeddings for a set of
reference face images to find the closest match. The embeddings can also be used for other face
recognition tasks, such as face verification (determining if two face images belong to the same
person) and face clustering (grouping similar face images together).
3.1.5 FaceNet: FaceNet is a deep learning-based face recognition system developed by researchers at
Google. It was introduced in 2015 and achieved state-of-the-art performance on standard
benchmark datasets for face recognition at the time.
FaceNet uses a Convolutional Neural Network (CNN) to extract features from a face image and
produce a compact representation, called an embedding, for each face. The embeddings are learned
such that similar faces have similar embeddings. This is achieved by training the network to minimize
the Euclidean distance between the embeddings of similar faces and maximize the distance between
the embeddings of dissimilar faces.
At test time, the embedding for a query face image can be compared to the embeddings for a set of
reference face images to find the closest match. The embeddings can also be used for other face
recognition tasks, such as face verification (determining if two face images belong to the same
person) and face clustering (grouping similar face images together).
3.1.6 Siamese Neural Networks (SNNs): SNNs are a type of deep learning architecture that can be
used for face recognition tasks. SNNs are designed to compare the similarity of two inputs and
determine if they are the same or not.
In the context of face recognition, a Siamese Network can be trained to compare the similarity of two
face images and determine if they belong to the same person. During training, the network is
presented with pairs of images, one of which contains a face of a known individual, and the other
contains a face of a different individual or no face at all. The network then learns to compare the two
images and determine if they are similar or not.
At test time, the SNN can be used to compare a query face image to a set of reference face images.
The network computes a similarity score between the query and each reference image, and the
reference image with the highest similarity score is considered the match.
Siamese Networks have been found to be effective for face recognition tasks, as they can learn to
compare the unique features of faces and determine their similarity, even if the faces are presented
at different scales, rotations, and poses.
In summary, the choice of face recognition technique depends on the specific requirements and
constraints of the task at hand which is recognizing employees in a meeting room as well as on the
needs and demands of the company. However, deep learning-based approaches such as CNNs and
siamese networks have become increasingly popular in recent years due to their ability to learn high-
level features directly from face images and achieve high accuracy on a wide range of face
recognition benchmarks.
4. Databases Used :
Many databases containing information that enables the evaluation of face recognition systems are
available on the market. However, these databases are generally adapted to the needs of some
specific recognition algorithms, each of which has been constructed with various image acquisition
conditions (changes in illumination, pose, facial expressions) as well as the number of sessions for
each individual. These databases range in size, scope and purpose.
LFW: Labeled Faces in the Wild dataset is a benchmark database of face photographs designed for
studying the problem of face recognition. The LFW dataset consists of more than 13,000 images of
faces collected from the internet, each labeled with the name of the person in the image. It is often
used to train and test face recognition systems, as well as to develop and evaluate new face
recognition techniques.
tasks. The dataset contains 494,414 face images of 10,575 real identities collected from the web. It
was automatically collected by the CASIA group and then manually refined. As is common for sets
that are collected by looking at celebrities or famous people, this set presents a long tail distribution
in terms of the images that are associated to a subject. This means that there are some frequent and
usually more famous subjects that comprise most of the images, while others are only described by a
few images.
4.1.2 UMDFaces: used a mix of human annotators via Amazon Mechanical Turk (AMT) and already
trained
deep-based face analysis tools to build medium-sized sets that are much tougher than the already
available sets. Another UMDFaces peculiarity is the fact that, unlike CASIA and VGGFace, the set
contains both still images (usually high quality) and video frames (often affected by motion blur). The
set provides annotations of facial keypoints, face pose angles, gender information. The set consists of
367,888 face annotations in still images for 8,277 subjects, and also 3.7 million annotated video
frames from about 22K videos of 3,100 subjects. Although the UMDFaces numbers are smaller than
the other sets, it presents a wider pose distribution than CASIA and VGGFace.
4.1.3 VGGFace2: is an improved version of VGGFace created in order to mitigate the deficiency of its
predecessor. VGGFace2 contains 3.31 million images of 9,131 subjects collected among celebrities,
but also famous people such as professors or politicians. It is designed to cover a large range of pose,
age and ethnicity, and to reduce label noise as much as possible. The reduction in the label noise was
IMDb-Face: is a dataset of face images collected from the Internet Movie Database (IMDb) website.
It was created for the purpose of face detection and recognition research. The dataset contains over
80,000 face images of more than 5,000 individuals, making it one of the largest publicly available face
image datasets. The images were annotated with facial landmarks and attributes, such as gender,
age, and facial expression. This new set claims to be the largest noise-controlled face collection.
YTF (YouTube Faces): This dataset contains over 3,000 videos of faces, providing a large and diverse
4.1.4 IJB-A (IARPA Janus Benchmark A): The IARPA Janus Benchmark A (IJB-A) database is developed
with
the aim to augment more challenges to the face recognition task by collecting facial images with a
wide variation in pose, illumination, expression, resolution and occlusion. IJB-A is constructed by
collecting 5,712 images and 2,085 videos from 500 identities, with an average of 11.4 images and 4.2
4.1.5 WebFace260M: WebFace260M is a million-scale face benchmark, which is constructed for the
research community towards closing the data gap behind the industry. It consists of: - Noisy 4M
identities and 260M faces - High-quality training data with 42M images of 2M identities by using
automatic cleaning - A test set with rich attributes and a time-constrained evaluation protocol
MS-Celeb-1M dataset is a large-scale face recognition dataset consists of 100K identities, and each
identity has about 100 facial images. The original identity labels are obtained automatically from
webpages.
4.1.6 MegaFace: is a large-scale face recognition evaluation dataset created by Carnegie Mellon
University.
It contains over a million images of over 6,000 individuals, making it one of the largest publicly
available face recognition datasets. The dataset was created to evaluate the performance of face
MegaFace consists of two parts: a gallery set and a probe set. The gallery set contains images of
individuals that are used as reference templates for recognition, while the probe set contains images
of the same individuals that are used to test the recognition algorithms. The probe set also includes
images of imposters (individuals who are not in the gallery set) to evaluate the ability of the
The MegaFace dataset has been used in several face recognition benchmarks, including the Face
Recognition Grand Challenge and the WildFace challenge, and it has been instrumental in advancing
the state-of-the-art in face recognition technology.
For this project, we are not going to use any of these databases. we will collect our own database
using a google chrome extension to get the images of certain people from google image search.