Real Time Face Recognition System at The Edge
Real Time Face Recognition System at The Edge
Emre Ozena , Fikret Alima , Sefa Burak Okcua , Enes Kavaklia , and Cevahir Ciglaa
a
Aselsan Inc., Yenimahalle, Ankara, Türkiye
ABSTRACT
Face recognition (FR) technology has gained widespread popularity due to its diverse utility and broad range
of applications. It is extensively used in various domains, including information security, access control, and
surveillance. Achieving better real-time face detection (FD) performance can be challenging, especially when
running multiple algorithms that require both high accuracy and swift execution (high frame rate) into embedded
System on Chips (SoC). In this study, a comprehensive methodology and system implementation are proposed
for concurrent face detection, landmark extraction, quality assessment, and face recognition directly at the
edge, without relying on external resources. The approach integrates cutting-edge techniques, including the
utilization of the Extended YOLO model for face detection and the ArcFace model for feature extraction,
optimized for deployment on embedded devices. By leveraging these models alongside a dedicated recognition
database and efficient software architecture, the system achieves remarkable accuracy and real-time processing
capabilities. Critical aspects of the methodology involve tailoring model optimization for SoC environments,
specifically focusing on the YOLO face detection model and the ArcFace feature extraction model. These
optimizations aim to enhance computational efficiency while preserving accuracy. Furthermore, efficient software
architecture plays a crucial role, allowing for the seamless integration of multiple components on embedded
devices. Optimization techniques are employed to minimize overhead and maximize performance, ensuring
real-time processing capabilities. By offering a detailed framework and implementation strategy, this research
contributes significantly to the development of a high-performance, highly accurate real-time face recognition
system optimized for embedded devices.
Keywords: Face Detection, Face Landmark Detection, Face Extraction, Face Recognition, YOLO, ArcFace,
Real-Time Edge Processing, Cameras, Surveillance, Embedded
1. INTRODUCTION
In contemporary times, various techniques exist for the process of human identification, such as biometrics, iris,
and fingerprint detection. Despite its slightly lower accuracy compared to iris and fingerprint recognition, face
recognition systems are widely adopted due to their non-contact process.1 Therefore, face recognition stands out
as the most convenient way to recognize people’s identities. Among these diverse biometric technologies, face
recognition takes precedence. Unlike other techniques, face recognition systems do not require physical contact
to identify individuals. This offers users an extremely practical and comfortable method of identity verification.
The primary benefit of face recognition lies in its capacity for mass identification, facilitating its use in densely
populated areas like airports, shopping malls, and other public venues without individuals necessarily realizing
they’re being scanned by the system.2 Ensuring quality standards is crucial in face recognition systems due to
the considerable variability present in face images. Elements such as lighting conditions, face expressions, angles,
and environmental noise during image capture can significantly influence the effectiveness of face recognition
systems.2 However, proficient face recognition algorithms along with suitable preprocessing of images can offset
the effects of noise and slight variations in orientation, scale, and illumination.3
Further author information:
Emre Ozen: E-mail: [email protected],
Fikret Alim: E-mail: [email protected],
Sefa Burak Okcu: E-mail: [email protected],
Enes Kavakli: E-mail: [email protected]
Cevahir Cigla: E-mail: [email protected]
The field of face recognition is a multidisciplinary domain that attracts researchers with diverse backgrounds,
spanning psychology, neural networks, and computer vision. These experts contribute to the development of
various methods aimed at addressing the complexities of face recognition.4 Commonly employed techniques
include holistic matching methods, which treat the entire face as a single entity for comparison and recognition;4
feature-based (structural) methods, which focus on extracting specific face features or structures like eyes, nose,
and mouth and analyzing their spatial relationships for recognition A significant obstacle encountered by feature
extraction methods is the issue of feature ”restoration.” This occurs when the system attempts to recover features
that are obscured or invisible due to significant variations, such as head pose discrepancies when matching a
frontal image with a profile image;5 and hybrid methods, which blend elements from both holistic matching
and feature-based approaches to enhance accuracy and robustness in face recognition tasks. In hybrid methods,
three-dimensional (3D) images are commonly employed. These images capture the face structure in 3D, enabling
the system to detect subtle curves and contours, particularly around the eye sockets.4
The majority of face recognition systems are typically composed of two main modules: feature extraction
and classifier. A variety of combinations of feature extraction and classifier algorithms have been utilized in
the design of these systems. For instance, some systems employ Histogram of Gradients (HOG) and Support
Vector Machine Classifier (SVM),6 while others utilize HOG with Relevance Vector Machine (RVM) Classifier,7
or Principal Component Analysis (PCA) with SVM.8 Convolutional Neural Network (CNN), a deep learning
algorithm, is often recommended for applications involving images because it integrates both feature extraction
and classification tasks.9
In this study, we present a novel application that integrates face detection (FD) and face recognition (FR)
functionalities, capable of operating collaboratively within a cost-effective system. This combined system is
designed to function seamlessly with security cameras, offering real-time performance and high accuracy. The
combination of FD and FR algorithms proves particularly advantageous for surveillance and security applications,
notably in access control scenarios. After FD, the post-processing steps aim to enhance the recognition accuracy
of detected faces while filtering out those that cannot be recognized prior to FR analysis. Additionally, these
filters and enhancements help in mitigating noise and aligning face features, thereby optimizing the input data
for the FR process. Consequently, only the refined and improved face images are forwarded to the FR stage,
streamlining the recognition process and improving overall system performance.
In this paper, we have developed and implemented a sophisticated face detection algorithm based on the
You Only Look Once (YOLO)v310 model. This model, rooted in the principles of Region-based Convolutional
Neural Networks (R-CNN),10 is widely recognized in the current literature for its exceptional balance between
processing speed and recognition accuracy. Furthermore, our approach extends beyond mere face detection;
we have incorporated the Additive Angular Margin Loss (ArcFace)11 model for precise face extraction . This
advanced technique ensures that the extracted face features are accurately represented, enhancing the overall
robustness of our identification system. Moreover, our methodology involves the training of an extended version
of the YOLOv3 model using additional wild images. By incorporating a diverse range of images, including those
captured in unconventional or challenging conditions, we aim to bolster the efficacy of our system, particularly in
surveillance scenarios where environmental variables can significantly impact performance. We demonstrate the
feasibility of identifying individuals checking in for access control using a single camera, provided it is properly
installed.
This study emphasizes the optimization of software and algorithms to enable the simultaneous execution
of FD and FR applications on an embedded platform. By leveraging the platform’s features and employing
optimized high-performance machine learning (ML) models, the two algorithmic steps can now run concurrently
at a speed of 12.5 frames per second. This provides sufficient speed for real-time face recognition tasks and is
an important step in dealing with resource constraints in embedded systems. With its optimized performance,
this study highlights the practical use of embedded platforms for tasks requiring fast and accurate face analysis,
contributing to the advancement of computer vision systems in real-world applications.
The paper is structured to systematically explore various aspects of face detection and recognition on embed-
ded platforms. In Sec. 2, an overview of related works in this domain is provided, offering insights into existing
research and developments. Sec. 3 articulates the motivation driving this study, emphasizing the significance of
optimizing algorithms for real-time execution. Moving forward, Sec. 4 delineates the proposed implementation,
detailing the integration of face detection and recognition subsystems, as well as the System-on-Chip (SoC)
implementation. Experimental results, including accuracy assessments and timing measurements, are presented
in Sec. 5, alongside an explanation of evaluation metrics and methodologies used. Finally, Sec. 6 encapsulates
the study’s conclusions, summarizing key findings and outlining avenues for future research in this field. This
structured approach ensures a coherent presentation of the research, facilitating comprehension and guiding
further exploration.
2. RELATED WORK
In contemporary studies, significant attention has been drawn to both FD and FR algorithms, highlighting their
crucial importance and broad appeal.Although classical methodologies have traditionally shaped the development
of these algorithms, the emergence of machine learning has heralded a transformative era, wherein machine
learning-based techniques have come to the forefront, commanding the field’s landscape.
Kumar et al.12 ’s study extensively examines various techniques researched for face detection in digital im-
ages. Within this comprehensive review, Kumar et al. systematically categorize FD methodologies into two
primary groups: feature-based and image-based approaches. Feature-based techniques, as elucidated, encap-
sulate methodologies such as feature analysis, shape model, and low-level methods. Conversely, image-based
methodologies encompass a diverse array of strategies, including neural networks, linear subspaces, and statisti-
cal approaches.
In an additional investigation, W. Yang et al.13 present a real-time FD system built upon the YOLO
architecture. In recent years, deep learning algorithms for object detection have advanced rapidly, falling into
two main categories: two-stage detectors like Faster R-CNN and one-stage detectors like YOLO. While YOLO
and its variants offer a significant advantage in speed compared to two-stage detectors, they may not perform
as well in terms of accuracy. The YOLO method incorporates more suitable anchor boxes and a more precise
regression loss function specifically tailored for face detection. The improved detector significantly enhances
accuracy while maintaining fast detection speed.
S.B. Okcu et al.14 delineate an efficient approach for Face Quality Assessment (FQA) through the incorpora-
tion of face quality score computation alongside a face landmark detection network. As expounded in the paper
and exemplified in Fig. 1, distortions such as blurring, occlusion, and variations in head orientation manifest in
surveillance scenarios, significantly impacting the quality of captured face images and consequently, the perfor-
mance of face recognition systems. The paper establishes a vital link between face quality assessment and face
recognition scores, particularly in surveillance contexts, which aligns closely with the objectives of our research.
Notably, the FD network proposed in our study is an extended variant of the one-shot methodology elucidated
in S.B. Okcu et al.’s work. In a single iteration, our network offers comprehensive insights, encompassing face
region detection, landmark localization, and evaluation of face quality metrics, as depicted in Fig. 1.
In the realm of face recognition, researchers have explored two main categories of methods: shallow and deep
approaches. Shallow methods rely on handcrafted local image descriptors like Local Binary Pattern (LBP),16
Histograms of oriented gradients (HOG),17 Scale Invariant Feature Transform (SIFT),18 and which capture
specific patterns within small face regions. These local descriptors are characterized by their utilization of a
Convolutional Neural Network (CNN) feature extractor, which is a trainable function crafted by combining
various linear and non-linear operations. A prominent example within this category is DeepFace.19 DeepFace
employs a CNN trained on a vast dataset comprising 4 million face images belonging to 4000 distinct individuals
for face classification tasks. Additionally, it adopts a siamese network architecture, where the same CNN is
employed to process pairs of faces, yielding descriptors that are subsequently compared using the Euclidean
distance metric. To enhance network generalization, they introduced a bootstrapping technique for selecting
identities during training. Furthermore, they demonstrated that fine-tuning the dimensionality of the fully
connected layer can enhance network performance. Deepface architecture shown in Fig. 2.
The Deep Convolutional Neural Network (DCNN) framework is widely employed for extracting features
from images. DCNN comprises multiple layers that facilitate accurate feature learning. In this framework, pre-
learned features are utilized as filters, convolving through input images to generate features.20 These features
are subsequently utilized by other layers within the network, as elucidated by Krizhevsky et al.21 Techniques
Figure 1. One shot face-landmark detection and face quality assessment can be performed at the edge. Processed Image
is taken from15
rooted in convolutional neural networks have been introduced for face expression recognition, such as the model
proposed by Kahou et al.22 However, their model necessitates extensive training with additional face datasets.
Sebastien23 utilized the Deep Convolutional Activation Feature for Generic Visual Recognition (DeCAF)24 for
face feature extraction, which obviates the need for extensive training. Nonetheless, DeCAF’s computational
efficiency is hindered, rendering it impractical for training even on modest image datasets, as it lacks GPU
support.
Jiankang Deng et al.11 propose an innovative technique to enhance the discriminative power of face recognition
models and ensure training stability. As illustrated in their methodology, the dot product between the features
extracted by the DCNN and the weights of the final fully connected layer is equivalent to the cosine distance after
normalization. They employ the arc-cosine function to compute the angle between the current feature and the
target weight. Subsequently, an additive angular margin is introduced to the target angle, and the target logit
is recalculated using the cosine function. This process is followed by rescaling all logits by a fixed feature norm,
resembling the subsequent steps outlined in the softmax loss function.11 In their ArcFace paper,11 the authors
showcase a direct optimization strategy for the geodesic distance margin, leveraging the precise correlation noted
between the angle and arc within the normalized hypersphere.They provide intuitive insights into this process by
analyzing the angle statistics between features and weights in the 512-dimensional space. Their ArcFace usage
with DCNN is shown in Fig. 3.
The paper by Jiankang Deng et al.25 provides a comprehensive overview of the face recognition pipeline, out-
lining four main stages. Initially, faces are detected in the video stream through methods like Viola-Jones, SSD,
or YOLO. Subsequently, face image preprocessing ensues, involving the alignment of face images and extraction
of face landmarks crucial for further processing. Feature extraction from face images is conducted using Convolu-
tional Neural Networks (CNNs), with advanced loss functions like triplet loss or ArcFace11 employed to enhance
discriminative power. Following this, the embeddings (feature vectors) are classified through nearest neighbor
search, ensuring that embeddings for faces with similar semantic characteristics are proximate in some metric
Figure 3. Training a DCNN for face recognition supervised by the ArcFace loss. Image is taken from11
space. The paper emphasizes the significance of the ArcFace loss function in augmenting the discriminative
capabilities of embeddings, which ensures close proximity of embeddings for faces with similar semantics, thus
significantly improving classification performance. Furthermore, their system’s real-time implementation is high-
lighted, optimized to run efficiently on embedded hardware platforms such as the NVIDIA Jetson TX2, enabling
seamless and efficient face recognition processes by integrating face detection, alignment, feature extraction, and
classification.
3. MOTIVATION
As elaborated in Sec. 2, existing research on face recognition systems predominantly focuses on algorithmic
advancements without prioritizing real-time performance considerations, particularly at edge devices. Although
some studies have addressed the real-time implementation of either FD or FR algorithms separately, the in-
tegration of these two functionalities into a single, real-time solution for edge devices with high-performance
requirements remains largely unexplored. This challenge stems from the inherent computational demands of
both FD and FR algorithms, making simultaneous execution on low-cost edge devices with real-time constraints
quite complex. It’s worth noting that edge devices typically feature low-power and cost-effective processors,
which necessitates efficient utilization of available resources.
In our study, our primary objective was to develop a solution capable of executing both FD and FR algorithms
on low-cost edge devices while meeting stringent real-time, accuracy, and performance criteria. To accomplish
this, we employed optimized machine learning models tailored for each algorithm and devised a sophisticated
software architecture. This architecture was designed to maximize the utilization of hardware resources while
ensuring efficient execution of the algorithms. Therefore, we explored the potential integration of more powerful
processing units such as Graphics Processing Units (GPUs) or Field-Programmable Gate Arrays (FPGAs) to
augment computational capabilities, thereby facilitating more robust real-time performance.
Furthermore, we implemented various post-processing techniques to mitigate false alarms and bolster over-
all accuracy. Comprehensive performance evaluations, encompassing accuracy and timing measurements, were
meticulously conducted to validate the viability of our proposed system for edge deployments. Through ex-
haustive experimentation and use-case analyses, we provide nuanced insights into the practical benefits and
implications of integrating FD and FR algorithms, underscoring the transformative potential of this amalga-
mated approach across diverse real-world applications.
through SoC implementation, hardware support is utilized to expedite these processes, making them faster.
In Sec. 4.1 and Sec. 4.2, we delve into the algorithmic investigations, encompassing the utilization of machine
learning models and post-processing methodologies. In Sec. 4.3, we elucidate the software implementations aimed
at crafting this hybrid application on the SoC platform.
advantages. Firstly, processing and storing data on the device preserves user privacy. The need to send data to
central servers is eliminated, reducing potential privacy vulnerabilities. Additionally, conducting face recognition
at the edge provides faster response times for real-time applications. Processing data on the device also reduces
the dependency on an internet connection, thereby providing independence and accessibility. Lastly, edge face
recognition can operate independently from central servers, ensuring systems are more reliable and secure. All of
these advantages constitute the reasons for the preference for edge face recognition, thus increasing its popularity.
Driven by this motivation, we have trained a face recognition DCNN model supervised with ArcFace loss.
To construct our machine learning model, we took inspiration from the research conducted by Jiankang Deng
et al.11 In ArcFace, the cosine of the angle (cosθ) is calculated by normalizing features and fully connected
(FC) layer weights, and then finding their inner product. The loss is computed by applying Softmax to cosθ.
Following this, arccos is applied to the cosθ values, and an angular margin of +m is added only for the correct
labels. This prevents the FC layer weights from being overly dependent on the input dataset, promoting a more
robust model. This calculation is illustrated in Fig. 3. Additionally, through network optimization techniques,
including the reduction of unnecessary data dimensions, we decreased the feature dimensionality from 512 to
256. This reduction facilitated faster computational times while preserving high-accuracy performance.
To begin with, the proposed ArcFace model is employed to establish a comprehensive face database. This
database is structured to include a distinctive identification (ID) for each face entry, coupled with its corre-
sponding face feature vector. This database comprises data from 10,000 distinct individuals. The ArcFace model
operates optimally on face images resized to dimensions of 112 x 112 pixels. Following the face detection pro-
cess, detected faces are meticulously aligned and subsequently cropped to fit the requisite 112 x 112 dimensions,
facilitating seamless integration into the face recognition pipeline.
During this intricate process, the ArcFace model efficiently extracts a plethora of face features, amounting
to a total of 256 feature descriptors. These descriptors are meticulously preserved and stored as a feature
vector. Utilizing the acquired face feature vectors, the system endeavors to establish a quantifiable measure of
similarity between the newly encountered face and those stored within the database. This involves computing
the dot product of the feature vector of the detected face with each normalized feature vector present within the
database. The resultant scores reflect the degree of resemblance, with values ranging from 0 to 1.
Subsequently, the system scrutinizes these similarity scores against a predefined threshold, judiciously de-
termined by the user. Faces surpassing this threshold are deemed sufficiently similar. The system meticulously
identifies the face exhibiting the highest similarity score, indicative of the closest match within the database.
Upon successfully identifying a matching face, the system retrieves pertinent information associated with the
identified individual from the database. This encompasses details such as name, personal identification number,
and any additional metadata linked to the identified individual.
By meticulously orchestrating this intricate process, the system seamlessly integrates cutting-edge face recog-
nition technology with robust database management principles, thereby facilitating efficient and accurate face
identification and verification tasks across a myriad of applications and scenarios.
4.3 SoC Implementation
The SoC implementation primarily serves as a smart camera application tailored for surveillance purposes. It
efficiently handles tasks such as capturing raw video data from a sensor block, applying video analytics for real-
time processing, encoding media, and creating an RTSP27 stream for seamless streaming. The end product is a
sophisticated surveillance camera equipped with a motorized zoom lens for enhanced functionality.
Internally, the camera’s SoC features a Dual-core ARM Cortex-A7 CPU, an Image Signal Processor (ISP),
an intelligent video engine (IVE), and a neural network interface engine (NNIE). These components work syn-
ergistically to provide processing power of up to 1.0 TOPS (Tera Operations Per Second), ensuring smooth and
efficient operation of the camera’s functionalities.
In this application, the sensor’s native resolution is set at 4K (3840 x 2160), which serves as the primary
streaming channel. Additionally, we generate down-scaled channels with a resolution of 512 x 288 specifically for
FD purposes. To achieve this, we utilize hardware resources for scaling operations, ensuring minimal timing loss.
Both the main and sub-streaming channels are directed to the encoder module directly to minimize video latency
and prevent frame rate drops. It’s important to note that video analytics applications may operate at lower frame
rates compared to the encoder’s frame rate capabilities. Hence, it’s essential to maintain separate pipelines for
the encoder and video analytics processes. The analytics results are then integrated into the metadata stream
as they are generated by the algorithms. This approach allows for flexibility in frame rates between the video
analytics and streaming processes, ensuring efficient utilization of resources while maintaining the desired level
of performance and functionality.
The application employs frames from the sensor block at various scales, directing them to both the encoder
and FD pipelines. Since the sensor output is in YUV format while the networks require RGB images, an initial
color space conversion is conducted to transform it into the necessary RGB format. The FD network operates
on the down-scaled RGB images to produce crucial outputs including face coordinates, face quality assessments,
and landmark points. Following this, specific -processing steps outlined in Sec. 4.1 are applied to determine
the final faces intended for transmission to the FR process. Faces deemed appropriate for recognition undergo
cropping from the full-resolution image and are then transformed into JPEG28 format, utilizing the capabilities
of the OpenCV26 library. Moreover, the extracted face coordinates are dispatched to the metadata generator
for integration into the RTSP stream. This inclusion ensures that information about detected faces accompanies
the video stream, enhancing contextual comprehension and enabling further downstream processing or analysis.
Using the ArcFace model, a face database is constructed with unique identities and corresponding feature
vectors. During the FD process, detected faces are aligned and subsequently cropped to create OpenCV Mat
vectors tailored for recognition. Sec. 4.2 are applied to determine whether detected face is recognized or not.
The processed face information is then used for similarity computation. Upon identifying faces exceeding a
specified threshold in the database, the system selects the face with the highest similarity score among candidates.
Retrieved information associated with the identified face is sourced from the database and transmitted to users
as metadata. Matching faces are transferred to an FTP29 server with JPEG format. By comparing incoming
photos from the FTP server with metadata, scenarios suitable for recognition applications are identified, thereby
enabling effective face recognition across various applications.
In our approach, we prioritize optimization techniques to enhance the efficiency and speed of our face recog-
nition system. Leveraging the hardware resources of the SoC, we ensure that computational tasks are executed
with maximum efficiency, leveraging the parallel processing capabilities and specialized instructions available
within the SoC architecture.
Additionally, we capitalize on the optimization capabilities provided by OpenCV, a powerful computer vision
library widely used in our system. OpenCV is specifically compiled with ARM Neon support, a technology that
enables Single Instruction, Multiple Data (SIMD) operations on ARM processors. This integration allows us to
exploit the full potential of SIMD instructions provided by ARM Neon, which significantly accelerates critical
image processing functions such as scaling, color space conversion, and similarity calculations.
By harnessing both hardware and software optimizations, we achieve significant improvements in computa-
tional efficiency and speed. This ensures that our face recognition system operates with optimal performance,
Figure 6. Proposed system-on-chip application pipeline
even when dealing with large datasets or real-time processing requirements. Overall, these optimization strate-
gies play a crucial role in enabling our system to deliver accurate and efficient face recognition across diverse
applications and scenarios.
5. EXPERIMENTAL RESULTS
The experimental setup involved conducting surveillance in a camera environment, with images from various
datasets. These images were processed through a script to measure the accuracy and speed of the FD and FR
models. The evaluation focused on comparing the performance of several state-of-the-art FD models, including
MTCNN,30 RetinaFace,31 yolov5Face,32 and a proposed model, on the WiderFace dataset. The assessment
included measuring their average precision (AP) scores across easy, medium, and hard subsets. Furthermore,
the evaluation extended to FR models, labeled as ”Proposed Medium” and ”Proposed Low,” distinguished by
their feature sizes. The accuracy of these models was scrutinized across diverse datasets, considering their
effectiveness in recognizing faces in different scenarios. This comprehensive analysis aimed to provide insights
into the effectiveness of various FD and FR models in surveillance applications, considering both accuracy and
speed metrics across different datasets and scenarios.
In this study, the models were trained and tested using the Glint360K33 dataset to ensure fairness and
robustness in the experimental setup. This study evaluates the performance of a face recognition model across
a diverse range of datasets. The evaluation is conducted on several benchmark datasets, including AgeDB-30,34
CFP-FF,35 CFP-FP,35 CALFW,36 CPLFW,37 and LFW.38 These datasets are widely recognized in the field of
face recognition and are commonly utilized to benchmark the performance of state-of-the-art models.
The AgeDB-30 dataset consists of 16,488 face images of celebrities from diverse backgrounds, with each image
tagged with identity and gender information. This dataset is commonly utilized in face recognition research.
The CFP dataset comprises 10 folders, each housing 350 pairs of same-person and 350 pairs of different-person
images for both frontal-frontal (CFP-FF) and frontal-profile (CFP-FP) experiments. The Labeled Faces in
the Wild LFW dataset encompasses 13,323 web photographs capturing the likeness of 5,749 celebrities. These
images are further categorized into 6,000 pairs of faces distributed across 10 separate splits. The Cross-Pose
LFW (CPLFW) dataset is an updated version of the LFW dataset with a specific focus on cross-pose face
recognition. Designed to be more challenging than its predecessor, CPLFW introduces a variety of poses to
test face recognition algorithms under diverse conditions. The Cross-Age LFW (CALFW) dataset is a refined
iteration of the LFW dataset, emphasizing cross-age face recognition. Tailored to present greater challenges
compared to its predecessor, CALFW introduces pairs of faces with significant age differences to assess face
recognition algorithms under diverse conditions.
The experimental results in Table 1 exhibit the AP scores of various FD models evaluated on the WiderFace
dataset across different difficulty levels. Among the tested methods, MTCNN, RetinaFace, yolov5Face,32 and
the proposed model are compared in terms of their AP performance on easy, medium, and hard subsets of the
dataset. MTCNN achieves a respectable AP score of 0.785 on the easy subset, whereas RetinaFace demonstrates
improved performance across all difficulty levels with AP scores of 0.899, 0.876, and 0.711 on easy, medium,
and hard subsets, respectively. yolov5Face surpasses both MTCNN and RetinaFace, particularly excelling on
the easy subset with an impressive AP score of 0.937. However, the most notable observation arises from the
performance of the proposed model. Despite being competitive with yolov5Face on the easy subset, the proposed
model outperforms all other methods on the medium and hard subsets, achieving AP scores of 0.908 and 0.765,
respectively. This significant improvement over existing state-of-the-art methods underscores the efficacy and
importance of the proposed model in addressing the challenges posed by more complex scenarios within the
WiderFace dataset. Furthermore, the consistent performance of the proposed model across varying difficulty
levels suggests its robustness and adaptability in real-world face detection applications. These findings position
the proposed model as a promising advancement in the field of face detection, offering enhanced accuracy and
reliability, particularly in challenging environments where existing methods may falter.
Table 2 provides a detailed analysis of the experimental findings derived from evaluating face recognition
models across various datasets. The study scrutinizes two distinct models, namely ”Proposed Medium” and
”Proposed Low,” are characterized by their feature sizes of 512 and 256, respectively. Both models exhibit
robust performance across the datasets, showcasing high accuracy rates, particularly notable on datasets like
CFP-FF and LFW. Interestingly, despite the high accuracy values reported for its model, the performance
of ”Proposed Medium” and ”Proposed Low” in this study surpass that of the reference model outlined in
the ”Arcface: Additive angular margin loss for deep face recognition” paper by Jiankang Deng et al.35 In
addition to achieving superior accuracy, the ”Proposed Low” model also demonstrates impressive computational
efficiency, further highlighting its practical utility. The comparative advantage demonstrated by the ”Proposed
Low” model emphasizes the effectiveness of this study’s incorporation of dimensionality reduction techniques to
optimize feature representation within face recognition systems. By leveraging such techniques, this study not
only enhances accuracy but also improves processing speed, thereby advancing the usability and performance
of face recognition technologies. This holistic approach underscores the significance of this research endeavor in
contributing to the advancement of face recognition systems.
Dataset Name
Model Information
Feature AgeDB-3034 CFP-FF35 CFP-FP35 CALFW36 CPLFW37 LFW38
Name
Size
Resnet-5011 512 0.981 0.999 0.988 0.960 0.943 0.997
Proposed Medium 512 0.984 0.998 0.987 0.960 0.944 0.998
Proposed Low 256 0.982 0.998 0.988 0.962 0.938 0.997
The table 3 presents a comprehensive comparison of different face recognition (FR) models, including both the
”Proposed Medium” and ”Proposed Low” models alongside a reference paper model cited as.11 This comparison
aims to evaluate various aspects such as feature size, input image size, memory requirements, and extraction
time, providing insights into the performance of each model on the system. When considering the reference paper
model,11 it’s evident that it offers a feature size of 512, an input image size of 112 x 112 pixels, and requires 44.0
MB of memory. The extraction time for this model is 32.2 ms, with a similarity comparison time of 0.016 ms. In
contrast, the ”Proposed Medium” model shares the same feature size and input image size as the reference paper
model but demands higher memory resources (89.8 MB) and slightly faster extraction time (28.8 ms). However,
the similarity comparison time remains the same at 0.016 ms. The standout performer among the models is the
”Proposed Low” model, which offers a reduced feature size of 256 while maintaining the same input image size.
This model significantly reduces memory requirements to 51.8 MB and achieves the fastest extraction time of
19.2 ms, along with a similarity comparison time of 0.009 ms. Considering these results, the ”Proposed Low”
model emerges as the preferred choice due to its efficient memory utilization and faster processing times. In
scenarios where devices face memory constraints or require real-time FR capabilities, the ”Proposed Low” model
offers a compelling solution. Its reduced memory footprint conserves resources, enabling smoother multitasking
and improved system performance. Moreover, the quicker extraction times enhance responsiveness, which is
crucial for applications requiring rapid face recognition. Therefore, the ”Proposed Low” model stands out as an
optimal choice for systems with memory and performance constraints, offering a balance between efficiency and
effectiveness in FR applications.
In Table 4 delineates the average execution times across various stages within distinct pipelines deployed
on a SoC, pivotal for image processing endeavors such as FD and FR. In the case of FD, the computational
time is primarily attributed to detection, consuming 35 milliseconds (ms), while the subsequent post-processing
phase requires an additional 4.8 ms, culminating in a total execution time of 39.8 ms. Conversely, within the FR
pipeline, the most time-intensive operation is feature extraction, demanding an average of 19.2 ms, followed by
a minimal 1 millisecond for post-processing and a subsequent 20 ms for database search, aggregating to a total
of 40.2 ms. The combined FD + FR pipeline, integrating both detection and recognition functionalities, demon-
strates a heightened computational load compared to individual pipelines. Detection and feature extraction
collectively consume 54.2 ms, while post-processing and database search retain their respective time allocations
of 5.8 ms and 20 ms, resulting in a total execution time of 80 ms. In addition to the provided information, it
is noteworthy that through harnessing the parallel processing capabilities of our hardware, we have achieved a
significant reduction in processing time from 100 ms to 20 ms for a face similarity check function with a similarity
threshold of 0.01 across a dataset of 10,000 images. This accomplishment underscores the profound impact of
leveraging hardware parallel processing, enabling simultaneous processing of the face similarity check function
across multiple images, thus yielding a remarkable enhancement in overall performance.
6. CONCLUSION
In this research, we successfully developed a unified FD and FR application, which was seamlessly integrated into
a 4K surveillance camera system. Our implementation also featured user-defined database support, enabling real-
time edge analysis capabilities. Through this endeavor, we aimed to enhance the functionality and efficiency of
surveillance systems, particularly in scenarios where quick and accurate identification of individuals is paramount.
Furthermore, the integration of user-defined database support introduced a level of customization, allows users
to adapt the system to suit their individual requirements and preferences. This feature facilitated real-time edge
analysis, enabling swift identification and response to security threats or other pertinent events.
The proposed system exhibits real-time functionality with superior performance and accuracy in both FD and
FR. While each component operates independently at 25 frames per second (fps), the combined execution rate
decreases slightly to 12.5 fps. Nevertheless, this speed remains adequate for real-time applications in practical
settings. Within the system, the face detection model employs the YOLOv3 model to extract face boundaries,
five key landmark points, and quality assessments for face recognition, all in a single pass. Additional steps are
implemented beyond detection networks to track and post-process data, eliminating redundant information and
enhancing FR accuracy. Additionally, it should be noted that in the FR application, the ArcFace model is used
to extract 256 features. This model provides feature vectors used for face recognition. These vectors represent
unique features among faces and are utilized to perform the recognition process. After feature extraction for FR,
it is compared with a database containing 10,000 human records in this application.
This study offers a comprehensive comparative analysis of face detection and face recognition models across
diverse datasets and scenarios. Particularly noteworthy is the superior performance of the proposed model in
both speed and accuracy. In face detection tasks, the proposed model exhibits faster processing times compared
to existing models, without compromising on accuracy. Additionally, in face recognition, the proposed model
achieves high accuracy rates while maintaining efficient processing speeds. These findings underscore the signif-
icance of the proposed model in advancing the fields of computer vision and pattern recognition, offering both
speed and accuracy enhancements for real-world applications.
Deploying the FD and FR system with its dedicated database at the edge presents various advantages,
including improved speed, privacy, security, scalability, and cost efficiency. This approach minimizes reliance on
external resources, leading to faster response times and greater adaptability across different deployment scenarios,
thereby establishing it as a highly effective solution for surveillance applications.
REFERENCES
[1] Chen, S. and Chang, Y., “Destech publications, inc. p. 21. isbn 9781605951508,” International Conference
on Artificial Intelligence and Software Engineering (AISE2014) (2014).
[2] Thakkar, D., “Top five biometrics: face, fingerprint, iris, palm and voice, bayometric,” (2018).
[3] Jafri, R. and Arabnia, H. R., “A survey of face recognition techniques,” journal of information processing
systems 5(2), 41–68 (2009).
[4] Parmar, D. N. and Mehta, B. B., “Face recognition methods & applications,” arXiv preprint arXiv:1403.0485
(2014).
[5] Zhao, W., Chellappa, R., Phillips, P. J., and Rosenfeld, A., “Face recognition: A literature survey,” ACM
computing surveys (CSUR) 35(4), 399–458 (2003).
[6] CHERIFI, D., KADDARI, R., Hamza, Z., and Amine, N.-A., “Infrared face recognition using neural net-
works and hog-svm,” in [2019 3rd International Conference on Bio-engineering for Smart Technologies
(BioSMART) ], 1–5, IEEE (2019).
[7] Karthik, H. and Manikandan, J., “Evaluation of relevance vector machine classifier for a real-time face
recognition system,” in [2017 IEEE international conference on consumer electronics-Asia (ICCE-Asia) ],
26–30, IEEE (2017).
[8] Faruqe, M. O. and Hasan, M. A. M., “Face recognition using pca and svm,” in [2009 3rd international
conference on anti-counterfeiting, security, and identification in communication ], 97–101, IEEE (2009).
[9] Lawrence, S., Giles, C. L., Tsoi, A. C., and Back, A. D., “Face recognition: A convolutional neural-network
approach,” IEEE transactions on neural networks 8(1), 98–113 (1997).
[10] Redmon, J., Divvala, S., Girshick, R., and Farhadi, A., “You only look once: Unified, real-time object
detection,” in [Proceedings of the IEEE conference on computer vision and pattern recognition ], 779–788
(2016).
[11] Deng, J., Guo, J., Xue, N., and Zafeiriou, S., “Arcface: Additive angular margin loss for deep face recogni-
tion,” in [Proceedings of the IEEE/CVF conference on computer vision and pattern recognition ], 4690–4699
(2019).
[12] Kumar, A., Kaur, A., and Kumar, M., “Face detection techniques: a review,” Artificial Intelligence Review,
Rev 52 , 927–948 (08 2019).
[13] Yang, W. and Jiachun, Z., “Real-time face detection based on yolo,” in [2018 1st IEEE International
Conference on Knowledge Innovation and Invention (ICKII) ], 221–224 (2018).
[14] Okcu, S. B., Özkalaycı, B. O., and Çığla, C., “An efficient method for face quality assessment on the
edge,” in [Computer Vision – ECCV 2020 Workshops ], Bartoli, A. and Fusiello, A., eds., 54–70, Springer
International Publishing, Cham (2020).
[15] Wong, Y., Chen, S., Mau, S., Sanderson, C., and Lovell, B. C., “Patch-based probabilistic image quality
assessment for face selection and improved video-based face recognition,” in [IEEE Biometrics Workshop,
Computer Vision and Pattern Recognition (CVPR) Workshops ], 81–88, IEEE (June 2011).
[16] Tan, X. and Triggs, B., “Fusing gabor and lbp feature sets for kernel-based face recognition,” in [Interna-
tional workshop on analysis and modeling of faces and gestures ], 235–249, Springer (2007).
[17] Dadi, H. S. and Pillutla, G. M., “Improved face recognition rate using hog features and svm classifier,”
IOSR Journal of Electronics and Communication Engineering 11(4), 34–44 (2016).
[18] Cinbis, R. G., Verbeek, J., and Schmid, C., “Unsupervised metric learning for face identification in tv
video,” in [2011 International Conference on Computer Vision ], 1559–1566, IEEE (2011).
[19] Taigman, Y., Yang, M., Ranzato, M., and Wolf, L., “Deepface: Closing the gap to human-level performance
in face verification,” in [Proceedings of the IEEE conference on computer vision and pattern recognition ],
1701–1708 (2014).
[20] Mayya, V., Pai, R. M., and Pai, M. M., “Automatic facial expression recognition using dcnn,” Procedia
Computer Science 93, 453–461 (2016).
[21] Krizhevsky, A., Sutskever, I., and Hinton, G. E., “Imagenet classification with deep convolutional neural
networks,” Advances in neural information processing systems 25 (2012).
[22] Kahou, S. E., Froumenty, P., and Pal, C., “Facial expression analysis based on high dimensional binary
features,” in [European Conference on Computer Vision ], 135–147, Springer (2014).
[23] Ouellet, S., “Real-time emotion recognition for gaming using deep convolutional network features,” arXiv
preprint arXiv:1408.3750 (2014).
[24] Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T., “Decaf: A deep
convolutional activation feature for generic visual recognition,” in [International conference on machine
learning], 647–655, PMLR (2014).
[25] Hintikka, T., Real-time Single-shot Face Recognition using Machine Learning, Master’s thesis (2019).
[26] Bradski, G., “The OpenCV Library,” Dr. Dobb’s Journal of Software Tools (2000).
[27] H. Schulzrinne, A. Rao, R. L., “Rfc2326: Real time streaming protocol (rtsp),” RFC Editor, United States
(1998).
[28] Wallace, G., “The jpeg still picture compression standard,” IEEE Transactions on Consumer Electron-
ics 38(1), xviii–xxxiv (1992).
[29] Postel, J. and Reynolds, J., “File transfer protocol,” tech. rep. (1985).
[30] Zhang, K., Zhang, Z., Li, Z., and Qiao, Y., “Joint face detection and alignment using multitask cascaded
convolutional networks,” IEEE signal processing letters 23(10), 1499–1503 (2016).
[31] Deng, J., Guo, J., Ververas, E., Kotsia, I., and Zafeiriou, S., “Retinaface: Single-shot multi-level face
localisation in the wild,” in [Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition], 5203–5212 (2020).
[32] Qi, D., Tan, W., Yao, Q., and Liu, J., “Yolo5face: Why reinventing a face detector,” CoRR abs/2105.12931
(2021).
[33] An, X., Deng, J., Guo, J., Feng, Z., Zhu, X., Jing, Y., and Tongliang, L., “Killing two birds with one stone:
Efficient and robust training of face recognition cnns by partial fc,” in [Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition ], (2022).
[34] Moschoglou, S., Papaioannou, A., Sagonas, C., Deng, J., Kotsia, I., and Zafeiriou, S., “Agedb: The first
manually collected, in-the-wild age database,” in [Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR) Workshops], (July 2017).
[35] Sengupta, S., Chen, J.-C., Castillo, C., Patel, V. M., Chellappa, R., and Jacobs, D. W., “Frontal to profile
face verification in the wild,” in [2016 IEEE winter conference on applications of computer vision (WACV)],
1–9, IEEE (2016).
[36] Zheng, T., Deng, W., and Hu, J., “Cross-age lfw: A database for studying cross-age face recognition in
unconstrained environments,” arXiv preprint arXiv:1708.08197 (2017).
[37] Zheng, T. and Deng, W., “Cross-pose lfw: A database for studying cross-pose face recognition in uncon-
strained environments,” Beijing University of Posts and Telecommunications, Tech. Rep 5(7), 5 (2018).
[38] Huang, G. B., Mattar, M., Berg, T., and Learned-Miller, E., “Labeled faces in the wild: A database
forstudying face recognition in unconstrained environments,” in [Workshop on faces in’Real-Life’Images:
detection, alignment, and recognition], (2008).