0% found this document useful (0 votes)
47 views22 pages

A Study On Efficiency and Accuracy On Unconstrained Very Low Resolution Face Recognition

This document summarizes research on unconstrained very low resolution face recognition. It discusses the challenges of face recognition at resolutions below 32x32 pixels, including a lack of real-world datasets and discriminative features at low resolutions. It provides an analysis of current state-of-the-art methods for very low resolution face recognition, dividing approaches into projection methods, synthesis methods, and homogeneous feature extraction methods. The document also discusses future research directions, including more efficient convolutional neural network architectures for this problem.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views22 pages

A Study On Efficiency and Accuracy On Unconstrained Very Low Resolution Face Recognition

This document summarizes research on unconstrained very low resolution face recognition. It discusses the challenges of face recognition at resolutions below 32x32 pixels, including a lack of real-world datasets and discriminative features at low resolutions. It provides an analysis of current state-of-the-art methods for very low resolution face recognition, dividing approaches into projection methods, synthesis methods, and homogeneous feature extraction methods. The document also discusses future research directions, including more efficient convolutional neural network architectures for this problem.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

A Performance Study on Unconstrained Very Low Resolution Face Recognition

Luis S. Luevano, Leonardo Chang, Miguel Gonzalez-Mendoza


Tecnologico de Monterrey
Av Lago de Guadalupe KM 3.5, Margarita Maza de Juárez, 52926 Cd López Mateos, Méx
{luis.s.luevano, lchang, mgonza}@tec.mx

Heydi Méndez-Vázquez, Yoanna Martı́nez-Dı́az


Advanced Technologies Application Center (CENATAV)
7A 21406 Siboney, Playa, P.C. 12200, Havana, Cuba
{ymartinez, hmendez}@cenatav.co.cu

Abstract image artifacts, and the overall robustness, reliability, and


optimizing the run time of the face recognition algorithms.
In the past decade, tremendous advancements have been
made in the face recognition area, particularly in face
recognition in uncontrolled scenarios (face recognition in The face recognition task in general constitutes of four
the wild). This has been partly due to the massive popu- steps: face detection, alignment, feature extraction and
larity and effectiveness of deep convolutional neural net- identity matching or model training. A classic algorithm for
works as well as the availability of bigger unconstrained the face detection step is the Boosting approach proposed
datasets. However, challenges for face recognition still re- by Viola and Jones [75]. The viola-Jones face detector uses
main in the context of very low resolution homogeneous classifiers in a cascade fashion, making it very robust and
(same domain) and heterogeneous (different domain) face efficient at inference runtime. Face alignment consists on
recognition. In this survey, we study the seminal and novel taking face landmarks and warp them to a frontal position.
methods to tackle the very low resolution face recognition This process is also known as frontalization. Face align-
problem and provide a deep analysis in their design, effec- ment methods can opt map the new face position in a 2D
tiveness and efficiency for a real-time surveillance appli- space or in a 3D space [8]. More recent approaches based in
cation. We also analyze the advantage of employing deep convolutional neural networks include Multi-task Cascaded
learning convolutional neural networks and present future Convolutional Networks [97] and the Retinaface detector
research directions for efficient lightweight CNN design in [18], where both approaches couple the face detection and
this context. alignment steps and show an impressive benefit of training
them together. For the feature extraction step, Traditional
approaches include Eigenfaces [73], Fisherfaces [3], the
1. Introduction extraction of Local Binary Pattern Histograms[21], Gabor
[15], SIFT [49] and SURF [2] features; and at the start
Automated face recognition has gathered the attention of this decade, works with learned descriptors such as
of the scientific community for the past six decades. Ap- [10] started to emerge. However, these methods struggle
plications for facial recognition most importantly include with capturing the non-linearity (deformations, pose and
aiding in law enforcement, forensics, surveillance tracking lighting conditions) of face appearances in unconstrained
and biometrics authentication [47]. Computing power scenarios. The face recognition evaluation task has two
and storage nowadays allow us to process millions of variants: identification and verification. Identification
images in a single computer in order to be able to propose refers to a one-to-many probe and gallery matching,
and implement more robust and accurate models. Today, while verification refers to a one-to-one probe and gallery
face recognition algorithms can run in real time on a verification. At the same time, the evaluation can be open-
smartphone, which is useful for several applications. How- set, where probes can either appear or not in the gallery,
ever, there are still challenges in this area which include or closed-set, where the probes always appear in the gallery.
dealing with uncontrolled environments, image resolution,
Particularly, automated face recognition in uncontrolled summarizes the taxonomy for the state-of-the-art solutions
scenarios has made a remarkable progress in the last for the very low resolution face recognition problem, which
decade, mainly due to the popularization of methods based is divided in Projection methods, Synthesis methods and
in neural networks. Face recognition in uncontrolled Homogeneous Feature Extraction methods. We propose to
scenarios, also called face recognition in the wild, refers further divide each approach in traditional and deep learn-
to identifying or verifying a person from face images ing methods, since they have an important performance gap.
taken from scenes with variations in illumination, scale,
viewpoint, aging, partial occlusion and image quality. In contrast to previous studies for very low resolution
Popular benchmark databases such as Labeled Faces in face recognition [46], [45], and [78], we explore and ana-
the Wild (LFW) [34], YouTube Faces (YTF) [83], CelebA lyze the efficacy and efficiency of the state-of-the-art meth-
[48], among others, have gathered the most attention from ods on challenging datasets for this problem. We focus on
researchers in the face recognition field, where today it is the aspects each method uses to solve the very low res-
possible to obtain a very high accuracy performance, of olution problem as well as the elements it uses to make
more than 96%, even in real time constraints (more than 30 it efficient while also pointing out their limitations. This
frames per second) [53]. survey also covers more recent methods from the state of
The resolution of these popular face recognition datasets the art which are focused primarily on efficient run-time
go as low as 100 × 100 pixels of resolution for the face computation and we provide new research directions stem-
area [13]. However, face recognition accuracy performance ming from advancements on lightweight convolutional neu-
rapidly declines once we reach face regions with an area of ral networks and capsule networks.
32×32 pixels or below [7]. There is no academic consensus In this study we make the following contributions:
on the terms low resolution and very low resolution. We
• Provide an efficiency and efficacy analysis of current
are going to refer to resolutions of 32 × 32 pixel area and
state-of-the-art methods for very low resolution face
below as very low resolution (VLR) in this paper.
recognition.

At the start of this decade, research for face recognition • Provide an in-depth study of the advantages and lim-
in controlled scenarios, using datasets such as CMU-PIE itations of current approaches for the very low reso-
[25] and FERET [58] achieved more than 90% using lution face recognition problems from an application
traditional coupled mapping techniques [98]. However, perspective.
these datasets contain images taken from controlled sce-
• Provide an overview of alternative modern lightweight
narios and do not represent the challenge in a real world
architectures and their performance on very low reso-
application such as illumination variations, geometric
lution face datasets.
distortions, noise (camera artifacts) and low resolutions.
Moreover, most of the models are trained and evaluated • Provide insights for future research directions.
using synthetic datasets, meaning that the face images from
resolutions higher than 32 × 32 pixels are manually resized 2. Research challenges affecting performance
to match VLR settings. This poses a significant domain on unconstrained very low resolution face
gap from the native images taken from cameras at VLR and
recognition scenarios
the synthetic, artificially resized, ones. Recent efforts for
more accurate real-world datasets for face recognition have In this section we discuss and analyze the specific re-
emerged, which include the SCface dataset [24], UCCS search challenges at the very low resolution settings such
[64], IJB-S [41], TinyFace [13], and Survface [14]. as dataset availability for real-world examples, lack of dis-
criminative features, domain discrepancy and the current
The face recognition problem at very low resolution has landscape on efficient solutions.
two variants: homogeneous and heterogeneous. In homo-
2.1. Lack of dataset availability for real-world na-
geneous face recognition, the probe image is in VLR as
tive examples at very low resolutions
well as the rest of the dataset. In heterogeneous face recog-
nition domain gap exists between the VLR probe image One of the hardest challenges to train and evaluate at
taken from the surveillance camera and the high resolution VLR native settings is A small number of datasets of native
gallery (reference) image in high resolution (100 × 100 real-world very low resolution datasets are publicly avail-
pixels or more) taken on a controlled environment. The able for training and evaluation. In the case of homoge-
heterogeneous variant of the problem is a harder one due neous face recognition, datasets such as TinyFace [13] and
to the domain disparity between the probe and camera QMUL-Survface [14], are publicly available. In the case
resolutions in a varying range of conditions. Figure 1 of the heterogeneous variant, datasets such as SCface[24],
Figure 1: Taxonomy of the very low resolution face recognition state of the art approaches. We propose to divide every
approach in both traditional and deep learning methods due to the increasing effectiveness and popularity of deep learning
methods in the past decade. Deep learning methods tend to have a better generalization and can take advantage of on graphical
acceleration (GPU) hardware to be able to close the gap for a real time application. Approaches for bringing the application to
reasonable inference times in CPU include the Traditional Multidimensional Scaling Coupled Mappings and the Lightweight
Convolutional Neural Networks homogeneous feature extraction and matching methods.

Point and Shoot (PaSC) [4] and UCCSface [64] are publicly scenarios.
available. Another dataset, IJB-S dataset [41], is not avail-
able to the public at the time of writing. All of these datasets Due to the limited availability for real-world datasets,
were just assembled in recent years to represent a more most of the evaluations at very low resolutions were made
grounded application for face identification on surveillance from datasets such as LFW [34], CASIA Webface [90],
YouTube Faces [83], Celeb-A [48]. using downsampling
methods (such as bilinear interpolation) in an attempt to SIFT [49], LBP [21], or Gabor [15], now most of the time
evaluate the performance. However, when comparing the they are based in residual network architectures or in archi-
performance of these same models on very challenging tectures that already work for general-purpose recognition
datasets such as SCface [24], these models performed very problems, such as AttentionNets. A study conducted by
poorly in terms of accuracy. The analogous effect happened Sandler et al. [62] explains the importance of the internal
when testing models trained on datasets such as CMU-PIE network resolution and the need to design networks for the
[25] and FERET [58], which constitute of data captured resolution of the problem at hand. In their paper, they intro-
in controlled environments, they performed poorly when duce the concept of isometric networks, which retain their
tested on the SCface dataset. size throughout the network, yielding better performance
In order to improve the VLR robustness of the methods regardless of the input resolution. However, they note that
using deep face recognition, several augmentation strate- at extreme resolutions (14 × 14 and below, for example)
gies have been implemented. For example, the authors in input resolution does matter as it affects the rest of the net-
[23] rely on bilinear interpolation to synthesize the dataset work internal layer sizes.
for very low resolution, however, this method by itself is
not effective at unconstrained settings. Lu et al. [50] on 2.3. Domain gap between high resolution and low
the other hand, resized the images into three different res- resolution: matching features from different
olutions, in order for the network to learn more mappings spaces
due to the varying low resolution present at the real-world In the heterogeneous variant of the problem, for datasets
application. The study performed by Mendez-Vazquez et such as [4], [64] and [24], a domain gap between images
al. [?] concluded that synthesizing datasets and pre-training taken in high resolution controlled environments and the
neural networks using bicubic and resampling using pixel images taken from surveillance camera exists. The images
area relationships which are very effective at adding robust- from surveillance cameras can present varying lighting con-
ness and improving recognition performance in the hetero- ditions, blurriness, heavy pose variations and varying reso-
geneous setting of the problem.[?] lutions. Extracting features from varying resolutions in par-
Due to the domain gap from synthetic and native ticular becomes a problem because the native space is dif-
datasets, it is a harder for super resolution methods to gen- ferent in dimension from one image to the next and from the
eralize and become effective in true unconstrained scenar- gallery reference images. Firstly, in order for an algorithm
ios. Super resolution methods are supervised methods by to be able to extract features from an image, the input image
nature, where normally a ground truth higher resolution im- must be standardized one or more predefined sizes sizes. It
age is needed in order to learn a mapping for synthetic low becomes a decision as to which dimensional domain the im-
resolution to high resolution, in an effort to match the tar- ages get scaled at, where super resolution methods aim to
get HR domain. Attempting to remedy this domain gap, scale them to the dimension of the high resolution gallery
the authors of [13] have opted to use synthetic datasets in image. Secondly, features extracted from the high resolu-
combination with native images where they map the super tion gallery image have richer and clearer information [51]
resolution relationship using the synthetic dataset but hal- than the native very low resolution images captured in un-
lucinating a high resolution counterpart of the native low constrained lighting conditions, and as such, the extracted
resolution face images by sharing weights between these features from a given algorithm greatly vary from one an-
two super resolution sub-networks and keeping the classifi- other. Coupled mapping and the homogeneous feature ex-
cation sub-networks with their own weights. traction methods are the two strategies meant to have a more
robust abstraction of the extracted features in order to match
2.2. Lack of discriminative information and pose them in the same conditions. This domain gap however does
variations not exist in the homogeneous variant of the problem where
Another challenge in this context is the lack of discrimi- all the images and recognition is performed from images in
native information at very low resolutions. Due to the light- the same domain, for datasets such as [13] [14].
ing conditions, pose and blurriness, it becomes harder to
2.4. Efficiency challenges
extract useful features for classification. Current low res-
olution face recognition methods do not use face aligment Classic coupled mapping methods for very low resolu-
for optimizing performance, rather it’s being used as a pre- tion face recognition can run with a small inference time us-
processing step for training any network. Most of the meth- ing CPU only [94]. The aforementioned methods perform
ods for face recognition in this context usually refine their very accurately on constrained scenarios. However, when
loss function and optimization steps of the network, along testing them on unconstrained low resolution face recogni-
with designing multiple branch architectures. Where pre- tion scenarios, their accuracy performance rapidly decays.
vious methods used feature extraction techniques based on Newer, more accurate, deep learning methods for high reso-
lution unconstrained face recognition [65] [71] are not suit- low resolution face recognition specific task. The source
able for run-time applications due to their increased com- column indicates from which scenario the data was obtained
putation requirements for inference. Furthermore, these from,the quality column indicates the type of the available
methods do not perform as accurately in very low resolu- images where HR corresponds to High Resolution imagery,
tion unconstrained face recognition scenarios as other state- LR to Very Low Resolution imagery and blur to blurred
of-the-art alternatives [17] [50] [?]. In order to achieve low resolution images, the static image/video column in-
state-of-the-art accuracy, deep learning methods such as dicates whether the dataset has static images and also con-
[50] and [13] use solutions base on multi-architecture net- tains video data or not, and the next columns indicate the
works, which still have heavy computation requirements. total number of identities and number of images (including
Due to the general efficiency requirements for deep learn- images from videos).
ing applications, mainly on mobile and embedded systems,
areas such as lightweight face recognition and quantization 4. Heterogeneous approaches for very low res-
have emerged as well [60] [38] which explore deep learn- olution face recognition
ing techniques using alternative data representations such
as binary and 32-bit floating point with little accuracy per- This section goes into depth for the state-of-the-art meth-
formance penalty. However, no solutions aiming to bridge ods that attempt to match images from two different do-
the domain gap and using lightweight network design prin- mains: high resolution and very low resolution in native
ciples has been developed from the ground up other than surveillance scenarios. These methods are classified in two
the approach proposed by Ge et al. [23] using knowledge variants: Projection methods, known as Coupled Mappings,
distillation techniques. and Synthesis methods, known as Super Resolution.
Furthermore, the vast majority of the methods utilized 4.1. Projection methods: Coupled Mappings
for low resolution face recognition report results based on
accuracy and very few on efficiency, and from those that Coupled mapping methods, fall into the category of pro-
report runtime performance, sometimes omit their hardware jection methods. This type of methods aim to find an ade-
information so it makes more challenging for researchers to quate representation of data from different domains, doing
take into account efficiency assessments. so by projecting all the data, to a single unified space. Once
the data is projected, similarity metrics can be computed
3. Datasets for very low resolution face recog- for classification and posterior optimization operations as
nition illustrated in Figure 3. Table 2 shows a summary of all the
projection methods discussed in this subsection.
Currently, very low resolution face recognition under
surveillance scenarios is a very niche research area with
4.1.1 Classical approaches for coupled mappings
limited datasets available. Recent efforts for expanding
these type of studies have come from the people behind the Classical methods for coupled mappings include Coupled
datasets the datasets described in Table 1. The SCface [24], Marginal Discriminant Mappings [98], Multidimen-
Point and Shoot [4], IJB-S [41], UCCSface [64], QMUL- sional Scaling (MDS) [6] and Pose-Robust MDS[5] by
Survface [14] and QMUL-Tinyface [13] are the available Biswas et al., and others, such as Simultaneous Dis-
benchmark datasets for unconstrained very low resolution criminant Analysis [11] and Coupled Marginal Fisher
face recognition. Extensive studies across all the previously Analysis [67], which are based on feature extraction, pro-
mentioned benchmark datasets have not been done yet ei- jection and using a variant of Linear Discriminant Analysis
ther. for classification and matching. The optimization method
Figure 2 shows samples of all the aforementioned of choice for MDS problems is the iterative majorization
datasets where we can appreciate that the datasets suited for algorithm [81], which is a common optimizer for other
heterogeneous face recognition benchmarking are the SC- types of optimization problems.
face, Point and Shoot, UCCSface and IJB-S because they
contain both high resolution and native low resolution im- The MDS approach by Biswas [6] proposed to optimize
agery. On the other hand, the QMUL-Survface and QMUL- the distance between the transformed feature vectors using
TinyFace datasets are suitable for deep learning homoge- a combined transformation matrix with three different reg-
neous face recognition due to their large number of images ularizing terms. The goal of the optimization is to approx-
and identities. These are very different problems on their imate the distance of the transformed feature vectors in the
own, however, in a multi-network architectured solution, projected space, to the distance of the samples in the default
subnetworks could be pretrained using the VLR homoge- high resolution space, where any distance measurement can
noeous datasets only, for instance. be utilized. The regularizing terms use an independent pa-
Table 1 summarizes the datasets available for the very rameter to control the rate of approximation to the sample’s
Figure 2: Example of subjects in the different datasets available for very low resolution face recognition. The SCface [24],
Point and Shoot [4], IJB-S [41] and UCCSface [64] are the most suitable for heterogeneous face recognition because they
supply both pairs, HR and VLR images. The QMUL-Survface [14] and QMUL-Tinyface [13] are suitable for the homogenous
face recognition type of problem because only VLR images are available.

Database name Source Quality Static image/video # subjects # images


Point and Shoot [4] Manually Collected HR + blur static + video 558 12,178
SCface [24] Surveillance HR + LR static 130 4,160
QMUL-Survface [14] Surveillance LR static + video 15,573 463,507
QMUL-TinyFace [13] Web LR static 5,139 169,403
UCCSface [64] Surveillance HR + blur static 308 6,337
IJB-S [41] Surveillance HR + LR static + video 202 3 million+

Table 1: Summary of unconstrained datasets for very low resolution face recognition. Complemented from [45]. Most of the
available datasets come from real-world surveillance imagery and contain high resolution and low resolution pairs, except
QMUL-surface and QMUL-Tinyface. However, these two datasets contain the most number of images and subjects, making
them suitable for deep learning training in a deep learning solution.

distance in the high resolution space and the class separa- total training time by the pose estimation component and
bility. its increased number of fiducial locations due to augmenta-
Later, Biswas et al. introduced Pose-Robust MDS [5] tions, when optimizing the MDS component.
which introduced a pose estimation after the MDS com- The approach proposed by Zhang et al. [94] called
putation. The pose estimation process firstly projects the Large Margin Coupled Mappings constructs inter-class
probe image on the tensor basis extracted from the train- and intra-class graphs, then learns the projection matrices
ing low resolution set, then estimates pose using the fidu- by maximizing the class margins using the weights from
cial locations in relation to median fiducial locations from the graph distances of the inter-class graph and vice-versa
the training set. It involves modeling mode matrices for the for the weights in intra-class graph; later it introduces the
subjects the training set, spaces of viewpoint, illumination, intra-class scatter as a regularizing term. The method uses
and eigenimage vectors [73]. This components are then uti- a scatter matrix approach for computing the inter-class and
lized in a TensorFace component which is then used to com- intra-class distance, where a class centroid is used as ref-
pute the coefficient vectors for the face normalization. The erence, the same as in LDA approaches. Figure 4 illus-
authors also introduced robustness by slightly perturbating trates the margin enforced by the class graphs in the com-
the fiducial locations. The MDS and pose componentes are mon subspace. The reported inference runtime is of 8.5 mi-
trained independently, which helps at not heavily taxing the croseconds on an Intel Core i5-4200U Laptop CPU, which
Method Approach Reported metrics
Simultaneous Discriminant Learn projection matrices using LDA-based scatter matrices Multi-PIE [29] mean accuracy: 96.46%
Analysis (SDA) (2011) [11] for images from LR and HR domains and their matching com- on LR probe. No efficiency metrics re-
binations. ported.
Coupled Marginal Fisher Marginal Fisher Discriminant Analysis-based optimization. Multi-PIE [29] mean accuracy: 96.80%
Analysis (2012) [67] Objective function is to minimize the inter-class and intra-class on LR probe. No efficiency metrics re-
ratio of the sum of the distances between the projected features ported.
to the unified space.
Coupled Marginal Discrim- Model and optimize the ratio of similarity (scatter) matrices FERET mean accuracy: 88.5%. No effi-
inant Mappings (CMDM) between and inter-class for solving as an eigen-decomposition ciency metrics reported.
(2015) [98] problem. Similar to CMFA.
Multidimensional Scaling Optimize projected LR and HR feature distance and approxi- SCface mean accuracy: 60%. No effi-
(MDS) (2012) [6] mate them to HR features from the source domain. ciency metrics reported.
Pose-Robust MDS (2013) Based on MDS [6]. Model and estimate mode and median ma- Multi-PIE: over 80% recognition rate.
[5] trices from viewpoint, illumination and eigenimage info from SCface: outperforms SIFT+PCA,
training set. SIFT+LDA, SURF+PCA, LBP on rank-1
accuracy CMC curve by more than 10%
margin. No efficiency metrics reported.
Discriminative Mul- Inspired by MDS [6]. Adds inter-class and intra-class con- SCface mean accuracy: 79.92%. No effi-
tidimensional Scaling straints to better project the features pertaining to each class in ciency metrics reported.
(DMDS) (2018) [86] the latent subspace.
Local-Consistency Pre- Complementary to DMDS. Only optimizes the sample dis- SCface mean accuracy: 81.54%. No effi-
served DMDS (LDMDS) tance from the same domains, not accross. ciency metrics reported.
(2018) [86]
Large Margin Copuled Based on LDA. Maximizes class margins using weights from SCface mean accuracy: 60.4%. Infer-
Mappings (LMCM) (2016) constructed class graphs. Class is centroid-based. ence: takes 8.5 microseconds per image
[94] (117.65 face images per second) on i5-
4200U CPU.
Local Geometry to Global Minimizes the mappings that minimize the distance of LR and SCface: 43.2% mean accuracy. No effi-
Structure CM [66] (2015) HR neighbors from the same class. Subsequently, combine ciency metrics reported.
these mappings to generate the global projection matrix.
Deep Coupled ResNet Deep learning-based method. Uses trunk-branch structure. SCface: 88.2% mean accuracy. No effi-
(2018) [50] Feature extraction using a ResNet style subnetwork and branch ciency metrics reported.
FC subnetworks for LR and HR images. Rescales images for
data augmentation strategy. Optimization using softmax and
centerloss functions.
GenLR-Net (2018) [55] Deep learning-based method. Uses VGG face architecture as LFW: 90.00% mean ver. rate. CFP:
base. Uses multiple classification losses at different points in 77.28% ver. rate. No efficiency metrics
the network to model the LR-HR relationship and a final con- reported.
trastive loss function to learn the LR projection closer to the
HR projected features.

Table 2: Summary of coupled mapping techniques. Most of the approaches in the last years are based in Multidimensional
Scaling [6] and improved upon using locality constraints and enforcing inter-class margins. Traditional approaches are con-
sidered efficient to run however only one of them reports inference time in a Laptop CPU. Efficient deep learning approaches
have not been researched in the context of projection methods.

is equivalent to processing 117.65 face images per second. intra-class samples is effectively minimized in the latent
The Discriminative Multidimensional Scaling common subspace. This work is directly inspired by the
(DMDS) approach proposed by Yang et al.[86] aims to previously mentioned MDS approach [6] and it features
find the projection matrices such that the distance between two additional inter-class and intra-class constraints to add
Figure 3: Basic idea for coupled mapping methods, taken Figure 5: Simplified overview of the DMDS method[86]
from [98]. Both domains get projected into a single com- in contrast of seminal method MDS [6]. The introduced
mon subspace, where the elements from the same classes inter-class and intra-class constraints promote a larger, more
get projected closer together to perform classification tasks discriminative, latent feature space.
afterwards.

an independent term which heavily penalizes the nearest


neighbors and maximizes the margin across classes. Af-
ter this projection, the global structure, concatenating the
resulting LR and HR feature vectors is built and utilized to
optimize the projection.

4.1.2 Deep Learning approaches for Coupled Map-


pings
Few methods using Deep Learning with a couple mapping
strategy have been proposed for the very low resolution
face recognition problem, such as [46] and [30]. However,
Figure 4: Overview of the LMCM method, taken from [94]. they are not necessarily trying to model the unified space
The proposed inter-class and intra-class margin in the com- for cross-resolution face recognition, they rather aim to ex-
mon subspace leads to better recognition rates in uncon- tract robust features from LR and HR faces. In contrast,
strained scenarios as opposed to earlier techniques. the Deep Coupled ResNet method in [50] consists of one
residual trunk network which specializes in feature extrac-
tion across different resolutions and two branch networks
discriminative potential in the latent subspace. These con- which minimizes the distance between intra-class samples,
straints come in the form of contributing to the projection the same loss function is shared across both of these branch
matrix optimization by using the projected sample distance networks. This architecture is illustrated in Figure 6.
in the latent subspace according to the class they belong The strength of this method relies on the robustness of
to. Figure 5 graphically illustrates the discriminative space the feature extraction with the trunk network while also
encouraged by the method’s constraints. The authors also using modern face recognition heuristics such as using a
proposed a variant of the method called Local-Consistency PReLU[72] activation function.
Preserved DMDS (LDMDS) where they changed the This method achieves the best recognition rate for the
optimization of the sample distance to only optimize the SCface dataset, for one of the cameras, at 4.2 meters, it
distance of the samples from the same domain space yields an accuracy of 73.3% which corresponds to more
additional to the MDS optimization function. than a 10% advantage over the previous described methods.
The Local Geometry to Global Structure CM ap- No other metrics are reported besides accuracy from dif-
proach proposed by Shi et al.[66] aims to minimize the ferent camera distances. Even though this method is more
distance between projected features as well, however, this effective than the previous ones, it makes assumptions re-
approach uses a k-neighbor approach to influence the dis- garding the resolution factor by using bilinear interpolation
tance optimization in both intra-class and inter-class pro- in various steps, mirroring a data augmentation strategy
jected groups. In the inter-class constraint, they include where we identify that there is room for improvement;
a more intelligent method such as super resolution for ity between a native very low resolution images from the
face hallucination could be used to improve the image surveillance feed and the synthetically downscaled images.
reconstruction process or a different network design for Moreover, the datasets for this task are not restrained to face
synthesizing images of various resolutions. images, most contain different things such as buildings, an-
imals, objects, etc. which is do not translate to the same
Another method using a projection approach is the performance for face hallucination. Most of the modern
GenLR-Net [55], detailed in 7, which uses the VGG face methods in this area are now neural-network based.
network as a base. This is mostly a projection method To mitigate the synthetic versus native dataset prob-
even though it includes a small super resolution component, lems, approaches using hybrid datasets (synthetic and na-
which marginally improves performance. This network fea- tive) have been proposed such as [13].
tures two subnetworks: one for low resolution images and
the other one for high resolution images. This method uses
two loss functions: an inter-intra classification loss before 4.2.1 Classical Methods for Super Resolution-based
the final convolution and pooling layers and posteriorly, the Low Resolution Face Recognition
contrastive loss after the fully connected layers. The con- The S2R2 matching [29] proposed by Hennings-Yeomans
trastive loss gradient gets propagated only to the low reso- et al. features simultaneous super resolution and recogni-
lution network and not to the high resolution network, since tion. This method performs the synthesis by first super re-
the intuition is to project the features closer to their high solving an image with a SR matrix, then using another LR
resolution counterparts, a similar intuition to [69]. This matrix to downsample the images and compare them in the
method was not tested on the SCface dataset. low resolution space. The second component of the opti-
mization measures the smoothnes of the super-resolved im-
4.2. Super Resolution-based approaches
age, and the third component measures the difference be-
Super Resolution methods aim to upscale a low resolu- tween the features extracted from the HR ground truth and
tion image by a factor, usually of 4×. For the context of those of the super-resolved image.
very low resolution face recognition, there are two types of Other methods for sparse representations include Yang et
methods: first upscaling the very low resolution image and al. [88] and Zeyde et al. [92], which inspired the later work
perform face recognition or training afterwards or have a of Uiboupin et al. [74]. The authors proposed a sparse rep-
face recognition constraint in the super resolution network. resentation method which uses of two different dictionaries:
Figure 8 illustrates the basic idea of super resolution meth- one with natural images and face images, and another with
ods. face images only. The recognition part is a 7-state Hidden
Having a recognition constraint in the same super resolu- Markov Model, for seven facial components, with SVD co-
tion process yields a better accuracy than running super res- efficients for feature extraction and recognition, based on
olution and recognition separately, running the two task in [54]. The LR images were modeled as a linear combination
separate can actually hinder performance in some datasets of a blurring kernel and downsampling operator, as such,
as stated in [13] An example of this phenomenon is the ap- optimizing the reconstruction from the ground-truth high
proach described on [96], which uses a face recognition resolution images.
loss to train the overall super resolution network as well, Methods based on Canonical Correlation Analysis
outperforming state-of-the-art methods not tailored for face (CCA) have also been proposed, such as Coherent Local
recognition purposes, such as Laplacian Super Resolution Linear Reconstruction SR (CLLR-SR) [35] by Huang et
Network [42]. Table 3 shows the most successful super res- al. and 2D CCA Face Image SR [1] by An et al. On
olution methods specifically made for and tested on very CLLR-SR the authors reconstruct particular facial details
low resolution face datasets. and the whole face by using CCA to model the relation-
Super Resolution is an active research area, however, ships between neighboring images across resolutions. The
most works are focused only on increasing the most popular objective is to be able to project the face image features into
metrics for this task, the Structural Similarity Index (SSIM) a coherent subspace from PCA vectors, then making this
and Peak Signal-to-Noise Ratio (PSNR). The value of these transformation of the LR features more correlated to the HR
metrics are dataset-dependent and are not a real indicator of features using the base vectors obtained by the CCA step.
their discriminative power for a task such as face recogni- The later approach proposed by An et al., used 2D CCA
tion. Furthermore, for supervised Super Resolution train- [43], instead of 1D CCA which was previously proposed.
ing, the most common approach is to take a high resolution The 2D CCA approach vastly outperformed the former one
image dataset and use bilinear interpolation to downscale in both recognition performance in the CAS-PEAL-R1 and
the images, in order to be able to train the models. CUHK datasets. 2D CCA does not reshape the image data
This presents another problem due to the domain dispar- into 1D vectors, utilizing 2D PCA [39] as a base. The pro-
Figure 6: Overview of the Deep Coupled Resnet architecture [50]. This architecture features a single resnet-style network
for feature extraction where the Coupled Mapping loss fits the generated features using the images from both HR and VLR
domains. The network weights are also updated by the CenterLoss [82] function for more accurate face recognition at the
classification step.

Figure 7: GenLR-Net [55] structure. The weights from the VGG [57] backbone for the high resolution sub-network are
pre-trained while the low resolution sub-network is fully trained.

jection coefficient is divided into left and right projections, sional spaces. The employed loss function in this feature
where a total of 2 left and right projection matrices for each extraction method is the cross-entropy loss function. In this
dimension are optimized. The authors reported an average method, the network is finding the unifying representation
time for super resolving one face image at 1.38 seconds on appropriate for the upscaled and the gallery images rather
a 2.4GHz CPU. This presents an improvement in super res- than finding a super resolved image representation useful
olution performance from the spatial representation method for recognition.
[40] which had the closest face recognition performance in
the reported datasets. In a study conducted by Wang et al. [78], they pro-
posed and evaluated the various image recognition mod-
els. Their Single Network with Super Resolution Pre-
4.2.2 Deep Learning Methods for Super Resolution-
training model where they pre-trained an unsupervised su-
based Low Resolution Face Recognition
per resolution network and posteriorly fine-tuned it with
The Resolution-Invariant Deep Network (RIDN) [17] by the supervised recognition component (two fully connected
Zheng et al., bridge the domain gap by using bicubic in- layers and softmax classifier on top). The model that
terpolation to upscale the low resolution image to the high yielded the best recognition performance was the Robust
resolution space. After that, they use a neural network Partially Coupled Network model, shown in Figure 9.
based in [90] for feature extraction and matching. This net- They noted that pre-training using Super Resolution meth-
work keeps the intermediate representations in low dimen- ods was insufficient for recognition purposes and that data
Method Approach Reported Metric
S2R2 matching (2008) [29] Compute and optimize a super resolution matrix by measuring Multi-PIE identification accuracy: 84.1%
the similarity of a reconstruction in the low resolution feature for 12 × 12 probles, 62.8% for 6 × 6
space, while adding smoothness to the super-resolved image. probes. Outperforms downsampling the
HR image to probe resolution and bilin-
ear interpolation matching on FERET. No
efficiency metrics reported.
Uiboupin et al. Sparse Rep- Based on Hidden Markov Model + SVD components [54]. FERET mean accuracy: 21.60%. No effi-
resentation (2016) [74] Uses downsampling operator, blurring kernel, dictionary with ciency metrics reported.
face images and natural imagery to improve reconstruction.
Coherent Local Linear Re- Use CCA to model neighboring images accross resolutions PSNR on CAS-PEAL face database [22]:
construction (2010) [35] and project to HR space using PCA vectors. 31.18%, outperforms PCA reconstruc-
tion. No efficiency metrics reported.
2D CCA Face Image SR Iteratively solves the eigenvalue problem for CCA in directions CUHK dataset [77] recognition accuracy:
(2014) [1] X and Y for two left and right projection matrices which are 99.31%. Inference time: 1.38 seconds per
used for upscaling. image on a desktop 2.4GHz processor.
Resolution Invariant Deep Uses bicubic interpolation for upscaling the VLR image and SCface mean recognition accuracy:
Network (RIDN) (2016) a more modern deep network architecture [90] for face repre- 74.00%. No efficiency metrics reported.
[17] sentation and matching.
Single Network with SR Pre-trains a super resolution sub-network in an unsupervised UCCS dataset rank-1 recognition accu-
(2016) [78] fashion, then a supervised fine-tuning step is performed for racy: 53.69%. No efficiency metrics re-
face recognition. ported.
Robust Partially Coupled Fully coupled super resolution network with downsampled HR UCCS dataset rank-1 recognition accu-
Network (2016) [78] to ground truth HR image reconstructions and LR to HR image racy: 59.03%. No efficiency metrics re-
reconstruction. The huber loss [36] was used for improving ported.
VLR face recognition performance.
Complement Super Reso- Two branch network with shared network weights between the Tinyface rank-1 recognition performance:
lution and Identity (CSRI) synthetic LR face image branch with supervised super resolu- 44.8%. No efficiency metrics reported.
(2018) [13] tion training and the Native LR face images without ground
truth. Update both subnetworks with the classification loss.
Dual Directed Capsule Net- Extract image features using a convolution and project the UCCS recognition accuracy: 95.81%. No
work (2019) [69] VLR image features closer to the HR image feature centroid. efficiency metrics reported.
Super resolve the projected images using a capsule network
[61] architecture with a reconstruction FC module.

Table 3: Summary of super resolution approaches discussed in this section. Traditional approaches have been mostly tested
with constrained face recognition datasets while deepp learning approaches have been tested primarily with unconstrained
face recognition datasets. Most of these deep learning approaches have been only tested on the UCCS dataset and have
made a tremendous leap on recognition performance and different architecture proposals for training and transferring the
knowledge from the recognition performance to the super resolution components and across native and synthetic images.

augmentation and/or data adaptation were needed for im- (CSRI) [13] method, detailed in Figure 10, uses two sub-
proving in recognition performance. In this model, they networks with with two parallel branches: one processing
built on top of the previous one and added a HR to HR re- the synthetic low resolution images and the high resolu-
construction such that the filter information gets transferred tion ground truth counterpart and one processing the na-
to the LR to HR reconstruction subnetwork, making this a tive low resolution image recognition only. The key for this
partially-coupled super resolution network. After this, they architecture are the shared parameters from both branches.
used the Huber loss [36] instead of the MSE loss to further The reported tests were done using their proposed TinyFace
improve face recognition performance, due to its lower sen- benchmark dataset, where their complement super resolu-
sitivity to outliers. tion learning strategy (shared weights) yields an 8% Rank-
The Complement Super-Resolution and Identity 1 accuracy increase over an independent weights strategy.
any time. Even though this method uses projection methods
akin to coupled mapping techniques, the network is trained
with a reconstruction loss at the end, with the image re-
construction network section at the end. This approach was
tested on various recognition datasets, one of them being the
UCCS dataset which is also representative to the problem at
hand, achieving a Rank-1 accuracy of 95.81%.

5. Efficient modern approaches to very low


resolution face recognition using feature-
based matching
Figure 8: Basic idea of super resolution methods for face
In the past few years, methods for achieving real-time
recognition. In this type of approach, the synthesis is made
performance in other computer vision tasks on embedded
from the low resolution probe image space to the gallery
devices have emerged, such as SqueezeNet [37], Shuf-
high resolution space only, the high resolution image re-
fleNet [101], ShuffleNetV2 [52], MobileNet [33], Mo-
mains the same, after this step, a similarity score is em-
bileNetV2[63], MobileNetV3 [32], and VarGNet [99].
ployed to score against the gallery images.
This set of methods have been proposed for efficiently solv-
ing general computer vision tasks such as image recogni-
tion, object detection and others. These methods are com-
monly called Lightweight Convolutional Neural Networks.
Face recognition variants have been proposed as well most
recently, which we will talk in depth in the next subsection
5.1.
Figure 12 shows a multiply-addition operations (MAdds)
benchmark against accuracy performance for these com-
puter vision tasks to give a general idea of where they stand
in between each other. MobileNetV3 currently claims to
Figure 9: Overview of the Robust Partially Coupled Net- be the network with best efficiency-accuracy network for
work [78]. This general-purpose computer vision tasks as per their evalu-
ation [32]. This metric gives us a general idea of the rela-
tive performance between each other, however, the accuracy
They report a 10.1% of Rank-1 accuracy increase when generally is dataset and problem specific.
training the recognition network jointly with the super res-
5.1. Lightweight Convolutional Neural Networks
olution network, as opposed of performing these tasks sep-
for Face Recognition
arately.
Another recently proposed method is the Dual Di- Lightweight neural networks specifically tailored for the
rected Capsule Network for Very Low Resolution Im- face recognition task have been reported in the literature,
age Recognition [69], depicted in Figure 11. Firstly, this such as MobileFaceNet [12], ShuffleFaceNet [53] and
method utilizes native low resolution images and upscales VarGFaceNet[85].
them using bilinear interpolation for training; it downscales In general, these techniques are based on lightweight
the native high resolution images to low resolution using general-purpose CNN architectures. In order to have a
this approach as well. The authors utilize a feature extrac- better efficiency-accuracy trade-off, these lightweight net-
tion strategy by taking any high resolution information as an works employ the following techniques: using grouped
”anchor” to guide the feature extraction of low resolution convolutions and shuffling the output channels to make
native images. This is done by introducing a novel ”High to reduce the number of operations and share information
Resolution-anchor” loss function and also by propagating across different input and output channels, using variable
the recognition gradient in the feature extraction and poste- groups of grouped convolutions to balance between infor-
rior super resolution stages. The anchor value is learned us- mation retaining and complexity, point-wise 1 × 1 convolu-
ing every high resolution sample of its class and then is used tions to reduce depth channels and computational complex-
to modify the extracted low resolution feature such that it ity while filtering, having low-dimension embeddings be-
gets closer to the anchor value. An important note is that the fore the fully connected layers, using strides instead of max
low resolution information is not used to learn the anchor at pooling operations to reduce complexity and retain more
Figure 10: Overview of the Complement Super-Resolution and Identity Network [13]. The network is comprised of two
networks: one for synthetic LR images with supervised training and one for Native LR face images without supervision
for super resolution. Both of these networks share their parameters, as such, both networks learn from the classification
recognition performance and the super resolution accuracy.

Figure 11: Overview of the Dual Directed Capsule Networks for Very Low Resolution Image Recognition[69]. The authors
proposed to project the extracted VLR image features closer to the HR image feature centroid using an HR-Anchor loss as a
first step. Then, they used this projected features to feed the capsule network component, which performs the super resolution
reconstruction with 3 fully connected layers.

information directly from the data and using inverted bot- formance such as using Global DepthWise convolutions in-
tleneck structures to first reduce the number of parameters stead of global average pooling [12], using an additive-
and then compact the network channels again to match the angular margin-based loss functions [19] and general de-
input channels. sign principles such as trying to retain as much fine-grained
Table 4 shows an overview of the most recent face recog- information as possible by avoiding the use of max pooling
nition networks based on mobile CNN architectures, the [53], which is beneficial for face recognition performance
techniques they use and their computational footprint in and may not be for general computer vision tasks.
FLOPs. These face recognition-specific approaches fol- Approaches based on teacher-student networks for train-
low several guidelines for optimized face recognition pe- ing and efficient inference, named knowledge distillation
Method Approach Efficiency optimizations Complexity remarks
Global DephtWise convolution instead of Global Aver-
MobileFaceNet [12] Based on MobileNetV2 [63] age Pooling layer, stride=1 after conv1, 1280-D feature 933.3M FLOPs
vector
Global DephtWise convolution after Conv5, PReLU ac-
ShuffleFaceNet [53] Based on ShuffleNetV2 [52] tivation, add strides and eliminate pooling at conv1, 577.5M FLOPs
compact 128-D feature vector
Teacher-student network architecture, set channel num-
ber as constant in a group, variable number of groups,
VarGFaceNet [85] Based on VarGNet [99] Teacher model: 24GFLOPs
PReLU aactivation, point-wise conv before FC layer,
Knowledge distillation Student Model: 1022M FLOPs
512-D feature vector

Table 4: Summary of some of the most recent efficient architectures for face recognition.

not from the ones mentioned in the previous Coupled Map-


pings section, even classical hand-crafted approaches. We
consider that the biggest drawbacks of this approach are the
fact that there is no re-projection and the images available
for training in the low resolution space are simply resized
from the gallery images. When testing on a very challeng-
ing database such as the SCface one, it does not achieve
desirable performance.
Other approaches to reduce inference complexity has
been quantization, such as the works from [87], [60], and
[38]. These approaches focus on reducing parameter num-
Figure 12: Accuracy against MAdds benchmark for ber representation (limiting the number of representation
lightweight general-purpose networks for computer vision, bits, thus reducing the complexity) and changing activation
taken from [32]. Currently MobileNetv3 shows the best functions to binary operators.
trade-off between multiply-addition operations against ac-
curacy, followed closely by MnasNet and MobileNetv2. 6. Discussion on state-of-the-art face recog-
nition performance on unconstrained
surveillance scenarios
methods, have also emerged in recent years. One approach
for face recognition in the context of lightweight real-time In this section we discuss and analyze the performance
performance has been also proposed in VarGFaceNet [85]. of the aforementioned methods in the context of heteroge-
The authors managed to reduce run-time on CPU and GPU neous face recognition for the SCFace dataset in particu-
to 31 fps on the method variant that they tested on the lar as it is very representative of the heterogeneous VLRFR
SCface dataset, where they reported a rank-1 accuracy of problem with the reserach challenges mentioned in Section
43.5%. It does not achieve the same performance as other 2.
heavier Deep Learning methods but it does outperform the
classical hand-crafted methods that appeared at the start of 6.1. Discussion on state-of-the-art recognition accu-
this decade and earlier deep learning-based methods that racy performance
were starting to emerge. We compiled the mean recognition accuracy results for
An attempt for efficient very low resolution face recog- the SCface dataset in Table 5 to better illustrate the state-
nition using knowledge distillation was proposed in [23]. of-the-art panorama. We discuss three key areas of oppor-
Their method uses a teacher model based on VGGFace and tunities regarding the state-of-the-art concerning accuracy
uses a student network with a simpler design for inference. performance: training methodologies, generalization capa-
The authors also used manual resizing to train the student bilities and type of approach.
network. The output of the teacher network uses a graph- Accuracy performance is heavily affected by the ex-
based approach to accept and reject persons that do not ap- perimental methodology of every study. Even though we
pear in training. The approach performs better than other can see that the best mean accuracy results are achieved
methods from the state of the art such as [70] and [56] but using Ligthweight CNN approaches, there are still areas
Method d1(4.2m) d2(2.6m) d3(1.0m) Mean accu- different resolutions, in order to add robustness to the
racy learned representation, effectively synthesizing the dataset
SCface [24] 1.82% 6.18% 6.18% 4.73% to three times its original size with no previous pre-training.
CLPM [44] 3.46% 4.32% 3.08% 3.62% Another method, DMDS [86], randomly selects 50 subjects
CSCDN [79] 6.99% 13.58% 18.97% 13.18% from the SCface dataset and uses them to train, without
SSR [89] 7.04% 13.2% 18.09% 12.78%
upscaling the VLR images. These generates a gap in
L2softmax [59] 9.20% 18.80% 16.80% 14.93%
CCA [80] 9.79% 14.85% 20.69% 15.11% evaluating the real limitations of the methods.
DCA [27] 12.19% 18.44% 25.53% 18.72%
LM Softmax [95] 14.00% 16.00% 18.00% 16.00%
AM Softmax [76] 14.80% 20.8% 18.4% 18.00%
We recognize that one of the limitations of deep learning
C-RSDA 15.77% 18.08% 18.46% 17.44% methods is the need of having very large datasets in order to
(2017)[16] train the networks effectively, which is why we support the
RIDN [17] 23.0% 66.0% 74.0% 24.96% idea of pre-training the networks on other face recognition
LDMDS [86] 62.7% 70.7% 65.5% 66.30%
VGG-Face* [57] 41.3% 75.5% 88.8% 68.53%
datasets such as LFW [34] or MS-Celeb-1M [26] and other
LightCNN* [84] 35.8% 79.0% 93.8% 69.53% benchmark datasets. This creates another area of opportu-
Center Loss* [82] 36.5% 81.8% 94.3% 70.87% nity in analyzing if a particular pre-trained dataset yields
VGG-Face-FT [57] 46.3% 78.5% 91.5% 72.10% better results for VLR face recognition before fine-tuning
ResNet50- 48.0% 92.0% 99.3% 79.77%
ArcFace* [19] to a specific benchmark dataset. As per the results of
LightCNN-FT [84] 49.0% 83.8% 93.5% 75.43% Table 5, we can see that VGG Face[57] is unable to learn
Center Loss-FT 54.8% 86.3% 95.8% 78.97% effective representations for the SCface dataset even after
[82] fine tuning, whereas newer and efficient feature extraction
FAN* [91] 62.0% 90.0% 94.8% 82.27% methods are able to learn this representations very well.
ShuffleFaceNet* 55.5% 95.3% 99.3% 83.37% This means that using VGG Face-like architectures as a
[53]
MobileFaceNetV1* 57.0% 95.3% 99.8% 84.03% base for VLRFR is not an effective strategy and that not all
[33] the seminal face recognition CNN architectures perform
ResNet50-ArcFace- 67.3% 93.5% 98.0% 86.27% well under this context even after fine tuning.
FT [19]
MobileFaceNetV2* 68.3% 97.0% 99.8% 88.37%
[12] Furthermore, in order to bring these methods to a
DCR-FT [50] 73.3% 93.5% 98.0% 88.27% real-world application scenario, the generalization capa-
TCN-ResNet-FT 74.6% 94.9% 98.6% 89.37%
[93]
bilities of these methods need to be tested as complete
FAN-FT [91] 77.5% 95.0% 98.3% 90.27% as possible. This means that researchers need to perform
ShuffleFaceNet-FT 86.0% 99.5% 99.8% 95.10% cross-dataset evaluations when fine-tuning to certain
[53] dataset, effectively avoiding reporting results based on the
MobileFaceNetV2- 95.3% 100.0% 100.0% 98.43%
FT [12]
CNN dataset memorization. This would allow researchers
to better analyze which methods effectively mitigate the
Table 5: Summary of methods tested on the SCface dataset. challenges presented on Section 2 and compare them under
Compiled from [46],[?] and [?]. Each column marks the the same generalization conditions. Due to the extremely
horizontal distance of the camera to the subject, where d1 challenging nature of the problem at hand, it is absolutely
of 4.2 meters is the farthest from the subject and as such, necessary that the learned representations are as robust as
the most challenging one for face recognition purposes. The possible for an open-set identification scenario.
networks marked with ”FT” denote that they have been fine-
tuned for the SCface dataset. The models marked with We can also appreciate from Table 5 that the top 10
an asterisk * were not trained or fine tuned in the SCface solutions in average accuracy performance for the SCface
dataset but rather the MS-Celeb-1M dataset [26] dataset is centered in CNN approaches, particularly the ones
. for robust homogeneous feature extraction methods using
regular and lightweight CNNs. These methods tend to gen-
eralize very well in the distances d2 and d3, which are the
of opportunity present within their training and testing lesser challenging distances in comparison to d1 and this is
methodologies. In order to achieve the best accuracy in why they have a higher average accuracy. Traditional meth-
this dataset, the authors of [?] pre-trained the networks in ods tend to be more consistent across the three distance set-
the MS1-Celeb-1M dataset and then fine tuned them using tings, with LDMDS [86] standing out as the best of them for
the SCface dataset with different up-sampling strategies. coupled mappings. This method has a consistent accuracy
In other methods, such as in Deep Coupled ResNet [50], performance across camera distances, reaching a 62.7% ac-
up-sampled the images from the SCface dataset to three curacy for the most challenging scenario. This method ef-
fectively demonstrates that enforcing the large margins be- affordable laptop CPU hardware. Approaches such as Mo-
tween classes is an effective strategy, a similar notion to the bileNetV1 and ShuffleFaceNet are able to service two cam-
additive margin loss functions [76] and [19] used in for face eras at the same time at least at 10 frames-per-second each
recognition CNNs. Super resolution methods tend to not be one of them, where the accuracy performance detailed in
present in this table due to their effectiveness being tested Table 5 favors ShuffleFaceNet. The compact 128-D face
in other datasets and their approaches focused on improving descriptor of ShuffleFaceNet favors a substantial increase
the SSIM and PSNR metrics rather than recognition perfor- in performance, while preserving the identity information,
mance. However, a clear standout is the FAN architecture favoring a fine-tuned scenario. MobileFaceNetV2 is able to
which demonstrates the effectiveness of using a GAN-like generalize better, as per Table 5 and is also able to achieve
approach by super resolving images after performing the real-time performance while servicing one camera on the
disentangled feature learning steps. Laptop CPU described. This means that MobileFaceNetV2
is a better choice in terms of accuracy-efficiency trade-off if
6.2. Discussion on efficiency performance we use the SCFace dataset for fine-tuning and compare face
descriptors for other identities foreign to that dataset.
From the methods in the previous average accuracy per-
Studying the structures of these networks, we can ob-
formance Table 5 for the SCface dataset, we report the run-
serve that the key element to balancing the accuracy and
time for the efficient lightweight convolutional neural net-
efficiency trade-off using is focusing the recognition on the
works in Table 6. This efficiency table gives a general
centered part of the aligned face image, while also using
panorama of where lightweight neural networks stand for
the fast downsampling strategy used in ShuffleFaceNet and
real-world inference time performance.
using a lower dimensional face descriptor.
When comparing efficiency performance, it is very
In general, traditional Coupled Mapping methods are
important to run tests standardized to hardware archi-
more efficient than Super Resolution methods. It is true that
tectures. Hardware-agnostic metrics such as FLOPs and
Deep Learning-based approaches have surpassed the accu-
the number of parameters of a network are often used as
racy of traditional methods, however, an interesting case
indicators to compare network efficiency between proposed
is LMCM [94]. The authors report an inference time of
architectures. This poses a problem since these metrics
8.5 microseconds on a CPU from a previous generation of
do not translate linearly to real run-time performance
the CPU used in our test, which is more efficient than any
metrics at any specific hardware configuration. Other
of the Deep Learning approaches shown on Table 6 while
considerations such as the number of times any given
holding a greater accuracy performance at the d1 scenario
architecture has to access memory, GPU/CPU memory
than the lightweight CNNs without fine-tuning, except for
size and memory bandwidth are important bottlenecks
MobileFaceNetV2. In the case of both traditional and deep
which do affect real-time performance. Furthermore,
learning-based Super Resolution methods, efficiency per-
some authors from earlier literature do not refer to the
formance is worse due to the need of more complex net-
specific hardware configuration used when reporting time
works in order to up-scale the image to HR resolution and
performance in seconds. Even though it is hard to compare
then perform identification and/or verification tasks. The
the performance between different hardware, even across
FAN [91] approach, which is a Super Resolution method,
different CPU generations from the same vendor, it still
has an inference time of 0.016s for inference in much more
provides a better idea of which hardware implementation is
powerful single GPU than the ones used in our tests, the
able to successfully run the proposed methods at any given
Nvidia Titan X GPU. This inference time is comparable to
scenario.
the inference time performance of ShuffleNet with 1.5 depth
on a laptop GTX 1050Ti GPU, far less powerful than the
Focusing on real-time hardware performance, Table 6 TItan X GPU.
shows specific hardware configurations for GPUs and one
laptop CPU. Using the same hardware configurations as 7. Discussion on future research directions
the authors of [?], we included run-time evaluations for
VarGFaceNet, which was featured at the LFR Challenge In this section we discuss the areas of opportunity of
[20], and its base network VarGNet [99]. For affordable current state-of-the-art approaches and detail how they can
hardware, it is possible to achieve real-time recognition per- have a significant contribution to the VLR heterogeneous
formance using low-power GPUs such as a Laptop GTX face recognition area.
1050Ti for almost all the networks described in the table,
7.1. Efficiency opportunity areas for Capsule Net-
except for ShuffleFaceNet, which has a larger memory foot-
works
print. In the case of CPU performance, MobileFaceNetV1,
MobileFaceNetV2 and ShuffleFaceNet are the best candi- The Capsule Network [31] concept proposed by Hinton
dates for running a recognition application in real-time on et al. and later refined by Sabour et al. [61] is a niche re-
Network # Params. 2× GTX GTX GTX Quadro Laptop GTX Laptop Intel
(Millions) 1080ti 1080Ti 1660Ti P2000 1050Ti i7 7700HQ
Light CNN - 4 6.8 M 5.49 ms 12.67 ms 14.09 ms 41.00 ms 55.50 ms 2,653.76 ms
[84]
Light CNN - 9 8.1 M 6.22 ms 14.36 ms 15.88 ms 40.96 ms 56.17 ms 2,106.72 ms
[84]
VGG [68] 144.9 M 3.39 ms 7.83 ms 108.97 ms 25.17 ms 35.29 ms 1,523.40 ms
VGGFace [57] 41.1M 3.99 ms 9.22 ms 14.24 ms 19.64 ms 21.61 ms 433.39 ms
Resnet100 [28] 65.2M 2.76 ms 5.26 ms 12.96 ms 48.14 ms 59.61 ms 285.64 ms
Light CNN - 29 31.0 M 2.01 ms 4.63 ms 2.38 ms 7.95 ms 7.93 ms 126.71 ms
[84]
VarGFaceNet 4.9 M 0.85 ms 1.48 ms 3.48 ms 5.10 ms 27.09 ms 126.59 ms
[85]
MobileNetv2 1.8 M 0.80 ms 1.46 ms 4.50 ms 11.08 ms 14.76 ms 103.69 ms
[63]
MobileNetv1 3.2 M 0.69 ms 1.21 ms 1.88 ms 5.56 ms 20.06 ms 98.99 ms
[33]
VarGNet [99] 4.2 M 0.68 ms 1.18 ms 2.16 ms 4.21 ms 5.42 ms 70.40 ms
MobileFaceNetV2 2.0 M 0.88 ms 1.48 ms 3.27 ms 5.55 ms 7.28 ms 62.45 ms
[12]
MobileFaceNetV1 3.3 M 0.74 ms 1.31 ms 1.61 ms 4.91 ms 8.23 ms 53.49 ms
[?]
Shufflenet - 2.0 5.3 M 1.10 ms 1.96 ms 18.79 ms 25.05 ms N/A 42.92 ms
[52]
ShuffleFaceNet - 4.5 M 1.00 ms 1.77 ms 2.41 ms 6.36 ms 6.95 ms 37.46 ms
2.0 [53]
ShuffleNet - 1.5 2.5 M 0.77 ms 1.33 ms 2.77 ms 5.29 ms 12.25 ms 32.98 ms
[101]
ShuffleFaceNet - 2.6 M 0.77 ms 1.34 ms 1.86 ms 4.75 ms 4.68 ms 29.08 ms
1.5 [53]

Table 6: Inference time for Lightweight Convolutional Neural Networks, as reported in [?], by CPU inference descending
time in milliseconds. The bottom part shows the methods suitable for real time performance of at least 10 frames per
second for inference on a Laptop CPU. At this time, ShuffleFaceNet with 1.5 of depth multiplier [53] provides the best mean
accuracy-efficiency trade-off for real-time CPU performance as per the efficiency results in this table and Table 5 for the
SCFace Dataset.

search technique with a lot of potential. Capsule networks level framework and CapsNet architecture design to remedy
have the potential for extracting focused information from this problem, however, more research is needed in this area.
different components. These extractors, called capsules,
later concatenate their output with the other capsules, ef- 7.2. Leveraging CNN multi-branch architectures
fectively building a more robust descriptor. Some of the and loss functions tailored for VLFR
guidelines for Capsule Networks, or CapNets, include re-
jecting the idea of max pooling due to the loss of infor- The domain gap relationship and its evaluation can be
mation and instead model activity vectors, which for any modeled at the CNN architecture level and at the loss func-
given class can yield a vector of values which later can be tion level, for a more effective and more generalized fea-
used as a representation for classification. However, they ture extraction. Methods such as FAN [91], CSRI [13],
represent a challenge since, for mobile and embedded sys- among others, have highlighted the importance and effec-
tems, it is very costly to run classification in high-dimension tiveness that inputs for VLR and HR imaging, synthetic or
feature vectors. Efficient CapsNet research is not present native and even sharing parameters for extracting features
at the state of the art and can benefit in adding robustness across different domain imagery. This needs to be extended
and more accurate performance for face recognition in em- to the very robust and efficient feature extraction capabili-
bedded systems. Early research on efficient Capsule Net- ties of the more recent lightweight convolutional neural net-
works [100] has suggested that the major bottleneck is the works for face recognition. As of now, we can only utilize
dynamic routing procedure (which determines which cap- lightweight CNNs by up-scaling the very low resolution,
sule is a vector going to connect) which is very intensive on effectively creating a synthetic image, not representative of
calls to memory. The authors have proposed a novel low- the real domain. This presents a very important area of op-
portunity for these already robust feature extractors. We are
confident that introducing and processing native VLR and Capsule Networks and Knowledge Distillation.
HR images separately and using the efficient design prin-
ciples of lightweight CNNs can aid generalization consid- References
erably. The most common losses used for the rest of deep
[1] L. An and B. Bhanu. Face image super-resolution using
learning methods are Centerloss [82], Cross-Entropy Soft- 2d cca. Signal Processing, 103:184 – 194, 2014. Image
max [9] and Arcface [19]. These loss functions encourage Restoration and Enhancement: Recent Advances and Ap-
the inter-class discriminative ability and a homogenized fea- plications.
ture extraction process. However, they do not model or con- [2] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-
sider the native domain space of the heterogeneous source up robust features (surf). Comput. Vis. Image Underst.,
images in the same way some Coupled Mappings and Su- 110(3):346–359, June 2008.
per Resolution methods do. The only CNN methods using [3] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman.
tailored loss functions for VLR and HR imagery are Deep Eigenfaces vs. fisherfaces: Recognition using class specific
Coupled ResNet [50], CSRI [13] and Dual Directed Cap- linear projection. IEEE Transactions on pattern analysis
sule Networks [69]. For instance, the coupled mapping loss and machine intelligence, 19(7):711–720, 1997.
of DCR consists of a combination of softmax loss and cen- [4] J. R. Beveridge, P. J. Phillips, D. S. Bolme, B. A. Draper,
G. H. Givens, Y. M. Lui, M. N. Teli, H. Zhang, W. T.
ter loss for HR and LR feature sets independently and an
Scruggs, K. W. Bowyer, P. J. Flynn, and S. Cheng. The
euclidean loss for all the extracted feature vectors to the challenge of face recognition from digital point-and-shoot
center. These kind of loss functions effectively enforce the cameras. In 2013 IEEE Sixth International Conference
network to learn feature representation based on the data on Biometrics: Theory, Applications and Systems (BTAS),
from the same domain only, making the networks robust to pages 1–8, Sep. 2013.
feature extraction for those domains. [5] S. Biswas, G. Aggarwal, P. J. Flynn, and K. W. Bowyer.
Pose-robust recognition of low-resolution face images.
7.3. Untapped potential of Knowledge Distillation IEEE Transactions on Pattern Analysis and Machine Intel-
approaches ligence, 35(12):3037–3049, Dec 2013.
Knowledge Distillation approaches are able to provide [6] S. Biswas, K. W. Bowyer, and P. J. Flynn. Multidimen-
knowledge from a more complex neural network architec- sional scaling for matching low-resolution face images.
IEEE Transactions on Pattern Analysis and Machine Intel-
ture to a more simple one, which is used only for testing.
ligence, 34(10):2019–2030, Oct 2012.
We have already discussed that methods such as DCR, FAN,
[7] B. J. Boom, G. M. Beumer, L. J. Spreeuwers, and R. N. J.
Dual Directed CapsNet, etc. are not suitable for real-time Veldhuis. The effect of image resolution on the perfor-
applications, however, they do provide generalization capa- mance of a face recognition system. In 2006 9th Inter-
bilities not present in lightweight convolutional neural net- national Conference on Control, Automation, Robotics and
works. We consider that Knowledge Distillation approaches Vision, pages 1–6, Dec 2006.
can provide competent generalizations using the heavier [8] A. Bulat and G. Tzimiropoulos. How far are we from solv-
networks, while leveraging the robust and efficient feature ing the 2d & 3d face alignment problem? (and a dataset of
extraction capabilities of the most successful lightweight 230,000 3d facial landmarks). In International Conference
convolutional neural networks. This would allow the heav- on Computer Vision, 2017.
ier networks to act as teacher networks to transfer the do- [9] J. Cao, Z. Su, L. Yu, D. Chang, X. Li, and Z. Ma. Softmax
main knowledge needed for lighter networks to be able to cross entropy loss with unbiased decision boundary for im-
extract features more robust to resolution and unconstrained age classification. In 2018 Chinese Automation Congress
(CAC), pages 2028–2032, 2018.
condition changes.
[10] Z. Cao, Q. Yin, X. Tang, and J. Sun. Face recognition with
learning-based descriptor. In 2010 IEEE Computer Society
8. Conclusions Conference on Computer Vision and Pattern Recognition,
In this paper we have reviewed the most successful ap- pages 2707–2714, June 2010.
proaches for very high and accurate unconstrained very low [11] Changtao Zhou, Zhiwei Zhang, Dong Yi, Z. Lei, and S. Z.
resolution face recognition, while discussing the limitations Li. Low-resolution face recognition via simultaneous dis-
criminant analysis. In 2011 International Joint Conference
and advantages of each approach type in the state-of-the-art.
on Biometrics (IJCB), pages 1–6, 2011.
We discussed the factors affecting accuracy and inference
[12] S. Chen, Y. Liu, X. Gao, and Z. Han. Mobilefacenets: Effi-
time performance as well as the caveats of using different cient cnns for accurate real-time face verification on mobile
training methodologies. We analyzed the impact of bridg- devices. Lecture Notes in Computer Science, pages 428–
ing the domain gap using at the architecture level and loss 438, 2018.
function design. With this in mind, we have also discussed [13] Z. Cheng, X. Zhu, and S. Gong. Low-Resolution Face
the most important tendencies in the deep learning convolu- Recognition. arXiv preprint arXiv:1811.08965, pages 1–
tional neural networks as a whole with approaches such as 16, nov 2018.
[14] Z. Cheng, X. Zhu, and S. Gong. Surveillance face recogni- [30] C. Herrmann, D. Willersinn, and J. Beyerer. Low-resolution
tion challenge. arXiv preprint arXiv:1804.09691, 2018. convolutional neural networks for video face recognition.
[15] Chengjun Liu and H. Wechsler. Gabor feature based classi- In 2016 13th IEEE International Conference on Advanced
fication using the enhanced fisher linear discriminant model Video and Signal Based Surveillance (AVSS), pages 221–
for face recognition. IEEE Transactions on Image Process- 227, Aug 2016.
ing, 11(4):467–476, April 2002. [31] G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transform-
[16] Y. Chu, T. Ahmad, G. Bebis, and L. Zhao. Low-resolution ing auto-encoders. In Proceedings of the 21th Interna-
face recognition with single sample per person. Signal Pro- tional Conference on Artificial Neural Networks - Volume
cessing, 141:144 – 157, 2017. Part I, ICANN’11, pages 44–51, Berlin, Heidelberg, 2011.
[17] Dan Zeng, Hu Chen, and Qijun Zhao. Towards resolu- Springer-Verlag.
tion invariant face recognition in uncontrolled scenarios. In [32] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen,
2016 International Conference on Biometrics (ICB), pages M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, Q. V.
1–8, 2016. Le, and H. Adam. Searching for mobilenetv3. In The IEEE
[18] J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou. International Conference on Computer Vision (ICCV), Oc-
Retinaface: Single-shot multi-level face localisation in the tober 2019.
wild. In Proceedings of the IEEE/CVF Conference on Com-
[33] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko,
puter Vision and Pattern Recognition (CVPR), June 2020.
W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mo-
[19] J. Deng, J. Guo, N. Xue, and S. Zafeiriou. Arcface: Addi-
bilenets: Efficient convolutional neural networks for mobile
tive angular margin loss for deep face recognition. In The
vision applications. ArXiv, abs/1704.04861, 2017.
IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), June 2019. [34] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller.
[20] J. Deng, J. Guo, D. Zhang, Y. Deng, X. Lu, and S. Shi. Labeled faces in the wild: A database for studying face
Lightweight face recognition challenge. In Proceedings of recognition in unconstrained environments. Technical Re-
the IEEE/CVF International Conference on Computer Vi- port 07-49, University of Massachusetts, Amherst, October
sion (ICCV) Workshops, Oct 2019. 2007.
[21] Dong-chen He and Li Wang. Texture unit, texture spectrum, [35] H. Huang, H. He, X. Fan, and J. Zhang. Super-resolution
and texture analysis. IEEE Transactions on Geoscience and of human face image using canonical correlation analysis.
Remote Sensing, 28(4):509–512, July 1990. Pattern Recognition, 43(7):2532 – 2543, 2010.
[22] W. Gao, B. Cao, S. Shan, X. Chen, D. Zhou, X. Zhang, and [36] P. J. Huber. Robust estimation of a location parameter. The
D. Zhao. The cas-peal large-scale chinese face database and Annals of Mathematical Statistics, 35(1):73–101, 1964.
baseline evaluations. IEEE Transactions on Systems, Man, [37] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J.
and Cybernetics - Part A: Systems and Humans, 38(1):149– Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy
161, 2008. with 50x fewer parameters and ¡1mb model size. ArXiv,
[23] S. Ge, S. Zhao, C. Li, and J. Li. Low-resolution face recog- abs/1602.07360, 2017.
nition in the wild via selective knowledge distillation. IEEE [38] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard,
Transactions on Image Processing, 28(4):2051–2062, Apr H. Adam, and D. Kalenichenko. Quantization and training
2019. of neural networks for efficient integer-arithmetic-only in-
[24] M. Grgic, K. Delac, and S. Grgic. Scface - surveillance ference. 2018 IEEE/CVF Conference on Computer Vision
cameras face database. Multimedia Tools Appl., 51:863– and Pattern Recognition, Jun 2018.
879, 02 2011.
[39] Jian Yang, D. Zhang, A. F. Frangi, and Jing-yu Yang. Two-
[25] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker.
dimensional pca: a new approach to appearance-based face
Multi-pie. Image Vision Comput., 28(5):807–813, May
representation and recognition. IEEE Transactions on Pat-
2010.
tern Analysis and Machine Intelligence, 26(1):131–137,
[26] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m:
2004.
A dataset and benchmark for large-scale face recognition.
In ECCV, 2016. [40] Jianchao Yang, J. Wright, T. Huang, and Yi Ma. Im-
[27] M. Haghighat and M. Abdel-Mottaleb. Low resolution face age super-resolution as sparse representation of raw image
recognition in surveillance systems using discriminant cor- patches. In 2008 IEEE Conference on Computer Vision and
relation analysis. In 2017 12th IEEE International Confer- Pattern Recognition, pages 1–8, 2008.
ence on Automatic Face Gesture Recognition (FG 2017), [41] N. D. Kalka, B. Maze, J. A. Duncan, K. OrConnor, S. El-
pages 912–917, May 2017. liott, K. Hebert, J. Bryan, and A. K. Jain. Ijb-s: Iarpa
[28] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning janus surveillance video benchmark. 2018 IEEE 9th In-
for image recognition, 2015. ternational Conference on Biometrics Theory, Applications
[29] P. H. Hennings-Yeomans, S. Baker, and B. V. K. V. Ku- and Systems (BTAS), pages 1–9, 2018.
mar. Simultaneous super-resolution and feature extraction [42] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang. Fast and
for recognition of low-resolution faces. In 2008 IEEE Con- accurate image super-resolution with deep laplacian pyra-
ference on Computer Vision and Pattern Recognition, pages mid networks. IEEE Transactions on Pattern Analysis and
1–8, 2008. Machine Intelligence, 41(11):2599–2613, Nov 2019.
[43] S. H. Lee and S. Choi. Two-dimensional canonical correla- [58] P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss.
tion analysis. IEEE Signal Processing Letters, 14(10):735– The feret evaluation methodology for face-recognition
738, 2007. algorithms. IEEE Trans. Pattern Anal. Mach. Intell.,
[44] B. Li, H. Chang, S. Shan, and X. Chen. Low-resolution 22(10):1090–1104, Oct. 2000.
face recognition via coupled locality preserving mappings. [59] R. Ranjan, C. D. Castillo, and R. Chellappa. L2-constrained
IEEE Signal Processing Letters, 17(1):20–23, Jan 2010. softmax loss for discriminative face verification. CoRR,
[45] P. Li, L. Prieto, D. Mery, and P. Flynn. Face Recogni- abs/1703.09507, 2017.
tion in Low Quality Images: A Survey. arXiv preprint [60] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi.
arXiv:1805.11519, 1(1), may 2018. Xnor-net: Imagenet classification using binary convolu-
[46] P. Li, L. Prieto, D. Mery, and P. J. Flynn. On Low- tional neural networks. Lecture Notes in Computer Science,
Resolution Face Recognition in the Wild: Comparisons and pages 525–542, 2016.
New Techniques. IEEE Transactions on Information Foren- [61] S. Sabour, N. Frosst, and G. E. Hinton. Dynamic rout-
sics and Security, PP(c):1–1, 2019. ing between capsules. In Proceedings of the 31st Interna-
[47] S. Z. Li and A. K. Jain. Handbook of Face Recognition. tional Conference on Neural Information Processing Sys-
Springer Publishing Company, Incorporated, 2nd edition, tems, NIPS’17, pages 3859–3869, USA, 2017. Curran As-
2011. sociates Inc.
[48] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face [62] M. Sandler, J. Baccash, A. Zhmoginov, and A. Howard.
attributes in the wild. In Proceedings of International Con- Non-discriminative data or weak model? on the relative im-
ference on Computer Vision (ICCV), December 2015. portance of data and model resolution. In 2019 IEEE/CVF
[49] D. G. Lowe. Distinctive image features from scale-invariant International Conference on Computer Vision Workshop
keypoints. Int. J. Comput. Vision, 60(2):91–110, Nov. 2004. (ICCVW), pages 1036–1044, 2019.
[50] Z. Lu, X. Jiang, and A. Kot. Deep Coupled ResNet for [63] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and
Low-Resolution Face Recognition. IEEE Signal Processing L. Chen. Mobilenetv2: Inverted residuals and linear bottle-
Letters, 25(4):526–530, 2018. necks. In 2018 IEEE/CVF Conference on Computer Vision
[51] Y. M. Lui, D. Bolme, B. A. Draper, J. R. Beveridge, and Pattern Recognition, pages 4510–4520, June 2018.
G. Givens, and P. J. Phillips. A meta-analysis of face recog-
[64] A. Sapkota and T. E. Boult. Large scale unconstrained open
nition covariates. In 2009 IEEE 3rd International Con-
set face database. In 2013 IEEE Sixth International Con-
ference on Biometrics: Theory, Applications, and Systems,
ference on Biometrics: Theory, Applications and Systems
pages 1–8, 2009.
(BTAS), pages 1–8, Sep. 2013.
[52] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun. Shufflenet v2:
[65] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A
Practical guidelines for efficient cnn architecture design.
unified embedding for face recognition and clustering. In
In V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss,
CVPR, pages 815–823. IEEE Computer Society, 2015.
editors, Computer Vision – ECCV 2018, pages 122–138,
Cham, 2018. Springer International Publishing. [66] J. Shi and C. Qi. From local geometry to global structure:
Learning latent subspace for low-resolution face image
[53] Y. Martinez-Diaz, L. S. Luevano, H. Mendez-Vazquez,
recognition. IEEE Signal Processing Letters, 22(5):554–
M. Nicolas-Diaz, L. Chang, and M. Gonzalez-Mendoza.
558, May 2015.
Shufflefacenet: A lightweight face architecture for efficient
and highly-accurate face recognition. In The IEEE Interna- [67] S. Siena, V. N. Boddeti, and B. V. K. Vijaya Kumar. Cou-
tional Conference on Computer Vision (ICCV) Workshops, pled marginal fisher analysis for low-resolution face recog-
Oct 2019. nition. In A. Fusiello, V. Murino, and R. Cucchiara,
[54] H. MIARNAEIMI and P. Davari. A new fast and efficient editors, Computer Vision – ECCV 2012. Workshops and
hmm-based face recognition system using a 7-state hmm Demonstrations, pages 240–249, Berlin, Heidelberg, 2012.
along with svd coefficients. 2008. Springer Berlin Heidelberg.
[55] S. P. Mudunuri, S. Sanyal, and S. Biswas. Genlr-net: [68] K. Simonyan and A. Zisserman. Very deep convolutional
Deep framework for very low resolution face and object networks for large-scale image recognition, 2014.
recognition with generalization to unseen categories. In [69] M. Singh, S. Nagpal, R. Singh, and M. Vatsa. Dual directed
2018 IEEE/CVF Conference on Computer Vision and Pat- capsule network for very low resolution image recognition.
tern Recognition Workshops (CVPRW), pages 602–60209, In The IEEE International Conference on Computer Vision
2018. (ICCV), October 2019.
[56] S. P. Mudunuri, S. Venkataramanan, and S. Biswas. Dictio- [70] M. Singh, S. Nagpal, M. Vatsa, R. Singh, and A. Majumdar.
nary alignment with re-ranking for low-resolution nir-vis Identity aware synthesis for cross resolution face recogni-
face recognition. IEEE Transactions on Information Foren- tion. In 2018 IEEE/CVF Conference on Computer Vision
sics and Security, 14(4):886–896, April 2019. and Pattern Recognition Workshops (CVPRW), pages 592–
[57] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face 59209, June 2018.
recognition. In X. Xie, M. W. Jones, and G. K. L. Tam, [71] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface:
editors, Proceedings of the British Machine Vision Confer- Closing the gap to human-level performance in face veri-
ence (BMVC), pages 41.1–41.12. BMVA Press, September fication. In Conference on Computer Vision and Pattern
2015. Recognition (CVPR), 2014.
[72] L. Trottier, P. Giguere, and B. Chaib-draa. Parametric [87] H. Yang, M. Fritzsche, C. Bartz, and C. Meinel. Bmxnet:
exponential linear unit for deep convolutional neural net- An open-source binary neural network implementation
works. 2017 16th IEEE International Conference on Ma- based on mxnet. In Proceedings of the 25th ACM Inter-
chine Learning and Applications (ICMLA), Dec 2017. national Conference on Multimedia, MM ’17, pages 1209–
[73] M. Turk and A. Pentland. Face recognition using eigen- 1212, New York, NY, USA, 2017. ACM.
faces. In Proceedings. 1991 IEEE computer society con- [88] J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image super-
ference on computer vision and pattern recognition, pages resolution via sparse representation. IEEE Transactions on
586–587, 1991. Image Processing, 19(11):2861–2873, 2010.
[74] T. Uiboupin, P. Rasti, G. Anbarjafari, and H. Demirel. Fa- [89] J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image super-
cial image super resolution using sparse representation for resolution via sparse representation. IEEE Transactions on
improving face recognition in surveillance monitoring. In Image Processing, 19(11):2861–2873, 2010.
2016 24th Signal Processing and Communication Applica- [90] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face represen-
tion Conference (SIU), pages 437–440, 2016. tation from scratch. ArXiv, abs/1411.7923, 2014.
[75] P. Viola and M. Jones. Rapid object detection using a [91] X. Yin, Y. Tai, Y. Huang, and X. Liu. Fan: Feature adapta-
boosted cascade of simple features. In Proceedings of the tion network for surveillance face recognition and normal-
2001 IEEE Computer Society Conference on Computer Vi- ization, 2019.
sion and Pattern Recognition. CVPR 2001, volume 1, pages [92] R. Zeyde, M. Elad, and M. Protter. On single image
I–I, Dec 2001. scale-up using sparse-representations. In J.-D. Boissonnat,
[76] F. Wang, J. Cheng, W. Liu, and H. Liu. Additive margin P. Chenin, A. Cohen, C. Gout, T. Lyche, M.-L. Mazure, and
softmax for face verification. IEEE Signal Processing Let- L. Schumaker, editors, Curves and Surfaces, pages 711–
ters, 25(7):926–930, 2018. 730, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
[77] X. Wang and X. Tang. Face photo-sketch synthesis and [93] J. Zha and H. Chao. Tcn: Transferable coupled network for
recognition. IEEE Transactions on Pattern Analysis and cross-resolution face recognition*. In ICASSP 2019 - 2019
Machine Intelligence, 31(11):1955–1967, 2009. IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pages 3302–3306, 2019.
[78] Z. Wang, S. Chang, Y. Yang, D. Liu, and T. S. Huang.
Studying Very Low Resolution Recognition Using Deep [94] J. Zhang, Z. Guo, X. Li, and Y. Chen. Large Margin Cou-
Networks. arXiv preprint arXiv:1601.04153, pages 4792– pled Mapping for Low Resolution Face Recognition. In
4800, 2016. C. Zhang, H. W. Guesgen, and W.-K. Yeap, editors, PRI-
CAI 2016: Trends in Artificial Intelligence, volume 3157 of
[79] Z. Wang, D. Liu, J. Yang, W. Han, and T. Huang. Deep
Lecture Notes in Computer Science, pages 661–672, Berlin,
networks for image super-resolution with sparse prior. In
Heidelberg, 2016. Springer Berlin Heidelberg.
2015 IEEE International Conference on Computer Vision
[95] K. Zhang, S. Gu, R. Timofte, Z. Hui, X. Wang, X. Gao,
(ICCV), pages 370–378, 2015.
D. Xiong, S. Liu, R. Gang, N. Nan, C. Li, X. Zou, N. Kang,
[80] Z. Wang, W. Yang, and X. Ben. Low-resolution degradation
Z. Wang, H. Xu, C. Wang, Z. Li, L. Wang, J. Shi, W. Sun,
face recognition over long distance based on cca. Neural
Z. Lang, J. Nie, W. Wei, L. Zhang, Y. Niu, P. Zhuo,
Computing and Applications, 26, 02 2015.
X. Kong, L. Sun, and W. Wang. AIM 2019 Challenge on
[81] A. R. Webb. Multidimensional scaling by iterative ma- Constrained Super-Resolution: Methods and Results. arXiv
jorization using radial basis functions. Pattern Recognition, preprint arXiv:1911.01249, (i), nov 2019.
28(5):753 – 759, 1995. [96] K. Zhang, Z. Zhang, C.-W. Cheng, W. H. Hsu, Y. Qiao,
[82] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative W. Liu, and T. Zhang. Super-identity convolutional neural
feature learning approach for deep face recognition. In network for face hallucination. Lecture Notes in Computer
B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, Com- Science, pages 196–211, 2018.
puter Vision – ECCV 2016, pages 499–515, Cham, 2016. [97] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection
Springer International Publishing. and alignment using multitask cascaded convolutional net-
[83] L. Wolf, T. Hassner, and I. Maoz. Face recognition in un- works. IEEE Signal Processing Letters, 23(10):14991503,
constrained videos with matched background similarity. In Oct 2016.
CVPR 2011, pages 529–534, June 2011. [98] P. Zhang, X. Ben, W. Jiang, R. Yan, and Y. Zhang. Cou-
[84] X. Wu, R. He, Z. Sun, and T. Tan. A light cnn for deep face pled marginal discriminant mappings for low-resolution
representation with noisy labels, 2015. face recognition. Optik, 126(23):4352 – 4357, 2015.
[85] M. Yan, M. Zhao, Z. Xu, Q. Zhang, G. Wang, and Z. Su. [99] Q. Zhang, J. Li, M. Yao, L. Song, H. Zhou, Z. Li, W. Meng,
Vargfacenet: An efficient variable group convolutional X. Zhang, and G. Wang. Vargnet: Variable group convo-
neural network for lightweight face recognition. In The lutional neural network for efficient embedded computing.
IEEE International Conference on Computer Vision (ICCV) ArXiv, abs/1907.05653, 2019.
Workshops, Oct 2019. [100] X. Zhang, S. L. Song, C. Xie, J. Wang, W. Zhang, and
[86] F. Yang, W. Yang, R. Gao, and Q. Liao. Discriminative X. Fu. Enabling highly efficient capsule networks process-
Multidimensional Scaling for Low-Resolution Face Recog- ing through a PIM-based architecture design. Proceedings -
nition. IEEE Signal Processing Letters, 25(3):388–392, 2020 IEEE International Symposium on High Performance
mar 2018. Computer Architecture, HPCA 2020, pages 542–555, 2020.
[101] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An
extremely efficient convolutional neural network for mobile
devices. 2018 IEEE/CVF Conference on Computer Vision
and Pattern Recognition, Jun 2018.

You might also like