Image and Vision Computing
Image and Vision Computing
Image and Vision Computing
Review article
A R T I C L E I N F O A B S T R A C T
Article history: Heterogeneous face recognition (HFR) refers to matching face imagery across different domains. It has
Received 21 July 2015 received much interest from the research community as a result of its profound implications in law
Received in revised form 11 June 2016 enforcement. A wide variety of new invariant features, cross-modality matching models and heterogeneous
Accepted 19 September 2016 datasets are being established in recent years. This survey provides a comprehensive review of established
Available online 26 September 2016
techniques and recent developments in HFR. Moreover, we offer a detailed account of datasets and bench-
marks commonly used for evaluation. We finish by assessing the state of the field and discussing promising
Keywords: directions for future research.
Cross-modality face recognition © 2016 Elsevier B.V. All rights reserved.
Heterogeneous face recognition
Sketch-based face recognition
Visual–infrared matching
2D–3D matching
High–low resolution matching
Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2. Outline of a HFR system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1. Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2. Cross-modal bridge strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3. Matching strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4. Formalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5. Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3. Matching facial sketches to images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1. Categorization of facial sketches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2. Facial sketch datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3. Viewed sketch face recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1. Synthesis-based approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.2. Projection based approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.3. Feature based approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4. Forensic sketch face recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5. Composite sketch based face recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.6. Caricature based face recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.7. Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.7.1. Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.7.2. Challenges and datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
http://dx.doi.org/10.1016/j.imavis.2016.09.001
0262-8856/© 2016 Elsevier B.V. All rights reserved.
S. Ouyang et al. / Image and Vision Computing 56 (2016) 28–48 29
research in the most obvious way by the pairs of imagery consid- Related areas not covered by this review include (homogeneous)
ered. We consider four cross-modality applications: sketch-based, 3D [58] and infra-red [59] matching. View [60] and illumination [61]
infra-red based, 3D-based and high–low resolution matching. More invariant recognition are also related, in that there exists a strong
specifically they are: covariate shift between probe and gallery images, however we do not
include these as good surveys already exist [61,62]. Fusing modalities
in multi-modal face recognition [63][58,59,64-66] is also relevant
• Sketch: Sketch-based queries are drawn or created by humans in that multiple modalities are involved. However the key differ-
rather than captured by an automatic imaging device. The ence to HFR is that multi-modal assumes both enrolment and testing
major example application is facial sketches made by law images are available in all modalities, and focuses on how to fuse
enforcement personal based on eye-witness description. The the cues from each, while HFR addresses matching across modali-
task can be further categorized into four variants based on level ties with probe and enrolment image in heterogeneous modalities.
of sketch abstraction, as shown in the left of Fig. 1. Finally, a good survey about face-synthesis [67] is complemen-
• Near infrared: Near infrared (NIR) images are captured by tary to this work, however we consider the broader problem of
infrared rather than visual-light devices. NIR capture may be cross-domain matching.
used to establish controlled lighting conditions in environment Most HFR studies focus their contribution on improved method-
where visual light is not controllable. The HFR challenge comes ology to bridge the cross-modal gap, thus allowing conventional
in matching NIR probe images against visual light images. face recognition strategies to be used for matching. Even across the
A major HFR application is access control, where enrollment wide variety of application domains considered above, these meth-
images may use visual light, but access gates may use infra-red. ods can be broadly categorized into three groups of approaches:
• 3D: Another common access control scenario relies on an (i) those that synthesize one modality from another, thus allow-
enrollment gallery of 3D images and 2D probe images. As ing them to be directly compared; (ii) those that engineer or learn
the gallery images contain more information than the probe feature representations that are variant to person identity while
images, this can potentially outperform vanilla 2D–2D match- being more invariant to imaging modality than raw pixels; and
ing, if the heterogeneity problem can be solved. (iii) those that project both views into a common space where they
• Low-resolution: Matching low-resolution against high- are more directly comparable. We will discuss these in more detail
resolution images is a topical challenge under contemporary in later sections.
security considerations. A typical scenario is that a high- The main contributions of this paper are summarized as follows:
resolution ‘watch list’ gallery is provided, and low-resolution
facial images taken at standoff distance by surveillance 1. We perform an up-to-date survey of HFR literature.
cameras are used as probes. 2. We summarize all common public HFR datasets introduced
thus far.
Fig. 1 offers an illustrative summary of the five categories of HFR 3. We extract some cross-cutting themes face recognition with a
literature covered in this survey. Table 1 further summarizes the cross-modal gap.
studies reviewed broken down by the modalities and methodological 4. We draw some conclusions about the field, and offer some
focus. recommendations about future work on HFR.
Near Infrared
Near Infrared Images
3D
3D Image
Low Resolution
Mugshot
Visible light images Low resolution
2D
High resolution
Table 1
Overview of heterogeneous face recognition steps and typical strategies for each.
The rest of this paper is organized as follow: In Section 2, to facial expression [6], and does not exploit texture information
we provide an overview of a HFR system pipeline, and highlight by default.
some cross-cutting design considerations. In Section 3, we provide Component-based representations detect face parts (e.g., eyes
a detailed review of methods for matching facial sketches to photos and mouth), and represents the appearance of each individually
and a systematic introduction of the most widely used facial sketches [34,35]. This allows the informativeness of each component in
datasets. In Section 4, we describe approaches for matching near- matching to be measured separately [35]; and if components can
infrared to visible light face images in detail. In Section 5, we focus on be correctly detected and matched it also provides some robustness
matching 2D probe images against a 3D enrollment gallery. Section 6 to both linear and non-linear misalignment across modalities [34].
discusses methods for matching low-resolution face images to high- However, a component-fusion scheme is then required to produce
resolution face images. We conclude with a discussion of current an overall match score between two face images.
issues and recommendations about future work on HFR. Global holistic representations represent the whole face image
in each modality with a single vector [7,10,40]. Compared to analytic
and component-based approaches, this has the advantage of encod-
2. Outline of a HFR system ing all available appearance information. However, it is sensitive
to alignment and expression/pose variation, and may provide a
In this section, we present an abstract overview of a HFR pipeline, high-dimensional feature vector that risks over-fitting [68].
outlining the key steps and the main types of strategies avail- Patch-based holistic representations encode the appearance
able at each stage. A HFR system can be broken into three major of each image in patches with a feature vector per patch
components, each corresponding to an important design decision: [25,26,29,30,32]. Subsequent strategies for using the patches vary,
representation, cross-modal strategy and matching strategy (Fig. 2).
Of these components, the first and third have analogues in homo-
geneous face recognition, while the cross-modal bridge strategy is
unique to HFR. Accompanying Fig. 2, Table 1 breaks down the papers Representation
reviewed in this survey by their choices about these design decisions.
Analytic Holistic: Patch Holistic: Global Facial component
2.1. Representation
fiducial points, allowing the face to be modeled geometrically, e.g., Feature-based Synthesis Projection Feature selection
using point distribution models [4,5]. This representation has the
advantage that if a model can be fit to a face in each modality, Matching Strategy
then the analytic/geometric representation is relatively invariant to
modality, and to precise alignment of the facial images. However, it Multi-class Nearest Neighbor
is not robust to errors in face model fitting and may require man-
ual intervention to avoid this [5]. Moreover geometry is not robust Fig. 2. Overview of an abstract HFR pipeline.
32 S. Ouyang et al. / Image and Vision Computing 56 (2016) 28–48
including for example concatenation into a very large feature vector sufficiency. In many HFR applications there is only one cross-modal
[31] (making it in effect a holistic representation), or learning a map- face pair per person. Thus classification strategies have one instance
ping/classifier per patch [39]. The latter strategy can provide some per class (person), and risk over fitting when training a model-based
robustness if the true mapping is not constant over the whole face, recogniser. In contrast, by transforming the problem into a binary
but does require a patch fusion scheme. one, all true pairs of faces form the positive class and all false pairs
form the negative class, resulting in a much larger training set, and
2.2. Cross-modal bridge strategies hence a stronger and more robust classifier.
In conventional face recognition, matching strategies are often
The key HFR challenge of cross-modality heterogeneity typically adopted according to how the proposed system is to be used at test
necessitates an explicit strategy to deal with the cross-modal gap. time. If the task is to recognize a face as one of a pre-defined set
This component uniquely distinguishes HFR systems from conven- of people, multi-class classifiers are a natural matching strategy. If
tional within-modality face recognition. Most HFR studies focus their the task is to check whether a face image matches someone on a
effort on developing improved strategies for this step. Common given watch-list or not, then model-based binary-verifiers are a nat-
strategies broadly fall into the categories: feature design, cross- ural choice. However, it is worth noting that multi-class classification
modal synthesis and subspace projection. These strategies are not can be performed by exhaustive verification, so many HFR systems
exclusive, and many studies employ or contribute to more than are realized by verification, whether the final aim is verification or
one [25,31]. recognition. A second reason for the use of verification in HFR stud-
Feature design strategies [29–32] focus on engineering or learn- ies is that the classic forensic sketch application scenario for HFR is
ing features that are invariant to the modalities in question, while an open-world verification scenario (the sketch may or may not cor-
simultaneously being discriminative for person identity. Typical respond to a person in the mug-shot database). For simplicity, in this
strategies include variants on SIFT [31] and LBP [32]. paper we use the term ‘recognition’ loosely to cover both scenarios,
Synthesis approaches focus on synthesizing one modality based and disambiguate where necessary.
on the other [7,25]. Typical methods include eigentransforms [7,8], We note that some methodologies can be interpreted as either
MRFs [25], and LLE [26]. The synthesized image can then be used cross-domain mappings or matching strategies. For example, some
directly for homogeneous matching. Of course, matching perfor- papers [25] present LDA as a recognition mechanism. However, as
mance is critically dependent on the fidelity and robustness of the it finds a projection that maps images of one class (person identity)
synthesis method. closer together, it also has a role in bridging the cross-modal gap
Projection approaches aim to project both modalities of face when those images are heterogeneous. Therefore for consistency, we
images to a common subspace in which they are more compara- categorize LDA and the like as cross-domain methods.
ble than in the original representations [10,31,40]. Typical methods
include linear discriminant analysis (LDA) [25], canonical compo- 2.4. Formalizations
nents analysis (CCA) [10,28], partial least squares (PLS) and common
basis [31] encoding. Many HFR methods can be seen as special cases of a general for-
A noteworthy special case of projection-based strategies is those malization given in Eq. (1). Images in two modalities xa and xb are
approaches that perform feature selection. Rather than mapping all input; non-linear feature extraction F may be performed; and some
input dimensions to a subspace, these approaches simply discover matching function M then compares the extracted features; possibly
which subset of input dimensions are the most useful (modality after taking linear transforms Wa and Wb of each feature.
invariant) to compare across domains, and ignore the others [11,35],
for example using Adaboost. M W a F(xai ), W b F(xbj ) . (1)
The majority of existing SBFR studies focused on recognizing
search for Wa and Wb such that W a F xai − W b F xai is minimized
viewed hand drawn sketches. This is not a realistic use case — a
for cross-modal pairs of the same person i. While LDA [25] strate- sketch would not be required if a photo of a suspect is readily
gies search for a single projection W such that WF xai − WF xaj is available. Yet studying them is a middle ground toward understand-
ing forensic sketches — viewed sketch performance should reflect
minimized when i = j and maximized when i = j.
forensic sketch performance in the ideal case when all details are
remembered and communicated correctly. Research can then focus
2.5. Summary and conclusions on making good viewed sketch methods robust to lower-quality
forensic sketches.
HFR methods explicitly or implicitly make design decisions about
three stages of representation, cross-domain mapping and matching 3.1. Categorization of facial sketches
(Fig. 1). An important factor in the strengths and weaknesses of each
approach arises from the use of supervised training in either or both Facial sketches can be created either by an artist or by soft-
of the latter two stages (Fig. 2). ware, and are referred to as hand-drawn and composite respectively.
Use of training data An important property of HFR systems is Meanwhile depending on whether the artist observes the actual face
whether annotated cross-modal training data is required/exploited. before sketching, they can also be categorized as viewed and forensic
This has practical consequences about whether an approach can (unviewed). Based on these factors, we identify four typically studied
be applied in a particular application, and its expected perfor- categories of facial sketches:
mance. Since a large dataset of annotated cross-modal pairs may
not be available, methods that require no training data (most • Forensic hand drawn sketches: These are produced by a
feature-engineering and NN matching approaches [29,30,33,34]) forensic artist based on the description of a witness [71], as
are advantageous. illustrated in the second column of Fig. 3. They have been used
On the other hand, exploiting available annotation provides a crit- by police since the 19th century, however they have been less
ical advantage to learn better cross-domain mappings, and many well studied by the recognition community.
discriminative matching approaches. Methods differ in how strongly • Forensic composite sketches: They are created by computer
they exploit available supervision. For example CCA tries to find software (Fig. 4) with which a trained operator selects vari-
the subspace where cross-modal pairs are most similar [10,28]. In ous facial components based on the description provided by a
contrast, LDA simultaneously finds a space where cross-modal pairs witness. An example of a resulting composite sketch is shown
are similar and also where different identities are well separated in the third column of Fig. 3. It is reported that 80% of law
[25], which exploits the labeled training data more thoroughly. It enforcement agencies use some form of software to create
is worth noting that since HFR is concerned with addressing the facial sketches of suspects [72]. The most widely used software
cross-modal gap, most approaches using training data make use of for generating facial composite sketches are IdentiKit [70],
cross-domain matching pairs as annotated training data, rather than Photo-Fit [73], FACES [69], Mac-a-Mug [73], and EvoFIT [74]. It
person identity annotations that are more common in conventional is worth nothing that due to the limitations of such software
(within-domain) face recognition. packages, less facial detail can be presented in composite
Heterogeneous feature spaces A second important model- sketches compared with hand-drawn sketches.
dependent property is whether the model can deal with het- • Viewed hand drawn sketches: In contrast to forensic sketches
erogeneous data dimensions. In some cross-modal contexts that are unviewed, these are sketches drawn by artists by while
(photo–sketch, VIS–NIR), while the data distribution is heterogeneous, looking at a corresponding photo, as illustrated in the first
the data dimensions can be the same; while in 2D–3D or low–high, column of Fig. 3. As such, they are the most similar to the
the data dimensionality may be fundamentally different. In the latter actual photo.
case approaches that require homogeneous dimensions such as LDA • Caricature: In contrast to the previous three categories, where
may not be applicable, while others such as CCA and PLS can still apply. the goal is to render the face as accurately as possible,
caricature sketches are purposefully dramatically exaggerated.
3. Matching facial sketches to images
Viewed sketch Forensic sketch Forensic composite Caricature sketch
The problem of matching facial sketches to photos is commonly sketch
known as sketch-based face recognition (SBFR). It typically involves
a gallery dataset of visible light images and a probe dataset of
facial sketches. An important application of SBFR is assisting law
enforcement to identify suspects by retrieving their photos auto-
matically from existing police databases. Over the past decades,
it has been accepted as an effective tool in law reinforcement. In
most cases, actual photos of suspects are not available, only sketch
drawings based on the recollection of eyewitnesses. The ability to Photograph
match forensic sketches to mug shots not only has the obvious ben-
efit of identifying suspects, but moreover allows the witness and
artist to interactively refine the sketches based on similar photos
retrieved [25].
SBFR can be categorized based on how the sketches are gen-
erated, as shown in Fig. 3: (i) viewed sketches, where artists are
given mugshots as reference, (ii) forensic sketches, where sketches
are hand-drawn by professional artists based on recollections of wit-
nesses, (iii) composite sketches, where rather than hand-drawn they Fig. 3. Facial sketches and corresponding mugshots: viewed sketch, forensic hand
were produced using specific software, and (iv) caricature sketches, drawn sketch, forensic composite sketch, caricature sketch and their corresponding
where facial features are exaggerated. facial images.
34 S. Ouyang et al. / Image and Vision Computing 56 (2016) 28–48
This adds a layer of abstractness that makes their recognition from [83] and [84] respectively, as well as 61 pairs from various
by conventional systems much more difficult. See fourth col- sources on the internet.
umn of Fig. 3 for an example. However, they are interesting to The Pattern Recognition and Image Processing (PRIP) Viewed
study because they allow the robustness of SBFR systems to be Software-Generated Composite (PRIP-VSGC) database [34] contains
rigorously tested, and because there is evidence that humans 123 subjects from AR database. For each photograph, three compos-
remember faces in a caricatured form, and can recognize them ites were created. Two of composites are created using FACES [69]
even better than accurate sketches [4,75,76]. and the third was created using Identi-Kit [70].
The Pattern Recognition and Image Processing (PRIP) Hand-
3.2. Facial sketch datasets Drawn Composite (PRIP-HDC) database [77] includes 265 hand-
drawn and composite facial sketches, together with corresponding
There are five commonly used datasets for benchmarking SBFR mugshots. Those facial sketches are drawn based on the verbal
systems. Each contains pairs of sketches and photos. They differ by description by the eyewitness or victim. Among all those facial
size, whether sketches are viewed and if drawn by artist or com- sketches, 73 were drawn by Lois Gibson, 43 were provided by Karen
posited by software. Table 2 summaries each dataset in terms of Taylor, 56 were provided by the Pinellas County Sheriff’s Office
these attributes. (PCSO), 46 were provided by Michigan State Police, and 47 were
CUHK Face sketch dataset (CUFS) [25] is widely used in SBFR. downloaded from the Internet. So far, only those 47 facial sketches
It includes 188 subjects from the Chinese University of Hong Kong collected from Internet are publicly available.
(CUHK) student dataset, 123 faces from the AR dataset [79], and 295 All sketches collected by previous attempts are coarsely grouped
faces from the XM2VTS dataset [80]. There are 606 faces in total. For as either viewed or unviewed, without tracking the time-delay
each subject, a sketch and a photo are provided. The photo is taken between viewing and forensic sketching — a factor that has critical
of each subject with frontal pose and neutral expression under nor- impact on the fidelity of human facial memory [85]. To address
mal lighting conditions. The sketch is then drawn by an artist based this Ouyang et al. [78] introduce the first Memory Gap Database
on the photo. which not only includes viewed and unviewed sketch, but uniquely
CUHK Face Sketch FERET Dataset (CUFSF) [25,39] is also com- sketches rendered at different time-delays between viewing and
monly used to benchmark SBFR algorithms. There are 1194 subjects sketching. Memory Gap Database (MGDB) [78] includes 100 real
from the FERET dataset [81]. For each subject, a sketch and a photo is subjects (mugshots sampled from mugshot.com). Each subject
also provided. However, compared to CUFS, instead of normal light has frontal face photo and four facial sketches drawn at various
condition, the photos in CUFSF are taken with lighting variation. time-delays: viewed sketch, 1 h sketch, 24 h sketch and unviewed
Meanwhile, the sketches are drawn with shape exaggeration based sketches. In total, 400 hand-drawn sketches are provided by the
on the corresponding photos. Hence, CUFSF is more challenging and MGDB. This database is aimed to help modelers disentangle modal-
closer to practical scenarios [39]. ity, memory, and communication factors in forensic sketch HFR.
The IIIT-D Sketch Dataset [32] is another well known facial sketch It is worth noting that the accessibility of these datasets varies,
dataset. Unlike CUFS and CUFSF, it contains not only viewed sketches with some not being publicly available. Klare et al. [31] created
but also semi-forensic sketches and forensic sketches, therefore can a forensic dataset from sketches cropped from two books (also
be regarded as three separate datasets each containing a particu- contained in IIIT-D forensic), which is thus limited by copyright.
lar type of sketches, namely IIIT-D viewed, IIIT-D semi-forensic and Klare et al. also conducted experiments querying against a real police
IIIT-D forensic sketch dataset. IIIT-D viewed sketch dataset com- database of 10,000 mugshots, but this is not publicly available.
prises a total of 238 sketch-image pairs. The sketches are drawn
by a professional sketch artist based on photos collected from
various sources. It comprises of 67 sketch-image pairs from the 3.3. Viewed sketch face recognition
FG-NET aging dataset1 , 99 sketch-digital image from Labeled Faces
in Wild (LFW) dataset [82], and 72 sketch-digital image pairs from Viewed sketch recognition is the most studied sub-problem of
the IIIT-D student & staff dataset [82]. In the IIIT-D semi-forensic SBFR. Although a hypothetical problem (in practice a photo would
dataset, sketches are drawn based on an artist’s memory instead of be used directly if available, rather than a viewed sketch), it provides
directly based on the photos or the description of an eye-witness. an important step toward ultimately improving forensic sketch accu-
These sketches are termed semi-forensic sketches. The semi-forensic racy. It is hypothesized that based on an ideal eyewitness description,
dataset is based on 140 digital images from the Viewed Sketch unviewed sketches would be equivalent to viewed ones. Thus per-
dataset. In the IIIT-D forensic dataset there are 190 forensic sketches formance on viewed sketches should be an upper bound on expected
and face photos. It contains 92 and 37 forensic sketch–photo pairs performance on forensic sketches.
Viewed sketch-based face recognition studies can be classified
into synthesis, projection and feature-based methods according to
1
Downloadable at http://www-prima.inrialpes.fr/FGnet/html/home.html. their main contribution to bridging the cross-modal gap.
S. Ouyang et al. / Image and Vision Computing 56 (2016) 28–48 35
Table 2
Existing facial sketch benchmark datasets.
3.3.1. Synthesis-based approaches effectively. In both [7] and [25], after photos/sketches are synthe-
The key strategy in synthesis-based approaches is to synthesize a sized, many standard methods like PCA [76], Bayesianface [86],
photo from corresponding sketch (or vice-versa), after which tradi- Fisherface [87], null-space LDA [88], dual-space LDA [89] and Ran-
tional homogeneous recognition methods can be applied (see Fig. 5). dom Sampling LDA (RS-LDA) [90,91] are straightforwardly applied
To convert a photo into a sketch, Wang and Tang [7] propose an for homogeneous face recognition.
eigensketch transformation, wherein a new sketch is constructed The embedded hidden Markov model (E-HMM) is applied by
using a linear combination of training sketch samples, with linear Zhong et al. [92] to transform a photo to a sketch. The nonlinear rela-
coefficients obtained from corresponding photos via eigen decom- tionship between a photo/sketch pair is modeled by E-HMM. Then,
position. Classification is then accomplished by the obtained eigens- learned models are used to generate a set of pseudo-sketches. Those
ketch features. To exploit the strong correlation exists among face pseudo-sketches are used to synthesize a finer face pseudo-sketch
images, the Karhunen–Loeve Transform (KLT) is applied to repre- based on a selective ensemble strategy. E-HMMs are also used by Gao
sent and recognize faces. The eigensketch transformation algorithm et al. [93,94] to synthesize sketches from photos. On the contrary,
reduced the discrepancies between photo and sketch. The resulting Xiao et al. [95] proposed an E-HMM based method to synthesis pho-
rank-10 accuracy is reasonable. However, the work lacks in the small tos from sketches. Liu et al. [96] proposed a synthesis method based
size of the dataset (188 pairs) used and weak rank-1 accuracy. on Bayesian Tensor Inference. This method can be used to synthesize
It was soon discovered that synthesizing facial sketches holisti- both sketches from photos and photos from sketches.
cally via linear processes might not be sufficient, in that synthesized A common problem shared by most sketch synthesis methods is
sketches lack detail which will in turn incur negatively impact final that they can not handle non-facial factors such as hair style, hairpins
matching accuracy. Liu et al. [26] proposed a Local Linear Embed- and glasses well. To tackle this problem, Zhang et al. [97] combined
ding (LLE) inspired method to convert photos into sketches based sparse representation and Bayesian inference in synthesizing facial
on image patches, rather than holistic photos. For each image patch sketches. Sparse representation is used to model photo patches,
to be converted, it finds the nearest neighbors in the training set. where nearest neighbor search with learned prior knowledge is
Reconstruction weights of neighboring patches are then computed, applied to compute similarity scores across patches. After select-
and used to generate the final synthesized patch. Wang and Tang [25] ing candidate sketches patches using these similarity scores, MRF is
further improved [26] by synthesizing local face structures at differ- employed to reconstruct the final sketch by calculating the probabil-
ent scales using Markov Random Fields (MRF), as shown in Fig. 5 (a). ity between photo patches and candidate sketch patches.
By modeling the relationship between local patches through a Most sketch synthesis methods rely on many training pairs to
compatibility function, the multi-scale MRF jointly reasons the selec- work, which naturally makes them deficient in modeling subtle non-
tion of the sketch patch corresponding to each photo patch during facial features. Zhang et al. [98] recognized this and proposed a
photo–sketch conversion. In each case photos/sketches conversion method that is capable of handling non-facial factors only using a
reduces the modality gap, allowing the two domains to be matched single photo–sketch pair. Sparse representation based greedy search
Fig. 5. Examples of sketch synthesis: (left) photo to sketch by synthesized sketches (right) sketch to photo by synthesized photos.
36 S. Ouyang et al. / Image and Vision Computing 56 (2016) 28–48
is used to select candidate patches and Bayesian inference is then images. To obtained the high frequency cues, sketches and pho-
used for final sketch synthesis. A cascaded image synthesis strategy tos are decomposed into multi-resolution pyramids. After extended
is further applied to improve the quality of the synthesized sketch. uniform circular local binary pattern based descriptors are com-
All aforementioned methods synthesize facial sketches using pixel puted, a Genetic Algorithm (GA) [102] based weight optimization
intensities alone. Peng et al. [99] explored a multi-representation technique is used to find optimum weights for each facial patch.
approach to face sketch modeling. Filters such as DoG, and Finally, NN matching is performed by using weighted Chi square
features like SURF and LBP are employed to generate different distance measure.
representations and a Markov network is deployed to exploit the Khan et al. [33] proposed a self-similarity descriptor. Features
mutual relationship among neighboring patches. They conduct foren- are extracted independently from local regions of sketches and pho-
sic sketch recognition experiments using sketches from CUHK and AR tos. Self-similarity features are then obtained by correlating a small
datasets as probe, and 10,000 face photo images from LFW-a dataset image patch within its larger neighborhood. Self-similarity remains
as gallery. relatively invariant to the photo/sketch-modality variation therefore
reduces the modality gap before NN matching.
A new face descriptor, Local Radon Binary Pattern (LRBP) was pro-
3.3.2. Projection based approaches
posed by Galoogahi et al. [103] to directly match face photos and
Rather than trying to completely reconstruct one modality
sketches. In the LRBP framework, face images are first transformed
from the other as in synthesis-based approaches; projection-based
into Radon space, then transformed face images are encoded by Local
approaches attempt to find a lower-dimensional sub-space in which
Binary Pattern (LBP). Finally, LRBP is computed by concatenating
the two modalities are directly comparable (and ideally, in which
histograms of local LBPs. Matching is performed by a distance mea-
identities are highly differentiated).
surement based on Pyramid Match Kernel (PMK) [104]. LRBP benefits
Lin and Tang [40] proposed a linear transformation which can
from low computational complexity and the fact that there is no
be used between different modalities (sketch/photo, NIR/VIS), called
critical parameter to be tuned [103].
common discriminant feature extraction (CDFE). In this method,
Zhang et al. [105] introduced another face descriptor based on
images from two modalities are projected into a common feature
coupled information-theoretic encoding which uniquely captures
space in which matching can be effectively performed.
discriminative local facial structures. Through maximizing mutual
Sharma et al. [9] use Partial Least Squares (PLS) to linearly map
information between photos and sketches in the quantized feature
images of different modalities (e.g., sketch, photo and different poses,
spaces, they obtained a coupled encoding using an information-
resolutions) to a common subspace where mutual covariance is
theoretic projection tree. The method was evaluated with 1194 faces
maximized. This is shown to generalize better than CCA. Within this
sampled from the FERET database.
subspace, final matching is performed with simple NN.
Galoogahi et al. consequently proposed another two face descrip-
In [46], a unified sparse coding-based model for coupled dic-
tors: Gabor Shape [30] which is variant of Gabor features and
tionary and feature space learning is proposed to simultaneously
Histogram of Averaged Oriented Gradient (HAOG) features [29]
achieve synthesis and recognition in a common subspace. The
which is variant of HOG for sketch/photo directly matching, the latter
learned common feature space is used to perform cross-modal face
achieves perfect 100% accuracy on the CUFS dataset.
recognition with NN.
Klare et al. [31] further exploited their SIFT descriptor, by com-
In [26] a kernel-based nonlinear discriminant analysis (KNDA)
bining it with a ‘common representation space’ projection-based
classifier is adopted by Liu et al. for sketch–photo recognition. The
strategy. The assumption is that even if sketches and photos are
central contribution is to use the nonlinear kernel trick to map input
not directly comparable, the distribution of inter-face similarities
data into an implicit feature space. Subsequently, LDA is used to
will be similar within the sketch and photo domain. That is, the
extract features in that space, which are non-linear discriminative
(dis)similarity between a pair of sketches will be roughly the same
features of the input data.
as the (dis)similarity between the corresponding pair of photos. Thus
each sketch and photo is re-encoded as a vector of their euclidean
3.3.3. Feature based approaches distances to the training set of sketches and photos respectively.
Rather mapping photos into sketches, or both into a common This common representation should now be invariant to modality
subspace; feature-based approaches focus on designing a feature and sketches/photos can be compared directly. To further improve
descriptor for each image that is intrinsically invariant to the modal- the results, direct matching and common representation matching
ity, while being variant to the identity of the person. The most scores are fused to generate the final match [31]. The advantage of
widely used image feature descriptors are Scale-invariant feature this approach over mappings like CCA and PLS is that it does not
transform (SIFT), Gabor transform, Histogram of Averaged Oriented require the sketch–photo domain mapping to be linear. The common
Gradients (HAOG) and Local Binary Pattern (LBP). Once sketch and representation strategy has also been used to achieve cross-view
photo images are encoded using these descriptors, they may be person recognition [107], where it was shown to be dependent on
matched directly, or after a subsequent projection-based step as in sufficient training data.
the previous section. In contrast to the previous methods which are appearance cen-
Klare et al. [31] proposed the first direct sketch/photo matching tric in their representation, Pramanik et al. [6] evaluate an analytic
method based on invariant SIFT-features [100]. SIFT features pro- geometry feature based recognition system. Here, a set of facial com-
vide a compact vector representation of an image patch based on ponents such as eyes, nose, eyebrows, lips, are extracted their aspect
the magnitude, orientation, and spatial distribution of the image gra- ratio are encoded as feature vectors, followed by K-NN as classifier.
dients [31]. SIFT feature vectors are first sampled uniformly from Overall, because viewed sketches and photos in the CUFS database
the face images and concatenated together separately for sketch are very well-aligned and exaggeration between photo and sketch
and photo images. Then, Euclidean distances are computed between is minimal, appropriate feature engineering, projection or synthesis
concatenated SIFT feature vectors of sketch and photo images approaches can all deliver near-perfect results, as shown in Table 3.
for NN matching.
Later on, Bhatt et al. [101] proposed a method which used 3.4. Forensic sketch face recognition
extended uniform circular local binary pattern descriptors to tackle
sketch/photo matching. Those descriptors are based on discriminat- Forensic sketches pose greater challenge than viewed sketch
ing facial patterns formed by high frequency information in facial recognition because, beyond modality shift, they contain incomplete
S. Ouyang et al. / Image and Vision Computing 56 (2016) 28–48 37
Table 3
Sketch–photo matching methods: performance on benchmark datasets.
Synthesis based [7] KLT CUHK Eigen-sketch features 88:100 About 60%
[26] KNDA CUFS 306:300 88%
[25] RS_LDA CUFS Multiscale MRF 306:300 96%
[92] CUFS E-HMM – 95%
[96] CUFS E-HMM + Selective ensemble – 100%
[97] CUHK/XM2VTS Sparse representations – –
[98] CUHK/XM2VTS Single sketch–photo pair – –
[99] Fisherface CUFS/IIIT-D Multiple representation + Markov network 88:100 98.3%
Projection based [31] Common representation CUFS SIFT 100:300 96%
[9] PLS CUHK 88:100 93%
[106] PLS regression CUFS,CUFSF Gabor and CCS-POP 0:1800 99%
Feature based [31] NN CUFS SIFT 100:300 98%
[33] NN CUFS Self similarity 161:150 99%
[105] PCA + LDA CUFSF CITE 500:694 99%
[30] NN, Chi-square CUFS Gabor Shape 306:300 99%
[101] Weighted Chi-square CUFS EUCLBP 78:233 94%
[29] NN, Chi-square CUFS HAOG 306:300 100%
[30] NN, Chi-square CUFSF Gabor Shape 500:694 96%
[103] NN, PMK, Chi-square CUFSF LRBP – 91%
[103] NN, PMK, Chi-square CUFSF LRBP – 91%
[6] K-NN CUHK Geometric features 108:80 80%
[101] Weighted Chi-square IIIT-D EUCLBP 58:173 79%
or inaccurate information due to the subjectivity of the description, IIIT-D forensic sketch and a large (10,030) mugshot database similar
and imperfection of the witness’ memory [85]. to that used in [109] and archived state-of-the-art results.
Due to its greater challenge, and the lesser availability of forensic
sketch datasets, research in this area has been less than for viewed 3.5. Composite sketch based face recognition
sketches. Uhl et al. [108] proposed the first system for automati-
cally matching police artist sketches to photographs. In their method, Several studies have now considered face recognition using
facial features are first extracted from sketches and photos. Then, composite sketches. The earliest used both local and global fea-
the sketch and photo are geometrically standardized to facilitate tures to represent sketches and is proposed by Yuen et al. [5].
comparison. Finally, eigen-analysis is employed for matching. Only 7 This method also investigated user input in the form of rele-
probe sketches were used in experimental validation, their method vance feedback in the recognition phase. Studies have focused on
is antiquated with respect to modern methods. Nonetheless, Uhl and holistic [24,110] component based [34,56,57,111] and hybrid [5,77]
Lobo’s study highlighted the complexity and difficulty in forensic representations respectively.
sketch based face recognition and drew other researchers towards The holistic method [110] uses similarities between local fea-
forensic sketch-based face recognition. tures computed on uniform patches across the entire face image.
Klare et al. [109] performed the first large scale study in 2011, Following tessellating a facial sketch/mugshot into 154 uniform
with an approach combining feature-based and projection-based patches, SIFT [100] and multi-scale local binary pattern (MLBP) [112]
contributions. SIFT and MLBP features were extracted, followed by invariant features are extracted from each patch. With this fea-
training a LFDA projection to minimize the distance between cor- ture encoding, as improved version of the common representation
responding sketches and photos while maximizing the distance intuition from [31] is applied, followed by RS-LDA [91] to gener-
between distinct identities. They analyze a dataset of 159 pairs of ate a discriminative subspace for NN matching with cosine distance.
forensic hand drawn sketches and mugshot photos. The subjects The scores generated by each feature and patch are fused for final
in this dataset were identified by the law enforcement agencies. recognition.
They also included 10,159 mugshot images provided by Michigan In contrast, the component based method [34] uses similarities
State Police to better simulate a realistic police search against a between individual facial components to compute an overall sketch
large gallery. With this realistic scenario, they achieved about 15% to mughsot match score. Facial landmarks in composite sketches
success rate. and photos are automatically detected by an active shape model
To improve recognition performance, Bhatt et al. [32] proposed (ASM) [113]. Multiscale local binary patterns (MLBPs) are then
an algorithm that also combines feature and projection-based con- applied to extract features of each facial component, and similar-
tributions. They use multi-scale circular Webber’s Local descriptor to ity is calculated for each component: using histogram intersection
encode structural information in local facial regions. Memetic opti- distance for the component’s appearance and cosine distance for
mization was then applied to every local facial region as a metric its shape. The similarity scores of each facial component are nor-
learner to find the optimal weights for Chi squared NN matching [32]. malized and fused to obtain the overall sketch–photo similarity.
The result outperforms [109] using only the forensic set as gallery. Mittal et al. [56] also used features extracted according to facial
Different to previous studies that tackle forensic sketch matching landmarks. Daisy descriptors were extracted from patches centred
using a single model, Ouyang et al. [78] developed a database and on facial landmarks. The cross-modal Chi square distances of these
methodology to decouple the multiple distinct challenges underly- descriptors at each landmark are then used as the input feature to
ing forensic matching: the modality change, the eyewitness-artist train a binary verifier based on GentleBoost. This was improved by a
description, and the memory loss of the eyewitness. Their MGDB subsequent study [57] which improved the representation by using
database has 400 forensic sketches created under different condi- Self Similarity Descriptors (SSD) as features, followed by encoding
tions such as memory time-delays. Using this MGDB, they applied them in terms of distance to a dictionary of faces (analogously to
multi-task Gaussian process regression to synthesize facial sketches the common representation in [31]) — before applying GentleBoost
accounting for each of these factors. They evaluated this model on verification again.
38 S. Ouyang et al. / Image and Vision Computing 56 (2016) 28–48
In contrast to engineered descriptors, deep learning can provide regression SVM, MKL and LDA. The results showed that caricatures
an effective way to learn discriminative and robust representations. can be recognized slightly better with high-level qualitative fea-
However these methods tend to require large data volumes rela- tures than low-level LBP features, and that they are synergistic in
tive to the size of available HFR datasets. To address this Mittal that combining the two can almost double the performance up to
et al. [24] use deep auto encoders and deep belief networks to learn 22.7% rank 1 accuracy. A key insight here is that – in strong con-
an effective face representation based on a large photo database. trast to viewed sketches that are perfectly aligned – the performance
This is then fine-tuned on a smaller heterogeneous database to of holistic feature based approaches is limited because the exagger-
adapt it to the HFR task. Binary verification is then performed using ated nature of caricature sketches means that detailed alignment is
SVM and NN classifiers. impossible.
Finally, Klum et al. [77] focuses on building a practically accurate, A limitation of the above work is that the facial attributes must
efficient and deployable sketch-based interaction system by improv- be provided, requiring manual intervention at run-time. Ouyang et
ing and fusing the holistic and component-based algorithms in [110] al. [118] provided a fully automated procedure that uses a clas-
and [34] respectively. The implication of different sources of training sifier ensemble to robustly estimate facial attributes separately in
data is also investigated. the photo and caricature domain. These estimated facial attributes
are then combined with low-level features using CCA to generate a
3.6. Caricature based face recognition robust domain invariant representation that can be matched directly.
This study also contributed facial attribute annotation datasets that
The human visual system’s ability to recognize a person can be used to support this line of research going forward.
from a caricature is remarkable, as conventional face recognition
approaches fail in this setting of extreme intra-class variability 3.7. Summary and conclusions
(Fig. 6). The caricature generation process can be conceptualized as
follows: If we assume a face space in which each face lies. Then by Table 3 summarizes the results of major studies in terms of dis-
drawing a line to connect the mean face to each face, the correspond- tance metric, dataset, feature representation, train to test ratio, and
ing caricature will lie beyond that face along the line. That is to say, a rank-1 accuracy, of feature-based and projection-based approaches
caricature is an exaggeration of a face away from the mean [114]. respectively2 . As viewed sketch datasets exhibit near perfect align-
Studies have suggested that people may encode faces in a car- ment and detail correspondence between sketches and photos, well
icatured manner [115]. Moreover they may be more capable of designed approaches of any type achieve near perfect accuracies.
recognizing a familiar person through a caricature than an accu- Forensic sketch in contrasts is an open problem, but the fewer
rate rendition [116,117]. The effectiveness of a caricature is due and less comparable studies here also make it hard to identify the
to its emphasis of deviations from average faces [54]. Develop- most promising techniques. What seems clear is that representa-
ing efficient approaches in caricature based face recognition could tions assuming simple perfect correspondence such as dense-HOG
help drive more robust and reliable face and heterogeneous face and simple linear projections are unlikely to be the answer, and that
recognition systems. purely image-processing approaches may be significantly improved
Klare et al. [54] proposed a semi-automatic system to match by understanding the human factors involved [78].
caricatures to photographs. In this system, they defined a set of
qualitative facial attributes that describe the appearance of a face 3.7.1. Methodologies
independently of whether it is a caricature or photograph. These All three categories of approaches – synthesis, projection and
mid-level facial features were manually annotated for each image, discriminative features – have been well studied for SBFR. Inter-
and used together with automatically extracted LBP [112] fea- estingly, while synthesis approaches have been one of the more
tures. These two feature types were combined with an ensemble of popular categories of methods, they have only been demonstrated
matching methods including NN and discriminatively trained logistic to work in viewed-sketch situations where the sketch–photo trans-
formation is very simple and alignment is perfect. It seems unlikely
that they can generalize effectively to forensic sketches, where the
Caricature sketches
uncertainty introduced by forensic process (eyewitness subjective
memory) significantly completes the matching process.
An interesting related issue that has not been systematically
explored by the field is the dependence on the sketching artists. Al
Nizami et al. [119] demonstrated significant intra-personal variation
in sketches drawn by different artists. This may challenge systems
that rely on learning a single cross-modal mapping. This issue will
become more significant in the forensic sketch case where there is
more artist discretion, than in viewed-sketches which are more like
copying exercises.
Photograph
2
Note that some results on the same dataset are not directly comparable because
Fig. 6. Caricatures and corresponding mugshots. of differing test set sizes.
S. Ouyang et al. / Image and Vision Computing 56 (2016) 28–48 39
Photograph
3.7.3. Training data source
Many effective SBFR studies have leveraged annotated train-
Fig. 7. VIS and NIR face images.
ing data to learn projections and/or classifiers [31]. As interest
has shifted to forensic sketches, standard practice has been to
train such models on viewed-sketch datasets and test on forensic
because it offers the potential for face recognition where control-
datasets [109]. An interesting question going forward is whether this
ling the visible environment light is difficult or impossible, such as in
is the best strategy. The first study explicitly addressing this issue
night-time surveillance or automated gate control.
concluded that it may not be [77]. Since viewed-sketches under-
In NIR based face recognition, similar to sketch based recognition,
represent sketch–photo heterogeneity, this means that learning
most studies can be categorized into synthesis, projection and dis-
methods are learning a model that is not matched to the data (foren-
criminant feature based approaches, according to their contribution
sic sketches) that they will be tested on. This poses an additional
to bridging the cross-modal gap.
challenge of domain shift [120] (photo/viewed → photo/unviewed),
to be solved. This issue also further motivates the creation of larger
4.1. Datasets
forensic-sketch datasets for training, which will be necessary to
thoroughly investigate the best training strategy.
There are five main heterogeneous datasets covering the NIR–VIS
condition. The CASIA HFB dataset [122], composed of visual (VIS),
3.7.4. Automated matching versus human recognition near infrared (NIR) and 3D faces, is widely used. In total, it includes
Finally we notice that the vision and biometrics communities 100 subjects: 57 males and 43 females. For each subject, there are
have largely focused on automated cross-domain matching, while an 4 VIS and 4 NIR face images. Meanwhile, there are also 3D images
important outstanding question in forensic sketch for law enforce- for each subject (92 subjects: 2 for each, 8 subjects: 1 for each).
ment has been left largely un-studied [85]. Rather than cross-domain In total, there are 800 images for NIR–VIS setting and 200 images
mapping for HFR matching of a sketch against a photo database, for 3D studies.
police are often interested in generating a sketch/photo which can CASIA NIR–VIS 2.0 [123] is another widely used NIR dataset. 725
be best recognized by a person who might be familiar with the sus- subjects are included, with 50 images (22 VIS and 28 NIR) per subject,
pect; rather than generating photo that can be matched to a mugshot for a total of 36,250 images.
database by a machine. From a cross-domain synthesis perspective, The Cross Spectral Dataset [124] is proposed by Goswami
rather than simply generate the most accurate photo, the task here is et al. It consists of 430 subjects from various ethnic backgrounds
to generate a more human recognisable image, which has a different (more than 20% of non-European origin). At least one set of 3 poses
set of requirements [121] than conventional metrics. (−10 ◦ /0 ◦ /10 ◦ ) are captured for each subject. In total, there are 2103
NIR images and 2086 VIS images.
The PolyU NIR face dataset [125] is proposed by the biometric
4. Matching NIR to visible light images research center at Hong Kong Polytechnic University. This dataset
includes 33,500 images from 335 subjects. Besides frontal face
NIR face recognition has attracted increasing attention recently images and faces with expression, pose variations are also included.
because of its much desired attribute of (visible-light) illumi- It is created with an active light source in the NIR spectrum between
nation invariance, and the decreasing cost of NIR acquisition 780 nm to 1100 nm.
devices. It encompasses matching near infrared (NIR) to visible The main NIR–VIS datasets are summarized in Table 4. Each col-
light (VIS) face images. In this case, the VIS enrollment samples umn categorizes the datasets by wavelength of NIR light, no. of
are images taken under visible light spectrum (wavelength range subject, no. of images, and whether they include 3D images, pose and
0.4 lm–0.7 lm), while query images are captured under near expression variations, respectively.
infrared (NIR) condition (just beyond the visible light range, wave-
lengths between 0.7 lm and 1.4 lm) [41]. NIR images are close 4.2. Synthesis based approaches
enough to the visible light spectrum to capture the structure of the
face, while simultaneously being far enough to be invariant to visi- Wang et al. [12] proposed an analysis-by-synthesis framework,
ble light illumination changes. Fig. 7 illustrates differences between that transforms face images from NIR to VIS. To achieve the con-
NIR and VIS images. Matching NIR to VIS face images is of interest, version, facial textures are extracted from both modalities. NIR–VIS
40 S. Ouyang et al. / Image and Vision Computing 56 (2016) 28–48
Table 4
Summary of existing NIR–VIS benchmark datasets.
Dataset Wavelength No. of subjects No. of images 3D Pose variations Expression variations
√
CASIA HFB [122] 850 nm 100 992 × ×
√ √ √
CASIA NIR–VIS 2.0 [123] 850 nm 725 17,580
√ √
Cross Spectral Dataset [125] 800–1000 nm 430 4189 ×
√ √ √
PolyU [125] 780–1100 nm 335 33,500
texture patterns extracted at corresponding regions of different face Finally, matching is performed with a verification-based strategy,
pairs collectively compose a training set of matched pairs. After where cosine distance between the projected vectors is compared
illumination normalization [126], VIS images can be synthesized with a threshold to decide a match.
patch-by-patch by finding the best matching patch for each patch of Klare et al. [41] build on [11], but improve it in a few ways.
the input NIR image. They add HOG to the previous LBP descriptors to better represent
Chen et al. [27] also synthesize VIS from NIR images using a patches, and use an ensemble of random LDA subspaces [41] learn a
similar inspiration of learning a cross-domain dictionary of corre- shared projection with reduced over fitting. Finally, NN and Sparse
sponding VIS and NIR patch pairs. To more reliably match patches, Representation based matching are performed for matching.
illumination invariant LBP features are used to represent them. Syn- Lei et al. [19] presented a method to match NIR and VIS face
thesis of the VIS image is further improved compared to [12], by images called Coupled Spectral Regression (CSR). Similar to other
using locally-linear embedding (LLE) inspired patch synthesis rather projection-based methods, they use two mappings to project the
than simple nearest-neighbor. Finally homogeneous VIS matching heterogeneous data into a common subspace. In order to further
is performed with NN classifier on the LBP representations of the improve the performance of the algorithm (efficiency and gener-
synthesized images. alisation), they use the solutions derived from the view of graph
Xiong et al. [127] developed a probabilistic statistical model of embedding [131] and spectral regression [132] combined with regu-
the mapping between two modalities of facial appearance, introduc- larization techniques. They later improve the same framework [20],
ing a hidden variable to represent the transform to be inferred. To to better exploit the cross-modality supervision and sample locality.
eliminate the influences of facial structure variations, a 3D model Huang et al. [133] proposed a discriminative spectral regression
is used to perform pose rectification and pixel-wise alignment. Dif- (DSR) method that maps NIR/VIS face images into a common dis-
ference of Gaussian (DOG) filter is further used to normalize image criminative subspace in which robust classification can be achieved.
intensities. They transform the subspace learning problem into a least squares
Recently, Xu et al. [128] introduced a dictionary learning problem. It is asked that images from the same subject should be
approach for VIS–NIR face recognition. It first learns a cross-modal mapped close to each other, while these from different subjects
mapping function between the two domains following a cross- should be as separated as possible. To reflect category relationships
spectral joint l0 minimization approach. Facial images can then be in the data, they also developed two novel regularization terms.
reliably reconstructed by applying the mapping in either direction. Yi et al. [47] applied Restricted Boltzmann Machines (RBMs) to
Experiments conducted on the CASIA NIR–VIS v2.0 database show address the non-linearity of the NIR–VIS projection. After extracting
state-of-the-art performance. Gabor features at localized facial points, RBMs are used to learn a
shared representation at each facial point. These locally learned rep-
resentations are stacked and processed by PCA to yield a final holistic
4.3. Projection based approaches
representation.
Lin et al. [40] proposed a matching method based on Common
Discriminant Feature Extraction (CDFE), where two linear mappings
are learned to project the samples from NIR and VIS modalities to 4.4. Feature based approaches
a common feature space. The optimization criterion aims to both
minimize the intra-class scatter while maximizing the inter-class Zhu et al. [21] interpret the VIS–NIR problem as a highly
scatter. They further extended the algorithm to deal with more illumination-variant task. They address it by designing an effective
challenging situations where the sample distribution is non-gaussian illumination invariant descriptor, the logarithm gradient histogram
by kernelization, and where the transform is multi-modal. (LGH). This outperforms the LBP and SIFT descriptors used by [11]
After analyzing the properties of NIR and VIS images, Yi et al. [10] and [41] respectively. As a purely feature-based approach, no
proposed a learning-based approach for cross-modality matching. In training data is required.
this approach, linear discriminant analysis (LDA) is used to extract Huang et al. [134], in contrast to most approaches, perform
features and reduce the dimension of the feature vectors. Then, feature extraction after CCA projection. CCA is used to maximize
a canonical correlation analysis (CCA) [129] based mechanism is the correlations between NIR and VIS image pairs. Based on low-
learned to project feature vectors from both modalities into CCA dimensional representations obtained by CCA, they extract three
subspaces. Finally, nearest-neighbor with cosine distance is used different modality-invariant features, namely, quantized distance vec-
matching score. tor (QDV), sparse coefficients (SC), and least square coefficients (LSC).
Both of methods proposed by Lin and Yi tend to overfit to train- These features are then represented with a sparse coding framework,
ing data. To overcome this, Liao et al. [11] present a algorithm and sparse coding coefficients are used as the encoding for matching.
based on learned intrinsic local image structures. In training phase, Goswami et al. [124] introduced a new dataset for NIR/VIS
Difference-of-Gaussian filtering is used to normalize the appearance (VIS/NIR) face recognition. To establish baselines for the new dataset
of heterogeneous face images in the training set. Then, Multi-scale they compared a series of photometric normalization techniques, fol-
Block LBP (MB-LBP) [130] is applied to represent features called Local lowed by LBP-based encoding and LDA to find an invariant subspace.
Structure of Normalized Appearance (LSNA). The resting represen- They compared classification with Chi-squared and Cosine as well
tation is high-dimensional, so Adaboost is used for feature selection as establishing a logistic-regression based verification model that
to discover a subset of informative features. R-LDA is then applied obtained the best performance by fusing the weights from each of
on the whole training set to construct a discriminative subspace. the model variants.
S. Ouyang et al. / Image and Vision Computing 56 (2016) 28–48 41
Table 5
NIR–VIS matching methods: performance on benchmark datasets.
Synthesis based [12] Analysis-by-synthesis framework Self-collected Texture patterns 200:200 About 90%
[27] LLE Self-collected LBP 250:250 94%
[127] Probabilistic statistical model CASIA HFB 200:200 40%
[128] Cross-spectral joint dictionary learning CASIA NIR–VIS 2.0 8600:6358 79%
Projection based [40] CDFE Self-collected 800:64 68%
[11] LSNA CASIA NIR–VIS 2.0 3464:1633 68%
[41] Random LDA subspace CASIA NIR–VIS 2.0 HoG + LBP 2548:2548 93%
[19] Coupled Spectral Regression CASIA NIR–VIS 2.0 LBP 2549:2548 97%
[133] Discriminative Spectral Regression CASIA NIR–VIS 2.0 LBP 2549:2548 95%
[47] Restricted Boltzmann Machines CASIA HFB Gabor About 2500:2500 99%
CASIA NIR–VIS 2.0 Gabor – 87%
Feature based [21] NN CASIA HFB LGH 400:400 46%
[38] THFM CASIA HFB Log-DoG 400:400 99%
Gong and Zheng [135] proposed a learned feature descriptor, that of interest for some time [58]. However, 3D–3D matching is ham-
adapts parameters to maximize the correlation of the encoded face pered in practice by the complication and cost of 3D compared to
images between two modalities. With this descriptor, the within- 2D equipment. An interesting variant of interest is thus the cross-
class variations can be reduced at the feature extraction stage, modal middle ground, of using 3D images for enrollment, and 2D
therefore offering better recognition performance. This descriptor images for probes. This is useful, for example, in access control where
outperforms classic HOG, LBP and MLBP, however unlike the others enrollment is centralized (and 3D images are easy to obtain), but
it requires training. the access gate can be deployed with simpler and cheaper 2D equip-
To tackle cross spectral face recognition, Dhamecha et al. [37] ment. In this case, 2D probe images can potentially be matched
evaluated the effectiveness of a variety of HoG variants. They con- more reliably against the 3D enrollment model than a 2D enroll-
cluded that DSIFT with subspace LDA outperforms other features ment image – if the cross-domain matching problem can be solved
and algorithms. effectively.
Finally, Zhu et al. [38] presented a new logarithmic Difference of A second motivation for 2D–3D HFR indirectly arises in the situ-
Gaussians (Log-DoG) feature, derived based on mathematical rather ation where pose-invariant 2D–2D matching is desired [14,15,136].
than merely empirical analysis of various features properties for In this case the faces can be dramatically out of correspondence, so it
recognition. Beyond this, they also present a framework for pro- may be beneficial to project one face to 3D in order to better reason
jecting to a non-linear discriminative subspace for recognition. In about alignment, or synthesize a better aligned or lit image for better
addition to aligning the modalities, and regularization with a man- matching.
ifold, their projection strategy uniquely exploits the unlabelled test
data transductively. 5.1. Datasets
4.5. Summary and conclusions The face Recognition Grand Challenge (FRGC) V2.0 dataset3 is
widely used for 2D–3D face recognition. It consists of a total of
Given their decreasing cost, NIR acquisition devices are gradually 50,000 recordings spread evenly across 6,250 subjects. For each sub-
becoming an integrated component of everyday surveillance cam- ject, there are 4 images taken in controlled light, 2 images taken
eras. Combined with the potential to match people in a (visible-light) under uncontrolled light and 1 3D image. The controlled images
illumination independent way, this has generated increasing interest were taken in a studio setting while uncontrolled images were taken
in NIR–VIS face recognition. in changing illumination conditions. The 3D images were taken by
Table 5 summarizes the results of major cross-spectral studies a Minolta Vivid 900/910 series sensor, including both range and
in terms of recognition approach, dataset, feature representation, texture cues. An example from the FRGC V2.0 dataset is shown in
train to test ratio, and rank-1 accuracy. Results are promising, but Fig. 8.
lack of standardization in benchmarking prevents direct quantitative UHDB11 [137] is another popular dataset in 2D–3D face recogni-
comparison across methods. tion. It consists of samples from 23 individuals, for each of which it
As with all the HFR scenarios reviewed here, NIR–VIS studies have has 2D high-resolution images spanning across six illumination con-
addressed bridging the cross-modal gap with a variety of synthesis, ditions and 12 head-pose variations (72 variations in total), and a
projection and feature-based techniques. One notable unique aspect textured 3D facial mesh models. Each capture consists of both 2D
of NIR–VIS is that it is the change in illumination type that is the images captured using a Canon DSLR camera and a 3D mesh captured
root of the cross-modal challenge. For this reason image-processing by 3dMD 2-pod optical 3D system. UHDB12 [138] is an incremental
or physics based photometric normalization methods (e.g., gamma update to UHDB11 [137]. 3D data were captured using a 3dMD 2-pod
correction, contrast equalization, DoG filtering) are often able to play optical scanner, while 2D images were collected using a commer-
a greater role. This is because it is to some extent possible to model cial Cannon DSLR camera. The 2D acquisition setup has six diffuse
the cross-modal lighting change more analytically and explicitly than lights that vary lighting conditions. For each subject, a single 3D
other HFR scenarios that must rely entirely on machine learning or scan and 6 2D images under different lighting conditions were cap-
invariant feature extraction methods. tured. Overall, there are 26 subjects with a total of 26 3D scan and
800 2D images. The most recent UHDB31 [14] dataset includes 3D
5. Matching 2D to 3D models and facial images from 21 view points for each of the 77 sub-
jects used. All data were captured using 21 3dMDTM high resolution
The majority of prior HFR systems work with 2D images, whether
the face is photographed, sketched or composited. Owning to the 2D
projection nature of these faces, such systems often exhibit high sen-
sitivity to illumination and pose. Thus 3D–3D face matching has been 3
Downloadable at http://www.nist.gov/itl/iad/ig/frgc.cfm.
42 S. Ouyang et al. / Image and Vision Computing 56 (2016) 28–48
Table 6
3D–2D matching methods: performance on benchmark datasets.
cameras and a high resolution SLR. The different surveillance cam- low dimensional eigenface domain. This is more robust to noise and
eras result in LR images from 144 × 108 to 224 × 168 pixels in registration than general pixel based super-resolution.
size. Some simple PCA baselines for cross-resolution recognition are
also provided. 6.3. Projection-based approaches
project both LR and HR face images into a common discriminative 7.1. Common themes
space.
Representation learning and metric learning were combined and 7.1.1. Model types
optimized jointly by Moutafis and Kakadiaris [53]. Matching is finally Although the set of modality pairs considered has been extremely
performed using NN with the learned metric. diverse (Sketch–Photo, VIS–NIR, HR–LR, 2D–3D), it is interesting that
Finally Bhatt et al. [55] addressed LR–HR matching while simul- a few common themes emerge about how to tackle modality het-
taneously addressing the sparsity of annotated data by combining erogeneity. Synthesis and subspace-projection have been applied in
the ideas of co-training and transfer learning. They pose learning each case. Moreover, integrating the learned projection with a dis-
HFR as a transfer learning problem of adapting an (easier to train) criminative constraint that different identities should be separable,
HR–HR matching model to a (HFR) HR–LR matching task. The base has been effectively exploited in a variety of ways. On the other hand,
model is binary-verification SVM based on LPQ and SIFT features. To feature engineering approaches, while often highly effective, have
address sparsity of annotated cross-domain training data, they per- been largely limited to situations where the input-representation
form co-training which exploits a large but un-annotated pool of itself is not intrinsically heterogeneous (Sketch–Photo, and VIS–NIR).
cross-domain data to improve the matching model.
7.1.2. Learning-based or engineered
6.4. Summary and conclusions An important property differentiating cross-domain recognition
systems is whether they require training data or not (and if so
Both high-resolution synthesis and sub-space projection methods how much). Most feature-engineering based approaches have the
have been successfully applied to LR–HR recognition. In both cases advantage of requiring no training data, and thus not requiring a
the key insight to improve performance has been to use discrimi- (possibly hard to obtain) dataset of annotated image pairs to be
native information in the reconstruction/projection, so that the new obtained before training for any particular application. On the other
representation is both accurate and discriminative for identity. Inter- hand, synthesis and projection approaches (and some learning-based
estingly, while this discriminative cue has been used relatively less feature approaches), along with discriminatively trained matching
frequently in SBFR, NIR and 3D matching, it has been used almost strategies, can potentially perform better at the cost of requiring
throughout in HR–LR matching. Table 7 summarizes the results of such a dataset. A third less-explored alternative are approaches that
major LR–HR matching studies; although again lack of consistency in can perform effective unsupervised representation learning, such as
experimental settings prevents direct quantitative comparison. auto-encoders and RBMs [47].
Table 7
LR–HR matching methods: performance on benchmark datasets.
Method Publications Recognition approach Dataset Feature Train:Test High resolution Low resolution Accuracy
diverse datasets [149]. Current HFR datasets, notably in Sketch are modalities; or in the LR–HR case, synthesis (super-resolution) meth-
also small and likely insufficiently diverse. As new larger and more ods that are often expensive. Deep learning techniques may help
diverse datasets are established, it will become clear whether exist- here, as while they are costly to train, they can provide strong
ing methods do indeed generalize, and if the current top performers non-linear mappings with modest run-time cost.
continue to be the most effective.
7.2.6. Technical methodologies
7.2. Issues and directions for future research CCA, PLS, Sparse coding, MRFs, metric learning and various gener-
alizations thereof have been used extensively in the studies reviewed
7.2.1. Training data volume here. Going forward, there are other promising methodologies that
An issue for learning-based approaches is how much train- are currently under-exploited in HFR, notably transfer learning and
ing data is required. Simple mappings to low-dimensional sub- deep learning.
spaces may require less data than more sophisticated non-linear
mappings across modalities, although the latter are in principle 7.2.7. Deep learning
more powerful. Current heterogeneous face datasets, for example in Deep learning has transformed many problems in computer
sketch [25,25,32,39], are much smaller than those used in homo- vision by learning significantly more effective feature represen-
geneous face recognition [82] and broader computer vision [150] tations [154]. These representations can be unsupervised or dis-
problems. As larger heterogeneous datasets are collected in future, criminatively trained, and have been used to good effect in con-
more sophisticated non-linear models may gain the edge. This is ventional face recognition [3,155]. They have also been effectively
even more critical for future research into HFR with deep-learning applied for many HFR-related problems including face-recognition
based methodologies which have proven especially powerful in con- across pose [156], facial attribute recognition [153] (which provides
ventional face recognition, but require thousands to millions of a more abstract domain/modality invariant representation), and
annotated images [3]. super-resolution [157] (which could potentially be used to address
the HR–LR variant of HFR). Preliminary studies found that conven-
7.2.2. Alignment tional photo face recognition DNNs do not provide an excellent out
Unlike homogeneous face recognition which has moved onto of the box representation for HFR [78], suggesting that they need to
recognition ‘in the wild’ [82], heterogeneous recognition generally be trained and/or designed specifically for HFR.
relies on accurately and manually aligned facial images. As a result, it Only a few studies have begin to consider application of Deep
is unclear how existing approaches will generalize to practical appli- Neural Networks (DNN)s to HFR [24], thus there is significant scope
cations with inaccurate automatic alignment. Future work should for deep learning to make impact in future. In terms of our abstract
address HFR methods that are robust enough to deal with resid- HFR pipeline, deep learning approaches to HFR would combine both
ual alignment errors, or integrate alignment into the recognition feature-based and synthesis or projection approaches by learning
process. a deep hierarchy of features that together bridge the cross-modal
gap. Both cross-modal HFR synthesis or matching would be possi-
ble with deep learning: e.g., by fully-convolutional networks such as
7.2.3. Side information and soft biometrics
used in super-resolution [157], image-image encoders such as used
Side information and soft-biometrics have been used in a few
for cross-pose matching [156], or multi-branch verification/ranking
studies [109] to prune the search space to improve matching per-
networks such used in other matching problems [158,159]. To fully
formance. The most obvious examples of this are filtering by gender
exploit DNNs for the HFR problem, a key challenge is HFR datasets,
or ethnicity. Where this information is provided as metadata, fil-
which likely need to grow to support training dat requirements
tering to reduce the matching-space is trivial. Alternatively, such
of DNNs, or developing methods for training DNNs with sparse
soft-biometric properties can be estimated directly from data, and
data [160]. Nevertheless, if this can be solved, DNNs are expected to
then the estimates used to refine the search space. However, better
provide improved feature and cross-modal projection learning com-
biometric estimation and appropriate fusion methods then need to
pared to existing approaches. Like CCA style projections, but unlike
be developed to balance the contribution of the biometric cue versus
many other reviewed methods, they can match across heterogenous
the face-matching cue.
dimensionality, e.g., as required for 2D–3D matching. Moreover they
provide the opportunity to integrate a number of other promising
7.2.4. Facial attributes strategies discussed earlier including multi-task learning for inte-
Related to soft-biometrics is the concept of facial attributes. grating attribute/biometric information with matching [161], jointly
Attribute-centric modeling has made huge impact on broader com- reasoning about alignment and matching [159], and fast yet non-
puter vision problems [151]. They have successfully been applied linear matching.
to cross-domain modeling for person (rather than face) recog-
nition [152]. Early analysis using manually annotated attributes 7.2.8. Transfer learning
highlighted their potential to help bridge the cross-modal gap by Transfer learning (including Domain Adaptation (DA)) [120] is
representing faces at a higher-level of abstraction [54]. Recent stud- also growing in importance in other areas of computer vision [162],
ies [118] have begun to address fully automating the attribute and has begun to influence, e.g., view and lighting invariant face
extraction task for cross-domain recognition, as well as releasing recognition [163]. This research area addresses adapting models to
facial attribute annotation datasets (both caricature and forensic a different-but-related task or domain to those which they were
sketch) to support research in this area. In combination with improv- trained [120,162]. Approaches to adapt both specific models [24,120]
ing facial attribute recognition techniques [153], this is a promising and model-agnostic approaches that adapt low-level features both
avenue to bridge the cross-modal gap. exist [120,163]. Some also require annotated target domain training
data [24] while others do not [163]. A straightforward application
7.2.5. Computation time of DA to HFR would be adapting a within-domain model (e.g., HR–
For automated surveillance, or search against realistically large HR) to another within domain setting (e.g., LR–LR). The outstanding
mugshot datasets, we may need to recognize faces in milliseconds. research question for HFR is how to use these ideas to support cross-
Test-time computation is thus important, which may be an impli- modal matching, which is just beginning to be addressed [24,55].
cation for models with sophisticated non-linear mappings across Finally, we note that TL is potentially synergistic with deep leaning,
46 S. Ouyang et al. / Image and Vision Computing 56 (2016) 28–48
in potentially allowing a strong DNN trained from large conventional [29] H. Galoogahi, T. Sim, Inter-modality face sketch recognition, ICME, 2012.
pp. 224–229.
recognition datasets to be adapted to HFR tasks.
[30] H. Kiani Galoogahi, T. Sim, Face photo retrieval by sketch example, ACM M,
2012. pp. 1–4.
[31] B. Klare, A.K. Jain, Sketch-to-photo matching: a feature-based approach,
7.3. Conclusion Biometric Technology for Human Identification VII. SPIE, 2010. pp. 1–10.
[32] H. Bhatt, S. Bharadwaj, R. Singh, M. Vatsa, Memetically optimized MCWLD for
In this survey we have reviewed the state of the art method- matching sketches with digital face images, TIFS (2012) 1522–1535.
[33] Z. Khan, Y. Hu, A. Mian, Facial self similarity for sketch to photo matching,
ology and datasets in heterogeneous face recognition across multi- Digital Image Computing Techniques and Applications (DICTA), 2012. pp. 1–7.
ple modalities including Photo–Sketch, Vis–NIR, 2D–3D and HR–LR. [34] H. Han, B. Klare, K. Bonnen, A. Jain, Matching composite sketches to face photos:
We provided a common framework to breakdown and understand a component-based approach, IEEE Transactions on Information Forensics and
Security, 2013. pp. 191–204.
the individual components of a HFR pipeline and a typology of [35] S. Liu, D. Yi, Z. Lei, S. Li, Heterogeneous face image matching using
approaches, that can be used to relate methods both within and multi-scale features, The IAPR International Conference on Biometrics (ICB),
across these diverse HFR settings. Based on this analysis we extract 2012. pp. 79–84.
[36] D. Huang, M. Ardabilian, Y. Wang, L. Chen, Oriented gradient maps based auto-
common themes, drawing connections across the somewhat distinct matic asymmetric 3D–2D face recognition, The IAPR International Conference
communities of HFR research, as well as identifying challenges for on Biometrics (ICB), 2012. pp. 125–131.
the field and directions for future research. [37] R.S. Tejas Indulal Dhamecha, P. Sharma, M. Vatsa, On effectiveness of histogram
of oriented gradient features for visible to near infrared face matching, Inter-
national Conference on Pattern Recognition (ICPR), 2014. pp. 1788–1793.
References [38] J.Y. Zhu, W.S. Zheng, J.-H. Lai, S. Li, Matching NIR face to VIS face using
transduction, TIFS, 2014. pp. 501–514.
[1] W. Zhao, R. Chellappa, P.J. Phillips, A. Rosenfeld, Face recognition: a literature [39] W. Zhang, X. Wang, X. Tang, Coupled information-theoretic encoding for face
survey, J. ACM Comput. Surv. (CSUR) (2003) 399–458. photo–sketch recognition, CVPR, 2011. pp. 513–520.
[2] Frontex, BIOPASS II: automated biometric border crossing systems based on [40] D. Lin, X. Tang, Inter-modality face recognition, ECCV, 2006. pp. 13–26.
electronic passports and facial recognition: RAPID and SmartGate, 2010. [41] B. Klare, A. Jain, Heterogeneous face recognition: matching NIR to
[3] Y. Sun, Y. Chen, X. Wang, X. Tang, Deep learning face representation by joint visible light images, International Conference on Pattern Recognition (ICPR),
identification-verification, NIPS (2014) 2010. pp. 1513–1516.
[4] H. Nejati, T. Sim, A study on recognizing non-artistic face sketches, IEEE [42] D. Huang, M. Ardabilian, Y. Wang, L. Chen, Asymmetric 3D/2D face recogni-
Workshop on Applications of Computer Vision (WACV), 2011. pp. 240–247. tion based on LBP facial representation and canonical correlation analysis, ICIP,
[5] P. Yuen, C.H. Man, Human face image searching system using sketches, IEEE 2009. pp. 3325–3328.
Trans. Syst. Man Cybern. Part A Syst. Humans (2007) 493–504. [43] S. Shekhar, V. Patel, R. Chellappa, Synthesis-based recognition of low
[6] S. Pramanik, D. Bhattacharjee, Geometric feature based face-sketch recogni- resolution faces, The 2011 International Joint Conference on Biometrics (IJCB),
tion, Pattern Recognition, Informatics and Medical Engineering (PRIME), 2012. 2011. pp. 1–6.
pp. 409–415. [44] A.M., D. Huang, Y. Wang, L. Chen, Automaticasymmetric3D–2Dfacerecognition,
[7] X. Tang, X. Wang, Face photo recognition using sketch, ICIP, 2002. pp. 257–260. International Conference on Pattern Recognition (ICPR), 2010. pp. 1225–1228.
[8] X. Wang, X. Tang, Face sketch synthesis and recognition, ICCV, 2003. pp. [45] S. Siena, V. Boddeti, B. Kumar, Maximum-margin coupled mappings for cross–
687–694. domain matching, Biometrics: Theory, Applications and Systems (BTAS), 2013.
[9] A. Sharma, D. Jacobs, Bypassing synthesis: PLS for face recognition with pose, pp. 1–8.
low-resolution and sketch, CVPR, 2011. pp. 593–600. [46] D.A. Huang, Y.C.F. Wang, Coupled dictionary and feature space learning
[10] D. Yi, R. Liu, R. Chu, Z. Lei, S. Li, Face matching between near infrared and visible with applications to cross-domain image synthesis and recognition, ICCV, 2013.
light images, Advances in Biometrics, Springer. 2007, pp. 523–530. pp. 2496–2503.
[11] S. Liao, D. Yi, Z. Lei, R. Qin, S.Z. Li, Heterogeneous face recognition from local [47] D. Yi, Z. Lei, S.Z. Li, Shared representation learning for heterogenous face
structures of normalized appearance, International Conference on Advances in recognition, FG, 2015. pp. 1–15.
Biometrics, 2009. pp. 209–218. [48] W. Zou, P. Yuen, Very low resolution face recognition problem, TIP, 2012. pp.
[12] R. Wang, J. Yang, D. Yi, S. Li, An analysis-by-synthesis method for heterogeneous 327–340.
face biometrics, Advances in Biometrics, Springer. 2009, pp. 319–326. [49] J. Jiang, R. Hu, Z. Han, K. Huang, T. Lu, Graph discriminant analysis on
[13] G. Toderici, G. Passalis, S. Zafeiriou, G. Tzimiropoulos, M. Petrou, T. Theoharis, multi-manifold: a novel super-resolution method for face recognition, ICIP,
I. Kakadiaris, Bidirectional relighting for 3D-aided 2D face recognition, CVPR, 2012. pp. 1465–1468.
2010. pp. 2721–2728. [50] H. Huang, H. He, Super-resolution method for face recognition using nonlinear
[14] Y. WU, S.K. Shah, I.A. Kakadiaris, Rendering or normalization? An analysis of mappings on coherent features, IEEE Transactions on Neural Networks (2011)
the 3D-aided pose-invariant face recognition, ISBA, 2016. pp. 1–8. 121–130.
[15] A. Moeini, H. Moeini, K. Faez, Unrestricted pose-invariant face recognition by [51] B. Gunturk, A. Batur, Y. Altunbasak, M. Hayes, R. Mersereau, Eigenface-domain
sparse dictionary matrix, IVC, 2015. super-resolution for face recognition, TIP, 2003. pp. 597–606.
[16] C. Zhou, Z. Zhang, D. Yi, Z. Lei, S. Li, Low-resolution face recognition via simulta- [52] B. Li, H. Chang, S. Shan, X. Chen, Coupled metric learning for face recognition
neous discriminant analysis, The International Joint Conference on Biometrics with degraded images, Advances in Machine Learning, Asian Conference on
(IJCB), 2011. pp. 1–6. Machine Learning (ACML). 2009, pp. 220–233.
[17] Z. Wang, Z. Miao, Y. Wan, Z. Tang, Kernel coupled cross-regression for low-res- [53] P. Moutafis, I. Kakadiaris, Semi-coupled basis and distance metric learning for
olution face recognition, Math. Probl. Eng. (2013) 1–20. cross-domain matching: application to low-resolution face recognition, IEEE
[18] P. Hennings-Yeomans, S. Baker, B. Kumar, Simultaneous super-resolution and International Joint Conference on Biometrics (IJCB), 2014. pp. 1–8.
feature extraction for recognition of low-resolution faces, CVPR, 2008. pp. 1–8. [54] B. Klare, S. Bucak, A. Jain, T. Akgul, Towards automated caricature recognition,
[19] Z. Lei, S. Li, Coupled spectral regression for matching heterogeneous faces, The IAPR International Conference on Biometrics (ICB), 2012. pp. 139–146.
CVPR, 2009. pp. 1123–1128. [55] H. Bhatt, R. Singh, M. Vatsa, N. Ratha, Improving cross-resolution face matching
[20] Z. Lei, C. Zhou, D. Yi, A.K. Jain, S.Z. Li, An improved coupled spectral regres- using ensemble-based co-transfer learning, IEEE Trans. Image Process. (2014)
sion for heterogeneous face recognition., The IAPR International Conference on 5654–5669.
Biometrics (ICB), 2012. pp. 7–12. [56] P. Mittal, A. Jain, R. Singh, M. Vatsa, Boosting local descriptors for matching
[21] J.Y. Zhu, W.S. Zheng, J.-H. Lai, Logarithm gradient histogram: a general illumi- composite and digital face images, ICIP, 2013. pp. 2797–2801.
nation invariant descriptor for face recognition, FG, 2013. pp. 1–8. [57] P. Mittal, A. Jain, G. Goswami, R. Singh, M. Vatsa, Recognizing composite
[22] S. Biswas, K.W. Bowyer, P.J. Flynn, Multidimensional scaling for matching sketches with digital face images via SSD dictionary, IEEE International Joint
low-resolution face images, TPAMI, 2012. pp. 2019–2030. Conference on Biometrics (IJCB), 2014. pp. 1–6.
[23] C.X. Ren, D.Q. Dai, H. Yan, Coupled kernel embedding for low-resolution face [58] K.W. Bowyer, K. Chang, P. Flynn, A survey of approaches and challenges in 3D
image recognition, TIP, 2012. pp. 3770–3783. and multi-modal 3D + 2D face recognition, CVIU, 2006. pp. 1–15.
[24] R.Singh, P. Mittal, M. Vatas, Composite sketch recognition via deep network-a [59] S.G. Kong, J. Heo, B.R. Abidi, J. Paik, M.A. Abidi, Recent advances in visual and
transfer learning approach, IAPR International Conference on Biometrics, 2015. infrared face recognition—a review, CVIU (2005) 103–135.
pp. 251–256. [60] X. Zhang, Y. Gao, Face recognition across pose: a review, PR (2009) 2876–2896.
[25] X. Wang, X. Tang, Face photo–sketch synthesis and recognition, TPAMI, 2009. [61] X. Zou, J. Kittler, K. Messer, Illumination invariant face recognition: a survey,
pp. 1955–1967. Biometrics: Theory, Applications, and Systems, 2007. pp. 1–8.
[26] Q. Liu, X. Tang, H. Jin, H. Lu, S. Ma, A nonlinear approach for face sketch [62] X. Zhang, Y. Gao, Face recognition across pose: a review, PR (2009) 2876–2896.
synthesis and recognition, CVPR, 2005. pp. 1005–1010. [63] Y. Wang, C.S. Chua, Face recognition from 2D and 3D images using 3D Gabor
[27] J. Chen, D. Yi, J. Yang, G. Zhao, S. Li, M. Pietikainen, Learning mappings for face filters, IVC (2005) 1018–1028.
synthesis from near infrared to visual light images, CVPR, 2009. pp. 156–163. [64] C.S. Chua, Y. Wang, Robust face recognition from 2D and 3D images using
[28] W. Yang, D. Yi, Z. Lei, J. Sang, S. Li, 2D–3D face matching using CCA, Automatic structural Hausdorff distance, IVC (2006) 176–185.
Face Gesture Recognition, 2008. pp. 1–6. [65] G.P. Kusuma, C.S. Chua, PCA-based image recombination for multimodal 2D and
S. Ouyang et al. / Image and Vision Computing 56 (2016) 28–48 47
[141] M. Grgic, K. Delac, S. Grgic, SCface— surveillance cameras face database, Multi- [153] P. Luo, X. Wang, X. Tang, A deep sum-product architecture for robust facial
media Tools Appl. (2011) 863–879. attributes analysis, ICCV, 2013. pp. 2864–2871.
[142] P.H.H. Yeomans, B.V. Kumar, S. Baker, Robust low-resolution face identification [154] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep
and verification using high-resolution features, ICIP, 2009. pp. 33–36. convolutional neural networks, NIPS, 2012. pp. 1–9.
[143] W.W. Zou, P.C. Yuen, Learning the relationship between high and low resolu- [155] G. Hu, Y. Yang, D. Yi, J. Kittler, W. Christmas, S.Z. Li, T. Hospedales, When face
tion images in kernel space for face super resolution, International Conference recognition meets with deep learning: an evaluation of convolutional neural
on Pattern Recognition(ICPR), 2010. pp. 1152–1155. networks for face recognition, 2015. arXiv preprint arXiv:1504.02351.
[144] K. Jia, S. Gong, Multi-modal tensor face for simultaneous super-resolution and [156] Z. Zhu, P. Luo, X. Wang, X. Tang, Deep learning identity-preserving face space,
recognition, ICCV, 2005. pp. 1683–1690. ICCV, 2013. pp. 113–120.
[145] B. Li, H. Chang, S. Shan, X. Chen, Low-resolution face recognition via coupled [157] C. Dong, C.C. Loy, K. He, X. Tang, Learning a deep convolutional network for
locality preserving mappings, IEEE Signal Process Lett. (2010) 20–23. image super-resolution, ECCV, 2014. pp. 1–16.
[146] Z.-X. Deng, D.-Q. Dai, X.-X. Li, Low-resolution face recognition via color infor- [158] Q. Yu, F. Liu, Y.-Z. Song, T. Xiang, T.M. Hospedales, C.C. Loy, Sketch me that shoe,
mation and regularized coupled mappings, Chinese Conference on Pattern CVPR, 2016. pp. 1–8.
Recognition (CCPR), 2010. pp. 1–5. [159] W. Li, R. Zhao, T. Xiao, X. Wang, DeepReID: deep filter pairing neural network
[147] S. Cho, Y. Matsushita, S. Lee, Removing non-uniform motion blur from images, for person re-identification, CVPR, 2014. pp. 152–159.
ICCV, 2007. pp. 1–8. [160] G. Hu, X. Peng, Y. Yang, T. M. Hospedales, J. Verbeek:, Frankenstein: Learning
[148] A. Levin, Y. Weiss, F. Durand, W.T. Freeman, Understanding blind deconvolution deep face representations using small data., CoRR abs/1603.06470
algorithms, TPAMI (2011) 2354–2367. [161] Z. Zhang, P. Luo, C.C. Loy, X. Tang, Facial landmark detection by deep multi-task
[149] A. Torralba, A.A. Efros, Unbiased look at dataset bias, CVPR, 2011. pp. learning, ECCV, 2014. pp. 1–15.
1521–1528. [162] V. Patel, R. Gopalan, R. Li, R. Chellappa, Visual domain adaptation: a survey of
[150] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: a large-scale recent advances, IEEE Signal Process. Mag. (2015)
hierarchical image database, CVPR, 2009. pp. 248–255. [163] H.T. Ho, R. Gopalan, Model-driven domain adaptation on product manifolds for
[151] C.H. Lampert, H. Nickisch, S. Harmeling, Learning to detect unseen object unconstrained face recognition, IJCV (2014)
classes by between-class attribute transfer, CVPR, 2009. pp. 951–958.
[152] R. Layne, T.M. Hospedales, S. Gong, Person re-identification by attributes,
BMVC, 2012. pp. 1–8.