0% found this document useful (0 votes)
38 views7 pages

Chanda 2019

one shot learning

Uploaded by

arif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views7 pages

Chanda 2019

one shot learning

Uploaded by

arif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

2019 15th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS)

Face Recognition - A One-Shot Learning


Perspective
Sukalpa Chanda∗ ˆ, Asish Chakrapani GV† ˆ, Anders Brun‡ , Anders Hast‡ ,
Umapada Pal† and David Doermann§
∗ Department of Information Technology, Østfold University College, Norway
[email protected]/[email protected]
† Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, India
[email protected], [email protected]
‡ Centre for Image Analysis, Uppsala University, Sweden
{anders.brun, anders.hast}@it.uu.se
§ Computer Science and Engineering, University at Buffalo, USA
[email protected]

Abstract—Ability to learn from a single instance is something Deep learning methods learn multiple levels of representations
unique to the human species and One-shot learning algorithms and abstractions by using a cascade of processing units for
try to mimic this special capability. On the other hand, despite feature extraction and transformation. This leads to forming a
the fantastic performance of Deep Learning-based methods
on various image classification problems, performance often hierarchy of abstraction/representation, and addresses changes
depends having on a huge number of annotated training samples in face pose, illumination, and expression. Even though deep-
per class. This fact is certainly a hindrance in deploying deep learning-based methods can tackle changes in lighting, pose,
neural network-based systems in many real-life applications like and expression while performing face recognition, one disad-
face recognition. Furthermore, an addition of a new class to the vantage is its demand for a huge amount of annotated data
system will require the need to re-train the whole system from
scratch. Nevertheless, the prowess of deep learned features could to train the system and the requirement of re-training when
also not be ignored. This research aims to combine the best a new class is added. While transfer learning techniques can
of deep learned features with a traditional One-Shot learning help mitigate such problems by freezing the first few layers
framework. Results obtained on 2 publicly available datasets and tuning pre-trained weights from the last few layers on the
are very encouraging achieving over 90% accuracy on 5-way new data, it does not completely eradicate the problem.
One-Shot tasks, and 84% on 50-way One-Shot problems.
One-shot algorithms, on the other hand, use a completely
different philosophy for classification. One-shot algorithms
Keywords-One-Shot Learning, Face recognition, Siamese Net-
are meant to perform classification seeing only a handful of
works, Image Classification.
the training samples. Thus a clever amalgamation of those
I. I NTRODUCTION two techniques could combine the best of both providing
a rich feature representation using deep learning techniques
Face recognition has been extensively explored over the last
and feeding those features to a One-Shot learning framework
several decades. Its value as a non-contact biometric authen-
for classification. A widely spread strategy to implement
tication and in a wide variety of other digital applications
One-Shot learning algorithms is to use a Siamese Neural
like security, digital entertainment system, video analytics for
Network with a triplet loss function. Our work takes a Siamese
marketing, video indexing from a streaming video cannot
Neural Network-based approach to perform One-Shot learning
be ignored. Like any other image analysis problem, face
and consequent classification. Deep Neural Network-based
recognition in its early days relied mainly on hand-crafted
features from the “DLIB-ml machine learning toolkit” [1] are
features like SIFT, SURF, Local Binary Pattern, Histogram of
used for feature representation for all face images.
Gradient, Fisher vectors, but with the advent of deep-learning
The primary contribution of this research is that a novel
methodologies, there is a clear shift towards deep-learned
hybrid method combining a Siamese Neural Network with
features. During those early days, research was focused on
Res-Net encoded features for One-Shot face recognition task
improving the pre-processing stage, the introduction of local
is being proposed. We also intend to publish our dataset with
descriptors and feature transformation, but such techniques
unconstrained face images procured from “Indian Movie Faces
failed to counter the challenges of unconstrained face recogni-
Database” in the near-future for One-Shot recognition task
tion. Hand-crafted feature-based methods were used to address
performance evaluation and benchmarking.
changes in lighting, pose, and expression but failed in real life
due to their inability to address more general pose challenges. II. R ELATED W ORK
This has changed as the deep-learning methods have evolved.
Face detection and recognition methods have had significant
ˆ Equal contribution by the authors importance as an image analysis research problem for almost

978-1-7281-5686-6/19/$31.00 ©2019 IEEE 113


DOI 10.1109/SITIS.2019.00029

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 28,2020 at 23:47:46 UTC from IEEE Xplore. Restrictions apply.
3 decades. One of the seminal articles in the early nineties solve the One-shot task, the authors generated images in
is [2], where the authors represent faces using a small set various poses using a 3D face model to train the deep model.
2-D Eigenvectors. Face recognition methods can be broadly Zhao et al. [17] proposed an enforced softmax that contains
divided into handcrafted features-based approaches and optimal dropout, selective attenuation, L2 normalization and
later deep-learning technologies deep-learned features-based model-level optimization which boosted the standard softmax
approaches. The hand-crafted approaches focused mainly on function to produce a better representation for low-shot
high-dimensional artificial feature extraction and the reduction learning.
of features. The representative dimension reduction methods
are the subspace learning methods like Principal Component The concept of Siamese Networks was initially introduced
Analysis [3], Linear Discriminant Analysis [4] and manifold by Bromley et al. [18] for the signature verification problem
learning methods like like Locality preserving projection and further, the use of deep convolutional Siamese networks
[5]. With the advance of deep-learning, the representative for one-shot tasks with a significant accuracy has been show-
method was to learn the discriminative face representations cased in [19]. Face recognition usually consists of face detec-
directly from the original image space. For example, Hu et tion, feature extraction, and recognition. We use the dlib-ml [1]
al. [6] introduced us to the convolutional neural network toolkit which leverages image-driven neural networks to detect
applied to face recognition. It analyses the advantages and and extract the faces in a given image and then use a resnet
disadvantages of this method and shows the developmental based architecture to generate a feature vector to represent
roadmap in the future. This work is further explored and each face. In this paper, we propose a method which integrates
state-of-the-art results are obtained in [7], [8], [9], [10]. Albeit the concept of Deep convolutional Siamese networks and a
CNNs exceptional performance for some applications, such transfer learning strategy to produce a robust face recognition
algorithms struggle to deal with many real-world applications system which leverages the deep learned feature attributes.
that require learning or drawing inferences from small
amounts of data, class imbalance and adjusting to a constant III. M ETHODOLOGY
inflow of new class information. The problem of developing One-shot learning can be achieved in several ways. In
an efficient, robust face recognition system at scale is also this research we have explored two approaches: (a) Siamese
not an exception in this context. Neural Network based approach; (b) a Deep-feature encoding
approach followed by the nearest neighbor classification of
In the past few years, there have been several works that those encoded features. We settled on a method by combin-
address this problem. To address the data imbalance problem ing the two approaches. This improvised combined method
Guo et al. [11] proposed a novel underrepresented classes uses the encoded features generated out of a ResNet CNN
promotion loss term which aligned the norms of weight architecture as an input to the Siamese network, and the
vectors of underrepresented classes and normal classes thus Siamese network is being trained to discriminate between two
giving the one-shot classes an equal weight-age. Work by encoded feature vectors. In this combined approach a pre-
Wang et al. [12] proposes a framework based on CNN, trained Deep convolutional neural network (ResNet) acts as
which deals with the deficient training data by using a a feature extractor for a pair of an input image and then an
balancing regularizer and shifting the center regeneration energy function Θ is used which ties the twin networks to
to regulate norms of weight vector into the same scale compute the similarity index. When the two encoded feature
and adjusts clustering center. Insufficient training data and vectors for the input face images are obtained, the Siamese
data imbalance, however, causes the network to perform Network learns to score the similarity of those two encoded
poorly. Ding et al. [13] proposed an approach to solve feature vectors in a range of 0-1. Where 1 is assigned if both
the underrepresented class problem in one-shot learning, the input images are of the same class.
by focusing on building generative models to build extra
examples. It proposed a generative model to synthesize data A. Siamese Network
for one-shot classes by adapting the data variances and Siamese networks are a subset of deep neural network
augmenting features from other normal classes. Another work architectures that contain two identical sub-networks working
by Jhadav et al. [14] proposed the method of deep attribute in cohesion that use the same weights while taking two distinct
representation of faces for one-shot face recognition. They input vectors and are joined by a comparative function. Such
used specific attributes of human faces such as the shape networks are used to determine the similarity between two
of the face, hair, gender to fine-tune a deep CNN for face distinct inputs. It is important that not only the architectures
recognition. Their experimental results on standard datasets of the sub-networks are identical, but the weights are shared
showed that deep attribute representations performed better among them as well for the network to be called ’Siamese’.
in case of two one-shot face recognition techniques such as In this current study, the convolutional Siamese network is
an exemplar SVM and one-shot similarity kernel. Wu et al. designed to learn features of the input images regardless of
[15] proposed a framework with hybrid classifiers using a prior domain knowledge with very few samples from a given
CNN and the nearest neighbor (NN) model. The work by distribution. This model was also adopted because the twin
Hong et al. [16] proposes a domain adaptation network to networks share weights resulting in fewer parameters to train

114

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 28,2020 at 23:47:46 UTC from IEEE Xplore. Restrictions apply.
Fig. 1: Sibling of the Twin Siamese Network Architecture used in the experiment(twin network not depicted).

on and a lower tendency of over-fitting. For experiments, a 6


small labeled support set consisting of train-validation classes g = sqrt( ) (2)
(f anin + f anout )
and test classes were used. During training, the network takes
a pair of images as the input where it learns to discriminate Here, f anin is the number of input units in the weight
between two input images based on their class labels and tensor and f anout is the number of output units in the weight
features. The task is achieved by generating probability scores tensor [20]. The biases were initialized using the default setting
which aid in perceiving whether they belong to the same class of zeros in all the layers.
or different classes. For evaluation of n way one-shot tasks, 3) Loss function: The model error for the Siamese network
the network is provided with pairs of images consisting of a during training is computed using a regularized cross entropy
reference image and one sample image from each of the n loss function. The cross-entropy function equation is as fol-
unseen classes at each instance. The label from the pair with lows
the highest probability is then given to the reference image. A
pictorial diagram of our Siamese network is shown in Fig. 1. L(xi1 , xi2 ) = y(xi1 , xi2 )logP (xi1 , xi2 )
+(1 − y(xi1 , xi2 ))log(1 − P (xi1 , xi2 )) (3)
1) Learning Details: A constant learning rate ηj is opted N 2
+λ |W |
for all the layers whilst following a step-based decay method
decaying at a uniform rate of 1% at every 500 iterations. Here i denotes the ith index of the current batch , y(xi1 , xi2 ) is
The Validation accuracy metric is calculated after every 1000 a vector of length M consisting of labels. It is assumed that it
iterations and the model with the best accuracy is saved equals 1 in case of same class and 0 in case of different class
during training. The model is trained for a maximum for for iteration N.
100,000 iterations. An early stopping condition was included B. ResNet
in case the validation accuracy does not show improvement
over 10,000 iterations. The momentum for each layer evolves The ResNet architecture was developed to address some
with a predefined linear slope until it attains a final value of issues observed in its predecessor, the VGG-Net. One thing
0.9 and it is initialized with a value of 0.5 at the beginning. lacking in VGG-Net was it tends to lose generalization ca-
The model is trained with a batch size of 8, along with a pability with an increase in the network depth. The other
linearly evolving layer-wise momentum μj for the jth layer, problem that ResNet deals with is countering the “vanishing
and L2 regularization penalization, weights for each iteration gradient” issue which is often a problem with deeper networks.
N. So the weight update rule for iteration N is: This is because gradients from the outer most layer easily
shrink to zero after several applications of the chain rule,
N
Wkj (xi1 , xi2 ) = Wkj
N N
+ ΔWkj (xi1 , xi2 ) + 2λj |Wkj | hence no weight updates are performed in the network. ResNet
N N −1 (1) introduced the “skip connection” concept and by virtue of that
Wkj (xi1 , xi2 ) = −ηj ∇Wkj
N
+ μj ΔWkj
gradients can flow directly backward from deeper layers to
where ΔWkj is the partial derivative with respect to the weight initial filters skipping intermediate layers. The Resnet used
between the j th neuron in a given layer and the k th neuron here is a pruned version of ResNet-34 [21].
in the next layer. In a pre-processing step, a CNN generates the bounding box
2) Weights: The weight initialization in all the layers in information of a face along with a set of 68 face Landmark
the network is done using the Glorot uniform initializer. The points [1] from an input image. The ResNet is fed with the
initializer draws samples from the uniform distribution of bounding box information of the face and those set of 68
[−g, g]where g is given by the equation activations points inside the face region. In order to save time

115

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 28,2020 at 23:47:46 UTC from IEEE Xplore. Restrictions apply.
and increase dissimilarity between the anchor image and the
negative image. Here ’a’ denotes an anchor image, ’p’ denotes
a positive image and ’n’ denotes a negative image. Another
hyperparameter variable called margin is being added to the
loss equation, that defines how far away the dissimilarities
should be. For example, if the margin = 0.4 and d(a,p) = 0.3
then d(a,n) should at least be equal to 0.7.

C. A Combined Hybrid Approach


The proposed combined approach is depicted in Fig. 3.
The Siamese network is taking as input the deep-learned
encoded features those were generated by the pruned Res-
Net CNN and learns its own set of weights intending to lower
its cross-entropy loss function.To optimize the weights for our
datasets, the weights of the initial convolutional layers were
kept constant and the update weights are carried on the final
few layers of the network with our training samples. Note
that the Res-Net CNN is has its own set of weights and its
corresponding loss function as well.

IV. E XPERIMENTAL P ROTOCOL


We used an N-way one-shot task performed on ’N’ “support
classes” in a disjoint set each time for evaluating the perfor-
mance in the evaluation set. For our experiments, we use 4
Fig. 2: Pruned ResNet Architecture used in the experiment. values of N pertaining to the set of 5,10,20,50. The efficacy
of such algorithms is measured based on its performance on
N-way tasks. During testing for a query sample image, a
support class set S is provided consisting of ’n’ examples each
and computational resources, we have used pre-trained weights
from ’N’ different unseen classes. The algorithm then has to
from the initial layers of this network. Those weights were
determine which of the support set classes the query sample
obtained while this network was trained from scratch on a
belongs to. Two draws producing n samples each are taken,
dataset of about 3 million faces. At that time the training
and each one of the samples produced in the first draw is taken
dataset was composed of 7845 individual face images procured
as test images and compared against all samples of the second
from multiple sources such as the ”face scrub dataset, the VGG
draw. This process was done twice for each evaluation set of
dataset and a large number of images scraped from the internet.
n classes. We therefore perform 2N different one-shot tasks.
This network in the 29th layer generates a 128-dimensional
We also observe the individual set accuracy and a mean global
encoded feature for an input face image, and later that 128-
accuracy for the model has been reported.
dimensional encoded feature is being used for classification.
This network learns the weights using a loss function called A. Dataset
“Triplet Loss”. The pruned network architecture is shown in
The experiments were conducted on two publicly available
Fig. 2.
large-scale datasets: “Labeled Faces in the Wild”(LFW) [22]
1) Loss function: In this current study, the ResNet archi-
and “Indian Movie Face database” (IMFDB) [23]. Another
tecture is uses a “Triplet Loss” function, governing by the
popular dataset “MS-Celeb Low-Shot dataset” has not been
following equation:
included in the experiment for two reasons. First, some of
the image samples of the dataset has been used to train the
L = max(D(a, p) − D(a, n) + margin, 0)) (4) Res-Net based face recognition system, hence it would be
unfair to use that database while evaluating the proposed
The objective behind training this pruned ResNet is to
system, Second, the dataset is unfortunately no longer publicly
generate optimal weights such that 128-dimensional feature
available. The reason for choosing “LFW” is that it is the
embedding of an anchor image and positive image should be
most common dataset used for performance benchmarking
similar and feature embedding of anchor image and negative
of a face recognition system, and this dataset is a curated
image should be much further apart. While using the “Triplet
dataset with proper alignment and proper annotation. The
Loss” function to train the network, the 128-dimensional
“IMFDB” consists of unconstrained type images with much
feature embedding from an anchor image is compared with the
greater variability in terms of pose, illumination, and color.
128-dimensional feature embedding of both a positive sample
This variability is the reason for using IMFDB dataset in
and a negative sample. The objective here is to decrease
our experiments. The two datasets are complementary to each
dissimilarity between the anchor image and positive image

116

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 28,2020 at 23:47:46 UTC from IEEE Xplore. Restrictions apply.
Fig. 3: Combined hybrid architecture used in the experiment.

other in that sense. The details of the respective datasets are to accommodate slight variations such as facial hair and
given below. obstructions such as headgear, eyewear we include a few
1) LFW: - This database consists of 13,000 images of faces samples of such images as well. For each class, we considered
collected from the web. Each face has been labeled with the 20 images in total. We have a total of 100 classes. To keep
name of the person pictured. In this dataset, 1680 people the train and test set completely disjoint and to exclude any
have two or more distinct photos in the data set. To maintain overlap in the classes we removed 6 classes which were
the consistency and to ensure robustness we have various common to IMFDB and the dataset used to train the ResNet.
images for different facial positions. Further to accommodate
slight variations such as facial hair and obstructions such as V. R ESULTS & D ISCUSSIONS
headgear, eyewear, we include a few samples of such images
as well. Finally, for each class, we end up taking 15 images While conducting experiments with three different ap-
in total and due to this constraint, we remove all the classes proaches, the input test and train set for each fold were same
which have 15 images or less. After this, we are left with a for all three experiments. This was done purposely to compare
total of 96 classes. We use a deep funneling method to align the efficacy of three approaches fairly.
the faces [24]. For our experiments, the subset of the LFW database
consisting of 96-face classes with 15 samples in each class
2) IMFDB: - This is a large unconstrained face database was used. Those 96 classes were selected since the rest of the
consisting of 34512 images of 100 Indian actors collected other classes have less than 15 samples. For the evaluation in
from more than 100 videos [23]. All the images are manually face recognition we perform 3 different one-shot tasks i.e. 5,
selected and cropped from the video frames resulting in a 10 and 20 way tasks so the new dataset was split into either 91-
high degree of variability in terms of scale, pose, expression, 5/ 81-10 or 71-20 train-validation & evaluation classes, where
illumination, age, resolution, occlusion, and makeup. Videos the train set was further split according to an 80-20% split
collected from the last two decades contain large diversity resulting in 72, 64 or 56 classes for training and 19, 17 or 15
in age variations compared to the images collected from the classes for validation.
Internet through a search query. IMFDB is the first face The set of IMFDB consisting of 94-face classes with 20
database that provides detailed annotation of every image in samples in each class was used. For the evaluation in face
terms of age, pose, gender, expression and type of occlusion recognition we perform 5, 10 and 20 way tasks so the new
that may help others face-related applications. This dataset dataset was split into either 89-5/ 84-10 or 74-20 train-
exhibits a huge degree of intra-class variability as well (Fig. 4). validation & evaluation classes, where the train set was further
split according to an 80-20% split resulting in 71, 67 or 59
classes for training and 18, 17 or 15 classes for validation.
The evaluation was conducted using the same n-way one-shot
tests on the n classes from the evaluation set.
Both the datasets contain around 95 classes and for
training and evaluation, we use a fold wise method. So the
total number of folds for “n-way” is obtained as total number
of classes divided by “n” the number of classes for testing with
minimal re-sampling. Therefore, in the case of 5-way we get
19 folds, 10-way we get 9 folds and for 20-way we get 4 folds.
Note that to frame a 50-way one-shot task, given the number
Fig. 4: Example of intra class variability in IMFDB dataset. of classes in each of those two datasets we could perform
only two folds of train-test evaluation run where a few of the
To maintain the variability and to ensure robustness we classes might be re-sampled from the previous folds. By “fold”
have various images with different facial positions. Further we mean to say an unique “train-validation-test” evaluation

117

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 28,2020 at 23:47:46 UTC from IEEE Xplore. Restrictions apply.
set.The accuracy metric used here is true recognition rate for whereas the highest accuracy on the same dataset with Siamese
each fold in a given dataset. Network is ≈ 26.00%. A similar trend can be observed in the
case of IMFDB dataset as well.
A. Siamese Network-based Results
Out of three approaches, during our initial experiments, TABLE III: Accuracy of One shot Tasks on IMFDB using
the Siamese Network-based approach performed the worst. Dlib-ResNet-29 network
Even while dealing with a 5-way One-Shot recognition task, it
Fold Number 5-Way Task 10-Way Task 20-Way Task
could only deliver the highest accuracy of ≈ 32.50% for both
datasets. To give an idea, results obtained on n-way One-Shot Fold 1 80.80% 78.60% 80.00%
tasks on both datasets on 4 different folds are shown in Table Fold 2 82.40% 80.40% 76.50%
I and II. Since the results are not encouraging we are not Fold 3 81.00% 79.30% 78.20%
providing results for all folds with respect to different n-way Fold 4 83.60% 82.00% 75.40%
tasks.

TABLE I: Accuracy of One shot Tasks on LFW dataset TABLE IV: Accuracy of One shot Tasks on LFW dataset
using Siamese Network with own feature extractor using Dlib-ResNet-29 network
Fold Number 5-Way Task 10-Way Task 20-Way Task Fold Number 5-Way Task 10-Way Task 20-Way Task
Fold 1 32.50% 28.20% 23.40% Fold 1 88.20% 86.00% 85.30%
Fold 2 27.50% 26.70% 22.60% Fold 2 90.00% 84.60% 87.00%
Fold 3 30.00% 30.20% 25.60% Fold 3 89.00% 90.00% 82.00%
Fold 4 24.60% 24.80% 22.60% Fold 4 90.20% 89.00% 81.40%

TABLE II: Accuracy of One shot Tasks on IMFDB using


C. Results obtained from Combined Hybrid Approach
Siamese Network with own feature extractor
The classification technique that we used to perform One-
Fold Number 5-Way Task 10-Way Task 20-Way Task
Shot learning on the encoded features from Res-Net was a
Fold 1 32.80% 30.80% 24.20% naive Nearest Neighbour classification. Despite the simple
Fold 2 30.50% 27.50% 20.60% classification, such high accuracies from the ResNet-based
Fold 3 28.50% 28.60% 22.80% approach confirm that the encoded features generated by the
Fold 4 27.60% 27.60% 26.02% ResNet were very discriminative. This motivated us to couple
the discriminative feature extractor with the sophisticated
discriminator function of the Siamese network architecture.
B. ResNet-Based Face Recognizer Results In this setup, the ResNet generated encoded features were fed
The ResNet architecture for face Recognition from “DLIB” to the Siamese network which learns its own set of weights
has been used in our experiment. To save time and resources, and hence gives much higher accuracy in the range of 80.00%-
a transfer learning strategy was adopted. Here a pre-trained 84.20% even for the 50-way one-shot task. We experimented
model of the Res-Net, which was generated while training 3 exhaustively with this approach with all possible folds of data.
million face images was initially considered in this experiment. The 20-way one shot results are depicted in Table V and Table
The weights of the initial convolutional layers of this model VI depicts the typical results obtained by this method on 50-
were kept constant during training on samples from the way one-shot learning for the two datasets.
“LFW” and “IMFDB” and weights associated with all fully
connected layers were updated. The 128-dimensional feature TABLE V: Accuracy of 20-way One shot Tasks on LFW
encoding obtained from the 29th layer of an input test image Dataset & IMFDB using combined approach
is compared with 128-dimensional feature encoding vectors
Fold Number LFW IMFDB
of all support set samples, then the class of input image is
assigned to the class of nearest neighbor amongst support Fold 1 92.50% 70.00%
set samples. Results on 5-way, 10-way and 20-way One-Shot Fold 2 95.50% 72.50%
learning tasks on LFW and IMFDB dataset is depicted in Fold 3 82.50% 72.50%
Table IV and Table III respectively. Note that here also Fold 4 87.50% 80.50%
we are reporting on the same 4 folds of data that we have
reported for Siamese Network. It can be noted that with the In our experiments, for the 5-way one shot task we obtained
use of ResNet feature encoding there is a striking improvement an average accuracy of 92.44% across 19 folds on the entire
in the results compared to results obtained with the Siamese subset of LFW dataset. Further, we obtained accuracy as high
Network only based approach. The accuracy is as high as as 97.00% in few ocassions. The mean accuracy yielded by
87.00% with the 20-way One-Shot tasks on LFW dataset, the 10-way one shot tasks over a 9 fold cross-validation set

118

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 28,2020 at 23:47:46 UTC from IEEE Xplore. Restrictions apply.
TABLE VI: Accuracy of 50-way One shot Tasks on LFW [3] W. Zhao, R. Chellappa, and A. Krishnaswamy, “Discriminant analysis
of principal components for face recognition,” Proceedings Third IEEE
Dataset & IMFDB using combined approach International Conference on Automatic Face and Gesture Recognition,
pp. 336–341, 1998.
Fold Number LFW IMFDB [4] L.-F. Chen, H.-y. Liao, M.-T. Ko, J.-C. Lin, and G.-J. Yu, “New lda-
Fold 1 80.00% 80.50% based face recognition system which can solve the small sample size
problem,” Pattern Recognition, vol. 33, pp. 1713–1726, 10 2000.
Fold 2 82.50% 84.20%
[5] Y. C. Tan, Y. Zhao, and X. Ma, “Contourlet-based feature extraction with
lpp for face recognition,” 2011 International Conference on Multimedia
and Signal Processing, vol. 1, pp. 122–125, 2011.
[6] G. Hu, Y. Yang, D. Yi, J. Kittler, W. Christmas, S. Li, and T. Hospedales,
was 90.55% with best accuracy shooting as high as 97.50% “When face recognition meets with deep learning: An evaluation of
in one of the fold. convolutional neural networks for face recognition,” 12 2015, pp. 384–
Similar to the experiments conducted on the LFW dataset 392.
[7] C. Lu and X. Tang, “Surpassing human-level face verification perfor-
we also performed 5-way and 10-way tasks on the IMFDB mance on LFW with gaussianface,” CoRR, vol. abs/1404.3840, 2014.
dataset. The mean accuracy of the 5-way one-shot task for 19 [8] Y. Sun, X. Wang, and X. Tang, “Deep learning face representation by
folds was observed to be 82.63%. Whereas for the 10-way one- joint identification-verification,” CoRR, vol. abs/1406.4773, 2014.
[9] Y. Sun, X. Wang, and X. Tang, “Deeply learned face representations are
shot task the mean accuracy across 9 fold set was observed to sparse, selective, and robust,” CoRR, vol. abs/1412.1265, 2014.
be 79.05%. The best accuracy of the 5 and 10 way task was [10] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the
observed to be 92.50% and 87.50% respectively. gap to human-level performance in face verification,” 09 2014.
[11] Y. Guo and L. Zhang, “One-shot face recognition by promoting under-
D. Comparison with other techniques represented classes,” CoRR, vol. abs/1707.05574, 2017.
[12] L. Wang, Y. Li, and S. Wang, “Feature learning for one-shot face recog-
Though there are a large number of published results on nition,” 2018 25th IEEE International Conference on Image Processing
face recognition, however, very few works like [13], [16], [14] (ICIP), pp. 2386–2390, 2018.
[13] Z. Ding, Y. Guo, L. Zhang, and Y. Fu, “One-shot face recognition via
focus on the One-Shot face recognition task. Unfortunately, we generative learning,” 05 2018, pp. 1–7.
could compare the performance of our system with only [14] [14] A. Jadhav, V. P. Namboodiri, and K. S. Venkatesh, “Deep attributes for
as the others have used the “MS-Celeb Low Shot” dataset one-shot face recognition,” in ECCV Workshops, 2016.
[15] Y. Wu, H. Liu, and Y. Fu, “Low-shot face recognition with hybrid
meant for One-Shot recognition task and that dataset is not classifiers,” in The IEEE International Conference on Computer Vision
available from any legitimate source. In [14], the authors (ICCV) Workshops, Oct 2017.
did experiments for One-Shot recognition using the “LFW” [16] S. Hong, W. Im, J. Ryu, and H. S. Yang, “SSPP-DAN: deep domain
adaptation network for face recognition with single sample per person,”
dataset and we have compared our results with them in Table CoRR, vol. abs/1702.04069, 2017.
VII. Note that our method has outperformed the method [17] J. Zhao, Y. Cheng, Z. Wang, Y. Xu, J. Karlekar, S. Shen, and J. Feng,
proposed in [14] especially in the case of 10-way and 20-way “Know you at one glance: A compact vector representation for low-shot
learning,” 09 2017.
one-shot tasks. We plan to preserve and publish the train and [18] J. Bromley, I. Guyon, Y. LeCun et al., “Signature Verification using
test split of images that we have used for our experiments from a ”Siamese” Time Delay Neural Network,” International Journal of
the other dataset “IMFDB”, for benchmarking performance Pattern Recognition and Artificial Intelligence, vol. 7, no. 04, p. 669688,
1993.
evaluation of One-Shot face Recognition task. [19] G. Koch, R. Zemel, and R. Salakhudtdinov, “Siamese Neural Networks
for One-shot Image Recognition,” in Proceedings of the 32 nd Inter-
TABLE VII: Accuracy comparison of One shot Tasks on national Conference on Machine Learning, vol. 37, Lille, France, Jul.
2015.
LFW [20] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
feedforward neural networks,” in Proceedings of the Thirteenth Inter-
Method 5 Way 10 Way 20 Way national Conference on Artificial Intelligence and Statistics, AISTATS
Deep attribute, Jadhav at al. [14] 94.00% 93.75% 88.87% 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010, 2010, pp.
249–256.
Dlib-Siamese Net , Proposed Method 97.00% 97.50% 95.50%
[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” CoRR, vol. abs/1512.03385, 2015.
[22] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled
VI. C ONCLUSIONS & F UTURE W ORK faces in the wild: A database for studying face recognition in uncon-
strained environments,” University of Massachusetts, Amherst, Tech.
This article proposes a new hybrid approach of fusing Res- Rep. 07-49, October 2007.
Net features along with a Siamese-Network classifier to handle [23] S. Setty, M. Husain, P. Beham, J. Gudavalli, M. Kandasamy, R. Vaddi,
V. Hemadri, J. C. Karure, R. Raju, B. Rajan, V. Kumar, and C. V. Jawa-
face recognition task in a One-Shot learning framework. The har, “Indian Movie Face Database: A Benchmark for Face Recognition
proposed hybrid network shows impressive performance even Under Wide Variations,” in National Conference on Computer Vision,
while dealing with 50-way One-Shot recognition tasks on two Pattern Recognition, Image Processing and Graphics (NCVPRIPG), Dec
2013.
publicly available datasets. Future research plan is to use more [24] G. B. Huang, M. A. Mattar, H. Lee, and E. Learned-Miller, “Learning to
sophisticated discriminator function to combat 100-way One- align from scratch,” in Proceedings of the 25th International Conference
Shot recognition task. on Neural Information Processing Systems - Volume 1, ser. NIPS’12,
USA, 2012, pp. 764–772.
R EFERENCES
[1] D. E. King, “Dlib-ml: A machine learning toolkit,” Journal of Machine
Learning Research, vol. 10, pp. 1755–1758, 07 2009.
[2] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of
Cognitive Neuroscience, vol. 3, no. 1, pp. 71–86, 1991.

119

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 28,2020 at 23:47:46 UTC from IEEE Xplore. Restrictions apply.

You might also like