Chanda 2019
Chanda 2019
Abstract—Ability to learn from a single instance is something Deep learning methods learn multiple levels of representations
unique to the human species and One-shot learning algorithms and abstractions by using a cascade of processing units for
try to mimic this special capability. On the other hand, despite feature extraction and transformation. This leads to forming a
the fantastic performance of Deep Learning-based methods
on various image classification problems, performance often hierarchy of abstraction/representation, and addresses changes
depends having on a huge number of annotated training samples in face pose, illumination, and expression. Even though deep-
per class. This fact is certainly a hindrance in deploying deep learning-based methods can tackle changes in lighting, pose,
neural network-based systems in many real-life applications like and expression while performing face recognition, one disad-
face recognition. Furthermore, an addition of a new class to the vantage is its demand for a huge amount of annotated data
system will require the need to re-train the whole system from
scratch. Nevertheless, the prowess of deep learned features could to train the system and the requirement of re-training when
also not be ignored. This research aims to combine the best a new class is added. While transfer learning techniques can
of deep learned features with a traditional One-Shot learning help mitigate such problems by freezing the first few layers
framework. Results obtained on 2 publicly available datasets and tuning pre-trained weights from the last few layers on the
are very encouraging achieving over 90% accuracy on 5-way new data, it does not completely eradicate the problem.
One-Shot tasks, and 84% on 50-way One-Shot problems.
One-shot algorithms, on the other hand, use a completely
different philosophy for classification. One-shot algorithms
Keywords-One-Shot Learning, Face recognition, Siamese Net-
are meant to perform classification seeing only a handful of
works, Image Classification.
the training samples. Thus a clever amalgamation of those
I. I NTRODUCTION two techniques could combine the best of both providing
a rich feature representation using deep learning techniques
Face recognition has been extensively explored over the last
and feeding those features to a One-Shot learning framework
several decades. Its value as a non-contact biometric authen-
for classification. A widely spread strategy to implement
tication and in a wide variety of other digital applications
One-Shot learning algorithms is to use a Siamese Neural
like security, digital entertainment system, video analytics for
Network with a triplet loss function. Our work takes a Siamese
marketing, video indexing from a streaming video cannot
Neural Network-based approach to perform One-Shot learning
be ignored. Like any other image analysis problem, face
and consequent classification. Deep Neural Network-based
recognition in its early days relied mainly on hand-crafted
features from the “DLIB-ml machine learning toolkit” [1] are
features like SIFT, SURF, Local Binary Pattern, Histogram of
used for feature representation for all face images.
Gradient, Fisher vectors, but with the advent of deep-learning
The primary contribution of this research is that a novel
methodologies, there is a clear shift towards deep-learned
hybrid method combining a Siamese Neural Network with
features. During those early days, research was focused on
Res-Net encoded features for One-Shot face recognition task
improving the pre-processing stage, the introduction of local
is being proposed. We also intend to publish our dataset with
descriptors and feature transformation, but such techniques
unconstrained face images procured from “Indian Movie Faces
failed to counter the challenges of unconstrained face recogni-
Database” in the near-future for One-Shot recognition task
tion. Hand-crafted feature-based methods were used to address
performance evaluation and benchmarking.
changes in lighting, pose, and expression but failed in real life
due to their inability to address more general pose challenges. II. R ELATED W ORK
This has changed as the deep-learning methods have evolved.
Face detection and recognition methods have had significant
ˆ Equal contribution by the authors importance as an image analysis research problem for almost
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 28,2020 at 23:47:46 UTC from IEEE Xplore. Restrictions apply.
3 decades. One of the seminal articles in the early nineties solve the One-shot task, the authors generated images in
is [2], where the authors represent faces using a small set various poses using a 3D face model to train the deep model.
2-D Eigenvectors. Face recognition methods can be broadly Zhao et al. [17] proposed an enforced softmax that contains
divided into handcrafted features-based approaches and optimal dropout, selective attenuation, L2 normalization and
later deep-learning technologies deep-learned features-based model-level optimization which boosted the standard softmax
approaches. The hand-crafted approaches focused mainly on function to produce a better representation for low-shot
high-dimensional artificial feature extraction and the reduction learning.
of features. The representative dimension reduction methods
are the subspace learning methods like Principal Component The concept of Siamese Networks was initially introduced
Analysis [3], Linear Discriminant Analysis [4] and manifold by Bromley et al. [18] for the signature verification problem
learning methods like like Locality preserving projection and further, the use of deep convolutional Siamese networks
[5]. With the advance of deep-learning, the representative for one-shot tasks with a significant accuracy has been show-
method was to learn the discriminative face representations cased in [19]. Face recognition usually consists of face detec-
directly from the original image space. For example, Hu et tion, feature extraction, and recognition. We use the dlib-ml [1]
al. [6] introduced us to the convolutional neural network toolkit which leverages image-driven neural networks to detect
applied to face recognition. It analyses the advantages and and extract the faces in a given image and then use a resnet
disadvantages of this method and shows the developmental based architecture to generate a feature vector to represent
roadmap in the future. This work is further explored and each face. In this paper, we propose a method which integrates
state-of-the-art results are obtained in [7], [8], [9], [10]. Albeit the concept of Deep convolutional Siamese networks and a
CNNs exceptional performance for some applications, such transfer learning strategy to produce a robust face recognition
algorithms struggle to deal with many real-world applications system which leverages the deep learned feature attributes.
that require learning or drawing inferences from small
amounts of data, class imbalance and adjusting to a constant III. M ETHODOLOGY
inflow of new class information. The problem of developing One-shot learning can be achieved in several ways. In
an efficient, robust face recognition system at scale is also this research we have explored two approaches: (a) Siamese
not an exception in this context. Neural Network based approach; (b) a Deep-feature encoding
approach followed by the nearest neighbor classification of
In the past few years, there have been several works that those encoded features. We settled on a method by combin-
address this problem. To address the data imbalance problem ing the two approaches. This improvised combined method
Guo et al. [11] proposed a novel underrepresented classes uses the encoded features generated out of a ResNet CNN
promotion loss term which aligned the norms of weight architecture as an input to the Siamese network, and the
vectors of underrepresented classes and normal classes thus Siamese network is being trained to discriminate between two
giving the one-shot classes an equal weight-age. Work by encoded feature vectors. In this combined approach a pre-
Wang et al. [12] proposes a framework based on CNN, trained Deep convolutional neural network (ResNet) acts as
which deals with the deficient training data by using a a feature extractor for a pair of an input image and then an
balancing regularizer and shifting the center regeneration energy function Θ is used which ties the twin networks to
to regulate norms of weight vector into the same scale compute the similarity index. When the two encoded feature
and adjusts clustering center. Insufficient training data and vectors for the input face images are obtained, the Siamese
data imbalance, however, causes the network to perform Network learns to score the similarity of those two encoded
poorly. Ding et al. [13] proposed an approach to solve feature vectors in a range of 0-1. Where 1 is assigned if both
the underrepresented class problem in one-shot learning, the input images are of the same class.
by focusing on building generative models to build extra
examples. It proposed a generative model to synthesize data A. Siamese Network
for one-shot classes by adapting the data variances and Siamese networks are a subset of deep neural network
augmenting features from other normal classes. Another work architectures that contain two identical sub-networks working
by Jhadav et al. [14] proposed the method of deep attribute in cohesion that use the same weights while taking two distinct
representation of faces for one-shot face recognition. They input vectors and are joined by a comparative function. Such
used specific attributes of human faces such as the shape networks are used to determine the similarity between two
of the face, hair, gender to fine-tune a deep CNN for face distinct inputs. It is important that not only the architectures
recognition. Their experimental results on standard datasets of the sub-networks are identical, but the weights are shared
showed that deep attribute representations performed better among them as well for the network to be called ’Siamese’.
in case of two one-shot face recognition techniques such as In this current study, the convolutional Siamese network is
an exemplar SVM and one-shot similarity kernel. Wu et al. designed to learn features of the input images regardless of
[15] proposed a framework with hybrid classifiers using a prior domain knowledge with very few samples from a given
CNN and the nearest neighbor (NN) model. The work by distribution. This model was also adopted because the twin
Hong et al. [16] proposes a domain adaptation network to networks share weights resulting in fewer parameters to train
114
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 28,2020 at 23:47:46 UTC from IEEE Xplore. Restrictions apply.
Fig. 1: Sibling of the Twin Siamese Network Architecture used in the experiment(twin network not depicted).
115
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 28,2020 at 23:47:46 UTC from IEEE Xplore. Restrictions apply.
and increase dissimilarity between the anchor image and the
negative image. Here ’a’ denotes an anchor image, ’p’ denotes
a positive image and ’n’ denotes a negative image. Another
hyperparameter variable called margin is being added to the
loss equation, that defines how far away the dissimilarities
should be. For example, if the margin = 0.4 and d(a,p) = 0.3
then d(a,n) should at least be equal to 0.7.
116
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 28,2020 at 23:47:46 UTC from IEEE Xplore. Restrictions apply.
Fig. 3: Combined hybrid architecture used in the experiment.
other in that sense. The details of the respective datasets are to accommodate slight variations such as facial hair and
given below. obstructions such as headgear, eyewear we include a few
1) LFW: - This database consists of 13,000 images of faces samples of such images as well. For each class, we considered
collected from the web. Each face has been labeled with the 20 images in total. We have a total of 100 classes. To keep
name of the person pictured. In this dataset, 1680 people the train and test set completely disjoint and to exclude any
have two or more distinct photos in the data set. To maintain overlap in the classes we removed 6 classes which were
the consistency and to ensure robustness we have various common to IMFDB and the dataset used to train the ResNet.
images for different facial positions. Further to accommodate
slight variations such as facial hair and obstructions such as V. R ESULTS & D ISCUSSIONS
headgear, eyewear, we include a few samples of such images
as well. Finally, for each class, we end up taking 15 images While conducting experiments with three different ap-
in total and due to this constraint, we remove all the classes proaches, the input test and train set for each fold were same
which have 15 images or less. After this, we are left with a for all three experiments. This was done purposely to compare
total of 96 classes. We use a deep funneling method to align the efficacy of three approaches fairly.
the faces [24]. For our experiments, the subset of the LFW database
consisting of 96-face classes with 15 samples in each class
2) IMFDB: - This is a large unconstrained face database was used. Those 96 classes were selected since the rest of the
consisting of 34512 images of 100 Indian actors collected other classes have less than 15 samples. For the evaluation in
from more than 100 videos [23]. All the images are manually face recognition we perform 3 different one-shot tasks i.e. 5,
selected and cropped from the video frames resulting in a 10 and 20 way tasks so the new dataset was split into either 91-
high degree of variability in terms of scale, pose, expression, 5/ 81-10 or 71-20 train-validation & evaluation classes, where
illumination, age, resolution, occlusion, and makeup. Videos the train set was further split according to an 80-20% split
collected from the last two decades contain large diversity resulting in 72, 64 or 56 classes for training and 19, 17 or 15
in age variations compared to the images collected from the classes for validation.
Internet through a search query. IMFDB is the first face The set of IMFDB consisting of 94-face classes with 20
database that provides detailed annotation of every image in samples in each class was used. For the evaluation in face
terms of age, pose, gender, expression and type of occlusion recognition we perform 5, 10 and 20 way tasks so the new
that may help others face-related applications. This dataset dataset was split into either 89-5/ 84-10 or 74-20 train-
exhibits a huge degree of intra-class variability as well (Fig. 4). validation & evaluation classes, where the train set was further
split according to an 80-20% split resulting in 71, 67 or 59
classes for training and 18, 17 or 15 classes for validation.
The evaluation was conducted using the same n-way one-shot
tests on the n classes from the evaluation set.
Both the datasets contain around 95 classes and for
training and evaluation, we use a fold wise method. So the
total number of folds for “n-way” is obtained as total number
of classes divided by “n” the number of classes for testing with
minimal re-sampling. Therefore, in the case of 5-way we get
19 folds, 10-way we get 9 folds and for 20-way we get 4 folds.
Note that to frame a 50-way one-shot task, given the number
Fig. 4: Example of intra class variability in IMFDB dataset. of classes in each of those two datasets we could perform
only two folds of train-test evaluation run where a few of the
To maintain the variability and to ensure robustness we classes might be re-sampled from the previous folds. By “fold”
have various images with different facial positions. Further we mean to say an unique “train-validation-test” evaluation
117
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 28,2020 at 23:47:46 UTC from IEEE Xplore. Restrictions apply.
set.The accuracy metric used here is true recognition rate for whereas the highest accuracy on the same dataset with Siamese
each fold in a given dataset. Network is ≈ 26.00%. A similar trend can be observed in the
case of IMFDB dataset as well.
A. Siamese Network-based Results
Out of three approaches, during our initial experiments, TABLE III: Accuracy of One shot Tasks on IMFDB using
the Siamese Network-based approach performed the worst. Dlib-ResNet-29 network
Even while dealing with a 5-way One-Shot recognition task, it
Fold Number 5-Way Task 10-Way Task 20-Way Task
could only deliver the highest accuracy of ≈ 32.50% for both
datasets. To give an idea, results obtained on n-way One-Shot Fold 1 80.80% 78.60% 80.00%
tasks on both datasets on 4 different folds are shown in Table Fold 2 82.40% 80.40% 76.50%
I and II. Since the results are not encouraging we are not Fold 3 81.00% 79.30% 78.20%
providing results for all folds with respect to different n-way Fold 4 83.60% 82.00% 75.40%
tasks.
TABLE I: Accuracy of One shot Tasks on LFW dataset TABLE IV: Accuracy of One shot Tasks on LFW dataset
using Siamese Network with own feature extractor using Dlib-ResNet-29 network
Fold Number 5-Way Task 10-Way Task 20-Way Task Fold Number 5-Way Task 10-Way Task 20-Way Task
Fold 1 32.50% 28.20% 23.40% Fold 1 88.20% 86.00% 85.30%
Fold 2 27.50% 26.70% 22.60% Fold 2 90.00% 84.60% 87.00%
Fold 3 30.00% 30.20% 25.60% Fold 3 89.00% 90.00% 82.00%
Fold 4 24.60% 24.80% 22.60% Fold 4 90.20% 89.00% 81.40%
118
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 28,2020 at 23:47:46 UTC from IEEE Xplore. Restrictions apply.
TABLE VI: Accuracy of 50-way One shot Tasks on LFW [3] W. Zhao, R. Chellappa, and A. Krishnaswamy, “Discriminant analysis
of principal components for face recognition,” Proceedings Third IEEE
Dataset & IMFDB using combined approach International Conference on Automatic Face and Gesture Recognition,
pp. 336–341, 1998.
Fold Number LFW IMFDB [4] L.-F. Chen, H.-y. Liao, M.-T. Ko, J.-C. Lin, and G.-J. Yu, “New lda-
Fold 1 80.00% 80.50% based face recognition system which can solve the small sample size
problem,” Pattern Recognition, vol. 33, pp. 1713–1726, 10 2000.
Fold 2 82.50% 84.20%
[5] Y. C. Tan, Y. Zhao, and X. Ma, “Contourlet-based feature extraction with
lpp for face recognition,” 2011 International Conference on Multimedia
and Signal Processing, vol. 1, pp. 122–125, 2011.
[6] G. Hu, Y. Yang, D. Yi, J. Kittler, W. Christmas, S. Li, and T. Hospedales,
was 90.55% with best accuracy shooting as high as 97.50% “When face recognition meets with deep learning: An evaluation of
in one of the fold. convolutional neural networks for face recognition,” 12 2015, pp. 384–
Similar to the experiments conducted on the LFW dataset 392.
[7] C. Lu and X. Tang, “Surpassing human-level face verification perfor-
we also performed 5-way and 10-way tasks on the IMFDB mance on LFW with gaussianface,” CoRR, vol. abs/1404.3840, 2014.
dataset. The mean accuracy of the 5-way one-shot task for 19 [8] Y. Sun, X. Wang, and X. Tang, “Deep learning face representation by
folds was observed to be 82.63%. Whereas for the 10-way one- joint identification-verification,” CoRR, vol. abs/1406.4773, 2014.
[9] Y. Sun, X. Wang, and X. Tang, “Deeply learned face representations are
shot task the mean accuracy across 9 fold set was observed to sparse, selective, and robust,” CoRR, vol. abs/1412.1265, 2014.
be 79.05%. The best accuracy of the 5 and 10 way task was [10] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the
observed to be 92.50% and 87.50% respectively. gap to human-level performance in face verification,” 09 2014.
[11] Y. Guo and L. Zhang, “One-shot face recognition by promoting under-
D. Comparison with other techniques represented classes,” CoRR, vol. abs/1707.05574, 2017.
[12] L. Wang, Y. Li, and S. Wang, “Feature learning for one-shot face recog-
Though there are a large number of published results on nition,” 2018 25th IEEE International Conference on Image Processing
face recognition, however, very few works like [13], [16], [14] (ICIP), pp. 2386–2390, 2018.
[13] Z. Ding, Y. Guo, L. Zhang, and Y. Fu, “One-shot face recognition via
focus on the One-Shot face recognition task. Unfortunately, we generative learning,” 05 2018, pp. 1–7.
could compare the performance of our system with only [14] [14] A. Jadhav, V. P. Namboodiri, and K. S. Venkatesh, “Deep attributes for
as the others have used the “MS-Celeb Low Shot” dataset one-shot face recognition,” in ECCV Workshops, 2016.
[15] Y. Wu, H. Liu, and Y. Fu, “Low-shot face recognition with hybrid
meant for One-Shot recognition task and that dataset is not classifiers,” in The IEEE International Conference on Computer Vision
available from any legitimate source. In [14], the authors (ICCV) Workshops, Oct 2017.
did experiments for One-Shot recognition using the “LFW” [16] S. Hong, W. Im, J. Ryu, and H. S. Yang, “SSPP-DAN: deep domain
adaptation network for face recognition with single sample per person,”
dataset and we have compared our results with them in Table CoRR, vol. abs/1702.04069, 2017.
VII. Note that our method has outperformed the method [17] J. Zhao, Y. Cheng, Z. Wang, Y. Xu, J. Karlekar, S. Shen, and J. Feng,
proposed in [14] especially in the case of 10-way and 20-way “Know you at one glance: A compact vector representation for low-shot
learning,” 09 2017.
one-shot tasks. We plan to preserve and publish the train and [18] J. Bromley, I. Guyon, Y. LeCun et al., “Signature Verification using
test split of images that we have used for our experiments from a ”Siamese” Time Delay Neural Network,” International Journal of
the other dataset “IMFDB”, for benchmarking performance Pattern Recognition and Artificial Intelligence, vol. 7, no. 04, p. 669688,
1993.
evaluation of One-Shot face Recognition task. [19] G. Koch, R. Zemel, and R. Salakhudtdinov, “Siamese Neural Networks
for One-shot Image Recognition,” in Proceedings of the 32 nd Inter-
TABLE VII: Accuracy comparison of One shot Tasks on national Conference on Machine Learning, vol. 37, Lille, France, Jul.
2015.
LFW [20] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
feedforward neural networks,” in Proceedings of the Thirteenth Inter-
Method 5 Way 10 Way 20 Way national Conference on Artificial Intelligence and Statistics, AISTATS
Deep attribute, Jadhav at al. [14] 94.00% 93.75% 88.87% 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010, 2010, pp.
249–256.
Dlib-Siamese Net , Proposed Method 97.00% 97.50% 95.50%
[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” CoRR, vol. abs/1512.03385, 2015.
[22] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled
VI. C ONCLUSIONS & F UTURE W ORK faces in the wild: A database for studying face recognition in uncon-
strained environments,” University of Massachusetts, Amherst, Tech.
This article proposes a new hybrid approach of fusing Res- Rep. 07-49, October 2007.
Net features along with a Siamese-Network classifier to handle [23] S. Setty, M. Husain, P. Beham, J. Gudavalli, M. Kandasamy, R. Vaddi,
V. Hemadri, J. C. Karure, R. Raju, B. Rajan, V. Kumar, and C. V. Jawa-
face recognition task in a One-Shot learning framework. The har, “Indian Movie Face Database: A Benchmark for Face Recognition
proposed hybrid network shows impressive performance even Under Wide Variations,” in National Conference on Computer Vision,
while dealing with 50-way One-Shot recognition tasks on two Pattern Recognition, Image Processing and Graphics (NCVPRIPG), Dec
2013.
publicly available datasets. Future research plan is to use more [24] G. B. Huang, M. A. Mattar, H. Lee, and E. Learned-Miller, “Learning to
sophisticated discriminator function to combat 100-way One- align from scratch,” in Proceedings of the 25th International Conference
Shot recognition task. on Neural Information Processing Systems - Volume 1, ser. NIPS’12,
USA, 2012, pp. 764–772.
R EFERENCES
[1] D. E. King, “Dlib-ml: A machine learning toolkit,” Journal of Machine
Learning Research, vol. 10, pp. 1755–1758, 07 2009.
[2] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of
Cognitive Neuroscience, vol. 3, no. 1, pp. 71–86, 1991.
119
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 28,2020 at 23:47:46 UTC from IEEE Xplore. Restrictions apply.