Abstract—This paper presents a comparative study to recog- Challenge (ILSVRC) [4]. In order to achieve higher accuracy
nize faces from a customized dataset of 10 identities of different in CNN image classification, the development and usage of
celebrities using Convolutional Neural Network based models deeper and complex CNN has become a trend in the research
such as AlexNet, VGG16, VGG19 and MobileNet. These pre-
trained models previously trained on ImageNet dataset are used area [5] [6] [7].
with the application of Transfer Learning and Fine Tuning. For In the recent times, face recognition and clustering has
our experiment we used Keras API with TensorFlow backend been made using facenet in [8]. In [9] video based emo-
written in Python. The performance analysis includes training, tion has been recognised using CNN-RNN and c3D hybrid
validation, and testing on different images created from original networks. In [10] two efficient approximations to standard
dataset. The validation accuracy of VGG19 model is found better
than the other three but MobileNet model showed better test convolutional neural networks: Binary-Weight-Networks and
accuracy. XNOR-Networks have been proposed. In [11] face recognition
has been significantly advanced by the emergence of deep
Index Terms—Deep Learning, Neural Network, Convolutional learning with VGGNet and GoogLeNet. In [12] a class of
Neural Network, Transfer Learning, Face Recognition efficient models called MobileNets for mobile and embedded
vision applications has been presented.
Face recognition is been a very trendy topic of research in
the field of deep neural network. The size of the MobileNet
Recognition of faces has been a trending topic in the model is just 17 megabytes approximately which can easily
area of computer vision. The fundamental aspects of face be applied in low end embedded systems. We wanted to
recognition are it’s broad interdisciplinary like machine vision; find out how well this small neural network performs in
biometrics and security; multimedia processing; psychology recognizing faces which may be applied in embedded security
and neuroscience etc. [1]. For this it has a wide fields of system. That is what motivates us to research with MobileNet
research. Human has been trying effortlessly to achieve more and comparing it with other larger networks for this specific
and more accurate results in this field over the era. implementation of face recognition. In this work we have
In certain circumstances, face recognition has quite contributed by building a new dataset which can be applied
some vital points for recommendation over other biometric for further research. We have also shown how the deep neural
modalities [2]. It is well accepted, very familiar and easily network models perform well with a small dataset with data
understandable by people. As a result, it has a large area augmentation.
of applications like identification of criminals, unlocking In this paper we have compared between four convolution
smartphones and laptops, home access and security, finding neural network based models which are AlexNet, VGG16,
missing person, helping blind people, identifying people on VGG19 and MobileNet for face recognition of a customized
social media, disease diagnosis, real time monitoring and dataset of faces with their training and validation results.
management systems etc. Section II describes some related works in this field and
section III tells about the dataset collection. After that section
IV gives an overview of the models; how we fine-tuned
II. R ELATED W ORKS the models for our purpose. Then section V refers to the
But recognition of faces has been a challenging task since experiment and section VI shows the result on our dataset.
the very beginning. Convolutional Neural Network (CNN) is Finally Section VII concludes with summery.
a very recent established competent image recognition method
which uses local receptive field as neurons in brain, weights
sharing and linking information and greatly reduces the train-
ing constraints in comparison with other neural networks [3]. A. Image Collection
CNN became more popular by Alexnet in computer vision Collection of good dataset has always been a hard task in
by winning the ImageNet Large Scale Visual Recognition the field of computer vision. Thus for this paper a customized
dataset has been made of face iamages of 10 celebrities maps in the previous layer which reside on the same GPU.
from google image search which consists of 130 images for The kernels of the third convolutional layer are connected to
each identity. For each class the images have been sorted all kernel maps in the second layer. The neurons in the fully
like as 100 images to train, 20 images for validation and 10 connected layers are connected to all neurons in the previ-
images to test. Fig. 1 shows some example of images from ous layer. Response-normalization layers follow the first and
our dataset. Our dataset is available publicly at this link: second convolutional layers. Max-pooling layers follow both
FaceDataset for further research in the future. response-normalization layers as well as the fifth convolutional
layer. The ReLU non-linearity is applied to the output of every
convolutional and fully-connected layer.
The VGG16 architecture consists of 12 convolutional
layers, some of which are followed by maximum pooling
layers and then 4 fully-connected layers and finally a 1000-
way softmax classifier. In this paper we have eliminated the 2
fully connected layer and the classification layer at the output
end and added a single fully connected layer with ReLU
activation and finally a classification layer of 10 neurons
with softmax classification. We have not updated the weights
learned from the ImageNet dataset for the first 14 layers
of VGG16 model. So, the model does not train its first 14
layers and that is how we fine tuned the VGG16 model for
our purpose and applied Transfer Learning. Classification for
10 classes reduces the total number of trainable parameters
from 138,357,544 of original model to 13,504,778 for the
fine-tuned model. Adam optimizer is also used for VGG16
Fig. 1: Example images from our dataset for four identities model with the same learning rate as MobileNet which
is 0.0001. Similarly VGG-19 is applied with 14 layers
pre-trained and reduced weights.
B. Data Augmentation
The training and validation images were augmented
before feeding to a deep neural network. As deep neural C. MobileNet
networks need a large amount of data but our dataset is
MobileNets are based on a streamlined architecture which
not large enough, hence we performed data augmentation
builds light weight deep neural networks by using depth wise
to avoid over-fitting. Augmented data from original images
seperable convolutions. Two simple global hyper parameters
are obtained by applying simple geometric transformations
that efficiently trade off between latency and accuracy are
such as translation, rotation, change in scale, horizontal flip
introduced [13]. The MobileNet model is based on depthwise
etc. The augmented data replace the original training and
separable convolutions which may be a frame of factorized
validation data before going through a neural network model
convolutions which factorize a standard convolution into a
in each epoch and that is how the neural network model is
depthwise convolution and a 1×1 convolution called a point-
getting different types of images of the same class in each
wise convolution. For MobileNets the depthwise convolution
applies a single filter to each input channel. The pointwise
convolution at that point applies a 1×1 convolution to combine
IV. M ODELS OVERVIEW the outputs of the depthwise convolution. The first MobileNet
model has 28 layers. In this paper we eliminated the last fully
A. AlexNet connected layer; the classification layer which was built to
The architecture of the AlexNet model contains eight classify 1000 classes of ImageNet dataset and again added a
learned layers—five convolutional and three fully-connected. fully connected layer of 10 classes to classify our data. That
The output of the final fully-connected layer is nourished to reduces the number of trainable parameters from 4,231,976 of
a 1000-way softmax which in our case is reduced to 10. original model to 3,217,226 and the non-trainable parameters
The network maximizes the multinomial logistic regression still remain the same; 21,888. We have trained all the layers
objective, which is identical to maximizing the normal over from scratch. Adam [14] optimization which is an adaptive
training cases of the log-probability of the right label under the learning rate optimization algorithm designed specifically for
prediction conveyance. The kernels of the second, fourth, and training deep neural networks is applied with a learning rate
fifth convolutional layers are connected only to those kernel of 0.0001.
For our experiment we used Keras [15] with TensorFlow Training
backend. Keras is a high-level neural networks API, written 2.5 Validation
by providing easy functions to build a customized neural
network model as well as enabling user to apply and customize 1.0
Fig. 2 and Fig. 3. The simplicity in Alexnet architecture might
be the reason behind it. VGG19 shows the best validation
accuracy, the performance of which is shown in Fig. 6 and
Fig. 7. We have tested the models with 100 images of 10 Training
classes having 10 images for each class. All the images are Validation
different from the training and validation images. Fig. 13 0 20 40 60 80 100
shows MobileNet predicted 84 images correctly where which Epochs
is the highest among others. Fig. 4: VGG19 Training and Validation Accuracy
0.4 1.0
0 20 40 60 80 100
Epochs 0 20 40 60 80 100
Fig. 2: AlexNet Training and Validation Accuracy Fig. 5: VGG19 Training and Validation Loss
2.0 Validation
0.2 Training
Validation 0 20 40 60 80 100
0 20 40 60 80 100
Epochs Fig. 9: MobileNet Training and Validation Loss
Fig. 6: VGG16 Training and Validation Accuracy
Confusion Matrix
7 0 1 0 0 0 0 0 1 1 10
Daniel 1 3 0 3 1 0 0 2 0 0
1 0 8 0 0 0 0 1 0 0 8
Ema 1 0 0 7 0 2 0 0 0 0
2.5 6
True label
Emilia 0 0 1 1 5 2 0 0 0 1
2.0 Maisie 0 0 1 1 3 5 0 0 0 0
0 1 0 0 0 0 5 1 2 1 4
1.5 0 2 1 2 0 0 1 4 0 0
Trump 0 0 0 0 0 0 0 0 10 0
Zuckerberg 0 0 4 0 2 0 0 1 0 3
Da d
Elo l
Em a
Ma lia
Ob e
To a
ck p
Zu rum
0 20 40 60 80 100
Epochs Predicted label
Fig. 7: VGG16 Training and Validation Loss Fig. 10: Confusion Matrix for AlexNet on Test Dataset
Confusion Matrix
7 0 0 0 0 1 0 2 0 0 10
Daniel 0 9 0 0 0 1 0 0 0 0
0 1 4 0 0 0 0 3 1 1 8
Ema 0 0 0 8 0 2 0 0 0 0
0.9 6
True label
Emilia 0 0 0 3 1 4 0 0 1 1
0.8 0 2 0 1 0 7 0 0 0 0
0.7 0 0 0 0 0 0 8 0 1 1 4
0.6 Tom 0 0 1 0 0 0 1 7 0 1
0.5 0 0 0 0 0 0 1 0 8 1 2
0.4 Zuckerberg 0 0 0 0 0 0 0 0 0 10
Training 0
0.3 Validation
Da d
Elo l
Em a
Ma lia
Ob e
To a
ck p
Zu rum
0 20 40 60 80 100
Predicted label
Fig. 8: MobileNet Training and Validation Accuracy
Fig. 11: Confusion Matrix for VGG16 on Test Dataset
Confusion Matrix accurate performance but comparing with the other three
6 0 0 0 0 0 0 4 0 0 10 VGG19 showed best performance in validation accuracy.
2 5 0 0 0 1 0 2 0 0 Our dataset is made available publicly so that anyone can
3 0 2 0 0 0 1 3 0 1 8 do further research with it. Other versions of these models
such as MobileNet v2 or other convolutional models can also
Ema 1 0 0 8 0 1 0 0 0 0
6 be applied. Also other deep convolutional neural network
True label
Emilia 2 0 0 1 2 3 0 1 0 1
based models such as ResNet, Inception can be applied with
Maisie 0 1 0 2 1 4 0 2 0 0 different versions in further research.
0 0 0 0 0 0 7 2 1 0 4
Tom 0 0 0 0 0 0 0 10 0 0
Trump 0 0 0 0 0 0 1 3 6 0
Zuckerberg 0 0 0 0 0 0 0 4 0 6 Abbreviations and Acronyms
Da d
Elo l
Em a
Ma lia
Ob sie
To a
ck p
CN N Convolutional Neural Network
Zu rum
T ReLU Rectified Linear Unit
GP U Graphical Processing Unit
Predicted label
RN N Recurrent Neural Network
Fig. 12: Confusion Matrix for VGG19 on Test Dataset AP I Application Programming Interface
