Convolutional Neural Network Approach Fo
Convolutional Neural Network Approach Fo
Fig. 2: Gradients
Fig. 1: Data collection flow chart (b)By counting major direction of gradients, we’ll replace
that square in the image with the arrow directions which are
the strongest.
(c)At the end a basic structure of face is found which is
IV. Methodology called HOG (Histograms of Oriented Gradients) image/pattern
used for detecting face.
We would discuss about the approach by which we im-
plement face recognition system on our system. The sim- B. Face Alignment
plest approach of face recognition is to directly compare the
unknown face but it’s a huge approach of recognition. We Figure 6 shows two different face direction images of the
believe the most reliable process of recognition is to use deep same person. The both images can be recognized by a human
convolutional neural network (CNN). By applying this CNN easily, but this will look entirely different for a computer.
we would find 128 measurements of a facial image also called The feature extractor will extract different features for the
embeddings which are needed to train the network. same person. But the classifier are being confused by these
features. For these confusing features the performance of face
recognition system will go down.
A. Face Detection
To solve this problem, we can wrap each picture of the face
Earlier the main challenge of face recognition was to detect so that the eyes and lips are always in the same place in the
human face. After around 2000, it becomes easier by the face image. This processed image will make the feature extractor
detection framework proposed by Paul Viola and Michael J. easy to extract same features for same person. There is an
Jones [2]. But here we used gradient based face detection algorithm which is used for this purpose called face landmark
technique [3] which is more reliable for us. estimation [16].
1) What is gradient?: Only gray scale image is needed for 1) Face landmark estimation: The idea of face landmark
detecting face from an image. For this, processing becomes estimation is to find 68 specific points (called landmarks) that
easier by black and white images than the color images. We exist on every face. Figure 7 shows an image with 68 landmark
would look at the pixels which are directly surrounding it. points in it. To find this landmarks from an image we used a
Every single pixel is taken at a time to figure out how dark the trained machine learning algorithm created by dlib toolkit [17].
current pixel by its surrounding pixels and drawing an arrow
sign to the darker pixel. This process repeats for each pixel and 2) Alignment: In a perfect center image(shows in figure
the replacing arrows show us the flow of light to dark across 7) these landmarks represent the positions of eye, nose and
the entire image. These arrows are called gradients. mouth. According to those positions, we have to simply rotate
and scale the non centered facial image so that the eyes and
2) How do gradients impact on detection?: (a)The gradi- mouth are centered as good as possible on that image. For this
ents are saved and a basic pattern is found which is applied to purpose, we used the dlib toolkit. This toolkit uses the basic
small squares of 16x16 pixels each. image transformations like rotation and scale.
Fig. 6: Two different face of the same person
Fig. 4: HOG representation by gradients.
Fig. 9: The FaceNet Network Architecture[12]. In Eq. (1) and Eq. (3) α is the margin that is enforced
between positive and negative pairs. Eq. (2) represent all the
triplet set where T is the set of all possible triplets in the
training dataset.
5) Training with Convolutional Neural Network: A convo-
lutional neural network is treated as a black box. After giving When the distance between the anchor and positive is
the model details there is nothing to do with its core. Although minimum and the distance between the anchor and negative
according to FaceNet [12] the most important part lies in the is maximum the convolutional neural network is ready to
end-to-end learning of the whole system. There is a triplet generate embedding for new examples. Figure 13 shows an
loss function that directly reflects the performance of face input face image with its embedding.
recognition, verification and clustering.
The combination of three face images where two face
images from one person and one face image from another
person is called a triplet. The first and the second face image
which are from the same person is named anchor and positive
respectively, the last one which is from the other person is
called negative. f (x)ǫRd represents the embeddings where
x is an image. This function embeds an image x into a d-
dimensional Euclidean space called embedding. After gen- Fig. 12: The triplet loss function minimizes the distance
erating embedding for each image of a triplet the Eq. (1) between an anchor and a positive and maximizes the distance
compares the distance between anchor to the positive and between the anchor and a negative. Here positive and anchor
anchor to t negative. The goal of this comparison is to ensure has the same identity where anchor and negative has the
that xia (anchor) of a particular person is closer to all other different identity
p
images xi (positive) of the same individual than it is to any
n
image xi (negative) of any other individual. According to the
comparison result, the loss that is being minimized is then L, D. Create Classifier with the features using SVM
found by the Eq.(3). Then a new triplet is created from the
training dataset and the whole process is run again until the After generating embeddings we can use any machine
distance between the anchor and positive is minimum. Figure learning algorithm to create a classifier for face recognition.
12 shows the visualization of training result. In this case a linear support vector machine is very easy to
use[20]. So, in this system we have used the support vector
p machine algorithm. Actually we are not concern about using
k f (xia ) − f (xi )k22 + α < k f (xia ) − f (xin )k22 (1) SVM or other machine learning to train the embeddings.
According to our results a normal linear SVM takes a very
little time to train as well as prediction.
∀( f (xai ), f (x pi ), f (xin ))ǫT (2)
E. Full System Architecture
Õ
N 1) Training Architecture: For training, our system is trained
[k f (xia ) −
p
f (xi )k22 + −k f (xia ) − f (xin )k22 + α]+ (3) with the images given from the external dataset. By doing the
i process like face detection, alignment, embeddings generation
Fig. 16: Normalization time
Fig. 14: Dataset training architecture As we know, images were not preprocessed or normalized,
it takes a little amount of time to normalize so that we
can easily ignore the false images. We can see from the
normalization time graph that there are a slight up or down
etc. the system creates CSV file which are used for the changes in normalization time when we increase the training
classifier. images.
2) Prediction Architecture: Our system predicts the person
with the help of CSV file which is created in training time.
V. Results
The original dataset is divided into two parts. One is Fig. 18: SVM Training time
for training the system and other is for testing. Each dataset
contains facial images of 200 persons. There are 30 facial
images of a person in each dataset. System performance is However, embedding generation and training the classifier
calculated by accuracy, prediction time and training time. This has massive changes over the increasing training images. So it
performance calculation was a step by step process. will create a problem if we want to train as soon as a new
person facial data is taken by the system. But there is no
At first, the system is trained with 20 person images (30
problem in prediction time. This little amount of prediction
images per person) and then testing dataset that contains same
time is also easily ignorable.
20 people’s other images(30 images per person) is provided
to the system for prediction. This prediction result is stored In general, it is very hard to get 100% accuracy in any
in a text file. After that 20 more people’s image is added to machine learning algorithm, so we have expected a face
the training and testing data set. After training and prediction, recognition system that has at least 95% accuracy. According
a new text file is created by the system. This process was to the accuracy graph, this neural network approach has greater
References
[1] Kanade, Takeo. "Picture processing system by computer complex and
recognition of human faces." Doctoral dissertation, Kyoto University
3952 (1973): 83-97.
[2] Viola, Paul, and Michael J. Jones. "Robust real-time face detection."
International journal of computer vision 57.2 (2004): 137-154.
[3] Dalal, Navneet, and Bill Triggs. "Histograms of oriented gradients for
human detection." Computer Vision and Pattern Recognition, 2005.
CVPR 2005. IEEE Computer Society Conference on. Vol. 1. IEEE,
2005.
[4] Kazemi, Vahid, and Josephine Sullivan. "One millisecond face align-
ment with an ensemble of regression trees." Proceedings of the IEEE
Fig. 19: Average prediction time Conference on Computer Vision and Pattern Recognition. 2014.
[5] Jafri, Rabia, and Hamid R. Arabnia. "A survey of face recognition
techniques." Jips 5.2 (2009): 41-68.
[6] Hotelling, Harold. "Analysis of a complex of statistical variables into
principal components." Journal of educational psychology 24.6 (1933):
417.
[7] Sirovich, Lawrence, and Michael Kirby. "Low-dimensional procedure
for the characterization of human faces." Josa a 4.3 (1987): 519-524.
[8] Turk, Matthew, and Alex Pentland. "Eigenfaces for recognition." Journal
of cognitive neuroscience 3.1 (1991): 71-86.
[9] Belhumeur, Peter N., JoÃčo P. Hespanha, and David J. Kriegman.
"Eigenfaces vs. fisherfaces: Recognition using class specific linear pro-
jection." IEEE Transactions on pattern analysis and machine intelligence
19.7 (1997): 711-720.
[10] Lawrence, Steve, et al. "Face recognition: A convolutional neural-
Fig. 20: Accuracy network approach." IEEE transactions on neural networks 8.1 (1997):
98-113.
[11] Taigman, Yaniv, et al. "Deepface: Closing the gap to human-level
performance in face verification." Proceedings of the IEEE Conference
than 95% accuracy. Although all testing images are taken at on Computer Vision and Pattern Recognition. 2014.
the same time when training images are taken. [12] Schroff, Florian, Dmitry Kalenichenko, and James Philbin. "Facenet:
A unified embedding for face recognition and clustering." Proceedings
Among 200 images, this six people’s images shown in the of the IEEE Conference on Computer Vision and Pattern Recognition.
2015.
figure 21 has the false prediction. We think the reason for the
false prediction is the alignment of those face. As we take [13] Amos, Brandon, Bartosz Ludwiczuk, and Mahadev Satyanarayanan.
OpenFace: A general-purpose face recognition library with mobile
random images for testing purpose from our datasets, we get applications. Technical report, CMU-CS-16-118, CMU School of Com-
those six images that were not suitable for prediction. So they puter Science, 2016.
got false prediction. [14] Parkhi, Omkar M., Andrea Vedaldi, and Andrew Zisserman. "Deep Face
Recognition." BMVC. Vol. 1. No. 3. 2015.
[15] Wu, Xiang, Ran He, and Zhenan Sun. "A lightened cnn for deep face
representation." 2015 IEEE Conference on IEEE Computer Vision and
Pattern Recognition (CVPR). 2015.
[16] Kazemi, Vahid, and Josephine Sullivan. "One millisecond face align-
ment with an ensemble of regression trees." Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 2014.
[17] King, Davis E. "Dlib-ml: A machine learning toolkit." Journal of
Machine Learning Research 10.Jul (2009): 1755-1758.
Fig. 21: Images that has the false prediction [18] Hubel, David H., and Torsten N. Wiesel. "Receptive fields and func-
tional architecture of monkey striate cortex." The Journal of physiology
195.1 (1968): 215-243.
[19] Szegedy, Christian, et al. "Going deeper with convolutions." Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern
VI. CONCLUSIONS Recognition. 2015.
Our main mission is to perceive human individually by [20] Lee, Yuh-Jye, and Olvi L. Mangasarian. "SSVM: A smooth support
vector machine for classification." Computational optimization and
visual recognition. As a very first step, we have to recognize Applications 20.1 (2001): 5-22.
human facial images with the training data which are served the
system externally. In data collection section, we have described
about our collected data which are used for training and testing
the system. We have showed our system accuracy in above
where we used deep learning based machine learning algorithm
named convolutional neural network. System is now ready
to recognize the students individually. Although some false
prediction is found, it can be minimized by increasing the
number of training images.