0% found this document useful (0 votes)
35 views6 pages

Convolutional Neural Network Approach Fo

Uploaded by

srivanisri247
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views6 pages

Convolutional Neural Network Approach Fo

Uploaded by

srivanisri247
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2017 20th International Conference of Computer and Information Technology (ICCIT), 22-24 December, 2017

Convolutional Neural Network Approach for Vision Based Student


Recognition System

Nusrat Mubin Ara1 Nishikanto Sarkar Simul1 Md. Saiful Islam


Dept. of CSE, SUST Dept. of CSE, SUST Dept. of CSE, SUST
Sylhet, Bangladesh Sylhet, Bangladesh Sylhet, Bangladesh
[email protected] [email protected] [email protected]
Abstract—Computers are now too smart to interact with B. Face Recognition
the human in different approaches. This interaction will be
more acceptable for both human and computer if it is based A comprehensive survey by Jafri and Arabnia [5] shows
on recognition process. In this article, author’s concern is to that all face recognition techniques(up to 2009) can be charac-
integrate and develop a student recognition system using exist- terized by feature-based and holistic approaches. The early in-
ing algorithms. Among various face recognition methods, here vented methods were feature based. Those methods give a low-
author use deep learning based face recognition method. This dimensional face representation based on ratios of distances,
method uses Convolutional Neural Networks (CNN) to generate areas, and angles [1] . But in practical, this representation
a low dimensional representation called embeddings. Then those are explicitly defined and not accurate. Besides, in holistic
embeddings are used to classify the person’s facial image.By this approaches, statistics and Artificial Intelligence (AI) is used
system different types of applications like student attendance-
system, building security etc. can be developed. After building
that learn from and perform well on a dataset of facial images.
the system, a resultant performance is also showed in this article. For example, Principal Component Analysis (PCA) [6] is a
statistical approach that represents faces as a combination
Keywords—Face recognition, Convolutional Neural Networks of eigenvectors [7]. Landmark techniques in PCA-based face
(CNN), Embeddings, Support Vector Machine(SVM). recognition are used in Eigenfaces [8] and fisherfaces [9].
In the recent years AI based techniques are getting more
popular. An AI technique is presented by Lawrence et al. [10].
I. INTRODUCTION This method uses convolutional neural networks to classify an
image of a face.
At present, Face recognition is the most prevalent word in
computer vision and image processing. Although it has a good C. Face Recognition with Neural Networks
attainment in algorithmic side, the real-time implementation is
not good enough. So, authors main purpose is to implement Face recognition with deep neural networks or convolu-
it in a system named "Student Recognition System". After tional neural networks is a very popular technique today. This
developing this, many class activities like taking attendance, method gives the higher accuracy and precision comparing
guard at exam hall etc. can be maintained in technical way. the previous methods. This method is used in Facebook’s
For this system the main challenge was to collect a large DeepFace [11] and Google’s FaceNet [12]. Also, there is
amount of facial image data and integrate Convolutional Neural an open source project called openface [13], which provide
Network(CNN) to the system. And finally, evaluate it does it a python based API for face recognition. The idea to gen-
work properly or not. erate embeddings using convolutional neural network was
invented in 2005 [12]. There are lots of other research about
producing embeddings using convolutional neural network.
II. Related Works Such as Visual Geometry Group (VGG) Face Descriptor [14]
and Lightened Convolutional Neural Networks (CNNs) [15],
A. Face Detection and Normalization and they have their won implemented system. To training a
neural network for generating embedding, lots of face data is
Face detection was the mainstream in the early 2000’s when necessary. Facebook, Google has their own dataset. Openface
Paul Viola and Michael [2] Jones invented a way to detect also releases some trained neural network by training it with
human faces which was fast enough for any face detection some open facial dataset.
related work in both low and high configuration digital devices.
However, much more reliable solutions exist now. Here authors III. Data collection
used a method invented in 2005 called Histogram of Oriented
Gradients [3] or just HOG in short. It has a fragile effect A. Collection Process
on low-light problem compared to the other face detection Our dataset is contained by a few amount of facial images.
system. After detecting, the images need some preprocessing These images are taken from the students of department of
called normalization. face landmark estimation [4] is used for CSE, department of EEE and department of Software Engi-
normalization. neering of Shahjalal University of Science and Technology. We
1 contributions
also collected images from some other school going students
of the first and second authors on this work are equal and some general people. We have assembled around two
hundred people’s facial images including 400 images for each

978-1-5386-1150-0/17/$31.00 © 2017 IEEE


people. The size of each image is 640x480. We tried to take
all images in same light condition. The source of the light was
the sun light.

Fig. 2: Gradients

Fig. 3: Darkness comparing.

Fig. 1: Data collection flow chart (b)By counting major direction of gradients, we’ll replace
that square in the image with the arrow directions which are
the strongest.
(c)At the end a basic structure of face is found which is
IV. Methodology called HOG (Histograms of Oriented Gradients) image/pattern
used for detecting face.
We would discuss about the approach by which we im-
plement face recognition system on our system. The sim- B. Face Alignment
plest approach of face recognition is to directly compare the
unknown face but it’s a huge approach of recognition. We Figure 6 shows two different face direction images of the
believe the most reliable process of recognition is to use deep same person. The both images can be recognized by a human
convolutional neural network (CNN). By applying this CNN easily, but this will look entirely different for a computer.
we would find 128 measurements of a facial image also called The feature extractor will extract different features for the
embeddings which are needed to train the network. same person. But the classifier are being confused by these
features. For these confusing features the performance of face
recognition system will go down.
A. Face Detection
To solve this problem, we can wrap each picture of the face
Earlier the main challenge of face recognition was to detect so that the eyes and lips are always in the same place in the
human face. After around 2000, it becomes easier by the face image. This processed image will make the feature extractor
detection framework proposed by Paul Viola and Michael J. easy to extract same features for same person. There is an
Jones [2]. But here we used gradient based face detection algorithm which is used for this purpose called face landmark
technique [3] which is more reliable for us. estimation [16].
1) What is gradient?: Only gray scale image is needed for 1) Face landmark estimation: The idea of face landmark
detecting face from an image. For this, processing becomes estimation is to find 68 specific points (called landmarks) that
easier by black and white images than the color images. We exist on every face. Figure 7 shows an image with 68 landmark
would look at the pixels which are directly surrounding it. points in it. To find this landmarks from an image we used a
Every single pixel is taken at a time to figure out how dark the trained machine learning algorithm created by dlib toolkit [17].
current pixel by its surrounding pixels and drawing an arrow
sign to the darker pixel. This process repeats for each pixel and 2) Alignment: In a perfect center image(shows in figure
the replacing arrows show us the flow of light to dark across 7) these landmarks represent the positions of eye, nose and
the entire image. These arrows are called gradients. mouth. According to those positions, we have to simply rotate
and scale the non centered facial image so that the eyes and
2) How do gradients impact on detection?: (a)The gradi- mouth are centered as good as possible on that image. For this
ents are saved and a basic pattern is found which is applied to purpose, we used the dlib toolkit. This toolkit uses the basic
small squares of 16x16 pixels each. image transformations like rotation and scale.
Fig. 6: Two different face of the same person
Fig. 4: HOG representation by gradients.

Fig. 5: Face Detected


Fig. 7: Face Landmarks

As long as a face has 60-70% front view, this process is


able to center the eyes and mouth are in roughly like the 2016[13]. In figure 4.8, the structure architecture of FaceNet
reference centered images. This will make our next step to is shown.
generate a lot more accurate output. 2) Convolutions: Convolution is the first step to generate
embeddings from the input images. It sustains the contiguous
C. Feature Extraction & Embedding Generation relationship between pixels by learning the image features. It is
known that every image has a matrix of pixel values. We may
Feature extraction means extracting derived values from consider a 5x5 image which pixel values are 0 and 1(here we
the informative and non-redundant data. In image processing, considered only the gray scale image) and a 3x3 kernel image.
these data are frontal facial images. It reduces the problems of The full matrix of original image is multiplied by the kernel
large amount data analysis. As large amount of data requires image matrix in part by part and the final calculated matrix is
large amount of memory and computational power, it is very saved as convolved feature matrix.
important to represent these data in small numeric values.
These numeric values are the basic measurement of a face The size of convolved feature is controlled by three pa-
that helps us to defer the face from others. These basic rameters named depth, stride and zero-padding. Depth is the
measurements might be the size of each ear, the spacing number of kernels used for convolution operation. Normally
between eyes, the length of the nose etc. These are also called for RGB image this depth is three. Stride is the number of
embedding. pixel in which number we move the pixels at a time. If the
stride is 3, the kernel is jumped 3 pixels at a time. If the stride
1) Feature Extraction using pretrained CNN: There are is 4, the kernel is jumped 4 pixels at a time and so on. Actually,
many techniques to extract features from the images. But the a large stride generates small feature maps. And the last one is
most reliable process for feature extraction is to use deep zero-padding which is to insert zero to the input image around
convolutional neural network (CNN). Its performance is much the borders.
better than the other processes.
3) Pooling: Pooling is the reduction of the dimensionality
Convolutional neural network is a machine learning algo- of an image by maximizing or averaging or adding the pixel
rithm of artificial neural network which is biologically-inspired values. Here maximum pooling is used because it works better
variants of multilayer perceptrons [18]. It subsists of some sub- than the average or summation. Actually, in max pooling the
sampling and convolutional layers which are freely followed by largest element is taken for the neighbors (for example, 2x2
fully connected layers. An input image may have in m × m × r matrix).
dimension where m is the height, width and r is the number
of channels. For example, RGB image has 3 channels. This 4) Inception Model: Inception is a new concept that is used
layer also has some kernels in n × n × q dimension which in a convolutional neural network. This concept is first used in
is smaller than the main image. The kernel size may rise GoogLeNet [19]. Each layer of a convolutional neural network
locally by creating features in m − n + 1 size. By applying can have a different operation like pooling, 1x1 convolution,
convolution, pooling, inceptions etc, on these features, the 3x3 convolution or 5x5 convolution. In an inception model,
full layer’s weights are updated in multiple times to train the a layer can have a composition of all those operations. For
network. example, a composition can be max pooling followed by 1x1
convolution, 1x1 convolution, 1x1 convolution followed by
FaceNet[12] architecture which is developed by Google 5x5 convolution or 3x3 convolution, etc. Then at the top, the
researchers is used to train the system. As it is an ongoing output of each of them will be concatenated. This convolutional
research, nn4.small2 is used in this system released in January model performs better than a simple convolutional model.
Fig. 8: Full process of face alignment Fig. 10: Convolution

Fig. 11: Pooling

Fig. 9: The FaceNet Network Architecture[12]. In Eq. (1) and Eq. (3) α is the margin that is enforced
between positive and negative pairs. Eq. (2) represent all the
triplet set where T is the set of all possible triplets in the
training dataset.
5) Training with Convolutional Neural Network: A convo-
lutional neural network is treated as a black box. After giving When the distance between the anchor and positive is
the model details there is nothing to do with its core. Although minimum and the distance between the anchor and negative
according to FaceNet [12] the most important part lies in the is maximum the convolutional neural network is ready to
end-to-end learning of the whole system. There is a triplet generate embedding for new examples. Figure 13 shows an
loss function that directly reflects the performance of face input face image with its embedding.
recognition, verification and clustering.
The combination of three face images where two face
images from one person and one face image from another
person is called a triplet. The first and the second face image
which are from the same person is named anchor and positive
respectively, the last one which is from the other person is
called negative. f (x)ǫRd represents the embeddings where
x is an image. This function embeds an image x into a d-
dimensional Euclidean space called embedding. After gen- Fig. 12: The triplet loss function minimizes the distance
erating embedding for each image of a triplet the Eq. (1) between an anchor and a positive and maximizes the distance
compares the distance between anchor to the positive and between the anchor and a negative. Here positive and anchor
anchor to t negative. The goal of this comparison is to ensure has the same identity where anchor and negative has the
that xia (anchor) of a particular person is closer to all other different identity
p
images xi (positive) of the same individual than it is to any
n
image xi (negative) of any other individual. According to the
comparison result, the loss that is being minimized is then L, D. Create Classifier with the features using SVM
found by the Eq.(3). Then a new triplet is created from the
training dataset and the whole process is run again until the After generating embeddings we can use any machine
distance between the anchor and positive is minimum. Figure learning algorithm to create a classifier for face recognition.
12 shows the visualization of training result. In this case a linear support vector machine is very easy to
use[20]. So, in this system we have used the support vector
p machine algorithm. Actually we are not concern about using
k f (xia ) − f (xi )k22 + α < k f (xia ) − f (xin )k22 (1) SVM or other machine learning to train the embeddings.
According to our results a normal linear SVM takes a very
little time to train as well as prediction.
∀( f (xai ), f (x pi ), f (xin ))ǫT (2)
E. Full System Architecture

Õ
N 1) Training Architecture: For training, our system is trained
[k f (xia ) −
p
f (xi )k22 + −k f (xia ) − f (xin )k22 + α]+ (3) with the images given from the external dataset. By doing the
i process like face detection, alignment, embeddings generation
Fig. 16: Normalization time

continued until the system handled all of the 200 person’s


Fig. 13: Generated embeddings of a facial image by Convolu- images.
tional Neural Network

Fig. 17: Embeddings generation time

Fig. 14: Dataset training architecture As we know, images were not preprocessed or normalized,
it takes a little amount of time to normalize so that we
can easily ignore the false images. We can see from the
normalization time graph that there are a slight up or down
etc. the system creates CSV file which are used for the changes in normalization time when we increase the training
classifier. images.
2) Prediction Architecture: Our system predicts the person
with the help of CSV file which is created in training time.

Fig. 15: Face prediction or recognition architecture

V. Results
The original dataset is divided into two parts. One is Fig. 18: SVM Training time
for training the system and other is for testing. Each dataset
contains facial images of 200 persons. There are 30 facial
images of a person in each dataset. System performance is However, embedding generation and training the classifier
calculated by accuracy, prediction time and training time. This has massive changes over the increasing training images. So it
performance calculation was a step by step process. will create a problem if we want to train as soon as a new
person facial data is taken by the system. But there is no
At first, the system is trained with 20 person images (30
problem in prediction time. This little amount of prediction
images per person) and then testing dataset that contains same
time is also easily ignorable.
20 people’s other images(30 images per person) is provided
to the system for prediction. This prediction result is stored In general, it is very hard to get 100% accuracy in any
in a text file. After that 20 more people’s image is added to machine learning algorithm, so we have expected a face
the training and testing data set. After training and prediction, recognition system that has at least 95% accuracy. According
a new text file is created by the system. This process was to the accuracy graph, this neural network approach has greater
References
[1] Kanade, Takeo. "Picture processing system by computer complex and
recognition of human faces." Doctoral dissertation, Kyoto University
3952 (1973): 83-97.
[2] Viola, Paul, and Michael J. Jones. "Robust real-time face detection."
International journal of computer vision 57.2 (2004): 137-154.
[3] Dalal, Navneet, and Bill Triggs. "Histograms of oriented gradients for
human detection." Computer Vision and Pattern Recognition, 2005.
CVPR 2005. IEEE Computer Society Conference on. Vol. 1. IEEE,
2005.
[4] Kazemi, Vahid, and Josephine Sullivan. "One millisecond face align-
ment with an ensemble of regression trees." Proceedings of the IEEE
Fig. 19: Average prediction time Conference on Computer Vision and Pattern Recognition. 2014.
[5] Jafri, Rabia, and Hamid R. Arabnia. "A survey of face recognition
techniques." Jips 5.2 (2009): 41-68.
[6] Hotelling, Harold. "Analysis of a complex of statistical variables into
principal components." Journal of educational psychology 24.6 (1933):
417.
[7] Sirovich, Lawrence, and Michael Kirby. "Low-dimensional procedure
for the characterization of human faces." Josa a 4.3 (1987): 519-524.
[8] Turk, Matthew, and Alex Pentland. "Eigenfaces for recognition." Journal
of cognitive neuroscience 3.1 (1991): 71-86.
[9] Belhumeur, Peter N., JoÃčo P. Hespanha, and David J. Kriegman.
"Eigenfaces vs. fisherfaces: Recognition using class specific linear pro-
jection." IEEE Transactions on pattern analysis and machine intelligence
19.7 (1997): 711-720.
[10] Lawrence, Steve, et al. "Face recognition: A convolutional neural-
Fig. 20: Accuracy network approach." IEEE transactions on neural networks 8.1 (1997):
98-113.
[11] Taigman, Yaniv, et al. "Deepface: Closing the gap to human-level
performance in face verification." Proceedings of the IEEE Conference
than 95% accuracy. Although all testing images are taken at on Computer Vision and Pattern Recognition. 2014.
the same time when training images are taken. [12] Schroff, Florian, Dmitry Kalenichenko, and James Philbin. "Facenet:
A unified embedding for face recognition and clustering." Proceedings
Among 200 images, this six people’s images shown in the of the IEEE Conference on Computer Vision and Pattern Recognition.
2015.
figure 21 has the false prediction. We think the reason for the
false prediction is the alignment of those face. As we take [13] Amos, Brandon, Bartosz Ludwiczuk, and Mahadev Satyanarayanan.
OpenFace: A general-purpose face recognition library with mobile
random images for testing purpose from our datasets, we get applications. Technical report, CMU-CS-16-118, CMU School of Com-
those six images that were not suitable for prediction. So they puter Science, 2016.
got false prediction. [14] Parkhi, Omkar M., Andrea Vedaldi, and Andrew Zisserman. "Deep Face
Recognition." BMVC. Vol. 1. No. 3. 2015.
[15] Wu, Xiang, Ran He, and Zhenan Sun. "A lightened cnn for deep face
representation." 2015 IEEE Conference on IEEE Computer Vision and
Pattern Recognition (CVPR). 2015.
[16] Kazemi, Vahid, and Josephine Sullivan. "One millisecond face align-
ment with an ensemble of regression trees." Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 2014.
[17] King, Davis E. "Dlib-ml: A machine learning toolkit." Journal of
Machine Learning Research 10.Jul (2009): 1755-1758.
Fig. 21: Images that has the false prediction [18] Hubel, David H., and Torsten N. Wiesel. "Receptive fields and func-
tional architecture of monkey striate cortex." The Journal of physiology
195.1 (1968): 215-243.
[19] Szegedy, Christian, et al. "Going deeper with convolutions." Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern
VI. CONCLUSIONS Recognition. 2015.
Our main mission is to perceive human individually by [20] Lee, Yuh-Jye, and Olvi L. Mangasarian. "SSVM: A smooth support
vector machine for classification." Computational optimization and
visual recognition. As a very first step, we have to recognize Applications 20.1 (2001): 5-22.
human facial images with the training data which are served the
system externally. In data collection section, we have described
about our collected data which are used for training and testing
the system. We have showed our system accuracy in above
where we used deep learning based machine learning algorithm
named convolutional neural network. System is now ready
to recognize the students individually. Although some false
prediction is found, it can be minimized by increasing the
number of training images.

You might also like