Theoretical and Practical Analysis On CNN, MTCNN and Caps-Net Base Face Recognition and Detection PDF
Theoretical and Practical Analysis On CNN, MTCNN and Caps-Net Base Face Recognition and Detection PDF
A Project Report
Submitted by:
Darshan Shah
of
BACHELOR OF TECHNOLOGY
IN
at
Ahmedabad, Gujarat
05/2019
1
2
Acknowledgement
I would like to express my private appreciation to all those who furnished me with the
opportunity to finish this project. Unique gratitude I deliver to our very last year venture
supervisor, Ruchir brahmbhatt, whose contribution to stimulating suggestions and
encouragement, helped me to coordinate my undertaking. I'd additionally want to acknowledge
with tons appreciation the important function of the workforce of Ecosmob technologies' for
making massive scale face dataset and they also gave the permission to apply all required device
and the important material to complete the challenge. A unique thank you visit my teammate,
Shivani Shah, who help me to gather the parts and gave thought approximately the obligations.
3
Abstract
At a standout amongst the best utilization of pictures investigation and comprehension, face
recognition has as of late gotten critical consideration, particularly amid a previous couple of
years. Facial recognition (FR) has developed as an alluring answer for location numerous
contemporary requirements for recognizable proof and the check of personality claims. It unites
the guarantee of other biometric frameworks, which endeavour to attach a personality to
separately particular highlights of the body, and the more natural usefulness of visual
reconnaissance frameworks. This report builds up a socio-scientific l examination that connects
the specialized and social-scientific writing on FR and addresses the one of a kind difficulties
and worries that go to its advancement, assessment, and explicit operational utilizations, settings,
and objectives. It features the potential and confinements of the innovation, noticing those
errands for which it appears to be prepared for sending, those territories where execution
impediments might be overwhelmed by future technological developments or sound operating
procedures, and still different issues which seem recalcitrant. Its worry with adequacy stretches
out to moral contemplations.
In this project, I implemented a face recognition system with face anti-spoofing for use in the
company’s project security system. There are many methods to make face recognition systems
like eigenface method, CNN base classification and Caps-net base. I use CNN and Cap-net base
recognition system in the project. Face Recognition we need to first detect face for that I use
MTCNN face detection method. This project is for the security project of the company so it is
important to not detect any spoofed image in recognition that why I also make a face spoof
detection in this project. Only face detection method I directly use as MTCNN but for
Recognition and spoof detection I make our my own model's which is base on CNN and machine
learning. I also make new large scale dataset for recognition and spoof detection. Dataset is
having 12000 images of 90 classes for recognition and for spoof detection I make 23000 real
face images and 21000 fake images for training.
4
Timeline / Gantt Chart
5
Index
Title Page -
Declaration of the Student 1
Certificate of the Guide 2
Abstract 3
Acknowledgement 4
Timeline / Gantt Chart 5
1. INTRODUCTION
2.LITERATURE SURVEY
4.RESULTS / OUTPUTS 30
5.CONCLUSIONS / RECOMMENDATIONS 32
6.REFERENCES 33
6
1. INTRODUCTION
The facial recognition has been a problem laboured on around the world for lots persons; this
problem has emerged in multiple fields and sciences, particularly in laptop technological
know-how, others fields that are very interested in this era are: Mechatronic, robotic, and so
forth. The human face is not specific, there are numerous elements that make a variation in the
arrival of the face. additionally, the comfy face popularity device may be very hard to build in
massive scale face training because it is able to be spoofed face photo shown at the same time as
popularity.
● Python3
Python turned into advanced by Guido van Rossum in the early Nineties and its
present-day version is 3.7.1, we are able to surely name it as Python3. Python three.0
become released in 2008. and is interpreted language i.e it's not compiled and the
interpreter will test the code line by line.
7
● Tensorflow
TensorFlow is a Python library for immediate numerical computing created and launched
by using Google. it's far a foundation library that may be used to create Deep learning
fashions at once or with the aid of the use of wrapper libraries that simplify the procedure
constructed on top of TensorFlow. TensorFlow is a cease-to-give up open source
platform for system gaining knowledge of. It has a comprehensive, bendy ecosystem of
gear, libraries and community sources that lets researchers push the today's in ML and
builders effortlessly build and installation ML powered packages.
● Keras
Keras is a high-level neural networks API, written in Python and able to jogging on the
pinnacle of TensorFlow, CNTK, or Theano. It turned into evolved with a focal point on
enabling speedy experimentation. Being able to pass from idea to end result with the least
viable delay is key to doing true research. Use Keras in case you need a deep mastering
library that: allows for smooth and fast prototyping (via person friendliness, modularity,
and extensibility). supports both convolutional networks and recurrent networks, as well
as combinations of the 2. Runs seamlessly on CPU and GPU
● IP Webcam
This is the android app which can help to use the mobile camera in python for testing.
This app is used for testing the face spoofing detection in mobile cameras.
● Scikit-learn
8
Scikit-learn gives a variety of supervised and unsupervised getting to know algorithms thru
a constant interface in Python. it's far certified below a permissive simplified BSD license and is
shipped under many Linux distributions, encouraging educational and commercial use. The
library is built upon the SciPy (science Python) that have to be set up before you could use
scikit-research. This stack that consists of:
● Sublime
Sublime textual content is a cross-platform supply code editor with a Python
application programming interface. It natively helps many programming languages and
markup languages, and features can be introduced by using users with plugins, generally
community-constructed and maintained below loose-software program licenses.
9
2. LITERATURE SURVEY
It is based at the Haar Wavelet approach to analyze pixels in the picture into
squares via a function. This makes use of gadget getting to know strategies to get
an excessive degree of accuracy from what's referred to as “schooling facts”. This
uses “integral photo” ideas to compute the “functions” detected. Haar Cascades
makes use of the AdaBoost studying set of rules which selects a small range of
important capabilities from a big set to give a green end result of classifiers. Haar
Cascades uses machine gaining knowledge of techniques in which a feature is
educated from a whole lot of nice and bad snapshots. This technique inside the set
of rules is characteristic extraction.
10
2. Using MTCNN:
The MTCNN model consists of three separate networks: the P-net, the R-net, and
the O-net:
After the 1/3 convolution layer, the network splits into layers. The activations
from the 1/3 layer are passed to 2 separate convolution layers and a softmax layer after one of
these convolution layers.
11
Convolution 4–1 output the probability of a face being in each bounding box, and
convolution 4–2 outputs the coordinates of the bounding boxes.
12
R-net has a similar shape, however with even extra layers. It takes the P-net bounding containers
as its inputs and refines its coordinates.
13
In addition, R-net splits into layers, ultimately, giving out two outputs: the
coordinates of the new bounding containers and the system’s self-belief in every
bounding box. subsequently, O-net takes the R-net bounding boxes as inputs and marks
down the coordinates of facial landmarks.
14
O-Net splits into three layers, in the long run, giving out 3 unique outputs: the
probability of a face being within the field, the coordinates of the bounding container, and
the coordinates of the facial landmarks
Algorithm
15
2. CNN base face Classification:
Nowadays, there are already several CNN models which have been released
publicly. These Models has a totally deep layer and trained the use of computers that have
excessive specs (most of which stand out are their GPU and RAM). One of those models that I
examine extra about is VGG16. VGG16 can classify your image in a thousand possible elegance.
The input to cov1 layer is of fixed length 224 x 224 RGB picture. The photo is
surpassed through a pile of convolutional (conv.) layers, in which the channels were utilized with
16
a totally little open zone: 3×3 (that is the littlest size to catch the conviction of left/right,
up/down, focus). In one of the arrangements, it additionally uses 1×1 convolution channels,
which might be viewed as a straight change of the info channels (pursued with the guide of
non-linearity). The convolution stride is consistent to 1 pixel; the spatial padding of conv. layer
enter is with the end goal that the spatial goals is protected after convolution, for example, the
cushioning is 1-pixel for three×3 conv. layers. Spatial pooling is accomplished by 5 max-pooling
layers, which watch a portion of the conv. layers (presently not all the conv. layers are seen
through max-pooling). Max-pooling is executed over a 2×2 pixel window, with stride 2.
3 completely-related (FC) layers comply with a stack of convolutional layers (which has
an exceptional intensity in distinctive architectures): the primary have 4096 channels every, the
third performs a thousand-way ILSVRC ( ImageNet big-Scale visible recognition project)
classification and accordingly incorporates 1000 channels (one for every elegance). The final
layer is the smooth-max layer. The configuration of the completely connected layers is the same
in all networks.
All hidden layers are prepared with the rectification (ReLU) non-linearity. it's also
referred to that not one of the networks (except for one) include local reaction Normalisation
(LRN), such normalization does no longer enhance the overall performance at the ILSVRC
dataset, but leads to elevated reminiscence consumption and computation time.
3. Caps-net
17
Convolutional layer:
PrimaryCaps layer:
This residue has 32 primary capsules whose job is to take basic functions detected
through the convolutional layer and produce combinations of the functions. The layer has
32 “primary pills” that are very similar to a convolutional layer of their nature. each
capsule applies eight 9x9x256 convolutional kernels (with stride 2) to the 20x20x256
input extent and therefore produces 6x6x8 output tensor. for the reason that there are 32
such drugs, the output extent has the shape of 6x6x8x32. Doing calculation much like the
only inside the preceding layer, we get 5308672 trainable parameters in this layer.
Digitcaps layer:
18
Total parameters: 1497600.
This accretion has 10 digit drugs, one for each digit. every capsule takes as enter a
6x6x8x32 tensor. we can consider it as 6x6x32 8-dimensional vectors, which is 1152
input vectors in total. As in line with the inner workings of the capsule, every of those
enter vectors receives their very own 8x16 weight matrix that maps 8-dimensional input
space to the 16-dimensional pill output space. So, there are 1152 matrices for each tablet,
and also 1152 c coefficients and 1152 b coefficients used in the dynamic routing.
Multiplying: 1152 x 8 x sixteen + 1152 + 1152, we get 149760 trainable parameters in
step with a tablet, then we multiply with the aid of 10 to get the final number of
parameters for this residue.
At the final find the loss of each vector and get the minimum loss class to predict
the output.
It's miles clear that face reputation structures primarily based on second and 3D
pix may be exposed to spoofing attacks. Researches have proved that they're analyzing
those attacks in phrases of descriptors and classifiers. Descriptors have been classified as
texture, movement, frequency, colour, shape or reflectance and classifiers are prepared as
a discriminant, regression, distance metric.
Overview of Descriptors
● Texture:- Textual capacities are extricated from face photographs underneath the
conviction that distributed faces produce certain texture styles that don't exist in genuine
ones.Ex:- local Binary styles (LBP), Histograms of oriented gradients (HOG), Deep
neural networks (DNN)
19
● motion:- There are 2 important techniques defined in researches. To hit upon and
describe intra-face variations, such as eye blinking, facial expressions, mouth
development, head rotation and so forth. evaluate the consistency of consumer interaction
in the surroundings
● Frequency:- Take gain of sure photograph artefacts that arise in spoofing attacks.
● colouration:- in spite of the fact that hues don't stay consistent due to lighting varieties,
positive overwhelming attributes are broad pieces of information to separate impostors
from genuine faces.
● shape:- shape records may be very useful to cope with published photograph assaults
seeing that facial geometry cannot be reproduced on a planar floor.
● Reflectance:- With the thinking that authentic and, impostor's faces carry on in another
manner inside equal illumination situations, it’s viable to apply the facts on the reflected
photo of the face to distinguish them.
Overview of Classifiers
● Discriminant:- Right here distinguishing different lessons by means of limiting
intra-class variety as well as augmenting between class variety is done. Following
are some commonplace classifiers in spoofing detection. Ex. Support vector
machines (SVM), Bayesian network (BN), Linear Discriminant Analysis (LDA),
Multilayer Perceptron (MLP).
● Regression:- The regression-essentially based class maps use input descriptors
straightforwardly into their radiance marks pondering a prescient form procured
from recognized sets of descriptors and names. Ex. Kernel discriminant analysis
(KDA), linear logistic regression (LLR)
● Heuristic:-Specific heuristics were used to decide whether a face is actual or faux.
One downside is heuristics might also reason over-becoming when self-accrued
information is used. Ex: a weighted sum of motion measurements, number of eye
blinks, average pixel ratio thresholding, movement measurements thresholding.
20
2.2 Proposed System
I make the company security system as a project which is using face detection, face
recognition and face spoofing. I use the MTCNN for face detection which is train on our
company’s face dataset. Dataset for face detection is containing 8000 face images for training. I
make the CNN model for face recognition that model is base on VGG16 base model, but it’s
modified with some layer and it’s giving good result in our face recognition dataset. I am not
using any online dataset for training and testing purposes because of security purpose of
company and agreements conditions. All training and testing data which is used, that is made in
company with the help of a company's employees. In face anti-spoofing detection I used CNN
and LBP feature in different colour channels. CNN and LBP give texture base features which can
help to classify the fake and real face in real time. I made 23000 real images and 21000 fake
images for training face spoofing model and getting features of those images.
The first step is to capture the image and normalize in 512x512 image size because of
computational complexity in MTCNN is high. There is not more difference between MTCNN
21
and Haar cascade in terms of accuracy. Haar cascade is faster than MTCNN but I use MTCNN
and train the data on that. After the detect the face I pass the detected normalized face in
anti-spoofing detection model for checking the detected face is fake or real.
I am building a security system so face spoofing part is more important in this project.
Face anti-spoofing detection system should be more accurate to detect fake or real faces. For face
anti-spoofing detection I use HUV and YCRCB colour channels. I remove the H component
from HUV because of that Hue component brightness issues came. These two channels are
giving good result for feature extraction than the RGB colour channel.
I use two methods to detect face spoofing, one is CNN base and second is machine
learning base. In CNN base I make a CNN model and train on our dataset. It is a binary
classification problem. Model detail is in a flowchart. The second method is finding a
LBP(Linear Binary Pattern) feature of all image in YCRCB and HUV colour channel then plot
the histogram of those features. Those features take as Input for SVM classifier to classify fake
or real images. If Both the method gives face is real then consider face as real and send for
recognition.
The third and final part is face recognition. For face recognition, I make a two model one
is same as Caps-net and second is CNN base with modified VGG16 model which giving good
result in our face dataset. If we want to fix the number of classes (faces) then we can use
Caps-net which is giving good result but the problem is in Caps-net is, It is not giving the
embedding so we can’t get features directly. We need to train each time when the new
class(face) will come. I use the that CNN in the second approach with gives embedding so we
can use any classifier for classifying the image. So finally we get the person name or class.
22
3. SYSTEM ANALYSIS & DESIGN
3.2 Flowcharts
Face Detection :
23
Face Spoofing:
● CNN Model
24
● Machine Learning Model:
Image 10: Image features a base model for Face Spoofing Detection
25
Face Recognition:
Caps-net :
In caps net model we need to fix the classes first (total number of people) and then you can train
the model and recognition the person so we use the transfer learning on VGG16 model and
modify the model with adding more layer at the end of the layers of VGG16. That whole model
detail I can’t share because of the security of that method.We got 95.6% the accuracy on our
own dataset.
3.3 Design and Test Steps
When I got the problem then first try to research on these topics then find the best
solution and good accuracy models and try to modify that model and checking the accuracy
whether it’s improving or not. First I started with the detection part, in that I got LBP cascade,
26
haar cascade and MTCNN. Second I started with the recognition part, in that I got CNN and
Caps-net base architectures. After that getting a problem with the spoofing part which is the most
important thing in this project because it’s a security project so I research more on anti-spoofing
algorithms and make models for that.
This all algorithm fails is I don’t have good data for training. I make a data set for all the
algorithms with the help of the company’s employees.
I tested in all lighting condition and I found it’s not working in bad lighting condition.
And some other scenarios because it is not using the IR camera for detection and it's not training
27
on IR face images.
This model is working good in white lighting condition and natural lights. If you are in
different lighting condition then maybe you can’t get good accuracy. In low lighting system will
always detect fake face so in that case system will not detect false class (person) but If you
change the light conditions then it will affect the accuracy.
28
4. Results / Outputs
● This is the face spoofing results, In image 13.A we can see that multiple faces can be
detected and it will give us real or face as output. In image 13.B one person with a
straight face and another person with the side face, so side face detect as a fake face
because of train data is straight faces only. In 13.C model can detect both the conditions
with parallel. In 13.D I gave the random image (Modiji) and that also gives as fake.
29
● This image is training accuracy and testing accuracy of the anti-spoof model
Image 14: Accuracy of the CNN model for face spoofing detection
30
Image 16: Face recognition accuracy and loss graph
● This is the graph that shown 90 classes training and testing accuracy which is
trained on the CNN model.
31
● For face recognition, I made an Application which gives more clear understanding in the
video but here I am showing some screenshots.
32
5. CONCLUSIONS / RECOMMENDATIONS
Face recognition is still a challenging problem in the field of computer vision. It has
received a great deal of attention over the past years because of its several applications in various
domains. Although there is strong research effort in this area, face recognition systems are far
from ideal to perform adequately in all situations form the real world.
In this project, I get good accuracy in daylighting but it features we also need this system
work in all lighting condition then I should go for IR camera and IR faces dataset with the 3D
face recognition system
6. References
1. Dynamic Routing Between Capsules by Sara Sabour, Nicholas Frosst, Geoffrey E Hinton
Submitted on 26 Oct 2017 (v1), last revised 7 Nov 2017, Computer Vision and Pattern
Recognition (cs.CV), arXiv:1710.09829 [cs.CV]
2. Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks
by Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, Yu Qiao (Submitted on 11 Apr 2016),
Computer Vision and Pattern Recognition (cs.CV), arXiv:1604.02878 [cs.CV]
3. Very deep convolutional networks for large-scale image recognition K Simonyan, A
Zisserman - arXiv preprint arXiv:1409.1556, 2014 arxiv.org
33
7. Face Spoofing Detection Using Colour Texture Analysis Article in IEEE Transactions on
Information Forensics and Security · August 2016
8. Wang, Shun-Yi; Yang, Shih-Hung; Chen, Yon-Ping; Huang, Jyun-We. 2017. "Face
Liveness Detection Based on Skin Blood Flow Analysis." Symmetry 9, no. 12: 305.
34