0% found this document useful (0 votes)
189 views6 pages

Face Recognition Based On MTCNN and FaceNet

Uploaded by

Hadiyet MAAFI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
189 views6 pages

Face Recognition Based On MTCNN and FaceNet

Uploaded by

Hadiyet MAAFI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Face Recognition Based on MTCNN and FaceNet

Rongrong Jin, Hao Li, Jing Pan, Wenxi Ma, and Jingyu Lin

Abstract but when facing potential nonlinear structures, they often


achieve unsatisfactory recognition results.
Face recognition performance improves rapidly with the re-
With the development of deep learning and the introduc-
cent deep learning technique developing and underlying large
training dataset accumulating. However, face images in the tion of deep convolutional neural networks, the accuracy and
wild undergo large intra-personal variations, such as poses, speed of face recognition have made great strides. However,
illuminations, occlusions, and low resolutions, which cause the results from different networks and models are very dif-
great challenges to face-related applications.This paper ad- ferent. Previous face recognition approaches based on deep
dresses this challenge by proposing a deep learning frame- networks use a classification layer(Taigman et al. 2014; Tang
work which is based on MTCNN and FaceNet, which can 2015), they regard face recognition as a classification task.
recover the canonical view of face images. In our project, we The number of softmax output is the number of face tags.
build our own Face Recognition System, which achieves high Therefore, every time a new sample comes in, the whole
accuracy on the LFW benchmark.We use the inherent cor- model needs to be retrained. While FaceNet directly trains
relation between detection and calibration to improve their
its output to be a compact 128-D embedding using a triplet-
performance under the multi-task framework of deep cascad-
ing. In particular, we use a three-tiered architecture combined based loss function based on LMNN(Schroff, Kalenichenko,
with a well-designed roll neural network algorithm to detect and Philbin 2015). The triplets consist of two matching
faces and roughly locate key points.In the FaceNet method, face thumbnails and a non-matching face thumbnail and the
it directly learns the mapping from a face image to a com- loss aims to separate the positive pair. The thumbnails are
pact Euclidean space, where distance directly corresponds tight crops of the face area, no 2D or 3D alignment, other
to a measure of facial similarity.Once this space is gener- than scale and translation is performed. The benefit of this
ated, face recognition, validation and clustering can be eas- approach is much greater representational efficiency: they
ily implemented using the standard FaceNet embedding tech- achieve state-of-the-art face recognition performance using
nique as the feature vector.This approaches dramatically re- only 128-bytes per face. So we use FaceNet, 128 dimen-
duce the intra-person variances, while maintaining the inter-
sional vector to represent face, and then recognize face by
person discriminativeness. Maybe there is something not that
perfect during our experiments, but we are going to summa- calculating vector distance.
rize our experiments and present some challenges lying ahead In order to achieve better performance, we first use
in recent face recognition. MTCNN(Zhang et al. 2016) to do face detection. Then use
the result of MTCNN as the input of FaceNet to perform face
recognition. MTCNN network, which is a mainstream target
1. Introduction detection network with high detection accuracy, lightweight
With the rapid development of artificial intelligence in re- and real-time.
cent years, facial recognition gains more and more attention. So our face recognition process is mainly divided into two
Compared with the traditional card recognition, fingerprint steps: face detection and face recognition. Firstly, MTCNN
recognition and iris recognition, face recognition has many is used for face detection to get accurate face coordinates.
advantages, including but limit to non-contact, high concur- Based on the results of the previous step, FaceNet is used
rency, and user friendly. It has high potential to be used in for face recognition. The processing flow of MTCNN is as
government, public facilities, security, e-commerce, retail- follows: First of all, the test image is continuously resized to
ing, education and many other fields. get the image pyramid. Then the image pyramid is input into
Traditional face recognition methods use feature oper- P-Net to get a large number of candidates. The candidate
ators to model face, which is simple and easy to imple- images screened by P-Net are fine tuned by R-Net. After
ment. However, with the further research, these algorithms many candidates are removed by R-Net, the images are input
can show strong effectiveness in finding linear structures, to O-Net. Finally, the accurate bbox coordinates are output.
Compared with DeepFace, FaceNet retains face alignment,
Copyright © 2021, Association for the Advancement of Artificial abandons feature extraction steps, and directly uses CNN to
Intelligence (www.aaai.org). All rights reserved. train end-to-end after face alignment.
2. Related Work
Face detection
Face detection are essential to many face applications, such
as face recognition and facial expression analysis. However,
the large visual variations of faces, such as occlusions, large
pose variations and extreme lightings, impose great chal-
lenges for these tasks in real world applications.
The cascade face detector proposed by Viola and
Jones(Viola and Jones 2004) utilizes Haar-Like features and
AdaBoost to train cascaded classifiers, which achieves good
performance with real-time efficiency. However, quite a few
works(Yang et al. 2014; Pham et al. 2010) indicate that this
kind of detector may degrade significantly in real-world ap-
plications with larger visual variations of human faces even
with more advanced features and classifiers. Besides the
cascade structure(Zhu and Ramanan 2012), introduce de-
formable part models (DPM) for face detection and achieve
remarkable performance. However, they are computation-
ally expensive and may usually require expensive annota-
tion in the training stage. Recently, convolutional neural net- Figure 1: Pipeline of MTCNN cascaded framework that in-
works (CNNs) achieve remarkable progresses in a variety cludes three-stage multi-task deep convolutional networks.
of computer vision tasks, such as image classification and Firstly, candidate windows are produced through a fast Pro-
face recognition(Sun, Wang, and Tang 2014). Inspired by posal Network (P-Net). After that, we refine these candi-
the significant successes of deep learning methods in com- dates in the next stage through a Refinement Network (R-
puter vision tasks, several studies utilize deep CNNs for face Net). In the third stage, the Output Network (O-Net) pro-
detection. Yang et al.(Yang et al. 2016) train deep convolu- duces final bounding box.
tion neural networks for facial attribute recognition to ob- .
tain high response in face regions which further yield can-
didate windows of faces. However, due to its complex CNN and get the exact coordinates of the face. Based on the re-
structure, this approach is time costly in practice. Li et al.(Li sults of face detection, face recognition is performed using
et al. 2015) use cascaded CNNs for face detection, but it FaceNet.
requires bounding box calibration from face detection with FaceNet directly learns a mapping from face images to
extra computational expense and ignores the inherent corre- a compact Euclidean space where distances directly corre-
lation between facial landmarks localization and bounding spond to a measure of face similarity. Once this space has
box regression. been produced, tasks such as face recognition, verification
and clustering can be easily implemented using standard
Face recognition techniques with FaceNet embeddings as feature vectors.
Using deep neural networks to learn effective feature rep-
resentations has become popular in face recognition(Sun, 3.1. MTCNN
Wang, and Tang 2013). With better deep network archi- MTCNN is a deep cascaded multi-task framework which ex-
tectures and supervisory methods, face recognition accu- ploits the inherent correlation between detection and align-
racy has been boosted rapidly in recent years. Previous face ment to boost up their performance. The framework of
recognition approaches based on deep networks use a clas- MTCNN leverages a cascaded architecture with three stages
sification layer (Taigman et al. 2014) trained over a set of of carefully designed deep convolutional networks to pre-
known face identities and then take an intermediate bottle- dict face and landmark location in a coarse-to-fine manner.
neck layer as a representation used to generalize recognition In addition, a new online hard sample mining strategy that
beyond the set of identities used in training. The downsides further improves the performance in practice.
of this approach are its indirectness and its inefficiency: one
has to hope that the bottleneck representation generalizes 3.1.1. Overall Framework
well to new faces; and by using a bottleneck layer the rep- The overall pipeline of MTCNN is shown in Figure. 1. Given
resentation size per face is usually very large (1000s of di- an image, we initially resize it to different scales to build an
mensions). Some recent work has reduced this dimensional- image pyramid, which is the input of the following three-
ity using PCA, but this is a linear transformation that can be stage cascaded framework:
easily learnt in one layer of the network. Stage 1: We exploit a fully convolutional network, called
Proposal Network (P-Net), to obtain the candidate facial
3. Method windows and their bounding box regression vectors. Then
For accurate face recognition, we train two networks, candidates are calibrated based on the estimated bound-
MTCNN and FaceNet. MTCNN is used to detect the face ing box regression vectors. After that, we employ non-
Figure 2: The architecture of P-Net, R-Net, and O-Net. Where “MP” means max pooling and “Conv” means convolution. The
step size in convolution and pooling is 1 and 2, respectively

maximum suppression (NMS) to merge highly overlapped The learning objective is formulated as a regression prob-
candidates. lem, and we employ the Euclidean loss for each sample xi :
Stage 2: All candidates are fed to another CNN, called 2
Lbox = ŷibox − yibox 2

Refine Network (R-Net), which further rejects a large num- i (2)
ber of false candidates, performs calibration with bounding
box regression, and conducts NMS. where ŷibox is the regression target obtained from the net-
Stage 3: This stage is similar to the second stage, but in work and yibox is the ground-truth coordinate.
this stage we aim to identify face regions with more super- 3) Facial landmark localization: Similar to bounding box
vision. In particular, the network will output five facial land- regression task, facial landmark detection is formulated as a
marks’ positions. regression problem and we minimize the Euclidean loss:
3.1.2. CNN Architectures Llanddmark
2
= ŷilandmark − yilandmark 2

(3)
i
We use 3×3 filter rather than 5×5 filter to reduce the comput-
ing while increase the depth to get better performance. With
these improvements, compared to the previous architecture where ŷilandmark is the facial landmark’s coordinates ob-
in(Li et al. 2015), we can get better performance with less tained from the network and yilandmark is the ground-truth
runtime. The CNN architectures are shown in Figure. 2. We coordinate for the i-th sample.
apply PReLU(He et al. 2015) as nonlinearity activation func- 4) Multi-source training: Since we employ different tasks
tion after the convolution and fully connection layers(except in each CNN, there are different types of training images
output layers). in the learning process, such as face, non-face, and partially
aligned face. In this case, some of the loss functions (i.e.,
3.1.3. Training Eq. (1)-(3)) are not used. The overall learning target can be
We leverage three tasks to train our CNN detectors: formulated as:
face/non-face classification, bounding box regression, and N X
X
facial landmark localization. min αj βij Lji (4)
1) Face classification: The learning objective is formu- i−1 j∈U
lated as a two-class classification problem. For each sample
xi , we use the cross-entropy loss: where U = {det, box, landmark}, and N is the number of
training samples and aj denotes on the task importance.
Ldet
i = −(yidet log(pi ) + (1 − yidet )(1 − log(pi ))) (1)

where pi is the probability produced by the network that in-


dicate sample xi being a face. The notation yidet ∈ {0, 1}
denotes the ground-truth label. Figure 3: FaceNet model structure.
2) Bounding box regression: For each candidate window, .
we predict the offset between it and the nearest ground truth.
3.2. FaceNet
FaceNet is adopted in our face recognition truncation. ∀(f (xai ), f (xpi ), f (xni )) ∈ T (6)
FaceNet directly trains its output to be a compact 128-D em-
bedding using a tripletbased loss function based on LMNN. where α is a margin that is enforced between positive and
Our triplets consist of two matching face thumbnails and a negative pairs. T is the set of all possible triplets in the train-
non-matching face thumbnail and the loss aims to separate ing set and has cardinality N. The loss that is being mini-
the positive pair from the negative by a distance margin. The mized is then L =
thumbnails are tight crops of the face area, no 2D or 3D N
2
X 2
alignment, other than scale and translation is performed.And [kf (xai ) − f (xpi )k2 − kf (xai ) − f (xni )k2 + α]+ (7)
it is based on learning a Euclidean embedding per image us- i
ing a deep convolutional network. The network is trained Generating all possible triplets would result in many
such that the squared L2 distances in the embedding space triplets that are easily satisfied. These triplets would not con-
directly correspond to face similarity: faces of the same per- tribute to the training and result in slower convergence, as
son have small distances and faces of distinct people have they would still be passed through the network.
large distances.

3.2.1 End-to-end learning


Instead of using the traditional softmax method to do classi-
fication learning, FaceNet extracted a certain layer as a fea- Figure 4: The Triplet Loss minimizes the distance between
ture to learn a coding method from the image to The Eu- an anchor and a positive, both of which have the same iden-
ropean space, and then do face recognition face verification tity, and maximizes the distance between the anchor and a
and face clustering based on this code.Given the model de- negative of a different identity.
tails, and treating it as a black box (see Figure 3), the most
important part of our approach lies in the end-to-end learn-
ing of the whole system. To this end we employ the triplet 3.2.3 Triplet Selection
loss that directly reflects what we want to achieve in face
In order to ensure fast convergence it is crucial to select
verification, recognition and clustering. Namely, we strive
triplets that violate the triplet constraint in Eq. (5). This
for an embedding f(x), from an image x into a feature space
means that, given xai , we want to select an xpi (hard pos-
Rd, such that the squared distance between all faces, inde- 2
pendent of imaging conditions, of the same identity is small, itive) such that argmaxxpi kf (xai ) − f (xpi )k2 and similarly
2
whereas the squared distance between a pair of face images xni (hard negative) such that argmaxxni kf (xai ) − f (xni )k2 .
from different identities is large. It is infeasible to compute the argmin and argmax across
the whole training set. Additionally, it might lead to poor
3.2.2 Triplet Loss training, as mislabelled and poorly imaged faces would
The triplet loss is more suitable for face verification. The dominate the hard positives and negatives. There are two ob-
motivation is that the loss from(Sun, Wang, and Tang 2014) vious choices that avoid this issue:
encourages all faces of one identity to be projected onto a • Generate triplets offline every n steps, using the most
single point in the embedding space. The triplet loss, how- recent network checkpoint and computing the argmin and
ever, tries to enforce a margin between each pair of faces argmax on a subset of the data.
from one person to all other faces. This allows the faces for • Generate triplets online. This can be done by selecting
one identity to live on a manifold, while still enforcing the the hard positive/negative exemplars from within a mini-
distance and thus discriminability to other identities. batch.
The embedding is represented by f (x) ∈ Rd . It em- Instead of picking the hardest positive, we use all an-
beds an image x into a d-dimensional Euclidean space. chorpositive pairs in a mini-batch while still selecting the
Additionally, we constrain this embedding to live on the hard negatives. We don’t have a side-by-side comparison
d-dimensional hypersphere, i.e.kf (x)2 k = 1 This loss is of hard anchor-positive pairs versus all anchor-positive
motivated in(Weinberger 2009) in the context of nearest- pairs within a mini-batch, but we found in practice that the
neighbor classification. Here we want to ensure that an im- all anchorpositive method was more stable and converged
age xai (anchor) of a specific person is closer to all other slightly faster at the beginning of training. Selecting the
images xpi (positive) of the same person than it is to any im- hardest negatives can in practice lead to bad local minima
age xni (negative) of any other person. This is visualized in early on in training, specifically it can result in a collapsed
Figure 4. model (i.e. f(x) = 0). In order to mitigate this, it helps to
select xni such that
Thus we want,
2 2
2 2 kf (xai ) − f (xpi )k2 < kf (xai ) − f (xni )k2 (8)
kf (xai ) − f (xpi )k2 + α < kf (xai ) − f (xni )k2 (5)
To sum up, Correct triplet selection is crucial for fast con-
vergence. On the one hand we would like to use small
mini-batches as these tend to improve convergence during we can get better performance with less runtime which are
Stochastic Gradient Descent (SGD). On the other hand, im- shown in Table I.
plementation details make batches of tens to hundreds of ex-
emplars more efficient. Table 1: COMPARISON OF SPEED AND VALIDATION
ACCURACY OF OUR CNNs AND PREVIOUS CNNs

300xForward
Group CNN Validation Accuracy
Propagation
Group1 12-Net 0.043s 93.10%
P-Net 0.040s 93.70%
Group2 24-Net 0.738s 93.80%
R-Net 0.466s 94.50%
Group3 48-Net 3.601s 92.10%
O-Net 1.411s 93.50%

So in the MTCNN part, with the cascade structure, our


Figure 5: Enhanced image and the original image. The first method can achieve high speed in joint face detection and
line is enhanced by four contrast changes. The second line is alignment. We compare our method with some classic tech-
the enhanced image with random operations and the original niques on GPU and the results are shown in Table II.
image
. Table 2: SPEED COMPARISON OF OUR METHOD AND
OTHER METHODS
3.3. Data Augmentation Method GPU Speed
Large-scale datasets are the prerequisite for the successful Ours NIVIDIA Titan Black 93FPS
application of deep neural networks. The image augmenta-
tion technology uses a series of random changes to the train- Casacade CNN NIVIDIA Titan Black 100FPS
ing images to generate similar but different training samples,
thereby expanding the size of the training dataset. Faceness NIVIDIA Titan Black 20FPS
Another explanation for image augmentation is that ran-
domly changing the training samples can reduce the model’s DP2MFD NIVIDIA Tesla K20 0.285FPS
dependence on certain attributes and improve the general-
ization ability of the model. For example, we can crop the
image in different ways to make the objects of interest ap-
pear in different positions, thereby reducing the dependence
of the model on the position of the object. We can also adjust 4.2. FaceNet Deep Architecture
factors such as contrast ratio to reduce the model’s sensitiv- We use three backbone networks, which are ZeilerFergus
ity to brightness. with 1×1 convolution and norm, ResNet50, and ResNet101.
In order to enhance the robustness of the model when pre- Below we use NN1, NN2, and NN3 to replace them re-
dicting, we decide to apply image augmentation to datasets spectively as is shown in Table 3. Among them, NN1 uses
with random operations when training. The methods we use model pre-trained on ImageNet, and the other two backbone
here are: random fixed ratio cropping, mirror flipping, turn- networks do not use pre-trained models. In NN1, we adds
ing left 45°, turning right 45°, etc. Some samples see in Fig- 1×1×d convolutional layers, between the standard convolu-
ure 5. tional layers of the ZeilerFergus architecture and results in
a model 22 layers deep. It has a total of 140 million param-
4. Experiments eters and requires around 1.6 billion FLOPS per image. We
retain the original architecture for the other two backbone
4.1. MTCNN Backbone networks networks.
Before the experiment, we notice the performance of multi-
ple CNNs might be limited by the following facts: 4.3. Performance on LFW
(1) Some filters in convolution layers lack diversity that We evaluate our model on LFW using the standard proto-
may limit their discriminative ability. col for unrestricted, labeled outside data. Nine training splits
(2) Considering face detection is a challenging binary are used to select the L2-distance threshold. Classification
classification task, so it may need less numbers of filters (same or different) is then performed on the tenth test split.
per layer. To this end, we reduce the number of filters and The selected optimal threshold is 1.242 for all test splits
change the 5×5 filter to 3×3 filter to reduce the comput- except split eighth (1.256). We achieve a classification ac-
ing while increase the depth to get better performance. With curacy of 89.52%±0.18, 90.16%±0.15 and 92.86%±0.12
these improvements, compared to the previous architecture, when using NN1, NN2 and NN3 respectively.
1ayer look into ways of improving the currently extremely long
layer size-in size-out param FLPS
kernal
training times, e.g. variations of our curriculum learning
conv1 220x220x3 110x110x64 7x7x3,2 9K 115M
with smaller batch sizes and offline as well as online pos-
pool1 110x110x64 55x55x64 3x3x64,2 0
itive and negative mining.
rnorm1 55x55x64 55x55x64 0
conv2a 55x55x64 55x55x64 1x1x64,1 4K 13M
conv2 55x55x64 55x55x192 3x3x64,1 111K 135M References
rnorm2 55x55x192 55x55x192 0 He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delving
pool2 55x55x192 28x28x192 3x3x192,2 0 Deep into Rectifiers: Surpassing Human-Level Performance
conv3a 28x28x192 28x28x192 1x1x192,1 37K 29M
on ImageNet Classification. In Proceedings of the IEEE In-
conv3 28x28x192 28x28x384 3x3x192,1 664K 521M
ternational Conference on Computer Vision (ICCV).
pool3 28x28x384 14x14x384 3x3x384,2 0
conv4a 14x14x384 14x14x384 1x1x384,1 148K 29M Li, H.; Lin, Z.; Shen, X.; Brandt, J.; and Hua, G. 2015. A
conv4 14x14x384 14x14x256 3x3x384,1 885K 173M convolutional neural network cascade for face detection. In
conv5a 14x14x256 14x14x256 1x1x256,1 66K 13M Computer Vision Pattern Recognition.
conv5 14x14x256 14x14x256 3x3x256,1 590K 116M
conv6a 14x14x256 14x14x256 3x3x256,1 66K 13M Pham, M. T.; Gao, Y.; Hoang, V. D. D.; and Cham, T. J.
conv6 14x14x256 14x14x256 3x3x256,1 590K 116M 2010. Fast polygonal integration and its application in ex-
pool4 14x14x256 7x7x256 3x3x256,2 0 tending haar-like features to improve object detection. In
concat 7x7x256 7x7x256 0 Computer Vision and Pattern Recognition.
fc1 7x7x256 1x32x128 maxout p=2 103M 103M
Schroff, F.; Kalenichenko, D.; and Philbin, J. 2015. FaceNet:
fc2 1x32x128 1x32x128 maxout p=2 34M 34M
fc7128 1x32x128 1x1x128 524K 0.5M
A Unified Embedding for Face Recognition and Clustering.
L2 1x1x128 1x1x128 0 In Proceedings of the IEEE Conference on Computer Vision
total 140M 1.6B and Pattern Recognition (CVPR).
Sun, Y.; Wang, X.; and Tang, X. 2013. Hybrid Deep Learn-
Table 3: FaceNet Deep Architectures. This table compares ing for Face Verification. In IEEE International Conference
the performance of the different backbones we used on the on Computer Vision.
LFW dataset. NN1 is ZeilerFergus with 1×1 convolution and
norm, NN2 is ResNet50 , and NN3 is ResNet101. Reported Sun, Y.; Wang, X.; and Tang, X. 2014. Deep Learning
are the mean validation rate VALs at 10E-3 false accept rate. Face Representation by Joint Identification-Verification. Ad-
The input image size is set to 160×160. vances in neural information processing systems 27.
. Taigman, Y.; Yang, M.; Ranzato, M.; and Wolf, L. 2014.
DeepFace: Closing the Gap to Human-Level Performance
in Face Verification. In Proceedings of the IEEE Conference
5. Conclusion on Computer Vision and Pattern Recognition (CVPR).
In this paper, we have proposed a multi-task cascaded Tang, Y. S. W. 2015. Deeply learned face representations
CNNs based framework combine with a unified embedding are sparse, selective, and robust. In Proceedings of the IEEE
for face detection and recognition. Ex-perimental results Conference on Computer Vision and Pattern Recognition
demonstrated that the method linger around the state-of-the- (CVPR).
art methods across challenging AFLW benchmark for face Viola, P.; and Jones, M. J. 2004. Robust Real-Time Face
alignment. The three main contributions for performance Detection. International Journal of Computer Vision 57(2):
improvement are carefully de-signed cascaded CNNs archi- 137–154.
tecture, online hard sample mining strategy, and joint face
alignment learning. Weinberger, K. Q. 2009. Distance Metric Learning for Large
What’s more, We use the a unified embedding method Margin Nearest Neighbor Classification. Jmlr 10.
to directly learn an embedding into an Euclidean space for Yang, B.; Yan, J.; Lei, Z.; and Li, S. Z. 2014. Aggregate
face verification.This sets it apart from other methods who channel features for multi-view face detection .
use the CNN bottleneck layer, or require additional post- Yang, S.; Luo, P.; Loy, C. C.; and Tang, X. 2016. From
processing such as concate-nation of multiple models and Facial Parts Responses to Face Detection: A Deep Learning
PCA, as well as SVM classification. Our end-to-end training Approach. In IEEE International Conference on Computer
both simplifies the setup and shows that directly optimizing Vision.
a loss relevant to the task at hand improves performance.
Three different recognition tricks chosen for their out- Zhang, K.; Zhang, Z.; Li, Z.; and Qiao, Y. 2016. Joint Face
performed results to our work were as follows: end-to-end Detection and Alignment Using Multitask Cascaded Convo-
learning, triplet loss, and data augmentation; All three com- lutional Networks. IEEE Signal Processing Letters 23(10):
bined with one another by convolutional neural networks 1499–1503. doi:10.1109/LSP.2016.2603342.
that return the main features of detected faces. Zhu, X.; and Ramanan, D. 2012. Face Detection, Pose Es-
Future work will focus on better understanding of the er- timation, and Landmark Localization in the Wild. In Com-
ror cases, further improving the model, and also reducing puter Vision and Pattern Recognition (CVPR), 2012 IEEE
model size and reducing CPU requirements. We will also Conference on.

You might also like