Openface: A General-Purpose Face Recognition Library With Mobile Applications
Openface: A General-Purpose Face Recognition Library With Mobile Applications
Abstract
Cameras are becoming ubiquitous in the Internet of Things (IoT) and can use face recognition tech-
nology to improve context. There is a large accuracy gap between today’s publicly available face
recognition systems and the state-of-the-art private face recognition systems. This paper presents
our OpenFace face recognition library that bridges this accuracy gap. We show that OpenFace pro-
vides near-human accuracy on the LFW benchmark and present a new classification benchmark
for mobile scenarios. This paper is intended for non-experts interested in using OpenFace and
provides a light introduction to the deep neural network techniques we use.
We released OpenFace in October 2015 as an open source library under the Apache 2.0 license.
It is available at: http://cmusatyalab.github.io/openface/
This research was supported by the National Science Foundation (NSF) under grant number CNS-1518865. Ad-
ditional support was provided by Crown Castle, the Conklin Kistler family fund, Google, the Intel Corporation, and
Vodafone. NVIDIA’s academic hardware grant provided the Tesla K40 GPU used in all of our experiments. Any
opinions, findings, conclusions or recommendations expressed in this material are those of the authors and should not
be attributed to their employers or funding sources.
Keywords: face recognition, deep learning, machine learning, computer vision, neural net-
works, mobile computing
1 Introduction
Video cameras are extremely cheap and easily integrated into today’s mobile and static devices
such as surveillance cameras, auto dashcams, police body cameras, laptops, smartphones, GoPro,
and Google Glass. Video cameras can be used to improve context in mobile scenarios. The identity
of a person is a large part of context in humans and modulates what people say and how they act.
Likewise, recognizing people is a primitive operation in mobile computing that adds context to
applications such as cognitive assistance, social events, speaker annotation in meetings, and person
of interest identification from wearable devices.
State-of-the-art face recognition is dominated by industry- and government-scale datasets. Ex-
ample applications in this space include person of interest identification from mounted cameras
and tagging a user’s friends in pictures. Training is often an offline, batch operation and produces
a model that can predict in hundreds of milliseconds. The time to train new classification models
in these scenarios isn’t a major focus because the set of people to classify doesn’t change often.
Mobile scenarios span a different problem space where a mobile user may have a device per-
forming real-time face recognition. The context of the mobile user and people around them pro-
vide information about who they are likely to see. If the user attends a meetup, the system should
quickly learn to recognize the other attendees. Many people in the system are transient and the
user only needs to recognize them for a short period of time. The time to train new classification
models now becomes important as the user’s context changes and people are added and removed
from the system.
Towards exploring transient and mobile face recognition, we have created OpenFace as a
general-purpose library for face recognition. Our experiments show that OpenFace offers higher
accuracy than prior open source projects and is well-suited for mobile scenarios. This paper dis-
cusses OpenFace’s design, implementation, and evaluation and presents empirical results relevant
to transient mobile applications.
1
Figure 1: Training flow for a feed-forward neural network.
to explicitly define a low-dimensional face representation based on ratios of distances, areas, and
angles [Kan73]. An explicitly defined face representation is desirable for an intuitive feature space
and technique. However, in practice, explicitly defined representations are not accurate. Later
work sought to use holistic approaches stemming from statistics and Artificial Intelligence (AI)
that learn from and perform well on a dataset of face images. Statistical techniques such as Princi-
pal Component Analysis (PCA) [Hot33] represent faces as a combination of eigenvectors [SK87].
Eigenfaces [TP91] and fisherfaces [BHK97] are landmark techniques in PCA-based face recogni-
tion. Lawrence et al. [LGTB97] present an AI technique that uses convolutional neural networks
to classify an image of a face.
Today’s top-performing face recognition techniques are based on convolutional neural net-
works. Facebook’s DeepFace [TYRW14] and Google’s FaceNet [SKP15] systems yield the highest
accuracy. However, these deep neural network-based techniques are trained with private datasets
containing millions of social media images that are orders of magnitude larger than available
datasets for research.
• Spatial convolutions that slide a kernel over the input feature maps,
• Linear or fully connected layers that take a weighted sum of all the input units, and
• Pooling that take the max, average, or Euclidean norm over spatial regions.
These operations are often followed by a nonlinear activation function, such as Rectified Linear
Units (ReLUs), which are defined by f (x) = max{0, x}. Neural network training is a (nonconvex)
2
optimization problem that finds a θ that minimizes (or maximizes) L. With differentiable layers,
∂L/∂θi can be computed with backpropagation. The optimization problem is then solved with
a first-order method, which iteratively progress towards the optimal value based on ∂L/∂θi . See
[BGC15] for a more thorough introduction to modern deep neural networks.
Figure 2 shows the logic flow for face
recognition with neural networks. There are
many face detection methods to choose from,
as it is another active research topic in com-
puter vision. Once a face is detected, the
systems preprocess each face in the image to
create a normalized and fixed-size input to
the neural network. The preprocessed im-
ages are too high-dimensional for a classifier
to take directly on input. The neural network
is used as a feature extractor to produce a low-
dimensional representation that characterizes a
person’s face. A low-dimensional representa-
tion is key so it can be efficiently used in clas-
sifiers or clustering techniques.
DeepFace first preprocesses a face by us-
ing 3D face modeling to normalize the input
image so that it appears as a frontal face even
if the image was taken from a different angle.
DeepFace then defines classification as a fully-
connected neural network layer with a softmax Figure 2: Logic flow for face recognition with a
function, which makes the network’s output a neural network.
normalized probability distribution over identi-
ties. The neural network predicts some probability distribution p̂ and the loss function L measures
how well p̂ predicts the person’s actual identity i.1 DeepFace’s innovation comes from three dis-
tinct factors: (a) the 3D alignment, (b) a neural network structure with 120 million parameters, and
(c) training with 4.4 million labeled faces. Once the neural network is trained on this large set of
faces, the final classification layer is removed and the output of the preceding fully connected layer
is used as a low-dimensional face representation.
Often, face recognition applications seek a desirable low-dimensional representation that gen-
eralizes well to new faces that the neural network wasn’t trained on. DeepFace’s approach to this
works, but the representation is a consequence of training a network for high-accuracy classifica-
tion on their training data. The drawback of this approach is that the representation is difficult to
use because faces of the same person aren’t necessarily clustered, which classification algorithms
can take advantage of. FaceNet’s triplet loss function is defined directly on the representation.
Figure 3 illustrates how FaceNet’s training procedure learns to cluster face representations of the
same person. The unit hypersphere is a high-dimensional sphere such that every point has distance
1
Formally, this is done with the cross-entropy loss L(p̂, i) = − log p̂i , where p̂i is the ith element of p̂.
3
Figure 3: Illustration of FaceNet’s triplet-loss training procedure.
1 from the origin. Constraining the embedding to the unit hypersphere provides a structure to a
space that is otherwise unbounded. FaceNet’s innovation comes from four distinct factors: (a) the
triplet loss, (b) their triplet selection procedure, (c) training with 100 million to 200 million labeled
images, and (d) (not discussed here) large-scale experimentation to find an network architecture.
For reference, we formally define FaceNet’s triplet loss in Appendix A.
4
Figure 4: OpenFace’s project structure.
5
Figure 5: OpenFace’s affine transformation. The transformation is based on the large blue land-
marks and the final image is cropped to the boundaries and resized to 96 × 96 pixels.
6
Figure 6: OpenFace’s end-to-end network training flow.
7
Technique Accuracy
Human-level (cropped) [KBBN09] 0.9753
Eigenfaces (no outside data) [TP91]3 0.6002 ± 0.0079
FaceNet [SKP15] 0.9964 ± 0.009
DeepFace-ensemble [TYRW14] 0.9735 ± 0.0025
OpenFace (ours) 0.9292 ± 0.0134
4 Evaluation
Our evaluation studies OpenFace’s accuracy and performance in comparison to other face recog-
nition techniques. The LFW dataset [HRBLM07] is a standard benchmark in face recognition
research and Section 4.1 presents OpenFace’s accuracy on the LFW verification experiment. Sec-
tion 4.2 presents a new classification benchmark using the LFW dataset for transient mobile sce-
narios.
All experiments in this section use the nn4.small2.v1 OpenFace model described in Appendix C.
The identities in our neural network training data does not overlap with the LFW identities.
8
Figure 8: ROC curve on the LFW benchmark with area under the curve (AUC) values.
is from their LFW script4 and the others are from the LFW results page. Kumar et al. [KBBN09]
provides human-level performance results. The cropped version crops LFW images around the
faces in the images to reduce contextual information. The FaceNet curve has not been released.
These results show that OpenFace’s accuracy is close to to the accuracy of state-of-the-art deep
learning techniques.
9
Figure 9: Overview of the LFW classification accuracy and performance benchmark.
classifier is trained on the training set and the accuracy is obtained from predicting the identities in
the testing set.
Measuring the runtime performance of training classifiers and predicting who new images be-
long to is an important consideration for mobility. On wearable devices, the performance of pre-
dicting who a face belongs to is important so there isn’t a noticeable lag when a user looks at
another person. Face detection isn’t included as part of the prediction time in our analysis be-
cause it is the same between all techniques. The prediction time includes the preprocessing and
prediction times.
Users may also want to add or remove identities from their recognition system for transient
scenarios. This benchmark studies how long it takes to re-train a classifier. We assume the face
and identity data is data that has already been collected and pre-processed and that the user has the
ability to choose a subset of identities to classify. The training time only reflects the time to train
the classifier.
All of OpenCV’s techniques have the same interface. The preprocessing converts the image a
grayscale. OpenFace’s preprocessing in this context means the affine transformation for alignment
10
followed by the neural network forward pass for the 128-dimensional representation. OpenFace
classification uses a linear SVM with a regularization weight of 1, which consistently performs the
same or better than other regularization weights and RBF kernels.
Figure 10 presents our experimental results from classifying between 10 and 100 people on
an 8-core Intel Xeon E5-1630 v3 @ 3.70GHz CPU with OpenBLAS and a NVIDIA Tesla K40
GPU. Figure 10a shows that adding more people decreases the accuracy and that OpenFace always
has the highest accuracy by a large margin. Figure 10b shows that adding more people increases
the training time. OpenFace’s SVM consistently has the fastest training time. Only one result
for OpenFace is shown here because the numbers do not include the neural network representa-
tion preprocessing time and the SVM library only uses the CPU. Figure 10c shows the per-image
prediction times. The execution time of eigenfaces, fisherfaces, and LBPH slightly increase as
more faces are added while OpenFace’s prediction time remains constant. OpenFace’s predic-
tion involves a neural network forward pass that takes substantially more time than the PCA and
histogram-based techniques. Executing on a GPU instead of a CPU offers slight performance
improvements.
5 Conclusion
This paper presents OpenFace, a face recognition library. OpenFace is open sourced under the
Apache 2.0 license and can be obtained from http://cmusatyalab.github.io/openface/.
We trained a network on the largest datasets available for research, which is one order of magnitude
smaller than DeepFace [TYRW14] and two orders of magnitude smaller than FaceNet [SKP15],
the state-of-the-art private datasets that have been published. We show competitive accuracy and
performance results on the LFW verification benchmark despite our smaller training dataset. We
introduce a LFW classification benchmark and show competitive performance results on it.
We intend to maintain OpenFace as a library that stays updated with the latest deep neural
network architectures and technologies for face recognition.
11
(a) Classification accuracy.
Figure 10: Accuracy and performance comparisons between OpenFace and prior non-proprietary
face recognition implementations (from OpenCV).
12
Acknowledgments
We are grateful for the insightful discussions and strong contributions that have made OpenFace possible.
Hervé Bredin helped us remove a redundant face detection after the affine transformation for alignment.
The Torch ecosystem and community have provided many of the neural network components, including
Alfredo Canziani’s implementation5 of FaceNet’s loss function. Nicholas Lèonard quickly merged our pull
requests to dpnn6 that modified the inception layer for FaceNet’s structure. Francisco Massa and Andrej
Karpathy quickly released Torch’s nn.Normalize layer after we expressed interest in using it. Early in
the project Soumith Chintala provided helpful Torch advice and Davis King helped with dlib usage. We
also thank Zhuo Chen, Kiryong Ha, Khalid Elgazzar, Jan Harkes, Wenlu Hu, J. Zico Kolter, Padmanabhan
Pillai, Wolfgang Richter, Daniel Siewiorek, Rahul Sukthankar, and Junjue Wang for insightful discussions
and feedback.
This research was supported by the National Science Foundation (NSF) under grant number CNS-
1518865. Additional support was provided by Crown Castle, the Conklin Kistler family fund, Google,
the Intel Corporation, and Vodafone. NVIDIA’s academic hardware grant provided the Tesla K40 GPU
used in all of our experiments. Any opinions, findings, conclusions or recommendations expressed in this
material are those of the authors and should not be attributed to their employers or funding sources.
5
https://github.com/Atcold/torch-TripletEmbedding
6
https://github.com/Element-Research/dpnn
13
References
[AHP04] Timo Ahonen, Abdenour Hadid, and Matti Pietikäinen. Face recognition with local binary
patterns. In Computer vision-eccv 2004, pages 469–481. Springer, 2004.
[B+ 00] Gary Bradski et al. The opencv library. Doctor Dobbs Journal, 25(11):120–126, 2000.
[BBB+ 93] Jane Bromley, James W Bentz, Léon Bottou, Isabelle Guyon, Yann LeCun, Cliff Moore, Ed-
uard Säckinger, and Roopak Shah. Signature verification using a “siamese” time delay neural
network. International Journal of Pattern Recognition and Artificial Intelligence, 7(04):669–
688, 1993.
[BGC15] Yoshua Bengio, Ian J. Goodfellow, and Aaron Courville. Deep learning. Book in preparation
for MIT Press, 2015.
[BHK97] Peter N Belhumeur, João P Hespanha, and David J Kriegman. Eigenfaces vs. fisherfaces:
Recognition using class specific linear projection. Pattern Analysis and Machine Intelligence,
IEEE Transactions on, 19(7):711–720, 1997.
[CKF11] Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. Torch7: A matlab-like environ-
ment for machine learning. In BigLearn, NIPS Workshop, 2011.
[H+ 07] John D Hunter et al. Matplotlib: A 2d graphics environment. Computing in science and
engineering, 9(3):90–95, 2007.
[HC15] Hwai-Jung Hsu and Kuan-Ta Chen. Face recognition on drones: Issues and limitations. In
Proceedings of the First Workshop on Micro Aerial Vehicle Networks, Systems, and Applica-
tions for Civilian Use, DroNet ’15, pages 39–44, New York, NY, USA, 2015. ACM.
[Hot33] Harold Hotelling. Analysis of a complex of statistical variables into principal components.
Journal of educational psychology, 24(6):417, 1933.
[HRBLM07] Gary B Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the
wild: A database for studying face recognition in unconstrained environments. Technical
report, Technical Report 07-49, University of Massachusetts, Amherst, 2007.
[IDFCF96] Roberto Ierusalimschy, Luiz Henrique De Figueiredo, and Waldemar Celes Filho. Lua-an
extensible extension language. Softw., Pract. Exper., 26(6):635–652, 1996.
[JA09] Rabia Jafri and Hamid R Arabnia. A survey of face recognition techniques. JIPS, 5(2):41–68,
2009.
[Jeb95] Tony S Jebara. 3D pose estimation and normalization for face recognition. PhD thesis,
McGill University, 1995.
[Kan73] Takeo Kanade. Picture processing system by computer complex and recognition of human
faces. Doctoral dissertation, Kyoto University, 3952:83–97, 1973.
[KBBN09] Neeraj Kumar, Alexander C Berg, Peter N Belhumeur, and Shree K Nayar. Attribute and
simile classifiers for face verification. In Computer Vision, 2009 IEEE 12th International
Conference on, pages 365–372. IEEE, 2009.
[Kin09] Davis E King. Dlib-ml: A machine learning toolkit. The Journal of Machine Learning
Research, 10:1755–1758, 2009.
[KS14] Vahid Kazemi and Josephine Sullivan. One millisecond face alignment with an ensemble of
regression trees. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 1867–1874, 2014.
[LA04] Chris Lattner and Vikram Adve. Llvm: A compilation framework for lifelong program analy-
sis & transformation. In Code Generation and Optimization, 2004. CGO 2004. International
Symposium on, pages 75–86. IEEE, 2004.
14
[LGTB97] Steve Lawrence, C Lee Giles, Ah Chung Tsoi, and Andrew D Back. Face recognition: A
convolutional neural-network approach. Neural Networks, IEEE Transactions on, 8(1):98–
113, 1997.
[NW14] Hong-Wei Ng and Stefan Winkler. A data-driven approach to cleaning large face datasets.
IEEE International Conference on Image Processing (ICIP), 265(265):530, 2014.
[Oli06] Travis E Oliphant. A guide to NumPy, volume 1. Trelgol Publishing USA, 2006.
[Pal08] Mike Pall. The luajit project. Web site: http://luajit. org, 2008.
[PVG+ 11] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion,
Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al.
Scikit-learn: Machine learning in python. The Journal of Machine Learning Research,
12:2825–2830, 2011.
[PVZ15] Omkar M Parkhi, Andrea Vedaldi, and Andrew Zisserman. Deep face recognition. Proceed-
ings of the British Machine Vision, 1(3):6, 2015.
[SK87] Lawrence Sirovich and Michael Kirby. Low-dimensional procedure for the characterization
of human faces. JOSA A, 4(3):519–524, 1987.
[SKP15] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for
face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 815–823, 2015.
[SLJ+ 15] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolu-
tions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 1–9, 2015.
[SMF+ 12] Tolga Soyata, Rajani Muraleedharan, Colin Funai, Minseok Kwon, and Wendi Heinzelman.
Cloud-vision: Real-time face recognition using a mobile-cloudlet-cloud acceleration architec-
ture. In Computers and Communications (ISCC), 2012 IEEE Symposium on, pages 000059–
000066. IEEE, 2012.
[Sta89] Richard M Stallman. Using and porting the gnu compiler collection. Free Software Founda-
tion, 51:02110–1301, 1989.
[TP91] Matthew Turk and Alex Pentland. Eigenfaces for recognition. Journal of cognitive neuro-
science, 3(1):71–86, 1991.
[TYRW14] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the
gap to human-level performance in face verification. In Computer Vision and Pattern Recog-
nition (CVPR), 2014 IEEE Conference on, pages 1701–1708. IEEE, 2014.
[VRDJ95] Guido Van Rossum and Fred L Drake Jr. Python reference manual. Centrum voor Wiskunde
en Informatica Amsterdam, 1995.
[WHS15] Xiang Wu, Ran He, and Zhenan Sun. A lightened cnn for deep face representation. arXiv
preprint arXiv:1511.02683, 2015.
[YLLL14] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learning face representation from scratch.
arXiv preprint arXiv:1411.7923, 2014.
15
A FaceNet’s Triplet Loss Formulation
This section presents the FaceNet [SKP15] triplet loss formulation. Let fΘ (I) be a neural network parame-
terized by Θ that maps an image I onto a unit hypersphere of dimension m. A triplet consists of an anchor
image a, a positive image of the same person p, and a negative image of a different person n. To achieve
the clustering illustrated in Figure 3, the distance between the anchor and positive should be less than the
distance between the anchor and negative. Adding a threshold α, all triplets should satisfy
where ||z||22 = 2
P
i zi is the squared Euclidean norm of z. The loss function to make triplets meet this
condition is
L(a, p, n) = ||fΘ (a) − fΘ (p)||22 + α − ||fΘ (a) − fΘ (n)||22 + ,
where [z]+ = max{0, z}. In this paper we use α = 0.2 and m = 128 as suggested in the FaceNet paper.
16
Figure 11: Original (suboptimal) OpenFace triplet loss training technique.
17
Table 1: The OpenFace nn4.small2 network definition.
type output size #1×1 #3×3 #3×3 #5×5 #5×5 pool proj
reduce reduce
conv1 (7 × 7 × 3, 2) 48 × 48 × 64
max pool + norm 24 × 24 × 64 m 3 × 3, 2
inception (2) 24 × 24 × 192 64 192
norm + max pool 12 × 12 × 192 m 3 × 3, 2
inception (3a) 12 × 12 × 256 64 96 128 16 32 m, 32p
inception (3b) 12 × 12 × 320 64 96 128 32 64 `2 , 64p
inception (3c) 6 × 6 × 640 128 256,2 32 64,2 m 3 × 3, 2
inception (4a) 6 × 6 × 640 256 96 192 32 64 `2 , 128p
inception (4e) 3 × 3 × 1024 160 256,2 64 128,2 m 3 × 3, 2
inception (5a) 3 × 3 × 736 256 96 384 `2 , 96p
inception (5b) 3 × 3 × 736 256 96 384 m, 96p
avg pool 736
linear 128
`2 normalization 128
18