A Deep Learning Approach For Face Detection Using YOLO
A Deep Learning Approach For Face Detection Using YOLO
using YOLO
Dweepna Garg Parth Goel Sharnil Pandya
Computer Engineering Department Computer Engineering Department Computer Science & Engineering
Devang Patel Institute of Advance Devang Patel Institute of Advance Department
Technology and Research, CHARUSAT Technology and Research, CHARUSAT Navrachana University
Changa, Anand, India Changa, Anand, India Vadodara, India
[email protected] [email protected] [email protected]
Amit Ganatra Ketan Kotecha
Devang Patel Institute of Advance Symbiosis Institute of Technology
Technology and Research Symbiosis International University
Charotar University of Science and Technology Pune, India
Changa, Anand, India [email protected]
[email protected]
Abstract—Deep learning is nowadays a buzzword and is also has the capability to classify, detect and label the object
considered a new era of machine learning which trains the with high accuracy. Region-based CNN (R-CNN) [3], Fast
computers in finding the pattern from a massive amount of R-CNN [4], Faster R-CNN [5], and YOLO [6] are popular
data. It mainly describes the learning at multiple levels of object detection networks in recent years.
representation which helps to make sense on the data
consisting of text, sound and images. Many organizations are Face detection has a plethora of applications. It plays a
using a type of deep learning known as a convolutional neural crucial role in face recognition algorithms. Face recognition
network to deal with the objects in a video sequence. Deep has several applications such as person identification in
Convolution Neural Networks (CNNs) have proved to be surveillance and authentication for a security system. It is
impressive in terms of performance for detecting the objects, also help for emotion recognition and based on detected
classification of images and semantic segmentation. Object emotion, further analysis can be used for emotion based
detection is defined as a combination of classification and applications. Hence, it is considered to be a way to deliver
localization. Face detection is one of the most challenging rich information like age, emotion, gender and many more
problems of pattern recognition. Various face related about an individual. Other applications of face detection are
applications like face verification, facial recognition, clustering to automatically focus on human faces in camera, to give tag
of face etc. are a part of face detection. Effective training needs and to identify different parts of faces. Automated face
to be carried out for detection and recognition. The accuracy in
detection has gained attention in computer vision and pattern
face detection using the traditional approach did not yield a
recognition. Earlier face detection systems could handle only
good result. This paper focuses on improving the accuracy of
detecting the face using the model of deep learning. YOLO
simple cases but now it has outperformed in various
(You only look once), a popular deep learning library is used to situations using deep learning algorithms. Due to large
implement the proposed work. The paper compares the variation caused by occlusions, illumination and viewpoints,
accuracy of detecting the face in an efficient manner with face detection remains a challenging problem in the area of
respect to the traditional approach. The proposed model uses computer vision. So accuracy, training time and processing
the convolutional neural network as an approach of deep time in real-time videos for detecting faces are still research
learning for detecting faces from videos. The FDDB dataset is issues.
used for training and testing of our model. A model is fine-
In this paper, section two presents related work of face
tuned on various performance parameters and the best
suitable values are taken into consideration. It is also detection algorithms. Section three describes the working of
compared the execution of training time and the performance YOLO framework for detecting objects. Proposed work is
of the model on two different GPUs. explained in section four. Experimental setup and dataset
information are discussed in section five. Results are
Keywords—Face Detection, YOLO, Neural Network, object analyzed in section six. Finally, conclusion and future work
detection, Convolutional Neural Network are described in section seven.
Authorized licensed use limited to: Dalarna University College. Downloaded on September 16,2023 at 12:14:40 UTC from IEEE Xplore. Restrictions apply.
estimation and detection of the face was proposed by The working of YOLO is as follows: The input image is
Osadchy [10]. Wilson et al. presented harcascading for divided into S x S grid. In case the center of the object falls
facial feature detection [11]. But limitation arises for [10, into a grid cell, then it is the responsibility of the grid cell to
11] when the face is exposed to various illuminations, poses detect the object. Each cell of the grid predicts the bounding
and expressions. box and the confidence score for that box. The confidence
score depicts the accuracy with which the object is detected
In recent years, face detection is carried out using deep in the bounding box. If no object is found in the cell, then the
learning models. One of the most popular models for it is confidence score is zero else it is calculated using the
intersection over union (IOU) between the predicted box and
CNN (convolutional neural network) [12]. Faster R-CNN is
the ground truth. There are in all mainly 5 predictions in the
also achieving remarkable results for object detection. This
bounding box: x, y, w, h and confidence. The center of the
paper proposes an architecture of a convolutional neural box with respect to the bounds of the grid is represented by
network to detect the face using the YOLO framework. the (x, y) coordinates. The height and width are predicted
relative to the whole image. Each cell also predicts the
Our architecture does not rely on the hand-crafted conditional class probabilities. Multiplication of conditional
features. Faces are detected based on the CNN which extract class probabilities with the individual box confidence
features by itself. Training and testing of a model are carried prediction gives the confidence score for each box. The
out on two GPU and it detects the faces at a faster rate in real calculated confidence score depicts that how accurate the
time. predicted box fits the object.
III. OVERVIEW OF YOLO There are various versions of YOLO. Yolov1 suffers
from the localization errors and has a low recall compared to
YOLO is a state-of-the-art deep learning framework for the other region based detection methods. The network
real-time object detection. It is an improved model then the classifier of original YOLO is trained at 224 x 224 and for
region based detector and outperformed on standard detection, the resolution is increased to 448 x 448. For
detection datasets like PASCAL VOC [13] and COCO [14] ImageNet dataset, the network classifier is trained at 448 x
dataset. Detecting the object on real-time basis is 448 resolution by YOLOv2. The downsampling of the image
comparatively faster with respect to other detection is carried out by the convolutional layer of YOLO by a factor
networks. This model can run on different resolutions of 32, hence an image which is fed as an input of 416 gets an
thereby giving good speed and accuracy. To improve the output feature map of 13 x 13.
performance towards scale invariant, the images can be
resized to a random scale. The detector should be capable to
IV. PROPOSED ARCHITECTURE
learn the features for a wide range of image sizes.
Our proposed network takes an input as a colour image of
Object detection should be fast, accurate in a manner that size 448 x 448. The architecture consists of 7 convolutional
a variety of objects can be recognized [15]. With the help of layers followed by max pooling layer of size 2 x 2. Then
neural network, the YOLO frameworks are becoming three fully connected layers are attached and output layer is
increasingly fast and accurate for detection. Still, a constraint followed by last fully connected. The convolutional layers
is observed for small set of objects. Presently, the datasets of find the simple features to complex features from the images
object detection are limited as compared to that of and the fully connected layer predicts the coordinates and
classification and tagging. The object detection datasets probabilities. Finally, the output layer predicts both class
consist of thousands of images with tags which are object probabilities and the coordinates of the bounding box using
coordinates in image. The classification datasets consist of NMS (Non-Maximum Suppression) technique.
millions of images with categories. Assigning a tag of an
object to the image for detection is more expensive as
compared to assigning a label for classification. V. EXPERIMENTAL SETUP & DATASET INFORMATION
The experiment is performed on two machines. The first
Region-based CNN generates a bounding box in an
experiment is performed on core i5 processor, 8GB RAM
image and then runs the classifier on these boxes. The
and 2GB GeForce 820M GPU. The second experiment is
bounding boxes are then refined using post-processing like
performed on core i7 processor, 16 GB RAM, and 4 GB
non-maximum suppression to eliminate duplicate detections.
NVIDIA GTX 1050 Ti GPU. The proposed architecture of
A single CNN can predict multiple bounding boxes and class
convolutional neural network is trained and tested for face
probabilities of objects. YOLO optimizes the performance as
detection on FDDB (Face Detection Dataset and Benchmark)
it is fast in detection. In YOLO (You Only Look Once), a
dataset [16].
single neural network is applied on the entire image during
training and testing time. It encodes the information about FDDB Dataset is used in our work to train the proposed
the appearance and the classes. architecture. It consists of 5171 faces in a set of 2845 images.
This dataset consists of regions of persons designed for
In our work, the bounding box is predicted based on the
studying the problem of detection. This work deals with
features from the image. The bounding boxes across an
2667 number of images and the total size of the dataset (in
image are predicted in parallel. Hence, it can be said that the
our study) is 52.2 MB. Dataset was divided into 70% training
network scans the full image as well as the object in the
dataset and 30% testing dataset.
image. With the help of YOLO, end-to-end training is
applied along with real-time speed. This enables to maintain
high average precision. VI. RESULT ANALYSIS
The model was trained for 25 epochs with gradient
decent optimizer algorithm. It was observed that accuracy
Authorized licensed use limited to: Dalarna University College. Downloaded on September 16,2023 at 12:14:40 UTC from IEEE Xplore. Restrictions apply.
remained nearly constant 92.2% after 20 epochs and the best which affects the FPS (frames per second) rate. It was
value of learning rate is considered after trying different observed that FPS was increased as the resolution was
values and it is 0.0001 as shown in Fig. 1. Same epochs and decreased. Low-resolution image has less number of pixels,
learning rate are considered for comparison of experimental so GPU process it speedily because of less number of
analysis on CPU and GPU. calculations of parameters.
Fig. 2 shows 92.2% accuracy which was achieved on test The accuracy of the proposed model was compared with
dataset for 20 epochs. Network was also trained with other face detection algorithms after fine-tuning all
different batch size. The batch size was kept 1, 8, 16 and 32. parameters and hyperparameters of the proposed model. It
It was observed that when the batch size was 32 or 16, the was shown that proposed model accuracy was higher than
network was not able to get trained on 2 GB 820M Graphics the haar cascade algorithm and R-CNN based face detection
card. It happened due to less size of GPU memory which model which is depicted in Fig. 5.
could not accommodate increased batch size. The same is
depicted in Fig. 3.
Authorized licensed use limited to: Dalarna University College. Downloaded on September 16,2023 at 12:14:40 UTC from IEEE Xplore. Restrictions apply.
plays a very important role. Resolution of the image as
concluded is inversely proportional to the frames per second.
In future work, proposed model can be further optimized for
very small face detections, on different viewpoint variations,
and partial face detection.
REFERENCES
[1] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol.
521, no. 7553, pp. 436–444, 2015.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet
classification with deep convolutional neural networks,” Proceedings
of the 25th International Conference on Neural Information
Processing Systems - Volume 1. Curran Associates Inc., pp. 1097–
1105, 2012.
[3] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich Feature
Hierarchies for Accurate Object Detection and Semantic
Segmentation,” in 2014 IEEE Conference on Computer Vision and
Pattern Recognition, 2014, pp. 580–587.
[4] R. Girshick, “Fast R-CNN,” Proc. IEEE International Conference on
Computer Vision, ICCV 2015, pp. 1440–1448, 2015.
[5] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: towards real-
time object detection with region proposal networks,” Proceedings of
the 28th International Conference on Neural Information Processing
Systems - Volume 1. MIT Press, pp. 91–99, 2015.
[6] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look
Once: Unified, Real-Time Object Detection,” 2015.
[7] R. Vaillant, C. Monrocq, and Y. Lecun, “Original approach for the
localisation of objects in images,” IEEE Proceedings on Vision,
Image, and Signal Processing, vol. 4, 1994.
[8] H.A. Rowley, S. Baluja, T. Kanade, “Neural network-based face
detection”, IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 1, pp.
23–38, 1998.
[9] C. Garcia and M. Delakis, "A neural architecture for fast and robust
face detection," IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no.
11, pp. 1408–1423, 2004.
[10] M. Osadchy, Y. Le Cun, and M. L. Miller, “Synergistic Face
Detection and Pose Estimation with Energy-Based Models,” Journal
of Machine Learning Research, vol. 8, pp. 1197-1215, 2007.
[11] F. J. Phillip Ian, “Facial feature detection using Haar classifiers,” J.
Comput. Sci. Coll., vol. 21, no. 4, pp. 127–133, 2002.
[12] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A Convolutional
Neural Network Cascade for Face Detection.”, IEEE International
Conference on Computer Vision and Pattern Recognition, CVPR
2015, pp. 5325-5334, 2015.
[13] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A.
Zisserman, “The PASCAL Visual Object Classes (VOC) Challenge”,
International Journal of Computer Vision, vol. 88, no. 2, pp. 303-338,
2010.
[14] T.-Y. Lin et al., “Microsoft COCO: Common Objects in Context,”
European Conference on Computer Vision, ECCV 2014, Lecture
Notes in Computer Science, vol 8693. Springer, Cham, pp. 740-755.
[15] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for
object detection,” Proceedings of the 26th International Conference
on Neural Information Processing Systems - Volume 2. Curran
Associates Inc., pp. 2553–2561, 2013.
[16] V. Jain and E. Learned-Miller, “FDDB: A Benchmark for Face
Detection in Unconstrained Settings.”, Technical Report UM-CS-
2010-009, Dept. of Computer Science, University of Massachusetts,
Amherst. 2010.
Authorized licensed use limited to: Dalarna University College. Downloaded on September 16,2023 at 12:14:40 UTC from IEEE Xplore. Restrictions apply.