Precision-Based Face Detection Algorithm Implementation On FPGA
Precision-Based Face Detection Algorithm Implementation On FPGA
Arti Singh3
MTech, Electronics and Communication Engineering
Nirma University
Ahmedabad, Gujarat, India
Abstract:- Face detection is a crucial step to implement a bottleneck in the system development. Minimum response
face recognition and tracking system which is used in time of the face detection system is the need of the hour.
security, surveillance, biometrics, artificial intelligence
etc. Face detection is a technique in which face(s) in a The detection of face in an input image/video is the first
image or video and its location in image/video is step in any face tracking system and face recognition system
identified. Face detection can be implemented by using which plays an important role in a surveillance system that
different algorithms which depends upon the accuracy can be so much helpful in many cases like, finding suspects
and the processing capabilities of the system on which it or convicts. An example of this is if a webcam is connected
is implemented. The accuracy of detection is highly to a display, then it can detect any face that walks by in front
influenced by the factors like illumination, head pose, of the webcam. Once this information is stored a number of
occlusion etc. This paper talks about the implementation operations can perform on it in order to detect
of Viola-Jones algorithm for face detection in each image. gender/race/age. Face detection system also has many
This algorithm works on Haar features extracted from a applications in the fields of Biometrics, Robotics, Human
face. Viola-Jones algorithm is a highly accurate interface and other commercial use.
algorithm, but it requires large number of resources. The
complexity level of this algorithm is very high and can be A. Factors affecting Face Detection: Below are some factors
used in such places where the accuracy of the system is a which can affect the result of face detection in an input
major concern. image/video:
Image Orientation—
This factor depends upon the nature of the input image,
which may appear upside down, rotated, inverted, or in the
correct form.
There are four main categories in which face detection techniques can be broadly divided: knowledge-based face detection
technique, feature-based face detection technique, template-based face detection technique and statistic-based face detection
technique. Each category is again subdivided into different techniques which are given in below figure [3]
Knowledge based Face Detection Technique: declared a face. The difficult part of this method is building
It is a rule-based face detection technique. In this such appropriate rules.
technique, some rules are defined to detect the faces from the
input image. These rules can be extended to detect faces from Feature-based Face Detection Technique:
a complicated background. These rules are nothing but some This face detection technique depends on the features of
features like 2 ears, 1 nose, 2 eyes, 1 mouth and other facial human faces. A human face can be distinguished by other
features. For example, a rule is like there are two symmetric objects by using these features like the area under eye is
eyes a face usually has, and the area under the eyes is darker darker than the cheeks area, the edge of the nose is brighter
than the cheeks. In the input image, firstly the facial features than the surrounding area etc. This technique depends on the
are extracted and then the face is detected according to the features which are extracted from the human face and will not
defined rules [4]. be undergone for any changes due to the factors like
occlusion, illumination, pose etc. Skin colour, nose, eyes,
The knowledge-based face detection technique tries to ears, mouth, eyebrows and etc., are some features that can be
capture knowledge of human faces and encode it into well- used in face detection techniques.
defined criteria. When the input image meets the criteria, it is
Advantages and Challenges in above-Described Face Detection Methods are given in Table:
IV. VIOLA JONES ALGORITHM this structure the region of the image which is not
containing the face.
Viola-Jones face detection technique is one of the face
detection techniques which can detect the presence of frontal Haar-Like Feature:
face(s) in an input image and determine the location of that Our face contains a number of features like nose, eyes,
face. This face detection technique can scan the input image lips, cheek, eyebrows etc. In face detection scheme we prefer
rapidly and give a high accuracy rate. Therefor the detection to use features of the face rather than the pixels directly. There
rate of viola-jones technique is high as well as the false are a number of reasons to use the features instead of the
positive rate is very small. There are three main properties of pixels directly for computation. The first reason of this is the
this algorithm which are characterized briefly as: speed of detection because the detection speed of a feature-
based system is higher than the pixel-based system. The
The first one is the representation of the input image, in second reason is that the features can be used to encode ad-
which the location of faces needs to determine, into a new hoc domain knowledge that is hard to learn by using a finite
format known as “Integral Image”. This format of the quantity of training data. The feature used in this technique is
image allows the detector to calculate the “features” same as Haar basis function which used by Papageorgiou et
rapidly. Along with the original image, this format also al (1998). Below figure shows some rectangular Haar like
helps to calculate the features at different scale value. This feature.
format of the image can be obtained by only some
mathematical operations per pixel. With the help of the
integral image, we can obtain the “Haar” like features of
an image very quickly and in constant time irrespective of
the location of the pixel.
The second property of this algorithm is the introduction
of the “Adaptive boosting (AdaBoost)” to select the
critical features of the face out of all the computed feature.
The AdaBoost algorithm is a learning algorithm and after
learning by the different examples of faces and non-faces
it gives a classifier which can classify the face in an
image. Out of all features, the irrelevant features must be
rejected by the learning process to achieve fast
classification and process only the critical features. Fig 8 Some Haar-Like Features
The third and important contribution is the application of
the cascaded structure of the strong classifiers. This The two, three and four rectangular haar-like feature is
cascaded structure enables the feature like the rejection of shown in figure below. This haar-like feature gives one value
the background region quickly and spends most of the after the computation of the feature which can be used to
computation over the promising face like regions. Due to categorize the subsections of an image.
Integral Image:
“Integral image” is the intermediate representation of
the input image. The main objective of this type of
representation is to compute the summation of the pixel’s
intensity under a rectangular region quickly and in constant
Fig 9 Rectangle Haar-Like Features time irrespective to the location where it is needed. This
representation of the input image at pixel location (i,j) can be
achieve by taking the summation of the pixel’s intensity
The computation of two rectangle features (figure 9 (A),
above and to the left of the location (i,j). Equation (3.1) gives
(B)) can be done by taking the difference between the
an idea to calculate the integral image and also shown in
summation of the pixels under white region and black region.
The computation of three rectangle feature (figure 9(C)) is figure below:
done by taking the summation of the pixels coming under the
two outside rectangles and then subtract it from the V. MATLAB IMPLEMENTATION AND RESULTS
summation of the pixels under the centre rectangle. Similarly,
Viola-Jones algorithm for face detection technique is a
for four rectangle feature (figure 9(D)) the value can be
machine learning technique in which a cascaded function is
computed by taking the summation of the pixels under
diagonal region and then take the difference of the two trained through a number of positive (contains face/faces) and
negative images (didn’t contain any face/faces). Now, this
summed values.
function is able to classify the face(s) in other images also.
This cascade classification function is obtained by taking the
How Haar-Like Feature Works:
weighted sum of weak classifiers. These weak classifiers are
Haar-like features are nothing, but the adjacent
made by the features which are extracted from the training
rectangular region and its value is calculated by taking the
difference between the sum of the pixel’s intensity in the data.
white rectangular region and black rectangular region. These
For the development of MATLAB implementation of
types of features are basically used in the machine learning
the viola-jones algorithm, I have used a pre-trained classifier
where a function is trained by a number of positive images
given by Dr. Rainer Lienhart professor at University at
(these images contain the human faces) and negative images
(these images didn’t contain any human face) which are Augsburg in Computer Science department [10]. This is one
of the best trained cascaded classifiers based on Viola-Jones
utilized to detect the location of human faces in an image. The
approach which is widely used by all prominent companies
input image is scanned and searched for the Haar like feature
such as Intel, Microsoft, Apple etc. for face detection
of the current stage. The size and weight of every feature are
applications. So, after the training of the classifiers, it get
computed by the machine learning algorithm like Adaboost.
trained and then it can be used to classify/detect the objects.
There are a number of features which can be applied on the
face, and it is shown in the figure below: The flow of the implementation of the Viola-Jones algorithm
is shown in figure below:
The trained data which contains all the features at face detection process. If in an image face is found, then it
multiple stages is taken as a .xml format. This file is returns the starting co-ordinate of the face and its width and
converted into .mat file which stores all the variables that height. With the help of co-ordinate, width, and height we can
contain the values of required constraints. Now all the get extreme co-ordinates of the face. By using these co-
features are used to make a cascaded structure of Haar ordinates face region can bound by a box. The bounding box
features and every stage have its own threshold. This can be drawn by changing the colour of the pixel of the
cascaded structure is used for the detection process. For the original image to the colour of the bounding box.
detection purpose, an input image in RGB/Grayscale is given
to the algorithm. This input image is scaled at multiple values Classifiers Details:
in order to detect the faces of any size. After the scaling The trained classifiers used here is taken from OpenCV
process, the image is given to integral image generation unit to detect the frontal human face by using viola-jones
to generate the integral image. With the help of this integral algorithm. Training of this cascaded classifier is done by the
image, we can calculate the sum of pixels by using four-pixel frontal faces of size 20X20. The total number of stages used
value irrespective of the number of pixel’s to be summed. here is 22, the total number of Haar classifiers are 2135 and
Through this integral image computation of Haar feature in total number of features used is 4630. For each stage the
the image is done. This Haar feature is used to compare with number of classifiers used is shown in the table below. As
the feature values taken from the trained data. If the computed shown in the table the number of classifiers in each stage is
feature value crosses the threshold value, then it is pass to the increasing thus the complexity of each stage is also increases.
next stage of cascaded structure otherwise got rejected for the
Performance of MATLAB Implementation: obtained by collecting images of 320x240 pixels from the
In order to obtain the accuracy performance internet. This database contains the frontal face images of
measurement of implemented code has been done. Two different people in a complex background and different
different databases has been used in order to measure the lightening conditions. Table below shows the accuracy of
performance of implemented viola-jones algorithm. First MATLAB implementation of Viola-Jones algorithm with this
database contains 100 images of different people and database.
Second database is given by Cambridge University poses this database also contains the facial images of 15
named as “Pointing’04 ICPR Workshop” [11]. This database people with and without spectacles. Accuracy of
contains the face pointing images of 15 different people with implemented algorithm is also tested for the images with and
different head poses. The angle of head poses varies from -90 without spectacles. The image size in this database is
to +90. In order to performance measurement only 0-to-45- 384x288 pixels.
degree variation in head pose is taken. Along with the head
Performance measurement of implemented Viola-Jones detection decreases because increase in face rotation angle
algorithm with ‘Pointing’04 ICPR workshop’ [11] is leads to decrement in visible facial features. In 45o rotation of
tabulated in table below. This table shows the accuracy of face the right eye is less than 50% visible than the 0o rotation
face detection as the face is rotated by 0o, 15o, 30o, 45o. We i.e. frontal face image. Therefore, with 45o facial rotation the
can observe that as rotation angle increases the accuracy of detection accuracy is minimum.
Table 4 Accuracy of MATLAB Implementation of Viola-Jones Algorithm with Database of Cambridge University
Person ID Facial Rotation
0o 15o 30o 45o
Person-1 T T T F
Person-2 T T T T
Person-3 T T F F
Person-4 T T T T
Person-5 T T F F
Person-6 T T T F
Person-7 T T T T
Person-8 F F F F
Person-9 F F F F
Person-10 T T T F
Person-11 T T F F
Person-12 T T T F
Person-13 T T T F
Person-14 T T T F
Peron-15 T T F F
Accuracy 86.67% 86.67% 60% 13.34%
Along with this database ‘Pointing’04 ICPR workshop’ occlusion of facial features. Table below shows the
[11] database also consists of frontal face images of 15 comparison of implemented algorithm on the database which
different people with and without spectacles. The contains the frontal faces with and without spectacles. This
implemented algorithm is also tested with this database. This table show with spectacles the accuracy of implemented
gives less accuracy as compared to the image of people algorithm deceases as some of the features of face get
without spectacles because the features which are related to blocked.
eyes are getting blocked due to spectacles. This comes under
Table 5 Accuracy of MATLAB Implementation of Viola-Jones Algorithm with and without Spectacles using
Database of Cambridge University
Person ID Frontal Face
With Spectacles Without Spectacles
Person-1 T T
Person-2 T T
Person-3 F T
Person-4 F T
Person-5 F T
Person-6 F F
Person-7 T T
Person-8 F T
Person-9 F T
Person-10 T T
Person-11 T T
Person-12 F T
Person-13 T T
Person-14 T T
Peron-15 F F
Accuracy 46.47% 86.67%
rows which contains a fixed number of pixels which called as number of general purpose I/Os, switches and LEDs are used
number of columns in the screen. At each pixel location, the for the implementation of some user-controlled activity.
RGB colour information in the video signal is used to control
the colour of the pixel. By changing the analog levels of the Hardware Setup used:
three RGB signals all other colours are produced. The block diagram of hardware setup and connections
of the Viola-Jones face detection system is shown in figure
FPGA Development Board: below. Camera OV7670 and a VGA display is connected
The Zedboard development board was chosen for the with the Zedboard. In this setup an image as an input to the
development of our project. The Zedboard is an evaluation system is taken of the resolution of 320x240. Since the
and development board which is based on Xilinx Zynq-7000 display is of resolution 640x480, only the left top corner is
All Programmable SoC (AP-SoC). This development kit used to display the input image or video. VHDL language is
implements a Xilinx Zynq-7000 AP SoC XC7Z020-CLG484 used to develop the code and VHDL code is compile and
which is having 4.9MB Block RAM, 106,400 numbers of synthesized on Xilinx ISE software and programmed onto the
flip-flops, 53200 number of LUTs (look up tables) and 85K FPGA board. A VGA cable is used to connect the VGA
of programmable logic cells. In this project the FPGA kit display and GPIOs are used to connect the OV7670 camera.
plays a role of heart of the entire system that captures images Four switches are used for the user control over the system.
from the camera, process the captured image to get the facial To reset the entire system [SW0] is used. In order to send the
features in the image, and display the faces on the VGA reconfigure the camera register [SW1] is used. [SW2] is used
display monitor. The camera OV7670 is interfaced with the to define the capture mode like whether we want input as
Zedboard via GPIO pins on the board and the VGA display is image or video. [SW3] is used to take the snapshot by the
interfaced with the VGA connector available on board. A camera.
The Mapping of Switches on FPGA Board and Detailed Description is given in Table below.
The top-level entity controls the camera module, display between the integral image generation and subwindow
module and the processing part of the input image. For evaluation. This integral image generation and subwindow
processing the input image, the top level entity takes the evaluation process will continue till the entire current scaled
source image from the image frame buffer and process it. The image is processed. After the evaluation of the last
processing of source image is done by first generating the subwindow the current image is further scaled down and the
integral image and then the comparison of weak and strong again the integral image generation and subwindow
stage threshold with pre-trained classifiers. Initially, an evaluation process take place. This system evaluates the
integral image of 40x60 pixels is generated. After this 17 source image for the face candidate at 4 scaled values and
parallel subwindow scans the integral image and evaluates then create the red box around the face in the source image.
this for faces. In parallel with subwindow scanning, the The final processed image shown on the VGA display. The
integral image for the next subwindow has been generated so scanning of image by subwindow and integral image is shown
that there will be no delay to give the new integral image to in figure below:
the subwindow evaluation. So there will be no any latency in
Capturing of Input Image and Saving it into BRAM: The 12 bit RGB input image is taken from the image
The input to the system which is an image, and it is buffer and convert it into grey image which is used for the
captured by the camera module OV7670. The image is generation of the integral image. The integral image square is
captured by using an on-board switch which is connected to also generated for the purpose of the calculation of variance
one of the pins of the FPGA. The captured image is saved into normalization factor. This generator uses accumulators and
the Block RAM which have a 320X240 = 76800 addressable recursive computation to obtain the resultant integral image.
memory addresses and each memory location can save 12 bits For the current row of location (x,y) an accumulator is used
of data (4: 4: 4 = R: G: B). Here dual port BRAM is to compute the sum of gray scale pixels value. If the current
instantiated in order to save the original image. Out of two row is not the first row of the source image then it must be
ports, one port is used to just read the original image in order added with the previous rows (x,y-1) integral image value in
to convert it into the integral image and save this integral order to get the correct integral value at location (x,y). Here
image in some other memory. Another port is used to make multiplexer is used to select the pixel summation of the first
the box over the detected faces in the original image. row of the image. After the first row of the image, integral
image generator requires the summation of the pixels of the
Generation of Integral Image: previous rows also so for that we take the data back from the
The next stage of this implementation is the generation memory and then add it to the summation of the pixels value
of the integral image. The integral image is generated for a of the current row. For the generation of the integral image
portion of the source image in order to use minimum memory the block diagram is shown in the figure:
resources. The integral image at any location (x,y) is the
summation of gray scale pixels above and to the left of (x,y).
The simulation waveform is shown in figure below for the integral image generation. In this figure wrdata_buff_2A[19:0]
shows the waveform for integral image and wrdata_buff_3A[27:0] shows the waveform of integral image square.
Figure below shows the computation time of an integral image. The high pulse shows that the computation of an integral image
is done, and the low period of the pulse shows the computation of the integral image.
Fig 20 Integral Image done Signal and Computation Time of Integral Image
On Zedboard a GPIO pin is used to get the signal and displayed on a DSO. Table below shows the computation time of integral
image used.
Processing of Sub-Window to Find Face: subwindow area. Figure below shows the chosen window
A cascaded classifier is used to remove the non-faces area for the subwindow processing. The sum of grey scale
and detect the faces in an image. This cascaded classifier is a pixel values within these rectangular areas are basically used
trained chain of facial features. Different feature evaluations to obtain the difference between the dark and light regions in
have been done throughout 22 strong stages in a cascaded human faces.
manner. The evaluation of feature is done in a 24x24 pixel
After the features calculation, the accumulated values subwindow is rejected and then the processing of next
are compared with the threshold of the strong stage. If this subwindow starts. If the subwindow crosses the last stage of
threshold is crossed by the accumulated value than the the cascaded structure without detecting any non-face, then
currently evaluated subwindow is considered for having the the subwindow is determined to contain a face. Figure below
face element and passed to the next stage for further shows the sequential processing of a subwindow in a
processing. If this threshold is not crossed by the accumulated cascaded classifier structure.
value than a non-face is detected and the currently evaluated
Figure below shows the processing of subwindow in which calculation of features and its comparison with the weak threshold
and the strong threshold is shown. This implementation shows two data paths, first is for variance normalization of subwindow and
second is for feature evaluation.
In order to bring the light level of the subwindow to the Here mean (m) is obtained from the integral image (s0 :
light levels of the training images, variance normalization is s3) and the sum of the square of pixel’s value (p2) is obtained
used. The formula used to calculate the variance from the integral image of squared pixels (ss0 : ss3). In one
normalization factor (VNF) is given by the equation below: subwindow total pixels are 24 x 24 = 576 but for the value of
N, we have taken 512 for ease in the division by using right
shifting of bits by 9 places.
According to the normalized weak threshold, a left tree the actual position (x and y position) and scale values are
and right tree value is collected into a register for strong required. A red box can be drawn by easily changing the pixel
threshold comparison at the end of a stage. colour of the desired pixel to the red. For this, a memory is
used in first in first out manner. It stores the detection from
Face Box Creation: all the 16 subwindow. When this subwindow process is
Face box creation stage draws the red box around the completed for all the scales ofthe image, desired pixel value
detected faces in the source image. To draw the box over face, of source image in image buffer is changed to the red colour.
Performance Measurement of Implemented Face box drawing on the detected faces is measured. On Zedboard
Detection System on FPGA: FPGA a register is used to get the FPS. The output of the
The performance of implemented system on FPGA is register is mapped with a GPIO pin of the FPGA. A DSO is
measured as the total time taken to detect the face. To obtain used to measure the time duration after which the output of
the performance of FPGA based face detection system FPS is the register become high. The waveform of face detection is
measured. FPS is nothing but the number of frames detected shows in figure below which is taken from the GPIO pin. The
per second. In order to compute the FPS of system, time low period of the signal shows the processing time of the face
duration between capturing an image to displaying the detection. When the system detects the face after processing
detected result is measured. More accurately the time then the signal become high.
between the starting of the frame capture to the end of face
In the above shown figure above one single square is of 60us. So, the total time for which the signal is low is approx. 130us.
The time taken to detect the face, and the detection frequency is shown in the table below:
VIII. CONCLUSION [5]. Z. Li, L. Xue, and F. Tan, “Face detection in complex
background based on skin color features and improved
There are several algorithms available for the face adaboost algorithms,” in Progress in Informatics and
detection. Selection of the algorithm entirely depends upon Computing (PIC), 2010 IEEE International
the requirement of the time. If we require a system which can Conference on, vol. 2, Dec 2010, pp.723–727.
detect the faces in an image with high accuracy, then it leads [6]. X. Zhao, X. Chai, Z. Niu, C. Heng, and S. Shan,
to more computation and thus requires more powerful “Context constrained facial landmark localization
hardware. If the hardware is not powerful, then we need to based on discontinuous haar-like feature,” in
compromise with the accuracy. In this paper Viola-Jones Automatic Face Gesture Recognition and Workshops
algorithm is used for the implementation which gives a high (FG 2011), 2011 IEEE International Conference on,
accuracy, but it requires more computation. March 2011, pp. 673–678.
[7]. W.-C. Hu, C.-Y. Yang, D.-Y. Huang, and C.-H.
Viola-Jones algorithm is implemented for face detection Huang, “Feature-based face detection against skin-
on MATLAB and then by using VHDL it is implemented on color like backgrounds with varying illumination,”
Zedboard FPGA. By MATLAB implementation and Journal of Information Hiding and Multimedia Signal
simulations, we will verify that how accurate Viola-Jones Processing, vol. 2, no. 2, pp. 123–132, 2011.
algorithm can detect the faces. The accuracy of the MATLAB [8]. D. N. Parmar and B. B. Mehta, “Face recognition
implementation of viola-jones algorithm is 86.67%. For the methods & applications,” arXiv preprint
hardware implementation the algorithm is developed in arXiv:1403.0485, 2014.
VHDL and implemented on Zedboard FPGA. The detection [9]. P. F. De Carrera and I. Marques, “Face recognition
rate given by the hardware implementation is measured in algorithms,” Master’s thesis in Computer Science,
processed frames per second and its value is obtained as 7.69 Universidad Euskal Herriko, 2010.
FPS. [10]. “Pre-trained classifier for face detection.”
[11]. “Pointing’04 icpr workshop.”
ACKNOWLEDGMENT
REFERENCES