FPGA Based Haar Cascade

www.ietdl.
org
Published in IET Image Processing

Received on 12th February 2009
Revised on 21st August 2009
doi: 10.1049/iet-ipr.2009.0030
ISSN 1751-9659
Field programmable gate array-based

Haar classifier for accelerating face
detection algorithm
C. Gao1,* S.-L.L. Lu2 T. Suh3 H. Lim3
1
Wireless Connectivity, Broadcom Corporation, San Diego, CA 92127, USA
2
Oregon Microarchitecture Lab, Intel Corporation, Hillsboro, Oregon 97124, USA
3
Computer Science Education, Korea University, Seoul 136-701, Korea
*This work was performed while the author was interning at Intel Corp.
E-mail: [email protected]
Abstract: The authors present a novel approach of using reconfigurable fabric to accelerate a face detection
algorithm based on the Haar classifier. With highly pipelined architecture and utilising abundant parallel
arithmetic units in FPGA, the authors have achieved real-time performance of face detection with very high
detection rate and low false positives. The 1-classifier and 16-classifier realisations in an accelerator provide
10 and 72 speedups, respectively, over the software counterpart. Moreover, the authors’, approach is
scalable towards the resources available on FPGA and it will gain more momentum as the Geneseo Initiative
is introduced in the market. This work also provides an understanding of using the reconfigurable fabric for
accelerating non-systolic-based vision algorithms.
1 Introduction is to include reconfigurable fabric for the static and/or

dynamic application acceleration. In addition, these
The microprocessor industry is aggressively moving to the reconfigurable fabrics can be used to customise individual
chip multiprocessor (CMP) or multicore architectures cores in order to satisfy different computational demands of
because of the limitation of frequency scaling and the various applications. On the computer system level, various
intractable amount of power dissipation. In the multicore vendors are already providing reconfigurable solutions for
design, there are two design options in the context of the application acceleration [4, 5].
adopting cores: homogeneous and heterogeneous multicore
architectures. A homogeneous multicore is relatively easy In this paper, we report our experience with an
and straightforward to implement because the same core implementation of a popular image processing application
is duplicated according to the design requirements. on an emulated (potential) heterogeneous CMP with an
Nevertheless, it may not provide the optimal solution in added reconfigurable fabric. Our intention is to investigate
terms of cost, power and performance with different methods to speed up statistical computing or computer
applications under various usage scenarios. To explore the vision problems by taking advantage of field programmable
advantage of the cores’ heterogeneity, researchers have gate array (FPGA). FPGAs provide several advantages over
proposed a heterogeneous multicore architecture [1] and other computing solutions. First, they provide the
they are actively researching various aspects of the computational benefit because the target design is
architecture [2]. In the embedded system world, implemented with dedicated logic and runs natively at a
heterogeneous cores are already integrated in system-on-a- machine speed. Thus, computing-intensive applications can
chip products to meet often contradicting requirements: benefit from FPGAs. Second, the memory bandwidth is
low-cost, low-power, real-time and high-performance [3]. easily expandable by allocating more pins in FPGA to
One possible approach to heterogeneous multicore design memory or utilising internal memory, so data-intensive
184 IET Image Process., 2010, Vol. 4, Iss. 3, pp. 184– 194
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-ipr.2009.0030
www.ietdl.org
applications can benefit from FPGAs as well. Third, they are problems. Our FPGA-based Haar classifier is based on
erasable so the target design can be modified and patched SVM. There are several potential approaches to providing the
anytime anywhere. Fourth, they are off-the-shelf devices. required amount of computation. One example is to use the
Thus, they are comparably much more inexpensive to use digital signal processor (DSP) in place of a general purpose
than very large scale integration (VLSI) solutions such as CPU. Li et al. [11] provide an example of this approach.
application-specific integrated circuit (ASIC). However, we do wish to integrate image processing together
with general computing such as database application, and
Our approach begins by examining the possibility of provide an alternative approach for potential heterogeneous
speeding up functions in the Intel’s Integrated Performance CMP configurations in the future processor design. Here we
Primitives (IPP) library [6]. The IPP library is widely used to employ hardware-based acceleration on a general-purpose
optimise many computation-intensive applications for computer. This paper aims at showing how to accelerate face
multimedia, communication and computer vision. It is, detection by implementing the Haar classifier in hardware
therefore beneficial to examine the feasibility of mapping with FPGA.
commonly used functions in the IPP library to the
reconfigurable fabric. It is hoped that by using reconfigurable Most FPGA implementations of image processing utilise
fabric to implement these functions the overall performance the systolic array structure of image data. Thus they represent
of a large number of applications can be improved over the similar streaming data processors [12, 13]. Irick et al. [14]
pure software solution. used the streaming architecture to implement face detection
algorithms based on NNs and achieved high performance.
As an experiment we chose a face detection among However, their pixel offset of 10 is unrealistically high and
computer vision applications to verify our approach. The they did not provide a performance comparison against the
face detection is the computation problem of identifying pure microprocessor-based software implementations. Other
and locating human frontal faces in a photo or video face or object detection implementations with FPGA either
regardless of the lighting, orientation, complexion and size. report inferior performance or have a lower detection rate
It is useful in many other vision-based applications such as and higher false alarm rate [15–18]. Our approach of
digital camera. We first examined the performance profile implementing Haar classifiers on FPGA provides higher
of the face detection program and found that the most frontal face detection rate and lower false alarm rate when it is
time-consuming function (greater than 500 clock ticks for compared with Intel’s IPP and Open Computer Vision
every image pixel) in its algorithm is the Haar classifier. Library (OpenCV) [19] software solutions. With highly
The Haar classifier takes 93% of the total computation pipelined and parallel architecture, our system achieved real-
time. By implementing the Haar classifier in hardware and time face detection performance of 37 frames/s. Furthermore,
porting it into the reconfigurable fabric, we have achieved since our design used the commercially available peripheral
up to 72 speedup over the pure software approach with component interconnect (PCI) Express (PCIe)-based FPGA
high detection rate and low false positives. card, it is comparably easy to migrate our design to other
object detection, recognition and tracking applications with
similar Haar classifier functions.
2 Related work
Face detection algorithms are crucial parts in solving many
face-related problems such as face recognition, expression 3 Algorithm
recognition and face tracking. Face detection is considered as 3.1 Haar classifier face detection
a classification problem of images into face or non-face. Yang
et al. [7] summarised different face detection algorithms and
algorithm
reported the comparative analysis on them. Among several Our face detection algorithm utilises the Haar classifier
face detection algorithms, neural networks (NNs) and function adapted from Viola and Jones [20, 21] and
support vector machine (SVM) provide the best performance Papageorgiou et al. [9, 10]. Lienhart et al. [22, 23] were
in terms of detection rate and false alarm rate. NNs [8] are the first to introduce this algorithm into Intel’s IPP, which
simplified models of neural processing in the brain, which can was later included into OpenCV. Viola and Jones [21]
be used to learn a general decision hypothesis from sample proposed a cascaded degenerated decision tree for a fast
data referred to as training data. The SVM [9, 10] is a linear software computing while maintaining the same detection
classifier, where the decision surface is chosen to minimise the rate compared to other slower single stage classifiers. They
classification error. The decision surface is calculated using a used AdaBoost learning algorithm to select a small number
small subset of the training vectors called support vectors. of critical visual features from a large set of potential
Owing to classification precision and a superior performance features to train the classifiers. The features are pixel tiles
in generalisation, it is one of the most popular algorithms similar to the rectangles region shown in Figs. 1 labelled ‘a’
used by the machine learning society. However, it requires an and ‘b’. They have different weights for different rectangle
enormous amount of computation time and it is also regions and describe the likelihood of such rectangle pairs
memory-intensive. Therefore it is imperative to provide an to be the features of human frontal faces [23]. Our FPGA
efficient method to train the SVM especially for large scale implementation utilises 40 classifier stages. Each classifier
IET Image Process., 2010, Vol. 4, Iss. 3, pp. 184 – 194 185
doi: 10.1049/iet-ipr.2009.0030 & The Institution of Engineering and Technology 2010
www.ietdl.org
Table 1 Performance comparison between our FPGA

friendly 40-stage Haar classifiers and OpenCV’s 22-stage
Haar classifiers
Detection False alarm SW execution

rate, % rate, % time , s
40-stage 87.6 13.5 18.1
22-stage 75.2 15.9 24.7

Section 5 explains the software simulation platform
image. The second step is Haar classifier function. The

main purpose of this step is to scan every image pixel with
Figure 1 Face detection flow based on the Haar classifier
the trained-Haar classifiers to determine whether it is a face
pixel or not. The third step is post-processing. It clusters
the adjacent detected faces to one or several rectangles to
represent faces.
3.1.1 Pre-processing: Because the feature window (17

17) is fixed, it is necessary to scale down source images to
Figure 2 Cascade of the Haar classifiers, where hit rate is detect faces. Otherwise, the algorithm is not able to detect
h40 and false alarm rate is f 40 faces with sizes larger than 17 17. The down-sampling
uses a simple linear image interpolation technique with a
has two or three rectangle features. As illustrated in Fig. 2, factor of 1.2. The most important task during this step is
assuming that each stage has the same detection rate h and to calculate integral images and lighting corrections.
false alarm rate f, the final overall detection rate is h 40 and Because the rectangles of the features are different in sizes,
false alarm rate is f 40. a simple calculation of rectangular images has the time
complexity of O(MN), where M and N are the horizontal
and vertical sizes of the rectangle. Viola and Jones [21]
In order to leverage the 16 classifier units in a Virtex 5
proposed a method to pre-calculate the integral image from
LX330T FPGA (a more aggressive implementation with
the original image. After pre-calculation, each pixel II(x, y)
16 parallel classifiers is introduced in Section 4), we re-
in the integral image in Fig. 3 represents the intensity
trained the Haar classifier with multiple of 16 classifiers per
summation of the rectangle area from the origin to the
stage, and kept the detection rate and false alarm rate at the
coordinate of (x, y) of the original image, as represented in (1)
similar levels to the OpenCV software version. The
purpose of this algorithm modification is to match the y
x X
X
number of Haar classifiers per stage with the maximum II (x, y) ¼ S(u, v) (1)
number of classifier components implementable in the u¼0 v¼0
FPGA fabric. Our final trained cascade Haar classifiers
include 40 stages, 2192 Haar classifiers and 4680 features where S(u, v) is the source image intensity for pixel (u, v).
with a 17 17 window size per pixel sub-window. The
OpenCV default has 22 stages, 2135 Haar classifiers and Thus, for example, to calculate the source image intensity
4630 features with a 20 20 sub-window. There are 16 summation for the bold-lined rectangle in Fig. 3, it is
classifiers in the first stage of this FPGA-friendly Haar necessary to calculate k3l 2 k1l 2 k2l þ k0l, where k0l ¼
classifier. Our novel FPGA-friendly Haar classifiers also II(x0 , y0), k1l ¼ II(x1 , y1), k2l ¼ II(x2 , y2) and k3l ¼ II(x3 ,
reduced the computational time of detecting faces for the y3). The computation complexity then drops to constant time
Carnegie Mellon University (CMU) 94 frontal face images
[24]. As listed in Table 1, our modified FPGA-friendly
classifiers (40-stage) achieve a better performance over the
original default OpenCV classifiers. ‘Software (SW)
execution time’ means the execution time of running the
face detection algorithm in pure software.
As illustrated in Fig. 1, after training the Haar classifiers

with our own training images [21], there are three steps to Figure 3 Fast feature calculation
detect faces from images. The first step is pre-processing Regions k0l, k1l, k2l and k3l represent the rectangular integral
that is used to calculate the integral images and lighting images from origin (0, 0) to the coordinates of (x0 , y0), (x1 , y1),
corrections from the scaled (scale factor is 1.2) original (x2 , y2) and (x3 , y3)
www.ietdl.org
for all different feature sizes. In terms of the integral image, the 4 Implementation
feature rectangle intensity summation (RIS) of the bold-lined
rectangle in Fig. 3 is calculated as (2) 4.1 FPGA implementation consideration
To achieve high performance in the face detection with the
RISbold linked ¼ II (x3 , y3 ) II (x1 , y1 ) II (x2 , y2 ) þ II (x0 , y0 ) Haar classifier, there are several issues to take into account
(2) when utilising reconfigurable fabric. It includes how to
connect the FPGA accelerator with the host
This RIS calculation in constant time is very useful as proved microprocessor, how to partition the computations between
from the software implementations [21, 23]. We adopted these two computing entities, and how to optimise the
it into our hardware implementations as well. During classifier best-utilising FPGA.
pre-processing, the pixel’s variance of source image is
calculated to compensate for the lighting differences (refer to There are several possible options to interface between a
Section 2.3 in [22]). general purpose processor and a reconfigurable fabric. The
most tightly coupled approach is to integrate the fabric
with the processor pipeline by implementing special
3.1.2 Haar classifier function: Haar classifier function instructions with a Haar engine. This approach provides
is a crucial part of the whole face detection algorithm. We the best performance in terms of execution time.
calculate features according to (2), and multiply them with
Nonetheless, it requires modification to the processor core
the trained feature weights. If it is greater than the classifier
making it unfeasible to be a general solution unless
threshold, the stage value is accumulated with a value of processor providers support the feature.
V1. Otherwise the stage value is accumulated with a value
of V2. Different classifiers may have different values of V1
Another approach is to place the reconfigurable fabric close
and V2. When each stage is finished, the accumulated
to the processor as much as possible. A practically possible
stage value is compared with the stage threshold. If it is
place is on a processor bus. In other words, the
greater than the stage threshold, a source pixel moves onto
reconfigurable fabric is connected with the processor via a
the next stage. Then, the same procedure is examined until
processor bus such as front side bus (FSB) of an Intel
it passes all the 40 stages. If the source pixel fails in any
processor. Heterogeneous multi-core architecture in the
stage during this examination, it is discarded as a non-facial
future would take a similar form even though multiple
pixel. The feature weight, classifier threshold, V1, V2, and
homogeneous cores in Core 2 Duo, for example, share a
stage threshold are all trained classifier values from the
large L2 cache [25]. In the previous research, Suh et al.
Haar classifier training stage. The detailed training process
[26] implemented a communication mechanism between
is explained in [21].
Pentium-III and an FPGA via FSB on an Intel-based
server system. The communication was achieved by
3.1.3 Post-processing: After the Haar classifier functions utilising cache coherence protocol on FSB. The DRC
are operated on the original source image and scaled images, we Computer [5] also provides a similar solution with the
cluster the detected face pixels with adjacent scaled images HyperTransport bus on the AMD-based system.
to form a final detected face rectangle. Fig. 4 illustrates the Nevertheless, as the processor’s bus frequency continues to
detected faces from sample pictures, which were chosen from grow and the proprietary bus standard continues to evolve,
the CMU’s frontal face images test set [24]. building a bus-based module requires tremendous time and
Figure 4 Faces detected from sample images
www.ietdl.org
effort due in part to the extensive I/O signalling and the than 93% of the total software execution time, we extracted
difficulty accessing proprietary documents. Moreover, as and ported the Haar classifier function onto the FPGA
presented in [26], the communication through the board and left the pre-processing and post-processing on
coherence traffic entails technical complications such as the host processor. We also changed the data flow (loop
allocation of a page in memory space and direct cache cycles) from looping for every classifier per pixel to looping
manipulation for cache line invalidation. Additionally, since for every stage per pixel. Second, in order to reduce the
a cache-to-cache transfer in coherence traffic involves one resource requirement and to take advantage of the FPGA’s
cache block transfer, a fine-tuning is required to achieve an intrinsic parallel assets, we replaced all floating point
effective communication. operations with integer computations. With such
transformation, we could utilise Xilinx Virtex 5’s embedded
The final and easiest option for communication is interfacing DSP48E blocks and consequently accelerate multiplications
two computing entities via a standard I/O channel such as the and additions in the Haar classifier function. According to
PCIe bus. It means that the reconfigurable fabric is located in our experiment, the consequently lowered data precision
an expansion card on an I/O bus. Even though this option does not affect the final detection accuracy. A single cycle
suffers from the highest latency for communication, it provides of FPGA operation is equivalent to 100–1000 of
a straightforward solution in terms of physical interface and software clock ticks to achieve the same functionality.
device driver development because I/O buses on a personal Third, we employed an extensive pipelining to increase the
computer are industry standards. Moreover, in the industry, algorithm level parallelism and to sustain the operating
Geneseo Initiative [27] is extending PCIe with features that frequency of FPGA. Our implementation has as many
provide power savings and better support for coherent as 28 pipeline stages for the Haar classifier algorithm (refer
coprocessors. Therefore computer systems will soon be to Section 4.3.3). We then strove to match the 17 17
equipped with more coprocessor-friendly features. In our pixel sub-window with 16N classifiers for each stage, where
particular application, latency is not a critical factor. Our main N is an integer. For example, the first stage has 16
goal is to evaluate the effectiveness of using reconfigurable fabric classifiers, as opposed to three classifiers in the software
to accelerate algorithms in particular non-systolic algorithms. version. Therefore each pixel requires more computations to
Our result with PCIe-based FPGA implementation can be pass the first stage in the FPGA implementation. However,
used to extrapolate designs with other interfaces. the first stage in the FPGA implementation dropped more
than 90% of CMU’s frontal image set [24] as non-facial
Based on this consideration, we chose a commercial pixels, as opposed to a 50% drop rate in the software
PCIe-based card equipped with an FPGA as our acceleration version. Fourth, we implemented the Haar classifier
platform. The PCIe card is an HTG-V5-PCIE board from function with reuse in mind. It consists of intellectual
HiTech Global [4], as shown in Fig. 5. It incorporates a property building blocks such as classifier (Fig. 9) and stage
Xilinx’s Virtex 5 LX110T and supports eight lanes of PCIe engine (Fig. 10) that could be used for other applications
Gen1. The Xilinx LX110T includes digital signal processing with similar time-consuming Haar functions. In addition,
(DSP) fabric inside the FPGA, so signal-processing centric the design is easily scalable with various parameters such as
applications such as face detection can best utilise the buffer window size, the total number of stages and the
reconfigurable fabric and can be ported to the FPGA in a number of classifiers in each stage, as detailed in Section 4.3.
cost-efficient way.
To achieve high performance in computation with FPGA, 4.2 System setup

Herbordt et al. [28] discussed four important guidelines Fig. 6 depicts the block diagram of our FPGA accelerator
in FPGA implementations. We implemented the face system. Fig. 6a shows the FPGA-based accelerator on a
detection algorithm on FPGA according to the guidelines.
First, we reconstructed the software-based face detection
application. Because the Haar classifier function costs more
Figure 6 Block diagram of the FPGA-based accelerator

system
a Block diagram on a PC system
Figure 5 PCIe card from HiTech Global b Block diagram based on PCIe
www.ietdl.org
Figure 7 Block diagram of FPGA implementation
personal computer system. A PC system is mainly composed of During the face detection processing, the host processor pre-
three main components: CPU, north and south bridges. The processes images and sends the integral images and image
accelerator is connected to a PCIe slot on the south bridge variances to the FPGA accelerator through PCIe bus. The
side. Fig. 6b is a system diagram in terms of PCIe. The FPGA stores a 32 32 image in buffer, proceeds with the
backbone of a personal computer system is based on PCIe Haar classifier function. Afterwards it sends the detected
and the accelerator is connected to one of the endpoints. faces’ coordinates back to the host processor through PCIe.
Figure 8 Pipelined design of face detection accelerator in FPGA, where srcram is the source integral image BRAM
Classram is the classifier features BRAM; normram is the image variance BRAM; classifier is the classifier computational engine; lesscomp is
the stage threshold comparing engine
www.ietdl.org
The host processor finishes the face detection algorithm with window, 17 17 window, 289:12 multiplexer (MUX),
post-processing. The host PC is equipped with 2.66 GHz classifier, stage comparator, mask BRAM and PCIe TLP
Intel Core 2 Duo and 8 GB main memory. (Transaction Layer Packet) in sequence. Finally, it is
delivered to the Xilinx Endpoint PCIe Core to be sent to
the host processor. Meanwhile, the feature data in the
4.3 FPGA implementation details Feature BRAM are fed into the MUX, classifier engine,
4.3.1 FPGA data flow: In order to detect the frontal and stage comparator. The followings explain the detailed
faces in the original image, it is necessary to scan and operations of the data flow path.
scrutinise each pixel in the image, which requires intense
computation. The computational load could be mitigated Pixel scanning operation: During the experiment, we store part
by jumping to non-neighbouring pixels during the pixel (32 32) of the 256 192 size of integral image into
Haar-function operation. However, the quality of the final BRAM. Each pixel in the integral image has a resolution of
face detection will be greatly degraded in terms of detection 17 bits. Therefore to store the 32 32 source integral image
rate and false alarm rate. In our work, we processed each in the Integral Image BRAM, the storage requirement is
pixel with Haar function. Nevertheless, the detailed FPGA 32 32 17 bits. The same resolution (17 bits) also applies
implementation addressed below does not prohibit the to the 17 17 buffer window, 17 17 window and 289:12
skipping-pixel scanning method. It implicitly states that the MUX. The numbers in the previous paragraph are all pixel
results in Section 5 are conservative and thus trustable. sizes (not bit capacity). For example, the 289:12 MUX selects
12 pixels from 289 pixels. In reality, there are 12 multiplexers;
As shown in Fig. 7, the integral image data move through each multiplexer can mux out one integral image pixel
the Integral Image BRAM (Block RAM), 17 17 buffer from 289 pixels, with 17 bits for each pixel. In other words,
Figure 9 Classifier block diagram

The total delay means the clock cycle delay from input signal to the current stage of computation
www.ietdl.org
12 17 289-bit to 1-bit multiplexers are required. The reason to the Classifier. Note that each feature determines four
that we do not store the whole integral image into FPGA is coordinates in the 17 17 window. The classifier
because Virtex 5 LX110T is not able to hold the 256 192 calculates the first class value with those input data.
integral image in its BRAM.
Stage operation: In our Haar classifier model, we have 16
Fig. 8 provides a detailed view of the pixel scanning classifiers in the first stage. The number 16 is a carefully
operation. We stored the 32 32 integral image into and well-chosen number mostly because about 90% of
srcram (source integral image BRAM) with 32 rows. Each pixels of the CMU’s frontal image set are dropped as
row has 544 bits (32 17) of data. During each clock non-facial pixels after examining through 16 classifiers.
cycle, one row (544 bits) of the integral image was read out Moreover, the Xilinx Virtex 5 LX330T FPGA provides a
from the srcram. Seventeen 17-bit data in this one-row capacity to implement 16 parallel classifiers for a much
integral image data are fed into the corresponding registers better performance improvement. After 16 cycles of
in the 17 17 buffer window. When all the registers in calculation for the 16 classifier values accumulated as the
the 17 17 buffer window are updated, the data will be first stage value, the values in this stage are compared with
transferred to the 17 17 window. the stage threshold to decide whether they pass the first
stage or not. Because more than 90% of the pixels are non-
Classifier operation: During the time of pixel scanning facial pixels, the 17 17 window and the classifier are
operation, the 12 features of the first classifier in the first ready in most cases for the first-stage computation of the
stage are prepared and stored to the Feature BRAM, and next pixel after finishing the first-stage computation of the
they are supplied to the MUX, classifier and stage previous pixel. Only less than 10% of the pixels will pass
comparator. Meanwhile, the pixel’s variance is fed into the the first stage. In this case, the data in the 17 17
classifier from the Image Variance BRAM. The MUX window remain intact. However, the classifiers (features)
chooses the 12 integral data from the 17 17 window from the second stage and later will continue to feed into
according to three features in each classifier and feeds them the classifier and the stage comparator until any stage value
Figure 10 Stage engine block diagram for the aggressive version of face detection that includes 16 parallel classifiers at the
front
www.ietdl.org
is less than the trained stage threshold. If any pixel passes the illustrates 28 pipelines in our design. Therefore a pixel
first stage, the pixel scan mode will become stage scan mode, needs 28 clock cycles for process from reading data from
where FPGA retrieves back the pixel index and the pixel the integral image memory (srcram) to the output decision.
moves to the next stage to decide if this pixel could pass Theoretically, this 28-stage pipeline can achieve more than
all of 40 stages. During this operation, FPGA dedicates its 20 additional speed up over the software version. In
resources to this specific pixel, buffers the following Fig. 8, long vertical bars indicate pipeline stages. All the
classifier data into the classifier engine and stops buffering data in the previous stage are latched at each pipeline stage
the integral sub-window of the next pixel. into the next stage’s registers. Fig. 8 does not show all 28
pipeline stages because of the space limitation. Instead, we
Face pixel mask recording: If a pixel passes all the 40 stages, drew key pipeline stages. The notation ‘total delay’ in
the pixel’s address (row and column indices) is sent to the Fig. 8 illustrates how many clock cycles are elapsed from
Mask BRAM. After all the pixels in the integral image issuing the pixel’s row address to the current stage operation.
pass through the data flow in Fig. 7, addresses of the The numbers are the real pipeline stages. For example, ‘total
detected face pixels are transferred to the host processor delay ¼ 18’ was put on the top of the third pipeline stage
through the PCIe bus. Because only less than 10% of the bar because 18 clock cycles (or pipeline stages) are required
pixels are face pixels, the sparseness of the face indices data to read 17 rows of integral image data from srcram. In
greatly reduce the time to transfer the pixel information reality, this third pipeline stage bar is the 18th real pipeline
from the FPGA accelerator to the host processor. stage. Another thing to be cautious is the difference of the
pipeline stages from the Haar classifier’s stages.
4.3.2 Parallel computational components design:
Fig. 9 illustrates the internal implementations of the classifier Because of the complexity of the Haar classifier algorithm,
in Fig. 7. In each clock cycle, 12 integral data from the 17 data in these 28-stage pipelines should be carefully aligned.
17 window in Fig. 7 and classifier parameters are fed into this It means that the triggering events for each pipeline stage
classifier engine, which calculates the class output value. The need to be carefully selected. For example, for the integral
multipliers, subtractors and adders in Fig. 9 are implemented image pixels (in srcram) and the classifiers (in classram) to
with Xilinx’s DSP48E on-chip cores (four DSP48Es for each arrive to the classifier at the same time, the triggering
classifier). It takes six clock cycles for one classifier to events for the classram (classifier counter control signals
compute an output as depicted in Fig. 9. and caddr in Fig. 8) have to be 17 clocks away from the
triggering events for the srcram (counter control signals
Although more classifiers were attempted to implement and srcaddr in Fig. 8).
with the LX110T device, the FPGA did not satisfy the
resource requirement for more registers and BRAMs for
the pixel scanning operation. For example, 290% of 5 Experiment results
LX110T’s registers are required for the 16-classifier design. Table 2 lists execution times of software and hardware
However, 16 parallel classifiers on the FPGA can be (FPGA) implementations for the Haar classifier face
implemented with more advanced FPGAs such as Virtex 5 detection application. Fig. 11 shows the speedup of the
LX330T. Thus, we designed, simulated and synthesised hardware implementation over the software counterpart.
another more aggressive (or resource-hungry) Haar stage The baseline for the performance comparison is the
engine with LX330T. Fig. 10 shows the aggressive stage OpenCV (v1.0) version of the Haar-classifier-based face
engine for the 16 parallel Haar classifiers, where FPGA detection software. As mentioned in Section 4.2, the
computes 16 classifiers simultaneously. In this case, 192 application is running on a workstation system with Intel
integral image data are fed into the stage engine in one 2.66 GHz Core 2 Duo CPU and 8 GB memory both for
clock cycle. Additionally, 17 17 pixel sub-window needs
to be latched in every clock cycle. This requires a higher
memory bandwidth and 16 more MUXes to retrieve Table 2 Execution times of the Haar function and overall
those features from the integral image sub-window than application for the software implementation, the 1-classifier
the single-classifier design. We successfully synthesised the FPGA version, and the 16-classifier FPGA version
design onto Virtex 5 LX330T device. We plan to demo
Time, s
this aggressive stage engine design in the future when the
hardware platform is obtained. The synthesis result in software Haar 18
Section 5 provides a guideline of the hardware resource
software overall 18.9
requirement and trade-off comparisons. It also proves that
our design is scalable with hardware resources. 1-classifier FPGA Haar 1.8
1-classifier FPGA overall 2.5
4.3.3 Pipelined design: A concurrent nature of
hardware provides the performance boost over the software 16-classifier FPGA Haar 0.25
implementation. Another more important factor in our
16-classifier FPGA overall 0.95
design to increase the performance is pipelining. Fig. 8
www.ietdl.org
The 16-classifier FPGA design is targeted for LX330T and

the 1-classifier FPGA design is targeted for LX110T.
When it comes to the overhead of downloading the

integral image from the host processor to the FPGA
accelerator and uploading of the detection results (face
indices), the PCIe implementation with 4 lane DMA
provides at least 500 MB/s of bandwidth. Thus, the PCIe
bus spends 0.055 s to download the 94 CMU test images
(27.7 MB data) and 0.001 s to upload the result face pixels
(0.56 MB data). The PCIe transfer time takes about 3% of
the overall execution time with 1-classifier in FPGA, and
22% with 16-classifier in FPGA.
6 Conclusion
Figure 11 Speed up over software implementation
We have presented a novel approach of using reconfigurable
fabric to accelerate the Haar classifier function for face
the baseline and for the hardware implementation. The OS
detection applications. Our accelerator was developed on a
for this system is Red Hat Enterprise Linux 5. The
commercial FPGA board that is connected to a PCIe slot in
1-classifier FPGA implementation is synthesised and
a computer system. The 1-classifier and the 16-classifier
populated onto the LX110T device residing on the
implementations provide 10 and 72 speed ups for the
Hi-Tech Global PCIe card with a 4 lane PCIe. The
Haar function, respectively, over the software counterpart.
16-classifier FPGA implementation is only simulated and
Several innovations such as algorithm adaptation to hardware,
synthesised targeting the Virtex 5 LX330T device. We
pipelined architecture design and high utilisation of parallel
resized all the CMU test images to 256 192.
arithmetic units contribute to the speed ups of the non-
systolic algorithm. We also confirmed that even the
As Fig. 11 states, the 1-classifier implementation provides 1-classifier implementation that provides a cost-effective
a 10 performance speedup over the software version. The solution provides the real-time performance of 37 frames/s.
speedup for 16-classifier FPGA implementation is 72. Additionally, our FPGA-friendly 40-stage Haar classifier
The speedups of the whole face detection application are boasts a very high detection rate and low false positives (false
8 for the 1-classifier and 20 for the 16-classifier. The alarms). We have also discussed how our approach can be
relative huge drop of speedup from 72 to 20 (from made scalable for reconfigurable fabric with variable resources.
classifier-only speedup to whole application speedup) states This design paves the way for utilising reconfigurable
that the Amdahl’s Law [29] holds. The Amdahl’s Law is hardware to accelerate other non-systolic applications. Our
used to find the maximum expected improvement to an acceleration approach will gain more momentum as the
overall system when only part of the system is improved. Geneseo Initiative is materialised as products in the market.
In our case, the speedup is limited by pre- and post- As microprocessor industry is moving to multi-core
processings, which are the serial part of the program. architecture, our work also could be referenced to estimate the
pros and cons of incorporating reconfigurable fabric in
From Table 2, we noticed that in the software version, the heterogeneous CMPs with provided quantitative information
face detection application is able to achieve the performance of on performance benefits and required hardware costs.
5 frames/sec, while for 1-classifier FPGA implementation
provides 37 frames/sec and 16-classifier FPGA
implementation achieves 98 frames/sec. Even with much 7 Acknowledgments
fewer resources consumed in the 1-classifier FPGA The authors are very grateful to their colleagues at Intel:
implementation, we are able to achieve the real-time Yangzhou Du, Yimin Zhang and Tao Wang for assisting
performance for the face detection application. Table 3 with the face-detection software; Nrupal Jani for providing
provides resource utilisation of two FPGA implementations. the Linux driver to interface with the HiTech Global card;
and Vladimir Dudnik and Alexander Kibkalo for helping
Table 3 FPGA resource utilisation of the 1-classifier and with IPP.
16-classifier implementations
Freq., LUT, % BRAM, % DSP48E, %

MHz
8 References
1-classifier 125 20 20 6 [1] KUMAR R., TULLSEN D.M. , JOUPPI N.P., RANGANATHAN P.:
‘Heterogeneous chip multiprocessors’, Computer, 2005,
16-classifier 125 95 30 33
38, (11), pp. 32– 38
www.ietdl.org
[2] KUMAR R., TULLSEN D.M., JOUPPI N.P.: ‘Core architecture [16] BING X., CHAROENSAK C.: ‘Rapid fpga prototyping of gabor-
optimization for heterogeneous chip multiprocessors’. wavelet transform for applications in motion detection’.
Proc. 15th Int. Conf. on Parallel Architectures and Seventh Int. Conf. on Control, Automation, Robotics and
Compilation Techniques, PACT’06, 2006, pp. 23– 32 Vision, 2002, ICARCV 2002, December 2002, vol. 3,
pp. 1653 – 1657
[3] TI: ‘OMAP3525’ http://focus.ti.com/docs/prod/
folders/print/omap3525.html [17] WARING C., LIU X.: ‘Face detection using spectral
histograms and svms’, IEEE Trans. Syst. Man Cybern. B,
[4] HiTech: ‘HiTech Global Design & Distribution’ http:// 2005, 35, pp. 467– 476
www.hitechglobal.com/index.htm
[18] WALL G., IQBAL F., ISAACS J., LIU X., FOO S.: ‘Real time texture
[5] DRCComputer: ‘DRC Computer’ http://www. classification using field programmable gate arrays’. Proc.
drccomputer.com 33rd Applied Imagery Pattern Recognition Workshop,
AIPR’04, 2004, pp. 130 – 135
[6] Intel: ‘Intel Integrated Performance Primitives 5.3’
http://www.intel.com/cd/software/products/asmona/eng/ [19] Sourceforge: ‘Open Computer Vision Library’, http://
302910.htm, 2008 sourceforge.net/projects/opencvlibrary/, 2008
[7] YANG M.-H., KRIEGMAN D.J., AHUJA N.: ‘Detecting faces in [20] VIOLA P., JONES M.J.: ‘Robust real-time face detection’,
images: a survey’, IEEE Trans. Pattern Anal. Mach. Intell., Int. J. Comput. Vision, 2004, 57, (2), pp. 137– 154
2002, 24, (1), pp. 34– 58
[21] VIOLA P., JONES M. : ‘Rapid object detection using a
[8] MITCHELL T.: ‘Machine learning’ (McGraw Hill, 1977) boosted cascade of simple features’. Proc. 2001 IEEE
Computer Society Conf. on Computer Vision and Pattern
[9] PAPAGEORGIOU C.P., OREN M., POGGIO T.: ‘1998, a general Recognition, CVPR 2001, 2001, vol. 1, pp. I-511 – I-518
framework for object detection’. 1998, Sixth Int. Conf., on
Computer Vision, 1998, pp. 555– 562 [22] LIENHART R., KURANOV A., PISAREVSKY V.: ‘Empirical analysis of
detection cascades of boosted classifiers for rapid object
[10] OREN M., PAPAGEORGIOU C., SINHA P., OSUNA E., POGGIO T.: detection’. Tech. Rep. Microprocessor Research Lab, Intel
‘Pedestrian detection using wavelet templates’. Proc. Labs, December 2002
Computer Vision and Pattern Recognition, 1997, pp. 193–199
[23] LIENHART R. , MAYDT J. : ‘An extended set of haar-like
[11] LI L., ZHANG Y., TIAN Q.: ‘Multi-face location on embedded features for rapid object detection’. Proc. 2002 Int. Conf.
dsp image processing system’. 2008 Congress on Image and Image Processing, 2002, vol. 1, pp. I-900 – I-903
Signal Processing, 2008, vol. 4, pp. 124 – 128
[24] CMU: ‘Frontal Face Images’, http://vasc.ri.cmu.edu/
[12] TRIEU D.B.K., MARUYAMA T.: ‘Implementation of a parallel idb/html/face/frontal_images/, 2008
and pipelined watershed algorithm on fpga’. FPL, 2006,
pp. 1 – 6 [25] Intel: ‘Intel Core 2 Duo Processor’, http://www.intel.
com/products/processor/core2duo/index.htm
[13] SALDANA G., ARIAS-ESTRADA M.: ‘FPGA-based customizable
systolic architecture for image processing applications’. [26] SUH T., LU S.-L.L. , LEE H.-H.S.: ‘An FPGA approach to
Proc. 2005 IEEE Computer Society Int. Conf. on quantifying coherence traffic efficiency on multiprocessor
Reconfigurable Computing and FPGAs (ReConFig’05), systems’. Proc. 17th Int. Conf. on Field Programmable
2005, vol. 3 Logic and Applications, August 2007, pp. 47– 53
[14] IRICK K. , DEBOLE M. , NARAYANAN V. , SHARMA R. , MOON H. , [27] EETimes: ‘Geneseo Initiative’, http://www.eetimes.
MUMMAREDDY S.: ‘A unified streaming architecture for real com/news/design/showArticle.jhtml?articleID=193006384
time face detection and gender classification’. Int. Conf.
Field Programmable Logic on Applications, 2007 (FPL [28] HERBORDT M. , VANCOURT T., GU Y. ET AL.: ‘Achieving high
2007), August 2007, pp. 267 – 272 performance with fpga-based computing’, Computer,
2007, 40, pp. 50– 57
[15] MCCURRY P. , MORGAN F. , KILMARTIN L.: ‘Xilinx fpga
implementation of an image classifier for object detection [29] HENNESSY J.L., PATTERSON D.A. : ‘Computer organization
applications’. Int. Conf. on Image Processing, 2001, vol. III, and design: the hardware/software interface’ (Morgan
pp. 346– 349 Kaufmann Publishers, Inc., 2009, 4th edn.)

FPGA Based Haar Cascade

Uploaded by

Copyright:

Available Formats

FPGA Based Haar Cascade

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FPGA Based Haar Cascade

Uploaded by

Copyright:

Available Formats

www.ietdl.

Published in IET Image Processing

Field programmable gate array-based

1 Introduction is to include reconﬁgurable fabric for the static and/or

Table 1 Performance comparison between our FPGA

Detection False alarm SW execution

image. The second step is Haar classiﬁer function. The

3.1.1 Pre-processing: Because the feature window (17

As illustrated in Fig. 1, after training the Haar classiﬁers

Figure 4 Faces detected from sample images

To achieve high performance in computation with FPGA, 4.2 System setup

Figure 6 Block diagram of the FPGA-based accelerator

Figure 7 Block diagram of FPGA implementation

Figure 9 Classiﬁer block diagram

The 16-classiﬁer FPGA design is targeted for LX330T and

When it comes to the overhead of downloading the

Freq., LUT, % BRAM, % DSP48E, %

You might also like