FPGA Based Haar Cascade
FPGA Based Haar Cascade
FPGA Based Haar Cascade
org
ISSN 1751-9659
Abstract: The authors present a novel approach of using reconfigurable fabric to accelerate a face detection
algorithm based on the Haar classifier. With highly pipelined architecture and utilising abundant parallel
arithmetic units in FPGA, the authors have achieved real-time performance of face detection with very high
detection rate and low false positives. The 1-classifier and 16-classifier realisations in an accelerator provide
10 and 72 speedups, respectively, over the software counterpart. Moreover, the authors’, approach is
scalable towards the resources available on FPGA and it will gain more momentum as the Geneseo Initiative
is introduced in the market. This work also provides an understanding of using the reconfigurable fabric for
accelerating non-systolic-based vision algorithms.
184 IET Image Process., 2010, Vol. 4, Iss. 3, pp. 184– 194
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-ipr.2009.0030
www.ietdl.org
applications can benefit from FPGAs as well. Third, they are problems. Our FPGA-based Haar classifier is based on
erasable so the target design can be modified and patched SVM. There are several potential approaches to providing the
anytime anywhere. Fourth, they are off-the-shelf devices. required amount of computation. One example is to use the
Thus, they are comparably much more inexpensive to use digital signal processor (DSP) in place of a general purpose
than very large scale integration (VLSI) solutions such as CPU. Li et al. [11] provide an example of this approach.
application-specific integrated circuit (ASIC). However, we do wish to integrate image processing together
with general computing such as database application, and
Our approach begins by examining the possibility of provide an alternative approach for potential heterogeneous
speeding up functions in the Intel’s Integrated Performance CMP configurations in the future processor design. Here we
Primitives (IPP) library [6]. The IPP library is widely used to employ hardware-based acceleration on a general-purpose
optimise many computation-intensive applications for computer. This paper aims at showing how to accelerate face
multimedia, communication and computer vision. It is, detection by implementing the Haar classifier in hardware
therefore beneficial to examine the feasibility of mapping with FPGA.
commonly used functions in the IPP library to the
reconfigurable fabric. It is hoped that by using reconfigurable Most FPGA implementations of image processing utilise
fabric to implement these functions the overall performance the systolic array structure of image data. Thus they represent
of a large number of applications can be improved over the similar streaming data processors [12, 13]. Irick et al. [14]
pure software solution. used the streaming architecture to implement face detection
algorithms based on NNs and achieved high performance.
As an experiment we chose a face detection among However, their pixel offset of 10 is unrealistically high and
computer vision applications to verify our approach. The they did not provide a performance comparison against the
face detection is the computation problem of identifying pure microprocessor-based software implementations. Other
and locating human frontal faces in a photo or video face or object detection implementations with FPGA either
regardless of the lighting, orientation, complexion and size. report inferior performance or have a lower detection rate
It is useful in many other vision-based applications such as and higher false alarm rate [15–18]. Our approach of
digital camera. We first examined the performance profile implementing Haar classifiers on FPGA provides higher
of the face detection program and found that the most frontal face detection rate and lower false alarm rate when it is
time-consuming function (greater than 500 clock ticks for compared with Intel’s IPP and Open Computer Vision
every image pixel) in its algorithm is the Haar classifier. Library (OpenCV) [19] software solutions. With highly
The Haar classifier takes 93% of the total computation pipelined and parallel architecture, our system achieved real-
time. By implementing the Haar classifier in hardware and time face detection performance of 37 frames/s. Furthermore,
porting it into the reconfigurable fabric, we have achieved since our design used the commercially available peripheral
up to 72 speedup over the pure software approach with component interconnect (PCI) Express (PCIe)-based FPGA
high detection rate and low false positives. card, it is comparably easy to migrate our design to other
object detection, recognition and tracking applications with
similar Haar classifier functions.
2 Related work
Face detection algorithms are crucial parts in solving many
face-related problems such as face recognition, expression 3 Algorithm
recognition and face tracking. Face detection is considered as 3.1 Haar classifier face detection
a classification problem of images into face or non-face. Yang
et al. [7] summarised different face detection algorithms and
algorithm
reported the comparative analysis on them. Among several Our face detection algorithm utilises the Haar classifier
face detection algorithms, neural networks (NNs) and function adapted from Viola and Jones [20, 21] and
support vector machine (SVM) provide the best performance Papageorgiou et al. [9, 10]. Lienhart et al. [22, 23] were
in terms of detection rate and false alarm rate. NNs [8] are the first to introduce this algorithm into Intel’s IPP, which
simplified models of neural processing in the brain, which can was later included into OpenCV. Viola and Jones [21]
be used to learn a general decision hypothesis from sample proposed a cascaded degenerated decision tree for a fast
data referred to as training data. The SVM [9, 10] is a linear software computing while maintaining the same detection
classifier, where the decision surface is chosen to minimise the rate compared to other slower single stage classifiers. They
classification error. The decision surface is calculated using a used AdaBoost learning algorithm to select a small number
small subset of the training vectors called support vectors. of critical visual features from a large set of potential
Owing to classification precision and a superior performance features to train the classifiers. The features are pixel tiles
in generalisation, it is one of the most popular algorithms similar to the rectangles region shown in Figs. 1 labelled ‘a’
used by the machine learning society. However, it requires an and ‘b’. They have different weights for different rectangle
enormous amount of computation time and it is also regions and describe the likelihood of such rectangle pairs
memory-intensive. Therefore it is imperative to provide an to be the features of human frontal faces [23]. Our FPGA
efficient method to train the SVM especially for large scale implementation utilises 40 classifier stages. Each classifier
IET Image Process., 2010, Vol. 4, Iss. 3, pp. 184 – 194 185
doi: 10.1049/iet-ipr.2009.0030 & The Institution of Engineering and Technology 2010
www.ietdl.org
186 IET Image Process., 2010, Vol. 4, Iss. 3, pp. 184– 194
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-ipr.2009.0030
www.ietdl.org
for all different feature sizes. In terms of the integral image, the 4 Implementation
feature rectangle intensity summation (RIS) of the bold-lined
rectangle in Fig. 3 is calculated as (2) 4.1 FPGA implementation consideration
To achieve high performance in the face detection with the
RISbold linked ¼ II (x3 , y3 ) II (x1 , y1 ) II (x2 , y2 ) þ II (x0 , y0 ) Haar classifier, there are several issues to take into account
(2) when utilising reconfigurable fabric. It includes how to
connect the FPGA accelerator with the host
This RIS calculation in constant time is very useful as proved microprocessor, how to partition the computations between
from the software implementations [21, 23]. We adopted these two computing entities, and how to optimise the
it into our hardware implementations as well. During classifier best-utilising FPGA.
pre-processing, the pixel’s variance of source image is
calculated to compensate for the lighting differences (refer to There are several possible options to interface between a
Section 2.3 in [22]). general purpose processor and a reconfigurable fabric. The
most tightly coupled approach is to integrate the fabric
with the processor pipeline by implementing special
3.1.2 Haar classifier function: Haar classifier function instructions with a Haar engine. This approach provides
is a crucial part of the whole face detection algorithm. We the best performance in terms of execution time.
calculate features according to (2), and multiply them with
Nonetheless, it requires modification to the processor core
the trained feature weights. If it is greater than the classifier
making it unfeasible to be a general solution unless
threshold, the stage value is accumulated with a value of processor providers support the feature.
V1. Otherwise the stage value is accumulated with a value
of V2. Different classifiers may have different values of V1
Another approach is to place the reconfigurable fabric close
and V2. When each stage is finished, the accumulated
to the processor as much as possible. A practically possible
stage value is compared with the stage threshold. If it is
place is on a processor bus. In other words, the
greater than the stage threshold, a source pixel moves onto
reconfigurable fabric is connected with the processor via a
the next stage. Then, the same procedure is examined until
processor bus such as front side bus (FSB) of an Intel
it passes all the 40 stages. If the source pixel fails in any
processor. Heterogeneous multi-core architecture in the
stage during this examination, it is discarded as a non-facial
future would take a similar form even though multiple
pixel. The feature weight, classifier threshold, V1, V2, and
homogeneous cores in Core 2 Duo, for example, share a
stage threshold are all trained classifier values from the
large L2 cache [25]. In the previous research, Suh et al.
Haar classifier training stage. The detailed training process
[26] implemented a communication mechanism between
is explained in [21].
Pentium-III and an FPGA via FSB on an Intel-based
server system. The communication was achieved by
3.1.3 Post-processing: After the Haar classifier functions utilising cache coherence protocol on FSB. The DRC
are operated on the original source image and scaled images, we Computer [5] also provides a similar solution with the
cluster the detected face pixels with adjacent scaled images HyperTransport bus on the AMD-based system.
to form a final detected face rectangle. Fig. 4 illustrates the Nevertheless, as the processor’s bus frequency continues to
detected faces from sample pictures, which were chosen from grow and the proprietary bus standard continues to evolve,
the CMU’s frontal face images test set [24]. building a bus-based module requires tremendous time and
IET Image Process., 2010, Vol. 4, Iss. 3, pp. 184 – 194 187
doi: 10.1049/iet-ipr.2009.0030 & The Institution of Engineering and Technology 2010
www.ietdl.org
effort due in part to the extensive I/O signalling and the than 93% of the total software execution time, we extracted
difficulty accessing proprietary documents. Moreover, as and ported the Haar classifier function onto the FPGA
presented in [26], the communication through the board and left the pre-processing and post-processing on
coherence traffic entails technical complications such as the host processor. We also changed the data flow (loop
allocation of a page in memory space and direct cache cycles) from looping for every classifier per pixel to looping
manipulation for cache line invalidation. Additionally, since for every stage per pixel. Second, in order to reduce the
a cache-to-cache transfer in coherence traffic involves one resource requirement and to take advantage of the FPGA’s
cache block transfer, a fine-tuning is required to achieve an intrinsic parallel assets, we replaced all floating point
effective communication. operations with integer computations. With such
transformation, we could utilise Xilinx Virtex 5’s embedded
The final and easiest option for communication is interfacing DSP48E blocks and consequently accelerate multiplications
two computing entities via a standard I/O channel such as the and additions in the Haar classifier function. According to
PCIe bus. It means that the reconfigurable fabric is located in our experiment, the consequently lowered data precision
an expansion card on an I/O bus. Even though this option does not affect the final detection accuracy. A single cycle
suffers from the highest latency for communication, it provides of FPGA operation is equivalent to 100–1000 of
a straightforward solution in terms of physical interface and software clock ticks to achieve the same functionality.
device driver development because I/O buses on a personal Third, we employed an extensive pipelining to increase the
computer are industry standards. Moreover, in the industry, algorithm level parallelism and to sustain the operating
Geneseo Initiative [27] is extending PCIe with features that frequency of FPGA. Our implementation has as many
provide power savings and better support for coherent as 28 pipeline stages for the Haar classifier algorithm (refer
coprocessors. Therefore computer systems will soon be to Section 4.3.3). We then strove to match the 17 17
equipped with more coprocessor-friendly features. In our pixel sub-window with 16N classifiers for each stage, where
particular application, latency is not a critical factor. Our main N is an integer. For example, the first stage has 16
goal is to evaluate the effectiveness of using reconfigurable fabric classifiers, as opposed to three classifiers in the software
to accelerate algorithms in particular non-systolic algorithms. version. Therefore each pixel requires more computations to
Our result with PCIe-based FPGA implementation can be pass the first stage in the FPGA implementation. However,
used to extrapolate designs with other interfaces. the first stage in the FPGA implementation dropped more
than 90% of CMU’s frontal image set [24] as non-facial
Based on this consideration, we chose a commercial pixels, as opposed to a 50% drop rate in the software
PCIe-based card equipped with an FPGA as our acceleration version. Fourth, we implemented the Haar classifier
platform. The PCIe card is an HTG-V5-PCIE board from function with reuse in mind. It consists of intellectual
HiTech Global [4], as shown in Fig. 5. It incorporates a property building blocks such as classifier (Fig. 9) and stage
Xilinx’s Virtex 5 LX110T and supports eight lanes of PCIe engine (Fig. 10) that could be used for other applications
Gen1. The Xilinx LX110T includes digital signal processing with similar time-consuming Haar functions. In addition,
(DSP) fabric inside the FPGA, so signal-processing centric the design is easily scalable with various parameters such as
applications such as face detection can best utilise the buffer window size, the total number of stages and the
reconfigurable fabric and can be ported to the FPGA in a number of classifiers in each stage, as detailed in Section 4.3.
cost-efficient way.
188 IET Image Process., 2010, Vol. 4, Iss. 3, pp. 184– 194
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-ipr.2009.0030
www.ietdl.org
personal computer system. A PC system is mainly composed of During the face detection processing, the host processor pre-
three main components: CPU, north and south bridges. The processes images and sends the integral images and image
accelerator is connected to a PCIe slot on the south bridge variances to the FPGA accelerator through PCIe bus. The
side. Fig. 6b is a system diagram in terms of PCIe. The FPGA stores a 32 32 image in buffer, proceeds with the
backbone of a personal computer system is based on PCIe Haar classifier function. Afterwards it sends the detected
and the accelerator is connected to one of the endpoints. faces’ coordinates back to the host processor through PCIe.
Figure 8 Pipelined design of face detection accelerator in FPGA, where srcram is the source integral image BRAM
Classram is the classifier features BRAM; normram is the image variance BRAM; classifier is the classifier computational engine; lesscomp is
the stage threshold comparing engine
IET Image Process., 2010, Vol. 4, Iss. 3, pp. 184 – 194 189
doi: 10.1049/iet-ipr.2009.0030 & The Institution of Engineering and Technology 2010
www.ietdl.org
The host processor finishes the face detection algorithm with window, 17 17 window, 289:12 multiplexer (MUX),
post-processing. The host PC is equipped with 2.66 GHz classifier, stage comparator, mask BRAM and PCIe TLP
Intel Core 2 Duo and 8 GB main memory. (Transaction Layer Packet) in sequence. Finally, it is
delivered to the Xilinx Endpoint PCIe Core to be sent to
the host processor. Meanwhile, the feature data in the
4.3 FPGA implementation details Feature BRAM are fed into the MUX, classifier engine,
4.3.1 FPGA data flow: In order to detect the frontal and stage comparator. The followings explain the detailed
faces in the original image, it is necessary to scan and operations of the data flow path.
scrutinise each pixel in the image, which requires intense
computation. The computational load could be mitigated Pixel scanning operation: During the experiment, we store part
by jumping to non-neighbouring pixels during the pixel (32 32) of the 256 192 size of integral image into
Haar-function operation. However, the quality of the final BRAM. Each pixel in the integral image has a resolution of
face detection will be greatly degraded in terms of detection 17 bits. Therefore to store the 32 32 source integral image
rate and false alarm rate. In our work, we processed each in the Integral Image BRAM, the storage requirement is
pixel with Haar function. Nevertheless, the detailed FPGA 32 32 17 bits. The same resolution (17 bits) also applies
implementation addressed below does not prohibit the to the 17 17 buffer window, 17 17 window and 289:12
skipping-pixel scanning method. It implicitly states that the MUX. The numbers in the previous paragraph are all pixel
results in Section 5 are conservative and thus trustable. sizes (not bit capacity). For example, the 289:12 MUX selects
12 pixels from 289 pixels. In reality, there are 12 multiplexers;
As shown in Fig. 7, the integral image data move through each multiplexer can mux out one integral image pixel
the Integral Image BRAM (Block RAM), 17 17 buffer from 289 pixels, with 17 bits for each pixel. In other words,
190 IET Image Process., 2010, Vol. 4, Iss. 3, pp. 184– 194
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-ipr.2009.0030
www.ietdl.org
12 17 289-bit to 1-bit multiplexers are required. The reason to the Classifier. Note that each feature determines four
that we do not store the whole integral image into FPGA is coordinates in the 17 17 window. The classifier
because Virtex 5 LX110T is not able to hold the 256 192 calculates the first class value with those input data.
integral image in its BRAM.
Stage operation: In our Haar classifier model, we have 16
Fig. 8 provides a detailed view of the pixel scanning classifiers in the first stage. The number 16 is a carefully
operation. We stored the 32 32 integral image into and well-chosen number mostly because about 90% of
srcram (source integral image BRAM) with 32 rows. Each pixels of the CMU’s frontal image set are dropped as
row has 544 bits (32 17) of data. During each clock non-facial pixels after examining through 16 classifiers.
cycle, one row (544 bits) of the integral image was read out Moreover, the Xilinx Virtex 5 LX330T FPGA provides a
from the srcram. Seventeen 17-bit data in this one-row capacity to implement 16 parallel classifiers for a much
integral image data are fed into the corresponding registers better performance improvement. After 16 cycles of
in the 17 17 buffer window. When all the registers in calculation for the 16 classifier values accumulated as the
the 17 17 buffer window are updated, the data will be first stage value, the values in this stage are compared with
transferred to the 17 17 window. the stage threshold to decide whether they pass the first
stage or not. Because more than 90% of the pixels are non-
Classifier operation: During the time of pixel scanning facial pixels, the 17 17 window and the classifier are
operation, the 12 features of the first classifier in the first ready in most cases for the first-stage computation of the
stage are prepared and stored to the Feature BRAM, and next pixel after finishing the first-stage computation of the
they are supplied to the MUX, classifier and stage previous pixel. Only less than 10% of the pixels will pass
comparator. Meanwhile, the pixel’s variance is fed into the the first stage. In this case, the data in the 17 17
classifier from the Image Variance BRAM. The MUX window remain intact. However, the classifiers (features)
chooses the 12 integral data from the 17 17 window from the second stage and later will continue to feed into
according to three features in each classifier and feeds them the classifier and the stage comparator until any stage value
Figure 10 Stage engine block diagram for the aggressive version of face detection that includes 16 parallel classifiers at the
front
IET Image Process., 2010, Vol. 4, Iss. 3, pp. 184 – 194 191
doi: 10.1049/iet-ipr.2009.0030 & The Institution of Engineering and Technology 2010
www.ietdl.org
is less than the trained stage threshold. If any pixel passes the illustrates 28 pipelines in our design. Therefore a pixel
first stage, the pixel scan mode will become stage scan mode, needs 28 clock cycles for process from reading data from
where FPGA retrieves back the pixel index and the pixel the integral image memory (srcram) to the output decision.
moves to the next stage to decide if this pixel could pass Theoretically, this 28-stage pipeline can achieve more than
all of 40 stages. During this operation, FPGA dedicates its 20 additional speed up over the software version. In
resources to this specific pixel, buffers the following Fig. 8, long vertical bars indicate pipeline stages. All the
classifier data into the classifier engine and stops buffering data in the previous stage are latched at each pipeline stage
the integral sub-window of the next pixel. into the next stage’s registers. Fig. 8 does not show all 28
pipeline stages because of the space limitation. Instead, we
Face pixel mask recording: If a pixel passes all the 40 stages, drew key pipeline stages. The notation ‘total delay’ in
the pixel’s address (row and column indices) is sent to the Fig. 8 illustrates how many clock cycles are elapsed from
Mask BRAM. After all the pixels in the integral image issuing the pixel’s row address to the current stage operation.
pass through the data flow in Fig. 7, addresses of the The numbers are the real pipeline stages. For example, ‘total
detected face pixels are transferred to the host processor delay ¼ 18’ was put on the top of the third pipeline stage
through the PCIe bus. Because only less than 10% of the bar because 18 clock cycles (or pipeline stages) are required
pixels are face pixels, the sparseness of the face indices data to read 17 rows of integral image data from srcram. In
greatly reduce the time to transfer the pixel information reality, this third pipeline stage bar is the 18th real pipeline
from the FPGA accelerator to the host processor. stage. Another thing to be cautious is the difference of the
pipeline stages from the Haar classifier’s stages.
4.3.2 Parallel computational components design:
Fig. 9 illustrates the internal implementations of the classifier Because of the complexity of the Haar classifier algorithm,
in Fig. 7. In each clock cycle, 12 integral data from the 17 data in these 28-stage pipelines should be carefully aligned.
17 window in Fig. 7 and classifier parameters are fed into this It means that the triggering events for each pipeline stage
classifier engine, which calculates the class output value. The need to be carefully selected. For example, for the integral
multipliers, subtractors and adders in Fig. 9 are implemented image pixels (in srcram) and the classifiers (in classram) to
with Xilinx’s DSP48E on-chip cores (four DSP48Es for each arrive to the classifier at the same time, the triggering
classifier). It takes six clock cycles for one classifier to events for the classram (classifier counter control signals
compute an output as depicted in Fig. 9. and caddr in Fig. 8) have to be 17 clocks away from the
triggering events for the srcram (counter control signals
Although more classifiers were attempted to implement and srcaddr in Fig. 8).
with the LX110T device, the FPGA did not satisfy the
resource requirement for more registers and BRAMs for
the pixel scanning operation. For example, 290% of 5 Experiment results
LX110T’s registers are required for the 16-classifier design. Table 2 lists execution times of software and hardware
However, 16 parallel classifiers on the FPGA can be (FPGA) implementations for the Haar classifier face
implemented with more advanced FPGAs such as Virtex 5 detection application. Fig. 11 shows the speedup of the
LX330T. Thus, we designed, simulated and synthesised hardware implementation over the software counterpart.
another more aggressive (or resource-hungry) Haar stage The baseline for the performance comparison is the
engine with LX330T. Fig. 10 shows the aggressive stage OpenCV (v1.0) version of the Haar-classifier-based face
engine for the 16 parallel Haar classifiers, where FPGA detection software. As mentioned in Section 4.2, the
computes 16 classifiers simultaneously. In this case, 192 application is running on a workstation system with Intel
integral image data are fed into the stage engine in one 2.66 GHz Core 2 Duo CPU and 8 GB memory both for
clock cycle. Additionally, 17 17 pixel sub-window needs
to be latched in every clock cycle. This requires a higher
memory bandwidth and 16 more MUXes to retrieve Table 2 Execution times of the Haar function and overall
those features from the integral image sub-window than application for the software implementation, the 1-classifier
the single-classifier design. We successfully synthesised the FPGA version, and the 16-classifier FPGA version
design onto Virtex 5 LX330T device. We plan to demo
Time, s
this aggressive stage engine design in the future when the
hardware platform is obtained. The synthesis result in software Haar 18
Section 5 provides a guideline of the hardware resource
software overall 18.9
requirement and trade-off comparisons. It also proves that
our design is scalable with hardware resources. 1-classifier FPGA Haar 1.8
1-classifier FPGA overall 2.5
4.3.3 Pipelined design: A concurrent nature of
hardware provides the performance boost over the software 16-classifier FPGA Haar 0.25
implementation. Another more important factor in our
16-classifier FPGA overall 0.95
design to increase the performance is pipelining. Fig. 8
192 IET Image Process., 2010, Vol. 4, Iss. 3, pp. 184– 194
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-ipr.2009.0030
www.ietdl.org
6 Conclusion
Figure 11 Speed up over software implementation
We have presented a novel approach of using reconfigurable
fabric to accelerate the Haar classifier function for face
the baseline and for the hardware implementation. The OS
detection applications. Our accelerator was developed on a
for this system is Red Hat Enterprise Linux 5. The
commercial FPGA board that is connected to a PCIe slot in
1-classifier FPGA implementation is synthesised and
a computer system. The 1-classifier and the 16-classifier
populated onto the LX110T device residing on the
implementations provide 10 and 72 speed ups for the
Hi-Tech Global PCIe card with a 4 lane PCIe. The
Haar function, respectively, over the software counterpart.
16-classifier FPGA implementation is only simulated and
Several innovations such as algorithm adaptation to hardware,
synthesised targeting the Virtex 5 LX330T device. We
pipelined architecture design and high utilisation of parallel
resized all the CMU test images to 256 192.
arithmetic units contribute to the speed ups of the non-
systolic algorithm. We also confirmed that even the
As Fig. 11 states, the 1-classifier implementation provides 1-classifier implementation that provides a cost-effective
a 10 performance speedup over the software version. The solution provides the real-time performance of 37 frames/s.
speedup for 16-classifier FPGA implementation is 72. Additionally, our FPGA-friendly 40-stage Haar classifier
The speedups of the whole face detection application are boasts a very high detection rate and low false positives (false
8 for the 1-classifier and 20 for the 16-classifier. The alarms). We have also discussed how our approach can be
relative huge drop of speedup from 72 to 20 (from made scalable for reconfigurable fabric with variable resources.
classifier-only speedup to whole application speedup) states This design paves the way for utilising reconfigurable
that the Amdahl’s Law [29] holds. The Amdahl’s Law is hardware to accelerate other non-systolic applications. Our
used to find the maximum expected improvement to an acceleration approach will gain more momentum as the
overall system when only part of the system is improved. Geneseo Initiative is materialised as products in the market.
In our case, the speedup is limited by pre- and post- As microprocessor industry is moving to multi-core
processings, which are the serial part of the program. architecture, our work also could be referenced to estimate the
pros and cons of incorporating reconfigurable fabric in
From Table 2, we noticed that in the software version, the heterogeneous CMPs with provided quantitative information
face detection application is able to achieve the performance of on performance benefits and required hardware costs.
5 frames/sec, while for 1-classifier FPGA implementation
provides 37 frames/sec and 16-classifier FPGA
implementation achieves 98 frames/sec. Even with much 7 Acknowledgments
fewer resources consumed in the 1-classifier FPGA The authors are very grateful to their colleagues at Intel:
implementation, we are able to achieve the real-time Yangzhou Du, Yimin Zhang and Tao Wang for assisting
performance for the face detection application. Table 3 with the face-detection software; Nrupal Jani for providing
provides resource utilisation of two FPGA implementations. the Linux driver to interface with the HiTech Global card;
and Vladimir Dudnik and Alexander Kibkalo for helping
Table 3 FPGA resource utilisation of the 1-classifier and with IPP.
16-classifier implementations
IET Image Process., 2010, Vol. 4, Iss. 3, pp. 184 – 194 193
doi: 10.1049/iet-ipr.2009.0030 & The Institution of Engineering and Technology 2010
www.ietdl.org
[2] KUMAR R., TULLSEN D.M., JOUPPI N.P.: ‘Core architecture [16] BING X., CHAROENSAK C.: ‘Rapid fpga prototyping of gabor-
optimization for heterogeneous chip multiprocessors’. wavelet transform for applications in motion detection’.
Proc. 15th Int. Conf. on Parallel Architectures and Seventh Int. Conf. on Control, Automation, Robotics and
Compilation Techniques, PACT’06, 2006, pp. 23– 32 Vision, 2002, ICARCV 2002, December 2002, vol. 3,
pp. 1653 – 1657
[3] TI: ‘OMAP3525’ http://focus.ti.com/docs/prod/
folders/print/omap3525.html [17] WARING C., LIU X.: ‘Face detection using spectral
histograms and svms’, IEEE Trans. Syst. Man Cybern. B,
[4] HiTech: ‘HiTech Global Design & Distribution’ http:// 2005, 35, pp. 467– 476
www.hitechglobal.com/index.htm
[18] WALL G., IQBAL F., ISAACS J., LIU X., FOO S.: ‘Real time texture
[5] DRCComputer: ‘DRC Computer’ http://www. classification using field programmable gate arrays’. Proc.
drccomputer.com 33rd Applied Imagery Pattern Recognition Workshop,
AIPR’04, 2004, pp. 130 – 135
[6] Intel: ‘Intel Integrated Performance Primitives 5.3’
http://www.intel.com/cd/software/products/asmona/eng/ [19] Sourceforge: ‘Open Computer Vision Library’, http://
302910.htm, 2008 sourceforge.net/projects/opencvlibrary/, 2008
[7] YANG M.-H., KRIEGMAN D.J., AHUJA N.: ‘Detecting faces in [20] VIOLA P., JONES M.J.: ‘Robust real-time face detection’,
images: a survey’, IEEE Trans. Pattern Anal. Mach. Intell., Int. J. Comput. Vision, 2004, 57, (2), pp. 137– 154
2002, 24, (1), pp. 34– 58
[21] VIOLA P., JONES M. : ‘Rapid object detection using a
[8] MITCHELL T.: ‘Machine learning’ (McGraw Hill, 1977) boosted cascade of simple features’. Proc. 2001 IEEE
Computer Society Conf. on Computer Vision and Pattern
[9] PAPAGEORGIOU C.P., OREN M., POGGIO T.: ‘1998, a general Recognition, CVPR 2001, 2001, vol. 1, pp. I-511 – I-518
framework for object detection’. 1998, Sixth Int. Conf., on
Computer Vision, 1998, pp. 555– 562 [22] LIENHART R., KURANOV A., PISAREVSKY V.: ‘Empirical analysis of
detection cascades of boosted classifiers for rapid object
[10] OREN M., PAPAGEORGIOU C., SINHA P., OSUNA E., POGGIO T.: detection’. Tech. Rep. Microprocessor Research Lab, Intel
‘Pedestrian detection using wavelet templates’. Proc. Labs, December 2002
Computer Vision and Pattern Recognition, 1997, pp. 193–199
[23] LIENHART R. , MAYDT J. : ‘An extended set of haar-like
[11] LI L., ZHANG Y., TIAN Q.: ‘Multi-face location on embedded features for rapid object detection’. Proc. 2002 Int. Conf.
dsp image processing system’. 2008 Congress on Image and Image Processing, 2002, vol. 1, pp. I-900 – I-903
Signal Processing, 2008, vol. 4, pp. 124 – 128
[24] CMU: ‘Frontal Face Images’, http://vasc.ri.cmu.edu/
[12] TRIEU D.B.K., MARUYAMA T.: ‘Implementation of a parallel idb/html/face/frontal_images/, 2008
and pipelined watershed algorithm on fpga’. FPL, 2006,
pp. 1 – 6 [25] Intel: ‘Intel Core 2 Duo Processor’, http://www.intel.
com/products/processor/core2duo/index.htm
[13] SALDANA G., ARIAS-ESTRADA M.: ‘FPGA-based customizable
systolic architecture for image processing applications’. [26] SUH T., LU S.-L.L. , LEE H.-H.S.: ‘An FPGA approach to
Proc. 2005 IEEE Computer Society Int. Conf. on quantifying coherence traffic efficiency on multiprocessor
Reconfigurable Computing and FPGAs (ReConFig’05), systems’. Proc. 17th Int. Conf. on Field Programmable
2005, vol. 3 Logic and Applications, August 2007, pp. 47– 53
[14] IRICK K. , DEBOLE M. , NARAYANAN V. , SHARMA R. , MOON H. , [27] EETimes: ‘Geneseo Initiative’, http://www.eetimes.
MUMMAREDDY S.: ‘A unified streaming architecture for real com/news/design/showArticle.jhtml?articleID=193006384
time face detection and gender classification’. Int. Conf.
Field Programmable Logic on Applications, 2007 (FPL [28] HERBORDT M. , VANCOURT T., GU Y. ET AL.: ‘Achieving high
2007), August 2007, pp. 267 – 272 performance with fpga-based computing’, Computer,
2007, 40, pp. 50– 57
[15] MCCURRY P. , MORGAN F. , KILMARTIN L.: ‘Xilinx fpga
implementation of an image classifier for object detection [29] HENNESSY J.L., PATTERSON D.A. : ‘Computer organization
applications’. Int. Conf. on Image Processing, 2001, vol. III, and design: the hardware/software interface’ (Morgan
pp. 346– 349 Kaufmann Publishers, Inc., 2009, 4th edn.)
194 IET Image Process., 2010, Vol. 4, Iss. 3, pp. 184– 194
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-ipr.2009.0030