0% found this document useful (0 votes)
3 views

FPGA-Based_Parallel_Hardware_Architecture_for_Real-Time_Image_Classification

This paper presents a parallel hardware architecture for real-time image classification using SIFT, BoF, and SVM algorithms, achieving classification within 33 ms on a 640 × 480 pixel image. The architecture demonstrates a 54× speedup in feature extraction and a 6× speedup in classification compared to software implementations, while maintaining a classification accuracy difference within 3%. The proposed FPGA-based solution effectively reduces hardware resource utilization compared to existing methods, making it suitable for real-time applications.

Uploaded by

Ankur Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

FPGA-Based_Parallel_Hardware_Architecture_for_Real-Time_Image_Classification

This paper presents a parallel hardware architecture for real-time image classification using SIFT, BoF, and SVM algorithms, achieving classification within 33 ms on a 640 × 480 pixel image. The architecture demonstrates a 54× speedup in feature extraction and a 6× speedup in classification compared to software implementations, while maintaining a classification accuracy difference within 3%. The proposed FPGA-based solution effectively reduces hardware resource utilization compared to existing methods, making it suitable for real-time applications.

Uploaded by

Ankur Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

56 IEEE TRANSACTIONS ON COMPUTATIONAL IMAGING, VOL. 1, NO.

1, MARCH 2015

FPGA-Based Parallel Hardware Architecture


for Real-Time Image Classification
Murad Qasaimeh, Assim Sagahyroon, and Tamer Shanableh

Abstract—This paper proposes a parallel hardware architecture involve a tradeoff between the quality of the extracted features,
for real-time image classification based on scale-invariant fea- the classification accuracy and the computational complexity
ture transform (SIFT), bag of features (BoFs), and support vector of these algorithms. As an example, the unsophisticated FAST
machine (SVM) algorithms. The proposed architecture exploits
different forms of parallelism in these algorithms in order to accel- corner detector [1] executes in 13 ms for a 1024 × 768 image
erate their execution to achieve real-time performance. Different on a desktop machine, but the extracted features are not robust
techniques have been used to parallelize the execution and reduce to changes in illumination, scale, or rotation. In contrast, a
the hardware resource utilization of the computationally intensive software implementation of computationally expensive algo-
steps in these algorithms. The architecture takes a 640 × 480 pixel rithms like SIFT [2] and HOG [3], process the same image
image as an input and classifies it based on its content within
33 ms. A prototype of the proposed architecture is implemented on within 1920 ms. However, the extracted features are invariant
an FPGA platform and evaluated using two benchmark datasets: to change in illumination, scale, viewpoint and rotation. The
1) Caltech-256 and 2) the Belgium Traffic Sign datasets. The archi- argument also applies to classification algorithms. A simple
tecture is able to detect up to 1270 SIFT features per frame with classification algorithm like Naive Bayes takes noticeably lower
an increment of 380 extra features from the best recent implemen- time to execute in comparison to a sophisticated algorithms
tation. We were able to speedup the feature extraction algorithm
when compared to an equivalent software implementation by 54× like RBF-SVM. However, the classification accuracy achieved
and for classification algorithm by 6×, while maintaining the by SVM is much higher than that of Naive Bayes. So, imple-
difference in classification accuracy within 3%. The hardware menting robust image classification system using complex algo-
resources utilized by our architecture were also less than those rithms like SIFT [2] and SVM [4] are computationally intensive
used by other existing solutions. and it is very hard to reach real-time performance (33 ms).
Index Terms—Field-programmable gate array (FPGA), hard- To address the complexity problem, a number of simplified
ware implementation, image classification, scale-invariant feature versions of the SIFT algorithm have been proposed, such as
transform (SIFT). SURF [5] and the fast SIFT [6] algorithms. The SURF reduces the
complexityoftheSIFTalgorithmbyapproximatingtheLaplacian
I. I NTRODUCTION of Gaussian (LoG) using filters in the orientation assignment and
feature description steps. This reduces the computation time from
O BJECT detection and classification is the process of find-
ing objects of a certain class, such as faces, cars, and
buildings, in an image or a video frame. This task involves clas-
1036 ms for SIFT to 354 ms for SURF on a standard Linux PC
(Pentium IV, 3 GHz) [5]. Many SURF hardware implementations
sifying the input image according to its visual content into a have been proposed to accelerate the algorithm [7]–[9]. Even
general class of similar objects. Feature-based object classifi- thought, it reduces the computational time for extracting the local
cation is a common image classification method in computer features, it produces fewer reliable features in comparison with
vision. It typically uses one of the feature extraction algorithms the SIFT algorithm. Moreover, the hardware implementations
to extract the image’s important content. Then, it uses one of the require a considerable internal memory, almost four times the
classification algorithms to label images into one of the trained memory requirement of SIFT [10], [11].
categories. Image classification has many potential applications Other researchers tried to accelerate the SIFT algorithm
including autonomous robots, intelligent traffic systems, com- using a GPU platform [12]–[14]. The implementation in [12]
puter human interactions, quality control in production lines, achieved a speedup of 4 − 7× over an optimized CPU ver-
and biomedical image analysis. sion when tested on a dataset of 320 × 280 pixel images. The
Many algorithms have been proposed for feature extraction power consumption of this implementation on the NVIDIA
and classification in the last two decades. These algorithms Tegra 250 development board was 3383 mW. In [14], another
GPU-based implementation was proposed. It was able to detect
Manuscript received November 14, 2014; February 26, 2015; accepted SIFT features in images (640 × 480 pixels) within 58 ms,
March 30, 2015. Date of publication April 23, 2015; date of current version
June 26, 2015. The associate editor coordinating the review of this manuscript
which is around 20 frames/s. The GPU-based implementa-
and approving it for publication was Prof. Andrew Lumsdaine. tions can accelerate the SIFT algorithm to reach near real-time
The authors are with the Department of Computer Engineering, American performance, but they require an excessive amount of hard-
University of Sharjah (AUS), Sharjah 26666, United Arab Emirates (e-mail: ware resources and they consume too much power compared
[email protected]; [email protected]; [email protected]).
Color versions of one or more of the figures in this paper are available online
to other hardware platforms. This makes the GPU implementa-
at http://ieeexplore.ieee.org. tions not suitable for portable embedded systems with limited
Digital Object Identifier 10.1109/TCI.2015.2424077 power. Other solutions tried to accelerate the SIFT algorithm

2333-9403 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Illinois at Chicago Library. Downloaded on May 24,2025 at 19:37:36 UTC from IEEE Xplore. Restrictions apply.
QASAIMEH et al.: FPGA-BASED PARALLEL HARDWARE ARCHITECTURE FOR REAL-TIME IMAGE CLASSIFICATION 57

using multicore processors [15]. The results showed that the the image is scanned to search for distinctive and repeatable
performance achieved by a multicore CPU is almost the same points called keypoints. This stage consists of three subtasks:
as the implementation on GPUs. GSS generation, local extrema detection, and KPD. Descriptors
Other attempts have been made to accelerate the algorithms generation is the second stage. It can be divided into two sub-
using FPGA hardware architectures [16]–[21]. However, the tasks: dominant orientation assignment and 128-d descriptor
current implementations either work on small image resolu- generation.
tion (320 × 240 pixels) like the work in [16], [17], [19], [21] or 1) GSS Generation: In the first step, the input image I(x, y)
compute small number of SIFT features within real-time con- is convolved with a series of Gaussian filters G(x, y, σi ) to
straints. In [22], the implementation can detect up to 890 SIFT build a GSS as defined in (1) and (2). Where σi is the Gaussian
features per frame within the 33 ms. Other implementations use filter scale, L(x, y, σi ) is the filtered image, and i is a scale
large hardware resources [18]–[20] like 35 000–45 000 LUTs. index. The ∗ in (1) is 2-D convolution operation in x and y,
Therefore, there is a strong need to propose a customized paral- and G(x, y, σi ) is the Gaussian Kernel. In this work, we used
lel hardware architecture that accelerates the execution of the six scales (σ0, k × σ0, k2 √ × σ0, k3 × σ0, k4 × σ0, k5 × σ0),
image classification algorithms (SIFT, BoF, SVM) on VGA where σ0 = 1.6, and k = 3 2. After computing the Gaussian
images to reach the real-time performance with low hardware filtered images, the next step is to compute the difference of
resources utilization and high classification accuracy. GSS (DoG) by subtracting each two consecutive images as
In this work, new techniques are used to accelerate the com- defined in (3)
putationally intensive parts in the SIFT feature extraction and
SVM classification algorithms and to reduce the hardware uti- L(x, y, σi ) = G(x, y, σi ) * I(x, y) (1)
 
lization. The main contributions of this paper are as follows. 1 − x22σ+y2 2
G(x, y, σ) = e (2)
1) To the best of authors’ knowledge, this is the first com- 2πσ 2
plete hardware architecture that implements SIFT, BoF and D(x, y, σi ) = L(x, y, Kσi ) − L(x, y, σi ). (3)
SVM algorithms on FPGA. 2) Reducing the hardware utiliza-
tion in the SIFT’s Gaussian scale space (GSS) step by using 2) Local Extrema Detection: In this step, the DoG images
the multiplierless multiple constant multiplication (MCM) with are scanned to find the candidate keypoints. Each pixel in the
common subexpression elimination algorithm. 3) Our SIFT D(x, y, σi ) image at location (x, y) is compared with its 3 × 3
module architecture is able to detect and describe up to 1270 neighbors in the same scale and the adjacent scales. If the pixel
SIFT features in 33 ms with an increment of 380 features is local maxima or local minima out of the total 26 neighboring
from the best recent implementation. 4) Introduce a new design pixels, it will be considered as a candidate keypoint. This oper-
for multiported memories to store the SIFT descriptor his- ation is performed for every pixel in the DoG images and what
togram values by reordering the SIFT’s 128 values to utilize results is a list of keypoint candidates.
the fabricated block RAMs in the FPGA. 5) To the best of our 3) Keypoint Detection: The goal of this step is to eliminate
knowledge, no other research has successfully used datasets the candidate keypoints that have low contrast or are poorly
with as much image variation as the ones in Caltech-256 and localized along edges. To detect a low contrast keypoint, the
KUL Belgium traffic sign classification datasets to verify their value of the pixel is compared with a predefined threshold. If
hardware image classification systems. the value is less than the threshold, the keypoint will be rejected.
This paper is organized as follows. Section II reviews the To find a poorly localized peak, a keypoint is tested using
algorithms used in our architecture. Section III presents the the inequality defined in (4). Where H is the Hessian matrix
software implementation of these algorithms. Section IV pro- computed as defined in (5), and Tr (H) is the trace of H, Det(H)
vides a detailed description of the proposed architecture and its is the determinant of H, and r is a constant value
2 2
various building blocks. Section V discusses the experimental Tr(H) (r + 1)
results and architecture performance and Section VI concludes < (4)
Det(H) r
the paper.  
Δxx Δxy
H= . (5)
Δxy Δyy
II. R EVIEW OF THE A LGORITHMS
4) Orientation Assignment: The gradient magnitude and
In this work, we used scale-invariant feature transform orientation are computed for all pixels around the stable key-
(SIFT) for feature extraction, bag of features (BoFs) algorithm points. The gradient is computed in both the horizontal and
for image representation and support vector machine (SVM) vertical direction as defined in (6) and (7).The gradient mag-
algorithm for classification. This section reviews these three nitude and gradient orientation are computed from (x) and (y)
algorithms. as given in (8) and (9)
Δx = L(x + 1, y) − L(x − 1, y)/2 (6)
A. Scale-Invariant Feature Transform (SIFT) Δy = L(x, y + 1) − L(x, y − 1)/2 (7)
SIFT is an algorithm used to extract and describe local fea- 
m(x, y) = Δx2 + Δy2 (8)
tures in images. It was proposed by Lowe [2] in 2004. The SIFT  
algorithm process can be divided into two main stages: keypoint −1 Δy
θ(x, y) = tan . (9)
detection (KPD) and keypoint description. In the first stage, Δx
Authorized licensed use limited to: University of Illinois at Chicago Library. Downloaded on May 24,2025 at 19:37:36 UTC from IEEE Xplore. Restrictions apply.
58 IEEE TRANSACTIONS ON COMPUTATIONAL IMAGING, VOL. 1, NO. 1, MARCH 2015

5) Descriptor Generation: To compute the SIFT descrip-


tors, there are two main tasks: dominant orientation compu-
tation and descriptor generation. The dominant orientation is
computed by building the gradient orientation histogram around
a keypoint. The gradient orientations in a region are mapped
into one of 36 bins, where each bin represents 10◦ . At the end,
the bin with the largest value or count will represents the dom-
inant orientation. After computing a dominant orientation, the
coordinates of the pixels around a keypoint are rotated relative Fig. 1. Five classes from the Caltech-256 dataset.
to the dominant orientation. A SIFT descriptor is computed by
dividing the region around the keypoint into 4 × 4 squares and SVM was originally designed as a binary classification algo-
building the gradient histogram over 8 bins, where each bin cov- rithm, but it is possible to extend it to multiclass classification.
ers 45◦ . The gradient magnitudes are weighted by a Gaussian The most common methods of multiclass SVM are the “one
filter prior to building the histogram. Finally, the 16 histograms against all” and the “pairwise” classifiers methods. In this work,
with 8 bins are each represented in the SIFT descriptor. The we used one against all multiclass SVM with RBF kernel.
SIFT descriptor has 4 × 4 × 8 values.

B. Bag of Features (BoFs) III. S OFTWARE I MPLEMENTATION

BoF is one of the popular image representation methods [23]. We used a software implementation of the SIFT–BoF–SVM
It is used to represent images as orderless collections of local algorithms for three reasons. First, for validation, we com-
features. The idea is to quantize each of the extracted key- pare the results generated by our hardware implementation with
points into one of the visual words (points in the feature space), the software implementation’s results. Second, for performance
and then to represent each image by a histogram of the visual measurement, we measure the processing time in software and
words. The process starts by clustering the SIFT descriptors compare it with the processing time in the hardware imple-
in their feature space (128-dimensional space) in large num- mentation. Third, to choose the best classification algorithm
bers of clusters by using the K-means clustering algorithm. for our system by conducting comparison between six different
The centroids of these clusters are the BoF visual words. The classification algorithms.
visual words represent a particular local pattern shared by the In the software implementation, we used an open source
keypoints in that cluster. The next step is to assign each descrip- SIFT library implemented in C by Hess [24]. For the SVM
tor into its nearest cluster center. The normalized histogram of classification algorithm, we used LibSVM [25], an open source
the quantized features is the BoF model representation of that C++ library implemented by Chang et al. The validation and
image. performance measurement parts are reported in the experimen-
tal result section. In this section, we elaborate why we choose
RBF-SVM over the other classification algorithms.
C. Support Vector Machine (SVM) Selecting the best classification algorithm for our system was
an important step. So that, we perform an experiment to find out
SVM is a machine-learning algorithm proposed by Vapnik which classifition algorithm achieves higher accuracy within
in 1990 [4]. The algorithm’s idea is to find the deci- an accepted processing time. The algorithms tested are: K-
sion hyperplane that defines decision boundaries between nearest-neighbor (KNN), naïve Bayes, decision tree, Adaboost,
two different classes. Given a set of training exam- linear SVM, and nonlinear SVM. For this experiment, we used
ples, each one is labeled to one of two categories, D = Caltech-256 [26] benchmark dataset.
{(x, y)|x → data sample, y → class label}. The SVM algo- We used five classes: airplane, human face, motorbike, hours,
rithm finds the optimal hyperplane that splits the training and watch. Fig. 1 shows example images from each category.
data into two categories. Any new examples can be classified To find out which classification algorithm has the best accu-
based on the location relative to that hyperplane. SVMs uti- racy, we build the learning curve for each classifier. The
lize a technique called kernel trick that represents the data in learning curve shows how the classifier performance is affected
a higher dimensional space than the original feature space. This by increasing the size of the training set. By training the classi-
mapping makes it easier to find a separation hyperplane in non- fiers on datasets of sizes (50, 100, 200, 300, 400, and 500) and
linearly separable data. The most common kernels are linear, measuring its accuracy we obtained the results of Fig. 2.
polynomial, radial basis function (RBF), and sigmoid kernels, The RBF-SVM classifier outperforms other classifiers on all
as given by in training sizes. In terms of accuracy, the RBF SVM is followed
by the Adaboost algorithms, which is then followed by the lin-
Linear: K(x, z) = x · z (10) ear SVM. The naïve Bayes and decision tree classifiers have
d
Polynomial: K(x, z) = ((x · z) + 1) , d > 0 (11) poor classification rates.
2 2 To assess how the processing time is effected by increas-
RBF: K(x, z) = exp(−x − z /(2σ ) (12)
ing the testing set size, we measured the processing time on
Sigmoid: K(x, z) = tanh(K(x · z) + Φ). (13) datasets of sizes of 50, 100, 200, 300, 400, and 500. The results
Authorized licensed use limited to: University of Illinois at Chicago Library. Downloaded on May 24,2025 at 19:37:36 UTC from IEEE Xplore. Restrictions apply.
QASAIMEH et al.: FPGA-BASED PARALLEL HARDWARE ARCHITECTURE FOR REAL-TIME IMAGE CLASSIFICATION 59

Fig. 2. Comparison between six classification algorithms.

Fig. 4. Overall architecture of the proposed SIFT hardware.

while its output is sent to the KPD module and GMO mod-
ule at the same time. The KPD module computes the stable
Fig. 3. Testing processing time of six classification algorithms. keypoints and stores them in a FIFO buffer, while the GMO
module computes the gradient magnitude and orientation from
are shown in Fig. 3. The KNN classifier has the highest pro- the Gaussian-filtered image. The KDS module reads one sta-
cessing time followed by the RBF SVM classifier and linear ble keypoint at a time and computes the SIFT descriptor based
SVM. The Adaboost and naïve Bayes classifiers have lower on the gradient values. In the next subsection, the internal
processing time. In this work, we opted to use the RBF SVM architecture of each module is presented.
classification algorithm due to its high classification accuracy. 1) Implementation of GSS Module: The GSS module com-
We then use techniques to accelerate the processing time of putes the Gaussian filtered images and the difference of
RBF-SVM algorithm using hardware implementation to reach Gaussian images (DoG) from the input image. First, it con-
real-time performance. volves the input image with a series of Gaussian filters to build
the first octave. Then, it reduces the image size in half and
repeats the same operations in order to compute the second
IV. H ARDWARE A RCHITECTURE octave’s images and so on. The final step is to compute the
DoG images by subtracting each two adjacent Gaussian filtered
The proposed pipelined architecture consists of three major images in the same octave. In this work we used three octaves
modules: 1) SIFT modules; 2) BoF module; and 3) a SVM mod- with six scales in each octave.
ule. The SIFT block receives its input as a stream of pixels from There are two approaches in the literature for generating the
the source image and performs the feature extraction operation. GSS; each one has its advantages and disadvantages. First, cas-
The BoF block converts the set of extracted SIFT features into a cade filtering approach used in [16], [17]. In this approach,
BoF vector. The SVM block takes the BoF vector and classifies the Gaussian-filtered image G(x, y, σi+1 ) is generated by
it into one of the trained classes. convolving G(x, y, σi ) with a small kernel. This method
reduces number of multipliers required in each Gaussian filters.
However, it needs a large amount of memory to save the inter-
A. SIFT Module Implementation mediate results between each two filters. The second approach,
The high level block diagram of the proposed SIFT archi- direct filtering approach used in [18], [22]. This method gener-
tecture is shown in Fig. 4. It consists of four main modules, ates all Gaussian-filtered images directly from the source image
namely, GSS generation, KPD, gradient magnitude and ori- using Gaussian kernel with different size. The size of these
entation generation (GMO), and keypoint description module kernel is large, so this method needs large number of multipli-
(KDS). The first two modules represent the SIFT feature detec- ers. But there is no need for memory to save any intermediate
tor, while the last two represent the SIFT feature descriptor. results.
The primary task of GSS module is to compute the Gaussian- In our GSS implementation, we modified the second
filtered images and the difference of the Gaussian images. The approach to take the advantage of a low memory require-
input to this module is a stream of pixels from the source image, ment. To reduce the hardware utilization by the large number
Authorized licensed use limited to: University of Illinois at Chicago Library. Downloaded on May 24,2025 at 19:37:36 UTC from IEEE Xplore. Restrictions apply.
60 IEEE TRANSACTIONS ON COMPUTATIONAL IMAGING, VOL. 1, NO. 1, MARCH 2015

Fig. 5. GSS’s first octave module.

of multipliers, we used the multiplierless MCM with com- Fig. 6. 1-D Gaussian filter block.
mon subexpression elimination algorithm [27] to implement
the Gaussian filters in the GSS module. Moreover, we used the
separability and symmetry properties of the Gaussian filter to The idea of using the multiplierless MCM algorithm with
reduce hardware resources, by separating each Gaussian filter common subexpression elimination is to represent the mul-
into two smaller 1-D vertical and horizontal filters. tiplication operation as a number of shifts, additions, and
The high level architecture for the first octave in GSS module subtractions operations. The common subexpression elimi-
is shown in Fig. 5. The architecture consists of buffer lines to nation method is used to extract the common parts among
store the input pixels and two 1-D Gaussian filter blocks. The these constants in order to minimize number of multiplications
buffer lines consists of 25 FIFO buffer lines (the largest mask needed.
size). Each buffer line consists of 640 (image width) elements Figs. 8 and 9 show the architecture for Block0 and Block1.
with each element represented as 8 bits (grayscale values). The In Fig. 9, the input X is the summation of two pixels in the
FIFO buffers are implemented using Xilinx LogiCORETM IP 25 buffers labeled with indices 11 and 13. The block’s output
FIFO Generator. is the result of multiplying X by the corresponding constant
The module has two states; loading and a computation. In from the Gaussian filter mask in Fig. 7. That is, Y1 = X ×
the loading state, the stream of pixels is shifted into the buffer 0.205, Y2 = X × 0.175, Y3 = X × 0.145, Y4 = X × 0.119,
line one value to the right every clock cycle as shown in Y5 = X × 0.096 and Y6 = X × 0.078. The other blocks are
Fig. 5. After 24 × (640) + 1 clock cycles the loading state implemented in the same way.
ends and the computation state start. In the computation state, The output of the GSS module is the DoG images. The
the first block (vertical 1-D Gaussian filter) reads the left DoG images are generated by subtracting each two adjacent
most 25 pixels from the buffer lines (pixels 0, 1, 2, . . . , 24), Gaussian filtered images in the same octave. Each pixel in these
and computes the first pixel of each Gaussian filter images images is represented by an integer part and fraction part. To
(S0_1D, S1_1D . . . S5_1D). The results from the first block are find the optimal number of bits, an analysis is conducted to
buffered in 6 buffer lines with 25 elements each. After comput- assess the effect of the number of bits on the accuracy of the
ing 25 valid values, the second block (horizontal 1-D Gaussian result DoG image and hardware utilization. The DoG pixels are
filter) reads each 25 values and computes the corresponding represented by 8 to 15 bits.
1-D horizontal filtered pixels. In the next clock cycle, a new Fig. 10 shows how the hardware resources increase almost
pixel from the source image shifted into the buffer lines and the linearly with the increase in the number of bits. On the other
next sliding window will be computed. This operation will be hand, the error decreases with a sharp slope between 8 and
repeated until the whole source image is shifted into the GSS 11 bits and then the improvement becomes negligible. These
module. results indicate that the resources utilized by the proposed hard-
The architecture for the vertical and horizontal 1-D Gaussian ware module is proportional to the number of pixels. To reduce
filter block is shown in Fig. 6. The architecture contains 13 the FPGA resource utilization without significantly scarifying
blocks to compute the multiplication of each pixel with 6 dif- the accuracy, we found that the optimal number of pixels is 11.
ferent constants from 6 different Gaussian filters. We used the Therefore, the output pixels in the GSS module are represented
symmetrical property of Gaussian filters to reduce the hardware with 20 bits; 11 bits for fractional part and 9 bits for integer
utilization by adding the pixels before multiplying them with part.
the corresponding filter constants. 2) Implementation of the KPD Module: The KPD module
Fig. 7 shows the constants of the six Gaussian filters (9 × 9, receives its input from the DoG module. The input is a stream of
11 × 11, 13 × 13, 15 × 15, 21 × 21, 25 × 25). All masks are Gaussian filtered images one pixel every clock cycle. The mod-
extended to 25 values by appending zeros to the left and right. ule shifts and stores the pixels in buffer lines and then detects
These constants are generated using MATLAB then the mul- keypoints by analyzing the DoG images. The module’s output
tiplierless MCM with a common subexpression elimination is the keypoint for the x and y positions.
algorithm is applied to find the optimized tree architecture of In this module, we parallelized the process of checking for
each block. Figs. 8 and 9 show the tree implementations for candidate keypoints by using three KP detector blocks working
block0 and block1, respectively. together. Each block analyzes three DoG images and checks if
Authorized licensed use limited to: University of Illinois at Chicago Library. Downloaded on May 24,2025 at 19:37:36 UTC from IEEE Xplore. Restrictions apply.
QASAIMEH et al.: FPGA-BASED PARALLEL HARDWARE ARCHITECTURE FOR REAL-TIME IMAGE CLASSIFICATION 61

Fig. 7. 1-D Gaussian filters values.

Fig. 8. Block 0 in 1-D Gaussian filter block.

Fig. 10. Effect of the number of bits on the accuracy of the GSS module.

Fig. 9. Block 1 in 1-D Gaussian filter block.

the current pixel in the second image is a candidate keypoint,


as shown in Fig. 11. The module has two states: loading and
checking. In the loading state, the module shifts the input pix-
els into the buffer lines. After (2 × image_width + 3) clock Fig. 11. KPD module architecture.
cycles, the buffer line’s content becomes valid and the checking
state begin.
In the checking states, each keypoint detector block com- 3) Implementation of Gradient Magnitude and Orientation
pares the pixel in the center with its surrounding neighbors. If Module: The gradient generation module is the third module
the current pixel is the maximum or the minimum out of the in the SIFT architecture. It is used to compute gradient mag-
27 pixels, then the pixel is a candidate keypoint. Most of the nitude and orientation from the Gaussian filtered images. The
previous implementations skipped the local minima to reduce module’s architecture consists of two buffer lines, three adders,
the hardware cost and to speedup the process. This simplifica- two multipliers and two CORDIC IP cores. The two buffer lines
tion could overlook some important key points. Therefore, to are connected to three pixel elements as shown in Fig. 12 to
reach the highest accuracy, we compute the local maxima and generate the values of G(x, y + 1), G(x, y − 1), G(x + 1, y),
minima in this module. and G(x − 1, y) after the (2 × image_width + 3) clock cycles.
Authorized licensed use limited to: University of Illinois at Chicago Library. Downloaded on May 24,2025 at 19:37:36 UTC from IEEE Xplore. Restrictions apply.
62 IEEE TRANSACTIONS ON COMPUTATIONAL IMAGING, VOL. 1, NO. 1, MARCH 2015

Fig. 12. Gradient magnitude and orientation module architecture.

TABLE I
DATA FORMAT IN THE GRADIENT GENERATION BLOCK

Fig. 13. Dominant orientation generation module.

In this implementation, the square root and the tan inverse


functions are computed using two CORDIC Xilinx IP cores. To
make these IP cores work correctly, we modify the input and
output data format to match with the IP core inputs and out-
puts. The input format for the atan2 IP core is a fixed-point
number with a 2 bits integer and 14 bits fraction (2.14) for
X_atan2 and Y_atan2. The output is an angle with the for-
mat 3.13 and range from (−3.14 to 3.14). The input of the sqrt
IP core is an unsigned integer with 20 bits width with a 9.11
fixed-point number, and the output is an unsigned integer with
a 20 bit width. Table I summarizes the input and output data
format. Fig. 14. State machine diagram of dominant orientation module.
In the atan2 IP, we used the coarse rotation module to extend
the range of CORDIC from the first quadrant (+Pi/4 to −Pi/4 The module architecture consists of 2×(17 MUXs (2 ×
rad) to the full circle. We also used the parallel architectural 1), 17 buffer lines (640 value), 17 FIFO (17 elements),
configuration with single-cycle data throughput instead of the 17 MUXs (17 × 1), two position counters (X, Y) and the
word serial architectural configuration for small area configu- Dom_Ori Block). The module has five states: 1) initialization;
ration. In this implementation, the output becomes valid after 2) window_shifting; 3) searching; and 4) loop_back state as
(2 × image_width + 3) clock cycles from presenting the input shown in Fig. 14. Initially all buffer lines are empty, when
to the module. We used a valid output signal to indicate when the valid gradient magnitude and orientation generated by the
the output becomes valid. The gradient magnitude and orienta- GMO module become available, the module start shifting these
tion generated by this module goes to the dominant orientation values into the buffer lines.
generation module. After 640 × 17 clock cycles, the buffer lines become full and
4) Dominant Orientation Module: The dominant orienta- the window_shifting state starts. In this state, a position counter
tion module computes the principle orientation for each sta- (X, Y) are used to keep track the current window position.
ble keypoint. The module’s inputs are the key point’s posi- Every clock cycle, a new pixel shifted into the buffer lines and
tion (KP_in), and the gradient magnitude and orientation the counter X incremented by 1, after every 640 clock cycles
(Gradient_M, Gradient_O). The dominant orientation module the counter y incremented by 1.
reads one keypoint at a time and computes the principle orien- When a new keypoint found, the searching state begins. First,
tation from (17 × 17) pixels around the keypoint. The module the Kp_x and Kp_y values updated by the new keypoint posi-
also extracts the gradient magnitude and orientation values in tion. Second, every clock cycle the current position counter
the region around the key point and sends them to the keypoint value (X, Y) is compared to the (Kp_x and Kp_y). If they are
descriptor module. Fig. 13 shows the module’s inputs, outputs, equal, then the loop_back state starts. In this state, the MUX
and its internal structure. selectors changed to (1) and a disable signal sent to the GSS
Authorized licensed use limited to: University of Illinois at Chicago Library. Downloaded on May 24,2025 at 19:37:36 UTC from IEEE Xplore. Restrictions apply.
QASAIMEH et al.: FPGA-BASED PARALLEL HARDWARE ARCHITECTURE FOR REAL-TIME IMAGE CLASSIFICATION 63

Fig. 16. Histogram memory implementation.

keypoint scale (SPB = 9.6), and NBO is number of orientation


bins (NBO = 8)

Xr = (cos(Og) × X + sin(Og) × Y)/SBP (14)


Fig. 15. SIFT descriptor module architecture.
Yr = (−sin(Og) × X + cos(Og) × Y)/SBP (15)
Or = NBO × (Og − Op)/(2π). (16)

The division operations in (14)–(16) are converted to mul-


module to stop reading from the input image. After 640 clk tiplication to reduce the hardware utilization. The Sine and
cycles, the 17 × 17 window around the keypoint will be shifted Cosine components are computed using the Xilinx LogiCORE
into the Dom_Ori module and the buffer lines values will stay IP CORDIC core. The input angles are expressed as a fixed-
the same. point numbers with (3.13) format. The outputs (sine, cosine) are
The Dom_Ori module reads every clock cycle a 17 new gra- expressed as a pair of fixed-point 2’s complement numbers with
dient magnitude values, and 17 new gradient orientation values. format (2.14). The orientation generation block consists of one
The data’s index in our implementation is (−8, 8), which is the comparator, one subtractor, and one adder. First, the gradient
relative location to the keypoint position. Each gradient orien- orientation (Op) is subtracted from the (Og). If the result is less
tation value is connected to a 10◦ decoder module. The task than zero, a (π = 3.1416 = 16 b0110010010001000) is added
of the decoder is to find the correct bin out of the 36 bins. to the result to get the positive equivalent of the result with
Each gradient magnitude is multiplied with the corresponding range (0–2π). Finally, the output is multiplied by a constant
Gaussian weight using 17 multipliers. The output is connected (1/2π) to generate the rotated angle (Or), as given by (16).
to a switching circuit that updates the correct histogram’s val- The Gaussian weight generation module takes the rotated
ues. After 17 clock cycles, the histogram will include data coordinates of Xr and Yr as input and generates the proper
from the whole window. The max block finds the bin with the Gaussian weight value. The module consists of a 16 element
largest value in the histogram and writes it to the output. When look-up table (LUTs), and one multiplier. The LUT stores the
the module finishes the current keypoint, it will read the next 1-D Gaussian filter coefficients. The final output (Wg) is deter-
keypoint and repeat the whole process again. mined by multiplying the two values obtained from the LUT
5) Descriptor Generation Module: The SIFT descriptor based on Xr and Yr. The trilinear interpolation module dis-
generation process involves four tasks: coordinate rotation, tributes the result of multiplying Wg × Mg into eight adjacent
Gaussian weight generation, trilinear interpolation, and normal- bins in the SIFT descriptor histogram based on the Xr, Yr,
ization. The first three tasks are repeated N times, where N is and Or values. The eight values with its corresponding eight
the number of pixels in the window around a keypoint. After N addresses are then sent to the histogram memory.
iterations, the normalization step is performed to obtain the 128 Implementing multiported memories in FPGA is a challeng-
elements SIFT descriptor vector. ing task, because the FPGA block RAMs include in the fabric
The block diagram of the SIFT descriptor module is shown typically have only two ports. In the trilinear interpolation mod-
in Fig. 15. It consists of four main blocks: rotation module, ule, we had to update the eight data element every clock cycle.
Gaussian weight generation module, and trilinear interpolation Therefore, the memory in the SIFT descriptor module should
module and normalization blocks. The module’s inputs are the have an eight data input with eight address lines. In our archi-
pixel coordinates (X, Y), the gradient orientation (Op), the gra- tecture, we want to implement a histogram memory so that it
dient magnitude (Mg), and the keypoint dominant orientation should provide multiports for input and output.
(Og). The module’s output is the SIFT vector with 128 ele- To solve this problem, we reordered the SIFT’s 128 val-
ments. The rotation module rotates the gradients within the ues into 8 block RAMs with 16 elements each, where the
region around a keypoint relative to the principle orientation. elements in each block would not be accessed at the same
It takes the pixel’s gradient orientation (Op) and dominant ori- time. Therefore, we can implement each block with one FPGA
entation (Og) as an inputs and generates the rotated coordinates BRAM. The top 8 blocks in the first line in Fig. 16 represent the
(Xr, Yr) and the rotated pixel gradient orientation (Or) based normal distribution of SIFT vector element in memory, while
on (14)–(16). Where SBP is a constant equal to three times the the lower set represents our implementation.
Authorized licensed use limited to: University of Illinois at Chicago Library. Downloaded on May 24,2025 at 19:37:36 UTC from IEEE Xplore. Restrictions apply.
64 IEEE TRANSACTIONS ON COMPUTATIONAL IMAGING, VOL. 1, NO. 1, MARCH 2015

In the upper set, the gray elements (0, 2, 8, 10, 32, 34, 40, the set of weights (α). The extracted parameters are used to
42, 64, 66, 72, 74, 96, 98, 104, 104) will never be accessed implement the testing phase of the SVM in the hardware.
at the same time. Therefore, we reordered these elements and In the SVM testing phase, the new data vector (x) is classified
put them in one block memory as shown in the lower set. The according to the decision function of (17). Where k (.,.) is the
same process was applied to the other seven memory blocks. kernel function, Nsv is the number of support vectors generated
This memory unit is interfaced with the trilinear interpolation in the training phase, αi is weights for each SV and b is a bias
module. The output from the trilinear interpolation module is constant. While xi and yi represent the SV and its class label
eight address lines, eight data elements. Before updating the yi = {−1, 1}, respectively
elements, we had to translate the input address to match our N sv

implementation. Therefore, we designed a circuit (address con- f (x) = sign yi × αi × k(xi, x) + b (17)
verter) that converts the input address into two parts. The first i=1
part represents the block number (0–7) and the second part K(x, z) = exp(−x − z2 /(2σ 2 ). (18)
represents the address inside the block (0–15).
In the kernel function module, the input image and the SV
values are shifted into the module one value every clock cycle.
B. BoF Module Implementation The output represents the kernel value defined in (18). The
The BoFs module is used to convert the set of SIFT descrip- RBF kernel module consists of one subtractor, two multipliers,
tors (k × 128) extracted by SIFT module into a BoF vector one accumulator and one module to compute the exponential
of size N. The first step is to cluster the SIFT descriptors function.
into N clusters using K-mean clustering algorithms and find To compute the norm value in the kernel function x − z2 ,
the clusters centers. The second step is to quantize each SIFT we simplify the computation by using (20) instead of (19). In
descriptor into the nearest cluster center. this case, we do not have to compute the square root. We can
2
In our implementation, the SIFT features are extracted from compute the summation of (xi − zi) , then multiply it by itself
a number of images from each class. These descriptors are used using one multiplier
to find the best N cluster centroids that minimize the sum of 2 2 2
the squared Euclidean distances between the points and their x − z = (x1 − z1) + (x2 − z2) + · · · + (xn − zn)
nearest cluster centroids. We used MATLAB code to implement (19)
2 2 2
the k-means clustering algorithm. Then, the cluster centroids x − z2 = (x1 − z1) + (x2 − z2) + · · · + (xn − zn) .
are then loaded into the BoF module in the hardware using N (20)
FIFO buffers with a length of 128 elements. When the mod-
ule’s enable signal is activated, every clock cycle one element In the kernel module, a new value from the x and SV is
of the SIFT descriptor (SIFT[i]) is shifted into the module. shifted into the module each clock cycle; the subtractor com-
Using N subtractors, this value is subtracted from the proper putes the difference and the multiplier computes the square
cluster centroid’s element. The results are accumulated using value of the difference. After n clock cycles, the value of
N accumulators. After 128 clock cycles, the distances between x − z2 becomes valid and the exponential block computes
the SIFT descriptor and the clusters centers are available in the exponential value and sends it to the output port where n is
the accumulators. The min block searches the N accumulators the length of SV and x.
to find the minimum value. The minimum value means that The accumulator size in the kernel function was chosen based
the distance between the input SIFT descriptor and that clus- on (21), which depends on the data width (20 bits) and the size
ter centroid is the minimum. The final step is to increment the of the input (c). In our case, (c) equal to the SV and x length
corresponding histogram bin by one. (c = 500)
Accumulator bit width = log2 (c × (220 − 1)). (21)
C. SVM Module Implementation After n clock cycles, the output of the accumulator is mul-
In the SVM architecture, we implement the one-against-all tiplied by −1/ 2σ 2 . For the implementation of the expo-
multiclass SVM classifier with RBF kernel. The architecture nential function, we represent the exponential function as a
exploits the parallelism in computing the RBF kernel in the summation of hyperbolic sine and hyperbolic cos of the x, as
decision function to reach real-time performance. We imple- given in (22). The Xilinx IP Core CORDIC block is used to
ment the decision function of the SVM classifier in hardware. produce the sinh and cosh of the input. The input range for
The SVM model was generated using a software implementa- CORDIC block is from -pi/4 to pi/4
tion though.
exp(x) = sinh(x) + cosh(x). (22)
The input to the SVM classifier module is the vector gen-
erated by BoF module. The output is the class label after The overall architecture for SVM module is depicted in
the classification process is completed. In this implementation, Fig. 17. We used FIFOs buffer lines to store the SVs and
the training phase of SVM is computed offline. The LibSVM yi × αi values. The value of these buffers come from the train-
MATLAB code is used to solve the dual Lagrange problem in ing phase carried out offline. In this architecture, we used 20
the SVM training phase. The output of the training phase is a modules to compute the kernels function between the input
set of support vectors that build the boundary hyperplane and vector x and 20 SVs.
Authorized licensed use limited to: University of Illinois at Chicago Library. Downloaded on May 24,2025 at 19:37:36 UTC from IEEE Xplore. Restrictions apply.
QASAIMEH et al.: FPGA-BASED PARALLEL HARDWARE ARCHITECTURE FOR REAL-TIME IMAGE CLASSIFICATION 65

TABLE II
H ARDWARE U TILIZATION FOR THE O BJECT D ETECTION A RCHITECTURE

TABLE III
H ARDWARE U TILIZATION FOR THE SIFT M ODULE

Fig. 17. Overall architecture for SVM.

TABLE IV
H ARDWARE U TILIZATION FOR THE B O F M ODULE

E. Hardware Utilization
Fig. 18. Block diagram of the FPGA prototype system. This section presents the hardware utilization for each mod-
ule in the proposed architecture. These modules are SIFT, BoF,
and SVM. The main hardware resources in the FPGA are slice
registers, slice LUTs, LUT flip flop pairs, DSP blocks, and
D. FPGA Implementation memory. The results are reported from the synthesis reports
We developed a prototype of our architecture on the FPGA. generated by Xilinx ISE environment. Table II summarizes
We used ML509 evaluation platform which is equipped with the hardware resources used to implement the whole object
Virtex 5 LX110T FPGA from Xilinx. The overall system con- detection architecture.
sists of the SIFT processor core, BoF module, SVM classifier Our architecture fits in 78% of LUTs and 25% of slice
module and MicroBlaze soft-core processor. The hardware registers in Virtex-5 XC5VLX110T FPGA. It consumed 86%
modules are implemented using Verilog hardware description of Block RAM memory because the implemented application
language. Our architecture interfaces with the Microblaze pro- required saving a lot of temporary results inside the FPGA chip.
cessor via two FSL bus systems for I/O purposes. We used In the SIFT module, the GSS and dominant orientation genera-
Xilinx ISE Design Suite, Xilinx Platform Studio (XPS), and tion modules consumed most of the hardware recourses. The
Xilinx Software Development Kit (SDK) platforms to design GSS used 15 blocks of RAM/FIFO to implement the buffer
and implement the architecture. The overall system architecture lines that save the input image. Using the MCM algorithm in
is illustrated in Fig. 18. GSS step reduces the hardware utilization by a reduction gain
Microblaze processor handles tasks such as data transfer- of 2 when compared to previous works. Tables III–V summa-
ring between I/O peripherals and hardware modules, as well rize the hardware utilization for each of the major modules of
as controlling the data flow. Initially, the input image frames is the design.
stored in the compact flash card (acting as the image acquisition
source). The Microblaze loads the frame to the external SRAM
V. E XPERIMENTAL R ESULT
before the SIFT extraction phase. Microblaze reads one pixel
data from the SRAM. After that the processor will read one The performance of the proposed hardware architecture can
pixel at a time and send it to the SIFT core via fast-simples-link be measured in terms of three metrics: classification accuracy,
(FSL) bus interface. When the final hardware module’s valid speed up compared with the equivalent software and hardware
signal becomes high, the microblaze processor will read the implementations, and hardware resources utilization. There is a
results and store them in a predifiened array in the SRAM. tradeoff between these metrics where higher accuracy requires
Authorized licensed use limited to: University of Illinois at Chicago Library. Downloaded on May 24,2025 at 19:37:36 UTC from IEEE Xplore. Restrictions apply.
66 IEEE TRANSACTIONS ON COMPUTATIONAL IMAGING, VOL. 1, NO. 1, MARCH 2015

TABLE V
H ARDWARE U TILIZATION FOR THE SVM M ODULE

Fig. 19. Example images from subset 1—Caltech-256 dataset.


more hardware resources and processing time. In this work, the
priority was to achieve the highest accuracy within real-time
constraints which is processing 30 images/s.

A. Modules’ Accuracy
In the hardware implementation, we used fixed point num-
bers, while the software implementation used floating point
numbers. This can lead to losing some accuracy when imple- Fig. 20. Hardware and software average confusion matrix.
menting the algorithms in hardware. To assess the accuracy
achieved by our implementation, we computed the percentage
error between the software and hardware values based on (23) B. Classification Rate Evaluation
In order to assess the classification rate of the proposed hard-
|Software value − Hardware value| ware implementation, we used two popular image classification
Error = × 100%. (23)
Software value datasets as mentioned previously; Caltech-256 dataset [26] and
KUL Belgium Traffic Sign Classification dataset [28]. These
In the GSS hardware module, the input image’s pixels are datasets are considered a challenging datasets in image classi-
represented using 8 bits integer numbers while the pixels of fication. So that, if our architecture achieved well in these two
Gaussian filtered images are represented as (9:11) fixed point datasets that means that it will also do well on other datasets.
numbers, with 9 bits for the integer part and 11 bits for the frac- 1) Experiment 1: Caltech-256: In the first experiment, we
tional part. The average errors in the first octave for its six scales used ten different randomly selected subsets from the Caltech-
are reported using a human face image. The errors in scales 0–5 256 dataset. Each subset consists of five different classes. The
were 1.760%, 2.040%, 2.930%, 3.870%, 4.310%, and 5.670%, classes were: horse, face, motorbike, watch, airplane for the first
respectively. subset. Blimp, bowling-pin, boxing-glove, brain, bulldozer for
In our implementation, the number of keypoints in hardware the second subset. Fig. 19 shows examples from the first and
was the same as that in the software implementation. Hence, second subsets.
there was no over-detection or under-detection in the KPD mod- The classification results in this section are averaged over the
ule. Our implementation achieved an accuracy of 99.3% in term five classes of each subset. The training phase of the SVM was
of the keypoints position. The accuracy is computed by dividing carried out using the LibSVM MATLAB code. The extracted
the number of true matches over the total number of keypoints. parameters (α, SV, and y) were used in the proposed hardware
The true matches are the keypoints with distance less than 5 architecture as an input. In the training phase, for each subset,
pixels from the original location detected in software. we used 100 images, 20 images per class, and in the testing
The gradient magnitude values are represented using 20 bits phase, for each subset, we used another 100 images, 20 images
fixed point numbers with (9:11) integer and fraction bits. The from each class.
gradient orientation value represented by 16 bits with (3:13) Fig. 20 shows the average confusion matrix using the ten
integer and fraction bits. The error in gradient magnitude was subsets. We computed the average confusion matrix by aver-
equal to 3.65% and in gradient orientation was equal to 1.527%. aging the values for all five classes combined. The average
The error in Gaussian weight generation module was approxi- confusion matrix represents the average accuracy for classi-
mately 4.3%. For the SIFT feature generation module, the SIFT fiers obtained through ten different subsets. The experiment is
vector’s element is represented by fixed point numbers with implemented using both software and hardware. The diagonal
20 bits in (9:11) format. The error between the software and elements represent the correctly classified images, while other
hardware values is equal to 3.36%. We can achieved a higher elements represent the misclassified images.
accuracy in these modules by increasing the word lengths of The classification accuracy in software implementation for
related data, but we allow a relatively small errors to save hard- subsets 1–10 was as follows (85%, 69%, 73%, 79%, 75%, 74%,
ware cost. We did an analysis to find the optimal bit width that 87%, 72%, 67%, and 70%) and for hardware implementation
gives us the best accuracy and the lower hardware cost. (84%, 69%, 71%, 77%, 72%, 70%, 87%, 71%, 66%, and 69%).
Authorized licensed use limited to: University of Illinois at Chicago Library. Downloaded on May 24,2025 at 19:37:36 UTC from IEEE Xplore. Restrictions apply.
QASAIMEH et al.: FPGA-BASED PARALLEL HARDWARE ARCHITECTURE FOR REAL-TIME IMAGE CLASSIFICATION 67

Fig. 21. Example images from the KUL Belgium Traffic Sign Dataset.
Fig. 23. Software implementation results compared to our SVM architecture.

between our architecture and the equivalent software imple-


mentation will be within 3% using the same training and testing
sets.

C. Processing Time
Fig. 22. Hardware and software confusion matrix. The processing time required to classify one frame is divided
into two parts: 1) the SIFT feature extraction and description
time and 2) the SVM classification time. The processing time
The highest accuracy achieved was (87%) in subset 7 and the in each module is estimated by simulation the implementation
lowest accuracy was (66%) in subset 10. Other subsets have in Verilog HDL and estimate the speed based on the number of
an accuracy between these two values. For the LibSVM soft- clock cycles required to complete each task and the operational
ware implementation, the average classification rate for the ten frequency for that module. The processing time is estimated
subsets was (75.1%) with a standard deviation of (6.7). While based on (24).
our hardware implementation achieved an average classifica-
tion rate of (73.6%) with a standard deviation of (6.9). The Number of clock cycles to finish the task
Time = . (24)
difference in the classification accuracy is due to the fact that Operationl frequency(Hz)
the LibSVM software implementation uses floating point num-
bers while our proposed system uses fixed point numbers which 1) SIFT Module Processing Time: The maximum operat-
could reduce the accuracy of the processed data. ing frequency of the proposed design is 60.36 MHz, which is
The misclassification error may seems high even in the soft- provided by the synthesis report. For the SIFT feature extrac-
ware implementation. This is due to the fact that the Caltech- tion module, it takes 640 × 480 clock cycles to scan the input
256 dataset is considered a very challenging dataset. To date, image and detect the keypoints. It also takes for each key-
the best classification accuracy achieved using this dataset is point (640 + 17 × 17 + 128) clock cycles to generate the SIFT
46.6% using 30 sample from each category (256 categories) descriptor. The processing time for the SIFT feature extraction
[29]. The goal of this experiment is to assess the accuracy and description module is estimated as shown in (25) and (26).
difference between the software and our hardware implemen- SIFT’s processing time for one frame (640 × 480) pixel. Where
tation. A 3% accuracy difference between our architecture the #KP represents number of SIFT descriptor detected in the input
equivalent software implementation is acceptable especially for image
such a challenge dataset.
2) Experiment 2: KUL Belgium Traffic Sign: In the second Time = Detection time + Descriptor generation time (25)
experiment, we used five classes from the KUL Belgium Traffic 640 × 480 + (640 + 17 × 17 + 128) × #KP
Time = . (26)
Sign Classification dataset; some examples from each class are Operationl frequency(Hz)
shown in Fig. 21. In the SVM training phase, we used 100
images, 20 images from each class. In the testing phase, we At the operation frequency of 50 MHz, the processing time
used different 100 images, again, 20 images from each class. of SIFT detector for a VGA frame (640 × 480) is about 640 ×
For the LibSVM software implementation, the classification 480/50 MHz = 6.144 ms. The SIFT descriptor’s processing
rate was (80%), while our hardware implementation achieved time is proportional to the number of detected keypoints. In
a classification rate equal to (78%). Fig. 22 shows the con- our architecture, it takes (640 + 17 × 17 + 128) clock cycles
fusion matrices. Again, the difference in the classification is to generate one descriptor (which is approximately 25.9 μs/
due to the fact that the LibSVM software implementation uses feature).
floating point numbers, while our proposed system uses fixed To compute the maximum number of keypoints that can be
point numbers which could reduce the accuracy. From these extracted from each frame, we computed the maximum number
two experiments, the results show that the accuracy difference of keypoints that could be extracted within 33 ms as given in
Authorized licensed use limited to: University of Illinois at Chicago Library. Downloaded on May 24,2025 at 19:37:36 UTC from IEEE Xplore. Restrictions apply.
68 IEEE TRANSACTIONS ON COMPUTATIONAL IMAGING, VOL. 1, NO. 1, MARCH 2015

TABLE VI
C OMPARING P ROPOSED SVM R ESULTS W ITH E XISTING S OLUTIONS

TABLE VII
C OMPARING P ROPOSED SIFT R ESULTS W ITH E XISTING S OULTIONS

(27). This is so because for real-time performance we need to 1.81 s per image, while our architecture extracts the SIFT fea-
achieve a rate of 30 frames/s tures within 33 ms per image, with a speedup equals to 54.8, as
shown in (28)
(640 × 480 + (640 + 17 × 17 + 128) × #KP)
33 ms = . Harware processing time 1.81 s
50 MHz Speedup = = = 54.8 (28)
(27) Software processing time 33 ms
The maximum number of keypoints in each frame is thus
2) SVM Module Processing Time: The processing time
1270 keypoints per frames. Therefore, our architecture allows a
of the SVM classifier is linearly dependent on the number of
VGA size image with about 0.4134% Keypoints to be processed
support vectors and the dimensionality of the feature vectors.
at a speed of 30 frames/s. Due to parallelization technique pro-
As the number of support vectors increases, the processing
posed in the SIFT descriptor module, we can compute larger
time increases. Likewise, if the number of features in the input
number of SIFT descriptor within the 33 ms than the works
vector increases the processing time increases linearly. In our
proposed in [22] that can compute up 890 SIFT descriptor
SVM architecture, the processing time required to classify
within the 33 ms, and [16] that can compute up to 412 SIFT
one image equals to floor N _SV 20 × SV_Dimentions ×

descriptors in real time performance. In our architecture, if
1
the input image has more than 1270 SIFT feature, it will #Classes × Maximumopertionalfrequency . In Caltech-256
take more than 33 ms to compute the SIFT features in the dataset experiment, we used five classes with each class having
image. 20 training images. Each image is represented as a vector with
To compute the speedup achieved by our SIFT architec- 500 elements so that the time required to classify one image is
ture, we compared the averaged processing time required to equal to floor 100 1
20 × 500 × 5 × 50 MHz = 2.5 × 10
−4
s.
extract the SIFT features from 209-image (people) collection To compare the processing time between the proposed archi-
of the Caltech-256 data set and compare the time to the time tecture and the equivalent software implementation, we clas-
reported in [24]. On average the processing time in [24] is sified 100 test images into one of the five classes. Fig. 23

Authorized licensed use limited to: University of Illinois at Chicago Library. Downloaded on May 24,2025 at 19:37:36 UTC from IEEE Xplore. Restrictions apply.
QASAIMEH et al.: FPGA-BASED PARALLEL HARDWARE ARCHITECTURE FOR REAL-TIME IMAGE CLASSIFICATION 69

summarizes the confusion matrices, the classification accuracy, or with more number of classes. Our implementation used a
and the processing time. challenging dataset (Caltech-256) with 500 dimensional input
We used a LibSVM library running on an Intel i5 processor vectors. The classification accuracy was equal to 82% within
with 8 GB RAM, it took 166 ms to classify 100 images, while a processing time of 0.25 ms for each input image. We also
in our implementation it took only 25 ms to classify the same worked on five classes while other implementations worked on
100 images with a speedup equals to 5.7, as shown in (29). The three or four classes only.
difference in classification accuracy in our implementation is In terms of hardware resources utilization, our SVM archi-
3% less than the software implementation tecture consumed less resources than [31], [32]. It consumed
only 38 179 LUTS and 9646 Registers. Our implementation
Harware processing time 143 ms consumed more resources than [30] because the input vector
Speedup = = = 5.72.
Software processing time 25 ms dimensions in [30] is only 24, while in our work it is 500. Also,
(29) the number of classes in [30] is restricted to 3. The architecture
in [33] consumed less resources but the processing time is 10
times greater than our implementation.
D. Comparison With Existing Solutions
In this section, a comparison between our implementation
and existing solutions is presented. First, we compared our VI. C ONCLUSION
SIFT feature extraction architecture with works in [16]–[21].
Table V summarizes the performance and hardware utilization This paper proposed a hardware architecture capable of
of each implementation. We also compared our SVM archi- detecting SIFT features from images with a dimension of
tecture with works in [30]–[33]. Table VII summarizes the 640 × 480 pixels within 6.144 ms. It can also compute up to
performance of each SVM architecture. 1270 SIFT features for each image. The classification accuracy
In our architecture, it takes 480 × 640 clock cycles to detect achieved by the proposed architecture on benchmark datasets
all keypoints in the input image. It also takes (640 + 17 × 17 + for five different classes was 85% for Caltech-256, and 78% for
128) clock cycles to generate one descriptor. Hence, at 50 MHz KUL Belgium traffic sign dataset. The difference in classifica-
operational frequency, the time required to scan the input image tion accuracy between the proposed architecture and the soft-
and detect all keypoints is 6.14 ms and 25.9 μs to generate ware implementation was less than 3%. The speedup achieved
each SFIT descriptor. While, in existing solutions it was around in the feature extraction was 54× and in the classification algo-
10–30 ms to detect the SIFT keypoints as shown in Table VI. In rithm it was 5.7× in comparison to software implementation.
[16], it took 80 μs to generate one SIFT feature and 11.7 ms in The proposed architecture FPGA resource utilization was mod-
[17]. In term of keypoint ratio that can be processed in 33 ms, erate compared to existing hardware implementations. Hence,
our architecture has a 0.4134%, while the works in [16] and the implemented image classification architecture provides an
[17] have 0.2897% and 0.024%, respectively. embedded system solution that can be used for the detection
The hardware resources utilized by our implementation was and recognition of objects with real-time performance. In a
less than most of the existing solutions. Our architecture con- future direction, and to achieve better performance and opti-
sumed 16 138 LUTs and 7924 registers which is less than mize power requirements, the ASIC implementation of the
[17]–[20] and almost the same as [16] and [21]. However, the work discussed here can be pursued with some modifications
image resolution in [16] and [21] is 320 × 240, while our archi- to the FPGA design flow by including activities such as con-
tecture works on larger images with a resolution of 640 × 480 version of IP cores to Verilog modules, performing postlayout
pixels. synthesis and meeting design-for-test requirements.
Table VII summarizes the performance of our SVM archi-
tecture against that reported in [30]–[33]. In [31], the SVM
architecture achieved an accuracy of 77% with 40 frames/s R EFERENCES
speed. It used large hardware resources to reach the 40 frames/s.
[1] H. Chris and S. Mike, “A combined corner and edge detector,” in Proc.
In [32], the author implements a SVM classifier and assesses its 4th Alvey Vis. Conf., 1988, pp. 147–151.
performance using a simple dataset called MNIST [34]. They [2] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”
achieved a speedup of 20 compared to dual Opteron 2.2 GHz Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.
[3] D. Navneet and B. Triggs, “Histograms of oriented gradients for human
processor CPU. In [33], the dataset used was simple; the images detection,” in Proc. IEEE Comp. Soc. Conf. Comp. Vis. Pattern Recog.,
are represented with 8 bits grayscale with a resolution of 2005, vol. 1, pp. 886–893.
32 × 32 pixels. The number of support vectors is also limited [4] C. Corinn and V. Vapnik, “support-vector networks,” Mach. Learn.,
vol. 20, no. 3, pp. 273–297, 1995.
to 10 per binary classifier. Their architecture has six binary [5] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded up robust
classifiers. The classification speed of the system was 2 ms. In features,” Comput. Vis., vol. 3951, pp. 404–417, 2006.
[30], the authors used three classes of Persian handwritten digits [6] M. Grabner, H. Grabner, and H. Bischof, “Fast approximated SIFT,”
Comput. Vis., vol. 3851, pp. 918–927, 2006.
to assess their architecture. They achieved a very high accu- [7] M. Pohl, M. Schaeferling, and G. Kiefer, “An efficient FPGA-based hard-
racy rate and small hardware utilization, but their input vector ware framework for natural feature extraction and related computer vision
dimension is limited to only 24. The implementations [30]–[33] tasks,” in Proc. 24th Int. Conf. Field Programmable Logic Appl. (FPL),
2014, pp. 1–8.
either used a very simple dataset to achieve a high accuracy or [8] A. Serackis and T. Sledevic, “SURF algorithm implementation on
consume more hardware resources to deal with large images FPGA,” in Proc. Biennial Baltic Electron. Conf., 2012, pp. 291–294.
Authorized licensed use limited to: University of Illinois at Chicago Library. Downloaded on May 24,2025 at 19:37:36 UTC from IEEE Xplore. Restrictions apply.
70 IEEE TRANSACTIONS ON COMPUTATIONAL IMAGING, VOL. 1, NO. 1, MARCH 2015

[9] J. E. Svab, “FPGA BASED speeded up robust features,” IEEE Int. Conf. [32] M. Papadonikolakis and C.-S. Bouganis, “A novel FPGA-based SVM
Technol. Pract. Robot Appl. (TePRA 09), Woburn, MA, USA, 2009, classifier,” in Proc. Int. Conf. Field Programm. Technol. (FPT), 2010,
pp. 35–41. pp. 283–286.
[10] W.-Y. Lee and K.-J. Byun, “A hardware design of optimized ORB algo- [33] M. Ruiz-Llata, G. Guarnizo, and M. Yébenes-Calvino, “FPGA imple-
rithm with reduced hardware cost,” Adv. Sci. Technol. Lett., pp. 58–62, mentation of a support vector machine for classification and regression,”
2013 in Proc. Int. Joint Conf. Neural Netw. (IJCNN), 2010, pp. 1–5.
[11] S. Panchal, “A comparison of SIFT and SURF,” Int. J. Innovative Res. [34] Y. LeCun and C. Cortes, “The MNIST database of hand-
Comput. Commun. Eng., vol. 1, no. 2, pp. 323–327, 2013. written digits,” 1988 [Online]. Available: http://yann.lecun.
[12] B. Rister, G. Wang, M. Wu, and J. R. Cavallaro, “A fast and efficient SIFT com/exdb/mnist/. Accessed on 2014.
detector using the mobile GPU,” in Proc. IEEE Int. Conf. Acoust. Speech
Signal Process. (ICASSP), Vancouver, BC, USA, 2013, pp. 2674–2678.
[13] C. Jiang, Z.-X. Geng, X.-F. Wei, and C. Shen, “SIFT implementation
based on GPU,” in Proc. Int. Symp. Photoelectron. Detect. Imag., 2013,
vol. 891304, p. 891304-7.
[14] S. Heymann, K. Müller, A. Smolic, and B. Fröhlich, “SIFT imple- Murad Qasaimeh received the B.S. degree in com-
mentation and optimization for general-purpose GPU,” in Proc. 15th puter engineering from Jordan University of Science
Int. Conf. Cent. Eur. Comput. Graph. Visualization Comput. Vis. and Technology (JUST), Irbid, Jordan, in 2011, and
(EUROGRAPHICS), 2007, pp. 317–322. the M.Sc. degree in computer engineering from the
[15] Q. Zhang, Y. Chen, Y. Zhang, and Y. Xu, SIFT Implementation and American University of Sharjah (AUS), Sharjah,
Optimization for Multi-Core Systems, Proc. IEEE Int. Symp. Workload UAE, in 2014.
Characterization, 2008, pp. 14–23. From 2014 to 2015, he was a Research Associate
[16] S. Zhong, J. Wang, and L. Yan, “A real-time embedded architecture for with the Semiconductor Research Corporation
SIFT,” J. Syst. Archit., vol. 59, no. 1, pp. 16–29, 2013. (SRC), Khalifa University (KU), Abu Dhabi, UAE.
[17] V. Bonato, E. Marques, and G. Constantinides, “A parallel hardware His research interests include image and video
architecture for scale and rotation invariant feature detection,” IEEE processing, parallel hardware architectures design,
Trans. Circuits Syst. Video Technol., vol. 18, no. 12, pp. 1703–1712, Dec. design hardware accelerators, FPGA/ASIC design, system on chip (SoC)
2008. design, and real-time embedded systems design.
[18] W. Feng, D. Zhao, Z. Jiang, Y. Zhu, H. Feng and L. Yao, “An architecture
of optimised SIFT feature detection for an FPGA implementation of an
image matcher,” in Proc. Int. Conf. Field Programm. Technol. (FPT’09),
Sydney, N.S.W., Australia, 2009, pp. 30–37.
[19] H. Borhanifar and V. Naeim, “High speed object recognition based on Assim Sagahyroon received the M.Sc. degree from
SIFT algorithm,” in Proc. Int. Conf. Image Vis. Comput., 2012. the Northwestern University, Evanston, IL, USA, and
[20] K. Mizuno et al., “Fast and low-memory-bandwidth architecture of SIFT the Ph.D. degree in computer engineering from the
descriptor generation with scalability on speed and accuracy for VGA University of Arizona, Tucson, AZ, USA.
video,” in Proc. Int. Conf. Field Programm. Logic Appl., Milano, Italy, From 1993 to 1999, he has been a Member
2010, pp. 608–611. of the Department of Computer Science and
[21] E. S. Kim and H.-J. Lee, “A novel hardware design for SIFT generation Engineering, Northern Arizona University, Flagstaff,
with reduced mempry requirement,” J. Semicond. Technol. Sci., vol. 13, AZ, USA, and then joined the Department of Math
no. 2, pp. 157–169, 2013. and Computer Science, California State University,
[22] F.-C. Huang, S.-Y. Huang, J.-W. Ker, and Y.-C. Chen, “High-performance Seaside, CA, USA. In 2003, he joined the Department
SIFT hardware accelerator for real-time image feature extraction,” IEEE of Computer Science and Engineering, American
Trans. Circuits Syst. Video Technol., vol. 22, no. 3, pp. 340–351, Mar. University of Sharjah, Sharjah, UAE, where he is currently a Professor and
2012. Head of the Department. His research interests include power consumption and
[23] C. Bray, G. Csurka, C. Dance, L. Fan, and J. Willamowski, “Visual testing of VLSI designs, hardware description languages, FPGAs, computer
categorization with bags of keypoints,” in Proc. Workshop Stat. Learn. architecture, and innovative applications of emerging technology.
Comput. Vis. (ECCV), 2004, pp. 1–22.
[24] R. Hess, “An open-source SIFT Library,” in Proc. Int. Conf. Multimedia,
2010, pp. 1493–1496.
[25] C.-J. Lin and C.-C. Chang, “LIBSVM: A library for support vector
machines,” ACM Trans. Intell. Syst. Technol., vol. 2, pp. 1–27, 2011.
[26] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category Tamer Shanableh received the Ph.D. degree in elec-
dataset,” California Inst., Tech. Rep. 7694, 2007. tronic systems engineering from the University of
[27] Y. V. A. Markus, “Multiplierless multiple constant multiplication,” ACM Essex, Colchester, U.K., in 2002.
Trans. Algorithms, vol. 3, no. 1549–6325, p. 11, 2007. From 1998 to 2001, he was a Senior Research
[28] R. Timofte, KUL Belgium Traffic Signs and Classification Officer with the University of Essex, during which
Benchmark Datasets, 2012 [Online]. Available: he collaborated with BTexact on inventing video
http://www.vision.ee.ethz.ch/~timofter/. Accessed on 2014. transcoders. He joined Motorola U.K. Research Lab,
[29] M. Sancho and D. G. Lowe, “Spatially local coding for object recogni- Basingstoke, U.K., in 2001. He joined the American
tion,” in Computer Vision–ACCV 2012. New York, NY, USA: Springer, University of Sharjah in 2002 and is currently an
2013. Associate Professor of Computer Science. He spent
[30] D. Mahmoodi, A. Soleimani, H. Khosravi, and M. Taghizadeh, “FPGA the summers of 2003, 2004, 2006, 2007, and 2008 as
simulation of linear and nonlinear support vector machine,” J. Softw. Eng. a Visiting Professor at Motorola Multimedia Research Labs, Chicago, IL, USA.
Appl., vol. 4, pp. 320–328, 2011. He spent the spring semester of 2012 as a Visiting Academic at the Multimedia
[31] C. Kyrkou and T. Theocharides, “A parallel hardware architecture for and Computer Vision and Laboratory, School of Electronic Engineering and
real-time object detection with support vector machines,” IEEE Trans. Computer Science, Queen Mary University of London, London, U.K. His
Comput., vol. 61, no. 6, pp. 831–842, Jun. 2012. research interests include digital video processing and pattern recognition.

Authorized licensed use limited to: University of Illinois at Chicago Library. Downloaded on May 24,2025 at 19:37:36 UTC from IEEE Xplore. Restrictions apply.

You might also like