a-high-performance-fpga-based-image-feature-detector-and-matcher-based-on-the-fast-and-brief-algorithms
a-high-performance-fpga-based-image-feature-detector-and-matcher-based-on-the-fast-and-brief-algorithms
ARTICLE
1 Poznan University of Technology, Institute of Control and Information Engineering, Poznan, Wielkopolska, Poland
*Corresponding author(s) E-mail: [email protected]
DOI: 10.5772/61434
© 2015 Author(s). Licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License
(http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the
original work is properly cited.
Abstract
Keywords FPGA, Feature Detection, Feature Matching
Image feature detection and matching is a fundamental
operation in image processing. As the detected and
matched features are used as input data for high-level
computer vision algorithms, the matching accuracy 1. Introduction
directly influences the quality of the results of the whole
computer vision system. Moreover, as the algorithms are Point correspondences found in sequences of images are
frequently used as a part of a real-time processing pipeline, the input data for a wide range of computer vision algo‐
the speed at which the input image data are handled is also rithms, including tracking [1, 2], 3D reconstruction [3, 4],
a concern. The paper proposes an embedded system image stitching [5], visual odometry [6, 7], video surveil‐
architecture for feature detection and matching. The lance [8, 9] and simultaneous localization and mapping [10,
architecture implements the FAST feature detector and the 11]. As the quality of the input data directly influences the
BRIEF feature descriptor and is capable of establishing key final results produced by the aforementioned algorithms,
point correspondences in the input image data stream numerous solutions to the problem of automated image
coming from either an external sensor or memory at a speed feature extraction and matching have been proposed by the
of hundreds of frames per second, so that it can cope with research community. The most important characteristic of
most demanding applications. Moreover, the proposed a quality feature detector is its repeatability. The feature
design is highly flexible and configurable, and facilitates should be an accurate and stable projection of a 3D point to
the trade-off between the processing speed and program‐ a 2D image plane regardless of the transformations or
mable logic resource utilization. All the designed hardware distortions introduced by frame-to-frame camera move‐
blocks are designed to use standard, widely adopted ment. Robustness against varying acquisition parameters,
hardware interfaces based on the AMBA AXI4 interface like changes in illumination or noise, is also desirable.
protocol and are connected using an underlying direct Feature matching supplements feature detection by
memory access (DMA) architecture, enabling bottleneck- establishing point correspondences across two or more
free inter-component data transfers. views of the observed scene. The matching is usually
: 3
30
ased image feature detector and matcher based on the FAST and BRIEF algorithms
Figure 2. Formation of the BRIEF binary descriptor.
15
10
0
0 5 10 15 20 25 30
Figure 3. Exemplary 256 point pairs for the BRIEF feature descriptor
Figure 3. Exemplary 256 point pairs for the BRIEF feature
descriptor.
5. The Implemented Architecture
buffers the data. It also offers some additional, subtle
5.1 Outline oflike
advantages, the an
Architecture
ability to fully flush the streaming core
pipeline or inject some generated values into it. Another
Image processing
feature of the core algorithms can be
is built-in error dividedand
handling into different
signalling
groups
(e.g., outbased on defined
of the the type range,
of data buffer
they are working on, like
over/underflow).
Figure 4. Block diagram of the implemented system.
pixelparameter
The or regionconfiguration
processing orand methods that analyse
event handling is donethe
meta-data
using extracted from
a register-based images. To facilitate the imple‐
interface.
mentation of such a broad range of algorithms, the proc‐
essing
5.3. platform
Ensuring has to
parallel be flexible
access to imageand dataeasy to reprogram as
for neighbourhood
new methods are invented and old ones are refined. The
operations
most common and
Contemporary natural approach
programmable is the use
logic devices of standard
contain a pool
microprocessors
of dual-port memory running
blockscomputer
for local datavision software.
storage and
However, The
buffering. computer
memory vision methods
blocks can beusually require with
used together high Figure 4. Block diagram of the implemented system
computational
register banks power due to simultaneous
to provide the vast amounts of data
access to tothe
be
processed.
pixel At the same
neighbourhood time,
in the they have
currently to be image.
processed power-
5.2 Universal Controller for Streaming Processors
efficient
Such an enough to be used
input block does innot
power-constrained
use the externalapplica‐RAM
tions (e.g., robotics or smart cameras). These contradictory
memory – which is a major advantage – as communication The process of converting the software implementation of
with external RAM
requirements can be caneased
be a bottleneck
by using in a data-intensive
heterogeneous an image processing algorithm into a hardware coproces‐
applications. A general block diagram of the input block is
processing platform, like Xilinx Zynq. These devices sor can be difficult and time consuming. Integrating the
given in figure 6.
contain FPGA logic alongside a relatively high-perform‐ coprocessor with the system (e.g., external memory) and
ance input
The processor-based
block operatessubsystem.
underSuch the aassumption
solution enables
that the other coprocessors is another problem that has to be
the partitioning of image processing tasks between hard‐
the source of the image data must feed the pixels in dealt with. To simplify system integration, the universal
Figure 5. The block diagram of universal controller for streaming
processors.
ware, hand-tailored coprocessors implemented in the
progressive scan mode, which is common in contemporary controller for the streaming processors’ IP core was created.
FPGA fabric and software-based implementations that run Its block diagram is given in figure 5.
on general purpose Cortex-A9 cores.
www.intechopen.com This is a utility core that allows connecting streaming: 5
The system described in Athis paper was implemented in a processors compliant with the AXI4-Stream standard to the
high-perfomance FPGA-based image feature detector and matcher based on the FAST and BRIEF algorithms
Xilinx Zynq-7000 All Programmable SoC device. The block DMA engines and controlling them by writing to their
diagram of the whole solution is given in figure 4. The internal registers using the AXI4 system bus. It can serve as
processor system houses two ARM Cortex-A9 cores along simple data feeder that adjusts communication interface
with an external DRAM memory controller and communi‐ data widths, handles different operating frequencies and
cation infrastructure for interfacing with the programma‐ buffers the data. It also offers some additional, subtle
ble logic part. The processors are used for controlling the advantages, like an ability to fully flush the streaming core
flow of data, visualization and communication with the pipeline or inject some generated values into it. Another
host computer. The programmable logic part contains a feature of the core is built-in error handling and signalling
dedicated FAST detector, a BRIEF descriptor and matching (e.g., out of the defined range, buffer over/underflow). The
coprocessors. They are connected to the memory and parameter configuration and event handling is done using
processor subsystem through the AXI4-Stream and the a register-based interface.
pixels on the Bresenham circle as either ’bright’ or ’dark’, computation of the differences between the intensity value
respectively. and
description The matching
input datausing
are the
the centre signal andis
BRIEF algorithm of the the
from central pixel proposal
original and the intensity
in [22] andvalues
[12].ofThe
the original
pixels
the pixel_XX signals corresponding to the central pixel on the Bresenham circle. Again, 16 subtracters are used
given in figure 7. A circuit allowing for the formation of a solution uses a decision tree to determine whether the
intensity value and the 16 Bresenham circle pixel intensity to perform these operations. In the following step, the
7×7 pixel neighbourhood (as described in the section 5.3) is pixels on the Bresenham circle fall into
values and the threshold value corresponding to the threshold value is subtracted from thetheresults
’dark’ obtained
or ’bright’
connected directly
FAST detector to theThe
threshold. input of the
’bright’ system.
data In this
path begins category.
in Such stage.
the previous a concept was proposed
The results produced with sequential
in both data
location,
with the computation of the differences between data
the processing is split into two independent the processors
paths in mind.
are then fed intoThe decision tree
comparators. algorithm
If the runs
results are
paths – one
intensity for of
values feature detection
the pixels andBresenham
on the one for feature
circle relatively
greater thanefficiently
zero, theyonare
such hardware
passed to the and
next allows us to
processing
description.
and the intensity value of the central pixel. This is done stages
discardtothecontribute to theas corner
candidate point score (not
a non-feature computation
satisfying
using 16 subtracters. In the next step, the threshold value process as a sc_part_XX_YY signal, where XX is a
Due to the specific nature of the FPGA hardware architec‐ the ’dark’ or ’bright’ criteria) early on at the cost of in‐
is deducted from the 16 results using another set of number within the range of 1-16 (one for each pixel in the
ture, the implemented
16 subtracters. version
The ’dark’ of the FAST
classification starts algorithm,
with the creased memory
Bresenham circle) consumption.
and YY is br for Asthe
the’bright
presented design
pixels or dkis
although returning the same results, differs significantly based on a different computational platform, its principle
www.intechopen.com Michał Fularz, Marek Kraft, Adam Schmidt and Andrzej Kasiński:
: 77
A High-performance
A high-perfomance FPGA-based
FPGA-based Image
image Feature
feature Detector
detector andand Matcher
matcher Based
based on the
on the FAST
FAST andand BRIEF
BRIEF Algorithms
algorithms
Figure 8. Schematic diagram of the bright/dark classifier block
Figure 8. Schematic diagram of the bright/dark classifier block.
of operation
for the is alsoThis
’dark’ pixels. different. In place using
is achieved of a decision tree, an
multiplexers intensity
to perform valueall
of the central pixel and
48 comparison the intensity
operations values in a
in parallel,
exhaustive
controlled by thesearch
resultis of
performed
comparison,for all
asthe
shownimage in pixels.
figure ofsingle
the pixels
step.onThethe points
Bresenham circle. Again,
coincident 16 subtracters
with the local maximum
Because the programmable
8. Furthermore, logic facilitates
the results returned by thethe implemen‐
comparators are
ofused
the to perform
corner thesefunction
score operations. inInathe7× following step,
7 neighbourhood
tation
set the of architectures
corresponding bit that
in thecanis_bright
perform computations
or is_dark in the threshold value is subtracted from the results
which additionally satisfy the segment test are the final obtained
vector to confirm inresultant
the previous stage. The results produced in both data
parallel, such the meeting has
an approach of the
no ’bright’
negativeor ’dark’
effect on test
the features.
criteria, respectively. If Despite
the resulting value is approach
negative paths are then fed into comparators. If the results are
overall performance. using a brute force
or equal to zero, its value is set to zero. Moreover, the greater than zero,
The detector parttheyhasare passed
been to the next
enhanced withprocessing
two additional
for ’bright’ and ’dark’ pixel classifications and the corner
corresponding bit in the is_bright or is_dark vector stages to contribute to the corner score computation image
counters which keep track of the numerical
score computation, the coprocessor consumes only a small process as a sc_part_XX_YY
coordinates of the detected signal, where Such
features. XX is afunctionality
number is
is reset.
portion of the available programmable logic resources. within the range of 1-16 (one for each pixel in the Bresen‐
necessary, as the image detector data path only puts a mark
The final outputs of the classifier block are, therefore, two ham circle) and
whenever YY is br
it detects for the ’bright
a feature. The firstpixels or dkreflects
counter for the
After the pixels have been organized into a 7×7 neighbour‐
sets of 16 8-bit values used for the computation of the the
image’dark’ pixels. number,
column This is achieved
while theusing secondmultiplexers
stands for the
hood, providing access to the central pixel (the centre in
corner score according to equation 1 and two 16-bit binary controlled
image row by the result ofAs
number. comparison,
the raw data as shown
frominthe figure
counters
figure 7) and the Bresenham circle (pixel_01 to pixel_16 in
vectors holding the results of the classification of the pixels 8.isFurthermore,
out of sync thewith
resultsthereturned
output by of
the the
comparators
feature set detector,
figure 7) used by FAST, the intensity values of these 17
on the Bresenham circle as either the corresponding
delaybit in the is_bright or is_dark vector to
specific pixels are passed to the’dark’
inputsorof’bright’ used in
the bright/dark additional lines have been added so that the detected
the next stages to perform the segment test. confirm
features the meeting
are in syncofwith the the
’bright’ or ’dark’ of
information test criteria,
their respective
classifier block. A schematic diagram of this functional respectively. If The
the resulting value
coordinates. coordinates areispassed
negative toortheequal to
coordinate’s
Theseblock of the circuit
outputs is shown into
are connected figure
two8. other functional zero, its value is set to zero. Moreover, the corresponding
intermediate FIFO buffer.
blocks. The first block is used
The block consists of two independent data for the paths
corner score
compris‐ bit in the is_bright or is_dark vector is reset.
computation. The block consists of two pipelined
ing a set of subtracters, comparators and multiplexers. The adder The data path for feature description begins with the
The final outputs of the classifier block are, therefore, two
trees.data
Each adder
paths tree to
are used hasclassify
16 inputs
the 16for theon
pixels corner score
the Bresen‐ averaging filter block. The original implementation of
sets of 16 8-bit values used for the computation of the corner
components from either the ’bright’ or the ’dark’
ham circle as either ’bright’ or ’dark’, respectively. The group of BRIEF uses an averaging filter with a square-shaped
score according to equation 1 and two 16-bit binary vectors
pixels.input data are the centre signal and the pixel_XX signals
The adders produce two values – the sum of the mask [47]. The motivation behind it was the fact that
holding the results of the classification of the pixels on the
corresponding
components for both to the central
of the pixel intensity groups
aforementioned value and the
– and the response of such a filter can be computed relatively
Bresenham circle as either ’dark’ or ’bright’ used in the next
16 Bresenham circle pixel intensity values and the thresh‐
the greater of those values is passed as the final value of quickly on a standard PC using integral images [49]. As
stages to perform the segment test.
old value
the corner scorecorresponding
function astoper the equation
FAST detector
1. The threshold.
second the hardware platform used to implement the described
module The performs
’bright’ datathepath beginstest
segment withfor
theboth
computation
the ’dark’ ofand
the These
systemoutputs are connected
is capable to two otherparallel
of massively functionaloperations,
blocks. a
differences
’bright’ between for
pixels, looking the intensity values
at least nine of the pixelslogical
consecutive on the The first block
decision wasismade
used for
to the
usecorner score computation.
an averaging filter withThe
a circular
’1’s inBresenham
the contents circle and16-bit
of the the intensity
is_brightvalue and
of the central
is_dark block
mask consists of two
instead. Suchpipelined
a filteradder trees. Each
has better adder tree than
characteristics
pixel.
vectors. TheThis is done
block using 16
consists of subtracters.
two groupsInof the16next step, the
nine-input has
one16using
inputsa for the corner
square window,score
ascomponents
its responsefrom either The
is isotropic.
ANDthreshold
gates. If value
any two is deducted from thea 16
vectors contains trainresults
of atusing
least the
mask’bright’
was or the ’dark’
defined on a group of pixels.
7 × 7 image The and
patch, adders
its shape
nine another set of logical
consecutive 16 subtracters.
’1’s, the The ’dark’
result ofclassification
the segment starts
test produce two values
is depicted – the9.sum
in figure Theofinput
the components
data are takenfor both
from the
with the as
is recognized computation
positive. of the differences between the ofsame
the aforementioned
7 × 7 image patch groups – and
as the inputthedata
greater of those
for the FAST data
path.
The
8 resulting corner
Int J Adv Robot Syst, score values| doi:
2015, 12:141 score and segment test
10.5772/61434
results is_corner are passed to the circuit composed The filtered image data are then passed further to yet
of FIFO memories and a register bank to form a 7 × another FIFO and register bank circuit, so that a complete
7 neighbourhood, arranging them for the subsequent 33 × 33 pixel window is available for instantaneous
non-maximum suppression. The arrangement allows us processing. The sampling pattern used by the described
values is passed as the final value of the corner score The filtered image data are then passed further to yet
function as per equation 1. The second module performs another FIFO and register bank circuit, so that a complete
the segment test for both the ’dark’ and ’bright’ pixels, 33×33 pixel window is available for instantaneous process‐
looking for at least nine consecutive logical ’1’s in the ing. The sampling pattern used by the described architec‐
contents of the 16-bit is_bright and is_dark vectors. The ture is the same as that given in figure 3. As the target
block consists of two groups of 16 nine-input AND gates. descriptor length is 256, 256 parallel comparators are used
If any two vectors contains a train of at least nine consecu‐ to compute the complete descriptor in a single clock cycle.
tive logical ’1’s, the result of the segment test is recognized The resulting binary vector is stored in a 256-bit register.
as positive. The block diagram of the module for the BRIEF descriptor
computation is given in figure 10. The output of the register
The resulting corner score values score and segment test
holding the computed descriptor is connected to the
results is_corner are passed to the circuit composed of FIFO
descriptor’s intermediate FIFO.
memories and a register bank to form a 7×7 neighbourhood, Figure 9. The shape of the averaging filter mask implemented in
arranging them for the subsequent non-maximum sup‐ the system.
64 4349
In table 2, the number of frames per second that can be
Table 2. Number of frames that each matching accelerator can process in
processed by the detection-description coprocessor clocked one second for a different number of matching cores and 512 features
at 100 MHz is given. Tests for VGA, HD and FullHD
resolution images were performed. As the detection and In table 2, the relationship between the number of matching
matching coprocessor is a fully systolic functional block cores and the processing speed is given. The same data are
employing a very long pipeline, the computation time does also plotted in figure 14. As shown in the figure, increasing
not change with the number of features and is linearly the number of matching cores results in a linear increase of
dependent on the number of pixels to process. the number of frames per second that can be processed,
The matching procedure is performed in parallel with the assuming that the number of features remains constant.
detection and description procedure – the features from the However, it should be noted that the speed of matching
previous image are matched while the detection and procedure is affected by the number of features. The
description are performed on the next incoming image. relationship is illustrated by table 3 and figure 15. For
Such a solution works best when the processing times of reference, the processing speed of a software implementa‐
Table 3. Number of frames that the description and matching accelerator can process in one second for different numbers of features and 32 matching cores
compared to a pure software implementation (tested on Core i7)
tion presented in [13] is also given in the table. For both the
2500
software and the hardware implementation, the processing
3000 (16%) FFs and 38 (27%) BRAMs. In the case of the smallest
system with only four matching cores, it was roughly three
2000 times the number of resources used by the feature matching
1000
coprocessor (considering LUTs and FFs). It is the other way
around in the case of the biggest system with 32 matching
0 cores, which is over twice the size of the rest of the elements
0 8 16 24 32 40 48 56 64
of the system.
Number of matching cores
The performance of the system can be further improved by
Figure 14. Achievable frames per second based on the number of matching using the scatter-gather engine in the DMA module. This
cores for 512 features
would allow it to avoid the costly operation of packet
whole system (4 matching cores) 13206 (25) 22488 (21) 46,5 (33)
whole system (8 matching cores) 14723 (28) 26616 (25) 46,5 (33)
whole system (16 matching cores) 18729 (35) 34869 (33) 46,5 (33)
whole system (32 matching cores) 26926 (50) 51381 (48) 46,5 (33)
Table 4. Resource usage of each of the parts of the implemented design and the whole system (designations: LUTs - lookup tables, FFs - flip-flops, BRAMs -
block RAM memory blocks). The values in percent () are given with respect to all corresponding resources available in the XC7Z020 device.
ordering for memory transfers. It would be especially neck-free, DMA-based input and output data transfers,
beneficial in the case of the smaller systems, with fewer freeing the CPU in the SoC to perform other tasks. The
matching cores. To perform frame-to-frame feature architecture can be easily parametrized to meet the specific
matching, they need to perform more memory transfer design goals by choosing a trade-off between processing
number _ of _ vectors speed and resource utilization. Moreover, while perfectly
operations, equal to number _ of _ matching _ cores . This,
capable of working as a standalone solution, the design can
however, would come at the cost of increasing the com‐ be easily expanded or integrated as a part of a more
plexity of the DMA engine and interconnect. This in turn complex system. The choice of algorithms results in a
would result in the DMA module consuming twice as many compact, high performance, low power architecture,
programmable logic resources. It is also possible to use a comparing favourably with the state of the art in terms of
more sophisticated and larger FPGA, like Xilinx Zynq processing speed, hardware utilization and operational
Z-7030, Z-7045 or Z-7100. This would it allow to create an characteristics.
even faster solution. Thanks to their increased capacity,
they could hold more matching cores and also operate with Future work will focus on integrating the design as a part
a higher clock frequency. of a computer vision processing pipeline, performing a
complete application starting from acquisition, through
It should be noted that the presented solution uses the feature detection, description and matching, up to high-
standard DDR RAM memory for storing all the images and level processing, all integrated in a single chip. Considering
the results of operations. This enables easy access to the the characteristic properties of the described system,
data for other coprocessors or ARM cores performing possible candidates are applications in robot navigation
further processing. Another advantage is the relatively and vision-based surveillance.
high capacity of the memory (hundreds of megabytes) and
its low cost in comparison to other available solutions. In 8. Acknowledgements
addition, the system is based on commonly used commu‐
nication interfaces, like AXI4-Stream and AXI4-Lite, which This research was financed by the Polish National Science
eases up the process of the integration of the presented Centre grant funded according to the decision
solution with other systems. DEC-2011/03/N/ST6/03022, which is gratefully acknowl‐
edged.
7. Conclusions and Future Work
9. References
This paper presents an architecture for real-time image
feature detection, description and matching implemented [1] Huiyu Zhou, Yuan Yuan, and Chunmei Shi. Object
in programmable hardware. The presented solution is tracking using SIFT features and mean shift.
supplemented with an underlying communication infra‐ Computer Vision and Image Understanding, 113(3):345
structure based on standard interfaces, ensuring bottle‐ – 352, 2009. Special Issue on Video Analysis.