2013vlsisocbc fpgaRealTime3DStereoMatch
2013vlsisocbc fpgaRealTime3DStereoMatch
1 Introduction
Stereo vision is a research area in which progress is made for some decades now,
and yet emerging algorithms, technologies, and applications continue to drive
research to new advancements. The purpose of stereo vision algorithms is to
construct an accurate depth map out of two or more images of the same scene,
taken under a slightly different angle/position. In a set of two images one image
has the role of the reference image while the other is the non-reference one. The
basic problem of finding pairs of pixels, one in the reference image and the other
in the non-reference image that correspond to the same point in space, is known
as the correspondence problem and has been studied for many decades [1]. The
difference in coordinates of the corresponding pixels (or similar features in the
two stereo images) is the disparity. Based on the disparity between correspond-
ing pixels and on stereo camera parameters such as the distance between the
two cameras and their focal length, one can extract the depth of the related
point in space by triangulation. This problem has been widely researched by
the computer vision community and appears not only in stereo vision but in
other image processing topics as well such as optical flow calculation [2]. The
range of applications of 3D stereo vision cannot be underestimated, with new
fields of application emerging continuously, such as in recent research on shape
reconstruction of space debris [3].
The class of algorithms which we study falls into the broad category of pro-
ducing dense stereo maps. An extensive taxonomy of dense stereo vision algo-
rithms is available in [4], and an online constantly renewed comparison can be
found in [5], containing mainly software implementations. In general, the al-
gorithm searches for pixel matches in an area around the reference pixel in the
non-reference frame. This entails a heavy processing task as for each pixel the 2D
search space should be exhaustively explored. To reduce the search space, a con-
straint called epipolar line can be applied. This constraint aims at reducing the
2D area search space to a 1D line by assuming that the two cameras are placed on
the same horizontal axis (much like the human eyes) and that the corresponding
images do not have a vertical displacement, thus the pixels which correspond
to the same image location are only displaced horizontally. The epipolar line
constraint is enforced through a preprocessing step called rectification, which is
applied to the input pair of stereo images. In this work we concentrate on the
stereo correspondence algorithm and not on the rectification step, assuming that
images are rectified prior to processing. We present an FPGA-based implemen-
tation that is scalable and can be adjusted to the application at hand, offering
great speed-up over a software implementation. Essentially, we extend our work
published in [6], by including more results and a detailed analysis on aspects
related to performance and resource utilization. We should note here that stereo
matching is embarrassingly parallel and thus someone would reasonably expect
getting high performance gains from a custom hardware implementation. Hence,
our contributions go beyond solely achieving high performance results, and these
are:
– an analysis showing how the use of aggregation alleviates the need to employ
the more computationally demanding Sum of Absolute Differences (SAD)
algorithm while still maintaining good results;
– an analysis on how to dimension the combination of the Absolute Diferences
(AD) and the Census algorithms with aggregation in a single hardware im-
plementation;
– the FPGA-based architecture with detailed tradeoff analysis in the use of its
primitive resources (Block RAM, Flip-Flops, logic slices), which justifies the
use of FPGAs in the field of stereo vision;
– a placed-and-routed design allowing real-time processing up to 87 fps for full
HD 1920 × 1200 frames in a medium-size FPGA;
– a modification at the final phase of design cycle that improved by 1.6x the
system performance;
– a detailed cost vs. accuracy analysis and on-FPGA RAM usage for design
optimization.
The chapter is organized as follows: Section 2 discusses previous work, focus-
ing mainly on hardware-related studies, and with a more up-to-date comparison
of recent research results vs. those in our previous work [6]. Section 3 describes
the algorithm and its individual steps. Section 4 analyses the benefits of map-
ping the algorithm to an FPGA with emphasis on dimensioning, and especially
on the usefulness of aggregation in addition to AD and Census vs. the SAD
algorithm. An in-depth discussion of our system is given in Section 5, including
the implementation of belief propagation. Section 6 has the system performance
and the usage of resources, and Section 7 summarizes the chapter.
2 Relevant Research
3 The Algorithm
Our algorithm consists of the cost computation step implemented by the Ab-
solute Difference (AD) census combination matching cost, a simple fixed window
aggregation scheme, a left/right consistency check and a scan-line belief prop-
agation solution as a post processing step. Each step of the algorithm will be
explained below, whereas the justification for the choice of this combination of
algorithms will become evident from quantitative data in Section 4.
The AD measure is defined as the absolute difference between two pixels,
CAD = |p1 − p2 |, while census [19] is a window based cost that assigns a bit-
string to a pixel and is defined as the sum of the Hamming distance between
the bit-strings of two pixels. Let Wc be the size of the census window. A pixel’s
bit-string is of size Wc2 − 1 and is constructed by assigning 1 if pi > pc or 0
otherwise, for pi ∈ W indow, and pc the central pixel of the window. The two
costs are fused by a truncated normalized sum:
where λtrunc is the truncation value given as parameter. This matching cost
encompasses image local light structure (census) as well as information about
the light itself (AD), and produces better results than its parts alone, as was
shown in [20]. At object borders, the aggregation window necessarily includes
costs belonging to two or more objects in the scene, whereas ideally we would
like to aggregate only costs of one object. For this reason, truncating the costs
to a maximum value helps at least limiting the effect of any outliers in each
aggregation window [4].
After initializing the DSI volume with AD-Census costs, we perform a simple
fixed window aggregation on the W × H slices of the DSI, illustrated in Figure 2.
This is based on the assumption that neighbouring pixels (i.e. pixels belonging in
the same window) most likely share the same disparity (depth) as well. Although
this does not stand for object borders and slanted surfaces, it produces good
results. On the other hand, one should select carefully the size of the aggregation
window Wa , as large windows tend to lead to an edge fattening effect in object
borders while small aggregation windows lead to loss of accuracy in the inside
area of an object itself, which results in a noisy output.
The algorithm can be mapped on an FPGA very efficiently due to its intrinsic
parallelism. For instance, the census transform requires Wc2 − 1 comparisons per
pixel to compute the bit-string. Aggregation also requires Wa2 additions per cost.
For each pixel we must evaluate 64 possible matches by selecting the minimum
cost. These operations can be done in parallel. The buffer architecture requires
memories to be placed close to each other, as they shift data between them in
a very regular way. FPGA memory primitives (BRAMs) are located in such a
way they facilitate this operation. Figure 3 shows the critical components for
the different steps of the algorithm. Our system shares the pixel clock of the
cameras and processes the incoming pixels in a streaming fashion as long as the
camera clock does not surpass the system’s maximum frequency. This way we
avoid building a full frame buffer; we instead keep only the part of the image
that the algorithm is currently processing.
Scan-line Belief
AD Census Aggregation Left/Right check Propagation
64x6-bit AD-Census 64x11-bit 6-bit Disparity,
Input 2x8-bit grayscale pixels
costs aggregated costs 1-bit LRC status
BRAM 1 + Sel
> Min
Control + Sel
>Min MUX
FSM AddrOut DataOut 8
Schematic AddrIn + Sel
> Min
MUX
DataIn
BRAM 2 + Sel
>Min
. MUX
. .
. . .
. . . .
8 . .
AddrOut DataOut
MUX
AddrIn
DataIn
+ Sel..
>Min .
BRAM W-1
BRAM Lines Buffer Tree Adder Comparator Tree Parallel Load Queue
Fig. 3. Algorithm stages and critical components that fit well into FPGA regular struc-
tures
It is important to assess the need for flexibility regarding the algorithm pa-
rameters, and the gains of such a setup. First and foremost, we are seeking to
build a system that is frame-agnostic. Stating differently, we aim at supporting
a series of frame sizes within a range of choices; we regard this feature as oblig-
atory. However, a limit on the maximum frame width was imposed for reasons
explained in Section 6. In addition, all the algorithmic parameters are adjustable;
the maximum disparity search range Dmax , the census window size Wc , and the
aggregation window size Wa . We chose to structure our system in a modular way
in order to easily add/remove features. Features such as scanline belief propa-
gation and aggregation can be turned on or off selectively by the user. Figure
4 shows performance results without aggregation and with various aggregation
window sizes.
Fig. 5. Quality results in terms of the good matches for different census and aggregation
window sizes, when using the AD-Census
We analyzed the influence of the algorithm’s parameters on the quality met-
ric of percentage of good matches, over six (6) datasets of Middlebury’s 2005
database [5]. We have settled on a Wc = 9 × 9 sized census window, a Wa = 5 × 5
sized aggregation window and a Dmax = 64 disparity search range; these values
offer a good trade-off between overall quality and computational requirements.
We followed a similar procedure to determine all the other secondary parame-
ters as well, such as the confident neighborhood queue size and the neighborhood
queue size of the scan-line belief propagation module, and the LR check threshold
of the LR consistency check module [10].
There are negligible gains if we choose a larger Wc or Wa . The maximum
achievable percentage of good matches was 78,36% for AD-Census (Wc =7, Wa =13),
therefore there is no actual benefit by choosing a large aggregation window. It
is thus our choice to fix the window sizes in our implementation. Our design
remains generic in any parameter aspect but it is not reconfigurable at run-
time. This decision simplifies our hardware design. For purposes of evaluation
and experimental verification of the design we designed our system using Xilinx
FPGAs, namely, a Virtex 5 XC5VLX110T as well as a Spartan 3 1000, setting
the parameters accordingly to fit the FPGA device at hand.
Last but not least, we need to consider what happens with occluded pixels
from one or the other camera. It is therefore useful to allow for some resources to
be used for Belief Propagation (BP), as shown in Section 5. Belief propagation
(which uses results from the Left-Right consistency check) does not consume
significant resources but it solves the problem of uncertainty due to occluded
pixels which would result if only one camera were used as the only reference
image.
The system in Figure 6 receives two 8-bit pixel values per clock period, each one
for the corresponding image in the stereo pair. A window buffer is constructed
for each data flow in two steps. Lines Buffer stores Wc −1 scan-lines of the image,
each in a BRAM, conceptually transforming the single pixel input of our system
to a Wc sized column vector. Window Buffer acts as a Wc sized buffer for this
vector, essentially turning it into a Wc2 matrix. This matrix is subsequently fed
into Census Bitstring Generator of Figure 6, which performs Wc2 −1 comparisons
per clock, producing the census bit-string. Central pixels/Bit-strings FIFO stores
64 non-reference census bit-strings and window central pixels, which along with
the reference bit-string and central pixel are driven to 64 Compute Cost mod-
ules. This component performs the XOR/summing that is required to produce
the Hamming distance for the census part of the cost, along with the absolute
difference for the AD part and the necessary normalization and addition of the
two. The maximum census cost value is 80 as there are 81 pixels in the window
excluding the central pixel from calculations. Likewise, the maximum AD cost
value is 255 as each pixel is 8 bits wide. As the two have different ranges, we
scale the census part from the 0-80 range to a 0-255 range, by turning it into an
8-bit value. To produce the final AD-Census cost we add the two parts together,
resulting in a 9-bit cost to account for overflow. Truncating this cost to 6-bit
produces a slight improvement in quality as discussed in Section 3, and also
reduces buffering requirements in the aggregation step.
Fig. 6. Datapath of the cost computation (left side) and aggregation (right side)
For the aggregation stage, 22 line buffers (Aggregation Lines Buffer in Figure
6) are used for 64 streams of 6-bit costs, each lines buffer allocated to 3 streams.
BRAM primitives are configured as multiples of 18K independent memories, so
we maximize memory utilization by packing three costs per BRAM, accepting
a maximum depth of 1024 per line. Like the Lines Buffers at the input, they
conceptually transform the stream of data to Wa sized vertical vectors. Each
vector is summed separately in the Vertical Sum components and driven to
delay adders (Horizontal Sum), which output X(t) + X(t − 1) + ... + X(t − 4).
At the end of this procedure we have 64 aggregated costs.
Following the aggregation of costs, the LRC component illustrated in Figure
7, filters out mismatches caused by occlusions; its operation is illustrated in Fig-
ure 8. The architecture of LRC is based on the observation that by computing the
right-to-left disparity at reference pixel p(x, y), we have already computed the
costs needed to extract the left-to-right disparity at non-reference pixel p0 (x, y).
The LRC buffer is a delay in the form of a ladder that outputs the appropriate
left-to-right costs needed to extract the non-reference disparity. The WTA mod-
ules select the match with the best (lowest) cost using comparator trees. The
reference disparity is delayed in order to allow enough time for the non-reference
disparities space to build up in NonReference Disparities Buffer and then it is
used to index said buffer. Finally, a threshold in the absolute difference between
DispRL (x, y) and DispLR (x, y) indicates the false matches detected.
Fig. 7. Datapaths of the left/right consistency check(left side) and scan-line belief
propagation(right side)
The datapath for the scan-line belief propagation algorithm is shown in Fig-
ure 7. The function of this component is based on two queues: the Confident
Neighborhood Queue and the Neighborhood Queue. As implied by its name,
the Confident Neighborhood Queue places quality constraints on its contents,
meaning that only disparities passing the LR consistency check are written in it.
Fig. 8. Top row has pixels of the Right image, whereas bottom row has pixels of the
Left image. The grey pixel in the center of the Right image has a search space in the
Left image shown with the broad grey area. To determine the validity of DispRL (x, y),
we need all left-to-right disparities in the broad grey area, thus we need right-to-left
costs up to x + Dmax . The same stands for the diagonal shaded pixel in the center of
the Left image.
The system which was described, above, can process one pixel pair per clock
period, after an initial latency. The most computationally intensive part of the
main stage of the algorithm lies in the XOR/sum module of the AD census, which
computes the XOR/sum of 64 80-bit strings at the same time. A similar situation
stands for the WTA module, which performs 64 11-bit simultaneous comparisons.
We cope with both bottlenecks through fully pipelined adder/comparator trees
in order to increase the throughput. After the initial implementation, we added
extra pipeline stages to further enhance performance. Below we present the dif-
ferences between the initial (unoptimized) design, and the second (optimized)
design. Table 2 shows the performance results of the system implemented in a
Xilinx Virtex-5 FPGA. The maximum clock after place and route for the unop-
timized design is 131MHz, while for the optimized design is 201MHz. Table 3 has
the differences in resource utilization between the two designs; it demonstrates
that for a speed improvement of over 50%, the resource utilization penalty is
rather small. Based on data we gathered from the tools, the critical path lies
on a control signal driving the FSM of the aggregation line buffers; 16,6% is
attributed to logic while the rest 83,4% of the delay is due to routing.
Table 2. Design clock and processing rates for the optimized vs. unoptimized design
in Virtex XC5VLX110T FPGA for various resolutions
We should point out here that Wc , Wa and Dmax parameters are related
with tasks carried out in parallel, thus they do not affect system performance
but only resource utilization. Table 4 has the distribution of resources along with
the percentage breakdown in each type of resources for the optimized design.
Table 4. Resource utilization of the optimized design in Virtex XC5VLX110T FPGA
for Dmax = 64, Wc = 9, Wa = 5
In addition, very large frame sizes cause parameter bloating. In specific, im-
ages with 1800 × 1500 resolution require at least Dmax = 180 for achieving satis-
factory results in terms of quality (without altering the current camera baseline).
While keeping the other parameters constant (Wc = 9, Wa = 5), such a large
Dmax would require buffering 180 × 5 × 1800 elements in the aggregation stage.
Due to the above we decided to put a limit on the image width. Restricting
the frame width to 1024 pixels allowed us to:
– Pack at least two lines per 18K BRAM using a 18 × 1K BRAM primitive
configuration. For each cost line we allocate 9 × 1024 bits.
– Avoid excessive parameter bloating.
Using AD-Census, the costs are 9-bit long as described earlier. This benefits
our design as BRAM primitives can be used optimally in a 18×1K configuration.
Using pure Census, cost size is reduced to 7-bits. We can maximize BRAM usage
by using 9-bit costs, so we have room to increase census window size Wc up to
21 × 21, with little additional cost to resource usage.
If the cost size is less than 9-bits or if the frame width is less than 1024 we
can pack more lines. This aspect of our design is also parametric, as depending
on the frame size and cost size, each BRAM can fit up to 6 lines in a 36 × 512
BRAM configuration.
In an effort to reduce BRAM consumption even further, we performed a cost
size-accuracy tradeoff experimental analysis, depicted in Figure 9. AD-Census
was redefined as:
Selecting saturation values to be power of 2, can reduce cost size and thus
fit more data into the BRAMs that implement the aggregation buffers. Our
analysis shows that there is a slight benefit in doing so: for a saturation value of
63 (cost size reduced to 6-bits), and for the default Wc and Wa values of 9 and 5
respectively, we observe a 0.5% improvement over the cost without saturation.
Fig. 9. Cost size/accuracy analysis. The peak value shifts to the right as the true
maximum cost increases.
This is an important result because it puts our quality almost on par with
a Wc = 11 and Wa = 5 parameter set. This slight improvement is attributed
to the reduction of the influence of outliers within the aggregation window by
truncating the cost. With 6-bit costs, we can pack 3 streams of costs per Ag-
gregation Lines Buffer, thus reducing BRAM consumption even further. Note
that all the results presented so far with regard to FPGA resource utilization,
correspond to designs incorporating the previous optimizations.
Figure 10 shows the effect of optimizations on BRAM utilization for Wc = 9,
Wa = 5, Dmax = 64 and a maximum frame width of 1024 pixels. Operating with
small frame sizes allows for optimal algorithm performance.
We performed extensive verification of our designs Figure 11 has the set of
images we used to test our prototype. We entered stereo images and we compared
the software and the FPGA output over the ground truth. The SW version aimed
to support the validation phase; we developed it in Matlab prior to the FPGA
design. In terms of the physical setup for the verification, Figure 12 shows the
methodology we followed to validate the FPGA system. The values of the pixels
in the output of the FPGA processing were subtracted from the values of the
pixels in the output of the SW, pixel-per-pixel so as to create an array holding
their differences. We obtained that SW and HW produced similar results. The
error lines are attributed to a slightly different selection policy in the WTA
process of the LRC stage. In particular, when it comes to compare two equal
cost values, our SW selects one cost value randomly, while our HW selects always
the first one. This variation occurs early in the algorithmic flow, thus it is not
only propagated but it is also amplified in the belief propagation module where
local estimates of correct disparities are spread to incorrectly matched pixels
along the scan-line. Finally, the errors at the borders that occur in both SW and
HW outputs as compared with the ground truth, are due to the unavoidable
occlusions at the image borders.
Fig. 10. BRAM resource utilization with the optimized aggregation buffer structure
SW output HW output
Fig. 11. Top row has Moebius 400 × 320 input dataset from Middlebury database and
the ideal (ground truth) result. The bottow row has algorithm’s output from SW and
HW implementations.
0 1 0 0
0 0 2
0 1
0
1
0 0
0 3 0
0 2 0 0
8 Acknowledgement
This work has been partially supported by the General Secretariat of Research
and Technology (G.S.R.T), Hellas, under the project AFORMI- Allowing for Re-
configurable Hardware to Efficiently Implement Algorithms of Multidisciplinary
Importance, funded in the call ARISTEIA of the framework Education and Life-
long Learning (code 2427).
References
1. D. H. Ballard and C. M. Brown, Computer Vision. Englewood Cliffs, NJ, US:
Prentice-Hall, 1982.
2. D. Marr, Vision. San Francisco, CA, US: Freeman, 1982.
3. S. Di Carlo, P. Prinetto, D. Rolfo, N. Sansonne, and P. Trotta, “A novel algorithm
and hardware architecture for fast video-based shape reconstruction of space de-
bris,” EURASIP Journal on Advances in Signal Processing, vol. 2014, no. 1, pp.
1–19, 2014.
4. D. Scharstein and R. Szeliski, “A Taxonomy and Evaluation of Dense Two-Frame
Stereo Correspondence Algorithms,” International Journal of Computer Vision,
vol. 47, no. 1-3, pp. 7–42, April-June 2002.
5. http://vision.middlebury.edu/stereo/eval/.
6. S. Thomas, K. Papadimitriou, and A. Dollas, “Architecture and Implementation
of Real-Time 3D Stereo Vision on a Xilinx FPGA,” in IFIP/IEEE International
Conference on Very Large Scale Integration (VLSI-SoC), October 2013, pp. 186–
191.
7. K. Konolige, “Small Vision Systems: Hardware and Implementation,” in Proceed-
ings of the International Symposium on Robotics Research, 1997, pp. 111–116.
8. C. Murphy, D. Lindquist, A. M. Rynning, T. Cecil, S. Leavitt, and M. L. Chang,
“Low-Cost Stereo Vision on an FPGA,” in Proceedings of the IEEE Symposium
on Field-Programmable Custom Computing Machines (FCCM), April 2007, pp.
333–334.
9. S. Hadjitheophanous, C. Ttofis, A. S. Georghiades, and T. Theocharides, “Towards
Hardware Stereoscopic 3D Reconstruction, A Real-Time FPGA Computation of
the Disparity Map,” in Proceedings of the Design, Automation and Test in Europe
Conference and Exhibition (DATE), March 2010, pp. 1743–1748.
10. M. Humenberger, C. Zinner, M. Weber, W. Kubinger, and M. Vincze, “A Fast
Stereo Matching Algorithm Suitable for Embedded Real-Time Systems,” Computer
Vision and Image Understanding, vol. 114, no. 11, pp. 1180–1202, November 2010.
11. D. K. Masrani and W. J. MacLean, “A Real-Time Large Disparity Range Stereo-
System using FPGAs,” in Proceedings of the IEEE International Conference on
Computer Vision Systems, 2006, pp. 42–51.
12. S. Jin, J. U. Cho, X. D. Pham, K. M. Lee, S.-K. Park, and J. W. J. Munsang Kim,
“FPGA Design and Implementation of a Real-Time Stereo Vision System,” IEEE
Transactions on Circuits and Systems for Video Technology, vol. 20, no. 1, pp.
15–26, January 2010.
13. G. Rematska, K. Papadimitriou, and A. Dollas, “A Low-Cost Embedded Real-Time
3D Stereo Matching System for Surveillance Applications,” in IEEE International
Symposium on Monitoring and Surveillance Research (ISMSR), in conjunction
with the IEEE International Conference on Bioinformatics and Bioengineering
(BIBE), November 2013.
14. M. Jin and T. Maruyama, “Fast and Accurate Stereo Vision System on FPGA,”
ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 7,
no. 1, pp. 3:1–3:24, February 2014.
15. http://danstrother.com/2011/01/24/fpga-stereo-vision-project/.
16. C. Rhemann, A. Hosni, M.Bleyer, C. Rother, and M. Gelautz, “Fast Cost-Volume
Filtering for Visual Correspondence and Beyond,” in Proceedings of IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR), June 2011, pp.
3017–3024.
17. R. Kalarot and J. Morris, “Stereo Vision Algorithms for FPGAs,” in IEEE Con-
ference on Computer Vision and Pattern Recognition Workshops (CVPRW), June
2010, pp. 9–15.
18. S. Mattoccia, “Stereo Vision Algorithms for FPGAs,” in IEEE Conference on
Computer Vision and Pattern Recognition Workshops (CVPRW), 2013, pp. 636–
641.
19. R. Zabih and J. Woodfill, “Non-parametric Local Transforms for Computing Visual
Correspondence,” in Proceedings of the European Conference on Computer Vision
(ECCV), 1994, pp. 151–158.
20. X. Mei, X. Sun, M. Zhou, S. Jiao, H. Wang, and X. Zhang, “On Building an
Accurate Stereo Matching System on Graphics Hardware,” in Proceedings of the
IEEE International Conference on Computer Vision Workshops (ICCV), Novem-
ber 2011, pp. 467–474.