InTech-High Speed Architecure Based On Fpga For A Stereo Vision Algorithm
InTech-High Speed Architecure Based On Fpga For A Stereo Vision Algorithm
Signal Processing Laboratory, Electronics Department; DICIS University of Guanajuato 2 Mechatronic Department, Campus Loma Bonita University of Papaloapan 1 Salamanca, Guanajuato, Mexico 2 Loma Bonita, Oaxaca, Mexico
1. Introduction
Stereo vision is used to reconstruct the 3D (depth) information of a scene from two images, called left and right. This information is acquired from two cameras separated by a previously established distance. Stereo vision is a very popular technique used for applications such as mobile robotics, autoguided vehicles and 3D model acquisition. However, the real-time performance of these applications cannot be achieved by conventional computers, because the processing is computationally expensive. For this reason, other solutions like recongurable architectures have been proposed to execute dense computational algorithms. In the last decade, several works have proposed the development of high-performance architectures to solve the stereo-vision problem i.e. digital signal processing (DSP), eld programmable gate arrays (FPGA) or application-specic integrated circuits (ASIC). The ASIC devices are one of the most complicated and expensive solutions, however they afford the best condition for developing a nal commercial system (Woodll et al., 2006). On the other hand, FPGA have allowed the creation of hardware designs in standard, high-volume parts, thereby amortizing the cost of mask sets and signicantly reducing time-to-market for hardware solutions. However, engineering cost and design time for FPGA-based solutions still remain signicantly higher than software-based solutions. Designers must frequently iterate the design process in order to achieve system performance requirements and simultaneously minimize the required size of the FPGA. Each iteration of this process takes hours or days to be completed (Schmit et al., 2000). Even if designing with FPGAs is faster than designing ASICs, it has a nite resource capacity which demands clever strategies for adapting versatile real-time systems (Masrani & MacLean, 2006). In this chapter, we present a high-speed recongurable architecture of the Census Transform algorithm (Zabih & Woodll, 1994) for calculating the disparity map from a dense stereo-vision system. The reuse of operations and the integer/binary nature of these operations were carefully adapted on the FPGA for obtaining a nal architecture that generates up to 325 dense disparity maps of 640 480 pixels, even though most of the vision-based systems do not require high video-frame rates. In this context, we propose a stereo-vision system that can be adapted to the real-time application requirements. An
72 2
analysis of the four essential architectural parameters (such as the size of the window of the arithmetic mean and median lters, the maximal disparity and the window size for the Census Transform), is carried out to obtain the best trade off between consumed resources and the disparity map accuracy. We vary these parameters and show a graphical representation of the consumed resources versus the desired performance for different extended architectures. From these curves, we can easily select the most appropriate architecture for our application. Furthermore, we develop a practical application of the obtained disparity map to tackle the problem of 3D environment reconstruction using the back-projection technique. Experimental performance results are compared to those of related architectures.
73 3
First of all, the algorithm processes in parallel and independently each of the images (right and left). The process begins with the rectication and correction of the distortion for each image. This process allows us to reduce the size of the search of points for the calculation of the disparity to a single dimension. In order to reduce the complexity and size of the required architecture, this algorithm uses the epipolar restriction. In this restriction, the main axes of the cameras should be aligned in parallel, so that the epipolar lines between the two cameras correspond to the displacement of the position between the two pixels (one per camera). Under this condition, an object location in the scene is reduced to a horizontal translation. If any pair of pixels is visible in both cameras and assuming they are the projection of a single point in the scene, then both pixels must be aligned on the same epipolar line (Ibarra-Manzano, Almanza-Ojeda, Devy, Boizard & Fourniols, 2009).
2.1 Image preprocessing
The Census Transform requires that left and right input images be pre-processed. During image pre-processing, we use an arithmetic mean lter that requires a rectangular window of size m n pixels. Suv represents a set of image coordinates inside of the rectangular window centered on the point (u, v). The arithmetic mean lter calculates the mean value in the noisy image I (u, v) at each rectangular window dened by Suv . The corrected image value I takes this arithmetic mean value at each point (u, v) of subset Suv (see Equation 1). I (u, v) = 1 I (i, j) m n (i,jS ) uv (1)
This lter could be implemented without using the scale factor 1/ (m n) because the size of the window is constant during the ltering process. The arithmetic mean lter smooths local variations in the image, at the same time, the noise produced by camera motions is notably reduced.
2.2 Census Transform
Once input images have been ltered, they are used to calculate the Census Transform. This transform is a non-parametric measure used during the matching process for measuring similarities and obtaining the correspondence between the points into the left and right images. A neighborhood of pixels is used for establishing the relationships among them (see Equation 2), IC (u, v) =
(i,j) Duv
I (u, v) , I (i, j)
(2)
where Duv represents the set of coordinates into the square window of size n n pixels (being n an odd number) and centered at the point (u, v). The function is the comparison of the intensity level among the center pixel (u, v) with all the pixels in Duv . This function returns 1 if the intensity of the pixel (i, j) is lower than the intensity of the centering pixel (u, v), otherwise the function returns 0. The operator represents the concatenation function among each bit calculated by the function . IC represents the Census Transform of the point (u, v) which is a bit chain.
2.3 Census correlation
The two pixels (one for each image) obtained from the Census Transform are compared using the Hamming distance. This comparison which is called the correlation process allows us
74 4
to obtain a disparity measure. The similarity evaluation is based on the binary comparison between two bit chains given by the Census Transform. The disparity measure from left to right D H1 in the point (u, v) is calculated by the equation 3, where ICl and ICr represent the left and right images of the Census Transform, respectively. This disparity measure comes from the similarity maximization function in the same epipolar line v for the two images. In this same equation, D represents the maximal displacement value on the epipolar line of the right image. The function represents the binary operator XNOR. 1 N (3) I (u, v)i ICr (u d, v)i N i Cl d[0,D ] =1 The correlation process is carried out two times, (left to right then right to left) with the aim of reducing the disparity error. The equation 4 is for that case in which the right to left disparity measure is calculated. This measure was added for complementing the process. Contrary to the previous disparity measure shown in equation 3, the equation 4 uses the following pixels with respect to the current pixel in the search process. D H1 (u, v) = max D H2 (u, v) = max
2.4 Disparity validation
d[0,D ]
(4)
Once both disparity measures have been obtained, the validating task is straightforward. The disparity measure validation (right to left and left to right) consists of comparing both disparity values and obtaining the absolute difference between them. In the case that this difference is lower than a predened threshold , then the disparity value is accepted. Otherwise, the disparity value is labeled as undened. The equation 5 represents the validation of the disparity measures, D H being the validation result. DH =
2.5 Disparity ltering
D H1 | D H1 D H2 | < ind | D H1 D H2 |
(5)
A novel ltering process is needed in order to improve the quality of the nal disparity image. Muv is the set of coordinates in a m n rectangular window centered on the point (u, v). First, the set of disparity values D H (i, j) in the region dened by Muv are ordered. After that, the median ltering process selects the centered value at the ordered list. This value is set into the region dened by an M N rectangular window Mu,v and the same process is carried out for all the image pixels (i, j) in order to obtain the ltered image D H . Hence, this ltered image calculated by the median lter, when expressed in terms of the central pixel (u, v), would be written as in equation 6. D H (u, v) = median ( D H (i, j) , (i, j) Muv ) (6)
Whereas, for the image preprocessing (described above), an arithmetic mean lter is used, here for the pre-ltering process a median spatial lter is used, because the median lter allows the selection of one value among all the disparity values for representing the disparity in the search window. This means that a new value does not need to be obtained, as in the arithmetic lter.
75 5
3. Hardware implementation
We have implemented an architecture for FPGA that implements several image processing tasks with high performance. During the architecture design, we try to minimize the consumed resources in the FPGA for maximizing the system performance. In previous subsections, we have explained the 4 essential tasks of our architecture: the image processing, the Census Transform, the disparity validation and the ltering of the disparity image. In this section, we describe the hardware implementation, that is, how these architectures are implemented into the FPGA.
3.1 The arithmetic mean lter module
Usually the image acquired from the camera is not so noisy, thus a fast smoothing function could be used for obtaining a good quality image. For this reason, we use the arithmetic mean lter, which is faster and easy to implement. We are often interested in the minimization of required resources, so that, after several tests, we choose a window size of 3 3 pixels that is also enough for achieving good results. Indeed, this size allows us to save resources in the nal architecture.
Fig. 2. Module architecture to calculate the arithmetic mean lter. The arithmetic mean calculation is carried out in two stages: in horizontal and vertical ways. The block diagram of this architecture, in accordance with the process described in subsection 2.1, is shown in gure 2. The three input registers (left side of the diagram) are used for the horizontal addition. These registers are connected to two 8-bit parallel adders, although the result is coded in 10 bits. The result of this operation is stored in the memory that is twice the length image size. For obtaining the sum of all the elements in the window, a vertical addition is carried out. This addition uses the current horizontal addition result plus the two previous horizontal additions stored in the memory. This is shown on the right side of the diagram. Finally the arithmetic mean of the nine pixels are codied in a 12 bit-chain. In this stage, the delay only depends on the operation that involves the last line plus one value.
3.2 The Census Transform module
The arithmetic mean of the left and right images are used as inputs in the Census Transform stage. This transformation codies all the intensity values inside the search window with respect to the central intensity value. The block diagram of the Census Transform architecture is shown in the gure 3. The performance in this module depends on the size of the search window. The size of this window directly increases the resources and the time of processing. So the best trade off between the consumed resources and the optimal size of the window has to be selected. After several tests, the best processing time and hardware saving resources is reached for a 7 7
76 6
Fig. 3. Module architecture to calculate the Census Transform. pixels window. This window needs 49 registers. On the other hand, 6 memory blocks are used in the processing module. The size of these memory blocks is obtained as follows: size of the image (usually 640) minus the size of the search window (7 pixels in our case) and the result is multiplied by 12. The constant 12 in the last multiplication is used because we look for the same size in the input of the Census Transform rather than in the output of the arithmetic mean module. Once we have selected the size of the search window, then we continue with the description of the Census Transform. The central pixel in the search window is compared with their 48 local neighbors. This last operation implies the connection of all the corresponding registers with parallel comparators as is shown in gure 3. The result is codied in 48 bits, where each bit corresponds to the comparator outputs. This stage has a delay equal to half of the search window by the length of the image.
3.3 The census correlation module
The correlation process consists of analyzing left and right images resulting from the Census Transform. Considering that both images contain a common object, the correlation process
77 7
has the aim of nding the displacement between two projected pixels belonging to that common object in the images. Hence, as images are acquired from two different points of view (associated with the left and right camera positions) there will exist a noticeable difference between point positions belonging to the same object, such difference is referred to as the disparity. Usually, the correlation intends to maximize the similarity between two images in order to nd the disparity. We also use this common method which consists of two main stages: the calculation of the similarity measure and the search for the maximal value. Figure 4 shows the block diagram of the corresponding architecture. We are interested in reducing the delay in the correlation process, therefore it is more convenient to compare one point of the left Census Transform image with the maximal number of points in the right one. Furthermore, the correlation process is executed twice in order to minimize the error during the correlation computation. We will consider 64 pixels as the maximal disparity value for each left and right Census Transform image. The registers represented by the gray blocks on the left of gure 4 store those pixels. The registers as well as the pixels of the left Census Transform image enter the binary operators XNOR to deliver a 48-bit chain at the output. The XNOR function is used to nd the maximal and minimal similarity associated with the disparity values at the input. All such pixel values enter by pairs into the XNOR gates. If all the compared pixels are equal, then the XNOR output will be 1, which means maximal similarity. Otherwise, if pixels are different, then 0 will be the output of the XNOR, which is associated with a minimal similarity value. Once the similarity has been calculated, we continue with the search of the highest disparity values between the 64 pixels compared in both correlation results, from left to right and right to left correlations, but independently. This task requires several selector units, each one with 4 inputs distributed as follows: 2 for the similarity values that will be compared and 2 for the indicators. The indicators are associated with the displacement between pixels in the left and right Census Transform. That is, if one pixel has the same position in both right and left Census Transform, then the indicator will be zero. Otherwise, the indicator will represent the number of pixels between pixel position in the left Census Transform with respect to the pixel in the right one. The block diagram of the architecture shown in gure 4 describes graphically the implementation of the maximization process. By examining this gure, we can highlight that the selector unit receives two similarity measures and two indicators as inputs. The elements inside of these units are a multiplexer and a comparator. The multiplexer receives the pixel with the highest similarity value, while the comparator receives two similarity values that come from the rst stage. Hence the output of that comparison will be considered as the selector inputs of the multiplexer. Thus, the multiplexer output will be the similarity measure and its refereed index pixel. However, in order to obtain the pixel with the maximal similarity measure, six levels of comparison units are needed. These levels are organized in a pyramidal fashion. The lowest level corresponds to the rst layer that carries out the selector unit task described above 32 times. As we ascend the pyramid levels, each level reduces by half the number of operators used with the previous level. The last level delivers the highest similarity value between the corresponding pixels in the left and right Census images. Whereas right to left image correlation stores left Census Transform pixels, which are compared with one pixel of the right Census image, for left to right image correlation the comparison is relative to one pixel of the left Census image. For this reason, a similar architecture is used for developing both correlation processes. All this stage (including both right to left and left to right processes) has a delay that depends on the number of layers in the selector units, which at the same time depends on the maximal disparity value that we are using. In our case, we establish maximal disparity value to 64, thus the number of layers is equal to 6.
78 8
This module fuses the two disparity values obtained by the Census correlation processes (right to left and left to right). First, the difference between the two values is calculated, after that it is compared with a threshold . If that difference is lower than then the left to right disparity value is the output of the comparison, otherwise, the output will be labeled as undened.
79 9
The gure 5 shows the block diagram of the disparity validation architecture. The inputs of this module are the two disparity values obtained by the Census Correlation process. The absolute difference of values is used in the comparison with . The comparator delivers one bit that controls the multiplexer selector. If the result of the comparison is 1 (that is, is higher than the correlation differences), then the multiplexer will have the undened label as result. This result is associated with the maximal disparity value plus one, which is referred to as the default value. If the comparison result is zero, then the output of the multiplexer will be the value of the left to right Census correlation.
3.5 The median lter module
Some errors have been detected during the disparity validation, due to the errors in the correlation process. Most of these errors appear because objects in the image have similar intensity values to their surrounding area. This situation produces very similar Census Transform values in pixels and consequently wrong disparity values in certain cases. We are interested in reducing these errors in the image by using a median lter. As we pointed out before, it is not recommended to use the same arithmetic mean lter as in the pre-processing stage because this lter will give us a new value (the average into the ltering window), which is not an element of the current disparity values. On the other hand the median lter works with the true values in the image, so the resulting value will be an element of the disparity window. The median lter uses a search window of size 3 3 pixels. This window is enough for notably reducing the error and improving the nal disparity map as is shown in gure 6.
Fig. 6. Median Filter. a) Left image, b) Resulting disparity map without ltering and c) with ltering. Figure 7 shows the block diagram of the median lter architecture. This lter is based on the works of (Dhanasekaran & Bagan, 2009) and (Vega-Rodriguez et al., 2002). On the left side of the diagram is shown the nine registers and the two RAM block memories used to generate the sliding window that extracts the nine pixels of the processing window. This architecture works similar like to the processing window of the Census Transform. That is, the nine pixels in the window are processed by a pyramidal architecture but in this case with 9 levels. Each level contains several comparison units that nd the higher value between two input values A and B. Each comparison unit contains a comparator and a multiplexer. If the input A in the comparator is higher than its input B, then the output will be 1, otherwise the output will be 0. The comparator output is used as a signal control of the multiplexer. When this signal is 1, then the multiplexer selects A as the higher value and B as the lower value, otherwise B is the higher value and A the lower one. Each comparator level in the median module orders
80 10
the disparity values with respect to its neighbors in the processing window, completing in this way the descendent organization of all the values. However, here it is not necessary to order all the disparity values, because we are only looking for the middle value in the last level. Therefore, we only need the comparison unit at the last level, because previous levels give only the partial order of the elements. The connection structure between the comparison units at each level guaranty an optimal median value (Vega-Rodriguez et al., 2002).
81 11
64 pixels. With these parameters, the architecture is able to calculate 130 disparity images per second with a 50 Mhz signal clock until 325 disparity images per second with a 100 Mhz signal clock.
4.1 Architectural exploration through high-level synthesis
High level synthesis was used to implement the stereo vision architecture based on Census Transform. The algorithm was developed using GAUT (Coussy & Morawiec, 2008), which is a high level synthesis tool using C language. After that, the algorithm was synthesized using the (EP2C35F672C6) Cyclone II of Altera. Each state of the architecture (ltering, Census Transform and Correlation) was developed taking into account consumed resources and high performance (high speed of processing). The best trade off was found for implementing an optimal architecture system. Tables 1 to 3 lay out three different architectures, labeled as Design 1, 2, and 3, with their most representative performance. In the following, we will describe how the different implementation details are related in our architecture. There exists a clear relation between performance, cadence and pipeline implementation. That is, if we reduce the performance, then the cadence increases, therefore the number of operations and stages in the pipeline is low. With the rest of feature design, it is more difcult to see how they are related. For example, the number of logic elements depends directly on the used combinational functions and the number of dedicated logic registers. The combinational functions are strongly associated with the quantity of operations and weakly with the state numbers in the state machine. As with any state machine, the cadence time controls the performance speed. Contrary to the combinational functions, the dedicated logic registers strongly depends on the number of states in the state machine and weakly on the number of operations. Finally, the delay is obtained based on the number of operations, the number of stages in the pipeline and specially in the cadence time established by the architecture design. The results shown in the tables 1 to 3 were carried out for an image size of 640 480 pixels with a processing window of 3 3 pixels for the arithmetic mean lter, a window size of 7 7 pixels for the Census Transform and a maximal disparity measure of 64 pixels, with a signal clock of 100 Mhz. Characteristics Cadency (ns) Performance (fps) Logic elements Comb, functions Ded. log. registers # Stages in pipeline # Operators Latency (s) Design 1 Design 2 Design 3 20 160 118 86 115 3 2 25.69 30 100 120 72 116 2 2 38.52 40 80 73 73 69 2 1 51.35
Table 1. Comparative table for the arithmetic mean lter. Taking into account the most common real time constraints, it is possible to choose the design 3 for the implementation of the arithmetic mean lter, because this represents the best compromise between performance and consumed resources. For the same reason, the design 2 could be chosen for developing the Census Transform and the design 3 for the Census correlation. The results of the hardware Synthesis in FPGA are summarized as follows: the global architecture needs 6, 977 logic elements and 112, 025 memory bits. The quantity of logic elements represents 21% of the total resources logic of the Cyclone II device, furthermore
82 12
Characteristics
Design 1 Design 2 Design 3 80 40 1,532 837 1,279 24 79 308.00 200 15 1,540 864 1,380 10 34 769.50
Cadency (ns) 40 Performance (fps) 80 Logic elements 2,623 Comb. functions 2,321 Ded. log. registers 2,343 # Stages in pipeline 48 # Operators 155 Latency (s) 154.36 Table 2. Comparative table for the Census Transform. Characteristics Cadency (ns) Performance (fps) Logic elements Comb. functions Ded. log. registers # Stages in pipeline # Operators Latency (ns)
Design 1 Design 2 Design 3 20 160 1,693 1,661 1,369 27 140 290 40 80 2,079 1,972 1,451 12 76 160 80 40 2,644 2,553 1,866 8 46 100
Table 3. Comparative table for the Census correlation. the memory size represents 23%. This architecture calculates 40 dense disparity images per second with a clock of 100 Mhz. This performance is lower than the proposed architecture, although it proposes a well-optimized design, since it uses less resources than in the previous case. In spite of the low performance, this is high enough in the majority of real-time vision applications.
4.2 Comparative analysis of the architectures
First, we will analyze the system performance for four different solutions to the dense disparity image. Two of the above mentioned solutions are hardware implementations. The third one is a solution for a Digital Signal Processing (DSP) model ADSP-21161N, with a signal clock of 100 MHz from Analog Devices Company. The last one is a software solution for a PC DELL Optiplex 755 with a 2.00 Ghz Intel Core 2 Duo processor and 2 Gb in RAM. The performance comparison between these solutions is shown in table 4. The rst column indicates the different image sizes used during the experimental test. The second column shows the sizes of the search window used in the Census Transform. The third column shows the processing time (performance). In the FPGA implementation, the parallel processing allows short calculation time. The developed architecture uses the RTL level design which reaches the lower processing time, but it takes more time for the implementation. On the other hand, using high level synthesis for the architecture design allows the development of a less complex design, but it requires longer processing time. However, the advantage of high level synthesis is the short implementation time. Unlike FPGA implementations, the DSP solutions are easier and faster to implement, nevertheless the processing remains sequential, and so the computation time is considerably high. Finally, the PC solution, that affords the easiest implementation of all above discussed,
83 13
requires very high processing times compared to the hardware solution, since it has an inappropriate architecture for real time applications. Image size Census window Time of processing (pixels) size (pixels) FPGA DSP PC 192 144 192 144 192 144 384 288 384 288 384 288 640 480 640 480 640 480 33 55 77 33 55 77 33 55 77 0.69ms 0.69ms 0.69ms 2.77ms 2.77ms 2.77ms 7.68ms 7.68ms 7.68ms 0.26s 0.69s 1.80s 1.00s 2.75s 7.20s 2.80s 7.70s 20.00s 33.29s 34.87s 36.31s 145.91s 151.39s 158.20s 403.47s 423.63s 439.06s
Table 4. Performance comparison of different implementation. We present a comparative analysis between our two architectures and four different FPGA implementations found in the literature. The rst column of table 5 lays out the most common characteristics of the architectures. The second and third columns show the limitations, performance and consumed resources by our architectures using the RTL level design and the High level synthesis HLS (Ibarra-Manzano, Devy, Boizard, Lacroix & Fourniols, 2009), labeled as Design 1 and Design 2, respectively. The remaining columns show the corresponding values for the four architectures, labeled as Design 3 to 6. These architectures were designed by different authors. See their corresponding articles for more technical details (Naoulou et al., 2006), (Murphy et al., 2007), (Arias-Estrada & Xicotencatl, 2001) y (Miyajima & Maruyama, 2003) for Design 3 to 6. Besides all of these are FPGA implementations, they calculate dense disparity images from two stereo images. Our architecture could be directly compared with Design 3 and 4, since they use the Census transform algorithm for calculating the disparity map. We propose two essential improvements with respect to Design 3: the delay and the size of memory. These improvements directly affect the number of logic elements (area) that in our case increase. With respect to Design 2, we propose three important improvements: the delay, the area and the memory size. Again these improvements impact the performance, that is the processed image per second is lower. Although Design 4 has a good performance with respect to other designs, this is lower than our architecture performance. In addition, it uses a four-times-smaller image, it has a lower value of disparity measure and it consumes a bigger quantity of resources (area and memory). Our architecture cannot be directly compared with Designs 5 and 6, since they use the Sum Absolute of Differences (SAD) as a correlation measure. However, an interesting comparison point is the architecture performance required for calculating the disparity map, at the moment that an architecture uses only logic elements (Design 5) or when several accesses to external memories are used (Design 6). The big quantity of logic elements consumed by the architecture in Design 5 limits the size of the input images and the maximal disparity value. As a consequence, this architecture has a lower performance with respect to our architecture (Design 1). The Design 6 requires a large quantity of external memory that directly affects its performance with respect to our Design 1.
5. Implementation results
We are interested in obtaining the disparity maps relative to a image sequence acquired from a camera mounted in a moving vehicle or robot. It is important to point out the additional
84 14
Design 1 Design 2 Design 3 Design 4 Design 5 Design 6 Measure Census Census Census Census SAD SAD Image size 640 480 640 480 640 480 320 240 320 240 640 480 Window size 77 77 77 13 13 77 77 Disparity max 64 64 64 20 16 80 Performance 325 40 130 40 71 18.9 Latency (s) 115 206 274 Area 12, 188 6, 977 11, 100 26, 265 4, 210 7, 096 Memory size 114 Kb 109 Kb 174 Kb 375 Kb Table 5. Comparative table from different architectures. constraint imposed by a vehicle in which the velocity is varying or very high. In this context, our architecture was tested for different navigational scenes using a stereo vision bank rst mounted in a mobile robot and then in a vehicle. In this section, we present three operational environments. Figure 8 (a) and (b) respectively show the left and right images from the stereo vision bank. Dense disparity image depicts the disparity value in gray color levels in gure 8 (c). By examining this last image, we can determine that if the object is close to the stereo vision bank that means a big disparity value, so it corresponds to a light gray level. Otherwise, if the object is far from the stereo vision bank, the disparity value is low, which corresponds to a dark gray level. In this way, we observe that the gray color which represents the road in the resulting images gradually changes from light to dark gray level. We point out the right side of the image, where we can see the different tones of gray level corresponding to each vehicle in the parking. Since these vehicles are located at different depths from the stereo vision bank, the disparity map detects and assigns a corresponding gray color value. The second test performed with the algorithm is shown in gure 9. In this case a white vehicle moves straightforward in our robot direction. This vehicle is detected in the disparity image and depicted with different gray color levels. Different depth points of the vehicle can be detected, since it is closer to our stereo vision bank than the vehicles parked at the right side of the scene. On the other hand, it is important to point out that sometimes the disparity validation fails because the similarity between left and right images is close. This problem is more signicant when there are shadows close to the visual system (as in this experiment) producing several detection errors in the shadow zones. In the last test (see gure 10), the stereo vision bank is mounted on a vehicle that is driven on a highway. This experimental test results in a difcult situation because the vehicle is driven at high-speed during the test. The left and right images (gure 10 (a) and (b) respectively) show a car that overtakes our vehicle. The gure 10 (c) shows the dense disparity map. We highlight all the different depths detected in the vehicle that overtakes our vehicle and how the gray color value in the highway becomes gradually darker until the black color which represents an innity depth.
85 15
Fig. 8. Stereo images acquired from a mobil robot during outdoor navigation: a) left image b) right image and c) the disparity map.
FPGA
Fig. 9. Stereo images acquired from a mobile robot during outdoor navigation: a) left image b) right image and c) the disparity map. Figures 11 (a) and (b) show the left image and the obtained disparity map respectively. Figure 11 (c) shows the reconstructed environment using the back-projection technique. Each point in the reconstructed scene was located with respect to a reference frame set in the stereo bank employing intrinsic/extrinsic parameters of the cameras and geometrical
86 16
FPGA
Fig. 10. Stereo images acquired from a vehicle in the highway: a) left image b) right image and c) the disparity map.
(c) 3D reconstruction
Fig. 11. 3D reconstruction from outdoor environment using dense disparity map obtained by our architecture.
87 17
assumptions. By examining gure 11 (c), we can see that most of the undened disparity points were removed, thus the reconstruction is based on the well-dened depth-points. Finally, it is important to point out that the reconstruction of this environment results in a difcult task, since the robot with the stereo vision bank moves with a considerable velocity of 6 meters per second and in an outdoor environment. Therefore, the ideal conditions of controlled illumination and controlled vibrations do not hold, and this will be reected in some images, making it more difcult to obtain the disparity map and, consequently, the scene reconstruction.
8. Acknowledgments
This work was partially funded by the CONACyT with the project entitled Diseo y optimizacin de una arquitectura para la clasicacin de objetos en tiempo real por color y textura basada en FPGA
88 18
9. References
Arias-Estrada, M. & Xicotencatl, J. M. (2001). Multiple stereo matching using an extended architecture, in G. Brebner & R. Woods (eds), FPL 01: Proceeding of the 11th International Conference on Field-Programmable Logic and Applications, Springer-Verlag, London, UK, pp. 203212. Coussy, P. & Morawiec, A. (2008). High-Level Synthesis: from Algorithm to Digital Circuit, 1 edn, Springer. Dhanasekaran, D. & Bagan, K. B. (2009). High speed pipelined architecture for adaptive median lter, European Journal of Scientic Research 29(4): 454460. Ibarra-Manzano, M. (2011). Vision multi-camra pour la dtection dobstacles sur un robot de service: des algoritmes un systme intgr, PhD thesis, Institut National des Sciences Appliques de Toulouse, Toulouse, France. Ibarra-Manzano, M., Almanza-Ojeda, D.-L., Devy, M., Boizard, J.-L. & Fourniols, J.-Y. (2009). Stereo vision algorithm implementation in fpga using census transform for effective resource optimization, Digital System Design, Architectures, Methods and Tools, 2009. 12th Euromicro Conference on, pp. 799 805. Ibarra-Manzano, M., Devy, M., Boizard, J.-L., Lacroix, P. & Fourniols, J.-Y. (2009). An efcient recongurable architecture to implement dense stereo vision algorithm using high-level synthesis, 2009 International Conference on Field Programmable Logic and Applications, Prague, Czech Republic, pp. 444447. Masrani, D. & MacLean, W. (2006). A Real-Time large disparity range Stereo-System using FPGAs, Computer Vision Systems, 2006 ICVS 06. IEEE International Conference on, p. 13. Miyajima, Y. & Maruyama, T. (2003). A real-time stereo vision system with fpga, in G. Brebner & R. Woods (eds), FPL 03: Proceeding of the 13th International Conference on Field-Programmable Logic and Applications, Springer-Verlag, London, UK, pp. 448457. Murphy, C., Lindquist, D., Rynning, A. M., Cecil, T., Leavitt, S. & Chang, M. L. (2007). Low-cost stereo vision on an fpga, FCCM 07: Proceeding of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, IEEE Computer Society, Washington, DC, USA, pp. 333334. Naoulou, A., Boizard, J.-L., Fourniols, J. Y. & Devy, M. (2006). A 3d real-time vision sytem based on passive stereovision algorithms: Application to laparoscopic surgical manipulations, Proceedings of the 2nd Information and Communication Technologies, 2006 (ICTTA), Vol. 1, IEEE, pp. 10681073. Schmit, H. H., Cadambi, S., Moe, M. & Goldstein, S. C. (2000). Pipeline recongurable fpgas, Journal of VLSI Signal Processing Systems 24(2-3): 129146. Vega-Rodriguez, M. A., Sanchez-Perez, J. M. & Gomez-Pulido, J. A. (2002). An fpga-baed implementation for median lter meeting the real-time requirements of automated visual inspection systems, Proceedings of th 10th Mediterranean Conference on Control and Automation, Lisbon, Portugal, pp. 17. Woodll, J., Gordon, G., Jurasek, D., Brown, T. & Buck, R. (2006). The tyzx DeepSea g2 vision system, ATaskable, embedded stereo camera, Computer Vision and Pattern Recognition Workshop, 2006. CVPRW 06. Conference on, p. 126. Zabih, R. & Woodll, J. (1994). Non-parametric local transforms for computing visual correspondence, ECCV 94: Proceedings of the Third European Conference on Computer Vision, Vol. II, Springer-Verlag New York, Inc., Secaucus, NJ, USA, pp. 151158.