Accurate Multiple View 3D Reconstruction Using Pathch-Based Stereo For Large-Scale Scenes
Accurate Multiple View 3D Reconstruction Using Pathch-Based Stereo For Large-Scale Scenes
Abstract— In this paper, we propose a depth-map merging point growing based methods spread points reconstructed in
based multiple view stereo method for large-scale scenes which textured regions to untextured ones which may leave holes in
takes both accuracy and efficiency into account. In the proposed final results; the depth-map based methods have been proved
method, an efficient patch-based stereo matching process is
used to generate depth-map at each image with acceptable to be more adapted to large-scale scenes but their performance
errors, followed by a depth-map refinement process to enforce is usually lower than those by others in terms of accuracy and
consistency over neighboring views. Compared to state-of-the-art completeness.
methods, the proposed method can reconstruct quite accurate and In this paper we propose a depth-map merging based MVS
dense point clouds with high computational efficiency. Besides, method for large-scale scenes which takes both accuracy
the proposed method could be easily parallelized at image level,
i.e., each depth-map is computed individually, which makes it and efficiency into account. The key of our method is an
suitable for large-scale scene reconstruction with high resolution efficient patch based stereo matching following a depth-map
images. The accuracy and efficiency of the proposed method are refinement process which enforces consistency over multiple
evaluated quantitatively on benchmark data and qualitatively on views. Compared to state-of-the-art methods, the proposed
large data sets. method has three main advantages: 1) It can reconstruct quite
Index Terms— 3D reconstruction, depth-map, multiple view accurate and dense point clouds since the patch based stereo
stereo (MVS). is able to produce depth-maps with acceptable errors which
can be further refined by the depth-map refinement process.
I. I NTRODUCTION 2) It is a computational efficient method which is about 10
to 20 times faster than the state-of-the-art method [11] while
The above random initialization process is very likely to provide more reliable matches than low resolution ones and
have at least one good guess for each scene plane in the a simple NCC is reliable enough to measure the photometric
image, especially for high resolution images in which each consistency. Besides, most unreliable pixels generated by NCC
scene plane contains more pixels which means more guesses could be removed in the depth-map refinement step which
than low resolution ones. We should note that, once the depth- makes the final reconstruction result of NCC almost equivalent
map for image Ii is computed, we can improve the purely to those of other complex aggregation methods. Thus, in this
random initialization process when computing the depth-map paper we use simple NCC as the aggregated matching cost
on Ii ’s reference image I j . In this scenario, the depth and which is the same as [11].
patch normal of each pixel in Ii ’s depth-map could be warped After the initialization, each pixel in image Ii is associated
to I j as an initial estimate when computing I j ’s depth-map, with a 3D plane. Then we process pixels in Ii one by one to
and pixels in I j that do not have mappings between Ii and I j refine the planes in n 2 iterations. At odd iterations, we start
still use random initializations. In this manner, we can assign from the top-left pixel and traverse in row wise order until we
each warped pixel in I j a better plane initially than random reach the bottom-right pixel. At even iterations, we reverse the
guess since this plane is consistent for the stereo pair Ii and I j . order to visit the pixel from the bottom-right to the top-left
According to [33], given the projection matrices for the two pixel, also in row wise order. In this paper we set the number
cameras P = [I3×3 | 03 ] and P = [R | t], and a plane defined of plane refinement as k2 = 3.
by π T X = 0 with π = (V T , 1)T , then the homography H At each iteration, each pixel has two operations, called spa-
induced by the plane is: tial propagation and random assignment. Spatial propagation
is used to compare and propagate the planes of neighboring
H = R − tV T (4)
pixels to that of the current pixel. In odd iterations, the
Here, I3×3 is the 3×3 identity matrix and 03 is a zero 3-vector, neighboring pixels are the left, upper, and upper-left neighbors,
which indicates that the world coordinate is chosen to coincide and in even iterations are the right, lower, and lower-right
with camera P. In this paper, the camera parameters of the neighbors. Let p N denote the neighborhood of the current
image pair are {K i , Ri , Ci } and {K j , R j , C j }, and the plane pixel p, and f p N denote p N ’s plane, we use the matching
f p = {X i , n i } is defined in camera Ci ’s coordinate. Thus, the cost in Eq.6 to check the condition m( p, f p N ) < m( p, f p ). If
projection matrices and plane parameters can be translated into this condition is satisfied, we consider f p N is a better choice
standard forms (put the world origin at Ci ), as: for p compared to its current plane f p , and propagate f p N to
p, i.e. set f p = f p N . This spatial propagation process relies on
Pi = K i [I3×3 | 03 ], P j = K j [R j Ri−1 | R j (Ci − C j )], the fact that neighboring pixels are very likely to have similar
n iT 3D planes especially for high resolution images. Theoretically,
VT = −
n iT X i even a single good guess is enough to propagate this plane on
to other pixels of the region after the first and second spatial
i =
According to Eq. 4, the homography for the cameras P propagations.
−1
[I3×3 | 03 ] and P j = [R j Ri | R j (Ci − C j )] is: For each pixel p, after spatial propagation, we further
R j (Ci − C j )n iT refine the plane f p using random assignment. The purpose
i j = R j R −1 +
H of the random assignment is to further reduce the matching
i
n iT X i
cost in Eq.6 by testing several random plane parameters.
Applying the transformations K i and K j to the images we Given a range {λ, θ, φ}, we 1) select a random plane
i , P j = K j P
obtain the cameras Pi = K i P j and the resulting parameter {λ , θ , φ } in the range λ ∈ [λ − λ, λ + λ],
induced homography is: θ ∈ [θ − θ, θ + θ ], φ ∈ [φ − φ, φ + φ]. 2) Compute
R j (Ci − C j )n iT the new plane f p = {X i , n i } using Eq.2 and Eq. 3. 3) If
Hi j = K j (R j Ri−1 + )K i−1 (5) m( p, f p ) < m( p, f p ), we accept f p = f p and set λ = λ ,
n iT X i θ = θ , φ = φ . 4) We halve the range {λ, θ, φ}. 5) Go
We set a square window B centered on pixel p, where B = back to step one. The above process is repeated for k3 times.
w × w (in this paper we set w = 7 pixels). For each pixel q In this paper, we set the initial range and repetition time as:
in B we find its corresponding pixel in the reference image λ = λmax −λ 4
min
, θ = 90˚, φ = 15˚, k3 = 6. This random
I j using homography mapping Hi j (q). Then the aggregated assignment process progressively reduces the search range in
matching cost m( p, f p ) for pixel p is computed as one minus order to capture depth and normal details.
the Normalized Cross Correlation (NCC) score between q and The spatial propagation and random assignment idea has
Hi j (q), as: already been successfully applied in the patchmatch stereo
method [32] and the hybrid recursive matching (HRM) method
(q − q)(Hi j (q) − Hi j (q)) [37]. This paper extends the idea of [32] to Multiple View
q∈B
m( p, f p ) = 1 − (6) Stereo for large-scale scenes using high resolution images.
(q − q)2 (Hi j (q) − Hi j (q))2 Thus, the novelty of the proposed approach is to modify the
q∈B q∈B
pathmatch stereo algorithm [32] in a proper way that makes it
Note that some more complex and robust aggregation tech- more powerful and efficient for the large-scale MVS problem.
niques, like [34]–[36], could be used to generate more reliable The main difference between the proposed approach and the
results than NCC. However, high resolution images could method in [32] is that the plane is defined in the image coordi-
SHEN: ACCURATE MULTIPLE VIEW 3D RECONSTRUCTION 1905
TABLE I
PARAMETER S ETTINGS OF THE P ROPOSED M ETHOD
(a)
(b)
Fig. 5. Sample images and their ground truth depth-maps of the benchmark data sets. In both (a) and (b), the left two images are from Fountain-P11 and
the right two images are from Herz-Jesu-P8.
matching process is based on DAISY [24] descriptors at each version of PMVS [13] which first decomposes the input
pixel. images into a set of image clusters of managable size and
These two methods are now considered as state-of-the- then run PMVS on each cluster separately. The authors of
art MVS methods, and can be used for large-scale scenes [11], [13] provide the source codes of PMVS and clustered-
as ours. In the next three subsections, the proposed method, PMVS, and we set its parameters as: level = 0, csi ze = 1,
together with the PMVS and DAISY methods, are evaluated thr eshold = 0.7, wsi ze = 7, mi n I mageNum = 3. level
quantitatively and qualitatively on different data sets. All the specifies the level in the image pyramid that is used for the
experiments are implemented on a Intel 2.8GHz Quad Core computation, and level = 0 means the original resolution
CPU with 16G RAM. images are used. csi ze controls the density of reconstructions,
and csi ze = 1 means the software tries to reconstruct a patch
in every pixel. thr eshold = 0.7 is a threshold for photometric
B. Parameter Settings
consistency measurement which is the same as 1 − τ1 in our
The proposed method has nine parameters, and we have method. mi n I mageNum = 3 means each 3D point must be
already discussed their value settings in Section III. Table I is visible in at least 3 images for being reconstructed which is
a summary. suggest by the authors. These parameter settings ensure that
The key of the DAISY method is the DAISY descriptor, and the PMVS method tries to reconstruct a 3D point at every
we set the DAISY parameters as: R = 8, Q = 2, T = 4, and pixel with full resolution images, which is the same as ours.
H = 4. For the details of these parameters one can refer [22], For a full description of these parameters, one can refer [11],
[24]. The authors of [22] suggest first computing depths at [13], [38] for details.
sparse locations of the image in order to constrain the search The proposed method, as well as the DAISY method, could
range for neighboring pixels. Thus, in the experiments we first be easily parallelized at image level, i.e., each depth-map is
select control pixels using a sampling step of 10 pixels on the computed individually. Thus, in these two methods we use
image, and compute their depths. Then we compute depths of four cores for parallel computing. For the PMVS method, we
other pixels with depth range constrained by its closest four set its parameter C PU = 4 which indicates the code to use
neighboring control pixels. four cores on the Quad Core CPU as ours.
The PMVS method may run out of memory for large Since the memory consumption is a key problem for large-
number of high resolution images, thus we use a clustered scale reconstructions, we analyze the memory requirements of
SHEN: ACCURATE MULTIPLE VIEW 3D RECONSTRUCTION 1907
(a)
(b)
Fig. 6. Final reconstruction results (colorized point cloud rendering) of the proposed method on the benchmark data sets. In both (a) and (b), the results are
rendered from three different view points (the right one is seen from the top view).
(a)
(b)
Fig. 7. Depth-map generated after k2 = 1, 2, . . . , 5 iterations in the depth-map computation process. In both (a) and (b), the top row are depth-maps generated
after one to five iterations, respectively, and the bottom row, are absolute depth differences between neighboring iterations.
three evaluated methods. The PMVS method needs to load all four bytes since color images are converted to single-precision
images (or images in a cluster) simultaneously, which means floating point gray images, and n is the size of the image set.
it requires H × W × 4 × n of memory, where H and W Apparently, the PMVS method may suffer from the scalability
are the height and width of the image respectively, 4 means problem (out of memory) as the number of images grows. On
1908 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 5, MAY 2013
Fig. 8. Depth-map and back projected 3D points (colored rendering) after each step for the fourth image in Fountain-P11. In (a)–(c), the top image is the
depth-map and the bottom is the the back projected 3D points.
Fig. 9. Depth-map and back projected 3D points (colored rendering) after each step for the second image in Herz-Jesu-P8. In (a)–(c), the top image is the
depth-map and the bottom is the back projected 3D points.
TABLE II
the contrary, the proposed and the DAISY methods can avoid
N UMBERS OF C ORRECT AND E RROR P IXELS A FTER E ACH S TEP FOR THE
this scalability problem since each depth-map is computed and
F OURTH I MAGE IN F OUNTAIN -P11 AND THE S ECOND I MAGE IN
refined individually. The DAISY method loads two images and
H ERZ -J ESU -P8
computes descriptors on each one of their pixels for depth-
map computation. Each descriptor requires 36 floating point Fountain-P11 Herz-Jesu-P8
Step
numbers, which means that the DAISY method requires H × Correct pixels Error pixels Correct pixels Error pixels
W × 36 × 4 × 2 = H × W × 288 of memory. The proposed depth-map
5 174 163 737 211 4 477 824 990 058
method also loads two images for depth-map computation and computation
depth-map
requires H × W × 4 × 2 = H × W × 8 of memory. Obviously, refinement
4 603 032 165 868 3 992 684 176 736
the proposed method has much lower memory requirement depth-map
5 254 683 260 351 4 306 332 266 224
compared with the DAISY and PMVS methods. merging
(a)
(b)
Fig. 10. In both (a) and (b), from left to right: depth error maps for the fourth image in Fountain-P11 using the proposed method, the DAISY method, and
the PMVS method, respectively. In all the images, blue pixels encode missing depth values by the MVS method, green pixels encode missing ground truth
data, red pixels encode an error e larger than τe , and pixels with errors between 0 and τe are encoded in gray [255, 0].
(a)
(b)
Fig. 11. In both (a) and (b), from left to right: depth error maps for the second image in Herz-Jesu-P8 using the proposed method, the DAISY method, and
the PMVS method, respectively. In all the images, blue pixels encode missing depth values by the MVS method, green pixels encode missing ground truth
data, red pixels encode an error e larger than τe , and pixels with errors between 0 and τe are encoded in gray [255, 0].
mesh model could be easily generated from point cloud using depth for each pixel is obtained from a 3D triangle mesh by
some meshing algorithm [41]. However, the three evaluated computing the depth of the first triangle intersection with the
methods in this section are all output 3D point could, thus a camera ray going through this pixel. After this process, the
more direct way is to compare the raw outputs in point form ground truth depth-maps are generated. Fig. 5(b) shows some
rather than in a refined mesh form. To make this comparison sample ground truth depth-maps.
feasible, we first project the ground truth to each image to Three evaluated MVS methods are used to reconstruct
generate ground truth depth-maps. Since the ground truth point cloud on the benchmark data with the parameters
model is in 3D triangulated mesh form, the ground truth given in Section IV.B, and the results generated by the
1910 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 5, MAY 2013
TABLE III
N UMBERS OF C ORRECT AND E RROR P IXELS U SING T HREE E VALUATED M ETHODS FOR THE F OURTH I MAGE IN F OUNTAIN -P11 AND THE S ECOND
I MAGE IN H ERZ -J ESU -P8
Fountain-P11 Herz-Jesu-P8
Method
Correct pixels Error pixels Error/Correct Correct pixels Error pixels Error/Correct
Proposed
5 254 683 260 351 4.9% 4 306 332 266 224 6.2%
method
DAISY 5 163 432 263 544 5.1% 4 107 572 382 104 9.3%
PMVS 3 853 304 246 696 6.4% 2 838 744 346 312 12.2%
(a) (b)
Fig. 12. Number of correct pixels using three evaluated methods. For each pixel, its depth is considered to be correct if the depth error e is below τe = 0.01.
(a) Is the result for Fountain-P11 which contains 11 images. (b) Is the result for Herz-Jesu-P8, which contains eight images.
(a) (b)
Fig. 13. Total number of correct pixels in all images as a function of the error threshold. (a) Result for Fountain-P11. (b) Result for Herz-Jesu-P8.
proposed method are shown in Fig. 6. Then we project Eq. 8 is a quantitative measurement about how accurate a
the point cloud computed by different methods to each reconstructed depth is, and the following evaluations are all
image for quantitative evaluation with the ground truth based on this measurement.
depth-map. The first experiment illustrates the effect of the iteration
For each pixel in the image, we denote the depth computed number k2 for plane refinement in the depth-map computation
by MVS method by d and denote the depth of the ground process, and shows why we choose k2 = 3 as the iteration
truth by dgt , the depth error between the computed depth and number. We select the fourth image in Fountain-P11 and the
the ground truth could be measured as: second image in Herz-Jesu-P8, and compute their depth-maps
d − dgt with different values of k2 (k2 = 1, 2, . . . , 5) as shown in
e= (8) Fig. 7. In both Fig. 7(a) and 7(b), the top row are depth-
dgt maps generated after one to five iterations respectively, and
If the depth error e is below a threshold τe , the depth d is the bottom row are absolute depth differences between neigh-
considered as correct (in this paper we set τe = τ2 = 0.01). boring iterations. The results show that quite a few depth
SHEN: ACCURATE MULTIPLE VIEW 3D RECONSTRUCTION 1911
Fig. 14. Sample images of the large data sets. The left two images are from Hull and the right two images are from Life Science Building.
(a)
(b)
(c)
(d)
Fig. 15. Final reconstruction results (colorized point cloud rendering) of three evaluated methods on the Hull data set. In (a)–(c), the results are rendered
from three different view points (the right one is seen from the top view).
changes could be found after three iterations, thus setting and back projected 3D points after each step for the fourth
k2 = 3 could provide a good balance between accuracy image in Fountain-P11 and the second image in Herz-Jesu-P8
and efficiency. in Fig. 8 and Fig. 9 respectively. Besides visual results, the
The proposed method has three steps: depth-map com- numbers of correct and error pixels after each step are given in
putation, depth-map refinement, and depth-map merging. To Table II based on the measurement given in Eq. 8. The results
illustrate the effect of each step, we show the depth-maps show that after depth-map computation the patch based stereo
1912 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 5, MAY 2013
(a)
(b)
(c)
(d)
Fig. 16. Final reconstruction results (colorized point cloud rendering) of three evaluated methods on the Life Science Building data set. In (a)–(c), the results
are rendered from three different view points (the right one is seen from the top view).
can generate acceptable depth-maps, but still contain certain than that of the DAISY method, and brighter means the depth
visible errors. After depth-map refinement, 77% and 82% of errors are smaller. Table III gives a quantitative evaluation of
the error pixels are removed in Fountain-P11 and Herz-Jesu- the results shown in Fig. 10 and 11 by counting the numbers of
P8 respectively, which results in a relatively clean point cloud. correct and error pixels. The results show that compared with
Finally, after depth-map merging some holes are filled, like the the DAISY and PMVS methods the proposed method not only
left part of the fountain’s base in Fig. 8. produces more correct pixels but also has lower error/correct
Fig. 10 and Fig. 11 show the depth error maps for the fourth ratios.
image in Fountain-P11 and the second image in Herz-Jesu-P8 Besides a single image, we compute the depth errors across
using three evaluated methods respectively. In these figures, all images in the data sets and evaluate the overall quality
blue pixels encode missing depth values by the MVS method, of the reconstruction results generated by three evaluated
green pixels encode missing ground truth data, red pixels methods. For each image, we count the number of correct
encode an error e larger than τe , and pixels with errors between pixels whose depth errors are below τe and show the results
0 and τe are encoded in gray [255, 0]. The results show that in Fig. 12. The results show that in all the images the correct
our method and the DAISY method can generate much more pixel numbers of the proposed method and the DAISY method
dense points than the PMVS method. Although the parameters are almost the same, and are approximately 1.5 times larger
of PMVS have been set to try to reconstruct a 3D point at every than that of the PMVS method.
pixel, it still leaves lots of pixels without depths. Compared To further evaluate the reconstruction accuracy, we count
with the DAISY method, the proposed method could achieve the total number of correct pixels in all images as a function
more accurate results since its error maps are more brighter of the error threshold τe . We set τe to 200 values uniformity
SHEN: ACCURATE MULTIPLE VIEW 3D RECONSTRUCTION 1913
[10] M. Goesele, N. Snavely, B. Curless, H. Hoppe, and S. M. Seitz, “Multi- [28] D. Gallup, J.-M. Frahm, and M. Pollefeys, “Piecewise planar and non-
view stereo for community photo collections,” in Proc. IEEE Int. Conf. planar stereo for urban scene reconstruction,” in Proc. IEEE Comput.
Comput. Vis., Oct. 2007, pp. 1–8. Soc. Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 1418–1425.
[11] Y. Furukawa and J. Ponce, “Accurate, dense, and robust multiview [29] N. Snavely, S. M. Seitz, and R. Szeliski, “Modeling the world from
stereopsis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 8, internet photo collections,” Int. J. Comput. Vis., vol. 80, no. 2,
pp. 1362–1376, Aug. 2010. pp. 189–210, 2008.
[12] T.-P. Wu, S.-K. Yeung, J. Jia, and C.-K. Tang, “Quasi-dense 3d recon- [30] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and R. Szeliski, “Building
struction using tensor-based multiview stereo,” in Proc. IEEE Comput. rome in a day,” in Proc. IEEE Int. Conf. Comput. Vis., Sep. 2009,
Soc. Conf. Comput Vis. Pattern Recognit., Jun. 2010, pp. 1482–1489. pp. 72–79.
[13] Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski, “Towards internet- [31] J.-M. Frahm, P. George, D. Gallup, T. Johnson, R. Raguram, C. Wu,
scale multi-view stereo,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Y.-H. Jen, E. Dunn, B. Clipp, S. Lazebnik, and M. Pollefeys, “Building
Pattern Recognit., Jun. 2010, pp. 1434–1441. rome on a cloudless day,” in Proc. Eur. Conf. Comput Vis., Sep. 2010,
[14] M. Goesele, B. Curless, and S. M. Seitz, “Multi-view stereo revisited,” pp. 368–381.
in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., [32] M. Bleyer, C. Rhemann, and C. Rother, “Patchmatch stereo—stereo
Oct. 2006, pp. 2402–2409. matching with slanted support windows,” in Proc. Brit. Mach. Vis. Conf.,
[15] C. Strecha, R. Fransens, and L. V. Gool, “Combined depth and outlier Aug.–Sep. 2011, pp. 14.1–14.11.
estimation in multi-view stereo,” in Proc. IEEE Comput. Soc. Conf. [33] R. Hartley and A. Zisserman, Multiple View Geometry in Computer
Comput. Vis. Pattern Recognit., Oct. 2006, pp. 2394–2401. Vision, 2nd ed. Cambridge, U.K.: Cambridge Univ. Press, 2004.
[16] P. Merrell, A. Akbarzadeh, L. Wang, P. Mordohai, and J.-M. Frahm, [34] K.-J. Yoon and I.-S. Kweon, “Locally adaptive support-weight approach
“Real-time visibility-based fusion of depth maps,” in Proc. IEEE Int. for visual correspondence search,” in Proc. IEEE Comput. Soc. Conf.
Conf. Comput. Vis., Oct. 2007, pp. 1–8. Comput. Vis. Pattern Recognit., Jun. 2005, pp. 924–931.
[17] C. Zach, T. Pock, and H. Bischof, “A globally optimal algorithm for [35] A. Hosni, M. Bleyer, M. Gelautz, and C. Rhemann, “Local stereo
robust tv-l1 range image integration,” in Proc. IEEE Int. Conf. Comput. matching using geodesic support weights,” in Proc. IEEE Int. Conf.
Vis., Oct. 2007, pp. 1–8. Image Process., Nov. 2009, pp. 2093–2096.
[18] D. Bradley, T. Boubekeur, and W. Heidrich, “Accurate multi-view recon- [36] Q. Yang, L. Wang, R. Yang, H. Stewenius, and D. Nister, “Stereo match-
struction using robust binocular stereo and surface meshing,” in Proc. ing with color-weighted correlation, hierarchical belief propagation, and
IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Jun. 2008, occlusion handling,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31,
pp. 1–8. no. 3, pp. 492–504, Mar. 2009.
[19] N. D. Campbell, G. Vogiatzis, C. Hernandez, and R. Cipolla, “Using [37] N. Atzpadin, P. Kauff, and O. Schreer, “Stereo Anal. by hybrid recursive
multiple hypotheses to improve depth-maps for multi-view stereo,” in matching for real-time immersive video conferencing,” IEEE Trans.
Proc. Eur. Conf. Comput. Vis., Oct. 2008, pp. 766–779. Circuits Syst. Video Technol., vol. 14, no. 3, pp. 321–334, Mar. 2004.
[20] Y. Liu, X. Cao, Q. Dai, and W. Xu, “Continuous depth estimation for [38] Patch-Based Multi-View Stereo Software (PMVS—Version 2). (2010)
multi-view stereo,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. [Online]. Available: http://grail.cs.washington.edu/software/pmvs/
Pattern Recognit., Jun. 2009, pp. 2121–2128. [39] C. Strecha, W. von Hansen, L. V. Gool, P. Fua, and U. Thoennessen,
[21] J. Li, E. Li, Y. Chen, L. Xu, and Y. Zhang, “Bundled depth-map merging “On benchmarking camera calibration and multi-view stereo for high
for multi-view stereo,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. resolution imagery,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis.
Pattern Recognit., Aug. 2010, pp. 2769–2776. Pattern Recognit., Jun. 2008, pp. 1–8.
[22] E. Tola, C. Strecha, and P. Fua, “Efficient large-scale multi-view stereo [40] Multi-View Evaluation. (2008) [Online]. Available: http:
for ultra high-resolution image sets,” Mach. Vis. Appl., vol. 23, no. 5, //cvlab.epfl.ch/∼strecha/multiview/denseMVS.html
pp. 903–920, 2012. [41] M. Kazhdan, M. Bolitho, and H. Hoppe, “Poisson surface reconstruc-
[23] D. Lowe, “Distinctive image features from scale-invariant keypoints,” tion,” in Proc. 4th Eurograph. Symp. Geometry Process., Jul. 2006,
Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. pp. 61–70.
[24] E. Tola, V. Lepetit, and P. Fua, “Daisy: An efficient dense descriptor
applied to wide-baseline stereo,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 32, no. 5, pp. 815–830, May 2010.
[25] D. Gallup, J.-M. Frahm, P. Mordohai, Q. Yang, and M. Pollefeys, “Real-
time plane-sweeping stereo with multiple sweeping directions,” in Proc.
IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Jun. 2007, Shuhan Shen received the B.S. and M.S. degrees
pp. 1–8. from Southwest Jiao Tong University, Chengdu,
[26] M. Pollefeys, D. Nister, J.-M. Frahm, A. Akbarzadeh, P. Mordohai, China, in 2003 and 2006, respectively, and the
B. Clipp, C. Engels, D. Gallup, S. J. Kim, P. Merrell, C. Salmi, Ph.D. degree from Shanghai Jiao Tong University,
S. N. Sinha, B. Talton, L. Wang, Q. Yang, H. Stewenius, R. Yang, Shanghai, China, in 2010.
G. Welch, and H. Towles, “Detailed real-time urban 3D reconstruction He is currently an Assistant Professor with the
from video,” Int. J. Comput. Vis., vol. 72, no. 2, pp. 143–167, 2008. National Laboratory of Pattern Recognition, Insti-
[27] G. Zhang, J. Jia, T.-T. Wong, and H. Bao, “Consistent depth maps tute of Automation, Chinese Academy of Sciences,
recovery from a video sequence,” IEEE Trans. Pattern Anal. Mach. Beijing, China. His current research interests include
Intell., vol. 31, no. 6, pp. 974–988, Jun. 2009. 3D reconstruction and image-based modeling.