0% found this document useful (0 votes)
33 views14 pages

Accurate Multiple View 3D Reconstruction Using Pathch-Based Stereo For Large-Scale Scenes

Uploaded by

ypxypxypx700
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views14 pages

Accurate Multiple View 3D Reconstruction Using Pathch-Based Stereo For Large-Scale Scenes

Uploaded by

ypxypxypx700
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO.

5, MAY 2013 1901

Accurate Multiple View 3D Reconstruction Using


Patch-Based Stereo for Large-Scale Scenes
Shuhan Shen

Abstract— In this paper, we propose a depth-map merging point growing based methods spread points reconstructed in
based multiple view stereo method for large-scale scenes which textured regions to untextured ones which may leave holes in
takes both accuracy and efficiency into account. In the proposed final results; the depth-map based methods have been proved
method, an efficient patch-based stereo matching process is
used to generate depth-map at each image with acceptable to be more adapted to large-scale scenes but their performance
errors, followed by a depth-map refinement process to enforce is usually lower than those by others in terms of accuracy and
consistency over neighboring views. Compared to state-of-the-art completeness.
methods, the proposed method can reconstruct quite accurate and In this paper we propose a depth-map merging based MVS
dense point clouds with high computational efficiency. Besides, method for large-scale scenes which takes both accuracy
the proposed method could be easily parallelized at image level,
i.e., each depth-map is computed individually, which makes it and efficiency into account. The key of our method is an
suitable for large-scale scene reconstruction with high resolution efficient patch based stereo matching following a depth-map
images. The accuracy and efficiency of the proposed method are refinement process which enforces consistency over multiple
evaluated quantitatively on benchmark data and qualitatively on views. Compared to state-of-the-art methods, the proposed
large data sets. method has three main advantages: 1) It can reconstruct quite
Index Terms— 3D reconstruction, depth-map, multiple view accurate and dense point clouds since the patch based stereo
stereo (MVS). is able to produce depth-maps with acceptable errors which
can be further refined by the depth-map refinement process.
I. I NTRODUCTION 2) It is a computational efficient method which is about 10
to 20 times faster than the state-of-the-art method [11] while

W ITH fast developments of modern digital cameras, huge


numbers of high resolution images could be easily
captured nowadays. There is an urgent need to extract 3D
attaining similar accuracy. 3) It could be easily parallelized
at image level, i.e., each depth-map is computed individually,
which makes it suitable for large-scale scene reconstruction
structures from these images for many applications, such as
with high resolution images.
architecture heritage preservation, city-scale modeling, and
so on. Multiple View Stereo (MVS) reconstruction is a key II. P REVIOUS W ORKS
step in image-based 3D acquisition and receives more and
According to the taxonomy given in [1], the four classes of
more interests recently. Although great efforts have been made
MVS, voxel based methods, surface evolution based methods,
in MVS and some efficient algorithms have been proposed,
feature point growing based methods, and depth-map merging
especially for small and compact objects, the handling of
based methods are reviewed in this section.
large-scale scenes using high resolution images (6 Megapixel
The voxel based methods compute a cost function on a 3D
and above) is still an open problem.
volume which is a bounding box of the object. Seitz et al. [2]
According to [1], MVS algorithms can be divided into four
propose a voxel coloring framework that traverses a discrete
classes, called voxel based methods [2]–[4], surface evolution
3D space in a generalized depth-order to identify voxels that
based methods [5]–[8], feature point growing based methods
have a unique color, constant across all possible interpretations
[9]–[13], and depth-map merging based methods [14]–[22].
of the scene. Vogiatzis et al. [3] use graph-cut optimization to
Among these classes, the voxel based methods are only suited
compute the minimal surface that encloses the largest possible
for small compact objects within a tight enclosing box; the
volume, where surface area is just a surface integral in this
surface evolution based methods require a reliable initial guess
photo-consistency field. Since the accuracy of these methods
which is difficult to obtain for large-scale scenes; the feature
is limited by the resolution of the voxel grid, Sinha et al.
Manuscript received June 19, 2012; revised December 20, 2012; accepted [4] present a method that does not require the surface to be
December 28, 2012. Date of publication January 11, 2013; date of current lying within a finite band around the visual hull. This method
version March 14, 2013. This work was supported in part by the Natural uses photo-consistency to guide the adaptive subdivision of
Science Foundation of China, under Grant 61105032, the National 973 Key
Basic Research Program of China, under Grant 2012CB316302, and Strategic a coarse mesh of the bounding volume, which generates a
Priority Research Program of the Chinese Academy of Sciences, under multi-resolution volumetric mesh that is densely tesselated in
Grant XDA06030300. The associate editor coordinating the review of this the parts likely to contain the unknown surface. However,
manuscript and approving it for publication was Prof. Bulent Sankur.
The author is with the National Laboratory of Pattern Recognition, this method is only suited to compact objects admitting a
Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China tight enclosing box, and its computational and memory costs
(e-mail: [email protected]). become prohibitive for large-scale scenes.
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. The surface evolution based methods iteratively evolve an
Digital Object Identifier 10.1109/TIP.2013.2237921 initial guess to improve the photo consistency measurement.
1057-7149/$31.00 © 2013 IEEE
1902 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 5, MAY 2013

Faugeras et al. [5] implement the level set to solve a set


of PDEs that are used to deform an initial set of surfaces
which then move toward the objects to be detected. Hernandez
et al. [6] propose a method based on texture and silhouette
information, and fuse the silhouette force into the snake
framework. This method evolves an initial surface that is close
enough to the object surface using texture and silhouette driven
forces. Hiep et al. [7] use a minimum s-t cut to generate a
coarse initial mesh, then refine it with a variational approach to
capture small details. A main drawback of such methods is the
requirement for a reliable initial guess which is hard to obtain
for outdoor scenes. To this end, Cremers et al. [8] formulate the
reconstruction problem as a convex functional minimization,
where the exact silhouette consistency is imposed as convex
constraints which restrict the domain of feasible functions.
This method does not depend on initialization and can provide
solutions that lie within an error bound of the optimal solution.
However, this method relies on voxel representation of the
space, thus it cannot be used for large-scale scenes.
The feature point growing based methods first reconstruct
points in textured regions, and then expand theses points to
untextured ones. Lhuillier et al. [9] propose a quasi-dense Fig. 1. Framework of the proposed method.
approach to 3D surface model acquisition. This method first
initials sparse correspondence points of interest and then noisy, overlapping depth-maps, and then fuse these depth-maps
resamples quasi-dense points from the quasi-dense disparity to obtain an integrated surface based on visibility relations
map to densify the feature points to overcome the sparseness between points. Zach et al. [17] present a method to globally
of the points of interest. Goesele et al. [10] propose a method optimize an energy functional consisting of a total variation
to handle Internet photo collections containing obstacles using regularization force and an L 1 data fidelity term. Bradley
global and local view selection plus a region growing process et al. [18] propose a method which uses robust binocular
from reconstructed SIFT [23] features. Based on these meth- stereo with scaled matching windows, followed by adaptive
ods, Furukawa et al. [11] present quite an accurate Patch- point-based filtering of the merged point clouds. Campbell
based MVS (PMVS) approach that starts from a sparse set of et al. [19] store multiple depth hypotheses for each pixel and
matched keypoints, and repeatedly expands these till visibility use a spatial consistency constraint to extract the true depth
constraints are invoked to filter away false matches. This in the discrete Markov Random Field framework. Liu et al.
method is now considered as the state-of-the-art MVS method. [20] produce high quality MVS reconstruction results using
Based on PMVS, Wu et al. [12] propose a Tensor-based MVS continuous depth-maps generated by variational optical flow.
(TMVS) method for quasi-dense 3D reconstruction which But this method requires the visual hull as an initialization. Li
combines the complementary advantages of photoconsistency, et al. [21] generate depth-maps using DAISY [24] feature,
visibility and geometric consistency enforcement in MVS and use two stages of bundle adjustment to optimize the
under the 3D tensors framework. These feature point growing positions and normals of 3D points. Tola et al. [22] also use
based methods aim at reconstructing a global 3D model by DAISY feature to generate depth-maps, and then merge them
using all the images available simultaneously, thus they suffer by consistency checking at neighboring views. This method
from the scalability problem as the number of images grows. is similar to our method, but ours uses patch based stereo
Although this problem can be partially solved by decomposing instead of merely matching DAISY features along epipolar
input images into clusters that have small overlap [13], the lines, which can produce more accurate depth-maps without
computational complexity remains quite high for large-scale diminishing the computational efficiency.
scenes.
The depth-map merging based methods are natural exten- III. MVS U SING PATCH BASED S TEREO
sions from binocular stereo to multiple views. Such methods
first compute depth-maps at each view and then merge them The proposed method consists of four steps: stereo pair
together into a single model by taking visibility into account. selection, depth-map computation, depth-map refinement,
Goesele et al. [14] use Normalized Cross Correlation (NCC) and depth-map merging. The framework of the method is
based pixel window matching techniques to produce depth- illustrated in Fig. 1. For each image in the input image set,
maps then merge them with volumetric integration. Strecha we select a reference image to form a stereo pair for depth-
et al. [15] jointly model depth and visibility as a hidden map computation. Since these raw depth-maps generated by
Markov Random Field, and use EM-algorithm to optimize stereo vision may contain noises and errors, we refine each
the model parameters. Merrell et al. [16] first use a com- of them by consistency checking using its neighboring depth-
putationally cheap stereo algorithm to generate potentially maps. Finally all the refined depth-maps are merged together
SHEN: ACCURATE MULTIPLE VIEW 3D RECONSTRUCTION 1903

to get a final reconstruction. Next we elaborate on each of the


steps.

A. Stereo Pair Selection


For each image in the image set, we need to select a
reference image for it for stereo computation. The selection
of stereo image pair is important not only for the accuracy of
the stereo matching but also for the final MVS result. Stereo
pair selection is a relatively easy task for street-side view
cameras on the vehicle [25]–[28] or cameras in a controlled
environment like the Middlebury benchmark data [1], but
needs to be carefully designed for unordered images. A good
candidate reference image should have a similar viewing
direction as the target image, and have a suitable baseline
neither too short to degenerate the reconstruction accuracy nor
too long to have less common coverage of the scene.
We use a method similar to [21] to select eligible stereo
pairs. Suppose we have n images, and for the i -th one, we Fig. 2. For each pixel p in the input image, we estimate its corresponding
compute θi j , j = 1, . . . , n which is the angle between prin- 3D plane. Ci and C j are the camera centers of the input and reference images
cipal view directions of camera i and j . If the camera poses respectively, f 1 , f 2 and f 3 are three 3D planes in p’s viewing ray. Obviously
f 2 has the minimal aggregated matching cost.
are calibrated using structure-from-motion (SfM) algorithms
[29]–[31], a set of sparse 3D points and their visibilities are
generated as a by-product of SfM, then a better θi j could be
computed as the average of the the angles between the visible
points and the camera centers, respectively, for camera i and j .
Besides θi j , we compute another parameter di j , j = 1, . . . , n
for each image i , which is the distance between optical centers
of camera i and j . Then for images that satisfy 5˚< θi j <60˚,
we compute the median d̄ of their di j , and remove images
whose di j > 2d̄ or di j < 0.05d̄. After these computations,
if the number of remaining images is less than k1 , they are
considered as neighboring images of the image i , denoted as
N(i ). Otherwise, the remaining images are sorted in ascending
order according to θi j · di j , and the first k1 images form the
neighboring images N(i ) (in this paper we set k1 = 10).
Finally, the one with minimal θi j · di j among N(i ) is selected
as the i -th image’s reference image to form a stereo pair.
Fig. 3. Support plane is represented by a 3D point X i and its normal n i
B. Depth-Map Computation in camera Ci ’s coordinate, where Ci is the camera center of the i-th input
image, and Ci − xyz is the camera’s coordinate.
For each eligible stereo pair, we follow the idea in [32] to
compute the depth-map. The core idea is that, for each pixel The 3D point X i must lie in the viewing ray of p, we select
in the input image we try to find a good support plane that has a random depth λ in the depth range λ ∈ [λmin , λmax ], then
minimal aggregated matching cost with the reference image, X i is computed in Ci ’s coordinate as:
as shown in Fig.2.
The support plane f is essentially a local tangent plane of X i = λK i−1 p (2)
the scene surface, which is represented by a 3D point X i and
Then we assign the normal of the plane randomly in camera
its normal n i in the related camera’s coordinate system, as
Ci ’s spherical coordinate, as:
shown in Fig. 3. ⎡ ⎤
For the i -th input image Ii in the image set, given its cosθ si nφ
reference image I j , and the associated camera parameters n i = ⎣ si nθ si nφ ⎦ (3)
{K i , Ri , Ci } and {K j , R j , C j }, where K is the intrinsic para- cosφ
meters, R is the rotation matrix, and C is the camera center, we where, θ is a random angle in the range [0˚, 360˚], and φ is a
first assign each pixel p in Ii to a random 3D plane. Suppose random angle in the range[0˚, 60˚]. These range settings come
p’s homogeneous coordinate is: from a simple assumption that a patch is visible in image Ii
⎡ ⎤
u when the angle between the patch normal n i and the z axis
p = ⎣v ⎦ (1) of camera Ci ’s coordinate system is below a certain threshold
1 (in this paper we set this threshold as 60˚).
1904 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 5, MAY 2013

The above random initialization process is very likely to provide more reliable matches than low resolution ones and
have at least one good guess for each scene plane in the a simple NCC is reliable enough to measure the photometric
image, especially for high resolution images in which each consistency. Besides, most unreliable pixels generated by NCC
scene plane contains more pixels which means more guesses could be removed in the depth-map refinement step which
than low resolution ones. We should note that, once the depth- makes the final reconstruction result of NCC almost equivalent
map for image Ii is computed, we can improve the purely to those of other complex aggregation methods. Thus, in this
random initialization process when computing the depth-map paper we use simple NCC as the aggregated matching cost
on Ii ’s reference image I j . In this scenario, the depth and which is the same as [11].
patch normal of each pixel in Ii ’s depth-map could be warped After the initialization, each pixel in image Ii is associated
to I j as an initial estimate when computing I j ’s depth-map, with a 3D plane. Then we process pixels in Ii one by one to
and pixels in I j that do not have mappings between Ii and I j refine the planes in n 2 iterations. At odd iterations, we start
still use random initializations. In this manner, we can assign from the top-left pixel and traverse in row wise order until we
each warped pixel in I j a better plane initially than random reach the bottom-right pixel. At even iterations, we reverse the
guess since this plane is consistent for the stereo pair Ii and I j . order to visit the pixel from the bottom-right to the top-left
According to [33], given the projection matrices for the two pixel, also in row wise order. In this paper we set the number
cameras P = [I3×3 | 03 ] and P  = [R | t], and a plane defined of plane refinement as k2 = 3.
by π T X = 0 with π = (V T , 1)T , then the homography H At each iteration, each pixel has two operations, called spa-
induced by the plane is: tial propagation and random assignment. Spatial propagation
is used to compare and propagate the planes of neighboring
H = R − tV T (4)
pixels to that of the current pixel. In odd iterations, the
Here, I3×3 is the 3×3 identity matrix and 03 is a zero 3-vector, neighboring pixels are the left, upper, and upper-left neighbors,
which indicates that the world coordinate is chosen to coincide and in even iterations are the right, lower, and lower-right
with camera P. In this paper, the camera parameters of the neighbors. Let p N denote the neighborhood of the current
image pair are {K i , Ri , Ci } and {K j , R j , C j }, and the plane pixel p, and f p N denote p N ’s plane, we use the matching
f p = {X i , n i } is defined in camera Ci ’s coordinate. Thus, the cost in Eq.6 to check the condition m( p, f p N ) < m( p, f p ). If
projection matrices and plane parameters can be translated into this condition is satisfied, we consider f p N is a better choice
standard forms (put the world origin at Ci ), as: for p compared to its current plane f p , and propagate f p N to
p, i.e. set f p = f p N . This spatial propagation process relies on
Pi = K i [I3×3 | 03 ], P j = K j [R j Ri−1 | R j (Ci − C j )], the fact that neighboring pixels are very likely to have similar
n iT 3D planes especially for high resolution images. Theoretically,
VT = −
n iT X i even a single good guess is enough to propagate this plane on
to other pixels of the region after the first and second spatial
i =
According to Eq. 4, the homography for the cameras P propagations.
 −1
[I3×3 | 03 ] and P j = [R j Ri | R j (Ci − C j )] is: For each pixel p, after spatial propagation, we further
R j (Ci − C j )n iT refine the plane f p using random assignment. The purpose
i j = R j R −1 +
H of the random assignment is to further reduce the matching
i
n iT X i
cost in Eq.6 by testing several random plane parameters.
Applying the transformations K i and K j to the images we Given a range {λ, θ, φ}, we 1) select a random plane
i , P j = K j P
obtain the cameras Pi = K i P j and the resulting parameter {λ , θ  , φ  } in the range λ ∈ [λ − λ, λ + λ],
induced homography is: θ  ∈ [θ − θ, θ + θ ], φ  ∈ [φ − φ, φ + φ]. 2) Compute
R j (Ci − C j )n iT the new plane f p = {X i , n i } using Eq.2 and Eq. 3. 3) If
Hi j = K j (R j Ri−1 + )K i−1 (5) m( p, f p ) < m( p, f p ), we accept f p = f p and set λ = λ ,
n iT X i θ = θ  , φ = φ  . 4) We halve the range {λ, θ, φ}. 5) Go
We set a square window B centered on pixel p, where B = back to step one. The above process is repeated for k3 times.
w × w (in this paper we set w = 7 pixels). For each pixel q In this paper, we set the initial range and repetition time as:
in B we find its corresponding pixel in the reference image λ = λmax −λ 4
min
, θ = 90˚, φ = 15˚, k3 = 6. This random
I j using homography mapping Hi j (q). Then the aggregated assignment process progressively reduces the search range in
matching cost m( p, f p ) for pixel p is computed as one minus order to capture depth and normal details.
the Normalized Cross Correlation (NCC) score between q and The spatial propagation and random assignment idea has
Hi j (q), as: already been successfully applied in the patchmatch stereo
 method [32] and the hybrid recursive matching (HRM) method
(q − q)(Hi j (q) − Hi j (q)) [37]. This paper extends the idea of [32] to Multiple View
q∈B
m( p, f p ) = 1 −    (6) Stereo for large-scale scenes using high resolution images.
(q − q)2 (Hi j (q) − Hi j (q))2 Thus, the novelty of the proposed approach is to modify the
q∈B q∈B
pathmatch stereo algorithm [32] in a proper way that makes it
Note that some more complex and robust aggregation tech- more powerful and efficient for the large-scale MVS problem.
niques, like [34]–[36], could be used to generate more reliable The main difference between the proposed approach and the
results than NCC. However, high resolution images could method in [32] is that the plane is defined in the image coordi-
SHEN: ACCURATE MULTIPLE VIEW 3D RECONSTRUCTION 1905

nate in [32] but in the camera’s coordinate in our work because


the stereo pair is not rectified in our paper. Besides, by taking
full advantage of multiple high resolution images, we make
three simplifications compared with [32] in order to reduce the
computational expense. Firstly, the aggregated matching cost
m( p, f p ) in our work is a simple normalized cross correlation
rather than the more complex adaptive support weight version
in [32], because the NCC is reliable enough for high resolution
images and the remaining errors could be removed in the fol-
lowing depth-map refinement step. Secondly, we only use the
spatial propagation compared to spatial and view propagation
in [32] because we only compute the depth-map on Ii , not on
both Ii and I j as in [32]. Thirdly, the method in [32] contains
another post-processing step that applies occlusion treatment
via left/right consistency checking and fills invalidated pixels
as well. Our method does not have this process because the Fig. 4. Illustration of redundancy removing by depth-map test. For each
following depth-map refinement step could reach the similar pixel in camera Ci ’s depth-map, we back project it to 3D as X using
Eq. (7) and reproject X to Ci ’s neighboring cameras N1 . . . N4 . Define
effects. d(X, N# ) by the depth of X with respect to camera N# and define λ(X, N# )
After the spatial propagation and random assignment by the depth value computed at the projection of X in N# using N# ’s
processes, we remove unreliable points in the depth-map depth-map. If d(X, N# ) < λ(X, N# ), we say the projection of X in N#
is occluded and remove it from N# ’s depth-map, like in N1 and N2 . If
whose aggregated matching costs are above a certain threshold |d(X,N# )−λ(X,N# )|
λ(X,N# ) < τ2 , we say X is reduplicate in Ci and N# and remove it
τ1 (in this paper we set τ1 = 0.3). from N# ’s depth-map, like in N4 . Points not satisfied the above two conditions
are retained in their depth-maps, like in N3 .
C. Depth-Map Refinement
Since the raw depth-maps may not completely agree with Ci ’s depth-map, we back project it to 3D as X using Eq. 7
each other on common areas due to depth errors, a refinement and reproject X to Ci ’s neighboring cameras. If the depth
process is carried out to enforce consistency over neighboring of X with respect to the neighboring camera is smaller that
views. For each point p in image Ii , we back project it to 3D the depth value computed at the projection of X in the
using its depth λ and the camera parameters, as: neighboring camera’s depth-map, like cameras N1 and N2 in
X = λRiT K i−1 p + Ci (7) Fig. 4, we say the projection of X in this neighboring camera
is occluded and remove it from this neighboring camera’s
where p is the homogeneous coordinate defined in Eq. 1, X depth-map. If these two depth values are close enough, like
is the 3D point in the world coordinate. Then we project X to the camera N4 in Fig. 4, we say the projection of X in this
Ii ’s neighboring images N(i ) which are generated in the stereo neighboring camera represents the same point as X which is a
pair selection stage. Suppose Nk is the k-th neighboring image redundancy and also remove it from this neighboring camera’s
in N(i ), we denote d(X, Nk ) as the depth of X with respect to depth-map.
camera Nk and denote λ(X, Nk ) as the depth value computed Finally, all depth-maps are back projected into 3D, and
at the projection of X in Nk using Nk ’s depth-map. If λ(X, Nk ) merged into a single point cloud. The final point cloud is
is close enough to d(X, Nk ), i.e., |d(X,Nλ(X,N
k )−λ(X,Nk )|
k)
< τ2 , usually quite dense especially when using high resolution
where τ2 is a threshold (in this paper we set τ2 = 0.01), images. If we want to make it sparse, we can simply just
we say X is consistent in Ii and Nk . If X is consistent for back project points at sparse locations in the depth-maps. For
at least k4 neighboring images in N(i ) (in this paper we set example, using only points at image locations (2n, 2n) in the
k4 = 2), it is regarded as a reliable scene point and its corre- depth-map will approximately reduce the size of the point
sponding pixel p in Ii ’s depth-map is retained, otherwise it is cloud to a quarter of the size that use all points. This gives
removed. us a way to control the point cloud size according to memory
After the above refinement process, most errors could be and storage limitations.
removed, which results in a relatively clean depth-map at each
view.
IV. E XPERIMENTAL R ESULTS
D. Depth-Map Merging A. Two State-of-the-Art Methods for Comparison
After refinement, all the depth-maps could be merged to We compared our method with two state-of-the-art meth-
represent the scene. However, merging depth-maps directly ods [11], [22]. The PMVS [11] method is a feature point
may contain lots of redundancies, because different depth- growing based method which repeatedly expands a sparse set
maps may have common coverage of the scene, especially for of matched keypoints and use visibility constraints to filter
neighboring images. In order to remove these redundancies, away false matches. The DAISY based method [22] is a
the depth-maps are further reduced by neighboring depth- depth-map merging based method for ultra high-resolution
map test. As illustrated by Fig. 4, for each pixel in camera image sets. This method is similar to ours, but its stereo
1906 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 5, MAY 2013

TABLE I
PARAMETER S ETTINGS OF THE P ROPOSED M ETHOD

Parameter Section Description Value


k1 III.A Maximum number of neighboring images 10
w III.B Size of the matching window, w × w 7 pixels
k2 III.B Number of iterations for plane refinement 3
λmax −λmin
λ III.B Depth range for random plane assignment 4
θ III.B Angle range for random plane assignment 90˚
φ III.B Angle range for random plane assignment 15˚
k3 III.B Repetition time for random assignment 6
τ1 III.B Matching cost threshold for reliable points 0.3
τ2 III.C and III.D Threshold to measure depth closeness 0.01
k4 III.C Minimal number of consistent neighboring images 2

(a)

(b)

Fig. 5. Sample images and their ground truth depth-maps of the benchmark data sets. In both (a) and (b), the left two images are from Fountain-P11 and
the right two images are from Herz-Jesu-P8.

matching process is based on DAISY [24] descriptors at each version of PMVS [13] which first decomposes the input
pixel. images into a set of image clusters of managable size and
These two methods are now considered as state-of-the- then run PMVS on each cluster separately. The authors of
art MVS methods, and can be used for large-scale scenes [11], [13] provide the source codes of PMVS and clustered-
as ours. In the next three subsections, the proposed method, PMVS, and we set its parameters as: level = 0, csi ze = 1,
together with the PMVS and DAISY methods, are evaluated thr eshold = 0.7, wsi ze = 7, mi n I mageNum = 3. level
quantitatively and qualitatively on different data sets. All the specifies the level in the image pyramid that is used for the
experiments are implemented on a Intel 2.8GHz Quad Core computation, and level = 0 means the original resolution
CPU with 16G RAM. images are used. csi ze controls the density of reconstructions,
and csi ze = 1 means the software tries to reconstruct a patch
in every pixel. thr eshold = 0.7 is a threshold for photometric
B. Parameter Settings
consistency measurement which is the same as 1 − τ1 in our
The proposed method has nine parameters, and we have method. mi n I mageNum = 3 means each 3D point must be
already discussed their value settings in Section III. Table I is visible in at least 3 images for being reconstructed which is
a summary. suggest by the authors. These parameter settings ensure that
The key of the DAISY method is the DAISY descriptor, and the PMVS method tries to reconstruct a 3D point at every
we set the DAISY parameters as: R = 8, Q = 2, T = 4, and pixel with full resolution images, which is the same as ours.
H = 4. For the details of these parameters one can refer [22], For a full description of these parameters, one can refer [11],
[24]. The authors of [22] suggest first computing depths at [13], [38] for details.
sparse locations of the image in order to constrain the search The proposed method, as well as the DAISY method, could
range for neighboring pixels. Thus, in the experiments we first be easily parallelized at image level, i.e., each depth-map is
select control pixels using a sampling step of 10 pixels on the computed individually. Thus, in these two methods we use
image, and compute their depths. Then we compute depths of four cores for parallel computing. For the PMVS method, we
other pixels with depth range constrained by its closest four set its parameter C PU = 4 which indicates the code to use
neighboring control pixels. four cores on the Quad Core CPU as ours.
The PMVS method may run out of memory for large Since the memory consumption is a key problem for large-
number of high resolution images, thus we use a clustered scale reconstructions, we analyze the memory requirements of
SHEN: ACCURATE MULTIPLE VIEW 3D RECONSTRUCTION 1907

(a)

(b)

Fig. 6. Final reconstruction results (colorized point cloud rendering) of the proposed method on the benchmark data sets. In both (a) and (b), the results are
rendered from three different view points (the right one is seen from the top view).

(a)

(b)

Fig. 7. Depth-map generated after k2 = 1, 2, . . . , 5 iterations in the depth-map computation process. In both (a) and (b), the top row are depth-maps generated
after one to five iterations, respectively, and the bottom row, are absolute depth differences between neighboring iterations.

three evaluated methods. The PMVS method needs to load all four bytes since color images are converted to single-precision
images (or images in a cluster) simultaneously, which means floating point gray images, and n is the size of the image set.
it requires H × W × 4 × n of memory, where H and W Apparently, the PMVS method may suffer from the scalability
are the height and width of the image respectively, 4 means problem (out of memory) as the number of images grows. On
1908 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 5, MAY 2013

(a) (b) (c)

Fig. 8. Depth-map and back projected 3D points (colored rendering) after each step for the fourth image in Fountain-P11. In (a)–(c), the top image is the
depth-map and the bottom is the the back projected 3D points.

(a) (b) (c)

Fig. 9. Depth-map and back projected 3D points (colored rendering) after each step for the second image in Herz-Jesu-P8. In (a)–(c), the top image is the
depth-map and the bottom is the back projected 3D points.

TABLE II
the contrary, the proposed and the DAISY methods can avoid
N UMBERS OF C ORRECT AND E RROR P IXELS A FTER E ACH S TEP FOR THE
this scalability problem since each depth-map is computed and
F OURTH I MAGE IN F OUNTAIN -P11 AND THE S ECOND I MAGE IN
refined individually. The DAISY method loads two images and
H ERZ -J ESU -P8
computes descriptors on each one of their pixels for depth-
map computation. Each descriptor requires 36 floating point Fountain-P11 Herz-Jesu-P8
Step
numbers, which means that the DAISY method requires H × Correct pixels Error pixels Correct pixels Error pixels
W × 36 × 4 × 2 = H × W × 288 of memory. The proposed depth-map
5 174 163 737 211 4 477 824 990 058
method also loads two images for depth-map computation and computation
depth-map
requires H × W × 4 × 2 = H × W × 8 of memory. Obviously, refinement
4 603 032 165 868 3 992 684 176 736
the proposed method has much lower memory requirement depth-map
5 254 683 260 351 4 306 332 266 224
compared with the DAISY and PMVS methods. merging

C. Quantitative Evaluation on Benchmark Data


To quantitatively evaluate our method, two benchmark data
sets, Fountain-P11 and Herz-Jesu-P8, provided by Strecha et sets. The ground truth is obtained by laser scanning (LIDAR),
al. [39] [40] are used. Fountain-P11 and Herz-Jesu-P8 has which is a single high resolution triangle mesh model. The
11 and 8 images respectively with 3072 × 2048 resolution benchmark data site [40] can quantitatively evaluate the MVS
(6 Megapixel). Fig. 5(a) shows some sample images in the data results in triangulated mesh model with the ground truth. This
SHEN: ACCURATE MULTIPLE VIEW 3D RECONSTRUCTION 1909

(a)

(b)

Fig. 10. In both (a) and (b), from left to right: depth error maps for the fourth image in Fountain-P11 using the proposed method, the DAISY method, and
the PMVS method, respectively. In all the images, blue pixels encode missing depth values by the MVS method, green pixels encode missing ground truth
data, red pixels encode an error e larger than τe , and pixels with errors between 0 and τe are encoded in gray [255, 0].

(a)

(b)

Fig. 11. In both (a) and (b), from left to right: depth error maps for the second image in Herz-Jesu-P8 using the proposed method, the DAISY method, and
the PMVS method, respectively. In all the images, blue pixels encode missing depth values by the MVS method, green pixels encode missing ground truth
data, red pixels encode an error e larger than τe , and pixels with errors between 0 and τe are encoded in gray [255, 0].

mesh model could be easily generated from point cloud using depth for each pixel is obtained from a 3D triangle mesh by
some meshing algorithm [41]. However, the three evaluated computing the depth of the first triangle intersection with the
methods in this section are all output 3D point could, thus a camera ray going through this pixel. After this process, the
more direct way is to compare the raw outputs in point form ground truth depth-maps are generated. Fig. 5(b) shows some
rather than in a refined mesh form. To make this comparison sample ground truth depth-maps.
feasible, we first project the ground truth to each image to Three evaluated MVS methods are used to reconstruct
generate ground truth depth-maps. Since the ground truth point cloud on the benchmark data with the parameters
model is in 3D triangulated mesh form, the ground truth given in Section IV.B, and the results generated by the
1910 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 5, MAY 2013

TABLE III
N UMBERS OF C ORRECT AND E RROR P IXELS U SING T HREE E VALUATED M ETHODS FOR THE F OURTH I MAGE IN F OUNTAIN -P11 AND THE S ECOND
I MAGE IN H ERZ -J ESU -P8

Fountain-P11 Herz-Jesu-P8
Method
Correct pixels Error pixels Error/Correct Correct pixels Error pixels Error/Correct
Proposed
5 254 683 260 351 4.9% 4 306 332 266 224 6.2%
method
DAISY 5 163 432 263 544 5.1% 4 107 572 382 104 9.3%
PMVS 3 853 304 246 696 6.4% 2 838 744 346 312 12.2%

(a) (b)

Fig. 12. Number of correct pixels using three evaluated methods. For each pixel, its depth is considered to be correct if the depth error e is below τe = 0.01.
(a) Is the result for Fountain-P11 which contains 11 images. (b) Is the result for Herz-Jesu-P8, which contains eight images.

(a) (b)

Fig. 13. Total number of correct pixels in all images as a function of the error threshold. (a) Result for Fountain-P11. (b) Result for Herz-Jesu-P8.

proposed method are shown in Fig. 6. Then we project Eq. 8 is a quantitative measurement about how accurate a
the point cloud computed by different methods to each reconstructed depth is, and the following evaluations are all
image for quantitative evaluation with the ground truth based on this measurement.
depth-map. The first experiment illustrates the effect of the iteration
For each pixel in the image, we denote the depth computed number k2 for plane refinement in the depth-map computation
by MVS method by d and denote the depth of the ground process, and shows why we choose k2 = 3 as the iteration
truth by dgt , the depth error between the computed depth and number. We select the fourth image in Fountain-P11 and the
the ground truth could be measured as: second image in Herz-Jesu-P8, and compute their depth-maps
d − dgt  with different values of k2 (k2 = 1, 2, . . . , 5) as shown in
e= (8) Fig. 7. In both Fig. 7(a) and 7(b), the top row are depth-
dgt maps generated after one to five iterations respectively, and
If the depth error e is below a threshold τe , the depth d is the bottom row are absolute depth differences between neigh-
considered as correct (in this paper we set τe = τ2 = 0.01). boring iterations. The results show that quite a few depth
SHEN: ACCURATE MULTIPLE VIEW 3D RECONSTRUCTION 1911

Fig. 14. Sample images of the large data sets. The left two images are from Hull and the right two images are from Life Science Building.

(a)

(b)

(c)

(d)

Fig. 15. Final reconstruction results (colorized point cloud rendering) of three evaluated methods on the Hull data set. In (a)–(c), the results are rendered
from three different view points (the right one is seen from the top view).

changes could be found after three iterations, thus setting and back projected 3D points after each step for the fourth
k2 = 3 could provide a good balance between accuracy image in Fountain-P11 and the second image in Herz-Jesu-P8
and efficiency. in Fig. 8 and Fig. 9 respectively. Besides visual results, the
The proposed method has three steps: depth-map com- numbers of correct and error pixels after each step are given in
putation, depth-map refinement, and depth-map merging. To Table II based on the measurement given in Eq. 8. The results
illustrate the effect of each step, we show the depth-maps show that after depth-map computation the patch based stereo
1912 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 5, MAY 2013

(a)

(b)

(c)

(d)

Fig. 16. Final reconstruction results (colorized point cloud rendering) of three evaluated methods on the Life Science Building data set. In (a)–(c), the results
are rendered from three different view points (the right one is seen from the top view).

can generate acceptable depth-maps, but still contain certain than that of the DAISY method, and brighter means the depth
visible errors. After depth-map refinement, 77% and 82% of errors are smaller. Table III gives a quantitative evaluation of
the error pixels are removed in Fountain-P11 and Herz-Jesu- the results shown in Fig. 10 and 11 by counting the numbers of
P8 respectively, which results in a relatively clean point cloud. correct and error pixels. The results show that compared with
Finally, after depth-map merging some holes are filled, like the the DAISY and PMVS methods the proposed method not only
left part of the fountain’s base in Fig. 8. produces more correct pixels but also has lower error/correct
Fig. 10 and Fig. 11 show the depth error maps for the fourth ratios.
image in Fountain-P11 and the second image in Herz-Jesu-P8 Besides a single image, we compute the depth errors across
using three evaluated methods respectively. In these figures, all images in the data sets and evaluate the overall quality
blue pixels encode missing depth values by the MVS method, of the reconstruction results generated by three evaluated
green pixels encode missing ground truth data, red pixels methods. For each image, we count the number of correct
encode an error e larger than τe , and pixels with errors between pixels whose depth errors are below τe and show the results
0 and τe are encoded in gray [255, 0]. The results show that in Fig. 12. The results show that in all the images the correct
our method and the DAISY method can generate much more pixel numbers of the proposed method and the DAISY method
dense points than the PMVS method. Although the parameters are almost the same, and are approximately 1.5 times larger
of PMVS have been set to try to reconstruct a 3D point at every than that of the PMVS method.
pixel, it still leaves lots of pixels without depths. Compared To further evaluate the reconstruction accuracy, we count
with the DAISY method, the proposed method could achieve the total number of correct pixels in all images as a function
more accurate results since its error maps are more brighter of the error threshold τe . We set τe to 200 values uniformity
SHEN: ACCURATE MULTIPLE VIEW 3D RECONSTRUCTION 1913

TABLE IV Life Science Building data set captured at Tsinghua University


C OMPUTATIONAL T IME OF T HREE E VALUATED M ETHODS (M INUTES ) which includes 102 images with 3456 × 2304 (8 Megapixel)
resolution. Some sample images of these two data sets are
Data set Proposed method DAISY PMVS
shown in Fig. 14.
Fountain-P11 (11 images) 9.5 11.9 127
Herz-Jesu-P8 (eight images) 7.1 7.5 106
We qualitatively evaluate three different methods on these
large data sets. Some reconstruction results are shown in
TABLE V Fig. 15 and Fig. 16, and the computational time is shown in
C OMPUTATIONAL T IME OF T HREE E VALUATED M ETHODS (M INUTES ) Table V. The results show that all the three methods could
achieve acceptable reconstruction results, but the proposed
Data set Proposed paper DAISY PMVS
method and the PMVS method perform more accurate than
Hull (61 images) 45.3 49.2 621
Life Science Building (102 images) 81 77.9 1579
the DAISY method which could be seen more clearly from
Fig. 16. Compared to the PMVS method, ours can get more
complete results, like the walls in Fig. 15(d) and Fig. 16(d).
distributed in the range [0, 0.02], and show the results in The proposed method and the DAISY method run about 13
Fig. 13. The results show that when the error threshold times faster than PMVS on the Hull and 20 times faster on
τe is quite small (below 0.002), the proposed method and the Life Science Building.
the PMVS method can get more correct pixels than the
DAISY method, which indicates that if we concern about V. C ONCLUSION
the reconstruction result with high accuracy the proposed In this paper we propose a depth-map merging based MVS
method and PMVS both outperform the DAISY method. This method for large-scale scenes which takes both accuracy and
result comes from the natures of three evaluated methods. The efficiency into account. The key of the proposed method is
proposed method and the PMVS method rely on refinement an efficient patch based stereo matching plus a depth-map
of the plane location and normal in a continuous domain, refinement process that enforces consistency over multiple
but the DAISY method matches DAISY descriptor at discrete views. Compared to state-of-the-art MVS methods, the pro-
pixel locations along the epipolar line which results in dis- posed method has three main advantages: 1) The reconstructed
crete depths in space, and this discrete nature will affect its point cloud is quite accurate and dense since the patch based
accuracy. stereo is able to produce depth-maps with acceptable errors
Finally, we evaluate the speed of different methods. First we which can be further refined by the depth-map refinement
analyze the computation complexity of the proposed method. process. 2) The proposed method is quite efficient which is
The complexities for depth-map computation, refinement, and about 10 to 20 times faster than the PMVS method while
merging are O(H W B D), O(H W k1 ), and O(H W k1) respec- attaining similar accuracy. 3) It could be easily parallelized
tively, where H and W are the height and width of the image at image level, i.e., each depth-map is computed individually,
respectively, B is the size of the matching window, D is the which makes it suitable for large-scale scene reconstruction
number of depths to be tested, and k1 is the maximum number with high resolution images.
of neighboring images. In this paper, B = w × w = 49,
k1 = 10. Obviously, the complexities of depth-map refinement R EFERENCES
and merging are much lower than that of the depth-map com- [1] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski,
putation. For each pixel in the depth-map computation step, we “A comparison and evaluation of multi-view stereo reconstruction
compute its cost aggregation once at the beginning, followed algorithms,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
Recognit., Jun. 2006, pp. 519–528.
by three iterations to refine the plane. At each iteration we [2] S. M. Seitz and C. R. Dyer, “Photorealistic scene reconstruction by voxel
compute the pixel’s cost aggregation three times using its three coloring,” Int. J. Comput. Vis., vol. 35, no. 2, pp. 151–173, Nov. 1999.
neighbor pixel’s plane parameters for spatial propagation and [3] G. Vogiatzis, C. Hernandez, P. H. Torr, and R. Cipolla, “Multiview stereo
via volumetric graph-cuts and occlusion robust photo-consistency,” IEEE
six times for random plane assignment. This gives the number Trans. Pattern Anal. Mach. Intell., vol. 29, no. 12, pp. 2241–2246,
of depths to be tested for each pixel: D = 1 + (3 + 6) × 3 = Dec. 2007.
28. Compared to traditional stereo matching approaches that [4] S. N. Sinha, P. Mordohai, and M. Pollefeys, “Multi-view stereo via graph
cuts on the dual of an adaptive tetrahedral mesh,” in Proc. IEEE Int.
should test a large number of depth hypotheses, the proposed Conf. Comput. Vis., Oct. 2007, pp. 1–8.
method can significantly reduce the computational complexity. [5] O. Faugeras and R. Keriven, “Variational principles, surface evolution,
The speeds of three evaluated methods are shown in Table pde’s, level set methods, and the stereo problem,” IEEE Trans. Image
Process., vol. 7, no. 3, pp. 336–344, Jun. 1998.
IV. The results show that the proposed method runs at similar [6] C. Hernandez and F. Schmitt, “Silhouette and stereo fusion for 3d object
speed with the DAISY method, and is about 13 to 15 times modeling,” Comput. Vis. Image Understand., vol. 96, no. 3, pp. 367–392,
faster than PMVS. Dec. 2004.
[7] V. H. Hiep, R. Keriven, P. Labatut, and J.-P. Pons, “Towards high-
resolution large-scale multi-view stereo,” in Proc. IEEE Comput. Soc.
D. Qualitative Evaluation on Large Data Sets Conf. Comput Vis. Pattern Recognit., Jun. 2009, pp. 1430–1437.
[8] D. Cremers and K. Kolev, “Multiview stereo and silhouette consistency
In this section we test the proposed method on large via convex functionals over convex domains,” IEEE Trans. Pattern Anal.
data sets. Two data sets are used here, one is the Hull Mach. Intell., vol. 33, no. 6, pp. 1161–1174, Jun. 2011.
[9] M. Lhuillier and L. Quan, “A quasi-dense approach to surface recon-
data set provided by [38] which includes 61 images with struction from uncalibrated images,” IEEE Trans. Pattern Anal. Mach.
3008 × 2000 (6 Megapixel) resolution, and another is the Intell., vol. 27, no. 3, pp. 418–433, Mar. 2005.
1914 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 5, MAY 2013

[10] M. Goesele, N. Snavely, B. Curless, H. Hoppe, and S. M. Seitz, “Multi- [28] D. Gallup, J.-M. Frahm, and M. Pollefeys, “Piecewise planar and non-
view stereo for community photo collections,” in Proc. IEEE Int. Conf. planar stereo for urban scene reconstruction,” in Proc. IEEE Comput.
Comput. Vis., Oct. 2007, pp. 1–8. Soc. Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 1418–1425.
[11] Y. Furukawa and J. Ponce, “Accurate, dense, and robust multiview [29] N. Snavely, S. M. Seitz, and R. Szeliski, “Modeling the world from
stereopsis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 8, internet photo collections,” Int. J. Comput. Vis., vol. 80, no. 2,
pp. 1362–1376, Aug. 2010. pp. 189–210, 2008.
[12] T.-P. Wu, S.-K. Yeung, J. Jia, and C.-K. Tang, “Quasi-dense 3d recon- [30] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and R. Szeliski, “Building
struction using tensor-based multiview stereo,” in Proc. IEEE Comput. rome in a day,” in Proc. IEEE Int. Conf. Comput. Vis., Sep. 2009,
Soc. Conf. Comput Vis. Pattern Recognit., Jun. 2010, pp. 1482–1489. pp. 72–79.
[13] Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski, “Towards internet- [31] J.-M. Frahm, P. George, D. Gallup, T. Johnson, R. Raguram, C. Wu,
scale multi-view stereo,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Y.-H. Jen, E. Dunn, B. Clipp, S. Lazebnik, and M. Pollefeys, “Building
Pattern Recognit., Jun. 2010, pp. 1434–1441. rome on a cloudless day,” in Proc. Eur. Conf. Comput Vis., Sep. 2010,
[14] M. Goesele, B. Curless, and S. M. Seitz, “Multi-view stereo revisited,” pp. 368–381.
in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., [32] M. Bleyer, C. Rhemann, and C. Rother, “Patchmatch stereo—stereo
Oct. 2006, pp. 2402–2409. matching with slanted support windows,” in Proc. Brit. Mach. Vis. Conf.,
[15] C. Strecha, R. Fransens, and L. V. Gool, “Combined depth and outlier Aug.–Sep. 2011, pp. 14.1–14.11.
estimation in multi-view stereo,” in Proc. IEEE Comput. Soc. Conf. [33] R. Hartley and A. Zisserman, Multiple View Geometry in Computer
Comput. Vis. Pattern Recognit., Oct. 2006, pp. 2394–2401. Vision, 2nd ed. Cambridge, U.K.: Cambridge Univ. Press, 2004.
[16] P. Merrell, A. Akbarzadeh, L. Wang, P. Mordohai, and J.-M. Frahm, [34] K.-J. Yoon and I.-S. Kweon, “Locally adaptive support-weight approach
“Real-time visibility-based fusion of depth maps,” in Proc. IEEE Int. for visual correspondence search,” in Proc. IEEE Comput. Soc. Conf.
Conf. Comput. Vis., Oct. 2007, pp. 1–8. Comput. Vis. Pattern Recognit., Jun. 2005, pp. 924–931.
[17] C. Zach, T. Pock, and H. Bischof, “A globally optimal algorithm for [35] A. Hosni, M. Bleyer, M. Gelautz, and C. Rhemann, “Local stereo
robust tv-l1 range image integration,” in Proc. IEEE Int. Conf. Comput. matching using geodesic support weights,” in Proc. IEEE Int. Conf.
Vis., Oct. 2007, pp. 1–8. Image Process., Nov. 2009, pp. 2093–2096.
[18] D. Bradley, T. Boubekeur, and W. Heidrich, “Accurate multi-view recon- [36] Q. Yang, L. Wang, R. Yang, H. Stewenius, and D. Nister, “Stereo match-
struction using robust binocular stereo and surface meshing,” in Proc. ing with color-weighted correlation, hierarchical belief propagation, and
IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Jun. 2008, occlusion handling,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31,
pp. 1–8. no. 3, pp. 492–504, Mar. 2009.
[19] N. D. Campbell, G. Vogiatzis, C. Hernandez, and R. Cipolla, “Using [37] N. Atzpadin, P. Kauff, and O. Schreer, “Stereo Anal. by hybrid recursive
multiple hypotheses to improve depth-maps for multi-view stereo,” in matching for real-time immersive video conferencing,” IEEE Trans.
Proc. Eur. Conf. Comput. Vis., Oct. 2008, pp. 766–779. Circuits Syst. Video Technol., vol. 14, no. 3, pp. 321–334, Mar. 2004.
[20] Y. Liu, X. Cao, Q. Dai, and W. Xu, “Continuous depth estimation for [38] Patch-Based Multi-View Stereo Software (PMVS—Version 2). (2010)
multi-view stereo,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. [Online]. Available: http://grail.cs.washington.edu/software/pmvs/
Pattern Recognit., Jun. 2009, pp. 2121–2128. [39] C. Strecha, W. von Hansen, L. V. Gool, P. Fua, and U. Thoennessen,
[21] J. Li, E. Li, Y. Chen, L. Xu, and Y. Zhang, “Bundled depth-map merging “On benchmarking camera calibration and multi-view stereo for high
for multi-view stereo,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. resolution imagery,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis.
Pattern Recognit., Aug. 2010, pp. 2769–2776. Pattern Recognit., Jun. 2008, pp. 1–8.
[22] E. Tola, C. Strecha, and P. Fua, “Efficient large-scale multi-view stereo [40] Multi-View Evaluation. (2008) [Online]. Available: http:
for ultra high-resolution image sets,” Mach. Vis. Appl., vol. 23, no. 5, //cvlab.epfl.ch/∼strecha/multiview/denseMVS.html
pp. 903–920, 2012. [41] M. Kazhdan, M. Bolitho, and H. Hoppe, “Poisson surface reconstruc-
[23] D. Lowe, “Distinctive image features from scale-invariant keypoints,” tion,” in Proc. 4th Eurograph. Symp. Geometry Process., Jul. 2006,
Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. pp. 61–70.
[24] E. Tola, V. Lepetit, and P. Fua, “Daisy: An efficient dense descriptor
applied to wide-baseline stereo,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 32, no. 5, pp. 815–830, May 2010.
[25] D. Gallup, J.-M. Frahm, P. Mordohai, Q. Yang, and M. Pollefeys, “Real-
time plane-sweeping stereo with multiple sweeping directions,” in Proc.
IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Jun. 2007, Shuhan Shen received the B.S. and M.S. degrees
pp. 1–8. from Southwest Jiao Tong University, Chengdu,
[26] M. Pollefeys, D. Nister, J.-M. Frahm, A. Akbarzadeh, P. Mordohai, China, in 2003 and 2006, respectively, and the
B. Clipp, C. Engels, D. Gallup, S. J. Kim, P. Merrell, C. Salmi, Ph.D. degree from Shanghai Jiao Tong University,
S. N. Sinha, B. Talton, L. Wang, Q. Yang, H. Stewenius, R. Yang, Shanghai, China, in 2010.
G. Welch, and H. Towles, “Detailed real-time urban 3D reconstruction He is currently an Assistant Professor with the
from video,” Int. J. Comput. Vis., vol. 72, no. 2, pp. 143–167, 2008. National Laboratory of Pattern Recognition, Insti-
[27] G. Zhang, J. Jia, T.-T. Wong, and H. Bao, “Consistent depth maps tute of Automation, Chinese Academy of Sciences,
recovery from a video sequence,” IEEE Trans. Pattern Anal. Mach. Beijing, China. His current research interests include
Intell., vol. 31, no. 6, pp. 974–988, Jun. 2009. 3D reconstruction and image-based modeling.

You might also like