Instant 3D Photography: Peter Hedman, Johannes Kopf
Instant 3D Photography: Peter Hedman, Johannes Kopf
Dual camera Input burst of 34 color-and-depth photos, Our 3D panorama (showing color, depth, and a 3D effect),
phone captured in 34.0 seconds generated in 34.7 seconds.
Fig. 1. Our work enables practical and casual 3D capture with regular dual camera cell phones. Left: A burst of input color-and-depth image pairs that we
captured with a dual camera cell phone at a rate of one image per second. Right: 3D panorama generated with our algorithm in about the same time it took to
capture. The geometry is highly detailed and enables viewing with binocular and motion parallax in VR, as well as applying 3D effects that interact with the
scene, e.g., through occlusions (right).
ACM Transactions on Graphics, Vol. 37, No. 4, Article 101. Publication date: August 2018.
101:2 • P. Hedman and J. Kopf
In this paper, we present a new algorithm that constructs 3D consistent stitched panoramas. A large number of these results is
panoramas from sequences of color-and-depth photos produced provided in the supplementary material and accompanying video.
from small-baseline stereo dual camera cell phones, such as recent
iPhones. We take these sequences with a custom burst capture app 2 PREVIOUS WORK
while casually moving the phone around at a half-arm’s distance.
360° Photo and Video: Using dedicated consumer hardware, such
The depth reconstruction is essentially free since it is integrated into
as the Ricoh Theta, it is now easy to capture full 360° × 180° panora-
native phone OS APIs and highly optimized. Using depth maps from
mas. Although this is often marketed as capture for VR, it does not
dual cameras makes our algorithm somewhat robust to scene motion,
make use of the most interesting capabilities of that technology, and
because the synchronous stereo capture enables reconstructing
the lack of binocular and motion parallax limits the realism.
depth even for dynamic objects, though stitching them might still
result in visible seams.
Stereo Panoramas: Binocular depth perception can be enabled by
Our method is fast and processes approximately one input image
stitching appropriate pairs of left-eye and right-eye panoramic im-
per second, about the same time it takes to capture. We stress the
ages. This representation is often called omnidirectional stereo [An-
importance of this point, since we found that as our system became
derson et al. 2016; Ishiguro et al. 1990; Peleg et al. 2001; Richardt
faster, it made our own behavior with regards to capture more
et al. 2013]. Recent systems enable the capture of stereo panoramic
opportunistic: we were suddenly able to capture spontaneously on
videos using multiple cameras arranged in a ring [Anderson et al.
the go and even “iterate” on scenes to try different viewing angles.
2016; Facebook 2016], or using two spinning wide-angle cameras
The output of our algorithm is a detailed 3D panorama, i.e., a
[Konrad et al. 2017].
textured, multi-layered 3D mesh that can be rendered with standard
Omnidirectional stereo has a number of drawbacks. In particular,
graphics engines. Our 3D panoramas can be viewed with binocular
the rendered views are not in a linear perspective projection and
and head-motion parallax in VR, or using a parallax viewer on
exhibit distortions such as curved straight lines and incorrect stereo
normal mobile and web displays (see the accompanying video for
parallax away from the equator [Hedman et al. 2017]. Even more
examples). We can also generate interesting geometric effects using
importantly, the representation does not support motion parallax,
the 3D representation (Figure 1, right).
i.e., the rendered scene does not change as the user moves their head,
We faced two challenges when developing this algorithm, which
which considerably limits depth perception, and, hence, immersion.
lead to our two main technical contributions:
(1) Due to the very small baseline of dual phone cameras depth Parallax-aware Panorama Stitching: Some panorama stitching
estimation is highly uncertain, and it requires strong edge- methods compute warp-deformations to compensate for parallax in
aware filtering to smooth the resulting noise. However, this the input images. While this reduces artifacts, it does not address
leads to low frequency errors in the depth maps that prevent the fundamental limitation that this representation does not support
simple alignment using global transformations. We present viewpoint changes at runtime.
a novel optimization method that jointly aligns the depth Zhang and Liu [2014] stitch image pairs with large parallax by
maps by recovering their camera poses as well as solving for finding a locally consistent alignment sufficient for finding a good
a spatially-varying adjustment field for the depth values. This seam. Perazzi et al. [2015] extend this work to the multi-image
method is able to bring even severely degraded input depth case and compute optimal deformations in overlapping regions to
maps into very good alignment. compensate parallax and extrapolate the deformation field smoothly
(2) Existing image fusion methods using discrete optimization are in the remaining regions. Lin et al. [2016] handle two independently
slow. We utilize a carefully designed data term and the high moving cameras whose relative poses change over time.
quality of our depth alignment to remove the need for label Recent work [Zhang and Liu 2015] demonstrates that these ap-
smoothness optimization, and replace it with independently proaches also extend to omni-directional stereo. However, this line
optimizing every pixel label after filtering the data term in a of work has not yet produced explicit 3D geometry, making them
depth-guided edge-aware manner. This achieves a speedup unable to produce head-motion parallax in VR.
of more than two orders of magnitude.
After stitching, we convert the 3D panorama into a multi-layered Panoramas with Depth: An alternative to generating a left-right
representation by converting it to a mesh, tearing it at strong depth pair of panoramic images is to augment a traditional stitched panoramic
edges, and extending the back-layer into occluded regions while image with depth information. Im et al. [2016] construct a panorama-
hallucinating new colors and depths. When viewing the panorama with-depth from small baseline 360° video. However, the fidelity
away from the default viewpoint this new content is revealed in of the depth reconstruction does not seem sufficient for viewpoint
disocclusions. changes (and this has not been demonstrated.) Lee et al. [2016]
We demonstrate our algorithm on a wide variety of captured use depth information to compute a spatially varying 3D projection
scenes, including indoor, outdoor, urban, and natural environments surface to compensate for parallax when stitching images captured
at day and night time. We also applied our algorithm to several with a 360° rig. However, similar to before mentioned work the
datasets where the depth maps were estimated from single images surface is a low-resolution grid mesh. Zheng et al. [2007] create a
using CNNs. These depth maps are strongly deformed from their layered depth panorama using a cylinder-sweep multi-view stereo
ground truth and lack image-to-image coherence, but neverthe- algorithm. However, their algorithm creates discrete layers at fixed
less our algorithm is able to produce surprisingly well-aligned and depths and cannot reconstruct sloped surfaces.
ACM Transactions on Graphics, Vol. 37, No. 4, Article 101. Publication date: August 2018.
Instant 3D Photography • 101:3
Fig. 2. Breakdown of the major algorithms stages and their outputs, which form the inputs to the next respective stage.
Multi-view Stereo: A long line of research in computer vision by non-rigidly deforming the 3D geometry which has already been
is concerned with producing depth maps or surface meshes from scanned.
multiple overlapping images using multi-view stereo (MVS) algo- In general, methods designed for depth cameras cannot directly
rithms [Seitz et al. 2006]. MVS algorithms are used in commercial be applied to narrow baseline stereo data, which is of much lower
photogrammetric tools for 3D reconstruction of VR scenes [Realities quality. Unlike the depth maps used in this paper, depth sensors
2017; Valve 2016]. Huang et al. [2017] use MVS to obtain dense point provide absolute scale, maintain frame-to-frame consistency and
clouds from video sequences captured with a single 360° camera. can often be rigidly aligned to a high degree of accuracy.
MVS methods work best if the camera baseline is large, which
is unfortunately not the case in the panorama capture scenario. In 3 OVERVIEW
this case, it is difficult for the methods to deal with the triangulation The goal of our work is to enable easy and rapid capture of 3D
uncertainty, which leads to artifacts, such as noisy reconstructions panoramas using readily available consumer hardware.
and missing regions. These methods are usually also slow, with
runtimes ranging from minutes to hours. Hedman et al. [Hedman 3.1 Dual Lens Depth Capture
et al. 2017] improve the quality of reconstructed 3D panoramas, but
Dual lens cameras capture synchronized small-baseline stereo image
their algorithm requires several hours of processing.
pairs for the purpose of reconstructing an aligned color-and-depth
image using depth-from-stereo algorithms [Szeliski 2010]. The depth
Light Fields: The light field representation [Gortler et al. 1996; reconstruction is typically implemented in system-level APIs and
Levoy and Hanrahan 1996] can generate highly realistic views of highly optimized, so from a programmer’s and a user’s perspective,
a scene with motion parallax and view-dependent effects. Recent the phone effectively features a “depth camera”. Several recent flag-
work addresses unstructured acquisition with a hand-held camera ship phones feature dual cameras, including the iPhone 7 Plus, 8
[Davis et al. 2012]. The main disadvantage of this representation Plus, X, and Samsung Note 8. Such devices are already in the hands
is that it requires a very large number of input views that need of tens of millions of consumers.
to be retained to sample from at runtime, and a custom rendering The small baseline is both a blessing and a curse: the limited search
algorithm. range enables quickly establishing dense image correspondence but
also makes triangulation less reliable and causes large uncertainty
Bundle Adjustment with Depth: The popularisation of consumer in the estimated depth. For this reason, most algorithms employ
depth cameras has inspired research on aligning and fusing multiple aggressive edge-aware filtering [Barron et al. 2015; He et al. 2010],
depth maps into globally consistent geometry [Izadi et al. 2011]. Dai which yields smoother depth maps with color-aligned edges, but
et al. [Dai et al. 2017b] present system which integrates new depth large low-frequency error in the absolute depth values. In addition,
maps in real-time, using bundle adjustment on 3D feature point the dual lenses on current-generation phones constantly move and
correspondences to continuously maintain and refine alignment. rotate during capture due to optical image stabilization, changes
There has also been work on non-rigid deformations to refine in focus, and even gravity1 . These effects introduce a non-linear
alignment with active depth sensors. Zhou and Koltun [Zhou and and spatially-varying transformation of disparity that adds to the
Koltun 2014] perform 3D camera calibration during scanning, cor- low-frequency error from noise filtering mentioned above.
recting for non-linear distortion associated with the depth camera.
Whelan et al. [Whelan et al. 2015] show how to correct for drift 1 see http://developer.apple.com/videos/play/wwdc2017/507 at 17:20-20:50, Slides 81-89.
ACM Transactions on Graphics, Vol. 37, No. 4, Article 101. Publication date: August 2018.
101:4 • P. Hedman and J. Kopf
Stereo image pair (a) iPhone 7+ depth map (b) Monodepth [Godard et al. 2017] (c) DfUSMC depth map [Ha et al. 2016]
Fig. 3. Estimating depth maps using various algorithms. Note relative scale difference and low-frequency deformations between different maps. (a) Small
baseline stereo depth computed by the native iOS algorithm on an iPhone 7+. (b) Single image CNN depth map [Godard et al. 2017]. (c) Depth from accidental
motion result [Ha et al. 2016] (we actually used a short video clip to produce this result).
In Figure 3, you can see depth maps reconstructed using different depth values in occluded areas. Finally, we simplify the mesh and
stereo algorithms on this kind of data. As revealed in the figure, compute a texture atlas.
there is a significant amount of low-frequency error in the depth
maps. Since our focus is not stereo matching, we use the depth
4 ALGORITHM
maps from the native iPhone 7 Plus stereo algorithm for all of our
experiments. 4.1 Capture and Preprocessing
An important detail to note is that many small baseline stereo We perform all of our captures with an iPhone 7 Plus using a custom-
methods (including the one running on the iPhone) do not estimate built rudimentary capture app. During a scene capture session, it
absolute depth, but instead produce normalized depth maps. So, automatically triggers the capture of color-and-depth photos (using
aligning such depth map involves estimating scale factors for each of the native iOS stereo algorithm) at 1 second intervals.
them, or, in fact, sometimes even more complicated transformations. The capture motion resembles how people capture panoramas
today: the camera is pointed outwards while holding the device at
3.2 Algorithm Overview half-arms’ length and scanning the scene in an arbitrary up-, down-,
Our 3D panorama construction algorithm proceeds in four stages: or sideways motion. Unfortunately, the field-of-view of the iPhone 7
Plus camera is fairly narrow in depth capture mode (37◦ vertical), so
Capture (Section 4.1, Figure 2a): The input to our algorithm is a we need to capture more images than we would with other cameras.
sequence of aligned color-and-depth image pairs, which we capture A typical scene contains between 20 and 200 images.
from a single vantage point on a dual lens camera phone using a The captured color and depth images have 720 × 1280 pixels and
custom burst capture app. 432 × 768 pixels resolution, respectively. We enable the automatic
exposure mode to capture more dynamic range of the scene. Along
Deformable Depth Alignment (Section 4.2, Figure 2b): Due to the with the color and depth maps, we also record the device orientation
small camera baseline and resulting triangulation uncertainty, the estimate provided by the IMU.
input depth maps are not very accurate, and it is not possible to align
them well using global transformations. We resolve this problem
Feature extraction and matching: As input for the following align-
using a novel optimization method that jointly estimates the camera
ment algorithm, we compute pairwise feature matching using stan-
poses as well as spatially-varying adjustment maps that are applied
dard methods. We detect Shi-Tomasi corner features [Shi and Tomasi
to deform the depth maps and bring them into good alignment.
1994] in the images, tuned to be separated by at least 1% of the
Stitching (Section 4.3, Figure 2c): Next, we stitch the aligned color- image diagonal. We then compute DAISY descriptors [Tola et al.
and-depth photos into a panoramic mosaic. Usually this is formu- 2010] at the feature points. We use the IMU orientation estimate to
lated as a labeling problem and solved using discrete optimization choose overlapping image pairs, and then compute matches using
methods. However, optimizing label smoothness, e.g., using MRF the FLANN library [Muja and Lowe 2009], taking care to discard
solvers, is very slow, even when the problem is downscaled. We outliers with a ratio test (threshold = 0.85) and simple geometric
utilize a carefully designed data term and the high quality of our filtering, which discards matches whose offset vector deviates too
depth alignment, to replace label smoothness optimization with much from the median offset vector (more than 2% of the image
independently optimizing every pixel after filtering the data term in diagonal). All this functionality is implemented using OpenCV.
a depth-guided edge-aware manner. This achieves visually similar
results with more than an order of magnitude speedup. 4.2 Deformable Depth Alignment
Multi-layer Mesh Generation (Section 4.4, Figure 2d): In the last Our first goal is to align the depth maps. Since the images were
stage, we convert the panorama into a multi-layered and textured taken from different viewpoints, we cannot deal with this in 2D
mesh that can be rendered on any device using standard graphics image space due to parallax. We need to recover the extrinsic camera
engines. We tear the mesh at strong depth edges and extend the poses (orientation and location), so that when we project out the
backside into the occluded regions, hallucinating new color and depth maps they align in 3D.
ACM Transactions on Graphics, Vol. 37, No. 4, Article 101. Publication date: August 2018.
Stitched depth map
Coefficient of variation Instant 3D Photography • 101:5
(a) Our global affine alignment (Eq. 4) (b) Global alignment to SFM point cloud (c) Our deformable alignment (Eq. 8)
Fig. 4. Aligning depth maps with low-frequency errors. We show stitches and the coefficient of variation (see text) for various methods. (a) Our algorithm with
a global affine model (Eq. 4). Many depth maps got pushed to infinity. (b) Aligning each depth map independently with Eq 4 to a high-quality reconstruction.
The result is better, but there are many visible seams and floaters due to the impossibility to fit the inaccurate depth maps with simple global transformations.
(c) Our algorithm with the spatially-varying affine model yields excellent alignment.
4.2.1 Rigid alignment: We achieve this goal by minimizing the this is not the case, extra per-camera variables could be added to
distance between reprojected feature point matches. Let f Ai be a this equation to estimate these values during the optimization.
feature point in image A and M = (f Ai , f Bi ) the set of all matched
Minimizing Eq. 1 w.r.t. the camera poses is equivalent to optimiz-
pairs. We define a reprojection loss as follows: ing a rigid alignment of the depth maps. However, since most small
baseline depth maps are normalized (including the ones produced
ρ
PA→B f Ai − f Bi
2 ,
Õ
2
E reprojection =
(1) by the iPhone), they cannot be aligned rigidly.
(f Ai , f Bi ) ∈M
4.2.2 Global transformations: We resolve this problem by intro-
where ρ(s) = log(1 +s) is a robust loss function to reduce sensitivity ducing extra variables that describe a global transformation of each
to outlier matches, and PA→B (f ) is a function that projects the 2D depth map. Our first experiment was trying to estimate a scale factor
point f from image A to image B: s A for each depth map, i.e., by replacing d A (f ) in Eq. 1 with
3D point in camera A’s coord scale
z }| { dA (f ) = s A d A (f ), (3)
PA→B (f ) = n R TB R A f˜ d A (f ) + t A − t B , (2) where s A is an extra optimization variable per image. However, this
| {z }
3D point in world space
did not achieve good results, because, as we learned, many depth
maps are normalized using unknown curves. We tried a variety of
where (R A , tA) and (R B , t B ) are the rotation matrix and translation other classes of global transformations, and achieved the best results
vectors for images A and B, respectively, f˜ is the homogeneous- with an affine transformation in disparity space (i.e., 1/d):
augmented version of f , d A (f ) is the value of image A’s depth map
x y
−1
affine
at location f , and n [x, y, z]T = [ z , z ]T . Note, that this formulation d A (f ) = s A d A−1
(f ) + o A ) , (4)
naturally handles the wrap-around in 360° panoramas.
Similar reprojection losses are common in geometric computer s A and o A are per-image scale and offset coefficients, respectively.
vision and have been used with great success in many recent re- Figure 4a shows a typical result of minimizing Eq 1 with the
construction systems [Schönberger and Frahm 2016]. However, our affine model. Many depth maps are incorrectly pushed at infinity,
formulation has a subtle but important difference: since we have because the optimizer could not find a good way to align them
depth maps, we do not need to optimize the 3D location of feature otherwise. In the bottom row we visualize the coefficient of variation
point correspondences. This significantly simplifies the system in of depth samples per pixel, i.e., the ratio of the standard deviation
several ways: (1) it drastically reduces the number of variables that to the mean. This is a scale-independent way of visualizing the
need to be estimated, to just the camera poses, (2) we do not have amount of disagreement in the alignment. As a sanity check we also
to link feature point matches into long tracks, and (3) the depth tried to independently align each depth map to a high quality SFM
maps helps reduce uncertainty, making our system robust to small reconstruction of the scene (computed with COLMAP [Schönberger
baselines and narrow triangulation angles. and Frahm 2016]) that can be considered ground truth (Figure 4b).
Eq. 2 assumes that the camera intrinsics as well as lens deforma- Even with this “best-possible” result for the model the stitch is
tion characteristics are known and fixed throughout the capture. If severely degraded by seams and floaters.
ACM Transactions on Graphics, Vol. 37, No. 4, Article 101. Publication date: August 2018.
101:6 • P. Hedman and J. Kopf
MRF stitch
Through our experimentation we found that it is ultimately not
possible to bring this kind of depth maps into good alignment using
simple global transformations because of to the low frequency error
that is characteristic for small baseline stereo depth maps due to the
triangulation uncertainty. Labels Color mosaic Depth mosaic
Our stitch
4.2.3 Deformable alignment: Our solution to this problem is to
estimate spatially-varying adjustment fields that deform each depth
map and can therefore bring them into much better alignment.
We modify the affine model in Eq. 4 to replace the global scale and
offset coefficients with regular grids of 5×5 values that are bilinearly Label Color mosaic Depth mosaic
interpolated across the image. Fig. 5. Comparing stitching using MRF optimization (top row, runtime 3.25
−1 minutes) vs. our algorithm (bottom row, runtime 0.5 seconds). While labels
deform
−1
dA (f ) = s A (f ) d A (f ) + o A (f ) , (5) change more frequently in our solution, the color and depth mosaics are
visually very similar to the MRF result.
where s A (f ) = i w i (f ) ŝAi , and o A (f ) = i w i (f ) ôAi , and w i (f ) are
Í Í
bilinear interpolation weights at position f .
To encourage smoothness in the deformation field we add a cost
for differences between neighboring grid values: We tried also even richer models than Eq. 4. In particular we have
ŝ i − ŝ j
2 +
ôi − ô j
2 tried 3D grids that represent bilateral space adjustments with depth
Õ Õ
E smoothness =
A A 2 A A 2 (6)
A (i, j)∈N
and luminance as range domains. However, while we found slight
improvements in the results we did not deem it significant enough
While E reprojection is agnostic to scale, E smoothness encourages set- to warrant the extra complexity.
ting the disparity scale functions ŝAi very small, which results in We also experimented with using 3D distance between matched
extremely large reconstructions. To prevent this, we add a regular- features points’ projection into world space in our loss. However,
ization term that keeps the overall scale in the scene constant: this did not work well, since a trivial solution is to shrink the scene
ŝAi
−1
ÕÕ
E scale = (7) until it vanishes.
A i Eq. 8 is quite robust and does not depend strongly on the initial-
ization. We tried other initializations, e.g., setting t A = [0, 0, 0]T ,
The combined problem that we solve is:
which worked fine as well. We have not encountered any scene in
argmin E reprojection + λ 1 E smoothness + λ 2 E scale , (8) our experiments where the optimization got stuck in a poor local
{R I ,t I , ŝ i , ô i } minimum.
with the balancing coefficients λ 1 = 106 , λ 2 = 10−4 .
Figure 4c shows the improvement achieved by using the de- 4.3 Stitching
formable model. The ground plane is now nearly perfectly smooth, Now that we have 3D aligned depth photos, our next goal is to stitch
there are no floaters, and thin structures such as the lamp post are them into a seamless panoramic mosaic. This enables removing out-
resolved much better. liers in the depth maps and also makes rendering faster by removing
4.2.4 Optimization Details: Since Eq. 8 is a non-linear optimiza- redundant content.
tion problem, we require a good initialization of the variables. We First, we compute a center of projection for the panorama by trac-
initialize the camera rotations using the IMU orientations, and ing the camera front vectors backwards and finding the 3D point
the locations by pushing them forward onto the unit sphere, i.e., that minimizes the distance to all of them. Then, we render all the
t A = R A · [0, 0, 1]T . We found it helpful to initialize the deformation color and depth maps from this central viewpoint into equirectan-
field to enlarge the depth maps, i.e., ŝAi = 0.1, ôAi = 0, because in this gular panoramas (see supplementary document for details). The full
way reprojected feature points are visible in their matched images. panoramas are 8192 × 4096 pixels, though for each image, we only
We use the Ceres library [Agarwal et al. 2017] to solve this non- keep a crop to the tight bounding box of the pixels actually used.
linear least-squares minimization problem using the Levenberg- As in previous work, we formulate the stitching as a discrete
Marquardt algorithm. We represent all rotations using the 3-dimensional labeling problem, where we need to select for every pixel p in the
axis-angle parameterization and use Rodrigues’ formula [Ayache panorama a source image αp from which to fetch color and depth.
1989] when they are applied to vectors. The optimization usually
4.3.1 Data Term: A “good” source αp for the target pixel p should
converges within 50 iterations, which takes about 10 seconds.
satisfy a number of constraints, which we formulate as penalty
4.2.5 Discussion: Due to the non-rigid transformations in the terms.
optimization the camera poses that we recover are not necessarily
accurate anymore. We inspected the recovered poses visually and Depth Consensus: If a source has the correct depth, it tends to be
found that they qualitatively look similar to results obtained by SFM, consistent with other views. Therefore, we count how many other
but we have not performed a careful analysis to verify their degree views n(p, αp) are at similar depth, i.e., their depth ratio is within
of accuracy. [0.9, 1.1], and define a depth consensus penalty, similar to earlier
ACM Transactions on Graphics, Vol. 37, No. 4, Article 101. Publication date: August 2018.
Instant 3D Photography • 101:7
ACM Transactions on Graphics, Vol. 37, No. 4, Article 101. Publication date: August 2018.
101:8 • P. Hedman and J. Kopf
Table 1. Breakdown of the algorithm performance per stage for the scene has proven to be quite robust, we could replace the feature point
from Figure 1.
Desktop Laptop detector and descriptor with faster to compute variants, e.g., FAST
Stage Timing Timing features and BRIEF descriptors. The alignment optimization could
be sped up by implementing a custom solver, tailored to this par-
Feature extraction and matching 6.6s 7.0s
ticular problem. Our current warping algorithm is implemented
Deformable alignment 9.8s 10.2s
in a wasteful way. Properly rewriting this GPU code would make
Warping 6.6s 5.1s
this operation practically free. The stitching algorithm could be
Stitching 2.7s 2.6s
reimplemented on the GPU.
Color harmonization 1.6s 1.8s
Multi-layer computation 1.0s 1.1s
Mesh simplification 3.1s 3.5s 5.2 Alignment
Texture atlas generation 3.3s 3.7s Figure 8 shows a quantitative evaluation of our alignment algorithm.
Total 34.7s 35.0s We processed the 25 scenes in Figure 7 using different variants of
the algorithm and evaluate the average reprojection error (Eq. 1).
We also evaluate the effect of varying the grid size of our de-
Next, we hallucinate new content in occluded parts by iteratively
formable model, which shows that the reprojection error remains
growing the mesh at its boundaries. In every iteration, each vertex
flat across a wide range of settings around our choice of 5 × 5.
that is missing a connection in one of the 4 cardinal directions
grows in this direction and generates a new vertex at the same depth
as itself. We connect new vertices with all neighboring boundary 5.3 Single-image CNN Depth Maps
vertices whose disparity is within the threshold above. After running We experimented with other depth map sources. In particular, we
this procedure for a fixed number of 30 iterations, we prune the were interested in using our algorithm with depth maps estimated
newly generated vertices by removing any but the furthest generated from single images using CNNs, since this would enable using our
vertex at every pixel location. If the remaining generated vertex is system with regular single-lens cameras (even though the depth map
in front of the original stitch we remove it as well. We synthesize quality produced by current algorithms is low). In Figure 9 we used
colors for the newly generated mesh parts using diffuse inpainting. the Monodepth algorithm [Godard et al. 2017] to generate single-
The resulting mesh smoothly extends the back-side around depth image depth maps for three of our scenes. The original Monodepth
edges into the occluded regions. Instead of stretched triangles or algorithm works well for the first street scene, since it was trained
holes, viewpoint changes now reveal smoothly inpainted color and on similar images. For the remaining two scenes we retrained the
depth content (Figure 6c). algorithm with explicit depth supervision using the RGBD scans in
In a supplementary document we provide some implementation the ScanNet dataset [Dai et al. 2017a]. Even though the input depth
details about simplification and texture atlas generation for the final maps are considerably degraded our method was able to reconstruct
mesh. surprisingly good results. To better appreciate the result quality, see
the video comparisons in the supplementary material.
5 RESULTS AND EVALUATION
We have captured and processed dozens of scenes with an iPhone 7
Plus. 25 of these are included in this submission, see Figure 7, as well 5.4 SFM and MVS Comparison
as the accompanying video and the supplementary material. These We were interested in how standard SFM algorithms would perform
scenes span a wide range of different environments (indoor and on our datasets. When processing our 25 datasets with COLMAP’s
outdoor, urban and natural) and capture conditions (day and night, SFM algorithm 7 scenes failed entirely, in 7 more not all cameras
bright and overcast). The scenes we captured range from about 20 were registered, and there was 1 were all cameras registered but
to 200 source images, and their horizontal field-of-view ranges from the reconstruction was an obvious catastrophic failure. This high
60° to 360°. failure rate underscores the difficulty of working with small baseline
imagery.
5.1 Performance We also compare against end-to-end MVS systems, in particular
All scenes were processed using a PC with 3.4 GHz 6-core Intel Xeon the commercial Capturing Reality system2 and Casual 3D [Hedman
E5-2643 CPU and a NVIDIA Titan X GPU and 64 GB of memory. et al. 2017]. Capturing Reality’s reconstruction speed is impressive
Our implementation mostly consists of unoptimized CPU code. The for a full MVS algorithm, but due to the small baseline it is only able
GPU is currently only (insignificantly) used in the warping stage. to reconstruct foreground. Casual 3D produces results of compa-
We ran our system also on a slower 14” Razer Blade laptop with a rable quality to ours, but at much slower speed. Figure 10 shows
3.3 GHz 4-core Intel i7-7700HQ CPU and a NVIDIA GTX 1060 GPU. an example scene, and the supplementary material contains video
Interestingly, the warping stage performs faster on the laptop, most comparisons of more scenes.
likely because CPU computation and CPU/GPU transfers dominate
the runtime. Table 1 breaks out the timings for the various algo-
rithm stages on both of these systems for an example scene. While
our algorithm already runs fast, we note that there are significant
2 https://www.capturingreality.com
further optimizations on the table. Since the deformable alignment
ACM Transactions on Graphics, Vol. 37, No. 4, Article 101. Publication date: August 2018.
Instant 3D Photography • 101:9
Alley Angkor Wat Miniature Bushes Footpath Golden Mount River Houses
(72 images) (30 images) (26 images) (34 images) (28 images) (32 images)
Southbank Embankment Tottenham Ivy Turtle Van Gogh Walk Wood Shed
(78 images) (53 images) (91 images) (60 images) (65 images) (31 images) (153 images)
Forest Bloomsbury
(101 images) (147 images)
Fig. 7. Datasets that we show in the paper, video, and supplementary material. The two bottom rows show 360° panoramas.
maps produced by these algorithms, this would not lead to good re-
Average reprojection error
sults, because the source depth maps are inconsistent and show the
scene from different vantage points. Our algorithm resolves these
inconsistencies and produces a coherent color-and-depth panorama
(Figure 10c and detail crop in 11e).
ACM Transactions on Graphics, Vol. 37, No. 4, Article 101. Publication date: August 2018.
101:10 • P. Hedman and J. Kopf
Example input
CNN depth Global affine (Eq. 4) Independent affine to SFM Deformable (Eq. 8) Deformable (Eq. 8)
CNN depth Dual camera depth
Example input
CNN depth
Global affine (Eq. 4) Independent affine to SFM Deformable (Eq. 8) Deformable (Eq. 8)
CNN depth Dual camera depth
Example input
CNN depth
Global affine (Eq. 4) Independent affine to SFM Deformable (Eq. 8) Deformable (Eq. 8)
CNN depth Dual camera depth
Fig. 9. Applying our algorithm to single-image CNN depth maps (3 middle columns), and comparing to a result for dual camera depth maps (right column).
See the supplementary material for videos of the results.
amount of overlap. At the same time the baseline would increase, leaves room for improvement. We plan to improve the inpainting
making the reconstruction problem easier. of colors using texture synthesis.
Artifacts: Our results exhibit similar artifacts as other 3D recon- Parameters: Like many end-to-end reconstruction algorithms we
struction systems. In particular, these are floating pieces of geom- depend on many parameters. We proceeded one stage at a time,
etry, incorrect depth in untextured regions, artifacts on dynamic examining intermediate results, when tuning the parameters. All
objects. Compared to existing systems these problems are reduced results shown anywhere in this submission or accompanying mate-
(Figure 10), but they are still present. To examine these artifacts rials use the same parameter settings, provided in the paper.
carefully we suggest watching the video comparison in the supple-
mentary material. 6 CONCLUSIONS
In this paper we have presented a fast end-to-end algorithm for
Multi-layer processing: The hallucination of occluded pixels is generating 3D panoramas from a sequence of color-and-depth im-
rudimentary. In particular the simple back-layer extension algorithm ages. Even though these depth maps contain a considerable amount
ACM Transactions on Graphics, Vol. 37, No. 4, Article 101. Publication date: August 2018.
Instant 3D Photography • 101:11
(a) Capturing Reality (2.1 minutes) (b) Casual 3D (1.8 hours) (c) Our result (34.7 seconds)
Fig. 10. Comparison against MVS systems: (a) Capturing Reality processes fast, but the reconstruction breaks down just a few meters away from the vantage
point due to triangulation uncertainty. (b) Casual 3D produced a high quality result, but it is slow. (c) Our result has even better quality, and was computed
over 200× faster.
(a) APAP [Zaragoza et al. 2013] (b) Microsoft ICE (c) APAP detail (d) ICE detail (e) Our detail
Fig. 11. Comparison against monocular panoramas stitched with (a) As-Projective-As-Possible warping, and (b) Microsoft ICE. Note that these algorithms do
not produce a stitched depth map. (c)-(d) show a detail crops from before mentioned algorithms. (e) corresponding detail crop from our result in Figure 10c.
of low-frequency error, our novel deformable alignment optimiza- Virtual Reality Video. ACM Transactions on Graphics 35, 6 (2016).
tion is able to align them precisely. This opens up the possibility Nicholas Ayache. 1989. Vision Stéréoscopique et Perception Multisensorielle: Application
à la robotique mobile. Inter-Editions (MASSON). https://hal.inria.fr/inria-00615192
to replace discrete smoothness optimization in our stitcher with Jonathan T. Barron, Andrew Adams, YiChang Shih, and Carlos Hernández. 2015. Fast
depth-guided edge-aware filtering of the data term and indepen- Bilateral-Space Stereo for Synthetic Defocus. Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (2015), 4466–4474.
dently optimizing every pixel, achieving two orders of magnitude Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and
speedup. Matthias Nießner. 2017a. ScanNet: Richly-annotated 3D Reconstructions of Indoor
We are excited about the many avenues for further improvement Scenes. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE (2017).
Angela Dai, Matthias Nießner, Michael Zollöfer, Shahram Izadi, and Christian Theobalt.
and research that this work opens up. Considering the performance 2017b. BundleFusion: Real-time Globally Consistent 3D Reconstruction using On-
discussion in Section 5.1 we believe a near-interactive implementa- the-fly Surface Re-integration. ACM Transactions on Graphics 2017 (TOG) 36, 3
tion directly on the phone is within reach. (2017), Article no. 24.
Abe Davis, Marc Levoy, and Fredo Durand. 2012. Unstructured Light Fields. Computer
We have seen already how the availability of a fast capture and Graphics Forum (Proc. EUROGRAPHICS 2012) 31, 2pt1 (2012), 305–314.
reconstruction system has changed our own behavior with respect Facebook. 2016. Facebook Surround 360. https://facebook360.fb.com/
facebook-surround-360/. (2016). Accessed: 2016-12-26.
to 3D scene capture. The way we capture scenes has become more Clément Godard, Oisin Mac Aodha, and Gabriel J. Brostow. 2017. Unsupervised Monoc-
opportunistic and impulsive. Almost all of the provided scenes have ular Depth Estimation with Left-Right Consistency. CVPR (2017).
been captured spontaneously without planning, e.g., while traveling. Steven J. Gortler, Radek Grzeszczuk, Richard Szeliski, and Michael F. Cohen. 1996. The
Lumigraph. Proceedings of the 23rd Annual Conference on Computer Graphics and
We are particularly excited about the promising first result using Interactive Techniques (1996), 43–54.
single-image CNN depth maps. The quality of these depth maps is Hyowon Ha, Sunghoon Im, Jaesik Park, Hae-Gon Jeon, and In So Kweon. 2016. High-
still quite low. Yet, our alignment algorithm was able to conflate quality Depth from Uncalibrated Small Motion Clip. Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR) (2016).
them, and stitching them using the consensus data term reduced Kaiming He, Jian Sun, and Xiaoou Tang. 2010. Guided Image Filtering. Proceedings of
artifacts further. Improving these results further is an interesting di- the 11th European Conference on Computer Vision (ECCV) (2010), 1–14.
Peter Hedman, Suhib Alsisan, Richard Szeliski, and Johannes Kopf. 2017. Casual 3D
rection for further research and holds the promise of bringing instant Photography. ACM Transactions on Graphics (Proc. SIGGRAPH Asia 2017) 36, 6 (2017),
3D photography to billions of regular cell phones with monocular Article no. 234.
cameras. Aasma Hosni, Christoph Rhemann Rhemann, Michael Bleyer, Carsten Rother, and
Margrit Gelautz. 2013. Fast Cost-Volume Filtering for Visual Correspondence and
Beyond. IEEE Trans. Pattern Anal. Mach. Intell. 35, 2 (2013), 504–511.
ACKNOWLEDGEMENTS Jingwei Huang, Zhili Chen, Duygu Ceylan, and Hailin Jin. 2017. 6-DOF VR Videos with
a Single 360-Camera. IEEE VR 2017 (2017).
The authors would like to thank Clément Godard for generating the Sunghoon Im, Hyowon Ha, François Rameau, Hae-Gon Jeon, Gyeongmin Choe, and
depth maps for the CNN evaluation and Suhib Alsisan for developing In So Kweon. 2016. All-Around Depth from Small Motion with a Spherical Panoramic
Camera. European Conference on Computer Vision (ECCV ’16) (2016), 156–172.
the capture app. Hiroshi Ishiguro, Masashi Yamamoto, and Saburo Tsuji. 1990. Omni-directional stereo
for making global map. Third International Conference on Computer Vision (1990),
REFERENCES 540–547.
Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard Newcombe,
Sameer Agarwal, Keir Mierle, and Others. 2017. Ceres Solver. http://ceres-solver.org. Pushmeet Kohli, Jamie Shotton, Steve Hodges, Dustin Freeman, Andrew Davison,
(2017). and Andrew Fitzgibbon. 2011. KinectFusion: Real-time 3D Reconstruction and
Robert Anderson, David Gallup, Jonathan T. Barron, Janne Kontkanen, Noah Snavely, Interaction Using a Moving Depth Camera. Proceedings of the 24th Annual ACM
Carlos Hernandez Esteban, Sameer Agarwal, and Steven M. Seitz. 2016. Jump:
ACM Transactions on Graphics, Vol. 37, No. 4, Article 101. Publication date: August 2018.
101:12 • P. Hedman and J. Kopf
ACM Transactions on Graphics, Vol. 37, No. 4, Article 101. Publication date: August 2018.