0% found this document useful (0 votes)
125 views12 pages

Instant 3D Photography: Peter Hedman, Johannes Kopf

instant3d_siggraph_2018

Uploaded by

Avinash Mallick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views12 pages

Instant 3D Photography: Peter Hedman, Johannes Kopf

instant3d_siggraph_2018

Uploaded by

Avinash Mallick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Instant 3D Photography

PETER HEDMAN, University College London*


JOHANNES KOPF, Facebook

Dual camera Input burst of 34 color-and-depth photos, Our 3D panorama (showing color, depth, and a 3D effect),
phone captured in 34.0 seconds generated in 34.7 seconds.
Fig. 1. Our work enables practical and casual 3D capture with regular dual camera cell phones. Left: A burst of input color-and-depth image pairs that we
captured with a dual camera cell phone at a rate of one image per second. Right: 3D panorama generated with our algorithm in about the same time it took to
capture. The geometry is highly detailed and enables viewing with binocular and motion parallax in VR, as well as applying 3D effects that interact with the
scene, e.g., through occlusions (right).

We present an algorithm for constructing 3D panoramas from a sequence of 1 INTRODUCTION


aligned color-and-depth image pairs. Such sequences can be conveniently Virtual reality (VR) is a fascinating emerging technology that creates
captured using dual lens cell phone cameras that reconstruct depth maps
lifelike experiences in immersive virtual environments, with high-
from synchronized stereo image capture. Due to the small baseline and
resulting triangulation error the depth maps are considerably degraded and
end headsets now widely available. Most content that is consumed
contain low-frequency error, which prevents alignment using simple global in VR today is synthetic and needs to be created by professional
transformations. We propose a novel optimization that jointly estimates the artists. There is no practical way for consumers to capture and share
camera poses as well as spatially-varying adjustment maps that are applied their own real-life environments in a form that makes full use of
to deform the depth maps and bring them into good alignment. When fusing the VR technology.
the aligned images into a seamless mosaic we utilize a carefully designed That is perhaps not surprising when considering the formidability
data term and the high quality of our depth alignment to achieve two or- of this problem: we are looking for a method that does not require
ders of magnitude speedup w.r.t. previous solutions that rely on discrete expensive hardware and is easy to use even for novice users. Yet,
optimization by removing the need for label smoothness optimization. Our it should create high-quality and truly immersive content, i.e., a
algorithm processes about one input image per second, resulting in an end-
3D representation that supports binocular vision and head-motion
to-end runtime of about one minute for mid-sized panoramas. The final 3D
panoramas are highly detailed and can be viewed with binocular and head
parallax. Finally, consumers demand very fast processing times, on
motion parallax in VR. the order of seconds at most.
Panoramic images and video can be easily captured now with
CCS Concepts: • Computing methodologies → Image-based render-
consumer 360° cameras. While the surrounding imagery provides
ing; Reconstruction; Computational photography;
some immersion, the realism is limited due to lack of depth and
Additional Key Words and Phrases: 3D Reconstruction, Image-stitching, parallax. Stereo panoramas [Peleg and Ben-Ezra 1999] provide binoc-
Image-based Rendering, Virtual Reality ular depth cues by delivering different images to the left and right
eye, but these images are static and do not provide motion parallax
* This work was done while Peter was working as a contractor for Facebook.
when the user turns or moves their head.
Authors’ addresses: Peter Hedman, University College London*, [email protected]. Full immersion can only be achieved using a 3D representation,
uk; Johannes Kopf, Facebook, [email protected]. such as the 3D models generated by multi-view stereo methods.
Permission to make digital or hard copies of all or part of this work for personal or However, when applied to typical panorama datasets, captured
classroom use is granted without fee provided that copies are not made or distributed “inside-out” from a single vantage point, these methods have to
for profit or commercial advantage and that copies bear this notice and the full citation deal with a very small baseline, which causes noisy results and
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or holes in the reconstruction. The methods are also very sensitive
republish, to post on servers or to redistribute to lists, requires prior specific permission to even slight scene motion, such as wind-induced motion in trees.
and/or a fee. Request permissions from [email protected].
© 2018 Copyright held by the owner/author(s). Publication rights licensed to Association
The Casual 3D Photography system [Hedman et al. 2017] achieves
for Computing Machinery. some improvement in reconstruction quality, but it is slow with a
0730-0301/2018/8-ART101 $15.00 runtime of several hours per scene.
https://doi.org/10.1145/3197517.3201384

ACM Transactions on Graphics, Vol. 37, No. 4, Article 101. Publication date: August 2018.
101:2 • P. Hedman and J. Kopf

In this paper, we present a new algorithm that constructs 3D consistent stitched panoramas. A large number of these results is
panoramas from sequences of color-and-depth photos produced provided in the supplementary material and accompanying video.
from small-baseline stereo dual camera cell phones, such as recent
iPhones. We take these sequences with a custom burst capture app 2 PREVIOUS WORK
while casually moving the phone around at a half-arm’s distance.
360° Photo and Video: Using dedicated consumer hardware, such
The depth reconstruction is essentially free since it is integrated into
as the Ricoh Theta, it is now easy to capture full 360° × 180° panora-
native phone OS APIs and highly optimized. Using depth maps from
mas. Although this is often marketed as capture for VR, it does not
dual cameras makes our algorithm somewhat robust to scene motion,
make use of the most interesting capabilities of that technology, and
because the synchronous stereo capture enables reconstructing
the lack of binocular and motion parallax limits the realism.
depth even for dynamic objects, though stitching them might still
result in visible seams.
Stereo Panoramas: Binocular depth perception can be enabled by
Our method is fast and processes approximately one input image
stitching appropriate pairs of left-eye and right-eye panoramic im-
per second, about the same time it takes to capture. We stress the
ages. This representation is often called omnidirectional stereo [An-
importance of this point, since we found that as our system became
derson et al. 2016; Ishiguro et al. 1990; Peleg et al. 2001; Richardt
faster, it made our own behavior with regards to capture more
et al. 2013]. Recent systems enable the capture of stereo panoramic
opportunistic: we were suddenly able to capture spontaneously on
videos using multiple cameras arranged in a ring [Anderson et al.
the go and even “iterate” on scenes to try different viewing angles.
2016; Facebook 2016], or using two spinning wide-angle cameras
The output of our algorithm is a detailed 3D panorama, i.e., a
[Konrad et al. 2017].
textured, multi-layered 3D mesh that can be rendered with standard
Omnidirectional stereo has a number of drawbacks. In particular,
graphics engines. Our 3D panoramas can be viewed with binocular
the rendered views are not in a linear perspective projection and
and head-motion parallax in VR, or using a parallax viewer on
exhibit distortions such as curved straight lines and incorrect stereo
normal mobile and web displays (see the accompanying video for
parallax away from the equator [Hedman et al. 2017]. Even more
examples). We can also generate interesting geometric effects using
importantly, the representation does not support motion parallax,
the 3D representation (Figure 1, right).
i.e., the rendered scene does not change as the user moves their head,
We faced two challenges when developing this algorithm, which
which considerably limits depth perception, and, hence, immersion.
lead to our two main technical contributions:
(1) Due to the very small baseline of dual phone cameras depth Parallax-aware Panorama Stitching: Some panorama stitching
estimation is highly uncertain, and it requires strong edge- methods compute warp-deformations to compensate for parallax in
aware filtering to smooth the resulting noise. However, this the input images. While this reduces artifacts, it does not address
leads to low frequency errors in the depth maps that prevent the fundamental limitation that this representation does not support
simple alignment using global transformations. We present viewpoint changes at runtime.
a novel optimization method that jointly aligns the depth Zhang and Liu [2014] stitch image pairs with large parallax by
maps by recovering their camera poses as well as solving for finding a locally consistent alignment sufficient for finding a good
a spatially-varying adjustment field for the depth values. This seam. Perazzi et al. [2015] extend this work to the multi-image
method is able to bring even severely degraded input depth case and compute optimal deformations in overlapping regions to
maps into very good alignment. compensate parallax and extrapolate the deformation field smoothly
(2) Existing image fusion methods using discrete optimization are in the remaining regions. Lin et al. [2016] handle two independently
slow. We utilize a carefully designed data term and the high moving cameras whose relative poses change over time.
quality of our depth alignment to remove the need for label Recent work [Zhang and Liu 2015] demonstrates that these ap-
smoothness optimization, and replace it with independently proaches also extend to omni-directional stereo. However, this line
optimizing every pixel label after filtering the data term in a of work has not yet produced explicit 3D geometry, making them
depth-guided edge-aware manner. This achieves a speedup unable to produce head-motion parallax in VR.
of more than two orders of magnitude.
After stitching, we convert the 3D panorama into a multi-layered Panoramas with Depth: An alternative to generating a left-right
representation by converting it to a mesh, tearing it at strong depth pair of panoramic images is to augment a traditional stitched panoramic
edges, and extending the back-layer into occluded regions while image with depth information. Im et al. [2016] construct a panorama-
hallucinating new colors and depths. When viewing the panorama with-depth from small baseline 360° video. However, the fidelity
away from the default viewpoint this new content is revealed in of the depth reconstruction does not seem sufficient for viewpoint
disocclusions. changes (and this has not been demonstrated.) Lee et al. [2016]
We demonstrate our algorithm on a wide variety of captured use depth information to compute a spatially varying 3D projection
scenes, including indoor, outdoor, urban, and natural environments surface to compensate for parallax when stitching images captured
at day and night time. We also applied our algorithm to several with a 360° rig. However, similar to before mentioned work the
datasets where the depth maps were estimated from single images surface is a low-resolution grid mesh. Zheng et al. [2007] create a
using CNNs. These depth maps are strongly deformed from their layered depth panorama using a cylinder-sweep multi-view stereo
ground truth and lack image-to-image coherence, but neverthe- algorithm. However, their algorithm creates discrete layers at fixed
less our algorithm is able to produce surprisingly well-aligned and depths and cannot reconstruct sloped surfaces.

ACM Transactions on Graphics, Vol. 37, No. 4, Article 101. Publication date: August 2018.
Instant 3D Photography • 101:3

Capture and pre-processing Deformable alignment Stitching Multi-layer processing


(Section 4.1) (Section 4.2) (Section 4.3) (Section 4.4)
Stage

Color-and-depth images Camera poses Color-and-depth panorama Triangle mesh


IMU orientations Depth adjustment fields Texture atlas
Feature point matches
Stage output

Fig. 2. Breakdown of the major algorithms stages and their outputs, which form the inputs to the next respective stage.

Multi-view Stereo: A long line of research in computer vision by non-rigidly deforming the 3D geometry which has already been
is concerned with producing depth maps or surface meshes from scanned.
multiple overlapping images using multi-view stereo (MVS) algo- In general, methods designed for depth cameras cannot directly
rithms [Seitz et al. 2006]. MVS algorithms are used in commercial be applied to narrow baseline stereo data, which is of much lower
photogrammetric tools for 3D reconstruction of VR scenes [Realities quality. Unlike the depth maps used in this paper, depth sensors
2017; Valve 2016]. Huang et al. [2017] use MVS to obtain dense point provide absolute scale, maintain frame-to-frame consistency and
clouds from video sequences captured with a single 360° camera. can often be rigidly aligned to a high degree of accuracy.
MVS methods work best if the camera baseline is large, which
is unfortunately not the case in the panorama capture scenario. In 3 OVERVIEW
this case, it is difficult for the methods to deal with the triangulation The goal of our work is to enable easy and rapid capture of 3D
uncertainty, which leads to artifacts, such as noisy reconstructions panoramas using readily available consumer hardware.
and missing regions. These methods are usually also slow, with
runtimes ranging from minutes to hours. Hedman et al. [Hedman 3.1 Dual Lens Depth Capture
et al. 2017] improve the quality of reconstructed 3D panoramas, but
Dual lens cameras capture synchronized small-baseline stereo image
their algorithm requires several hours of processing.
pairs for the purpose of reconstructing an aligned color-and-depth
image using depth-from-stereo algorithms [Szeliski 2010]. The depth
Light Fields: The light field representation [Gortler et al. 1996; reconstruction is typically implemented in system-level APIs and
Levoy and Hanrahan 1996] can generate highly realistic views of highly optimized, so from a programmer’s and a user’s perspective,
a scene with motion parallax and view-dependent effects. Recent the phone effectively features a “depth camera”. Several recent flag-
work addresses unstructured acquisition with a hand-held camera ship phones feature dual cameras, including the iPhone 7 Plus, 8
[Davis et al. 2012]. The main disadvantage of this representation Plus, X, and Samsung Note 8. Such devices are already in the hands
is that it requires a very large number of input views that need of tens of millions of consumers.
to be retained to sample from at runtime, and a custom rendering The small baseline is both a blessing and a curse: the limited search
algorithm. range enables quickly establishing dense image correspondence but
also makes triangulation less reliable and causes large uncertainty
Bundle Adjustment with Depth: The popularisation of consumer in the estimated depth. For this reason, most algorithms employ
depth cameras has inspired research on aligning and fusing multiple aggressive edge-aware filtering [Barron et al. 2015; He et al. 2010],
depth maps into globally consistent geometry [Izadi et al. 2011]. Dai which yields smoother depth maps with color-aligned edges, but
et al. [Dai et al. 2017b] present system which integrates new depth large low-frequency error in the absolute depth values. In addition,
maps in real-time, using bundle adjustment on 3D feature point the dual lenses on current-generation phones constantly move and
correspondences to continuously maintain and refine alignment. rotate during capture due to optical image stabilization, changes
There has also been work on non-rigid deformations to refine in focus, and even gravity1 . These effects introduce a non-linear
alignment with active depth sensors. Zhou and Koltun [Zhou and and spatially-varying transformation of disparity that adds to the
Koltun 2014] perform 3D camera calibration during scanning, cor- low-frequency error from noise filtering mentioned above.
recting for non-linear distortion associated with the depth camera.
Whelan et al. [Whelan et al. 2015] show how to correct for drift 1 see http://developer.apple.com/videos/play/wwdc2017/507 at 17:20-20:50, Slides 81-89.

ACM Transactions on Graphics, Vol. 37, No. 4, Article 101. Publication date: August 2018.
101:4 • P. Hedman and J. Kopf

Stereo image pair (a) iPhone 7+ depth map (b) Monodepth [Godard et al. 2017] (c) DfUSMC depth map [Ha et al. 2016]
Fig. 3. Estimating depth maps using various algorithms. Note relative scale difference and low-frequency deformations between different maps. (a) Small
baseline stereo depth computed by the native iOS algorithm on an iPhone 7+. (b) Single image CNN depth map [Godard et al. 2017]. (c) Depth from accidental
motion result [Ha et al. 2016] (we actually used a short video clip to produce this result).

In Figure 3, you can see depth maps reconstructed using different depth values in occluded areas. Finally, we simplify the mesh and
stereo algorithms on this kind of data. As revealed in the figure, compute a texture atlas.
there is a significant amount of low-frequency error in the depth
maps. Since our focus is not stereo matching, we use the depth
4 ALGORITHM
maps from the native iPhone 7 Plus stereo algorithm for all of our
experiments. 4.1 Capture and Preprocessing
An important detail to note is that many small baseline stereo We perform all of our captures with an iPhone 7 Plus using a custom-
methods (including the one running on the iPhone) do not estimate built rudimentary capture app. During a scene capture session, it
absolute depth, but instead produce normalized depth maps. So, automatically triggers the capture of color-and-depth photos (using
aligning such depth map involves estimating scale factors for each of the native iOS stereo algorithm) at 1 second intervals.
them, or, in fact, sometimes even more complicated transformations. The capture motion resembles how people capture panoramas
today: the camera is pointed outwards while holding the device at
3.2 Algorithm Overview half-arms’ length and scanning the scene in an arbitrary up-, down-,
Our 3D panorama construction algorithm proceeds in four stages: or sideways motion. Unfortunately, the field-of-view of the iPhone 7
Plus camera is fairly narrow in depth capture mode (37◦ vertical), so
Capture (Section 4.1, Figure 2a): The input to our algorithm is a we need to capture more images than we would with other cameras.
sequence of aligned color-and-depth image pairs, which we capture A typical scene contains between 20 and 200 images.
from a single vantage point on a dual lens camera phone using a The captured color and depth images have 720 × 1280 pixels and
custom burst capture app. 432 × 768 pixels resolution, respectively. We enable the automatic
exposure mode to capture more dynamic range of the scene. Along
Deformable Depth Alignment (Section 4.2, Figure 2b): Due to the with the color and depth maps, we also record the device orientation
small camera baseline and resulting triangulation uncertainty, the estimate provided by the IMU.
input depth maps are not very accurate, and it is not possible to align
them well using global transformations. We resolve this problem
Feature extraction and matching: As input for the following align-
using a novel optimization method that jointly estimates the camera
ment algorithm, we compute pairwise feature matching using stan-
poses as well as spatially-varying adjustment maps that are applied
dard methods. We detect Shi-Tomasi corner features [Shi and Tomasi
to deform the depth maps and bring them into good alignment.
1994] in the images, tuned to be separated by at least 1% of the
Stitching (Section 4.3, Figure 2c): Next, we stitch the aligned color- image diagonal. We then compute DAISY descriptors [Tola et al.
and-depth photos into a panoramic mosaic. Usually this is formu- 2010] at the feature points. We use the IMU orientation estimate to
lated as a labeling problem and solved using discrete optimization choose overlapping image pairs, and then compute matches using
methods. However, optimizing label smoothness, e.g., using MRF the FLANN library [Muja and Lowe 2009], taking care to discard
solvers, is very slow, even when the problem is downscaled. We outliers with a ratio test (threshold = 0.85) and simple geometric
utilize a carefully designed data term and the high quality of our filtering, which discards matches whose offset vector deviates too
depth alignment, to replace label smoothness optimization with much from the median offset vector (more than 2% of the image
independently optimizing every pixel after filtering the data term in diagonal). All this functionality is implemented using OpenCV.
a depth-guided edge-aware manner. This achieves visually similar
results with more than an order of magnitude speedup. 4.2 Deformable Depth Alignment
Multi-layer Mesh Generation (Section 4.4, Figure 2d): In the last Our first goal is to align the depth maps. Since the images were
stage, we convert the panorama into a multi-layered and textured taken from different viewpoints, we cannot deal with this in 2D
mesh that can be rendered on any device using standard graphics image space due to parallax. We need to recover the extrinsic camera
engines. We tear the mesh at strong depth edges and extend the poses (orientation and location), so that when we project out the
backside into the occluded regions, hallucinating new color and depth maps they align in 3D.

ACM Transactions on Graphics, Vol. 37, No. 4, Article 101. Publication date: August 2018.
Stitched depth map
Coefficient of variation Instant 3D Photography • 101:5

(a) Our global affine alignment (Eq. 4) (b) Global alignment to SFM point cloud (c) Our deformable alignment (Eq. 8)
Fig. 4. Aligning depth maps with low-frequency errors. We show stitches and the coefficient of variation (see text) for various methods. (a) Our algorithm with
a global affine model (Eq. 4). Many depth maps got pushed to infinity. (b) Aligning each depth map independently with Eq 4 to a high-quality reconstruction.
The result is better, but there are many visible seams and floaters due to the impossibility to fit the inaccurate depth maps with simple global transformations.
(c) Our algorithm with the spatially-varying affine model yields excellent alignment.

4.2.1 Rigid alignment: We achieve this goal by minimizing the this is not the case, extra per-camera variables could be added to
distance between reprojected feature point matches. Let f Ai be a this equation to estimate these values during the optimization.
feature point in image A and M = (f Ai , f Bi ) the set of all matched

Minimizing Eq. 1 w.r.t. the camera poses is equivalent to optimiz-
pairs. We define a reprojection loss as follows: ing a rigid alignment of the depth maps. However, since most small
baseline depth maps are normalized (including the ones produced
ρ PA→B f Ai − f Bi 2 ,
Õ  2 
E reprojection =

(1) by the iPhone), they cannot be aligned rigidly.
(f Ai , f Bi ) ∈M
4.2.2 Global transformations: We resolve this problem by intro-
where ρ(s) = log(1 +s) is a robust loss function to reduce sensitivity ducing extra variables that describe a global transformation of each
to outlier matches, and PA→B (f ) is a function that projects the 2D depth map. Our first experiment was trying to estimate a scale factor
point f from image A to image B: s A for each depth map, i.e., by replacing d A (f ) in Eq. 1 with
3D point in camera A’s coord scale
z }| { dA (f ) = s A d A (f ), (3)
 
PA→B (f ) = n R TB R A f˜ d A (f ) + t A − t B , (2) where s A is an extra optimization variable per image. However, this
| {z }
3D point in world space
did not achieve good results, because, as we learned, many depth
maps are normalized using unknown curves. We tried a variety of
where (R A , tA) and (R B , t B ) are the rotation matrix and translation other classes of global transformations, and achieved the best results
vectors for images A and B, respectively, f˜ is the homogeneous- with an affine transformation in disparity space (i.e., 1/d):
augmented version of f , d A (f ) is the value of image A’s depth map
x y
 −1
affine

at location f , and n [x, y, z]T = [ z , z ]T . Note, that this formulation d A (f ) = s A d A−1
(f ) + o A ) , (4)
naturally handles the wrap-around in 360° panoramas.
Similar reprojection losses are common in geometric computer s A and o A are per-image scale and offset coefficients, respectively.
vision and have been used with great success in many recent re- Figure 4a shows a typical result of minimizing Eq 1 with the
construction systems [Schönberger and Frahm 2016]. However, our affine model. Many depth maps are incorrectly pushed at infinity,
formulation has a subtle but important difference: since we have because the optimizer could not find a good way to align them
depth maps, we do not need to optimize the 3D location of feature otherwise. In the bottom row we visualize the coefficient of variation
point correspondences. This significantly simplifies the system in of depth samples per pixel, i.e., the ratio of the standard deviation
several ways: (1) it drastically reduces the number of variables that to the mean. This is a scale-independent way of visualizing the
need to be estimated, to just the camera poses, (2) we do not have amount of disagreement in the alignment. As a sanity check we also
to link feature point matches into long tracks, and (3) the depth tried to independently align each depth map to a high quality SFM
maps helps reduce uncertainty, making our system robust to small reconstruction of the scene (computed with COLMAP [Schönberger
baselines and narrow triangulation angles. and Frahm 2016]) that can be considered ground truth (Figure 4b).
Eq. 2 assumes that the camera intrinsics as well as lens deforma- Even with this “best-possible” result for the model the stitch is
tion characteristics are known and fixed throughout the capture. If severely degraded by seams and floaters.

ACM Transactions on Graphics, Vol. 37, No. 4, Article 101. Publication date: August 2018.
101:6 • P. Hedman and J. Kopf

MRF stitch
Through our experimentation we found that it is ultimately not
possible to bring this kind of depth maps into good alignment using
simple global transformations because of to the low frequency error
that is characteristic for small baseline stereo depth maps due to the
triangulation uncertainty. Labels Color mosaic Depth mosaic

Our stitch
4.2.3 Deformable alignment: Our solution to this problem is to
estimate spatially-varying adjustment fields that deform each depth
map and can therefore bring them into much better alignment.
We modify the affine model in Eq. 4 to replace the global scale and
offset coefficients with regular grids of 5×5 values that are bilinearly Label Color mosaic Depth mosaic
interpolated across the image. Fig. 5. Comparing stitching using MRF optimization (top row, runtime 3.25
 −1 minutes) vs. our algorithm (bottom row, runtime 0.5 seconds). While labels
deform

−1
dA (f ) = s A (f ) d A (f ) + o A (f ) , (5) change more frequently in our solution, the color and depth mosaics are
visually very similar to the MRF result.
where s A (f ) = i w i (f ) ŝAi , and o A (f ) = i w i (f ) ôAi , and w i (f ) are
Í Í
bilinear interpolation weights at position f .
To encourage smoothness in the deformation field we add a cost
for differences between neighboring grid values: We tried also even richer models than Eq. 4. In particular we have
ŝ i − ŝ j 2 + ôi − ô j 2 tried 3D grids that represent bilateral space adjustments with depth
Õ Õ
E smoothness =

A A 2 A A 2 (6)
A (i, j)∈N
and luminance as range domains. However, while we found slight
improvements in the results we did not deem it significant enough
While E reprojection is agnostic to scale, E smoothness encourages set- to warrant the extra complexity.
ting the disparity scale functions ŝAi very small, which results in We also experimented with using 3D distance between matched
extremely large reconstructions. To prevent this, we add a regular- features points’ projection into world space in our loss. However,
ization term that keeps the overall scale in the scene constant: this did not work well, since a trivial solution is to shrink the scene
ŝAi
−1
ÕÕ 
E scale = (7) until it vanishes.
A i Eq. 8 is quite robust and does not depend strongly on the initial-
ization. We tried other initializations, e.g., setting t A = [0, 0, 0]T ,
The combined problem that we solve is:
which worked fine as well. We have not encountered any scene in
argmin E reprojection + λ 1 E smoothness + λ 2 E scale , (8) our experiments where the optimization got stuck in a poor local
{R I ,t I , ŝ i , ô i } minimum.
with the balancing coefficients λ 1 = 106 , λ 2 = 10−4 .
Figure 4c shows the improvement achieved by using the de- 4.3 Stitching
formable model. The ground plane is now nearly perfectly smooth, Now that we have 3D aligned depth photos, our next goal is to stitch
there are no floaters, and thin structures such as the lamp post are them into a seamless panoramic mosaic. This enables removing out-
resolved much better. liers in the depth maps and also makes rendering faster by removing
4.2.4 Optimization Details: Since Eq. 8 is a non-linear optimiza- redundant content.
tion problem, we require a good initialization of the variables. We First, we compute a center of projection for the panorama by trac-
initialize the camera rotations using the IMU orientations, and ing the camera front vectors backwards and finding the 3D point
the locations by pushing them forward onto the unit sphere, i.e., that minimizes the distance to all of them. Then, we render all the
t A = R A · [0, 0, 1]T . We found it helpful to initialize the deformation color and depth maps from this central viewpoint into equirectan-
field to enlarge the depth maps, i.e., ŝAi = 0.1, ôAi = 0, because in this gular panoramas (see supplementary document for details). The full
way reprojected feature points are visible in their matched images. panoramas are 8192 × 4096 pixels, though for each image, we only
We use the Ceres library [Agarwal et al. 2017] to solve this non- keep a crop to the tight bounding box of the pixels actually used.
linear least-squares minimization problem using the Levenberg- As in previous work, we formulate the stitching as a discrete
Marquardt algorithm. We represent all rotations using the 3-dimensional labeling problem, where we need to select for every pixel p in the
axis-angle parameterization and use Rodrigues’ formula [Ayache panorama a source image αp from which to fetch color and depth.
1989] when they are applied to vectors. The optimization usually
4.3.1 Data Term: A “good” source αp for the target pixel p should
converges within 50 iterations, which takes about 10 seconds.
satisfy a number of constraints, which we formulate as penalty
4.2.5 Discussion: Due to the non-rigid transformations in the terms.
optimization the camera poses that we recover are not necessarily
accurate anymore. We inspected the recovered poses visually and Depth Consensus: If a source has the correct depth, it tends to be
found that they qualitatively look similar to results obtained by SFM, consistent with other views. Therefore, we count how many other
but we have not performed a careful analysis to verify their degree views n(p, αp) are at similar depth, i.e., their depth ratio is within
of accuracy. [0.9, 1.1], and define a depth consensus penalty, similar to earlier

ACM Transactions on Graphics, Vol. 37, No. 4, Article 101. Publication date: August 2018.
Instant 3D Photography • 101:7

work [Hedman et al. 2017]:


n(p,α p)
 
E consensus p, αp = max 1 − τconsensus , 0,

(9)
where τconsensus = 5 determines how many other depth maps need
to agree before the penalty reaches zero and we consider the labeling
to be completely reliable.
Image Boundaries: We prefer pixels from the image center, be-
cause there the depth maps are more reliable and there is more
space for seam-hiding feathering around them. We define an image (a) Naïve mesh (b) Naïve mesh + tears (c) Multi-layer mesh
boundary penalty E boundary (p, αp) that is set to 1 if a source pixel
is close to the boundary in its original image (i.e., within 5% of the Fig. 6. (a) Naïvely meshing by connecting all vertices yield stretched trian-
image width), and 0 otherwise. gles at depth edges. (b) Tearing the mesh avoids this, but reveals holes. (c)
Our multi-layer meshes extend the back-side at depth edges smoothly into
Saturated Pixels: To maximize detail in the resulting panorama, the occluded region and reveal inpainted colors and depths.
we define a term that avoids overexposed source pixels:
(
1 if l(p, αp) > τsaturated
E saturated p, αp =
 and then process each channel independently. We solve a linear sys-
(10)
0 otherwise, tem to compute global affine color-channel adjustments (i.e., scale
and offset) for each source image, such that the adjusted color values
where l(p, αp) is the luminance of a source pixel, and τsaturated =
in the overlapping regions agree as much as possible. We further
0.98.
reduce visible seams by feathering the label region boundaries with
Combined Objective: By putting together the previous objectives a wide radius of 50 pixels. In a supplementary document we provide
we obtain the per-pixel data term: more implementation details.
E data = E consensus + λ 3 E boundary + λ 4 E saturated , (11)
4.4 Multi-layer Processing
with the balancing coefficients λ 3 = 1, λ 4 = 3. The final step of our algorithm is to convert the panorama into a
4.3.2 Optimization: Independently optimizing Eq. 11 for every triangle mesh that can be rendered on any device using standard
pixel is fast, but yields noisy results since labels may change very graphics engines. Naïvely creating a triangle mesh by connecting all
frequently. The canonical way to achieving smoother results is to pixels to their 4-neighbors yields stretched triangles at strong depth
define a pairwise smoothness term that encourages fewer label edges that are revealed when the viewpoint changes (Figure 6a). Our
changes that are placed in areas where they tend to be least visible. solution resembles somewhat the “two-layer merging” algorithm of
However, this makes the problem considerably harder and requires Hedman et al. [2017]. However, an important difference is that our
using slow MRF solvers. For example, Hedman et al. [2017] report stitcher does not produce back-surface stitches, since the baseline
runtimes of several minutes for solving a downscaled stitching is too small to reconstruct significant content in occluded regions
problem. (while we use similar camera trajectories the field of view of our
We found that we can achieve very similar looking results faster camera is smaller, hence there is less overlap between images). If
with independent per-pixel optimization, by applying a variant of scenes were captured with a wider baseline and/or more wide-angle
cost-volume filtering [Hosni et al. 2013] which first filters the data camera the two-layer stitch-and-merge algorithm mentioned above
term with a depth-guided edge aware filter: could be adapted at the expense of slower runtime.
In our algorithm, every mesh vertex corresponds to a pixel posi-
wdα p, p + ∆ · E data p, αp .
Õ
E soft-data p, αp =
  
(12) tion in the panorama that is “pushed out” to a certain depth. Each
∆∈Wdα vertex is connected to at most one neighbor in each of the 4 cardinal
However, instead of using a single global guide, we determine unique directions. Since our goal is to generate multiple layers, there can
filter weights wdα for each source image α using a guided filter [He be multiple vertices at different depths for a single pixel position.
We initialize the mesh by creating vertices for every panorama pixel
et al. 2010], guided by the normalized disparities in α. In our experi-
and connecting them to their 4 neighbors.
ments, we use a filter footprint that spans 2.5% of the image width
We start the computation by detecting major depth edges in the
and set the edge-aware parameter ϵ = 10−7 .
mesh. Since the depth edges are soft and spread over multiple pixels,
In Figure 5 we compare our result with an MRF solution using
we apply a 9 × 9 median filter to turn them into step edges. Next, we
the color and disparity smoothness terms defined by Hedman et
tear the connection between neighboring vertices if their disparity
al. [2017]. While our stitch exhibits more frequent label changes the
differs by more than 5 · 10−2 units. Sometimes the median filter
stitched color and depth mosaics are visually very similar.
produces small isolated “floating islands” at the middle of depth
4.3.3 Color Harmonization: Since we capture images with auto- edges. We detect these using connected component analysis and
exposure enabled, we need to align the exposures to create a seam- merge them into either foreground or background, by replacing their
less panorama. Following the insight from Reinhard et al. [2001], we depth with the median of depths just outside the floater. Figure 6b
convert the images to the channel-decorrelated CIELAB color space, shows the mesh after tearing it at depth edges.

ACM Transactions on Graphics, Vol. 37, No. 4, Article 101. Publication date: August 2018.
101:8 • P. Hedman and J. Kopf

Table 1. Breakdown of the algorithm performance per stage for the scene has proven to be quite robust, we could replace the feature point
from Figure 1.
Desktop Laptop detector and descriptor with faster to compute variants, e.g., FAST
Stage Timing Timing features and BRIEF descriptors. The alignment optimization could
be sped up by implementing a custom solver, tailored to this par-
Feature extraction and matching 6.6s 7.0s
ticular problem. Our current warping algorithm is implemented
Deformable alignment 9.8s 10.2s
in a wasteful way. Properly rewriting this GPU code would make
Warping 6.6s 5.1s
this operation practically free. The stitching algorithm could be
Stitching 2.7s 2.6s
reimplemented on the GPU.
Color harmonization 1.6s 1.8s
Multi-layer computation 1.0s 1.1s
Mesh simplification 3.1s 3.5s 5.2 Alignment
Texture atlas generation 3.3s 3.7s Figure 8 shows a quantitative evaluation of our alignment algorithm.
Total 34.7s 35.0s We processed the 25 scenes in Figure 7 using different variants of
the algorithm and evaluate the average reprojection error (Eq. 1).
We also evaluate the effect of varying the grid size of our de-
Next, we hallucinate new content in occluded parts by iteratively
formable model, which shows that the reprojection error remains
growing the mesh at its boundaries. In every iteration, each vertex
flat across a wide range of settings around our choice of 5 × 5.
that is missing a connection in one of the 4 cardinal directions
grows in this direction and generates a new vertex at the same depth
as itself. We connect new vertices with all neighboring boundary 5.3 Single-image CNN Depth Maps
vertices whose disparity is within the threshold above. After running We experimented with other depth map sources. In particular, we
this procedure for a fixed number of 30 iterations, we prune the were interested in using our algorithm with depth maps estimated
newly generated vertices by removing any but the furthest generated from single images using CNNs, since this would enable using our
vertex at every pixel location. If the remaining generated vertex is system with regular single-lens cameras (even though the depth map
in front of the original stitch we remove it as well. We synthesize quality produced by current algorithms is low). In Figure 9 we used
colors for the newly generated mesh parts using diffuse inpainting. the Monodepth algorithm [Godard et al. 2017] to generate single-
The resulting mesh smoothly extends the back-side around depth image depth maps for three of our scenes. The original Monodepth
edges into the occluded regions. Instead of stretched triangles or algorithm works well for the first street scene, since it was trained
holes, viewpoint changes now reveal smoothly inpainted color and on similar images. For the remaining two scenes we retrained the
depth content (Figure 6c). algorithm with explicit depth supervision using the RGBD scans in
In a supplementary document we provide some implementation the ScanNet dataset [Dai et al. 2017a]. Even though the input depth
details about simplification and texture atlas generation for the final maps are considerably degraded our method was able to reconstruct
mesh. surprisingly good results. To better appreciate the result quality, see
the video comparisons in the supplementary material.
5 RESULTS AND EVALUATION
We have captured and processed dozens of scenes with an iPhone 7
Plus. 25 of these are included in this submission, see Figure 7, as well 5.4 SFM and MVS Comparison
as the accompanying video and the supplementary material. These We were interested in how standard SFM algorithms would perform
scenes span a wide range of different environments (indoor and on our datasets. When processing our 25 datasets with COLMAP’s
outdoor, urban and natural) and capture conditions (day and night, SFM algorithm 7 scenes failed entirely, in 7 more not all cameras
bright and overcast). The scenes we captured range from about 20 were registered, and there was 1 were all cameras registered but
to 200 source images, and their horizontal field-of-view ranges from the reconstruction was an obvious catastrophic failure. This high
60° to 360°. failure rate underscores the difficulty of working with small baseline
imagery.
5.1 Performance We also compare against end-to-end MVS systems, in particular
All scenes were processed using a PC with 3.4 GHz 6-core Intel Xeon the commercial Capturing Reality system2 and Casual 3D [Hedman
E5-2643 CPU and a NVIDIA Titan X GPU and 64 GB of memory. et al. 2017]. Capturing Reality’s reconstruction speed is impressive
Our implementation mostly consists of unoptimized CPU code. The for a full MVS algorithm, but due to the small baseline it is only able
GPU is currently only (insignificantly) used in the warping stage. to reconstruct foreground. Casual 3D produces results of compa-
We ran our system also on a slower 14” Razer Blade laptop with a rable quality to ours, but at much slower speed. Figure 10 shows
3.3 GHz 4-core Intel i7-7700HQ CPU and a NVIDIA GTX 1060 GPU. an example scene, and the supplementary material contains video
Interestingly, the warping stage performs faster on the laptop, most comparisons of more scenes.
likely because CPU computation and CPU/GPU transfers dominate
the runtime. Table 1 breaks out the timings for the various algo-
rithm stages on both of these systems for an example scene. While
our algorithm already runs fast, we note that there are significant
2 https://www.capturingreality.com
further optimizations on the table. Since the deformable alignment

ACM Transactions on Graphics, Vol. 37, No. 4, Article 101. Publication date: August 2018.
Instant 3D Photography • 101:9

Alley Angkor Wat Miniature Bushes Footpath Golden Mount River Houses
(72 images) (30 images) (26 images) (34 images) (28 images) (32 images)

Temple Temple Yard Lisbon Bricks Plumstead Skate Park Snowman


(59 images) (57 images) (106 images) (88 images) (88 images) (78 images) (134 images)

Southbank Embankment Tottenham Ivy Turtle Van Gogh Walk Wood Shed
(78 images) (53 images) (91 images) (60 images) (65 images) (31 images) (153 images)

Wilkins Terrace Industrial Hanover Gardens


(164 images) (203 images) (91 images)

Forest Bloomsbury
(101 images) (147 images)

Fig. 7. Datasets that we show in the paper, video, and supplementary material. The two bottom rows show 360° panoramas.

maps produced by these algorithms, this would not lead to good re-
Average reprojection error

sults, because the source depth maps are inconsistent and show the
scene from different vantage points. Our algorithm resolves these
inconsistencies and produces a coherent color-and-depth panorama
(Figure 10c and detail crop in 11e).

5.6 Capture without Parallax


Global Affine align- Deformable Deformable Deformable Deformable We evaluated the effect of varying the amount of parallax in the
affine ment to SFM 3x3 5x5 8x8 12x12
input images by capturing scenes while rotating the camera around
Fig. 8. Average reprojection error (Eq. 1, without the robust loss function) the optical center without translating (as much as was possible), and
for different alignment methods, as well as our deformable alignment with comparing against a normal capture where we move the camera at
different grid sizes. half-arm’s length. The resulting panoramas are visually very similar
(see supplementary material). That said, theoretically our method
should break down in the complete absence of parallax because the
5.5 Parallax-aware Stitching reprojection error in Equation 1 will become invariant to depth in
this case. In practice, however, it is very difficult to completely avoid
We compared our algorithm with two monocular stitching algo- any parallax in the capture, and, fortunately, the natural way to
rithms that handle parallax in different ways. As-projective-as- capture panoramas is on an arc with a radius of about a half arm’s
possible warping (APAP) [Zaragoza et al. 2013] allows local de- length anyway.
viations from otherwise globally projective warps to account for
misalignment (Figure 11a). Note, that this does not always suc-
ceed (see detail crop in Figure 11c). Microsoft ICE3 uses globally 5.7 Limitations
projective warps, but leverages carefully engineered seam finding Our algorithm has a variety of limitations that lead to interesting
and blending to hide parallax errors (Figure 11b and detail crop in avenues for future work.
11d). Neither of these methods produce a depth panorama. While
the source depth pixels could be stitched according to the label Capture: The iPhone camera has a very narrow field-of-view in
depth capture mode, because one of the lenses is wide-angle and
3 https://www.microsoft.com/en-us/research/product/ the other a telephoto lens. If both lenses were wide-angle we would
computational-photography-applications/image-composite-editor/ need to capture considerably fewer images to achieve the same

ACM Transactions on Graphics, Vol. 37, No. 4, Article 101. Publication date: August 2018.
101:10 • P. Hedman and J. Kopf

Example input

CNN depth Global affine (Eq. 4) Independent affine to SFM Deformable (Eq. 8) Deformable (Eq. 8)
CNN depth Dual camera depth

Example input

CNN depth
Global affine (Eq. 4) Independent affine to SFM Deformable (Eq. 8) Deformable (Eq. 8)
CNN depth Dual camera depth

Example input

CNN depth

Global affine (Eq. 4) Independent affine to SFM Deformable (Eq. 8) Deformable (Eq. 8)
CNN depth Dual camera depth

Fig. 9. Applying our algorithm to single-image CNN depth maps (3 middle columns), and comparing to a result for dual camera depth maps (right column).
See the supplementary material for videos of the results.

amount of overlap. At the same time the baseline would increase, leaves room for improvement. We plan to improve the inpainting
making the reconstruction problem easier. of colors using texture synthesis.

Artifacts: Our results exhibit similar artifacts as other 3D recon- Parameters: Like many end-to-end reconstruction algorithms we
struction systems. In particular, these are floating pieces of geom- depend on many parameters. We proceeded one stage at a time,
etry, incorrect depth in untextured regions, artifacts on dynamic examining intermediate results, when tuning the parameters. All
objects. Compared to existing systems these problems are reduced results shown anywhere in this submission or accompanying mate-
(Figure 10), but they are still present. To examine these artifacts rials use the same parameter settings, provided in the paper.
carefully we suggest watching the video comparison in the supple-
mentary material. 6 CONCLUSIONS
In this paper we have presented a fast end-to-end algorithm for
Multi-layer processing: The hallucination of occluded pixels is generating 3D panoramas from a sequence of color-and-depth im-
rudimentary. In particular the simple back-layer extension algorithm ages. Even though these depth maps contain a considerable amount

ACM Transactions on Graphics, Vol. 37, No. 4, Article 101. Publication date: August 2018.
Instant 3D Photography • 101:11

(a) Capturing Reality (2.1 minutes) (b) Casual 3D (1.8 hours) (c) Our result (34.7 seconds)

Fig. 10. Comparison against MVS systems: (a) Capturing Reality processes fast, but the reconstruction breaks down just a few meters away from the vantage
point due to triangulation uncertainty. (b) Casual 3D produced a high quality result, but it is slow. (c) Our result has even better quality, and was computed
over 200× faster.

(a) APAP [Zaragoza et al. 2013] (b) Microsoft ICE (c) APAP detail (d) ICE detail (e) Our detail

Fig. 11. Comparison against monocular panoramas stitched with (a) As-Projective-As-Possible warping, and (b) Microsoft ICE. Note that these algorithms do
not produce a stitched depth map. (c)-(d) show a detail crops from before mentioned algorithms. (e) corresponding detail crop from our result in Figure 10c.

of low-frequency error, our novel deformable alignment optimiza- Virtual Reality Video. ACM Transactions on Graphics 35, 6 (2016).
tion is able to align them precisely. This opens up the possibility Nicholas Ayache. 1989. Vision Stéréoscopique et Perception Multisensorielle: Application
à la robotique mobile. Inter-Editions (MASSON). https://hal.inria.fr/inria-00615192
to replace discrete smoothness optimization in our stitcher with Jonathan T. Barron, Andrew Adams, YiChang Shih, and Carlos Hernández. 2015. Fast
depth-guided edge-aware filtering of the data term and indepen- Bilateral-Space Stereo for Synthetic Defocus. Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (2015), 4466–4474.
dently optimizing every pixel, achieving two orders of magnitude Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and
speedup. Matthias Nießner. 2017a. ScanNet: Richly-annotated 3D Reconstructions of Indoor
We are excited about the many avenues for further improvement Scenes. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE (2017).
Angela Dai, Matthias Nießner, Michael Zollöfer, Shahram Izadi, and Christian Theobalt.
and research that this work opens up. Considering the performance 2017b. BundleFusion: Real-time Globally Consistent 3D Reconstruction using On-
discussion in Section 5.1 we believe a near-interactive implementa- the-fly Surface Re-integration. ACM Transactions on Graphics 2017 (TOG) 36, 3
tion directly on the phone is within reach. (2017), Article no. 24.
Abe Davis, Marc Levoy, and Fredo Durand. 2012. Unstructured Light Fields. Computer
We have seen already how the availability of a fast capture and Graphics Forum (Proc. EUROGRAPHICS 2012) 31, 2pt1 (2012), 305–314.
reconstruction system has changed our own behavior with respect Facebook. 2016. Facebook Surround 360. https://facebook360.fb.com/
facebook-surround-360/. (2016). Accessed: 2016-12-26.
to 3D scene capture. The way we capture scenes has become more Clément Godard, Oisin Mac Aodha, and Gabriel J. Brostow. 2017. Unsupervised Monoc-
opportunistic and impulsive. Almost all of the provided scenes have ular Depth Estimation with Left-Right Consistency. CVPR (2017).
been captured spontaneously without planning, e.g., while traveling. Steven J. Gortler, Radek Grzeszczuk, Richard Szeliski, and Michael F. Cohen. 1996. The
Lumigraph. Proceedings of the 23rd Annual Conference on Computer Graphics and
We are particularly excited about the promising first result using Interactive Techniques (1996), 43–54.
single-image CNN depth maps. The quality of these depth maps is Hyowon Ha, Sunghoon Im, Jaesik Park, Hae-Gon Jeon, and In So Kweon. 2016. High-
still quite low. Yet, our alignment algorithm was able to conflate quality Depth from Uncalibrated Small Motion Clip. Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR) (2016).
them, and stitching them using the consensus data term reduced Kaiming He, Jian Sun, and Xiaoou Tang. 2010. Guided Image Filtering. Proceedings of
artifacts further. Improving these results further is an interesting di- the 11th European Conference on Computer Vision (ECCV) (2010), 1–14.
Peter Hedman, Suhib Alsisan, Richard Szeliski, and Johannes Kopf. 2017. Casual 3D
rection for further research and holds the promise of bringing instant Photography. ACM Transactions on Graphics (Proc. SIGGRAPH Asia 2017) 36, 6 (2017),
3D photography to billions of regular cell phones with monocular Article no. 234.
cameras. Aasma Hosni, Christoph Rhemann Rhemann, Michael Bleyer, Carsten Rother, and
Margrit Gelautz. 2013. Fast Cost-Volume Filtering for Visual Correspondence and
Beyond. IEEE Trans. Pattern Anal. Mach. Intell. 35, 2 (2013), 504–511.
ACKNOWLEDGEMENTS Jingwei Huang, Zhili Chen, Duygu Ceylan, and Hailin Jin. 2017. 6-DOF VR Videos with
a Single 360-Camera. IEEE VR 2017 (2017).
The authors would like to thank Clément Godard for generating the Sunghoon Im, Hyowon Ha, François Rameau, Hae-Gon Jeon, Gyeongmin Choe, and
depth maps for the CNN evaluation and Suhib Alsisan for developing In So Kweon. 2016. All-Around Depth from Small Motion with a Spherical Panoramic
Camera. European Conference on Computer Vision (ECCV ’16) (2016), 156–172.
the capture app. Hiroshi Ishiguro, Masashi Yamamoto, and Saburo Tsuji. 1990. Omni-directional stereo
for making global map. Third International Conference on Computer Vision (1990),
REFERENCES 540–547.
Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard Newcombe,
Sameer Agarwal, Keir Mierle, and Others. 2017. Ceres Solver. http://ceres-solver.org. Pushmeet Kohli, Jamie Shotton, Steve Hodges, Dustin Freeman, Andrew Davison,
(2017). and Andrew Fitzgibbon. 2011. KinectFusion: Real-time 3D Reconstruction and
Robert Anderson, David Gallup, Jonathan T. Barron, Janne Kontkanen, Noah Snavely, Interaction Using a Moving Depth Camera. Proceedings of the 24th Annual ACM
Carlos Hernandez Esteban, Sameer Agarwal, and Steven M. Seitz. 2016. Jump:

ACM Transactions on Graphics, Vol. 37, No. 4, Article 101. Publication date: August 2018.
101:12 • P. Hedman and J. Kopf

Symposium on User Interface Software and Technology (2011), 559–568.


Robert Konrad, Donald G. Dansereau, Aniq Masood, and Gordon Wetzstein. 2017.
SpinVR: Towards Live-streaming 3D Virtual Reality Video. ACM Transactions on
Graphics (Proc. SIGGRAPH Asia 2017) 36, 6 (2017), article no. 209.
Jungjin Lee, Bumki Kim, Kyehyun Kim, Younghui Kim, and Junyong Noh. 2016. Rich360:
Optimized Spherical Representation from Structured Panoramic Camera Arrays.
ACM Transactions on Graphics 35, 4 (2016), article no. 63.
Marc Levoy and Pat Hanrahan. 1996. Light Field Rendering. Proceedings of the 23rd
Annual Conference on Computer Graphics and Interactive Techniques (1996), 31–42.
Kaimo Lin, Nianjuan Jiang, Loong-Fah Cheong, Minh N. Do, and Jiangbo Lu. 2016.
SEAGULL: Seam-Guided Local Alignment for Parallax-Tolerant Image Stitching.
14th European Conference on Computer Vision (ECCV) (2016), 370–385.
Marius Muja and David G. Lowe. 2009. Fast Approximate Nearest Neighbors with
Automatic Algorithm Configuration. International Conference on Computer Vision
Theory and Application VISSAPP’09) (2009), 331–340.
Shmuel Peleg and Moshe Ben-Ezra. 1999. Stereo panorama with a single camera. IEEE
Conference on Computer Vision and Pattern Recognition (CVPR 1999) (1999), 395–401.
Shmuel Peleg, Moshe Ben-Ezra, and Yael Pritch. 2001. Omnistereo: panoramic stereo
imaging. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 3 (2001),
279–290.
F. Perazzi, A. Sorkine-Hornung, H. Zimmer, P. Kaufmann, O. Wang, S. Watson, and
M. Gross. 2015. Panoramic Video from Unstructured Camera Arrays. Computer
Graphics Forum 34, 2 (2015), 57–68.
Realities. 2017. realities.io | Go Places. http://realities.io/. (2017). Accessed: 2017-1-12.
Erik Reinhard, Michael Ashikhmin, Bruce Gooch, and Peter Shirley. 2001. Color Transfer
Between Images. IEEE Comput. Graph. Appl. 21, 5 (2001), 34–41.
Christian Richardt, Yael Pritch, Henning Zimmer, and Alexander Sorkine-Hornung.
2013. Megastereo: Constructing High-Resolution Stereo Panoramas. IEEE Conference
on Computer Vision and Pattern Recognition (CVPR 2013) (2013), 1256–1263.
Johannes Lutz Schönberger and Jan-Michael Frahm. 2016. Structure-from-Motion
Revisited. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2016).
Steven M Seitz, Brian Curless, James Diebel, Daniel Scharstein, and Richard Szeliski.
2006. A comparison and evaluation of multi-view stereo reconstruction algorithms.
2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR’06) 1 (2006), 519–528.
Jianbo Shi and Carlo Tomasi. 1994. Good Features to Track. 1994 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR’94) (1994), 593 – 600.
Richard Szeliski. 2010. Computer Vision: Algorithms and Applications (1st ed.). Springer-
Verlag New York, Inc., New York, NY, USA.
Engin Tola, Vincent Lepetit, and Pascal Fua. 2010. DAISY: An Efficient Dense Descriptor
Applied to Wide-Baseline Stereo. IEEE Trans. Pattern Anal. Mach. Intell. 32, 5 (2010),
815–830.
Valve. 2016. Valve Developer Community: Advanced Outdoors Photogramme-
try. https://developer.valvesoftware.com/wiki/Destinations/Advanced_Outdoors_
Photogrammetry. (2016). Accessed: 2016-11-3.
Thomas Whelan, Stefan Leutenegger, Renato Salas Moreno, Ben Glocker, and Andrew
Davison. 2015. ElasticFusion: Dense SLAM Without A Pose Graph. Proceedings of
Robotics: Science and Systems (2015).
Julio Zaragoza, Tat-Jun Chin, Michael S. Brown, and David Suter. 2013. As-Projective-As-
Possible Image Stitching with Moving DLT. Proceedings of the 2013 IEEE Conference
on Computer Vision and Pattern Recognition (2013), 2339–2346.
Fan Zhang and Feng Liu. 2014. Parallax-Tolerant Image Stitching. Proceedings of the
2014 IEEE Conference on Computer Vision and Pattern Recognition (2014), 3262–3269.
Fan Zhang and Feng Liu. 2015. Casual Stereoscopic Panorama Stitching. IEEE Conference
on Computer Vision and Pattern Recognition (CVPR ’15) (2015), 2002–2010.
Ke Colin Zheng, Sing Bing Kang, Michael F. Cohen, and Richard Szeliski. 2007. Layered
Depth Panoramas. IEEE Conference on Computer Vision and Pattern Recognition
(CVPR 2007) (2007), 1–8.
Qian-Yi Zhou and Vladlen Koltun. 2014. Simultaneous Localization and Calibration:
Self-Calibration of Consumer Depth Cameras. 2014 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) (2014), 454–460.

ACM Transactions on Graphics, Vol. 37, No. 4, Article 101. Publication date: August 2018.

You might also like