Panoramic Video From Unstructured Camera Arrays 2015
Panoramic Video From Unstructured Camera Arrays 2015
Wimmer
(Guest Editors)
Zurich
2 Disney
Research Zurich
3 Walt
Disney Imagineering
Figure 1: Two panoramas created with our system. Top - (00:05 in the accompanying video): a cropped frame from a 160
megapixel panoramic video generated from five input videos. The overlay on the right shows the full panorama, with the
respective individual field of views of the input cameras highlighted by colored frames. Bottom - (01:22): a crop from a 20
megapixel panorama created from a highly unstructured array consisting of 14 cameras.
Abstract
We describe an algorithm for generating panoramic video from unstructured camera arrays. Artifact-free
panorama stitching is impeded by parallax between input views. Common strategies such as multi-level blending or minimum energy seams produce seamless results on quasi-static input. However, on video input these
approaches introduce noticeable visual artifacts due to lack of global temporal and spatial coherence. In this
paper we extend the basic concept of local warping for parallax removal. Firstly, we introduce an error measure with increased sensitivity to stitching artifacts in regions with pronounced structure. Using this measure, our
method efficiently finds an optimal ordering of pair-wise warps for robust stitching with minimal parallax artifacts. Weighted extrapolation of warps in non-overlap regions ensures temporal stability, while at the same time
avoiding visual discontinuities around transitions between views. Remaining global deformation introduced by the
warps is spread over the entire panorama domain using constrained relaxation, while staying as close as possible
to the original input views. In combination, these contributions form the first system for spatiotemporally stable
panoramic video stitching from unstructured camera array input.
Categories and Subject Descriptors (according to ACM CCS):
GenerationViewing Algorithms
1. Introduction
The richness and detail of our surrounding visual world is
challenging to capture in a regular photograph or video. The
idea of combining content from multiple cameras into a wide
field of view panorama therefore is essentially as old as phoc 2015 The Author(s)
c 2015 The Eurographics Association and John
Computer Graphics Forum
Wiley & Sons Ltd. Published by John Wiley & Sons Ltd.
tography and film themselves. Popular contemporary implementations are, for instance, dome-based projections as in
planetaria or IMAX cinemas. But while tools for creating
panoramic still images are available in most consumer cameras and software nowadays, capturing panoramic video of
comparable quality remains a difficult challenge.
Figure 2: Two of the camera arrays we constructed for capturing the panoramic videos shown in this paper. Left: 14
machine vision cameras. Right: A small rig built using five
GoPro cameras. Note that our panoramas have been captured
without particularly accurate placement of cameras in order
to demonstrate the flexibility of our approach.
lax removal. Our procedure remains efficient even for large
numbers of input cameras, where brute-force approaches for
finding an optimal warp order would be infeasible.
Finally, local image warping accumulates globally, leading to significant spatial deformations of the panorama.
Since these deformations are dependent on the per-frame
scene content, they change for every output frame and hence
result in noticeable temporal jitter. We resolve these global
deformations and temporal instability by a weighted warp
extrapolation from overlap to non-overlap image regions,
and a final constrained relaxation step of each full panoramic
output frame to a reference projection.
We demonstrate panoramic video results captured with
different types of cameras on arrays with up to 14 cameras,
allowing the generation of panoramic video in the order of
tens to over a hundred megapixels.
2. Related Work
We review the most related works here, and refer to
Szeliski [Sze06] for an extensive survey.
Parallax-free input. One class of methods focuses on
creating a single panoramic image from a set of overlapping images, under the assumption that all images are captured from the same or similar center of projection and
hence are basically parallax-free. Such a configuration can
be achieved by carefully rotating a camera, e.g. using a tripod [KUDC07]. After estimating the relative camera poses,
a panorama can be generated by projecting all images onto
a common surface. Smaller errors caused, e.g., by imperfect alignment of projection centers, objects moving during image acquisition, or lens distortion can be removed using image blending [BA83,LZPW04] content-adaptive seam
computation [EF01], or efficient, user-assisted combinations
thereof [STP12]. Depending on the field of view, a suitable
projection has to be chosen in order to minimize undesirable
distortion, e.g., of straight lines [KLD 09, HCS13]. While
these methods enable the creation of static panoramas up
to gigapixel resolution [KUDC07], they are not designed to
c 2015 The Author(s)
c 2015 The Eurographics Association and John Wiley & Sons Ltd.
Computer Graphics Forum
produce panoramic video and cannot correct for larger parallax errors when camera projection centers do not align.
Handling parallax. When capturing hand-held panoramas or using an array of cameras, parallax has to be accounted for. A common strategy, which is also the basis
for our work, is to warp the images in 2D space [SS00,
KSU04, JT08] using image correspondences computed as
depth maps or optical flow fields in the overlapping image
regions. In a general multi-view scenario with more than
two input views, this requires combining all correspondence
fields [EdDM 08]. We show in Section 3.2 that existing solutions such as averaging of correspondence fields [SS00]
are sensitive to wrong correspondence estimates, which
occur frequently in real-world images. Stereo-based approaches are inherently sensitive to rolling-shutter artifacts
or unsynchronized images. Lin et al. [LLM 11] describe a
method based on smoothly varying affine stitching which
is, however, computationally expensive and not straightforward to extend to high resolution input or video. Zaragoza et
al. [ZCT 14] extend this approach and combine a series of
local homographies to reduce parallax while preserving the
projectivity of the transformation. For specific scenes and
camera motion such as long panoramas of buildings along a
street [AAC 06,KCSC10], parallax errors can be reduced by
using approximate geometric proxies and seam-based stitching. Stereo panorama techniques [PRRAZ00] intentionally
capture parallax to enable stereoscopic viewing, but require
a dense sampling and significant overlap of views. Recently,
multi-perspective scene collages [NZN07, ZMP07] showed
interesting artistic results by aligning and pasting images on
top of each other. Both works propose strategies to find optimal orderings for combining images, but their respective
solutions are computationally too expensive for processing
video and not designed towards seamless parallax removal of
dynamic content. Inspired by recent advances in video stabilization [LYTS13], a recent state-of-the-art method utilizes
seam-based homography optimization for parallax tolerant
stitching [ZL14], which we compare to in our results.
Dynamic panoramas. Several methods have been developed that augment static panoramas with dynamic video
elements, e.g., by segmenting dynamic scene parts captured at different times and overlaying them on a static
panorama [IAB 96]. Dynamic video textures [AZP 05] are
infinitely looping video panoramas that show periodic motions (e.g. fire). Dynamosaics [RPLP05] are made by scanning a dynamic scene with a moving video camera, creating
a panorama where all events play simultaneously. Similarly,
Pirk et al. [PCD 12] enhance a static gigapixel panorama
by locally embedding video clips. All above methods work
well for localized, periodic motion, but have not been designed to handle significant motion resulting from a camera
array moving through a scene.
Panoramic video. Our aim is to generate fully dynamic
panoramic video. One possibility would be to perform 3D
c 2015 The Author(s)
c 2015 The Eurographics Association and John Wiley & Sons Ltd.
Computer Graphics Forum
reconstruction, e.g., using structure-from-motion and multiview stereo techniques over the whole input sequence, and
then project the reconstructed 3D models to a single virtual
camera. However, robustly and efficiently creating photorealistic, temporally stable geometry reconstructions for entire videos remains a difficult challenge. In particular for
non-static scenes with independently moving objects, the
typically limited amount of overlap between cameras hinders robust multi-view stereo of the complete panorama. One
of the few works that explicitly addresses panoramic video
is [ZC12]. Similar to [KSU04] they compute depth only in
the overlapping regions and then smoothly extrapolate deformation to non-overlapping regions [JT08]. This method
works well for simple motions without too strong occlusions, but heavily relies on accurate segmentation of moving
objects. Practical issues such as rolling shutter or unsynchronized cameras pose additional challenges for stereo-based
methods, which our method can handle to a certain extend
due to a general motion field estimation.
Commercial solutions like the Point Grey Ladybug
(www.ptgrey.com), the FlyCam (www.fxpal.com), or the
Panacast camera (www.altiasystems.com) are based on precalibrated, miniaturized camera arrays in order to minimize
parallax. The resolution, optical quality, and flexibility of
such systems is limited in comparison to larger, high quality cameras and lenses. Similar issues arise for systems like
GoPano (www.gopano.com), FullView (www.fullview.com),
or the OmniCam [SFW 13], which rely on catadioptric mirrors [Nay97] to eliminate parallax.
Assessing panorama quality. In order to analyze alignment quality, a standard choice is to measure patch similarity using, e.g., the sum of squared differences (SSD) of pixels in overlapping image regions. This is, however, overly
sensitive to intensity differences in less structured image regions, which can be easily fixed after alignment using multilevel blending. More robust measures were proposed for
hole filling in videos [WSI07] in order to find similar patches
that minimize the SSD of color and gradients. One problem of comparing pixel values in a patch is that it does not
consider structural image properties. To alleviate this issue,
Kopf et al. [KKDK12] restrict the search space for patches to
texture-wise similar patches only, which also improves efficiency. In the context of hole filling, Simakov et al. [SCSI08]
achieve semantically meaningful results by enforcing that
the filled region is close to known content. Our analysis of
parallax error builds on these ideas.
3. Our Algorithm
A common first step for combining views from multiple
cameras is to estimate their mutual poses, i.e., to find a basic static image alignment on a user-defined projection surface (such as a hemisphere). In the following we refer to
this step as the reference projection, and we assume the configuration of cameras to be static over the input sequence.
(a)
(b)
(c)
(d)
Figure 4: Impact of different warp ordering strategies on data from our array with 14 cameras - further demonstrated in the
accompanying video at (00:31). The bottom row shows the output of our error measure. Darker pixels correspond to higher
structural error. The respective close-ups focus on two different types of artifacts, deformation (left) and discontinuities (right),
caused by wrong correspondence fields. (a) Static alignment with multi-band blending works for well aligned regions, but
creates discontinuities in all areas with parallax. (b) Averaging of all correspondence fields (e.g., as in [SS00]) is computationally
expensive and is sensitive to incorrect optical flow estimates, which can lead to deformations as well as incomplete alignment.
(c) Pairwise warping is computationally efficient, but without the optimal ordering accumulates errors such as deformations,
which cannot be corrected. (d) An optimal pairwise ordering computed using our patch-based metric reduces structural errors.
moves the patch to its corresponding location in G j . Using
this homography we compute the distance d j between pi j
and its corresponding patch p j = H pi j in G j as
2
d j =
Gi j [pi j ] G j [H pi j ]
,
(1)
where G[p] is the vector obtained by stacking all pixels of G
in a patch p. For computing the distance di to a patch pi in
the unwarped image Gi , the homography H is the identity.
In the following we denote the index of this respective best
patch with a , i.e., the best patch p has the minimum
patch distance d = min{di , d j }.
We found that we can drastically increase robustness by
considering all patches containing a pixel for computing its
respective error instead of using the patch distances directly.
In this way we measure the error that a particular pixel introduces into the panorama and ignore the error that other
pixels in its vicinity create. Hence we accumulate the error
in each pixel as the sum of errors from all patches containing
this pixel. Let the per-pixel difference in patch p be
p (x) = |Gi j (x) G (H x)|2 ,
(2)
with H as defined above. The combined error with contributions from all patches containing that pixel is computed as
(x) =
p (x) ,
(3)
p 3x
c 2015 The Author(s)
c 2015 The Eurographics Association and John Wiley & Sons Ltd.
Computer Graphics Forum
(x) .
(4)
xGi j
Input
Input
Input
Per-pixel SSD
Per-pixel SSD
Our Patch-based error
Our Patch-based error
Per-pixel SSD
Figure 5: Comparison of a per-pixel SSD error to our patchbased approach. Note how the per-pixel SSD identifies large
errors in regions with no visible warping (left), while underdetecting areas with significant distortion (right).
re-used. Motion fields are computed for each output frame
individually to ensure optimal parallax removal.
E(vs ) =
(7)
We propose a solution to these issues based on two components. A weighted warp extrapolation smoothly extrapolates the motion field ui j into the non-overlap region of I j
during each pairwise warping step (see Figure 7) while remaining close to the reference projection. At the end of the
warping procedure, a global relaxation step, which is constrained by the reference projection, further ensures temporal stability of the panorama shape.
Weighted warp extrapolation. We formulate the extrapolation as an energy minimization problem of the form
Z
E(e
ui j ) =
|e
ui j |2 dx ,
(6)
where j = j \i j denotes the non-overlapping image region of I j for which the extrapolated motion field uei j is to
be computed. In order to minimize Eq. (6), we solve the
corresponding Euler-Lagrange equation uei j = 0, which is
known as the Poisson equation. For the solution we assume
Dirichlet boundary conditions uei j = ui j along the boundary
i j of the overlap region i j .
To smoothly attenuate the effects of the extrapolation we
augment Eq. (6) with another set of Dirichlet boundary conditions uei j = 0 along the level set Lc ( f ) = c. The function
f (x) measures the minimum distance of a pixel x from any
point in i j [FH12]. In our experiments we set c 10% of
the resolution of the output panorama. After discretization of
the equation, the resulting linear system can be solved by any
standard solver, e.g., conjugate gradient. The attenuation ensures that pixels sufficiently far away from the overlap region
remain close to their position in the reference projection.
This weighted extrapolation is performed whenever we
compute a warp from one image to another, including the
computation of the warp ordering in Section 3.2.
Final global relaxation. This step accounts for remaining
warp deformation and brings each computed output frame
c 2015 The Author(s)
c 2015 The Eurographics Association and John Wiley & Sons Ltd.
Computer Graphics Forum
Figure 8: A frame (03:16) from a 60 megapixel panorama with a modified configuration of the RED camera array (field of
views of the individual cameras are highlighted). From left to right, the close-ups show comparisons to the following methods:
(a) image averaging to illustrate parallax, (b) seam optimization via graph cuts [RWSG13], (c) Autopano Video (commercial
software for panoramic video generation, www.kolor.com), and (d) our result. Seam-based approaches generally try to hide
parallax artifacts by cutting through well aligned image regions, rather than actively removing parallax between views. While
such strategies work well on still images and specific types of video, for cluttered and highly dynamic scenes such unnoticeable
seams often do not exist, so that warping becomes essential. A further problem of per-frame seam computation is temporal
stability in video. In our result virtually all parallax has been removed.
4. Experiments and Results
In this section we first describe our capture setups and provide some practical tips related to preprocessing panoramic
video. We then discuss a number of results created with our
method. Panoramic video results and high resolution frames
are provided in the accompanying video and supplemental
material. Input data and results are publicly available on our
project webpage for facilitating future research in this area.
4.1. Capture Setups
We used three different camera arrays, each with varying
configurations. See Figure 2 for two exemplary implementations.
The most portable rig was built from 5 GoPro Hero3 cameras, each operating at 1920 1080 resolution with a horizontal view of about 94 degrees. This rig is compact and
can be easily operated by a single person. Despite its small
size, this rig still enabled us to produce panoramic video at
resolutions of up to 30001500. We also built a rig consisting of 14 Lumenera LT425 machine vision cameras with
Kowa 12.5mm C mount lenses, each recording at a resolution of 2048 2048 and a field of view of approximately
180140 degrees. We show sample panoramas from both of
these setups in Figure 10. Finally, the highest quality footage
has been captured with a large custom rig consisting of five
movie-grade cameras (RED Epic-X) with Nikkor 14 mm
lenses, each capturing at a resolution of 51202700 pixels.
The corresponding videos have been captured with the rig
mounted on a helicopter and to the back of a car. We show
results with resolutions up to 1533910665, with an approximate field of view of 160140 degrees (Figures 1,8 and 13).
4.2. Preprocessing
Due to details of the various camera hardware and APIs,
it can be a difficult task to perfectly synchronize all cameras using hardware triggers alone. We therefore precompute
[Mic14]
[ZL14]
Our strategy
Figure 9: Frame-differences between sequential frames visualize the temporal warping artifacts of different approaches. Dark
regions of static content correspond to temporal instability, especially visible around the border. Images have been enhanced for
visibility. Please see the accompanying video (02:10) for more comparisons.
Figure 10 shows results captured with our 14 camera array, and additional results are provided in the video. Compared to the street example, e.g., shown in Figure 1, we reduced the mutual overlap between the field of views of the
individual cameras in order to increase the overall field of
view of the array. These datasets posed several considerable challenges. In particular dynamic scene elements such
as objects passing directly in front of the array combined
with camera ego-motion creates significant parallax differences between views. Despite these challenges our method
was able to produce acceptable results for the majority of the
video. We refer to the video for further validation.
A major challenge of the GoPro array (see Figure 10) was
that rolling shutter effects create significant artifacts in each
video. The situation is further complicated by the random
orientation of the cameras (Figure 2), such that the individual scanning directions and, hence, rolling shutter artifacts, created inconsistent distortions, causing some degree
of bouncing in the resulting video. The GoPro footage
therefore represented a significant challenge, but the resulting panoramic output video still looks acceptable. In the
video we compare our result to a state-of-the-art commercial
software package, which clearly struggles with the above
mentioned difficulties. The pairwise optimal selection and
global relaxation can also handle closed cyclic view configc 2015 The Author(s)
c 2015 The Eurographics Association and John Wiley & Sons Ltd.
Computer Graphics Forum
0.040
0.035
Patch-based Error
0.030
0.025
0.020
0.015
40
60
80
Frame Index
100
120
Figure 12: Error produced by different pairwise warp ordering strategies for the street sequence with 14 input views.
Red: a random warp ordering fixed over the entire sequence
produces high parallax error. Yellow: a warp ordering which
only maximizes pairwise overlap between images does not
account for parallax errors. Blue: optimal per-frame warp
ordering, computed at each frame independently, generates
a low error but causes temporal instability, see Figure 9.
Green: our strategy is to compute an optimal warp ordering
on a single reference frame, and then apply the same order to
the whole frame sequence. The green plot shows the average
error and standard deviation of our strategy over all possible
reference frames. The results demonstrate that the choice of
the reference frame to compute the ordering is not critical
and parallax removal is consistent over the entire sequence.
Magenta: in comparison, warp ordering based on SSD error (see Figure 5) produces inferior results.
urations without modification. We show a corresponding result in Figure 11. We encourage the reader to refer to the accompanying video (02:10) for additional comparisons with
existing state-of-the-art methods and commercial softwares
such as [Mic14] and Autopano Video.
Timing. In Table 1 we report detailed running times for
each intermediate step of our algorithm at different (horizontal) pixel resolutions. All timings have been measured
on a single core Intel Xeon 2.20 GHz processor. As the
smooth relaxation and extrapolation can be computed at a
Initial Parallax
Multiband Blending
Ours
Initial Parallax
Multiband Blending
Ours
Final Panorama
Final Panorama
Figure 10: Results generated with our 14 camera array - top, (00:31) - and using 5 (randomly assembled) GoPro cameras bottom, (01:30) - shown in Figure 2. Inset are a close up of the reference panorama showing the initial parallax, the result after
standard multiband blending, and the result after applying our method. Note the ghosting, duplication, and discontinuities in
high frequency image content which are removed in our result. Please see the video for full results.
Figure 11: Our method also handles closed 360 degree configurations, here we used 5 GoPros. Left: the reference projection
with averaged images to demonstrate ghosting, in particular around the rail, buildings, and the clouds. Right: our stitched result.
Please zoom in for details and refer to the accompanying video (03:06).
Panorama Resolution
Reference Projection
Pairwise Parallax
Extrapolation
Relaxation
Blending
512
0.18
0.834
0.401
1.094
0.598
1024
0.3
3.143
2.206
4.705
1.956
2048
0.96
7.641
8.546
9.23
7.102
4096
1.95
54.03
85.62
78.28
29.59
Figure 13: Additional result computed at 40 megapixel resolution, showing the full field of view of all five cameras as well as
close-ups into different overlap regions, where our algorithm removed parallax between the images (03:45).
Its important to note that our method can be trivially parallelized, as each frame of the output panorama is computed
independently from all other frames. We believe that this is a
key property in order to be able to create high resolution output, as processing an entire video volume rather than individual frames becomes intractable with such large amounts of
image-data. In the future we would like to investigate faster
motion field estimators and more efficient implementations
of our patch-based method. For this work we focused on the
quality of the output videos rather than optimize for speed.
5. Conclusions
We presented an algorithm and processing pipeline for creating panoramic video from camera arrays, which focuses
on the particular challenges arising for high resolution video
input: spatiotemporally coherent output and globally minimized distortion despite considerable scene motion and parallax. We demonstrated that, using our algorithm, it is possible to create panoramic videos even from larger arrays with
challenging unstructured camera configurations and practical issues such as lack of perfect temporal synchronization
or rolling shutter. To the best of our knowledge, these issues
have not been explored in previous work.
We believe that algorithms for jointly processing video
from multiple cameras for applications such as panorama
stitching will become more important in the coming years, as
it is much easier in practice to combine cameras into arrays
than to increase the resolution and quality of a single camera.
Assembling a set of portable GoPro cameras into an array
is nowadays easily possible also for non-professional users,
and camera arrays are currently being integrated even into
mobile phones (www.pelicanimaging.com). Our aim with
this work was to provide a step towards the direction of ubiquitous panoramic video capture, similar to how panoramic
image capture has become part of every consumer camera.
Our input data and results are publicly available on our
project webpage.
6. Acknowledgements
We are grateful to Yael Pritch for her substantial contributions during the initial phase of this work. We thank Andreas Baumann and Carlo Coppola for their invaluable support during the construction of the machine vision camera
array, and Feng Liu and Fan Zhang for kindly providing results of their method [ZL14]. Finally, special thanks to Marianne McLean, Max Penner, and the entire production team at
Walt Disney Imagineering and ParadiseFX. This work was
supported by an SNF award 200021_143598.
References
[AAC 06]
P IRK S., C OHEN M. F., D EUSSEN O., U YTTEN M., KOPF J.: Video enhanced gigapixel panoramas. SIGGRAPH Asia Technical Briefs (2012), 7:17:4. 3
DAELE
[IAB 96] I RANI M., A NANDAN P., B ERGEN J., K UMAR R.,
H SU S. C.: Efficient representations of video sequences and their
applications. Sig. Proc.: Image Comm. 8, 4 (1996), 327351. 3
[STP12] S UMMA B., T IERNY J., PASCUCCI V.: Panorama weaving: fast and flexible seam processing. ACM Trans. Graph. 31, 4
(2012), 83. 2
[JT08] J IA J., TANG C.-K.: Image stitching using structure deformation. IEEE Trans. Pattern Anal. Mach. Intell. 30, 4 (2008),
617631. 3, 6
[WSZ 14] WANG O., S CHROERS C., Z IMMER H., G ROSS M.,
S ORKINE -H ORNUNG A.: Videosnapping: Interactive synchronization for multiple videos. ACM Trans. Graph. 33, 4 (2014).
8
[KLD 09] KOPF J., L ISCHINSKI D., D EUSSEN O., C OHEN O R D., C OHEN M. F.: Locally adapted projections to reduce
panorama distortions. Comput. Graph. Forum 28, 4 (2009),
10831089. 2
[KSU04] K ANG S. B., S ZELISKI R., U YTTENDAELE M.: Seamless Stitching using Multi-Perspective Plane Sweep. Tech. Rep.
MSR-TR-2004-48, Microsoft Research, 2004. 3
[KUDC07] KOPF J., U YTTENDAELE M., D EUSSEN O., C OHEN
M. F.: Capturing and viewing gigapixel images. ACM Trans.
Graph. 26, 3 (2007). 2
[LLM 11] L IN W.-Y., L IU S., M ATSUSHITA Y., N G T.-T.,
C HEONG L. F.: Smoothly varying affine stitching. In CVPR
(2011). 3