Roma Pepper
Roma Pepper
Roma Pepper
Sameer Agarwal1, Noah Snavely2 Ian Simon1 Steven M. Seitz1 Richard Szeliski3
1
Abstract
We present a system that can match and reconstruct 3D
scenes from extremely large collections of photographs such
as those found by searching for a given city (e.g., Rome) on
Internet photo sharing sites. Our system uses a collection
of novel parallel distributed matching and reconstruction
algorithms, designed to maximize parallelism at each stage
in the pipeline and minimize serialization bottlenecks. It is
designed to scale gracefully with both the size of the problem
and the amount of available computation. We have experimented with a variety of alternative algorithms at each stage
of the pipeline and report on which ones work best in a
parallel computing environment. Our experimental results
demonstrate that it is now possible to reconstruct cities consisting of 150K images in less than a day on a cluster with
500 compute cores.
1. Introduction
Entering the search term Rome on flickr.com returns more than two million photographs. This collection represents an increasingly complete photographic record of the
city, capturing every popular site, facade, interior, fountain,
sculpture, painting, cafe, and so forth. Most of these photographs are captured from hundreds or thousands of viewpoints and illumination conditionsTrevi Fountain alone
has over 50,000 photographs on Flickr. Exciting progress
has been made on reconstructing individual buildings or
plazas from similar collections [16, 17, 8], showing the potential of applying structure from motion (SfM) algorithms
on unstructured photo collections of up to a few thousand
photographs. This paper presents the first system capable of
city-scale reconstruction from unstructured photo collections.
We present models that are one to two orders of magnitude
larger than the next largest results reported in the literature.
Furthermore, our system enables the reconstruction of data
sets of 150,000 images in less than a day.
To whom correspondence should be addressed. Email: sagarwal@
cs.washington.edu
City-scale 3D reconstruction has been explored previously in the computer vision literature [12, 2, 6, 21] and is
now widely deployed e.g., in Google Earth and Microsofts
Virtual Earth. However, existing large scale structure from
motion systems operate on data that comes from a structured
source, e.g., aerial photographs taken by a survey aircraft
or street side imagery captured by a moving vehicle. These
systems rely on photographs captured using the same calibrated camera(s) at a regular sampling rate and typically
leverage other sensors such as GPS and Inertial Navigation
Units, vastly simplifying the computation.
Images harvested from the web have none of these simplifying characteristics. They are taken from a variety of
different cameras, in varying illumination conditions, have
little to no geographic information associated with them, and
in many cases come with no camera calibration information.
The same variability that makes these photo collections so
hard to work with for the purposes of SfM also makes them
an extremely rich source of information about the world. In
particular, they specifically capture things that people find interesting, i.e., worthy of photographing, and include interiors
and artifacts (sculptures, paintings, etc.) as well as exteriors
[14]. While reconstructions generated from such collections
do not capture a complete covering of scene surfaces, the
coverage improves over time, and can be complemented by
adding aerial or street-side images.
The key design goal of our system is to quickly produce
reconstructions by leveraging massive parallelism. This
choice is motivated by the increasing prevalence of parallel
compute resources both at the CPU level (multi-core) and
the network level (cloud computing). At todays prices, for
example, you can rent 1000 nodes of a cluster for 24 hours
for $10,000 [1].
The cornerstone of our approach is a new system for largescale distributed computer vision problems, which we will
be releasing to the community. Our pipeline draws largely
from the existing state of the art of large scale matching and
SfM algorithms, including SIFT, vocabulary trees, bundle
adjustment, and other known techniques. For each stage
in our pipeline, we consider several alternatives, as some
2. System Design
Our system runs on a cluster of computers (nodes), with
one node designated as the master node. The master node is
responsible for the various job scheduling decisions.
In this section, we describe the detailed design of our
system, which can naturally be broken up into three distinct
phases: (1) pre-processing 2.1, (2) matching 2.2, and (3)
geometric estimation 2.4.
I0
SIFT
F0
VT
VQ
TF0
VT
VQ
TFK
TFIDF0
...
IK
SIFT
FK
D
F
...
...
IN
SIFT
FN
VT
VQ
TFN
(a)
Round 1
Verify
{F0}
Verify
(b)
Round 2
Verify
{F0}
...
...
MK
...
...
...
...
TFIDFN
MN
(c)
(d)
Merge
M0
TFIDFK
Round 3
Expand1
NM
Distribute k2
N0
Distribute k1
M0
MN
(e)
Merge
{T0}
...
Verify
Merge
{FN}
{TN}
T0
T0
TC
Merge
CC
{FN}
(f)
(g)
(h)
(i)
(j)
(k)
TC
(l)
(m)
2.2.3
2.2.4
Query Expansion
33099
33099
35513
33099
35513
129261
118293
92654
92654
2543
2543
41037
2543
41037
100455
100455
60343
109955
64855
109955
64855
66327
66327
39964
Initial Matches
43705
3037
118292
27282
13972
88549
40018
40018
61450
96302
61450
96302
5412
39960
104
116625
48531
86774
39964
86775
48995
Query Expansion 1
48531
86774
39964
86775
48995
96302
5412
39960
104
116625
39964
CC Merge
15277
102389
27282
88549
61450
48531
86774
86775
48995
46262
15277
43705
3037
118292
13972
40018
39960
104
116625
39964
86775
83222
46262
102389
27282
88549
5412
48531
16742
66327
83222
15277
43705
3037
118292
96302
39960
104
116625
66399
101322
88584
86001
16742
46262
13972
61450
5412
122090
117000
66399
102389
40018
96302
86774
3039
122090
117000
83222
27282
88549
61450
48531
86774
25965
3039
66327
43705
3037
118292
13972
40018
39960
104
116625
109955
64855
36587
25965
101322
88584
86001
16742
15277
102389
27282
88549
5412
109955
64855
122090
66399
46262
15277
43705
3037
118292
13972
55720
117000
83222
46262
102389
117980
123361
3039
101322
88584
86001
16742
66327
83222
86005
100455
16743
36586
36587
25965
122090
66399
101322
88584
16742
15275
109954
134888
55760
55720
117000
66399
60343
117980
123361
3039
122090
117000
124710
36586
36587
25965
3039
86001
15275
109954
134888
55760
55720
36587
25965
101322
88584
60343
117980
123361
109955
64855
36587
86001
41359
16743
124710
36586
109954
134888
55720
61722
38055
68091
41359
15275
117980
123361
55760
112136
14757
2990
68091
16743
124710
36586
74555
14757
2990
41359
15275
109954
134888
65125
116995
84355
14757
68091
16743
60343
55760
55720
64871
108233
96301
74555
38055
2990
41359
124710
117980
123361
109956
694
116995
84355
14757
68091
36586
48994
94300
64871
108233
96301
74555
38055
2990
16743
15275
109954
134888
68070
41360
124010
48994
109956
694
116995
41359
60343
55760
100455
68070
41360
94300
64871
108233
96301
84355
14757
68091
57215
124010
48994
109956
694
38055
2990
92505
65119
68070
94300
64871
74555
19115
10492
41360
124010
48994
109956
116995
84355
18990
15276
62039
40017
100455
68070
108233
96301
38055
130620
124011
101945
59314
23977
92505
65119
57215
40017
41360
694
74555
1628
34810
60361
62850
86005
65125
19115
10492
57215
124010
94300
64871
116995
41037
1628
18990
15276
62039
23977
92505
65119
48994
109956
108233
96301
2543
41037
34810
124011
101945
59314
19115
10492
68070
694
84355
8824
60361
62850
86005
65125
14803
41377
73147
49873
10493
1628
18990
15276
62039
40017
41360
124010
94300
124710
2543
41037
34810
124011
101945
59314
23977
92505
65119
4728
91259
8824
60361
62850
86005
65125
19115
10492
57215
40017
18989
22505
43242
73147
49873
10493
1628
18990
15276
62039
23977
92505
65119
87765
61722
91259
34810
124011
101945
59314
65125
19115
10492
57215
112136
18989
22505
43242
8824
60361
62850
86005
18990
15276
62039
70985
61449
87765
61722
73147
49873
10493
1628
124011
101945
40017
693
55060
75015
49824
116997
41377
70985
61449
112136
91259
34810
60361
62850
59314
23977
41361
34812
49030
124013
55060
75015
49824
116997
41377
70985
18989
22505
43242
73147
8824
39962
30036
61720
61448
61449
87765
61722
91259
49873
34299
39962
30036
124013
55060
75015
49824
112136
18989
22505
43242
10493
63807
54469
45356
43870
116997
41377
70985
87765
61722
123345
14470
1231
34299
61448
61449
112136
73147
92654
34321
13942
49030
45356
61720
124013
55060
75015
49824
116997
41377
70985
18989
22505
8824
53569
130620
63807
54469
1231
43870
39962
30036
116844
5433
102896
4728
123345
14470
13942
49030
34299
61448
61449
91259
49873
92654
34321
54469
45356
61720
124013
55060
75015
49824
87765
43242
116843
131056
53569
130620
63807
14470
1231
43870
116997
10493
116844
5433
102896
4728
123345
34321
13942
49030
39962
61448
124013
118294
109346
54467
116843
131056
53569
130620
63807
54469
34299
30036
55061
124191
66328
116844
5433
102896
4728
123345
14470
45356
14803
109346
54467
116843
131056
53569
34321
1231
43870
61720
55764
693
124191
66328
116844
5433
102896
83416
55764
14803
109346
54467
116843
131056
13942
49030
118293
83416
693
124191
66328
109346
54467
130620
63807
39962
30036
55763
118293
55764
14803
124191
66328
4728
123345
54469
34299
61448
22059
34298
83981
34812
83416
693
14803
116844
92654
14470
61720
74763
55763
118293
55764
693
109346
53569
34321
45356
55762
22059
34298
83981
34812
83416
55764
5433
102896
1231
108214
86003
20789
41361
74763
55763
118293
83416
124191
116843
131056
48154
91260
52259
118294
55762
22059
34298
83981
34812
54467
13942
61721
48154
108214
86003
20789
41361
74763
55763
55062
92655
87709
68095
68072
61721
91260
52259
118294
55762
22059
34298
66328
62383
68094
6670
55061
68072
48154
108214
86003
20789
41361
74763
83981
34812
43870
33099
65121
3038
68095
55061
61721
91260
52259
118294
55762
87709
6670
68095
68072
48154
108214
86003
20789
41361
130689
92655
68094
87709
6670
55061
61721
91260
52259
118294
22059
55763
55062
65121
3038
130689
92655
68094
87709
68095
68072
48154
108214
34298
129261
62383
55062
65121
3038
130689
92655
68094
6670
55061
61721
74763
83981
35513
129261
62383
55062
65121
3038
130689
87709
68095
86003
55762
35513
129261
62383
55062
92655
68094
6670
68072
91260
52259
33099
35513
129261
62383
65121
3038
130689
20789
86775
48995
Query Expansion 4
48995
Skeletal Set
Figure 2. The evolution of the match graph as a function of the rounds of matching, and the skeletal set corresponding to it. Notice how the
second round of matching merges the two components into one, and how rapidly the query expansion increases the density of the within
component connections. The last column shows the skeletal set corresponding to the final match graph. The skeletal sets algorithm can
break up connected components found during the match phase if it determines that a reliable reconstruction is not possible, which is what
happens in this case.
2.4.1
Bundle Adjustment
Time (hrs)
Data set
Dubrovnik
Rome
Venice
Images
Cores
Registered
Pairs verified
Pairs found
Matching
Skeletal sets
Reconstruction
57,845
150,000
250,000
352
496
496
11,868
36,658
47,925
2,658,264
8,825,256
35,465,029
498,982
2,712,301
6,119,207
5
13
27
1
1
21.5
16.5
7
16.5
Table 1. Matching and reconstruction statisics for the three data sets.
ment trick. The first algorithm has low time complexity per
iteration, but uses more LM iterations, while the second one
converges faster at the cost of more time and memory per
iteration. The resulting code uses significantly less memory
than SBA and runs up to an order of magnitude faster. The
exact runtime and memory savings depend upon the sparsity
structure of the linear system involved.
3. Experiments
We report here the results of running our system on three
city data sets downloaded from flickr.com: Dubrovnik,
Croatia; Rome; and Venice, Italy. Figures 3(a), 3(b) and 3(c)
show reconstructions of the largest connected components in
these data sets. Due to space considerations, only a sample
of the results is shown here, we encourage the reader to
visit http://grail.cs.washington.edu/rome,
where the complete results are posted, and additional results,
data, and code will be posted over time.
The experiments were run on a cluster of 62 nodes with
dual quad core processors each, on a private network with 1
GB/sec Ethernet interfaces. Each node was equipped with
32 GB of RAM and 1 TB of local hard disk space. The
4. Discussion
At the time of writing this paper, searching on Flickr.com
for the keywords Rome or Roma results in over
2,700,000 images. Our aim is to be able to reconstruct as
much of the city as possible from these photographs in 24
hours. Our current system is about an order of magnitude
away from this goal. We believe that our current approach
can be scaled to problems of this size. However, a number
(a) Dubrovnik: Four different views and associated images from the largest connected component. Note that the component captures the entire
old city, with both street-level and roof-top detail. The reconstruction consists of 4,585 images and 2,662,981 3D points with 11,839,682 observed
features.
Data set
Dubrovnik
Rome
Venice
CC1
CC2
Skeletal Set
Reconstructed
6,076
7,518
20,542
4619
2,106
14,079
977
254
1,801
4585
2,097
13,699
Table 2. Reconstruction statistics for the largest connected components in the three data sets. CC1 is the size of the largest connected
component after matching, CC2 is the size of the largest component
after skeletal sets. The last column lists the number of images in
the final reconstruction.
Acknowledgement
This work was supported in part by SPAWAR, NSF grant IIS0811878, the Office of Naval Research, the University of Washington Animation Research Labs, and Microsoft. We also thank
Microsoft Research for generously providing access to their HPC
cluster. The authors would also like to acknowledge discussions
with Steven Gribble, Aaron Kimball, Drew Steedly and David
Nister.
References
[1] Amazon Elastic Computer Cloud. http://aws.amazon.
com/ec2.