Interactive Acquisition of Residential F
Interactive Acquisition of Residential F
Interactive Acquisition of Residential F
Young Min Kim, Jennifer Dolson, Mike Sokolsky, Vladlen Koltun, Sebastian Thrun
Stanford University
{ymkim, jldolson, mvs, vladlen, thrun}@stanford.edu
Abstract— We present a hand-held system for real-time, does not provide sophisticated rendering. Rather, projection
interactive acquisition of residential floor plans. The system allows the user to visualize the reconstruction process. Then,
integrates a commodity range camera, a micro-projector, and the user can detect reconstruction errors that arise due to
a button interface for user input and allows the user to freely
move through a building to capture its important architectural deficiencies in the data capture path, and can complete
elements. The system uses the Manhattan world assumption, missing data in response. The user can also note which
which posits that wall layouts are rectilinear. This assumption walls have been included in the model and easily resolve
allows generating floor plans in real time, enabling the operator ambiguities with a simple input device.
to interactively guide the reconstruction process and to resolve
structural ambiguities and errors during the acquisition. The
interactive component aids users with no architectural training II. R ELATED W ORK
in acquiring wall layouts for their residences. We show a
number of residential floor plans reconstructed with the system. A number of approaches have been proposed for indoor
reconstruction in computer graphics, computer vision, and
robotics. Real-time indoor reconstruction has been recently
I. I NTRODUCTION explored either with a depth sensor [1] or an optical camera
Acquiring an accurate floor plan of a home is a challenging [2]. The key to real-time performance is the fast registration
task, yet it is a requirement in many situations that involve of successive frames. Similar to [1], we fuse both color
remodeling or selling a property. Original blueprints are often and depth information to register frames. Furthermore, our
hard to find, especially for older residences. In practice, approach extends real-time acquisition and reconstruction, by
contractors and interior designers use point-to-point laser allowing the operator to visualize the current reconstruction
measurement devices to acquire a set of distance measure- status without consulting a computer screen. By making the
ments. Based on these measurements, an expert creates a feedback loop immediate, the operator can resolve failures
floor plan that respects the measurements and represents the and ambiguities while the acquisition session is in progress.
layout of the residence. Previous approaches are also limited to a dense 3-D
In this paper, we present a hand-held system for indoor reconstruction (registration of point cloud data) with no
architectural reconstruction. The system eliminates the man- higher-level information, which is memory intensive. A few
ual post-processing necessary for reconstructing the layout exceptions include [3], which detects high level features
of walls in a residence. Instead, an operator with no archi- (lines and planes) to reduce complexity and noise. The high
tectural expertise can interactively guide the reconstruction level structures, however, do not necessarily correspond to
process by moving freely through an interior until all walls actual meaningful structure. In contrast, our system identifies
have been observed by the system. and focuses on significant architectural elements using the
Our system is composed of a laptop connected to a Manhattan world assumption, which is based on the observa-
commodity range sensor, a lightweight optical projector, and tion that many indoor scenes are largely rectilinear [4]. This
an input button interface (Figure 1, left). The real-time depth assumption has been widely used for indoor scene recon-
sensor is the main input modality. We use the Microsoft struction from images to overcome the inherent limitations
Kinect, a lightweight commodity device that outputs VGA- of image data [5][6]. The stereo method only reconstructs
resolution range and color images at video rates. The data is locations of image feature points, and the Manhattan world
processed in real time to create the floor plan by focusing on assumption successfully fills the area between the sparse
flat surfaces and ignoring clutter. The generated floor plan feature points during a post-processing step. Similarly, our
can be used directly for remodeling or real-estate applica- system differentiates between architectural features and mis-
tions, or to produce a 3D model of the interior for applica- cellaneous objects in the space, produces a clean architectural
tions in virtual environments. In Section V, we demonstrate floor plan, and simplifies the representation of the environ-
a number of residential wall layouts reconstructed with our ment. Even with the Manhattan world assumption, however,
system. the system still cannot fully resolve ambiguities introduced
The attached projector is initially calibrated to have an by large furniture items and irregular features in the space
overlapping field of view with the same image center as the without user input. This interactive capability relies on the
depth sensor and projects the reconstruction status onto the system’s ability to integrate new input into a global map of
surface being scanned. Under normal lighting, the projector the space in real time.
345
319
10
76 217
494
269
332
63
61
421
Fig. 1. Our hand-held system is composed of a projector, a Microsoft Kinect sensor, and an input button (left). The system uses augmented reality
feedback (middle left) to project the status of the current model onto the environment and to enable real-time acquisition of residential wall layouts (middle
right). The floor plan (middle right) and visualization (right) were generated using data captured by our system.
Simplifying the representation also reduces the computa- is similarly created by including the selected planes, and
tional burden of processing the map. Registration of suc- the room is correctly positioned into the global coordinate
cessive point clouds results in an accumulation of errors, system. The model is updated in real time and stored in either
especially for a large environment, and requires a global a CAD format or a 3-D mesh format that can be loaded into
optimization step in order to build a consistent map. This most 3-D modeling software.
is similar to reconstruction tasks encountered in robotic
mapping and is usually solved by bundle adjustment, a IV. DATA ACQUISITION P ROCESS
costly off-line process [7][8]. Employing the Manhattan
At each time step t, the sensor produces a new frame of
world assumption simplifies the map construction to a one-
data, Ft = {Xt , It , Pt , T t }. The sensor output is composed
dimensional, closed-form problem.
of a range image Xt (a 2-D array of depth measurements)
The augmented reality component of our system is in-
and a color image It . During the acquisition process, we
spired by the SixthSense project [9]. Instead of simply aug-
represent the relationship between the planes in global map
menting a user’s view of the world, however, our projected
Mt and the measurement in the current frame Xt as Pt , a
output serves to guide an interactive reconstruction process.
2-D array of plane labels for each pixel. T t represents the
Directing the user in this way is similar to re-photography
transformation from the frame Ft , which is the measure-
[10], where a user is guided to capture a photograph from
ment relative to the current sensor position, to the global
the same viewpoint as in a previous photograph. By using
coordinate system, which is where the map Mt is defined.
a micro-projector as the output modality, our system allows
Throughout the data capture session, the system maintains
the operator to focus on interacting with the environment.
the global map Mt , and the two most recent frames, Ft−1
and Ft . Additionally, the frame with the last observed
III. S YSTEM OVERVIEW AND U SAGE
corner Fc is stored to recover the sensor position when lost.
The data acquisition process is initiated by pointing the Instead of storing information from all frames, we keep the
sensor to a corner, where three mutually orthogonal planes total computational and memory requirements minimal by
meet. This defines the Manhattan-world coordinate system. incrementally updating the global map only with components
The attached projector will indicate successful initialization that need to be added to the final model.
by overlaying blue-colored planes with white edges onto the The map Mt is composed of loops of axis-parallel planes
scene (Figure 2 (a)). After the initialization, the user scans Ltr . Each room has its own loop of planes. Each plane has its
each room individually as he or she loops around it holding axis label (x, y, or z) and the offset value (e.g. x = x0 ), as
the device. If the movement is too fast or if there are not well as its left or right plane if the connectivity is observed.
enough features, a red projection guides the user to recover A plane can be selected or ignored based on user input. The
the position of the device (Figure 2 (b)). selected planes are extracted from Ltr as the loop of the
The system extracts flat surfaces that align with the room Rtr , which can be converted into the floor plan as a
Manhattan coordinate system and creates complete recti- 2-D rectilinear polygon. Both Ltr and Rtr are constrained to
linear polygons, even when connectivity between planes is have alternating axis labels (x and y). For the z direction
occluded. Sometimes, the user might not want some of the (vertical direction), we are only keeping the ceiling and the
extracted planes (parts of furniture or open doors) to be floor. We also keep the sequence of observation (S x , S y , and
included in the model even if they satisfy the Manhattan- S z ) of offset values for each axis direction, and we store the
world constraint. When the user clicks the input button measured distance and the uncertainty of the measurement
(left click), the extracted wall toggles between inclusion between planes.
(indicated as blue) and exclusion (indicated as grey) to the The overall reconstruction process is summarized in Fig-
model (Figure 2 (c)). As the user finishes scanning a room, he ure 2. As mentioned in Sec. III, the process is initiated by
or she walks toward another room. A new rectilinear polygon extracting three mutually orthogonal planes, when a user
is initiated by a right click. Another rectilinear polygon points the system to one of the corners. To detect planes
Fetch a new frame
Exists
Global
Success
adjustment
Pair-wise
Initialization Plane extraction
registration
Map update
New
User interaction
Failure Left click Right click
Adjust data Start a new
Visual feedback Select planes
path room
Fig. 2. System overview and usage (Section III). When an acquisition session is initiated by observing a corner, the user is notified by a blue projection
(a). After the initialization, the system updates the camera pose by registering consecutive frames. If a registration failure occurs, the user is notified by
a red projection and is required to adjust the data capture path (b). Otherwise, the updated camera configuration is used to detect planes that satisfy the
Manhattan-world constraint in the environment and to integrate them into the global map. The user interacts with the system by selecting planes in the
space (c). When the acquisition session is completed, the acquired map is used to construct a floor plan consisting of user-selected planes.
consists of simply clicking the input button during scanning floor plan. In all cases the detection and labeling of planar
when pointing at a plane, as shown in Figure 6. If the user surfaces by our algorithm enabled the user to add or remove
enters a new room, a right click of the button indicates these surfaces from the model in real time, allowing the
the user wishes to include this room and to optimize it final model to be constructed using only the important
individually. The system creates a new loop of planes and architectural elements from the scene.
any newly observed planes are added to the loop. The overlaid floor plans in Figure 7(c) show that that
Whenever a new plane is added to Ltr or there is user the relative placement of the rooms may be misaligned.
input to specify the room structure, the map update routine This is because our global adjustment routine optimizes
extracts a 2-D rectilinear polygon Rtr from Ltr with the help rooms individually, thus error can accumulate in transitions
of user input. We start by adding all selected planes into Rtr between rooms. The algorithm could be extended to enforce
as well as whichever unselected planes in Ltr are necessary global constraints on the relative placement of rooms, such
to have alternating axis direction. The planes with observed as maintaining a certain wall thickness and/or aligning the
boundary edges have priority to be added. outer-most walls, but such global constraints may induce
other errors.
V. E VALUATION Table I contains a quantitative comparison of the errors.
The current practice in architecture and real estate is to use The reported depth resolution of the sensor is 0.01m at 2m,
a point-to-point laser device to measure distances between and for each model we have an average of 0.075m error
pairs of parallel planes. Making such measurements requires per wall. The relative error stays in the range of 2-5%,
a clear, level line of sight between two planes, which may be which shows that the accumulation of small registration error
time-consuming to find due to furniture, windows, and other continues to increase as more frames are processed.
obstructions. After making all the distance measurements, a Fundamentally, the limitations of our method reflect those
user is required to manually draw a floor plan that respects of the Kinect sensor, namely, the processing power of the
the measurements. Roughly 10-20 minutes was needed to laptop and the assumptions made in our approach. As the
take the distance measurements in each apartment. Using accuracy of depth data is worse than visual features, our
our system, the data acquisition process took approximately approach exhibits larger errors compared to visual SLAM.
2-5 minutes per home to initiate, run, and generate the full Some of the uncertainty can be reduced by adapting ap-
floor plan. Table I summarizes the timing data for each data proaches from well-explored visual SLAM literature. Still,
set. The average frame rate is 7.5 frames per second running we are limited when we cannot detect meaningful features.
on an Intel 2.50GHz Dual Core laptop. The Kinect sensor’s reported measurement range is between
In Figure 7, we visually compare the reconstructed floor 1.2 and 3.5m from an object; outside that range, data is noisy
plans. The floor plans in blue are reconstructed using point- or unavailable. As a consequence, data in narrow hallways
to-point laser measurements, and the floor plans in red are or large atriums is difficult to collect.
reconstructed by our system. For each home, the topology of Another source of potential error is a user outpacing the
the reconstructed walls agrees with the manually-constructed operating rate of approximately 7.5 fps. This frame rate
house 1
house 2
house 3
house 4
house 5
house 6
(a) (b) (c)
Fig. 7. (a) Manually constructed floor plans generated from point-to-point laser measurements, (b) floor plans acquired with our system, and (c) overlay.
For house 4, some parts (pillars in large open space, stairs, and an elevator) are ignored by the user. The system still uses the measurements from those
parts and other objects to correctly understand the relative positions of the rooms.
already allows for a reasonable data capture pace, but with in real time while ignoring clutter. The current status of
more processing power, the pace of the system could always the reconstruction is projected on the scanned environment
be guaranteed to exceed normal human motion. to enable the user to provide high-level feedback to the
system. This feedback helps overcome ambiguous situations
VI. C ONCLUSION AND F UTURE W ORK
and allows the user to interactively specify the important
We have presented an interactive system that allows a user planes that should be included in the model.
to capture accurate architectural information and to auto-
matically generate a floor plan. Leveraging the Manhattan More broadly, our interactive system can be extended to
world assumption, we create a representation that is tractable other applications in indoor environments. For example, a
data no. of run average error
fps [12] P. Besl and N. McKay, “A method for registration of 3-d shapes,”
set frames time m %
IEEE Trans. PAMI, vol. 14, pp. 239–256, 1992.
1 1465 2m 56s 8.32 0.115 4.14
[13] S. Rusinkiewicz and M. Levoy, “Efficient variants of the icp algo-
2 1009 1m 57s 8.66 0.064 1.90
rithm,” in Proc. 3DIM, 2001.
3 2830 5m 19s 8.88 0.053 2.40
4 1129 2m 39s 7.08 0.088 2.34
5 1533 3m 52s 6.59 0.178 3.52
6 2811 7m 4s 6.65 0.096 3.10
ave. 1795 3m 57s 7.54 0.075 2.86
TABLE I
ACCURACY COMPARISON BETWEEN FLOOR PLANS RECONSTRUCTED BY
OUR SYSTEM , AND MANUALLY CONSTRUCTED FLOOR PLANS
GENERATED FROM POINT- TO - POINT LASER MEASUREMENTS .
Fig. 8. The system, having detected the planes in the scene, also allows
the user to interact directly with the physical world. Here the user adds a
window to the room by dragging a cursor across the wall (left). This motion
updates the internal model of the world (right).
R EFERENCES
[1] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox., “Rgb-d mapping:
Using depth cameras for dense 3d modeling of indoor environments,”
in ISER, 2010.
[2] R. A. Newcombe and A. J. Davison, “Live dense reconstruction with
a single moving camera,” in CVPR, 2010.
[3] A. P. Gee, D. Chekhlov, A. Calway, and W. Mayol-Cuevas, “Dis-
covering higher level structure in visual slam,” IEEE Transactions on
Robotics, vol. 24, pp. 980–990, October 2008.
[4] J. M. Coughlan and A. L. Yuille, “Manhattan world: Compass direc-
tion from a single image by bayesian inference,” in ICCV, pp. 941–
947, 1999.
[5] Y. Furukawa, B. Curless, S. Seitz, and R. Szeliski, “Reconstructing
building interiors from images,” in ICCV, pp. 80–87, 2009.
[6] C. A. Vanegas, D. G. Aliaga, and B. Benes, “Building reconstruction
using manhattan-world grammars,” in CVPR, pp. 358–365, 2010.
[7] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon,
“Bundle adjustment - a modern synthesis,” in Proceedings of the
International Workshop on Vision Algorithms: Theory and Practice,
ICCV ’99, Springer-Verlag, 2000.
[8] S. Thrun, “Robotic mapping: A survey,” in Exploring Artificial In-
telligence in the New Millenium (G. Lakemeyer and B. Nebel, eds.),
Morgan Kaufmann, 2002.
[9] P. Mistry and P. Maes, “Sixthsense: a wearable gestural interface,” in
SIGGRAPH ASIA Art Gallery & Emerging Technologies, p. 85, 2009.
[10] S. Bae, A. Agarwala, and F. Durand, “Computational rephotography,”
ACM Trans. Graph., vol. 29, no. 5, 2010.
[11] M. A. Fischler and R. C. Bolles, “Random sample consensus: a
paradigm for model fitting with applications to image analysis and
automated cartography,” Commun. ACM, vol. 24, pp. 381–395, June
1981.