0% found this document useful (0 votes)
59 views9 pages

Floor Plan Recognition Using Multi-Task Netwrok With Room Boundary Guided Attention

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views9 pages

Floor Plan Recognition Using Multi-Task Netwrok With Room Boundary Guided Attention

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Deep Floor Plan Recognition Using a Multi-Task Network

with Room-Boundary-Guided Attention

Zhiliang Zeng Xianzhi Li Ying Kin Yu Chi-Wing Fu


The Chinese University of Hong Kong
{zlzeng,xzli,cwfu}@cse.cuhk.edu.hk [email protected]
arXiv:1908.11025v1 [cs.CV] 29 Aug 2019

Abstract

This paper presents a new approach to recognize ele-


ments in floor plan layouts. Besides walls and rooms, we
aim to recognize diverse floor plan elements, such as doors,
windows and different types of rooms, in the floor layouts.
To this end, we model a hierarchy of floor plan elements and
design a deep multi-task neural network with two tasks: one
to learn to predict room-boundary elements, and the other
to predict rooms with types. More importantly, we formu-
late the room-boundary-guided attention mechanism in our
spatial contextual module to carefully take room-boundary
features into account to enhance the room-type predictions.
Furthermore, we design a cross-and-within-task weighted
loss to balance the multi-label tasks and prepare two new Figure 1. Our network is able to recognize walls of nonuniform
datasets for floor plan recognition. Experimental results thickness (see boxes 2, 4, 5), walls that meet at irregular junctions
demonstrate the superiority and effectiveness of our net- (see boxes 1, 2), curved walls (see box 3), and various room types
work over the state-of-the-art methods. in the layout; see Figure 2 for the legend of the color labels.

convolutional neural network (CNN) to recognize junction


1. Introduction points in a floor plan image and connected the junctions to
locate walls. The method, however, can only locate walls
To recognize floor plan elements in a layout requires the of uniform thickness along XY-principal directions in the
learning of semantic information in the floor plans. It is image. Later, Yamasaki et al. [20] adopted a fully convo-
not merely a general segmentation problem since floor plans lutional network to label pixels in a floor plan; however,
present not only the individual floor plan elements, such as the method simply uses a general segmentation network to
walls, doors, windows, and closets, etc., but also how the recognize pixels of different classes and ignores the spatial
elements relate to one another, and how they are arranged relations between floor plan elements and room boundary.
to make up different types of rooms. While recognizing This paper presents a new method for floor plan recog-
semantic information in floor plans is generally straightfor- nition, with a focus on recognizing diverse floor plan ele-
ward for humans, automatically processing floor plans and ments, e.g., walls, doors, rooms, closets, etc.; see Figure 1
recognizing layout semantics is a very challenging problem for two example results and Figure 2 for the legend. These
in image understanding and document analysis. elements are inter-related graphical elements with structural
Traditionally, the problem is solved based on low-level semantics in the floor plans. To approach the problem, we
image processing methods [14, 2, 7] that exploit heuristics model a hierarchy of labels for the floor plan elements and
to locate the graphical notations in the floor plans. Clearly, design a deep multi-task neural network based on the hi-
simply relying on hand-crafted features is insufficient, since erarchy. Our network learns shared features from the in-
it lacks generality to handle diverse conditions. put floor plan and refines the features to learn to recog-
Recent methods [11, 5, 20] for the problem has begun to nize individual elements. Specifically, we design the spatial
explore deep learning approaches. Liu et al. [11] designed a contextual module to explore the spatial relations between

1
elements via the room-boundary-guided attention mecha-
nism to avoid feature blurring, and formulate the cross-and-
within-task weighted loss to balance the labels across and
within tasks. Hence, we can effectively explore the spatial
relations between the floor plan elements to maximize the
network learning; see again the example results shown in
Figure 1, which exhibit the capability of our network.
Our contributions are threefold. First, we design a deep
multi-task neural network to learn the spatial relations be-
tween floor plan elements to maximize network learning.
Second, we present the spatial contextual module with the
room-boundary-guided attention mechanism to learn the
spatial semantic information, and formulate the cross-and-
within-task weighted loss to balance the losses for our tasks.
Lastly, we take the datasets from [11] and [10], collect addi-
tional floor plans, and prepare two new datasets with labels Figure 2. Floor plan elements organized in a hierarchy.
on various floor plan elements and room types.
adopts a general segmentation network, where it simply rec-
2. Related Work ognizes pixels of different classes independently, thus ig-
Traditional approaches recognize elements in floor plan noring the spatial relations among classes in the inference.
based on low-level image processing. Ryall et al. [16] Compared with the recent works, our method has sev-
applied a semi-automatic method for room segmentation. eral distinctive improvements. Technical-wise, our method
Other early methods [1, 6] locate walls, doors, and rooms simultaneously considers multiple floor plan elements in the
by detecting graphical shapes in the layout, e.g., line, arc, network; particularly, we take their spatial relationships into
and small loop. Or et al. [15] converted bitmapped floor account and design a multi-task approach to maximize the
plans to vector graphics and generated 3D room models. learning of the floor plan elements in the network. Result-
Ahmed et al. [2] separated text from graphics and extracted wise, our method is more general and capable of recogniz-
lines of various thickness, where walls are extracted from ing nonrectangular room layouts and walls of nonuniform
the thicker lines and symbols are assumed to have thin lines; thickness, as well as various room types; see Figure 2.
then, they applied such information to further locate doors Recently, there are several other works [22, 9, 24, 21,
and windows. Gimenez et al. [7] recognized walls and 18] related to room layouts, but they focus on a different
openings using heuristics, and generated 3D building mod- problem, i.e., to reconstruct 3D room layouts from photos.
els based on the detected walls and doors.
Using heuristics to recognize low-level elements in floor 3. Our Method
plans is error-prone. This motivates the development of ma-
3.1. Goals and Problem Formulation
chine learning methods [4], and very recently, deep learning
methods [5, 11, 20] to address the problem. Dodge et al. [5] The objectives of this work are as follows. First, we aim
used a fully convolutional network (FCN) to first detect the to recognize various kinds of floor plan elements, which are
wall pixels, and then adopted a faster R-CNN framework not only limited to walls but also include doors, windows,
to detect doors, sliding doors, and symbols such as kitchen room regions, etc. Second, we target to handle rooms of
stoves and bathtubs. Also, they employed a library tool to nonrectangular shapes and walls of nonuniform thickness.
recognize text to estimate the room size. Last, we aim also to recognize the rooms types in floor
Liu et al. [11] trained a deep neural network to first iden- plans, e.g., dining room, bedroom, bathroom, etc.
tify junction points in a given floor plan image, and then Achieving these goals requires the ability to process the
used integer programming to join the junctions to locate the floor plans and find multiple nonoverlapping but spatially-
walls in the floor plan. Due to the Manhattan assumption, correlated elements in the plans. In our method, we first
the method can only handle walls that align with the two organize the floor plan elements in a hierarchy (see Fig-
principal axes in the floor plan image. Hence, it can recog- ure 2), where pixels in a floor plan can be identified as inside
nize layouts with only rectangular rooms and walls of uni- or outside, while the inside pixels can be further identified
form thickness. Later, Yamasaki et al. [20] trained a FCN as room-boundary pixels or room-type pixels. Moreover,
to label the pixels in a floor plan with several classes. The the room-boundary pixels can be walls, doors, or windows,
classified pixels formed a graph model and were taken to re- whereas room-type pixels can be the living room, bathroom,
trieve houses of similar structures. However, their method bedroom, etc.; see the legend in Figure 2. Based on the hi-
Figure 3. (a) Schematic diagram illustrating our deep multi-task neural network. We have a VGG encoder to extract features from the input
floor plan image. These features are shared for two subsequent tasks in the network: one for predicting the room-boundary pixels (wall,
door, and windows) and the other for predicting the room-type pixels (dining room, bedroom, etc.). Most importantly, these two tasks
have separate VGG decoders. We design the room-boundary-guided attention mechanism (blue arrows) to make use of the room-boundary
features from the decoder in the upper branch to help the decoder in the lower path to learn the contextual features (red boxes) for predicting
the room-type pixels. (b) Details of the VGG encoder and decoders. The dimensions of the features in the network are shown.

erarchy, we design a deep multi-task network with one task features from the top VGG decoder (see the blue boxes in
to predict room-boundary elements and the other to predict Figures 3(a) & 4), while the input to the bottom branch is
room-type elements. In particular, we formulate the spa- the room-type features from the bottom VGG decoder (see
tial contextual module to explore the spatial relations be- the green boxes in Figures 3(a) & 4). See again Figure 3(a):
tween elements, i.e., using the features learned for the room there are four levels in the VGG decoders, and the spatial
boundary to refine the features for learning the room types. contextual module (see the dashed arrows in Figure 3(a))
is applied four times, once per level, to integrate the room-
3.2. Network Architecture boundary and room-type features from the same level (i.e.,
Overall network architecture. Figure 3(a) presents the in the same resolution) and generate the spatial contextual
overall network architecture. First, we adopt a shared VGG features; see the red boxes in Figures 3(a) & 4.
encoder [17] to extract features from the input floor plan im- • In the top branch, we apply a series of convolutions to
age. Then, we have two main tasks in the network: one for the room-boundary feature and reduce it to a 2D fea-
predicting the room-boundary pixels with three labels, i.e., ture map as the attention weights, denoted as am,n at
wall, door, and window, and the other for predicting the pixel location m, n. The attention weights are learned
room-type pixels with eight labels, i.e., dining room, wash- through the convolutions rather than being fixed.
room, etc.; see Figure 2 for details. Here, room boundary
refers to the floor-plan elements that separate room regions • Furthermore, we apply the attention weights to the
in floor plans; it is not simply low-level edges nor the outer- bottom branch twice; see the “X” operators in Fig-
most border that separates the foreground and background. ure 4. The first attention is applied to compress the
Specifically, our network first learns the shared feature, noisy features before the four convolutional layers with
common for both tasks, then makes use of two separate direction-aware kernels, while the second attention is
VGG decoders (see Figure 3(b) for the connections and fea- applied to further suppress the blurring features. We
ture dimensions) to perform the two tasks. Hence, the net- call it the room-boundary-guided attention mechanism
work can learn additional features for each task. To max- since the attention weights are learned from the room-
imize the network learning, we further make use of the boundary features. Let fm,n as the input feature for
0
room-boundary context features to bound and guide the dis- the first attention weight am,n and fm,n as the output,
covery of room regions, as well as their types; here, we the X operation can be expressed as
0
design the spatial contextual module to process and pass fm,n = am,n · fm,n . (1)
the room-boundary features from the top decoder (see Fig- • In the bottom branch as shown in Figure 4, we first ap-
ure 3(a)) to the bottom decoder to maximize the feature in- ply a 3 × 3 convolution to the room-type features and
tegration for room-type predictions. then reduce it into a 2D feature map. After that, we ap-
Spatial contextual module. Figure 4 shows the network ply the first attention to the 2D feature map followed by
architecture of the spatial contextual module. It has two four separate direction-aware kernels (horizontal, ver-
branches. The input to the top branch is the room-boundary tical, diagonal, and flipped diagonal) of k unit size to
Figure 4. Our spatial contextual module with the room-boundary-guided attention mechanism, which leverages the room-boundary features
to learn the attention weights for room-type prediction. In the lower branch, we use convolutional layers with four different direction-aware
kernels to generate features for integration with the attention weights and produce the spatial contextual features (in red; see also Figure 3).
Here “C” denotes concatenation, while “X” and “+” denote element-wise multiplication and addition, respectively.

further process the feature. Taking the horizontal ker- For the train-test split ratio, we followed the original pa-
nel as an example, our equation is as follows: per [11] to split R2V into 715 images for training and 100
X
0 0 images for testing. For R3D, we randomly split it into 179
hm,n = (αm−k,n · fm−k,n + αm,n · fm,n images for training and 53 images for testing.
k (2)
0
+ αm+k,n · fm+k,n ), Cross-and-within-task weighted loss. Each of the two
tasks in our network involves multiple labels for various
where hm,n is the contextual features along the hori- room-boundary and room-type elements. Since the num-
0
zontal direction; fm,n is the input feature (see Eq. (1)); ber of pixels varies for different elements, we have to bal-
and α is the weight. In our experiments, we set α to 1. ance their contributions within each task. Also, there are
generally more room-type pixels than room-boundary pix-
• In the second attention, we further apply the attention
els, so we have to further balance the contributions of the
weights (am,n ) to integrate the aggregated features:
two tasks. Therefore, we design a cross-and-within-task
00
fm,n = am,n · (hm,n + vm,n + dm,n + d0m,n ), (3) weighted loss to balance between the two tasks as well as
among the floor plan elements within each task.
where vm,n , dm,n , and d0m,n denotes the contextual
features along the vertical, diagonal, and flipped di- • Within-task weighted loss. Here, we define the within-
agonal directions, respectively, after the convolutions task weighted loss in an entropy style as
with the direction-aware kernels. C
X
Ltask = wi −yi log pi , (4)
3.3. Network Training i=1

Datasets. As there are no public datasets with pixel-wise where yi is the label of the i-th floor plan element in the
labels for floor plan recognition, we prepared two datasets, floor plan and C is the number of floor plan elements in
namely R2V and R3D. Specifically, R2V has 815 images, the task; pi is the prediction label of the pixels for the
all from Raster-to-Vector [11], where the floor plans are i-th element (pi ∈ [0, 1]); and wi is defined as follows:
mostly in rectangular shapes with uniform wall thickness. N̂ − N̂i
For R3D, besides the original 214 images from [10], we wi = PC , (5)
further added 18 floor plan images of round-shaped layouts j=1 (N̂ − N̂j )
to the data. Compared with R2V, most room shapes in R3D where N̂i is the total number of ground-truth pixels for
are irregular with nonuniform wall thickness. Here, we used
Photoshop to manually label the image regions in R2V and PCi-th floor plan element in the floor plan, and N̂ =
the
i=1 N̂i , which means the total number of ground-
R3D for walls, doors, bedrooms, etc. Note that we used truth pixels over all the C floor plan elements.
the same label for some room regions, e.g., living room and
dining room (see Figure 2), since they usually locate just • Cross-and-within-task weighted loss: Lrb and Lrt de-
next to one another without walls separating them. Such notes the within-task weighted losses for the room-
a situation can be observed in both datasets. Second, we boundary and room-type prediction tasks computed
followed the GitHub code in Raster-to-Vector [11] to group from Eq. (4), respectively. Nrb and Nrt are the total
room regions, so that we can compare with their results. number of network output pixels for room boundary
Figure 5. Visual comparison of floor plan recognition results produced by our method (c&d) and by others (e-g) on the R2V dataset; note
that we have to use rectangular floor plans for comparison with Raster-to-Vector [11]. Symbol † indicates the postprocessing step.

Table 1. Comparison with Raster-to-Vector [11] on the R2V dataset. Symbol † indicates our method with postprocessing (see Section 4.1).
class accu
overall accu
Wall Door & Window Closet Bathroom & etc. Living room & etc. Bedroom Hall Balcony
Raster-to-Vector [11] 0.84 0.53 0.58 0.78 0.83 0.72 0.89 0.64 0.71
Ours 0.88 0.88 0.86 0.80 0.86 0.86 0.75 0.73 0.86
Ours† 0.89 0.88 0.86 0.82 0.90 0.87 0.77 0.82 0.93

and room type, respectively. Then, the overall cross- train their networks. To obtain the best recognition results,
and-within-task weighted loss L is defined as: we further evaluated the result every five training epochs
L = wrb Lrb + wrt Lrt , (6) and reported only the best one.
where wrb and wrb are weights given by Network testing. Given a test floor plan image, we feed
Nrt Nrb it to our network and obtain its output. However, due
wrb = and wrt = . (7) to the per-pixel prediction, the output may contain cer-
Nrb + Nrt Nrb + Nrt
tain noise, so we further find connected regions bounded
4. Experiments by the predicted room-boundary pixels to locate room re-
gions, count the number of pixels of each predicted room
4.1. Implementation Details type in each bounded region, and set the overall predicted
Network training. We trained our network on an type as the type of the largest frequency (see Figure 5(c)
NVIDIA TITAN Xp GPU and ran 40k iterations in total. & (d)). Our code and datasets are available at: https:
//github.com/zlzeng/DeepFloorplan.
We employed Adam optimizer to update the parameters and
used a fixed learning rate of 1e-4 to train the network. The
4.2. Qualitative and Quantitative Comparisons
resolution of the input floor plan is 512 × 512, for keeping
the thin and short lines (such as the walls) in the floor plans. Comparing with Raster-to-Vector. First, we compared
Moreover, we used a batch size of one without using batch our method with Raster-to-Vector [11], the state-of-the-art
normalization, since it requires at least 32 batch size [19]. method for floor plan recognition. Specifically, we used im-
Also, we did not use any other normalization method. For ages from the R2V dataset to train its network and also our
other existing methods in our comparison, we used the orig- network. To run Raster-to-Vector, we used its original la-
inal hyper-parameters reported in their original papers to bels (which are 2D corner coordinates of rectangular bound-
Figure 6. Visual comparison of floor plan recognition results produced by our method (c&d) and others (e-f) on the R3D dataset. Symbol
† indicates our method with postprocessing (see Section 4.1).

ing boxes), while for our network, we used per-pixel labels. floor plan elements, and the postprocessing could further
Considering that the Raster-to-Vector network can only out- improve our performance.
put 2D corner coordinates of bounding boxes, we followed Comparing with segmentation networks. To evaluate
the procedure presented in [11] to convert its bounding box how general segmentation networks perform for floor plan
outputs to per-pixel labels to facilitate comparison with our recognition, we further compare our method with two recent
method; please refer to [11] for the procedural details. segmentation networks, DeepLabV3+ [3] and PSPNet [23].
Figure 5 (c-e) shows visual comparisons between our For a fair comparison, we trained their networks, as well
method and Raster-to-Vector. For our method, we provide as our network, on the R2V dataset and also on the R3D
both results with (denoted with †) and w/o postprocessing. dataset, and adjusted their hyper-parameters to obtain the
For Raster-to-Vector, it has already contained a simple post- best recognition results. Figures 5 & 6 present visual com-
processing step to connect room regions. Comparing the re- parisons with PSPNet and DeepLabV3+ on testing floor
sults with the ground truths in (b), we can see that Raster-to- plans from R2V and R3D, respectively. Due to space limi-
Vector tends to have poorer performance on room-boundary tation, please see our supplementary material for results of
predictions, e.g., missing even some room regions. Our PSPNet and DeepLabV3+ with postprocessing. From the
results are more similar to the ground truths, even with- figures, we can see that their results tend to contain noise,
out postprocessing. For the R3D dataset, it contains many especially for complex room layouts and small elements
nonrectangular room shapes, so Raster-to-Vector performed like doors and windows. Since these elements are usually
badly with many missing regions, due to its Manhattan as- the room boundary between room regions, so the results fur-
sumption; thus, we did not report the comparisons on R3D. ther affect the room-type predictions. Please see the supple-
For quantitative evaluation, we adopted two widely-used mentary material for more visual comparison results.
metrics [13], i.e., the overall pixel accuracy and the per- Table 2 reports the quantitative comparison results for
class pixel accuracy: various methods with and without postprocessing, in terms
of the overall and per-class accuracy, on both R2V and
P
Ni Ni
overall accu = Pi and class accu(i) = , (8) R3D datasets. Comparing with DeepLabV3+ and PSPNet,

i i N̂i
our method performs better for most floor plan elements,
where N̂i and Ni are the total number of the ground-truth even without postprocessing, showing its superiority over
pixels and the correctly-predicted pixels for the i-th floor these general-purpose segmentation networks. Note that,
plan element, respectively. Table 1 shows the quantitative our postprocessing step assumes plausible room-boundary
comparison results on the R2V dataset. From the results, we predictions, so it typically fails to enhance results with poor
can see that our method achieves higher accuracies for most room-boundary predictions; see the results in Figure 6.
Table 2. Comparison with DeepLabV3+ and PSPNet. Besides the class accuracy, we further followed the GitHub code of [13] to compute
the mean IoU metric; see the last row. The values inside () indicate the performance after postprocessing. Note that the R2V dataset
contains floor plans that are mostly in rectangular shapes, while the R3D dataset contains a much richer variety shape of floor plans.
R3D R2V
Ours DeepLabV3+ [3] PSPNet [23] Ours DeepLabV3+ [3] PSPNet [23]
overall accu 0.89 (0.90) 0.85 (0.83) 0.84 (0.81) 0.89 (0.90) 0.88 (0.87) 0.88 (0.88)
wall 0.98 (0.98) 0.93 (0.93) 0.91 (0.91) 0.89 (0.89) 0.80 (0.80) 0.84 (0.84)
door-and-window 0.83 (0.83) 0.60 (0.60) 0.54 (0.54) 0.89 (0.89) 0.72 (0.72) 0.76 (0.76)
closet 0.61 (0.54) 0.24 (0.048) 0.45 (0.086) 0.81 (0.92) 0.78 (0.85) 0.80 (0.71)
bathroom & etc. 0.81 (0.78) 0.76 (0.57) 0.70 (0.50) 0.87 (0.93) 0.90 (0.90) 0.90 (0.84)
class accu
living room & etc. 0.87 (0.93) 0.76 (0.90) 0.76 (0.89) 0.88 (0.91) 0.85 (0.84) 0.83 (0.90)
bedroom 0.75 (0.79) 0.56 (0.40) 0.55 (0.40) 0.83 (0.91) 0.82 (0.65) 0.86 (0.92)
hall 0.59 (0.68) 0.72 (0.44) 0.61 (0.23) 0.68 (0.84) 0.55 (0.87) 0.78 (0.81)
balcony 0.44 (0.49) 0.08 (0.0027) 0.41 (0.11) 0.90 (0.92) 0.87 (0.45) 0.87 (0.82)
mean IoU 0.63 (0.66) 0.50 (0.44) 0.50 (0.41) 0.74 (0.76) 0.69 (0.67) 0.70 (0.69)

Comparing with an edge detection method. To show Table 3. Comparison with a state-of-the-art edge detection network
(RCF [12]) on detecting the walls in floor plans.
that room boundaries (i.e., wall, door, and window) are not
merely edges in the floor plans but structural elements with R2V R3D
Fβmax Fβmean Fβmax Fβmean
semantics, we further compare our method with a state-of-
RCF [12] 0.62 0.56 0.68 0.58
the-art edge detection network [12] (denoted as RCF) on
Ours 0.85 0.85 0.95 0.95
detecting wall elements in floor plans. Here, we re-trained
RCF using our wall labels, separately on the R2V and R3D Table 4. A comparison of our full network with Baseline network
datasets; since RCF outputs a per-pixel probability (∈ [0, 1]) #1 and Baseline network #2 using the R3D dataset.
on wall prediction, we need a threshold (denoted as tRCF ) Methods
to locate the wall pixels from its results. In our method, Metrics
Baseline #1 Baseline #2 Our full network
we extract a binary map from our network output for walls overall accu 0.82 0.85 0.89
pixels; see Figure 2 (bottom) for an example. average class accu 0.72 0.72 0.80
To quantitatively compare the binary maps produced
by RCF and our method, we employ F-measure [8], a
4.3. Architecture Analysis on our Network
commonly-used metric, which is expressed as
Next, we present an architecture analysis on our network
(1 + β 2 )P recision × Recall by comparing it with the following two baseline networks:
Fβ = , (9)
β 2 P recision + Recall
• Baseline #1: two separate single-task networks. The
where P recision and Recall are the ratios of the correctly- first baseline breaks the problem into two separate
predicted wall pixels over all the predicted wall pixels and single-task networks, one for room-boundary predic-
over all the ground-truth wall pixels, respectively. To ac- tion and the other for room-type prediction, with two
count for the fact that we need tRCF to threshold RCF’s re- separate sets of VGG encoders and decoders. Hence,
sults, we extend Fβ into Fβmax and Fβmean in the evaluations: there are no shared features and also no spatial contex-
tual modules compared to our full network.
M M T −1
1 X p 1 XX p t • Baseline #2: without the spatial contextual module.
Fβmax = F̃β and Fβmean = F ( ),
M p=1 M T p=1 t=0 β T − 1 The second baseline is our full network with the shared
features but without the spatial contextual module.
where M is the total number of testing floor plans; F̃βp is the
best Fβ on the p-th test input over T different tRCF ranged Table 4 shows the comparison results, where we trained
in [0,1]; and Fβp ( T −1
t
) is Fβ on the p-th test input using and tested each network using the R3D dataset [10]. From
t
tRCF = T −1 . In our implementation, as suggested by previ- the results, we can see that our full network outperforms the
ous work [8], we empirically set β 2 =0.3 and T =256. Note two baselines, indicating that the multi-task scheme with
that Fβmax and Fβmean are the same for the binary maps pro- the shared features and the spatial contextual module both
duced by our method, since they do not require tRCF . Ta- help improve the floor plan recognition performance.
ble 3 reports the results, clearly showing that our method
4.4. Analysis on the Spatial Contextual Module
outperforms RCF on detecting the walls. Having said that,
simply detecting edges in the floor plan images is inefficient An ablation analysis of the spatial contextual module
to floor plan recognition. (see Figure 4 for details) is presented here.
Figure 7. Reconstructed 3D models from our recognition results.

Table 5. Ablation study on the spatial contextual module. First, our network may fail to differentiate inside and out-
Methods side regions, in case there are some special room structures
Metrics No direction Our complete
No attention
-aware kernels version
in the floor plan, e.g., long and double-bended corridors.
overall accu 0.86 0.87 0.89
Second, our network may wrongly recognize large icons
average class accu 0.74 0.77 0.80 (e.g., compass icon) in floor plans as wall elements. To ad-
dress these issues, we believe that more data is needed for
the network to learn more variety of floor plans and the se-
• No attention: the room-boundary-guided attention
mantics. Also, we may explore weakly-supervised learning
mechanism (see the top branch in Figure 4) is removed
for the problem to avoid the tedious annotations; please see
from the spatial contextual module.
the supplemental material for example failure cases.
• No direction-aware kernels: the convolution layers
with the four direction-aware kernels in the spatial con-
textual module are removed. Only the room-boundary- 5. Conclusion
guided attention mechanism is applied. This paper presents a new method for recognizing floor
Table 5 shows the comparison results between the above plan elements. There are three key contributions in this
schemes and the full method (i.e., with both attention and work. First, we explore the spatial relationship between
direction-aware kernels). Again, we trained and tested on floor plan elements, model a hierarchy of floor plan ele-
the R3D dataset [10]. From Table 5, we can see that the ments, and design a multi-task network to learn to recognize
spatial contextual module performs the best when equipped room-boundary and room-type elements in floor plans. Sec-
with the attention mechanism and direction-aware kernels. ond, we further take the room-boundary features to guide
the room-type prediction by formulating the spatial con-
4.5. Discussion textual module with the room-boundary-guided attention
mechanism. Further, we design a cross-and-within-task
Application: 3D model reconstruction. Here, we take
weighted loss to balance the losses within each task and
our floor plan recognition results to reconstruct 3D models.
across tasks. In the end, we prepared also two datasets
Figure 7 shows several examples of the constructed 3D floor
for floor plan recognition and extensively evaluated our net-
plans. Our method is able to recognize walls of nonuniform
work in various aspects. Results show the superiority of our
thickness and a wide variety of shapes. It thus enables us to
network over the others in terms of the overall accuracy and
construct 3D room-boundary of various shapes, e.g., curved
Fβ metrics. In the future, we plan to further extract the di-
walls in floor plan. One may notice that we only reconstruct
mension information in the floor plan images, and learn to
the walls in 3D in Figure 7. In fact, we may further recon-
recognize the text labels and symbols in floor plans.
struct the doors and windows, since our method has also
recognized them in the layouts. For more reconstruction
results, please refer to our supplementary material. References
Limitations. Here, we discuss two challenging situations, [1] Christian Ah-Soon and Karl Tombre. Variations on the
for which our method fails to produce plausible predictions. analysis of architectural drawings. In International Con-
ference on Document Analysis and Recognition (ICDAR). In Proc. of Vision, Modeling, and Visualization 2005 (VMV-
IEEE, 1997. 2 2005), pages 25–32, 2005. 2
[2] Sheraz Ahmed, Marcus Liwicki, Markus Weber, and An- [16] Kathy Ryall, Stuart Shieber, Joe Marks, and Murray Mazer.
dreas Dengel. Improved automatic analysis of architectural Semi-automatic delineation of regions in floor plans. In In-
floor plans. In International Conference on Document Anal- ternational Conference on Document Analysis and Recogni-
ysis and Recognition (ICDAR). IEEE, 2011. 1, 2 tion (ICDAR). IEEE, 1995. 2
[3] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian [17] Karen Simonyan and Andrew Zisserman. Very deep con-
Schroff, and Hartwig Adam. Encoder-decoder with atrous volutional networks for large-scale image recognition. In In-
separable convolution for semantic image segmentation. In ternational Conference on Learning Representations (ICLR),
European Conference on Computer Vision (ECCV), 2018. 6, 2015. 3
7 [18] Cheng Sun, Chi-Wei Hsiao, Min Sun, and Hwann-Tzong
[4] Lluı́s-Pere de las Heras, Joan Mas, Gemma Sanchez, and Chen. HorizonNet: Learning room layout with 1D represen-
Ernest Valveny. Wall patch-based segmentation in architec- tation and pano stretch data augmentation. In IEEE Confer-
tural floorplans. In International Conference on Document ence on Computer Vision and Pattern Recognition (CVPR),
Analysis and Recognition (ICDAR). IEEE, 2011. 2 2019. 2
[5] Samuel Dodge, Jiu Xu, and Björn Stenger. Parsing floor [19] Yuxin Wu and Kaiming He. Group normalization. In Euro-
plan images. In International Conference on Machine Vision pean Conference on Computer Vision (ECCV), 2018. 5
Applications (MVA). IEEE, 2017. 1, 2 [20] Toshihiko Yamasaki, Jin Zhang, and Yuki Takada. Apart-
[6] Philippe Dosch, Karl Tombre, Christian Ah-Soon, and ment structure estimation using fully convolutional networks
Gérald Masini. A complete system for the analysis of archi- and graph model. In Proceedings of the 2018 ACM Workshop
tectural drawings. International Journal on Document Anal- on Multimedia for Real Estate Tech, 2018. 1, 2
ysis and Recognition, 3(2):102–116, 2000. 2 [21] Shang-Ta Yang, Fu-En Wang, Chi-Han Peng, Peter Wonka,
[7] Lucile Gimenez, Sylvain Robert, Frédéric Suard, and Khal- Min Sun, and Hung-Kuo Chu. DuLa-Net: A dual-projection
doun Zreik. Automatic reconstruction of 3D building models network for estimating room layouts from a single RGB
from scanned 2D floor plans. Automation in Construction, panorama. In IEEE Conference on Computer Vision and Pat-
63:48–56, 2016. 1, 2 tern Recognition (CVPR), 2019. 2
[8] Qibin Hou, Ming-Ming Cheng, Xiaowei Hu, Ali Borji,
[22] Yinda Zhang, Shuran Song, Ping Tan, and Jianxiong
Zhuowen Tu, and Philip H. S. Torr. Deeply supervised salient
Xiao. PanoContext: A whole-room 3D context model for
object detection with short connections. IEEE Transactions
panoramic scene understanding. In European Conference on
on Pattern Analysis and Machine Intelligence, 41(4):815–
Computer Vision (ECCV), 2014. 2
828, 2018. 7
[23] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
[9] Chen-Yu Lee, Vijay Badrinarayanan, Tomasz Malisiewicz,
Wang, and Jiaya Jia. Pyramid scene parsing network. In
and Andrew Rabinovich. RoomNet: End-to-end room layout
IEEE Conference on Computer Vision and Pattern Recogni-
estimation. In IEEE International Conference on Computer
tion (CVPR), 2017. 6, 7
Vision (ICCV), 2017. 2
[24] Chuhang Zou, Alex Colburn, Qi Shan, and Derek Hoiem.
[10] Chenxi Liu, Alex Schwing, Kaustav Kundu, Raquel Urtasun,
LayoutNet: Reconstructing the 3D room layout from a single
and Sanja Fidler. Rent3D: Floor-plan priors for monocular
RGB image. In IEEE Conference on Computer Vision and
layout estimation. In IEEE Conference on Computer Vision
Pattern Recognition (CVPR), 2018. 2
and Pattern Recognition (CVPR), 2015. 2, 4, 7, 8
[11] Chen Liu, Jiajun Wu, Pushmeet Kohli, and Yasutaka Fu-
rukawa. Raster-to-Vector: Revisiting floorplan transforma-
tion. In IEEE International Conference on Computer Vision
(ICCV), 2017. 1, 2, 4, 5, 6
[12] Yun Liu, Ming-Ming Cheng, Xiaowei Hu, Kai Wang, and
Xiang Bai. Richer convolutional features for edge detection.
In IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), 2017. 7
[13] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
convolutional networks for semantic segmentation. In IEEE
Conference on Computer Vision and Pattern Recognition
(CVPR), 2015. 6, 7
[14] Sébastien Macé, Hervé Locteau, Ernest Valveny, and Salva-
tore Tabbone. A system to detect rooms in architectural floor
plan images. In Proceedings of the 9th IAPR International
Workshop on Document Analysis Systems, 2010. 1
[15] Siu-Hang Or, Kin-Hong Wong, Ying-Kin Yu, and Michael
Ming-Yuan Chang. Highly automatic approach to architec-
tural floorplan image understanding and model generation.

You might also like