A Weak Structure Model For Regular Pattern Recognition Applied To Facade Images
A Weak Structure Model For Regular Pattern Recognition Applied To Facade Images
9
X
X
X
Xz
)
P
P
P
Pq
P
P
P
Pq
)
P
P
P
Pq
Fig. 1. Hierarchy in probability model, numbers in brackets are section references.
2 Structural Recognition Framework
We consider the problem of recognizing elements in an image, like windows in a
facade. Our model parameters (variables) consist of complexity k (the number of
windows), shape attributes A (i.e. size, aspect), location attributes X (window
center locations) and element neighborhood relation N. The recognition task
can then be formulated as follows: Given image data I, we search for model pa-
rameters = (k, A, X, N) by nding the mode of the following joint distribution
p(I, )
= arg max
p(I|)p(), (1)
which is computed with Bayes theorem from data likelihood p(I|) and structural
model prior p(). We will decompose our probability model hierarchically as
shown in Fig. 1 and propose pdfs specic for the task of window detection in
facade images. Then we can apply stochastic RJMCMC framework to nd the
optimal value
x
i
R
2
; i = 1, . . . , k
k
i=1
p(w
i
|h
i
) is the aspect ratio with distribution
p(w
i
|h
i
) = (
wi
wi+hi
,
r
,
r
). When any of the windows overlap with another, we
set unit function 1(A|X) = 0, eectively avoiding such window conguration.
To model constraints on heights H, we introduce a set of latent variables
h
c
, one for each component c of graph G(X) with neighborhood N. The height
similarity within components is enforced in
p(H|k, N, X) =
p(h
c
)
iVc
p(h
i
|h
c
)
, (3)
where c is from the set of all components, V
c
is the set of windows in the com-
ponent c and p(h
c
) = (h
c
,
h
,
h
) is the common height prior. Each height
in a component c should be most probably equal to h
c
, which is expressed by
p(h
i
|h
c
) = N(h
i
h
c
, 0,
h
).
3.2 Structural Prior
The structure prior p(k, N, X) = p(N, X|k)p(k) combines structural regularity
p(N, X|k) and complexity p(k).
Structural Regularity. In order to model multiple assumptions on p(N, X|k),
we express it as a probability mixture [12]:
p(N, X|k) =
1
p
a
(X|N)p(N) +
2
p
s
(X|N)p(N) +
3
p
c
(N|X)p(X), (4)
where
k
i=1
i
= 1,
123
=
1
3
and k was omitted in p() for simplicity. We assume
element locations in p(X) are mutually independent and uniformly distributed
in image. The neighborhood prior p (N) =
(u,v)
p(l
uv
) takes into account the
possibility of suppressing an edge where p(l
uv
= 0) = p
sup
, p(l
uv
= 1) = 1p
sup
and p
sup
= 0.01 is the probability of a suppressed edge.
Alignment. The rst assumption on the position of elements is that neighboring
elements should be horizontally or vertically aligned. We model this by measuring
angles (x
u
, x
v
) (
4
,
4
) between the line connecting element locations x
u
x
v
and horizontal (o
uv
= h) resp. vertical (o
uv
= v) direction, and express them in
p
a
(X|N) =
(u,v)D(X)
p(x
u
, x
v
|l
uv
), (5)
where p(x
u
, x
v
|l
uv
= 1) = (
(x
u
, x
v
),
),
= 50 and
(x
u
, x
v
) =
2
(
uv
+
4
) (0, 1) is the angle normalized to unit interval. The probability
in the case of a suppressed edge is p(x
u
, x
v
|l
uv
= 0) = p
a0
.
6 Radim Tylecek and Radim
S ara
Spacing. The second assumption is that the distance between elements in a
horizontal or vertical neighborhood should most probably be equal. We model
this by comparing distances to horizontal and vertical neighbors in
p
s
(X|N) =
(u,v,z)D
2
(X)
p(x
u
, x
v
, x
z
|l
uv
, l
vz
) (6)
where (u, v, z) denotes a pair of edges (u, v), (v, z), u = z with the common vertex
v and the same orientation. The distance term is expressed by p(x
u
, x
v
, x
z
|l
uv
=
l
vz
= 1) = (
uv
uv+vz
,
), where
= 50 and
uv
= |x
u
x
v
| are
distances to the neighbors. As in the previous case, the probability in the cases
with any suppressed edge is p(x
u
, x
v
, x
z
|l
uv
= 1 l
vz
= 1) = p
s0
.
Congurations. We model higher-order dependencies in the structure congu-
rations with
p
c
(N|X) =
k
i=1
p(l
ij
|(i, j) D(X)), (7)
where the probabilities p(l
ij
|(i, j) D(X)) model the expected degree of a given
vertex i, including orientation of edges (i, j) connected to it, i.e. the typical
grid conguration is to have two vertical and two horizontal edges incident with
vertex i.
With the grid assumption and the window size prior, we can estimate the
number of rows m =
1
2
h
and columns n =
1
2
h
r
h
, assuming the space between
the windows to be equal to the window size. This heuristic plays only a minor
role in our model and helps us to derive the vertex conguration probability
p(l
ij
|(i, j) D(X)). It is given in Table 1, where rows and columns correspond
to the number of horizontal and vertical edges connected to the window vertex.
The maximum degree of a vertex in RNG is six with at most three horizontal
and three vertical edges.
Table 1. Neighborhood conguration prior p(lij|(i, j) D(X)), where deg
h
(i), degv(i)
are functions of neighboring labels lij. The pc0 = 10
4
is the probability of a single
(unstructured) window, pc1 = 0.099 is the probability of a single row or column of
windows, pc2 = 0.9 is the probability of a window grid, pc3 = 10
5
is the probability
of more dense congurations.
deg
h
(i), degv(i) 0h 1h 2h 3h
0v pc0
1
2
pc1
1
(m2)
pc1 pc3
1v
1
2
pc1
1
4
pc2
2
(m2)
pc2 pc3
2v
1
(n2)
pc1
2
(n2)
pc2
1
(m2)(n2)
pc2 pc3
3v pc3 pc3 pc3 pc3
A Weak Structure Model for Regular Pattern Recognition 7
Structural Complexity. The prior for number of elements can be modeled
with Poisson distribution p(k) = Pois(k, mn) based on the estimation of number
of rows m and columns n given above.
4 Data Likelihood
The data likelihood p(I|K, N, A, X) is solely task-specic and can be chosen
arbitrarily as long as it can be evaluated by means of probability density or
likelihood ratio.
In the task of window detection in facade images, the input is image I =
{i; i = 1, . . . , I
w
I
h
} dened as a set of pixels and we assume it is rectied,
i.e. the windows borders are parallel to the image borders, and I
w
, I
h
are image
width and height.
We want to express the probability of observing image I if window parameters
and structure are given. We combine two features: image edges J and color C in
p(I|k, A, X, N) = p(J|k, A, X, N)p(C|k, A, X, N). We use color to detect regions
of interest and edge features for localization of the windows borders.
4.1 Edge Likelihood
We assume that window borders correspond to edges, and use Canny detector
to nd them. However, this model will not fully hold in real world situations,
when we obtain the input by detecting edges in a picturethere can be windows
which do not have all pixels with underlying edges and vice versa, some edges
do not belong to any windows at all. The latter case will typically prevail.
We use binary imaging model for window edges represented by oriented edge
image J = {J
i
{0, 1, 2} ; i I}, where J
i
= 1 if pixel i belongs to an horizontal
edge detected in I (foreground), resp. J
i
= 2 for vertical edge; otherwise J
i
= 0
(background). We dene d(J) (0, 1) as a distance transform of the edge image
J normalized by max(I
h
, I
w
). We use the gradient of d(J) to distinguish between
horizontal and vertical edges. Similarly, we introduce edge image R(A, X) ren-
dered from the current conguration specied by attributes A, X and the shape
template in Fig. 2 with nearest neighbor discretization. Assuming pixel indepen-
dence, we can write p(J|A, X) =
iI
p(J
i
|R
i
(A, X)) where the probability of
observing a pixel i in the edge image J given the rendered conguration R is
p(J
i
= 0|R
i
= 0) = p
TN
= 1 2p
FN
,
p(J
i
{1, 2} |R
i
= 0) = p
FN
= 0.1, (8)
p(J
i
= 0|R
i
{1, 2}) = p
FP
(d(i))(1 p
FX
), d(i) > 0,
p(J
i
= 1|R
i
= 1) = p(J
i
= 2|R
i
= 2) = p
TP
= p
FP
(0),
p(J
i
= 2|R
i
= 1) = p(J
i
= 1|R
i
= 2) = p
FX
,
where p
FP
(d(i)) = (d(i),
FP
= 500, 1) makes rectangles close to edges more
probable and acts as a guide for directing the random walk. The p
FX
= 10
9
is
8 Radim Tylecek and Radim
S ara
the probability assigned when the edge specied by the conguration crosses an
image edge with opposite direction.
The edge likelihood can be eciently evaluated from pre-computed integral
edge images, one for each orientation, yielding constant computational complex-
ity O(1) per edge; this speed-up is possible thanks to rectied images and helps
make random sampling (described in Sect. 5) very ecient.
4.2 Color Likelihood
A pixel color classier matches the input RGB color image C =
c
i
(0, 1)
3
; i = 1, . . . , k
iC
f
p
f
(c
i
|A, X)
jC
b
p
b
(c
j
|A, X), (9)
where the foreground color model is expressed by p
f
(C
i
|A, X) = N(
C,
C
), the
background probability p
b
(c
j
|A, X) = p
b
is constant and we evaluate foreground
pixels only. Similarly to edge likelihood, color likelihood can be evaluated using
pre-computed integral images in linear time.
5 Recognition Algorithm
We have chosen reversible jump Markov Chain Monte Carlo (RJMCMC) frame-
work [13] that ts our task of nding the most probable interpretation of the
input image in the terms of target probability p(, I) in (1), which has a very
complex pdf as it is a joint probability of both attributes and structure. Our
solution
|).
We use an independent sampler q(|I) to initialize the Markov chain, which
samples the initial state
0
either from the prior distribution q() or ex-
ploits some image information in q(|I). This involves sampling the number
A Weak Structure Model for Regular Pattern Recognition 9
of elements k q(k) rst and then their attribute values (X, A) q(X, A)
independently. In practice we choose sampler to start with k
0
= 1.
The conditional sampler q(
|, I)
|, I). The main sampler only chooses from q(m) which of the
individual samplers m will be used to propose the next move. We will now
propose the set of samplers that will explore the space of parameters . Their
design must fulll Markov Chain properties of detailed balance and reversibility
of all moves, i.e. given a move there must always exist a reverse move m
, and
their probability ratio must be reected in the acceptance of Metropolis-Hastings
(MH) algorithm:
A = min
1,
p(
, I)
p(, I)
q(m
)
q(m|)
. (10)
5.1 Metropolis-Hastings Moves
Moves introduced in this section do not modify the model complexity k and can
be thus evaluated by a classical MH algorithm (10).
Attribute modication. This move picks up an element i U({1, . . . , k}) from
discrete uniform distribution and perturbs some of its attributes values ran-
domly. Additionally, attribute samplers can be designed to exploit image likeli-
hood to increase the acceptance rate. In the window detection scenario, we have
implemented three variants for this type of proposals:
Drift - random variation of position x
i
= x
i
+ , N(0,
) without
changing the size,
Resize - randomly pick up one of four window sides (left/right/top/bottom)
and move it by ,
Flip - x one of the window sides and ip the window around it.
Element resampling. This move is a more radical variant of the previous one,
we pick up an element i and change of all its attributes by sampling from the
prior distribution a
i
, x
i
q(a
i
, x
i
) or a
i
, x
i
q(a
i
, x
i
|I) if possible.
Attribute constraint enforcement. This move proposes changes to the attributes
according to the current neighborhood, a
i
, x
i
q(a
i
, x
i
|A, X, N). We pick up a
random edge (u, v) U(D(X)) and direction (u v or v u) and transfer
attribute values over the edge from one element to another according to the
specic constraints, i.e. a
u
= a
v
. For facades, we transfer both position and size
from one element to the other in dimension given by orientation of the connected
edge, i.e. height and vertical position for horizontal edge.
10 Radim Tylecek and Radim
S ara
Structure modication. We include move to allow changes to the neighborhood
structure: it picks up a random edge q
d
(u, v) and changes its label l
uv
=
1 l
uv
, eectively suppressing or recovering the edge.
Proposals for latent heights h
c
are performed similarly by choosing uniformly
component c and then sampling h
c
N(
h
c
,
h
), where
h
c
=
1
|Vc|
iVc
h
i
is the
mean height in the component.
5.2 Reversible Jump Moves
We also need to nd the number of elements k, that controls the dimension of
parameters A, X. In order to compare the models in dierent dimensions, we
need to dene dimension matching functions q
, q
, I)
p(I)
q(m|
)
q(m
|)
q
(u
)
q
(u
|)
J
, (11)
where refers to direct move, to reverse move, u are dimension matching
variables and J
f(,u)
(,u)
= [a
, x
, x
q(a, x)
and obtain a new state where A
= {A, a
} and X
= {X, x
}. The correspond-
ing dimension matching function is f
(A, X, u
) = f
({A, X}, [a
, x
]), which
inserts a
(A
, X
, u
) = f
({A
, X
}, [ ]) , where a
i
, x
i
are the removed attributes and
A = A
\ a
i
, X = X
\ x
i
. The corresponding birth move acceptance is then
birth
=
p(
, I)
p(I)
q(m|
)
q(m
|)
q(i|k
)
q(|k)
1
q
(a
|A)
1, (12)
where q
(a
) =
1
k
and q(|k) =
1
k
are the probabilities of selecting the windows a
, a
i
.
Death. By removing an existing element from the set we propose a decrease of
dimension k k
, x
= x
u
+ (x
v
x
u
), where we choose
U
1
2
,
1
3
,
2
3
, 2, 1
=
1
2
(h
u
+h
v
) and
the width w
analogically.
5.3 Convergence and Complexity
We have found that the typical necessary number of MCMC samples (classier
calls) is proportional to image size in pixels |I| (from 30% for easy instances to
200% for dicult ones). This is a good news, we expected that the number will
grow exponentially with scene complexity. As a result, we xed the number of
samples in our current method to a pessimistic estimate, but our experiments
suggest that signicantly shorter sampling time could be achieved with suitably
designed stopping condition.
6 Experimental Results
We have performed a number of experiments with the implementation of window
detection in facades of various styles to demonstrate the universality of our
approach. We have run the Markov Chain for 510
5
iterations in our experiments,
which roughly equals to visiting all pixels in the analyzed images.
Because of a very recent appearance of a rst public dataset known to us with
quantitative results in [10], we are among the rst to compare with them. The
test part of the dataset consists of 10 rectied and annotated images of facades
from a street in Paris, which share attributes of Haussmannian style but diers
in lightning conditions. Direct comparison is not possible, because they segment
facade pixels into eight dierent classes of elements and our window detector
denes only two (window/non-window). To deal with this issue, we have merged
the columns of confusion matrix given in [10] into two, and the results are given in
Table 2. All parameters of our model were xed for this experiment, specically
the size prior was set such that the most probable relative window height is
h = 0.1 and aspect ratio r = 0.5.
The numbers in Table 2 for window and wall classes show that our weak
structure model slightly outperforms Procedural Segmentation (PS) framework
[10]. This is clearly a success, because PS benets from a randomized forest com-
bining 8 classiers, trained on 15 15 pixel patches in 20 images from the same
street as the test data, and a grammar specically designed for Haussmannian
style. In contrast, our method is guided by far weaker cues: color of individ-
ual pixels, rectangular shape matching with image edges and size prior. In our
case the dominant role plays the weak structural model that emerges from the
data: it is able to select among objects of interest proposed by local classiers
and, at the same time, support windows completing the structure even where
12 Radim Tylecek and Radim
S ara
a) Monge No. 13 b) Monge No. 43 c) Monge No. 50
Fig. 3. Visualization of results on part of Parisian dataset [10], facade a) is occluded
by plants, in facade b) cast shadow is present. False positive windows in c) are also
window-like regions: They have good response from both classiers and match with
the neighbors. Detected windows are shown in red, neighborhood edges in green and
image edges are emphasized in blue. Results on the complete test set are available as
supplemental material.
the classier response is low. This allows us to achieve good results even when
illumination varies and partial occlusion of windows is present, as shown in Fig.
3. Poor results of Randomized Forest (RF) segmentation from [10] included in
Table 2 give an idea how entirely unstructured approaches perform on this data.
For classes dierent than window and wall the results cannot be directly
compared with the other methods, but allow us to analyze the behavior of our
method in such classes. Balconies are typically overlapping windows in Hauss-
mannian style, but such overlaps are somehow randomly annotated as window
or balcony in the ground truth [10], even when the appearance is the same, in-
troducing some amount of ambiguity in the results. The shop class areas are
Table 2. Quantitative results on Haussmannian dataset [10] shown in percentage of
pixels from class specied in a row. Second column displays the percentage of pixels of
given class in the whole test set. RF stands for Randomized Forest, PS for Procedural
Segmentation. Our window detection rate of 83% is comparable to 81% rate for PS (in
bold face).
ground truth[10] RF [10] PS [10] proposed mapping of our classes
class area hit miss hit miss hit miss window non-window
window 11 30 70 81 19 83 17
wall 48 38 62 83 17 84 16
A Weak Structure Model for Regular Pattern Recognition 13
a) Modern facade b) Irregular facade c) Sparse structure
Fig. 4. Results on facade images from Prague.
Fig. 5. Interpreted facades of a modern building. Left: Simple shape template with
t = 1 fails to detect light windows. Right: Change to t = 0.33 improves the result
signicantly as the response from edge likelihood is stronger.
actually formed by shop-windows and the wall around them, and the visualized
results show that our detector follows this interpretation. The roof area was
dicult for our approach, since the color classier considers them window-like.
While the authors in [10] claim their segmentation framework generalizes on
some mild variants of Haussmannian facades, we can say our framework is not
limited to any particular style at all. To prove this, we demonstrate results on
modern buildings in Fig. 5 and 4 a).
Finally, we have made experiments with loosely regular facade of Frank
Gehrys Dancing House shown in Fig. 4 b), where window alignment shows
signicant deviation from grid structure. We were successful in correctly locat-
ing all windows lying on the major plane as well as their neighborhood. The
ability to handle sparse regular structures is presented on the right in Fig. 4 c).
14 Radim Tylecek and Radim
S ara
7 Conclusion and Future Work
We have presented a recognition framework that uses a weak structure model to
locate elements in images, and demonstrated its potential in the task of window
detection in facades. Our experiments have demonstrated that structural regu-
larity given by pair-wise attribute constraints can eciently guide a stochastic
process that estimates element locations and neighborhood at the same time.
We have shown that the conjunction of a weak non-specic classier and a weak
structural model can lead to performance that would be hardly achievable by a
well-trained specic classier. Despite the seemingly complex description of the
model, the ideas are simple and the implementation is straightforward.
In our future we would like to endow our recognition framework with more
powerful classiers and an ability to handle relations on multiple levels that
would i.e. allow two dierent structural components to overlap.
Acknowledgment. This work has been supported by Google Research Award,
by the Czech Ministry of Education under project MSM6840770012 and by Grant
Agency of the CTU Prague under project SGS10/278/OHK3/3T/13.
References
1. Micusik, B., Kosecka, J.: Piecewise planar city 3D modeling from street view
panoramic sequences. In: Proc. CVPR. (2009)
2. Hohmann, B., Krispel, U., Havemann, S., Fellner, D.: CITYFIT: High-quality
urban reconstructions by tting shape grammars to images and derived textured
point cloud. In: Proc. of the International Workshop 3D-ARCH. (2009)
3. Pauly, M., Mitra, N., Wallner, J., Pottmann, H., Guibas, L.: Discovering structural
regularity in 3D geometry. Transactions on Graphics 27 (2008) 4343
4. Gips, J.: Shape grammars and their uses. Birkh auser (1975)
5. Zhu, S., Mumford, D.: A stochastic grammar of images. Foundations and Trends
in Computer Graphics and Vision 2 (2006) 362
6. Alegre, F., Dellaert, F.: A probabilistic approach to the semantic interpretation of
building facades. In: International Workshop on Vision Techniques Applied to the
Rehabilitation of City Centres. (2004)
7. M uller, P., Zeng, G., Wonka, P., Van Gool, L.: Image-based procedural modeling
of facades. Transactions on Graphics 26 (2007) 85
8. Mayer, H., Reznik, S.: Building facade interpretation from uncalibrated wide-
baseline image sequences. ISPRS Journal of Photogrammetry and Remote Sensing
61 (2007) 371380
9. Ripperda, N., Brenner, C.: Data driven rule proposal for grammar based facade
reconstruction. Photogrammetric Image Analysis 36 (2007) 16
10. Teboul, O., Simon, L., Koutsourakis, P., Paragios, N.: Segmentation of building
facades using procedural shape prior. In: Proc. CVPR. (2010)
11. Toussaint, G.T.: The relative neighbourhood graph of a nite planar set. Pattern
Recognition 12 (1980) 261 268
12. McLaughlan, G.J.: Finite Mixture Models. Wiley (2000)
13. Green, P.J.: Reversible jump Markov chain Monte Carlo computation and Bayesian
model determination. Biometrika 82 (1995) 711732