Geo-Consistency For Multi-Camera Stereo By: July 2005
Geo-Consistency For Multi-Camera Stereo By: July 2005
net/publication/4156185
CITATIONS READS
23 54
3 authors, including:
Marc-Antoine Drouin
National Research Council Canada
22 PUBLICATIONS 67 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Marc-Antoine Drouin on 26 May 2015.
Abstract
This paper presents a new model to overcome the occlu-
sion problems coming from wide baseline multiple camera
stereo. Rather than explicitly modeling occlusions in the
matching cost function, it detects occlusions in the depth
map obtained from regular efficient stereo matching algo-
rithms. Occlusions are detected as inconsistencies of the
depth map by computing the visibility of the map as it is
reprojected into each camera. Our approach has the par-
ticularity of not discriminating between occluders and oc-
cludees. The matching cost function is modified according
to the detected occlusions by removing the offending cam-
eras from the computation of the matching cost. The al-
gorithm gradually modifies the matching cost function ac-
cording to the history of inconsistencies in the depth map,
until convergence. While two graph-theoretic stereo algo-
rithms are used in our experiments, our framework is gen-
eral enough to be applied to many others. The validity of Figure 1. Example of occlusion. Occluded pixels appear in white,
our framework is demonstrated using real imagery with dif- occluders in black.
ferent baselines.
1. Introduction
This makes it possible to use a standard efficient algorithm
The goal of binocular stereo is to reconstruct the 3D during each iteration, instead of tackling a very difficult op-
structure of a scene from two views. As the baseline gets timization problem. Futhermore, our approach guarantees
wider, the problem of occlusion, wich is often considered to preserve the consistency between the recovered visibility
negligible with small baseline configurations, can become and geometry, a property we call geo-consistency. In this
severe and limit the quality of the obtained depth map. Oc- paper, the maximum flow [19] and graph cut [2] formula-
clusion occurs when part of a scene is visible in one camera tions are used to solve each iteration. Our framework is gen-
image but not in the other (see figure 1). The difficulty of eral enough to be used with many other stereo algorithms.
detecting occlusion comes from the fact that it is induced A survey paper by Scharstein and Szeliski [21] compares
by the 3D structure of the scene, which is unknown until various standard algorithms.
the correspondence is established, as it is the final goal of
the algorithm. We propose a novel multiple camera stereo The rest of this paper is divided as follows: in Section 2,
algorithm that relies on photometric and geometric incon- previous work will be presented. Section 3 describes occlu-
sistencies in the depth map to detect occlusions. As this sion modeling and geometric inconsistency. Our proposed
algorithm is iterative, it does not explicitly model an oc- algorithm is described in Section 4. Experimental results
clusion state or add extra constraints to the cost function. are presented in Section 5.
2. Previous work configuration f : P 7→ Z associates a depth label to every
pixel. When occlusion is not modeled, the energy function
In a recent empirical comparison of strategies to over- to minimize is
come occlusion for 2 cameras, Egnal [4] enumerates 5 basic
ones: left-right checking, bimodality test, goodness Jumps X X X
constraint, duality of depth discontinuity and occlusion, and E(f ) = e(p, f (p)) + s(p, r, f (p), f (r))
p∈P p∈P r∈Np
uniqueness constraint. Some algorithms that have been pro- | {z } | {z }
posed rely on one or more of the these strategies, and are pointwise smoothing
often based on varying a correlation window position or likelihood
size [9, 6, 26, 10]. These methods are binocular in na- (1)
ture and do not generalize well to the case of multiple arbi- where Np is a neighborhood of pixel p. This can be solved
trary cameras. Other algorithms use dynamic programming efficiently because the likelihood term e(p, f (p)) is inde-
[16, 7, 3] because of its ability to efficiently solve more pendent from e(p′ , f (p′ )) for p 6= p′ , and the smoothing
complex matching costs and smoothing terms. Two meth- term has a simple 2-site clique form.
ods using graph theoretical approaches [8, 11] have been To model occlusion, we must compute the volumetric
proposed, but again they do not generalize to multiple cam- visibility Vi (q, f ) of a 3D reference point q from the point
era configurations. of view of a camera i, given a depth configuration f . It is
When extending binocular stereo to multiple cameras, set to 1 if the point is visible, and 0 otherwise. Visibility
the amount of occlusion increases since each pixel of the is a long range interaction and knowledge about immediate
reference camera can be hidden in more than one support- neighborhood configuration is insufficient most of the time
ing camera. This is particularly true when going from a sin- for computing it. The visibility information is collected into
gle to a multiple-baseline configuration, such as a regular a vector, the visibility mask
grid of cameras [15]. Some researchers have proposed spe-
cially designed algorithms to cope with occlusion in mul- V (q, f ) = (V1 (q, f ), . . . , VN (q, f ))
tiple camera configurations. Amongst these, Kang et al.
[10] proposed a visibility approach. While they did not where N is the number of cameras outside the reference;
improve over adaptive windows, their scheme was based a vector (1, . . . , 1) means that the 3D point is visible in all
on the hypothesis that a low matching cost function im- supporting cameras, (0, . . . , 0) that it is invisible instead.
plies the absence of occlusion. This hypothesis is also We call M the set of all possible visibility masks; an M-
made in [15, 20, 17, 18]. In contrast, we do not rely on configuration g : P 7→ M associates a mask to every pixel.
such an assumption. In [27], a relief reconstruction ap- Using this, we transform Eq. 1 into an energy function with
proach based on belief propagation is presented where the mask
correct visibility is approximated by using a low resolution X
E(f, g) = e(p, f (p), g(p)) + smoothing. (2)
base surface obtained from manually established correspon- p∈P
dences. In [14, 23], visibility-based methods are introduced.
The matching cost incorporates the visibility information Typically, we define
into a photo-consistency matching criteria, thereby implic-
itly modeling occlusion in the reconstruction process. Our m · C(p|z)
e(p, z, m) = for p ∈ P, z ∈ Z, m ∈ M
method differs completely in the way it handles smoothing |m|
and by its ability to recover from bad “carving”. Similarly,
where the 3D point p|z is p augmented by z and C(q) =
a level-set method [5] uses the visibility information from
(C1 (q), . . . , CN (q)) is the vector of matching costs of 3D
the evolving reconstructed surface to explicitly model oc-
point q for each camera. We use |m| to represent the l1 -
clusion. In [12] a stereo algorithm based on graph cuts is
norm which is just the number of cameras used from q. The
presented. It strictly enforces visibility constraints to guide
case where |m| = 0 is discussed in section 4.2. A simple
the matching process and ensures that it does not contain
cost function is Ci (q) = (Iref (Mref q) − Ii (Mi q))2 where
any geometric inconsistencies. The formulation imposes
Mref and Mi are projection matrices from the world to the
strict constraints on the form of the smothing term, con-
images of camera ref and i respectively, and Iref and Ii are
straints that will not apply to our method as we will see.
these images. Now, in order to model occlusion properly,
we simply need to examine the case g(p) = V (p|f (p), f ).
3. Modeling occlusion and Geo-consistency If the visibility masks were already known and fixed,
the occlusion problem would be solved and only photogra-
We have a set of reference pixels P, for which we want metric ambiguity would remain to be dealt with; the en-
to compute depth, and a set of depth labels Z. A Z- ergy function (2) would then be relatively easy to minimize.
Since this is not the case and f and V (., f ) are depen- minimize an approximation of Eq. 2 subject to the con-
dent, we relax the problem by introducing the concept of straint of Eq. 3 with spatial smoothing by moving iteratively
geo-consistency: we say that a Z-configuration f is geo- from one geo-consistent solution to the other.
consistent with an M-configuration g if
4. Stereo with a new implicit occlusion model
g(p) ≤ V (p|f (p), f ) (3)
We propose to reduce the dependency between f and g
for each component of these vectors and all p ∈ P. The by making it temporal: we let f 0 be the Z-configuration
inequality thus allows the mask to contain a subset of the minimizing E(f, g 0 ) in f and for t > 0, we define itera-
visible cameras. The removal of extra cameras has been tively f t as the function minimizing
observed to have little impact on the quality of the solution X
[15]. Our problem becomes the minimization of Eq. 2 in f e(p, f t (p), V (p|f t (p), f t−1 ) + smoothing (4)
and g, with the constraint that f is geo-consistent with g. p∈P
and g t as
3.1. Solving simultaneously for depth and visibility g t (p) = V (p|f t (p), f t−1 ),
that is to say, f t minimizes E(f t , g t ), where g t depends on
Lets define g 0 (p) = (1, . . . , 1) for all p ∈ P; this cor- f t according to the above equation. Now, this can be done
responds to the case where all cameras are visible by all using any standard algorithm. Unfortunately, this process
points. Minimizing E(f, g 0 ) in f is equivalent to minimiz- does not always converge [10].
ing E(f ). In general, it is possible to minimize E(f, g)
by explicitly testing all combinations of depth labels and 4.1. Using history for convergence
visibility masks in Z × M. Since #M = 2N , this effec-
tively makes the problem too big to be solved except in the Because of the way g t is defined, cameras that are re-
simplest cases. One way to reduce the number of visibil- moved at one iteration can be kept at the next, possibly in-
ity masks is to realize that for a given camera configuration, troducing cycles. To guarantee convergence, we introduce
some masks may occur for no configuration f . This makes a visibility history mask independent of the matching cost
it possible to precompute a smaller subset of M. Another function value
way to reduce the number of masks is simply to decide on
a reasonable subset to use [15]. Unfortunately, even with H(q, t) = (H1 (q, t), . . . , HN (q, t))
a small number of masks, it is still not practical to mini- where N is again the number of cameras other than the ref-
mize in f and g simultaneously. We can however use photo- erence and
consistency alone to select the visibility mask of a pixel, if Y
it is assumed equivalent to geo-consistency. In order to de- Hi (q, t) = Vi (q, f k ) = min Vi (q, f k ) (5)
0≤k≤t
termine the mask for a pixel p at depth f (p), we can try 0≤k≤t
each mask and select the most photo-consistent one, i.e. we The new problem is obtained by substituting H for V in
define gf∗ as Eq. 4 to obtain
X
t
gf∗ (p) = arg min e(p, f (p), m) w(m) EH (f t ) = e(p, f t (p), H(p|f t (p), t − 1) + smoothing
m∈M p∈P
(6)
where w(m) is a weight function favoring g 0 and eliminat- Mutatis mutandis, f t now minimizes EH t
(f t ) and g t (p) =
ing improbable masks. The problem thus becomes the min- H(p|f t (p), t − 1). This iterative process always converges
imization of E(f, gf∗ ) in f . Since e is point-wise indepen- (or stabilizes) in a polynomial number of steps. Indeed,
dent, the new problem is reduced to the original formula- H(q, t) is monotonically decreasing in t for all q; more-
tion of Eq. 1 and is easily solved using standard algorithms. over, if H(q, t − 1) = H(q, t) for all q, then f t = f t+1
This technique is used in [15, 20, 17]. However, the selected since both are solutions to the same minimization problem,
masks are not guaranteed to preserve geo-consistency. and the process has stabilized. We see that the number of
In space carving[14], the depth f (p) of a pixel is in- iterations is bounded by N · #P · #Z.
creased at a given step if it is not photo-consistent (which Furthermore, after convergence, the final configuration
is determined using a threshold). When depth is changed f T +1 = f T is geo-consistent with g T +1 ; this comes from
at a point, the mask configuration g is updated accordingly, the fact that for all p:
and so preserves geo-consistency. Space carving is a greedy
algorithm that solves Eq. 2 subject to the constraint of Eq. 3 g T +1 (p) = H(p|f T +1 (p), T ) = H(p|f T (p), T )
without smoothing. Kolmogorov and Zabih [12] tried to ≤ V (p|f T (p), f T ) = V (p|f T +1 (p), f T +1 ).
Scenes from Middlebury comparative study
Algo Tsukuba Head and Lamp Sawtooth
Real status of pixels classified from depth map as occluders
Reference Supporting occludee occluder regular occludee occluder regular
camera camera bp [24] 44.8 16.3 38.9 42.6 3.8 53.6
bnv [2] 50.4 15.4 34.2 42.6 4.3 53.3
Real status of pixels classified from depth map as occludees
DM occluders occludee occluder regular occludee occluder regular
DM occludees bp [24] 15.5 5.9 76.6 5.5 1.1 93.4
bnv [2] 16.4 5.8 77.8 7.2 1.1 91.7
Real status of pixels classified from depth map as regulars
occludee occluder regular occludee occluder regular
1 2 3 4 5 bp [24] 1.0 2.0 97.0 0.5 1.5 98.0
bnv [2] 1.0 2.0 96.9 0.5 1.5 98.0
GT occluders
Figure 3. Real (ground truth) status in percentages of pixels ac-
GT occludees cording to their classification. Examples from the Middlebury
comparative study [21]. In bold are the misclassifications favored
Figure 2. Effect of object enlargement on classification of occlud- by the overestimation of the disparity of occluded pixels.
ers and occludees of a scene viewed by 2 cameras. The ground
truth is in thick gray and the depth map in thick dashes. Occluders
and occludees are shown for both ground truth (GT) and com-
puted depth map (DM). Illustration of classification shift. Respec- ular are occluders than occludees. The observation above
tively, the 5 zones represent 1) regular pixels wrongly classified as discourages the direct use of visibility to update the visibil-
occludees 2) occludees correctly classified 3) occludees wrongly ity history mask. Instead, we introduce a pseudo-visibility
classified as occluders 4) occluders correctly classified 5) occlud-
ers wrongly classified as regular. V ′ (q, f ) = (V1′ (q, f ), . . . , VN′ (q, f ))
where g is defined as
3 if |Iref (Mref p) − Iref (Mref q)| < 5
Figure 4. Left) Ordering constraint is satisfied. In this camera con- g(p, r) =
figuration, the epipolar lines are parallel to the X-axis. The line 2 1 otherwise
is located to the left of the line 1 in both images. Right) Ordering
constraint is broken, the line 2 appears to the left of the line 1 in with l(p, r) = |f (p) − f (r)| for the maximum flow [19]
one image and to the right in the other. formulation and l(p, r) = δ(f (p) − f (r)) for graph cut
formulation [2]. The parameter λ is user-defined. For each
depth map computation, we chose the λ that achieved the
a reverse Z-buffer test. Two depth maps Lfi and Gfi are best performance. A pixel disparity is considered erroneous
thus obtained and contains minimal and maximal depth ob- if it differs by more than one disparity step from the ground
served by the camera. By comparing them, we can detect truth. This error measurement is compatible with the one
when two points of the mesh project to the same location for used in two comparative studies for 2-camera stereo [25,
a given supporting camera. When using rectified images, 21, 12].
this rendering process can be greatly sped up and simplified When minimizing Eq. 6, a visibility mask must be keep
by replacing it by a line drawing using depth buffers. The for every voxel of the reconstruction volume, that is, for
pseudo-visibility function Vi′ (q, f ) can therefore be com- each p ∈ P and z ∈ Z. To reduce memory requirements
puted as and the number of iterations, we kept a single visibility his-
tory for each pixel p Q regardless of the disparity z, i.e. (5)
Vi′ (q, f ) = δ Lfi (Ti q) − Gfi (Ti q) becomes Hi (p, t) = 0≤k≤t Vi (p|f k (p), f k ). This saves
a lot of memory but the convergence is no longer guaran-
where δ is 1 at 0 and 0 elsewhere. teed. We simply stop iterating when H(p, t) = H(p, t − 1)
It is possible for a voxel to have all its cameras removed, for all p ∈ P. We observed that running the algorithm any
i.e. H(p|z, t−1) = 0 even if V (p|z, t−1) 6= 0 In practice, longer only produce minor modifications to f t . However,
when this happens, we replace e(p, z, H(p|z, t − 1)) by the number of pixels with final zero masks increases, usu-
′
e(p, f t +1 (p)), H(p|z, t′ ) in the minimization process that ally in regions where the ordering constraint is broken. Pix-
computes f t (see Eq 6), where t′ is the largest index such els with zero masks are more prone to error, therefore we
that H(p|z, t′ ) 6= 0. It this case, depth is assigned only tried to improve results by adding a second step that reintro-
from the neighborhood through smoothing. duces eliminated cameras. This step consisted in fixing to
their final values the depth labels of the pixels with non-zero
final camera masks. The history of the others was discarded
5. Experimental results and the volumetric visibility recomputed, considering only
occlusion caused by the fixed pixels. Finally, an additional
In all our experiments, the matching cost function was minimization was run to produce a better depth map.
the same for all algorithms, that of [12] which is based on
[1]. We used color images but only the reference images 5.1 Middlebury
in gray scale are shown here. As for the smoothing term,
we used the experimentally defined smoothing function that This datasets from Middlebury [22] consists of 6 series
also comes from [12]: of 9 images of size 434 × 383. We used images 0 to 7 in
our experiments. The disparities between images 2 and 6
range from 0 to 19 pixels and 20 disparity steps were used.
s(p, r, f (p), f (r)) = λ g(p, r) l(f (p), f (r)) Since the ground truth was available for this dataset, we
GEO−BNV GEO−BNV pt
Figure 6. Reference images for the Head and Lamp scene (left)
and the Santa scene (right) from the Multiview Images database of
the University of Tsukuba.