0% found this document useful (0 votes)
61 views9 pages

Geo-Consistency For Multi-Camera Stereo By: July 2005

This document summarizes a research paper about a new model for overcoming occlusion problems in wide baseline multi-camera stereo reconstruction. The model detects occlusions by analyzing inconsistencies in depth maps when reprojected into each camera. Rather than explicitly modeling occlusions, the algorithm modifies the matching cost function based on detected occlusions by removing cameras that cause inconsistencies. The algorithm iteratively refines the depth map and matching cost function until convergence. The framework can be applied to graph-theoretic and other stereo algorithms. Experimental results using real imagery demonstrate the validity of the approach.

Uploaded by

Joy Roy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views9 pages

Geo-Consistency For Multi-Camera Stereo By: July 2005

This document summarizes a research paper about a new model for overcoming occlusion problems in wide baseline multi-camera stereo reconstruction. The model detects occlusions by analyzing inconsistencies in depth maps when reprojected into each camera. Rather than explicitly modeling occlusions, the algorithm modifies the matching cost function based on detected occlusions by removing cameras that cause inconsistencies. The algorithm iteratively refines the depth map and matching cost function until convergence. The framework can be applied to graph-theoretic and other stereo algorithms. Experimental results using real imagery demonstrate the validity of the approach.

Uploaded by

Joy Roy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/4156185

Geo-consistency for Multi-Camera Stereo by

Conference Paper · July 2005


DOI: 10.1109/CVPR.2005.168 · Source: IEEE Xplore

CITATIONS READS
23 54

3 authors, including:

Marc-Antoine Drouin
National Research Council Canada
22 PUBLICATIONS   67 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

active 3D imaging View project

All content following this page was uploaded by Marc-Antoine Drouin on 26 May 2015.

The user has requested enhancement of the downloaded file.


Geo-consistency for Wide Multi-Camera Stereo

Marc-Antoine Drouin Martin Trudeau Sébastien Roy


DIRO
Université de Montréal
{drouim,trudeaum,roys}@iro.umontreal.ca

Abstract
This paper presents a new model to overcome the occlu-
sion problems coming from wide baseline multiple camera
stereo. Rather than explicitly modeling occlusions in the
matching cost function, it detects occlusions in the depth
map obtained from regular efficient stereo matching algo-
rithms. Occlusions are detected as inconsistencies of the
depth map by computing the visibility of the map as it is
reprojected into each camera. Our approach has the par-
ticularity of not discriminating between occluders and oc-
cludees. The matching cost function is modified according
to the detected occlusions by removing the offending cam-
eras from the computation of the matching cost. The al-
gorithm gradually modifies the matching cost function ac-
cording to the history of inconsistencies in the depth map,
until convergence. While two graph-theoretic stereo algo-
rithms are used in our experiments, our framework is gen-
eral enough to be applied to many others. The validity of Figure 1. Example of occlusion. Occluded pixels appear in white,
our framework is demonstrated using real imagery with dif- occluders in black.
ferent baselines.

1. Introduction
This makes it possible to use a standard efficient algorithm
The goal of binocular stereo is to reconstruct the 3D during each iteration, instead of tackling a very difficult op-
structure of a scene from two views. As the baseline gets timization problem. Futhermore, our approach guarantees
wider, the problem of occlusion, wich is often considered to preserve the consistency between the recovered visibility
negligible with small baseline configurations, can become and geometry, a property we call geo-consistency. In this
severe and limit the quality of the obtained depth map. Oc- paper, the maximum flow [19] and graph cut [2] formula-
clusion occurs when part of a scene is visible in one camera tions are used to solve each iteration. Our framework is gen-
image but not in the other (see figure 1). The difficulty of eral enough to be used with many other stereo algorithms.
detecting occlusion comes from the fact that it is induced A survey paper by Scharstein and Szeliski [21] compares
by the 3D structure of the scene, which is unknown until various standard algorithms.
the correspondence is established, as it is the final goal of
the algorithm. We propose a novel multiple camera stereo The rest of this paper is divided as follows: in Section 2,
algorithm that relies on photometric and geometric incon- previous work will be presented. Section 3 describes occlu-
sistencies in the depth map to detect occlusions. As this sion modeling and geometric inconsistency. Our proposed
algorithm is iterative, it does not explicitly model an oc- algorithm is described in Section 4. Experimental results
clusion state or add extra constraints to the cost function. are presented in Section 5.
2. Previous work configuration f : P 7→ Z associates a depth label to every
pixel. When occlusion is not modeled, the energy function
In a recent empirical comparison of strategies to over- to minimize is
come occlusion for 2 cameras, Egnal [4] enumerates 5 basic
ones: left-right checking, bimodality test, goodness Jumps X X X
constraint, duality of depth discontinuity and occlusion, and E(f ) = e(p, f (p)) + s(p, r, f (p), f (r))
p∈P p∈P r∈Np
uniqueness constraint. Some algorithms that have been pro- | {z } | {z }
posed rely on one or more of the these strategies, and are pointwise smoothing
often based on varying a correlation window position or likelihood
size [9, 6, 26, 10]. These methods are binocular in na- (1)
ture and do not generalize well to the case of multiple arbi- where Np is a neighborhood of pixel p. This can be solved
trary cameras. Other algorithms use dynamic programming efficiently because the likelihood term e(p, f (p)) is inde-
[16, 7, 3] because of its ability to efficiently solve more pendent from e(p′ , f (p′ )) for p 6= p′ , and the smoothing
complex matching costs and smoothing terms. Two meth- term has a simple 2-site clique form.
ods using graph theoretical approaches [8, 11] have been To model occlusion, we must compute the volumetric
proposed, but again they do not generalize to multiple cam- visibility Vi (q, f ) of a 3D reference point q from the point
era configurations. of view of a camera i, given a depth configuration f . It is
When extending binocular stereo to multiple cameras, set to 1 if the point is visible, and 0 otherwise. Visibility
the amount of occlusion increases since each pixel of the is a long range interaction and knowledge about immediate
reference camera can be hidden in more than one support- neighborhood configuration is insufficient most of the time
ing camera. This is particularly true when going from a sin- for computing it. The visibility information is collected into
gle to a multiple-baseline configuration, such as a regular a vector, the visibility mask
grid of cameras [15]. Some researchers have proposed spe-
cially designed algorithms to cope with occlusion in mul- V (q, f ) = (V1 (q, f ), . . . , VN (q, f ))
tiple camera configurations. Amongst these, Kang et al.
[10] proposed a visibility approach. While they did not where N is the number of cameras outside the reference;
improve over adaptive windows, their scheme was based a vector (1, . . . , 1) means that the 3D point is visible in all
on the hypothesis that a low matching cost function im- supporting cameras, (0, . . . , 0) that it is invisible instead.
plies the absence of occlusion. This hypothesis is also We call M the set of all possible visibility masks; an M-
made in [15, 20, 17, 18]. In contrast, we do not rely on configuration g : P 7→ M associates a mask to every pixel.
such an assumption. In [27], a relief reconstruction ap- Using this, we transform Eq. 1 into an energy function with
proach based on belief propagation is presented where the mask
correct visibility is approximated by using a low resolution X
E(f, g) = e(p, f (p), g(p)) + smoothing. (2)
base surface obtained from manually established correspon- p∈P
dences. In [14, 23], visibility-based methods are introduced.
The matching cost incorporates the visibility information Typically, we define
into a photo-consistency matching criteria, thereby implic-
itly modeling occlusion in the reconstruction process. Our m · C(p|z)
e(p, z, m) = for p ∈ P, z ∈ Z, m ∈ M
method differs completely in the way it handles smoothing |m|
and by its ability to recover from bad “carving”. Similarly,
where the 3D point p|z is p augmented by z and C(q) =
a level-set method [5] uses the visibility information from
(C1 (q), . . . , CN (q)) is the vector of matching costs of 3D
the evolving reconstructed surface to explicitly model oc-
point q for each camera. We use |m| to represent the l1 -
clusion. In [12] a stereo algorithm based on graph cuts is
norm which is just the number of cameras used from q. The
presented. It strictly enforces visibility constraints to guide
case where |m| = 0 is discussed in section 4.2. A simple
the matching process and ensures that it does not contain
cost function is Ci (q) = (Iref (Mref q) − Ii (Mi q))2 where
any geometric inconsistencies. The formulation imposes
Mref and Mi are projection matrices from the world to the
strict constraints on the form of the smothing term, con-
images of camera ref and i respectively, and Iref and Ii are
straints that will not apply to our method as we will see.
these images. Now, in order to model occlusion properly,
we simply need to examine the case g(p) = V (p|f (p), f ).
3. Modeling occlusion and Geo-consistency If the visibility masks were already known and fixed,
the occlusion problem would be solved and only photogra-
We have a set of reference pixels P, for which we want metric ambiguity would remain to be dealt with; the en-
to compute depth, and a set of depth labels Z. A Z- ergy function (2) would then be relatively easy to minimize.
Since this is not the case and f and V (., f ) are depen- minimize an approximation of Eq. 2 subject to the con-
dent, we relax the problem by introducing the concept of straint of Eq. 3 with spatial smoothing by moving iteratively
geo-consistency: we say that a Z-configuration f is geo- from one geo-consistent solution to the other.
consistent with an M-configuration g if
4. Stereo with a new implicit occlusion model
g(p) ≤ V (p|f (p), f ) (3)
We propose to reduce the dependency between f and g
for each component of these vectors and all p ∈ P. The by making it temporal: we let f 0 be the Z-configuration
inequality thus allows the mask to contain a subset of the minimizing E(f, g 0 ) in f and for t > 0, we define itera-
visible cameras. The removal of extra cameras has been tively f t as the function minimizing
observed to have little impact on the quality of the solution X
[15]. Our problem becomes the minimization of Eq. 2 in f e(p, f t (p), V (p|f t (p), f t−1 ) + smoothing (4)
and g, with the constraint that f is geo-consistent with g. p∈P

and g t as
3.1. Solving simultaneously for depth and visibility g t (p) = V (p|f t (p), f t−1 ),
that is to say, f t minimizes E(f t , g t ), where g t depends on
Lets define g 0 (p) = (1, . . . , 1) for all p ∈ P; this cor- f t according to the above equation. Now, this can be done
responds to the case where all cameras are visible by all using any standard algorithm. Unfortunately, this process
points. Minimizing E(f, g 0 ) in f is equivalent to minimiz- does not always converge [10].
ing E(f ). In general, it is possible to minimize E(f, g)
by explicitly testing all combinations of depth labels and 4.1. Using history for convergence
visibility masks in Z × M. Since #M = 2N , this effec-
tively makes the problem too big to be solved except in the Because of the way g t is defined, cameras that are re-
simplest cases. One way to reduce the number of visibil- moved at one iteration can be kept at the next, possibly in-
ity masks is to realize that for a given camera configuration, troducing cycles. To guarantee convergence, we introduce
some masks may occur for no configuration f . This makes a visibility history mask independent of the matching cost
it possible to precompute a smaller subset of M. Another function value
way to reduce the number of masks is simply to decide on
a reasonable subset to use [15]. Unfortunately, even with H(q, t) = (H1 (q, t), . . . , HN (q, t))
a small number of masks, it is still not practical to mini- where N is again the number of cameras other than the ref-
mize in f and g simultaneously. We can however use photo- erence and
consistency alone to select the visibility mask of a pixel, if Y
it is assumed equivalent to geo-consistency. In order to de- Hi (q, t) = Vi (q, f k ) = min Vi (q, f k ) (5)
0≤k≤t
termine the mask for a pixel p at depth f (p), we can try 0≤k≤t

each mask and select the most photo-consistent one, i.e. we The new problem is obtained by substituting H for V in
define gf∗ as Eq. 4 to obtain
X
t
gf∗ (p) = arg min e(p, f (p), m) w(m) EH (f t ) = e(p, f t (p), H(p|f t (p), t − 1) + smoothing
m∈M p∈P
(6)
where w(m) is a weight function favoring g 0 and eliminat- Mutatis mutandis, f t now minimizes EH t
(f t ) and g t (p) =
ing improbable masks. The problem thus becomes the min- H(p|f t (p), t − 1). This iterative process always converges
imization of E(f, gf∗ ) in f . Since e is point-wise indepen- (or stabilizes) in a polynomial number of steps. Indeed,
dent, the new problem is reduced to the original formula- H(q, t) is monotonically decreasing in t for all q; more-
tion of Eq. 1 and is easily solved using standard algorithms. over, if H(q, t − 1) = H(q, t) for all q, then f t = f t+1
This technique is used in [15, 20, 17]. However, the selected since both are solutions to the same minimization problem,
masks are not guaranteed to preserve geo-consistency. and the process has stabilized. We see that the number of
In space carving[14], the depth f (p) of a pixel is in- iterations is bounded by N · #P · #Z.
creased at a given step if it is not photo-consistent (which Furthermore, after convergence, the final configuration
is determined using a threshold). When depth is changed f T +1 = f T is geo-consistent with g T +1 ; this comes from
at a point, the mask configuration g is updated accordingly, the fact that for all p:
and so preserves geo-consistency. Space carving is a greedy
algorithm that solves Eq. 2 subject to the constraint of Eq. 3 g T +1 (p) = H(p|f T +1 (p), T ) = H(p|f T (p), T )
without smoothing. Kolmogorov and Zabih [12] tried to ≤ V (p|f T (p), f T ) = V (p|f T +1 (p), f T +1 ).
Scenes from Middlebury comparative study
Algo Tsukuba Head and Lamp Sawtooth
Real status of pixels classified from depth map as occluders
Reference Supporting occludee occluder regular occludee occluder regular
camera camera bp [24] 44.8 16.3 38.9 42.6 3.8 53.6
bnv [2] 50.4 15.4 34.2 42.6 4.3 53.3
Real status of pixels classified from depth map as occludees
DM occluders occludee occluder regular occludee occluder regular
DM occludees bp [24] 15.5 5.9 76.6 5.5 1.1 93.4
bnv [2] 16.4 5.8 77.8 7.2 1.1 91.7
Real status of pixels classified from depth map as regulars
occludee occluder regular occludee occluder regular
1 2 3 4 5 bp [24] 1.0 2.0 97.0 0.5 1.5 98.0
bnv [2] 1.0 2.0 96.9 0.5 1.5 98.0
GT occluders
Figure 3. Real (ground truth) status in percentages of pixels ac-
GT occludees cording to their classification. Examples from the Middlebury
comparative study [21]. In bold are the misclassifications favored
Figure 2. Effect of object enlargement on classification of occlud- by the overestimation of the disparity of occluded pixels.
ers and occludees of a scene viewed by 2 cameras. The ground
truth is in thick gray and the depth map in thick dashes. Occluders
and occludees are shown for both ground truth (GT) and com-
puted depth map (DM). Illustration of classification shift. Respec- ular are occluders than occludees. The observation above
tively, the 5 zones represent 1) regular pixels wrongly classified as discourages the direct use of visibility to update the visibil-
occludees 2) occludees correctly classified 3) occludees wrongly ity history mask. Instead, we introduce a pseudo-visibility
classified as occluders 4) occluders correctly classified 5) occlud-
ers wrongly classified as regular. V ′ (q, f ) = (V1′ (q, f ), . . . , VN′ (q, f ))

which compensates for the bias by labeling both occluders


We thus have an algorithm that converges to a geo- and occludees as invisible. An obvious consequence of this
consistent solution, but that can transit through intermediate definition is the fact that
ones that are not. This type of behavior differentiates our
approach from others that strictly enforce geo-consistency Vi′ (p|f (p), f ) ≤ Vi (p|f (p), f ) ∀p ∈ P, 1 ≤ i ≤ N.
during the optimization process [14, 5, 12].
The ordering constraint simply states that when scanning
4.2. Pseudo-visibility an epipolar line, the order in which we encounter two dif-
ferent objects visible in two images of a stereo pair must
For a given f , an occluder p|f (p) is a 3D point blocking be the same in the two images (see Fig 4-left). This con-
an occludee p′ |f (p′ ) in some camera. Figure 1 illustrates straint holds for most scenes (see Fig 4-right) [13]. While
the phenomenon. Each pixel of a depth map can be classi- this constraint is broken in some rare cases, it remains a
fied as an occluder, an occludee, or a regular pixel (neither powerful tool when dealing with occlusion and outliers. If
occluder nor occludee). We have observed experimentally we represent the depth map as an opaque mesh, we are guar-
that many algorithms have a tendency to overestimate the anteed to preserve the ordering constraint between the refer-
disparity of occluded pixels. This has the effect of making ence and any supporting camera for any point visible from
close objects larger, creating a shift in the pixel classifica- them. If a set of pixels O breaks the ordering constraint
tion of occludees and occluders. Occludees have a tendency between the reference camera and some supporting image
to be classified as occluders, occluders as regular pixels i at iteration t, then according to our definition of pseudo-
and regular pixels as occludees (see figure 2 ). To validate visibility (and using an opaque mesh), the history mask is
this assertion, we used the results of two of the best stereo updated to Hi (p|f t+1 (p), t) = 0 for all p in O. After con-
matchers evaluated with the Middlebury dataset. [21, 24, 2]. vergence for the final configuration f T we have for all p
[2] was ranked the best stereo matcher in two comparative H(p|f T +1 (p), T ) = H(p|f T (p), T − 1). In particular
studies [25, 21]. [24] appeared later and achieved an even Hi (p|f T +1 (p), T ) = 0 for all p ∈ O. Since the offending
lower error rate. For each obtained depth map, we com- camera i was not used to compute the final solution, the or-
puted the percentage of pixels classified as occluder by the dering constraint is respected between the reference camera
depth map that really are occludees and that of pixels clas- and the supporting camera i.
sified as occludees that really are regular (figure 3). Both The pseudo-visibility masks Vi′ are computed by using
turned out to be quite high. Since most pixels are regular, rendering techniques. Two renderings of the current depth
the percentage of wrong classification for them is low. Nev- map f are done from the point of view of each support-
ertheless, there is a clear bias: more pixels classified as reg- ing camera i: one with an ordinary Z-buffer and one with
Scenes from Middlebury
2 2 Algorithm Barn1 Barn2 Bull Poster Venus Sawtooth
FULL-BNV 3.5 % 3.1 % 0.7 % 3.7 % 3.4 % 3.3%
FULL-MF 4.0 % 5.4 % 0.7 % 3.4 % 4.4 % 3.8 %
GEO-BNV 0.8 % 0.6 % 0.4 % 1.1 % 2.4 % 1.1 %
1 GEO-MF 1.5 % 0.9 % 0.3 % 1.4 % 3.4 % 1.5 %
1 KAN-BNV 1.4 % 1.5 % 0.9 % 1.1 % 4.0 % 1.5%
KAN-MF 1.1 % 1.2 % 0.3 % 0.9 % 5.8 % 2.2 %

Figure 5. Error percentages for the different scenes of the Middle-


bury data set. The best performance for each image set is high-
lighted.
2 1 2 1 2 1 1 2

where g is defined as

3 if |Iref (Mref p) − Iref (Mref q)| < 5
Figure 4. Left) Ordering constraint is satisfied. In this camera con- g(p, r) =
figuration, the epipolar lines are parallel to the X-axis. The line 2 1 otherwise
is located to the left of the line 1 in both images. Right) Ordering
constraint is broken, the line 2 appears to the left of the line 1 in with l(p, r) = |f (p) − f (r)| for the maximum flow [19]
one image and to the right in the other. formulation and l(p, r) = δ(f (p) − f (r)) for graph cut
formulation [2]. The parameter λ is user-defined. For each
depth map computation, we chose the λ that achieved the
a reverse Z-buffer test. Two depth maps Lfi and Gfi are best performance. A pixel disparity is considered erroneous
thus obtained and contains minimal and maximal depth ob- if it differs by more than one disparity step from the ground
served by the camera. By comparing them, we can detect truth. This error measurement is compatible with the one
when two points of the mesh project to the same location for used in two comparative studies for 2-camera stereo [25,
a given supporting camera. When using rectified images, 21, 12].
this rendering process can be greatly sped up and simplified When minimizing Eq. 6, a visibility mask must be keep
by replacing it by a line drawing using depth buffers. The for every voxel of the reconstruction volume, that is, for
pseudo-visibility function Vi′ (q, f ) can therefore be com- each p ∈ P and z ∈ Z. To reduce memory requirements
puted as and the number of iterations, we kept a single visibility his-
  tory for each pixel p Q regardless of the disparity z, i.e. (5)
Vi′ (q, f ) = δ Lfi (Ti q) − Gfi (Ti q) becomes Hi (p, t) = 0≤k≤t Vi (p|f k (p), f k ). This saves
a lot of memory but the convergence is no longer guaran-
where δ is 1 at 0 and 0 elsewhere. teed. We simply stop iterating when H(p, t) = H(p, t − 1)
It is possible for a voxel to have all its cameras removed, for all p ∈ P. We observed that running the algorithm any
i.e. H(p|z, t−1) = 0 even if V (p|z, t−1) 6= 0 In practice, longer only produce minor modifications to f t . However,
when this happens, we replace e(p, z, H(p|z, t − 1)) by the number of pixels with final zero masks increases, usu-

e(p, f t +1 (p)), H(p|z, t′ ) in the minimization process that ally in regions where the ordering constraint is broken. Pix-
computes f t (see Eq 6), where t′ is the largest index such els with zero masks are more prone to error, therefore we
that H(p|z, t′ ) 6= 0. It this case, depth is assigned only tried to improve results by adding a second step that reintro-
from the neighborhood through smoothing. duces eliminated cameras. This step consisted in fixing to
their final values the depth labels of the pixels with non-zero
final camera masks. The history of the others was discarded
5. Experimental results and the volumetric visibility recomputed, considering only
occlusion caused by the fixed pixels. Finally, an additional
In all our experiments, the matching cost function was minimization was run to produce a better depth map.
the same for all algorithms, that of [12] which is based on
[1]. We used color images but only the reference images 5.1 Middlebury
in gray scale are shown here. As for the smoothing term,
we used the experimentally defined smoothing function that This datasets from Middlebury [22] consists of 6 series
also comes from [12]: of 9 images of size 434 × 383. We used images 0 to 7 in
our experiments. The disparities between images 2 and 6
range from 0 to 19 pixels and 20 disparity steps were used.
s(p, r, f (p), f (r)) = λ g(p, r) l(f (p), f (r)) Since the ground truth was available for this dataset, we
GEO−BNV GEO−BNV pt
Figure 6. Reference images for the Head and Lamp scene (left)
and the Santa scene (right) from the Multiview Images database of
the University of Tsukuba.

used it to compute error percentages when using the second


image as the reference. We compared our method against
Nakamura’s [15] with a special choice of masks: either all
the cameras to the left of the reference are visible or all the Ground truth Ordering constraint mask
cameras to the right are. This specialized version of Naka-
mura is described in [10, 20]. The abbreviation used for Figure 7. Depth maps for the Head and Lamp scene (Multiview
this method is KAN. Our method is denoted by GEO. The Images database of the University of Tsukuba). Note for GEO-
BNV how the errors are concentrated in regions breaking the or-
results of GEO after one iteration are also shown under the
dering constraint. A mask of pixels breaking the ordering con-
label FULL, since this is a case where no occlusion mod- straint for the smallest baseline is also shown.
eling is made. We used 2 different stereo matchers: maxi-
mum flow [19] (MF) and graph cuts [2] (BNV). Results are Algorithm Baseline Error (whole image) Error (mask)
shown in Figure 5. While KAN’s modeling of occlusion GEO-BNV pt 1x 2.23% 1.53%
KZ1 1x 2.30% 2.01%
achieves impressive results, our approach using the BNV GEO-BNV 1x 2.46% 1.64%
stereo matcher perform better in 4 of the 6 sequences of GEO-MF 1x 3.42% 2.52%
GEO-BNV 2x 2.69% 2.11%
images and were close to KAN in the other two. Oddly GEO-MF 2x 2.62% 1.28%
enough, in the Venus scene, KAN had a higher error rate
than FULL, even though FULL is a simplified version of
KAN (a single mask with all the cameras). Our algorithm Figure 8. Percentages of error of the different algorithms for Head
takes an average of 8 iterations to converge, the improve- and Lamp scene, using 5 images. The right column contains the
amount of error computed after the removal of the pixels breaking
ment after just 4 is minimal.
the ordering constraint, the left shows it for all the pixels.

5.2. Tsukuba Head and Lamp


ter removing the pixels breaking the ordering constraint, in
This dataset is from the Multiview Image Database from particular part of the arm of the lamp. The mask was deter-
the University of Tsukuba (see Figure 6). It is composed of mined by re-projecting the ground truth in each supporting
a 5×5 image grid. Each image has a resolution of 384×288. camera, hence it differs for the two baselines.
The search interval is between 0 and 15 pixels and we used GEO-BNV almost performed as well as KZ1; when re-
16 disparity steps. We only used 5 images for each depth moving pixels breaking the ordering constraint, it achieved
map computation. The reference image is the center one a slightly lower error rate. For some algorithms, the er-
and the 4 supporting images are at an equal distance from ror rate decreased for the larger baseline. This counter-
it, arranged in a cross shape. In addition to those of GEO- intuitive behavior is explained by the fact that the matching
BNV and GEO-MF, the results of GEO-BNV when using cost function in the lamp region is less ambiguous when the
the recovery method described in section 5 are shown under baseline is larger. Figure 9 shows the stability to changes of
the label “GEO-BNV pt”. Some depth maps are shown in the smoothing parameter of our algorithm using graph cuts,
figure 7 and error precentages are shown in table 8. The en- giving the error percentage for 6 values of this parameter.
try KZ1 of the table comes directly from [12]. This method
achieved a very low error rate. However, as the authors 5.3. Baseline test
mentioned, the algorithm has trouble with low textured re-
gions (the top right corner for instance), therefore the error As the baseline increases, the amount of occlusion in
is somewhat underestimated by the removal of an 18 pixel the scene increases as well. A stereo matcher not affected
border in the ground truth. We also computed the error af- by occlusion would give identical depth maps for different
Algorithm Smoothness parameter
1/30 1/10 1 2 3 4
GEO-BNV 1x 2.61 2.67 2.66 2.55 3.53 4.12

Figure 9. Resistance to change of the smoothing parameter for the


Head and Lamp scene. The smoothing parameter increases by a
factor of 120, while the error rate varies by less than 1.6% for the
small baseline.
GEO−MF 1x GEO−MF 3x

baselines. To measure the level of resistance to change of


the baseline, for the different occlusion overcoming strate-
gies, we introduce the notion of depth map incompatibility.
A pixel p is incompatible in two depth maps i and j if

|fi (p) − fj (p)| > 1


KZ1 1x KZ1 3x
(a difference of 1 is meaningless as it could be the result
of discretization errors). It is important to mention that a
low incompatibility level is not necessarily a sign of low
error level in the depth map. But the amount of occlusion
increases with the baseline, and so should the error and in-
compatibility levels for stereo matchers that do not model
occlusion. To test the stability of our algorithm, we used
the Santa scenes from the Multiview Image Database of the NAKA−MF 1x NAKA−MF 3x

University of Tsukuba (see figure 6). This dataset contains


Figure 10. Depth maps obtained by 3 algorithms for 2 differ-
81 images in a 9 × 9 grid and the focal distance of the cam-
ent baselines (1x and 3x) on the Santa scene (Multiview Image
era was 10 mm with successive baselines of 20, 40, 60 and Database of the University of Tsukuba).
80 mm. We only used 5 images in a cross shape configu-
ration. Images were reduced by a factor of 2 to achieve a
resolution of 320 × 240. Each depth map was computed us- a way to apply this framework to add occlusion modeling
ing 23 disparity steps. Note the details on the right side of to standard stereo algorithms. Rather than explicitly model
the hat and on the candle. Again, for each depth map, the occlusion, our iterative approach relies on geo-consistency
smoothing parameter was adjusted to obtain the best possi- of depth maps to determine visibility of cameras and to ag-
ble performance. Since no ground truth was available, the gressively remove them to adjust the matching cost func-
choice was made by visual inspection of every depth maps. tion to the scene structure and to the bias in the type of
The figure 11 contains bar charts of the percentages of error committed by the stereo matcher. One of the main
pixels incompatible between the depth maps obtained for characteristic of our approach is that we do not discrimi-
two baselines. In addition to GEO-MF and KZ1, results nate between occluders and occludees. Our implicit occlu-
from the Nakamura approach [15, 20] using maximum flow sion model is successful in obtaining sharp and well-located
(NAKA-MF) and graph cuts (NAKA-BNV) were also in- depth discontinuities and allows the use of efficient standard
cluded. GEO-MF is twice as stable as NAKA-MF and stereo matching algorithms. Moreover, our framework does
yields less noisy depth maps. KZ1 and NAKA-BNV are not add any parameter or constraint to the matching pro-
less stable by a factor of 5 and more. The results for FULL- cess. The validity of our framework has been demonstrated
MF are again given. We can see in Figure 10 that GEO-MF on standard datasets with ground truth and was compared
achieves the best results for the third baselines. For the first to other state of the art occlusion models for multiple view
baseline, KZ1, NAKA-MF and GEO-MF performed simi- stereo. Our approach was also tested on increasingly wider
larly. The running times for GEO-MF and GEO-BNV are baselines in order to demonstrate its stability to increasing
respectively less than 5 and 9 minutes on a 2.0 GHz AMD amount of occlusion in the scene. While the validity of our
Athlon(tm) XP 2600+. framework has been demonstrated using two stereo match-
ing algorithms, it is general enough to be applied to others.
6. Conclusion It is not limited to regular grids of cameras and also works
with other camera configurations.
We have presented a new framework to model occlusion As for future work, better approach to recover from er-
in stereo by introducing geo-consistency. We also provided ror in scene breaking the ordering constraint should be in-
[11] V. Kolmogorov and R. Zabih. Computing visual correspon-
FULL-MF dence with occlusions via graph cuts. In Proc. of IEEE Con-
NAKA-BNV ference on Computer Vision and Pattern Recognition, pages
KZ1 508–515, 2001.
1x vs 4x
NAKA-MF 3x vs 4x [12] V. Kolmogorov and R. Zabih. Multi-camera scene recon-
2x vs 3x
GEO-MF 1x vs 2x struction via graph cuts. In Proc. European Conference on
0 2 4 6 8 10 12 14% Computer Vision, 2002.
[13] J. Krol and W. van der Grind. The double-nail illusion. Per-
ception, 11:615–619, 1982.
Figure 11. Resistance to baseline change for 5 algorithms for [14] K. Kutulakos and S. Seitz. A theory of shape by space carv-
the Santa scene (Multiview Image Database of the University of ing. International Journal of Computer Vision, 38(3):133–
Tsukuba ); each bar represents a percentage of incompatible pix- 144, 2000.
els between depth maps obtained for two different baselines. [15] Y. Nakamura, T. Matsuura, K. Satoh, and Y. Ohta. Occlusion
detectable stereo -occlusion patterns in camera matrix-. In
Proc. of IEEE Conference on Computer Vision and Pattern
vestigated. Also, the extension of this occlusion model to Recognition, 1996.
full volumetric reconstruction, where occlusion becomes [16] Y. Ohta and T. Kanade. Stereo by intra- and inter-scanline
the dominant problem, should be investigated. using dynamic programming. IEEE Trans. Pattern Analysis
and Machine Intelligence, 7(2):139–154, 1985.
7. Acknowledgment [17] J. Park and S. Inoue. Hierarchical depth mapping from mul-
tiple cameras. In Int. Conf. on Image Analysis and Process-
ing, volume 1, pages 685–692, Florence, Italy, 1997.
This work was made possible by NSERC (Canada) and [18] J. Park and S. Inoue. Acquisition of sharp depth map from
NATEQ (Québec) grants. multiple cameras. Signal Processing: Image Commun.,
14:7–19, 1998.
References [19] S. Roy. Stereo without epipolar lines : A maximum-flow for-
mulation. Int. J. Computer Vision, 34(2/3):147–162, 1999.
[20] M. Sanfourche, G. L. Besnerais, and F. Champagant. On
[1] S. Birchfield and C. Tomasi. A pixel dissimilarity measure
the choice of the correlation term for multi-baseline stereo-
that is insensitive to image sampling. IEEE Trans. Pattern
vision. In Proc. of the IEEE Conf. on British Computer Vi-
Anal. Mach. Intell., 20(4):401–406, 1998.
sion, September 2004.
[2] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate en-
[21] D. Scharstein and R. Szeliski. A taxonomy and evaluation
ergy minimization via graph cut. In Proc. Int. Conference
of dense two-frame stereo correspondence algorithms. IJCV
on Computer Vision, pages 377–384, 1999.
47(1/2/3):7-42, April-June 2002., 47, 2002.
[3] I. J. Cox, S. Hingorani, B. M. Maggs, and S. B. Rao. A
[22] D. Scharstein and R. Szeliski. High-accuracy stereo depth
maximum likelihood stereo algorithm. Computer Vision and
maps using structured light. In Proc. of IEEE Conference on
Image Understanding, 63(3):542–567, 1996.
[4] G. Egnal and R. P. Wildes. Detecting binocular half- Computer Vision and Pattern Recognition, 2003.
[23] S. M. Seitz and C. R. Dyer. Photorealistic scene reconstruc-
occlusions: Empirical comparisons of five approaches.
tion by voxel coloring. Int. J. Computer Vision, 35(2):151–
IEEE Trans. Pattern Anal. Mach. Intell., 24(8):1127–1133,
173, 1999.
2002.
[24] J. Sun, N. Zheng, and H. Shum. Stereo matching using be-
[5] O. D. Faugeras and R. Keriven. Complete dense stereovision
lief propagation. IEEE Trans. Pattern Analysis and Machine
using level set methods. In Proc. European Conference on
Intelligence, 25(7):787–800, July 2003.
Computer Vision, pages 379–393, 1998.
[6] A. Fusiello, V. Roberto, and E. Trucco. Efficient stereo with [25] R. Szeliski and R. Zabih. An experimental comparison of
multiple windowing. In Proc. of IEEE Conference on Com- stereo algorithms. In Vision Algorithms: Theory and Prat-
puter Vision and Pattern Recognition, 1997. ice, pages 1–19. Springer-Verlag, 1999.
[7] S. Intille and A. F. Bobick. Disparity-space images and large [26] O. Veksler. Fast variable window for stereo correspondence
occlusion stereo. In Proc. European Conference on Com- using integral images. In Proc. of IEEE Conference on Com-
puter Vision, pages 179–186, 2002. puter Vision and Pattern Recognition, 2003.
[8] H. Ishikawa and D. Geiger. Occlusions, discontinuities, and [27] G. Vogiatzis, P. Torr, S. M. Seitz, and R. Cipolla. Recon-
epipolar lines in stereo. In Fifth European Conference on structing relief surfaces. In Proc. of the IEEE Conf. on
Computer Vision, pages 232–248, 1998. British Computer Vision, September 2004.
[9] T. Kanade and M. Okutomi. A stereo matching algo-
rithm with an adaptive window: Theory and experiment.
IEEE Trans. Pattern Analysis and Machine Intelligence,
16(9):920–932, 1994.
[10] S. Kang, R. Szeliski, and J. Chai. Handling occlusions in
dense multiview stereo. In Proc. of IEEE Conference on
Computer Vision and Pattern Recognition, 2001.

View publication stats

You might also like