0% found this document useful (0 votes)
55 views15 pages

Efficient Hybrid Tree-Based Stereo Matching With Applications To Postcapture Image Refocusing

Uploaded by

Isaac Anaman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views15 pages

Efficient Hybrid Tree-Based Stereo Matching With Applications To Postcapture Image Refocusing

Uploaded by

Isaac Anaman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

3428 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO.

8, AUGUST 2014

Efficient Hybrid Tree-Based Stereo Matching With


Applications to Postcapture Image Refocusing
Dung T. Vu, Benjamin Chidester, Hongsheng Yang, Minh N. Do, Fellow, IEEE,
and Jiangbo Lu, Member, IEEE

Abstract— Estimating dense correspondence or depth infor- researchers due to its continued and extensive relevance within
mation from a pair of stereoscopic images is a fundamental computer vision. Examples of applications that can benefit
problem in computer vision, which finds a range of important from fast and reliable depth-estimation from stereo matching
applications. Despite intensive past research efforts in this topic,
it still remains challenging to recover the depth information both exist in a diverse array of fields, ranging from computa-
reliably and efficiently, especially when the input images contain tional photography, to robotics, to augmented reality, among
weakly textured regions or are captured under uncontrolled, real- others. For computational photography, depth information can
life conditions. Striking a desired balance between computational enable the creation of novel artistic effects, such as depth-
efficiency and estimation quality, a hybrid minimum spanning of-field rendering and depth-guided filtering, allowing for
tree-based stereo matching method is proposed in this paper. Our
method performs efficient nonlocal cost aggregation at pixel-level more creative and meaningful expressions of image con-
and region-level, and then adaptively fuses the resulting costs tent. To make these applications practical, the challenge
together to leverage their respective strength in handling large remains to acquire accurate depth information in a computa-
textureless regions and fine depth discontinuities. Experiments tionally efficient manner, avoiding sophisticated optimization
on the standard Middlebury stereo benchmark show that the methods.
proposed stereo method outperforms all prior local and nonlocal
aggregation-based methods, achieving particularly noticeable The significance of depth information for computational
improvements for low texture regions. To further demonstrate photography in particular has been demonstrated by the active
the effectiveness of the proposed stereo method, also motivated search for novel depth-acquisition methods within this area
by the increasing desire to generate expressive depth-induced of research. Examples of proposed methods include the work
photo effects, this paper is tasked next to address the emerging of Green et al. [2], which proposed a system of camera
application of interactive depth-of-field rendering given a real-
world stereo image pair. To this end, we propose an accurate thin- sensors to capture multiple images in a single exposure at
lens model for synthetic depth-of-field rendering, which considers different apertures and thereby estimate the depth of a scene.
the user-stroke placement and camera-specific parameters and Levin et al. [3] and Bando et al. [4] proposed the addi-
performs the pixel-adapted Gaussian blurring in a principled tion of a special filter within the aperture of camera lenses
way. Taking ∼1.5 s to process a pair of 640 × 360 images in (a patterned occluder, RGB color filter) to estimate depth.
the off-line step, our system named Scribble2focus allows users
to interactively select in-focus regions by simple strokes using The recently developed Lytro camera [5] houses a micro-
the touch screen and returns the synthetically refocused images lens array integrated on a digital image sensor to capture
instantly to the user. the ray directions of the entire light field and is one of the
Index Terms— Stereo matching, depth estimation, cost aggre- first attempts to commercialize a computational technique of
gation, depth of field, post-capture refocusing. depth acquisition. Despite these recent advances, and even
the development of the Lytro camera, these techniques have
I. I NTRODUCTION not yet reached widespread consumer adoption. Furthermore,
some of the recent techniques and hardware, such as the
D ESPITE the long history of research in stereo matching,
or stereo correspondence, this well-researched area con-
tinues to inspire new methods and to attract the attention of
PiCam [6] developed by Pelican Imaging, cannot avoid the
need for efficient correspondence search to align the images
captured from its camera array and to generate depth maps.
Manuscript received October 1, 2013; revised March 10, 2014; accepted Though many algorithms that emphasize time-sensitivity
May 21, 2014. Date of publication June 5, 2014; date of current version have been proposed to solve the stereo correspondence prob-
July 1, 2014. The associate editor coordinating the review of this manuscript
and approving it for publication was Prof. Janusz Konrad. lem, it still remains open to advances in estimation reliabil-
D. T. Vu and J. Lu are with the Advanced Digital Sciences Center, ity and computation reduction. Additionally, along with the
Singapore 138632 (e-mail: [email protected]; [email protected]). traditional difficulties of occlusion and depth discontinuities
J. Lu thanks NVIDIA Corporation for providing the Tegra 3 prototype tablet.
(Corresponding author: J. Lu.) that must be handled by such an algorithm, an algorithm
B. Chidester and M. N. Do are with the University of Illinois at Urbana- must also handle the imperfections of real-world images.
Champaign, Champaign, IL 61820 USA (e-mail: [email protected]; Many proposed algorithms perform well on undistorted images
[email protected]).
H. Yang is with the University of North Carolina, Charlotte, NC 28223 that have been captured in controlled illumination, such as
USA (e-mail: [email protected]). This work was mainly done when H. Yang the standard Middlebury test cases. Nevertheless, in a real
was working at ADSC. scenario, they are either too slow or fail to handle slight
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. radiometric distortions from the camera, especially in large
Digital Object Identifier 10.1109/TIP.2014.2329389 textureless regions, resulting in unsatisfactory depth estimation
1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
VU et al.: EFFICIENT HYBRID TREE-BASED STEREO MATCHING 3429

for some applications, particularly within computational image are estimated simultaneously by minimizing a global
photography. energy function with a certain smoothness condition. To solve
Therefore, to achieve aesthetic results for such applications, this minimization, various approaches exist. Among the most
we propose an efficient stereo matching technique that esti- popular are dynamic programming [7], Markov networks [8]
mates depth more robustly and reliably than other competing and graph cut [9]. Though they do not solve the minimization
methods, even in challenging, real-life cases. Our method is exactly, they perform well in many challenging region types,
based upon a recently popular Minimum Spanning Tree (MST) such as textureless regions and along depth discontinuities.
representation [1], but extends it to a hierarchical, coarse- However, since this work is concerned with applications for
to-fine representation, with non-local cost aggregation and which fast processing speed is necessary, these methods are
color-guided depth refinement. not considered, as they are computationally prohibitive.
Furthermore, to demonstrate the effectiveness of our Dynamic programming methods attempt to reduce the com-
algorithm and also motivated by the growing interest in depth- putational burden of global methods while maintaining some
induced photo effects, we have evaluated it in an application level of global connectedness for inference by reducing the
within computational photography of post-capture refocusing. support of the smoothness constraint to individual scan-lines
With the advent of mobile devices and the current popularity of the image, but this 1D relaxation usually suffers from the
of mobile photography and image editing, this application “streaking” effect due to the lack of enforced consistency
is highly relevant. To this end, we also propose an accu- between horizontal and vertical scan-lines.
rate thin-lens model for synthetic depth-of-field rendering. Given the requirement of speed, local methods [10]–[12]
Additionally, given the effectiveness of our stereo algorithm, are the most promising methods and are therefore the most
we propose a system, called Scribble2focus, for interactive comparable methods to that proposed. Local methods infer the
refocus-rendering using simple cues, either on the touch screen disparity of each pixel independently, usually by comparing
of a mobile device or on a PC. The system can operate on any windowed regions around the reference pixel and the can-
setup that can capture stereoscopic images, such as a PC or didate matching pixel of the corresponding matching image.
a mobile device equipped with a pair of consumer webcams. This approach results in faster, but often less reliable, depth
Depending on the computational power of capturing devices, inference. In particular, these methods struggle significantly
depth estimation can either be run on the device, or carried out in textureless regions due to matching ambiguity. The local
on a remote server before sending back the estimated depth methods that produce the best inference are those that are
map for interactive refocusing rendering. In fact, the proposed able to adaptively weight the support of the windows during
algorithms could also be used to process a stereoscopic pair of matching to handle object boundaries. However, generating
images captured from two viewpoints by a mobile device with the weights is costly, as this amounts to translation-varying
only a single camera, once the image pair is rectified with the filtering. Although the guided filter [13] was recently pro-
recovered epipolar geometry. posed as an efficient method for computing adaptive support
The paper structure is organized as follows. Section II weights [14], it still struggles with these notoriously difficult
reviews existing depth estimation methods, including textureless regions.
methods using stereoscopic images and methods using To have better representation of color patches and texture,
hardware modification. Section III discusses our depth map some researchers have proposed the use of segmented color-
estimation and refinement algorithm based on a multi-MST images for stereo matching [15]–[17]. These region-based
construction. Section IV discusses how we model and algorithms have smooth disparity estimates inside homoge-
render a real-life, depth-of-field effect. Section V explains neous regions, and color segments can help reduce the com-
the overview of our interactive, mobile photo-refocusing putation time. However, trusting in color segmentation alone
system. Finally, Section VI evaluates our algorithm on both is unreliable, as it may fail to provide a good representation of
the standard Middlebury benchmark for stereo matching regions in the image, resulting in erroneous depth estimation.
and real-life stereoscopic image pairs captured using an Another important direction has been the adoption of
Android tablet. We also discuss our proposed application, tree-based structures for stereo matching. Veksler [18] and
Scribble2focus, in this section. Deng and Lin [19] proposed to use tree-based graphical
models to improve the performance of dynamic program-
II. R ELATED W ORK ming to solve the stereo correspondence problem. While
As previously mentioned, various methods of depth infer- Veksler [18] directly uses the image pixels as nodes in the
ence are currently available for computational photogra- tree, Deng and Lin [19] proposed that pixels of similar
phy applications, and we now provide a more thorough color should be grouped as line segments and a tree is
consideration. constructed to connect all the line segments together. The tree-
Among many proposed algorithms for stereo matching, each based dynamic programming algorithms are much faster than
one is optimized for different criteria, but still struggling to MRF-based global methods, however, the resulting accuracy
strike a desired balance between speed and inference accuracy. is comparably weak and also outperformed by leading local
Generally, they can be grouped into one of two categories: methods [14], [20].
global methods or local methods. Global methods enforce The most closely related work to the proposed algorithm is
global consistency constraints upon the estimated depth map that of Yang, who proposed a non-local filter method [1]. This
using regularization. The matching pixel pairs of the entire method uses a local pixel dissimilarity cost with a non-linear,
3430 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

Fig. 1. Comparison between the non-local filter-based stereo matching algorithm [1] and our method on two challenging cases. (a) Original images. (b) Depth
estimation result of the non-local filter algorithm (left) and our method (right). Depth-of-field rendering using (c) the non-local method’s disparity map, and
(d) our method’s disparity map.

tree-based aggregation scheme to produce similar results to significant further development before they might be ready for
global optimization algorithms without incurring similarly widespread use. Whereas our method only requires two stereo
excessive computational complexity. The dissimilarity cost cameras, and even a single camera, such as a camera from a
between stereo image pairs is computed at each pixel, as in mobile phone, could be used by simply capturing images from
standard local methods. However, the image is represented as two unique perspectives, creating a parallax.
a planar graph, with each pixel being a node. Like in dynamic
programing methods, the fully connected graph is reduced to III. D EPTH F ROM S TEREOSCOPIC I MAGE PAIRS
a tree structure. In Yang’s method, the chosen structure is a
In this section, we present our method for depth inference
Minimum Spanning Tree (MST), which preserves connections
from calibrated stereoscopic images. For applications in such
between pixels of similar intensity. The aggregated cost at
areas as computational photography, robotics, or augmented
each pixel is computed by traversing the MST, so that every
reality, a stereo matching algorithm must effectively handle the
pixel contributes to the depth estimation of every other pixel,
challenge of real-world images while meeting computational
unlike standard local methods. Yang’s algorithm produces
constraints. For the specific application to image refocusing
competitive depth estimation with minimal computational cost.
and editing, it must also preserve the edges of the scene in the
However, even this approach still has difficulty with some
depth map, as the sharpness of object edges highly influences
regions of typical, real-life images taken by commodity cam-
the perceived visual quality of the edited image. Inspired by
eras, as shown in Fig. 1. In particular, the non-local filter
the strengths of both the MST cost aggregation method of [1]
fails to provide the correct depth estimates of large, textureless
and region-based stereo matching, we propose a region-based
regions like the wall or the box of uniform color. Due to the
enhancement to the MST. We extend the method of Yang [1]
particular illumination, the color intensity of the wall changes
by adding a second MST, that is created from a segmented
slightly from patch to patch, resulting in patches of different
version of the image, where each node is a superpixel from
estimated disparity. Our method improves upon that of [1] to
the segmented image. The region-level MST enables aggre-
handle these challenging regions by introducing an additional,
gation over a coarse scale of the image, which is helpful for
region-level MST, which significantly improves estimation
large regions of uniform color and texture, while the pixel-
quality with the computation time only slightly increased.
level MST enables aggregation over a finer scale, which is
Other depth inference methods that require hardware modi-
helpful for edge boundaries. In summary, our stereo algorithm
fications of the camera have also been proposed, and this area
performs the following steps: pixel-level cost initialization,
of research has recently received more attention, due in part
pixel-level and region-level MST construction, adaptive cost
to the popularity of the Microsoft Kinect camera [21]. The
aggregation on both MSTs, a Winner-Take-All strategy to
Kinect camera uses active infra-red illumination to find the
estimate the disparity map, and disparity map refinement using
scene’s depth map in real time. However, the produced depth
a non-local disparity refinement method [1] followed by Cross-
map has poor resolution, and furthermore, Kinect’s active light
based Local Multipoint Filtering (CLMF) [23].
system is susceptible to interference from other light sources
and therefore might not work well with certain materials or
in outdoor lighting. Coded aperture methods [3], [4] use a A. Depth From Multiple Minimum Spanning Trees
specially designed filter attached behind the camera’s aperture Fig. 2 shows the flow chart of our depth estimation and
to estimate the depth map of a scene from the depth-dependent, refinement process. Given the calibrated stereo image pair,
ray-diffusion characteristics of the filter. Zhou et al. [22] I0 and I1 , a disparity map D is recovered such that a pixel
proposed the use of photometric cues by turning on and off I0 ( p) at location p = (u, v) in the reference image I0 and
the flash during capture of stereoscopic images to improve the a pixel I1 ( pd ) at location pd ≡ p + (d, 0), a d horizontally
quality of inference at depth boundaries and to overcome the displaced pixel of p, in the matching image correspond to the
challenge of occlusion. These methods, however, still require same 3D point. A discrete disparity range, H = [dmin , dmax ],
VU et al.: EFFICIENT HYBRID TREE-BASED STEREO MATCHING 3431

In the graph, each vertex represents a pixel and is con-


nected via edges to its eight neighboring pixels. Each edge
weight ω P ( p, q) between the two connected vertices ( p, q) is
given as
ω P ( p, q) = |I ( p) − I (q)|.
Applying Kruskal’s algorithm on G P generates a
fully-connected MST. Since a tree has no cycles, it admits
an efficient, recursive computation of the aggregated cost.
Additionally, the MST in particular ensures that cost for a
particular pixel is only aggregated over neighboring pixels
of similar color. This results from the definition of the edge
weights, since the MST has a total weight less than or
equal to the total weight of every other possible spanning
tree of G P . Fig. 5 shows the connections, but not the edge
weights, of the generated pixel-level MST for a patch from
the Lamp Shade image.
2) Region-Level Cost Computation and MST Construction:
As we discussed in Section II in motivation of our method,
although the pure pixel-level MST method of [1] improves
upon local methods in troublesome textureless regions and
texture with large boundaries, it still produces unreliable
depth estimates that are especially problematic for our con-
sidered applications, as seen in Fig. 1. To overcome these
Fig. 2. Flow chart of the proposed stereo matching algorithm.
problems, we propose our modification of Yang’s non-local
filter method [1], which combines pixel-level and region-level
cost computation and MST construction. We use the Simple
is specified based on the focal length of the cameras, the
Linear Iterative Clustering (SLIC) method of [24] to find the
baseline between the stereoscopic images, and the desired
superpixel segmentations I S0 and I S1 for the respective input
depth resolution.
images, I0 and I1 . The resulting superpixels, or regions, from
1) Pixel-Level Cost Computation and MST Construction: SLIC adhere well to image boundaries. Additionally, SLIC
First, the matching cost between I0 and I1 is computed at each provides us with the freedom to tweak the compactness of
pixel for each disparity level. The common cost measurement, the generated superpixels, which is important, as the desired
truncated absolute differences (TAD), is adopted to minimize superpixel size in SLIC depends on the size of the input
the impact of outlier pixels. Similar to [20], the dissimilarity images. In our experiment with the NVIDIA tablet, the stereo
of pixels I0 ( p) and I1 ( pd ), denoted by Cd ( p), is given by a images have a resolution of 640×360, for which we chose the
convex combination of the color dissimilarity ei (measured in superpixel size to be 150 pixels. Fig. 3b shows the superpixel
L1 distance) and the gradient difference eg , segmentation result using SLIC. This size leads to a good
Cd ( p) = βei ( p, pd ) + (1 − β)eg ( p, pd ), (1) balance between large coverage of color patches in textureless
regions and minimal incorrect segmentation across object
where ei and eg are defined as follows, boundaries. The region cost Cd (S) for each superpixel S,
corresponding to a set of pixels in the original image, for
ei ( p, pd ) = min (|I0 ( p) − I1 ( pd )|, Ti ), each disparity d is given by the following expression,
  
eg ( p, pd ) = min |I0 ( p) − I1 ( pd )|, Tg ,
Cd (S) = Cd ( p), (2)
with I0 ( p) and I1 ( pd ) denoting the color vector and I0 ( p) and p∈S

I1 ( pd ) denoting the horizontal gradient at the corresponding where p ∈ S are all the pixels inside the superpixel S.
pixel. The two truncation parameters, Ti = 8 and Tg = 2, After segmentation, the resulting superpixel image does not
are set empirically to limit the negative impact of outliers. form a regular grid, so to represent the image as a graph
In experiments, the weight β is set to 0.11, G R = (V R , E R ), each superpixel is represented by a node, and
Yang [1] showed that cost aggregation on a MST produces a connection is made between every node. However, as shown
quality disparity estimation. The MST connects all the vertices in Fig. 4, this results in connections between nodes that are not
of the graph so that each pixel has support from every other truly “neighbors”, such as the red superpixel that is incorrectly
pixel in the image, depending on their similarity, without the connected to the superpixel R. For superpixels, we consider
computationally expensive calculation of adaptive windows as two nodes to be neighbors only if some pixel from one
in some accurate local methods. Following Yang’s method [1], superpixel is a neighbor with a pixel from the other superpixel.
we construct the pixel-level MST by creating a planar graph To prevent the error in aggregation that would be caused
G P = (V P , E P ) and applying to it Kruskal’s algorithm [25]. by connections between non-neighbors, we penalize these
3432 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

Fig. 3. Result from each step of disparity map estimation. (a) Original images. (b) Segmentation images from the SLIC algorithm [24]. (c) Edge density
images. (d) Disparity maps without the CLMF-1 refinement [23]. (e) Disparity maps after the CLMF-1 refinement [23].

Fig. 4. Region graph builder. Superpixels resulting from the SLIC over-
segmentation method are treated as nodes on an eight-connected undirected
regular graph. The red colored superpixel does not share any neighbour pixel
with superpixel R therefore the edge between them is penalized.

connections by setting their edge weights to the maximum


value. For neighboring nodes, S and T , the edge weight
between them is calculated based on the difference of the
color distribution within each node. There are a variety of
metrics that could be used to define the distance between
color distributions. We compute the color histogram and use
the difference of the dominant colors, I S and IT , also known
as the modes, of each region as the metric. This metric is
simple to compute and is more robust than the difference of
the mean colors of the regions, as SLIC sometimes generates
segments that slightly straddle regions of different color. The
edge weight is then computed as

ω R (S, T ) = |I S − IT |. (3)
Fig. 5. Close-up examination of the multiple Minimum Spanning Trees.
Finally, as in the construction of the pixel-level MST, we Row 1: (a) Original Lamp Shade image. (b) A close-up region from the
apply Kruskal’s algorithm [25] on G R to obtain the result- original image with the red test point. Row 2: (a) Pixel-level MST. (b) Region-
level MST. Row 3: the weighted contribution of all the pixels inside the close-
ing region-level MST. Fig. 5 shows how the region-level up region to the test point according to (a) the pixel-level MST and (b) the
MST looks like in a patch of the segmented Lamp Shade region-level MST. Row 4: depth estimation result by (a) using the pixel-level
image. MST only and (b) our method using adaptive fusion of the pixel-level and
region-level MSTs.
3) Adaptive Fusion of Pixel-Level and Region-Level Costs:
Once the pixel-wise cost for each disparity has been created,
as well as the MSTs, the cost must be aggregated for each distance D(Vi , V j ) of the path P(Vi , V j ) connecting Vi with
pixel. On each of the MSTs, we employ the non-local cost V j as the sum of all edge weights ω along the path,
aggregation by Yang [1] to find the cost at each node. Let us 
consider the MST structure T (V, E), where each Vi ∈ V is D(Vi , V j ) = ωE . (4)
a node and each E j ∈ E is an edge of T . We calculate the ω E ∈P(Vi ,V j )
VU et al.: EFFICIENT HYBRID TREE-BASED STEREO MATCHING 3433

For each node Vi , Cd (Vi ) denotes the matching cost for where p ∈ R and α R is the edge density of the region R.
disparity d and CdA (Vi ) denotes the aggregated cost. Based To find the edge density as shown in Fig. 3c, we first apply
on the MST structure, we aggregate the cost for each node the Canny edge detector [26] to find edges in the image and
non-locally with weighted support from every other node in then calculate the density as the ratio of the number of edge
the tree T (V, E) pixels, Ne , to the number of pixels in the region, N R ,
 Ne
CdA (Vi ) = W (Vi , V j )Cd (V j ), (5) αR = . (8)
V j ∈T NR
Finally, a WTA optimization is applied to find the best
where weight W (Vi , V j ) is calculated as an exponential
disparity value at each pixel based on the combined aggregated
function of the distance D(Vi , V j ),
cost CdA ( p). Fig. 3d shows the resulting depth maps.
 
D(Vi , V j )
W (Vi , V j ) = exp − . (6)
σ B. Disparity Map Refinement
We use σ to control the support in the cost aggregation process For non-global methods, it is common practice to apply
over the nodes. If we increase the σ value, distant nodes on some form of post-processing refinement, usually relying
the tree can provide a larger contribution. However, a large upon smoothness or consistency constraints, to the generated
σ value has a trade-off. Large contribution of far distance disparity map D. We apply a two-step refinement process.
nodes is good for low texture regions but it creates an error We first employ the non-local refinement method of Yang [1].
propagation problem for sharp edges and thin objects. Through In this method, we in turn consider the left and right images
experiments, we set the σ value of both pixel-level and region- as the reference and find their respective disparity maps.
level MST to be 0.1. Fig. 5 shows the weighted contribution A mutual consistency check is then employed to detect pixel
of nodes on the pixel-level and region-level MSTs to a test pairs with consistent disparity values, and if pixels are found
point within a patch of the image. to be consistent, they are marked as stable. A new pixel
In the cost aggregation process, Yang [1] utilizes the MST dissimilarity cost is assigned to each pixel based on the its
structure to efficiently compute the aggregated cost by two stability:
traversals of the tree. The aggregated costs are computed 
new |d − D( p)| p is stable and D( p) > 0,
recursively by using the sum at previous nodes on the MST. Cd ( p) = (9)
Hence, the algorithm only requires a few operations per node. 0 else.
We apply this process on both the pixel-level and region-level We then run pixel-level aggregation on the pixel-level MST
MSTs to obtain CdA ( p) and CdA (R) for each node. again. The cost of the stable pixels, and ultimately their depth
After aggregating the cost for both the pixel-level and values, are propagated to the unstable pixels, providing a more
region-level MSTs, we can now see the shortcomings of the consistent depth map. However, this process can improve the
pixel-level MST when it alone defines the aggregation. Con- quality mostly only for incorrect disparities that are caused
sider the case in Fig. 5. It can be seen that the pixel-level MST by occlusion. In the case of large textureless regions, pixels
accurately estimated the weighted contribution of pixels near might be wrongly classified as stable due to the ambiguity of
color and depth discontinuities. However, in the textureless matching. Therefore, we need a second refinement step that
region, the weighted contribution of neighboring pixels in the smooths large textureless regions and provides sharp depth
pixel-level MST was split among patches. This effect leads discontinuities along object boundaries.
to the wrong depth estimation inside the textureless region. To accomplish this, we use the color image for guidance.
The region-level MST outperforms the pixel-level MST in this The basic idea is to enforce the depth map to be coherent with
case, as it is able to correctly aggregate from support within the color image, based on the assumption that a local region
the uniformly yellow region and more accurately estimate the of pixels that have similar color are likely to exist in the same
depth for this region. disparity plane. The refined disparity map Dout is a locally
The key consideration is how to fuse the aggregation from filtered version of the original depth map, where the weights
each of the MSTs to produce more accurate depth estimation. are based on the spatial and color similarity of pixels in a local
Ideally, the region-level cost aggregation at a coarser level neighborhood N ( p),
would act to complement the finer, pixel-level aggregation 
and vice versa. In a textureless region, which has no depth Dout ( p) = w p j D( j ). (10)
j ∈N ( p)
discontinuities, the region-level MST should dominate the
cost aggregation and the resulting depth estimation. However, One commonly used weighting scheme is the bilateral
in a region of rich texture, the algorithm should rely more filter. Here, we use both a faster O(1) and more effective
on the finer, pixel-level cost. In our algorithm, we use the version called the cross-based multi-points filter (CLMF-1),
edge density as a measure of the level of texture within a which was recently proposed in [23]. The reason we choose
region. Then the pixel-level and region-level aggregated costs CLMF-1 over another well-known O(1) filtering technique–
are adaptively blended in an unsupervised fashion, according the guided-image filter (GF) [13]–is that the GF’s kernel
to edge density, as follows cannot handle more than two local color-line models. It has
the tendency to blend them together to create a color blending
CdA ( p) = α R CdA ( p) + (1 − α R )CdA (R), (7) effect when doing edge-preserving image smoothing, and a
3434 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

fuzzy depth boundary is verified in the depth up-sampling


experiment of [23]. The CLMF-1 explicitly adds a support
region estimation procedure, leading to a sharply, well-aligned
depth boundary in the filtering stage. Fig. 3e shows our
disparity refinement results.

IV. I NTERACTIVE R EFOCUS R ENDERING


In the previous section, we presented our depth map esti-
mation and refinement algorithm. The resulting depth map
from our algorithm, Dout , has sharp edges that align well
with objects. The disparity values of objects at the same depth
level are smooth and consistent. The acquisition of such a
depth map is of fundamental importance for various depth-
based image editing processes in computational photography.
To demonstrate the effectiveness of our stereo algorithm within
this realm, we present an interactive refocus-rendering applica-
tion. Provided with a pair of stereoscopic images captured by a
user, our application processes these images to obtain the depth
map of the scene. In the application, users are able to draw a
simple stroke through an object or region in the scene to denote
that it should be emphasized. If we were to use only color
segmentation and Gaussian filtering, the application would
easily fail, as non-connected objects at the same depth range
would be rendered differently. In this application, we must
intelligently infer an in-focus/out-of-focus effect, according to
the user’s input stroke and the scene’s depth map, to render Fig. 6. A thin lens model of how the objects are captured on an image plane.
a real-life, depth-of-field (d.o.f.) image. We also develop the Row 1: An object focuses on a image plane. Row 2: Farthest distance to the
proper method that should be used to render the refocusing camera for objects to have a permissible circle of confusion size. Row 3:
Nearest distance to the camera for objects to have the permissible circle of
effect that is true to the real physical model. Our method for confusion size.
this interactive refocus-rendering application is discussed in
two subsections: the physical model of the d.o.f. effect based
on the physical model of thin lens; and the rendering of the inferred disparity map into the real depth of the scene.
d.o.f. effect using a Gaussian point spread function (PSF) with From the input of a user’s stroke and the depth map Dout ,
a pixel-adapted sigma value. we interpret the user’s selection depth Z U and the desired
in-focus region. If the user’s stroke passes through multiple
A. Physical Model of Depth-of-Field Effect depth planes which are far from one another, then the user
has defined the d.o.f. directly. The in-focus depth range will
In photography, light rays emanating from a point in 3D
extend from the nearest depth plane Z N to the farthest depth
space pass through a lens and are focused onto an image plane,
plane Z F , and thus it is not necessary to define a permissible
which is a given distance behind the lens. At a certain distance
threshold C T . In this case, Z U is chosen to be
from the camera, these light rays will converge to an exact
point in the image plane, and as a result, the object will appear 1
ZU = (Z F − Z N ) + Z N . (11)
sharp in the captured image. If the object is too close to or too 3
far from the camera, the point of convergence of these rays will If the user’s stroke lies in only one depth plane, the algorithm
be either behind or in front of the image plane, and the object considers Z U to be the depth of the user’s selected stroke and
will appear blurred. This is caused by the diffusion of the the d.o.f. is calculated as follows (the derivation of which is
light rays throughout a region of the image, which is usually presented in Appendix A),
modeled as a circle and is called the circle of confusion,
although in actuality, the shape of this region depends on the d.o. f. = [Z N , Z F ] , (12)
shape of the aperture. These effects are visualized in Fig. 6. f 2 ZU
Although a point in the scene may project onto the image ZN = , (13)
C T N(Z U − f ) + f 2
plane with a circle of confusion of non-zero diameter, so long
f 2 ZU
as the diameter is below a specified threshold, denoted C T , ZF = . (14)
it is still considered to be in focus. The range of the distance C T N(Z U − f ) − f 2
from the camera for which an object is considered to be in As previously mentioned, if a point p lies outside of the
focus is the d.o.f. d.o.f., then it will appear blurred in the image based on its
Based on the camera’s intrinsic values and specifications distance from the camera sensor. To render the depth-based
(sensor size, focal length, baseline distance), we convert the refocusing effect, we therefore need to calculate the diameter
VU et al.: EFFICIENT HYBRID TREE-BASED STEREO MATCHING 3435

TABLE I Since our depth map refinement process, which uses the
S YMBOLS AND D ESCRIPTIONS FOR d.o.f. C ALCULATION color image as guidance, provides sharp, well-aligned depth
boundaries, we can safely rely on the produced depth map
Dout to not cause edge leaking problems in the refocused
image. A hard threshold on the depth is set based on the
previously calculated d.o.f., so that, if the depth value of the
pixel is within the d.o.f., then its color intensity is stored
directly in the refocused image I R , and any blur kernel that
is applied to the image excludes this point from its support.
For other points which are outside of the d.o.f., an adaptive
Gaussian PSF is applied. The diffusion of light rays from an
out-of-focus object in the camera’s image plane is similar
to the distribution of the energy of a Gaussian kernel of
appropriate spread. Therefore, the d.o.f.-rendered image I R
is generated according to the following relationship:

I ( p) Dout ( p) ∈ [Z N , Z F ],
I R ( p) = (16)
(I ∗ G)( p) Dout ( p) ∈/ [Z N , Z F ],
Fig. 7. If an object is out of focus range, it will create a circle of confusion where the filter coefficient for the Gaussian PSF G p (u, v)
on the image plane. is defined for each point p adaptively with relative image
coordinate (u, v) as
TABLE II
 
S YMBOLS AND D ESCRIPTIONS FOR C IRCLE OF 1 −(u 2 + v 2 )
G p (u, v) = exp · δ (Dout ( p + (u, v))) ,
C ONFUSION S IZE C ALCULATION 2σ p 2 2σ p 2
(17)
In Eq. (16), ∗ denotes the convolution operator. The binary
function δ (Dout ( p + (u, v))) in Eq. (17) evaluates whether
Dout ( p +(u, v)) ∈
/ [Z N , Z F ]. If the condition is true, the filter
kernel mask at point p + (u, v) is activated; otherwise, it is
of the circle of confusion of each point that lies outside of the set to 0 to avoid color bleeding between adjacent in-focus and
d.o.f. using the similar triangle formula, as in Fig. 7: out-of-focus regions. To model the out-of-focus effect, points
in the scene with a larger circle of confusion should be more
Cp z p − z U
heavily blurred and points with a smaller circle of confusion
= ,
f /N z p should incur less blur. Therefore, we relate the spread of the
(z p − z U  ) f /N kernel, parametrized by σ p , linearly to the circle of confusion
Cp = . (15) for each pixel p as follows:
z p
Cp
σp = K ∗ , (18)
In Eq. (15), z p stands for behind lens distance of point P. ps
The behind lens distance can be directly converted from
where K controls the linear relation and ps is the sensor’s
object’s distance to the camera sensor using the thin lens
pixel size. Although we have assumed a circular shape of the
model as shown in Fig. 12. Tables I and II summarize the
aperture in this paper, other real-life aperture shapes could also
symbols used for d.o.f. calculation and circle of confusion size
be implemented to further improve this artistic effect.
calculation, respectively.
Fig. 8 shows our d.o.f. rendering result as well as the value
of sigma at each pixel, which is displayed in the lower graph,
B. Depth-of-Field Rendering Using Gaussian according to various user strokes. The graph shows how the
Point Spread Function value of sigma changes according to the depth of scene and
the user-selected d.o.f.
To render a convincing and realistic refocused image I R
based on the proposed physical model of d.o.f., we propose
two primary techniques for the rendering process. Firstly, each V. I NTERACTIVE M OBILE P HOTO R EFOCUSING S YSTEM
pixel that is outside of the d.o.f. must be diffused through- Though any camera pair can be used to capture stereoscopic
out the region of its circle of confusion, which we accomplish images, motivated by the mobile computational photography
through convolution with a blur kernel with spread based on trend, we specially pick NVIDIA’s Tegra 3 Android tablet [27]
the diameter of the circle of confusion. Secondly, in-focus as an image acquisition and manipulation device. It is a
objects must have sharp edges that do not leak into the blurred powerful tablet, which houses a quad-core CPU and a ded-
background and foreground areas, so we adapt the support of icated GPU. At the current stage of the development, we
the blur kernel based on the depth of neighboring pixels. use the tablet’s stereo cameras to acquire images. The stereo
3436 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

Fig. 8. Depth-of-field rendering result with different focus regions. Green strokes are users’ strokes to select in-focus areas. Row 1: Estimated and refined
depth map. Row 2: Original all in-focus color images. Row 3-5: Rendering result. Row 6: Gaussian PSF sigma value (blue) and estimated disparity value
(red) of each pixel on the red horizontal line. This figure is best viewed on screen.

image pairs are then transferred to the PC for testing of A. System Overview
our stereo matching correspondence and refocus-rendering Fig. 9 presents our system design, which is separated
algorithm. In addition to robustness, all of the components into interactive and pre-processing sections. The first step is
of our algorithm are designed with close attention also to effi- the calibration of the stereo cameras, which involves cal-
ciency, so that the system may easily be ported as an Android culating the intrinsic and extrinsic parameters of the cam-
application to run on the tablet, which is our immediate future eras. This is accomplished by the standard checker-board
work. calibration method. On our mobile system, the stereo images
VU et al.: EFFICIENT HYBRID TREE-BASED STEREO MATCHING 3437

algorithm with larger images and additional difficult test cases.


In addition, our algorithm is evaluated on images captured by
the NVIDIA Tegra 3 tablet. The tablet’s camera sensors have
the following specifications:
• Baseline: b = 65mm,
• Focal length: f = 10.11mm,
• Sensor size: 4592μm × 3423μm,
• Pixel size: 1.75μm × 1.75μm,
• f-Number: N = 2.8.
The depth estimation and refinement process is applied using
a PC with an Intel Quad Core 2.8Ghz and 4GB RAM. For
an image pair of 640 × 360 resolution, our algorithm takes
an average of 1.55s to produce the final refined disparity map
Fig. 9. Interactive mobile refocusing system overview. Dout , with 60 pixels as the maximum disparity, which includes
245ms to generate the superpixel segmentation with a default
are captured by an Android application that is built upon the superpixel size of 150 pixels, 725ms to construct the multi-
FCam API [28]. After users capture the images, the system MST and aggregate the cost, and 585ms to refine the disparity
runs image processing in the background. The stereo images map. In the case of images of a smaller resolution of 384×288
are rectified and tone calibrated. The corresponding depth map and a smaller maximum disparity of 19 pixels, such as the
from this image pair is estimated and refined. Users can choose Tsukuba test image, our method only takes 0.47s to estimate
any color image to start the interactive refocus-rendering and refine the disparity map.
process. Strokes are drawn on the image in the GUI to indicate
the region of interest in the image. On a mobile device, this
A. Middlebury Stereo Matching Evaluation
could be done easily on its touch screen. The application then
renders a depth-of-field effect on the image, which keeps the We evaluate our depth estimation algorithm using the
region of interest in focus while synthetically defocusing other quality assessment method proposed by [30] and the
regions according to their distance from the camera. Middlebury stereo database [31]. The parameters are set to
constant values across all the test datasets: β = 0.11, Ti = 8,
Tg = 2, σ = 0.1. Evaluation is based on the rate of wrong
B. Camera Calibration
disparity values over the entire image and over three different
The standard checkerboard method is employed to find regions: non-occluded regions, discontinuous regions, and all
the cameras’ intrinsic and extrinsic parameters, followed by regions. Four standard datasets–Tsukuba, Venus, Teddy, and
estimation of the relative position and orientation between the Cones–were used to obtain evaluation results and temporary
two cameras. We compute the rectification transformation to rankings at the time of submission. Fig. 10 shows our depth
enforce that the cameras’ corresponding epipolar lines on the estimation results, both with and without our CLMF depth
left and right images have the same y-coordinate, so that the refinement method in comparison with the result of the
left and right images are shifted only by horizontal disparities. non-local filter [1] and the ground truth. Visually, our algo-
The intrinsic parameters of the cameras are also required for rithm produces better estimates of the depth of non-occluded
the later computation of the depth-of-field effect. Additionally, regions. For example, our algorithm successfully infers the
we balance the color between the left and right images by depth of the top-right portion of the wall in the Tsukuba
employing a Grey World algorithm [29]. This algorithm is case. It also provides a better result in the Teddy case in the
especially good at the removal of the color cast problem of area to the left of the teddy bear. The repetitive background
digital images. The algorithm is based upon the assumption pattern misleads the non-local filter algorithm into computing
that images which are captured from different camera sensors an unreliable pixel matching cost. Cost aggregation using only
or attributes of the same scene would converge to a similar the pixel-level MST further propagates this error, leading to
mean color. Therefore, we compute the mean color of each patches of incorrectly estimated disparity around the teddy
image and then transform each image according to the average bear. In our algorithm, the integration of a region-level MST
of the mean colors. After this calibration step, the output stereo into cost aggregation helps to suppress the propagation of this
image pairs are appropriately color-balanced and rectified. error. Fig. 10d shows that the application of depth refinement
using the corresponding color image further preserves the
VI. E XPERIMENT AND E VALUATION edges of objects, such as the table in the Tsukuba case, the
In this section, we evaluate our depth estimation and cones in the Cones case, and the papers in the Venus case.
refinement algorithm and discuss a depth-based post- Table III presents our method’s quantitative performance
processing application, Scribble2focus–an application for on four different test cases from the Middlebury dataset in
interactive refocusing. We use the Middlebury 2003 dataset comparison to state-of-the-art methods either focusing on the
to compare the ranking of our proposed stereo method cost aggregation step or using the tree structures. Based solely
with other state-of-the-art algorithms in stereo matching. The upon quantitative comparison, we found that our algorithm
Middlebury 2006 dataset is also used to further challenge our performs slightly better without the CLMF-based depth refine-
3438 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

Fig. 10. Results on four Middlebury stereo vision datasets: Tsukuba, Venus, Teddy, and Cones. (a) Original left image. Disparity map of (b) non-local cost
aggregation method [1], (c) our method without CLMF depth refinement, (d) our method with CLMF depth refinement, and (e) ground truth.

ment, so the results reported in Table III were generated preservation because of the guidance of the color image in
without the use of this refinement. The better quantitative the refinement process. The disparity of large, uniform-color
performance that is achieved by bypassing this additional objects, such as the yellow boxes in Lamp Shade and the
refinement step is likely due to the disparity averaging that wall in Middlebury, are well estimated with piece-wise smooth
is performed during the color-guided refinement, in addition disparity values. The edges in all the cases are sharp and well-
to discrepancies with the integer-valued ground truth. This preserved with few errors, especially in the Wood case. Visual
additional refinement creates sharper edges, which is desirable comparison with the non-local method [1] is provided in
for our refocusing application, but it may diffuse incorrect Fig. 1.
depth values into surrounding regions through convolution For the purpose of evaluating the processing speed,
with the non-linear kernel. These errors do not impact the we have run the non-local algorithm [1] source code on
visual quality of our refocused images, but if quantitative our Quad Core CPU with single-core implementation. The
performance is the primary consideration, it may be preferable non-local algorithm takes an average processing time of
to leave out this additional processing step. Generally, our 0.7 seconds while our method takes an average processing time
algorithm achieved the temporal rank 20t h (out of nearly 170 of 1.01 seconds on the Middlebury 2003 dataset. Such a tree-
methods) as of March 2014 while having competitive runtime based aggregation structure has provided a speed advantage
efficiency. Our algorithm is ranked above some sophisticated over other stereo matching algorithms: OverSegmBP [38]
local methods, such as Patch Match (rank 26rd ) [32] and also takes 50 seconds, GlobalGCP [36] takes 130 seconds, and
the original non-local filter (rank 40t h ) [1]. FastBilateral [39] takes 32 seconds. Compared to the non-local
To further evaluate our method’s performance, we use method, our algorithm is slower by 0.31 seconds, which is due
the challenging Middlebury 2006 dataset. The second row to the additional computation required to build the region-level
of Fig. 11 shows the result of our depth estimation and MST and perform region-level cost aggregation. However,
refinement method. The red-colored pixels mark errors in we believe that the increase in estimation accuracy, as shown
the estimated disparity map as compared to the ground truth by the difference between rankings on the Middlebury dataset,
of each test case. Generally, our stereo matching method and the better handling of challenging cases, as shown in
excels in large, low texture regions and provides good edge Fig. 1b, is worth the small trade-off in computation time for
VU et al.: EFFICIENT HYBRID TREE-BASED STEREO MATCHING 3439

TABLE III
M IDDLEBURY S TEREO M ATCHING E VALUATION

Fig. 11. Depth estimation result of our method on Middlebury 2006 dataset. First row presents the original left color images. Second row presents our depth
estimation result together with error pixels (red color) in comparison with the ground truth. Third row shows the synthetic refocusing effect based on the
depth estimation result.

most of the applications that we have considered, especially required an average of 0.7 seconds to calculate the d.o.f. model
computational photography applications. from the color and depth images and render the d.o.f. effect
according to the input scribble of the users. Fig. 8 shows
the input and output result of our Scribble2focus application.
B. Scribble2focus – An Interactive Photo Refocusing System The first row of the figure shows our depth estimation and
We experiment with our interactive, post-capture refocusing refinement result on three real-world cases captured by the
application, using the NVIDIA Tegra 3 tablet to capture several tablet. The result is not perfect, as our method still incurs
test images under different indoor and outdoor conditions. error in low contrast areas or areas of highly-varying texture.
We change the camera parameters manually to capture all-in- However, the visual quality of the estimated depth is good;
focus images. These images are passed to our Scribble2focus we can easily identify objects of different depth with well
application, which performs image rectification, image cal- preserved depth discontinuities. Through the Scribble2focus
ibration, depth estimation, and d.o.f. effect rendering. The interface, users simply mark a green scribble on captured
calibrated images have a resolution of 640 × 360. Our PC color images to indicate their region of interest. Rows 3-5
3440 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

TABLE IV
SSIM C OMPARISON OF THE N ON -L OCAL A GGREGATION M ETHOD [1] W ITH AND W ITHOUT O UR CLMF R EFINEMENT AND O UR M ETHOD

of Fig. 8 show the resulting real-to-life d.o.f. effect, which is Fig. 1(c), 1(d) shows a comparison between d.o.f. rendering
rendered according to users’ scribbles. The last row of Fig. 8, on depth estimation results from the non-local method [1]
which shows the adaptive Gaussian PSF’s sigma value for a and from our method. The cropped region clearly shows
scanline from each image, provides intuition of the blurring how incorrect depth estimation in textureless region would
effect for a given selected region. The plots also justify the create undesirable visual defects on the d.o.f.-rendered images.
importance of our physical thin-lens based refocusing model, Additionally, to evaluate the quality of our resulting refocused
as the relationship between the Gaussian blurring level and the images, we compared our method to the non-local aggregation
disparity value is not straightforward. In fact, the blur kernel method [1] quantitatively using the popular SSIM metric [43]
size for a given point is jointly decided by the user-scribble for visual quality. The reference refocusing image is generated
placement, the stereo image depth range, and the camera- by feeding the ground-truth disparity map into our thin-
specific parameters such as N, f and ps . lens based computational refocusing model. As shown in
1) Chairs Test Case in Fig. 8(a): In this test case, each chair Table IV, our method outperforms the non-local aggregation
lies in a different depth layer and is quite far apart from the method both with and without our CLMF refinement, though
others. The color image has relatively low texture because each our CLMF refinement improves the refocusing results of the
chair consists mostly of one color. Note that another challenge original non-local method.
in this test case is the whiteboard, which causes a strong
reflection and also creates a strong color border with the wall, VII. C ONCLUSION
which might result in different depth values on either side of We have proposed an efficient stereo matching algorithm
the border. Our method successfully estimates the depth layer for fast processing that is based on pixel-level and region-
of each chair and also the wall, and the depth map inside each level MST representations of a stereo pair of images. Fusion
layer is very smooth. The algorithm also does not make any of aggregation over these MSTs, one being of a finer resolution
mistakes with the wall and whiteboard. The d.o.f. rendering of the image and one being of a coarser resolution, allows for
according to each selected chair also shows that our algorithm better depth estimation over large, textureless regions while
preserves the edges well, as the color of the chair does not still preserving depth discontinuities at object boundaries.
leak into the surrounding out-of-focus areas. The result of our depth estimation method is superior to
2) Two People Test Case in Fig. 8(b): This test case was state-of-the-art local methods on the Middlebury benchmark.
also taken indoors and the color of the plain wall and two Experiments show that our method performs exceptionally
main people are similar. Our algorithm is able to estimate and well in the notoriously difficult low texture regions and is
separate depth layers robustly, though some errors still exist at able to preserve sharp depth discontinuities. Furthermore, our
the transition boundary between depth layers. The main reason method provides depth inference of high visual quality on
for this is the low contrast of the image. However, when we challenging, real-world cases captured under different indoor
apply the d.o.f. rendering, the result is still visually acceptable. and outdoor conditions. The addition of color-guided filtering
3) Outdoor Test Case in Fig. 8(c): This test case was taken of the disparity map using CLMF refines edge boundaries,
outdoors under strong sunlight. This setup is also challenging, resulting in sharper disparity discontinuities at object borders,
as it contains slanted surfaces, thin objects, and large texture- which we have shown is important for depth-based compu-
less regions. Our estimated depth map is visually acceptable–it tational photography applications. We plan to improve our
clearly shows objects with correct depth discontinuities. Note algorithm’s performance in areas of low contrast by further
that our method can even detect the connection between the research in robust matching measures and improved inference.
head of the motorbike and the mirror. More importantly, our We also evaluated this method in an application of
refocused images look visually plausible, creating convincing, refocus-rendering from computational photography. For this
d.o.f. effects based upon the user’s strokes. application, we derived a precise model of the lens to achieve
4) Middlebury 2006 Dataset in Fig. 11: To further demon- a realistic d.o.f. effect. Motivated by this application, we
strate the performance of our application, we also test our presented an interactive depth-of-field rendering application
d.o.f. rendering algorithm on the depth estimation result of the named Scribble2focus that uses our proposed stereo matching
Middlebury 2006 dataset. According to the information that algorithm to estimate depth. Our application enables the user
we were able to gather from the Middlebury 2006 dataset’s to easily and interactively create a real-life, depth-of-field
website [41] and Scharstein and Szeliski’s paper discussing effect by simply drawing a stroke through the region to be
the dataset [42], we assume that theses images were captured emphasized. We carefully study the physical model of a real-
with focal length f = 13.11mm and baseline b = 160mm. life, depth-of-field effect and utilize the acquired depth map
The last row of Fig. 11 shows our final rendering result for to guide the depth-based, pixel-adapted, Gaussian blurring in
different input strokes from the user, drawn in green. the rendering process. The Scribble2focus application provides
VU et al.: EFFICIENT HYBRID TREE-BASED STEREO MATCHING 3441

R EFERENCES
[1] Q. Yang, “A non-local cost aggregation method for stereo match-
ing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2012,
pp. 1402–1409.
[2] P. Green, W. Sun, W. Matusik, and F. Durand, “Multi-aperture photog-
raphy,” ACM Trans. Graph., vol. 26, no. 3, pp. 68:1–68:7, Jul. 2007.
[3] A. Levin, R. Fergus, F. Durand, and W. T. Freeman, “Image and depth
from a conventional camera with a coded aperture,” ACM Trans. Graph.,
vol. 26, no. 3, pp. 70:1–70:9, Jul. 2007.
[4] Y. Bando, B.-Y. Chen, and T. Nishita, “Extracting depth and matte
using a color-filtered aperture,” ACM Trans. Graph., vol. 27, no. 5,
pp. 134:1–134:9, Dec. 2008.
[5] (2014, Jun. 13). About the Camera, Lytro, Mountain View, CA, USA
[Online]. Available: http://www.lytro.com/
[6] K. Venkataraman et al., “Picam: An ultra-thin high performance
monolithic camera array,” ACM Trans. Graph., vol. 32, no. 6,
Fig. 12. Relationship between the focal length f , the object distance to pp. 166:1–166:13, Nov. 2013.
 . According to the
camera Z U and the object’s image behind lens distance z U [7] A. F. Bobick and S. S. Intille, “Large occlusion stereo,” Int. J. Comput.
 = f ZU .
physical model of thin lens, z U can be expressed in the form z U Z −f
U
Vis., vol. 33, no. 3, pp. 181–200, 1999.
[8] J. Sun, Y. Li, S. B. Kang, and H.-Y. Shum, “Symmetric stereo matching
for occlusion handling,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., Jun. 2005, pp. 399–406.
users with quality, artistic, depth-of-field images with little [9] V. Kolmogorov and R. Zabih, “Computing visual correspondence with
effort. In future work, we plan to optimize the algorithm and occlusions using graph cuts,” in Proc. Int. Conf. Comput. Vis., Jul. 2001,
port the entire framework to Android devices. pp. 508–515.
[10] K.-J. Yoon and I. S. Kweon, “Adaptive support-weight approach for
correspondence search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28,
A PPENDIX A no. 4, pp. 650–656, Apr. 2006.
P ROOF OF D EPTH OF F IELD R ANGE [11] O. Veksler, “Fast variable window for stereo correspondence using
integral images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
According to Fig. 6, we use the similar triangle formula to Jun. 2003, pp. 556–561.
calculate z F as [12] K. Zhang, J. Lu, and G. Lafruit, “Cross-based local stereo matching
using orthogonal integral images,” IEEE Trans. Circuits Syst. Video
CT zU − z F Technol., vol. 19, no. 7, pp. 1073–1079, Jul. 2009.
= [13] K. He, J. Sun, and X. Tang, “Guided image filtering,” in Proc. Eur.
f /N zF Conf. Comput. Vis., 2010, pp. 1–14.
f zU [14] A. Hosni, C. Rhemann, M. Bleyer, C. Rother, and M. Gelautz, “Fast
⇒ zF = .
CT N + f cost-volume filtering for visual correspondence and beyond,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2011, pp. 3017–3024.
Fig. 12 shows how we use the thin lens model to calculate [15] Y. Wei and L. Quan, “Region-based progressive stereo matching,”
the behind lens distance. Application of this model gives us in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun./Jul. 2004,
pp. 106–113.
another expression of z F and z U in the form [16] L. Hong and G. Chen, “Segment-based stereo matching using
f ZF graph cuts,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
zF = , Jun./Jul. 2004, pp. 74–81.
ZF − f [17] Y. Deng, Q. Yang, X. Lin, and X. Tang, “A symmetric patch-based
f ZU correspondence model for occlusion handling,” in Proc. 10th Int. Conf.
zU = . Comput. Vis., Oct. 2005, pp. 1316–1322.
ZU − f [18] O. Veksler, “Stereo correspondence by dynamic programming on a
Therefore, tree,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2005,
pp. 384–390.
f ZF f zU [19] Y. Deng and X. Lin, “A fast line segment based dense stereo algorithm
zF = =
ZF − f CT N + f using tree dynamic programming,” in Proc. Eur. Conf. Comput. Vis.,
2006, pp. 201–212.
f zU
⇒ ZF = . [20] J. Lu, H. Yang, D. Min, and M. N. Do, “PatchMatch filter: Efficient
CT N + f − zU edge-aware filtering meets randomized search for fast correspondence
field estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
After substituting Z U , the farthest distance value for object to Jun. 2013, pp. 1854–1861.
be in focus is [21] J. Shotton et al., “Real-time human pose recognition in parts from single
depth images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
f 2 ZU Jun. 2011, pp. 1297–1304.
ZF = .
C T N(Z U − f ) − f 2 [22] C. Zhou, A. Troccoli, and K. Pulli, “Robust stereo with flash and no-
flash image pairs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Similarly, we calculate the nearest distance for objects to be Jun. 2012, pp. 342–349.
in focus as [23] J. Lu, K. Shi, D. Min, L. Lin, and M. N. Do, “Cross-based local
multipoint filtering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
f 2 ZU Jun. 2012, pp. 430–437.
ZN = .
C T N(Z U − f ) + f 2 [24] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk,
“SLIC superpixels compared to state-of-the-art superpixel methods,”
ACKNOWLEDGMENT IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2274–2282,
Nov. 2012.
This work was supported by the research grant for the [25] J. B. Kruskal, “On the shortest spanning subtree of a graph and the
Human Sixth Sense Programme at the Advanced Digital traveling salesman problem,” Proc. Amer. Math. Soc., vol. 7, no. 1,
pp. 48–50, Feb. 1956.
Sciences Center from Singapore’s Agency for Science, Tech- [26] J. Canny, “A computational approach to edge detection,” IEEE Trans.
nology and Research (A*STAR). Pattern Anal. Mach. Intell., vol. 8, no. 6, pp. 679–698, Jun. 1986.
3442 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

[27] (2014. Jun. 13). Tegra Developer Zone. NVIDIA, Santa Clara, CA, USA Hongsheng Yang received the B.Eng. degree in
[Online]. Available: http://developer.nvidia.com/tegra-start electronic information engineering from the Uni-
[28] A. Adams et al., “The frankencamera: An experimental platform versity of Electronic Science and Technology of
for computational photography,” ACM Trans. Graph., vol. 29, no. 4, China, Chengdu, China, in 2011. He was with
pp. 29:1–29:12, Jul. 2010. the Advanced Digital Sciences Center, Singapore,
[29] A. C. Shumate and H. Li. Gray World Algorithm. [Online]. Available: as a Software Engineer till 2013. He is cur-
http://scien.stanford.edu/pages/labsite/2000/psych221/projects/00/trek rently pursuing the Ph.D. degree in computer sci-
[30] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two- ence with the University of North Carolina at
frame stereo correspondence algorithms,” Int. J. Comput. Vis., vol. 47, Chapel Hill, Chapel Hill, NC, USA. His cur-
nos. 1–3, pp. 7–42, Apr. 2002. rent research interests include 3D computer vision,
[31] (2014, Jun. 13). Middlebury Stereo Vision, Middlebury College, Middle- large-scale scene understanding, and optimization
bury, VT, USA [Online]. Available: http://vision.middlebury.edu/stereo/ methods.
[32] M. Bleyer, C. Rhemann, and C. Rother, “Patchmatch stereo—Stereo
matching with slanted support windows,” in Proc. Brit. Mach. Vis. Conf.,
2011, pp. 14.1–14.11.
[33] X. Mei, X. Sun, W. Dong, H. Wang, and X. Zhang, “Segment-tree based
cost aggregation for stereo matching,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit., Jun. 2013, pp. 313–320.
[34] D. Min, J. Lu, and M. N. Do, “Joint histogram-based cost aggregation
for stereo matching,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35,
no. 10, pp. 2539–2545, Oct. 2013. Minh N. Do (M’01–SM’07–F’14) was born in
[35] D. Chen, M. Ardabilian, X. Wang, and L. Chen, “An improved non- Vietnam in 1974. He received the B.Eng. degree
local cost aggregation method for stereo matching based on color and in computer engineering from the University of
boundary cue,” in Proc. IEEE Int. Conf. Multimedia Expo, Jul. 2013, Canberra, Canberra, ACT, Australia, in 1997, and
pp. 1–6. the Dr. Sci. degree in communication systems from
[36] L. Wang and R. Yang, “Global stereo matching leveraged by sparse the Swiss Federal Institute of Technology Lausanne,
ground control points,” in Proc. IEEE Conf. Comput. Vis. Pattern Lausanne, Switzerland, in 2001.
Recognit., Jun. 2011, pp. 3033–3040. He has been on the faculty with the University
[37] C. Lei, J. Selzer, and Y.-H. Yang, “Region-tree based stereo using of Illinois at Urbana-Champaign, Champaign, IL,
dynamic programming optimization,” in Proc. IEEE Conf. Comput. Vis. USA, since 2002, where he is currently a Professor
Pattern Recognit., Jun. 2006, pp. 2378–2385. with the Department of Electrical and Computer
[38] C. L. Zitnick and S. B. Kang, “Stereo for image-based rendering using Engineering, and hold joint appointments with the Coordinated Science
image over-segmentation,” Int. J. Comput. Vis., vol. 75, no. 1, pp. 49–65, Laboratory, the Beckman Institute for Advanced Science and Technology,
Oct. 2007. and the Department of Bioengineering. His research interests include image
[39] S. Mattoccia, S. Giardino, and A. Gambini, “Accurate and efficient cost and multidimensional signal processing, wavelets and multiscale geometric
aggregation strategy for stereo correspondence based on approximated analysis, computational imaging, augmented reality, and visual information
joint bilateral filtering,” in Proc. Asian Conf. Comput. Vis., 2009, representation.
pp. 23–27. Prof. Do was a recipient of the Silver Medal from the 32nd International
[40] M. Mueller, F. Zilly, and P. Kauff, “Adaptive cross-trilateral depth map Mathematical Olympiad in 1991, the University Medal from the University of
filtering,” in Proc. 3DTV-Conf., True Vis. Capture, Transmiss. Display Canberra in 1997, the Doctorate Award from the EPFL in 2001, the CAREER
3D Video, 3DTV-CON, Jun. 2010, pp. 1–4. Award from the National Science Foundation in 2003, and the Young Author
[41] (2006). Stereo Datasets with Ground Truth, Middle- Best Paper Award from the IEEE in 2008. He was named a Beckman
bury College, Middlebury, VT, USA [Online]. Available: Fellow at the Center for Advanced Study, UIUC, in 2006, and received
http://vision.middlebury.edu/stereo/data/scenes2006/ the Xerox Award for Faculty Research from the College of Engineering,
[42] D. Scharstein and R. Szeliski, “High-accuracy stereo depth maps using UIUC, in 2007. He was a member of the IEEE Signal Processing Theory
structured light,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., and Methods Technical Committee, the Image, Video, and Multidimensional
Jun. 2003, pp. 195–202. Signal Processing Technical Committee, and an Associate Editor of the IEEE
[43] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image T RANSACTIONS ON I MAGE P ROCESSING.
quality assessment: From error visibility to structural similarity,” IEEE
Trans. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.

Dung T. Vu received the B.S. degree in electrical


engineering from the National University of Singa-
pore, Singapore, in 2010. He is currently with the
Advanced Digital Sciences Center, which was jointly Jiangbo Lu (M’09) received the B.S. and M.S.
founded by the University of Illinois at Urbana- degrees in electrical engineering from Zhejiang Uni-
Champaign, Champaign, IL, USA, and the Agency versity, Hangzhou, China, in 2000 and 2003, respec-
for Science, Technology, and Research, a Singapore tively, and the Ph.D. degree in electrical engineering,
government agency. His research interests include Katholieke Universiteit Leuven, Leuven, Belgium, in
3D computer vision, 3D reconstruction, and hybrid 2009.
sensor systems. He was with VIA-S3 Graphics, Shanghai, China,
from 2003 to 2004, as a Graphics Processing Unit
Architecture Design Engineer. In 2002 and 2005, he
conducted visiting research at Microsoft Research
Asia, Beijing, China. Since 2004, he has been with
Benjamin Chidester received the B.S. degree in the Multimedia Group, Interuniversity Microelectronics Center, Leuven, Bel-
electrical and computer engineering from Carnegie gium, as a Ph.D. Researcher. Since 2009, he has been with the Advanced
Mellon University, Pittsburgh, PA, USA, in 2009, Digital Sciences Center, Singapore, which is a joint research center between
and the M.S. degree in electrical and computer engi- the University of Illinois at Urbana-Champaign, Champaign, IL, USA, and the
neering from the University of Illinois at Urbana- Agency for Science, Technology and Research, Singapore, where he is leading
Champaign, Champaign, IL, USA, in 2013, where a few research projects. His research interests include computer vision, visual
he is currently pursuing the Ph.D. degree. He was a computing, image processing, video communication, interactive multimedia
recipient of one-Year Research Fellowship from MIT applications and systems, and efficient algorithms for various architectures.
Lincoln Labs in 2013. His interests include computer Dr. Lu is an Associate Editor of the IEEE T RANSACTIONS ON C IRCUITS
vision, image processing, and machine learning. AND S YSTEMS FOR V IDEO T ECHNOLOGY . He was a recipient of the 2012
TCSVT Best Associate Editor Award.

You might also like