Efficient Hybrid Tree-Based Stereo Matching With Applications To Postcapture Image Refocusing
Efficient Hybrid Tree-Based Stereo Matching With Applications To Postcapture Image Refocusing
8, AUGUST 2014
Abstract— Estimating dense correspondence or depth infor- researchers due to its continued and extensive relevance within
mation from a pair of stereoscopic images is a fundamental computer vision. Examples of applications that can benefit
problem in computer vision, which finds a range of important from fast and reliable depth-estimation from stereo matching
applications. Despite intensive past research efforts in this topic,
it still remains challenging to recover the depth information both exist in a diverse array of fields, ranging from computa-
reliably and efficiently, especially when the input images contain tional photography, to robotics, to augmented reality, among
weakly textured regions or are captured under uncontrolled, real- others. For computational photography, depth information can
life conditions. Striking a desired balance between computational enable the creation of novel artistic effects, such as depth-
efficiency and estimation quality, a hybrid minimum spanning of-field rendering and depth-guided filtering, allowing for
tree-based stereo matching method is proposed in this paper. Our
method performs efficient nonlocal cost aggregation at pixel-level more creative and meaningful expressions of image con-
and region-level, and then adaptively fuses the resulting costs tent. To make these applications practical, the challenge
together to leverage their respective strength in handling large remains to acquire accurate depth information in a computa-
textureless regions and fine depth discontinuities. Experiments tionally efficient manner, avoiding sophisticated optimization
on the standard Middlebury stereo benchmark show that the methods.
proposed stereo method outperforms all prior local and nonlocal
aggregation-based methods, achieving particularly noticeable The significance of depth information for computational
improvements for low texture regions. To further demonstrate photography in particular has been demonstrated by the active
the effectiveness of the proposed stereo method, also motivated search for novel depth-acquisition methods within this area
by the increasing desire to generate expressive depth-induced of research. Examples of proposed methods include the work
photo effects, this paper is tasked next to address the emerging of Green et al. [2], which proposed a system of camera
application of interactive depth-of-field rendering given a real-
world stereo image pair. To this end, we propose an accurate thin- sensors to capture multiple images in a single exposure at
lens model for synthetic depth-of-field rendering, which considers different apertures and thereby estimate the depth of a scene.
the user-stroke placement and camera-specific parameters and Levin et al. [3] and Bando et al. [4] proposed the addi-
performs the pixel-adapted Gaussian blurring in a principled tion of a special filter within the aperture of camera lenses
way. Taking ∼1.5 s to process a pair of 640 × 360 images in (a patterned occluder, RGB color filter) to estimate depth.
the off-line step, our system named Scribble2focus allows users
to interactively select in-focus regions by simple strokes using The recently developed Lytro camera [5] houses a micro-
the touch screen and returns the synthetically refocused images lens array integrated on a digital image sensor to capture
instantly to the user. the ray directions of the entire light field and is one of the
Index Terms— Stereo matching, depth estimation, cost aggre- first attempts to commercialize a computational technique of
gation, depth of field, post-capture refocusing. depth acquisition. Despite these recent advances, and even
the development of the Lytro camera, these techniques have
I. I NTRODUCTION not yet reached widespread consumer adoption. Furthermore,
some of the recent techniques and hardware, such as the
D ESPITE the long history of research in stereo matching,
or stereo correspondence, this well-researched area con-
tinues to inspire new methods and to attract the attention of
PiCam [6] developed by Pelican Imaging, cannot avoid the
need for efficient correspondence search to align the images
captured from its camera array and to generate depth maps.
Manuscript received October 1, 2013; revised March 10, 2014; accepted Though many algorithms that emphasize time-sensitivity
May 21, 2014. Date of publication June 5, 2014; date of current version have been proposed to solve the stereo correspondence prob-
July 1, 2014. The associate editor coordinating the review of this manuscript
and approving it for publication was Prof. Janusz Konrad. lem, it still remains open to advances in estimation reliabil-
D. T. Vu and J. Lu are with the Advanced Digital Sciences Center, ity and computation reduction. Additionally, along with the
Singapore 138632 (e-mail: [email protected]; [email protected]). traditional difficulties of occlusion and depth discontinuities
J. Lu thanks NVIDIA Corporation for providing the Tegra 3 prototype tablet.
(Corresponding author: J. Lu.) that must be handled by such an algorithm, an algorithm
B. Chidester and M. N. Do are with the University of Illinois at Urbana- must also handle the imperfections of real-world images.
Champaign, Champaign, IL 61820 USA (e-mail: [email protected]; Many proposed algorithms perform well on undistorted images
[email protected]).
H. Yang is with the University of North Carolina, Charlotte, NC 28223 that have been captured in controlled illumination, such as
USA (e-mail: [email protected]). This work was mainly done when H. Yang the standard Middlebury test cases. Nevertheless, in a real
was working at ADSC. scenario, they are either too slow or fail to handle slight
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. radiometric distortions from the camera, especially in large
Digital Object Identifier 10.1109/TIP.2014.2329389 textureless regions, resulting in unsatisfactory depth estimation
1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
VU et al.: EFFICIENT HYBRID TREE-BASED STEREO MATCHING 3429
for some applications, particularly within computational image are estimated simultaneously by minimizing a global
photography. energy function with a certain smoothness condition. To solve
Therefore, to achieve aesthetic results for such applications, this minimization, various approaches exist. Among the most
we propose an efficient stereo matching technique that esti- popular are dynamic programming [7], Markov networks [8]
mates depth more robustly and reliably than other competing and graph cut [9]. Though they do not solve the minimization
methods, even in challenging, real-life cases. Our method is exactly, they perform well in many challenging region types,
based upon a recently popular Minimum Spanning Tree (MST) such as textureless regions and along depth discontinuities.
representation [1], but extends it to a hierarchical, coarse- However, since this work is concerned with applications for
to-fine representation, with non-local cost aggregation and which fast processing speed is necessary, these methods are
color-guided depth refinement. not considered, as they are computationally prohibitive.
Furthermore, to demonstrate the effectiveness of our Dynamic programming methods attempt to reduce the com-
algorithm and also motivated by the growing interest in depth- putational burden of global methods while maintaining some
induced photo effects, we have evaluated it in an application level of global connectedness for inference by reducing the
within computational photography of post-capture refocusing. support of the smoothness constraint to individual scan-lines
With the advent of mobile devices and the current popularity of the image, but this 1D relaxation usually suffers from the
of mobile photography and image editing, this application “streaking” effect due to the lack of enforced consistency
is highly relevant. To this end, we also propose an accu- between horizontal and vertical scan-lines.
rate thin-lens model for synthetic depth-of-field rendering. Given the requirement of speed, local methods [10]–[12]
Additionally, given the effectiveness of our stereo algorithm, are the most promising methods and are therefore the most
we propose a system, called Scribble2focus, for interactive comparable methods to that proposed. Local methods infer the
refocus-rendering using simple cues, either on the touch screen disparity of each pixel independently, usually by comparing
of a mobile device or on a PC. The system can operate on any windowed regions around the reference pixel and the can-
setup that can capture stereoscopic images, such as a PC or didate matching pixel of the corresponding matching image.
a mobile device equipped with a pair of consumer webcams. This approach results in faster, but often less reliable, depth
Depending on the computational power of capturing devices, inference. In particular, these methods struggle significantly
depth estimation can either be run on the device, or carried out in textureless regions due to matching ambiguity. The local
on a remote server before sending back the estimated depth methods that produce the best inference are those that are
map for interactive refocusing rendering. In fact, the proposed able to adaptively weight the support of the windows during
algorithms could also be used to process a stereoscopic pair of matching to handle object boundaries. However, generating
images captured from two viewpoints by a mobile device with the weights is costly, as this amounts to translation-varying
only a single camera, once the image pair is rectified with the filtering. Although the guided filter [13] was recently pro-
recovered epipolar geometry. posed as an efficient method for computing adaptive support
The paper structure is organized as follows. Section II weights [14], it still struggles with these notoriously difficult
reviews existing depth estimation methods, including textureless regions.
methods using stereoscopic images and methods using To have better representation of color patches and texture,
hardware modification. Section III discusses our depth map some researchers have proposed the use of segmented color-
estimation and refinement algorithm based on a multi-MST images for stereo matching [15]–[17]. These region-based
construction. Section IV discusses how we model and algorithms have smooth disparity estimates inside homoge-
render a real-life, depth-of-field effect. Section V explains neous regions, and color segments can help reduce the com-
the overview of our interactive, mobile photo-refocusing putation time. However, trusting in color segmentation alone
system. Finally, Section VI evaluates our algorithm on both is unreliable, as it may fail to provide a good representation of
the standard Middlebury benchmark for stereo matching regions in the image, resulting in erroneous depth estimation.
and real-life stereoscopic image pairs captured using an Another important direction has been the adoption of
Android tablet. We also discuss our proposed application, tree-based structures for stereo matching. Veksler [18] and
Scribble2focus, in this section. Deng and Lin [19] proposed to use tree-based graphical
models to improve the performance of dynamic program-
II. R ELATED W ORK ming to solve the stereo correspondence problem. While
As previously mentioned, various methods of depth infer- Veksler [18] directly uses the image pixels as nodes in the
ence are currently available for computational photogra- tree, Deng and Lin [19] proposed that pixels of similar
phy applications, and we now provide a more thorough color should be grouped as line segments and a tree is
consideration. constructed to connect all the line segments together. The tree-
Among many proposed algorithms for stereo matching, each based dynamic programming algorithms are much faster than
one is optimized for different criteria, but still struggling to MRF-based global methods, however, the resulting accuracy
strike a desired balance between speed and inference accuracy. is comparably weak and also outperformed by leading local
Generally, they can be grouped into one of two categories: methods [14], [20].
global methods or local methods. Global methods enforce The most closely related work to the proposed algorithm is
global consistency constraints upon the estimated depth map that of Yang, who proposed a non-local filter method [1]. This
using regularization. The matching pixel pairs of the entire method uses a local pixel dissimilarity cost with a non-linear,
3430 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014
Fig. 1. Comparison between the non-local filter-based stereo matching algorithm [1] and our method on two challenging cases. (a) Original images. (b) Depth
estimation result of the non-local filter algorithm (left) and our method (right). Depth-of-field rendering using (c) the non-local method’s disparity map, and
(d) our method’s disparity map.
tree-based aggregation scheme to produce similar results to significant further development before they might be ready for
global optimization algorithms without incurring similarly widespread use. Whereas our method only requires two stereo
excessive computational complexity. The dissimilarity cost cameras, and even a single camera, such as a camera from a
between stereo image pairs is computed at each pixel, as in mobile phone, could be used by simply capturing images from
standard local methods. However, the image is represented as two unique perspectives, creating a parallax.
a planar graph, with each pixel being a node. Like in dynamic
programing methods, the fully connected graph is reduced to III. D EPTH F ROM S TEREOSCOPIC I MAGE PAIRS
a tree structure. In Yang’s method, the chosen structure is a
In this section, we present our method for depth inference
Minimum Spanning Tree (MST), which preserves connections
from calibrated stereoscopic images. For applications in such
between pixels of similar intensity. The aggregated cost at
areas as computational photography, robotics, or augmented
each pixel is computed by traversing the MST, so that every
reality, a stereo matching algorithm must effectively handle the
pixel contributes to the depth estimation of every other pixel,
challenge of real-world images while meeting computational
unlike standard local methods. Yang’s algorithm produces
constraints. For the specific application to image refocusing
competitive depth estimation with minimal computational cost.
and editing, it must also preserve the edges of the scene in the
However, even this approach still has difficulty with some
depth map, as the sharpness of object edges highly influences
regions of typical, real-life images taken by commodity cam-
the perceived visual quality of the edited image. Inspired by
eras, as shown in Fig. 1. In particular, the non-local filter
the strengths of both the MST cost aggregation method of [1]
fails to provide the correct depth estimates of large, textureless
and region-based stereo matching, we propose a region-based
regions like the wall or the box of uniform color. Due to the
enhancement to the MST. We extend the method of Yang [1]
particular illumination, the color intensity of the wall changes
by adding a second MST, that is created from a segmented
slightly from patch to patch, resulting in patches of different
version of the image, where each node is a superpixel from
estimated disparity. Our method improves upon that of [1] to
the segmented image. The region-level MST enables aggre-
handle these challenging regions by introducing an additional,
gation over a coarse scale of the image, which is helpful for
region-level MST, which significantly improves estimation
large regions of uniform color and texture, while the pixel-
quality with the computation time only slightly increased.
level MST enables aggregation over a finer scale, which is
Other depth inference methods that require hardware modi-
helpful for edge boundaries. In summary, our stereo algorithm
fications of the camera have also been proposed, and this area
performs the following steps: pixel-level cost initialization,
of research has recently received more attention, due in part
pixel-level and region-level MST construction, adaptive cost
to the popularity of the Microsoft Kinect camera [21]. The
aggregation on both MSTs, a Winner-Take-All strategy to
Kinect camera uses active infra-red illumination to find the
estimate the disparity map, and disparity map refinement using
scene’s depth map in real time. However, the produced depth
a non-local disparity refinement method [1] followed by Cross-
map has poor resolution, and furthermore, Kinect’s active light
based Local Multipoint Filtering (CLMF) [23].
system is susceptible to interference from other light sources
and therefore might not work well with certain materials or
in outdoor lighting. Coded aperture methods [3], [4] use a A. Depth From Multiple Minimum Spanning Trees
specially designed filter attached behind the camera’s aperture Fig. 2 shows the flow chart of our depth estimation and
to estimate the depth map of a scene from the depth-dependent, refinement process. Given the calibrated stereo image pair,
ray-diffusion characteristics of the filter. Zhou et al. [22] I0 and I1 , a disparity map D is recovered such that a pixel
proposed the use of photometric cues by turning on and off I0 ( p) at location p = (u, v) in the reference image I0 and
the flash during capture of stereoscopic images to improve the a pixel I1 ( pd ) at location pd ≡ p + (d, 0), a d horizontally
quality of inference at depth boundaries and to overcome the displaced pixel of p, in the matching image correspond to the
challenge of occlusion. These methods, however, still require same 3D point. A discrete disparity range, H = [dmin , dmax ],
VU et al.: EFFICIENT HYBRID TREE-BASED STEREO MATCHING 3431
I1 ( pd ) denoting the horizontal gradient at the corresponding where p ∈ S are all the pixels inside the superpixel S.
pixel. The two truncation parameters, Ti = 8 and Tg = 2, After segmentation, the resulting superpixel image does not
are set empirically to limit the negative impact of outliers. form a regular grid, so to represent the image as a graph
In experiments, the weight β is set to 0.11, G R = (V R , E R ), each superpixel is represented by a node, and
Yang [1] showed that cost aggregation on a MST produces a connection is made between every node. However, as shown
quality disparity estimation. The MST connects all the vertices in Fig. 4, this results in connections between nodes that are not
of the graph so that each pixel has support from every other truly “neighbors”, such as the red superpixel that is incorrectly
pixel in the image, depending on their similarity, without the connected to the superpixel R. For superpixels, we consider
computationally expensive calculation of adaptive windows as two nodes to be neighbors only if some pixel from one
in some accurate local methods. Following Yang’s method [1], superpixel is a neighbor with a pixel from the other superpixel.
we construct the pixel-level MST by creating a planar graph To prevent the error in aggregation that would be caused
G P = (V P , E P ) and applying to it Kruskal’s algorithm [25]. by connections between non-neighbors, we penalize these
3432 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014
Fig. 3. Result from each step of disparity map estimation. (a) Original images. (b) Segmentation images from the SLIC algorithm [24]. (c) Edge density
images. (d) Disparity maps without the CLMF-1 refinement [23]. (e) Disparity maps after the CLMF-1 refinement [23].
Fig. 4. Region graph builder. Superpixels resulting from the SLIC over-
segmentation method are treated as nodes on an eight-connected undirected
regular graph. The red colored superpixel does not share any neighbour pixel
with superpixel R therefore the edge between them is penalized.
ω R (S, T ) = |I S − IT |. (3)
Fig. 5. Close-up examination of the multiple Minimum Spanning Trees.
Finally, as in the construction of the pixel-level MST, we Row 1: (a) Original Lamp Shade image. (b) A close-up region from the
apply Kruskal’s algorithm [25] on G R to obtain the result- original image with the red test point. Row 2: (a) Pixel-level MST. (b) Region-
level MST. Row 3: the weighted contribution of all the pixels inside the close-
ing region-level MST. Fig. 5 shows how the region-level up region to the test point according to (a) the pixel-level MST and (b) the
MST looks like in a patch of the segmented Lamp Shade region-level MST. Row 4: depth estimation result by (a) using the pixel-level
image. MST only and (b) our method using adaptive fusion of the pixel-level and
region-level MSTs.
3) Adaptive Fusion of Pixel-Level and Region-Level Costs:
Once the pixel-wise cost for each disparity has been created,
as well as the MSTs, the cost must be aggregated for each distance D(Vi , V j ) of the path P(Vi , V j ) connecting Vi with
pixel. On each of the MSTs, we employ the non-local cost V j as the sum of all edge weights ω along the path,
aggregation by Yang [1] to find the cost at each node. Let us
consider the MST structure T (V, E), where each Vi ∈ V is D(Vi , V j ) = ωE . (4)
a node and each E j ∈ E is an edge of T . We calculate the ω E ∈P(Vi ,V j )
VU et al.: EFFICIENT HYBRID TREE-BASED STEREO MATCHING 3433
For each node Vi , Cd (Vi ) denotes the matching cost for where p ∈ R and α R is the edge density of the region R.
disparity d and CdA (Vi ) denotes the aggregated cost. Based To find the edge density as shown in Fig. 3c, we first apply
on the MST structure, we aggregate the cost for each node the Canny edge detector [26] to find edges in the image and
non-locally with weighted support from every other node in then calculate the density as the ratio of the number of edge
the tree T (V, E) pixels, Ne , to the number of pixels in the region, N R ,
Ne
CdA (Vi ) = W (Vi , V j )Cd (V j ), (5) αR = . (8)
V j ∈T NR
Finally, a WTA optimization is applied to find the best
where weight W (Vi , V j ) is calculated as an exponential
disparity value at each pixel based on the combined aggregated
function of the distance D(Vi , V j ),
cost CdA ( p). Fig. 3d shows the resulting depth maps.
D(Vi , V j )
W (Vi , V j ) = exp − . (6)
σ B. Disparity Map Refinement
We use σ to control the support in the cost aggregation process For non-global methods, it is common practice to apply
over the nodes. If we increase the σ value, distant nodes on some form of post-processing refinement, usually relying
the tree can provide a larger contribution. However, a large upon smoothness or consistency constraints, to the generated
σ value has a trade-off. Large contribution of far distance disparity map D. We apply a two-step refinement process.
nodes is good for low texture regions but it creates an error We first employ the non-local refinement method of Yang [1].
propagation problem for sharp edges and thin objects. Through In this method, we in turn consider the left and right images
experiments, we set the σ value of both pixel-level and region- as the reference and find their respective disparity maps.
level MST to be 0.1. Fig. 5 shows the weighted contribution A mutual consistency check is then employed to detect pixel
of nodes on the pixel-level and region-level MSTs to a test pairs with consistent disparity values, and if pixels are found
point within a patch of the image. to be consistent, they are marked as stable. A new pixel
In the cost aggregation process, Yang [1] utilizes the MST dissimilarity cost is assigned to each pixel based on the its
structure to efficiently compute the aggregated cost by two stability:
traversals of the tree. The aggregated costs are computed
new |d − D( p)| p is stable and D( p) > 0,
recursively by using the sum at previous nodes on the MST. Cd ( p) = (9)
Hence, the algorithm only requires a few operations per node. 0 else.
We apply this process on both the pixel-level and region-level We then run pixel-level aggregation on the pixel-level MST
MSTs to obtain CdA ( p) and CdA (R) for each node. again. The cost of the stable pixels, and ultimately their depth
After aggregating the cost for both the pixel-level and values, are propagated to the unstable pixels, providing a more
region-level MSTs, we can now see the shortcomings of the consistent depth map. However, this process can improve the
pixel-level MST when it alone defines the aggregation. Con- quality mostly only for incorrect disparities that are caused
sider the case in Fig. 5. It can be seen that the pixel-level MST by occlusion. In the case of large textureless regions, pixels
accurately estimated the weighted contribution of pixels near might be wrongly classified as stable due to the ambiguity of
color and depth discontinuities. However, in the textureless matching. Therefore, we need a second refinement step that
region, the weighted contribution of neighboring pixels in the smooths large textureless regions and provides sharp depth
pixel-level MST was split among patches. This effect leads discontinuities along object boundaries.
to the wrong depth estimation inside the textureless region. To accomplish this, we use the color image for guidance.
The region-level MST outperforms the pixel-level MST in this The basic idea is to enforce the depth map to be coherent with
case, as it is able to correctly aggregate from support within the color image, based on the assumption that a local region
the uniformly yellow region and more accurately estimate the of pixels that have similar color are likely to exist in the same
depth for this region. disparity plane. The refined disparity map Dout is a locally
The key consideration is how to fuse the aggregation from filtered version of the original depth map, where the weights
each of the MSTs to produce more accurate depth estimation. are based on the spatial and color similarity of pixels in a local
Ideally, the region-level cost aggregation at a coarser level neighborhood N ( p),
would act to complement the finer, pixel-level aggregation
and vice versa. In a textureless region, which has no depth Dout ( p) = w p j D( j ). (10)
j ∈N ( p)
discontinuities, the region-level MST should dominate the
cost aggregation and the resulting depth estimation. However, One commonly used weighting scheme is the bilateral
in a region of rich texture, the algorithm should rely more filter. Here, we use both a faster O(1) and more effective
on the finer, pixel-level cost. In our algorithm, we use the version called the cross-based multi-points filter (CLMF-1),
edge density as a measure of the level of texture within a which was recently proposed in [23]. The reason we choose
region. Then the pixel-level and region-level aggregated costs CLMF-1 over another well-known O(1) filtering technique–
are adaptively blended in an unsupervised fashion, according the guided-image filter (GF) [13]–is that the GF’s kernel
to edge density, as follows cannot handle more than two local color-line models. It has
the tendency to blend them together to create a color blending
CdA ( p) = α R CdA ( p) + (1 − α R )CdA (R), (7) effect when doing edge-preserving image smoothing, and a
3434 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014
TABLE I Since our depth map refinement process, which uses the
S YMBOLS AND D ESCRIPTIONS FOR d.o.f. C ALCULATION color image as guidance, provides sharp, well-aligned depth
boundaries, we can safely rely on the produced depth map
Dout to not cause edge leaking problems in the refocused
image. A hard threshold on the depth is set based on the
previously calculated d.o.f., so that, if the depth value of the
pixel is within the d.o.f., then its color intensity is stored
directly in the refocused image I R , and any blur kernel that
is applied to the image excludes this point from its support.
For other points which are outside of the d.o.f., an adaptive
Gaussian PSF is applied. The diffusion of light rays from an
out-of-focus object in the camera’s image plane is similar
to the distribution of the energy of a Gaussian kernel of
appropriate spread. Therefore, the d.o.f.-rendered image I R
is generated according to the following relationship:
I ( p) Dout ( p) ∈ [Z N , Z F ],
I R ( p) = (16)
(I ∗ G)( p) Dout ( p) ∈/ [Z N , Z F ],
Fig. 7. If an object is out of focus range, it will create a circle of confusion where the filter coefficient for the Gaussian PSF G p (u, v)
on the image plane. is defined for each point p adaptively with relative image
coordinate (u, v) as
TABLE II
S YMBOLS AND D ESCRIPTIONS FOR C IRCLE OF 1 −(u 2 + v 2 )
G p (u, v) = exp · δ (Dout ( p + (u, v))) ,
C ONFUSION S IZE C ALCULATION 2σ p 2 2σ p 2
(17)
In Eq. (16), ∗ denotes the convolution operator. The binary
function δ (Dout ( p + (u, v))) in Eq. (17) evaluates whether
Dout ( p +(u, v)) ∈
/ [Z N , Z F ]. If the condition is true, the filter
kernel mask at point p + (u, v) is activated; otherwise, it is
of the circle of confusion of each point that lies outside of the set to 0 to avoid color bleeding between adjacent in-focus and
d.o.f. using the similar triangle formula, as in Fig. 7: out-of-focus regions. To model the out-of-focus effect, points
in the scene with a larger circle of confusion should be more
Cp z p − z U
heavily blurred and points with a smaller circle of confusion
= ,
f /N z p should incur less blur. Therefore, we relate the spread of the
(z p − z U ) f /N kernel, parametrized by σ p , linearly to the circle of confusion
Cp = . (15) for each pixel p as follows:
z p
Cp
σp = K ∗ , (18)
In Eq. (15), z p stands for behind lens distance of point P. ps
The behind lens distance can be directly converted from
where K controls the linear relation and ps is the sensor’s
object’s distance to the camera sensor using the thin lens
pixel size. Although we have assumed a circular shape of the
model as shown in Fig. 12. Tables I and II summarize the
aperture in this paper, other real-life aperture shapes could also
symbols used for d.o.f. calculation and circle of confusion size
be implemented to further improve this artistic effect.
calculation, respectively.
Fig. 8 shows our d.o.f. rendering result as well as the value
of sigma at each pixel, which is displayed in the lower graph,
B. Depth-of-Field Rendering Using Gaussian according to various user strokes. The graph shows how the
Point Spread Function value of sigma changes according to the depth of scene and
the user-selected d.o.f.
To render a convincing and realistic refocused image I R
based on the proposed physical model of d.o.f., we propose
two primary techniques for the rendering process. Firstly, each V. I NTERACTIVE M OBILE P HOTO R EFOCUSING S YSTEM
pixel that is outside of the d.o.f. must be diffused through- Though any camera pair can be used to capture stereoscopic
out the region of its circle of confusion, which we accomplish images, motivated by the mobile computational photography
through convolution with a blur kernel with spread based on trend, we specially pick NVIDIA’s Tegra 3 Android tablet [27]
the diameter of the circle of confusion. Secondly, in-focus as an image acquisition and manipulation device. It is a
objects must have sharp edges that do not leak into the blurred powerful tablet, which houses a quad-core CPU and a ded-
background and foreground areas, so we adapt the support of icated GPU. At the current stage of the development, we
the blur kernel based on the depth of neighboring pixels. use the tablet’s stereo cameras to acquire images. The stereo
3436 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014
Fig. 8. Depth-of-field rendering result with different focus regions. Green strokes are users’ strokes to select in-focus areas. Row 1: Estimated and refined
depth map. Row 2: Original all in-focus color images. Row 3-5: Rendering result. Row 6: Gaussian PSF sigma value (blue) and estimated disparity value
(red) of each pixel on the red horizontal line. This figure is best viewed on screen.
image pairs are then transferred to the PC for testing of A. System Overview
our stereo matching correspondence and refocus-rendering Fig. 9 presents our system design, which is separated
algorithm. In addition to robustness, all of the components into interactive and pre-processing sections. The first step is
of our algorithm are designed with close attention also to effi- the calibration of the stereo cameras, which involves cal-
ciency, so that the system may easily be ported as an Android culating the intrinsic and extrinsic parameters of the cam-
application to run on the tablet, which is our immediate future eras. This is accomplished by the standard checker-board
work. calibration method. On our mobile system, the stereo images
VU et al.: EFFICIENT HYBRID TREE-BASED STEREO MATCHING 3437
Fig. 10. Results on four Middlebury stereo vision datasets: Tsukuba, Venus, Teddy, and Cones. (a) Original left image. Disparity map of (b) non-local cost
aggregation method [1], (c) our method without CLMF depth refinement, (d) our method with CLMF depth refinement, and (e) ground truth.
ment, so the results reported in Table III were generated preservation because of the guidance of the color image in
without the use of this refinement. The better quantitative the refinement process. The disparity of large, uniform-color
performance that is achieved by bypassing this additional objects, such as the yellow boxes in Lamp Shade and the
refinement step is likely due to the disparity averaging that wall in Middlebury, are well estimated with piece-wise smooth
is performed during the color-guided refinement, in addition disparity values. The edges in all the cases are sharp and well-
to discrepancies with the integer-valued ground truth. This preserved with few errors, especially in the Wood case. Visual
additional refinement creates sharper edges, which is desirable comparison with the non-local method [1] is provided in
for our refocusing application, but it may diffuse incorrect Fig. 1.
depth values into surrounding regions through convolution For the purpose of evaluating the processing speed,
with the non-linear kernel. These errors do not impact the we have run the non-local algorithm [1] source code on
visual quality of our refocused images, but if quantitative our Quad Core CPU with single-core implementation. The
performance is the primary consideration, it may be preferable non-local algorithm takes an average processing time of
to leave out this additional processing step. Generally, our 0.7 seconds while our method takes an average processing time
algorithm achieved the temporal rank 20t h (out of nearly 170 of 1.01 seconds on the Middlebury 2003 dataset. Such a tree-
methods) as of March 2014 while having competitive runtime based aggregation structure has provided a speed advantage
efficiency. Our algorithm is ranked above some sophisticated over other stereo matching algorithms: OverSegmBP [38]
local methods, such as Patch Match (rank 26rd ) [32] and also takes 50 seconds, GlobalGCP [36] takes 130 seconds, and
the original non-local filter (rank 40t h ) [1]. FastBilateral [39] takes 32 seconds. Compared to the non-local
To further evaluate our method’s performance, we use method, our algorithm is slower by 0.31 seconds, which is due
the challenging Middlebury 2006 dataset. The second row to the additional computation required to build the region-level
of Fig. 11 shows the result of our depth estimation and MST and perform region-level cost aggregation. However,
refinement method. The red-colored pixels mark errors in we believe that the increase in estimation accuracy, as shown
the estimated disparity map as compared to the ground truth by the difference between rankings on the Middlebury dataset,
of each test case. Generally, our stereo matching method and the better handling of challenging cases, as shown in
excels in large, low texture regions and provides good edge Fig. 1b, is worth the small trade-off in computation time for
VU et al.: EFFICIENT HYBRID TREE-BASED STEREO MATCHING 3439
TABLE III
M IDDLEBURY S TEREO M ATCHING E VALUATION
Fig. 11. Depth estimation result of our method on Middlebury 2006 dataset. First row presents the original left color images. Second row presents our depth
estimation result together with error pixels (red color) in comparison with the ground truth. Third row shows the synthetic refocusing effect based on the
depth estimation result.
most of the applications that we have considered, especially required an average of 0.7 seconds to calculate the d.o.f. model
computational photography applications. from the color and depth images and render the d.o.f. effect
according to the input scribble of the users. Fig. 8 shows
the input and output result of our Scribble2focus application.
B. Scribble2focus – An Interactive Photo Refocusing System The first row of the figure shows our depth estimation and
We experiment with our interactive, post-capture refocusing refinement result on three real-world cases captured by the
application, using the NVIDIA Tegra 3 tablet to capture several tablet. The result is not perfect, as our method still incurs
test images under different indoor and outdoor conditions. error in low contrast areas or areas of highly-varying texture.
We change the camera parameters manually to capture all-in- However, the visual quality of the estimated depth is good;
focus images. These images are passed to our Scribble2focus we can easily identify objects of different depth with well
application, which performs image rectification, image cal- preserved depth discontinuities. Through the Scribble2focus
ibration, depth estimation, and d.o.f. effect rendering. The interface, users simply mark a green scribble on captured
calibrated images have a resolution of 640 × 360. Our PC color images to indicate their region of interest. Rows 3-5
3440 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014
TABLE IV
SSIM C OMPARISON OF THE N ON -L OCAL A GGREGATION M ETHOD [1] W ITH AND W ITHOUT O UR CLMF R EFINEMENT AND O UR M ETHOD
of Fig. 8 show the resulting real-to-life d.o.f. effect, which is Fig. 1(c), 1(d) shows a comparison between d.o.f. rendering
rendered according to users’ scribbles. The last row of Fig. 8, on depth estimation results from the non-local method [1]
which shows the adaptive Gaussian PSF’s sigma value for a and from our method. The cropped region clearly shows
scanline from each image, provides intuition of the blurring how incorrect depth estimation in textureless region would
effect for a given selected region. The plots also justify the create undesirable visual defects on the d.o.f.-rendered images.
importance of our physical thin-lens based refocusing model, Additionally, to evaluate the quality of our resulting refocused
as the relationship between the Gaussian blurring level and the images, we compared our method to the non-local aggregation
disparity value is not straightforward. In fact, the blur kernel method [1] quantitatively using the popular SSIM metric [43]
size for a given point is jointly decided by the user-scribble for visual quality. The reference refocusing image is generated
placement, the stereo image depth range, and the camera- by feeding the ground-truth disparity map into our thin-
specific parameters such as N, f and ps . lens based computational refocusing model. As shown in
1) Chairs Test Case in Fig. 8(a): In this test case, each chair Table IV, our method outperforms the non-local aggregation
lies in a different depth layer and is quite far apart from the method both with and without our CLMF refinement, though
others. The color image has relatively low texture because each our CLMF refinement improves the refocusing results of the
chair consists mostly of one color. Note that another challenge original non-local method.
in this test case is the whiteboard, which causes a strong
reflection and also creates a strong color border with the wall, VII. C ONCLUSION
which might result in different depth values on either side of We have proposed an efficient stereo matching algorithm
the border. Our method successfully estimates the depth layer for fast processing that is based on pixel-level and region-
of each chair and also the wall, and the depth map inside each level MST representations of a stereo pair of images. Fusion
layer is very smooth. The algorithm also does not make any of aggregation over these MSTs, one being of a finer resolution
mistakes with the wall and whiteboard. The d.o.f. rendering of the image and one being of a coarser resolution, allows for
according to each selected chair also shows that our algorithm better depth estimation over large, textureless regions while
preserves the edges well, as the color of the chair does not still preserving depth discontinuities at object boundaries.
leak into the surrounding out-of-focus areas. The result of our depth estimation method is superior to
2) Two People Test Case in Fig. 8(b): This test case was state-of-the-art local methods on the Middlebury benchmark.
also taken indoors and the color of the plain wall and two Experiments show that our method performs exceptionally
main people are similar. Our algorithm is able to estimate and well in the notoriously difficult low texture regions and is
separate depth layers robustly, though some errors still exist at able to preserve sharp depth discontinuities. Furthermore, our
the transition boundary between depth layers. The main reason method provides depth inference of high visual quality on
for this is the low contrast of the image. However, when we challenging, real-world cases captured under different indoor
apply the d.o.f. rendering, the result is still visually acceptable. and outdoor conditions. The addition of color-guided filtering
3) Outdoor Test Case in Fig. 8(c): This test case was taken of the disparity map using CLMF refines edge boundaries,
outdoors under strong sunlight. This setup is also challenging, resulting in sharper disparity discontinuities at object borders,
as it contains slanted surfaces, thin objects, and large texture- which we have shown is important for depth-based compu-
less regions. Our estimated depth map is visually acceptable–it tational photography applications. We plan to improve our
clearly shows objects with correct depth discontinuities. Note algorithm’s performance in areas of low contrast by further
that our method can even detect the connection between the research in robust matching measures and improved inference.
head of the motorbike and the mirror. More importantly, our We also evaluated this method in an application of
refocused images look visually plausible, creating convincing, refocus-rendering from computational photography. For this
d.o.f. effects based upon the user’s strokes. application, we derived a precise model of the lens to achieve
4) Middlebury 2006 Dataset in Fig. 11: To further demon- a realistic d.o.f. effect. Motivated by this application, we
strate the performance of our application, we also test our presented an interactive depth-of-field rendering application
d.o.f. rendering algorithm on the depth estimation result of the named Scribble2focus that uses our proposed stereo matching
Middlebury 2006 dataset. According to the information that algorithm to estimate depth. Our application enables the user
we were able to gather from the Middlebury 2006 dataset’s to easily and interactively create a real-life, depth-of-field
website [41] and Scharstein and Szeliski’s paper discussing effect by simply drawing a stroke through the region to be
the dataset [42], we assume that theses images were captured emphasized. We carefully study the physical model of a real-
with focal length f = 13.11mm and baseline b = 160mm. life, depth-of-field effect and utilize the acquired depth map
The last row of Fig. 11 shows our final rendering result for to guide the depth-based, pixel-adapted, Gaussian blurring in
different input strokes from the user, drawn in green. the rendering process. The Scribble2focus application provides
VU et al.: EFFICIENT HYBRID TREE-BASED STEREO MATCHING 3441
R EFERENCES
[1] Q. Yang, “A non-local cost aggregation method for stereo match-
ing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2012,
pp. 1402–1409.
[2] P. Green, W. Sun, W. Matusik, and F. Durand, “Multi-aperture photog-
raphy,” ACM Trans. Graph., vol. 26, no. 3, pp. 68:1–68:7, Jul. 2007.
[3] A. Levin, R. Fergus, F. Durand, and W. T. Freeman, “Image and depth
from a conventional camera with a coded aperture,” ACM Trans. Graph.,
vol. 26, no. 3, pp. 70:1–70:9, Jul. 2007.
[4] Y. Bando, B.-Y. Chen, and T. Nishita, “Extracting depth and matte
using a color-filtered aperture,” ACM Trans. Graph., vol. 27, no. 5,
pp. 134:1–134:9, Dec. 2008.
[5] (2014, Jun. 13). About the Camera, Lytro, Mountain View, CA, USA
[Online]. Available: http://www.lytro.com/
[6] K. Venkataraman et al., “Picam: An ultra-thin high performance
monolithic camera array,” ACM Trans. Graph., vol. 32, no. 6,
Fig. 12. Relationship between the focal length f , the object distance to pp. 166:1–166:13, Nov. 2013.
. According to the
camera Z U and the object’s image behind lens distance z U [7] A. F. Bobick and S. S. Intille, “Large occlusion stereo,” Int. J. Comput.
= f ZU .
physical model of thin lens, z U can be expressed in the form z U Z −f
U
Vis., vol. 33, no. 3, pp. 181–200, 1999.
[8] J. Sun, Y. Li, S. B. Kang, and H.-Y. Shum, “Symmetric stereo matching
for occlusion handling,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., Jun. 2005, pp. 399–406.
users with quality, artistic, depth-of-field images with little [9] V. Kolmogorov and R. Zabih, “Computing visual correspondence with
effort. In future work, we plan to optimize the algorithm and occlusions using graph cuts,” in Proc. Int. Conf. Comput. Vis., Jul. 2001,
port the entire framework to Android devices. pp. 508–515.
[10] K.-J. Yoon and I. S. Kweon, “Adaptive support-weight approach for
correspondence search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28,
A PPENDIX A no. 4, pp. 650–656, Apr. 2006.
P ROOF OF D EPTH OF F IELD R ANGE [11] O. Veksler, “Fast variable window for stereo correspondence using
integral images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
According to Fig. 6, we use the similar triangle formula to Jun. 2003, pp. 556–561.
calculate z F as [12] K. Zhang, J. Lu, and G. Lafruit, “Cross-based local stereo matching
using orthogonal integral images,” IEEE Trans. Circuits Syst. Video
CT zU − z F Technol., vol. 19, no. 7, pp. 1073–1079, Jul. 2009.
= [13] K. He, J. Sun, and X. Tang, “Guided image filtering,” in Proc. Eur.
f /N zF Conf. Comput. Vis., 2010, pp. 1–14.
f zU [14] A. Hosni, C. Rhemann, M. Bleyer, C. Rother, and M. Gelautz, “Fast
⇒ zF = .
CT N + f cost-volume filtering for visual correspondence and beyond,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2011, pp. 3017–3024.
Fig. 12 shows how we use the thin lens model to calculate [15] Y. Wei and L. Quan, “Region-based progressive stereo matching,”
the behind lens distance. Application of this model gives us in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun./Jul. 2004,
pp. 106–113.
another expression of z F and z U in the form [16] L. Hong and G. Chen, “Segment-based stereo matching using
f ZF graph cuts,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
zF = , Jun./Jul. 2004, pp. 74–81.
ZF − f [17] Y. Deng, Q. Yang, X. Lin, and X. Tang, “A symmetric patch-based
f ZU correspondence model for occlusion handling,” in Proc. 10th Int. Conf.
zU = . Comput. Vis., Oct. 2005, pp. 1316–1322.
ZU − f [18] O. Veksler, “Stereo correspondence by dynamic programming on a
Therefore, tree,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2005,
pp. 384–390.
f ZF f zU [19] Y. Deng and X. Lin, “A fast line segment based dense stereo algorithm
zF = =
ZF − f CT N + f using tree dynamic programming,” in Proc. Eur. Conf. Comput. Vis.,
2006, pp. 201–212.
f zU
⇒ ZF = . [20] J. Lu, H. Yang, D. Min, and M. N. Do, “PatchMatch filter: Efficient
CT N + f − zU edge-aware filtering meets randomized search for fast correspondence
field estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
After substituting Z U , the farthest distance value for object to Jun. 2013, pp. 1854–1861.
be in focus is [21] J. Shotton et al., “Real-time human pose recognition in parts from single
depth images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
f 2 ZU Jun. 2011, pp. 1297–1304.
ZF = .
C T N(Z U − f ) − f 2 [22] C. Zhou, A. Troccoli, and K. Pulli, “Robust stereo with flash and no-
flash image pairs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Similarly, we calculate the nearest distance for objects to be Jun. 2012, pp. 342–349.
in focus as [23] J. Lu, K. Shi, D. Min, L. Lin, and M. N. Do, “Cross-based local
multipoint filtering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
f 2 ZU Jun. 2012, pp. 430–437.
ZN = .
C T N(Z U − f ) + f 2 [24] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk,
“SLIC superpixels compared to state-of-the-art superpixel methods,”
ACKNOWLEDGMENT IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2274–2282,
Nov. 2012.
This work was supported by the research grant for the [25] J. B. Kruskal, “On the shortest spanning subtree of a graph and the
Human Sixth Sense Programme at the Advanced Digital traveling salesman problem,” Proc. Amer. Math. Soc., vol. 7, no. 1,
pp. 48–50, Feb. 1956.
Sciences Center from Singapore’s Agency for Science, Tech- [26] J. Canny, “A computational approach to edge detection,” IEEE Trans.
nology and Research (A*STAR). Pattern Anal. Mach. Intell., vol. 8, no. 6, pp. 679–698, Jun. 1986.
3442 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014
[27] (2014. Jun. 13). Tegra Developer Zone. NVIDIA, Santa Clara, CA, USA Hongsheng Yang received the B.Eng. degree in
[Online]. Available: http://developer.nvidia.com/tegra-start electronic information engineering from the Uni-
[28] A. Adams et al., “The frankencamera: An experimental platform versity of Electronic Science and Technology of
for computational photography,” ACM Trans. Graph., vol. 29, no. 4, China, Chengdu, China, in 2011. He was with
pp. 29:1–29:12, Jul. 2010. the Advanced Digital Sciences Center, Singapore,
[29] A. C. Shumate and H. Li. Gray World Algorithm. [Online]. Available: as a Software Engineer till 2013. He is cur-
http://scien.stanford.edu/pages/labsite/2000/psych221/projects/00/trek rently pursuing the Ph.D. degree in computer sci-
[30] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two- ence with the University of North Carolina at
frame stereo correspondence algorithms,” Int. J. Comput. Vis., vol. 47, Chapel Hill, Chapel Hill, NC, USA. His cur-
nos. 1–3, pp. 7–42, Apr. 2002. rent research interests include 3D computer vision,
[31] (2014, Jun. 13). Middlebury Stereo Vision, Middlebury College, Middle- large-scale scene understanding, and optimization
bury, VT, USA [Online]. Available: http://vision.middlebury.edu/stereo/ methods.
[32] M. Bleyer, C. Rhemann, and C. Rother, “Patchmatch stereo—Stereo
matching with slanted support windows,” in Proc. Brit. Mach. Vis. Conf.,
2011, pp. 14.1–14.11.
[33] X. Mei, X. Sun, W. Dong, H. Wang, and X. Zhang, “Segment-tree based
cost aggregation for stereo matching,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit., Jun. 2013, pp. 313–320.
[34] D. Min, J. Lu, and M. N. Do, “Joint histogram-based cost aggregation
for stereo matching,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35,
no. 10, pp. 2539–2545, Oct. 2013. Minh N. Do (M’01–SM’07–F’14) was born in
[35] D. Chen, M. Ardabilian, X. Wang, and L. Chen, “An improved non- Vietnam in 1974. He received the B.Eng. degree
local cost aggregation method for stereo matching based on color and in computer engineering from the University of
boundary cue,” in Proc. IEEE Int. Conf. Multimedia Expo, Jul. 2013, Canberra, Canberra, ACT, Australia, in 1997, and
pp. 1–6. the Dr. Sci. degree in communication systems from
[36] L. Wang and R. Yang, “Global stereo matching leveraged by sparse the Swiss Federal Institute of Technology Lausanne,
ground control points,” in Proc. IEEE Conf. Comput. Vis. Pattern Lausanne, Switzerland, in 2001.
Recognit., Jun. 2011, pp. 3033–3040. He has been on the faculty with the University
[37] C. Lei, J. Selzer, and Y.-H. Yang, “Region-tree based stereo using of Illinois at Urbana-Champaign, Champaign, IL,
dynamic programming optimization,” in Proc. IEEE Conf. Comput. Vis. USA, since 2002, where he is currently a Professor
Pattern Recognit., Jun. 2006, pp. 2378–2385. with the Department of Electrical and Computer
[38] C. L. Zitnick and S. B. Kang, “Stereo for image-based rendering using Engineering, and hold joint appointments with the Coordinated Science
image over-segmentation,” Int. J. Comput. Vis., vol. 75, no. 1, pp. 49–65, Laboratory, the Beckman Institute for Advanced Science and Technology,
Oct. 2007. and the Department of Bioengineering. His research interests include image
[39] S. Mattoccia, S. Giardino, and A. Gambini, “Accurate and efficient cost and multidimensional signal processing, wavelets and multiscale geometric
aggregation strategy for stereo correspondence based on approximated analysis, computational imaging, augmented reality, and visual information
joint bilateral filtering,” in Proc. Asian Conf. Comput. Vis., 2009, representation.
pp. 23–27. Prof. Do was a recipient of the Silver Medal from the 32nd International
[40] M. Mueller, F. Zilly, and P. Kauff, “Adaptive cross-trilateral depth map Mathematical Olympiad in 1991, the University Medal from the University of
filtering,” in Proc. 3DTV-Conf., True Vis. Capture, Transmiss. Display Canberra in 1997, the Doctorate Award from the EPFL in 2001, the CAREER
3D Video, 3DTV-CON, Jun. 2010, pp. 1–4. Award from the National Science Foundation in 2003, and the Young Author
[41] (2006). Stereo Datasets with Ground Truth, Middle- Best Paper Award from the IEEE in 2008. He was named a Beckman
bury College, Middlebury, VT, USA [Online]. Available: Fellow at the Center for Advanced Study, UIUC, in 2006, and received
http://vision.middlebury.edu/stereo/data/scenes2006/ the Xerox Award for Faculty Research from the College of Engineering,
[42] D. Scharstein and R. Szeliski, “High-accuracy stereo depth maps using UIUC, in 2007. He was a member of the IEEE Signal Processing Theory
structured light,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., and Methods Technical Committee, the Image, Video, and Multidimensional
Jun. 2003, pp. 195–202. Signal Processing Technical Committee, and an Associate Editor of the IEEE
[43] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image T RANSACTIONS ON I MAGE P ROCESSING.
quality assessment: From error visibility to structural similarity,” IEEE
Trans. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.