0% found this document useful (0 votes)
27 views

Depth Estimation From A Single Image Using Deep Learned Phase Coded Mask

This document describes a method for single-image depth estimation using a deep learning approach with a phase-coded aperture camera. The camera is equipped with an optical phase mask that provides unambiguous depth-related color characteristics in the captured image. These characteristics are then used by a fully-convolutional neural network to estimate the depth map of the scene. The phase mask and network weights are learned jointly using backpropagation to allow for accurate depth estimation with a simple network architecture. Performance on simulated and real images is superior to state-of-the-art monocular depth estimation methods in terms of both accuracy and computational requirements.

Uploaded by

J Spencer
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Depth Estimation From A Single Image Using Deep Learned Phase Coded Mask

This document describes a method for single-image depth estimation using a deep learning approach with a phase-coded aperture camera. The camera is equipped with an optical phase mask that provides unambiguous depth-related color characteristics in the captured image. These characteristics are then used by a fully-convolutional neural network to estimate the depth map of the scene. The phase mask and network weights are learned jointly using backpropagation to allow for accurate depth estimation with a simple network architecture. Performance on simulated and real images is superior to state-of-the-art monocular depth estimation methods in terms of both accuracy and computational requirements.

Uploaded by

J Spencer
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCI.2018.2849326, IEEE
Transactions on Computational Imaging
1

Depth Estimation from a Single Image using Deep


Learned Phase Coded Mask
Harel Haim, Shay Elmalem, Raja Giryes, Member, IEEE, Alex M. Bronstein, Fellow, IEEE,
and Emanuel Marom, Fellow, IEEE

Abstract—Depth estimation from a single image is a well Eigen et al. [1] introduced a deep neural network for depth
known challenge in computer vision. With the advent of deep estimation that relies on depth cues in the RGB image. They
learning, several approaches for monocular depth estimation have used a multi-scale architecture with coarse and fine depth
been proposed, all of which have inherent limitations due to the
scarce depth cues that exist in a single image. Moreover, these estimation networks concatenated to achieve both dynamic
methods are very demanding computationally, which makes them range and resolution. Two later publications by Cao et al. [2]
inadequate for systems with limited processing power. In this and Liu et al. [3] employed the novel fully-convolutional
paper, a phase-coded aperture camera for depth estimation is network (FCN) architecture (originally presented by Long et
proposed. The camera is equipped with an optical phase mask al. [9] for scene semantic segmentation) for monocular depth
that provides unambiguous depth-related color characteristics for
the captured image. These are used for estimating the scene depth estimation. In [2] the authors used a residual network [10],
map using a fully-convolutional neural network. The phase-coded and refined the results using a conditional random field (CRF)
aperture structure is learned jointly with the network weights prior, external to the network architecture. Similar approach of
using back-propagation. The strong depth cues (encoded in the using CRF to refine a DL model initial result was also used by
image by the phase mask, designed together with the network Li et al. [4]. In [3] a simpler FCN model was proposed, but
weights) allow a much simpler neural network architecture for
faster and more accurate depth estimation. Performance achieved with the CRF operation integrated inside the network structure.
on simulated images as well as on a real optical setup is superior This approach was further researched using deeper networks
to state-of-the-art monocular depth estimation methods (both and more sophisticated architectures [5], [6]. The challenge
with respect to the depth accuracy and required processing was also addressed in the unsupervised learning approach, as
power), and is competitive with more complex and expensive presented by Garg et al. [7] and Godard et al. [8].
depth estimation methods such as light-field cameras.
Common to all these approaches is the use of depth cues
Index Terms—Coded Aperture, Phase Mask, Depth Recon- in the RGB image ’as-is’, as well as having the training and
struction, Deep Learning, Computational Camera,. testing on well-known public datasets such as the NYU depth
[11], [12] and Make3D [13]. Since the availability of reliable
depth cues in a regular RGB image is limited, these approaches
I. I NTRODUCTION
require large architectures with significant regularization (Mul-
tiscale, ResNets, CRF) as well as separation of the models to
P ASSIVE depth estimation is a well-known challenge in
computer vision. A common solution is based on stereo
vision, where two calibrated cameras capture the same scene
indoor/outdoor scenes. A modification of the image acquisition
process itself seems necessary in order to allow using a simpler
from different views (similarly to the human eyes), and thus model, generic enough to encompass both indoor and outdoor
the distance to every object can be inferred by triangulation. scenes.
Yet, such a dual camera system significantly increases the form Imaging methods that use an aperture coding mask (both
factor, cost and power consumption. phase or amplitude) became more common in the last two
The current electronics miniaturization trend (high quality decades. One of the first and prominent studies in this field
smart-phone cameras, wearable devices, etc.) requires a com- was carried out by Dowski and Cathey [14], where a cubic
pact and low-cost solution. This requirement dictates a more phase mask was designed to generate a constant point spread
challenging task: passive depth estimation from a single image. function (PSF) throughout the desired depth of field (DOF).
While a single image lacks the depth cues that exist in a stereo Similar ideas were presented later in [15] using a random
image pair, there are still some depth cues such as perspective diffuser with focal sweep [16], or by using an uncorrected
lines and vanishing points that enable depth estimation to some lens as a type of spectral focal sweep [17]. When a depth-
degree of accuracy. The ongoing deep learning revolution independent PSF is achieved, an all-in-focus image can be
did not overlook this challenge, and some neural network- recovered using non-blind deconvolution methods. However,
based approaches to monocular depth estimation exist in the in all these methods the captured and restored images have a
literature [1]–[8]. similar response in the entire DOF, and thus depth information
can only be recovered to some extent using monocular cues.
H. Haim was with the Faculty of Electrical Engineering, Tel-Aviv Univer- In order to generate optical cues, the PSF should be depth-
sity, Tel-Aviv, 6997801 Israel, and currently with the Department of Computer dependent. Related methods use an amplitude coded mask
Science, University of Toronto, Toronto, Canada.
S. Elmalem, R. Giryes, A.M. Bronstein and E. Marom are with the Faculty [18], [19] or a color-dependent ring mask [20], [21] such that
of Electrical Engineering, Tel-Aviv University, Tel-Aviv, 6997801 Israel objects at different depths exhibit a distinctive spatial/spectral

2333-9403 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCI.2018.2849326, IEEE
Transactions on Computational Imaging
2

structure. The main drawback of these strategies is that the


actual light efficiency is only 50% in [18], [19], 60% in
[20] and 80% in [21], making them unsuitable for low light
conditions. Moreover, those techniques (except [21]) are based
on the same low DOF setup, having a f = 50mm, f /1.8 lens
(27.8mm aperture). Thus, they are also unsuitable for small-
scale cameras since they are less sensitive to small changes in (a) (b)
focus. Fig. 1. Spatial frequency response and color channel separation.
Contribution. In this paper, we propose a novel deep learn- (a) Optical system response to normalized spatial frequency for
ing framework for the joint design of a phase-coded aperture different values of the defocus parameter ψ. (b) Comparison between
contrast levels for a single normalized spatial frequency (0.25) as a
element and a corresponding FCN model for single-image function of ψ for clear aperture (dotted) and when our new trained
depth estimation. A similar phase mask has been proposed phase mask is used (solid).
by Milgrom et al. [22] for extended DOF imaging; its major
advantage is light efficiency above 95%. Our phase mask
is designed to increase sensitivity to small focus changes, since the out-of-focus blur may result in information loss in
thus providing an accurate depth measurement for small-scale parts of the image.
cameras (such as smartphone cameras). These limitations can be overcome by manipulating the
In our system, the aperture coding mask is designed for image acquisition process. A recent study by Haim et al. [23]
encoding strong depth cues with negligible light throughput used Milgrom’s aperture phase coding technique [22] to
loss. The coded image is fed to a FCN, designed to decode achieve extended DOF imaging. In [23], the authors proposed
the color-coded depth cues in the image, and thus estimate the a method for utilizing the diversity between color channels
depth map. The phase mask structure is trained together with (expressed in their respective PSF) to find the corresponding
the FCN weights, allowing end-to-end system optimization. blurring model for each small image patch, and used this
For training, we created the ’TAU-Agent’ dataset1 containing model to restore the image. Here we adopt a similar phase
pairs of high-resolution realistic animation images and their mask for depth reconstruction. We show that this mask intro-
perfectly registered pixel-wise depth maps. duces depth-dependent color cues throughout the scene, which
Since the depth cues in the coded image are much stronger lead to fast and accurate depth estimation. Due to the optical
than their counterparts in a clear aperture image, the pro- cues based depth estimation, our method generalization ability
posed FCN is much simpler and smaller compared to other is better compared to the current monocular depth estimation
monocular depth estimation networks. The joint design and methods.
processing of the phase mask and the proposed FCN lead to
an improved overall performance: better accuracy and faster A. Out-of-focus imaging
run-time compared to the known monocular depth estimation An imaging system acquiring an out-of-focus (OOF) object
methods. Also, the achieved performance is competitive with can be described analytically using a quadratic phase error in
more complex, cumbersome and higher cost depth estimation its pupil plane [24]. In the case of a circular exit pupil with
solutions such as light-field cameras. radius R, the defocus parameter is defined as
The rest of the paper is organized as follows: Section II
πR2 1 πR2
   
presents the phase-coded aperture used for encoding depth 1 1 1 1
ψ= + − = −
cues in the image, and its design process. Section III describes λ zo zimg f λ zimg zi
(1)
πR2 1
 
the FCN architecture used for depth estimation and its training 1
process. Experimental results on synthetic data as well as on = − ,
λ zo zn
real images acquired using an optical setup with a manufac-
where zimg is the sensor plane location of an object in the
tured optimal aperture coding mask are presented in Section
nominal position (zn ), zi is the ideal image plane for an object
IV. Our system is shown to exhibit superior performance
located at zo , and λ is the optical wavelength. Out-of-focus
in depth accuracy, system complexity, run-time and required
blur increases with the increase of |ψ|; the image exhibits
processing power compared to competing methods. Section V
gradually decreasing contrast level that eventually leads to
concludes the paper.
information loss (see Fig. 1(a)).
II. P HASE -C ODED A PERTURE I MAGING FOR D EPTH
E STIMATION B. Mask design
The need to acquire high-quality images and videos of Both Milgrom et al. [22] and Haim et al. [23] have shown
moving objects in low-light conditions establish the well- that phase masks with a single radially symmetric ring intro-
known trade-off between the aperture size (F#) and the DOF in duce diversity between the responses of the three major color
optical imaging systems. With conventional optics, increasing channels (R, G and B) for different focus scenarios, such that
the light efficiency at the expense of reduced DOF poses the three channels jointly provide an extended DOF. In order
inherent limitations on any purely computational technique, to allow more flexibility in the system design, we use a mask
with two or three rings, whereby each ring exhibits a different
1 Dataset will be publicly available upon publication of the paper. wavelength-dependent phase shift. In order to determine the

2333-9403 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCI.2018.2849326, IEEE
Transactions on Computational Imaging
3

Fig. 2. Neural network architecture for the depth classification CNN: (the ’inner’ net in the FCN model in Fig. 4). Spatial dimension reduction is achieved
by convolution stride instead of pooling layers. Every CONV block is followed by BN-ReLU layer (not shown in this figure).

optimal phase mask parameters within a deep learning-based between the color channels for different depths (expressed in
depth estimation framework, the imaging stage is modeled as ψ values) when using a clear aperture (dotted plot) versus our
the initial layer of a CNN model. The inputs to this coded optimized phase mask (solid plot).
aperture convolution layer are the all-in-focus images and their
corresponding depth maps. The parameters (or weights) of the
layer are the radii ri and phase shifts φi of the mask’s rings.
Such layer forward model is composed of the coded aperture
PSF calculation (for each depth in the relevant depth range)
followed by imaging simulation using the all-in-focus input
image and its corresponding depth map. The backward model
uses the inputs from the next layer (backpropagated to the
coded aperture convolutional layer) and the derivatives of
the the coded aperture PSF with respect to its weights, (a) (b)
∂P SF/∂ri , ∂P SF/∂φi , in order to calculate the gradient
Fig. 3. Aperture phase coding mask. (a) 3D illustration of the
descent step on the phase mask parameters. A detailed descrip- optimal three-ring mask (b) cross-section of the mask. The area
tion of the coded aperture convolution layer and its forward marked in black acts as a circular pupil.
and backward models is presented in the Appendix. One of
the important hyper-parameters of such a layer is the depth
range under consideration (in ψ terms). The ψ range setting, III. FCN FOR D EPTH E STIMATION
together with the lens parameters (focal length, F# and focus We now turn to describe the architecture of our fully
point) dictates the trade-off between the depth dynamic range convolutional network (FCN) for depth estimation, which
and resolution. In this study, we set the range to ψ = [−4, 10]; relies on optical depth cues encoded in the image, provided by
its conversion to a metric depth range is presented in sec- the phase coded aperture incorporated in the lens as described
tion IV. As mentioned above, the optimization of the phase in Section II. These cues are used by the FCN model to
mask parameters is done by integrating the coded aperture estimate the scene depth map. Our network configuration is
convolutional layer into the CNN model detailed in the sequel, inspired by the FCN structure introduced by Long et al. [9].
followed by the end-to-end optimization of the entire model. In this work, an ImageNet classification CNN was converted
To validate the coded aperture layer, we compared the case to a semantic segmentation FCN by adding a deconvolu-
where the CNN (described in the following section) is trained tion block to the ImageNet model, and fine-tuning it for
end-to-end with the phase coded aperture layer to the case semantic segmentation (with several architecture variants for
where the phase mask is held fixed to its initial value. Several increased spatial resolution). For depth estimation using our
fixed patterns were examined; the training of the phase mask phase coded aperture camera, a totally different ’inner net’
improved the classification error by 5% to 10%. should replace the ’ImageNet model’. The inner net should
For the setup we used, the optimization process yielded classify the different imaging conditions (i.e. ψ values), and
a three rings mask such that the outer ring is deeper than the deconvolution block will turn the initial pixel labeling
the middle one as illustrated in Fig. 3. Such a design poses into a full depth estimation map. We tested two different
significant fabrication challenges for the chemical etching ’inner’ network architectures: the first based on the DenseNet
process used at our facilities. Since an optimized three-rings architecture [25], and the second based on a traditional feed-
mask surpass the two-ring mask only by a small margin, forward architecture. An FCN based on both inner nets is
in order to make the fabrication process simpler and more presented, and the trade-off is discussed. The following sub-
reliable, a two-ring limit was set in the training process; this sections present the ψ classification inner nets, and the FCN
resulted in the normalized ring radii r = {0.55, 0.8, 0.8, 1} model based on them for depth estimation.
and phases φ = {6.2, 12.3} [rad] (both ψ and φ are defined
for the blue wavelength, where the RGB wavelengths taken A. ψ classification CNN
are the peak wavelengths of the camera color filter response: As presented in Section II, the phase coded aperture is
λR,G,B = [610, 535, 455]nm). Figure 1(b) shows the diversity designed along with the CNN such that it encodes depth-

2333-9403 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCI.2018.2849326, IEEE
Transactions on Computational Imaging
4

Fig. 4. Network architecture for the depth estimation FCN. The depth (ψ) classification network (see Fig. 2) is wrapped in a deconvolution framework to
provide depth estimation map equal to the input image size.

dependent cues in the image by manipulating the response of its corresponding ψ value.
the RGB channels for each depth. Using these strong optical After the imaging stage is done (as explained in II), an Addi-
cues, the depth slices (i.e. ψ values) can be classified using tive White Gaussian Noise (AWGN) is added to each patch to
some CNN classification model. make the network robust to the typical noise level that exists in
For this task, we tested two different architectures; the first images taken with a real-world camera. Though increasing the
one based on the DenseNet architecture for CIFAR-10, and noise level improves the robustness, it is important to consider
the second based on the traditional feed-forward architecture the trade-off that exists between noise robustness and depth
of repeated blocks of convolutions, batch normalization [26] estimation accuracy, which limits the amount of noise that
and rectified linear units [27] (CONV-BN-ReLU, see Fig. 2). should be added in training, making the noise level a hyper-
In view of the approach presented in [28], pooling layers parameter one should tune. In our tests, when we set a specific
are omitted in the second architecture, and stride of size 2 noise level, the accuracy of the depth results is deteriorated for
is used in the CONV layers for lateral dimension reduction. inputs with higher noise level (as one would expect). At the
This approach allows much faster model evaluation (only 25% same time, when we train the CNN with relatively high noise
of the calculation in each CONV layer), with minor loss in levels, the system becomes more robust to noise at the expense
performance. of accuracy reduction for images with lower noise. Therefore,
To reduce the model size and speed up its evaluation even σ = 3 is chosen as a good compromise since it resembles
more, the input (in both architectures) to the first CONV layer the noise level of images of a well-lighted scene taken with
of the net is the captured raw image (in mosaicked Bayer the selected camera. Of course, one may consider a different
pattern). By setting the stride of the first CONV layer to 2, noise level, according to the target camera and its expected
the filters’ response remains shift-invariant (since the Bayer noise level.
pattern period is 2). This way the input size is decreased by a Data augmentation via four rotations is used to increase
factor of 3, with minor loss in performance. This also omits the dataset size as well as achieving rotation invariance. The
the need for a demosaicking stage, allowing faster end-to-end dataset size is about 2.4M patches, where 80% of it is used
performance (in cases where the RGB image is not needed for training and 20% is used for validation. Both architectures
as an output, and one is interested only in the depth map). were trained to classify into 15 integer values of ψ (between
One can see the direct processing of mosaicked images as −4 and 10) using the softmax loss. These nets are used as
a case where the CNN representation power ’contains’ the an initialization for the depth estimation FCN, as presented in
demosaicking operation, and therefore it is not really needed III-C.
as a preprocessing step.
Both inner classification net architectures are trained on the B. RGB-D Dataset
Describable Textures Dataset (DTD) [29]. About 40K texture The deep learning based methods for depth estimation from
patches (32x32 pixels each) were selected from the dataset. a single image mentioned in Section I [1]–[8] rely strongly
Each patch was ’replicated’ in the dataset 15 times, where on the input image details. Thus, most studies in this field
each replication corresponds to a different blur kernel (corre- assume an input image with a large DOF such that most of
sponding to the phase coded aperture for ψ = −4, −3, ..., 10). the acquired scene is in focus. This assumption is justified
The first layer of both architectures represents the phase- when the photos are taken by small aperture cameras as
coded aperture layer, whose inputs are the clean patch and is the case in datasets such as NYU Depth [11], [12] and

2333-9403 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCI.2018.2849326, IEEE
Transactions on Computational Imaging
5

Make3D [13] that are commonly used for the training and
testing of those depth estimation techniques. However, such
optical configurations limit the resolution and increase the
noise level, thus reducing the image quality. Moreover, the
depth maps in these dataset are prone to errors due to depth
sensor inaccuracies and calibrations issues (alignment and
scaling) with the RGB sensor.
Our method is based on a phase-coded aperture imaging
process, which encodes the image. To train or evaluate our (a) Confusion matrix (b) MAPE vs Focus point
method on images not taken with our camera, the phase coded
Fig. 5. (a) Confusion matrix for the depth segmentation FCN validation set
aperture imaging process has to be simulated on those images. (b) MAPE as a function of the focus point using our continuous net.
To simulate the imaging process properly, the input data should
contain high resolution, all in-focus images with low noise,
accompanied by accurate pixelwise depth maps. Evaluating However, such step is very computationally demanding, and
depth datasets such as NYU depth [11], [12] and Make3D does not provide significant improvement (since the phase-
[13] for our coded aperture imaging simulation is impossible coded aperture parameters tuning reached its optimum in the
due to the limited image and depth resolution provided in inner net training). Therefore, in the FCN training stage, the
these datasets, which limit the possibility of simulating our optical imaging simulation is done as a pre-processing step
image acquisition process on these datasets. Proper input for with the best phase mask achieved in the inner net training
such imaging simulation may be generated primarily using 3D stage. In the discrete training step of the FCN, the ground-
graphic simulation software. truth depth maps are discretized to ψ = −4, −3, .., 10 values.
We use the popular MPI-Sintel depth images dataset [30], The Sintel/Agent images (after imaging simulation with the
created by the Blender 3D graphics software. The Sintel coded aperture blur kernels, RGB-to-Bayer transformation and
dataset contains 23 scenes with a total of 1k images. Yet, AWGN addition), along with the discretized depth maps, are
because it has been designed specifically for optical flow used as the input data for the discrete depth estimation FCN
evaluation, the depth variation in each scene is limited. Thus, model training. The FCN is trained for reconstructing the
we could only use about 100 unique images, which are not discrete depth of the input image using softmax loss.
enough for training. The need for additional data has led us to After training, both versions of the FCN model (based on
create a new Sintel-like dataset (using Blender) called ’TAU- the DenseNet architecture and the traditional feed-forward
Agent’, which is based on the recent open movie ’Agent 327’. architecture) achieved roughly the same performance, but with
This new animated dataset, which relies on the new render a significant increase in inference time (x3), training time
engine ’Cycles’, contains 300 realistic images (indoor and (x5) and memory requirements (x10) for the DenseNet model.
outdoor), with a resolution of 1024 × 512, and corresponding When examining the performance, one can see that most of
pixelwise depth maps. With rotations augmentation, our full the errors are on smooth/low texture areas of the images,
dataset contained 840 scenes, where 70% are used for training where our method (that relies on texture) is expected to be
and the rest for validation. weaker. Yet, in areas with ’sufficient’ texture, there are enough
encoded depth cues, enabling good depth estimation even
with relatively simple DNN architecture. This similarity in
C. Depth estimation FCN performance between the DenseNet based model (which is
In similarity to the FCN model presented by Long et al. [9], one of the best CNN architectures known to date) to a simple
the inner ψ classification net is wrapped in a deconvolution feed-forward architecture is a clear example to the inherent
framework, turning it to a FCN model (see Fig. 4). The desired power of optical image processing using coded aperture; a task
output of our depth estimation FCN is a continuous depth driven design of the image acquisition stage can potentially
estimation map. However, since training continuous models save significant resources in the digital processing stage. As
is prone to over-fitting and regression to the mean issues, such, we decided to keep the simple feed-forward architecture
we pursue this goal in two stages. In the first one, the FCN as the chosen solution.
is trained for discrete depth estimation. On the second step, To evaluate the discrete depth estimation accuracy, we
the discrete FCN model is used as an initialization for the calculated a confusion matrix for our validation set (∼ 250
continuous model training. images, see Fig. 5(a)). After 1500 epochs, the net achieves
To train the discrete depth FCN, the Sintel and Agent accuracy of 68% (top-1 error). However, the vast majority of
datasets RGB images are blurred using the coded aperture the errors are to adjacent ψ values, and on 93% of the pixels
imaging model, where each object is blurred using the cor- the discrete depth estimation FCN recovers the correct depth
responding blur kernel associated with its depth (indicated in with an error of up to ±1ψ. As already mentioned above, most
the ground truth pixelwise depth map). The imaging is done of the errors originate from smooth areas, where no texture
in a quasi-continuous way, with ψ step of 0.1 in the range exists and therefore no depth dependent color-cues were
of ψ = [−4, 10]. This imaging simulation can be carried in encoded. This performance is sufficient as an initialization
the same way as the ’inner’ net training, i.e. using the phase point for the continuous depth estimation network.
coded aperture layer as the first layer of the FCN model. The discrete depth estimation (segmentation) FCN model is

2333-9403 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCI.2018.2849326, IEEE
Transactions on Computational Imaging
6

The ψ = [−4, 10] domain is spread to some depth dynamic


range, depending on the chosen focus point. Near-by focus
point dictates small dynamic range and high depth resolution,
and vice versa. However, since the FCN model is designed for
ψ estimation, the model (and its ψ 0 s related MAD) remains
the same. After translating to metric maps, the Mean Absolute
Percentage Error (MAPE) is different for each focus point.
Such analysis is presented in Fig. 5(b), where the aperture
diameter is set to 2.3[mm] and the focus point changes from
0.1[m] to 2[m], resulting with a working distance of 9[cm]
to 30[m]. One can see that the relative error is roughly linear
with the focus point, and remains under 10% for relatively
wide focus-point range. A summary of the depth estimation
performance with several error measures is presented in Table I
Additional simulated scenes examples are presented in
Fig. 7. The proposed FCN model achieves accurate depth
estimation maps compared to the ground truth. Notice the
(a) Input (b) GT (c) L1 (d) L2 difference in the estimated maps when using the L1 loss
(Fig. 7(c)) and the L2 loss (Fig. 7(d)). The L1 based model
Fig. 6. Depth estimation results on simulated image from the ’Agent’
dataset: (a) original input image (the actual input image used in our net was
produces smoother output but reduces the ability to distinguish
the raw version of the presented image), (b) Continuous ground truth (c-d) between fine details, while the L2 model produces noisier
Continuous depth estimation achieved using the L1 loss (c) and the L2 loss output but provides sharper maps. This is illustrated in all
(d).
scenes where the gap between the body and the hands of
the characters is not visible, as observed in Fig. 7(c). Note
upgraded to a continuous depth estimation (regression) model that in this case the L2 model produces a sharper separation
using some modifications. The linear prediction results serve (Fig. 7(d)). The estimated maps in Fig. 7(c-d) also presents
as an input to a 1X1 CONV layer, initialized with linear a few limitations of our method. In the top row, the fence
regression coefficients from the ψ predictions to continuous behind the bike wheel is not visible since the fence wires
ψ values (ψ values can be easily translated to depth values in are too thin. In the middle and bottom rows, the background
meters, once lens parameters and focus point are known). details are not visible due to low dynamic range in these
The continuous network is fine-tuned in an end-to-end areas (the background is too far from the camera). One may
fashion, with lower learning rate (by a factor of 100) for the overcome the dynamic range limitations by changing the
pre-trained discrete network layers. The same Sintel & Agent aperture size/focus point, as explained in the following.
images are used as an input, but with the quasi-continuous As mentioned above, our system is designed to handle
depth maps (without discretization) as ground truth, and L2 ψ range of [−4, 10], but the metric range depends on the
or L1 loss. After 200 epochs, the model converges to Mean focus point selection (as presented above). This codependency
Absolute Difference (MAD) of 0.6ψ. Again, we found that allows one to use the same FCN model with different optical
most of the errors originate from smooth areas (as detailed in configurations. To demonstrate this important advantage, we
Section IV-A hereafter). simulated an image (Fig.10(a)) captured with a lens having
an aperture of 3.45[mm] (1.5 the size of our original aperture
IV. E XPERIMENTAL RESULTS AND COMPARISON used for training). The larger aperture provides better metrical
accuracy in exchange of reducing the dynamic range. The
A. Validation set results focus point was set to 48[cm], providing a working range of
As a basic sanity check, the validation set images can be 39[cm] to 53[cm]. We then produced an estimated depth map,
inspected visually. In Fig. 6 it can be seen that while the which was translated into point cloud data using the camera
depth cues encoded in the input image are hardly visible to the parameters (sensor size and lens focal length) from Blender.
naked eye, the proposed FCN model achieves quite accurate The 3D face reconstruction shown in Fig. 10(b) validates our
depth estimation maps compared to the ground truth. Most metrical depth estimation capabilities and demonstrates the
of the errors are concentrated in smooth areas, as mentioned efficiency of our strategy as we were able to create this 3D
in Section III-C. The continuous depth estimation smooths the model in real time.
initial discrete depth recovery, achieving a more realistic result.
As mentioned above, our method estimates the blur kernel
(ψ value), using the optical cues encoded by the phase coded B. Real-world results
aperture. An important practical analysis is the translation To test the proposed depth estimation method, several
of the ψ estimation map to a metric depth map. Using experiments were carried. The experimental setup included an
the lens parameters and the focus point, transforming from f = 16mm, F/7 lens (LM16JCM-V by Kowa) with our phase
ψ to depth is straight-forward (see Section II). Using this coded aperture incorporated in the aperture stop plane (see
transformation, the relative metric depth error can be analyzed. Fig.8(a)). The lens was mounted on a UI3590LE camera made

2333-9403 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCI.2018.2849326, IEEE
Transactions on Computational Imaging
7

(a) Input (b) GT (c) L1 (d) L2


Fig. 7. Depth estimation results on simulated scenes from the ’Agent’ dataset: (a) original input image (the actual input image used in our net was the
raw version of the presented image), (b) Continuous ground truth (c-d) Continuous depth estimation achieved by our FCN network when trained using (c)
the L1 loss and (d) the L2 loss.

is that it requires the incorporation of a very simple optical


element to an existing lens, while light-field and other solu-
tions (like stereo cameras) require a much more complicated
optical setup. In the stereo camera, two calibrated and laterally
separated cameras are mounted on a rigid base. In the light-
field camera, special light-field optics and sensor are used. In
both cases the cumbersome optical setup dictates large volume
(a) lens and phase mask (b) indoor scene side view and high cost.
Fig. 8. lab setup We examined all the solutions on both indoor and outdoor
scenes. Several examples are presented, with similar and
TABLE I different focus points. Indoor scenes examples are shown in
D EPTH ESTIMATION RESULTS SUMMARY
Fig. 9. Several objects were laid on a table with a poster in
Measure Result the background (see Fig. 8(b) for a side view of the scene).
Initial discrete depth segmentation 68% (top-1), 93% (±1ψ)
Since the scenes lack global depth cues, the method from
Continuous depth estimation error [ψ] 0.6
[3] fails to estimate a correct depth map. The Lytro camera
Val. set- rel. error [m] 5.5%
estimates the gradual depth structure of the scene with good
Val. set- log10 error [m] 0.056
object identification, but provides a relative scale only. Our
Val. set- RMS error [m] 0.12
method succeeds to identify both the gradual depth of the table
Experimental scene (spot check, rel.) [m] 6.25%
as well as the fine details of the objects (top row- note the
Run time (Full-HD image) [s] 0.22
screw located above the truck on the right, middle row- note
the various groups of screws). Although some scene texture
’seeps’ to our recovered depth map, it causes only a minor
by IDS Imaging. The lens was focused to zo = 1100mm, so error in the depth estimation. A partial failure case appears
that the ψ = [−4, 10] domain was spread between 0.5 − 2.2m. in the leaflet scene (Fig. 9, bottom row), where our method
Several scenes were captured using the phase coded aperture misses only on texture-less areas. Performance on non-textured
camera, and the corresponding depth maps were calculated areas is the most challenging scenario to our method (since it
using the proposed FCN model. relies on color-coded cues generated on texture areas), and it
For comparison, two competing solutions were examined is the source for almost all failure cases. In most cases, our
on the same scenes: Illum light-field camera (by Lytro), and net learns to associate non-textured areas with their correct
the monocular depth estimation net proposed by Liu et al. [3]. depth using adjacent locations in the scene that happen to have
Since the method in [3] assumes an all in-focus image as an texture and are at similar depth. However, this is not always
input, we used the Lytro camera all in-focus imaging option the case (as can be seen in Fig. 9(d)- bottom), where it fails
as the input to [3]. to do so in the blank white areas. This issue can be resolved
It is important to note that our proposed method provides using a much deeper network, and it imposes a performance
depth maps in absolute values (meters), while the Lytro camera vs. model complexity trade-off.
and [3] provide a relative depth map only (far/near values Similar comparison is presented for two outdoor scenes in
with respect to the scene). Another advantage of our technique Fig. 11. On its first row, we chose a scene consisting of a

2333-9403 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCI.2018.2849326, IEEE
Transactions on Computational Imaging
8

(a) scene (b) Light-Field (Lytro) (c) monocular (d) ours


Fig. 9. Indoor scene depth estimation. Left to right: (a) the scene and its depth map acquired using (b) Lytro Illum camera, (c) Liu et al. [3] monocular
depth estimation net, (d) our method. As each camera has a different field of view, the images were cropped to achieve roughly the same part of the scene.
The depth scale on the right is relevant only for (d). Because the outputs of (b)&(c) provide only a relative depth map (and not absolute as in the case of
(d)), their maps were brought manually to the same scale for visualization purposes.

and resolution are different (as can be seen in the depth scale
on the right column). However, since our FCN model is trained
for ψ estimation, all depth maps were achieved using the same
network, and the absolute depth is calculated using the known
focus point and the estimated ψ map.
Quantitative evaluation of the real-world setup with a cam-
era equipped with our phase-coded aperture was performed
’in the wild’, since exact depth GT is difficult to acquire in
(a) Input image (b) Point cloud map the general case. For quantitative evaluation on real data, we
Fig. 10. 3D face reconstruction performed a ’spot-check’- we measured the depth recovery
error of our network for the known object distances in the lab
setting of Fig. 9. We got an average depth estimation error of
granulated wall. In this example, the global depth cues are 6.25%. This accuracy is comparable to Lytro accuracy (Zeller
also weak, and therefore the monocular depth estimation fails et al. [31]) and much better than monocular (25%), while both
to separate the close vicinity of the wall (right part of the require a cumbersome calibration phase.
image). Both the Lytro and our phase coded aperture camera Besides the depth map recovery performance and the sim-
achieve good depth estimation of the scene. Note though that pler optics, another important benefit of our proposed solution
our camera has the advantage that it provides an absolute scale is the required processing power/run time. The fact that depth
and uses much simpler optics. cues are encoded by the phase mask enables much simpler
On the second row of Fig. 11, we chose a grassy slope FCN architecture, and therefore much faster inference time.
with flowers. In this case, the global depth cues are stronger. This is due to the fact that some of the processing is done by
Thus, the monocular method [3] does better compared to the optics (in the speed of light, with no processing resources
the previous examples, but still achieves only a partial depth needed). For example, for a full-HD image as an input, our
estimate. Lytro and our camera achieve good results. proposed network evaluates a full-HD depth map in 0.22s
Additional outdoor examples are presented in Fig. 12. Note (using Nvidia Titan X Pascal GPU). For the same sized input
that the scenes in first five rows of Fig. 12 were taken with on the same GPU, the net presented in [3] evaluates a 3-times
a different focus point (compared to the indoor and the rest smaller depth map in 10s (Timing was measured using the
of the outdoor scenes), and therefore the depth dynamic range same machine and the implementation of the network available

2333-9403 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCI.2018.2849326, IEEE
Transactions on Computational Imaging
9

at the authors’ website). Of course, if a one-to-one input image diffraction effects.2 In this model, the PSF calculation contains
to depth map is not needed, the output size can be reduced all the optical properties of the system. Following [24], the
and our FCN will run even faster. PSF of an incoherent imaging system is defined as:
Another advantage of our method is that the depth esti-
P SF = |hc |2 = |F{P (ρ, θ)}|2 , (2)
mation relies mostly on local cues in the image. This allows
performing of the computations in a distributed manner. The where hc is the coherent system impulse response, and P (ρ, θ)
image can be simply split and the depth map can be evaluated is the system’s exit pupil function (the amplitude and phase
in parallel on different resources. The partial outputs can be profile in the imaging system exit pupil). The pupil function
recombined later with barely visible block artifacts. reference is a perfect spherical wave converging at the image
plane. Thus, for an in-focus and aberration free (or diffraction
V. S UMMARY AND CONCLUSIONS limited) system, the pupil function is just the identity for the
amplitude in the active area of the aperture, and zero for the
In this paper we have presented a method for real-time phase.
depth estimation from a single image using a phase coded Out-of-Focus (OOF): An imaging system acquiring an
aperture camera. The phase mask is designed together with the object in OOF conditions suffers from blur that degrades the
FCN model using back propagation, which allows capturing image quality. This results in low contrast, loss of sharpness
images with high light efficiency and color-coded depth cues, and even loss of information. The OOF error is expressed
such that each color channel responds differently to OOF analytically as a quadratic phase wave-front error in the pupil
scenarios. Taking advantage of this coded information, a function. To quantify the defocus condition, we introduce the
simple convolutional neural network architecture is proposed parameter ψ. For the case of a circular exit pupil with radius
to recover the depth map of the captured scene. R, we define ψ as:
This proposed scheme outperforms state-of-the-art monocu-
lar depth estimation methods by having better accuracy, more πR2 1

1 1

πR2

1 1

than an order of magnitude speed acceleration, less memory ψ= + − = −
λ zo zimg f λ zimg zi
requirements and hardware parallelization compliance. In ad- (3)
πR2 1
 
dition, our simple and low-cost solution shows comparable 1
= − ,
performance to expensive commercial solutions with complex λ zo zn
optics such as the Lytro camera. Moreover, as opposed to the where zimg is the image distance (or sensor plane location)
relative depth maps produced by the monocular methods and of an object in the nominal position zn , zi is the ideal image
the Lytro camera, our system provides an absolute (metric) plane for an object located at zo , and λ is the illumination
depth estimation, which can be useful to many computer vision wavelength. The defocus parameter ψ measures the maximum
applications, such as 3D modeling and augmented reality. quadratic phase error at the aperture edge. For a circular pupil:

A PPENDIX POOF = P (ρ, θ) exp{jψρ2 }, (4)


P HASE - CODED APERTURE IMAGING AS A NEURAL where POOF is the OOF pupil function, P (ρ, θ) is the in-focus
NETWORK LAYER pupil function, and ρ is the normalized pupil coordinate.
As described in the paper, our depth estimation method is Aperture Coding: As mentioned above, the pupil function
based on a phase-coded aperture lens that introduces depth- represents the amplitude and phase profile in the imaging
dependent color cues in the resultant image. The depth cues are system exit pupil. Therefore, by adding a coded pattern
later processed by a Fully-Convolutional Network (FCN) to (amplitude, phase or both) at the exit pupil,3 the PSF of the
produce a depth map of the scene. Since the depth estimation system can be manipulated by some pre-designed pattern. In
is done using deep learning, and in order to have an end-to- this case, the pupil function can be expressed as:
end deep learning based solution, we model the phase-coded
aperture imaging as a layer in the deep network and optimize PCA = P (ρ, θ)CA(ρ, θ), (5)
its parameters using backpropagation, along with the network
where PCA is the coded aperture pupil function, P (ρ, θ) is
weights. In the following we present in detail the forward and
the in-focus pupil function, and CA(ρ, θ) is the aperture/phase
backward model of the phase coded aperture layer.
mask function. In our case of phase coded aperture, CA(ρ, θ)
is a circularly symmetric piece-wise constant function repre-
A. Forward model senting the phase rings pattern. For the sake of simplicity, we
will consider a single ring phase mask, applying a φ phase shift
Following the imaging system model presented in [24], in a ring starting at r1 to r2 . Therefore, CA(ρ, θ) = CA(r, φ)
the physical imaging process is modeled as a convolution where:
of the aberration free geometrical image with the imaging
2 Note that in this model, the geometric image is a perfect reproduction of
system Point Spread Function (PSF). In other words, the final
image is the scaled projection of the scene onto the image the scene (up to scaling), with no resolution limit.
3 The exit pupil is not always accessible. Therefore, the mask may be added
plane, convolved with the system’s PSF, which contains all the also in the aperture stop, entrance pupil, or in any other surface conjugate to
system properties: wave aberrations, chromatic aberrations and the exit pupil.

2333-9403 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCI.2018.2849326, IEEE
Transactions on Computational Imaging
10

(a) scene (b) Light-Field (Lytro) (c) monocular (d) ours


Fig. 11. Outdoor scenes depth estimation. Depth estimation results for a granulated wall (upper) and grassy slope with flowers (lower) scenes. Left to
right: (a) the scene and its depth map acquired using (b) Lytro Illum camera, (c) Liu et al. [3] monocular depth estimation net, (d) our method. As each
camera has a different field of view, the images were cropped to achieve roughly the same part of the scene. The depth scale on the right is relevant only for
(d). Because the outputs of (b)&(c) provide only a relative depth map (and not absolute as in the case of (d)), their maps were brought manually to the same
scale for visualization purposes. Additional examples appear in Fig. 12.

(
exp{jφ} r1 < ρ < r2
CA(r, φ) = (6) ∂Iout ∂
1 otherwise = [Iin ∗ P SF (ψ, r, φ)]
∂ri /φ ∂ri /φ
This example can be easily extended to a multiple rings (10)

pattern. = Iin ∗ P SF (ψ, r, φ)
∂ri /φ
Depth dependent coded PSF: Combining all factors, the
complete term for the depth dependent coded pupil function Thus, we need to calculate ∂P SF/∂ri and ∂P SF/∂φ.
becomes: Since both derivatives are almost similar, we start with
P (ψ) = P (ρ, θ)CA(r, φ) exp{jψρ2 }. (7) ∂P SF/∂φ and then describe the differences in the derivation
of ∂P SF/∂ri later. Using (2), we get
Using the definition in (2), the depth dependent coded
P SF (ψ) can be easily calculated. ∂ ∂
P SF (ψ, r, φ) = [F{P (ψ, r, φ)F{P (ψ, r, φ)}]
Imaging Output: Using the coded aperture PSF, the imag- ∂φ ∂φ
ing output can be calculated simply by: ∂
= [ F{P (ψ, r, φ)]F{P (ψ, r, φ)}+
∂φ
Iout = Iin ∗ P SF (ψ). (8)

+ F{P (ψ, r, φ)[ F{P (ψ, r, φ)}]
This model limits us to a Linear Shift-Invariant (LSI) model. ∂φ
However, this is not the case in real imaging systems, and the (11)
PSF varies across the Field of View (FOV). This is solved
by segmenting the FOV to blocks with similar PSF, and then ∂
We may see that the main term in (11) is ∂φ [F{P (ψ, r, φ)]
applying the LSI model in each block. or its complex conjugate. Due to the linearity of the derivative
and the Fourier transform, the order of operations can be

B. Backward model reversed and rewritten as: F{ ∂φ P (ψ, r, φ)}. Therefore, the
last term remaining for calculating the PSF derivative is:
As described in the previous subsection, the forward model
of the phase coded aperture layer is expressed as: ∂ ∂
P (ψ, r, φ) = [P (ρ, θ)CA(r, φ) exp{jψρ2 }]
Iout = Iin ∗ P SF (ψ). (9) ∂φ ∂φ

The P SF (ψ) varies with the depth (ψ), but it has also a = P (ρ, θ) exp{jψρ2 } [CA(r, φ)] (12)
∂φ
constant dependence on the phase ring pattern parameters r (
and φ, as expressed in (7). In the network training process, we jP (ψ, r, φ) r1 < ρ < r2
=
are interested in determining both r and φ. Therefore, we need 0 otherwise
to evaluate three separate derivatives: ∂Iout /∂ri for i = 1, 2
(the inner and outer radius of the phase ring, as detailed in Similar to the derivation of ∂P SF/∂φ, for calculating
(6)) and ∂Iout /∂φ. All three are derived in a similar fashion: ∂P SF/∂ri we need also ∂r∂ i P (ψ, r, φ). Similar to (12), we

2333-9403 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCI.2018.2849326, IEEE
Transactions on Computational Imaging
11

(a) scene (b) Light-Field (Lytro) (c) monocular (d) ours


Fig. 12. Outdoor scenes depth estimation. From left to right: (a) the scene and its depth map acquired using (b) Lytro Illum camera, (c) Liu et al. [3]
monocular depth estimation net, (d) our method. See caption of Fig. 9 for full details.

2333-9403 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCI.2018.2849326, IEEE
Transactions on Computational Imaging
12

have [13] A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3d scene


structure from a single still image,” IEEE Trans. Pattern Anal. Mach.
∂ ∂ Intell., vol. 31, no. 5, pp. 824–840, May 2009. [Online]. Available:
P (ψ, r, φ) = [P (ρ, θ)CA(r, φ) exp{jψρ2 }] http://dx.doi.org/10.1109/TPAMI.2008.132
∂ri ∂ri
(13) [14] E. R. Dowski and W. T. Cathey, “Extended depth of field through wave-

= P (ρ, θ) exp{jψρ2 } [CA(r, φ)] front coding,” Applied Optics, vol. 34, no. 11, pp. 1859–1866, 1995.
∂ri [15] O. Cossairt, C. Zhou, and S. Nayar, “Diffusion coded photography for
extended depth of field,” in ACM Transactions on Graphics (TOG),
Since the ring radius is a step function, this derivative has to vol. 29, no. 4. ACM, 2010, p. 31.
be approximated. We found that tanh(100ρ) achieves good [16] H. Nagahara, S. Kuthirummal, C. Zhou, and S. K. Nayar, “Flexible depth
of field photography,” in Computer Vision–ECCV 2008. Springer, 2008,
enough results for the phase step approximation. pp. 60–73.
With the full forward and backward model, the phase coded [17] O. Cossairt and S. Nayar, “Spectral focal sweep: Extended depth of field
aperture layer can be incorporated as a part of the FCN model, from chromatic aberrations,” in Computational Photography (ICCP),
2010 IEEE International Conference on. IEEE, 2010, pp. 1–8.
and the phase mask parameters r and φ can be learned along [18] A. Levin, R. Fergus, F. Durand, and W. T. Freeman, “Image and depth
with the network weights. from a conventional camera with a coded aperture,” ACM Transactions
on Graphics, vol. 26, no. 3, p. 70, 2007.
[19] P. A. Shedligeri, S. Mohan, and K. Mitra, “Data driven coded aperture
ACKNOWLEDGMENT design for depth recovery,” CoRR, vol. abs/1705.10021, 2017. [Online].
Available: http://arxiv.org/abs/1705.10021
This research was partially supported by ERC-StG SPADE [20] A. Chakrabarti and T. Zickler, “Depth and deblurring from a spectrally-
varying depth-of-field,” in Computer Vision–ECCV 2012. Springer,
PI Giryes and ERC-StG RAPID PI Bronstein. The authors are 2012, pp. 648–661.
grateful to NVIDIA’s hardware grant for the donation of the [21] M. Martinello, A. Wajs, S. Quan, H. Lee, C. Lim, T. Woo, W. Lee, S.-S.
Titan X that was used in this research. Kim, and D. Lee, “Dual aperture photography: Image and depth from a
mobile camera,” 04 2015.
[22] B. Milgrom, N. Konforti, M. A. Golub, and E. Marom, “Novel approach
for extending the depth of field of barcode decoders by using rgb
R EFERENCES channels of information,” Optics express, vol. 18, no. 16, pp. 17 027–
17 039, 2010.
[1] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from [23] H. Haim, A. Bronstein, and E. Marom, “Computational multi-
a single image using a multi-scale deep network,” in Advances focus imaging combining sparse model with color dependent phase
in Neural Information Processing Systems 27, Z. Ghahramani, mask,” Opt. Express, vol. 23, no. 19, pp. 24 547–24 556, Sep 2015.
M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, [Online]. Available: http://www.opticsexpress.org/abstract.cfm?URI=oe-
Eds. Curran Associates, Inc., 2014, pp. 2366–2374. [Online]. 23-19-24547
Available: http://papers.nips.cc/paper/5539-depth-map-prediction-from- [24] J. Goodman, Introduction to Fourier Optics, 2nd ed. MaGraw-Hill,
a-single-image-using-a-multi-scale-deep-network.pdf 1996.
[2] Y. Cao, Z. Wu, and C. Shen, “Estimating depth from monocular [25] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely
images as classification using deep fully convolutional residual connected convolutional networks,” in The IEEE Conference on Com-
networks,” CoRR, vol. abs/1605.02305, 2016. [Online]. Available: puter Vision and Pattern Recognition (CVPR), July 2017.
http://arxiv.org/abs/1605.02305 [26] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating
[3] F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single deep network training by reducing internal covariate shift,” in
monocular images using deep convolutional neural fields,” IEEE Proceedings of the 32nd International Conference on Machine
Transactions on Pattern Analysis and Machine Intelligence, 2016. Learning (ICML-15), D. Blei and F. Bach, Eds. JMLR Workshop
[Online]. Available: http://dx.doi.org/10.1109/TPAMI.2015.2505283 and Conference Proceedings, 2015, pp. 448–456. [Online]. Available:
[4] B. Li, C. Shen, Y. Dai, A. van den Hengel, and M. He, “Depth and http://jmlr.org/proceedings/papers/v37/ioffe15.pdf
surface normal estimation from monocular images using regression [27] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
on deep features and hierarchical crfs,” in The IEEE Conference on with deep convolutional neural networks,” in Advances in Neural
Computer Vision and Pattern Recognition (CVPR), June 2015. Information Processing Systems 25, F. Pereira, C. J. C. Burges,
[5] H. Jung and K. Sohn, “Single image depth estimation with integration L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012,
of parametric learning and non-parametric sampling,” Journal of Korea pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/4824-
Multimedia Society, vol. 9, no. 9, Sep 2016. [Online]. Available: imagenet-classification-with-deep-convolutional-neural-networks.pdf
http://dx.doi.org/10.9717/kmms.2016.19.9.1659 [28] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A.
[6] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and Riedmiller, “Striving for simplicity: The all convolutional net.”
N. Navab, “Deeper depth prediction with fully convolutional residual CoRR, vol. abs/1412.6806, 2014. [Online]. Available: http://dblp.uni-
networks,” CoRR, vol. abs/1606.00373, 2016. [Online]. Available: trier.de/db/journals/corr/corr1412.html#SpringenbergDBR14
http://arxiv.org/abs/1606.00373 [29] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi,
[7] R. Garg, B. G. V. Kumar, G. Carneiro, and I. D. Reid, “Unsupervised “Describing textures in the wild,” in Proceedings of the IEEE Conf.
CNN for single view depth estimation: Geometry to the rescue,” in on Computer Vision and Pattern Recognition (CVPR), 2014.
Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, [30] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic
The Netherlands, October 11-14, 2016, Proceedings, Part VIII, 2016, pp. open source movie for optical flow evaluation,” in European Conf. on
740–756. Computer Vision (ECCV), ser. Part IV, LNCS 7577, A. Fitzgibbon et al.
[8] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular (Eds.), Ed. Springer-Verlag, Oct. 2012, pp. 611–625.
depth estimation with left-right consistency,” in The IEEE Conference [31] N. Zeller, C. A. Noury, F. Quint, C. Teulière, U. Stilla, and
on Computer Vision and Pattern Recognition (CVPR), July 2017. M. Dhome, “Metric calibration of a focused plenoptic camera based
[9] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks on a 3d calibration target,” ISPRS Annals of Photogrammetry, Remote
for semantic segmentation,” CVPR, Nov. 2015. Sensing and Spatial Information Sciences, vol. III-3, pp. 449–456,
[10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image 2016. [Online]. Available: https://www.isprs-ann-photogramm-remote-
recognition,” in The IEEE Conference on Computer Vision and Pattern sens-spatial-inf-sci.net/III-3/449/2016/
Recognition (CVPR), June 2016.
[11] P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor segmen-
tation and support inference from rgbd images,” in ECCV, 2012.
[12] N. Silberman and R. Fergus, “Indoor scene segmentation using a
structured light sensor,” in Proceedings of the International Conference
on Computer Vision - Workshop on 3D Representation and Recognition,
2011.

2333-9403 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like