0% found this document useful (0 votes)
107 views12 pages

2021-Depth From Defocus With Learned Optics For Imaging and Occlusion-Aware Depth Estimation

This document describes a new framework for monocular depth estimation that uses a learned phase-coded aperture and occlusion-aware image formation model. It jointly trains the optics and a neural network to estimate depth maps and all-in-focus images from a single coded sensor image. The approach uses a rotationally symmetric aperture design for computational efficiency and outperforms standard and end-to-end monocular depth estimation methods on indoor and outdoor scenes. A prototype camera is built with a custom diffractive optical element in the aperture to demonstrate the method.

Uploaded by

cynorr rain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views12 pages

2021-Depth From Defocus With Learned Optics For Imaging and Occlusion-Aware Depth Estimation

This document describes a new framework for monocular depth estimation that uses a learned phase-coded aperture and occlusion-aware image formation model. It jointly trains the optics and a neural network to estimate depth maps and all-in-focus images from a single coded sensor image. The approach uses a rotationally symmetric aperture design for computational efficiency and outperforms standard and end-to-end monocular depth estimation methods on indoor and outdoor scenes. A prototype camera is built with a custom diffractive optical element in the aperture to demonstrate the method.

Uploaded by

cynorr rain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Depth from Defocus with Learned Optics for

Imaging and Occlusion-aware Depth Estimation


Hayato Ikoma, Cindy M. Nguyen, Christopher A. Metzler, Member, IEEE, Yifan Peng, Member, IEEE,
and Gordon Wetzstein, Senior Member, IEEE

Abstract—Monocular depth estimation remains a challenging problem, despite significant advances in neural network architectures
that leverage pictorial depth cues alone. Inspired by depth from defocus and emerging point spread function engineering approaches
that optimize programmable optics end-to-end with depth estimation networks, we propose a new and improved framework for depth
estimation from a single RGB image using a learned phase-coded aperture. Our optimized aperture design uses rotational symmetry
constraints for computational efficiency, and we jointly train the optics and the network using an occlusion-aware image formation
model that provides more accurate defocus blur at depth discontinuities than previous techniques do. Using this framework and a
custom prototype camera, we demonstrate state-of-the art image and depth estimation quality among end-to-end optimized
computational cameras in simulation and experiment.

Index Terms—Computational Photography, Computational Optics

1 I NTRODUCTION
Robust depth perception is a challenging, yet crucial capability vision science community, however, that defocus blur and the
for many computer vision and imaging problems in robotics [1], spatial relationships implied by occluding edges provide an even
[2], autonomous driving [3], [4], [5], [6], augmented reality [7], stronger depth cue than pictorial cues for human vision [33], [34],
and 3D photography [8]. Existing approaches building on time-of- [35].
flight, stereo pairs, or structured illumination require high-powered To alleviate this shortcoming in DfD, we propose a nonlinear
illumination and complex hardware systems, making monocular occlusion-aware optical image formation that models defocus blur
depth estimation (MDE) from just a single 2D image [9], [10], at occlusion boundaries more accurately than previous E2E ap-
[11], [12], [13], [14], [15], [16], [17], [18], [19] one of the most proaches. Moreover, we adopt a rotationally symmetric design of
attractive solutions. our optimized phase-coded aperture, reducing the computational
MDE approaches typically rely on pictorial depth cues, such complexity and memory requirements of the optimization by an
as perspective, partial occlusions, and relative object sizes learned order of magnitude. Finally, we derive an effective preconditioning
from a dataset of training images in a supervised manner. These approach that applies an approximate inverse of the optical image
contextual cues reliably help estimate the relative ordering of formation model to the sensor measurements. This approximate
objects within a scene [10], [16]. Defocus blur is another com- inverse makes it significantly easier for the MDE network to
plementary depth cue, which has been exploited in depth from robustly infer a depth map from the coded sensor image. Our
defocus (DfD) approaches [20], [21], [22], [23], [24]. Recent DfD approach is uniquely robust in estimating not only the depth map,
methods also propose network architectures that learn both picto- but also an all-in-focus image from a single optically coded sensor
rial and depth cues simultaneously [25]. Defocus cues, however, image, which is crucial for direct view or downstream tasks that
are ambiguous, which is why many computational photography rely on image and depth, such as classification or object detection.
approaches use coded apertures to engineer the defocus blur to Specifically, our contributions are the following:
optically encode more information than the conventional defocus • We formulate the E2E optimization of a phase-coded aper-
blur contains [26], [27], [28], [29]. Hand-crafted aperture designs ture and MDE network using an occlusion-aware image
have recently been improved using an end-to-end (E2E) optimiza- formation model, a rotationally symmetric aperture, and
tion of optics and image processing [30], [31], [32]. an effective preconditioning approach.
While existing E2E coded aperture MDE techniques have • We analyze the proposed framework and demonstrate that
proven to work well, these methods do not take full advantage of it outperforms standard and E2E MDE approaches with
the available monocular depth cues. Specifically, the linear optical comparable network architectures.
image formation models employed by these approaches [30], [31], • We build a camera prototype with a custom-fabricated
[32] do not model defocus blur at occlusion boundaries accurately. diffractive optical element (DOE) in its aperture and
Thus, prior works exclusively rely on defocus information in demonstrate its performance for indoor and outdoor scenes
image regions of locally constant depth. It is well known in the along with high-quality RGBD video capture.

• H. Ikoma, C. M. Nguyen, Y. Peng, and G. Wetzstein are with the 2 R ELATED W ORK
Department of Electrical Engineering, Stanford University. C. A. Metzler
is with the Department of Computer Science, University of Maryland. 2.1 Monocular Depth Estimation (MDE)
Deep learning is an attractive approach for MDE as networks
• Project website: https:// www.computationalimaging.org
can identify features unknown to humans in depth estimation.
A variety of deep learning methods for MDE have been pro- of MDE with a phase-coded aperture. Our framework enables us
posed using custom loss functions [10], [13], local and global to recover both an RGB image and a depth map from a single
constraints [16], [36], [37], and varying levels of supervision [38], coded sensor image, providing significantly higher resolution and
[39], [40]. Geometrically-driven approaches learn surface normal accuracy compared to estimates from related work.
estimation in conjunction with depth estimation using conditional
random fields [41], two-stream CNNs [42], and 3D reconstruction
from videos [43]; all showing high performance on datasets such 3 P HASE - CODED 3D I MAGING S YSTEM
as KITTI [44] and NYU Depth [45]. Other approaches include
estimating relative depth maps [46] and using the spectral domain This section describes our E2E training pipeline from the image
to augment estimation [47]. To generalize better across datasets, formation model to the neural network-based reconstruction algo-
past works have also taken to incorporating physical camera rithm. We consider a camera with a learnable phase-coded aperture
parameters, such as defocus blur [25], [48], focal length [49], or and a CNN that estimates both an all-in-focus (AiF) RGB image
other sensor information [50] to utilize their implicit encoding of and a depth map from a raw sensor image with coded depth of
depth cues. We propose a computational optics approach to jointly field. This pipeline is illustrated in Fig. 1.
optimize a phase-coded aperture and neural network for passive
3D imaging from a single image.
3.1 Radially Symmetric Point Spread Function

2.2 Computational Imaging for Depth Estimation As in most cameras, ours is comprised of a sensor and a con-
ventional photographic compound lens that focuses the scene on
Instead of relying on a single 2D image, several variants of
the sensor. We modify this optical system by adding a DOE into
DfD capture and process two or more images using a sum-
its aperture plane. This phase-coded aperture allows us to directly
modified-Laplacian operator [51], spatial-domain convolution
control the depth-dependent point spread function (PSF) of the
transforms [52], and quadratic likelihood functions [53]. Dual-
imaging system using variations in the surface height of the DOE.
pixel sensors have also been demonstrated to capture a stereo
The goal of the E2E optimization procedure described in this
pair with sufficient disparity to estimate depth [54]. Amplitude-
section is to find a surface profile, which shapes the PSF in a
[26], [27], [55] and phase-coded [56], [57] apertures have also
way that makes it easy and informative for the CNN to estimate
been extensively studied as having depth estimation techniques
per-pixel scene depth and color from a single image.
that utilize chromatic aberrations [23]. Most of these approaches,
The PSF is modeled as [71]
however, use conventional lenses or hand-crafted aperture designs
and algorithms, which do not optimize the system performance in 2π ∞
Z 2

an E2E fashion. PSF (ρ, z, λ) =
rD(r, λ, z)P (r, λ)J0 (2πρr) dr . (1)
λs 0
2.3 Deep Optics Here, ρ and r are the radial distances on the sensor and aperture
Jointly designing optics or sensor electronics and networks have planes, respectively, λ is the wavelength, and J0 (·) is the zero-
been explored for color filter design [58], spectral imaging th order Bessel function of the first kind. In this formulation, the
[59], superresolution localization microscopy [60], superresolu- camera lens with focal length f is focused at some distance d. The
tion single-photon imaging [61], extended depth of field [62], Gaussian thin lens formula f1 = d1 + 1s relates these quantities
achromatic imaging [63], HDR imaging [64], [65], image clas- to the distance between lens and sensor s. The defocus factor
sification [66], and video compressive sensing [67], [68]. A recent D(r, λ, z), which models the depth variation of the PSF for a
survey of the use of artificial intelligence in optics can be found point at some distance z from the lens, is given by
in [69]. z √ √ 
i 2π r 2 +z 2 − r 2 +d2
Principled approaches to jointly optimizing camera optics and D(r, λ, z) = e λ . (2)
λ(r2 + z 2 )
depth estimation networks have also recently been proposed. For
example, Haim et al. [31] use concentric rings in a phase mask We employ a radially symmetric DOE design [63], which reduces
to induce chromatic aberrations, while Wu et al. [30] rely on the number of DOE parameters to be optimized, memory require-
defocus cues in their jointly optimized phase mask and CNN- ments, and compute time of the PSF by an order of magnitude
based reconstruction. Chang et al. [32] use E2E optimization compared to the requirements of a nonsymmetric design. Finally,
to design a freeform lens for the task. Deep optics has also the phase delay on the aperture plane P is related to the surface
been extended to extract a depth map and multispectral scene profile h of a DOE with refractive index n (λ) as
information from a sensor measurement [70].

Inspired by the idea of deep optics, we propose a novel P (r, λ) = a (r) ei λ (n(λ)−nair ) h(r) , (3)
approach to E2E depth imaging that makes several important
improvements over existing approaches in this area [30], [31], where nair ≈ 1.0 is the refractive index of air, and a is the
[32]. First, we introduce an occlusion-aware image formation transmissivity of the phase mask, which is typically 1, but can
model that significantly improves our ability to model and opti- also include light-blocking regions which set the transmissivity
cally encode defocus blur at occlusion boundaries. Second, we locally to 0.
introduce a preconditioning approach that applies an approximate We include a more detailed derivation of these formulations
inverse of our nonlinear image formation model before feeding the in our Supplemental Material. Although these equations are based
data into the depth estimation network. Finally, we tailor a rota- on standard optical models [71], in the Supplement, we derive a
tionally symmetric optical design, which was recently introduced novel formulation that allows us to evaluate the integral of Eq. 1
for achromatic imaging with a single DOE [63], to the application efficiently.
Training Images PSF Coded Image RGB Volume CNN Prediction Loss

1x1 1x1

skip connection

64 64 128 128 128 128 128 128 64 64


64 64 64 64
32 32 32 32 32 4

DOE

Fig. 1. Illustration of E2E optimization framework. RGBD images of a training set are convolved with the depth-dependent 3D PSF created by a lens
surface profile h and combined using alpha compositing. The resulting sensor image b is processed by an approximate-inverse-based preconditioner
before being fed into the CNN. A loss function L is applied to both the resulting RGB image and the depth map. The error is backpropagated into
the CNN parameters and the surface profile of the phase-coded aperture.

3.2 Image Formation Model with Occlusion


Prior work on E2E optimized phase-coded apertures for snapshot
3D imaging [30], [31], [32] used variants of a simple linear image
formation model of the form
K−1
X
b (λ) = PSF k (λ) ∗ lk (λ) + η, (4)
k=0

where ∗ is the 2D convolution operator, b (λ) is a single wave-


length of the sensor image, and η is additive noise. For this model,
the input RGBD image is quantized into K depth layers lk , with
k = 0 being the farthest layer.
A linear model can accurately reproduce defocus blur for
image regions corresponding to a locally constant depth value.
However, this approach is incapable of accurately modeling de-
focus blur at depth discontinuities. Defocus blur at these depth
edges is crucial for human depth perception [33], [35]–we argue Fig. 2. Comparing image formation models that simulate defocus blur
that a MDE network would similarly benefit from more accurate from an RGB image (top left) and a depth map (top right). Existing linear
defocus blur at depth edges. To this end, we adopt a nonlinear models, including Wu et al.’s [30] and Chang et al.’s [32] variants of
it, do not model blur at depth discontinuities adequately. Our nonlinear
differentiable image formation model based on alpha composit- occlusion-aware model achieves a more faithful approximation of a ray-
ing [72], [73], [74] and combine it with our wavelength- and traced ground truth image.
depth-dependent PSF as
K−1
X K−1
Y
b (λ) = lk
e (1− α
ek0 ) + η, (5) ray tracing is a valuable tool for verifying these different images
k=0 k0 =k+1 formation models, but it is not a feasible tool for training our
system. It takes too long to ray trace images on the fly during
where e lk := (PSFk (λ)∗lk ) /Ek (λ) and α ek :=
training, and it is infeasible to pre-compute every ray-traced image
(PSFk (λ)∗αk (λ)) /Ek (λ). The depth map is quantized for every possible phase-coded aperture setting. Please refer to the
into K depth layers to compose binary masks αk . As the
Supplemental Material for additional discussions.
convolution with the PSFs are naively performed with the sub-
images lk and the binary masks αk , the energy or the brightness
is unrealistically reduced at the transition of depth layers.
Therefore, to recover 3.3 CNN-based Estimation of Image and Depth
Pit, we apply a normalization with a factor
Ek (λ) := PSFk ∗ kk0 =0 αk0 . We implement the convolutions
with fast Fourier transforms (FFTs) and crop 32 pixels at the In the E2E training, we utilize a CNN to jointly estimate an all-
boundaries to reduce possible boundary artifacts. in-focus image and a depth map or an RGBD image. We describe
As seen in Fig. 2, our nonlinear model produces a more its architecture and training details in the following.
realistic defocused image from RGBD input than previously used
linear models. Compared with Wu et al. [30] and Chang et al. [32], 3.3.1 Preconditioning with Approximate Inverse
our model’s improvements are especially noticeable around depth
discontinuities, which provide the downstream network superior Although the linear image formation model outlined in Eq. 4
defocus information. Compared to the direct linear model, our is not accurate at occlusion boundaries, it provides a simple-
model produces more accurate defocus blur around texture and enough framework to serve as a preconditioning step for our
depth edges. The error maps shown in Fig. 2 are computed with network. Specifically, we formulate the inverse problem of finding
respect to the ray-traced ground truth sensor image. Note that a multiplane representation l(est) ∈ RM ×N ×K from a single
2D sensor image as a Tikhonov-regularized least squares problem 3.4.1 Training Loss Function
with regularization parameter γ We train the network using feature similarity for the RGB image
K−1
X 2 2 LRGB , an L1 loss for the depth map LDepth and a regularizer for
l(est) = arg min b − PSF k ∗ lk + γ l . (6) the PSF LPSF :

l∈RM ×N ×K k=0
L = ψRGB LRGB + ψDepth LDepth + ψPSF LPSF , (8)
We omit the wavelength dependence for notational simplicity here.
In our Supplemental Material, we derive a closed-form solution for where ψ is the regularization weight. The feature similarity is
this inverse problem in the frequency domain. It is implemented evaluated on input, conv1 2, conv2 2 and conv3 3 features of
with FFTs, and edge-tapering is applied as a pre-processing step a pre-trained VGG-16 network [77]. The losses for the features
to reduce ringing artifacts [75]. This closed-form inverse of the are weighted with [0.250, 0.080, 0.250, 0.208] by following the
linear image formation model maps the 2D sensor image into fine-tuned weight used in [78]. Since the preconditioning with the
a layered 3D representation that has the sharpest details on the approximate inverse has worse performance at the boundaries due
layer corresponding to the ground truth depth, even though it is to edge-tapering, the 32 pixels at the boundaries are excluded for
incorrect at depth edges. Thus, in simplified terms, our CNN then the evaluation of the loss. We set ψRGB = ψDepth = 1 and
has to find the layer with the sharpest details or highest gradients ψPSF = 45, all of which are manually tuned.
at each pixel. This pixel value is close to the sought after RGB
value, and the corresponding layer index is representative of its 3.5 Training Details
depth. It is therefore intuitive that the CNN will have an easier
time learning the mapping from a layered depth representation to The E2E model was trained for 100 epochs with the Adam
an RGB image and depth map, rather than having to learn the optimizer (β1 = 0.9, β2 = 0.999) with a batch size of 3
“full” inverse starting from the sensor image. These arguments are and evaluated on the validation set at the end of every epoch.
further discussed and experimentally validated in Section 4. The Among the 100 checkpoints, the one achieving the lowest val-
closed-form solution is fully differentiable, and its computational idation loss is used for evaluating on the test set. Source code
cost is dominated by the Fourier transform, so it is O(N 2 logN ) and pre-trained network models and phase masks are available
for an image with N 2 pixels. on the project website: https://www.computationalimaging.org/
publications/deepopticsdfd.
3.3.2 CNN Architecture
We use the well-known U-Net style network architecture [76] 4 A NALYSIS AND E VALUATION
to estimate the 4-channel RGBD image (Fig. 1). The input is
In this section, we describe a number of qualitative and quan-
the channel-wise concatenation of the captured image and the
titative experiments we performed to evaluate our method and
multilayer deconvolved image (Eq. 6) and is transformed to a 32-
compare it to related work.
channel feature map with a 1×1 convolution. Our CNN has skip
connections and five scales with four consecutive downsamplings
and upsamplings. Each scale has two consecutive convolutions to 4.1 Datasets
output features, and the number of channels for the features is For our simulated results, we use the cleanpass subset of the
set to [32, 64, 64, 128, 128] respectively. All convolution layers FlyingThings3D dataset for training [79], [80]. This dataset
are followed by a batch normalization and a rectified linear unit contains 22K and 8K pairs of an RGB image and corresponding
(ReLU). The downsamplings and upsamplings are performed with depth maps for training and testing, respectively. The training
maxpool and bilinear interpolation. Our CNN has ∼1M trainable set is divided into 18K and 4K pairs for training and validation,
parameters, which is significantly smaller than conventional MDE respectively. During training, we performed random cropping with
networks. window sizes of 384 × 384 pixels and random horizontal/vertical
flipping to augment the training set. The target depth range was
3.4 PSF Regularization set to 1.0 m to 5.0 m, and the camera is focused at 1.7 m with an
f-number of 6.3. When the depth map was converted to an alpha
Due to memory constraints, the E2E pipeline has to be trained
channel volume, it was resampled with the inverse perspective
using image patches, which are significantly smaller than the full
sampling scheme [54].
sensor. Therefore, the PSF is optimized only over the size of the
image patch and has no constraints outside the patch. However, the
PSF may create non-zero energy outside the patch, which would 4.2 Baseline Comparisons
reduce the contrast of captured images in practice. To prevent these We compare our method to several alternative approaches:
artifacts, we penalize the energy of the PSF with the regularizer
• AiF: Applying the same depth estimation CNN we use
K−1
X X X 2 in our model directly to a ground truth (GT) all-in-focus
LPSF = |PSFk (ρ, λ)| , (7)
(AiF) image.
λ∈(R,G,B) k=0 ρ>ρtarget
• DfD: Applying the same depth estimation CNN we use to
where PSFk (·, λ) is a 1D PSF evaluated over a full sensor size and a sensor image with a conventional (non-learned) defocus
ρtarget is a target PSF radius. Although the evaluation of the multi- blur with a similar f-number as our setting.
color 3D PSF over a full sensor is computationally expensive, the • Haim et al. [31]: A three-ring phase-coded aperture de-
regularizer (Eq. 7) is inexpensive to evaluate while having the sign implemented with our radially symmetric PSF model.
same effectiveness due to the rotational symmetry of the PSF. In The step function representing the rings is implemented
our training, we used the target radius of 32 pixels. with tanh(100ρ) as proposed in that work.
Ground truth (GT) AiF DfD Haim et al. Wu et al. Chang et al. Ours w/o pinv. Ours w/ pinv.
GT 24.4 dB 24.0 dB 23.2 dB 22.4 dB 31.8 dB 31.9 dB

0.896 m 0.321 m 0.682 m 0.683 m 0.679 m 0.332 m 0.276 m

Fig. 3. Top: Ground truth (GT) RGB image (left) along with the simulated sensor images of all baseline approaches. These baselines (columns 3–6)
do not attempt to estimate the GT RGB image. Estimated RGB images of our approach, without and with the proposed preconditioning. Bottom:
GT and estimated depth maps of all approaches. The quality of RGB image and depth map estimated by our method is best for this scene. PSNR
of the image and RMSE of the depth map are shown on the top right. See the Supplemental Material for their corresponding PSFs and captured
images.

TABLE 1
An ablation study and comparison to previous work in simulation. Top: all methods are implemented as described in their respective papers and
use their respective sensor image as input. The output of each network is compared to the ground truth depth map, and we additionally compare
either the estimated RGB image or, if an algorithm does not directly compute that, the sensor image to the all-in-focus reference image. Bottom:
an ablation of different variants of the proposed rotationally symmetric DOE design for the linear image formation, a linear image formation with
nonlinear refinement [30], and the proposed nonlinear model. Using a variety of different metrics on estimated RGB images and depth maps, we
demonstrate that the proposed approach is the best when using a comparable CNN architecture for all methods.

Image Depth
Model Refinement MAE↓ PSNR↑ SSIM↑ MAE↓ RMSE↓ log10 ↓ δ<1.25↑ δ<1.252 ↑ δ<1.253 ↑
All in focus (AiF) — GT GT GT 0.357 0.500 0.099 0.658 0.807 0.874
Prior Work

Depth from def. (DfD) — 3.24e-2 24.95 0.711 0.097 0.228 0.039 0.929 0.965 0.979
Haim et al. [31] — 3.28e-2 24.90 0.708 0.297 0.635 0.109 0.803 0.879 0.923
Wu et al. [30] — 3.49e-2 24.54 0.704 0.207 0.521 0.090 0.865 0.918 0.945
Chang et al. [32] — 3.62e-2 24.28 0.694 0.205 0.490 0.077 0.888 0.945 0.968
Linear w/o pinv — 2.02e-2 30.01 0.870 0.268 0.598 0.108 0.845 0.898 0.925
Rot. Symmetric

Linear w/ pinv — 1.99e-2 30.86 0.891 0.258 0.554 0.103 0.856 0.899 0.927
Linear w/o pinv Nonlin. w/o pinv 1.89e-2 31.43 0.900 0.127 0.264 0.065 0.901 0.952 0.964
Linear w/ pinv Nonlin. w/ pinv 1.83e-2 31.58 0.902 0.095 0.203 0.038 0.931 0.969 0.979
Nonlin. w/o pinv — 1.82e-2 31.61 0.903 0.104 0.237 0.041 0.925 0.963 0.977
Nonlin. w/ pinv — 1.76e-2 31.88 0.905 0.089 0.191 0.034 0.941 0.970 0.981

• Wu et al. [30]: The PhaseCam3D approach implemented on the captured sensor image. Unsurprisingly, our estimated RGB
with a DOE size of 256 × 256 features and 55 Zernike image is significantly better than all of these sensor images when
coefficients. The DOE was initialized with the DOE which compared to the reference AiF image. When comparing the quality
minimizes the mean of the Cramér-Rao lower bound for of the estimated depth maps, the conventional DfD approach
single-emitter localization as described in that work. does surprisingly well, much better than any of the optimized
• Chang et al. [32]: A singlet lens introducing chromatic methods. This is likely due to the fact that all of these approaches
aberrations implemented with our radially symmetric PSF use variants of the linear image formation model, which provide
model. All optical parameters match our setup. inaccurate defocus blur around depth discontinuities, whereas
our implementation of DfD is trained with the nonlinear image
Related works are typically trained using only LDepth with formation model that all methods are tested against. Nevertheless,
regularization of depth maps and the PSFs. For a fair comparison our approach outperforms all of these baselines when implemented
of optical models, we used only LDepth for the respective works with the proposed preconditioning using the approximate inverse
and the baselines. We reimplemented their image formation model (pinv). Without the preconditioning, our approach does slightly
by following their respective papers or the code provided by the worse on the depth map than the DfD approach, which is under-
authors. standable because our approach needs to recover both depth map
and RGB image whereas DfD only estimates the depth map with
4.3 Comparisons to Prior Work the same CNN architecture. These trends are confirmed by the
quantitative results shown in Table 1 (top).
Fig. 3 shows qualitative and quantitative results for one example
from our test dataset. The ground truth RGB image and depth
maps are shown on the left, followed by all baselines described 4.4 Additional Ablations
above. Due to the fact that none of the baselines attempt to We also ablate the proposed rotationally symmetric DOE design
estimate an AiF RGB image, we evaluate their RGB image quality in more detail in Table 1 (bottom) by analyzing the importance
TABLE 2 TABLE 3
Evaluating different weightings of the loss function. Evaluating diffraction efficiency (DE) using RGB PSNR / depth RMSE
metrics. We train DOEs from scratch for several different DEs (rows)
and test them using the same and other DEs (columns).
Loss weights (ψRGB , ψDepth ) Image PSNR Depth RMSE
(1.0, 1.0) 31.88 0.191 ```
(1.0, 0.1) 33.83 0.307 ``` Tested 100% 75% 50%
Designed ```
(0.1, 1.0) 29.91 0.184
100% 31.88 / 0.19 30.98 / 0.36 27.66 / 0.79

(a) (b) 75% 29.79 / 0.27 31.74 / 0.19 29.50 / 0.35


50% 27.91 / 0.67 30.38 / 0.34 31.24 / 0.21

complete depth map [45], [81]. While the completed depth map is
used for the simulated image formation during training, the loss
function LDepth is evaluated only at valid (i.e., non-inpainted)
(c)
depth values. Training images are drawn from DualPixels and
FlyingThings3D with the same probability.

5.1.2 PSF Model with Limited Diffraction Efficiency


As often observed in practice, our fabricated DOEs have an
imperfect diffraction efficiency (DE), which means that some
0.0 µm 2.1 µm amount of the incident light passes straight through them without
being diffracted. In this scenario, the measured PSF of the imaging
system comprises a superposition of the native PSF of the focusing
Fig. 4. (a) A disassembled camera lens next to our fabricated DOE with a
3D-printed mounting adapter. (b) A microscopic image of the fabricated lens and the designed PSF created by the phase-coded aperture.
DOE. The dark gray area is the DOE made of NOA61, and the light gray With a DE of µ, we model the resulting PSF as
area is the light-blocking metal aperture made of chromium and gold.
The black scale bar on the bottom right is 1 mm. (c) The height profile of
the designed DOE. The maximum height is 2.1 µm.
PSF = µ · PSFdesign + (1 − µ) · PSFnative . (9)

To quantify our DE, we fabricated a diffraction grating and


of the nonlinear image formation model over the linear one with determined that the DE of our fabrication process is ∼70 %. With
optional nonlinear refinement, as proposed by Wu et al. [30]. For this DE, the DOE and the network were jointly optimized for our
all of these variants of our DOE design, using the pinv improves physical prototype.
both image and depth quality compared to the results not using We parameterized the DOE height using 400 learnable param-
the pinv. Moreover, the nonlinear model also performs better than eters which matches the accuracy of our fabrication technique rea-
linear variants. sonably well. For simulating the PSF, however, we upsample these
In Table 2, we evaluate the effect of the relative weights of 400 features to 4,000 pixels using nearest-neighbor upsampling to
image and depth terms of the loss function (Eq. 8). As expected, ensure the accuracy.
the relative weights between the two loss terms directly trade RGB To evaluate the impact of the limited DE of a physical DOE,
PSNR for depth RMSE with our choice of parameters (1.0,1.0) we performed additional simulations analyzing the performance
being a good tradeoff between the two. of various combinations of diffraction efficiencies for training
and testing (Tab. 3). Unsurprisingly, optimizing the correct DE is
always best, with mismatches degrading performance. Reducing
5 E XPERIMENTAL A SSESSMENT the DE also decreases the overall performance.
In this section, we discuss modifications to the training procedure
that account for physical constraints as well as fabrication details
5.1.3 Robust Optimization of PSF
and experimentally captured results.
Our image formation model assumes shift invariance of the PSF
on any one depth plane. In practice, however, the PSF slightly
5.1 Training for Camera Prototype
changes due to optical aberrations as visualized in the Supple-
5.1.1 Additional Datasets mental Material. Moreover, discretizing the scene depths does not
While the FlyingThings3D dataset provides complete depth model the PSFs between the sampled depth planes. We empirically
maps aligned with RGB images, the images are synthetic and found that this discrepancy destabilizes the accuracy of our method
do not represent natural scenes. Therefore, we additionally used when applied to experimentally captured data. To overcome this
the DualPixels dataset [54] to learn the features of natural issue, we randomly shift the red and blue channels of the PSF
scenes. This dataset consists of a set of multi-view images and with a maximum shift of 2 pixels during training, leaving the green
their depth maps captured by smartphone cameras. It has 2,506 channel fixed. Furthermore, each depth plane is randomly sampled
captured images for training and 684 for validation. We used only in between equidistant depth planes for the PSF simulation per
the central view out of the five views for training. As the provided batch. The farthest plane is randomly sampled between 5 m and
depth map is sparse, we inpainted the depth map to obtain a 100 m.
design
capture
fitted 1.0 m Depth 5.0 m

Fig. 5. Depth-dependent point spread functions (PSFs). The designed PSF (top row) is optimized with our end-to-end simulator. Optical
imperfections result in the captured PSF (center row), slightly deviating from the design. Instead of working directly with the captured PSF, we
fit a parametric model to it (bottom row), which is then used to refine our CNN. The scale bar represents 100 µm. For visualization purposes, we
convert the linear intensity of the PSF to amplitude by applying a square root.

Conventional Estimated Depth (conventional) Coded Image Estimated AiF Estimated Depth

1.0m +5.0m

Fig. 6. Experimentally captured results of indoor and outdoor scenes. From left: Images of scenes captured with a conventional camera, depth maps
estimated by a CNN from these conventional camera images, images captured by our phase-coded camera prototype with the optimized DOE, AiF
images estimated by our algorithm from these coded sensor images, depth maps estimated by our algorithm from these coded sensor images. A
top view of the indoor scene (top row) and the size of the receptive field of our neural network are visualized in the Supplemental Material.

5.1.4 Training Details around the DOE. Additional details on this fabrication procedure
The camera settings, optimizer, and loss function are the same as are described in [63].
in the ablation study except for the change of the weighting for The glass substrate with the DOE is mounted in the aperture
loss functions. We set ψRGB = ψDepth = ψPSF = 1. plane of a compound lens (Yongnuo, 50 mm, f/1.8) with a custom
3D-printed holder. To reduce multiple reflections inside the lens,
a black nylon sheet is also inserted between the DOE and the
5.1.5 Fabrication and Hardware Implementation lens. The DOE has a diameter of 5.6 mm which corresponds to
The trained DOE is fabricated using the imprint lithography f/6.3 for the compound lens. The lens is mounted on a machine
technique. For this purpose, the designed phase mask is patterned vision camera (FLIR Grasshopper3), and images are captured in
on a positive photoresist layer (AZ-1512, MicroChemicals) with 16-bit raw mode. The fabricated DOE and our mounting system
a grayscale lithography machine (MicroWriter ML3, Durham are shown in Fig. 4. Since we manually align the DOE and the
Magneto Optics), and its 3D structure is then replicated onto a light-blocking annulus (Fig. 4, b), these two are not perfectly
UV-curable optical adhesive layer (NOA61, Norland Products) aligned, party contributing to the undiffracted light. Specifically,
on a glass substrate. The glass substrate is also coated with we measured a misalignment of ∼140 µm between these two
a chromium-gold-chromium layer to block the incoming light components.
Estimated Depth Estimated Depth from the conventional camera images with a CNN architecture
Conventional
(conventional) (MiDaS)
similar to that used by our approach (column 2), our depth maps
are significantly more detailed. They can easily segment high-
frequency objects apart, and they show an overall higher quality
than this baseline does.
In Fig. 7 and the supplemental movies, we compare our
Coded Image Estimated AiF (ours) Estimated Depth (ours) estimated RGBD images against a baseline model trained on AiF
images and a state-of-the-art MDE method (MiDaS) [19]. For
MiDaS, we used the code with a trained checkpoint provided by
the authors (v2.1). While MiDaS estimates a qualitatively good
depth map, their estimation remains relative and is not consistent
between different frames. On the other hand, our method estimates
1.0m +5.0m
accurate depth in a temporarily consistent manner.
Finally, we show experiments that help quantify the depth
Fig. 7. Selected frames of experimentally captured dynamic scenes. The accuracy achieved by our prototype in Fig. 8. In this experiment,
full dynamic scenes are available as supplemental movies. From top- we capture five photographs of a scene where one object, i.e.,
left to bottom-right: an image of the scene captured with a conventional
camera, a depth map estimated by a CNN comparable to ours from the book, is moved to different distances of known values. We
this conventional camera image, a depth map estimated from the con- extract a region of interest (ROI) of size 50 × 50 pixels in each
ventional image by MiDaS [19], an image captured by our phase-coded of the estimated depth maps and report the estimated depth as the
camera prototype with the optimized DOE, an AiF image estimated by
our CNN from this coded sensor image, and a depth map estimated by mean value of the ROI. The estimated depth values (shown in the
our CNN from the coded sensor image. labels of the individual depth maps) are in good agreement with
the calibrated ground truth distances with a total root mean square
error of 0.17 m for all five depth planes.
5.2 Model Refinement with PSF calibration
After fabricating and mounting the DOE in our camera, we record 6 D ISCUSSION
depth-dependent PSFs of this system by capturing a white LED In summary, we present a new approach to jointly optimize a
with a 15 µm pinhole at multiple depths. For each depth, ten phase-coded aperture implemented with a single DOE and a CNN
camera images are averaged to reduce capture noise, and the that estimates both an all-in-focus image and a depth map of a
averaged image is demosaiced with bilinear interpolation. As scene. Our approach is unique in leveraging a nonlinear image
shown in Fig. 5 (center row), the captured PSF is slightly dif- formation model that more accurately represents the defocus blur
ferent from the designed one (top row). This difference originates observed at depth discontinuities than previous approaches do.
from various factors, including optical aberrations, misalignment Our model also leverages a rotationally symmetric DOE/PSF
of the DOE inside the compound lens, and fabrication errors. design, which makes the training stage computationally tractable
To accommodate for this difference with our RGB and depth by reducing both memory consumption and the number of opti-
estimation CNN, a PSF model is fitted with the MSE loss to mization variables by an order of magnitude compared to those of
the captured PSF by optimizing a rotationally symmetric height previous works. Although our nonlinear image formation model is
map and the diffraction efficiency in post-processing. With the marginally more computationally expensive than the linear model
fitted PSF (Fig. 5, bottom row), we refine our CNN with the during training time, it is not part of the test/inference time where
same training procedure described before but with a fixed PSF this operation is performed physically by the optics.
for inference with captured images. We note that other parameterizations of the DOE could also
To optimize the robustness of our method during inference, we provide computational benefits. For example, similar to Sitzmann
feed a set of horizontally and vertically flipped sensor images into et al. [62] and Wu et al. [30], we could use a Zernike representation
our pre-trained network and take the average of their outputs as of the DOE that matches the small number of parameters of our
the final estimation. This inference-time augmentation is possible rotationally symmetric model. Although these two options would
due to the rotational symmetry of the PSF. have the same number of parameters to optimize, the Zernike
representation would be smooth and still require an order of
5.3 Experimental Results magnitude higher memory resources, which is the primary prob-
lem the rotationally symmetric model solves. The latter requires
We show experimentally captured results in Fig. 6 and in the
exclusively 1D computations to evaluate the whole rotationally
supplemental movies. These examples include scenes captured in
symmetric 2D PSF. For the Zernike representations, all of these
both indoor and outdoor settings. The sensor images captured with
calculations need to be done in 2D at full resolution. Because we
our phase-coded aperture camera prototype (column 3) look more
use an E2E-differentiable model, the huge amount of intermediate
blurry than those of a conventional camera image of the same
variables that need to be stored in the computational graph for
scenes (column 1). Notably, this depth-dependent blur encodes the
these 2D calculations make a Zernike-based option as memory
optimized information that is used by our pre-trained and refined
intensive as other options.
CNN to estimate all-in-focus images (column 4) and depth maps
(column 5). The image quality of our estimated RGB images is
very good and comparable to the reference images. Our depth 6.1 Limitations and Future Work
maps show accurately estimated scene depth with fine details, One of the primary limitations of our phase-coded aperture in-
especially in challenging areas like the plants in the bottom rows cludes the limited diffraction efficiency as well as some amount
and the toys in the top row. Compared to depth maps estimated of shift variance of the measured PSFs (see the Supplemental
1.5 m 2.0 m 2.5 m 3.0 m 3.5 m
Coded Image
Estimated AiF

1.64 m 2.12 m 2.70 m 3.27 m 3.48 m


Estimated Depth

1.0m +5.0m

Fig. 8. Experimental quantitative analysis. A scene containing several objects, including a book, is photographed multiple times with the book
positioned at different depths. The depth of this book is determined from the estimated depth maps. The root mean square error evaluated for all
five depth planes is 0.17 m.

Material). In this project, we were able to successfully work the pictorial scene information. Finally, treating the image and
around these issues by optimizing a DOE, taking the limited depth reconstruction tasks with separate networks could further
diffraction efficiency into account, and by randomly jittering the improve the network capacity, but at the cost of increased memory
PSF during training, making it robust to slight shifts. Yet, the consumption.
performance of similar types of computational imaging systems
could be greatly improved by optimizing the fabrication processes
6.2 Conclusion
and diffraction efficiency of the optical elements as well as the
alignment and calibration of the fully integrated systems. The emerging paradigm of E2E optimization of optics and image
In our captured results (Fig. 6), we also see some edges of processing has great potential in various computational optics
textured regions appearing in the estimated depth maps. These applications. We believe that depth-dependent PSF engineering
remaining imperfections could be introduced by any difference in particular, for example to passively estimate the depth of a
between our image formation model and the physical system, scene, is among the most promising directions of this paradigm
including a small amount of spatial variation of the PSF, optical with potential impacts on robotics, autonomous driving, human-
aberrations, or a slight mismatch of the estimated and true diffrac- computer interaction, and beyond. With our work, we make
tion efficiency of the DOE. Moreover, we only simulate the PSF significant progress towards making jointly optimized hardware-
at three discrete wavelengths, to keep memory usage reasonable, software systems practical in these applications.
whereas the physical sensor integrates over a reasonably broad
spectrum. Finally, we discretize the depth of the scene into layers
whereas the physical model is continuous. We account for some ACKNOWLEDGMENTS
of these issues by jittering the PSF during the training, but not C.M.N. was supported by an NSF Graduate Research Fellowship
all of these physical effects can be perfectly modeled. Thus, al- under award DGE-1656518. C.A.M. was supported by an ap-
though our approach shows significant improvements over related pointment to the Intelligence Community Postdoctoral Research
methods, there is further room for improving experimental results. Fellowship Program at Stanford University administered by Oak
Network architectures and training procedures for MDE have Ridge Institute for Science and Education (ORISE) through an
greatly improved in performance at the cost of increased com- interagency agreement between the U.S. Department of Energy
plexity (e.g., [18], [19]). These software-only approaches are very and the Office of the Director of National Intelligence (ODN).
successful in estimating relative depth information of a scene, but G.W. was further supported by NSF awards 1553333 and 1839974,
they are unable to reliably estimate absolute scene depth. Depth- a Sloan Fellowship, and a PECASE by the ARO. Part of this
from-learned-defocus-type approaches have the ability to estimate work was performed at the Stanford Nano Shared Facilities
robust absolute scene depth in regions where texture and depth (SNSF), supported by the National Science Foundation under
edges are available, but our work and previous approaches in this award ECCS-2026822. We would like to thank the following
area use relatively small networks that lack the capacity of modern Blend Swap users for models used in our Blender rendering:
monocular depth estimators and thus may not be able to learn pujiyanto (wooden chair), wawanbreton (blue sofa), bogiva (ket-
contextual cues as effectively as those methods do. Therefore, it tle), danikreuter (vespa), animatedheaven (basketball), tikiteyboo
is important to explore different network architectures that are (bottle crate), TowerCG (spider plant), oenvoyage (piggy bank),
optimized to capture both the physical information provided by JSantel (banana), Rohit Miraje (purple jeep), mStuff (rubber
(coded) defocus blur as well as the contextual cues encoded by duck), and sudeepsingh (blue car).
R EFERENCES [25] M. Carvalho, B. Le Saux, P. Trouvé-Peloux, A. Almansa, and F. Cham-
pagnat, “Deep Depth from Defocus: how can defocus blur improve 3d
[1] M. Ye, E. Johns, A. Handa, L. Zhang, P. Pratt, and G.-Z. Yang, “Self- estimation using dense neural networks?” in European Conference on
supervised siamese learning on stereo image pairs for depth estimation Computer Vision (ECCV), 2018, pp. 0–0.
in robotic surgery,” arXiv:1705.08260, 2017. [26] A. Levin, R. Fergus, F. Durand, and W. T. Freeman, “Image and depth
[2] A. Sabnis and L. Vachhani, “Single image based depth estimation for from a conventional camera with a coded aperture,” ACM Transactions
robotic applications,” in IEEE Recent Advances in Intelligent Computa- on Graphics (TOG), vol. 26, no. 3, pp. 70–es, 2007.
tional Systems, 2011, pp. 102–106. [27] A. Veeraraghavan, R. Raskar, A. Agrawal, A. Mohan, and J. Tumblin,
[3] N. Metni, T. Hamel, and F. Derkx, “Visual tracking control of aerial “Dappled photography: Mask enhanced cameras for heterodyned light
robotic systems with adaptive depth estimation,” in IEEE Conference on fields and coded aperture refocusing,” ACM Transactions on Graphics
Decision and Control, 2005, pp. 6078–6084. (TOG), vol. 26, no. 3, p. 69, 2007.
[4] J. Stowers, M. Hayes, and A. Bainbridge-Smith, “Altitude control of a
[28] C. Zhou, S. Lin, and S. Nayar, “Coded aperture pairs for depth from
quadrotor helicopter using depth map from Microsoft Kinect sensor,” in
defocus,” in International Conference on Computer Vision (ICCV), 2009,
IEEE International Conference on Mechatronics, 2011, pp. 358–362.
pp. 325–332.
[5] Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q.
Weinberger, “Pseudo-lidar from visual depth estimation: Bridging the [29] P. A. Shedligeri, S. Mohan, and K. Mitra, “Data driven coded aperture
gap in 3D object detection for autonomous driving,” in Conference on design for depth recovery,” in International Conference on Image Pro-
Computer Vision and Pattern Recognition (CVPR), 2019, pp. 8445–8453. cessing (ICIP), 2017, pp. 56–60.
[6] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3D object [30] Y. Wu, V. Boominathan, H. Chen, A. Sankaranarayanan, and A. Veer-
detection network for autonomous driving,” in Conference on Computer araghavan, “Phasecam3d—learning phase masks for passive single view
Vision and Pattern Recognition (CVPR), 2017, pp. 1907–1915. depth estimation,” in International Conference on Computational Pho-
[7] W. Lee, N. Park, and W. Woo, “Depth-assisted real-time 3D object tography (ICCP). IEEE, 2019, pp. 1–12.
detection for augmented reality,” in International Conference on Artificial [31] H. Haim, S. Elmalem, R. Giryes, A. M. Bronstein, and E. Marom, “Depth
Reality and Telexistence (ICAT), vol. 11, no. 2, 2011, pp. 126–132. estimation from a single image using deep learned phase coded mask,”
[8] J. Kopf, K. Matzen, S. Alsisan, O. Quigley, F. Ge, Y. Chong, J. Patterson, IEEE Transactions on Computational Imaging, vol. 4, no. 3, pp. 298–
J.-M. Frahm, S. Wu, M. Yu et al., “One shot 3D photography,” ACM 310, 2018.
Transactions on Graphics (TOG), vol. 39, no. 4, pp. 76–1, 2020. [32] J. Chang and G. Wetzstein, “Deep optics for monocular depth estimation
[9] A. Saxena, S. H. Chung, and A. Y. Ng, “Learning depth from single and 3d object detection,” in International Conference on Computer Vision
monocular images,” in Advances in Neural Information Processing (ICCV), 2019.
Systems, 2006, pp. 1161–1168. [33] J. A. Marshall, C. A. Burbeck, D. Ariely, J. P. Rolland, and K. E. Martin,
[10] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a “Occlusion edge blur: a cue to relative visual depth,” Journal of the
single image using a multi-scale deep network,” in Advances in Neural Optical Society of America A, vol. 13, no. 4, pp. 681–688, Apr 1996.
Information Processing Systems, 2014, pp. 2366–2374. [34] S. E. Palmer and T. Ghose, “Extremal edge: A powerful cue to depth per-
[11] B. Li, C. Shen, Y. Dai, A. Van Den Hengel, and M. He, “Depth and ception and figure-ground organization,” Psychological Science, vol. 19,
surface normal estimation from monocular images using regression on no. 1, pp. 77–83, 2008.
deep features and hierarchical crfs,” in Conference on Computer Vision [35] M. Zannoli, G. D. Love, R. Narain, and M. S. Banks, “Blur and the
and Pattern Recognition (CVPR), 2015, pp. 1119–1127. perception of depth at occlusions,” Journal of Vision, vol. 16, no. 6, pp.
[12] F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single monoc- 17–17, 2016.
ular images using deep convolutional neural fields,” IEEE Transactions [36] L. He, M. Yu, and G. Wang, “Spindle-net: Cnns for monocular depth
on Pattern Analysis and Machine Intelligence, vol. 38, no. 10, pp. 2024– inference with dilation kernel method,” in International Conference on
2039, 2015. Pattern Recognition (ICPR), 2018, pp. 2504–2509.
[13] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, [37] M. Heo, J. Lee, K.-R. Kim, H.-U. Kim, and C.-S. Kim, “Monocular depth
“Deeper depth prediction with fully convolutional residual networks,” estimation using whole strip masking and reliability-based refinement,”
in International Conference on 3D Vision (3DV), 2016, pp. 239–248. in European Conference on Computer Vision (ECCV), 2018, pp. 36–51.
[14] D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe, “Multi-scale contin- [38] F. Tosi, F. Aleotti, M. Poggi, and S. Mattoccia, “Learning monocular
uous crfs as sequential deep networks for monocular depth estimation,” in depth estimation infusing traditional stereo knowledge,” in Conference
Conference on Computer Vision and Pattern Recognition (CVPR), 2017, on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9799–
pp. 5354–5362. 9809.
[15] J. Li, R. Klein, and A. Yao, “A two-streamed network for estimating fine- [39] H. Zhan, R. Garg, C. Saroj Weerasekera, K. Li, H. Agarwal, and
scaled depth maps from single rgb images,” in International Conference I. Reid, “Unsupervised learning of monocular depth estimation and visual
on Computer Vision (ICCV), 2017, pp. 3372–3380. odometry with deep feature reconstruction,” in Conference on Computer
[16] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal Vision and Pattern Recognition (CVPR), 2018, pp. 340–349.
regression network for monocular depth estimation,” in Conference on
[40] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular
Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2002–2011.
depth estimation with left-right consistency,” in Conference on Computer
[17] Y. Cao, T. Zhao, K. Xian, C. Shen, Z. Cao, and S. Xu, “Monoc-
Vision and Pattern Recognition (CVPR), 2017, pp. 270–279.
ular depth estimation with augmented ordinal depth relationships,”
[41] T.-C. Wang, A. A. Efros, and R. Ramamoorthi, “Depth estimation with
arXiv:1806.00585, 2018.
occlusion modeling using light-field cameras,” IEEE Transactions on
[18] I. Alhashim and P. Wonka, “High quality monocular depth estimation via
Pattern Analysis and Machine Intelligence, vol. 38, no. 11, pp. 2170–
transfer learning,” arXiv:1812.11941, 2018.
2181, 2016.
[19] R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun, “Towards
robust monocular depth estimation: Mixing datasets for zero-shot cross- [42] X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia, “Geonet: Geometric neural
dataset transfer,” arXiv:1907.01341, 2020. network for joint depth and surface normal estimation,” in Conference on
[20] A. P. Pentland, “A new sense for depth of field,” IEEE Transactions on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 283–291.
Pattern Analysis and Machine Intelligence, vol. PAMI-9, no. 4, pp. 523– [43] Z. Yang, P. Wang, Y. Wang, W. Xu, and R. Nevatia, “Lego: Learning
531, 1987. edge with geometry all at once by watching videos,” in Conference on
[21] M. Watanabe and S. K. Nayar, “Rational filters for passive depth from Computer Vision and Pattern Recognition (CVPR), 2018, pp. 225–234.
defocus,” International Journal of Computer Vision, vol. 27, pp. 203– [44] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The
225, 1998. kitti dataset,” The International Journal of Robotics Research, vol. 32,
[22] P. Favaro, “Recovering thin structures via nonlocal-means regularization no. 11, pp. 1231–1237, 2013.
with application to depth from defocus,” in Conference on Computer [45] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation
Vision and Pattern Recognition (CVPR), 2010, pp. 1133–1140. and support inference from rgbd images,” in European Conference on
[23] P. Trouvé, F. Champagnat, G. Le Besnerais, J. Sabater, T. Avignon, and Computer Vision (ECCV), 2012, pp. 746–760.
J. Idier, “Passive depth estimation using chromatic aberration and a depth [46] J.-H. Lee and C.-S. Kim, “Monocular depth estimation using relative
from defocus approach,” Applied Optics, vol. 52, no. 29, pp. 7152–7164, depth maps,” in Conference on Computer Vision and Pattern Recognition
2013. (CVPR), 2019, pp. 9729–9738.
[24] E. Alexander, Q. Guo, S. Koppal, S. Gortler, and T. Zickler, “Focal flow: [47] J.-H. Lee, M. Heo, K.-R. Kim, and C.-S. Kim, “Single-image depth es-
Measuring distance and velocity with defocus and differential motion,” in timation based on fourier domain analysis,” in Conference on Computer
European Conference on Computer Vision (ECCV), 2016, pp. 667–682. Vision and Pattern Recognition (CVPR), 2018, pp. 330–339.
[48] P. P. Srinivasan, R. Garg, N. Wadhwa, R. Ng, and J. T. Barron, “Aperture [69] G. Wetzstein, A. Ozcan, S. Gigan, S. Fan, D. Englund, M. Soljacic,
supervision for monocular depth estimation,” in Conference on Computer C. Denz, D. A. B. Miller, and D. Psaltis, “Inference in artificial intelli-
Vision and Pattern Recognition (CVPR), 2018, pp. 6393–6401. gence with deep optics and photonics,” Nature, vol. 588, 2020.
[49] L. He, G. Wang, and Z. Hu, “Learning depth from single images with [70] S.-H. Baek, H. Ikoma, D. S. Jeon, Y. Li, W. Heidrich, G. Wetzstein,
deep neural network embedding focal length,” IEEE Transactions on and M. H. Kim, “End-to-end hyperspectral-depth imaging with learned
Image Processing, vol. 27, no. 9, pp. 4676–4689, 2018. diffractive optics,” arXiv:2009.00436, 2020.
[50] M. Nishimura, D. B. Lindell, C. Metzler, and G. Wetzstein, “Dis- [71] J. W. Goodman, Introduction to Fourier optics. Roberts and Company
ambiguating monocular depth estimation with a single transient,” in Publishers, 2005.
European Conference on Computer Vision (ECCV), 2020. [72] S. W. Hasinoff and K. N. Kutulakos, “A layer-based restoration frame-
[51] S. K. Nayar and Y. Nakagawa, “Shape from focus,” IEEE Transactions work for variable-aperture photography,” in International Conference on
on Pattern Analysis and Machine Intelligence, vol. 16, no. 8, pp. 824– Computer Vision (ICCV). IEEE, 2007, pp. 1–8.
831, 1994. [73] T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely, “Stereo
Magnification: learning view synthesis using multiplane images,” ACM
[52] M. Subbarao and G. Surya, “Depth from defocus: a spatial domain Transactions on Graphics (TOG), vol. 37, no. 4, 2018.
approach,” International Journal of Computer Vision, vol. 13, no. 3, pp. [74] X. Zhang, K. Matzen, V. Nguyen, D. Yao, Y. Zhang, and R. Ng,
271–294, 1994. “Synthetic defocus and look-ahead autofocus for casual videography,”
[53] H. Tang, S. Cohen, B. Price, S. Schiller, and K. N. Kutulakos, “Depth ACM Transactions on Graphics (TOG), vol. 38, no. 4, Jul. 2019.
from defocus in the wild,” in Conference on Computer Vision and Pattern [75] S. J. Reeves, “Fast image restoration without boundary artifacts,” IEEE
Recognition (CVPR), 2017, pp. 2740–2748. Transactions on Image Processing, vol. 14, no. 10, pp. 1448–1453, 2005.
[54] R. Garg, N. Wadhwa, S. Ansari, and J. T. Barron, “Learning single [76] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
camera depth estimation using dual-pixels,” in International Conference for biomedical image segmentation,” in International Conference on
on Computer Vision (ICCV), 2019, pp. 7628–7637. Medical Image Computing and Computer-assisted Intervention (MIC-
[55] C. Zhou, S. Lin, and S. K. Nayar, “Coded aperture pairs for depth CAI), 2015, pp. 234–241.
from defocus and defocus deblurring,” International Journal of Computer [77] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
Vision, vol. 93, no. 1, pp. 53–72, 2011. large-scale image recognition,” in International Conference on Learning
[56] S. R. P. Pavani, M. A. Thompson, J. S. Biteen, S. J. Lord, N. Liu, R. J. Representations (ICLR), 2015.
Twieg, R. Piestun, and W. Moerner, “Three-dimensional, single-molecule [78] J. Flynn, M. Broxton, P. Debevec, M. DuVall, G. Fyffe, R. Overbeck,
fluorescence imaging beyond the diffraction limit by using a double-helix N. Snavely, and R. Tucker, “DeepView: View synthesis with learned
point spread function,” Proceedings of the National Academy of Sciences, gradient descent,” in Conference on Computer Vision and Pattern Recog-
vol. 106, no. 9, pp. 2995–2999, 2009. nition (CVPR), 2019, pp. 2367–2376.
[57] A. Levin, S. W. Hasinoff, P. Green, F. Durand, and W. T. Freeman, [79] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and
“4D frequency analysis of computational cameras for depth of field T. Brox, “A large dataset to train convolutional networks for disparity,
extension,” ACM Transactions on Graphics (TOG), vol. 28, no. 3, pp. optical flow, and scene flow estimation,” in Conference on Computer
1–14, 2009. Vision and Pattern Recognition (CVPR), 2016, pp. 4040–4048.
[80] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox,
[58] A. Chakrabarti, “Learning sensor multiplexing design through back-
“Flownet 2.0: Evolution of optical flow estimation with deep networks,”
propagation,” in Advances in Neural Information Processing Systems,
in Conference on Computer Vision and Pattern Recognition (CVPR),
2016, pp. 3081–3089.
2017, pp. 2462–2470.
[59] L. Wang, T. Zhang, Y. Fu, and H. Huang, “HyperReconNet: Joint [81] A. Levin, D. Lischinski, and Y. Weiss, “Colorization using optimization,”
coded aperture optimization and image reconstruction for compressive ACM Transactions on Graphics (TOG), vol. 23, no. 3, p. 689–694, Aug.
hyperspectral imaging,” IEEE Transactions on Image Processing, vol. 28, 2004.
no. 5, pp. 2257–2270, May 2019.
[60] E. Nehme, D. Freedman, R. Gordon, B. Ferdman, L. E. Weiss,
O. Alalouf, R. Orange, T. Michaeli, and Y. Shechtman, “Deep-
STORM3D: Dense three dimensional localization microscopy and point
spread function design by deep learning,” Nature Methods, vol. 17, pp.
734–740, 2020.
[61] Q. Sun, J. Zhang, X. Dun, B. Ghanem, Y. peng, and W. Heidrich, “End-
to-end learned, optically coded super-resolution spad camera,” ACM Hayato Ikoma is a Ph.D. student in Electrical
Transactions on Graphics (TOG), vol. 39, 2020. Engineering at Stanford University (USA). He re-
[62] V. Sitzmann, S. Diamond, Y. Peng, X. Dun, S. Boyd, W. Heidrich, ceived a B.E. at University of Tokyo (Japan), and
F. Heide, and G. Wetzstein, “End-to-end optimization of optics and image M.S. degrees at Kyoto University (Japan), Mas-
processing for achromatic extended depth of field and super-resolution sachusetts Institute of Technology (USA), and
imaging,” ACM Transactions on Graphics (TOG), vol. 37, no. 4, pp. École Normale Supérieure de Cachan (France).
1–13, 2018. His research focuses on the development of
[63] X. Dun, H. Ikoma, G. Wetzstein, Z. Wang, X. Cheng, and Y. Peng, computational imaging techniques for cameras
“Learned rotationally symmetric diffractive achromat for full-spectrum and fluorescence optical microscopy.
computational imaging,” Optica, vol. 7, no. 8, pp. 913–922, 2020.
[64] C. A. Metzler, H. Ikoma, Y. Peng, and G. Wetzstein, “Deep optics for
single-shot high-dynamic-range imaging,” in Conference on Computer
Vision and Pattern Recognition (CVPR), 2020, pp. 1375–1385.
[65] Q. Sun, E. Tseng, Q. Fu, W. Heidrich, and F. Heide, “Learning rank-
1 diffractive optics for single-shot high dynamic range imaging,” in
Conference on Computer Vision and Pattern Recognition (CVPR). IEEE,
2020.
Cindy M. Nguyen received her B.S. in Bio-
[66] J. Chang, V. Sitzmann, X. Dun, W. Heidrich, and G. Wetzstein, “Hybrid engineering at Stanford University, Stanford, CA,
optical-electronic convolutional neural networks with optimized diffrac- USA in 2019. She is currently a Ph.D. student
tive optics for image classification,” Scientific Reports, vol. 8, no. 1, p. in Electrical Engineering at Stanford University,
12324, 2018. Stanford, CA, USA. Her interests lie in applying
[67] J. Martel, L. Müller, S. Carey, P. Dudek, and G. Wetzstein, “Neural optimization methods to problems in computer
Sensors: Learning Pixel Exposures for HDR Imaging and Video Com- vision and computational imaging. She is a re-
pressive Sensing with Programmable Sensors,” IEEE Transactions on cipient of the NSF Graduate Research Fellow-
Pattern Analysis and Machine Intelligence, vol. 42, p. 1642–1653, 2020. ship.
[68] Y. Li, M. Qi, R. Gulve, M. Wei, R. Genov, K. N. Kutulakos, and
W. Heidrich, “End-to-end video compressive sensing using anderson-
accelerated unrolled networks,” in International Conference on Compu-
tational Photography (ICCP), 2020, pp. 1–12.
Christopher A. Metzler (Member, IEEE) is an Gordon Wetzstein (Senior Member, IEEE) re-
Assistant Professor of Computer Science (and ceived the graduation (with Hons.) degree from
Electrical and Computer Engineering by cour- the Bauhaus-Universität Weimar, Weimar, Ger-
tesy) at the University of Maryland, College Park. many and the Ph.D. degree in computer sci-
He received his B.S., M.S., and Ph.D. degrees in ence from the University of British Columbia,
Electrical and Computer Engineering from Rice BC, Canada, in 2011. He is currently an As-
University, Houston, TX, USA in 2013, 2014, sistant Professor of Electrical Engineering and,
and 2019, respectively, and recently completed a by courtesy, of Computer Science, with Stanford
two-year postdoc in the Stanford Computational University, Stanford, CA, USA. He is the Leader
Imaging Lab. He was an Intelligence Community of Stanford Computational Imaging Lab and a
Postdoctoral Research Fellow, an NSF Graduate Faculty Co-Director of the Stanford Center for
Research Fellow, a DoD NDSEG Fellow, and a NASA Texas Space Image Systems Engineering. At the intersection of computer graphics
Grant Consortium Fellow. His research uses machine learning and sta- and vision, computational optics, and applied vision science, his re-
tistical signal processing to develop data-driven solutions to challenging search has a wide range of applications in next-generation imaging,
imaging problems. display, wearable computing, and microscopy systems. He is the recipi-
ent of an NSF CAREER Award, an Alfred P. Sloan Fellowship, an ACM
SIGGRAPH Significant New Researcher Award, a Presidential Early
Career Award for Scientists and Engineers (PECASE), an SPIE Early
Career Achievement Award, a Terman Fellowship, an Okawa Research
Yifan (Evan) Peng (Member, IEEE) received Grant, the Electronic Imaging Scientist of the Year 2017 Award, an Alain
a Ph.D. in Computer Science from the Univer- Fournier Ph.D. Dissertation Award, Laval Virtual Award, and the Best
sity of British Columbia, Canada, in 2018, and Paper and Demo Awards at ICCP 2011, 2014, and 2016 and at ICIP
an M.Sc. and B.S., both in Optical Science & 2016.
Engineering, from Zhejiang University, China, in
2013 and 2010, respectively. He is currently a
Postdoctoral Fellow in the Stanford Electrical
Engineering Department. His research focuses
on incorporating optical and computational tech-
niques for enabling new imaging modalities. He
is working on computational cameras & displays
with wave optics.

You might also like