Depth Estimation From A Single Image Using Deep Learned Phase Coded Mask
Depth Estimation From A Single Image Using Deep Learned Phase Coded Mask
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCI.2018.2849326, IEEE
Transactions on Computational Imaging
1
Abstract—Depth estimation from a single image is a well Eigen et al. [1] introduced a deep neural network for depth
known challenge in computer vision. With the advent of deep estimation that relies on depth cues in the RGB image. They
learning, several approaches for monocular depth estimation have used a multi-scale architecture with coarse and fine depth
been proposed, all of which have inherent limitations due to the
scarce depth cues that exist in a single image. Moreover, these estimation networks concatenated to achieve both dynamic
methods are very demanding computationally, which makes them range and resolution. Two later publications by Cao et al. [2]
inadequate for systems with limited processing power. In this and Liu et al. [3] employed the novel fully-convolutional
paper, a phase-coded aperture camera for depth estimation is network (FCN) architecture (originally presented by Long et
proposed. The camera is equipped with an optical phase mask al. [9] for scene semantic segmentation) for monocular depth
that provides unambiguous depth-related color characteristics for
the captured image. These are used for estimating the scene depth estimation. In [2] the authors used a residual network [10],
map using a fully-convolutional neural network. The phase-coded and refined the results using a conditional random field (CRF)
aperture structure is learned jointly with the network weights prior, external to the network architecture. Similar approach of
using back-propagation. The strong depth cues (encoded in the using CRF to refine a DL model initial result was also used by
image by the phase mask, designed together with the network Li et al. [4]. In [3] a simpler FCN model was proposed, but
weights) allow a much simpler neural network architecture for
faster and more accurate depth estimation. Performance achieved with the CRF operation integrated inside the network structure.
on simulated images as well as on a real optical setup is superior This approach was further researched using deeper networks
to state-of-the-art monocular depth estimation methods (both and more sophisticated architectures [5], [6]. The challenge
with respect to the depth accuracy and required processing was also addressed in the unsupervised learning approach, as
power), and is competitive with more complex and expensive presented by Garg et al. [7] and Godard et al. [8].
depth estimation methods such as light-field cameras.
Common to all these approaches is the use of depth cues
Index Terms—Coded Aperture, Phase Mask, Depth Recon- in the RGB image ’as-is’, as well as having the training and
struction, Deep Learning, Computational Camera,. testing on well-known public datasets such as the NYU depth
[11], [12] and Make3D [13]. Since the availability of reliable
depth cues in a regular RGB image is limited, these approaches
I. I NTRODUCTION
require large architectures with significant regularization (Mul-
tiscale, ResNets, CRF) as well as separation of the models to
P ASSIVE depth estimation is a well-known challenge in
computer vision. A common solution is based on stereo
vision, where two calibrated cameras capture the same scene
indoor/outdoor scenes. A modification of the image acquisition
process itself seems necessary in order to allow using a simpler
from different views (similarly to the human eyes), and thus model, generic enough to encompass both indoor and outdoor
the distance to every object can be inferred by triangulation. scenes.
Yet, such a dual camera system significantly increases the form Imaging methods that use an aperture coding mask (both
factor, cost and power consumption. phase or amplitude) became more common in the last two
The current electronics miniaturization trend (high quality decades. One of the first and prominent studies in this field
smart-phone cameras, wearable devices, etc.) requires a com- was carried out by Dowski and Cathey [14], where a cubic
pact and low-cost solution. This requirement dictates a more phase mask was designed to generate a constant point spread
challenging task: passive depth estimation from a single image. function (PSF) throughout the desired depth of field (DOF).
While a single image lacks the depth cues that exist in a stereo Similar ideas were presented later in [15] using a random
image pair, there are still some depth cues such as perspective diffuser with focal sweep [16], or by using an uncorrected
lines and vanishing points that enable depth estimation to some lens as a type of spectral focal sweep [17]. When a depth-
degree of accuracy. The ongoing deep learning revolution independent PSF is achieved, an all-in-focus image can be
did not overlook this challenge, and some neural network- recovered using non-blind deconvolution methods. However,
based approaches to monocular depth estimation exist in the in all these methods the captured and restored images have a
literature [1]–[8]. similar response in the entire DOF, and thus depth information
can only be recovered to some extent using monocular cues.
H. Haim was with the Faculty of Electrical Engineering, Tel-Aviv Univer- In order to generate optical cues, the PSF should be depth-
sity, Tel-Aviv, 6997801 Israel, and currently with the Department of Computer dependent. Related methods use an amplitude coded mask
Science, University of Toronto, Toronto, Canada.
S. Elmalem, R. Giryes, A.M. Bronstein and E. Marom are with the Faculty [18], [19] or a color-dependent ring mask [20], [21] such that
of Electrical Engineering, Tel-Aviv University, Tel-Aviv, 6997801 Israel objects at different depths exhibit a distinctive spatial/spectral
2333-9403 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCI.2018.2849326, IEEE
Transactions on Computational Imaging
2
2333-9403 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCI.2018.2849326, IEEE
Transactions on Computational Imaging
3
Fig. 2. Neural network architecture for the depth classification CNN: (the ’inner’ net in the FCN model in Fig. 4). Spatial dimension reduction is achieved
by convolution stride instead of pooling layers. Every CONV block is followed by BN-ReLU layer (not shown in this figure).
optimal phase mask parameters within a deep learning-based between the color channels for different depths (expressed in
depth estimation framework, the imaging stage is modeled as ψ values) when using a clear aperture (dotted plot) versus our
the initial layer of a CNN model. The inputs to this coded optimized phase mask (solid plot).
aperture convolution layer are the all-in-focus images and their
corresponding depth maps. The parameters (or weights) of the
layer are the radii ri and phase shifts φi of the mask’s rings.
Such layer forward model is composed of the coded aperture
PSF calculation (for each depth in the relevant depth range)
followed by imaging simulation using the all-in-focus input
image and its corresponding depth map. The backward model
uses the inputs from the next layer (backpropagated to the
coded aperture convolutional layer) and the derivatives of
the the coded aperture PSF with respect to its weights, (a) (b)
∂P SF/∂ri , ∂P SF/∂φi , in order to calculate the gradient
Fig. 3. Aperture phase coding mask. (a) 3D illustration of the
descent step on the phase mask parameters. A detailed descrip- optimal three-ring mask (b) cross-section of the mask. The area
tion of the coded aperture convolution layer and its forward marked in black acts as a circular pupil.
and backward models is presented in the Appendix. One of
the important hyper-parameters of such a layer is the depth
range under consideration (in ψ terms). The ψ range setting, III. FCN FOR D EPTH E STIMATION
together with the lens parameters (focal length, F# and focus We now turn to describe the architecture of our fully
point) dictates the trade-off between the depth dynamic range convolutional network (FCN) for depth estimation, which
and resolution. In this study, we set the range to ψ = [−4, 10]; relies on optical depth cues encoded in the image, provided by
its conversion to a metric depth range is presented in sec- the phase coded aperture incorporated in the lens as described
tion IV. As mentioned above, the optimization of the phase in Section II. These cues are used by the FCN model to
mask parameters is done by integrating the coded aperture estimate the scene depth map. Our network configuration is
convolutional layer into the CNN model detailed in the sequel, inspired by the FCN structure introduced by Long et al. [9].
followed by the end-to-end optimization of the entire model. In this work, an ImageNet classification CNN was converted
To validate the coded aperture layer, we compared the case to a semantic segmentation FCN by adding a deconvolu-
where the CNN (described in the following section) is trained tion block to the ImageNet model, and fine-tuning it for
end-to-end with the phase coded aperture layer to the case semantic segmentation (with several architecture variants for
where the phase mask is held fixed to its initial value. Several increased spatial resolution). For depth estimation using our
fixed patterns were examined; the training of the phase mask phase coded aperture camera, a totally different ’inner net’
improved the classification error by 5% to 10%. should replace the ’ImageNet model’. The inner net should
For the setup we used, the optimization process yielded classify the different imaging conditions (i.e. ψ values), and
a three rings mask such that the outer ring is deeper than the deconvolution block will turn the initial pixel labeling
the middle one as illustrated in Fig. 3. Such a design poses into a full depth estimation map. We tested two different
significant fabrication challenges for the chemical etching ’inner’ network architectures: the first based on the DenseNet
process used at our facilities. Since an optimized three-rings architecture [25], and the second based on a traditional feed-
mask surpass the two-ring mask only by a small margin, forward architecture. An FCN based on both inner nets is
in order to make the fabrication process simpler and more presented, and the trade-off is discussed. The following sub-
reliable, a two-ring limit was set in the training process; this sections present the ψ classification inner nets, and the FCN
resulted in the normalized ring radii r = {0.55, 0.8, 0.8, 1} model based on them for depth estimation.
and phases φ = {6.2, 12.3} [rad] (both ψ and φ are defined
for the blue wavelength, where the RGB wavelengths taken A. ψ classification CNN
are the peak wavelengths of the camera color filter response: As presented in Section II, the phase coded aperture is
λR,G,B = [610, 535, 455]nm). Figure 1(b) shows the diversity designed along with the CNN such that it encodes depth-
2333-9403 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCI.2018.2849326, IEEE
Transactions on Computational Imaging
4
Fig. 4. Network architecture for the depth estimation FCN. The depth (ψ) classification network (see Fig. 2) is wrapped in a deconvolution framework to
provide depth estimation map equal to the input image size.
dependent cues in the image by manipulating the response of its corresponding ψ value.
the RGB channels for each depth. Using these strong optical After the imaging stage is done (as explained in II), an Addi-
cues, the depth slices (i.e. ψ values) can be classified using tive White Gaussian Noise (AWGN) is added to each patch to
some CNN classification model. make the network robust to the typical noise level that exists in
For this task, we tested two different architectures; the first images taken with a real-world camera. Though increasing the
one based on the DenseNet architecture for CIFAR-10, and noise level improves the robustness, it is important to consider
the second based on the traditional feed-forward architecture the trade-off that exists between noise robustness and depth
of repeated blocks of convolutions, batch normalization [26] estimation accuracy, which limits the amount of noise that
and rectified linear units [27] (CONV-BN-ReLU, see Fig. 2). should be added in training, making the noise level a hyper-
In view of the approach presented in [28], pooling layers parameter one should tune. In our tests, when we set a specific
are omitted in the second architecture, and stride of size 2 noise level, the accuracy of the depth results is deteriorated for
is used in the CONV layers for lateral dimension reduction. inputs with higher noise level (as one would expect). At the
This approach allows much faster model evaluation (only 25% same time, when we train the CNN with relatively high noise
of the calculation in each CONV layer), with minor loss in levels, the system becomes more robust to noise at the expense
performance. of accuracy reduction for images with lower noise. Therefore,
To reduce the model size and speed up its evaluation even σ = 3 is chosen as a good compromise since it resembles
more, the input (in both architectures) to the first CONV layer the noise level of images of a well-lighted scene taken with
of the net is the captured raw image (in mosaicked Bayer the selected camera. Of course, one may consider a different
pattern). By setting the stride of the first CONV layer to 2, noise level, according to the target camera and its expected
the filters’ response remains shift-invariant (since the Bayer noise level.
pattern period is 2). This way the input size is decreased by a Data augmentation via four rotations is used to increase
factor of 3, with minor loss in performance. This also omits the dataset size as well as achieving rotation invariance. The
the need for a demosaicking stage, allowing faster end-to-end dataset size is about 2.4M patches, where 80% of it is used
performance (in cases where the RGB image is not needed for training and 20% is used for validation. Both architectures
as an output, and one is interested only in the depth map). were trained to classify into 15 integer values of ψ (between
One can see the direct processing of mosaicked images as −4 and 10) using the softmax loss. These nets are used as
a case where the CNN representation power ’contains’ the an initialization for the depth estimation FCN, as presented in
demosaicking operation, and therefore it is not really needed III-C.
as a preprocessing step.
Both inner classification net architectures are trained on the B. RGB-D Dataset
Describable Textures Dataset (DTD) [29]. About 40K texture The deep learning based methods for depth estimation from
patches (32x32 pixels each) were selected from the dataset. a single image mentioned in Section I [1]–[8] rely strongly
Each patch was ’replicated’ in the dataset 15 times, where on the input image details. Thus, most studies in this field
each replication corresponds to a different blur kernel (corre- assume an input image with a large DOF such that most of
sponding to the phase coded aperture for ψ = −4, −3, ..., 10). the acquired scene is in focus. This assumption is justified
The first layer of both architectures represents the phase- when the photos are taken by small aperture cameras as
coded aperture layer, whose inputs are the clean patch and is the case in datasets such as NYU Depth [11], [12] and
2333-9403 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCI.2018.2849326, IEEE
Transactions on Computational Imaging
5
Make3D [13] that are commonly used for the training and
testing of those depth estimation techniques. However, such
optical configurations limit the resolution and increase the
noise level, thus reducing the image quality. Moreover, the
depth maps in these dataset are prone to errors due to depth
sensor inaccuracies and calibrations issues (alignment and
scaling) with the RGB sensor.
Our method is based on a phase-coded aperture imaging
process, which encodes the image. To train or evaluate our (a) Confusion matrix (b) MAPE vs Focus point
method on images not taken with our camera, the phase coded
Fig. 5. (a) Confusion matrix for the depth segmentation FCN validation set
aperture imaging process has to be simulated on those images. (b) MAPE as a function of the focus point using our continuous net.
To simulate the imaging process properly, the input data should
contain high resolution, all in-focus images with low noise,
accompanied by accurate pixelwise depth maps. Evaluating However, such step is very computationally demanding, and
depth datasets such as NYU depth [11], [12] and Make3D does not provide significant improvement (since the phase-
[13] for our coded aperture imaging simulation is impossible coded aperture parameters tuning reached its optimum in the
due to the limited image and depth resolution provided in inner net training). Therefore, in the FCN training stage, the
these datasets, which limit the possibility of simulating our optical imaging simulation is done as a pre-processing step
image acquisition process on these datasets. Proper input for with the best phase mask achieved in the inner net training
such imaging simulation may be generated primarily using 3D stage. In the discrete training step of the FCN, the ground-
graphic simulation software. truth depth maps are discretized to ψ = −4, −3, .., 10 values.
We use the popular MPI-Sintel depth images dataset [30], The Sintel/Agent images (after imaging simulation with the
created by the Blender 3D graphics software. The Sintel coded aperture blur kernels, RGB-to-Bayer transformation and
dataset contains 23 scenes with a total of 1k images. Yet, AWGN addition), along with the discretized depth maps, are
because it has been designed specifically for optical flow used as the input data for the discrete depth estimation FCN
evaluation, the depth variation in each scene is limited. Thus, model training. The FCN is trained for reconstructing the
we could only use about 100 unique images, which are not discrete depth of the input image using softmax loss.
enough for training. The need for additional data has led us to After training, both versions of the FCN model (based on
create a new Sintel-like dataset (using Blender) called ’TAU- the DenseNet architecture and the traditional feed-forward
Agent’, which is based on the recent open movie ’Agent 327’. architecture) achieved roughly the same performance, but with
This new animated dataset, which relies on the new render a significant increase in inference time (x3), training time
engine ’Cycles’, contains 300 realistic images (indoor and (x5) and memory requirements (x10) for the DenseNet model.
outdoor), with a resolution of 1024 × 512, and corresponding When examining the performance, one can see that most of
pixelwise depth maps. With rotations augmentation, our full the errors are on smooth/low texture areas of the images,
dataset contained 840 scenes, where 70% are used for training where our method (that relies on texture) is expected to be
and the rest for validation. weaker. Yet, in areas with ’sufficient’ texture, there are enough
encoded depth cues, enabling good depth estimation even
with relatively simple DNN architecture. This similarity in
C. Depth estimation FCN performance between the DenseNet based model (which is
In similarity to the FCN model presented by Long et al. [9], one of the best CNN architectures known to date) to a simple
the inner ψ classification net is wrapped in a deconvolution feed-forward architecture is a clear example to the inherent
framework, turning it to a FCN model (see Fig. 4). The desired power of optical image processing using coded aperture; a task
output of our depth estimation FCN is a continuous depth driven design of the image acquisition stage can potentially
estimation map. However, since training continuous models save significant resources in the digital processing stage. As
is prone to over-fitting and regression to the mean issues, such, we decided to keep the simple feed-forward architecture
we pursue this goal in two stages. In the first one, the FCN as the chosen solution.
is trained for discrete depth estimation. On the second step, To evaluate the discrete depth estimation accuracy, we
the discrete FCN model is used as an initialization for the calculated a confusion matrix for our validation set (∼ 250
continuous model training. images, see Fig. 5(a)). After 1500 epochs, the net achieves
To train the discrete depth FCN, the Sintel and Agent accuracy of 68% (top-1 error). However, the vast majority of
datasets RGB images are blurred using the coded aperture the errors are to adjacent ψ values, and on 93% of the pixels
imaging model, where each object is blurred using the cor- the discrete depth estimation FCN recovers the correct depth
responding blur kernel associated with its depth (indicated in with an error of up to ±1ψ. As already mentioned above, most
the ground truth pixelwise depth map). The imaging is done of the errors originate from smooth areas, where no texture
in a quasi-continuous way, with ψ step of 0.1 in the range exists and therefore no depth dependent color-cues were
of ψ = [−4, 10]. This imaging simulation can be carried in encoded. This performance is sufficient as an initialization
the same way as the ’inner’ net training, i.e. using the phase point for the continuous depth estimation network.
coded aperture layer as the first layer of the FCN model. The discrete depth estimation (segmentation) FCN model is
2333-9403 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCI.2018.2849326, IEEE
Transactions on Computational Imaging
6
2333-9403 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCI.2018.2849326, IEEE
Transactions on Computational Imaging
7
2333-9403 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCI.2018.2849326, IEEE
Transactions on Computational Imaging
8
and resolution are different (as can be seen in the depth scale
on the right column). However, since our FCN model is trained
for ψ estimation, all depth maps were achieved using the same
network, and the absolute depth is calculated using the known
focus point and the estimated ψ map.
Quantitative evaluation of the real-world setup with a cam-
era equipped with our phase-coded aperture was performed
’in the wild’, since exact depth GT is difficult to acquire in
(a) Input image (b) Point cloud map the general case. For quantitative evaluation on real data, we
Fig. 10. 3D face reconstruction performed a ’spot-check’- we measured the depth recovery
error of our network for the known object distances in the lab
setting of Fig. 9. We got an average depth estimation error of
granulated wall. In this example, the global depth cues are 6.25%. This accuracy is comparable to Lytro accuracy (Zeller
also weak, and therefore the monocular depth estimation fails et al. [31]) and much better than monocular (25%), while both
to separate the close vicinity of the wall (right part of the require a cumbersome calibration phase.
image). Both the Lytro and our phase coded aperture camera Besides the depth map recovery performance and the sim-
achieve good depth estimation of the scene. Note though that pler optics, another important benefit of our proposed solution
our camera has the advantage that it provides an absolute scale is the required processing power/run time. The fact that depth
and uses much simpler optics. cues are encoded by the phase mask enables much simpler
On the second row of Fig. 11, we chose a grassy slope FCN architecture, and therefore much faster inference time.
with flowers. In this case, the global depth cues are stronger. This is due to the fact that some of the processing is done by
Thus, the monocular method [3] does better compared to the optics (in the speed of light, with no processing resources
the previous examples, but still achieves only a partial depth needed). For example, for a full-HD image as an input, our
estimate. Lytro and our camera achieve good results. proposed network evaluates a full-HD depth map in 0.22s
Additional outdoor examples are presented in Fig. 12. Note (using Nvidia Titan X Pascal GPU). For the same sized input
that the scenes in first five rows of Fig. 12 were taken with on the same GPU, the net presented in [3] evaluates a 3-times
a different focus point (compared to the indoor and the rest smaller depth map in 10s (Timing was measured using the
of the outdoor scenes), and therefore the depth dynamic range same machine and the implementation of the network available
2333-9403 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCI.2018.2849326, IEEE
Transactions on Computational Imaging
9
at the authors’ website). Of course, if a one-to-one input image diffraction effects.2 In this model, the PSF calculation contains
to depth map is not needed, the output size can be reduced all the optical properties of the system. Following [24], the
and our FCN will run even faster. PSF of an incoherent imaging system is defined as:
Another advantage of our method is that the depth esti-
P SF = |hc |2 = |F{P (ρ, θ)}|2 , (2)
mation relies mostly on local cues in the image. This allows
performing of the computations in a distributed manner. The where hc is the coherent system impulse response, and P (ρ, θ)
image can be simply split and the depth map can be evaluated is the system’s exit pupil function (the amplitude and phase
in parallel on different resources. The partial outputs can be profile in the imaging system exit pupil). The pupil function
recombined later with barely visible block artifacts. reference is a perfect spherical wave converging at the image
plane. Thus, for an in-focus and aberration free (or diffraction
V. S UMMARY AND CONCLUSIONS limited) system, the pupil function is just the identity for the
amplitude in the active area of the aperture, and zero for the
In this paper we have presented a method for real-time phase.
depth estimation from a single image using a phase coded Out-of-Focus (OOF): An imaging system acquiring an
aperture camera. The phase mask is designed together with the object in OOF conditions suffers from blur that degrades the
FCN model using back propagation, which allows capturing image quality. This results in low contrast, loss of sharpness
images with high light efficiency and color-coded depth cues, and even loss of information. The OOF error is expressed
such that each color channel responds differently to OOF analytically as a quadratic phase wave-front error in the pupil
scenarios. Taking advantage of this coded information, a function. To quantify the defocus condition, we introduce the
simple convolutional neural network architecture is proposed parameter ψ. For the case of a circular exit pupil with radius
to recover the depth map of the captured scene. R, we define ψ as:
This proposed scheme outperforms state-of-the-art monocu-
lar depth estimation methods by having better accuracy, more πR2 1
1 1
πR2
1 1
than an order of magnitude speed acceleration, less memory ψ= + − = −
λ zo zimg f λ zimg zi
requirements and hardware parallelization compliance. In ad- (3)
πR2 1
dition, our simple and low-cost solution shows comparable 1
= − ,
performance to expensive commercial solutions with complex λ zo zn
optics such as the Lytro camera. Moreover, as opposed to the where zimg is the image distance (or sensor plane location)
relative depth maps produced by the monocular methods and of an object in the nominal position zn , zi is the ideal image
the Lytro camera, our system provides an absolute (metric) plane for an object located at zo , and λ is the illumination
depth estimation, which can be useful to many computer vision wavelength. The defocus parameter ψ measures the maximum
applications, such as 3D modeling and augmented reality. quadratic phase error at the aperture edge. For a circular pupil:
2333-9403 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCI.2018.2849326, IEEE
Transactions on Computational Imaging
10
(
exp{jφ} r1 < ρ < r2
CA(r, φ) = (6) ∂Iout ∂
1 otherwise = [Iin ∗ P SF (ψ, r, φ)]
∂ri /φ ∂ri /φ
This example can be easily extended to a multiple rings (10)
∂
pattern. = Iin ∗ P SF (ψ, r, φ)
∂ri /φ
Depth dependent coded PSF: Combining all factors, the
complete term for the depth dependent coded pupil function Thus, we need to calculate ∂P SF/∂ri and ∂P SF/∂φ.
becomes: Since both derivatives are almost similar, we start with
P (ψ) = P (ρ, θ)CA(r, φ) exp{jψρ2 }. (7) ∂P SF/∂φ and then describe the differences in the derivation
of ∂P SF/∂ri later. Using (2), we get
Using the definition in (2), the depth dependent coded
P SF (ψ) can be easily calculated. ∂ ∂
P SF (ψ, r, φ) = [F{P (ψ, r, φ)F{P (ψ, r, φ)}]
Imaging Output: Using the coded aperture PSF, the imag- ∂φ ∂φ
ing output can be calculated simply by: ∂
= [ F{P (ψ, r, φ)]F{P (ψ, r, φ)}+
∂φ
Iout = Iin ∗ P SF (ψ). (8)
∂
+ F{P (ψ, r, φ)[ F{P (ψ, r, φ)}]
This model limits us to a Linear Shift-Invariant (LSI) model. ∂φ
However, this is not the case in real imaging systems, and the (11)
PSF varies across the Field of View (FOV). This is solved
by segmenting the FOV to blocks with similar PSF, and then ∂
We may see that the main term in (11) is ∂φ [F{P (ψ, r, φ)]
applying the LSI model in each block. or its complex conjugate. Due to the linearity of the derivative
and the Fourier transform, the order of operations can be
∂
B. Backward model reversed and rewritten as: F{ ∂φ P (ψ, r, φ)}. Therefore, the
last term remaining for calculating the PSF derivative is:
As described in the previous subsection, the forward model
of the phase coded aperture layer is expressed as: ∂ ∂
P (ψ, r, φ) = [P (ρ, θ)CA(r, φ) exp{jψρ2 }]
Iout = Iin ∗ P SF (ψ). (9) ∂φ ∂φ
∂
The P SF (ψ) varies with the depth (ψ), but it has also a = P (ρ, θ) exp{jψρ2 } [CA(r, φ)] (12)
∂φ
constant dependence on the phase ring pattern parameters r (
and φ, as expressed in (7). In the network training process, we jP (ψ, r, φ) r1 < ρ < r2
=
are interested in determining both r and φ. Therefore, we need 0 otherwise
to evaluate three separate derivatives: ∂Iout /∂ri for i = 1, 2
(the inner and outer radius of the phase ring, as detailed in Similar to the derivation of ∂P SF/∂φ, for calculating
(6)) and ∂Iout /∂φ. All three are derived in a similar fashion: ∂P SF/∂ri we need also ∂r∂ i P (ψ, r, φ). Similar to (12), we
2333-9403 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCI.2018.2849326, IEEE
Transactions on Computational Imaging
11
2333-9403 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCI.2018.2849326, IEEE
Transactions on Computational Imaging
12
2333-9403 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.