Joint Image and Depth Estimation With Mask-Based Lensless Cameras
Joint Image and Depth Estimation With Mask-Based Lensless Cameras
6, 2020 1167
Abstract—Mask-based lensless cameras replace the lens of a scene to a single pixel on the sensor, every sensor pixel in a
conventional camera with a custom mask. These cameras can FlatCam records light from many points in the scene. A single
potentially be very thin and even flexible. Recently, it has been point source in the scene casts a shadow of the mask on the
demonstrated that such mask-based cameras can recover light
intensity and depth information of a scene. Existing depth recovery sensor, which shifts if the point moves parallel to the sensor plane
algorithms either assume that the scene consists of a small number and expands/shrinks if the point source moves toward/away from
of depth planes or solve a sparse recovery problem over a large the sensor plane. The measurements recorded on the sensor
3D volume. Both these approaches fail to recover the scenes with thus represent a superposition of shifted and scaled versions
large depth variations. In this paper, we propose a new approach of the mask shadows corresponding to light sources in different
for depth estimation based on an alternating gradient descent
algorithm that jointly estimates a continuous depth map and light directions and depths. Image and depth information about the
distribution of the unknown scene from its lensless measurements. scene is thus encoded in the measurements, and we can solve an
We present simulation results on image and depth reconstruction inverse problem to estimate both of them.
for a variety of 3D test scenes. A comparison between the proposed To jointly estimate the depth and light distribution, we propose
algorithm and other method shows that our algorithm is more a two-step approach that consists of an initialization step and
robust for natural scenes with a large range of depths. We built
a prototype lensless camera and present experimental results for an alternating gradient descent step to minimize our objective.
reconstruction of intensity and depth maps of different real objects. To preserve sharp edges in the image intensity and depth map,
we include an adaptive regularization penalty in our objective
Index Terms—Lensless imaging, flatcam, depth estimation, non-
convex optimization, alternating minimization.
function.
An overview of the reconstruction framework is illustrated
in Fig. 1. In this paper, we use the same sampling framework
I. INTRODUCTION proposed in [6]. We initialize the estimate of the depth map
by selecting a single plane or solving the greedy algorithm
EPTH estimation is an important and challenging problem
D that arises in a variety of applications including computer
vision, robotics, and autonomous systems. Existing depth es-
proposed in [6]. The greedy method assumes that the scene
consists of a small number of depth planes and fails to recover
scene with continuous depth variations. The method proposed
timation systems use stereo pairs of conventional (lens-based)
in this paper can estimate continuous depth by minimizing an
cameras or time-of-flight sensors [2]–[4]. These cameras can
objective function with respect to image intensity and depth via
be heavy, bulky, and require large space for their installation.
alternating gradient descent. We present extensive simulation
Therefore, their adoption in portable and lightweight devices
and real experimental results with different objects.
with strict physical constraints is still limited.
The main contributions of this paper are as follows.
In this paper, we propose a joint image and depth estimation r We propose a new computational framework for joint es-
framework for a computational lensless camera that consists
timation of light intensity and depth maps from a single
of a fixed, binary mask placed on top of a bare sensor. Such
image of a mask-based lensless camera. In contrast to
mask-based cameras offer an alternative design for building
other methods, our method estimates the depth map on a
cameras without lenses. A recent example of a mask-based
continuous domain. Our algorithm consists of a careful ini-
lensless camera is known as FlatCam [5]. In contrast with a
tialization step based on greedy pursuit and an alternating
lens-based camera that is designed to map every point in the
minimization step based on gradient descent.
r The problem of joint image and depth recovery is highly
Manuscript received October 16, 2019; revised April 23, 2020 and July 6, nonconvex. To tackle this issue, we present different regu-
2020; accepted July 12, 2020. Date of publication July 20, 2020; date of current
version July 31, 2020. This work was supported in part by Google Faculty Award. larization schemes that offer robust recovery on a diverse
A shorter version with preliminary results was presented in [1]. The associate dataset.
editor coordinating the review of this manuscript and approving it for publication r We present simulation results on standard 3D datasets
was Prof. Hajime Nagahara. (Corresponding author: M. Salman Asif.)
The authors are with the Department of Electrical and Computer En- and demonstrated a significant improvement over existing
gineering, University of California, Riverside, CA 92521 USA (e-mail: methods for 3D imaging using coded mask-based lensless
[email protected]; [email protected]). cameras.
This article has supplementary downloadable material available at https:// r We built a hardware prototype to capture measurements of
ieeexplore.ieee.org, provided by the authors.
Digital Object Identifier 10.1109/TCI.2020.3010360 real objects. We present image and depth reconstruction
2333-9403 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on June 08,2023 at 21:16:11 UTC from IEEE Xplore. Restrictions apply.
1168 IEEE TRANSACTIONS ON COMPUTATIONAL IMAGING, VOL. 6, 2020
Fig. 1. A coded mask-based imaging model and an overview of the proposed continuous depth estimation framework.
results of these real objects using our proposed algorithm depth-dependent imaging capability in coded aperture systems
and a comparison with existing methods. is known since the pioneering work in this domain [8], [16].
However, the classical methods usually assume that the scene
consists of a single plane at known depth. In this paper, we
II. RELATED WORK assume that the depth map is arbitrarily distributed on a contin-
A pinhole camera, also known as camera obscura, is the uous domain and the true depth map is unknown at the time of
simplest example of a mask-based lensless camera. Even though reconstruction.
a pinhole can easily provide an image of the scene onto a sensor The 3D lensless imaging problem has also recently been
plane, the image quality is often severely affected by noise studied in [6], [11], [12], [17], [18]. These methods can broadly
because the amount of light collected is limited by the pinhole be divided into two categories. In the first category, the 3D scene
aperture [7]. Coded aperture-based lensless cameras avoid this is divided into a finite number of voxels. To recover the 3D light
problem by increasing the number of pinholes and allowing more distribution, these methods solve an 1 norm-based recovery
light to reach the sensor [5], [8]–[12]. In contrast to a pinhole problem under the assumption that the scene is very sparse [12],
camera where only one inverted image of the scene is obtained [17]. In the second category, the 3D scene is divided into an
through a single pinhole, the measurements captured through a intensity map and multiple depth planes such that each pixel
coded-mask are a linear combination of all the pinhole images is assigned one intensity and depth. To solve the intensity and
under every mask element. To recover an image of the scene, depth recovery problem, these methods either sweep through
we need to solve a computational image recovery problem [5], the depth planes [11], [18] or assign depth to each pixel using a
[8], [12]. greedy method [6]. Our proposed method belongs to the second
Recent work on mask-based lensless imaging broadly falls category in which we model the image intensity and depth
into two categories. FlatCam [6] uses a separable mask aligned separately and assume that the depth values of the scene are
with the sensor such that the sensor measurements correspond- distributed on a continuous domain. To recover the 3D scene,
ing to a plane at a fixed depth from the sensor can be written we jointly estimate the image intensity and depth map from the
as a separable system. DiffuserCam [12] assumes that the mask available sensor measurements.
size and angular span of the object are small enough so that the Joint estimation of image intensity and depth map can be
sensor measurements of a plane can be modeled as a convolution viewed as a nonlinear inverse problem in which the sampling
of the mask pattern with image intensity at that plane. The function is dependent on scene depth. Similar inverse problem
convolutional model can be computationally efficient if the also arises in many other fields such as direction-of-arrival
object falls within a small angular range because we can use estimation in radar [19], super-resolution [20] and compressed
fast Fourier transform to compute convolutions. The separable sensing [21]–[23]. Similar to the joint estimation of image
model does not require a small angular range assumption. A intensity and depth, the solution approaches to these problems
number of methods based on deep learning have also been consists of two main steps: identification of signal bases and the
developed recently for both separable and convolutional imaging estimation of signal intensities based on the identified bases. The
models to recover images at a fixed depth plane [13]–[15]. problem of identifying the signal bases from continuously vary-
A coded aperture system offers another advantage by encod- ing candidates is often called off-the-grid signal recovery. The
ing light from different directions and depths differently. The methods for solving the off-the-grid signal recovery problems
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on June 08,2023 at 21:16:11 UTC from IEEE Xplore. Restrictions apply.
ZHENG AND SALMAN ASIF: JOINT IMAGE AND DEPTH ESTIMATION WITH MASK-BASED LENSLESS CAMERAS 1169
can be divided into two main types. The first approach formulates The measurement recorded on any sensor pixel is the sum-
the problem as a convex program on a continuous domain and mation of contributions from each of the point sources in the
solves it using an atomic norm minimization approach [24], [25]. 3D scene. The imaging model for a single sensor pixel can be
The second approach linearizes the problem for the optimization represented by
parameter using a first-order approximation at every iteration N
N
[20], [26]. Our proposed algorithm is inspired by the second y(su , sv ) = ψi,j (su , sv )li,j . (4)
approach. i=1 j=1
Mask-based lensless cameras have traditionally been used
We can write the imaging model for the entire sensor in a
for imaging light at wavelengths beyond the visible spectrum
compact form as
[9], [10]. Other examples related to mask-based cameras in-
clude controllable aperture and employing coded-mask for com- y = Ψ(α)l + e, (5)
pressed sensing and computational imaging [27], [28], dis- M2
where y ∈ R is a vectorized form of an M × M matrix that
tributed lensless camera [29], single pixel camera [30] and 2
denotes sensor measurements, l ∈ RN is a vectorized form
external mask setting [31].
of an N × N matrix that denotes light intensity from all the
Coded masks have also recently been used with conventional
locations (θi , θj , αi,j ), and Ψ is a matrix with all the basis
lens-based cameras to estimate depth and lightfield [32]–[35].
functions corresponding to θi , θj , αi,j . The basis functions in
Recently, a number of data-driven methods have been proposed 2
to design custom phase masks and optical elements to estimate (5) are parameterized by the unknown α ∈ RN and e denotes
depth from a single image [36], [37]. An all-optical diffractive noise and other nonidealities in the system.
deep neural network is proposed in [38], [39], which can perform We can jointly estimate light distribution (l) and inverse depth
pattern recognition tasks such as handwritten digits classifi- map (α)1 using the following optimization problem:
cation using optical mask layers. Such networks can literally 1
minimize y − Ψ(α)l22 . (6)
process images at a lightning-fast pace with near-zero energy α,l 2
cost. Note that if we know the true values of α (or we fix it to some-
thing), then the problem in (6) reduces to a linear least-squares
III. METHODS problem that can be efficiently solved via standard solvers. On
A. Imaging Model the other hand, if we fix the value of l, the problem remains
nonlinear with respect to α. In the next few sections we discuss
We divide the 3D scene under observation into N × N uni- our approach for solving the problem in (6) via alternating
formly spaced directions. We use θi and θj to denote the angular minimization.
directions of a light source with respect to the center of the
sensor. The intensity and depth of the light source are denoted B. Initialization
using li,j and zi,j respectively. Fig. 1(a) depicts the geometry
of such an imaging model. A planar coded-mask is placed on Since the minimization problem in (6) is not convex, a proper
top of a planar sensor array at distance d. The M × M sensor initialization is often needed to ensure convergence to a local
array captures lights coming from the scene modulated by the minimum close to the optimal point. A naïve approach is to
coded-mask. initialize all the point sources in the scene at the same depth
Every light source in the scene casts a shadow of the mask plane. To select an initial depth plane, we sweep through a set
on the sensor array, which we denote using basis functions ψ. of candidate depth planes and perform image reconstruction on
We use su and sv to index a pixel on the rectangular sensor one depth plane at a time by solving the following linear least
array. The shadow cast by a light source with unit intensity at squares problem:
(θi , θj , zi,j ) can be represented as the following basis or point 1
spread function: minimize y − Ψ(α)l22 . (7)
l 2
ψi,j (su , sv ) = mask [αi,j su + d tan(θi ), αi,j sv + d tan(θj )], We evaluate the loss value for all the candidate depth planes and
(1) picked the one with the smallest loss as our initialized depth.
where mask[u, v] denotes the transmittance of the mask pattern The mask basis function in (1) changes as we change α, which
at location (u, v) on the mask plane and αi,j is a variable that has an inverse relation with the scene depth. We select candidate
is related to the physical depth zi,j with the following inverse depth corresponding to uniformly sampled values of α, which
relation: yields non-uniform sampling of the physical scene depth. The
d single-depth initialization approach is computationally simple
αi,j = 1 − , (2) and provides a reasonable initialization of light distribution to
zi,j
start with, especially when the scene is far from the sensor.
If the 3D scene consists of only a single point source at (θi , θj ) Our second approach for initialization is the greedy method
with light intensity li,j , the measurement captured at sensor pixel proposed in [6]. Greedy algorithms are widely used for sparse
(su , sv ) would be
1 α has an inverse relation with the depth map (2); therefore we refer to it as
y(su , sv ) = ψi,j (su , sv )li,j . (3) inverse depth map throughout the paper.
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on June 08,2023 at 21:16:11 UTC from IEEE Xplore. Restrictions apply.
1170 IEEE TRANSACTIONS ON COMPUTATIONAL IMAGING, VOL. 6, 2020
signal recovery [21]–[23]. Based on these algorithms, [6] pro- derivatives using finite-difference of ψi,j (su , sv ) over a fine grid
posed a greedy depth pursuit algorithm for depth estimation and linear interpolation.
from FlatCam [5]. The algorithm works by iteratively updating
the depth surface that matches the observed measurements the D. Algorithm Analysis
best.
To solve the non-linear least squares problem in (6) in our
The depth pursuit method assumes that the scene consists
algorithms, we compute the gradient derived in (10) and use it
of a small number of predefined depth planes. We start the
as input of a optimization solver. Suppose ψi and ψj denote the
program by initializing all the pixels at a single depth plane
basis function vectors evaluated on a 1D mask as
and the estimation of light intensities l based on the initialized
depth map. The first step is to select new candidate values for α. ψi (su ) = mask [αi,j su + d tan(θi )]
The new candidates are selected using the basis vectors that are
mostly correlated with the current residual of the estimate. In the ψj (sv ) = mask [αi,j sv + d tan(θj )]. (11)
second step, new candidates for α are appended to the current If we use a separable mask pattern, then the 2D mask function
estimate. We solve a least squares problem using the appended ψi,j in (1) can be computed as the outer product of two vectors
α. In the third step, we prune the α by selecting αi,j as the given as ψi,j = ψi ψjT . Similarly, we define 1D sub-gradient
value corresponding to the largest magnitude of li,j . Although function g as
this method may not estimate the off-grid point sources well, it
∂ψi,j (su , sv )
produces a good preliminary estimate of the scene. gi (su ) =
∂ui,j
C. Refinement via Alternating Gradient Descent ∂ψi,j (su , sv )
gj (sv ) = , (12)
To solve the minimization problem in (6), we start with the ∂vi,j
preliminary image and depth estimates from the initialization ∂ψ (s ,s ) ∂ψ (s ,s )
step and alternately update depth and light distribution via gra- Similar to (10), the functions i,j u v
∂ui,j and i,j∂vi,ju v are the
dient descent. The main computational task in gradient descent sub-gradient functions along the 1D mask. It takes non-negative
method is computing the gradient of the loss function w.r.t. α. values at locations where mask pattern value changes and takes
To compute that gradient, we expand the loss function in (6) as zero value at the other places. Using the derivation in (10), the
∂ψ (su ,sv )
matrix contains i,j ∂αi,j at all (su , sv ) can be computed using
M N
1 the following sum of two vector outer products.
L= (y(su , sv ) − ψi,j (su , sv )li,j )2 (8)
2 u,v=1 i,j=1 ∂ψi,j
= gi ψjT + ψi gjT . (13)
N ∂αi,j
We define Ru,v = y(su , sv ) − i,j=1 ψi,j (su , sv )li,j as the
residual approximation error at location (su , sv ). The derivatives Using the derivations in (9), the derivative of loss function with
of the loss function with respect to the αi,j is given as respect to depth value can be computed using the following
matrix multiplications, where R refers to the matrix of residual
M M
∂L ∂Ru,v ∂ψi,j (su , sv ) Ru,v at all (su , sv )
= Ru,v = −li,j Ru,v .
∂αi,j u,v=1
∂α i,j u,v=1
∂αi,j ∂L
= giT Rψj + ψiT Rgj . (14)
(9) ∂αi,j
We compute the derivatives of sensor value with respect to Suppose we have M × M pixels on sensor array. The computa-
the αi,j using the total derivative2 as follows. tion in (14) takes 2M 2 + 2M multiplications. We then feed our
gradients to minfunc solver [40] with L-BFGS algorithm [41]
∂ψi,j (su , sv ) ∂ψi,j (su , sv ) ∂ui,j ∂ψi,j (su , sv ) ∂vi,j
= + to solve the non-linear optimization problem in (6).
∂αi,j ∂ui,j ∂αi,j ∂vi,j ∂αi,j
∂ψi,j (su , sv ) ∂ψi,j (su , sv ) E. Regularization Approaches
= su + sv . (10)
∂ui,j ∂vi,j 2 regularization on spatial gradients: The optimization
ui,j = αi,j su + d tan(θi ) and vi,j = αi,j sv + d tan(θj ) denote problem in (6) is highly non-convex and contains several local
two dummy variables that also correspond to the specific loca- minima; therefore, the estimate often gets stuck in some local
tion on the mask where a light ray from a point source at angle minima and the estimated intensity and depth maps are coarse.
(θi , θj ) and depth αi,j and sensor pixel at (su , sv ) intersects To improve the performance of our algorithm for solving the
∂ψ (su ,sv ) ∂ψi,j (su ,sv ) non-convex problem in (6), we seek to exploit additional struc-
with the mask plane. The terms in i,j ∂ui,j , ∂vi,j can
tures in the scene. A standard assumption is that the depth of
be viewed as the derivatives of mask pattern along the respective
neighboring pixels is usually close, which implies that the spatial
spatial coordinates and evaluated at ui,j , vi,j . We compute these
differences of (inverse) depth map are small. To incorporate
this assumption in our model, we add a quadratic regularization
2 Recall that the total derivative of a multivariate function f (x, y) is term on the spatial gradients of the inverse depth map to our
∂f (x,y) ∂f (x,y)
∂x dx + ∂y dy. loss function. The quadratic regularization term is defined on an
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on June 08,2023 at 21:16:11 UTC from IEEE Xplore. Restrictions apply.
ZHENG AND SALMAN ASIF: JOINT IMAGE AND DEPTH ESTIMATION WITH MASK-BASED LENSLESS CAMERAS 1171
adaptive weighted regularization inspired from [42]: = ∇r α1 + ∇c α1 . (19)
N
To solve the nonlinear optimization problem with 1 norm
c
RW (α) = Wi,j (αi,j − αi+1,j )2 + Wi,j
r
(αi,j − αi,j+1 )2 , regularization, we write the optimization problem as
i,j=1
1
r,α c,α
(16) minimize y − Ψ(α)l22 + λ(dr 1 + dc 1 )
where Wi,j and Wi,j denote weights for row and column dif- α,l 2
ferences, respectively. We aim to select these weights to promote s.t.dr = ∇r α, dc = ∇c α. (20)
depth similarity for neighboring pixels, but avoid smoothing
the sharp edges. To promote this, we selected weights with We solve this problem (20) using a split-Bregman method [46].
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on June 08,2023 at 21:16:11 UTC from IEEE Xplore. Restrictions apply.
1172 IEEE TRANSACTIONS ON COMPUTATIONAL IMAGING, VOL. 6, 2020
A. Simulation Setup
To validate the performance of the proposed algorithm, we
simulate a lensless imaging system using a binary planar mask
with a separable maximum length sequence (MLS) pattern [47]
that is placed 4mm away from a planar sensor array. We used
an MLS sequence of length 1024 and converted all the −1 s to
0 s to create a separable binary pattern. We used square mask
features, each of which is 30 μm wide. Since we optimize the
objective function in (6) with respect to α and need to compute
the gradient in (9), we require the mask function to be smooth
and differentiable with respect to α. Therefore, we convolved
the binary pattern with a Gaussian blur kernel of length 15
μm and standard deviation 5. In our simulations, we do not
Fig. 3. Left to right: original image and depth of the Cones scene; image
explicitly model the diffraction blur. However, the Gaussian and depth initialized via greedy algorithm [6]; depth estimation using weighted
blur kernel that we apply to the mask function can be viewed 2 -based regularization. The depth in this scene varies from around 0.99 m to
as an approximation of the diffraction blur. The sensor contains 1.7 m.
512 × 512 square pixels, each of which is 50 μm wide. The chief
ray angle of each sensor pixel is ±18◦ . We assume that there is
from around 0.99 m to 1.7 m. We used depth pursuit greedy
no noise added to the sensor measurements. In our experiments
algorithm in [6] as our initialization method. We selected 15
for continuous depth estimation, we fixed all the parameters to
candidate depths by uniformly sampling the inverse depth values
these default values and analyze the performance with respect
α from 0.996 to 0.9976, which gives an effective depth in the
to a single parameter.
same range as the original depth. Since we are trying to gauge
the performance for off-the-grid estimate of depth, the candidate
B. Reconstruction of Scenes With Continuous Depth values of α are not exactly the same as the true values of α in
Depth datasets: We performed all our experiments on 3D our simulations. The output of the initialization algorithm is then
images created using light intensities and depth information fed into the alternating gradient descent method.
from Middlebury [48], Make3D [49], [50] and NYU Depth [51], Performance metrics: We evaluate the performance of recov-
the test scenes and their depth ranges are listed in Table I. ered image intensity and depth independent of each other. We
Initialization via greedy method: Let us further discuss our report the peak signal to noise ratio (PSNR) of the estimated
simulation setup using the Cones scene, for which the results are intensity distribution and root mean squared error (RMSE) of
presented in Fig. 3. We simulated the 3D scene using depth data the estimated depth maps for all our experiments. The estimates
from Middlebury dataset [48]. We sample the scene at uniform for image intensity and depth maps for the initialization and our
angles to create a 128 × 128 image and its (inverse) depth map proposed weighted TV-2 method are shown in Fig. 3, along
with the same size. We can compute the physical depth from with the PSNR and RMSE. We can observe that both image and
α using (2). In our simulation, the depth of this scene ranges depth estimation from greedy method [6] contain several spikes
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on June 08,2023 at 21:16:11 UTC from IEEE Xplore. Restrictions apply.
ZHENG AND SALMAN ASIF: JOINT IMAGE AND DEPTH ESTIMATION WITH MASK-BASED LENSLESS CAMERAS 1173
Fig. 4. Comparison between reconstructions using three different regularization approaches from the same measurements.
Fig. 5. Effects of noise: Reconstruction from the measurements with signal-to-noise ratio (SNR) at 20dB, 30dB and 40dB, along with the PSNR of reconstructed
image and RMSE of reconstructed depth map. As expected, the quality of reconstructed image and depth improves as the noise level is reduced. The sequence in
left is for Sword, right is Playtable.
because of the model mismatch with the predefined depth grid. measurements and can adversely affect the reconstruction re-
In contrast, many of these spikes are removed in the estimations sults. To investigate the effect of noise on our algorithm, we
from the proposed algorithm with weighted TV-2 while the present simulation results for the reconstruction of scenes from
edges are preserved. the same sensor measurements under different levels of ad-
Comparison of regularization methods: Here we present a ditive white Gaussian noise. The experiments are performed
comparison between three different regularization approaches. on multiple 3D scenes listed in Table I. Some examples of
We reconstruct image intensity and (inverse) depth map using reconstruction with different levels of noise are shown in
the same measurements with TV-2 , weighted TV-2 , and TV-1 Fig. 5.
regularization. The results are shown in Fig. 4 . Compared to the The plots recording PSNR of image intensities and RMSE
TV-2 method, we observe that both weighted TV-2 and TV-1 of depth maps over a range of measurement SNR values are
preserve the sharp edges in image and depth estimates. Overall, presented in Fig. 6. As we can observe from the curves that
in our experiments, weighted TV-2 provided the best results. the quality of both estimated image and depth improve when
Therefore, we used that as our default method for the rest of the the measurements have small noise (high SNR) and the quality
paper. degrades as we add more noise in the measurements (low SNR).
Another observation we can make is that the scenes that are
farther away have higher RMSE. This aspect is understandable
C. Effects of Noise because as the scenes move farther, α of the scene pixels all get
Sensor noise exists widely in any observation process. very close to 1 and we cannot resolve fine depth variations in
The amplitude of noise depends on the intensities of sensor the scene.
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on June 08,2023 at 21:16:11 UTC from IEEE Xplore. Restrictions apply.
1174 IEEE TRANSACTIONS ON COMPUTATIONAL IMAGING, VOL. 6, 2020
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on June 08,2023 at 21:16:11 UTC from IEEE Xplore. Restrictions apply.
ZHENG AND SALMAN ASIF: JOINT IMAGE AND DEPTH ESTIMATION WITH MASK-BASED LENSLESS CAMERAS 1175
Fig. 7. Reconstructions from measurements with different sizes of sensor pixels. The number of sensor pixels is fixed as 512 × 512. The quality of depth
reconstruction improves as we increase the size of sensor pixels.
Fig. 8. Comparison of existing 3D recovery methods for lensless imaging, 3D grid method in [12], [17] and greedy method in [6], with our proposed method.
3D grid method provides a 3D volume with multiple depth planes; therefore, we pick the depth with the largest light intensity along any angle for comparison.
Fig. 10. Experiments on real objects. (a) A slanted card; the depth range is 18–28 cm (b) Two slanted cards; the depth range of left card is 18–28 cm and the right
card is 26–29 cm. (c) Hand sculpture; depth range is 15–30 cm. (d) A mug with card texture; depth range is 24–27 cm. We divide each group of real scenes into
four columns, the first column is front view and side view of the scene, the second column is the result from greedy algorithm in [6], the third column is the output
of sparse 3D grid recovery algorithm proposed in [12] and [17], and the last column is the image intensity and depth map estimated using our proposed algorithm.
sensor, we captured sensor measurements for LED flashlight at In comparison, both the greedy algorithm [6] and the sparse
9 different angles at the same depth and merged them to estimate 3D volume recovery algorithm [12], [17] produce coarse and
the mask function at that depth. discretized depth maps. The intensity map recovered by our
In our experiments, we captured the sensor measurements method is also visually better compared to other methods.
by placing an LED at z = 42 cm away from the sensor, which Even though our proposed algorithm produces better intensity
corresponds to the mask function in (1) evaluated at α = and depth maps compared to the greedy and 3D grid methods, we
1 − d/z = 0.9905 for d = 4 mm, z = 42 cm. We first resized observed that the estimated depth has some errors in the darker
the calibrated mask function to compute the mask function parts of the objects. For instance, the left part of the mug is darker
corresponding to α = 1. than the right part because the object was illuminated from a
lamp on the right side. The left part appears to have errors in the
depth estimate as several pixels are assigned small depth values,
C. Reconstruction of Real Objects but that part is in fact farther from the sensor. We also observe
We present results for four objects in Fig. 10, (a) slanted card a similar effect in other experiments, where depth estimates for
has depth range from 18 cm to 28 cm, (b) two slanted cards have darker parts of the scene appear to have larger errors.
depth ranges from 18 cm to 28 cm and 26 cm to 29 cm, (c) hands
sculpture has depth range from 15 cm to 30 cm, and (d) mug with
VI. CONCLUSION
card texture depth is from 24 cm to 27 cm. The figure is divided
into four boxes. In each box, we present a front- and side-view We presented a new algorithm to jointly estimate the image
of the object along with estimated scene intensity and depth and depth of a scene using a single snapshot of a mask-based
maps for three methods. the greedy algorithm in [6], the sparse lensless camera. Existing methods for 3D lensless imaging either
3D volume recovery method from [12], [17], and our proposed estimate scene over a predefined 3D grid (which is computa-
method. For the greedy and 3D grid method, we generated 15 tionally expensive) or a small number of candidate depth planes
candidate depth planes by uniformly sampling the inverse depth (which provides a coarse depth map). We divide the scene into an
values α between 0.96 and 0.9905 (corresponding to the depth intensity map at uniform angles and a depth map on a continuous
of 10 cm and 42 cm, respectively). domain, which allows us to estimate a variety of scenes with
All the objects in our experiments are placed in front of different depth ranges using the same formulation. We jointly es-
the black background and the depth values for dark pixels are timate the image intensity and depth map by solving a nonconvex
not meaningful. We can observe that in all these experiments, problem. We initialize our estimates using a greedy method and
our proposed method provides a continuous depth map that add weighted regularization to enforce smoothness in the depth
is consistent with the real depth of the object in the scene. estimate while preserving the sharp edges. We demonstrated
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on June 08,2023 at 21:16:11 UTC from IEEE Xplore. Restrictions apply.
ZHENG AND SALMAN ASIF: JOINT IMAGE AND DEPTH ESTIMATION WITH MASK-BASED LENSLESS CAMERAS 1177
with extensive simulations and experiments with real data that [18] Y. Hua, S. Nakamura, M. S. Asif, and A. C. Sankaranarayanan, “Sweepcam
our proposed method can recover image and depth with high depth-aware lensless imaging using programmable masks,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 42, no. 7, pp. 1606–1617, 2020.
accuracy for a variety of scenes. We evaluated the performance of [19] Z. Tan, P. Yang, and A. Nehorai, “Joint sparse recovery method for
our methods under different noise levels, sensor sizes, and num- compressed sensing with structured dictionary mismatches,” IEEE Trans.
bers of sensor pixels and found the method to be robust. We pre- Signal Process., vol. 62, no. 19, pp. 4997–5008, Oct. 2014.
[20] N. Boyd, G. Schiebinger, and B. Recht, “The alternating descent condi-
sented a comparison with existing methods for lensless 3D imag- tional gradient method for sparse inverse problems,” in Proc. IEEE 6th
ing and demonstrated both in simulation and real experiments Int. Workshop Comput. Adv. Multi-Sensor Adaptive Process., Dec. 2015,
that our method provides significantly better results. We believe pp. 57–60.
[21] J. A. Tropp and A. C. Gilbert, “Signal recovery from random measurements
this work provides a step toward capturing complex scenes with via orthogonal matching pursuit,” IEEE Trans. Inform. Theory, vol. 53,
lensless cameras, where depth estimation is a feature as well as no. 12, pp. 4655–4666, Dec. 2007.
a compulsion because if the depth information is unavailable or [22] D. Needell and J. A. Tropp, “Cosamp: Iterative signal recovery from
incomplete and inaccurate samples,” Commun. ACM, vol. 53, no. 12,
inaccurate, that will cause artifacts in the recovered images. pp. 93–100, Dec. 2010. [Online]. Available: http://doi.acm.org/10.1145/
1859204.1859229
REFERENCES [23] R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde, “Model-
based compressive sensing,” IEEE Trans. Inform. Theory, vol. 56, no. 4,
[1] Y. Zheng and M. S. Asif, “Image and depth estimation with mask-based pp. 1982–2001, 2010.
lensless cameras,” in Proc. IEEE Int. Workshop Comput. Adv. Multi-Sensor [24] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, “The
Adapt. Process. (CAMSAP), 2019, pp. 91–95. convex geometry of linear inverse problems,” Foundations Comput. Math.,
[2] R. Hartley and A. Zisserman, Multiple View Geom. Comput. Vis. Cam- vol. 12, no. 6, pp. 805–849, Dec. 2012. [Online]. Available: https://doi.org/
bridge, U.K.: Cambridge Univ. Press, 2003. 10.1007/s10208-012-9135-7
[3] S. B. Gokturk, H. Yalcin, and C. Bamji, “A time-of-flight depth sensor - [25] G. Tang, B. N. Bhaskar, P. Shah, and B. Recht, “Compressed sensing off
system description, issues and solutions,” in Conf. Comput. Vision Pattern the grid,” IEEE Trans. Inform. Theory, vol. 59, no. 11, pp. 7465–7490,
Recognit. Workshop, Jun. 2004, pp. 35–35. Nov. 2013.
[4] F. Heide, M. B. Hullin, J. Gregson, and W. Heidrich, “Low-budget transient [26] Z. Yang, L. Xie, and C. Zhang, “Off-grid direction of arrival estimation
imaging using photonic mixer devices,” ACM Trans. Graph. (ToG), vol. 32, using sparse Bayesian inference,” IEEE Trans. Signal Process., vol. 61,
no. 4, pp. 45, 2013. no. 1, pp. 38–43, Jan. 2013.
[5] M. S. Asif, A. Ayremlou, A. Sankaranarayanan, A. Veeraraghavan, and [27] D. Takhar et al., “A new compressive imaging camera architecture using
R. G. Baraniuk, “Flatcam: Thin, lensless cameras using coded aperture optical-domain compression,” in Proc. Comput. Imag. IV SPIE Electron.
and computation,” IEEE Trans. Comput. Imag., vol. 3, no. 3, pp. 384–397, Imag., 2006, pp. 43–52.
Sep. 2017. [28] A. Zomet and S. K. Nayar, “Lensless imaging with a controllable aper-
[6] M. S. Asif, “Lensless 3D imaging using mask-based cameras,” in IEEE ture,” in Proc. IEEE Comput. Vision Pattern Recognit., vol. 1, Jun. 2006,
Int. Conf. Acoust., Speech and Signal Process. (ICASSP). IEEE, Apr. 2018, pp. 339–346.
pp. 6498–6502. [29] Y. Zheng and M. S. Asif, “Imaging with distributed lensless line sen-
[7] A. Yedidia, C. Thrampoulidis, and G. Wornell, “Analysis and optimization sors,” in Proc. 53rd Asilomar Conf. Signals, Syst., Comput., Nov. 2019,
of aperture design in computational imaging,” IEEE Int. Conf. Acoust., pp. 1289–1293.
Speech, and Signal Process., pp. 4029–4033, Apr. 2018. [Online]. Avail- [30] M. F. Duarte et al., “Single-pixel imaging via compressive sampling,”
able: http://sigport.org/3049 IEEE Signal Process. Mag., vol. 25, no. 2, pp. 83–91, Mar. 2008.
[8] E. E. Fenimore and T. M. Cannon, “Coded aperture imaging with [31] D. Reddy, J. Bai, and R. Ramamoorthi, “External mask based depth and
uniformly redundant arrays,” Appl. Opt., vol. 17, no. 3, pp. 337–347, light field camera,” in Proc. IEEE Int. Conf. Comput. Vision Workshops,
Feb. 1978. [Online]. Available: http://ao.osa.org/abstract.cfm?URI=ao- Dec. 2013, pp. 37–44.
17-3-337 [32] A. Levin, R. Fergus, F. Durand, and W. T. Freeman, “Image and depth
[9] A. Busboom, H. Elders-Boll, and H. D. Schotten, “Uniformly redundant from a conventional camera with a coded aperture,” ACM Trans. Graph.,
arrays,” Exp. Astron., vol. 8, no. 2, pp. 97–123, Jun. 1998. [Online]. vol. 26, pp. 70–es, 2007.
Available: https://doi.org/10.1023/A:1007966830741 [33] A. Veeraraghavan, R. Raskar, A. Agrawal, A. Mohan, and J. Tumblin,
[10] T. M. Cannon and E. E. Fenimore, “Coded Aperture Imaging: Many holes “Dappled photography: Mask enhanced cameras for heterodyned light
make light work,” Opt. Eng., vol. 19, pp. 283–289, Jun. 1980. fields and coded aperture refocusing,” in ACM Trans. Graph., vol. 26,
[11] V. Boominathan et al., “Lensless imaging: A computational renaissance,” no. 3, 2007, pp. 69–es.
IEEE Signal Process. Mag., vol. 33, no. 5, pp. 23–35, 2016. [34] K. Marwah, G. Wetzstein, Y. Bando, and R. Raskar, “Compressive light
[12] N. Antipa et al., “Diffusercam: lensless single-exposure 3D imaging,” field photography using overcomplete dictionaries and optimized projec-
Optica, vol. 5, no. 1, pp. 1–9, Jan. 2018. [Online]. Available: http://www. tions,” ACM Trans. Graph. (Proc. SIGGRAPH), vol. 32, no. 4, pp. 1–11,
osapublishing.org/optica/abstract.cfm?URI=optica-5-1-1 2013.
[13] S. S. Khan, A. V. R. , V. Boominathan, J. Tan, A. Veeraraghavan, and [35] M. Hirsch, S. Sivaramakrishnan, S. Jayasuriya, A. Wang, A. Molnar,
K. Mitra, “Towards photorealistic reconstruction of highly multiplexed R. Raskar, and G. Wetzstein, “A switchable light field camera architecture
lensless images,” in Proc. IEEE Int. Conf. Comput. Vision, Oct. 2019, with angle sensitive pixels and dictionary-based sparse coding,” in Proc.
pp. 7859–7868. IEEE Int. Conf. Comput. Photography (ICCP), 2014, pp. 1–10.
[14] K. Monakhova, J. Yurtsever, G. Kuo, N. Antipa, K. Yanny, and L. Waller, [36] Y. Wu, V. Boominathan, H. Chen, A. Sankaranarayanan, and A. Veer-
“Learned reconstructions for practical mask-based lensless imaging,” Opt. araghavan, “Phasecam3D learning phase masks for passive single view
Express, vol. 27, no. 20, pp. 28 075–28 090, Sep. 2019. [Online]. Available: depth estimation,” in Proc. IEEE Int. Conf. Comput. Photography (ICCP),
http://www.opticsexpress.org/abstract.cfm?URI=oe-27-20-28075 May 2019, pp. 1–12.
[15] A. Dave, A. K. Vadathya, R. Subramanyam, R. Baburajan, and K. Mitra, [37] J. Chang and G. Wetzstein, “Deep optics for monocular depth estimation
“Solving inverse computational imaging problems using deep pixel-level and 3D object detection,” in Proc. IEEE Int. Conf. Comput. Vision, 2019.
prior,” IEEE Trans. Comput. Imag., vol. 5, no. 1, pp. 37–51, 2019. [Online]. [38] X. Lin et al., “All-optical machine learning using diffractive deep neu-
Available: https://doi.org/10.1109/TCI.2018.2882698 ral networks,” Sci., vol. 361, no. 6406, pp. 1004–1008, 2018. [Online].
[16] H. H. Barrett, D. T. Wilson, G. D. DeMeester, and H. Scharfman, “Fresnel Available: https://science.sciencemag.org/content/361/6406/1004
zone plate imaging in radiology and nuclear medicine,” in Application of [39] D. Mengu, Y. Luo, Y. Rivenson, and A. Ozcan, “Analysis of diffractive
Optical Instrumentation, vol. 0035, Medicine I, P. L. Carson, W. R.Hendee, optical neural networks and their integration with electronic neural net-
and W. C. Zarnstorff, Eds., International Society for Optics and Photonics, works,” IEEE J. Sel. Topics Quantum Electron., vol. 26, no. 1, pp. 1–14,
1972, pp. 199–206. Jan. 2020.
[17] J. K. Adams et al., “Single-frame 3D fluorescence microscopy with [40] M. Schmidt, “minfunc: unconstrained differentiable multivariate opti-
ultraminiature lensless flatscope,” Sci. Adv., vol. 3, no. 12, 2017. [Online]. mization in matlab,” 2005. [Online]. Available: http://www.cs.ubc.ca/
Available: https://advances.sciencemag.org/content/3/12/e1701548 schmidtm/Software/minFunc.html
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on June 08,2023 at 21:16:11 UTC from IEEE Xplore. Restrictions apply.
1178 IEEE TRANSACTIONS ON COMPUTATIONAL IMAGING, VOL. 6, 2020
[41] D. C. Liu and J. Nocedal, “On the limited memory BFGS method for Yucheng Zheng received the B.Sc. degree in electri-
large scale optimization,” Math. Program., vol. 45, no. 1, pp. 503–528, cal engineering from the Nanjing University of Aero-
Aug. 1989. [Online]. Available: https://doi.org/10.1007/BF01589116 nautics and Astronautics, Nanjing, China in 2017.
[42] Y. Liu, J. Ma, Y. Fan, and Z. Liang, “Adaptive-weighted total variation min- He is currently working toward the Ph.D. degree at
imization for sparse data toward low-dose X-ray computed tomography the University of California, Riverside, CA, USA.
image reconstruction,” Phy. Med. & Biol., vol. 57, no. 23, pp. 7923–7956, His current research interests include computational
2012. imaging, computer vision and signal processing.
[43] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,”
in Proc. Sixth Int. Conf. Comput. Vision (IEEE Cat. No. 98CH36271),
Jan. 1998, pp. 839–846.
[44] F. Durand and J. Dorsey, “Fast bilateral filtering for the display
of high-dynamic-range images,” ACM Trans. Graph., vol. 21, no. 3,
pp. 257–266, Jul. 2002. [Online]. Available: http://doi.acm.org/10.1145/
566654.566574
[45] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise
removal algorithms,” Phys. D, vol. 60, no. 1–4, pp. 259–268, Nov. 1992.
[Online]. Available: https://doi.org/10.1016/0167-2789(92)90242-F
[46] T. Goldstein and S. Osher, “The split Bregman method for l1-regularized
problems,” SIAM J. Img. Sci., vol. 2, no. 2, pp. 323–343, Apr. 2009.
[Online]. Available: http://dx.doi.org/10.1137/080725891
[47] F. J. MacWilliams and N. J. A. Sloane, “Pseudo-random sequences and M. Salman Asif received the B.Sc. degree from the
arrays,” in Proc. IEEE, vol. 64, no. 12, pp. 1715–1729, Dec. 1976. University of Engineering and Technology, Lahore,
[48] D. Scharstein, R. Szeliski, and R. Zabih, “A taxonomy and evaluation Pakistan, in 2004, the M.S.E.E and Ph.D. degrees
of dense two-frame stereo correspondence algorithms,” in Proc. IEEE from the Georgia Institute of Technology, Atlanta,
Workshop Stereo Multi-Baseline Vis., Dec. 2001, pp. 131–140. GA, USA, in 2008 and 2013, respectively. He is an
[49] A. Saxena, S. H. Chung, and A. Y. Ng, “Learning depth from single Assistant Professor in the Department of Electrical
monocular images,” in Adv. Neural Inf. Process. Syst. 18, Y. Weiss, B. and Computer Engineering, University of Califor-
Schölkopf, and J. C. Platt, Eds. MIT Press, 2006, pp. 1161–1168. [Online]. nia, Riverside, CA, USA. He was a Research Intern
Available: http://papers.nips.cc/paper/2921-learning-depth-from-single- at Mitsubishi Electric Research Laboratories, Cam-
monocular-images.pdf bridge, MA, USA, in the Summer of 2009, and at
[50] A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3D scene structure Samsung Standards Research Laboratory, Richard-
from a single still image,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, son, TX, USA, in the Summer of 2010. He was a Senior Research Engineer at
no. 5, pp. 824–840, May 2009. Samsung Research America, Dallas, TX, USA, from August 2012 to January
[51] P. K. Nathan Silberman, D. Hoiem, and R. Fergus, “Indoor segmentation 2014, and as a Postdoctoral Researcher at Rice University from February 2014
and support inference from RGBD images,” in Proc. Eur. Conf. Comput. to June 2016. His research interests include compressive sensing, computational
Vision, 2012, pp. 746–760. and medical imaging, and machine learning.
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on June 08,2023 at 21:16:11 UTC from IEEE Xplore. Restrictions apply.