Deep Compressed Sensing
Deep Compressed Sensing
where δ ∈ (0, 1) is a small constant. The RIP states that ∂Eθ (m, z)
ẑ ← ẑ − α (7)
the projection from F preserves the distance between two ∂z z=ẑ
signals bounded by factors of 1 − δ and 1 + δ. This property
holds with high probability for various random matrices F where α is a learning rate. One can take a specified T steps
and sparse signals x. It guarantees minimising the measure- of gradient descent. Typically, hundreds or thousands of
ment error gradient descent steps and several re-starts from the initial
step are needed to obtain a sufficiently good ẑ (Bora et al.,
2
x̂ = arg min km − F xk2 (3) 2017; Bojanowski et al., 2018). This process is illustrated
x in Figure 1.
under the constraint that x is sparse, leads to accurate re- This work established the connection between compressed
construction x̂ ≈ x with high probability (Donoho, 2006; sensing and deep neural networks, and demonstrated perfor-
1 mance superior to the Lasso (Tibshirani, 1996), especially
The theory can also be proved from the closely related and
more general Restricted Eigenvalue condition (Bora et al., 2017). when the number of measurements is small. The theoreti-
We focus on RIP in this form for its more straightforward connec- cal properties of CSGM have been more closely examined
tion with the training loss (see section 3.1). by Hand & Voroninski (2017), who also proved stronger
Deep Compressed Sensing
convergence guarantees. More recently, Dhar et al. (2018) modes, unless extra care is taken in designing and training
proposed additional constraints to allow sparse deviation the model (Radford et al., 2015; Salimans et al., 2016).
from the generative model’s support set, thus improving
A widely adapted trick is using − ln (D(G(z))) as the ob-
generalisation. However, CSGM still suffers from two re-
jective for the generator (Goodfellow et al., 2014). Com-
strictions:
pared with eq. 10, this alternative objective avoids saturat-
ing the discriminator in the early stage of training when
1. The optimisation for reconstruction is still slow, as it the generator is too weak. However, this objective voids
requires thousands of gradient descent steps. most theoretical analyses (Hu et al., 2018), since the new
adversarial objective is no longer a zero-sum game (eq. 10).
2. It relies on random measurement matrices, which are
known to be sub-optimal for highly structured signals In most GAN models, discriminators become useless after
such natural images. Learned measurements can per- training. Recently, Tao et al. (2018) and Azadi et al. (2019)
form significantly better (Weiss et al., 2007). proposed methods using the discriminator for importance
sampling. Our work provides an alternative: our model
2.3. Model-Agnostic Meta Learning moves latent representations to areas more likely to generate
realistic images as deemed by the discriminator.
Meta-learning, or learning to learn, allows a model adapt-
ing to new tasks by self-improving (Schmidhuber, 1987).
Model-Agnostic Meta learning (MAML) provides a general 3. Deep Compressed Sensing
method to adapt parameters for a number of tasks (Finn We start by showing the benefit of combining meta-learning
et al., 2017). Given a differentiable loss function L(Ti ; θ) with the model in Bora et al. (2017). We then generalise
for task Ti sampled from the task distribution ptask (T ), the measurement matrices to parametrised measurement func-
task-specific parameters are adapted by gradient descent tions, including deep neural networks. While previous work
from the initial parameters θ: relies on random projections as measurement functions, our
approach learns measurement functions by imposing the RIP
θi ← θ − α∇θ L(Ti ; θ) (8) as a training objective. We then derive two novel models by
imposing properties other than the RIP on the measurements,
The initial parameters θ are trained to minimise the loss
including a GAN model with discriminator-guided latent
across all tasks
optimisation, which leads to more stable training dynamics
min ETi ∼ptask (T ) [L(Ti ; θi )] (9) and better results.
θ
Multiple steps and more sophisticated optimisation algo- 3.1. Compressed Sensing with Meta-Learning
rithms can be used in the place of eq. 8. Despite L usually We hypothesise that the run-time efficiency and performance
being a highly non-convex function, by back-propagating in CSGM (Bora et al. 2017, section 2.2), can be improved
through the gradient-descent process, only a few gradient by training the latent optimisation procedure using meta-
steps are sufficient to adapt to new tasks. learning, by back-propagating through the gradient descent
steps (Finn et al., 2017). The latent optimisation procedure
2.4. Generative Adversarial Networks for CS models can take hundreds or thousands of steps.
By employing meta-learning to optimise this optimisation
A Generative Adversarial Network (GAN) trains a
procedure we aim to achieve similar results with far fewer
parametrised generator Gθ to fool a discriminator Dφ that
updates.
tries to distinguish real data from fake data sampled from
the generator (Goodfellow et al., 2014). The generator Gθ To this end, the model parameters, as well as the latent
is a deterministic function that transforms samples z from optimisation procedure, are trained to minimise the expected
a source pz (z) to the same space as the data x, which has measurement error:
the distribution pdata (x). This adversarial game can be sum-
marised by the following min-max problem with the value min LG , for LG = Exi ∼pdata (x) [Eθ (mi , ẑi )] (11)
θ
function V (Gθ , Dφ ):
where ẑi is obtained from gradient descent (eq. 7). The
min maxV (Gθ , Dφ ) = Ex∼pdata (x) [ln Dφ (x)] gradient descent in eq. 7 and the loss function in eq. 11
Gθ Dφ
(10)
mirror their counterparts in MAML (eq. 8 and 9), except
+ Ez∼pz (z) [ln(1 − Dφ (Gθ (z)))]
that:
GANs are usually difficult to train due to this adversarial
game (Balduzzi et al., 2018). Training may either diverge 1. Instead of the stochastic gradient computed in the out-
or converge to bad equilibrium with, for example, collapsed side loop, here each measurement error Eθ only de-
Deep Compressed Sensing
pends on a single sample z, so eq. 7 computes the exact Algorithm 1 Compressed Sensing with Meta Learning
gradient of Eθ . Input: minibatchs of data {xi }N i=1 , random matrix F,
2. The online optimisation is over latent variables rather generator Gθ , learning rate α, number of latent optimisa-
than parameters. There are usually much fewer latent tion steps T
variables than parameters, so the update is quicker. repeat
Initialize generator parameters θ
Like in MAML, we implicitly perform second order optimi- for i = 1 to N do
sation, by back-propagating through the latent optimisation Measure the signal mi ← F xi
steps which compute ẑi when optimising eq. 11. We empiri- Sample ẑi ∼ pz (z)
cally observed that this dramatically improves the efficiency for t = 1 to T do
∂
of latent optimisation, with only 3-5 gradient descent steps Optimise ẑi ← ẑi − ∂z Eθ (mi , ẑi )
being sufficient to improve upon baseline methods. end for
end for P
Unlike Bora et al. (2017), we also train the generator Gθ . N
LG = N1 i=1 Eθ (mi , ẑi )
Merely minimising eq. 9 would fail — the generator can Compute LF using eq. 12
exploit F by mapping all Gθ (z) into the null space of F. Update θ ← θ − ∂θ ∂
(LG + LF )
This trivial solution always gives zero measurement error, until reaches the maximum training steps
but may contain no useful information. Our solution is
to enforce the RIP (eq. 2) via training, by minimising the
measurement loss:
h i The distance preserving property as a counterpart of the RIP
2 can be enforced by minimising a loss similar to eq. 12:
LF = Ex1 ,x2 (kF (x1 − x2 )k2 − kx1 − x2 k2 ) (12)
h 2 i
x1 and x2 can be sampled in various ways. While the choice LF = Ex1 ,x2 kFφ (x1 − x2 )k2 − kx1 − x2 k2
is not unique, it is important to sample from both the data (14)
distribution pdata (x) and generated samples Gθ (z), so that
the trained RIP holds for both real and generated data. In Minimising LF provides a relaxation of the constraint spec-
our experiments, we randomly sampled one image from the ified by the RIP (eq. 2). When LF is small, the projection
data and two generated images at the beginning and end of from F better preserves the distance between x1 and x2 .
latent optimisation, then computed the average between the This relaxation enables us to transform the RIP into a train-
3 pairs of losses between these 3 points as a form of “triplet ing objective for the measurements, which can then be inte-
loss”. grated into training other model components. Empirically,
we found this relaxation leads to high quality reconstruction.
Our algorithm is summarised in Algorithm 1. Since Al-
gorithm 1 still uses a random measurement matrix F, it The rest of the algorithm is identical to Algorithm 1, except
can be used as any other CS algorithm when ground truth that we also update the measurement function’s parameters
reconstructions are available for training the generator. φ. Consequently, different schemes can be employed to
coordinate updating θ and φ, which will be discussed more
3.2. Deep Compressed Sensing with Learned in section 3.3. This extended algorithm is summarised in
Measurement Function Algorithm 2. We call it Deep Compressed Sensing (DCS)
to emphasise that both the measurement and reconstruction
In Algorithm 1, we use the RIP property to train the gen- can be deep neural networks. Next, we turn to generalising
erator. We can use the same approach and enforce the RIP the measurements to properties other than the RIP.
property to learn the measurement function F itself, rather
than using a random projection. 3.2.2. G ENERALISED CS 1: CS-GAN
3.2.1. L EARNING M EASUREMENT F UNCTION Here we consider an extreme case: a one-dimensional mea-
surement that only encodes how likely an input is a real data
We start by generalising the measurement matrix F (eq. 1), point or fake one sampled from the generator. One way to
and define a parametrised measurement function m ← formulate this is to train the measurement function Fφ using
Fφ (x). The model introduced in the previous section cor- the following loss instead of eq. 14:
responds to a linear function Fφ (x) = F x; now both Fφ
and Gθ can be deep neural networks. Similar to CS, the
(
2
kFφ (x) − 1k2 x ∼ pdata (x)
central problem in this generalised setting is inverting the LF = 2 (15)
measurement function to recover the signal x ← Fφ−1 (m) kFφ (x̂)k2 x̂ ∼ Gθ (ẑ), ∀ẑ
via minimising the measurement error similar to eq. 6:
Algorithm 2 then becomes the Least Squares Generative
2
Eθ (m, z) = km − Fφ (Gθ (z))k2 (13) Adversarial Nets (LSGAN, Mao et al., 2017) with latent
Deep Compressed Sensing
Algorithm 2 Deep Compressed Sensing ment function is achieved by the perfect classifier:
Input: minibatchs of data {xi }N i=1 , measurement func-
tion Fφ , generator Gθ , learning rate α, number of latent (
1 x ∼ pdata (x)
optimisation steps T Dφ (x) = (19)
repeat 0 x ∼ Gθ (z), ∀z
Initialize generator parameters θ
for i = 1 to N do
Measure the signal mi ← Fφ (xi ) We can therefore simplify eq. 18 by replacing m with its
Sample ẑi ∼ pz (z) target value 1 as in teacher-forcing (Williams & Zipser,
for t = 1 to T do 1989):
∂
Optimise ẑi ← ẑi − ∂z Eθ (mi , ẑi )
end for E(m, z) = ln [Dφ (Gθ (z))] (20)
end for P
N
LG = N1 i=1 Eθ (mi , ẑi )
Compute LF using eq. 12 This objective recovers the vanilla GAN formulation with
∂
Option 1 : joint update θ ← θ − ∂θ (LG + LF ) the commonly used alternative loss (Goodfellow et al.,
Option 2 : alternating update 2014), which we derived as a measurement error. When
∂ ∂
θ ← θ − ∂θ LG φ ← φ − ∂φ LF latent optimisation is disabled (T = 0), Algorithm 2 is
until reaches the maximum training steps identical to a vanilla GAN.
In our experiments (section 4.2), we observed that the addi-
tional latent optimisation steps introduced from the CS per-
optimisation — they are exactly equivalent when the la- spective significantly improved GAN training. We reckon
tent optimisation is disabled (T = 0, zero step). LSGAN this is because latent optimisation moves the representa-
is an alternative to the original GAN (Goodfellow et al., tion to areas more likely to generate realistic images as
2014) that can be motivated from Pearson χ2 Divergence. deemed by the discriminator. Since the gradient descent
To demonstrate a closer connection with original GANs process remains local, the latent representations are still
(Goodfellow et al., 2014), we instead focus on another for- spread broadly in latent space, which avoids mode collapse.
mulation whose measurement function is a binary classifier Although a sufficiently powerful generator Gθ can trans-
(the discriminator). form the source pz (z) into arbitrarily complex distribution,
This is realised by using a binary classifier Dφ as the mea- a more informative source, as implicitly manifested from
surement function, where we can interpret Dφ (x) as the the optimised z, may significantly reduce the complexity
probability that x comes from the dataset. In this case, the required for Gθ , thus striking a better trade-off in terms of
measurement function is equivalent to the discriminator in the overall computation.
GANs. Consequently, we change the the squared-loss in
eq. 13 to the cross-entropy loss as the matching measure- 3.2.3. G ENERALISED CS 2: S EMI - SUPERVISED GAN S
ment loss function (Bishop, 2006) (ignoring the expectation So far, we have shown two extreme cases of Deep Com-
over x for brevity): pressed Sensing: in one case, the distance preserving mea-
surements (section 3.2.1) essentially encode all information
LF = t(x) ln [Dφ (x)] + (1 − t(x)) ln [1 − Dφ (x)] (16)
for recovering the original signals; on the other hand, the
CS-GAN (section 3.2.2) has one-dimensional measurements
where the binary scalar t is an indicator function identifies
that only indicates whether signals are real or fake. We now
whether x is a real data point.
seek a middle ground, by using measurements that preserve
( class information for labelled data.
1 x ∼ pdata (x)
t(x) = (17) We generalise CS-GAN by replacing the binary classifier
0 x ∼ Gθ (z), ∀z
(discriminator) Dφ with a multi-class classifier Cφ . For data
Similarly, a cross-entropy measurement error is employed with K classes, this classifier outputs K + 1 classes with
to quantify the discrepancy between Dφ (Gθ (z)) and the the (K + 1)’th class reserved for “fake” data that comes
scalar measurement m = Dφ (x): from the generator. This specification is the same as the
classifier used in semi-supervised GANs (SGANs, Salimans
Eθ (m, z) = m ln [Dφ (Gθ (z))] et al. (2016)). Consequently, we extend the binary indicator
(18) function in eq. 17 to multi-class indicator, so that its k’the
+ (1 − m) ln [1 − Dφ (Gθ (z))] element tk (x) = 1 when x in class k. The k’th output of
the classifier Cφk (x) indicates the predicted probability that
At the minimum of LF = 0 (eq. 16), the optimal measure- x is in the k’th class, and multi-class cross-entropy loss is
Deep Compressed Sensing
M ODEL P ROPERTY L OSS where β is a scalar controlling the strength of this regu-
CS RIP N/A
lariser. This regulariser encourages small moves of z in
DCS TRAINED RIP EQ .
14 optimisation, and can be interpreted as approximating an
CS-GAN VALIDITY PRESERVING EQ .
16 optimal transport cost (Villani, 2008). We found a range
CS-SGAN CLASS PRESERVING EQ .
21 of β from 1.0 to 10.0 made little difference in training, and
used β = 3.0 in our experiments with CS-GAN.
Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face Odena, A., Olah, C., and Shlens, J. Conditional image
attributes in the wild. In Proceedings of International synthesis with auxiliary classifier gans. In Proceedings of
Conference on Computer Vision (ICCV), 2015. the 34th International Conference on Machine Learning-
Volume 70, pp. 2642–2651. JMLR. org, 2017.
Lu, X., Dong, W., Wang, P., Shi, G., and Xie, X. Convcsnet:
A convolutional compressive sensing framework based on Radford, A., Metz, L., and Chintala, S. Unsupervised rep-
deep learning. arXiv preprint arXiv:1801.10342, 2018. resentation learning with deep convolutional generative
adversarial networks. arXiv preprint arXiv:1511.06434,
Lustig, M., Donoho, D., and Pauly, J. M. Sparse mri: The 2015.
application of compressed sensing for rapid mr imaging.
Magnetic Resonance in Medicine: An Official Journal Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic
of the International Society for Magnetic Resonance in backpropagation and approximate inference in deep gen-
Medicine, 58(6):1182–1195, 2007. erative models. arXiv preprint arXiv:1401.4082, 2014.
Maaten, L. v. d. and Hinton, G. Visualizing data using Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Rad-
t-sne. Journal of machine learning research, 9(Nov): ford, A., and Chen, X. Improved techniques for training
2579–2605, 2008. gans. In Advances in Neural Information Processing
Systems, pp. 2234–2242, 2016.
MacKay, D. J. Information theory, inference and learning
Schlegl, T., Seeböck, P., Waldstein, S. M., Schmidt-Erfurth,
algorithms. Cambridge university press, 2003.
U., and Langs, G. Unsupervised anomaly detection with
Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., and generative adversarial networks to guide marker discov-
Paul Smolley, S. Least squares generative adversarial ery. In International Conference on Information Process-
networks. In Proceedings of the IEEE International Con- ing in Medical Imaging, pp. 146–157. Springer, 2017.
ference on Computer Vision, pp. 2794–2802, 2017.
Schmidhuber, J. Evolutionary principles in self-referential
Metzler, C., Mousavi, A., and Baraniuk, R. Learned d- learning, or on learning how to learn: the meta-meta-
amp: Principled neural network based compressive image ... hook. PhD thesis, Technische Universität München,
recovery. In Advances in Neural Information Processing 1987.
Systems, pp. 1772–1783, 2017. Sun, J., Li, H., Xu, Z., et al. Deep admm-net for com-
Mirza, M. and Osindero, S. Conditional generative adver- pressive sensing mri. In Advances in neural information
sarial nets. arXiv preprint arXiv:1411.1784, 2014. processing systems, pp. 10–18, 2016.
Tao, C., Chen, L., Henao, R., Feng, J., and Duke, L. C.
Miyato, T. and Koyama, M. cgans with projection discrimi-
Chi-square generative adversarial network. In Dy, J.
nator. arXiv preprint arXiv:1802.05637, 2018.
and Krause, A. (eds.), Proceedings of the 35th Interna-
Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spec- tional Conference on Machine Learning, volume 80 of
tral normalization for generative adversarial networks. In Proceedings of Machine Learning Research, pp. 4887–
International Conference on Learning Representations, 4896, Stockholmsmssan, Stockholm Sweden, 10–15 Jul
2018. URL https://openreview.net/forum? 2018. PMLR. URL http://proceedings.mlr.
id=B1QRgziT-. press/v80/tao18b.html.
Mohamed, S. and Lakshminarayanan, B. Learn- Tibshirani, R. Regression shrinkage and selection via the
ing in implicit generative models. arXiv preprint lasso. Journal of the Royal Statistical Society. Series B
arXiv:1610.03483, 2016. (Methodological), pp. 267–288, 1996.
Mousavi, A., Patel, A. B., and Baraniuk, R. G. A deep learn- Villani, C. Optimal transport: old and new, volume 338.
ing approach to structured signal recovery. In 2015 53rd Springer Science & Business Media, 2008.
Annual Allerton Conference on Communication, Control, Weiss, Y., Chang, H. S., and Freeman, W. T. Learning
and Computing (Allerton), pp. 1336–1343. IEEE, 2015. compressed sensing. In Snowbird Learning Workshop,
Allerton, CA. Citeseer, 2007.
Mousavi, A., Dasarathy, G., and Baraniuk, R. G. Deepcodec:
Adaptive sensing and recovery via deep convolutional Williams, R. J. and Zipser, D. A learning algorithm for con-
neural networks. arXiv preprint arXiv:1707.03386, 2017. tinually running fully recurrent neural networks. Neural
computation, 1(2):270–280, 1989.
Mousavi, A., Dasarathy, G., and Baraniuk, R. G. A data-
driven and distributed approach to sparse signal repre- Yann, L., Corinna, C., and Burges, C. The mnist
sentation and recovery. In International Conference on database of handwritten digits. URL http://yhann. lecun.
Learning Representations, 2018. com/exdb/mnist, 1998.
Deep Compressed Sensing