0% found this document useful (0 votes)
56 views14 pages

Scalable Solvers of Random Quadratic Equations Via Stochastic Truncated Amplitude Flow

1) The document discusses the problem of reconstructing an unknown signal vector x from magnitude-only measurements, known as phase retrieval. This arises in applications like X-ray crystallography, optics, and coherent diffraction imaging. 2) Phase retrieval is ill-posed, with up to 2n-2 possible signal vectors consistent with the magnitude measurements. Additional constraints or measurement redundancy are typically needed. 3) The document proposes a novel approach called stochastic truncated amplitude flow (STAF) to solve the phase retrieval problem from random quadratic equations of the form |〈ai, x〉|. STAF provides an iterative solver with two stages to initialize and refine the solution in a scalable manner.

Uploaded by

vidula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views14 pages

Scalable Solvers of Random Quadratic Equations Via Stochastic Truncated Amplitude Flow

1) The document discusses the problem of reconstructing an unknown signal vector x from magnitude-only measurements, known as phase retrieval. This arises in applications like X-ray crystallography, optics, and coherent diffraction imaging. 2) Phase retrieval is ill-posed, with up to 2n-2 possible signal vectors consistent with the magnitude measurements. Additional constraints or measurement redundancy are typically needed. 3) The document proposes a novel approach called stochastic truncated amplitude flow (STAF) to solve the phase retrieval problem from random quadratic equations of the form |〈ai, x〉|. STAF provides an iterative solver with two stages to initialize and refine the solution in a scalable manner.

Uploaded by

vidula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO.

8, APRIL 15, 2017 1961

Scalable Solvers of Random Quadratic Equations via


Stochastic Truncated Amplitude Flow
Gang Wang, Student Member, IEEE, Georgios B. Giannakis, Fellow, IEEE, and Jie Chen, Senior Member, IEEE

Abstract—A novel approach termed stochastic truncated am- limitations of optical detectors such as photosensitive films,
plitude flow (STAF) is developed to reconstruct an unknown charge-coupled device (CCD) cameras, and human eyes, one
n-dimensional real-/complex-valued signal x from m “phaseless” records only the intensity of light (which describes the abso-
quadratic equations of the form ψi = |ai , x|. This problem,
also known as phase retrieval from magnitude-only information, is lute counts of photons or electrons that strike the detectors) but
NP-hard in general. Adopting an amplitude-based nonconvex for- loses the phase (where the wave peaks and troughs lie) [5]. It
mulation, STAF leads to an iterative solver comprising two stages: is known that when collecting the diffraction pattern at a large
s1) Orthogonality-promoting initialization through a stochastic enough distance from the imaging plane, the field is given by
variance reduced gradient algorithm; and, s2) a series of iter-
the Fourier transform of the image (up to a known phase fac-
ative refinements of the initialization using stochastic truncated
gradient iterations. Both stages involve a single equation per iter- tor). Therefore, those optical devices in the far field essentially
ation, thus rendering STAF a simple, scalable, and fast approach measure only the squared modulus of the Fourier transform of
amenable to large-scale implementations that are useful when n is the object, whereas the phase of the incident light reaching the
large. When {ai }m i= 1 are independent Gaussian, STAF provably detector is missing. Nevertheless, very much information is con-
recovers exactly any x ∈ R n exponentially fast based on order
tained in the Fourier phase. It has been well documented that
of n quadratic equations. STAF is also robust in the presence
of additive noise of bounded support. Simulated tests involving the Fourier phase of an image encodes often more structural in-
real Gaussian {ai } vectors demonstrate that STAF empirically re- formation than its Fourier magnitude [6]. Recovering the phase
constructs any x ∈ R n exactly from about 2.3n magnitude-only from magnitude-only measurements is thus of paramount prac-
measurements, outperforming state-of-the-art approaches and tical relevance. Further details concerning recent advances in
narrowing the gap from the information-theoretic number of equa-
the theory and practice of phase retrieval can be found in the
tions m = 2n − 1. Extensive experiments using synthetic data
and real images corroborate markedly improved performance of review [5].
STAF over existing alternatives. Succinctly stated, the generalized phase retrieval amounts to
solving a system of “phaseless” quadratic equations taking the
Index Terms—Nonconvex optimization, phase retrieval, vari-
ance reduction, Kaczmarz algorithm. form
ψi = |ai , x| , 1≤i≤m (1)
I. INTRODUCTION
ONSIDER the fundamental problem of reconstructing a where x ∈ R n or C n is the wanted unknown, ai ∈ R n or C n
C general signal vector from magnitude-only measurements,
e.g., the magnitude of the Fourier transform or any linear trans-
are known sensing/feature vectors, and ψ := [ψ1 · · · ψm ]T is
the observed data vector. Equivalently, (1) can also be given in
form of the signal. This problem, also known as phase retrieval its squared form as yi = |ai , x|2 , where yi := ψi2 denotes the
[1], arises in many fields of science and engineering ranging intensity or the squared modulus.
from X-ray crystallography [2], optics [3], as well as coherent In the classical discretized one-dimensional (1D) phase re-
diffraction imaging [4]. In such settings, due to the physical trieval, the amplitude vector ψ corresponds to the m-point (typ-
ically, m = 2n − 1) Fourier transform of the length-n signal x
Manuscript received September 15, 2015; revised December 15, 2016; ac- [5]. It has been established using the fundamental theorem of al-
cepted January 5, 2017. Date of publication January 16, 2017; date of current gebra that there is no unique solution in the discretized 1D phase
version February 7, 2017. The associate editor coordinating the review of this retrieval, even if one fixes trivial ambiguities resulting from op-
manuscript and approving it for publication was Dr. Yue Rong. This work was
supported in part by NSF under Grant 1500713 and Grant 1514056. erations that preserve Fourier magnitudes, including the global
G. Wang is with the Digital Technology Center and the Electrical and Com- phase shift, conjugate inversion, and spatial shift [7]. In fact,
puter Engineering Department, University of Minnesota, Minneapolis, MN there are up to 2n −2 generally distinct signals with common
55455 USA and also with the School of Automation, Beijing Institute of Tech-
nology, Beijing 100081, China (e-mail: [email protected]). ψ beyond trivial ambiguities [7]. To overcome this ill-posed
G. B. Giannakis is with the Digital Technology Center and the Electrical and character of the 1D phase retrieval, different approaches have
Computer Engineering Department, University of Minnesota, Minneapolis, MN been suggested. Additional constraints on the unknown sig-
55455 USA (e-mail: [email protected]).
J. Chen is with the School of Automation and State Key Laboratory of Intelli- nal such as sparsity or non-negativity are enforced in [8]–[10]
gent Control and Decision of Complex Systems, Beijing Institute of Technology, and [12]–[15]. Other viable options include introducing spe-
Beijing 100081, China (e-mail: [email protected]). cific redundancy into measurements leveraging, for example, the
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. short-time Fourier transform [5], [16], or masks [17], or simply
Digital Object Identifier 10.1109/TSP.2017.2652392 assuming random measurements (e.g., random Gaussian {ai }
1053-587X © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 09,2023 at 15:26:49 UTC from IEEE Xplore. Restrictions apply.
1962 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 8, APRIL 15, 2017

designs) [1], [12], [18], [19]. For analytic concreteness, we will A. Prior Art
henceforth assume random measurements ψi that are collected
Adopting the least-squares criterion (which would coincide
from the real-valued Gaussian model (1), with independently with the maximum likelihood one when assuming an additive
and identically distributed (i.i.d.) ai ∼ N (0, I n ). To demon-
white Gaussian noise model), the task of tackling the quadratic
strate the effectiveness of our proposed algorithm, experimental
system in (1) can be reformulated as that of minimizing the
implementation for the complex-valued Gaussian model with following amplitude-based empirical loss [9], [12], [19]
i.i.d. ai ∼ CN (0, I n ) := N (0, I n /2) + jN (0, I n /2), and us-
m
ing real images will be included as well. 1  2
minimize ψi − |aH
i z| (2)
It has been recently proved that when m ≥ 2n − 1 or m ≥ n
z∈C 2m i=1
4n − 4 generic measurements (e.g., from the Gaussian mod-
els) are acquired, the system in (1) determines uniquely an or, the intensity-based one [1]
n-dimensional real- or complex-valued x (up to a global sign 1 
m

2 2
or phase) [20], [21], respectively. In the real case, m = 2n − 1 minimize yi − |aH
i z| (3)
z∈Cn 2m i=1
generic measurements are also proved necessary for uniqueness
[20]. Postulating existence of a unique solution x, our goal is to and its counterpart for Poisson data [18]
devise simple yet effective algorithms amenable to large-scale m
implementation: i) that provably reconstruct x from a near- 1  H 2  
minimize |a z| − yi log |aH
i z| .
2
(4)
optimal number of quadratic equations as in (1); and ii), that z∈C n 2m i=1 i
feature in simultaneously low per-iteration and computational
Unfortunately, the three objective functions are nonconvex be-
complexities as well as linear convergence rate.
cause of the modulus in (2), or the quadratic terms in (3) and
Being a particular instance of nonconvex quadratic program-
(4). It is well known that nonconvex functions may exhibit many
ming, the problem of solving quadratic equations subsumes as
stationary points, and minimizing nonconvex objectives is in
special cases various classical combinatorial optimization tasks
general NP-hard, and hence computationally intractable [24]. It
involving Boolean variables (e.g., the NP-complete stone prob-
is worth stressing that it is difficult to establish convergence to a
lem [22, Section 3.4.1], [18]). Considering for instance real-
local minimum due to the existence of complicated saddle point
valued vectors ai and x, this problem boils down to assigning
structures [24]–[26].
signs si = ±1, such that the solution to the system of linear
√ Past approaches for solving quadratic equations can be
equations ai , x = si yi , denoted by z, adheres to the given
grouped in two categories: convex and nonconvex ones. The
equations |ai , z| = ψi , 1 ≤ i ≤ m. It is clear that there are a
nonconvex ones include the “workhorse” alternating projec-
total of 2m different combinations of {si }mi=1 , whereas only two
tion algorithms [9], [27]–[29], AltMinPhase [12] and TAF [14],
combinations of these signs leads to x up to a global sign. The
[15], [19], [30], trust-region [31] and majorization-minimization
complex scenario becomes even more complicated, in which
[32], [33], as well as the recently proposed Wirtinger-based
instead of assigning a series of signs {si }m i=1 , one looks for a
variants such as (truncated) Wirtinger flow (WF/TWF) [1],
collection of unimodular complex constants {σi ∈ C}m i=1 such
[18], [34]. Based on STFT measurements, gradient descent-type
that the resulting linear system and the original quadratic sys-
algorithms starting with a least-squares initialization provably
tem are equivalent. Furthermore, solving quadratic equations
recover the signal from magnitude-only information under ap-
has also found applications in estimating the mixture of lin-
propriate conditions [16]. Stochastic or incremental counter-
ear regressions, in which the latent membership variables are
parts consisting of Kaczmarz and ITWF have been reported
viewed as the missing phases [23]. Despite its practical rel-
too [35], [36]. On the other hand, the convex alternatives typi-
evance across various science and engineering fields, solving
cally rely upon the so-called matrix-lifting technique to derive
systems of quadratic equations is combinatorial in nature, and
semidefinite programming-based solvers such as PhaseLift [37],
NP-hard in general.
PhaseCut [38], and CoRK [39]. For the Gaussian model, com-
Notation: Lower- (upper-) case boldface letters denote col-
parisons between convex and nonconvex solvers in terms of
umn vectors (matrices), and calligraphic symbols are reserved
sample complexity and computational complexity to acquire an
for sets. The symbol T (H) stands for transposition (con-
-accurate solution are listed in Table I.
jugate transposition), and for positive semidefinite matri-
ces. For vectors, · signifies the Euclidean norm, and · 1
denotes the 1 -norm. The symbol · is the ceiling opera- B. This Paper
tion that returns the smallest integer greater than or equal Adopting the amplitude-based nonconvex formulation, this
to the given number. For a given function g(n) of integer paper puts forth a new algorithm, referred to as stochastic trun-
n > 0, Θ(g(n)) denotes the set of functions Θ(g(n)) = {f (n) : cated amplitude flow (STAF). STAF offers an iterative algo-
there exist positive constants C1 , C2 , and n0 such that 0 ≤ rithm that builds upon but considerably broadens the scope of
C1 g(n) ≤ f (n) ≤ C2 g(n) for all n ≥ n0 }; and likewise, TAF [19]. Specifically, it operates in two stages: Stage one em-
O(g(n)) = {f (n) : there exist positive constants C and n0 ploys a stochastic variance reduced gradient algorithm to obtain
such that 0 ≤ f (n) ≤ Cg(n) for all n ≥ n0 }, and Ω(g(n)) = an orthogonality-promoting initialization, whereas the second
{f (n) : there exist positive constants C and n0 such that 0 ≤ stage applies stochastic truncated amplitude-based iterations to
Cg(n) ≤ f (n)for all n ≥ n0 }. refine the initial estimate. Our approach is shown capable of

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 09,2023 at 15:26:49 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SCALABLE SOLVERS OF RANDOM QUADRATIC EQUATIONS VIA STAF 1963

TABLE I
COMPARISONS OF DIFFERENT ALGORITHMS

Algorithm Sample complexity m Computational complexity


 
PhaseLift [37] O(n ) O n 3 / 
PhaseCut [38]  O(n )   O n 3
/ 
AltMinPhase [12] O n log n (log 2 n + log(1/) O n 2 log
 n3 (log n + log (1/)
2 2

WF [1] O(n log n ) O n  log n log(1/)


TAF [19], TWF [18], ITWF [36] O(n ) O n 2 log(1/) 
This paper O(n ) O n 2 log(1/)

reconstructing any n-dimensional real-/complex-valued signal solution set becomes {xeiφ , ∀φ ∈ [0, 2π)}. This prompts the
x from a nearly minimal number of magnitude-only measure- following definition of the Euclidean distance of any estimate z
ments in linear time. Relative to TAF, the present paper’s STAF to the solution set of (1): dist(z, x) := min z ± x for real-
is well suited for large-scale applications. Besides achieving valued signals, and dist(z, x) := minimizeφ∈[0,2π ) z − xeiφ
order-optimal sample and computational complexities, STAF for complex ones [1]. Define also the indistinguishable global
enjoys O(n) per-iteration complexity in both initialization and phase constant in the real case as
refinement stages, which not only improves upon state-of-the- 
art alternatives that can afford O(n2 ), but it is also order op- 0, z−x ≤ z+x ,
φ(z) := (5)
timal. This makes STAF applicable and appealing to common π, otherwise.
large-scale imaging phase retrieval settings. Although ITWF
adopts an incremental gradient method to achieve O(n) per- Henceforth, letting x be any solution of the given system in (1),
iteration complexity at the second stage, its first stage relies we assume that φ (z) = 0; otherwise, z is replaced by e−j φ(z) z,
on the gradient-type power method of per-iteration complexity but for brevity of exposition, the phase adaptation term e−j φ(z)
O(n2 ) to obtain a truncated spectral initialization [36]. More- shall be dropped whenever it is clear from the context.
over, as will be demonstrated by our simulated tests, STAF out-
performs the state-of-the-art algorithms including TAF, ITWF, A. Truncated Amplitude Flow
and (T)WF on both synthetic data and real images in terms In this section, the two stages of TAF are outlined [19].
of both exact recovery performance and convergence speed. In stage one, TAF employs power iterations to compute
Specifically for the real-valued Gaussian model, STAF empiri- an orthogonality-promoting initialization, while the second
cally reconstructs any real-valued n-dimensional signal x from stage refines the initialization via gradient-type iterations.
a number m ≈ 2.3n of magnitude measurements, which is close The orthogonality-promoting initialization builds upon a basic
to the information-theoretic limit of m = 2n − 1. In sharp con- characteristic of high-dimensional spaces, which asserts that
trast, the existing alternatives such as TAF, ITWF, and (T)WF high-dimensional random vectors are almost always nearly or-
typically require a few times more measurements to achieve ex- thogonal to each other [19]. Its core idea relies on approximat-
act recovery. Markedly improved performance is also witnessed ing the unknown x by a vector z 0 ∈ R n most orthogonal to
for STAF when the complex-valued Gaussian model, and coded a carefully selected subset of design vectors {ai }i∈I0 , with
diffraction patterns of real images [17], are employed. the index set I0 ⊆ [m] := {1, 2, . . . , m}. It is well known
Paper outline: The rest of the paper is outlined as follows. that the geometric relationship between any nonzero vectors
Section II first reviews the truncated amplitude flow (TAF) p ∈ R n and q ∈ R n can be captured by their squared normal-
algorithm, and subsequently motivates and derives the two ized inner-product defined as cos2 θ := |p, q|2 /( pi 2 q 2 ),
stages of our proposed STAF algorithm. Section III summarizes where θ ∈ [0, π] signifies the angle between p and q. Intuitively,
STAF, and establishes its theoretical performance. Extensive the smaller cos2 θ is, the more orthogonal the two vectors are.
tests comparing STAF with state-of-the-art approaches on both Assume with no loss of generality that x = 1, which will
synthetic data and real images are presented in Section IV. Fi- be justified shortly. Upon obtaining the squared normalized
nally, main proofs are given in Section V, and technical details inner-products for all pairs {(ai , x)}mi=1 , collectively denoted
can be found in the Appendix. by {cos2 θi }m
i=1 with θi denoting the angle between ai and x,
the orthogonality-promoting initialization constructs I0 by in-
cluding the indices of ai ’s that produce one of the smallest
II. ALGORITHM: STOCHASTIC TRUNCATED AMPLITUDE FLOW
|I0 | normalized inner-products. Precisely, z 0 can be found by
In this section, TAF is first reviewed, and its limitations for solving [19]
large-scale applications are pointed out. To cope with these  
limitations, simple, scalable, and fast stochastic gradient descent 1  ai aT
i
minimize z T z (6)
(SGD)-type algorithms for both the initialization and gradient z =1 |I0 | ai 2
i∈I0
refinement stages are developed.
To begin with, a number of basic concepts are introduced. If x where |I0 | is on the order of n. To be precise, as shown in [19,
in the real case solves (1), so does −x. In the complex case, the Theorem 1], one requires for exact recovery of TAF that m ≥

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 09,2023 at 15:26:49 UTC from IEEE Xplore. Restrictions apply.
1964 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 8, APRIL 15, 2017

Algorithm 1: Power Method.


1: Input: Matrix Y 0 = DD T .
2: Initialize a unit vector u0 ∈ R n randomly.
3: For t = 0 to T − 1 do
ut+1 = Y 0 ut
Y 0 ut
.
4: End for
5: Output: z̃ 0 = uT .

c1 |I0 | ≥ c2 n holds for certain numerical constants c1 , c2 > 0.


Solving (6) amounts to finding the smallest eigenvalue and the
 a aT
associated eigenvector of Y 0 := |I10 | i∈I0 ai i i2 0. Nev-
ertheless, to avoid the O(n3 ) computational complexity of
computing the eigenvector associated with the smallest eigen-
value in (6), an application of the standard concentration result
m ai aTi m
i=1 ai 2  n I n simplifies that to computing the principal
 a aT
eigenvector of Y 0 := |I1 | i∈I 0 ai i i2 , where I 0 is the comple-
0

ment of I0 in [m]. Upon collecting {ai }i∈I 0 into an n × |I 0 |


data matrix D, one can rewrite Y 0 = DD T to arrive at the
following principal component analysis (PCA) problem
1 T
z̃ 0 := arg max z DD T z. (7)
z =1 |I 0 |
On the other hand, if x = 1, the estimate z̃ 0 is scaled
m
i=1 yi , a norm estimate of x to yield z 0 :=
1
by m
m Fig. 1. Eigengaps δ of Ȳ 0 ∈ R n ×n averaging over 100 Monte Carlo real-
i=1 yi z̃ 0 . Further details can be found in [19,
1
m izations for n = 10, 000 fixed and m/n varying by 0.2 from 1 to 6. Top: Real-
Section II.B]. valued Gaussian model with x ∼ N (0, I n ), and ai ∼ N (0, I n ). Bottom:
When the signal dimension n is modest, problem (7) can be Complex-valued Gaussian model with x ∼ CN (0, I n ), and ai ∼ CN (0, I n ).
solved exactly by a full singular value decomposition (SVD)
of D [40]. Yet it has running time of O(min{n2 I 0 , nI 0 })
2
t ≥ 0 denoting the iteration number, the truncated gradient stage
(or simply O(n ) because |I 0 | is required to be on the order
3 starts with the initial estimate z 0 , and operates in the following
of n), which grows prohibitively in large-scale applications. A iterative fashion
common alternative is the power method that is tabulated in μ  aT z t
Algorithm 1, and was also employed by [1], [18], [19], [36] z t+1 = z t − aTi z t − ψi iT ai , ∀t ≥ 0 (9)
m |ai z t |
to find an initialization [40]. Power method, on the other hand, i∈I t+1

involves a matrix-vector multiplication Y 0 ut per iteration, thus where the index set responsible for the gradient regularization
incurring per-iteration complexity of O(n|I 0 |) or O(n2 ) by is given as [19]
passing through the selected data {ai }i∈I 0 . Furthermore, to 
produce an -accurate solution, it incurs runtime of [40] |aTi z t | 1
It+1 := 1 ≤ i ≤ m ≥ , ∀t ≥ 0 (10)
|aTi x| 1+γ
1
O n|I 0 | log(1/) (8)
δ for some regularization parameter γ > 0.
depending on the eigengap δ > 0, which is defined as the gap
between the largest and the second largest eigenvalues of Y 0 B. Variance-reducing Orthogonality-promoting Initialization
normalized by the largest one [40]. It is clear that when the eigen- This section first presents some empirical evidence showing
gap δ is small, the runtime of O(n|I 0 | log(1/)/δ) required by that small eigengaps appear commonly in the orthogonality-
the power method would be equivalent to many passes over the promoting initialization approach. Fig. 1 plots empirical
entire data, and this could be prohibitively for large datasets eigengaps of Y 0 ∈ R n ×n under the real- and complex-valued
[41]. Hence, the power method may not be appropriate for com- Gaussian models over 100 Monte Carlo realizations under
puting the initialization in large-scale applications, particularly default parameters of TAF, where n = 10, 000 is fixed, and m/n
those involving small eigengaps. the number of equations and unknowns increases by 0.2 from
The second stage of TAF relies on truncated gradient itera- 1 to 6. As shown in Fig. 1, the eigengaps of Y 0 resulting from
tions of the amplitude-based cost function (4). Specifically, with the orthogonality-promoting initialization in [19, Algorithm 1]

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 09,2023 at 15:26:49 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SCALABLE SOLVERS OF RANDOM QUADRATIC EQUATIONS VIA STAF 1965

Proposition 1 ([41]): Let v 1 ∈ R n be an eigenvector of


Algorithm 2: Variance-reduced orthogonality-promoting
Y 0 associated with the largest eigenvalue λ1 . Assume that
initialization (VR-OPI).
maxi∈[m ] ai 2 ≤ r := 2.3n (which holds with probability at
1: Input: Data matrix D = {ai }i∈I 0 , step size η = 20/m,
least 1 − me−n /2 ), the two largest eigenvalues of Y 0 are λ1 >
as well as the number of epochs S = 100, and the epoch
> 0 with eigengap δ = (λ1 − λ2 )/λ1 , and that ũ0 , v 1  ≥
λ2 √
length T = |I 0 | (by default).
1/ 2. With any 0 < , ξ < 1, constant step size η > 0, and
2: Initialize a unit vector ũ0 ∈ R n randomly.
epoch length T chosen such that
3: For s = 0 to S − 1 do
w = |I1 | i∈I 0 ai (aTi ũs ) c0 ξ 2 c1 log(2/ξ)
0 η≤ δ, T ≥ ,
u1 = ũs . r2 ηδ
4: For t = 0 to T − 1 do 
T η 2 r2 + rη T log(2/ξ) ≤ c2 (11)
Pick it ∈ I 0 uniformly
  at random  
ν t+1 = ut + η ai t aTi t ut − aTi t ũs + w for certain universal constants c0 , c1 , c2 > 0, successive es-
ut+1 = νν tt ++ 11 . timates of VR-OPI (summarized in Algorithm 2) after S =
5: End For log(1/)/log(2/ξ) epochs satisfy
ũs+1 = uT .
6: End For |ũS , v 1 |2 ≥ 1 −  (12)
7: Output: z̃ 0 = uS . with probability exceeding 1 − log  ξ. Typical parameter val-
ues are η = 20/m, S = 100, and T = |I 0 |.
The proof of Proposition 1 can be found in [41]. Even though
are rather small particularly for small m/n close to the informa- PCA in (7) is nonconvex, the SGD based VR-OPI algorithm
tion limit 2. Using power iterations in Algorithm 1 of runtime converges to the globally optimal solution under mild conditions
O(n|I 0 | log(1/)/δ) in (8) thus entails many passes over the [41]. Moreover, fixing any ξ ∈ (0, 1), conditions in (11) hold
entire data due to a small eigengap factor of 1/δ, which may not true when T is chosen to be on the order of 1/(ηδ), and η to
perform well in the presence of large dimensions that are com- be sufficiently smaller than δ/r2 . Expressed differently, if VR-
mon to imaging applications [41]. On the other hand, instead of OPI runs T = Θ(r2 /δ 2 ) iterations per epoch for a total number
using the deterministic power method, stochastic and incremen- S = Θ(log(1/)) of epochs, then the returned VR-OPI estimate
tal algorithms have been advocated in [41], [42]. These algo- is -accurate with probability at least 1 − log2 (1/) ξ. Since
rithms perform a much cheaper update per iteration by choosing each epoch takes O(n(T + |I 0 |)) time to implement, the total
some it ∈ I 0 either uniformly at random or in a cyclic manner, runtime is of
and update the current iterate using only ai t . They are shown   r2  
to have per-iteration complexity of O(n), which is very appeal- O n |I 0 | + 2 log(1/) (13)
ing to large-scale applications. Building on recent advances in δ
accelerating stochastic optimization schemes [43], a variance- which validates the exponential convergence rate of VR-OPI. In
reducing principal component analysis (VR-PCA) algorithm addition, when δ/r ≥ Ω(1/ |I 0 |), the total runtime reduces
can be found in [41]. VR-PCA performs cheap stochastic itera-
to O(n|I 0 | log(1/)) up to log-factors. It is worth emphasizing
tions, yet its total runtime is O(n(I 0 + 1/δ 2 ) log(1/)) which
that the required runtime is proportional to the time required
depends only logarithmically on the solution accuracy  > 0.
to scan the selected data once, which is in stark contrast to
This is in sharp contrast to the standard SGD variant, whose
the runtime of O(n|I 0 | log(1/)/δ) when using power method
runtime depends on 1/ due to the large variance of stochastic
[40]. Simulated tests in Section IV corroborate the effectiveness
gradients [42].
of VR-OPI over the popular power method in processing data
For the considered large-scale phase retrieval in most imag-
involving large dimensions m and/or n.
ing applications, this paper advocates using VR-PCA to solve
the orthogonality-promoting initialization problem in (7). We
refer to the resulting algorithm as the variance-reducing C. Stochastic Truncated Gradient Stage
orthogonality-promoting initialization (VR-OPI), which is sum- Driven by the need of efficiently processing large-scale phase-
marized in Algorithm 2 next. Specifically, VR-OPI is a double- less data in imaging applications, a stochastic solution algo-
loop algorithm with a single execution of the inner loop referred rithm is put forth for minimizing the amplitude-based cost
to as an iteration and one execution of the outer loop referred function in (4). To ensure good performance, the gradient reg-
to as an epoch. In practice, the algorithm consists of S epochs, ularization rule in (10) is also accounted for to lead to our
while each epoch runs T (typically taken to be the data size |I 0 |) truncated stochastic gradient iterations. It is worth mention-
iterations. Note that the full gradient evaluated per execution of ing that the Kaczmarz method [44] was also used for solving
the outer loop combined with the stochastic gradients inside the a system of quadratic equations in [35]. However, Kaczmarz
inner loop can be shown capable of reducing the variance of variants of block or randomized updates converge to at most a
stochastic gradients [43]. neighborhood of the optimal solution x. Distance between the
The following results adopted from [41, Theorem 1] establish Kaczmarz estimates and x is bounded in terms of the dimension
linear convergence rate of VR-OPI. m and the size of the amplitude data vector ψ measured by the

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 09,2023 at 15:26:49 UTC from IEEE Xplore. Restrictions apply.
1966 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 8, APRIL 15, 2017

1 - or ∞ -norm. Nevertheless, the obtained bounds of the form


Algorithm 3: Stochastic Truncated Amplitude Flow
m ψ 1 or m ψ ∞ are rather loose (m typically very large),
(STAF).
and less attractive than the geometric convergence to the global
solution x to be established for also stochastic iterations based 1: Input: Data {(ai , ψi )}mi=1 ; maximum number of
STAF. iterations T = 500 m; by default, step sizes μ = 0.8/n
Adopting the intensity-based Poisson likelihood function or μ = 1.2/n in the real- or complex-valued Gaussian
(3), an incremental version of TWF was developed in [36], models, truncation thresholds |I 0 | = 16 m , and
which provably converges to x in linear time. Albeit achieving γ = 0.7.
improved empirical performance and faster convergence over 2: Evaluate I 0 to consist of indices associated with the
TWF in terms of the number of passes over the entire data to |I 0 | largest values among {ψi / ai }.
m
produce an -accurate solution [18], the number of measure- 3: Initialize z 0 as m1 i=1 ψi z̃ 0 , where z̃ 0 is obtained
2

ments it requires for exact recovery is still relatively far from  a aT


via Algorithm 2 with Y 0 := |I1 | i∈I 0 ai i i2 .
the information-theoretic limits. Specifically for the real-valued 0

Gaussian ai designs, ITWF requires about m ≥ 3.2n noiseless 4: For t = 0 to T − 1 do


measurements to guarantee exact recovery relative to 4.5n for z t+1 = z t − μai t
TWF [18]. Recall that TAF achieves exact recovery from about  
aT z t
3n measurements [19]. Furthermore, gradient iterations can be × aTi t z t − ψi t iTt 1 ψ
 (17)
trapped in saddle points when dealing with nonconvex optimiza- |ai t z t | |aTi t zt |≥ 1 +i tγ
tion. In contrast, stochastic iterations are able to escape saddle
points, and converge globally to at least a local minimum [25]. where it is sampled uniformly at random from
Hence, besides the appealing computational advantage, stochas- {1, 2, . . . , m}, or,
tic counterparts of TAF may further improve the performance ai t
z t+1 = z t −
over TAF, as also asserted by the comparison between ITWF and ai t 2
TWF. In the following, we present two STAF variants: Starting  
aTi t z t
with an initial estimate z 0 found using VR-OPI in Algorithm T
× ai t z t − ψ i t T 1 T ψ
 (18)
2, the first variant successively updates z 0 through amplitude- |ai t z t | |ai t zt |≥ 1 +i tγ
based stochastic gradient iterations with a constant step size where it is sampled at random from {1, 2, . . . , m} with
μ > 0 chosen on the order of 1/n, while the second operates probability proportional to ai t 2 .
much like the Kaczmarz method, yet both suitably account for 5: End for
the truncation rule in (10). 6: Output: z T .
For simplicity of exposition, let us rewrite the amplitude-
based cost function as follows
m
 m where μt is either set to be a constant μ > 0 on the order
1  2
minimize (z) = i (z) := ψi − |aTi z| (14) of 1/n, or taken as the time-varying one as in Kaczmarz’s
z∈R n 2
i=1 i=1 iteration, namely, μt = 1/ ai t 2 [44]. The index it is sam-
pled uniformly at random or with given probabilities from
where the factor 1/2 is introduced for notational convenience. {1, 2, . . . , m}, or it simply cycles through the entire set [m].
It is clear that the cost (z) or each i (z) in (14) is nonconvex In addition, fixing the truncation threshold to γ = 0.7, the
and nonsmooth; hence, the optimization in (14) is computation- indicator function 1{|aTi zt |/|aTi x|≥1/(1+γ )} in (15) takes the
t t
ally intractable in general [45]. Along the lines of nonconvex
value 1, if |aTi t z t |/|aTi t x| ≥ 1/(1 + γ) holds true; and 0 oth-
paradigms including WF [1], TWF [18], and TAF [19], our ap-
erwise. It is worth stressing that this truncation rule provably
proach to solving the problem at hand amounts to iteratively
rejects “bad” search directions with high probability. Moreover,
refining the initial estimate z 0 by means of truncated stochas-
this regularization maintains only gradient components of large
tic gradient iterations. This is in contrast to (T)WF and TAF,
enough |aTi z t | values, hence saving the objective function (2)
which rely on (truncated) gradient-type iterations [1], [18], [19].
from being non-differentiable at z t and simplifying the theo-
STAF processes one datum at a time and evaluates the general-
retical analysis. In the context of large-scale linear regressions
ized gradient of one component function i t (z) for some index
or dynamic tracking, similar ideas such as censoring have been
it ∈ {1, 2, . . . , m} per iteration t ≥ 0. Specifically, STAF suc-
pursued [46], [47], [48]. Numerical tests demonstrating the per-
cessively updates z 0 using the following truncated stochastic
formance improvement using the stochastic truncated iterations
gradient iterations for all t ≥ 0
will be presented in Section IV.
z t+1 = z t − μt ∇i t (z t )1{|aTi z t |/|aTi x|≥1/(1+γ )} (15)
t t III. MAIN RESULTS
with The proposed STAF scheme is summarized as Algorithm 3,
with either constant step size μ > 0 in the truncated stochas-
 aT z t  tic gradient iterations in (17), or with time-varying step size
∇i t (z t ) = aTi t z t − ψi t iTt ai t (16)
|ai t z t | μt = 1/ ai t 2 in the truncated Kaczmarz iterations in (18).

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 09,2023 at 15:26:49 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SCALABLE SOLVERS OF RANDOM QUADRATIC EQUATIONS VIA STAF 1967

then with probability at least 1 − c1 m exp(−c2 n), the stochas-


tic truncated amplitude flow (STAF) estimates (tabulated in
Algorithm 3 with default parameters) satisfy
   ν t
EPt dist2 (z t , x) ≤ ρ 1 − x 2 , t = 0, 1, . . . (20)
n
for ρ = 1/10 and some numerical constant ν > 0, where the
expectation is taken over the path sequence Pt := {i0 , i1 ,
. . . , it−1 }, and c0 , c1 , c2 , μ0 > 0 are certain universal con-
stants.
The proof of Theorem 1 is deferred to the Appendix. Ap-
parently, the mean-square distance between the iterate and the
global solution is reduced by a factor of (1 − ν/n)m after one
pass through the entire data. Heed that the expectation EPt [·]
in (20) is taken over the algorithmic randomness Pt rather than
the data. This is important since in general the data may be
modeled as deterministic. Although only performing stochastic
iterations in (17) and (18), STAF still enjoys linear convergence
rate. This is in sharp contrast to typical SGD methods, where
variance reduction techniques controlling the variance of the
stochastic gradients are required to achieve linear convergence
rate [41], as in Algorithm 2. Moreover, the largest constant step
size that STAF can afford is estimated to be μ0 = 0.8469, giv-
ing rise to a convergence factor of ν = 0.0696 in (20). When
truncated Kaczmarz iterations are implemented, ν is estimated
to be 1.0091 much larger than the one in the constant step size
case. Our experience with numerical experiments also confirm
that the Kaczmarz-based STAF in (18) converges faster than
the constant step-size based one in (17), yet it is slightly more
sensitive when additive noise is present in the data.

IV. SIMULATED TESTS


This section presents extensive numerical experiments eval-
uating the performance of STAF using both synthetic data and
Fig. 2. Error evolution of the iterates using: i) power method in Algorithm 1; real images. STAF was thoroughly compared with existing al-
and ii) variance-reducing orthogonality-promoting initialization in Algorithm 2
for solving problem (7) with step size η = 1. Top: Noiseless real-valued ternatives including TAF [19], (T)WF [1], [18], and ITWF [36].
Gaussian model with x ∼ N (0, I n ), and ai ∼ N (0, I n ), where n = 10 4 , For fair comparisons, all the parameters pertinent to implemen-
and m = 2n − 1. Bottom: Noiseless complex-valued Gaussian model with tation of each algorithm were set to their suggested values. The
x ∼ CN (0, I n ), and ai ∼ CN (0, I n ), where n = 10 4 , and m = 4n − 4.
initialization in each scheme was found based on a number of
(power/stochastic) iterations equivalent to 100 passes over the
entire data, which was subsequently refined by a number of
iterations corresponding to 1,000 passes; unless otherwise
Equipped with an initialization obtained using VR-OPI, both
stated. All simulated estimates were averaged over 100 inde-
STAF variants will be shown to converge at exponential rate
pendent Monte Carlo trials. Two performance evaluation met-
to the globally optimal solution with high probability, as soon
rics were used: the relative root mean-square error defined as
as m/n the number of equations and unknowns exceeds some
Relative error := dist(z, x)/ x ; and the empirical success-
numerical constant.
ful recovery rate among 100 independent runs, in which a
Assuming m independent data samples {(ai ; ψi )} drawn
success is declared when the returned estimate incurs a rela-
from the real-valued Gaussian model, the following establishes
tive error less than 10−5 [1]. Tests using both noiseless/noisy
theoretical performance of STAF in the absence of noise.
real-/complex-valued Gaussian models ψi = |aH i x| + ηi were
Theorem 1 (Exact recovery): Consider the noiseless mea-
conducted, where the i.i.d. noise obeys ηi ∼ N (0, σ 2 x 2 ).
surements ψi = |aTi x| with an arbitrary signal x ∈ R n , and
The Matlab implementations of STAF can be downloaded from
i.i.d. {ai ∼ N (0, I n )}m
i=1 . If μt is either set to be a constant http://www.tc.umn.edu/gangwang/STAF.
μ > 0 as per (17), or it is time-varying μt = 1/ ai t 2 as per
The first experiment compares VR-OPI in Algorithm 2 with
(18) with the corresponding index sampling scheme, and also
the power method in Algorithm 1 to solve the orthogonality-
promoting initialization optimization in (7). The comparison is
m ≥ c0 n and μ ≤ μ0 /n (19) carried out in terms of the number of data passes to achieve

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 09,2023 at 15:26:49 UTC from IEEE Xplore. Restrictions apply.
1968 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 8, APRIL 15, 2017

Fig. 3. Empirical success rate for: i) WF [1]; ii) TWF [18]; iii) ITAF [36]; Fig. 4. Relative error versus iterations using: i) WF [1]; ii) TWF [18];
iv) TAF [19]; and v) STAF with n = 1, 000 and m/n varying by 0.1 from 1 iii) ITAF [36]; iv) TAF [19]; and v) STAF under the same orthogonality-
to 7 under the same orthogonality-promoting initialization. Top: Noiseless real- promoting initialization. Top: Noiseless real-valued Gaussian model with
valued Gaussian model with x ∼ N (0, I n ), and ai ∼ N (0, I n ); Bottom: x ∼ N (0, I n ), and ai ∼ N (0, I n ); Bottom: Noiseless complex-valued Gaus-
Noiseless complex-valued Gaussian model with x ∼ CN (0, I n ), and ai ∼ sian model with x ∼ CN (0, I n ), and ai ∼ CN (0, I n ), where n = 1, 000,
CN (0, I n ). and m = 5n.

the same solution accuracy, in which one pass through the se- dient evaluations and thus results in considerable savings in
lected data amounts to a number |I 0 | of gradient evaluations computational resources.
of component functions. First, synthetic data based experiments The second experiment evaluates the refinement stage of
are conducted using the real-/complex-valued Gaussian mod- STAF relative to its competing alternatives including those of
els with n = 10, 000 under the known sufficient conditions for (T)WF, TAF, and ITWF in a variety of settings. For fairness, all
uniqueness, i.e., m = 2n − 1 in the real case, and m = 4n − 4 schemes were here initialized using the same orthogonality-
in the complex case. Fig. 2 plots the error evolution of the iterates promoting initialization found using 100 power iterations,
ut for the power method and VR-OPI,  where the error in log- and subsequently applied a number of iterations correspond-
arithmic scale is defined as log10 1 − D T ut 2 / D T v 0 2 ing to T = 1, 000 data passes. First, tests on the noiseless
with the exact principal eigenvector v 0 computed from the SVD real- and complex-valued Gaussian models were conducted,
of Y 0 = DD T in (7). Apparently, the inexpensive stochastic with i.i.d. ai ∼ N (0, I 1,000 ), x ∼ N (0, I 1,000 ), and i.i.d. ai ∼
iterations of VR-OPI achieve certain solution accuracy with CN (0, I 1,000 ), x ∼ CN (0, I 1,000 ), respectively. Fig. 3 depicts
considerably fewer gradient evaluations or data passes in both the empirical success rate of all considered schemes with m/n
real and complex settings. This is important for tasks of large varying by 0.1 from 1 to 7. Fig. 4 compares the convergence
|I 0 |, or equivalently large dimension m (since |I 0 | = m/6 speed of various schemes in terms of the number of data passes
by default), because one less data pass implies |I 0 | fewer gra- to produce solutions of a given accuracy. Apparently, starting

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 09,2023 at 15:26:49 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SCALABLE SOLVERS OF RANDOM QUADRATIC EQUATIONS VIA STAF 1969

Fig. 5. Empirical success rate for: i) WF [1]; ii) TWF [18]; iii) ITAF [36]; Fig. 6. Relative error versus iterations using: i) WF [1]; ii) TWF [18];
iv) TAF [19]; and v) STAF with n = 1, 000 and m/n varying 0.1 from 1 iii) ITAF [36]; iv) TAF [19]; and v) STAF with n = 1, 000 and m/n = 5. Top:
to 7. Top: Noiseless real-valued Gaussian model with x ∼ N (0, I n ), and Noisy real-valued Gaussian model with x ∼ N (0, I n ), and ai ∼ N (0, I n );
ai ∼ N (0, I n ); Bottom: Noiseless complex-valued Gaussian model with Bottom: Noisy complex-valued Gaussian model with x ∼ CN (0, I n ), and
x ∼ CN (0, I n ), and ai ∼ CN (0, I n ). ai ∼ CN (0, I n ).

with the same initialization, STAF outperforms its competing STAF guarantees exact recovery from about 2.3n magnitude-
alternatives under both real-/complex-valued Gaussian models. only measurements, which is close to the information-theoretic
In particular, SGD-based STAF improves in terms of exact limit of m = 2n − 1. In comparison, existing alternatives re-
recovery and convergence speed over the state-of-the-art quire a few times more measurements to achieve exact recovery.
gradient-type TAF, corroborating the benefit of using SGD-type STAF also performs well in the complex case.
solvers to cope with saddle points and local minima of noncon- To demonstrate the robustness of STAF against additive noise,
vex optimization [25], [36]. we perform stable phase retrieval under the noisy real-/complex-
The previous experiment showed improved performance of valued Gaussian model ψi = |aH i x| + ηi , with ηi ∼ N (0, σ I)
2

STAF under the same initialization. Now, we present numer- i.i.d., and σ = 0.1 x . The noisy data for magnitude-square
2 2 2

ical results comparing different schemes equipped with their based algorithms were generated as yi = ψi2 . Curves in Fig. 6
own initialization, namely, WF with spectral initialization [1], clearly show near-perfect statistical performance and fast con-
(I)TWF with truncated spectral initialization [18], as well as vergence of STAF.
TAF with orthogonality-promoting initialization using power Finally, to demonstrate the effectiveness and scalability of
iterations [19], and STAF with VR-OPI. Fig. 5 demonstrates STAF on real data, the Milky Way Galaxy image1 is con-
merits of STAF over its competing alternatives in exact recovery sidered. The colorful image of RGB bands is denoted by
performance on the noiseless real-valued (left) and complex-
valued (right) Gaussian model. Specifically in the real case, 1 Downloaded from http://pics-about-space.com/milky-way-galaxy.

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 09,2023 at 15:26:49 UTC from IEEE Xplore. Restrictions apply.
1970 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 8, APRIL 15, 2017

considerably broaden the scope of the state-of-the-art TAF algo-


rithm in [19]. Adopting the amplitude-based nonconvex formu-
lation, STAF is a two-stage iterative algorithm. It first adopts an
orthogonality-promoting initialization using a stochastic vari-
ance reduced gradient algorithm, and subsequently refines the
initial estimate via truncated stochastic amplitude-based itera-
tions. STAF was shown capable of recovering any signal from
about as many equations as unknowns. In contrast to exist-
ing alternatives, both stages of STAF achieve optimal itera-
tion and computational complexities that make it attractive to
large-scale implementation. Numerical tests involving synthetic
data and real images corroborate the merits of STAF in terms
of both exact recovery performance and convergence speed
over the state-of-the-art approaches including TAF, (T)WF,
and ITWF.
Pertinent future research directions include establishing
Fig. 7. Recovered images after: the variance-reducing orthogonality- analytical results for STAF in the presence of noise, and compar-
promoting initialization stage (top panel), and the STAF refinement stage (bot- ing the estimation performance of STAF with known Cramer-
tom panel) on the Milky Way Galaxy image using K = 8 random masks.
Rao bounds [29]. Another possibility consists of leveraging the
orthogonality-promoting initialization in the context of robust
phase retrieval and faster semidefinite optimization, and devel-
X ∈ R 1080×1920×3 , in which the first two indices encode the oping suitable gradient regularization rules for other nonconvex
pixel location, and the third the color band. The algorithm optimization tasks. Devising inexpensive stochastic iterations
was run independently on each of the three RGB images. We based solvers for compressive phase retrieval, as well as general-
collected the physically realizable measurements called coded ization to two-dimensional phase retrieval problems, constitute
diffraction patterns (CDP) using random masks [17], which additional directions for future research.
have also been used in [1], [18], [19], [36]. Letting x ∈ R n
be a vectorization of a certain band of X, one has magnitude APPENDIX
measurements of the form
Proof of Theorem 1
ψ (k ) = F D (k ) x , 1≤k≤K (21) Recall from [19, Theorem 1] that when m/n exceeds some
where n = 1,080 × 1,920 = 2,073,600, F is an n × n discrete universal constant c0 > 0, the estimate z 0 returned by the
Fourier transform matrix, and D (k ) is a diagonal matrix whose orthogonality-promoting initialization obeys the following with
diagonal entries are sampled uniformly at random from phase high probability
delays {1, −1, j, −j}, with j denoting the imaginary unit. CDP dist(z 0 , x) ≤ (1/10) x . (22)
measurements were generated using K = 8 random masks for
a total of m = nK measurements. In this part, since the fast Along the lines of (T)WF and TAF, to prove our Theorem 1, it
Fourier transform (FFT) can be implemented in O(n log n) in- suffices to show that successive STAF iterates z t are on average
stead of O(n2 ) operations, the advantage of using STAF with locally contractive around the planted solution x, as asserted in
optimal per-iteration complexity is less pronounced. Hence, in- the following proposition.
stead of processing one quadratic measurement per iteration, a Proposition 2 (Local error contraction): Consider the noi-
block STAF version processes per iteration n2 measurements seless measurements ψi = |aTi x| with an arbitrary signal x
associated with one random mask. That is, STAF samples ran- ∈ R n , and i.i.d. ai ∼ N (0, I n ), 1 ≤ i ≤ m. Under the de-
domly the index k ∈ {1, 2, . . . , K} of masks in (21), and up- fault algorithmic parameters given in Algorithm 3, there exist
dates the iterate using all diffraction patterns corresponding to universal constants c0 , c1 , c2 > 0, and ν > 0, such that with
the k-th mask. In this case, STAF is able to leverage the effi- probability at least 1 − c2 m exp(−c1 n), the following holds
cient implementation of FFT, and converges fast. Fig. 7 displays simultaneously for all z t satisfying (22)
the recovered images, where the top is obtained after 100 data    ν
passes of VR-OPI iterations, and the bottom is produced by 100 Ei t dist2 (z t+1 , x) ≤ 1 − dist2 (z t , x) (23)
n
data passes of STAF iterations refining the initialization. Ap-
parently, the recovered images corroborate the effectiveness of provided that m ≥ c0 n.
STAF in real-world conditions. Proposition 2 demonstrates monotonic decrease of the mean-
square estimation error: Once entering a reasonably small-size
neighborhood of x, successive iterates of STAF will be dragged
V. CONCLUDING REMARKS
toward x at a linear rate. Upon establishing the local error
This paper developed a new linear-time algorithm abbrevi- contraction property in (23), taking expectation on both sides of
ated with STAF to solve systems of quadratic equations, and (23) over it−1 , and applying Proposition 2 again, yields a similar

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 09,2023 at 15:26:49 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SCALABLE SOLVERS OF RANDOM QUADRATIC EQUATIONS VIA STAF 1971

relation for the previous iteration. Continuing this process to index it (rather than the data randomness) to obtain
reach the initialization z 0 and appealing to the initialization  
result in (22) collectively, leads to (20), hence completes the Ei t dist2 (z t+1 , x)
proof of Theorem 1. m
 
2μ  T aTi t z t
= ht − 2
ai t z t − ψi t T aT ht 1 T ψ

m i t =1
|ai t z t | i t |ai t zt |≥ 1 +i tγ
Proof of Proposition 2
m
 2
μ2  aT z t
To prove Proposition 2, let us first define the truncated gradi- + aTi t z t − ψi t iTt ai t 2
1 ψ
.
ent of (z) as follows m i =1 |ai t z t | |aTi t zt |≥ 1 +i tγ
t

(27)
m
 aT z
∇tr (z) = aTi z − ψi iT ai 1  (24) Now the task reduces to upper bounding the terms on the right
|ai z| |aTi zt |≥ 1 +1 γ ψ i hand side of (27). Note from (24) that by means of ∇tr (z t ),
i=1
the second term in (27) can be re-expressed as follows
which corresponds to the truncated gradient employed by TAF m
 
2μ  aT
z t
[19]. Instrumental in proving the local error contraction in − aTi t z t − ψi t iTt aTi t ht 1 T ψ

Proposition 2, the following lemma adopts a sufficient decrease m i =1 |ai t z t | |ai t zt |≥ 1 +i tγ
t
result from [19, Proposition 3]. The sufficient decrease is a key

step in establishing the local regularity condition [1], [18], [19], =− ∇tr (z t ), ht 
which suffices to prove linear convergence of iterative optimiza- m
2
tion algorithms. ≤ −4μ (1 − ζ1 − ζ2 − 2) h (28)
Proposition 3: [19, Proposition 3] Consider the noise-free
measurements ψi = |aTi x| with i.i.d. ai ∼ N (0, I n ), 1 ≤ i ≤ where the inequality follows from Proposition 3. Regarding
m, and γ = 0.7. For any fixed  > 0, there exist universal con- the last term in (27), since for the i.i.d. real-valued Gaussian
stants c0 , c1 , c2 > 0 such that if m > c0 n, then the following ai ’s, maxi t ∈[m ] ai t ≤ 2.3n holds with probability at least
holds with probability at least 1 − c2 exp(−c1 m), 1 − me−n /2 [19], and also 1 T ψi
t
≤ 1, then the next
{|ai z t |≥ 1 + γ }
t

  holds with high probability


1 2  2
h, ∇tr (z) ≥ 2 (1 − ζ1 − ζ2 − 2) h , m
m μ2  aT z t
aTi t z t − ψi t iTt ai t 2
1 ψ

h := z − x (25) m i =1 |ai t z t | |aTi t zt |≥ 1 +i tγ
t

m
2.3nμ2   T 2
n
for all x, z ∈ R such that h / x ≤ 1/10, where estimates ≤ ai t z t − aTi t x
m i =1
ζ1 ≈ 0.0782, and ζ2 ≈ 0.3894. t

Now let us turn to the term on the left hand side of (23), m
2.3nμ2   T 2
which after plugging in the update of z t+1 in (17) or (18), boils ≤ a z t − aTi t x
m i =1 i t
down to t

2
2.3nμ T T
dist2 (z t+1 , x) ≤ ht A Aht
m
 
 a T
z 2 ≤ 2.3(1 + δ)μ2 n ht 2
(29)
 t 
= ht − μt aTi t z t − ψi t Tt i
ai t 1 T ψi 
|ai t z t | |ai t zt |≥ 1 + γt
in which the second inequality comes from (|aTi t z t | −
 
aTi t z t |aTi t x|)2 ≤ (aTi t z t − aTi t x)2 , and the last inequality arises due
2 T
= ht − 2μt ai t z t − ψi t T aTi t ht 1 T ψ
 to the fact that λm ax (AT A) ≤ (1 + δ)m holds with probabil-
|ai t z t | |ai t zt |≥ 1 +i tγ
ity at least 1 − c2 exp(−c1 nδ 2 ), provided that m ≥ c0 nδ −2 for
 2 some universal constant c0 , c1 , c2 > 0 [49, Theorem 5.39].
aTi t z t
T
+ μt ai t z t − ψi t T
2
ai t 2 1 T ψ
 Substituting (28) and (29) into (27) establishes that
|ai t z t | |ai t zt |≥ 1 +i tγ
  
(26) Ei t dist2 (z t+1 , x) ≤ 1 − 4μ (1 − ζ1 − ζ2 − 2)

+ 2.3(1 + δ)μ2 n ht 2 (30)
where μt = μ > 0 with it ∈ {1, 2, . . . , m} sampled uniformly
at random in (17), or μt = 1/ ai t 2 with it ∈ {1, 2, . . . , m} holds with probability exceeding 1 − c2 m exp(−c1 n) provided
selected with probability proportional to ai t 2 in (18). that m ≥ c0 n, where c0 ≥ c0 δ −2 . To obtain legitimate estimates
Consider first the constant step size case in (17). Take the for the step size, fixing , δ > 0 to be sufficiently small con-
expectation of both sides in (26) with respect to the selection of stants, say e.g., 0.01, then using (30), μ can be chosen such that

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 09,2023 at 15:26:49 UTC from IEEE Xplore. Restrictions apply.
1972 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 8, APRIL 15, 2017

4(0.98 − ζ1 − ζ2 ) − 2.42μn > 0, yielding term on the right hand side of (33), one obtains that
m
 2
 ai t 2 1 aTi t z t
4(0.98 − ζ1 − ζ2 ) μ0
T
ai t z t − ψ i t T 1 T ψi

0.8469 A 2 a 2 |a z | |ai z t |≥ 1 + tγ
0<μ< ≈ := . (31) i =1
t
F i t i t t t
2.42n n n
m

1  2
= aTi t z t − aTi t x 1 ψ

Plugging μ = c3 /n for some 0 < c3 ≤ μ0 into (30), gives A 2
F i t =1
|aTi t zt |≥ 1 +i tγ
rise to m
1   T 2
≤ ai t z t − aTi t x
  A
ν
2
 F i t =1
Ei t dist2 (z t+1 , x) ≤ 1 − dist2 (z t , x) (32)
n 1
≤ hTt AT Aht
A 2
F
for ν := 4c3 (1− ζ1 − ζ2 −2) − 2.3c23 (1+δ) ≤ ν0 := 0.0697, (1 + δ)m 2
where the equality holds at the maximum step size μ = μ0 , ≤ ht
(1 − σ)mn
hence concluding the proof of Proposition 2 for the constant
step size case. (1 + δ) 2
≤ ht (35)
Now let us turn to the case of a time-varying step size. Specif- (1 − σ)n
ically, let μt = 1/ ai t 2 , and it be sampled at 
random from the which holds with high probability as soon as m ≥ c0 n ≥
set {1, 2, . . . , m} with probability ai t 2 / m i t =1 ai t
2
= c0 δ −2 n.
ai t / A F [50]. Taking the expectation of both sides in (26)
2 2
Putting results in (33), (34), and (35) together, one establishes
over it gives rise to that the following holds

  4
  Ei t dist (z t+1 , x) ≤ 1 −
2
(1 − ζ1 − ζ2 − 2)
Ei t dist2 (z t+1 , x) (1 + σ)n
  
m
aTi t ht aTi t z t (1 + δ)
+ ht 2 (36)
= ht 2
−2 T
ai t z t − ψi t T 1 T ψi
 (1 − σ)n
A 2
F |a i z t | |ai z t |≥ 1 + tγ
t
i =1
t t

 2 with probability at least 1 − c2 m exp(−c1 n) provided that m ≥


m
 c0 n. Hence, one can set in this case
1 aT z t
+ aTi t z t − ψi t iTt 1 ψ
.
A 2
F |ai t z t | |aTi t zt |≥ 1 +i tγ 4 (1 + δ)
i t =1 ν := (1 − ζ1 − ζ2 − 2) − .
(33) (1 + σ)n (1 − σ)n
Taking without loss of generality δ, σ,  to be 0.01, and substi-
tuting the estimates of ζ1 , ζ2 into (36), one arrives at ν = 1.0091
Consider random A := [a1 · · · am ]T with i.i.d. rows ai ∼ to deduce that
N (0, I n ), and any fixed σ > 0. Then, by means of Bernstein-   1.0091
type inequality
 [49, Proposition 5.16], m1n A 2F − 1| Ei t dist2 (z t+1 , x) ≤ 1 − dist2 (z t , x) (37)
= | m n i,j ai,j − 1| ≤ σ holds with probability at least 1 −
1 2 n
2 exp(−mnσ 2 /8). Therefore, the second term on the right hand which holds with high probability as soon as m ≥ c0 n, es-
side of (33) can be bounded as follows tablishing the local error contraction property of the truncated
Kaczmarz iterations in (18), as claimed in Proposition 2.
m
  Combining the results in (32) and (37), we proved the local
2  aT z t error contraction property in Proposition 2 of the two STAF
− aTi t z t − ψi t iTt aTi t ht 1 ψi

A 2
F i t =1 |ai t z t | |aTi z t |≥ 1 + tγ
t
variants under both constant and time-varying step sizes.

m
 
2  a T
z t
REFERENCES
≤− aT z t −ψi t Tti
aT ht 1 T ψi

(1 + σ)mni =1 i t |ai t z t | i t |ai z t |≥ 1 + tγ
t
[1] E. J. Candès, X. Li, and M. Soltanolkotabi, “Phase retrieval via Wirtinger
t flow: Theory and algorithms,” IEEE Trans. Inf. Theory, vol. 61, no. 4,
4m pp. 1985–2007, Apr. 2015.
2
≤− (1 − ζ1 − ζ2 − 2) h [2] J. Miao, P. Charalambous, J. Kirz, and D. Sayre, “Extending the method-
(1 + σ)mn ology of X-ray crystallography to allow imaging of micrometre-sized
non-crystalline specimens,” Nature, vol. 400, no. 6742, pp. 342–344, Jul.
4 2
≤− (1 − ζ1 − ζ2 − 2) h (34) 1999.
(1 + σ)n [3] R. P. Millane, “Phase retrieval in crystallography and optics,” J. Opt. Soc.
Am. A, vol. 7, no. 3, pp. 394–411, 1990.
[4] O. Bunk et al., “Diffractive imaging for periodic samples: Retrieving one-
dimensional concentration profiles across microfluidic channels,” Acta
where the second inequality follows from Proposition 3, and the Crystallograph. A, Found. Crystallograph., vol. 63, no. 4, pp. 306–314,
last inequality from the fact that m ≥ c0 n. Concerning the last 2007.

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 09,2023 at 15:26:49 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SCALABLE SOLVERS OF RANDOM QUADRATIC EQUATIONS VIA STAF 1973

[5] K. Jaganathan, Y. C. Eldar, and B. Hassibi, “Phase retrieval: An overview [32] T. Qiu, P. Babu, and D. P. Palomar, “PRIME: Phase retrieval via
of recent developments,” arXiv:1510.07713, 2015. majorization-minimization,” IEEE Trans. Signal Process., vol. 64, no. 19,
[6] E. J. Candès, Y. C. Eldar, T. Strohmer, and V. Voroninski, “Phase retrieval pp. 5174–5186, Oct. 2016.
via matrix completion,” SIAM Rev., vol. 57, no. 2, pp. 225–251, May 2015. [33] T. Qiu and D. Palomar, “Undersampled phase retrieval via majorization-
[7] E. Hofstetter, “Construction of time-limited functions with specified auto- minimization,” arXiv:1609.02842, 2016.
correlation functions,” IEEE Trans. Inf. Theory, vol. 10, no. 2, pp. 119–126, [34] H. Zhang, Y. Chi, and Y. Liang, “Provable non-convex phase retrieval with
Apr. 1964. outliers: Median truncated Wirtinger flow,” arXiv:1603.03805, 2016.
[8] Y. Shechtman, A. Beck, and Y. C. Eldar, “GESPAR: Efficient phase re- [35] K. Wei, “Solving systems of phaseless equations via Kaczmarz methods:
trieval of sparse signals,” IEEE Trans. Signal Process., vol. 62, no. 4, A proof of concept study,” Inverse Probl., vol. 31, no. 12, p. 125008, Nov.
pp. 928–938, Feb. 2014. 2015.
[9] J. R. Fienup, “Phase retrieval algorithms: A comparison,” Appl. Opt., [36] R. Kolte and S. A. Ozgur, “Phase retrieval via incremental truncated
vol. 21, no. 15, pp. 2758–2769, Aug. 1982. Wirtinger flow,” arXiv:1606.03196, 2016.
[10] J. Ranieri, A. Chebira, Y. M. Lu, and M. Vetterli, “Phase retrieval for [37] E. J. Candès, T. Strohmer, and V. Voroninski, “PhaseLift: Exact and stable
sparse signals: Uniqueness conditions,” arXiv:1308.3058, 2013. signal recovery from magnitude measurements via convex programming,”
[11] K. Jaganathan, S. Oymak, and B. Hassibi, “Recovery of sparse 1-D signals Appl. Comput. Harmon. Anal., vol. 66, no. 8, pp. 1241–1274, Nov. 2013.
from the magnitudes of their Fourier transform,” in Proc. IEEE Int. Symp. [38] I. Waldspurger, A. d’Aspremont, and S. Mallat, “Phase recovery, max-
Inf. Theory, 2012, pp. 1473–1477. cut and complex semidefinite programming,” Math. Program., vol. 149,
[12] P. Netrapalli, P. Jain, and S. Sanghavi, “Phase retrieval using alternating no. 1–2, pp. 47–81, 2015.
minimization,” IEEE Trans. Signal Process., vol. 63, no. 18, pp. 4814– [39] K. Huang, Y. C. Eldar, and N. D. Sidiropoulos, “Phase retrieval from
4826, Sep. 2015. 1D Fourier measurements: Convexity, uniqueness, and algorithms,”
[13] C. Qian, N. D. Sidiropoulos, K. Huang, L. Huang, and H. C. So, “Phase arXiv:1603.05215, 2016.
retrieval using feasible point pursuit: Algorithms and Cramer-Rao bound,” [40] G. H. Golub and C. F. Van Loan, Matrix Computations. Baltimore, MD,
IEEE Trans. Signal Process., vol. 64, no. 20, pp. 5282–5296, Oct. 2016. USA: Johns Hopkins Univ. Press, 2012, vol. 3.
[14] G. Wang, G. B. Giannakis, J. Chen, and M. Akçakaya, “SPARTA: Sparse [41] O. Shamir, “Fast stochastic algorithms for SVD and PCA: Convergence
phase retrieval via truncated amplitude flow,” in Proc. IEEE Int. Conf. properties and convexity,” in Proc. 33th Proc. Int. Conf. Mach. Learn.,
Acoust., Speech Signal Process., New Orleans, LA, USA, 2017. New York, NY, USA, 2016.
[15] G. Wang, L. Zhang, G. B. Giannakis, J. Chen, and M. Akçakaya, “Sparse [42] E. Oja, “Simplified neuron model as a principal component analyzer,” J.
phase retrieval via truncated amplitude flow,” arXiv:1611.07641, 2016. Math. Biol., vol. 15, no. 3, pp. 267–273, Nov. 1982.
[16] T. Bendory and Y. C. Eldar, “Non-convex phase retrieval from STFT [43] R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using
measurements,” arXiv:1607.08218, 2016. predictive variance reduction,” in Proc. Adv. Neural Inf. Process. Syst.,
[17] E. J. Candès, X. Li, and M. Soltanolkotabi, “Phase retrieval from coded 2013, pp. 315–323.
diffraction patterns,” Appl. Comput. Harmon. Anal., vol. 39, no. 2, pp. 277– [44] S. Kaczmarz, “Angenherte auflsung von systemen linearer gleichungen,”
299, Sep. 2015. Bull. Int. de l’Acadmie Polonaise des Sci. et des Lett. Classe des Sci.
[18] Y. Chen and E. J. Candès, “Solving random quadratic systems of equations Mathmatiques et Naturelles. Srie A, Sci. Mathmatiques, vol. 37, pp. 355–
is nearly as easy as solving linear systems,” Commun. Pure Appl. Math., 357, 1937.
to be published. [45] P. M. Pardalos and S. A. Vavasis, “Quadratic programming with one
[19] G. Wang, G. B. Giannakis, and Y. C. Eldar, “Solving systems of random negative eigenvalue is NP-hard,” J. Global Optim., vol. 1, no. 1, pp. 15–
quadratic equations via truncated amplitude flow,” arXiv:1605.08285, 22, 1991.
2016. [46] D. K. Berberidis, V. Kekatos, G. Wang, and G. B. Giannakis, “Adaptive
[20] R. Balan, P. Casazza, and D. Edidin, “On signal reconstruction without censoring for large-scale regressions,” in Proc. IEEE Int. Conf. Acoust.,
phase,” Appl. Comput. Harmon. Anal., vol. 20, no. 3, pp. 345–356, May Speech Signal Process., South Brisbane, Qld, Australia, 2015, pp. 5475–
2006. 5479.
[21] A. Conca, D. Edidin, M. Hering, and C. Vinzant, “An algebraic charac- [47] G. Wang, D. Berberidis, V. Kekatos, and G. B. Giannakis, “Online recon-
terization of injectivity in phase retrieval,” Appl. Comput. Harmon. Anal., struction from big data via compressive censoring,” in Proc. IEEE Global
vol. 38, no. 2, pp. 346–356, Mar. 2015. Conf. Signal Inf. Process., Atlanta, GA, USA, 2014, pp. 326–330.
[22] A. Ben-Tal and A. Nemirovski, Lectures on Modern Convex Optimization: [48] J. Chen, G. Wang, and J. Sun, “Power scheduling for Kalman filtering
Analysis, Algorithms, and Engineering Applications. Philadelphia, PA, over lossy wireless sensor networks,” IET Control Theory Appl., to be
USA: SIAM, 2001, vol. 2. published.
[23] Y. Chen, X. Yi, and C. Caramanis, “A convex formulation for mixed [49] R. Vershynin, “Introduction to the non-asymptotic analysis of random
regression with two components: Minimax optimal rates,” in Proc. 27th matrices,” arXiv:1011.3027, 2010.
Conf. Learn. Theory, Paris, France, Jun. 2014, pp. 560–604. [50] T. Strohmer and R. Vershynin, “A randomized Kaczmarz algorithm with
[24] K. G. Murty and S. N. Kabadi, “Some NP-complete problems in quadratic exponential convergence,” J. Fourier Anal. Appl., vol. 15, no. 2, pp. 262–
and nonlinear programming,” Math. Program., vol. 39, no. 2, pp. 117–129, 278, 2009.
1987.
[25] R. Ge, F. Huang, C. Jin, and Y. Yuan, “Escaping from saddle points—
Online stochastic gradient for tensor decomposition,” in Proc. 28th Conf.
Learn. Theory, 2015, pp. 797–842.
[26] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and
Y. Bengio, “Identifying and attacking the saddle point problem in high-
dimensional non-convex optimization,” in Proc. Adv. Neural Inf. Process.
Syst., 2014, pp. 2933–2941.
[27] R. W. Gerchberg and W. O. Saxton, “A practical algorithm for the determi-
nation of phase from image and diffraction,” Optik, vol. 35, pp. 237–246,
Nov. 1972.
Gang Wang (S’12) received the B.Eng. degree in
[28] I. Waldspurger, “Phase retrieval with random Gaussian sensing vectors by
electrical engineering and automation from Beijing
alternating projections,” aXiv:1609.03088, 2016.
[29] C. Qian, X. Fu, N. D. Sidiropoulos, L. Huang, and J. Xie, “Inexact al- Institute of Technology, Beijing, China, in 2011. He
is currently working toward the Ph.D. degree in the
ternating optimization for phase retrieval in the presence of outliers,”
Department of the Electrical and Computer Engi-
arXiv:1605.00973v1, 2016.
neering, University of Minnesota, Minneapolis, MN,
[30] G. Wang and G. B. Giannakis, “Solving random systems of quadratic
equations via truncated generalized gradient flow,” in Proc. Adv. Neural USA. His research interests focuses on the areas of
high-dimensional structured signal processing and
Inf. Process. Syst., Barcelona, Spain, 2016, pp. 568–576.
(non)convex optimization for smart power grids.
[31] J. Sun, Q. Qu, and J. Wright, “A geometric analysis of phase retrieval,”
arXiv:1602.06664, 2016.

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 09,2023 at 15:26:49 UTC from IEEE Xplore. Restrictions apply.
1974 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 8, APRIL 15, 2017

Georgios B. Giannakis (F’97) received the Diploma Jie Chen (SM’12) received the B.S., M.S., and Ph.D.
in electrical engineering from the National Techni- degrees from the Beijing Institute of Technology, Bei-
cal University of Athens, Zografou, Greece, 1981. jing, China, in 1986, 1996, and 2001, respectively. He
From 1982 to 1986, he was with the University of is currently a Professor in the School of Automation,
Southern California, Los Angeles, CA, USA. where Beijing Institute of Technology. His current research
he received the M.Sc. degree in electrical engineer- interests include complex systems, multi-agent sys-
ing, the M.Sc. degree in mathematics, and the Ph.D. tems, multi-objective optimization and decision, con-
degree in electrical engineering, in 1983, 1986, and strained nonlinear control, and optimization methods.
1986, respectively. He was with the University of Vir-
ginia from 1987 to 1998, and since 1999, he has been
a Professor with the University of Minnesota, Min-
neapolis, MN, USA, where he holds an Endowed Chair in Wireless Telecom-
munications, a University of Minnesota McKnight Presidential Chair in ECE,
and is the Director of the Digital Technology Center.
His general interests span the areas of communications, networking and
statistical signal processing subjects on which he has published more than 400
journal papers, 680 conference papers, 25 book chapters, two edited books
and two research monographs (h-index 119). His current research focuses on
learning from Big Data, wireless cognitive radios, and network science with
applications to social, brain, and power networks with renewables. He is the
(co-)inventor of 28 patents issued, and the (co-)recipient of eight best paper
awards from the IEEE Signal Processing (SP) and Communications Societies,
including the G. Marconi Prize Paper Award in Wireless Communications. He
also received Technical Achievement Awards from the SP Society (2000), from
EURASIP (2005), a Young Faculty Teaching Award, the G. W. Taylor Award
for Distinguished Research from the University of Minnesota, and the IEEE
Fourier Technical Field Award (2015). He is a Fellow of EURASIP, and has
served the IEEE in a number of posts, including that of a Distinguished Lecturer
for the IEEE-SP Society.

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 09,2023 at 15:26:49 UTC from IEEE Xplore. Restrictions apply.

You might also like