0% found this document useful (0 votes)
81 views11 pages

An Overview of The Simultaneous Perturbation Method For Efficient Optimization

The document provides an overview of the simultaneous perturbation stochastic approximation (SPSA) algorithm for stochastic optimization of multivariate systems. It discusses how optimization algorithms play a critical role in engineering systems and are widely used. SPSA has advantages over traditional gradient-based optimization algorithms in that it requires only two objective function measurements regardless of the number of variables, allowing for significantly lower optimization costs, especially with many variables. The document introduces SPSA and compares it to traditional gradient-based and gradient-free stochastic optimization algorithms.

Uploaded by

Melvin Gauci
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views11 pages

An Overview of The Simultaneous Perturbation Method For Efficient Optimization

The document provides an overview of the simultaneous perturbation stochastic approximation (SPSA) algorithm for stochastic optimization of multivariate systems. It discusses how optimization algorithms play a critical role in engineering systems and are widely used. SPSA has advantages over traditional gradient-based optimization algorithms in that it requires only two objective function measurements regardless of the number of variables, allowing for significantly lower optimization costs, especially with many variables. The document introduces SPSA and compares it to traditional gradient-based and gradient-free stochastic optimization algorithms.

Uploaded by

Melvin Gauci
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

J. C.

SPALL

An Overview of the Simultaneous Perturbation Method


for Efficient Optimization
James C. Spall

M ultivariate stochastic optimization plays a major role in the analysis and


control of many engineering systems. In almost all real-world optimization problems,
it is necessary to use a mathematical algorithm that iteratively seeks out the solution
because an analytical (closed-form) solution is rarely available. In this spirit, the
“simultaneous perturbation stochastic approximation (SPSA)” method for difficult
multivariate optimization problems has been developed. SPSA has recently attracted
considerable international attention in areas such as statistical parameter estimation,
feedback control, simulation-based optimization, signal and image processing, and
experimental design. The essential feature of SPSA—which accounts for its power and
relative ease of implementation—is the underlying gradient approximation that requires
only two measurements of the objective function regardless of the dimension of the
optimization problem. This feature allows for a significant decrease in the cost of
optimization, especially in problems with a large number of variables to be optimized.
(Keywords: Gradient approximation, Multivariate optimization, Simulation-based
optimization, Simultaneous perturbation stochastic approximation, SPSA, Stochastic
optimization.)

INTRODUCTION The future, in fact, will be full of [optimization] algorithms. They


This article is an introduction to the simultaneous are becoming part of almost everything. They are moving up the
complexity chain to make entire companies more efficient. They
perturbation stochastic approximation (SPSA) algo- also are moving down the chain as computers spread. (USA
rithm for stochastic optimization of multivariate Today, 31 Dec 1997)
systems. Optimization algorithms play a critical role in
the design, analysis, and control of most engineering Before presenting the SPSA algorithm, we provide
systems and are in widespread use in the work of APL some general background on the stochastic optimiza-
and other organizations: tion context of interest here.

482 JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998)
SIMULTANEOUS PERTURBATION METHOD FOR OPTIMIZATION

The mathematical representation of most optimiza- sensor data, and the design of complex queuing and
tion problems is the minimization (or maximization) of discrete-event systems. This article focuses on the case
some scalar-valued objective function with respect to where such an approximation is going to be used as a
a vector of adjustable parameters. The optimization result of direct gradient information not being readily
algorithm is a step-by-step procedure for changing the available.
adjustable parameters from some initial guess (or set of Overall, gradient-free stochastic algorithms exhibit
guesses) to a value that offers an improvement in the convergence properties similar to the gradient-based
objective function. Figure 1 depicts this process for a stochastic algorithms [e.g., Robbins-Monro1 stochastic
very simple case of only two variables, u1 and u2, where approximation (R-M SA)] while requiring only loss
our objective function is a loss function to be minimized function measurements. A main advantage of such
(without loss of generality, we will discuss optimization algorithms is that they do not require the detailed
in the context of minimization because a maximization knowledge of the functional relationship between the
problem can be trivially converted to a minimization parameters being adjusted (optimized) and the loss
problem by changing the sign of the objective func- function being minimized that is required in gradient-
tion). Most real-world problems would have many more based algorithms. Such a relationship can be notorious-
variables. The illustration in Fig. 1 is typical of a sto- ly difficult to develop in some areas (e.g., nonlinear
chastic optimization setting with noisy input informa- feedback controller design), whereas in other areas
tion because the loss function value does not uniformly (such as Monte Carlo optimization or recursive statis-
decrease as the iteration process proceeds (note the tical parameter estimation), there may be large compu-
temporary increase in the loss value in the third step tational savings in calculating a loss function relative
of the algorithm). Many optimization algorithms have to that required in calculating a gradient.
been developed that assume a deterministic setting and Let us elaborate on the distinction between algo-
that assume information is available on the gradient rithms based on direct gradient measurements and those
vector associated with the loss function (i.e., the gra- based on gradient approximations from measurements
dient of the loss function with respect to the parameters of the loss function. The prototype gradient-based
being optimized). However, there has been a growing algorithm is R-M SA, which may be considered a gen-
interest in recursive optimization algorithms that do eralization of such techniques as deterministic steepest
not depend on direct gradient information or measure- descent and Newton–Raphson, neural network back-
ments. Rather, these algorithms are based on an ap- propagation, and infinitesimal perturbation analysis–
proximation to the gradient formed from (generally based optimization for discrete-event systems. The gra-
noisy) measurements of the loss function. This interest dient-based algorithms rely on direct measurements of
has been motivated, for example, by problems in the the gradient of the loss function with respect to the
adaptive control and statistical identification of com- parameters being optimized. These measurements typ-
plex systems, the optimization of processes by large ically yield an estimate of the gradient because the
Monte Carlo simulations, the training of recurrent underlying data generally include added noise. Because
neural networks, the recovery of images from noisy it is not usually the case that one would obtain direct
measurements of the gradient (with or without added
noise) naturally in the course of operating or simulating
a system, one must have detailed knowledge of the
Height of vertical underlying system input–output relationships to calcu-
line after each
optimization step late the R-M gradient estimate from basic system output
denotes loss
function value measurements. In contrast, the approaches based on
L (u1, u 2) gradient approximations require only conversion of the
basic output measurements to sample values of the loss
function, which does not require full knowledge of the
system input–output relationships. The classical method
for gradient-free stochastic optimization is the Kiefer–
Wolfowitz finite-difference SA (FDSA) algorithm.2
u2 Because of the fundamentally different information
u1 needed in implementing these gradient-based (R-M)
and gradient-free algorithms, it is difficult to construct
meaningful methods of comparison. As a general rule,
Initial guess
however, the gradient-based algorithms will be faster to
Solution
converge than those using loss function–based gradient
Figure 1. Example of stochastic optimization algorithm minimiz-
approximations when speed is measured in number of
ing loss function L(u1, u2). iterations. Intuitively, this result is not surprising given

JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998) 483
J. C. SPALL

the additional information required for the gradient- FDSA AND SPSA ALGORITHMS
based algorithms. In particular, on the basis of asymptotic
This article considers the problem of minimizing a
theory, the optimal rate of convergence measured in
(scalar) differentiable loss function L(u), where u is a
terms of the deviation of the parameter estimate from
p-dimensional vector and where the optimization
the true optimal parameter vector is of order k–1/2 for
problem can be translated into finding the minimizing
the gradient-based algorithms and of order k–1/3 for the
u* such that ∂L/∂u = 0. This is the classical formulation
algorithms based on gradient approximations, where k
of (local) optimization for differentiable loss functions.
represents the number of iterations. (Special cases exist
It is assumed that measurements of L(u) are available
where the maximum rate of convergence for a non-
at various values of u. These measurements may or may
gradient algorithm is arbitrarily close to, or equal to,
not include added noise. No direct measurements of
k–1/2.)
∂L/∂u are assumed available, in contrast to the R-M
In practice, of course, many other factors must be
framework. This section will describe the FDSA and
considered in determining which algorithm is best for
SPSA algorithms. Although the emphasis of this article
a given circumstance for the following three reasons:
is SPSA, the FDSA discussion is included for compar-
(1) It may not be possible to obtain reliable knowledge
ison because FDSA is a classical method for stochastic
of the system input–output relationships, implying that
optimization.
the gradient-based algorithms may be either infeasible
The SPSA and FDSA procedures are in the general
(if no system model is available) or undependable (if
recursive SA form:
a poor system model is used). (2) The total cost to
achieve effective convergence depends not only on the
number of iterations required, but also on the cost uˆk +1 = uˆk – a k gˆ k (uˆk ) , (1)
needed per iteration, which is typically greater in
gradient-based algorithms. (This cost may include where gˆ k (uˆk ) is the estimate of the gradient g(u) ;
greater computational burden, additional human effort ∂L/∂u at the iterate û k based on the previously men-
required for determining and coding gradients, and tioned measurements of the loss function. Under appro-
experimental costs for model building such as labor, priate conditions, the iteration in Eq. 1 will converge
materials, and fuel.) (3) The rates of convergence are to u* in some stochastic sense (usually “almost surely”)
based on asymptotic theory and may not be represen- (see, e.g., Fabian4 or Kushner and Yin5).
tative of practical convergence rates in finite samples. The essential part of Eq. 1 is the gradient approx-
For these reasons, one cannot say in general that a imation gˆ k (uˆk ). We discuss the two forms of interest
gradient-based search algorithm is superior to a gradient here. Let y(·) denote a measurement of L(·) at a design
approximation-based algorithm, even though the level represented by the dot (i.e., y(·) =L(·) + noise)
gradient-based algorithm has a faster asymptotic rate of and ck be some (usually small) positive number. One-
convergence (and with simulation-based optimization sided gradient approximations involve measurements
such as infinitesimal perturbation analysis requires only y (uˆk ) and y( û k + perturbation), whereas two-sided
one system run per iteration, whereas the approximation- gradient approximations involve measurements of the
based algorithm may require multiple system runs per form y( û k ± perturbation). The two general forms of
iteration). As a general rule, however, if direct gradient gradient approximations for use in FDSA and SPSA are
information is conveniently and reliably available, it is finite difference and simultaneous perturbation, respec-
generally to one’s advantage to use this information in tively, which are discussed in the following paragraphs.
the optimization process. The focus in this article is the For the finite-difference approximation, each com-
case where such information is not readily available. ponent of û k is perturbed one at a time, and corre-
The next section describes SPSA and the related sponding measurements y(·) are obtained. Each com-
FDSA algorithm. Then some of the theory associated ponent of the gradient estimate is formed by
with the convergence and efficiency of SPSA is differencing the corresponding y(·) values and then
summarized. The following section is an illustration of dividing by a difference interval. This is the standard
the implications of the theory in an example related to approach to approximating gradient vectors and is
neural networks. Then practical guidelines for motivated directly from the definition of a gradient as
implementation are presented, followed by a summary a vector of p partial derivatives, each constructed as the
of some ancillary results and some extensions of the limit of the ratio of a change in the function value over
algorithm. Not covered here are global optimization a corresponding change in one component of the
methods such as genetic algorithms and simulated argument vector. Typically, the ith component of
annealing; Spall3 presents some discussion of such gˆ k (uˆk ) (i = 1, 2,…, p) for a two-sided finite-difference
methods in the context of stochastic approximation. approximation is given by

484 JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998)
SIMULTANEOUS PERTURBATION METHOD FOR OPTIMIZATION

at APL, where SPSA is providing a solution to a prob-


y(uˆk + c k e i ) – y(uˆk – c k e i )
gˆ ki (uˆk ) = , (2) lem that appeared intractable using other available
2c k methods. In addition, to illustrate some of the other
possible applications, we close with a brief summary of
where ei denotes a vector with a one in the ith place some additional projects based on SPSA, most of which
and zeros elsewhere (an obvious analogue holds for the have been developed at other institutions.
one-sided version; likewise for the simultaneous pertur-
bation form below), and ck denotes a small positive Signal Timing for Vehicle Traffic Control
number that usually gets smaller as k gets larger. A long-standing problem in traffic engineering is to
The simultaneous perturbation approximation has optimize the flow of vehicles through a given road
all elements of û k randomly perturbed together to ob- network (Fig. 2). Improving the timing of the traffic
tain two measurements of y(·), but each component signals at intersections in a network is generally the
gˆ ki (uˆk ) is formed from a ratio involving the individual most powerful and cost-effective means of achieving
components in the perturbation vector and the differ- this goal. However, because of the many complex as-
ence in the two corresponding measurements. For two- pects of a traffic system—human behavioral consider-
sided simultaneous perturbation, we have ations, vehicle flow interactions within the network,
weather effects, traffic accidents, long-term (e.g., sea-
sonal) variation, etc.—it has been notoriously difficult
y(uˆk + c k D k ) – y(uˆk – c k D k ) to determine the optimal signal timing, especially on
gˆ ki (uˆk ) = , (3)
2c k D ki a system-wide (multiple intersection) basis. Much of
this difficulty has stemmed from the need to build
extremely complex models of the traffic dynamics as a
where the distribution of the user-specified p-dimen- component of the control strategy. A “strategy” in this
sional random perturbation vector, Dk = (Dk1, Dk2,…, context is a set of rules providing real-time signal tim-
Dkp)T, satisfies conditions discussed later in this article ing in response to minute-by-minute changes in the
(superscript T denotes vector transpose). traffic conditions. The APL approach is fundamentally
Note that the number of loss function measurements different from those in existence in that it eliminates
y(·) needed in each iteration of FDSA grows with p, the need for such complex models. SPSA is central to
whereas with SPSA, only two measurements are need- the approach by providing a means for making small
ed independent of p because the numerator is the same simultaneous changes to all the signal timings in a
in all p components. This circumstance, of course, pro- network and using the information gathered in this way
vides the potential for SPSA to achieve a large savings to update the system-wide timing strategy. By avoiding
(over FDSA) in the total number of measurements conventional “one-signal-at-a-time” changes to the
required to estimate u when p is large. This potential signal timing strategies, the time it would take to pro-
is realized only if the number of iterations required for duce an overall optimal strategy for the system is re-
effective convergence to u* does not increase in a way duced from years or decades (obviously impractical!) to
to cancel the measurement savings per gradient approx- several months (quite reasonable). Note that, unlike
imation at each iteration. A later section of this article the two examples that follow, the savings here is not
will discuss this efficiency issue further, demonstrating computational per se, but is inherent in the need for
when this potential can be realized by establishing that: data on a daily basis (and hence represents a reduction
Under reasonably general conditions, SPSA and FDSA
in physical experimental costs such as labor and time).
achieve the same level of statistical accuracy for a given This approach is described in detail in Spall and Chin6
number of iterations, even though SPSA uses p times fewer and Chin et al.,7 including realistic simulations of a 9-
function evaluations than FDSA (because each gradient intersection network within the central business dis-
approximation uses only 1/p the number of function trict of Manhattan, New York, and a 12-intersection
evaluations).
network in Montgomery County, Maryland.

SELECTED APPLICATIONS OF SPSA Optimal Targeting of Weapon Systems


The efficiency issue mentioned in the preceding This is an example of the use of simulations to
section (and treated in more detail in the next section) optimize processes, something done in a wide range of
has profound implications for practical multivariate DoD and non-DoD applications. More specifically,
optimization. Many problems that formerly may have given a number of projectiles that are going to be di-
been considered intractable with conventional (say, rected at a target, the problem is to optimally select a
FDSA) methods, may now be solvable. In this section, set of aim points with the goal of maximizing damage
we summarize three distinct examples, based on work to the target while minimizing so-called collateral

JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998) 485
J. C. SPALL

Flow efficiency Time

Figure 2. Overall system-wide traffic control concept. Traffic control center provides timing information to signals in traffic network;
information on traffic flow is fed back to traffic control center.

damage (damage to sensitive locations not directly simulations (recall that there is random variation in the
associated with the military mission, e.g., schools and outcome of one “volley” of projectiles corresponding to
hospitals). The projectiles have inherent random vari- one simulation). Commonly used techniques for solv-
ation and may be statistically dependent. So the tar- ing the optimization problem by simulation are com-
geter must allow for the statistical variation between putationally intensive and prohibitively time-consum-
the aim point and actual impact point in developing ing since the damage function for many different sets
a strategy for determining aim points. In such cases it of aim points must be evaluated (i.e., many simulations
is desirable to use patterning of multiple projectiles. must be run). The SPSA method provides an efficient
“Patterning” in this context means aiming the projec- means of solving this multivariate problem, which for
tiles at a set of points that may not overlap each other planar targets has a dimension of p = 2 3 [no. of
or be within the target boundaries. Figure 3 illustrates projectiles] (so p = 10 in the small-scale example of Fig.
the concept for one target and five projectiles; a bias 3). SPSA works by varying all of the aim point coor-
away from the “stay-out zone” (e.g., a civilian facility) dinates simultaneously and running a simulation in the
is apparent in this case where it is desired to destroy the process of producing the gradient approximation for the
target while minimizing the chances of producing dam- optimization process. This procedure is repeated as the
age to the stay-out zone. For scenarios with many pro- iteration for the optimization proceeds. This method
jectiles that are independently targeted, the damage contrasts significantly with conventional methods
function—which must be evaluated to find the best aim where one would vary only one of the coordinates for
point pattern—is likely to be analytically unavailable one of the aim points prior to running a simulation,
and will require estimation by Monte Carlo simulation. repeating that process as each coordinate for each aim
In particular, to evaluate the effectiveness of a given set point was changed at a specified nominal set of aim
of aim points, it is necessary to run one or more points to construct a gradient approximation at the

486 JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998)
SIMULTANEOUS PERTURBATION METHOD FOR OPTIMIZATION

Power source

Stay-out zone x 5 cm
Downrange

Object
Target x 40 cm

x
0 cm 300 cm

x Figure 4. APL demonstration site for ECOL. Indicated probes


measure the electric potential.

etc.) that can affect conductivity and the high dimen-


sionality of the finite-element model. The inherent
uncertainty about the subsurface makes gradient-based
Crossrange methods infeasible, and the high dimensionality makes
Figure 3. Example of optimal aim points (3) given stay-out zone.
the “one-variable-at-a-time” methods very time-
consuming. SPSA was used to provide a relatively easy
and rapid solution to this problem. For example, with
surface data from one of the field experiments, effective
given nominal point. The process is repeated as the convergence of the algorithm was achieved after about
nominal aim points are varied over the course of the 4 min on a 180-MHz Pentium PC; the conventional
optimization. By simultaneously changing the aim finite-difference method would have taken approxi-
points, one is able to reduce by a factor of p the number mately 6 to 7 h on the same PC. Larger-scale practical
of simulations needed, possibly reducing the run times implementations would show an even much larger
from days to minutes or hours. A more complete de- relative savings. A description of ECOL and some of
scription of this approach to determining aim points is the field experiments is given in Chin and Srinivasan.10
given in Heydon et al.8 and Hill and Heydon.9
Some Other Applications
Locating Buried Objects via Electrical Some additional recent applications of SPSA,
Conductivity initiated both in and out of APL, are described in Hill
ECOL (electrical conductivity object locator) is an and Fu11 and Fu and Hill12 (queuing systems),
approach to determining the location of buried objects Hopkins13 (control of a heavy ion beam), Rezayat14
via injecting electrical current into the ground in an (industrial quality improvement), Maeda et al.15 (pat-
area surrounding a candidate object. Measurements of tern recognition), Kleinman et al.16 (simulation-based
the electric potential are taken near the surface, which optimization with applications to air traffic manage-
then form the basis for constructing a subsurface char- ment), Cauwenberghs17 (neural network training),
acterization of the conductivity. The object being Spall and Cristion18,19 and Maeda and De Figueiredo20
sought must have conductivity different from the sur- (neural network training for adaptive control of dynamic
rounding soil, which would include metal or plastic systems), Gerencsér21 (classification of ECG signals for
objects of potential interest in mine sweeping or buried heart monitoring), Luman22 (simulation-based decision
waste detection. Present technology is limited to aiding), Alessandri and Parisini23 (statistical model pa-
searching for objects from 5 to 500 cm below the surface rameter estimation/fault detection), Nechyba and Xu24
in an area ranging from 10 to 30 m2. Several field (human–machine interaction), Sadegh and Spall25
demonstrations have been conducted on APL property, (sensor placement and configuration), and Chin26 (sig-
and the setup for one is shown in Fig. 4. The basis for nal inversion for a complex physical model).
ECOL is to use the contrast in conductivity between
the buried object and surrounding soil to construct a
BASIC ASSUMPTIONS AND
finite-element model of the subsurface. This represents
a demanding optimization problem attributable to the SUPPORTING THEORY
uncertainty about the nature of the subsurface and the With the goal of minimizing a loss function L(u)
many potential impurities (stones, sticks, tree roots, over feasible values of u, the SPSA algorithm works by

JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998) 487
J. C. SPALL

iterating from an initial guess of the optimal u, where the One properly chosen simultaneous random change in all the
iteration process depends on the aforementioned simul- variables in a problem provides as much information for
taneous perturbation approximation to the gradient g(u). optimization as a full set of one-at-a-time changes of each
Spall27,28 presents sufficient conditions for conver- variable.
gence of the SPSA iterate ( û k → u* in the stochastic
“almost sure” sense) using a differential equation ap- This surprising and significant result seems to run
proach well known in general SA theory (e.g, Kushner counter to all that one learns in engineering and sci-
and Yin5). To establish convergence, conditions are entific training. It is the qualifier “for optimization”
imposed on both gain sequences (ak and ck), the user- that is critical to the validity of the statement. To view
specified distribution of Dk, and the statistical relation- an online animated demonstration of this concept, se-
ship of Dk to the measurements y(·). We will not repeat lect the blue box.
the conditions here since they are available in Let us provide some informal mathematical ratio-
Spall.27,28 The essence of the main conditions is that nale for this key result. Figure 5 provides an example
ak and ck both go to 0 at rates neither too fast nor too of a two-variable problem, where the level curves show
slow, that L(u) is sufficiently smooth (several times points of equal value in the loss function. In a low- or
differentiable) near u*, and that the {Dki} are indepen- no-noise setting, the FDSA algorithm will behave sim-
dent and symmetrically distributed about 0 with finite ilarly to a traditional gradient descent algorithm in
inverse moments E(|Dki|–1) for all k, i. One particular taking steps that provide the locally greatest reduction
distribution for Dki that satisfies these latter conditions in the loss function. A standard result in calculus shows
is the symmetric Bernoulli ±1 distribution; two com- that this “steepest descent” direction is perpendicular
mon distributions that do not satisfy the conditions (in to the level curve at that point, as shown in the steps
particular, the critical finite inverse moment condi- for the FDSA algorithm of Fig. 5 (each straight segment
tion) are the uniform and normal distributions. is perpendicular to the level curve at the origin of the
In addition to establishing the formal convergence segment). Hence, the FDSA algorithm is behaving
of SPSA, Spall (Ref. 28, Sect. 4) shows that the prob- much as an aggressive skier might act in descending a
ability distribution of an appropriately scaled û k is ap- hill by going in small segments that provide the steepest
proximately normal (with a specified mean and cova- drop from the start of each segment. SPSA, on the
riance matrix) for large k. This asymptotic normality other hand, with its random search direction, does not
result, together with a parallel result for FDSA, can be follow the path of locally steepest descent. On average,
used to study the relative efficiency of SPSA. This though, it will nearly follow the steepest descent path
efficiency is the major theoretical result justifying the because the gradient approximation is an almost unbi-
use of SPSA. The efficiency depends on the shape of ased estimator of the gradient (i.e., E[ ĝ k (u)] = g(u) +
L(u), the values for {ak} and {ck}, and the distributions small bias, where small bias is proportional to c 2k , and
of the {Dki} and measurement noise terms. There is no ck is the small number mentioned earlier). Over the
single expression that can be used to characterize the course of many iterations, the errors associated with the
relative efficiency; however, as discussed in Spall (Ref. “misdirections” in SPSA will average out in a manner
28, Sect. 4) and Chin,29 in most practical problems analogous to the way random errors cancel out in form-
SPSA will be asymptotically more efficient than FDSA. ing the sample mean of almost any random process (the
For example, if ak and ck are chosen as in the guidelines ak sequence in Eq. 1 governs this averaging). Figure 5
of Spall,30 then by equating the asymptotic mean- shows this effect at work in the way the SPSA search
squared error E(i û k –u*i2) in SPSA and FDSA, we find direction tends to “bounce around” the FDSA search
direction, while ultimately settling down near the so-
lution in the same number of steps. Although this
No. of measurements of L(u) in SPSA 1
→ (4) discussion was motivated by the two-variable (p = 2)
No. of measurements of L(u) in FDSA p problem with no- or low-noise loss function measure-
ments (so that the FDSA algorithm behaves very nearly
as the number of loss measurements in both procedures like a true gradient descent algorithm), the same essen-
gets large. Hence, Expression 4 implies that the p-fold tial intuition applies in higher-dimension settings and
savings per iteration (gradient approximation) trans- noisier loss measurements. Noisy loss measurements
lates directly into a p-fold savings in the overall imply that the FDSA algorithm will also not closely
optimization process despite the complex nonlinear track a gradient descent algorithm as in Fig. 5; however,
ways in which the sequence of gradient approximations the relationship between SPSA and FDSA (which is
manifests itself in the ultimate solution û k . what Expression 4 pertains to) will still be governed by
Relative to implementation in a practical problem, the idea of averaging out the errors in directions over
another way of looking at Expression 4 is that: a large number of iterations.

488 JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998)
SIMULTANEOUS PERTURBATION METHOD FOR OPTIMIZATION

still large: over 1600 in the first


two FDSA iterations versus 160 for
SPSA all the iterations of SPSA.)
The numerical study in Fig. 6 is
only one of many such examples.
Starting point for Further, there have been compar-
SPSA and FDSA isons with other types of stochastic
FDSA optimization algorithms. For ex-
ample, Chin26 performs a compar-
ison with the popular simulated
annealing algorithm in the con-
text of model estimation for a
magnetosphere model and finds
that SPSA significantly outper-
forms simulated annealing. The
Solution author has also performed compar-
isons with simulated annealing
and several types of directed ran-
dom search, and found similar su-
perior relative performance. An
open issue is to conduct a careful
comparison with the popular ge-
netic algorithm and related evolu-
Figure 5. Example of relative search paths for SPSA and FDSA in p = 2 problem. tionary methods.
Deviations of SPSA from FDSA average out in reaching a solution in the same number of
iterations; FDSA nearly follows the gradient descent path (perpendicular to level curves)
in the low-noise setting.
IMPLEMENTATION OF
SPSA
The following step-by-step summary shows how
EXPERIMENTAL RESULTS SPSA iteratively produces a sequence of estimates. The
Figure 6 shows the implications of the theoretical boxed insert presents an implementation of the follow-
result just discussed in a practical setting. This graph ing steps in MATLAB code.
shows the results of a simulation using a neural network
Step 1: Initialization and coefficient selection. Set counter
to regulate the water purity and methane gas by-
index k = 1. Pick initial guess and non-negative coef-
product of a wastewater treatment process (Spall and
ficients a, c, A, a, and g in the SPSA gain sequences
Cristion19 discuss this problem in detail). The vertical
ak = a/(A + k)a and ck = c/kg. The choice of the gain
axis represents a normalized (for units) version of the
sequences (ak and ck) is critical to the performance of
loss function L(u) (measuring the deviation from cer-
SPSA (as with all stochastic optimization algorithms
tain target values in water cleanliness and methane gas
by-product), and the horizontal axis measures the iter-
ations of the algorithms. The dimension of the u vector
in this case was 412, corresponding to the number of 1.6
“connection weights” in the neural network. The es-
Total L(u ) measurements
sential point to observe in this graph is that the FDSA SPSA = 160
and SPSA approaches achieve very similar levels of 1.2 FDSA = 65,920
Normalized L(u )

accuracy for a given number of iterations after the first


few iterations, but that SPSA only uses 2 experiments 0.8
(loss evaluations) per iteration, whereas FDSA uses 824 SPSA
experiments! This difference obviously leads to a very FDSA
0.4
substantial savings—representing computational sav- Minimum achievable long-run value
ings in a simulation exercise or direct time and money
savings in a real waste treatment plant—in the overall 0
problem of estimating the neural network weights. 0 20 40 60 80
No. of iterations
(The sharp initial decline of FDSA in Fig. 6 may be
slightly misleading because the weights had yet to begin Figure 6. Relative performance of SPSA and FDSA for controller
stabilizing and because the number of measurements is in a wastewater treatment system.

JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998) 489
J. C. SPALL

Step 4: Gradient approximation. Generate the simulta-


MATLAB CODE
neous perturbation approximation to the unknown
The accompanying figure presents a sample MATLAB
code for performing n iterations of the standard (first- gradient g (uˆk ) :
order) SPSA algorithm. Algorithm initialization for pro-
gram variables theta, n, p, a, A, c, alpha,
gamma is not shown here since that can be handled in  D–1 
many ways (e.g., read from another file, direct inclusion  k1 
in the program, and user input during execution). The y(uˆk + c k D k ) – y(uˆk – c k D k ) D–1 
program calls an external function “loss” to obtain the gˆ k (uˆk ) = k2
 M ,
2c k (5)
(possibly noisy) measurements. The Dki elements are gen-  
erated according to a Bernoulli ±1 distribution.  D–kp1 
 
For k=1:n
ak=a/(k+A)^alpha;
ck=c/k^gamma; where Dki is the ith component of the Dk vector (which
delta=2*round(rand(p,1))-1;
thetaplus=theta+ck*delta; may be ±1 random variables as discussed in Step 2);
thetaminus=theta-ck*delta; note that the common numerator in all p components
yplus=loss(thetaplus); of gˆ k (uˆk ) reflects the simultaneous perturbation of all
yminus=loss(thetaminus); components in û k in contrast to the component-by-
ghat=(yplus-yminus)./(2*ck*delta); component perturbations in the standard finite-differ-
theta=theta-ak*ghat;
end ence approximation.
theta Step 5: Updating u estimate. Use the standard SA form
If maximum and minimum values on the values of theta
can be specified, say thetamax and thetamin, then the uˆk +1 = uˆk – a k gˆ k (uˆk ) (6)
following two lines can be added below the theta update
line to impose the constraints
theta=min(theta, thetamax); to update û k to a new value û k+1 . Modifications to the
theta=max(theta, thetamin); basic updating step in Eq. 6 are sometimes desirable to
enhance convergence and impose constraints. These
modifications block or alter the update to the new value
of u if the “basic” value from Eq. 6 appears undesirable.
and the choice of their respective algorithm coeffi- Reference 31 (Sect. 2) discusses several possibilities.
cients). Spall30 provides some guidance on picking One easy possibility if maximum and minimum allow-
these coefficients in a practically effective manner. (In able values can be specified on the components of u is
cases where the elements of u have very different shown at the bottom of the boxed insert.
magnitudes, it may be desirable to use a matrix scaling
Step 6: Iteration or termination. Return to Step 2 with
of the gain ak if prior information is available on the
k+1 replacing k. Terminate the algorithm if there is
relative magnitudes. The next section discusses a
little change in several successive iterates or the max-
second-order version of SPSA that automatically scales
imum allowable number of iterations has been reached
for different magnitudes.)
(more formal termination guidance is discussed in Pflug,
Step 2: Generation of the simultaneous perturbation vec- Ref. 32, pp. 297–300).
tor. Generate by Monte Carlo a p-dimensional random
perturbation vector Dk, where each of the p compo-
nents of Dk are independently generated from a zero- FURTHER RESULTS AND
mean probability distribution satisfying the preceding EXTENSIONS TO BASIC
conditions. A simple (and theoretically valid) choice SPSA ALGORITHM
for each component of Dk is to use a Bernoulli ±1
Sadegh and Spall33 consider the problem of choosing
distribution with probability of 1/2 for each ±1 out-
the best distribution for the Dk vector. On the basis of
come. Note that uniform and normal random variables
asymptotic distribution results, it is shown that the
are not allowed for the elements of Dk by the SPSA
optimal distribution for the components of Dk is sym-
regularity conditions (since they have infinite inverse
metric Bernoulli. This simple distribution has also
moments).
proven effective in many finite-sample practical and
Step 3: Loss function evaluations. Obtain two measure- simulation examples. The recommendation in Step 2
ments of the loss function L(·) based on the simulta- of the algorithm description follows from these find-
neous perturbation around the current û k :y( û k + ckDk) ings. (It should be noted, however, that other distribu-
and y( û k – ckDk) with the ck and Dk from Steps 1 and 2. tions are sometimes desirable. Since the user has full

490 JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998)
SIMULTANEOUS PERTURBATION METHOD FOR OPTIMIZATION

control over this choice and since the generation of Dk for u is a stochastic analogue of the well-known New-
represents a trivial cost toward the optimization, it may ton–Raphson algorithm of deterministic optimization.
be worth evaluating other possibilities in some appli- The recursion for the Hessian matrix is simply a recur-
cations. For example, Maeda and De Figueiredo20 used sive calculation of the sample mean of per-iteration
a symmetric two-part uniform distribution, i.e., a uni- Hessian estimates formed using SP-type ideas.
form distribution with a section removed near 0 [to
preserve the finiteness of inverse moments], in an
application for robot control.) CONCLUSION
Some extensions to the basic SPSA algorithm are Relative to standard deterministic methods, stochas-
reported in the literature. For example, its use in feed- tic optimization considerably broadens the range of
back control problems, where the loss function changes practical problems for which one can find rigorous
with time, is given in Spall and Cristion.18,19,34 Refer- optimal solutions. Algorithms of the stochastic optimi-
ence 34 is the most complete methodological and the- zation type allow for the effective treatment of prob-
oretical treatment. Reference 18 also reports on a gra- lems in areas such as network analysis, simulation-based
dient smoothing idea (analogous to “momentum” in optimization, pattern recognition and classification,
the neural network literature) that may help reduce neural network training, image processing, and nonlin-
noise effects and enhance convergence (and also gives ear control. It is expected that the role of stochastic
guidelines for how the smoothing should be reduced optimization will continue to grow as modern systems
over time to ensure convergence). Alternatively, it is increase in complexity and as population growth and
possible to average several simultaneous perturbation dwindling natural resources force trade-offs that were
gradient approximations at each iteration to reduce previously unnecessary.
noise effects (at the cost of additional function mea- The SPSA algorithm has proven to be an effective
surements); this is discussed in Spall.28 An implemen- stochastic optimization method. Its primary virtues are
tation of SPSA for global minimization is discussed in ease of implementation and lack of need for loss function
Chin35 (i.e., the case where there are multiple mini- gradient, theoretical and experimental support for rela-
mums at which g(u) = 0); this approach is based on a tive efficiency, robustness to noise in the loss measure-
step-wise (slowly decaying) sequence ck (and possibly ments, and empirical evidence of ability to find a global
ak). The problem of constrained (equality and inequal- minimum when multiple (local and global) minima
ity) optimization with SPSA is considered in Sadegh36 exist. SPSA is primarily limited to continuous-variable
and Fu and Hill12 using a projection approach. A one- problems and, relative to other methods, is most effective
measurement form of the simultaneous perturbation when the loss function measurements include added
gradient approximation is considered in Spall37; noise. Numerical comparisons with techniques such as
although it is shown in Ref. 37 that the standard two- the finite-difference method, simulated annealing, ge-
measurement form will usually be more efficient (in netic algorithms, and random search have supported the
terms of total number of loss function measurements to claims of SPSA’s effectiveness in a wide range of practical
obtain a given level of accuracy in the u iterate), there problems. The rapidly growing number of applications
are advantages to the one-measurement form in real- throughout the world provides further evidence of the
time operations where the underlying system dynamics algorithm’s effectiveness. To add to the effectiveness,
may change too rapidly to get a credible gradient es- there have been some extensions of the basic idea, in-
timate with two successive measurements. cluding a stochastic analogue of the fast deterministic
An “accelerated” form of SPSA is reported in Spall.31,38 Newton–Raphson (second-order) algorithm, adaptations
This approach extends the SPSA algorithm to include for real-time (control) implementations, and versions for
second-order (Hessian) effects with the aim of accel- some types of constrained and global optimization prob-
erating convergence in a stochastic analogue to the lems. Although much work continues in extending the
deterministic Newton–Raphson algorithm. Like the basic algorithm to a broader range of real-world settings,
standard (first-order) SPSA algorithm, this second- SPSA addresses a wide range of difficult problems and
order algorithm is simple to implement and requires should likely be considered for many of the stochastic
only a small number—independent of p—of loss func- optimization challenges encountered in practice.
tion measurements per iteration (no gradient measure-
ments, as in standard SPSA). In particular, only four
measurements are required to estimate the loss-function REFERENCES
gradient and inverse Hessian at each iteration (and one 1 Robbins, H., and Monro, S., “A Stochastic Approximation Method,” Ann.
Math. Stat. 29, 400–407 (1951).
additional measurement is sometimes recommended as 2 Kiefer, J., and Wolfowitz, J., “Stochastic Estimation of a Regression
a check on algorithm behavior). The algorithm is im- Function,” Ann. Math. Stat. 23, 462–466 (1952).
3 Spall, J. C., “Stochastic Optimization, Stochastic Approximation, and
plemented with two simple parallel recursions: one for Simulated Annealing,” in Encyclopedia of Electrical and Electronics Engineering,
u and one for the Hessian matrix of L(u). The recursion J. G. Webster (ed.), Wiley, New York (in press, 1999).

JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998) 491
J. C. SPALL

4Fabian, V., “Stochastic Approximation,” in Optimizing Methods in Statistics, 23 Alessandri, A., and Parisini, T., “Nonlinear Modelling of Complex Large-
J. J. Rustigi (ed.), Academic Press, New York, pp. 439–470 (1971). Scale Plants Using Neural Networks and Stochastic Approximation,” IEEE
5 Kushner, H. J., and Yin, G. G., Stochastic Approximation Algorithms and
Trans. Syst., Man, Cybernetics—A 27, 750–757 (1997).
Applications, Springer-Verlag, New York (1997). 24 Nechyba, M. C., and Xu, Y., “Human-Control Strategy: Abstraction, Verifi-
6 Spall, J. C., and Chin, D. C., “Traffic-Responsive Signal Timing for System-
cation, and Replication,” IEEE Control Syst. Magazine 17(5), 48–61 (1997).
Wide Traffic Control,” Transp. Res., Part C 5, 153–163 (1997). 25 Sadegh, P., and Spall, J. C., “Optimal Sensor Configuration for Complex
7Chin, D. C., Spall, J. C., and Smith, R. H., Evaluation and Practical
Systems,” in Proc. Amer. Control Conf., Philadelphia, PA, 3575–3579 (1998).
Considerations for the S-TRAC System-Wide Traffic Signal Controller, Transpor- 26 Chin, D. C., “The Simultaneous Perturbation Method for Processing
tation Research Board 77th Annual Meeting, Preprint 98-1230 (1998). Magnetospheric Images,” Opt. Eng. (in press, 1999).
8 Heydon, B. D., Hill, S. D., and Havermans, C. C., “Maximizing Target 27Spall, J. C., “A Stochastic Approximation Algorithm for Large-Dimensional
Damage Through Optimal Aim Point Patterning” in Proc. AIAA Conf. on Systems in the Kiefer-Wolfowitz Setting,” in Proc. IEEE Conf. on Decision and
Missile Sciences, Monterey, CA (1998). Control, pp. 1544–1548 (1988).
9 Hill, S. D., and Heydon, B. D., Optimal Aim Point Patterning, Report SSD/PM- 28Spall, J. C., “Multivariate Stochastic Approximation Using a Simultaneous
97-0448, JHU/APL, Laurel, MD (25 Jul 1997). Perturbation Gradient Approximation,” IEEE Trans. Autom. Control 37,
10 Chin, D. C., and Srinivasan, R., “Electrical Conductivity Object Locator,” in
332–341 (1992).
Proc. Forum ’97—A Global Conf. on Unexploded Ordnance, Nashville, TN, 29 Chin, D. C., “Comparative Study of Stochastic Algorithms for System
pp. 50–57 (1998). Optimization Based on Gradient Approximations,” IEEE Trans. Syst., Man,
11 Hill, S. D., and Fu, M. C., “Transfer Optimization via Simultaneous
and Cybernetics—B 27, 244–249 (1997).
Perturbation Stochastic Approximation,” in Proc. Winter Simulation Conf., 30Spall, J. C., “Implementation of the Simultaneous Perturbation Algorithm for
C. Alexopoulos, K. Kang, W. R. Lilegdon, and D. Goldsman, D. (eds.), Stochastic Optimization,” IEEE Trans. Aerosp. Electron. Syst. 34(3), 817–823
pp. 242–249 (1995). (1998).
12 Fu, M. C., and Hill, S. D., “Optimization of Discrete Event Systems via 31 Spall, J. C., Adaptive Simultaneous Perturbation Method for Accelerated Optimi-
Simultaneous Perturbation Stochastic Approximation,” Trans. Inst. Industr. zation, Memo PSA-98-017, JHU/APL, Laurel, MD (1998).
Eng. 29, 233–243 (1997). 32 Pflug, G. Ch., Optimization of Stochastic Models: The Interface Between
13 Hopkins, H. S., Experimental Measurement of a 4-D Phase Space Map of a
Simulation and Optimization, Kluwer Academic, Boston (1996).
Heavy Ion Beam, Ph.D. thesis, Dept. of Nuclear Engineering, University of 33Sadegh, P., and Spall, J. C., “Optimal Random Perturbations for Multivariate
California—Berkeley (Dec 1997). Stochastic Approximation Using a Simultaneous Perturbation Gradient
14 Rezayat, F., “On the Use of an SPSA-Based Model-Free Controller in Quality
Approximation,” in Proc. American Control Conf., pp. 3582–3586 (1997).
Improvement,” Automatica 31, 913–915 (1995). 34 Spall, J. C., and Cristion, J. A., “Model-Free Control of Nonlinear Stochastic
15 Maeda, Y., Hirano, H., and Kanata, Y., “A Learning Rule of Neural Networks
Systems with Discrete-Time Measurements,” IEEE Trans. Autom. Control 43,
via Simultaneous Perturbation and Its Hardware Implementation,” Neural 1198–1210 (1998).
Networks 8, 251–259 (1995). 35 Chin, D. C., “A More Efficient Global Optimization Algorithm Based on
16 Kleinman, N. L., Hill, S. D., and Ilenda, V. A., “SPSA/SIMMOD Optimiza-
Styblinski and Tang,” Neural Networks 7, 573–574 (1994).
tion of Air Traffic Delay Cost,” in Proc. American Control Conf., pp. 1121– 36 Sadegh, P., “Constrained Optimization via Stochastic Approximation with a
1125 (1997). Simultaneous Perturbation Gradient Approximation,” Automatica 33, 889–
17 Cauwenberghs, G., Analog VLSI Autonomous Systems for Learning and
892 (1997).
Optimization, Ph.D. thesis, California Institute of Technology (1994). 37 Spall, J. C., “A One-Measurement Form of Simultaneous Perturbation
18Spall, J. C., and Cristion, J. A., “Nonlinear Adaptive Control Using Neural
Stochastic Approximation,” Automatica 33, 109–112 (1997).
Networks: Estimation Based on a Smoothed Form of Simultaneous Perturba- 38 Spall, J. C., “Accelerated Second-Order Stochastic Optimization Using Only
tion Gradient Approximation,” Stat. Sinica 4, 1–27 (1994). Function Measurements,” in Proc. 36th IEEE Conf. on Decision and Control,
19 Spall, J. C., and Cristion, J. A., “A Neural Network Controller for Systems
pp. 1417–1424 (1997).
with Unmodeled Dynamics with Applications to Wastewater Treatment,”
IEEE Trans. Syst., Man, Cybernetics–B 27, 369–375 (1997).
20 Maeda, Y., and De Figueiredo, R. J. P., “Learning Rules for Neuro-Controller ACKNOWLEDGMENTS: This work was partially supported by U.S. Navy con-
via Simultaneous Perturbation,” IEEE Trans. Neural Networks 8, 1119–1130 tract N00024-98-D-8124 and the APL Independent Research and Development
(1997). (IR&D) Program. Parts of this article have benefited from helpful comments of Gert
21Gerencsér, L., “The Use of the SPSA Method in ECG Analysis,” IEEE Trans. Cauwenberghs, Vaclav Fabian, Michael Fu, John L. Maryak, Boris T. Polyak, M. A.
Biomed. Eng. (in press, 1998). Styblinski, and Sid Yakowitz. I would also like to thank Larry Levy, the Strategic
22 Luman, R. R., Quantitative Decision Support for Upgrading Complex Systems of Systems Department, and the IR&D Committee for providing an environment in
Systems, Ph.D. thesis, School of Engineering and Applied Science, George which initially “high risk” concepts like this algorithm can be developed into
Washington University (1997). broadly useful methods.

THE AUTHOR

JAMES C. SPALL joined APL’s Strategic Systems Department in 1983 after


receiving an S.M. from M.I.T. and a Ph.D. from the University of Virginia. He was
appointed to the Principal Professional Staff in 1991. He also teaches in the JHU
Whiting School of Engineering. Dr. Spall has published many articles in the areas
of statistics and control and holds two U.S. patents. In 1990, he received the Hart
Prize as principal investigator of an outstanding Independent Research and
Development project at APL, and in 1997 he was the Chairman of the 4th
Symposium on Research and Development at APL. He is an Associate Editor for
the IEEE Transactions on Automatic Control and a Contributing Editor for the
Current Index to Statistics. He also was the editor and co-author for the book
Bayesian Analysis of Time Series and Dynamic Models (Marcel Dekker), and is the
author of the forthcoming book Introduction to Stochastic Search and Optimization
(Wiley). Dr. Spall is a senior member of IEEE, a member of the American
Statistical Association, and a fellow of the engineering honor society Tau Beta Pi.
His e-mail address is [email protected].

492 JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998)

You might also like