An Overview of The Simultaneous Perturbation Method For Efficient Optimization
An Overview of The Simultaneous Perturbation Method For Efficient Optimization
SPALL
482 JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998)
SIMULTANEOUS PERTURBATION METHOD FOR OPTIMIZATION
The mathematical representation of most optimiza- sensor data, and the design of complex queuing and
tion problems is the minimization (or maximization) of discrete-event systems. This article focuses on the case
some scalar-valued objective function with respect to where such an approximation is going to be used as a
a vector of adjustable parameters. The optimization result of direct gradient information not being readily
algorithm is a step-by-step procedure for changing the available.
adjustable parameters from some initial guess (or set of Overall, gradient-free stochastic algorithms exhibit
guesses) to a value that offers an improvement in the convergence properties similar to the gradient-based
objective function. Figure 1 depicts this process for a stochastic algorithms [e.g., Robbins-Monro1 stochastic
very simple case of only two variables, u1 and u2, where approximation (R-M SA)] while requiring only loss
our objective function is a loss function to be minimized function measurements. A main advantage of such
(without loss of generality, we will discuss optimization algorithms is that they do not require the detailed
in the context of minimization because a maximization knowledge of the functional relationship between the
problem can be trivially converted to a minimization parameters being adjusted (optimized) and the loss
problem by changing the sign of the objective func- function being minimized that is required in gradient-
tion). Most real-world problems would have many more based algorithms. Such a relationship can be notorious-
variables. The illustration in Fig. 1 is typical of a sto- ly difficult to develop in some areas (e.g., nonlinear
chastic optimization setting with noisy input informa- feedback controller design), whereas in other areas
tion because the loss function value does not uniformly (such as Monte Carlo optimization or recursive statis-
decrease as the iteration process proceeds (note the tical parameter estimation), there may be large compu-
temporary increase in the loss value in the third step tational savings in calculating a loss function relative
of the algorithm). Many optimization algorithms have to that required in calculating a gradient.
been developed that assume a deterministic setting and Let us elaborate on the distinction between algo-
that assume information is available on the gradient rithms based on direct gradient measurements and those
vector associated with the loss function (i.e., the gra- based on gradient approximations from measurements
dient of the loss function with respect to the parameters of the loss function. The prototype gradient-based
being optimized). However, there has been a growing algorithm is R-M SA, which may be considered a gen-
interest in recursive optimization algorithms that do eralization of such techniques as deterministic steepest
not depend on direct gradient information or measure- descent and Newton–Raphson, neural network back-
ments. Rather, these algorithms are based on an ap- propagation, and infinitesimal perturbation analysis–
proximation to the gradient formed from (generally based optimization for discrete-event systems. The gra-
noisy) measurements of the loss function. This interest dient-based algorithms rely on direct measurements of
has been motivated, for example, by problems in the the gradient of the loss function with respect to the
adaptive control and statistical identification of com- parameters being optimized. These measurements typ-
plex systems, the optimization of processes by large ically yield an estimate of the gradient because the
Monte Carlo simulations, the training of recurrent underlying data generally include added noise. Because
neural networks, the recovery of images from noisy it is not usually the case that one would obtain direct
measurements of the gradient (with or without added
noise) naturally in the course of operating or simulating
a system, one must have detailed knowledge of the
Height of vertical underlying system input–output relationships to calcu-
line after each
optimization step late the R-M gradient estimate from basic system output
denotes loss
function value measurements. In contrast, the approaches based on
L (u1, u 2) gradient approximations require only conversion of the
basic output measurements to sample values of the loss
function, which does not require full knowledge of the
system input–output relationships. The classical method
for gradient-free stochastic optimization is the Kiefer–
Wolfowitz finite-difference SA (FDSA) algorithm.2
u2 Because of the fundamentally different information
u1 needed in implementing these gradient-based (R-M)
and gradient-free algorithms, it is difficult to construct
meaningful methods of comparison. As a general rule,
Initial guess
however, the gradient-based algorithms will be faster to
Solution
converge than those using loss function–based gradient
Figure 1. Example of stochastic optimization algorithm minimiz-
approximations when speed is measured in number of
ing loss function L(u1, u2). iterations. Intuitively, this result is not surprising given
JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998) 483
J. C. SPALL
the additional information required for the gradient- FDSA AND SPSA ALGORITHMS
based algorithms. In particular, on the basis of asymptotic
This article considers the problem of minimizing a
theory, the optimal rate of convergence measured in
(scalar) differentiable loss function L(u), where u is a
terms of the deviation of the parameter estimate from
p-dimensional vector and where the optimization
the true optimal parameter vector is of order k–1/2 for
problem can be translated into finding the minimizing
the gradient-based algorithms and of order k–1/3 for the
u* such that ∂L/∂u = 0. This is the classical formulation
algorithms based on gradient approximations, where k
of (local) optimization for differentiable loss functions.
represents the number of iterations. (Special cases exist
It is assumed that measurements of L(u) are available
where the maximum rate of convergence for a non-
at various values of u. These measurements may or may
gradient algorithm is arbitrarily close to, or equal to,
not include added noise. No direct measurements of
k–1/2.)
∂L/∂u are assumed available, in contrast to the R-M
In practice, of course, many other factors must be
framework. This section will describe the FDSA and
considered in determining which algorithm is best for
SPSA algorithms. Although the emphasis of this article
a given circumstance for the following three reasons:
is SPSA, the FDSA discussion is included for compar-
(1) It may not be possible to obtain reliable knowledge
ison because FDSA is a classical method for stochastic
of the system input–output relationships, implying that
optimization.
the gradient-based algorithms may be either infeasible
The SPSA and FDSA procedures are in the general
(if no system model is available) or undependable (if
recursive SA form:
a poor system model is used). (2) The total cost to
achieve effective convergence depends not only on the
number of iterations required, but also on the cost uˆk +1 = uˆk – a k gˆ k (uˆk ) , (1)
needed per iteration, which is typically greater in
gradient-based algorithms. (This cost may include where gˆ k (uˆk ) is the estimate of the gradient g(u) ;
greater computational burden, additional human effort ∂L/∂u at the iterate û k based on the previously men-
required for determining and coding gradients, and tioned measurements of the loss function. Under appro-
experimental costs for model building such as labor, priate conditions, the iteration in Eq. 1 will converge
materials, and fuel.) (3) The rates of convergence are to u* in some stochastic sense (usually “almost surely”)
based on asymptotic theory and may not be represen- (see, e.g., Fabian4 or Kushner and Yin5).
tative of practical convergence rates in finite samples. The essential part of Eq. 1 is the gradient approx-
For these reasons, one cannot say in general that a imation gˆ k (uˆk ). We discuss the two forms of interest
gradient-based search algorithm is superior to a gradient here. Let y(·) denote a measurement of L(·) at a design
approximation-based algorithm, even though the level represented by the dot (i.e., y(·) =L(·) + noise)
gradient-based algorithm has a faster asymptotic rate of and ck be some (usually small) positive number. One-
convergence (and with simulation-based optimization sided gradient approximations involve measurements
such as infinitesimal perturbation analysis requires only y (uˆk ) and y( û k + perturbation), whereas two-sided
one system run per iteration, whereas the approximation- gradient approximations involve measurements of the
based algorithm may require multiple system runs per form y( û k ± perturbation). The two general forms of
iteration). As a general rule, however, if direct gradient gradient approximations for use in FDSA and SPSA are
information is conveniently and reliably available, it is finite difference and simultaneous perturbation, respec-
generally to one’s advantage to use this information in tively, which are discussed in the following paragraphs.
the optimization process. The focus in this article is the For the finite-difference approximation, each com-
case where such information is not readily available. ponent of û k is perturbed one at a time, and corre-
The next section describes SPSA and the related sponding measurements y(·) are obtained. Each com-
FDSA algorithm. Then some of the theory associated ponent of the gradient estimate is formed by
with the convergence and efficiency of SPSA is differencing the corresponding y(·) values and then
summarized. The following section is an illustration of dividing by a difference interval. This is the standard
the implications of the theory in an example related to approach to approximating gradient vectors and is
neural networks. Then practical guidelines for motivated directly from the definition of a gradient as
implementation are presented, followed by a summary a vector of p partial derivatives, each constructed as the
of some ancillary results and some extensions of the limit of the ratio of a change in the function value over
algorithm. Not covered here are global optimization a corresponding change in one component of the
methods such as genetic algorithms and simulated argument vector. Typically, the ith component of
annealing; Spall3 presents some discussion of such gˆ k (uˆk ) (i = 1, 2,…, p) for a two-sided finite-difference
methods in the context of stochastic approximation. approximation is given by
484 JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998)
SIMULTANEOUS PERTURBATION METHOD FOR OPTIMIZATION
JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998) 485
J. C. SPALL
Figure 2. Overall system-wide traffic control concept. Traffic control center provides timing information to signals in traffic network;
information on traffic flow is fed back to traffic control center.
damage (damage to sensitive locations not directly simulations (recall that there is random variation in the
associated with the military mission, e.g., schools and outcome of one “volley” of projectiles corresponding to
hospitals). The projectiles have inherent random vari- one simulation). Commonly used techniques for solv-
ation and may be statistically dependent. So the tar- ing the optimization problem by simulation are com-
geter must allow for the statistical variation between putationally intensive and prohibitively time-consum-
the aim point and actual impact point in developing ing since the damage function for many different sets
a strategy for determining aim points. In such cases it of aim points must be evaluated (i.e., many simulations
is desirable to use patterning of multiple projectiles. must be run). The SPSA method provides an efficient
“Patterning” in this context means aiming the projec- means of solving this multivariate problem, which for
tiles at a set of points that may not overlap each other planar targets has a dimension of p = 2 3 [no. of
or be within the target boundaries. Figure 3 illustrates projectiles] (so p = 10 in the small-scale example of Fig.
the concept for one target and five projectiles; a bias 3). SPSA works by varying all of the aim point coor-
away from the “stay-out zone” (e.g., a civilian facility) dinates simultaneously and running a simulation in the
is apparent in this case where it is desired to destroy the process of producing the gradient approximation for the
target while minimizing the chances of producing dam- optimization process. This procedure is repeated as the
age to the stay-out zone. For scenarios with many pro- iteration for the optimization proceeds. This method
jectiles that are independently targeted, the damage contrasts significantly with conventional methods
function—which must be evaluated to find the best aim where one would vary only one of the coordinates for
point pattern—is likely to be analytically unavailable one of the aim points prior to running a simulation,
and will require estimation by Monte Carlo simulation. repeating that process as each coordinate for each aim
In particular, to evaluate the effectiveness of a given set point was changed at a specified nominal set of aim
of aim points, it is necessary to run one or more points to construct a gradient approximation at the
486 JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998)
SIMULTANEOUS PERTURBATION METHOD FOR OPTIMIZATION
Power source
Stay-out zone x 5 cm
Downrange
Object
Target x 40 cm
x
0 cm 300 cm
JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998) 487
J. C. SPALL
iterating from an initial guess of the optimal u, where the One properly chosen simultaneous random change in all the
iteration process depends on the aforementioned simul- variables in a problem provides as much information for
taneous perturbation approximation to the gradient g(u). optimization as a full set of one-at-a-time changes of each
Spall27,28 presents sufficient conditions for conver- variable.
gence of the SPSA iterate ( û k → u* in the stochastic
“almost sure” sense) using a differential equation ap- This surprising and significant result seems to run
proach well known in general SA theory (e.g, Kushner counter to all that one learns in engineering and sci-
and Yin5). To establish convergence, conditions are entific training. It is the qualifier “for optimization”
imposed on both gain sequences (ak and ck), the user- that is critical to the validity of the statement. To view
specified distribution of Dk, and the statistical relation- an online animated demonstration of this concept, se-
ship of Dk to the measurements y(·). We will not repeat lect the blue box.
the conditions here since they are available in Let us provide some informal mathematical ratio-
Spall.27,28 The essence of the main conditions is that nale for this key result. Figure 5 provides an example
ak and ck both go to 0 at rates neither too fast nor too of a two-variable problem, where the level curves show
slow, that L(u) is sufficiently smooth (several times points of equal value in the loss function. In a low- or
differentiable) near u*, and that the {Dki} are indepen- no-noise setting, the FDSA algorithm will behave sim-
dent and symmetrically distributed about 0 with finite ilarly to a traditional gradient descent algorithm in
inverse moments E(|Dki|–1) for all k, i. One particular taking steps that provide the locally greatest reduction
distribution for Dki that satisfies these latter conditions in the loss function. A standard result in calculus shows
is the symmetric Bernoulli ±1 distribution; two com- that this “steepest descent” direction is perpendicular
mon distributions that do not satisfy the conditions (in to the level curve at that point, as shown in the steps
particular, the critical finite inverse moment condi- for the FDSA algorithm of Fig. 5 (each straight segment
tion) are the uniform and normal distributions. is perpendicular to the level curve at the origin of the
In addition to establishing the formal convergence segment). Hence, the FDSA algorithm is behaving
of SPSA, Spall (Ref. 28, Sect. 4) shows that the prob- much as an aggressive skier might act in descending a
ability distribution of an appropriately scaled û k is ap- hill by going in small segments that provide the steepest
proximately normal (with a specified mean and cova- drop from the start of each segment. SPSA, on the
riance matrix) for large k. This asymptotic normality other hand, with its random search direction, does not
result, together with a parallel result for FDSA, can be follow the path of locally steepest descent. On average,
used to study the relative efficiency of SPSA. This though, it will nearly follow the steepest descent path
efficiency is the major theoretical result justifying the because the gradient approximation is an almost unbi-
use of SPSA. The efficiency depends on the shape of ased estimator of the gradient (i.e., E[ ĝ k (u)] = g(u) +
L(u), the values for {ak} and {ck}, and the distributions small bias, where small bias is proportional to c 2k , and
of the {Dki} and measurement noise terms. There is no ck is the small number mentioned earlier). Over the
single expression that can be used to characterize the course of many iterations, the errors associated with the
relative efficiency; however, as discussed in Spall (Ref. “misdirections” in SPSA will average out in a manner
28, Sect. 4) and Chin,29 in most practical problems analogous to the way random errors cancel out in form-
SPSA will be asymptotically more efficient than FDSA. ing the sample mean of almost any random process (the
For example, if ak and ck are chosen as in the guidelines ak sequence in Eq. 1 governs this averaging). Figure 5
of Spall,30 then by equating the asymptotic mean- shows this effect at work in the way the SPSA search
squared error E(i û k –u*i2) in SPSA and FDSA, we find direction tends to “bounce around” the FDSA search
direction, while ultimately settling down near the so-
lution in the same number of steps. Although this
No. of measurements of L(u) in SPSA 1
→ (4) discussion was motivated by the two-variable (p = 2)
No. of measurements of L(u) in FDSA p problem with no- or low-noise loss function measure-
ments (so that the FDSA algorithm behaves very nearly
as the number of loss measurements in both procedures like a true gradient descent algorithm), the same essen-
gets large. Hence, Expression 4 implies that the p-fold tial intuition applies in higher-dimension settings and
savings per iteration (gradient approximation) trans- noisier loss measurements. Noisy loss measurements
lates directly into a p-fold savings in the overall imply that the FDSA algorithm will also not closely
optimization process despite the complex nonlinear track a gradient descent algorithm as in Fig. 5; however,
ways in which the sequence of gradient approximations the relationship between SPSA and FDSA (which is
manifests itself in the ultimate solution û k . what Expression 4 pertains to) will still be governed by
Relative to implementation in a practical problem, the idea of averaging out the errors in directions over
another way of looking at Expression 4 is that: a large number of iterations.
488 JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998)
SIMULTANEOUS PERTURBATION METHOD FOR OPTIMIZATION
JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998) 489
J. C. SPALL
490 JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998)
SIMULTANEOUS PERTURBATION METHOD FOR OPTIMIZATION
control over this choice and since the generation of Dk for u is a stochastic analogue of the well-known New-
represents a trivial cost toward the optimization, it may ton–Raphson algorithm of deterministic optimization.
be worth evaluating other possibilities in some appli- The recursion for the Hessian matrix is simply a recur-
cations. For example, Maeda and De Figueiredo20 used sive calculation of the sample mean of per-iteration
a symmetric two-part uniform distribution, i.e., a uni- Hessian estimates formed using SP-type ideas.
form distribution with a section removed near 0 [to
preserve the finiteness of inverse moments], in an
application for robot control.) CONCLUSION
Some extensions to the basic SPSA algorithm are Relative to standard deterministic methods, stochas-
reported in the literature. For example, its use in feed- tic optimization considerably broadens the range of
back control problems, where the loss function changes practical problems for which one can find rigorous
with time, is given in Spall and Cristion.18,19,34 Refer- optimal solutions. Algorithms of the stochastic optimi-
ence 34 is the most complete methodological and the- zation type allow for the effective treatment of prob-
oretical treatment. Reference 18 also reports on a gra- lems in areas such as network analysis, simulation-based
dient smoothing idea (analogous to “momentum” in optimization, pattern recognition and classification,
the neural network literature) that may help reduce neural network training, image processing, and nonlin-
noise effects and enhance convergence (and also gives ear control. It is expected that the role of stochastic
guidelines for how the smoothing should be reduced optimization will continue to grow as modern systems
over time to ensure convergence). Alternatively, it is increase in complexity and as population growth and
possible to average several simultaneous perturbation dwindling natural resources force trade-offs that were
gradient approximations at each iteration to reduce previously unnecessary.
noise effects (at the cost of additional function mea- The SPSA algorithm has proven to be an effective
surements); this is discussed in Spall.28 An implemen- stochastic optimization method. Its primary virtues are
tation of SPSA for global minimization is discussed in ease of implementation and lack of need for loss function
Chin35 (i.e., the case where there are multiple mini- gradient, theoretical and experimental support for rela-
mums at which g(u) = 0); this approach is based on a tive efficiency, robustness to noise in the loss measure-
step-wise (slowly decaying) sequence ck (and possibly ments, and empirical evidence of ability to find a global
ak). The problem of constrained (equality and inequal- minimum when multiple (local and global) minima
ity) optimization with SPSA is considered in Sadegh36 exist. SPSA is primarily limited to continuous-variable
and Fu and Hill12 using a projection approach. A one- problems and, relative to other methods, is most effective
measurement form of the simultaneous perturbation when the loss function measurements include added
gradient approximation is considered in Spall37; noise. Numerical comparisons with techniques such as
although it is shown in Ref. 37 that the standard two- the finite-difference method, simulated annealing, ge-
measurement form will usually be more efficient (in netic algorithms, and random search have supported the
terms of total number of loss function measurements to claims of SPSA’s effectiveness in a wide range of practical
obtain a given level of accuracy in the u iterate), there problems. The rapidly growing number of applications
are advantages to the one-measurement form in real- throughout the world provides further evidence of the
time operations where the underlying system dynamics algorithm’s effectiveness. To add to the effectiveness,
may change too rapidly to get a credible gradient es- there have been some extensions of the basic idea, in-
timate with two successive measurements. cluding a stochastic analogue of the fast deterministic
An “accelerated” form of SPSA is reported in Spall.31,38 Newton–Raphson (second-order) algorithm, adaptations
This approach extends the SPSA algorithm to include for real-time (control) implementations, and versions for
second-order (Hessian) effects with the aim of accel- some types of constrained and global optimization prob-
erating convergence in a stochastic analogue to the lems. Although much work continues in extending the
deterministic Newton–Raphson algorithm. Like the basic algorithm to a broader range of real-world settings,
standard (first-order) SPSA algorithm, this second- SPSA addresses a wide range of difficult problems and
order algorithm is simple to implement and requires should likely be considered for many of the stochastic
only a small number—independent of p—of loss func- optimization challenges encountered in practice.
tion measurements per iteration (no gradient measure-
ments, as in standard SPSA). In particular, only four
measurements are required to estimate the loss-function REFERENCES
gradient and inverse Hessian at each iteration (and one 1 Robbins, H., and Monro, S., “A Stochastic Approximation Method,” Ann.
Math. Stat. 29, 400–407 (1951).
additional measurement is sometimes recommended as 2 Kiefer, J., and Wolfowitz, J., “Stochastic Estimation of a Regression
a check on algorithm behavior). The algorithm is im- Function,” Ann. Math. Stat. 23, 462–466 (1952).
3 Spall, J. C., “Stochastic Optimization, Stochastic Approximation, and
plemented with two simple parallel recursions: one for Simulated Annealing,” in Encyclopedia of Electrical and Electronics Engineering,
u and one for the Hessian matrix of L(u). The recursion J. G. Webster (ed.), Wiley, New York (in press, 1999).
JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998) 491
J. C. SPALL
4Fabian, V., “Stochastic Approximation,” in Optimizing Methods in Statistics, 23 Alessandri, A., and Parisini, T., “Nonlinear Modelling of Complex Large-
J. J. Rustigi (ed.), Academic Press, New York, pp. 439–470 (1971). Scale Plants Using Neural Networks and Stochastic Approximation,” IEEE
5 Kushner, H. J., and Yin, G. G., Stochastic Approximation Algorithms and
Trans. Syst., Man, Cybernetics—A 27, 750–757 (1997).
Applications, Springer-Verlag, New York (1997). 24 Nechyba, M. C., and Xu, Y., “Human-Control Strategy: Abstraction, Verifi-
6 Spall, J. C., and Chin, D. C., “Traffic-Responsive Signal Timing for System-
cation, and Replication,” IEEE Control Syst. Magazine 17(5), 48–61 (1997).
Wide Traffic Control,” Transp. Res., Part C 5, 153–163 (1997). 25 Sadegh, P., and Spall, J. C., “Optimal Sensor Configuration for Complex
7Chin, D. C., Spall, J. C., and Smith, R. H., Evaluation and Practical
Systems,” in Proc. Amer. Control Conf., Philadelphia, PA, 3575–3579 (1998).
Considerations for the S-TRAC System-Wide Traffic Signal Controller, Transpor- 26 Chin, D. C., “The Simultaneous Perturbation Method for Processing
tation Research Board 77th Annual Meeting, Preprint 98-1230 (1998). Magnetospheric Images,” Opt. Eng. (in press, 1999).
8 Heydon, B. D., Hill, S. D., and Havermans, C. C., “Maximizing Target 27Spall, J. C., “A Stochastic Approximation Algorithm for Large-Dimensional
Damage Through Optimal Aim Point Patterning” in Proc. AIAA Conf. on Systems in the Kiefer-Wolfowitz Setting,” in Proc. IEEE Conf. on Decision and
Missile Sciences, Monterey, CA (1998). Control, pp. 1544–1548 (1988).
9 Hill, S. D., and Heydon, B. D., Optimal Aim Point Patterning, Report SSD/PM- 28Spall, J. C., “Multivariate Stochastic Approximation Using a Simultaneous
97-0448, JHU/APL, Laurel, MD (25 Jul 1997). Perturbation Gradient Approximation,” IEEE Trans. Autom. Control 37,
10 Chin, D. C., and Srinivasan, R., “Electrical Conductivity Object Locator,” in
332–341 (1992).
Proc. Forum ’97—A Global Conf. on Unexploded Ordnance, Nashville, TN, 29 Chin, D. C., “Comparative Study of Stochastic Algorithms for System
pp. 50–57 (1998). Optimization Based on Gradient Approximations,” IEEE Trans. Syst., Man,
11 Hill, S. D., and Fu, M. C., “Transfer Optimization via Simultaneous
and Cybernetics—B 27, 244–249 (1997).
Perturbation Stochastic Approximation,” in Proc. Winter Simulation Conf., 30Spall, J. C., “Implementation of the Simultaneous Perturbation Algorithm for
C. Alexopoulos, K. Kang, W. R. Lilegdon, and D. Goldsman, D. (eds.), Stochastic Optimization,” IEEE Trans. Aerosp. Electron. Syst. 34(3), 817–823
pp. 242–249 (1995). (1998).
12 Fu, M. C., and Hill, S. D., “Optimization of Discrete Event Systems via 31 Spall, J. C., Adaptive Simultaneous Perturbation Method for Accelerated Optimi-
Simultaneous Perturbation Stochastic Approximation,” Trans. Inst. Industr. zation, Memo PSA-98-017, JHU/APL, Laurel, MD (1998).
Eng. 29, 233–243 (1997). 32 Pflug, G. Ch., Optimization of Stochastic Models: The Interface Between
13 Hopkins, H. S., Experimental Measurement of a 4-D Phase Space Map of a
Simulation and Optimization, Kluwer Academic, Boston (1996).
Heavy Ion Beam, Ph.D. thesis, Dept. of Nuclear Engineering, University of 33Sadegh, P., and Spall, J. C., “Optimal Random Perturbations for Multivariate
California—Berkeley (Dec 1997). Stochastic Approximation Using a Simultaneous Perturbation Gradient
14 Rezayat, F., “On the Use of an SPSA-Based Model-Free Controller in Quality
Approximation,” in Proc. American Control Conf., pp. 3582–3586 (1997).
Improvement,” Automatica 31, 913–915 (1995). 34 Spall, J. C., and Cristion, J. A., “Model-Free Control of Nonlinear Stochastic
15 Maeda, Y., Hirano, H., and Kanata, Y., “A Learning Rule of Neural Networks
Systems with Discrete-Time Measurements,” IEEE Trans. Autom. Control 43,
via Simultaneous Perturbation and Its Hardware Implementation,” Neural 1198–1210 (1998).
Networks 8, 251–259 (1995). 35 Chin, D. C., “A More Efficient Global Optimization Algorithm Based on
16 Kleinman, N. L., Hill, S. D., and Ilenda, V. A., “SPSA/SIMMOD Optimiza-
Styblinski and Tang,” Neural Networks 7, 573–574 (1994).
tion of Air Traffic Delay Cost,” in Proc. American Control Conf., pp. 1121– 36 Sadegh, P., “Constrained Optimization via Stochastic Approximation with a
1125 (1997). Simultaneous Perturbation Gradient Approximation,” Automatica 33, 889–
17 Cauwenberghs, G., Analog VLSI Autonomous Systems for Learning and
892 (1997).
Optimization, Ph.D. thesis, California Institute of Technology (1994). 37 Spall, J. C., “A One-Measurement Form of Simultaneous Perturbation
18Spall, J. C., and Cristion, J. A., “Nonlinear Adaptive Control Using Neural
Stochastic Approximation,” Automatica 33, 109–112 (1997).
Networks: Estimation Based on a Smoothed Form of Simultaneous Perturba- 38 Spall, J. C., “Accelerated Second-Order Stochastic Optimization Using Only
tion Gradient Approximation,” Stat. Sinica 4, 1–27 (1994). Function Measurements,” in Proc. 36th IEEE Conf. on Decision and Control,
19 Spall, J. C., and Cristion, J. A., “A Neural Network Controller for Systems
pp. 1417–1424 (1997).
with Unmodeled Dynamics with Applications to Wastewater Treatment,”
IEEE Trans. Syst., Man, Cybernetics–B 27, 369–375 (1997).
20 Maeda, Y., and De Figueiredo, R. J. P., “Learning Rules for Neuro-Controller ACKNOWLEDGMENTS: This work was partially supported by U.S. Navy con-
via Simultaneous Perturbation,” IEEE Trans. Neural Networks 8, 1119–1130 tract N00024-98-D-8124 and the APL Independent Research and Development
(1997). (IR&D) Program. Parts of this article have benefited from helpful comments of Gert
21Gerencsér, L., “The Use of the SPSA Method in ECG Analysis,” IEEE Trans. Cauwenberghs, Vaclav Fabian, Michael Fu, John L. Maryak, Boris T. Polyak, M. A.
Biomed. Eng. (in press, 1998). Styblinski, and Sid Yakowitz. I would also like to thank Larry Levy, the Strategic
22 Luman, R. R., Quantitative Decision Support for Upgrading Complex Systems of Systems Department, and the IR&D Committee for providing an environment in
Systems, Ph.D. thesis, School of Engineering and Applied Science, George which initially “high risk” concepts like this algorithm can be developed into
Washington University (1997). broadly useful methods.
THE AUTHOR
492 JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998)