A Rapid Learning and Dynamic Stepwise Updating Algor
A Rapid Learning and Dynamic Stepwise Updating Algor
1, FEBRUARY 1999
Abstract—A fast learning algorithm is proposed to find an opti- This paper proposes a one-step fast learning algorithm and
mal weights of the flat neural networks (especially, the functional- a stepwise update algorithm for the flat networks. Although
link network). Although the flat networks are used for nonlinear only the functional-link network is used as a prototype here,
function approximation, they can be formulated as linear systems.
Thus, the weights of the networks can be solved easily using a the proposed algorithms can also be applied to radial basis
linear least-square method. This formulation makes it easier to function network. The algorithms are developed based on
update the weights instantly for both a new added pattern and the formulation of the functional-link network into a set of
a new added enhancement node. A dynamic stepwise updating linear system equations. Because the system equations of
algorithm is proposed to update the weights of the system on- the radial basis function network have a similar form to
the-fly. The model is tested on several time-series data including
an infrared laser data set, a chaotic time-series, a monthly flour the functional-link network and both networks share similar
price data set, and a nonlinear system identification problem. “flat” architecture, the proposed update algorithm can be
The simulation results are compared to existing models in which applied to radial basis function network as well. The most
more complex architectures and more costly training are needed. significant advantage of the stepwise approach is that the
The results indicate that the proposed model is very attractive to weight connections of network can be updated easily, when a
real-time processes.
new input is given later after the network has been trained. The
weights can be updated easily based on the original weights
I. INTRODUCTION and the new inputs. The stepwise approach also is able to
update weights instantly when a new neuron is added to the
F EEDFORWARD artificial neural networks have been a
popular research subject recently. The research topics vary
from theoretical view of learning algorithms such as learning
existing network if the desired error criterion cannot be met.
With the proposed approach, the flat networks become very
and generalization properties of the networks to a variety attractive in terms of learning speed.
of applications in control, classification, biomedical, manu- The proposed work has been applied to time-series ap-
facturing, and business forecasting, etc. The backpropagation plications including, an infrared laser data set, a chaotic
(BP) supervised learning algorithm is one of the most popular time-series, a monthly flour price data set, and a nonlinear
learning algorithms being developed for layered networks [1], system identification. The time-series is modeled by the AR( )
[2]. Improving the learning speed of BP and increasing the (auto-regression with delay) model. During the training
generalization capability of the networks has played a center stage, different number of enhancement nodes may be added as
role in neural network research [3]–[9]. Apart from multilayer necessary. The update of weights is carried by the proposed al-
network architectures and the BP algorithm, various simplified gorithm. Contrary to the traditional BP learning and multilayer
architectures or different nonlinear activation functions have models, the training of this network is fast because of an one-
been devised. Among those, a so-called flat networks includ- step learning procedure and the dynamic updating algorithm.
ing functional-link neural network and radial basis function The proposed work also has been applied to nonlinear system
network have been proposed [10]–[15]. These flat networks identification problems involving discrete-time single-input,
remove the drawback of a long learning process with the ad- single-output (SISO), multiple-input, multiple-output (MIMO)
vantage of learning only one set of weights. Most importantly, plants can be described by the difference equations [16]. The
the literature has reported satisfactory generalization capability system identification model extends time-series to more than
in function approximation [14]–[16]. one-dimension, that is the addition of the state variables. With
the proposed algorithm, the training is easy and fast. The result
Manuscript received March 17, 1996; revised September 9, 1996 and July
5, 1997. This work was supported under Air Force Contract F33610-D-5964, is also very promising.
Wright Laboratory, Wright-Patterson AFB, OH, under Grant N00014-92-J- The paper is organized as follows, wherein Section II briefly
4096 from ONR, and under Grant F49620-94-0277 from the Air Force Office discusses the concept of the functional-link and its linear
of Scientific Research.
C. L. P. Chen is with the Department of Computer Science and Engineering, formulation. Sections III and IV introduce the proposed dy-
Wright State University, Dayton, OH 45435 USA. He is also with MLIM, namic stepwise update algorithm followed by the refinement
Materials Directorate, Wright Laboratory, Wright-Patterson Air Force Base, of the model in Section V. Section VI discusses the procedures
OH 45433 USA (e-mail: [email protected]).
J. Z. Wan is with Lexis-Nexis Data Central, Dayton, OH 45343 USA. of the training. Finally, several examples and conclusions
Publisher Item Identifier S 1083-4419(99)00899-7. are given.
1083–4419/99$10.00 1999 IEEE
CHEN AND WAN: DYNAMIC STEPWISE UPDATING ALGORITHM 63
where
if (a)
and
if
where
(b)
therefore
Fig. 3. Illustration of stepwise update algorithm.
where ,
if
if
So the pseudoinverse of can be updated based only on
, and the new added row vector without recomputing and .
the entire new pseudoinverse. Again the new weights are
Let the output vector be partitioned as
(4)
updating of the pseudoinverse (and therefore the weight ma- IV. TRAINING WITH WEIGHTED LEAST SQUARE
trix) involves only a multiplication of finitely sparse matrices In training and testing the fitness of a model, error is
and backward substitutions. Suppose we have the – de- minimized in the sense of least mean-squares, that is in general
composition of and denote , where is an
orthogonal matrix and is an upper triangular matrix. When (5)
a new row or a new column is added, the – decomposition
can be updated based on a finite number of Givens rotations where is the number of patterns. In other words, the
[18]. Denote where, remains orthogonal, average difference between network output and actual output
is an upper triangular matrix, and both are obtained through is minimized over the span of the whole training data. If an
finitely many Givens rotations. The pseudoinverse of is overall fit is hard to achieve, it might be reasonable to train the
network so that it achieves a better fit for most recent data.
This leads to the so-called weighted least-squares problem.
The stepwise updating of weight matrix based on weighted
where (eventually, ) can be computed by back- least-squares is derived as follows.
ward substitution. This stepwise weight update using – Let diag be the weight
and Givens rotation matrix is summarized in the following factor matrix. Also, let represent input matrix with
algorithm. patterns and is with an added new row, that is
diag
(8)
the updating rule for the weight matrix, if is of full rank The orthogonal least squares learning approach is another
(i.e., ), is way to generate a set of weights that can avoid ill-conditioning
problem [13]. Furthermore, regularization and cross-validation
(12)
methods are the techniques to avoid both overfitting and
Equation (12) is exactly the same as the weighted recursive generalization problems [24].
least-squares method [23] in which only the full-rank condition
is discussed. However, (11) is more complete because it covers VI. TIME-SERIES APPLICATIONS
both and cases. Thus, the weighted weight matrix
The literature has discussed time-series forecasting using
can be easily updated based on the current weights and
different neural network models [25], [26]. Here the algorithm
new observations, without running a complete training cycle,
proposed above is applied to the forecasting model. Represent
as long as the weighted pseudoinverse is maintained. Similar
the time-series by the AR( ) (autoregression with delay)
derivation can be applied to the network with an added neuron.
model. Suppose is a stationary time-series. The AR( )
model can be represented as the following equation:
V. REFINE THE MODEL
Let us take a look again at an input matrix of size ,
which represents observations of variables. The singular
value decomposition of is where ’s are autoregression parameters.
In terms of a flat neural network architecture, the AR( )
model can be described as a functional-link network with
input nodes, enhancement nodes, and a single output node.
where is an orthogonal matrix of the eigenvectors of
This will artificially increase the dimension of the input space,
and an orthogonal matrix of eigenvectors of
or the rank of the input data matrix. The network includes
. is an “diagonal” matrix whose diagonals are
input nodes and a single output node. During the training
singular values of . That is
stage, a variable number of enhancement nodes may be added
as necessary. Contrary to the traditional error backpropagation
models, the training of this network is fast because of the
one-step learning procedure and dynamic updating algorithm
mentioned above. To improve the performance in some special
situations, a weighted least-square criterion may be used to
optimize the weights instead of the ordinary least-squares
error.
where is the rank of matrix . is the so-called corre- Using the stepwise updating learning, this section discusses
lation matrix, whose eigenvalues are squares of the singular the procedure of training the neural network for time-series
values. Small singular values might be the result of noise in forecasting. First, available data on a single time-series are
the data or due to round off errors in computations. This can split into training set and testing set. Let the time data,
lead to very large values of weights because the pseudoinverse , be the th time step after the data and assume
of is given by , where that there will be training data points, The training stage
proceeds as follows.
Step 1—Construct Input and Output: Build an input matrix
of size , where is the delay-time. The th
row consists of .
Clearly, small singular values of will result in very large The target output vector will be produced using
value of weights which will, in turn, amplify any noise in .
the new data. The same question arises as more and more Step 2—Obtain the Weight Matrix: Find the pseudoinverse
enhancement nodes are added to the model during the training. of and the weight matrix . This will give the
A possible solution is to round off small singular values to linear least-square fit with lags, or AR( ). Predictions can be
zeros and therefore avoid large values of weights. If there is produced either single step ahead or iterated prediction. The
a gap among all the singular values, it is easy to cutoff at the network outputs are then compared to the actual continuation
gap. Otherwise, one of the following approaches may work. of the data using testing data. The error will be large most of
1) Set an upper bound on the norm of weights. This will the time, especially when we deal with a nonlinear time-series.
provide a criterion to cutoff small singular values. The Step 3—Add a New Enhancement Node if the Error is Above
result is an optimal solution within a bounded region. the Desired Level: If the error is above the desired level, a
2) Investigate the relation between the cutoff values and new hidden node will be added. The weights from input nodes
the performance of the network in terms of prediction to the enhancement node can be randomly generated, but a
error. If there is a point where the performance is not numerical rank check may be necessary to ensure that the
improved when small singular values are included, it is added input node will increase the rank of augmented matrix
then reasonable to set a cutoff value corresponding to by one. At this time the pseudoinverse of the new matrix can
that point. be updated by using (4).
CHEN AND WAN: DYNAMIC STEPWISE UPDATING ALGORITHM 67
(a)
(b) (c)
Fig. 4. (a) Prediction of the time-series 60 steps of the future, (b) network prediction (first 50 points), and (c) network prediction (first 100 points).
Step 4—Stepwise Update the Weight Matrix: After enter- in ascending order. Let the condition number of be the
ing a new input pattern to the input matrix, (i.e., adding to ratio of the largest singular value over the smallest one. If
and forming ), the new weight matrix can be the small singular values are not rounded off to zeros, the
obtained or updated, using either (3) or – decomposition conditional number would be huge. In other words, the matrix
algorithm. Then testing data is applied again to check the would be extremely ill-conditioned. The least-square solution
error level. resulting from the pseudoinverse would be very sensitive to
Step 5—Looping for Further Training: Repeat by going to small perturbations which is not desirable. A possible solution
Step 3 until the desired error level is achieved. would be to cut off any small singular values (and therefore
It is worth noting that having more enhancement nodes does reduce the rank). If the error is not under the desired level
not necessarily mean better performance. Particularly, a larger after training, extra input nodes will be produced based on
than necessary number of enhancement nodes usually would the original input nodes and the enhanced input nodes, where
make the augmented input matrix very ill-conditioned and the weights are fixed. This is similar to the idea of “cascade-
therefore prone to computational error. Theoretically, the rank correlation” network structure [27]. But one step learning is
of the expanded input matrix will be increased by one, which utilized here, which is much more efficient.
is not the case as observed in practice. Suppose the expanded
input matrix has singular value decomposition , VII. EXAMPLES AND DISCUSSION
where and are orthogonal matrices, and is a diagonal The proposed time-series forecasting model is tested on
matrix whose diagonal entries give the singular values of several time-series data including an infrared laser data set,
68 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 29, NO. 1, FEBRUARY 1999
TABLE I
PREVIOUS RESULTS FOR EXAMPLE 1
a chaotic time-series, a monthly flour price data set, and a enhancement nodes are added, the network can perform single
nonlinear system identification. The following examples not step predictions exceptionally well. Since the goal is to predict
only show the effectiveness of the proposed method but also multiple steps beyond the training data, iterated prediction
demonstrate a relatively fast way of forecasting time-series. is also produced. Fig. 4 shows 60 steps iterated prediction
The nonlinear system identification of discrete-time SISO, into the future, as is compared to the actual continuation of
MIMO plants can be described by the difference equations the time-series. The whole procedure including training and
[16]. The most common equation for system identification is producing predictions took just about less than 20 s on a DEC
alpha machine, compared to the huge computation with over
1000 parameters to adapt and overnight training time using
backpropagation training algorithm. To compare the prediction
with previous work [28], the normalized mean squared error
where represents the input-output pair of the
(NMSE) is defined as , where
plant at time and and are differentiable functions.
denotes the points in the test set ,
The system identification model extends the input dimen-
denotes the sample variance of the observed value in ,
sion, that is the addition of the state variables. The training
, and are target and predicted values, respectively. A
concept is similar to the one-dimensional (i.e., time) time-
network with a 25 lags and 50 enhancement nodes are used
series prediction. The proposed algorithm can be also applied
for predicting 50 and 100 steps ahead using 1000 data points
to multilag, MIMO systems easily as shown in Example 4.
for training. For 50 steps ahead prediction, the NMSE is about
Example 1: This is one of the data sets used in a competi-
4.15 10 , and the NMSE for 100 steps ahead prediction
tion of time-series prediction held in 1992 [28]. The training
is about 8.1 10 . The results are better than those of
data set contains 1000 points of the fluctuations in a far-
previously done, shown in Table I, in both speed (such as
infrared laser as shown in Fig. 4. The goal is to predict the
hours, days, or weeks) and accuracy [28] (see [28, p. 64,
continuation of the time-series beyond the sample data. During
Table II]).
the course of the competition, the physical background of the
Example 2: Time series produced by iterating the logistic
data set was withheld to avoid biasing the final prediction
map
results. Therefore we are not going to use any information
other than the time-series itself to build our network model.
To determine the size of network, first we use simple linear
net as a preliminary fit, i.e., AR( ), where, is the value of or
so-called lag. After comparing the single step error versus the
value of , it’s noted that optimal choice for the lag value
lies between 10 and 15. So we use an AR(15) model and add is probably the simplest system capable of displaying deter-
nonlinear enhancement nodes as needed. Training starts with ministic chaos. This first-order difference equation, also known
a simple linear network with 15 inputs and 1 output node. as the Feigenbaum equation, has been extensively studied
Enhancement nodes are added one at a time and weights are as a model of biological populations with nonoverlapping
updated using (4), as described in Section III. After about 80 generations, where represents the normalized population
CHEN AND WAN: DYNAMIC STEPWISE UPDATING ALGORITHM 69
(a) (b)
(c) (d)
(e)
Fig. 6. (a) Flour price indexes, (b) network modeling, (c) network prediction (one-lag), (d) network prediction error (one-lag), and (e) iterated prediction
(multilag) of flour price indexes of three cities, where solid lines are the predicted values and the dashed lines are the actual values.
CHEN AND WAN: DYNAMIC STEPWISE UPDATING ALGORITHM 71
(a) ACKNOWLEDGMENT
The authors are indebted to Dr. Y.-H. Pao, the pioneer
of the functional-link neural network, for his encouragement
and discussion in this work. The author also deeply thanks
the support from AFOSR and Materials Directorate, Wright
Laboratory, Wright-Patterson Air Force Base.
REFERENCES
[1] P. J. Werbos, “Beyond regression: New tools for prediction and analysis
in the behavioral science,” Ph.D. dissertation, Harvard Univ., Cam-
(b)
bridge, MA, Nov. 1974.
[2] , “Backpropagation through time: What it does and how to do it,”
Proc. IEEE, vol. 78, pp. 1550–1560, Oct. 1990.
[3] A. Cichocki and R. Unbehauen, Neural Networks for Optimization and
Signal Processing. New York: Wiley, 1992.
[4] L. F. Wessels and E. Barnard, “Avoiding false local minima by proper
initialization of connections,” IEEE Trans. Neural Networks, vol. 3, pp.
899–905, 1992.
[5] R. A. Jacobs, “Increased rates of convergence through learning rate
adaptation,” Neural Networks, vol. 1, pp. 295–307, 1988.
[6] H. Drucker and Y. le Cun, “Improving generalization performance using
double backpropagation,” IEEE Trans. Neural Networks, vol. 3, pp.
991–997, 1992.
[7] S. J. Perantonis and D. A. Karras, “An efficient constrained learning
(c)
algorithm with momentum acceleration,” Neural Networks, vol. 8, no.
2, pp. 237–249, 1994.
[8] D. A. Karras and S. J. Perantonis, “An efficient constrained training
algorithm for feedforward networks,” IEEE Trans. Neural Networks,
vol. 6, pp. 1420–1434, Nov. 1995.
[9] D. S. Chen and C. Jain, “A robust back propagation learning algorithm
for function approximation,” IEEE Trans. Neural Networks, vol. 5, pp.
467–479, May 1994.
[10] Y. H. Pao and Y. Takefuji, “Functional-Link net computing, theory,
system architecture, and functionalities,” IEEE Comput., vol. 3, pp.
76–79, 1991.
[11] Y. H. Pao, G. H. Park, and D. J. Sobajic, “Learning and generalization
characteristics of the random vector functional-link net,” in Neuro-
(d) computing. Amsterdam, The Netherlands: Elsevier, 1994, vol. 6, pp.
163–180.
Fig. 7. (a) Identification of a MiMO system, yp1 , (b) identification of a
0
MiMO system, yp2 , (c) the difference for (yp1 y^p1 ), and (d) the difference
[12] B. Igelnik and Y. H. Pao, “Stochastic choice of basis functions in
for (yp2 0 y^p2 ).
adaptive function approximation and the functional-link net,” IEEE
Trans. Neural Networks, vol. 6, pp. 1320–1329, 1995.
[13] S. Chen, C. F. N. Cowan, and P. M. Grant, “Orthogonal least squares
learning algorithm for radial basis function networks,” IEEE Trans.
show the plots of and , respectively. Neural Networks, vol. 2, pp. 302–309, Mar. 1991.
The training time is again very fast—about 30 s in a DEC [14] D. S. Broomhead and D. Lowe, “Multivariable functional interpolation
workstation. and adaptive methods,” Complex Syst. 2, pp. 321–355, 1988.
[15] Y. H. Pao, G. H. Park, and D. J. Sobajic, “Learning and generalization
characteristics of the random vector functional-link net,” in Neuro-
VIII. CONCLUSION computing. Amsterdam, The Netherlands: Elsevier, 1994, vol. 6, pp.
163–180.
In summary, the proposed algorithm is simple and fast [16] K. S. Narendra and K. Parthasarathy, “Identification and control of dy-
and easy to update. Several examples show the promising namical systems using neural networks,” IEEE Trans. Neural Networks,
vol. 1, pp. 4–27, 1990.
result. There are two points that we want to emphasize: 1) the [17] C. L. P. Chen, “A rapid supervised learning neural network for function
proposed learning algorithm for functional-link net is very fast interpolation and approximation,” IEEE Trans. Neural Networks, vol. 7,
pp. 1220–1230, Sept. 1996.
and efficient. The fast learning makes it possible for the trial- [18] G. H. Golub and C. F. Van Loan, Matrix Computations, 3rd ed.
error approach to fine-tune some hard-to-determine parameters Baltimore, MD: Johns Hopkins Univ. Press, 1996.
[e.g., the number of enhancement (hidden) nodes], and the [19] M. H. Hassoun, Fundamentals of Artificial Neural Networks. Cam-
bridge, MA: MIT Press, 1995.
dimension of the state space, or the AR parameter . The [20] A. Ben-Israel and T. N. E. Greville, Generalized Inverses: Theory and
training algorithm allows us to update the weight matrix in Applications. New York: Wiley, 1974.
72 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 29, NO. 1, FEBRUARY 1999
[21] F. H. Kishi, “On line computer control techniques and their application John Z. Wan received the B.S. degree from JanXi University, China, the
to re-entry aerospace vehicle control,” in Advances in Control Systems M.S. degree from Zhongshan University, China, and the Ph.D. degree from
Theory and Applications, C. T. Leondes, Ed. New York: Academic, the University of Cincinnati, Cincinnati, OH, in 1982, 1985, and 1993,
1964, pp. 245–257. respectively, all in mathematics. He also studied at Wright State University
[22] B. Widrow, “Generalization and information storage in networks of for the M.S. degree in computer science.
adaline neuron,” in Self-Organizing Systems, M. C. Jovitz et al., Eds., He worked at Armstrong Laboratory, Wright-Patterson Air Force Base, for
1962, pp. 435–461. one year before he moved to Lexis-Nexis Data Central, Dayton, OH, where
[23] C. R. Johnson, Jr., Lectures on Adaptive Parameter Estimation. Engle- he is currently a Software Engineer.
wood Cliffs, NJ: Prentice-Hall, 1988.
[24] M. J. L. Orr, “Regularization in the selection of radial basis function
centers,” Neural Computat., vol. 7, pp. 606–623, 1995.
[25] V. R. Vemuri and R. D. Rogers, Eds., Artificial Neural Networks:
Forecasting Time Series. Los Alamitos, CA: IEEE Comput. Soc. Press,
1993.
[26] A. Khotanzad, R. Hwang, A. Abaye, and D. Maratukulam, “An adaptive
modular artificial neural network hourly load forecaster and its imple-
mentation at electric utilities,” IEEE Trans. Power Syst., vol. 10, pp.
1716–1922, 1995.
[27] S. E. Fahlman and C. Lebiere, “The cascade-correlation learning archi-
tecture,” Adv. Neural Inf. Process. Syst. I, 1989.
[28] A. S. Weigend and N. A. Gershenfeld Eds., Time Series Prediction,
Forecasting the Future and Understanding the Past. Reading, MA:
Addison-Wesley, 1994.
[29] K. Chakraborty, K. Mehrotra, C. Mohan, and S. Ranka, “Forecasting
the behavior of multivariate time series using neural networks,” Neural
Networks, vol. 5, pp. 961–970, 1992.