0% found this document useful (0 votes)
83 views15 pages

Artificial Neural Networks

This article provides an introduction to artificial neural networks and methodology for designing effective ANN solutions. It discusses that ANNs can model nonlinear relationships and were inspired by biological neural systems. The article also outlines selecting input variables, acquiring training data, and designing network architecture and learning algorithms to develop quality ANN models.

Uploaded by

Sai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views15 pages

Artificial Neural Networks

This article provides an introduction to artificial neural networks and methodology for designing effective ANN solutions. It discusses that ANNs can model nonlinear relationships and were inspired by biological neural systems. The article also outlines selecting input variables, acquiring training data, and designing network architecture and learning algorithms to develop quality ANN models.

Uploaded by

Sai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Artificial Neural Networks

Steven Walczak
University of Colorado, Denver

Narciso Cerpa
University of Talca, Chile

I. Introduction to Artificial Neural Networks


II. Need for Guidelines
III. Input Variable Selection
IV. Learning Method Selection
V. Architecture Design
VI. Training Samples Selection
VII. Conclusions

GLOSSARY tionship between the internal activation level and the


output.
Architecture The several different topologies into which Weight The relative importance of each input to a
artificial neural networks can be organized. Processing processing element.
elements or neurons can be interconnected in different
ways.
Artificial neural network Model that emulates a biolog- ARTIFICIAL NEURAL NETWORKS (ANNS) have
ical neural network using a reduced set of concepts been used to support applications across a variety of busi-
from a biological neural system. ness and scientific disciplines during the past years. These
Learning method Algorithm for training the artificial computational models of neuronal activity in the brain are
neural network. defined and illustrated through some brief examples. Neu-
Processing element An artificial neuron that receives in- ral network designers typically perform extensive knowl-
put(s), processes the input(s), and delivers a single edge engineering and incorporate a significant amount of
output. domain knowledge into ANNs. Once the input variables
Summation function Computes the internal stimulation, present in the neural network’s input vector have been
or activation level, of the artificial neuron. selected, training data for these variables with known out-
Training sample Training cases that are used to adjust put values must be acquired. Recent research has shown
the weight. that smaller training set sizes produce better performing
Transformation function A linear or nonlinear rela- neural networks, especially fortime-series applications.

631
632 Artificial Neural Networks

Summarizing, this article presents an introduction to arti- success of ANN applications is their ability to create non-
ficial neural networks and also a general heuristic method- linear models as well as traditional linear models and,
ology for designing high-quality ANN solutions to various hence, artificial neural network solutions are applicable
domain problems. across a wider range of problem types (both linear and
nonlinear).
In the following sections, a brief history of artificial
I. INTRODUCTION TO ARTIFICIAL neural networks is presented. Next, a detailed examination
NEURAL NETWORKS of the components of an artificial neural network model is
given with respect to the design of artificial neural network
Artificial neural networks (sometimes just called neural models of business and scientific domain problems.
networks or connectionist models) provide a means for
dealing with complex pattern-oriented problems of both
A. Biological Basis of Artificial
categorization and time-series (trend analysis) types. The
Neural Networks
nonparametric nature of neural networks enables models
to be developed without having any prior knowledge of the Artificial neural networks are a technology based on stud-
distribution of the data population or possible interaction ies of the brain and nervous system as depicted in Fig. 1.
effects between variables as required by commonly used These networks emulate a biological neural network but
parametric statistical methods. they use a reduced set of concepts from biological neural
As an example, multiple regression requires that the er- systems. Specifically, ANN models simulate the electri-
ror term of the regression equation be distributed normally cal activity of the brain and nervous system. Processing
(with a µ = 0) and also be nonheteroscedastic. Another elements (also known as either a neurode or perceptron)
statistical technique that is frequently used for perform- are connected to other processing elements. Typically the
ing categorization is discriminant analysis, but discrimi- neurodes are arranged in a layer or vector, with the out-
nant analysis requires that the predictor variables be mul- put of one layer serving as the input to the next layer
tivariate normally distributed. Because such assumptions and possibly other layers. A neurode may be connected
are removed from ANN models, the ease of develop- to all or a subset of the neurodes in the subsequent layer,
ing a domain problem solution is increased with artifi- with these connections simulating the synaptic connec-
cial neural networks. Another factor contributing to the tions of the brain. Weighted data signals entering a neurode

FIGURE 1 Sample artificial neural network architecture (not all weights are shown).
Artificial Neural Networks 633

simulate the electrical excitation of a nerve cell and conse- layers of perceptrons to be trained [and thus introduced
quently the transference of information within the network the hidden layer(s) to artificial neural networks], and was
or brain. The input values to a processing element, i n , are the birth of MLPs (multiple layered perceptrons). Follow-
multiplied by a connection weight, wn,m , that simulates the ing the discovery of MLPs and the backpropagation algo-
strengthening of neural pathways in the brain. It is through rithm, a revitalization of research and development efforts
the adjustment of the connection strengths or weights that in artificial neural networks took place.
learning is emulated in ANNs. In the past years, ANNs have been used to support
All of the weight-adjusted input values to a process- applications across a diversity of business and scientific
ing element are then aggregated using a vector to scalar disciplines (e.g., financial, manufacturing, marketing,
function such as summation (i.e., y = wi j xi ), averaging, telecomunications, and biomedical). This proliferation
input maximum, or mode value to produce a single input of neural network applications has been facilitated by
value to the neurode. Once the input value is calculated, the emergence of neural networks shells (e.g., Brain-
the processing element then uses a transfer function to pro- maker, Neuralyst, Neuroshell, and Professional II Plus)
duce its output (and consequently the input signals for the and tool add-ins (for SAS, MATLAB, and Excel) that
next processing layer). The transfer function transforms provide developers with the means for specifying the
the neurode’s input value. Typically this transformation ANN architecture and training the neural network. These
involves the use of a sigmoid, hyperbolic-tangent, or other shells and add-in tools enable ANN developers to build
nonlinear function. The process is repeated between lay- ANN solutions without requiring an in-depth knowl-
ers of processing elements until a final output value, on , edge of ANN theory or terminology. Please see either
or vector of values is produced by the neural network. of these World Wide Web sites (active on December 31,
Theoretically, to simulate the asynchronous activity of 2000): http://www.faqs.org/faqs/ai-faq/neural-nets/part6/
the human nervous system, the processing elements of the or http://www.emsl.pnl.gov:2080/proj/neuron/neural/sys-
artificial neural network should also be activated with the tems/software.html for additional links to neural network
weighted input signal in an asynchronous manner. Most shell software available commercially.
software and hardware implementations of artificial neu- Neural networks may use different learning algorithms
ral networks, however, implement a more discretized ap- and we can classify them into two major categories based
proach that guarantees that each processing element is on the input format: binary-valued input (i.e., 0s and 1s) or
activated once for each presentation of a vector of input continuous-valued input. These two categories can be sub-
values. divided into supervised learning and unsupervised learn-
ing. As mentioned above, supervised learning algorithms
use the difference between the desired and actual output
B. History and Resurgence of Artificial
to adjust and finally determine the appropriate weights
Neural Networks
for the ANN. In a variation of this approach some super-
The idea of combining multiple processing elements into vised learning algorithms are informed whether the output
a network is attributed to McCulloch and Pitts in the early for the input is correct and the network adjust its weights
1940s and Hebb in 1949 is credited with being the first to with the aims of achieving correct results. Hopfield net-
define a learning rule to explain the behavior of networks work (binary) and backpropagation (continuous) are ex-
of neurons. In the late 1950s, Rosenblatt developed the first amples of supervised learning algorithms. Unsupervised
perceptron learning algorithm. Soon after Rosenblatt’s learning algorithms only receive input stimuli and the net-
discovery, Widrow and Hoff developed a similar learn- work organizes itself with the aim of having hidden pro-
ing rule for electronic circuits. Artificial neural network cessing elements that respond differently to each set of
research continued strongly throughout the 1960s. input stimuli. The network does not require information
In 1969, Minsky and Papert published their book, on the correctness of the output. ART I (binary) and Koho-
Perceptrons, in which they showed the computational lim- nen (continuous) are examples of unsupervised learning
its of single-layer neural networks, which were the type algorithms.
of artificial neural networks being used at that time. The Neural network applications are frequently viewed as
theoretical limitations of perceptron-like networks led to a black boxes that mystically determine complex patterns
decrease in funding and subsequently research on artificial in data. However, ANN designers must perform exten-
neural networks. sive knowledge engineering and incorporate a significant
Finally in 1986, McClelland and Rumelhart and the amount of domain knowledge into artificial neural net-
PDP research group published the Parallel Distributed works. Successful artificial neural network development
Processing texts. These new texts published the back- requires a deep understanding of the steps involved in de-
propagation learning algorithm, which enabled multiple signing ANNs.
634 Artificial Neural Networks

ANN design requires the developer to make many deci- work’s architecture—quantity of nodes and arrangement
sions such as input values, training and test data set sizes, in hidden layers. Two critical design issues are still a chal-
learning algorithm, network architecture or topology, and lenge for artificial neural networks developers: selection
transformation function. Several of these decisions are de- of appropriate input variables and capturing a sufficient
pendent on each other. For example, the ANN architecture quantity of training examples to permit the neural network
and the learning algorithm will determine the type of input to adequately model the application.
value (i.e., binary or continuous). Therefore, it is essen- Many different types of ANN applications have being
tial to follow a methodology or a well-defined sequence developed in the past several years and are continuing to
of steps when designing ANNs. These steps are listed be developed. Industrial applications exist in the financial,
below: manufacturing, marketing, telecommunications, biomed-
ical, and many other domains. While business managers
r Determine data to use. are seeking to develop new applications using ANNs, a
r Determine input variables. basic misunderstanding of the source of intelligence in an
r Separate data into training and test sets. ANN exists. As mentioned above, the development of new
r Define the network architecture. ANN applications has been facilitated by the emergence
r Select a learning algorithm. of a variety of neural network shells that allow anyone
r Transform variables to network inputs. to produce neural network systems by simply specify-
r Train (repeat until ANN error is below acceptable ing the ANN architecture and providing a set of train-
value). ing data to be used by the shell to train the ANN. These
r Test (on hold-out sample to validate generalization of shell-based neural networks may fail or produce subopti-
the ANN). mal results unless a deeper understanding of how to use
and incorporate domain knowledge in the ANN is ob-
In the following sections we discuss the need for guide- tained by the designers of ANNs in business and industrial
lines, and discuss heuristics for input variable selection, domains.
learning method selection, architecture design, and train- The traditional view of an ANN is of a program that
ing sample selection. Finally we conclude and summarize emulates biological neural networks and “learns” to rec-
a set of guidelines for ANN design. ognize patterns or categorize input data by being trained
on a set of sample data from the domain. These programs
learn through training and subsequently have the ability to
II. NEED FOR GUIDELINES generalize broad categories from specific examples. This
is the unique perceived source of intelligence in an ANN.
Artificial neural networks have been applied to a wide However, experienced ANN application designers typi-
variety of business, engineering, medical, and scientific cally perform extensive knowledge engineering and in-
problems. Several research results have shown that ANNs corporate a significant amount of domain knowledge into
outperform traditional statistical techniques (e.g., regres- the design of ANNs even before the learning through train-
sion or logit) as well as other standard machine learning ing process has begun. The selection of the input variables
techniques (e.g., the ID3 algorithm) for a large class of to be used by the ANN is quite a complex task, due to the
problem types. misconception that the more input a network is fed the
Many of these ANN applications such as financial time more successful the results produced. This is only true
series, e.g., foreign exchange rate forecasts, are difficult to if the information fed is critical to making the decisions;
model. Artificial neural networks provide a valuable tool however, noisy input variables commonly result in very
for building nonlinear models of data, especially when poor generalization performance.
the underlying laws governing the system are unknown. Design of optimal neural networks is problematic in
Artificial neural network forecasting models have outper- that there exist a large number of alternative ANN physi-
formed both statistical and other machine learning models cal architectures and learning methods, all of which may
of financial time series, achieving forecast accuracies of be applied to a given domain problem. Selecting the ap-
more than 60% and thus are being widely used to model propriate size of the training data set presents another chal-
the behavior of financial time series. Other categorization- lenge, since it implies direct and indirect costs, and it can
based applications of ANNs are achieving success rates also affect the generalization performance.
of well over 90%. A general heuristic or rule of thumb for the design of
Development of effective neural network models is dif- neural networks in time-series domains is that the more
ficult. Most artificial neural network designers develop knowledge that is available to the neural network for form-
multiple neural network solutions with regard to the net- ing its model, the better the ultimate performance of the
Artificial Neural Networks 635

neural network. A minimum of 2 years of training data III. INPUT VARIABLE SELECTION
is considered to be a nominal starting point for financial
time series. Times-series models are considered to im- The generalization performance of supervised learning
prove as more data are incorporated into the modeling artificial neural networks (e.g., backpropagation) usually
process. Research has indicated that currency exchange improves when the network size is minimized with respect
rates have a long-term memory, implying that larger peri- to the weighted connections between processing nodes
ods of time (data) will produce more comprehensive mod- (elements of the input, hidden, and output layers). ANNs
els and produce better generalization. However, this has that are too large tend to overfit or memorize the input
been challenged in recent research and will be discussed data. Conversely, ANNs with too few weighted connec-
in Section VI. tions do not contain enough processing elements to cor-
Neural network researchers have built forecasting and rectly model the input data set, underfitting the data. Both
trading systems with training data from 1 to 16 years, of these situations result in poor out-of-sample general-
including various training set sizes in between the two ization.
extremes. However, researchers typically use all of the Therefore, when developing supervised learning neural
data in building the neural network forecasting model, networks (e.g., backpropagation, radial basis function, or
with no attempt at comparing data quantity effects on the fuzzy ARTMAP), the developer must determine what in-
quality of the produced forecasting models. put variables should be selected to accurately model the
In this article, a set of guidelines for incorporating domain.
knowledge into an ANN and using domain knowledge ANN designers must spend a significant amount of time
to design optimal ANNs is described. The guidelines for performing the task of knowledge acquisition to avoid the
designing ANNs are made up of the following steps: fact that “garbage in, garbage out” also applies to ANN
knowledge-based selection of input values, selection of a applications. ANNs as well as other artificial intelligence
learning method, architecture design, and training sample (AI) techniques are highly dependent on the specification
selection. The majority of the ANN design steps described of input variables. However, ANN designers tend to mis-
will focus mainly on feed-forward supervised learning specify input variables.
(and more specifically backpropagation) ANN applica- Input variable misspecification occurs because ANN
tions. Following these guidelines will enable developers designers follow the expert system approach of incor-
and researchers to take advantage of the power of ANNs porating as much domain knowledge as possible into an
and will afford economic benefit by producing an ANN intelligent system. ANN performance improves as addi-
that outperforms similar ANNs with improperly specified tional domain knowledge is provided through the input
design parameters. variables. This belief is correct, because if a sufficient
Artificial neural network designers must determine the amount of information representing critical decision cri-
optimal set of design criteria specified as follows: teria is not given to an ANN, it cannot develop a correct
model of the domain. Most ANN designers believe that
r Appropriate input (independent) variables. since ANNs learn, they will be able to determine those
r Best learning method: Learning methods can be input variables that are important and develop a corre-
classified into either supervised or unsupervised sponding model through the modification of the weights
learning methods. Within these learning methods associated with the connections between the input layer
there are many alternatives, each of which is and the hidden layers.
appropriate for different distributions or types of data. Noise input variables produce poor generalization per-
r Appropriate architecture: The number of hidden formance in ANNs. The presence of too many input vari-
layers depending on the selected learning method; the ables causes poor generalization when the ANN not only
quantity of processing elements (nodes) per hidden models the true predictors, but also includes the noise vari-
layer. ables in the model. Interaction between input variables
r Appropriate amount of training data: Time series and produces critical differences in output values, further ob-
classification problems. scuring the ideal problem model when unnecessary vari-
ables are included in the set of input values.
The designer’s choices for these design criteria will affect As indicated above and shown in the following sec-
the performance of the resulting ANN on out-of-sample tions both under- and overspecification of input variables
data. Inappropriate selection of the values for these design produce suboptimal performance. The following section
factors may produce ANN applications that perform worse describes the guidelines for selecting input (independent)
than random selection of an output (dependent) value. variables for an ANN solution to a domain problem.
636 Artificial Neural Networks

A. Determination of Input Variables correlation matrix—to identify “noise” variables. If two


variables have a high correlation, then one of these two
Two approaches exist regarding the selection of input pa-
variables may be removed from the set of variables with-
rameter variables for supervised learning neural networks.
out adversely affecting the ANN performance. Alterna-
In the first approach, it is thought that since a neural net-
tively, a chi-square test may be used for categorical vari-
work that utilizes supervised training will adjust its con-
ables. The cutoff value for variable elimination is an
nection weights to better approximate the desired output
arbitrary value and must be determined separately for ev-
values, then all possible domain-relevant variables should
ery ANN application, but any correlation absolute value
be given to the neural network as input values. The idea is
of 0.20 or higher indicates a probable noise source to the
that the connection weights that indicate the contribution
ANN.
of nonsignificant variables will approach zero and thus
Additional statistical techniques may be applied, de-
effectively eliminate any effect on the output value from
pending on the distribution properties of the data set. Step-
these variables
wise multiple or logistic regression and factor analysis
lim εt ⇒ 0, provide viable tools for evaluating the predictive value
t→∞
of input variables and may serve as a secondary filter to
where ε is the error term of the neural network and t is the the Pearson correlation matrix. Multiple regression and
number of training iterations. factor analysis perform best with normally distributed lin-
The second approach emphasizes the fact that the ear data, while logistic regression assumes a curvilinear
weighted connections never achieve a value of true zero relationship.
and thus there will always be some contribution to the out- Several researchers have shown that smaller input vari-
put value of the neural network by all of the input variables. able sets can produce better generalization performance
Hence, ANN designers must research domain variables to by an ANN. As mentioned above, high correlation values
determine their potential contribution to the desired out- of variables that share a common element need to be dis-
put values. Selection of input variables for neural networks regarded. Smaller input variable sets frequently improve
is a complex, but necessary task. Selection of irrelevant the ANN generalization performance and reduce the net
variables may cause output value fluctuations of up to 7%. cost of data acquisition for development and usage of the
Designers should determine applicability through knowl- ANN. However, care must be taken when removing vari-
edge acquisition of experts in the domain, similar to expert ables from the ANN’s input set to ensure that a complete
systems development. Highly correlated variables should set of noncorrelated predictor variables is available for
be removed from the input vector because they can mul- the ANN, otherwise the reduced variable sets may worsen
tiply the effect of those variables and consequently cause generalization performance.
noise in the output values. This process should produce
an expert-specified set of significant variables that are not
intercorrelated, and which will yield the optimal perfor- IV. LEARNING METHOD SELECTION
mance for supervised learning neural networks.
The first step in determining the optimal set of input After determining a heuristically optimal set of input vari-
variables is to perform standard knowledge acquisition. ables using the methods from the previous section, an
Typically, this involves consultation with multiple domain ANN learning method must be selected. The learning
experts. Various researchers have indicated the require- method is what enables the ANNs to correctly model cate-
ment for extensive knowledge acquisition utilizing do- gorization and time-series problems. Artificial neural net-
main experts to specify ANN input variables. The primary work learning methods can be divided into two distinct
purpose of the knowledge acquisition phase is to guarantee categories: unsupervised learning and supervised learn-
that the input variable set is not underspecified, providing ing. Both unsupervised and supervised learning methods
all relevant domain criteria to the ANN. require a collection of training examples that enable the
Once a base set of input variables is defined through ANN to model the data set and produce accurate output
knowledge acquisition, the set can be pruned to elimi- values.
nate variables that contribute noise to the ANN and con- Unsupervised learning systems, such as adaptive res-
sequently reduce the ANN generalization performance. onance theory (ART), self-organizing map (SOM, also
ANN input variables need to be predictive, but should called Kohonen networks), or Hopfield networks, do not
not be correlated. Correlated variables degrade ANN per- require that the output value for a training sample be pro-
formance by interacting with each other as well as other vided at the time of training.
elements to produce a biased effect. The designer should Supervised learning systems, such as backpropagation
calculate the correlation of pairs of variables—Pearson (MLP), radial basis function (RBF), counterpropagation,
Artificial Neural Networks 637

FIGURE 2 Kohonen layer (12-node) learning of a square

or fuzzy ARTMAP networks, require that a known output SOM-trained networks are composed of a Kohonen
value for all training samples be provided to the ANN. layer of neurodes that are two dimensional as opposed to
Unsupervised learning methods determine output val- the vector alignments of most other ANNs. The collection
ues directly from the input variable data set. Most un- of neurodes (also called the grid) maps input values onto
supervised learning methods have less computational the grid of neurodes to preserve order, which means that
complexity and less generalization accuracy than super- two input values that are close together will be mapped to
vised methods, because the answers must be contained the same neurode. The Kohonen grid is connected to both
within or directly learned from the input values. Hence, an input and output layer. As training progresses, the neu-
unsupervised learning techniques are typically used for rodes in the grid attempt to approximate the feature space
classification problems, where the desired classes are self- of the input by adjusting the collection of values mapped
descriptive. For example, the ART algorithm is a good onto each neurode. A graphical example of the learn-
technique to use for performing object recognition in pic- ing process in the Kohonen layer of the SOM is
torial or graphical data. An example of a problem that has shown in Fig. 2 , which is a grid of 12 neurodes (3 × 4) that
been solved with ART-based ANNs is the recognition of is trying to learn the category of a hollow square object.
hand-written numerals. The hand-written numerals 0–9 Figures 2a–d represent the two-dimensional coordinates
are each unique, although in some cases similar for exam- of each of the 12 Kohonen-layer processing elements.
ple 1 and 7 or 3 and 8, and define the pattern to be learned: The Hopfield training algorithm is similar in nature to
the shapes of the numerals 0–9. The advantage of using the ART training algorithm. Both require a hidden layer
unsupervised learning methods is that these ANNs can (in this case called the Hopfield layer as opposed to an
be designed to learn much more rapidly than supervised F1 layer for ART-based ANNs) that is the same size as
learning systems. the input layer. The Hopfield algorithm is based on spin
glass physics and views the state of the network as an en-
A. Unsupervised Learning ergy surface. Both SOM and Hopfield trained ANNs have
been used to solve traveling salesman problems in addition
The unsupervised learning algorithms—ART, SOM
to the more traditional image processing of unsupervised
(Kohonen), and Hopfield—form categories based on the
learning ANNs. Hopfield ANNs are also used for opti-
input data. Typically, this requires a presentation of each of
mization problems. A difficulty with Hopfield ANNs is the
the training examples to the unsupervised learning ANN.
capacity of the network, which is estimated at n/(4 ln n),
Distinct categories of the input vector are formed and re-
where n is the number of neurodes in the Hopfield
formed as new input examples are presented to the ANN.
layer.
The ART learning algorithm establishes a category for
the initial training example. As additional examples are
B. Supervised Learning
presented to the ART-based ANN, new categories are
formed based on how closely the new example matches The backpropagation learning algorithm is one of the most
one of the existing categories with respect to both negative popular design choices for implementing ANNs, since
inhibition and positive excitation of the neurodes in the this algorithm is available and supported by most com-
network. As a worst case, an ART-trained ANN may pro- mercial neural network shells and is based on a very ro-
duce M distinct categories for M input examples. When bust paradigm. Backpropagation-trained ANNs have been
building ART-based networks, the architecture of the net- shown to be universal approximators, and they are able
work is given explicitly by the quantity of input values to learn arbitrary category mappings. Various researchers
and the desired number of categories (output values). The have supported this finding and shown the superiority of
hidden or what is usually called the F1 layer is the same backpropagation-trained ANNs to different ANN learning
size as the input layer and serves as the feature detector paradigms including radial basis function (RBF), coun-
for the categories. The output or F2 layer is defined by the terpropagation, and fuzzy adaptive resonance theory. An
quantity of categories to be defined. ANN’s performance has been found to be more dependent
638 Artificial Neural Networks

on data representation than on the selection of a learning neural networks produced the second best allocation re-
rule. Learning rules other than backpropagation perform sults, which indicated into the previously unknown per-
well if the data from the domain have specific proper- ception that the categories used for allocating resources
ties. The mathematical specifications of the various ANN were unique.
learning methods described in this section are available in To summarize, backpropagation MLP networks are
the reference articles and books given at the end of this usually implemented due to their robust and genera-
article. lized problem-solving capabilities. General regression
Backpropagation is the superior learning method when networks are implemented to simulate the statistical
a sufficient number of noise/error-free training examples regression models. Radial basis function networks are im-
exist, regardless of the complexity of the specific domain plemented to resolve domain problems having a partial
problem. Backpropagation ANNs can handle noise in the sample or a training data set that is too small. Both coun-
training data and they may actually generalize better if terpropagation and fuzzy ARTMAP networks are imple-
some noise is present in the training data. However, too mented to resolve the difficulty of extremely noisy training
many erroneous training values may prevent the ANN data. The combination of unsupervised (clustering and
from learning the desired model. ART) learning techniques with supervised learning may
For ANN applications that provide only a few train- improve the performance of neural networks in the noisy
ing examples or very noisy training data, other super- domains. Finally, learning vector quantization networks
vised learning methods should be selected. RBF networks are implemented to exploit the potential for unique deci-
perform well in domains with limited training sets and sion criteria of disjoint sets.
counterpropagation networks perform well when a suffi- The selection of a learning method is an open problem
cient number of training examples is available, but may and ANN designers must use the constraints of the train-
contain very noisy data. For resource allocation problems ing data set for determining the optimal learning method.
(configuration) backpropagation produced the best results, If reasonably large quantities of relatively noise-free train-
although the first appearance of the problem indicated ing examples are available, then backpropagation provides
that counterpropagation might outperform backpropaga- an effective learning method, which is relatively easy to
tion due to anticipated noise in the training data set. Hence, implement.
although properties of the data population may strongly
indicate the preference of a particular training method,
because of the strength of the backpropagation network, V. ARCHITECTURE DESIGN
this type of learning method should always be tried in ad-
dition to any other methods prescribed by domain data The architecture of an ANN consists of the number of
tendencies. layers of processing elements or nodes, including input,
Domains that have a large collection of relatively error- output, and any hidden layers, and the quantity of nodes
free historical examples with known outcomes suit back- contained in each layer. Selection of input variables (i.e.,
propagation ANN implementations. Both the ART and input vector) was discussed in Section III, and the output
RBF ANNs have worse performance than the back- vector is normally predefined by the problem to be solved
propagation ANN performance for this specific domain with the ANN. Design of hidden layers is dependent on the
problem. selected learning algorithm (discussed in Section IV). For
Many other ANN learning methods exist and each is example, unsupervised learning methods such as ART nor-
subject to constraints on the type of data that is best pro- mally require a first hidden layer quantity of nodes equal
cessed by that specific learning method. For example, gen- to the size of the input layer. Supervised learning sys-
eral regression neural networks are capable of solving any tems are generally more flexible in the design of hidden
problem that can also be solved by a statistical regression layers. The remaining discussion focuses on backpropa-
model, but does not require that a specific model type (e.g., gation ANN systems or other similar supervised learning
multiple linear or logistic) be specified in advance. How- ANNs. The designer should determine the following as-
ever, regression ANNs suffer from the same constraints as pects regarding the hidden layers of the ANN architecture:
regression models, such as the linear or curvilinear rela- (1) number of hidden layers and (2) number of nodes in
tionship of the data with heteroscedastic error. Likewise, the hidden layer(s).
learning vector quantization (LVQ) networks try to divide
input values into disjoint categories similar to discrimi-
A. Number of Hidden Layers
nant analysis and consequently have the same data dis-
tribution requirements as discriminant analysis. Research It is possible to design an ANN with no hidden layers,
using resource allocation problems has indicated that LVQ but these types of ANNs can only classify input data that
Artificial Neural Networks 639

is linearly separable, which severely limits their applica- r 2n + 1 hidden layer nodes where n is the number of
tion. Artificial neural networks that contain hidden layers nodes in the input layer.
have the ability to deal robustly with nonlinear and com-
plex problems and therefore can operate on more inter- These algorithmic heuristics do not utilize domain knowl-
esting problems. The quantity of hidden layers is asso- edge for estimating the quantity of hidden nodes and may
ciated with the complexity of the domain problem to be be counterproductive.
solved. ANNs with a single hidden layer create a hyper- As with the knowledge acquisition and elimination of
plane. ANNs with two hidden layers combine hyperplanes correlated input variables heuristic for defining the opti-
to form convex decision areas and ANNs with three hidden mal input node set, the number of decision factors (DFs)
layers combine convex decision areas to form convex de- heuristically determines the optimal number of hidden
cision areas that contain concave regions. The convexity or units for an ANN. Knowledge acquisition or existing
concavity of a decision region corresponds roughly to the knowledge bases may be used to determine the DFs for a
number of unique inferences or abstractions that are per- particular domain and consequently the hidden layer ar-
formed on the input variables to produce the desired output chitecture and optimal quantity of hidden nodes. Decision
result. factors are the separable elements that help to form the
Increasing the number of hidden unit layers enables unique categories of the input vector space. The DFs are
a trade-off between smoothness and closeness-of-fit. A comparable to the collection of heuristic production rules
greater quantity of hidden layers enables an ANN to im- used in an expert system.
prove its closeness-of-fit, while a smaller quantity im- An example of the DF design principle is provided by
proves the smoothness or extrapolation capabilities of the the NETTalk neural network research project. NETTalk
ANN. has 203 input nodes representing seven textual characters,
Several researchers have indicated that a single hid- and 33 output units representing the phonetic notation of
den layer architecture, with an arbitrarily large quantity of the spoken text words. Hidden units are varied from 0 to
hidden nodes in the single layer, is capable of modeling 120. NETTalk improved output accuracy as the number of
any categorization mapping. On the other hand two hid- hidden units was increased from 0 to 120, but only a min-
den layer networks outperform their single hidden layer imal improvement in the output accuracy was observed
counterparts for specific problems. A heuristic for deter- between 60 and 120 hidden units. This indicates that the
mining the quantity of hidden layers required by an ANN ideal quantity of DFs for the NETTalk problem was around
is as follows: “As the dimensionality of the problem space 60; adding hidden units beyond 60 increased the training
increases—higher order problems—the number of hidden time, but did not provide any appreciable difference in the
layers should increase correspondingly.” ANN’s performance.
The number of hidden layers is heuristically set by de- Several researchers have found that ANNs perform
termining the number of intermediate steps, dependent on poorly until a sufficient number of hidden units is avail-
previous categorizations, required to translate the input able to represent the correlations between the input vec-
variables into an output value. Therefore, domain prob- tor and the desired output values. Increasing the num-
lems that have a standard nonlinear equation solution are ber of hidden units beyond the sufficient number served
solvable by a single hidden layer ANN. to increase training time without a corresponding in-
crease in output accuracy. Knowledge acquisition is nec-
essary to determine the optimal input variable set to be
B. Number of Nodes per Hidden Layer
used in an ANN system. During the knowledge acqui-
When choosing the number of nodes to be contained in sition phase, additional knowledge engineering can be
a hidden layer, there is a trade-off between training time performed to determine the DFs and subsequently the
and the accuracy of training. A greater number of hidden minimum number of hidden units required by the ANN
unit nodes results in a longer (slower) training period, architecture. The ANN designer must acquire the heuris-
while fewer hidden units provide shorter (faster) training, tic rules or clustering methods used by domain experts,
but at the cost of having fewer feature detectors. Too many similar to the knowledge that must be acquired dur-
hidden nodes in an ANN enable it to memorize the training ing the knowledge acquisition process for expert sys-
data set, which produces poor generalization performance. tems. The number of heuristic rules or clusters used
Some of the heuristics used for selecting the quantity of by domain experts is equivalent to the DFs used in the
hidden nodes for an ANN are using: domain.
Researchers have explored and shown techniques for
r 75 percent of the quantity of input nodes, automatically producing an ANN architecture with the
r 50 percent of the quantity of input and output nodes, or exact number of hidden units required to model the DFs for
640 Artificial Neural Networks

the problem space. The approach used by these automatic A “rule of thumb” lower bound on the number of train-
methods consists of three steps: ing examples required to train a backpropagation ANN is
four times the number of weighted connections contained
1. Initially create a neural network architecture with a in the network. Therefore, if a training database contains
very small or very large number of hidden units. only 100 training examples, the maximum size of the ANN
2. Train the network for some predetermined number of is 25 connections or approximately 10 nodes depending
epochs. on the ANN architecture. While the general heuristic of
3. Evaluate the error of the output nodes. four times the number of connections is applicable to most
classification problems, time-series problems, including
If the error exceeds a set threshold value, then a hidden the prediction of financial time series (e.g., stock values),
unit is added or deleted, respectively, and the process is re- are more dependent on business cycles. Recent research
peated until the error term is less than the threshold value. has conclusively shown that a maximum of 1 or 2 years
Another method to automatically determine the optimum of data is all that is required to produce optimal fore-
architecture is to use genetic algorithms to generate mul- casting results for ANNs performing financial time-series
tiple ANN architectures and select the architectures with prediction.
the best performance. Determining the optimum number Another issue to be considered during training sample
of hidden units for an ANN application is a very com- selection is how well the samples in the training set model
plex problem, and an accurate method for automatically the real world. If training samples are skewed such that
determining the DF quantity of hidden units without per- they only cover a small portion of the possible real-world
forming the corresponding knowledge acquisition remains instances that a neural network will be asked to classify
a current research topic. or predict, then the neural network can only learn how to
In this section, the heuristic architecture design princi- classify or predict results for this subset of the domain.
ple of acquiring decision factors to determine the quantity Therefore, developers should take care to ensure that their
of hidden nodes and the configuration of hidden layers training set samples have a similar distribution to the do-
has been presented. A number of hidden nodes equal to main in which the neural network must operate.
the number of the DFs is required by an ANN to perform Artificial neural network training sets should be rep-
robustly in a domain and produce accurate results. This resentative of the population-at-large. This indicates that
concept is similar to the principle of a minimum size in- categorization-based ANNs require at least one example
put vector determined through knowledge acquisition pre- of each category to be classified and that the distribu-
sented in Section III. The knowledge acquisition process tion of training data should approximate the distribution
for ANN designers must acquire the heuristic decision of the population at large. A small amount of additional
rules or clustering methods of domain experts. The DFs examples from each category will help to improve the
for a domain are equivalent to the heuristic decision rules generalization performance of the ANN. Thus a catego-
used by domain experts. Further analysis of the DFs to de- rization ANN trying to classify items into one of seven
termine the dimensionality of the problem space enables categories with distributions of 5, 10, 10, 15, 15, 20, and
the knowledge engineer to configure the hidden nodes into 25% would need a minimum of 20 training examples, but
the optimal number of hidden layers for efficient modeling would benefit by having 40–100 training examples. Time-
of the problem space. series domain problems are dependent on the distribution
of the time series, with the neural network normally re-
quiring one complete cycle of data. Again, recent research
VI. TRAINING SAMPLES SELECTION in financial time series has demonstrated that 1- and 2-year
cycle times are prevalent and thus the minimum required
Acquisition of training data has direct costs associated training data for a financial time-series ANN would be
with the data themselves, and indirect costs due to the from 1 to 2 years of training examples.
fact that larger training sets require a larger quantity of Based on these more recent findings we suggest that
training epochs to optimize the neural network’s learning. neural network developers should use an iterative ap-
The common belief is that the generalization performance proach to training. Starting with a small quantity of train-
of a neural network will increase when larger quantities ing data, train the neural network and then increase the
of training samples are used to train the neural network, quantity of samples in the training data set and repeat
especially for time-series applications of neural networks. training until a decrease in performance occurs.
Based on this belief, the neural network designer must Development of optimal neural networks is a difficult
acquire as much data as possible to ensure the optimal and complex task. Limiting both the set of input variables
learning of a neural network. to those that are thought to be predictive and the training
Artificial Neural Networks 641

set size increases the probability of developing robust and incrementally larger quantities of training data. The re-
highly accurate neural network models. sulting outputs were used to empirically evaluate whether
Most neural network models of financial time series are neural network exchange rate forecasting models achieve
homogeneous. Homogeneous models utilize data from the optimal performance in the presence of a critical amount
specific time series being forecast or directly obtainable of data used to train the network. Once this critical quan-
from that time series (e.g., a k-day trend or moving av- tity of data is obtained, addition of more training data does
erage). Heterogeneous models utilize information from not improve and may, in fact, hinder the forecasting per-
outside the time series in addition to the time series itself. formance of the neural network forecasting model. For
Homogeneous models rely on the predictive capabilities most exchange rate predictions, a maximum of 2 years of
of the time series itself, corresponding to a technical anal- training data produces the best neural network forecast-
ysis as opposed to a fundamental analysis. ing model performance. Hence, this finding leads to the
Most neural network forecasting in the capital markets induction of the empirical hypothesis for a time-series re-
produces an output value that is the future price or ex- cency effect. The TS recency effect can be summarized in
change rate. Measuring the mean standard error of these the following statement: “The use of data that are closer
neural networks may produce misleading evaluations of in time to the data that are to be forecast by the model
the neural networks’ capabilities, since even very small produces a higher quality model.”
errors that are incorrect in the direction of change will The TS recency effect provides several direct benefits
result in a capital loss. Instead of measuring the mean for both neural network researchers and developers:
standard error of a forecast, some researchers argue that
a better method for measuring the performance of neural r A new paradigm for choosing training samples for
networks is to analyze the direction of change. The direc- producing a time-series model
tion of change is calculated by subtracting today’s price r Higher quality models by having better forecasting
from the forecast price and determining the sign (positive performance through the use of smaller quantities of
or negative) of the result. The percentage of correct direc- data
tion of change forecasts is equivalent to the percentage of r Lower development costs for neural network
profitable trades enabled by the ANN system. time-series models because fewer training data are
The effect on the quality of the neural network model required
forecasting outputs achieved from the quantities of train- r Less development time because smaller training set
ing data has been called the “time-series (TS) recency sizes typically require fewer training iterations to
effect.” The TS recency effect states that for time-series accurately model the training data.
data, model construction data that are closer in time to
the values to be forecast produce better forecasting mod- The time-series recency effect refutes existing heuris-
els. This effect is similar to the concept of a random walk tics and is a call to revise previous claims of longevity ef-
model that assumes future values are only affected by the fects in financial time series. The empirical method used
previous time period’s value, but able to use a wider range to evaluate and determine the critical quantity of train-
of proximal data for formulating the forecasts. ing data for exchange rate forecasting is generalized for
Requirements for training or modeling knowledge were application to other financial time series, indicating the
investigated when building nonlinear financial time-series generality of the TS recency effect to other financial time
forecasting models with neural networks. Homogeneous series.
neural network forecasting models were developed for The TS recency effect offers an explanation as to why
trading the U.S. dollar against various other foreign cur- previous research efforts using neural network models
rencies (i.e., dollar/pound, dollar/mark, dollar/yen). Var- have not surpassed the 60% prediction accuracy demon-
ious training sets were used, ranging from 22 years to strated as a realistic threshold by researchers. The diffi-
1 year of historic training data. The differences between culty in most prior neural network research is that too
the neural network models for a specific currency ex- much data is typically used. In attempting to build the
isted only in the quantity of training data used to develop best possible forecasting model, as was perceived at that
each time-series forecasting model. The researchers crit- time, too much training data is used (typically 4–6 years of
ically examined the qualitative effect of training set size data), thus violating the TS recency effect by introducing
on neural network foreign exchange rate forecasting mod- data into the model that is not representative of the current
els. Training data sets of up to 22 years of data are used time-series behavior. Training, test, and general use data
to predict 1-day future spot rates for several nominal ex- represent an important and recurring cost for information
change rates. Multiple neural network forecasting models systems in general and neural networks in particular. Thus,
for each exchange rate forecasting model were trained on if the 2-year training set produces the best performance and
642 Artificial Neural Networks

represents the minimal quantity of data required to achieve Additionally, the TS recency effect is supported by all
this level of performance, then this minimal amount of data three currencies; however, the Swiss franc achieves its
is all that should be used to minimize the costs of neural maximum performance with 4 years of training data. The
network development and maintenance. For example, the quality of the ANN outputs for the Swiss franc model con-
Chicago Mercantile Exchange (CME) sells historical data tinually increases as new training data years are added,
on commodities (including currency exchange rates) at the through the fourth year, then precipitously drops in per-
cost of $100 per year per commodity. At this rate, using formance as additional data are added to the training set.
1–2 years of data instead of the full 22 years of data pro- Again, the Swiss franc results still support the research
vides an immediate data cost savings of $2000 to $2100 goal of determining a critical training set size and the
for producing the neural network models. discovered TS recency effect. However, the Swiss franc
The only variation in the ANN models above was the results indicate that validation tests should be performed
quantity of data used to build the ANN models. It may be individually for all financial time series to determine the
argued that certain years of training data contain noise and minimum quantity of data required for producing the best
would thus adversely affect the forecasting performance forecasting performance.
of the neural network model. In such case, the addition of While a significant amount of evidence has been ac-
more training data (older) that is error free should com- quired to support the TS recency effect for ANN models
pensate for the noise effects in middle data, creating a of foreign exchange rates, can the TS recency effect be
U-shaped performance curve. The most recent data pro- generalized to apply to other financial time series? The
vide high performance and the largest quantity of data knowledge that only a few years of data are necessary
available also provides high performance due to drown- to construct neural network models with maximum fore-
ing out the noise in middle-time frame samples. casting performance would serve to save neural network
The TS recency effect has been demonstrated for the developers significant development time, effort, and costs.
three most widely traded currencies against the U.S. dollar. On the other hand, the dollar/Swiss franc ANNs described
These results contradict current approaches which state above indicate that a cutoff of 2 years of training data may
that as the quantity of training data used in constructing not always be appropriate.
neural network models increases, the forecasting perfor- A method for determining the optimal training set size
mance of the neural networks correspondingly improves. for financial time series ANN models has been proposed.
The results were tested for robustness by extending the This method consists of the following steps:
research method to other foreign currencies. Three ad-
ditional currencies were selected: the French franc, the 1. Create 1-year training set using most recent data;
Swiss franc, and the Italian lira. These three currencies determine appropriate test set.
were chosen to approximate the set of nominal currencies 2. Train with 1-year set and test (baseline); record
used in previous study. performance.
Results for the six different ANN models for each of 3. Add 1 year of training data; the closest to current
the three new currencies show that the full 22-year train- training set.
ing data set continues to be outperformed by either the 4. Train with newest training set, and test on original
1- or 2-year training sets. This is excluding the French test set; record performance.
franc, which has equivalent performance for the most 5. If the performance of the newest training set is better
recent and the largest training data sets. The result that than previous performance,
the 22-year data set cannot outperform the smaller 1- or Then
2-year training data sets provides further empirical evi- Go to step 5
dence that a critical amount of training data, less than the Otherwise
full 22 years for the foreign exchange time series, pro- Use the previous training data set, which
duces optimal performance for neural network financial produced the best performance.
time-series models.
The French franc, similar to the Japanese yen, ANN This is an iterative approach that starts with a single year of
models have identical performance between the largest training data and continues to add additional years of train-
(22-year) data set and the smallest (1-year) data set. ing data until the trained neural network’s performance
Because no increase in performance is provided through begins to decrease. In other words, the process continues
the use of additional data, economics dictates that the to search for better training set sizes as long as the perfor-
smaller 1-year set be used as the training paradigm for mance increases or remains the same. The optimal training
the French franc, producing a possible $2100 savings in set size is then set to be the smallest quantity of training
data costs. data to achieve the best forecasting performance.
Artificial Neural Networks 643

Because the described method is a result of the em- proximately 4 years of training data, emulated a simple
pirical evidence acquired using foreign exchange rates, efficient market. A random walk model of the DIS stock
it stands to reason that testing the method on additional produced a 50% prediction accuracy and so the DIS artifi-
neural network foreign exchange rate forecasting models cial neural network forecasting model did outperform the
would continue to validate the method. Therefore, three random walk model, but not by a statistically significant
new financial time series were used to demonstrate the amount. An improvement to the ANN model to predict
robustness of the specified method. The DJIA stock index stock price changes may be achieved by following the
closing values, the closing price for the individual DIS generalized method for determining the best size training
(Walt Disney Co.) stock, and the CAC-40 French stock set and reducing the overall quantity of training data, thus
index closing values served as the three new financial time limiting the effect of nonrelevant data.
series. Data samples from January 1977 to August 1999, Again as an alternative evaluation mechanism, a simu-
to simulate the 22 years of data used in the foreign ex- lation is run with the CAC-40 stock index data. A starting
change neural network training, were used for the DJIA value of $10,000 with sufficient funds and/or credit is as-
and DIS time series and data values from August 1988 to sumed to enable a position on 100 index options contracts.
May 1999 were used for the CAC-40 index. Options are purchased or sold consistent with the ANN
Following the method discussed above, three backprop- forecasts for the direction of change in the CAC-40 index.
agation ANNs, one for each of the two time series, were All options contracts are sold at the end of the year-long
trained on the 1998 data set and tested a single time on the simulation. The two-year training data set model produces
1999 data values (164 cases for the DJIA and DIS; 123 a net gain of $16,790, while using the full 10-year training
cases for the CAC-40). Then a single year was added to data set produces a net loss of $15,010. The simulation re-
the training set, a new ANN model was trained and tested sults yield a net average difference between the TS recency
a single time, with the process repeated until a decrease in effect model (2 years) and the heuristic greatest quantity
forecasting performance occurred. An additional 3 years model (10 years) of $31,800, or three times the size of the
of training data, in 1-year increments, were added to the initial investment.
training sets and evaluated to strengthen the conclusion
that the optimal training set size has been acquired. A fi-
nal test of the usefulness of the generalized method for VII. CONCLUSIONS
determining minimum optimal training set sizes was per-
formed by training similar neural network models on the General guidelines for the development of artificial neural
full 22-year training set for the DJIA index and DIS stock networks are few, so this article presents several heuristics
ANNs and on the 10-year training set for all networks, for developing ANNs that produce optimal generalization
which was the maximum data quantity available for the performance. Extensive knowledge acquisition is the key
CAC-40. Then each of the ANNs trained on the “largest” to the design of ANNs.
training sets was tested on the 1999 test data set to evaluate First, the correct input vector for the ANN must be
the forecasting performance. determined by capturing all relevant decision criteria used
For both the DJIA and the DIS stock, the 1-year train- by domain experts for solving the domain problem to be
ing data set was immediately identified as the best size modeled by the ANN and eliminating correlated variables.
for a training data set as soon as the ANN trained on the Second, the selection of a learning method is an open
2-year data set was tested. The CAC-40 ANN forecast- problem and an appropriate learning method can be
ing model, however, achieved its best performance with a selected by examining the set of constraints imposed by
2-year training data set size. While the forecasting ac- the collection of available training examples for training
curacy for these three new financial time series did not the ANN.
achieve the 60% forecasting accuracy as do many of the Third, the architecture of the hidden layers is deter-
foreign exchange forecasting ANNs, it did support the mined by further analyzing a domain expert’s clustering
generalized method for determining minimum necessary of the input variables or heuristic rules for producing an
training data sets and consequently lends support to the output value from the input variables. The collection of
time-series recency effect. Once the correct or best per- clustering/decision heuristics used by the domain expert
forming minimum training set was identified by the gen- has been called the set of decision factors (DFs). The quan-
eralized method, no other ANN model trained on a larger tity of DFs is equivalent to the minimum number of hidden
size training set was able to outperform the “minimum” units required by an ANN to correctly represent the prob-
training set. lem space of the domain.
The results for the DIS stock value are slightly better. Use of the knowledge-based design heuristics enables
Conclusions were that the ANN model, which used ap- an ANN designer to build a minimum size ANN that is
644 Artificial Neural Networks

capable of robustly dealing with specific domain prob- develop the highest quality financial time-series forecast-
lems. The future may hold automatic methods for deter- ing models in the shortest amount of time and at the lowest
mining the optimum configuration of the hidden layers cost.
for ANNs. Minimum size ANN configurations guarantee Therefore, the set of general guidelines for designing
optimal results with the minimum amount of training time. ANNs can be summarized as follows:
Finally, a new time-series model effect, termed the
time-series recency effect, has been described and demon- 1. Perform extensive knowledge acquisition. This
strated to work consistently across six different currency knowledge acquisition should be targeted at
exchange time series ANN models. The TS recency effect identifying the necessary domain information
claims that model building data that is nearer in time to the required for solving the problem and identifying the
out-of-sample values to be forecast produce more accu- decision factors that are used by domain experts for
rate forecasting models. The empirical results discussed solving the type of problem to be modeled by the
in this article show that frequently, a smaller quantity of ANN.
training data will produce a better performing backprop- 2. Remove noise variables. Identify highly correlated
agation neural network model of a financial time series. variables via a Pearson correlation matrix or
Research indicates that for financial time series 2 years chi-square test, and keep only one correlated variable.
of training data are frequently all that is required to pro- Identify and remove noncontributing variables,
duce optimal forecasting accuracy. Results from the Swiss depending on data distribution and type, via
franc models alert the neural network researcher that the discriminant/factor analysis or step-wise regression.
TS recency effect may extend beyond 2 years. A gener- 3. Select an ANN learning method, based on the
alized method is presented for determining the minimum demographic features of the data and decision
training set size that produces the best forecasting perfor- problem. If supervised learning methods are
mance. Neural network researchers and developers using applicable, then implement backpropagation in
the generalized method for determining the minimum nec- addition to any other method indicated by the data
essary training set size will be able to implement artificial demographics (i.e., radial-basis function for small
neural networks with the highest forecasting performance training sets or counterpropagation for very noisy
at the least cost. training data).
Future research can continue to provide evidence for 4. Determine the amount of training data. Follow the
the TS recency effect by examining the effect of training methodology described in Section VI for time series.
set size for additional financial time series (e.g., any other Four times the number of weighted connections for
stock or commodity and any other index value). The TS classification problems.
recency effect may not be limited only to financial time 5. Determine the number of hidden layers. Analyze the
series; evidence from nonfinancial time-series domain complexity, and number of unique steps, of the
neural network implementations already indicates that traditional expert decision-making solution. If in
smaller quantities of more recent modeling data are ca- doubt, then use a single hidden layer, but realize that
pable of producing high-performance forecasting models. additional nodes may be required to adequately
Additionally, the TS recency effect has been demon- model the domain problem.
strated with neural network models trained using back- 6. Set the quantity of hidden nodes in the last hidden
propagation. The common belief is that the TS recency layer equal to the decision factors used by domain
effect holds for all supervised learning neural network experts to solve the problem. Use the knowledge
training algorithms (e.g., radial basis function, fuzzy acquired during step 1 of this set of guidelines.
ARTMAP, probabilistic) and is therefore a general prin-
ciple for time-series modeling and not restricted to back- SEE ALSO THE FOLLOWING ARTICLES
propagation neural network models.
In conclusion, it has been noted that ANN systems in- ARTIFICIAL INTELLIGENCE • COMPUTER NETWORKS •
cur costs from training data. This cost is not only financial, EVOLUTIONARY ALGORITHMS AND METAHEURISTICS
but also has an impact on the development time and effort.
Empirical evidence demonstrates that frequently only 1 or
BIBLIOGRAPHY
2 years of training data will produce the “best” perform-
ing backpropagation trained neural network forecasting
Bansal, A., Kauffman, R. J., and Weitz, R. R. (1993). “Comparing the
models. The proposed method for identifying the mini- modeling performance of regression and neural networks as data qual-
mum necessary training set size for optimal performance ity varies: A business value approach,” J. Management Infor. Syst. 10
enables neural network researchers and implementers to (1), 11–32.
Artificial Neural Networks 645

Barnard, E., and Wessels, L. (1992). “Extrapolation and interpolation in A model,” Science 233 (4764), 625–633.
neural network classifiers,” IEEE Control Syst. 12 (5), 50–53. Hornik, K., Stinchcombe, M., and White, H. (1989). “Multilayer feedfor-
Carpenter, G. A., and Grossberg, S. (1998). “The ART of adaptive pattern ward networks are universal approximators,” Neural Networks 2 (5)
recognition by a self-organizing neural network,” Computer, 21 (3), 359–366.
77–88. Kohonen, T. (1988). “Self-Organization and Associative Memory,”
Carpenter, G. A., Grossberg, S., Markuzon, N., and Reynolds, J. H. Springer-Verlag, Berlin.
(1992). “Fuzzy ARTMAP: A neural network architecture for incre- Li, E. Y. (1994). “Artificial neural networks and their business applica-
mental learning of analog multidimensional maps,” IEEE Trans. Neu- tions,” Infor. Management, 27 (5), 303–313.
ral Networks 3 (5), 698–712. Medsker, L., and Liebowitz, J. (1994). “Design and Development of Ex-
Dayhoff, J. (1990). “Neural Network Architectures: An Introduction,” pert Systems and Neural Networks,” Macmillan, New York.
Van Nostrand Reinhold, New York. Mehra, P., and Wah, B. W. (19xx). “Artificial Neural Networks: Concepts
Fu, L. (1996). “Neural Networks in Computer Intelligence,” McGraw- and Theory,” IEEE, New York.
Hill, New York. Moody, J., and Darken, C. J. (1989). “Fast learning in networks of locally-
Gately, E. (1996). “Neural Networks for Financial Forecasting,” Wiley, tuned processing elements,” Neural Comput. 1 (2), 281–294.
New York. Smith, M. (1993). “Neural Networks for Statistical Modeling,” Van
Hammerstrom, D. (1993). “Neural networks at work,” IEEE Spectrum Nostrand Reinhold, New York.
30 (6), 26–32. Specht, D. F. (1991). “A general regression neural network,” IEEE Trans.
Haykin, S. (1994). “Neural Networks: A Comprehensive Foundation,” Neural Networks 2 (6), 568–576.
Macmillan, New York. White, H. (1990). “Connectionist nonparametric regression: Multilayer
Hecht-Nielsen, R. (1988). “Applications of counterpropagation net- feedforward networks can learn arbitrary mappings,” Neural Networks
works,” Neural Networks 1, 131–139. 3 (5), 535–549.
Hertz, J., Krogh, A., and Palmer, R. (1991). “Introduction to the Theory Widrow, B., Rumelhart, D. E., and Lehr, M. A. (1994). “Neural
of Neural Computation,” Addison-Wesley, Reading, MA. networks: Applications in industry, business and science,” Commun.
Hopfield, J. J., and Tank, D. W. (1986). “Computing with neural circuits: ACM 37 (3), 93–105.

You might also like