Classification by Backpropagation - A Multilayer Feed-Forward Neural Network - Defining A Network Topology - Backpropagation
Classification by Backpropagation - A Multilayer Feed-Forward Neural Network - Defining A Network Topology - Backpropagation
1j
2
2j
Each layer is made up of units. The inputs to the network correspond to the attributes
measured for each training tuple. The inputs are fed simultaneously into the units
making up the input layer. These inputs pass through the input layer and are then
weighted and fed simultaneously to a second layer of “neuronlike” units, known as a
hidden layer. The outputs of the hidden layer units can be input to another hidden
layer, and so on. The number of hidden layers is arbitrary, although in practice, usually
only one is used. The weighted outputs of the last hidden layer are input to units making
up the output layer, which emits the network’s prediction for given tuples.
The units in the input layer are called input units. The units in the hidden layers and
output layer are sometimes referred to as neurodes, due to their symbolic biological
basis, or as output units. The multilayer neural network shown in Figure 9.2 has two
layers of output units. Therefore, we say that it is a two-layer neural network. (The
input layer is not counted because it serves only to pass the input values to the next
layer.) Similarly, a network containing two hidden layers is called a three-layer neural
network, and so on. It is a feed-forward network since none of the weights cycles back
to an input unit or to a previous layer’s output unit. It is fully connected in that each
unit provides input to each unit in the next forward layer.
Each output unit takes, as input, a weighted sum of the outputs from units in the
previous layer (see Figure 9.4 later). It applies a nonlinear (activation) function to the
weighted input. Multilayer feed-forward neural networks are able to model the class pre-
diction as a nonlinear combination of the inputs. From a statistical point of view, they
perform nonlinear regression. Multilayer feed-forward networks, given enough hidden
units and enough training samples, can closely approximate any function.
400 Chapter 9 Classification: Advanced Methods
9.2.3 Backpropagation
“How does backpropagation work?” Backpropagation learns by iteratively processing a
data set of training tuples, comparing the network’s prediction for each tuple with the
actual known target value. The target value may be the known class label of the training
tuple (for classification problems) or a continuous value (for numeric prediction). For
each training tuple, the weights are modified so as to minimize the mean-squared error
between the network’s prediction and the actual target value. These modifications are
made in the “backwards” direction (i.e., from the output layer) through each hidden
layer down to the first hidden layer (hence the name backpropagation). Although it is
not guaranteed, in general the weights will eventually converge, and the learning process
stops. The algorithm is summarized in Figure 9.3. The steps involved are expressed in
terms of inputs, outputs, and errors, and may seem awkward if this is your first look at
9.2 Classification by Backpropagation 401
neural network learning. However, once you become familiar with the process, you will
see that each step is inherently simple. The steps are described next.
Initialize the weights: The weights in the network are initialized to small random num-
bers (e.g., ranging from −1.0 to 1.0, or −0.5 to 0.5). Each unit has a bias associated with
it, as explained later. The biases are similarly initialized to small random numbers.
Each training tuple, X, is processed by the following steps.
Propagate the inputs forward: First, the training tuple is fed to the network’s input
layer. The inputs pass through the input units, unchanged. That is, for an input unit, j,
402 Chapter 9 Classification: Advanced Methods
Weights
w1j
y1
Bias
j
w2j
y2
...
Σ f Output
wnj
yn
Figure 9.4 Hidden or output layer unit j: The inputs to unit j are outputs from the previous layer. These
are multiplied by their corresponding weights to form a weighted sum, which is added to the
bias associated with unit j. A nonlinear activation function is applied to the net input. (For
ease of explanation, the inputs to unit j are labeled y1 , y2 , . . . , yn . If unit j were in the first
hidden layer, then these inputs would correspond to the input tuple (x1 , x2 , . . . , xn ).)
its output, Oj , is equal to its input value, Ij . Next, the net input and output of each unit
in the hidden and output layers are computed. The net input to a unit in the hidden or
output layers is computed as a linear combination of its inputs. To help illustrate this
point, a hidden layer or output layer unit is shown in Figure 9.4. Each such unit has
a number of inputs to it that are, in fact, the outputs of the units connected to it in
the previous layer. Each connection has a weight. To compute the net input to the unit,
each input connected to the unit is multiplied by its corresponding weight, and this is
summed. Given a unit, j in a hidden or output layer, the net input, Ij , to unit j is
X
Ij = wij Oi + θj , (9.4)
i
where wij is the weight of the connection from unit i in the previous layer to unit j; Oi is
the output of unit i from the previous layer; and θj is the bias of the unit. The bias acts
as a threshold in that it serves to vary the activity of the unit.
Each unit in the hidden and output layers takes its net input and then applies an acti-
vation function to it, as illustrated in Figure 9.4. The function symbolizes the activation
of the neuron represented by the unit. The logistic, or sigmoid, function is used. Given
the net input Ij to unit j, then Oj , the output of unit j, is computed as
1
Oj = . (9.5)
1 + e −Ij
9.2 Classification by Backpropagation 403
This function is also referred to as a squashing function, because it maps a large input
domain onto the smaller range of 0 to 1. The logistic function is nonlinear and
differentiable, allowing the backpropagation algorithm to model classification problems
that are linearly inseparable.
We compute the output values, Oj , for each hidden layer, up to and including the
output layer, which gives the network’s prediction. In practice, it is a good idea to
cache (i.e., save) the intermediate output values at each unit as they are required again
later when backpropagating the error. This trick can substantially reduce the amount of
computation required.
Backpropagate the error: The error is propagated backward by updating the weights
and biases to reflect the error of the network’s prediction. For a unit j in the output
layer, the error Errj is computed by
where Oj is the actual output of unit j, and Tj is the known target value of the given
training tuple. Note that Oj (1 − Oj ) is the derivative of the logistic function.
To compute the error of a hidden layer unit j, the weighted sum of the errors of the
units connected to unit j in the next layer are considered. The error of a hidden layer
unit j is
X
Errj = Oj (1 − Oj ) Errk wjk , (9.7)
k
where wjk is the weight of the connection from unit j to a unit k in the next higher layer,
and Errk is the error of unit k.
The weights and biases are updated to reflect the propagated errors. Weights are
updated by the following equations, where 1wij is the change in weight wij :
“What is l in Eq. (9.8)?” The variable l is the learning rate, a constant typically having
a value between 0.0 and 1.0. Backpropagation learns using a gradient descent method
to search for a set of weights that fits the training data so as to minimize the mean-
squared distance between the network’s class prediction and the known target value of
the tuples.1 The learning rate helps avoid getting stuck at a local minimum in decision
space (i.e., where the weights appear to converge, but are not the optimum solution) and
encourages finding the global minimum. If the learning rate is too small, then learning
will occur at a very slow pace. If the learning rate is too large, then oscillation between
1A method of gradient descent was also used for training Bayesian belief networks, as described in
Section 9.1.2.
404 Chapter 9 Classification: Advanced Methods
inadequate solutions may occur. A rule of thumb is to set the learning rate to 1/t, where
t is the number of iterations through the training set so far.
Biases are updated by the following equations, where 1θj is the change in bias θj :
Note that here we are updating the weights and biases after the presentation of each
tuple. This is referred to as case updating. Alternatively, the weight and bias incre-
ments could be accumulated in variables, so that the weights and biases are updated
after all the tuples in the training set have been presented. This latter strategy is called
epoch updating, where one iteration through the training set is an epoch. In the-
ory, the mathematical derivation of backpropagation employs epoch updating, yet
in practice, case updating is more common because it tends to yield more accurate
results.
All 1wij in the previous epoch are so small as to be below some specified
threshold, or
The percentage of tuples misclassified in the previous epoch is below some thresh-
old, or
A prespecified number of epochs has expired.
In practice, several hundreds of thousands of epochs may be required before the weights
will converge.
“How efficient is backpropagation?” The computational efficiency depends on the
time spent training the network. Given |D| tuples and w weights, each epoch requires
O(|D| × w) time. However, in the worst-case scenario, the number of epochs can be
exponential in n, the number of inputs. In practice, the time required for the networks
to converge is highly variable. A number of techniques exist that help speed up the train-
ing time. For example, a technique known as simulated annealing can be used, which
also ensures convergence to a global optimum.
Example 9.1 Sample calculations for learning by the backpropagation algorithm. Figure 9.5 shows
a multilayer feed-forward neural network. Let the learning rate be 0.9. The initial weight
and bias values of the network are given in Table 9.1, along with the first training tuple,
X = (1, 0, 1), with a class label of 1.
This example shows the calculations for backpropagation, given the first training
tuple, X. The tuple is fed into the network, and the net input and output of each unit
9.2 Classification by Backpropagation 405
are computed. These values are shown in Table 9.2. The error of each unit is computed
and propagated backward. The error values are shown in Table 9.3. The weight and bias
updates are shown in Table 9.4.
x1 1 w14
w15 4
w46
w24
x2 2 6
w25
w56
w34 5
x3 3 w35
1 0 1 0.2 −0.3 0.4 0.1 −0.5 0.2 −0.3 −0.2 −0.4 0.2 0.1