CCN Book
CCN Book
CCN Book
which reflects a contrast between the average firing rate during the outcome, represented by the first
term, and that over the expectation, represented by the second term.
We will see how this XCAL rule is related to the backpropagation error-minimizing rule, but achieves
this function in a more biologically constrained way. This was the same goal of previous attempts
including the GeneRec (generalized recirculation) algorithm (O'Reilly, 1996), which is equivalent to
the Contrastive Hebbian Learning (CHL) equation (Movellan etc):
Here, the first term is the activity of the sending and receiving units during the outcome (in the plus
phase), while the second term is the activity during the expectation (in the minus phase). CHL is
so-named because it involves the contrast or difference between two Hebbian-like terms. As you can
see, XCAL is essentially equivalent to CHL, despite a few differences:
XCAL actually uses the XCAL dWt function instead of a direct subtraction, which causes
weight changes to go to 0 at when short term activity is 0 (as dictated by the biology).
XCAL is based on average activations across the entire evolution of attractors (reflected by
accumulated Ca++ levels), instead of based on single points of activation (i.e., the final attractor
state in each of two phases, as used somewhat unrealistically in CHL -- how would the plasticity
rules 'know' exactly what counts as the final state of each phase?).
Both of these factors are discussed more in the subtopic on Implementational Details. But for the
present purposes, we can safely ignore them, which allows us to leverage all of the analysis that
went into understanding GeneRec -- itself a large step towards biological plausibility relative to
backpropagation.
The core of this analysis revolves around the following simpler version of the GeneRec equation,
which we call the GeneRec delta equation:
where the weight change is driven only by the delta in activity on the receiving unit y between the
plus (outcome) and minus (expectation) phases, multiplied by the sending unit activation x. One can
derive the full CHL equation from this simpler GeneRec delta equation by adding a constraint that
the weight changes computed by the sending unit to the receiving unit be the same as those of the
receiving unit to the sending unit (i.e., a symmetry constraint based on bidirectional connectivity),
and by replacing the minus phase activation for the sending unit with the average of the minus and
plus phase activations (which ends up being equivalent to the midpoint method for integrating a
differential equation). You can find the actual mathematics of this derivation later in this subtopic, but
you can take our word for it for the time being.
Interestingly, the GeneRec delta equation is equivalent in form to the delta rule, which we derive
below as the optimal way to reduce error in a two layer network (input units sending to output units,
with no hidden units in between). The delta rule was originally derived in 1960 by Widrow and Hoff,
and it is also basically equivalent to a gradient descent solution to linear regression. This is very
basic old-school math.
But two-layer networks are very limited in what they can compute. As we discussed in the Networks
Chapter, you really need those hidden layers to form higher-level ways of re-categorizing the input,
to solve challenging problems (you will also see this directly in the simulation explorations in this
chapter). As we discuss more below, the limitations of the delta rule and two-layer networks were
highlighted in a very critical paper by Minsky and Papert in 1969, which brought research in the field
of neural network models nearly to a standstill for nearly 20 years.
Figure 1: Illustration of backpropgation computation in three-layer network. First, the feedforward activation
pass generates a pattern of activations across the units in the network, cascading from input, to hidden to
output. Then, "delta" values are propagated backward in the reverse direction across the same weights. The
delta sum is broken out in the hidden layer to facilitate comparison with the GeneRec algorithm as shown in the
next figure.
Figure 2: Illustration of GeneRec/XCAL computation in three-layer network, for comparison with previous figure
showing backpropagation. Activations settle in the expectation/minus phase, in response to input activations
presented to the input layer. Activation flows bidirectionally, so that the hidden units are driven both by inputs
and activations that arise on the output units. In the outcome/plus phase, "target" values drive the output unit
activations, and due to the bidirectional connectivity, these also influence the hidden units in the plus phase.
Mathematically, changing the weights based on the difference in hidden layer activation states between the
plus and minus phases results in a close approximation to the delta value computed by backpropagation. This
same rule is then used to change the weights into the hidden units from the input units (delta times sending
activation), which is the same form used in backpropagation, and identical in form to the delta rule.
layer above the layer containing the current receiving unit y (with each such unit indexed by the
subscript k), and
is the weight from the receiving unit y to the k'th such unit in the next layer
and is used to drive learning by changing the weight from sending unit y in the hidden layer to a
given output unit z is:
You should recognize that this is exactly the delta rule as described above (where we keep in mind
that y is now a sending activation to the output units). The delta rule is really the essence of all errordriven learning methods.
Now let's get back to the delta backpropagation equation, and see how we can get from it to
GeneRec (and thus to XCAL). We just need to replace the
term with the value for the
output units, and then do some basic rearranging of terms, and we get very close to the GeneRec
delta equation:
If you compare this last equation with the GeneRec delta equation, they would be equivalent (except
for the y' term that we're still ignoring) if we made the following definitions:
Interestingly, these sum terms are identical to the net input that unit y would receive from unit z if the
weight went the other way, or, critically, if y also received a symmetric, bidirectional
connection from z, in addition to sending activity to z. Thus, we arrive at the critical insight behind the
GeneRec algorithm relative to the backpropagation algorithm:
The only wrinkle in this argument at this point is that we had to assign the activation states of the
receiving unit to be equal to those net-input like terms (even though we use non-linear thresholded
activation functions), and also those net input terms ignore the other inputs that the receiving unit
should also receive from the sending units in the input layer. The second problem is easily
dispensed with, because those inputs from the input layer would be common to both "phases" of
activation, and thus they cancel out when we subtract
. The first problem can be solved by
finally no longer ignoring the y'term -- it turns out that the difference between a function evaluated at
two different points can be approximated as the difference between the two points, times the
derivative of the function:
So we can now say that the activations states of y are a function of these net input terms:
and thus their difference can be approximated by the difference in net inputs times the activation
function derivative:
Which gets us right back to the GeneRec delta equation as being a good approximation to the delta
backpropagation equation:
So if you've followed along to this point, you can now rest easy by knowing that the GeneRec (and
thus XCAL) learning functions are actually very good approximations to error backpropagation. As
we noted at the outset, XCAL uses bidirectional activation dynamics to communicate error signals
throughout the network, in terms of averaged activity over two distinct states of activation
(expectation followed by outcome), whereas backpropagation uses a biologically implausible
procedure that propagates a single error value (outcome - expectation) backward across weight
values, in the opposite direction of the way that activation typically flows.
which is the sum over output units (indexed by k) of the target activation t minus the actual output
activation that the network produced (z) , squared. There is typically an extra sum here too, over all
the different input/output patterns that the network is being trained on, but it cancels out for all of the
following math, so we can safely ignore it.
In the context of the expectation and outcome framework of the main chapter, the outcomes are the
targets, and the expectations are the output activity of the network.
For the time being, we assume a linear activation function of activations from sending units y, and
that we just have a simple two-layer network with these sending units projecting directly to the output
units:
Taking the negative of the derivative of SSE with respect to the weight w, which is easier computed
by breaking it down into two parts using the 'chain rule to first get the derivative of SSE with respect
to the output activation z, and multiply that by the derivative of z with respect to the weight:
When you break down each step separately, it is all very straightforward:
(the other elements of the sums drop out because the first partial derivative is with respect to z_k so
derivative for all other z's is zero, and similarly the second partial derivative is with respect to y_j so
the derivative for the other y's is zero.)
Thus, the negative of
is
obvious that the problem was the use of a two-layer network, but as often happens, this critique left
a bad "odor" over the field, and people simply pursued other approaches (mainly symbolic AI, which
Minsky was an advocate for).
Error Backpropagation
Then, roughly 26 years later, David Rumelhart and colleagues published a paper on the
backpropagation learning algorithm, which extended the delta-rule style error-driven learning to
networks with three or more layers. The addition of the extra layer(s) now allows such networks to
solve XOR and any other kind of problem (there are proofs about the universality of the learning
procedure). The problem is that above, we only considered how to change weights from a sending
unit y to an output unit z, based on the error between the target t and actual output activity. But for
multiple stages of hidden layers, how do we adjust the weights from the inputs to the hidden units?
Interestingly, the mathematics of this involves simply adding a few more steps to the chain rule.
The overall derivation is as follows. The goal is to again minimize the error (SSE) as a function of the
weights,
Although this looks like a lot, it is really just applying the same chain rule as above repeatedly. To
know how to change the weights from input unit
to hidden unit
, we have to know
how the SSE changes with output activity, how output activity changes with its net input, how this net
input changes with hidden unit activity
, how in turn this activity changes with its net
input
to hidden unit
, and finally, how this net input changes with the weights from sending unit
. Once all of these factors are computed, they can be multiplied together to
all sending units to all hidden units (and also as derived earlier, for all hidden units to all output
units).
We again assume a linear activation function at the output for simplicity, so that
. We allow
for non-linear activation functions in the hidden units y, and simply refer to the derivative of this
activation function as
(which for the common sigmoidal activation functions turns out to
be
but we leave it in generic form here so that it can be applied to any differentiable
activation function. The solution to the above equation is then, applying each step in order,
as specified earlier.
You can see that this weight change occurs not only in proportion to the error at the output, and the
'backpropagated' error at the hidden unit activity y, but also to the activity of the sending
unit
. So, once again, the learning rule assigns credit/blame to change weights based on
active input units that contributed to the error, weighted by the degree to the error at the level of
hidden unit activities contribute to errors at the output (which is the weight
). At all steps
along the process, the appropriate units and weights are factored in to minimize errors. This
procedure can be repeated for any arbitrary number of layers, with repeated application of the chain
rule.
Biological Implausibility
todo
== Generalized Recirculation and the Contrastive Hebbian Learning (CHL) Function ==
https://grey.colorado.edu/CompCogNeuro/index.php/CCNBook/Learning/Backpropaga
tion
Introduction
Artificial neural networks (ANNs) are a powerful class of models used for nonlinear
regression and classification tasks that are motivated by biological neural computation.
The general idea behind ANNs is pretty straightforward: map some input onto a desired
target value using a distributed cascade of nonlinear transformations (see Figure 1).
However, for many, myself included, the learning algorithm used to train ANNs can
be difficult to get your head around at first. In this post I give a step-by-step walkthrough of the derivation of gradient descent learning algorithm commonly used to train
ANNs (aka the backpropagation algorithm) and try to provide some high-level insights
into the computations being performed during learning.
An ANN consists of an input layer, an output layer, and any number (including zero) of
hidden layers situated between the input and output layers. Figure 1 diagrams an ANN
with a single hidden layer. The feed-forward computations performed by the ANN are as
follows: The signals from the input layer
are multiplied by a set of fully-
connected weights
(not displayed in
the graphical model in Figure 1). This calculation forms the pre-activation signal
for the hidden layer. The pre-activation signal is then transformed by the hidden layer
activation function
to form the feed-forward activation signals leaving leaving
activation signals
, a bias
output
that
minimize the errors that the network makes. Often the choice for the error function is
Equation (1)
This problem can be solved using gradient descent, which requires determining
for all
in the model. Note that, in general, there are two sets of parameters:
those parameters that are associated with the output layer (i.e.
), and thus
directly affect the network output error; and the remaining parameters that are
associated with the hidden layer(s), and thus affect the output error indirectly.
Before we begin, lets define the notation that will be used in remainder of the
derivation. Please refer to Figure 1 for any clarification.
: input to node
: ouput/activation of node
in layer
in layer
for layer
in layer
layer
in layer
in the output layer
(applied to
to node
in
Since the output layer parameters directly affect the value of the error function,
determining the gradients for those parameters is fairly straight-forward:
Equation (2)
Here, weve used the Chain Rule. (Also notice that the summation disappears in the
derivative. This is because when we take the partial derivative with respect to
the
-th dimension/node, the only term that survives in the error gradient
is
-th, and thus we can ignore the remaining terms in the summation). The
. Also,
. Thus
Equation (3)
where, again we use the Chain Rule. Now, recall that
and thus
, giving:
Equation (4)
The gradient of the error function with respect to the output layer weights is a product
of three terms. The first term is the difference between the network output and the
target value
. The second term is the derivative of output layer activation
function. And the third term is the activation output of node j in the hidden layer.
If we define
we obtain the following expression for the derivative of the error with respect to the
output weights
:
Equation (5)
Here the
terms can be interpreted as the network output error after being back-
propagated through the output activation function, thus creating an error signal.
Loosely speaking, Equation (5) can be interpreted as determining how much
each
contributes to the error signal by weighting the error signal by the
magnitude of the output activation from the previous (hidden) layer associated with
each weight (see Figure 1). The gradients with respect to each parameter
are thus considered to be the contribution of the parameter to the error signal and
should be negated during learning. Thus the output weights are updated as
,
where
As well see shortly, the process of backpropagating the error signal can iterate all the
way back to the input layer by successively projecting
back through
,
then through the activation function for the hidden layer via
signal
As far as the gradient with respect to the output layer biases, we follow the same
routine as above for
. However, the third term in Equation (3) is
, giving
Equation (6)
Thus the gradient for the biases is simply the back-propagated error from the output
units. One interpretation of this is that the biases are weights on activations that are
always equal to one, regardless of the feed-forward signal. Thus the bias gradients
arent affected by the feed-forward signal, only by the error.
Notice here that the sum does not disappear because, due to the fact that the layers
are fully connected, each of the hidden unit outputs affects the state of each output
unit. Continuing on, noting that
Equation (7)
Here, again we use the Chain Rule. Ok, now heres where things get slightly more
involved. Notice that the partial derivative in the third term in Equation (7) is with
respect to
, but the target
is a function of index
. How the heck
Equation (8)
From the last term in Equation (8) we see that
on
calculate
is indirectly dependent
. Equation (8) also suggests that we can use the Chain Rule to
. This is probably the trickiest part of the derivation, and goes like
Equation (9)
Now, plugging Equation (9) into
Equation (10)
Notice that the gradient for the hidden layer weights has a similar form to that of the
gradient for the output layer weights. Namely the gradient is some term weighted by
the output activations from the layer below (
). For the output weight gradients,
activation function
the output error signal backpropagated to the hidden layer, then weighted by the input
to the hidden layer. To make this idea more explicit, we can define the resulting error
signal backpropagated to layer
as
, and includes all terms in Equation
Equation (11)
This suggests that in order to calculate the weight gradients at any layer
in an
feeding into that layer! Analogously, the gradient for the hidden layer weights can be
interpreted as a proxy for the contribution of the weights to the output error signal,
which can only be observedfrom the point of view of the weightsby backpropagating
the error signal to the hidden layer.
Calculating the gradients for the hidden layer biases follows a very similar procedure to
that for the hidden layer weights where, as in Equation (9), we use the Chain Rule to
calculate
. However, unlike Equation (9) the third term that results for the biases
is slightly different:
Equation (12)
In a similar fashion to calculation of the bias gradients for the output layer, the
gradients for the hidden layer biases are simply the backpropagated error signal
reaching that layer. This suggests that we can also calculate the bias gradients at any
layer
in an arbitrarily-deep network by simply calculating the backpropagated
Wrapping up
In this post we went over some of the formal details of the backpropagation learning
algorithm. The math covered in this post allows us to train arbitrarily deep neural
networks by re-applying the same basic computations. Those computations are:
1.
2.
3.
Backpropagate the error signals by weighting it by the weights in previous layers and
the gradients of the associated activation functions
4. Calculating the gradients
for the parameters based on the backpropagated
5.
The only real constraints on model construction is ensuring that the error
function
and the activation functions
are differentiable. For more
details on implementing ANNs and seeing them at work, stay tuned for the next post.
https://theclevermachine.wordpress.com/2014/09/06/derivation-errorbackpropagation-gradient-descent-for-neural-networks/