0% found this document useful (0 votes)
64 views6 pages

Estimating The Mean and Variance of The Target Probability Distribution

Uploaded by

Aniket kolte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views6 pages

Estimating The Mean and Variance of The Target Probability Distribution

Uploaded by

Aniket kolte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Estimating the Mean and Variance of the

Target Probability Distribution


David A. Nix and Andreas S. Weigend
Department of Computer Science and Institute of Cognitive Science
University of Colorado at Boulder
Boulder, CO 80309-0430, USA
[email protected]
Abstmzct- We introduce a method that diction by simultaneously estimating the degree of
estimates the mean and the variance of the noise about ji(Z) based on the noise observed in the
probability distribution of the target as a func- training data (see [2] [5]).
tion of the input, given an assumed target Just as the network output y varies with the input
error-distribution model. Through the acti- pattern I , the quantitative uncertainty due to n(Z),
vation of an auxiliary output unit, this method defined as the true variance u2 of the target error
provides a measure of the uncertainty of the distribution around f(.’), is also some function of Z .
usual network output for each input pattern. This function, a2(Z),may be constant (i.e., indepen-
We here derive the cost function and weight- dent of the input) if the target noise level is uniform
update equations for the example of a Gaus- over the range of input values. Alternatively-the
sian target error distribution, and we demon- case we want to consider here-the level of noise may
strate the feasibility of the network on a syn- vary systematically over the input space. In either
thetic problem where the true input-depen- case, not only do we want the network t o learn an
dent noise level is known. output function y(Z) M f(Z) that estimates the true
mean p(Z) of the corresponding target distribution,
I. INTRODUCTION but we also want to simultaneously learn a function
Feed-forward artificial neural networks are widely s2((.3that estimates the true variance u 2 ( Z )of that
used and well-suited for function-approximation (re- distribution, given an appropriate assumption as to
gression) tasks, particularly when there exists a suf- the distribution’s form.
ficiently large data set from which to train. In al- Based on a maximum-likelihoad formulation of a
most any real-world problem, paired input-target feed-forward neural network for function approxi-
data contains noise, one form of which is observa- mation [2] [3] [4], we here introduce a network that
tional noise that corrupts the target values (e.g., [l]). calculates s2(Z) M u 2 ( Z ) ,the estimated variance of

Thus, when we attempt to approximate the func- the target error distribution as a function of the in-
tion f(Z),we assume our measured data d ( Z ) can put, in addition to the usual output y(Z) = b(Z) x
be modeled by p(Z) = f(Z). We derive the method in full for the
case where we assume the targets are normally dis-
d ( Z ) = f(Z) + n(Z) (1) tributed about f(Z),and we apply this derivation to
a synthetic example problem where U’(.’) is known.
where the additive noise n(Z) can be viewed as er-
rors on the target values that move the targets away 11. THEMETHOD
from their true values f(Z) to their observed val- A . The Idea
ues d ( Z ) . In a function-approximation task, the
How does the network estimate s2(Z)? To the out-
output of a network given a particular input pat-
put unit y that computes ji(Zi), we add a comple-
tern, y(.’), can be interpreted as an estimate B ( Z )
mentary “s2 unit” that computes s2(Zi), the esti-
of the true mean p ( Z ) of this noisy target distribu-
mate of U ’ ( & ) given input pattern Zi.
tion around f(Z), given an appropriate error model
Since a’(.’) can never be negative or zero, we
for n(Z) [2] [3] [4].
choose an exponential activation function for s2(Z)
While an estimate of the mean of the target distri-
to naturally impose these bounds:
bution (the expectation value) for a given input pat-
tern is indeed valuable information, we sometimes
want to know more. In addition to the network’s
predicting d ( Z ) by estimating the mean of the tar-
get distribution (U(.’) = j i ( Z ) ) , we would also like
s 2 ( 2 i ) = exp
[. W,akhf(Zi)

the network to quantify the uncertainty of its pre- where p is the bias for the s2 unit and hi’(Zi) is
p +
1 (2)

0-7803-1901-X/94 $4.00 01994 IEEE 55

Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 16,2024 at 12:50:08 UTC from IEEE Xplore. Restrictions apply.
the activation of hidden unit k for input Zi in the Output unit Variance unit
hidden layer feeding directly into the s2 unit. Y S2
Having selected a particular network architecture
(see Figure 1; inputs z' are indexed by m, hidden
units by j and k), we employ the same gradient-
descent (backpropagation) learning as in the usual
case to find wjm and W y j in order to also find a set
of weights Wkm and w63k that calculate s2(z). Thus,
after each pattern i is presented, all the weights in
the network1 are adapted to minimize some cost
function C according to x m

Input units
(3)
Figure 1: Architecture for a network with one out-
put unit y and one s2 unit. Both sets of hidden
(4)
units are connected to the same input units, but
no connections are shared (not all connections are
shown).

where q is the learning rate and Ci is the contribu-


tion of pattern i to the overall cost function C. initial weights produce a net input to the s2 unit's
We obtain a form for C by expressing our goal as exponential activation function (Eq. (2)) of approx-
maximizing the log likelihood of the targets (having imately zero, corresponding to an initial estimate
assumed our patterns are independently and identi- of approximately unit variance over the entire in-
cally distributed), given the input patterns and the put space. Because it would be premature to make
network n/ (e.g., [2] [4] [SI). That is, we attempt to differing estimates of the noise level over the input
maximize space before f(Z)is at least roughly approximated
by y(Z), we save computations by not updating the
(7) weights W k m and W r l k until y(Z) is somewhat close
i
to f(Z).
The exact form of C depends on the assumption
Additionally, if the variance is either much larger
we make as to the form of this target probability
or much smaller than unity, the bias P in Eq. (2),
distribution such that y(Z) = fi(Z).
will have to grow either very positive of very neg-
B. Details ative for s2(Z) to well approximate a'(.-). Since P
B . l . Architecture has a natural interpretation as the natural log of the
mean value of u 2 ( Z ) ,we can accelerate the learning
The s2 unit is fully connected to its own set of hid-
of s2(Z) by setting p equal to the natural log of the
den units, h" (indexed by k), just as the output
current mean variance over the training data at the
unit y is connected to its hidden units hv (indexed
endof each epoch (for early training epochs). The
by j ) (see Figure 1). Alternatively, we could connect
precise form of the variance depends on the assumed
both y and s2 to a common large set of hidden units,
form of the target error distribution model (see ex-
but our experience has been that the method works
ample below).
better when we use a split-hidden-unit architecture.
However, once training has proceeded to the point
In addition, we could easily add a second hidden
where the approximation of f ( Z ) is improving only
layer (split or shared) as required by the particular
very slowly, we assume that y(Z) is a rough approx-
form of f ( 2 ) and/or u2(Z),but we will restrict our
imation of f ( Z ) that additional training will fine-
attention here to the case of a single hidden layer.
tune. At this point, and for subsequent training, we
B.2. Learning Dynamics update the weights that calculate ~'(5)according to
All initial weights are drawn from a uniform ran- Eqs. (13) and (14). Furthermore, we no longer set
dom distribution E [-1,1] and scaled by the recip- the s2 unit's bias /3 to the natural log of the current
rocal of the number of incoming connections. These mean variance over the training data; instead we
update /3 according to gradient descent just like all
'The biases of all units are considered to be an additional
weight connected t o a hidden unit clamped at 1 and are updated other weights and biases. Training is then continued
accordingly. until C does not decrease significantly further.

56

Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 16,2024 at 12:50:08 UTC from IEEE Xplore. Restrictions apply.
111. A SPECIFIC
EXAMPLE been to generate error bars in addition to the pre-
dicted values themselves, none of the entries con-
A . Normally Distributed Errors
tained principled uncertainty estimates [7]. Ignor-
Least-squares regression techniques can be interpret- ing the possibility a variable u2(.‘) is equivalent to
ed aa maximum likelihood with an underlying Gaus- assuming U”(&) to be a constant independent of
sian error model. In this simple case of assuming Zi. With this assumption, the second term in Eq.
normally distributed errors around f(.‘), we have (10) is a constant that can be ignored for minimiza-
tion, and the 1/2u2(Zi) term in Eq. (10) is a con-
stant that is incorporated into the learning rate
in Eqs. (11) and (12). This assumption results in
(8) the standard equations for backpropagation using
aa the target probability distribution for input pat- a sum-squared-error cost function. However, since
tern Zi, where, as before, y(5i) corresponds to the we are specifically allowing for a variable U2(Zi), we
mean of this distribution and u”(1i)is the variance. explicitly keep these terms in the cost function.
If we take the natural log of both sides, we get
8. A Synthetic Example Problem
1 B.l. The Problem
lnP(diIZi,N) = - s l n ( 2 ~ )
To demonstrate the application of Eqs. (10)-(14),
we construct a one-dimensional example problem
where the true f(Z) and a2(.’) are known. We con-
as the log likelihood to be maximized. The first sider an amplitude-modulation equation of the form
term on the right is a constant and can be ignored
for maximization. Since maximizing a value is the f(x) = m(z)sin(w,z) (15)
same as minimizing the negative of that value, we where m ( z ) = sin(wmz). For this simple example
write what remains of the right-hand side of Eq.
we choose w, = 5 and wm = 4 over the interval
(9) as a cost function C to be minimized over all
2 E [O,T/2].
patterns i: We generate our target values according to Eq.
(1) where n(z) is zero-mean Gaussian noise with
variance u2(x)that changes according t o

Using Eq. (10) for C, we obtain our weight-update


u2(x)= 0.02 + 0.02 x [I - m(x)12. (16)
equations by first specifying a linear activation func- We generate 5000 patterns and randomly assign ap-
tion for y and tanh activation functions for the hid- proximately 25% of these t o a cross-validation set to
den units.” Then we approximate U”((.;.) by s2(&), guard against overfitting [SI. We perform gradient-
the activation of the s2 unit, and take the deriva- descent learning on the remainder of the patterns,
tives in Eqs. (3)-(6) for pattern i: updating the weights according to Eqs. (11)-( 14)
after the presentation of each pattern. We connect
thirty tanh hidden units to y and ten tanh hidden
units to s”. To avoid artifacts, we use a conservative
learning rate of 7) = lo-‘ and do not use momen-
tum.
As described above, initially only the weights and
biases that calculate y(z) are updated after each
pattern presentation, and O , in Eq. (2) is set a t the
end of each epoch to the natural log of the mean-
squared error over the entire training set. After
the normalized mean-squared error stops decreasing
x w,lk[l - hia(Zi)]22m,i.(14) sharply, all parameters in the network are adapted
by Eqs. (11)-(14) after each pattern presentation,
Despite one of the tasks of the Santa Fe Tame including p.
Series Prediction and Analysis Competition having
8.2. Results
We could select m y appropriate hidden-unit activation func-
tion for either set of hidden units; however, for simplicity we The learning curve for the normalized-mean-squared
choose all tanh activation functions. error (NMSE) is plotted in Figure 2. The curve

57

Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 16,2024 at 12:50:08 UTC from IEEE Xplore. Restrictions apply.
1.0 1 -Training Set
--- Cross-validation Set
1.o

0.8 0.5
W
*. 0.0
3 0.6
0.4 -0.5
True function f(x)
0.2 Network output y(x)

Figure 2: Learning curve for normalized mean- Figure 4: Training data, true function f(z), and
squared error ( N M S E v = MSEp/u&; where V estimate y(z) (epoch 3000).
is either the training or the cross-validation data).
0.12

0.10

-1 .o 1 -Training Set
--- Cross-validation Set 0.08

2!
.g 0.06
s 0.04

0.02

0.0 0.4 0.8 1.2 1.6


X
Epoch

Figure 5: True variance c2(t) and estimate sz(z)


Figure 3: Learning curve for normalized cost (epoch 3000).
( N C p = Cv/u$; where 2, is either the training
or the cross-validation data ).
or the NC is observed on the cross-validation set.
In Figure 4 we plot the training data, the true
has two primary descent phases, corresponding to function f ( t ) ,and the output of the network, y(z),
learning each of the two significant oscillations in over the interval z E [ 0 , ~ / 2 ] . We see that the
f(z);z E [ 0 , r / 2 ] .The curve starts to level out at network's approximation y(z) closely matches the
about epoch 1500,at which point y(z) approximates shape of f(t). Note that the approximation is slightly
the gross features of f(z). After epoch 1500, Eqs. better for smaller t than for larger t;this is due in
(11)-(14) are used to update all parameters in the part to the robust regression effect described below.
network. The true variance u 2 ( z ) ,given by Eq. (16)' is plot-
While the NMSE continues to decrease slightly ati ted in Figure 5 along with the network's estimate
the approximation y(z) w f(t)is fine-tuned, we see s 2 ( t ) over the interval t E [ 0 , n / 2 ] . We see that
in Figure 3 that the normalized cost (NC) continues s2(z) follows closely the shape of u 2 ( z ) . The slight
to decrease steadily as s2(z)learns to approximate differences are due to a combination of two factors.
u2(z).Training is continued until epoch 3000. Note First, the approximation of f(t) is not perfect, so
that no overfitting with respect to either the NMSE there is some error introduced to the approximation

58

Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 16,2024 at 12:50:08 UTC from IEEE Xplore. Restrictions apply.
of u 2 ( z ) . Second, with a finite sample size the tar- overfitting weaponry as is required when attempt-
get noise will not have exactly the true u2(iz). With ing function approximation with any sparse data set
these slight differences aside, however, for a given in- (e.g., adding complexity penalty terms to Eq. (10)
put z,we have accurate estimates y and s2 of both PI [O
the mean and the variance of the target probability V . CONCLUSIONS
distribution.
We have introduced a method to estimate the un-
IV. DISCUSSION certainty of the output of a network that tries to
A . Robust Regression approximate a function. This is accomplished by
learning a second function s2(.‘) that estimates u 2 ( S ) ,
Naively, one might expect that allowing for varia- the variance of the target probability distribution
tions in u 2 ( q (with the addition of the s2 unit and around f(.‘) as a function of the input 5. This
the resulting modifications of the standard back- function provides a quantitative estimate of the tar-
propagation weight-update equations) does not al- get noise level depending on the location in input
ter the way the network approximates f(.’). How- space and, therefore, provides a measure of the un-
ever, this is not the case. According to Eqs. (11) certainty of y(5).
and (12), as long as U’(.’) is constant over all Si, We have derived the specific weight update equa-
the effective learning rate is constant over all pat- tions for the case of Gaussian noise on the outputs,
terns ( q / u 2 ) . However, for input patterns where i.e., we have shown how to estimate the second mo-
u2(.‘) is smaller than average, the learning rate q is ment (variance) of the target distribution in addi-
effectively amplified compared to patterns for which tion to the usual estimation of the first moment
u2(.’) is larger that average. Thus, this particular
(mean). The extension of this technique to other er-
estimation of u2(.’) has the side-effect of biasing the ror models is straightforward (e.g., a Poisson model
network’s allocation of its resources towards lower- could be used when the errors are suspected to be
noise regions, discounting regions of the input space Poisson distributed, etc.).
where the network is producing larger-than-average For very sparse data sets, we may only be able
errors. Through this side-effect, this procedure im- to reasonably estimate the first moment of the tar-
plements a form of robust regression, emphasizing get distribution. Estimating both the first and sec-
low-noise regions of the input space in the alloca- ond moments is a reasonable goal if we are dealing
tion of the network’s remaining resources. with a moderately sized data set. For extremely
B. Overfitting large data sets, however, one can be more ambi-
tious and aim for estimating the entire probability
In the example problem above, a sufficiently large
density function using connectionist methods [9] or
data set was used such that overfitting of y(.’) to the
hidden Markov models with mixed states [lo].
training data was not observed. However, in most
We will apply our method to the real-world Data
applications data sets are more limited, and overfit-
Set A (from a laser) from the Santa Fe Time Series
ting can present a serious problem. In our method,
Analysis and Prediction Competition [5] [7].
an accurate approximation of u2(.’) depends on the
quality of y(2) as an approximation of f(2) with- ACKNOWLEDGMENTS
out overfitting. To see why this is so, consider the
extreme case where a network overfits the training We would like to thank David Rumelhart and Barak
data such that the error is zero on every training Pearlmutter for discussing the problem and the ap-
pattern. Then the estimated variance would be zero proach. We would also like to thank Wray Buntine
even though in reality there may be considerable for emphasizing the potential problem of overfitting
target noise about the true f(.’). of the variance function. This work was supported
In addition, we must also be concerned with over- by a Graduate Fellowship from the Office of Naval
fitting s2(.‘) to the training data. For example, take Research and by NSF grant RIA ECS-9309786.
a situation in which the true variance is constant
over the input space, yet in one small region we REFERENCES
have four one-dimensional input patterns arranged [l] M. Casdagli, S. Eubank, J.D. Farmer, and J.
such that the outer two patterns have small errors Gibson, “State Space Reconstruction in the
from f(z) and and the inner two have large errors. Presence of Noise.’’ Physica D, vol. 51D, pp.
We do not want s2(z)to estimate a sudden increase 52-98, 1991.
in the variance in the region of the inner two pat- [2] W.L. Buntine and A S . Weigend, “Bayesian
terns. Therefore, in applying our technique to rel- Backpropagation.” Complex Systems, vol. 5 ,
atively short data sets, we must use the same anti- pp. 603-643, 1991.

59

Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 16,2024 at 12:50:08 UTC from IEEE Xplore. Restrictions apply.
[3]D. MacKay, “A Practical Bayesian Frame-
work for Backpropagation.” Neural Computa-
tion, vol. 4,no. 3, pp. 448-472,1992.
[4]D.E. Rumelhart, R. Durbin, R. Golden, and Y.
Chauvin, “Backpropagation: The Basic The-
ory.” In Backpropagation: Theory, Architec-
tures and Applications, Y. Chauvin and D. E.
Rumelhart, eds., Lawrence Erlbaum, 1994.
[5] N.A. Gershenfeld and A S . Weigend, “The Fu-
ture of Time Series.” In Time Series Prediction:
Forecasting the f i t u r e and Understanding the
Past, A.S. Weigend and N.A. Gershenfeld, eds.,
Addison-Wesley, pp. 1-70, 1994.
[6] A.S. Weigend, B.A. Huberman, and D.E.
Rumelhart, “Predicting Sunspots and Ex-
change Rates with Connectionist Networks.” In
Nonlinear Modeling and Forecasting, M. Cas-
dagli and S . Eubank, eds., Addison-Wesley, pp.
395-432,1992.
[7]A S . Weigend and N.A. Gershenfeld, eds., Time
Series Prediction: Forecasting the Future and
Understanding the Past. Santa Fe Institute
Studies in the Sciences of Complexity, Proc.
Vol. XV.,Addison-Wesley, 1994.
[8] A S . Weigend, B.A. Huberman, and D.E.
Rumelhart, “Predicting the h t u r e : A Con-
nectionist Approach,” International Journal of
Neural Systems, vol. 1, pp. 193-209,1990.
[9]A.S. Weigend, “Predicting Predictability,”
Preprint, Department of Computer Science,
University of Colorado at Boulder, in prepa-
ration, 1994.
[lo] A.M. Fraaer and A. Dimitriadis, “Forecast-
ing Probability Densities Using Hidden Markov
Models with Mixed States.” In Time Series
Prediction: Forecasting the Future and Under-
standing the Past, A S . Weigend and N.A. Ger-
shenfeld, eds., Addison-Wesley, pp. 265-282,
1994.

Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 16,2024 at 12:50:08 UTC from IEEE Xplore. Restrictions apply.

You might also like