Estimating The Mean and Variance of The Target Probability Distribution
Estimating The Mean and Variance of The Target Probability Distribution
Thus, when we attempt to approximate the func- the target error distribution as a function of the in-
tion f(Z),we assume our measured data d ( Z ) can put, in addition to the usual output y(Z) = b(Z) x
be modeled by p(Z) = f(Z). We derive the method in full for the
case where we assume the targets are normally dis-
d ( Z ) = f(Z) + n(Z) (1) tributed about f(Z),and we apply this derivation to
a synthetic example problem where U’(.’) is known.
where the additive noise n(Z) can be viewed as er-
rors on the target values that move the targets away 11. THEMETHOD
from their true values f(Z) to their observed val- A . The Idea
ues d ( Z ) . In a function-approximation task, the
How does the network estimate s2(Z)? To the out-
output of a network given a particular input pat-
put unit y that computes ji(Zi), we add a comple-
tern, y(.’), can be interpreted as an estimate B ( Z )
mentary “s2 unit” that computes s2(Zi), the esti-
of the true mean p ( Z ) of this noisy target distribu-
mate of U ’ ( & ) given input pattern Zi.
tion around f(Z), given an appropriate error model
Since a’(.’) can never be negative or zero, we
for n(Z) [2] [3] [4].
choose an exponential activation function for s2(Z)
While an estimate of the mean of the target distri-
to naturally impose these bounds:
bution (the expectation value) for a given input pat-
tern is indeed valuable information, we sometimes
want to know more. In addition to the network’s
predicting d ( Z ) by estimating the mean of the tar-
get distribution (U(.’) = j i ( Z ) ) , we would also like
s 2 ( 2 i ) = exp
[. W,akhf(Zi)
the network to quantify the uncertainty of its pre- where p is the bias for the s2 unit and hi’(Zi) is
p +
1 (2)
Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 16,2024 at 12:50:08 UTC from IEEE Xplore. Restrictions apply.
the activation of hidden unit k for input Zi in the Output unit Variance unit
hidden layer feeding directly into the s2 unit. Y S2
Having selected a particular network architecture
(see Figure 1; inputs z' are indexed by m, hidden
units by j and k), we employ the same gradient-
descent (backpropagation) learning as in the usual
case to find wjm and W y j in order to also find a set
of weights Wkm and w63k that calculate s2(z). Thus,
after each pattern i is presented, all the weights in
the network1 are adapted to minimize some cost
function C according to x m
Input units
(3)
Figure 1: Architecture for a network with one out-
put unit y and one s2 unit. Both sets of hidden
(4)
units are connected to the same input units, but
no connections are shared (not all connections are
shown).
56
Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 16,2024 at 12:50:08 UTC from IEEE Xplore. Restrictions apply.
111. A SPECIFIC
EXAMPLE been to generate error bars in addition to the pre-
dicted values themselves, none of the entries con-
A . Normally Distributed Errors
tained principled uncertainty estimates [7]. Ignor-
Least-squares regression techniques can be interpret- ing the possibility a variable u2(.‘) is equivalent to
ed aa maximum likelihood with an underlying Gaus- assuming U”(&) to be a constant independent of
sian error model. In this simple case of assuming Zi. With this assumption, the second term in Eq.
normally distributed errors around f(.‘), we have (10) is a constant that can be ignored for minimiza-
tion, and the 1/2u2(Zi) term in Eq. (10) is a con-
stant that is incorporated into the learning rate
in Eqs. (11) and (12). This assumption results in
(8) the standard equations for backpropagation using
aa the target probability distribution for input pat- a sum-squared-error cost function. However, since
tern Zi, where, as before, y(5i) corresponds to the we are specifically allowing for a variable U2(Zi), we
mean of this distribution and u”(1i)is the variance. explicitly keep these terms in the cost function.
If we take the natural log of both sides, we get
8. A Synthetic Example Problem
1 B.l. The Problem
lnP(diIZi,N) = - s l n ( 2 ~ )
To demonstrate the application of Eqs. (10)-(14),
we construct a one-dimensional example problem
where the true f(Z) and a2(.’) are known. We con-
as the log likelihood to be maximized. The first sider an amplitude-modulation equation of the form
term on the right is a constant and can be ignored
for maximization. Since maximizing a value is the f(x) = m(z)sin(w,z) (15)
same as minimizing the negative of that value, we where m ( z ) = sin(wmz). For this simple example
write what remains of the right-hand side of Eq.
we choose w, = 5 and wm = 4 over the interval
(9) as a cost function C to be minimized over all
2 E [O,T/2].
patterns i: We generate our target values according to Eq.
(1) where n(z) is zero-mean Gaussian noise with
variance u2(x)that changes according t o
57
Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 16,2024 at 12:50:08 UTC from IEEE Xplore. Restrictions apply.
1.0 1 -Training Set
--- Cross-validation Set
1.o
0.8 0.5
W
*. 0.0
3 0.6
0.4 -0.5
True function f(x)
0.2 Network output y(x)
Figure 2: Learning curve for normalized mean- Figure 4: Training data, true function f(z), and
squared error ( N M S E v = MSEp/u&; where V estimate y(z) (epoch 3000).
is either the training or the cross-validation data).
0.12
0.10
-1 .o 1 -Training Set
--- Cross-validation Set 0.08
2!
.g 0.06
s 0.04
0.02
58
Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 16,2024 at 12:50:08 UTC from IEEE Xplore. Restrictions apply.
of u 2 ( z ) . Second, with a finite sample size the tar- overfitting weaponry as is required when attempt-
get noise will not have exactly the true u2(iz). With ing function approximation with any sparse data set
these slight differences aside, however, for a given in- (e.g., adding complexity penalty terms to Eq. (10)
put z,we have accurate estimates y and s2 of both PI [O
the mean and the variance of the target probability V . CONCLUSIONS
distribution.
We have introduced a method to estimate the un-
IV. DISCUSSION certainty of the output of a network that tries to
A . Robust Regression approximate a function. This is accomplished by
learning a second function s2(.‘) that estimates u 2 ( S ) ,
Naively, one might expect that allowing for varia- the variance of the target probability distribution
tions in u 2 ( q (with the addition of the s2 unit and around f(.‘) as a function of the input 5. This
the resulting modifications of the standard back- function provides a quantitative estimate of the tar-
propagation weight-update equations) does not al- get noise level depending on the location in input
ter the way the network approximates f(.’). How- space and, therefore, provides a measure of the un-
ever, this is not the case. According to Eqs. (11) certainty of y(5).
and (12), as long as U’(.’) is constant over all Si, We have derived the specific weight update equa-
the effective learning rate is constant over all pat- tions for the case of Gaussian noise on the outputs,
terns ( q / u 2 ) . However, for input patterns where i.e., we have shown how to estimate the second mo-
u2(.‘) is smaller than average, the learning rate q is ment (variance) of the target distribution in addi-
effectively amplified compared to patterns for which tion to the usual estimation of the first moment
u2(.’) is larger that average. Thus, this particular
(mean). The extension of this technique to other er-
estimation of u2(.’) has the side-effect of biasing the ror models is straightforward (e.g., a Poisson model
network’s allocation of its resources towards lower- could be used when the errors are suspected to be
noise regions, discounting regions of the input space Poisson distributed, etc.).
where the network is producing larger-than-average For very sparse data sets, we may only be able
errors. Through this side-effect, this procedure im- to reasonably estimate the first moment of the tar-
plements a form of robust regression, emphasizing get distribution. Estimating both the first and sec-
low-noise regions of the input space in the alloca- ond moments is a reasonable goal if we are dealing
tion of the network’s remaining resources. with a moderately sized data set. For extremely
B. Overfitting large data sets, however, one can be more ambi-
tious and aim for estimating the entire probability
In the example problem above, a sufficiently large
density function using connectionist methods [9] or
data set was used such that overfitting of y(.’) to the
hidden Markov models with mixed states [lo].
training data was not observed. However, in most
We will apply our method to the real-world Data
applications data sets are more limited, and overfit-
Set A (from a laser) from the Santa Fe Time Series
ting can present a serious problem. In our method,
Analysis and Prediction Competition [5] [7].
an accurate approximation of u2(.’) depends on the
quality of y(2) as an approximation of f(2) with- ACKNOWLEDGMENTS
out overfitting. To see why this is so, consider the
extreme case where a network overfits the training We would like to thank David Rumelhart and Barak
data such that the error is zero on every training Pearlmutter for discussing the problem and the ap-
pattern. Then the estimated variance would be zero proach. We would also like to thank Wray Buntine
even though in reality there may be considerable for emphasizing the potential problem of overfitting
target noise about the true f(.’). of the variance function. This work was supported
In addition, we must also be concerned with over- by a Graduate Fellowship from the Office of Naval
fitting s2(.‘) to the training data. For example, take Research and by NSF grant RIA ECS-9309786.
a situation in which the true variance is constant
over the input space, yet in one small region we REFERENCES
have four one-dimensional input patterns arranged [l] M. Casdagli, S. Eubank, J.D. Farmer, and J.
such that the outer two patterns have small errors Gibson, “State Space Reconstruction in the
from f(z) and and the inner two have large errors. Presence of Noise.’’ Physica D, vol. 51D, pp.
We do not want s2(z)to estimate a sudden increase 52-98, 1991.
in the variance in the region of the inner two pat- [2] W.L. Buntine and A S . Weigend, “Bayesian
terns. Therefore, in applying our technique to rel- Backpropagation.” Complex Systems, vol. 5 ,
atively short data sets, we must use the same anti- pp. 603-643, 1991.
59
Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 16,2024 at 12:50:08 UTC from IEEE Xplore. Restrictions apply.
[3]D. MacKay, “A Practical Bayesian Frame-
work for Backpropagation.” Neural Computa-
tion, vol. 4,no. 3, pp. 448-472,1992.
[4]D.E. Rumelhart, R. Durbin, R. Golden, and Y.
Chauvin, “Backpropagation: The Basic The-
ory.” In Backpropagation: Theory, Architec-
tures and Applications, Y. Chauvin and D. E.
Rumelhart, eds., Lawrence Erlbaum, 1994.
[5] N.A. Gershenfeld and A S . Weigend, “The Fu-
ture of Time Series.” In Time Series Prediction:
Forecasting the f i t u r e and Understanding the
Past, A.S. Weigend and N.A. Gershenfeld, eds.,
Addison-Wesley, pp. 1-70, 1994.
[6] A.S. Weigend, B.A. Huberman, and D.E.
Rumelhart, “Predicting Sunspots and Ex-
change Rates with Connectionist Networks.” In
Nonlinear Modeling and Forecasting, M. Cas-
dagli and S . Eubank, eds., Addison-Wesley, pp.
395-432,1992.
[7]A S . Weigend and N.A. Gershenfeld, eds., Time
Series Prediction: Forecasting the Future and
Understanding the Past. Santa Fe Institute
Studies in the Sciences of Complexity, Proc.
Vol. XV.,Addison-Wesley, 1994.
[8] A S . Weigend, B.A. Huberman, and D.E.
Rumelhart, “Predicting the h t u r e : A Con-
nectionist Approach,” International Journal of
Neural Systems, vol. 1, pp. 193-209,1990.
[9]A.S. Weigend, “Predicting Predictability,”
Preprint, Department of Computer Science,
University of Colorado at Boulder, in prepa-
ration, 1994.
[lo] A.M. Fraaer and A. Dimitriadis, “Forecast-
ing Probability Densities Using Hidden Markov
Models with Mixed States.” In Time Series
Prediction: Forecasting the Future and Under-
standing the Past, A S . Weigend and N.A. Ger-
shenfeld, eds., Addison-Wesley, pp. 265-282,
1994.
Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 16,2024 at 12:50:08 UTC from IEEE Xplore. Restrictions apply.