Dynamic Logistic Regression
Dynamic Logistic Regression
net/publication/2465226
CITATIONS READS
27 2,276
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Stephen J. Roberts on 23 December 2013.
Abstract|We propose an online learning algorithm for In section 2 we brie
y review steepest descent and
training a logistic regression model on nonstationary classi- Bayesian algorithms for stationary logistic regression.
cation problems. The nonstationarity is captured by mod- In section 3 we describe how the Bayesian formalism is
elling the weights in a logistic regression classier as evolv- extended to the case of nonsatationary logistic regres-
ing according to a rst order Markov process. The weights sion and in section 4 describe algorithms for inferring
are updated using the extended Kalman lter formalism and the state noise variance. In section 5 we present re-
nonstationarities are tracked by inferring a time-varying sults from numerical examples. Throughout the paper
state noise variance parameter. We describe an algorithm we consider only the case of binary logistic regression
for doing this based on maximising the evidence of updated although the methods are readily extended to multiple
predictions. The algorithm is illustrated on a number of classes. Further details of this research are available in
synthetic problems. a technical report [10].
is that the joint density (xt t ; xt+1 t+1 ; xt+T t+T ) may
p ;z ;z ::: z
vary with time, .
t A problem with steepest descent learning for stationary
logistic regression is that there are two con
icting re- on the model parameters. Moderated outputs are typi-
quirements on the learning rate parameter; at the start cally better than unmoderated outputs in terms of the
of learning we wish to be large so that learning is fast likelihood of predictions [8].
but at the end of learning must be small to ensure If the posterior distribution is approximated by a
that the learning process converges. Whilst it is true Gaussian, its mean and covariance can be found via
that can be set to be proportional to 1=t, methods Newton's method as formulated by Spiegelhalter and
based on this approach are entirely ad hoc and have Lauritzen [13]. The update equations are
practical diculties e.g. how is the constant of propor-
tionality chosen ? t = t?1 ? 1 +yyt(1(1??yyt))s2 (t?1xt)(t?1xt)T
This problem can be overcome by taking a Bayesian t t t
approach instead of a maximum likelihood approach. (10)
The Bayesian algorithm, as we shall see, posseses an im- w^t = w^t?1 + txt(zt ? yt) (11)
plicit learning rate parameter which naturally decreases A comparison of equation 11 with equation 5 shows
as learning progresses. that the eective learning rate is equal to the covari-
ance matrix t . We can therefore evaluate an average
B. Bayesian Learning eective learning rate as < >= p1 T r(t ) where p
In Bayesian learning we consider the evolution of the is the number of weights. The learning rate is there-
distribution of weight values, p(w^ ). This distribution fore adaptive; as the classier sees more data from the
is often approximated as a Gaussian. Before data point same stationary regime the covariance reduces which,
(xt; zt) arrives the weights are distributed according to in turn, reduces the learning rate.
the prior distribution N (w^ t?1 ; t?1 ) where N (m; C ) Alternatively, we could use the Extended Kalman
is a normal distribution with mean m and covariance Filter (EKF) paradigm to approximate the posterior
C . After arrival of the new data point this prior is distribution. This produces an update equation for
updated to a posterior, N (w^ t ; t ), according to Bayes the covariance which is identical to that obtained from
rule. Newton's method. But the update for the mean is dif-
The mean and covariance of this posterior can be ferent and is given by
evaluated using a variational approximation due to
Jaakkola and Jordan [4], by Newton's method as de- w^t = w^t?1 + 1 + yt(1t??1 yt)s2 xt(zt ? yt) (12)
t
scribed by Spiegelhalter and Lauritzen [13] or by the
Extended Kalman Filter as described by Niranjan [9]. This is dierent to equation 11 in that learning will be
In this paper, we consider the latter two methods. slower for larger activation uncertainty, s2t , or output
In Bayesian learning the prior weight distribution (ie. uncertainty, yt (1 ? yt ).
before the class label is observed and the weights are
updated) implies a prior distribution on the activation III. Nonstationary Logistic Regression
P (at) = N (at; s2t ) where at = w^ Tt?1xt is the mean ac-
tivation and s2t is the variance of the activation given We now consider a nonstationary logistic regression
by model where the weights evolve dynamically and the
s2t = xTt t?1xt (6) classications are made according to the usual logistic
This gives rise to the concept of a `moderated' output, function
y~t , dened as
wt = wt?1 + nt (13)
Z yt = g(w t xt )
T (14)
y~t = P (Z^t = 1) = P (Z^t = 1 j at )p(at)dat (7) where nt is `state' noise generated from a zero mean
normal distribution with isotropic covariance matrix
This integral can be accurately approximated by [8] qtI . The important dierence between this nonstation-
ary model and the stationary model of equation 1 is
y~t = g(K (st )at) (8) that the weights wt are now time-varying.
We now take a Bayesian-learning approach to the
where problem of estimating the weights online. The situa-
s2 ?1=2 tion is exactly the same as for the stationary case except
K (st ) = 1 + t
8 (9) that the prior distribution is now N (w^ t?1; t?1 + qt I ).
This is because we are adding state noise before the
The unmoderated output, yt , is given by yt = g(at ). new observation is made (see equation 13). If we use
The `moderation' changes the actual output, yt , to a the Newton updates to estimate the posterior distribu-
moderated output, y~t , which is nearer to 0.5 by an tion then the mean and covariance are given by equa-
amount which is dependent on the prior uncertainty tions 10, 11 and 6 except that t?1 is replaced by
t?1 + qtI . Our dynamic nonstationary learning rule A. Maximising Evidence of Updated Predictions
is hence: For reasons previously discussed we choose to max-
t = (t?1 + qtI ) ? 1 +yyt(1(1??yyt))s2 imise the evidence of the model trained on the new class
label, rather than maximising the evidence of the new
t t t
[(t?1 + qtI )xt][(t?1 + qtI )xt ]T (15) class label itself. This can be achieved by minimising
XN
w^t = w^t?1 + txt(zt ? yt) (16) GN (^y ; y~) = ? [^yt ln y~t + (1 ? y^t ) ln(1 ? y~t )] (18)
where t is the variance of the activation and is given
s2 t=1
by where y^t = g(wTt xt). Note that this `updated pre-
s2t = xTt (t?1 + qtI )xt (17) diction' is based on the updated weights whereas the
The EKF learning rule is identical except that equa- moderated prediction is based on the weights from the
tion 16 is replaced by equation 12. The use of an previous time step (if this were not the case, the opti-
isotropic covariance matrix for the state noise, qtI , as- mal value of q would be q = 0).
sumes that the weight changes are of similar magni- This criterion may also be understood from a
tude. Whilst this may be a reasonable assumption if we `variance-matching' perspective. The maximum evi-
consider just the input weights, it may not be so rea- dence criterion corresponds to setting the state noise
sonable if we consider the input weights and the bias so that the observed prediction variance is matched by
weight. However, by scaling the input data to have the estimated prediction variance. In the prediction
zero mean the bias weights will
uctuate around zero. context (see, for example, section 5.3.1 in [3] ) these
Alternatively, one could use a separate noise variance variances parameterise a Gaussian distribution whereas
parameter for the bias weight or, indeed, have separate in the classication context the variances parameterise
noise variance parameters for every weight as discussed a binomial distribution.
in the context of prediction problems by DeFreitas and The evidence of updated predictions is maximised
Niranjan [3]. when the following variance matching criterion is min-
imised
IV. Estimating state noise X
N
V (^y ; y~) = y^t (1 ? y^t ) ? y~t (1 ? y~t ) (19)
We now describe an online method for updating the t=1
state noise parameter, qt, based on maximising the `ev- Note that the corresponding criterion for maximising
idence' of the observations. This approach has been the evidence of class labels contains the term zt (1 ?
used in the context of Kalman Filters and Extended zt ) which is zero. This provides another explanation
Kalman Filters and was developed originally by Jazwin- of why we should maximise the evidence of updated
ski [6] and has since been re-discovered by DeFreitas predictions rather than of class labels.
and Niranjan and placed in the context of regulariza-
tion in neural networks [3]. B. Line Search Method
We note, however, that maximising the evidence of
the observations in a classication context, as opposed The value of q which minimises GN (^y ; y~) can be
to a prediction context, may not be the optimal strat- found with a line search algorithm. Line searches are
egy. This is because, in the logistic regression model, iterative schemes which, for example, t a quadratic
the observations are class labels which are binary ran- polynomial to a cost function evaluated at three dis-
dom variables (drawn from a binomial distribution with tinct points and the new parameter estimate is given
a `hit' probability given by the output of the logistic as the minimum of that quadratic function. The next
unit). Each class label therefore contains much less iteration uses this new point and two of the previous
information (at most 1 bit) than the equivalent obser- points. In this paper we use Brent's line search algo-
vation in the prediction context (a Gaussian random rithm ([11], page 402).
variable). The resulting estimates of state noise vari- The state noise can be estimated by moving a window
ance are commensurately noisier. along one sample at a time and performing a new line
We therefore investigate an alternative strategy: search. This is, however, rather computationally inten-
maximising the evidence of updated predictions which, sive and may be unnecessary as q is likely to evolve
we believe, is a more appropriate criterion for classi- on a slower time-scale than the weights. We therefore
cation problems. choose to move the window along N samples at a time
The state noise variance parameter, qt, which max- and then to re-estimate q ie. qt = qt?1 unless t is ex-
imises this criterion can be found by a line search pro- actly divisible by N in which case qt is re-estimated
cedure which is also described in this section. using line search.
We note that if the window size, N , is set too small
then it is possible for the state noise estimates to be-
come innite. This therefore suggests both that reason-
ably large values of N should be used and that there
0.1
q
mate qmax . 0
In this paper we use the following heuristic for set-
ting N and qmax . Firstly, dierent segments of the
time series are used to train dierent logistic regression
models using a batch method (eg. Bayesian logistic re- (a) 200
t
400
<α>
0.4
pirical covariance. We then subtract the estimated co-
variance from the empirical covariance and infer that 0.2
whatever covariance remains is due to state noise (and
not just parameter uncertainty from having nite data 0
sets). The average diagonal entries correspond to esti- (b) 0 200
t
400 600
2
7 6
1
6
5
0
5
4
−1
0 100 200 300 400 500 600 4
3
2
3
1 2
1
1
0
0
0
−1
(a) (b)
−1
0 100 200 300 400 500 600 −1
2 −2 −2
−2 −1 0 1 2 3 4 5 6 7 −2 −1 0 1 2 3 4 5 6
0
Fig. 4. Two data sets (a) A and (b) B used to form the discrete
−1
0 100 200 300 400 500 600 change data set.
Fig. 2. Estimation of weights using nonstationary learning rule
with the state noise estimated by the line search method; true
weights (solid line), estimated weights (dashed line).
0.2
350
Nonstationary
300 Stationary 0.1
q
St. Descent
250
200 0
E
150
(a)
100 100 200 300
t
50
2
0
0 200 400 600
t
1.5
We generated data sets A and B, shown in Figure 4. Fig. 5. (a) Estimate of state noise, , and (b) eective learn- q
Each data set consists of examples drawn from one of ing rate of the nonstationary rule (solid line) and the stationary
two isotropic Gaussians corresponding to one of two rule (dotted line). The nonstationary rule increases the eective
learning rate in response to the change in classication regime
class labels. Each data set is optimally separated by a at = 150, whereas the eective learning rate for the stationary
t
linear decision boundary shown in the Figure. A data rule decreases monotonically as it believes that all data points
set was then formed by drawing 150 points from data belong to the same classication regime.
set A followed by 150 points from data set B. Thus at
t = 150 the optimal decision boundary switches from
that shown in Figure 4(a) to that in Figure 4(b). Note
that if the points in the overall data set were randomly
100
Nonstationary
sible.
E
40
that obtained by the stationary and steepest descent al- Fig. 6. Cumulative prediction error, E, of nonstationary, sta-
gorithms. Again, the nonstationary algorithm is seen tionary and steepest descent learning rules on a data set which
to be the superior method as it is able to increase the has a discrete transition between classication regimes at = t
References
[1] C. M. Bishop. Neural Networks for Pattern Recognition.
Oxford University Press, Oxford, 1995.
[2] C. Chateld. The Analysis of Time Series: An Introduc-
tion. Chapman and Hall, 1996.
[3] J.F.G. DeFreitas, M. Niranjan, and A.H. Gee. Hier-
archical Bayesian-Kalman Models for Regularisation and