0% found this document useful (0 votes)
84 views

Dynamic Logistic Regression

This document summarizes a research paper that proposes an online learning algorithm for training logistic regression models on non-stationary classification problems. It models the weights as evolving according to a first-order Markov process, which allows the learning rate to vary over time based on changes in the data. The weights are updated using an extended Kalman filter approach. It also describes algorithms for inferring a time-varying state noise variance parameter to track non-stationarities in the data. The algorithm maximizes the evidence of updated predictions to perform this inference. The method is illustrated on synthetic problems and can adaptively increase the learning rate during periods of non-stationarity and decrease it during stationary periods to allow for convergence.

Uploaded by

gchagas6019
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views

Dynamic Logistic Regression

This document summarizes a research paper that proposes an online learning algorithm for training logistic regression models on non-stationary classification problems. It models the weights as evolving according to a first-order Markov process, which allows the learning rate to vary over time based on changes in the data. The weights are updated using an extended Kalman filter approach. It also describes algorithms for inferring a time-varying state noise variance parameter to track non-stationarities in the data. The algorithm maximizes the evidence of updated predictions to perform this inference. The method is illustrated on synthetic problems and can adaptively increase the learning rate during periods of non-stationarity and decrease it during stationary periods to allow for convergence.

Uploaded by

gchagas6019
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/2465226

Dynamic Logistic Regression

Article · October 1999


DOI: 10.1109/IJCNN.1999.832603 · Source: CiteSeer

CITATIONS READS
27 2,276

2 authors:

William Penny Stephen J. Roberts


University College London University of Oxford
203 PUBLICATIONS   23,239 CITATIONS    415 PUBLICATIONS   14,048 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Lifetime Machine Learning View project

DPhil in Machine Learning View project

All content following this page was uploaded by Stephen J. Roberts on 23 December 2013.

The user has requested enhancement of the downloaded file.


Dynamic Logistic Regression
William D. Penny and Stephen J. Roberts
fw.penny, [email protected]
Department of Electrical and Electronic Engineering, Imperial College, London SW7 2BT, U.K.

Abstract|We propose an online learning algorithm for In section 2 we brie y review steepest descent and
training a logistic regression model on nonstationary classi- Bayesian algorithms for stationary logistic regression.
cation problems. The nonstationarity is captured by mod- In section 3 we describe how the Bayesian formalism is
elling the weights in a logistic regression classi er as evolv- extended to the case of nonsatationary logistic regres-
ing according to a rst order Markov process. The weights sion and in section 4 describe algorithms for inferring
are updated using the extended Kalman lter formalism and the state noise variance. In section 5 we present re-
nonstationarities are tracked by inferring a time-varying sults from numerical examples. Throughout the paper
state noise variance parameter. We describe an algorithm we consider only the case of binary logistic regression
for doing this based on maximising the evidence of updated although the methods are readily extended to multiple
predictions. The algorithm is illustrated on a number of classes. Further details of this research are available in
synthetic problems. a technical report [10].

I. Introduction II. Stationary Logistic Regression


This paper proposes an online learning algorithm for In a logistic regression model the predicted class la-
training a logistic regression model on nonstationary bel, Z^t , is generated according to
classi cation problems. By nonstationary1 we mean yt = P (Z^t = 1 j w) = g(wT xt) (1)
that the statistics of each class may vary with time
or, equivalently from the classi cation perspective, that where g(at ) is the logistic function
the optimal decision boundary changes with time.
The simplest online algorithm for training a logistic g(at ) =
exp(at ) (2)
regression model is steepest descent. This algorithm 1 + exp(at)
has a learning rate parameter which we would wish to
be small in regions of stationarity, to allow convergence, and at is the `activation' at time t, where w is a column
and large in regions of nonstationarity, to allow track- vector of weights and xt is a column vector of inputs.
ing. Our problem therefore, can be seen as equivalent Note that this is a stationary model because the weight
to estimating a time-varying learning rate. vector w is not time varying.
In this paper the weights are modelled as evolving
according to a rst order Markov process. This state- A. Steepest Descent Learning
space approach (where the state is the weight vector) The weights in a logistic regression model can be
possesses an implicit learning rate which decreases as adapted using a steepest descent algorithm which max-
more data is seen from the same stationary regime and imises the likelihood, L(zt ), of the class labels where
increases when a new regime is encountered. The abil-
ity to adapt to a new regime is based on inferring a L(zt ) = P (Z^t = zt j w) = ytz (1 ? yt )1?z
t t
(3)
state-noise variance parameter. We present an algo-
rithm for doing this based on maximising the evidence Maximising the likelihood can be achieved by minimis-
of model predictions. ing the negative log likelihood or cross entropy
G(zt ; yt) = ? ln L(zt ) = ?[zt ln yt + (1 ? zt ) ln(1 ? yt )]
This work was supported in part by the UK Engineering (4)
and Physical Sciences Research Council (EPSRC), grant num- The corresponding steepest descent learning rule is [1]
ber GR/K79062.
1 We assume the usual technical de nition of nonstationarity w^t = w^t?1 + xt(zt ? yt) (5)
(see eg. [2]) which for a sequence of input-output pairs, fxt t g,
where is the learning rate and zt is the true class label.
;z

is that the joint density (xt t ; xt+1 t+1 ; xt+T t+T ) may
p ;z ;z ::: z
vary with time, .
t A problem with steepest descent learning for stationary
logistic regression is that there are two con icting re- on the model parameters. Moderated outputs are typi-
quirements on the learning rate parameter; at the start cally better than unmoderated outputs in terms of the
of learning we wish to be large so that learning is fast likelihood of predictions [8].
but at the end of learning must be small to ensure If the posterior distribution is approximated by a
that the learning process converges. Whilst it is true Gaussian, its mean and covariance can be found via
that can be set to be proportional to 1=t, methods Newton's method as formulated by Spiegelhalter and
based on this approach are entirely ad hoc and have Lauritzen [13]. The update equations are
practical diculties e.g. how is the constant of propor-
tionality chosen ? t = t?1 ? 1 +yyt(1(1??yyt))s2 (t?1xt)(t?1xt)T
This problem can be overcome by taking a Bayesian t t t
approach instead of a maximum likelihood approach. (10)
The Bayesian algorithm, as we shall see, posseses an im- w^t = w^t?1 + txt(zt ? yt) (11)
plicit learning rate parameter which naturally decreases A comparison of equation 11 with equation 5 shows
as learning progresses. that the e ective learning rate is equal to the covari-
ance matrix t . We can therefore evaluate an average
B. Bayesian Learning e ective learning rate as < >= p1 T r(t ) where p
In Bayesian learning we consider the evolution of the is the number of weights. The learning rate is there-
distribution of weight values, p(w^ ). This distribution fore adaptive; as the classi er sees more data from the
is often approximated as a Gaussian. Before data point same stationary regime the covariance reduces which,
(xt; zt) arrives the weights are distributed according to in turn, reduces the learning rate.
the prior distribution N (w^ t?1 ; t?1 ) where N (m; C ) Alternatively, we could use the Extended Kalman
is a normal distribution with mean m and covariance Filter (EKF) paradigm to approximate the posterior
C . After arrival of the new data point this prior is distribution. This produces an update equation for
updated to a posterior, N (w^ t ; t ), according to Bayes the covariance which is identical to that obtained from
rule. Newton's method. But the update for the mean is dif-
The mean and covariance of this posterior can be ferent and is given by
evaluated using a variational approximation due to
Jaakkola and Jordan [4], by Newton's method as de- w^t = w^t?1 + 1 + yt(1t??1 yt)s2 xt(zt ? yt) (12)
t
scribed by Spiegelhalter and Lauritzen [13] or by the
Extended Kalman Filter as described by Niranjan [9]. This is di erent to equation 11 in that learning will be
In this paper, we consider the latter two methods. slower for larger activation uncertainty, s2t , or output
In Bayesian learning the prior weight distribution (ie. uncertainty, yt (1 ? yt ).
before the class label is observed and the weights are
updated) implies a prior distribution on the activation III. Nonstationary Logistic Regression
P (at) = N (at; s2t ) where at = w^ Tt?1xt is the mean ac-
tivation and s2t is the variance of the activation given We now consider a nonstationary logistic regression
by model where the weights evolve dynamically and the
s2t = xTt t?1xt (6) classi cations are made according to the usual logistic
This gives rise to the concept of a `moderated' output, function
y~t , de ned as
wt = wt?1 + nt (13)
Z yt = g(w t xt )
T (14)
y~t = P (Z^t = 1) = P (Z^t = 1 j at )p(at)dat (7) where nt is `state' noise generated from a zero mean
normal distribution with isotropic covariance matrix
This integral can be accurately approximated by [8] qtI . The important di erence between this nonstation-
ary model and the stationary model of equation 1 is
y~t = g(K (st )at) (8) that the weights wt are now time-varying.
We now take a Bayesian-learning approach to the
where problem of estimating the weights online. The situa-
 s2 ?1=2 tion is exactly the same as for the stationary case except
K (st ) = 1 + t
8 (9) that the prior distribution is now N (w^ t?1; t?1 + qt I ).
This is because we are adding state noise before the
The unmoderated output, yt , is given by yt = g(at ). new observation is made (see equation 13). If we use
The `moderation' changes the actual output, yt , to a the Newton updates to estimate the posterior distribu-
moderated output, y~t , which is nearer to 0.5 by an tion then the mean and covariance are given by equa-
amount which is dependent on the prior uncertainty tions 10, 11 and 6 except that t?1 is replaced by
t?1 + qtI . Our dynamic nonstationary learning rule A. Maximising Evidence of Updated Predictions
is hence: For reasons previously discussed we choose to max-
t = (t?1 + qtI ) ? 1 +yyt(1(1??yyt))s2 imise the evidence of the model trained on the new class
label, rather than maximising the evidence of the new
t t t
[(t?1 + qtI )xt][(t?1 + qtI )xt ]T (15) class label itself. This can be achieved by minimising
XN
w^t = w^t?1 + txt(zt ? yt) (16) GN (^y ; y~) = ? [^yt ln y~t + (1 ? y^t ) ln(1 ? y~t )] (18)
where t is the variance of the activation and is given
s2 t=1
by where y^t = g(wTt xt). Note that this `updated pre-
s2t = xTt (t?1 + qtI )xt (17) diction' is based on the updated weights whereas the
The EKF learning rule is identical except that equa- moderated prediction is based on the weights from the
tion 16 is replaced by equation 12. The use of an previous time step (if this were not the case, the opti-
isotropic covariance matrix for the state noise, qtI , as- mal value of q would be q = 0).
sumes that the weight changes are of similar magni- This criterion may also be understood from a
tude. Whilst this may be a reasonable assumption if we `variance-matching' perspective. The maximum evi-
consider just the input weights, it may not be so rea- dence criterion corresponds to setting the state noise
sonable if we consider the input weights and the bias so that the observed prediction variance is matched by
weight. However, by scaling the input data to have the estimated prediction variance. In the prediction
zero mean the bias weights will uctuate around zero. context (see, for example, section 5.3.1 in [3] ) these
Alternatively, one could use a separate noise variance variances parameterise a Gaussian distribution whereas
parameter for the bias weight or, indeed, have separate in the classi cation context the variances parameterise
noise variance parameters for every weight as discussed a binomial distribution.
in the context of prediction problems by DeFreitas and The evidence of updated predictions is maximised
Niranjan [3]. when the following variance matching criterion is min-
imised
IV. Estimating state noise X
N
V (^y ; y~) = y^t (1 ? y^t ) ? y~t (1 ? y~t ) (19)
We now describe an online method for updating the t=1
state noise parameter, qt, based on maximising the `ev- Note that the corresponding criterion for maximising
idence' of the observations. This approach has been the evidence of class labels contains the term zt (1 ?
used in the context of Kalman Filters and Extended zt ) which is zero. This provides another explanation
Kalman Filters and was developed originally by Jazwin- of why we should maximise the evidence of updated
ski [6] and has since been re-discovered by DeFreitas predictions rather than of class labels.
and Niranjan and placed in the context of regulariza-
tion in neural networks [3]. B. Line Search Method
We note, however, that maximising the evidence of
the observations in a classi cation context, as opposed The value of q which minimises GN (^y ; y~) can be
to a prediction context, may not be the optimal strat- found with a line search algorithm. Line searches are
egy. This is because, in the logistic regression model, iterative schemes which, for example, t a quadratic
the observations are class labels which are binary ran- polynomial to a cost function evaluated at three dis-
dom variables (drawn from a binomial distribution with tinct points and the new parameter estimate is given
a `hit' probability given by the output of the logistic as the minimum of that quadratic function. The next
unit). Each class label therefore contains much less iteration uses this new point and two of the previous
information (at most 1 bit) than the equivalent obser- points. In this paper we use Brent's line search algo-
vation in the prediction context (a Gaussian random rithm ([11], page 402).
variable). The resulting estimates of state noise vari- The state noise can be estimated by moving a window
ance are commensurately noisier. along one sample at a time and performing a new line
We therefore investigate an alternative strategy: search. This is, however, rather computationally inten-
maximising the evidence of updated predictions which, sive and may be unnecessary as q is likely to evolve
we believe, is a more appropriate criterion for classi - on a slower time-scale than the weights. We therefore
cation problems. choose to move the window along N samples at a time
The state noise variance parameter, qt, which max- and then to re-estimate q ie. qt = qt?1 unless t is ex-
imises this criterion can be found by a line search pro- actly divisible by N in which case qt is re-estimated
cedure which is also described in this section. using line search.
We note that if the window size, N , is set too small
then it is possible for the state noise estimates to be-
come in nite. This therefore suggests both that reason-
ably large values of N should be used and that there
0.1

may be a need to impose a maximum state noise esti-

q
mate qmax . 0
In this paper we use the following heuristic for set-
ting N and qmax . Firstly, di erent segments of the
time series are used to train di erent logistic regression
models using a batch method (eg. Bayesian logistic re- (a) 200
t
400

gression using the evidence formalism [8]). This allows 0.8


weight vectors and covariance matrices to be estimated
for each segment. We then calculate the di erence in 0.6
weight vector between consecutive segments and form
an outer product. This acts as an estimate of the em-

<α>
0.4
pirical covariance. We then subtract the estimated co-
variance from the empirical covariance and infer that 0.2
whatever covariance remains is due to state noise (and
not just parameter uncertainty from having nite data 0
sets). The average diagonal entries correspond to esti- (b) 0 200
t
400 600

mated q values and we set qmax to the median value (a


conservative but robust estimate). The window length Fig. 1. (a) Estimate of state noise, q, and (b) e ective learning
N is then initially set to a small value, say, 10 and is rate of nonstationary rule (solid line) and stationary rule (dotted
line). The nonstationary rule increases the e ective learning
increased if the line search estimate of q is greater than rate in response to the change in classi cation regime at t =
qmax . 200, whereas the e ective learning rate for the stationary rule
In practice, we nd that, as expected, maximising decreases monotonically as it believes that all data points belong
to the same classi cation regime.
the evidence of updated predictions is much more ro-
bust to choices of N and qmax than is maximising the
evidence of class labels and nd that the above heuris-
tic scheme is often not necessary. All of the numerical put at time t through a logistic node, yt = g(wTt xt).
examples in this paper estimate the state noise variance Class labels, zt , were then generated from this posterior
by maximising the evidence of updated predictions. probability.
We then trained logistic regression models using sta-
V. Results tionary, nonstationary and steepest descent learning al-
gorithms. In the nonstationary model the state noise
A. Continuous drift was estimated using the line search method to maximise
the evidence of updated predictions.
We generated a `continuous drift' data set as fol-
lows. Two Gaussians were rst de ned as having means Figure 1(a) shows the true state noise and the esti-
m1 = [1; 1]T and m2 = [?1; ?1]T and covariances mated state noise trajectories and Figure 1(b) shows
1 = 2 =  = diag(v; v) and ND = 300 data points the corresponding learning rates for the stationary and
were drawn from each Gaussian. The data were then nonstationary algorithms. Figure 2 shows the true
randomly ordered to create a set of 600 input vectors, weights and the weights as estimated by the nonstation-
fxt g. The weights in a logistic regression model were ary rule and Figure 3 compares the cumulative predic-
then initialised by assuming class 1 data points came tion error (as measured by the cross entropy between
from Gaussian 1 and class 2 from Gaussian 2 and us- moderated predictions and true class labels) with that
ing the relations in [1] (section 3.1.3). The variance v obtained by the stationary and steepest descent algo-
was chosen so that the 70% of the data could be cor- rithms.
rectly classi ed (the appropriate value of v was found
via Monte Carlo simulation). The weights were then The reason why nonstationary learning performs so
allowed to evolve according to the di usion in equa- much better is that the learning rate is increased when
tion 13 where the state noise at time t has isotropic the new classi cation regime is encountered (see Fig-
variance, q(t). Various pro les for q(t) were tried, the ure 1(b)). The EKF approximation was preferred to
one we report here being zero except between t = 200 Newton's method as it was less sensitive to overesti-
and t = 220 where q(t) = 0:1. The true posterior prob- mates of state noise (due to the denominator terms in
ability at time t was then evaluated by passing the in- equation 12).
3

2
7 6
1
6
5
0
5
4
−1
0 100 200 300 400 500 600 4
3
2
3

1 2

1
1
0
0
0

−1

(a) (b)
−1
0 100 200 300 400 500 600 −1

2 −2 −2
−2 −1 0 1 2 3 4 5 6 7 −2 −1 0 1 2 3 4 5 6

0
Fig. 4. Two data sets (a) A and (b) B used to form the discrete
−1
0 100 200 300 400 500 600 change data set.
Fig. 2. Estimation of weights using nonstationary learning rule
with the state noise estimated by the line search method; true
weights (solid line), estimated weights (dashed line).
0.2

350
Nonstationary
300 Stationary 0.1

q
St. Descent
250

200 0
E

150

(a)
100 100 200 300
t
50
2
0
0 200 400 600
t
1.5

Fig. 3. Cumulative prediction error, E, of nonstationary, sta-


tionary and steepest descent learning rules on a data set which
<α>
1
has a slow transition between classi cation regimes starting at
t = 200. 0.5

B. Discrete change (b)


0
0 100
t
200 300

We generated data sets A and B, shown in Figure 4. Fig. 5. (a) Estimate of state noise, , and (b) e ective learn- q

Each data set consists of examples drawn from one of ing rate of the nonstationary rule (solid line) and the stationary
two isotropic Gaussians corresponding to one of two rule (dotted line). The nonstationary rule increases the e ective
learning rate in response to the change in classi cation regime
class labels. Each data set is optimally separated by a at = 150, whereas the e ective learning rate for the stationary
t

linear decision boundary shown in the Figure. A data rule decreases monotonically as it believes that all data points
set was then formed by drawing 150 points from data belong to the same classi cation regime.
set A followed by 150 points from data set B. Thus at
t = 150 the optimal decision boundary switches from
that shown in Figure 4(a) to that in Figure 4(b). Note
that if the points in the overall data set were randomly
100
Nonstationary

ordered then it could only be classi ed with an accuracy


Stationary
80 St. Descent
of 50% ie. the chance rate. If, however, the temporal
ordering is preserved near perfect classi cation is pos- 60

sible.
E

40

Figure 5(a) shows the estimated state noise trajec-


tory and Figure 5(b) shows the corresponding learning
20

rates for the stationary and nonstationary algorithms. 0

Figure 6 compares the cumulative prediction error with


0 100 200 300
t

that obtained by the stationary and steepest descent al- Fig. 6. Cumulative prediction error, E, of nonstationary, sta-
gorithms. Again, the nonstationary algorithm is seen tionary and steepest descent learning rules on a data set which
to be the superior method as it is able to increase the has a discrete transition between classi cation regimes at = t

implicit learning rate when a new classi cation regime 150.


is encountered.
VI. Discussion ARD in Sequential Learning. Technical Report CUED/F-
INFENG/TR 307, Department of Engineering, Cam-
We have investigated an online learning algorithm bridge University, 1998. Also avaiable from http://svr-
for training a logistic regression model on nonstation- www.eng.cam.ac.uk/ jfgf/publications.html.
[4] T. S. Jaakkola and M. I. Jordan. A variational approach
ary classi cation problems. The nonstationarity is cap- to Bayesian logistic regression models and their extensions.
tured by modelling the weights in a logistic regression Technical Report 9702, MIT Computational Cognitive Sci-
classi er as evolving according to a rst order Markov ence, January 1997.
process. The weights were updated using the extended [5] A. Jazwinski. Stochastic Processes and Filtering Theory.
Academic Press, 1970.
Kalman lter formalism. Similar approaches exist in [6] A.H. Jazwinski. Adaptive Filtering. Automatica, 5:475{
the statistics literature and are known as Dynamic Gen- 485, 1969.
eralised Linear Models (DGLIMs) [15],[14]. [7] X.R. Li and Y. Bar-Shalom. A recursive multiple model
The nonstationarities were tracked by inferring a approach to noise identi cation. IEEE Transactions on
time-varying state noise variance parameter using a Aerospace and Electronic Systems, 30(3):671{684, 1994.
[8] D.J.C. Mackay. The evidence framework applied to classi -
maximum evidence algorithm widely used for nonsta- cation networks. Neural Computation, 4(5):720{736, 1992.
tionary prediction problems [5],[3] but adapted to the [9] M. Niranjan. Risk prediction in pregnancy. Talk at the
classi cation context by choosing to maximise the ev- Isaac Newton Institute, November 1997.
idence of updated predictions rather than of observa- [10] W.D. Penny and S.J. Roberts. Nonstationary Logis-
tions. tic Regression. Technical report, Department of Electri-
cal Engineering, Imperial College, 1999. Available from
An alternative approach to estimating state-noise pa- http://www.ee.ic.ac.uk/research/neural/wpenny.html.
rameters is outlined in Li and Bar-Shalom [7] where a [11] W. H. Press, S.A. Teukolsky, W.T. Vetterling, and B.V.P.
mixture of models, each having a xed state-noise value Flannery. Numerical Recipes in C, second edition. Cam-
is employed. In its simplest incarnation two models bridge, 1992.
[12] S. Roweis and Z. Ghahramani. A unifying review of lin-
would be used; one with zero state noise and one with ear gaussian models. Neural Computation, 11(2):305{346,
a `maximum' state noise. Estimates of state noise are 1999.
then obtained by estimating mixing coecients. [13] D. J. Spiegelhalterand S. L. Lauritzen. Sequentialupdating
Our observation that state-noise estimation is eas- of conditionalprobabilities on directed graphical structures.
ier if the targets are `updated predictions', rather than Networks, 20:579{605, 1990.
[14] M. West and J. Harrison. Bayesian Forecasting and Dy-
binary class labels, suggests the following heuristic esti- namic Models. Springer, 2nd edition, 1997.
mation scheme; if the model is tracking the data at all, [15] M. West, J. Harrison, and H.S. Migon. Dynamic General-
then the estimated weights at time t should be closer to ized Linear Models and Bayesian Forecasting. Journal of
the true weights at time t ? N than are the estimated the American Statistical Association, 80(389):73{83, 1985.
weights at time t ? N . Therefore, for the purposes of
estimating state noise, the model at time t could be
used to generate pseudo-targets for previous time steps
ie. zt?N  y(xt?N ; wt ). Experimentation with such
an approach shows that good results can be obtained
as long as some initial tracking ability is present in the
model. This therefore suggests a multiple-pass algo-
rithm where, rstly, a weight trajectory is estimated
and then, secondly, a state noise trajectory is estimated
and these two steps are iterated until a consistent so-
lution is reached. Such a scheme is reminiscent of the
forward and backward recursions involved in the EM
(Expectation-Maximisation) estimation [12] of model
parameters and seems a promising avenue for further
research. However such `smoothing' approaches (as op-
posed to ltering approaches) mean that the algorithms
are no longer `online'.

References
[1] C. M. Bishop. Neural Networks for Pattern Recognition.
Oxford University Press, Oxford, 1995.
[2] C. Chat eld. The Analysis of Time Series: An Introduc-
tion. Chapman and Hall, 1996.
[3] J.F.G. DeFreitas, M. Niranjan, and A.H. Gee. Hier-
archical Bayesian-Kalman Models for Regularisation and

View publication stats

You might also like