0% found this document useful (0 votes)
58 views22 pages

Adjusting The Outputs of A Classi Er To New Probabilities: A Simple Procedure

Uploaded by

Guilherme Marthe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views22 pages

Adjusting The Outputs of A Classi Er To New Probabilities: A Simple Procedure

Uploaded by

Guilherme Marthe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

NOTE Communicated by Leo Breiman

Adjusting the Outputs of a ClassiŽer to New a Priori


Probabilities: A Simple Procedure

Marco Saerens
[email protected]
IRIDIA Laboratory, cp 194/6, UniversitÂe Libre de Bruxelles, B-1050 Brussels, Belgium,
and SmalS-MvM, Research Section, Brussels, Belgium

Patrice Latinne
[email protected]
IRIDIA Laboratory, cp 194/6, UniversitÂe Libre de Bruxelles, B-1050 Brussels, Belgium

Christine Decaestecker
[email protected]
Laboratory of Histopathology, cp 620, UniversitÂe Libre de Bruxelles, B-1070 Brussels,
Belgium

It sometimes happens (for instance in case control studies) that a classiŽer


is trained on a data set that does not reect the true a priori probabilities
of the target classes on real-world data. This may have a negative effect on
the classiŽcation accuracy obtained on the real-world data set, especially
when the classiŽer’s decisions are based on the a posteriori probabilities
of class membership. Indeed, in this case, the trained classiŽer provides
estimates of the a posteriori probabilities that are not valid for this real-
world data set (they rely on the a priori probabilities of the training set).
Applying the classiŽer as is (without correcting its outputs with respect to
these new conditions) on this new data set may thus be suboptimal. In this
note, we present a simple iterative procedure for adjusting the outputs
of the trained classiŽer with respect to these new a priori probabilities
without having to reŽt the model, even when these probabilities are not
known in advance. As a by-product, estimates of the new a priori prob-
abilities are also obtained. This iterative algorithm is a straightforward
instance of the expectation-maximizatio n (EM) algorithm and is shown
to maximize the likelihood of the new data. Thereafter, we discuss a sta-
tistical test that can be applied to decide if the a priori class probabilities
have changed from the training set to the real-world data. The procedure
is illustrated on different classiŽcation problems involving a multilayer
neural network, and comparisons with a standard procedure for a priori
probability estimation are provided. Our original method, based on the
EM algorithm, is shown to be superior to the standard one for a priori
probability estimation. Experimental results also indicate that the classi-
Žer with adjusted outputs always performs better than the original one in

Neural Computation 14, 21–41 (2001) °


c 2001 Massachusetts Institute of Technology
22 M. Saerens, P. Latinne, and C. Decaestecker

terms of classiŽcation accuracy, when the a priori probability conditions


differ from the training set to the real-world data. The gain in classiŽca-
tion accuracy can be signiŽcant.

1 Introduction

In supervised classiŽcation tasks, sometimes the a priori probabilities of the


classes from a training set do not reect the “true” a priori probabilities of
real-world data, on which the trained classiŽer has to be applied. For in-
stance, this happens when the sample used for training is stratiŽed by the
value of the discrete response variable (i.e., the class membership). Con-
sider, for example, an experimental setting—a case control study—where
we select 50% of individuals suffering from a disease (the cases) and 50% of
individuals who do not suffer from this disease (the controls), and suppose
that we make a set of measurements on these individuals. The resulting
observations are used in order to train a model that classiŽes the data into
the two target classes: disease and no disease. In this case, the a priori prob-
abilities of the two classes in the training set are 0.5 each. Once we apply
the trained model in a real-world situation (new cases), we have no idea
of the true a priori probability of disease (also labeled “disease prevalence”
in biostatistics). It has to be estimated from the new data. Moreover, the
outputs of the model have to be adjusted accordingly. In other words, the
classiŽcation model is trained on a data set with a priori probabilities that
are different from the real-world conditions.
In this situation, knowledge of the “true” a priori probabilities of the
real-world data would be an asset for the following important reasons:
• Optimal Bayesian decision making is based on the a posteriori proba-
bilities of the classes conditional on the observation (we have to select
the class label that has maximum estimated a posteriori probability).
Now, following Bayes’ rule, these a posteriori probabilities depend in
a nonlinear way on the a priori probabilities. Therefore, a change of the
a priori probabilities (as is the case for the real-world data versus the
training set) may have an important impact on the a posteriori proba-
bilities of membership, which themselves affect the classiŽcation rate.
In other words, even if we use an optimal Bayesian model, if the a
priori probabilities of the classes change, the model will not be opti-
mal anymore in these new conditions. But knowing the new a priori
probabilities of the classes would allow us to correct (by Bayes’ rule)
the output of the model in order to recover the optimal decision.
• Many classiŽcation methods, including neural network classiŽers, pro-
vide estimates of the a posteriori probabilities of the classes. From the
previous point, this means that applying such a classiŽer as is on new
data having different a priori probabilities from the training set can
result in a loss of classiŽcation accuracy, in comparison with an equiv-
Adjusting a ClassiŽer to New a Priori Probabilities 23

alent classiŽer that relies on the “true” a priori probabilities of the new
data set.

This is the primary motivation of this article: to introduce a procedure


allowing the correction of the estimated a posteriori probabilities, that is,
the classiŽer’s outputs, in accordance with the new a priori probabilities
of the real-world data, in order to make more accurate predictions, even if
these a priori probabilities of the new data set are not known in advance. As
a by-product, estimates of the new a priori probabilities are also obtained.
The experimental section, section 4, will conŽrm that a signiŽcant increase
in classiŽcation accuracy can be obtained when correcting the outputs of
the classiŽer with respect to new a priori probability conditions.
For the sake of completeness, notice also that there exists another ap-
proach, the min-max criterion, which avoids the estimation of the a priori
probabilities on the new data. Basically, the min-max criterion says that one
should use the Bayes decision rule, which corresponds to the least favor-
able a priori probability distribution (see, e.g., Melsa & Cohn, 1978, or Hand,
1981).
In brief, we present a simple iterative procedure that estimates the new
a priori probabilities of a new data set and adjusts the outputs of the classi-
Žer, which is supposed to approximate the a posteriori probabilities, without
having to reŽt the model (section 2). This algorithm is a simple instance of
the expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin,
1977; McLachlan & Krishnan, 1997), which aims to maximize the likeli-
hood of the new observed data. We also discuss a simple statistical test (a
likelihood ratio test) that can be applied in order to decide if the a priori
probabilities have changed or not from the training set to the new data set
(section 3). We illustrate the procedure on artiŽcial and real classiŽcation
tasks and analyze its robustness with respect to imperfect estimation of the
a posteriori probabilities provided by the classiŽer (section 4). Comparisons
with a standard procedure used for a priori probabilities estimation (also in
section 4) and a discussion with respect to the related work (section 5) are
also provided.

2 Correcting a Posteriori Probability Estimates with Respect to New a


Priori Probabilities

2.1 Data ClassiŽcation. One of the most common uses of data is clas-
siŽcation. Suppose that we want to forecast the unknown discrete value of
a dependent (or response) variable ! based on a measurement vector—or
observation vector—x. This discrete dependent variable takes its value in
V D (v1 , . . . , vn )—the n class labels.
A training example is therefore a realization of a random feature vec-
tor, x, measured on an individual and allocated to one of the n classes
2 V. A training set is a collection of such training examples recorded for
24 M. Saerens, P. Latinne, and C. Decaestecker

the purpose of model building (training) and forecasting based on that


model.
The a priori probability of belonging to class vi in the training set will
be denoted as pt (vi ) (in the sequel, subscript t will be used for estimates
carried out on the basis of the training set). In the case control example,
pt (v1 ) D pt (disease) D 0.5, and pt (v2 ) D pt (no disease) D 0.5.
For the purpose of training, we suppose that for P each class vi , observa-
tions on Nti individuals belonging to the class (with niD 1 Nti D Nt , the total
number of training examples) have been independently recorded accord-
ing to the within-class probability density, p(x | vi ). Indeed, case control
studies involve direct sampling from the within-class probability densities,
p(x | vi ). In a case control study with two classes (as reported in section 1),
this means that we made independent measurements on Nt1 individuals
who contracted the disease (the cases), according to p(x |disease), and on
Nt2 individuals who did not (the controls), according to p(x | no disease).
The a priori probabilities of the classes in the training set are therefore esti-
mated by their frequencies b pt (vi ) D Nti / Nt .
Let us suppose that we trained a classiŽcation model (the classiŽer),
and denote by b pt (vi | x) the estimated a posteriori probability of belong-
ing to class vi provided by the classiŽer, given that the feature vector x has
been observed, in the conditions of the training set. The classiŽcation model
(whose parameters are estimated on the basis of the training set as indicated
by subscript t) could be an artiŽcial neural network, a logistic regression,
or any other model that provides as output estimates of the a posteriori
probabilities of the classes given the observation. This is, for instance, the
case if we use the least-squares error or the Kullback-Leibler divergence as
a criterion for training and if the minimum of the criterion is reached (see,
e.g., Richard & Lippmann, 1991, or Saerens, 2000, for a recent discussion).
We therefore suppose that the model has n outputs, gi (x) (i D 1, . . . , n), pro-
viding estimated posterior probabilities of membership b pt (vi | x) D gi (x). In
the experimental section (section 4), we will show that even imperfect ap-
proximations of these output probabilities allow reasonably good outputs
corrections by the procedure to be presented below.
Let us now suppose that the trained classiŽcation model has to be applied
on another data set (new cases, e.g., real-world data to be scored) for which
the class frequencies, estimating the a priori probabilities p(vi ) (no subscript
t), are suspected to be different from b pt (vi ). The a posteriori probabilities
provided by the model for these new cases will have to be corrected accord-
ingly. As detailed in the two next sections, two cases must be considered
according to the fact that estimates of the new a priori probabilities b p (vi )
are, or are not, available for this new data set.

2.2 Adjusting the Outputs to New a Priori Probabilities: New a Priori


Probabilities Known. In the sequel, we assume that the generation of the
observations within the classes, and thus the within-class densities, p(x | vi ),
Adjusting a ClassiŽer to New a Priori Probabilities 25

does not change from the training set to the new data set (only the relative
proportion of measurements observed from each class has changed). This is
a natural requirement; it supposes that we choose the training set examples
only on the basis of the class labels vi , not on the basis of x. We also assume
that we have an estimate of the new a priori probabilities, b p (vi ).
Suppose now that we are working on a new data set to be scored. Bayes’
theorem provides

pt (vi | x)b
b pt (x)
pt (x | vi ) D
b , (2.1)
pt (vi )
b

where the a posteriori probabilities b pt (vi | x) are obtained by applying the


trained model as is (subscript t) on some observation x of the new data set
(i.e., by scoring the data). These are the estimated a posteriori probabilities
in the conditions of the training set (relying on the a priori probabilities of
the training set).
The corrected a posteriori probabilities, b p (vi | x) (relying this time on the
estimated a priori probabilities of the new data set) obey the same equa-
tion, but with b p (vi ) as the new a priori probabilities and b p (x) as the new
probability density function (no subscript t):

p (vi | x)b
b p (x)
p (x | vi ) D
b . (2.2)
p (vi )
b

Since the within-class densities b p (x | vi ) do not change from training to


pt (x | vi ) D b
real-world data (b p(x | vi )), by equating equation (2.1) to (2.2)
and deŽning f (x) D b pt (x) / b
p (x), we Žnd

p(vi )
b
p (vi | x) D f (x)
b pt (vi | x).
b (2.3)
pt (vi )
b
Xn
Since p (vi | x) D 1, we easily obtain
b
iD 1
2 3 ¡1
Xn b(v )
p j
f (x) D 4 pt (vj | x)5 ,
b
j 1
p
b t (vj )
D

and consequently

p(vi )
b
pt (vi | x)
b
pt (vi )
b
p (vi | x) D n
b . (2.4)
X b p (vj )
pt (vj | x)
b
pt (vj )
b
jD 1

This well-known formula can be used to compute the corrected a poste-


p(vi | x) in terms of the outputs provided by the trained
riori probabilities, b
26 M. Saerens, P. Latinne, and C. Decaestecker

pt (vi | x), and the new priors b


model, gi (x) D b p(vi ). We observe that the new
a posteriori probabilities b p (vi | x) are simply the a posteriori probabilities
in the conditions of the training set, b pt (vi | x), weighted by the ratio of the
new priors to the old priors, b p(vi ) /b
pt (vi ). The denominator of equation 2.4
ensures that the corrected a posteriori probabilities sum to one.
However, in many real-world cases, we ignore what the real-world a
priori probabilities p(vi ) are since we do not know the class labels for these
new data. This is the subject of the next section.

2.3 Adjusting the Outputs to New a Priori Probabilities: New a Pri-


ori Probabilities Unknown. When the new a priori probabilities are not
known in advance, we cannot use equation 2.4, and the p(vi ) probabilities
have to be estimated from the new data set. In this section, we present an
already known standard procedure used for new a priori probability esti-
mation (the only one available in the literature to our knowledge); then we
introduce our original method based on the EM algorithm.

2.3.1 Method 1: Confusion Matrix. The standard procedure used for a


priori probabilities estimation is based on the computation of the confusion
matrix, b p (di | vj ), an estimation of the probability of taking the decision
di to classify an observation in class vi , while in fact it belongs to class vj
(see, e.g., McLachlan, 1992, or McLachlan & Basford, 1988). In the sequel,
this method will be referred to as the confusion matrix method. Here is its
rationale. First, the confusion matrix b pt (di | vj ) is estimated on the training
set from cross-tabulated classiŽcation frequencies provided by the classiŽer.
Once this confusion matrix has been computed on the training set, it is used
in order to infer the a priori probabilities on a new data set by solving the
following system of n linear equations,
n
X
p(di ) D
b pt (di | vj )b
b p(vj ), i D 1, . . . , n, (2.5)
jD 1

p (vj ), where the b


with respect to the b p (di ) is simply the marginal of classify-
ing an observation in class vi , estimated by the class label frequency after
application of the classiŽer on the new data set. Once the b p (vj ) are com-
puted from equation 2.5, we use equation 2.4 to infer the new a posteriori
probabilities.

2.3.2 Method 2: EM Algorithm. We now present a new procedure for a


priori and a posteriori probabilities adjustment, based on the EM algorithm
(Dempster et al., 1977; McLachlan & Krishnan, 1997). This iterative algo-
rithm increases the likelihood of the new data at each iteration until a local
maximum is reached.
Once again, let us suppose that we record a set of N new independent
realizations of the stochastic variable x, XN
1 D (x1 , x2 , . . . , xN ), sampled from
Adjusting a ClassiŽer to New a Priori Probabilities 27

p(x), in a new data set to be scored by the model. The likelihood of these
new observations is deŽned as
N
Y
L(x1 , x2 , . . . , xN ) D p(x k )
kD 1
" #
N
Y n
X
D p(x k, vi )
kD 1 iD 1
" #
N
Y n
X
D p(x k | vi )p(vi ) , (2.6)
kD 1 iD 1

where the within-class densities—that is, the probabilities of observing x k


given class vi —remain the same (p(x k | vi ) D pt (xk | vi )) since we assume
that only the a priori probabilities change from the training set to the new
data set. We have to determine the estimates b p(vi ) that maximize the likeli-
hood (2.6) with respect to p(vi ). While a closed-form solution to this problem
cannot be found, we can obtain an iterative procedure for estimating the new
p(vi ) by applying the EM algorithm.
As before, let us deŽne gi (x k ) as the model’s output value corresponding
to class vi for the observation x k of the new data set to be scored. The model
outputs provide an approximation of the a posteriori probabilities of the
classes given the observation in the conditions of the training set (subscript
t), while the a priori probabilities of the training set are estimated by class
frequencies:

pt (vi | x k ) D gi (xk )
b (2.7)
Nti
pt (vi ) D
b . (2.8)
Nt
Let us deŽne as b p(s) (vi ) and b
p (s) (vi | x k ) the estimations of the new a
priori and a posteriori probabilities at step s of the iterative procedure. If
p (s) (vi ) are initialized by the frequencies of the classes in the training set
the b
(see equation 2.8), the EM algorithm provides the following iterative steps
(see the appendix) for each new observation x k and each class vi :
p(0) (vi ) D b
b pt (vi )
p (s) (vi )
b
pt (vi | x k )
b
pt (vi )
b
p (s) (vi | xk ) D n (s)
b
Xb p (vj )
bpt (vj | x k )
j 1 D
pt (vj )
b

N
1 X
p (s C 1) (vi ) D
b p(s) (vi | x k ),
b (2.9)
N kD 1
28 M. Saerens, P. Latinne, and C. Decaestecker

where b pt (vi | x k ) and b pt (vi ) are given by equations 2.7 and 2.8. Notice
the similarity between equations 2.4 and 2.9. At each iteration step s, both
the a posteriori (b p (s) (vi | x k )) and the a priori probabilities (b
p (s) (vi )) are
reestimated sequentially for each new observation x k and each class vi .
The iterative procedure proceeds until the convergence of the estimated
probabilities b p(s) (vi ).
Of course, if we have some a priori knowledge about the values of the
prior probabilities, we can take these starting values for the initialization of
p(0) (vi ). Notice also that although we did not encounter this problem
the b
in our simulations, we must keep in mind that local maxima problems
potentially may occur (the EM algorithm Žnds a local maximum of the
likelihood function).
In order to obtain good a priori probability estimates, it is necessary that
the a posteriori probabilities relative to the training set are reasonably well
approximated (i.e., sufŽciently well estimated by the model). The robust-
ness of the EM procedure with respect to imperfect a posteriori probability
estimates will be investigated in the experimental section (section 4).

3 Testing for Different A Priori Probabilities

In this section, we show that the likelihood ratio test can be used in order
to decide if the a priori probabilities have signiŽcantly changed from the
training set to the new data set. Before adjusting the a priori probabilities
(when the trained classiŽcation model is simply applied to the new data),
the likelihood of the new observations is

N
Y
Lt (x1 , x2 , . . . , xN ) D pt (x k )
b
kD 1

N µ ¶
Y p(x k | vi )b
b pt (vi )
D , (3.1)
k 1 D
pt (vi | xk )
b

whatever the class label vi , and where we used the fact that pt (x k | vi ) D
p(x k | vi ).
After the adjustment of the a priori and a posteriori probabilities, we
compute the likelihood in the same way:

N
Y
L(x1 , x2 , . . . , xN ) D p (x k )
b
kD 1

N µ ¶
Y p (x k | vi )b
b p (vi )
D , (3.2)
k 1
D
p(vi | x k )
b
Adjusting a ClassiŽer to New a Priori Probabilities 29

so that the likelihood ratio is


µ ¶
N
Q p(x k | vi )b
b p (vi )
L(x1 , x2 , . . . , xN ) kD 1 p (vi | x k )
b
D µ ¶
Lt (x1 , x2 , . . . , xN ) N
Q b p (x k | vi )b
pt (vi )
kD 1 pt (vi | x k )
b
µ ¶
QN p(vi )
b
kD 1 b p(vi | x k )
D µ ¶ (3.3)
QN pt (vi )
b
,
kD 1 b pt (vi | x k )

and the log-likelihood ratio is

µ ¶ N
X N
L(x1 , x2 , . . . , xN ) £ ¤ X £ ¤
log D log bpt (vi | x k ) ¡ log bp (vi | xk )
Lt (x1 , x2 , . . . , xN ) kD 1 kD 1
£ ¤ £ ¤
C N log bp(vi ) ¡ N log bpt (vi ) . (3.4)

From standard statistical inference (see, e.g., Mood, Graybill, & Boes, 1974;
Papoulis, 1991), 2 £ log [L(x1 , x2 , . . . , xN ) / Lt (x1 , x2 , . . . , xN )] is asymptoti-
2
cally distributed as a chi square with (n ¡ 1) degrees of freedom (Â(n¡1) ,
Pn
n
where is the number of classes). Indeed, since iD 1 p
b (v )
i D 1 , there are
only (n ¡ 1) degrees of freedom. This allows us to test if the new a priori
probabilities differ signiŽcantly from the original ones and thus to decide if
the a posteriori probabilities (i.e., the model outputs) need to be corrected.
Notice also that standard errors on the estimated a priori probabilities can
be obtained through the computation of the observed information matrix,
as detailed in McLachlan & Krishnan, 1997.

4 Experimental Evaluation

4.1 Simulations on ArtiŽcial Data. We present a simple experiment that


illustrates the iterative adjustment of the a priori and a posteriori proba-
bilities. We chose a conventional multilayer perceptron (with one hidden
layer, softmax output functions, trained with the Levenberg-Marquardt al-
gorithm) as a classiŽcation model, as well as a database labeled Ringnorm,
introduced by Breiman (1998).1 This database is constituted of 7400 cases
described by 20 numerical features and divided into two equidistributed
classes (each drawn from a multivariate normal distribution with a differ-
ent variance-covariance matrix).

1
Available online at http://www.cs.utoronto.ca/˜delve/data/datasets.html.
30 M. Saerens, P. Latinne, and C. Decaestecker

Table 1: Results of the Estimation of Priors on the Test Sets, Averaged on 10


Runs, Ringnorm ArtiŽcial Data Set.

True Priors Estimated Prior by Using Log-Likelihood Ratio Test

EM Confusion Number of Times


Matrix the Test Is SigniŽcant

10% 14.7% 18.1% 10


20% 21.4 24.2 10
30% 33.0 34.4 10
40% 42.5 42.7 10
50% 49.2 49.0 0
60% 59.0 57.1 10
70% 66.8 64.8 10
80% 77.3 73.9 10
90% 85.6 80.9 10
Note: The neural network has been trained on a data set with a priori
probabilities of (50%, 50%).

Ten replications of the following experimental design were applied. First,


a training set of 500 cases of each class was extracted from the data (pt (v1 ) D
pt (v2 ) D 0.50) and was used for training a neural network with 10 hidden
units. For each training set, nine independent test sets of 1000 cases were
selected according to the following a priori probability sequence: p(v1 ) D
0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90 (with p(v2 ) D 1 ¡p(v1 )). Then,
for each test set, the EM procedure (see equation 2.9), as well as the confusion
matrix procedure (see equation 2.5), were applied in order to estimate the
new a priori probabilities and adjust the a posteriori probabilities provided
by the model (b pt (v1 | x) D g(x)). In each experiment, a maximum of Žve
iteration steps of the EM algorithm was sufŽcient to ensure the convergence
of the estimated probabilities.
Table 1 shows the estimated a priori probabilities for v1 . With respect to
the EM algorithm, it also shows the number of times the likelihood ratio
test was signiŽcant at p < 0.01 on these 10 replications. Table 2 presents the
classiŽcation rates (computed on the test set) before and after the probability
adjustments, as well as when the true priors of the test set (p(vi ), which
are unknown in a real-world situation) were used to adjust the classiŽer’s
outputs (using equation 2.4). This latter result can be considered an optimal
reference in this experimental context.
The results reported in Table 1 show that the EM algorithm was clearly
superior to the confusion matrix method for a priori probability estimation
and that the a priori probabilities are reasonably well estimated. Except in
the cases where p(vi ) D pt (vi ) D 0.50, the likelihood ratio test revealed in
each replication a signiŽcant difference (at p < 0.01) between the training
and the test set a priori probabilities (b pt (vi ) D p(vi )). The a priori estimates
6 b
appeared as slightly biased toward 50%; this appears as a bias affecting the
neural network classiŽer trained on an equidistributed training set.
Adjusting a ClassiŽer to New a Priori Probabilities 31

Table 2: ClassiŽcation Rates on the Test Sets, Averaged on 10 Runs, Ringnorm


ArtiŽcial Data Set.

True Priors Percentage of Correct ClassiŽcation

No After Adjustment by Using

Adjustment EM Confusion Matrix True Priors


10% 90.1% 93.6% 93.1% 94.0%
20% 90.3 91.9 91.7 92.2
30% 88.6 89.9 89.8 90.0
40% 90.4 90.4 90.4 90.6
50% 87.0 86.9 86.8 87.0
60% 90.0 90.0 90.0 90.0
70% 89.2 89.8 89.7 90.2
80% 89.5 90.7 90.7 91.0
90% 88.5 91.6 91.3 92.0

By looking at Table 2 (classiŽcation results), we observe that the impact of


the adjustment of the outputs on classiŽcation accuracy can be signiŽcant.
The effect was most beneŽcial when the new a priori probabilities, p(vi ), are
far from the training set ones (pt (vi ) D 0.50). Notice that in each case, the
classiŽcation rates obtained after adjustment were close to those obtained
by using the true a priori probabilities of the test sets. Although the EM
algorithm provides better estimates of the a priori probabilities, we found
no difference between the EM algorithm and the confusion matrix method
in terms of classiŽcation accuracy. This could be due to the high recognition
rates observed for this problem. Notice also that we observe a small degra-
dation in classiŽcation accuracy if we adjust the a priori probabilities when
not necessary (pt (vi ) D p(vi ) D 0.5), as indicated by the likelihood ratio test.

4.2 Robustness Evaluation on ArtiŽcial Data. This section investigates


the robustness of the EM-based procedure with respect to imperfect esti-
mates of the a posteriori probability values provided by the classiŽer, as
well as to the size of the training and the test set (the test set alone is used
to estimate the new a priori probabilities). In order to degrade the classiŽer
outputs, we gradually decreased the size of the training set in steps. Sym-
metrically, in order to reduce the amount of data available to the EM and the
confusion matrix algorithms, we also gradually decreased the size of the test
set. For each condition, we compared the classiŽer outputs with those ob-
tained with a Bayesian classiŽer based on the true data distribution (which
is known for an artiŽcial data set such as Ringnorm). We were thus able to
quantify the error level of the classiŽer with respect to the true a posteriori
probabilities (How far is our neural network from the Bayesian classiŽer?)
and to evaluate the effects of a decrease in the training or test sizes on the a
priori estimates provided by EM and the classiŽcation performances.
32 M. Saerens, P. Latinne, and C. Decaestecker

Table 3: Averaged Results for Estimation of the Priors, Ringnorm Data Set, Av-
eraged on 10 Runs.

Training Set Test Set Mean Absolute Estimated Prior for v1


Size Size Deviation (p(v1 ) D 0.20) by Using
1
PN
(#v1 , #v2 ) (#v1 , #v2 ) N kD 1
| b(xk ) ¡ g(x k ) | EM Confusion Matrix

(500, 500) (200, 800) 0.107 22.0% 24.7%


(100, 400) 0.110 21.6 24.5
(40, 160) 0.104 20.4 23.5
(20, 80) 0.122 22.7 26.7
(250, 250) (200, 800) 0.139 22.1 25.3
(100, 400) 0.140 22.6 25.6
(40, 160) 0.134 23.1 25.8
(20, 80) 0.167 22.7 26.0
(100, 100) (200, 800) 0.183 24.1 27.5
(100, 400) 0.185 24.4 28.2
(40, 60) 0.181 23.5 27.3
(20, 80) 0.180 26.6 29.2
(50, 50) (200, 800) 0.202 24.9 28.5
(100, 400) 0.199 25.3 29.0
(40, 160) 0.203 24.3 27.6
(20, 80) 0.189 22.3 26.0
Note: Notice that the true priors of the test sets are (20%, 80%).

As for the experiment reported above, a multilayer perceptron was


trained on the basis of an equidistributed training set (pt (v1 ) D 0.5 D pt (v2 )).
An independent and unbalanced test set (with p(v1 ) D 0.20 and p(v2 ) D
0.80) was selected and scored by the neural network. The experiments (10
replications in each condition) were carried out on the basis of training and
test sets with decreasing sizes (1000, 500, 200 and 100 cases), as detailed in
Table 3.
We Žrst compared the artiŽcial neural network’s output values (g(x) D
pOt (v1 | x), obtained by scoring the test sets with the trained neural network)
with those provided by the Bayesian classiŽer (b(x) D pt (v1 | x), obtained
by scoring the test sets with the Bayesian classiŽer) on the test sets before
output readjustment; that is, we measured the discrepancy between the
outputs of the neural and the Bayesian classiŽers before output adjustment.
For this purpose, we computed the averaged absolute deviation between
the output value of the neural and the Bayesian classiŽers (the average of
|b(x) ¡ g(x)|) before output adjustment.
Then for each test set, the EM and the confusion matrix procedures were
applied to the outputs of the neural classiŽer in order to estimate the new a
priori probabilities and the new a posteriori probabilities. The results for a
priori probability estimation are detailed in Table 3.
By looking at the mean absolute deviation in Table 3, it can be seen that,
as expected, decreasing the training set size results in a degradation in the
Adjusting a ClassiŽer to New a Priori Probabilities 33

100%
Using true priors
Average classification rate in each condition

No adjustment
After adjustment by EM
After adjustment by confusion matrix
95%

90%

85%

80%

#Training=(50,50); #Test=(20,80)
#Training=(50,50); #Test=(200,800)

#Training=(50,50); #Test=(100,400)

#Training=(500,500); #Test=(20,80)

#Training=(250,250); #Test=(20,80)

#Training=(100,100); #Test=(20,80)
#Training=(50,50); #Test=(40,160)
#Training=(500,500); #Test=(200,800)

#Training=(250,250); #Test=(200,800)

#Training=(100,100); #Test=(200,800)

#Training=(500,500); #Test=(100,400)

#Training=(250,250); #Test=(100,400)

#Training=(100,100); #Test=(100,400)

#Training=(500,500); #Test=(40,160)

#Training=(250,250); #Test=(40,160)

#Training=(100,100); #Test=(40,160)

Number of observations in training and test sets

Figure 1: ClassiŽcation rates obtained on the Ringnorm data set. Results are
reported for four different conditions: (1) Without adjusting the classiŽer output
(no adjustment); (2) adjusting the classiŽer output by using the confusion matrix
method (after adjustment by confusion matrix); (3) adjusting the classiŽer output
by using the EM algorithm (after adjustment by EM); and (4) adjusting the
classiŽer output by using the true a priori probability of the new data (using
true priors). The results are plotted by function of different sizes of both the
training and the test sets.

estimation of the a posteriori probabilities (an increase of absolute deviation


of about 0.10 between large, i.e., Nt D 1000, and small, i.e., Nt D 100, training
set sizes). Of course, the prior estimates degraded accordingly, but only
slightly. The EM algorithm appeared to be more robust than the confusion
matrix method. Indeed, on average (on all the experiments), the EM method
overestimated the prior p(v1 ) by 3.3%, while the confusion matrix method
overestimated by 6.6%. In contrast, decreasing the size of the test set seems
to have very few effects on the results.
Figure 1 shows the classiŽcation rates (averaged on the 10 replications)
of the neural network before and after the output adjustments made by the
EM and the confusion matrix methods. It also illustrates the degradation in
34 M. Saerens, P. Latinne, and C. Decaestecker

classiŽer performances due to the decrease in the size of the training sets: a
loss of about 8% between large (i.e., Nt D 1000), and small (i.e., Nt D 100)
training set sizes. The classiŽcation rates obtained after the adjustments
made by the confusion matrix method are very close to those obtained with
the EM method. In fact, the EM method almost always (15 times on the 16
conditions) provided better results, but the differences in accuracy between
the two methods are very small (0.3% in average). As already observed in
the Žrst experiment (see Table 2), the classiŽcation rates obtained after ad-
justment by the EM or the confusion matrix method are very close to those
obtained by using the true a priori probabilities (a difference of 0.2% on aver-
age). Finally, we clearly observe (see Figure 1) that by adjusting the outputs
of the classiŽer, we always increased classiŽcation accuracy signiŽcantly.

4.3 Tests on Real Data. We also tested the a priori estimation and out-
puts readjustment method on three real medical data sets from the UCI
repository (Blake, Keogh, & Merz, 1998) in order to conŽrm our results on
more realistic data. These data are Pima Indian Diabetes (2 classes of 268
and 500 cases, 8 features), Breast Cancer Wisconsin (2 classes of 239 and 444
cases after omission of the 16 cases with missing values, 9 features) and Bupa
Liver Disorders (2 classes of 145 and 200 cases, 6 features). A training set of
50 cases of each class was selected in each data set and used for training a
multilayer neural network; the remaining cases were used for selecting an
independent test set. In order to increase the difference between the class
distributions in the training (0.50, 0.50) and the test sets, we omitted a num-
ber of cases from the smallest class in order to obtain a class distribution
of (p(v1 ) D 0.20, p(v2 ) D 0.80) for the test set. Ten different selections of
training and test set were carried out, and for each of them, the training
phase was replicated 10 times, giving a total of 100 trained neural networks
for each data set.
On average over the 100 experiments, Table 4 details the a priori proba-
bilities estimated by means of the EM and the confusion matrix methods, as

Table 4: ClassiŽcation Results on Three Real Data Sets.

Data True Priors Estimated by Percentage of Correct ClassiŽcation


Set Priors
EM Confusion No After Adjustment
Matrix Adjustment by Using

EM Confusion True
Matrix Priors
Diabetes 20% 24.8% 31.3% 67.4% 76.3% 74.4% 78.3%
Breast 20 18.0 26.2 91.3 92.0 92.1 92.6
Liver 20 24.6 21.5 68.0 75.7 75.5 79.1

Note: The neural network has been trained on a learning set with a priori probabilities
of (50%, 50%).
Adjusting a ClassiŽer to New a Priori Probabilities 35

well as the classiŽcation rates before and after the probability adjustments.
These results show that the EM prior estimates were generally better than
the confusion matrix ones (except for the Liver data). Moreover, adjusting
the classiŽer outputs on the basis of the new a priori probabilities always
increased classiŽcation rates and provided accuracy levels not too far from
those obtained by using the true priors for adjusting the outputs (given in
the last column of Table 4). However, except for the Diabetes data, for which
EM gave better results, the adjustments made on the basis of the EM and the
confusion matrix methods seemed to have the same effect on the accuracy
improvement.

5 Related Work

The problem of estimating parameters of a model by including unlabeled


data in addition to the labeled samples has been studied in both the machine
learning and the artiŽcial neural network communities. In this case, we
speak about learning from partially labeled data (see, e.g., Shahshahani
& Landgrebe, 1994; Ghahramani & Jordan, 1994; Castelli & Cover, 1995;
Towell, 1996; Miller & Uyar, 1997; Nigam, McCallum, Thrun, & Mitchell,
2000). The purpose is to use both labeled and unlabeled data for learning
since unlabeled data are usually easy to collect, while labeled data are much
more difŽcult to obtain. In this framework, the labeled part (the training
set in our case) and the unlabeled part (the new data set in our case) are
combined in one data set, and a partly supervised EM algorithm is used to
Žt the model (a classiŽer) by maximizing the full likelihood of the complete
set of data (training set plus new data set). For instance, Nigam et al. (2000)
use the EM algorithm to learn classiŽers that take advantage of both labeled
and unlabeled data.
This procedure could easily be applied to our problem: adjusting the a
posteriori probabilities provided by a classiŽer to new a priori conditions.
Moreover, it makes fully efŽcient use of the available data. However, on the
downside, the model has to be completely reŽtted each time it is applied to
a new data set. This is the opposite of the approach discussed in this article,
where the model is Žtted only on the training set. When applied to a new
data set, the model is not modiŽed; only its outputs are recomputed based
on the new observations.
Related problems involving missing data have also been studied in ap-
plied statistics. Some good recent reference pointers are Scott and Wild
(1997) and Lawless, Kalbeisch, and Wild (1999).

6 Conclusion

We presented a simple procedure for adjusting the outputs of a classiŽer to


new a priori class probabilities. This procedure is a simple instance of the EM
algorithm. When deriving this procedure, we relied on three fundamental
36 M. Saerens, P. Latinne, and C. Decaestecker

assumptions:

1. The a posteriori probabilities provided by the model (our readjust-


ment procedure can be applied only if the classiŽer provides as out-
put an estimate of the a posteriori probabilities) are reasonably well
approximated, which means that it provides predicted probabilities
of belonging to the classes that are sufŽciently close to the observed
probabilities.
2. The training set selection (the sampling) has been performed on the
basis of the discrete dependent variable (the classes), and not of the ob-
served input variable x (the explanatory variable), so that the within-
class probability densities do not change.
3. The new data set to be scored is large enough in order to be able to
estimate accurately the new a priori class probabilities.

If sampling also occurs on the basis of x, the usual sample survey so-
lution to this problem is to use weighted maximum likelihood estimators
with weights inversely proportional to the selection probabilities, which are
supposed to be known (see, e.g., Kish and Frankel, 1974).
Experimental results show that our new procedure based on EM per-
forms better than the standard method (based on the confusion matrix) for
new a priori probability estimation. The results also show that even if the
classiŽer ’s output provides imperfect a posteriori probability estimates,

• The EM procedure is able to provide reasonably good estimates of the


new a priori probabilities.
• The classiŽer with adjusted outputs always performs better than the
original one if the a priori conditions differ from the training set to the
real-world data. The gain of classiŽcation accuracy can be signiŽcant.
• The classiŽcation performances after adjustment by EM are relatively
close to the results obtained by using the true priors (which are un-
known in a real-world situation), even when the a posteriori probabil-
ities are imperfectly estimated.

Additionally, the quality of the estimates does not appear to depend


strongly on the size of the new data set. All these results enable us to relax
to a certain extent the Žrst and third assumptions above.
We also observed that adjusting the outputs of the classiŽer when not
needed (i.e., when the a priori probabilities of the training set and the real-
world data do not differ) can result in a decrease in classiŽcation accuracy.
We therefore showed that a likelihood ratio test can be used in order to de-
cide if the a priori probabilities have signiŽcantly changed from the training
set to the new data set. The readjustment procedure should be applied only
when we Žnd a signiŽcant change of a priori probabilities.
Adjusting a ClassiŽer to New a Priori Probabilities 37

Notice that the EM-based adjustment procedure could be useful in the


context of disease prevalence estimation. In this application, the primary
objective is the estimation of the class proportions in an unlabeled data
set (i.e., class a priori probabilities); classiŽcation accuracy is not important
per se.
Another important problem, also encountered in medicine, concerns the
automatic estimation of the proportions of different cell populations con-
stituting, for example, a smear or a lesion (such as a tumor). Mixed tumors
are composed of two or more cell populations with different lineages, as,
for example, in brain glial tumors (Decaestecker et al., 1997). In this case, a
classiŽer is trained on a sample of images of reference cells provided from
tumors with a pure lineage (which did not present diagnostic difŽculties)
and labeled by experts. When a tumor is suspected to be mixed, the classiŽer
is applied to a sample of cells from this tumor (a few hundred) in order to
estimate the proportion of the different cell populations. The main motiva-
tion for the determination of the proportion of the different cell populations
in these mixed tumors is that the different lineage components may signif-
icantly differ with respect to their susceptibility for aggressive progression
and may thus inuence patients’ prognoses. In this case, the primary goal
is the determination of the proportion of cell populations, corresponding to
the new a priori probabilities.
Another practical use of our readjustment procedure is the automatic
labeling of geographical maps based on remote sensing information. Each
portion of the map has to be labeled according to its nature (e.g., forest,
agricultural zone, urban zone). In this case, the a priori probabilities are
unknown in advance and vary considerably from one image to another,
since they directly depend on the geographical area that has been observed
(e.g., urban area, country area).
We are now actively working on these biomedical and geographical prob-
lems.

Appendix: Derivation of the EM Algorithm

Our derivation of the iterative process (see equation 2.9) closely follows the
estimation of mixing proportions of densities (see McLachlan & Krishnan,
1997). Indeed, p(x | vi ) can be viewed as a probability density deŽned by
equation 2.1.
The EM algorithm supposes that there exists a set of unobserved data
deŽned as the class labels of the observations of the new data set. In order
to pose the problem as an incomplete data one, associated with the new
observed data, XN1 D (x1 , x2 , . . . , xN ), we introduce as the unobservable data
N
Z1 D 1 z2 . . . , zN ), where each vector z k is associated with one of the n
(z , ,
mutually exclusive classes: z k will represent the class label 2 (v1 , . . . , vn ) of
the observation x k. More precisely, each z k will be deŽned as an indicator
38 M. Saerens, P. Latinne, and C. Decaestecker

vector: if zki is the component i of vector zk , then z ki D 1 and zkj D 0 for


each j D 6 i if and only if the class label associated with observation x k is
vi . For instance, if the observation x k is assigned to class label vi , then
z k D [0, . . . , 0 , 1, 0 , . . . , 0]T .
1 i¡1 i i C 1 n
Let us denote by ¼ D [p(v1 ), p(v2 ), . . . , p(vn )]T the vector of a priori
probabilities (the parameters) to be estimated. The likelihood of the com-
plete data (for the new data set) is

N Y
Y n £ ¤zki
L(XN N
1 , Z1 | ¼ ) D p(x k,vi )
kD 1 i D 1
N Y
Y n £ ¤zki
D p(x k | vi )p(vi ) , (A.1)
kD 1 i D 1

where p(x k | vi ) is constant (it does not depend on the parameter vector ¼ )
and the p(vi ) probabilities are the parameters to be estimated.
The log-likelihood is
h i
l(XN N N N
1 , Z 1 | ¼ ) D log L(X 1 , Z1 | ¼ )
N X
X n £ ¤ XN X
n £ ¤
D z ki log p(vi ) C zki log p(xk | vi ) .(A.2)
kD 1 i D 1 kD 1 i D 1

Since the ZN 1 data are unobservable, during the E-step, we replace the
log-likelihood function by its conditional expectation over p(ZN N
1 | X1 , ¼ ):
N
EZN [l | X1 , ¼ ]. Moreover, since we need to know the value of ¼ in order to
1

compute EZN [l | XN 1 , ¼ ] (the expected log-likelihood), we use, as a current


1
guess for ¼ , the current value (at iteration step s) of the parameter vector,
b (s) D [b
¼ p (s) (v1 ), b
p (s) (v2 ), . . . , b
p(s) (vn )]T ,
h i
b(s) ) D EZN l(XN
Q(¼ , ¼ N N
¼ (s)
1 , Z1 | ¼ ) | X1 , b
1

N X
X n h i £ ¤
D ¼ (s) log p(vi )
EZN zki | xk , b
1
kD 1 i D 1
N X
X n h i £ ¤
C b(s) log p(x k | vi ) ,
EZN z ki | x k, ¼ (A.3)
1
kD 1 i D 1

where we assumed that the complete data observations f(x k, zk ), k D 1, . . . ,


Ng are independent. We obtain for the expectation of the unobservable data
h i
b(s) D 0 ¢ p(z ki D 0 | xk, b
E ZN zki | xk , ¼ ¼ (s) ) C 1 ¢ p(zki D 1 | x k, b
¼ (s) )
1
Adjusting a ClassiŽer to New a Priori Probabilities 39

¼ (s) )
D p(z ki D 1 | xk , b
p (s) (vi | x k )
Db

p (s) (vi )
b
pt (vi | xk )
b
pt (vi )
b
D n , (A.4)
Xb p (s) (vj )
pt (vj | xk )
b
j 1 D
pt (vj )
b

where we used equation 2.4 at the last step. The expected likelihood is
therefore
N X
X n £ ¤
¼ (s) ) D
Q(¼ , b p (s) (vi | xk ) log p(vi )
b
kD 1 i D 1

N X
X n £ ¤
C p (s) (vi | x k ) log p(xk | vi ) ,
b (A.5)
kD 1 i D 1

where bp(s) (vi | x k ) is given by equation A.4.


For the M-step, we compute the maximum of Q(¼, b ¼ (s) ) (see equation A.5)
with respect to the parameter vector¼ D [p(v1 ), p(v2 ), . . . , p(vn )]T . The new
estimate at time step (s C 1) will therefore be the value of the parameter
P vector
¼ that maximizes Q(¼ , ¼ b(s) ). Since we have the constraint, niD1 p(vi ) D 1,
we deŽne the Lagrange function as
" #
Xn
(s)
`(¼ ) D Q(¼ , ¼ b )Cl 1¡ p(vi )
iD 1

N X
X n N X
X n
D p (s) (vi | xk ) log[p(vi )] C
b p (s) (vi | x k ) log[p(x k | vi )]
b
kD 1 i D 1 kD 1 i D 1
" #
n
X
Cl 1¡ p(vi ) . (A.6)
iD 1

@`(¼ )
By computing D 0, we obtain
@p(vj )

N
X
p (s) (vj | x k ) D l p(vj )
b (A.7)
k D1

for j D 1, . . . , n. If we sum this equation over j, we obtain the value of the


Lagrange parameter, l D N, so that
N
1 X
p(vj ) D p (s) (vj | x k ),
b (A.8)
N kD 1
40 M. Saerens, P. Latinne, and C. Decaestecker

and the next estimate of p(vi ) is therefore

N
1 X
p(s C 1) (vi ) D
b p(s) (vi | x k ),
b (A.9)
N kD 1

so that equations A.4 (E-step) and A.9 (M-step) are repeated until the con-
vergence of the parameter vector ¼ . The overall procedure is summarized
in equation 2.9. It can be shown that this iterative process increases the
likelihood (see equation 2.6) at each step (see, e.g., Dempster et al., 1977;
McLachlan & Krishnan, 1997).

Acknowledgments

Part of this work was supported by project RBC-BR 216/4041 from the
RÂegion de Bruxelles-Capitale, and funding from the SmalS-MvM. P. L. is
supported by a grant under an Action de Recherche ConcertÂee program of
the CommunautÂe Française de Belgique. C. D. is a research associate with
the FNRS (Belgian National ScientiŽc Research Fund). We also thank the
two anonymous reviewers for their pertinent and constructive remarks.

References

Blake, C., Keogh, E., & Merz, C. (1998). UCI repository of machine learning
databases. Irvine, CA: University of California, Department of Informa-
tion and Computer Science. Available online at: http://www.ics.uci.edu/
»mlearn/MLRepository.html.
Breiman, L. (1998). Arcing classiŽers. Annals of Statistics, 26, 801–849.
Castelli, V., & Cover, T. (1995). On the exponential value of labelled samples.
Pattern Recognition Letters, 16, 105–111.
Decaestecker, C., Lopes, M.-B., Gordower, L., Camby, I., Cras, P., Martin, J.-J.,
Kiss, R., VandenBerg, S., & Salmon, I. (1997). Quantitative chromatin pat-
tern description in feulgen-stained nuclei as a diagnostic tool to characterise
the oligodendroglial and astroglial components in mixed oligoatrocytomas.
Journal of Neuropathology and Experimental Neurology, 56, 391–402.
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incom-
plete data via the EM algorithm (with discussion). Journal of the Royal Statis-
tical Society B, 39, 1–38.
Ghahramani, Z., & Jordan, M. (1994). Supervised learning from incomplete data
via an EM algorithm. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Ad-
vances in neural information processing systems, 6 (pp. 120–127). San Mateo, CA:
Morgan Kaufmann.
Hand, D. (1981). Discrimination and classiŽcation. New York: Wiley.
Kish, L., & Frankel, M. (1974). Inference from complex samples (with discussion).
Journal of the Royal Statistical Society B, 61, 1–37.
Adjusting a ClassiŽer to New a Priori Probabilities 41

Lawless, J., Kalbeisch, J., & Wild, C. (1999). Semiparametric methods for
response-selective and missing data problems in regression. Journal of the
Royal Statistical Society B, 61, 413–438.
McLachlan, G. (1992). Discriminant analysis and statistical pattern recognition. New
York: Wiley.
McLachlan, G., & Basford, K. (1988). Mixture models, inference and applications to
clustering. New York: Marcel Dekker.
McLachlan, G., & Krishnan, T. (1997). The EM algorithm and extensions. New York:
Wiley.
Melsa, J., & Cohn, D. (1978). Decision and estimation theory. New York: McGraw-
Hill.
Miller, D., & Uyar, S. (1997). A mixture of experts classiŽer with learning based on
both labeled and unlabeled data. In M. Mozer, M. Jordan, & T. Petsche (Eds.),
Advances in neural information processing systems, 9 (pp. 571–578). Cambridge,
MA: MIT Press.
Mood, A., Graybill, F., & Boes, D. (1974). Introduction to the theory of statistics (3rd
ed.). New York: McGraw-Hill.
Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. (2000). Text classiŽcation from
labeled and unlabeled documents using EM. Machine Learning, 39, 103–134.
Papoulis, A. (1991). Probability, random variables, and stochastic processes (3rd ed.),
New York: McGraw-Hill.
Richard, M., & Lippmann, R. (1991). Neural network classiŽers estimate
Bayesian a posteriori probabilities. Neural Computation, 2, 461–483.
Saerens, M. (2000). Building cost functions minimizing to some summary statis-
tics. IEEE Transactions on Neural Networks, 11, 1263–1271.
Scott, A., & Wild, C. (1997). Fitting regression models to case-control data by
maximum likelihood. Biometrika, 84, 57–71.
Shahshahani , B., & Landgrebe, D. (1994). The effect of unlabeled samples in
reducing the small sample size problem and mitigating the Hugues phe-
nomenon. IEEE Transactions on Geoscience and Remote Sensing, 32, 1087–1095.
Towell, G. (1996). Using unlabeled data for supervised learning. In D. Touretzky,
M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing
systems, 8 (pp. 647–653). Cambridge, MA: MIT Press.

Received April 19, 2000; accepted March 30, 2001.


Copyright of Neural Computation is the property of MIT Press and its content may not be copied or emailed to
multiple sites or posted to a listserv without the copyright holder's express written permission. However, users
may print, download, or email articles for individual use.

You might also like