Adjusting The Outputs of A Classi Er To New Probabilities: A Simple Procedure
Adjusting The Outputs of A Classi Er To New Probabilities: A Simple Procedure
Marco Saerens
[email protected]
IRIDIA Laboratory, cp 194/6, UniversitÂe Libre de Bruxelles, B-1050 Brussels, Belgium,
and SmalS-MvM, Research Section, Brussels, Belgium
Patrice Latinne
[email protected]
IRIDIA Laboratory, cp 194/6, UniversitÂe Libre de Bruxelles, B-1050 Brussels, Belgium
Christine Decaestecker
[email protected]
Laboratory of Histopathology, cp 620, UniversitÂe Libre de Bruxelles, B-1070 Brussels,
Belgium
1 Introduction
alent classier that relies on the “true” a priori probabilities of the new
data set.
2.1 Data Classication. One of the most common uses of data is clas-
sication. Suppose that we want to forecast the unknown discrete value of
a dependent (or response) variable ! based on a measurement vector—or
observation vector—x. This discrete dependent variable takes its value in
V D (v1 , . . . , vn )—the n class labels.
A training example is therefore a realization of a random feature vec-
tor, x, measured on an individual and allocated to one of the n classes
2 V. A training set is a collection of such training examples recorded for
24 M. Saerens, P. Latinne, and C. Decaestecker
does not change from the training set to the new data set (only the relative
proportion of measurements observed from each class has changed). This is
a natural requirement; it supposes that we choose the training set examples
only on the basis of the class labels vi , not on the basis of x. We also assume
that we have an estimate of the new a priori probabilities, b p (vi ).
Suppose now that we are working on a new data set to be scored. Bayes’
theorem provides
pt (vi | x)b
b pt (x)
pt (x | vi ) D
b , (2.1)
pt (vi )
b
p (vi | x)b
b p (x)
p (x | vi ) D
b . (2.2)
p (vi )
b
p(vi )
b
p (vi | x) D f (x)
b pt (vi | x).
b (2.3)
pt (vi )
b
Xn
Since p (vi | x) D 1, we easily obtain
b
iD 1
2 3 ¡1
Xn b(v )
p j
f (x) D 4 pt (vj | x)5 ,
b
j 1
p
b t (vj )
D
and consequently
p(vi )
b
pt (vi | x)
b
pt (vi )
b
p (vi | x) D n
b . (2.4)
X b p (vj )
pt (vj | x)
b
pt (vj )
b
jD 1
p(x), in a new data set to be scored by the model. The likelihood of these
new observations is dened as
N
Y
L(x1 , x2 , . . . , xN ) D p(x k )
kD 1
" #
N
Y n
X
D p(x k, vi )
kD 1 iD 1
" #
N
Y n
X
D p(x k | vi )p(vi ) , (2.6)
kD 1 iD 1
pt (vi | x k ) D gi (xk )
b (2.7)
Nti
pt (vi ) D
b . (2.8)
Nt
Let us dene as b p(s) (vi ) and b
p (s) (vi | x k ) the estimations of the new a
priori and a posteriori probabilities at step s of the iterative procedure. If
p (s) (vi ) are initialized by the frequencies of the classes in the training set
the b
(see equation 2.8), the EM algorithm provides the following iterative steps
(see the appendix) for each new observation x k and each class vi :
p(0) (vi ) D b
b pt (vi )
p (s) (vi )
b
pt (vi | x k )
b
pt (vi )
b
p (s) (vi | xk ) D n (s)
b
Xb p (vj )
bpt (vj | x k )
j 1 D
pt (vj )
b
N
1 X
p (s C 1) (vi ) D
b p(s) (vi | x k ),
b (2.9)
N kD 1
28 M. Saerens, P. Latinne, and C. Decaestecker
where b pt (vi | x k ) and b pt (vi ) are given by equations 2.7 and 2.8. Notice
the similarity between equations 2.4 and 2.9. At each iteration step s, both
the a posteriori (b p (s) (vi | x k )) and the a priori probabilities (b
p (s) (vi )) are
reestimated sequentially for each new observation x k and each class vi .
The iterative procedure proceeds until the convergence of the estimated
probabilities b p(s) (vi ).
Of course, if we have some a priori knowledge about the values of the
prior probabilities, we can take these starting values for the initialization of
p(0) (vi ). Notice also that although we did not encounter this problem
the b
in our simulations, we must keep in mind that local maxima problems
potentially may occur (the EM algorithm nds a local maximum of the
likelihood function).
In order to obtain good a priori probability estimates, it is necessary that
the a posteriori probabilities relative to the training set are reasonably well
approximated (i.e., sufciently well estimated by the model). The robust-
ness of the EM procedure with respect to imperfect a posteriori probability
estimates will be investigated in the experimental section (section 4).
In this section, we show that the likelihood ratio test can be used in order
to decide if the a priori probabilities have signicantly changed from the
training set to the new data set. Before adjusting the a priori probabilities
(when the trained classication model is simply applied to the new data),
the likelihood of the new observations is
N
Y
Lt (x1 , x2 , . . . , xN ) D pt (x k )
b
kD 1
N µ ¶
Y p(x k | vi )b
b pt (vi )
D , (3.1)
k 1 D
pt (vi | xk )
b
whatever the class label vi , and where we used the fact that pt (x k | vi ) D
p(x k | vi ).
After the adjustment of the a priori and a posteriori probabilities, we
compute the likelihood in the same way:
N
Y
L(x1 , x2 , . . . , xN ) D p (x k )
b
kD 1
N µ ¶
Y p (x k | vi )b
b p (vi )
D , (3.2)
k 1
D
p(vi | x k )
b
Adjusting a Classier to New a Priori Probabilities 29
µ ¶ N
X N
L(x1 , x2 , . . . , xN ) £ ¤ X £ ¤
log D log bpt (vi | x k ) ¡ log bp (vi | xk )
Lt (x1 , x2 , . . . , xN ) kD 1 kD 1
£ ¤ £ ¤
C N log bp(vi ) ¡ N log bpt (vi ) . (3.4)
From standard statistical inference (see, e.g., Mood, Graybill, & Boes, 1974;
Papoulis, 1991), 2 £ log [L(x1 , x2 , . . . , xN ) / Lt (x1 , x2 , . . . , xN )] is asymptoti-
2
cally distributed as a chi square with (n ¡ 1) degrees of freedom (Â(n¡1) ,
Pn
n
where is the number of classes). Indeed, since iD 1 p
b (v )
i D 1 , there are
only (n ¡ 1) degrees of freedom. This allows us to test if the new a priori
probabilities differ signicantly from the original ones and thus to decide if
the a posteriori probabilities (i.e., the model outputs) need to be corrected.
Notice also that standard errors on the estimated a priori probabilities can
be obtained through the computation of the observed information matrix,
as detailed in McLachlan & Krishnan, 1997.
4 Experimental Evaluation
1
Available online at http://www.cs.utoronto.ca/˜delve/data/datasets.html.
30 M. Saerens, P. Latinne, and C. Decaestecker
Table 3: Averaged Results for Estimation of the Priors, Ringnorm Data Set, Av-
eraged on 10 Runs.
100%
Using true priors
Average classification rate in each condition
No adjustment
After adjustment by EM
After adjustment by confusion matrix
95%
90%
85%
80%
#Training=(50,50); #Test=(20,80)
#Training=(50,50); #Test=(200,800)
#Training=(50,50); #Test=(100,400)
#Training=(500,500); #Test=(20,80)
#Training=(250,250); #Test=(20,80)
#Training=(100,100); #Test=(20,80)
#Training=(50,50); #Test=(40,160)
#Training=(500,500); #Test=(200,800)
#Training=(250,250); #Test=(200,800)
#Training=(100,100); #Test=(200,800)
#Training=(500,500); #Test=(100,400)
#Training=(250,250); #Test=(100,400)
#Training=(100,100); #Test=(100,400)
#Training=(500,500); #Test=(40,160)
#Training=(250,250); #Test=(40,160)
#Training=(100,100); #Test=(40,160)
Figure 1: Classication rates obtained on the Ringnorm data set. Results are
reported for four different conditions: (1) Without adjusting the classier output
(no adjustment); (2) adjusting the classier output by using the confusion matrix
method (after adjustment by confusion matrix); (3) adjusting the classier output
by using the EM algorithm (after adjustment by EM); and (4) adjusting the
classier output by using the true a priori probability of the new data (using
true priors). The results are plotted by function of different sizes of both the
training and the test sets.
classier performances due to the decrease in the size of the training sets: a
loss of about 8% between large (i.e., Nt D 1000), and small (i.e., Nt D 100)
training set sizes. The classication rates obtained after the adjustments
made by the confusion matrix method are very close to those obtained with
the EM method. In fact, the EM method almost always (15 times on the 16
conditions) provided better results, but the differences in accuracy between
the two methods are very small (0.3% in average). As already observed in
the rst experiment (see Table 2), the classication rates obtained after ad-
justment by the EM or the confusion matrix method are very close to those
obtained by using the true a priori probabilities (a difference of 0.2% on aver-
age). Finally, we clearly observe (see Figure 1) that by adjusting the outputs
of the classier, we always increased classication accuracy signicantly.
4.3 Tests on Real Data. We also tested the a priori estimation and out-
puts readjustment method on three real medical data sets from the UCI
repository (Blake, Keogh, & Merz, 1998) in order to conrm our results on
more realistic data. These data are Pima Indian Diabetes (2 classes of 268
and 500 cases, 8 features), Breast Cancer Wisconsin (2 classes of 239 and 444
cases after omission of the 16 cases with missing values, 9 features) and Bupa
Liver Disorders (2 classes of 145 and 200 cases, 6 features). A training set of
50 cases of each class was selected in each data set and used for training a
multilayer neural network; the remaining cases were used for selecting an
independent test set. In order to increase the difference between the class
distributions in the training (0.50, 0.50) and the test sets, we omitted a num-
ber of cases from the smallest class in order to obtain a class distribution
of (p(v1 ) D 0.20, p(v2 ) D 0.80) for the test set. Ten different selections of
training and test set were carried out, and for each of them, the training
phase was replicated 10 times, giving a total of 100 trained neural networks
for each data set.
On average over the 100 experiments, Table 4 details the a priori proba-
bilities estimated by means of the EM and the confusion matrix methods, as
EM Confusion True
Matrix Priors
Diabetes 20% 24.8% 31.3% 67.4% 76.3% 74.4% 78.3%
Breast 20 18.0 26.2 91.3 92.0 92.1 92.6
Liver 20 24.6 21.5 68.0 75.7 75.5 79.1
Note: The neural network has been trained on a learning set with a priori probabilities
of (50%, 50%).
Adjusting a Classier to New a Priori Probabilities 35
well as the classication rates before and after the probability adjustments.
These results show that the EM prior estimates were generally better than
the confusion matrix ones (except for the Liver data). Moreover, adjusting
the classier outputs on the basis of the new a priori probabilities always
increased classication rates and provided accuracy levels not too far from
those obtained by using the true priors for adjusting the outputs (given in
the last column of Table 4). However, except for the Diabetes data, for which
EM gave better results, the adjustments made on the basis of the EM and the
confusion matrix methods seemed to have the same effect on the accuracy
improvement.
5 Related Work
6 Conclusion
assumptions:
If sampling also occurs on the basis of x, the usual sample survey so-
lution to this problem is to use weighted maximum likelihood estimators
with weights inversely proportional to the selection probabilities, which are
supposed to be known (see, e.g., Kish and Frankel, 1974).
Experimental results show that our new procedure based on EM per-
forms better than the standard method (based on the confusion matrix) for
new a priori probability estimation. The results also show that even if the
classier ’s output provides imperfect a posteriori probability estimates,
Our derivation of the iterative process (see equation 2.9) closely follows the
estimation of mixing proportions of densities (see McLachlan & Krishnan,
1997). Indeed, p(x | vi ) can be viewed as a probability density dened by
equation 2.1.
The EM algorithm supposes that there exists a set of unobserved data
dened as the class labels of the observations of the new data set. In order
to pose the problem as an incomplete data one, associated with the new
observed data, XN1 D (x1 , x2 , . . . , xN ), we introduce as the unobservable data
N
Z1 D 1 z2 . . . , zN ), where each vector z k is associated with one of the n
(z , ,
mutually exclusive classes: z k will represent the class label 2 (v1 , . . . , vn ) of
the observation x k. More precisely, each z k will be dened as an indicator
38 M. Saerens, P. Latinne, and C. Decaestecker
N Y
Y n £ ¤zki
L(XN N
1 , Z1 | ¼ ) D p(x k,vi )
kD 1 i D 1
N Y
Y n £ ¤zki
D p(x k | vi )p(vi ) , (A.1)
kD 1 i D 1
where p(x k | vi ) is constant (it does not depend on the parameter vector ¼ )
and the p(vi ) probabilities are the parameters to be estimated.
The log-likelihood is
h i
l(XN N N N
1 , Z 1 | ¼ ) D log L(X 1 , Z1 | ¼ )
N X
X n £ ¤ XN X
n £ ¤
D z ki log p(vi ) C zki log p(xk | vi ) .(A.2)
kD 1 i D 1 kD 1 i D 1
Since the ZN 1 data are unobservable, during the E-step, we replace the
log-likelihood function by its conditional expectation over p(ZN N
1 | X1 , ¼ ):
N
EZN [l | X1 , ¼ ]. Moreover, since we need to know the value of ¼ in order to
1
N X
X n h i £ ¤
D ¼ (s) log p(vi )
EZN zki | xk , b
1
kD 1 i D 1
N X
X n h i £ ¤
C b(s) log p(x k | vi ) ,
EZN z ki | x k, ¼ (A.3)
1
kD 1 i D 1
¼ (s) )
D p(z ki D 1 | xk , b
p (s) (vi | x k )
Db
p (s) (vi )
b
pt (vi | xk )
b
pt (vi )
b
D n , (A.4)
Xb p (s) (vj )
pt (vj | xk )
b
j 1 D
pt (vj )
b
where we used equation 2.4 at the last step. The expected likelihood is
therefore
N X
X n £ ¤
¼ (s) ) D
Q(¼ , b p (s) (vi | xk ) log p(vi )
b
kD 1 i D 1
N X
X n £ ¤
C p (s) (vi | x k ) log p(xk | vi ) ,
b (A.5)
kD 1 i D 1
N X
X n N X
X n
D p (s) (vi | xk ) log[p(vi )] C
b p (s) (vi | x k ) log[p(x k | vi )]
b
kD 1 i D 1 kD 1 i D 1
" #
n
X
Cl 1¡ p(vi ) . (A.6)
iD 1
@`(¼ )
By computing D 0, we obtain
@p(vj )
N
X
p (s) (vj | x k ) D l p(vj )
b (A.7)
k D1
N
1 X
p(s C 1) (vi ) D
b p(s) (vi | x k ),
b (A.9)
N kD 1
so that equations A.4 (E-step) and A.9 (M-step) are repeated until the con-
vergence of the parameter vector ¼ . The overall procedure is summarized
in equation 2.9. It can be shown that this iterative process increases the
likelihood (see equation 2.6) at each step (see, e.g., Dempster et al., 1977;
McLachlan & Krishnan, 1997).
Acknowledgments
Part of this work was supported by project RBC-BR 216/4041 from the
RÂegion de Bruxelles-Capitale, and funding from the SmalS-MvM. P. L. is
supported by a grant under an Action de Recherche ConcertÂee program of
the CommunautÂe Française de Belgique. C. D. is a research associate with
the FNRS (Belgian National Scientic Research Fund). We also thank the
two anonymous reviewers for their pertinent and constructive remarks.
References
Blake, C., Keogh, E., & Merz, C. (1998). UCI repository of machine learning
databases. Irvine, CA: University of California, Department of Informa-
tion and Computer Science. Available online at: http://www.ics.uci.edu/
»mlearn/MLRepository.html.
Breiman, L. (1998). Arcing classiers. Annals of Statistics, 26, 801–849.
Castelli, V., & Cover, T. (1995). On the exponential value of labelled samples.
Pattern Recognition Letters, 16, 105–111.
Decaestecker, C., Lopes, M.-B., Gordower, L., Camby, I., Cras, P., Martin, J.-J.,
Kiss, R., VandenBerg, S., & Salmon, I. (1997). Quantitative chromatin pat-
tern description in feulgen-stained nuclei as a diagnostic tool to characterise
the oligodendroglial and astroglial components in mixed oligoatrocytomas.
Journal of Neuropathology and Experimental Neurology, 56, 391–402.
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incom-
plete data via the EM algorithm (with discussion). Journal of the Royal Statis-
tical Society B, 39, 1–38.
Ghahramani, Z., & Jordan, M. (1994). Supervised learning from incomplete data
via an EM algorithm. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Ad-
vances in neural information processing systems, 6 (pp. 120–127). San Mateo, CA:
Morgan Kaufmann.
Hand, D. (1981). Discrimination and classication. New York: Wiley.
Kish, L., & Frankel, M. (1974). Inference from complex samples (with discussion).
Journal of the Royal Statistical Society B, 61, 1–37.
Adjusting a Classier to New a Priori Probabilities 41
Lawless, J., Kalbeisch, J., & Wild, C. (1999). Semiparametric methods for
response-selective and missing data problems in regression. Journal of the
Royal Statistical Society B, 61, 413–438.
McLachlan, G. (1992). Discriminant analysis and statistical pattern recognition. New
York: Wiley.
McLachlan, G., & Basford, K. (1988). Mixture models, inference and applications to
clustering. New York: Marcel Dekker.
McLachlan, G., & Krishnan, T. (1997). The EM algorithm and extensions. New York:
Wiley.
Melsa, J., & Cohn, D. (1978). Decision and estimation theory. New York: McGraw-
Hill.
Miller, D., & Uyar, S. (1997). A mixture of experts classier with learning based on
both labeled and unlabeled data. In M. Mozer, M. Jordan, & T. Petsche (Eds.),
Advances in neural information processing systems, 9 (pp. 571–578). Cambridge,
MA: MIT Press.
Mood, A., Graybill, F., & Boes, D. (1974). Introduction to the theory of statistics (3rd
ed.). New York: McGraw-Hill.
Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. (2000). Text classication from
labeled and unlabeled documents using EM. Machine Learning, 39, 103–134.
Papoulis, A. (1991). Probability, random variables, and stochastic processes (3rd ed.),
New York: McGraw-Hill.
Richard, M., & Lippmann, R. (1991). Neural network classiers estimate
Bayesian a posteriori probabilities. Neural Computation, 2, 461–483.
Saerens, M. (2000). Building cost functions minimizing to some summary statis-
tics. IEEE Transactions on Neural Networks, 11, 1263–1271.
Scott, A., & Wild, C. (1997). Fitting regression models to case-control data by
maximum likelihood. Biometrika, 84, 57–71.
Shahshahani , B., & Landgrebe, D. (1994). The effect of unlabeled samples in
reducing the small sample size problem and mitigating the Hugues phe-
nomenon. IEEE Transactions on Geoscience and Remote Sensing, 32, 1087–1095.
Towell, G. (1996). Using unlabeled data for supervised learning. In D. Touretzky,
M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing
systems, 8 (pp. 647–653). Cambridge, MA: MIT Press.