0% found this document useful (0 votes)

58 views22 pages

Adjusting The Outputs of A Classi Er To New Probabilities: A Simple Procedure

Uploaded by

Guilherme Marthe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views22 pages

Adjusting The Outputs of A Classi Er To New Probabilities: A Simple Procedure

Uploaded by

Guilherme Marthe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

NOTE Communicated by Leo Breiman

Adjusting the Outputs of a Classier to New a Priori

Probabilities: A Simple Procedure

Marco Saerens
[email protected]
IRIDIA Laboratory, cp 194/6, UniversitÂe Libre de Bruxelles, B-1050 Brussels, Belgium,
and SmalS-MvM, Research Section, Brussels, Belgium

Patrice Latinne
[email protected]
IRIDIA Laboratory, cp 194/6, UniversitÂe Libre de Bruxelles, B-1050 Brussels, Belgium

Christine Decaestecker
[email protected]
Laboratory of Histopathology, cp 620, UniversitÂe Libre de Bruxelles, B-1070 Brussels,
Belgium

It sometimes happens (for instance in case control studies) that a classier

is trained on a data set that does not reect the true a priori probabilities
of the target classes on real-world data. This may have a negative effect on
the classication accuracy obtained on the real-world data set, especially
when the classier’s decisions are based on the a posteriori probabilities
of class membership. Indeed, in this case, the trained classier provides
estimates of the a posteriori probabilities that are not valid for this real-
world data set (they rely on the a priori probabilities of the training set).
Applying the classier as is (without correcting its outputs with respect to
these new conditions) on this new data set may thus be suboptimal. In this
note, we present a simple iterative procedure for adjusting the outputs
of the trained classier with respect to these new a priori probabilities
without having to ret the model, even when these probabilities are not
known in advance. As a by-product, estimates of the new a priori prob-
abilities are also obtained. This iterative algorithm is a straightforward
instance of the expectation-maximizatio n (EM) algorithm and is shown
to maximize the likelihood of the new data. Thereafter, we discuss a sta-
tistical test that can be applied to decide if the a priori class probabilities
have changed from the training set to the real-world data. The procedure
is illustrated on different classication problems involving a multilayer
neural network, and comparisons with a standard procedure for a priori
probability estimation are provided. Our original method, based on the
EM algorithm, is shown to be superior to the standard one for a priori
probability estimation. Experimental results also indicate that the classi-
er with adjusted outputs always performs better than the original one in

Neural Computation 14, 21–41 (2001) °

c 2001 Massachusetts Institute of Technology
22 M. Saerens, P. Latinne, and C. Decaestecker

terms of classication accuracy, when the a priori probability conditions

differ from the training set to the real-world data. The gain in classica-
tion accuracy can be signicant.

1 Introduction

In supervised classication tasks, sometimes the a priori probabilities of the

classes from a training set do not reect the “true” a priori probabilities of
real-world data, on which the trained classier has to be applied. For in-
stance, this happens when the sample used for training is stratied by the
value of the discrete response variable (i.e., the class membership). Con-
sider, for example, an experimental setting—a case control study—where
we select 50% of individuals suffering from a disease (the cases) and 50% of
individuals who do not suffer from this disease (the controls), and suppose
that we make a set of measurements on these individuals. The resulting
observations are used in order to train a model that classies the data into
the two target classes: disease and no disease. In this case, the a priori prob-
abilities of the two classes in the training set are 0.5 each. Once we apply
the trained model in a real-world situation (new cases), we have no idea
of the true a priori probability of disease (also labeled “disease prevalence”
in biostatistics). It has to be estimated from the new data. Moreover, the
outputs of the model have to be adjusted accordingly. In other words, the
classication model is trained on a data set with a priori probabilities that
are different from the real-world conditions.
In this situation, knowledge of the “true” a priori probabilities of the
real-world data would be an asset for the following important reasons:
Optimal Bayesian decision making is based on the a posteriori proba-
bilities of the classes conditional on the observation (we have to select
the class label that has maximum estimated a posteriori probability).
Now, following Bayes’ rule, these a posteriori probabilities depend in
a nonlinear way on the a priori probabilities. Therefore, a change of the
a priori probabilities (as is the case for the real-world data versus the
training set) may have an important impact on the a posteriori proba-
bilities of membership, which themselves affect the classication rate.
In other words, even if we use an optimal Bayesian model, if the a
priori probabilities of the classes change, the model will not be opti-
mal anymore in these new conditions. But knowing the new a priori
probabilities of the classes would allow us to correct (by Bayes’ rule)
the output of the model in order to recover the optimal decision.
Many classication methods, including neural network classiers, pro-
vide estimates of the a posteriori probabilities of the classes. From the
previous point, this means that applying such a classier as is on new
data having different a priori probabilities from the training set can
result in a loss of classication accuracy, in comparison with an equiv-
Adjusting a Classier to New a Priori Probabilities 23

alent classier that relies on the “true” a priori probabilities of the new
data set.

This is the primary motivation of this article: to introduce a procedure

allowing the correction of the estimated a posteriori probabilities, that is,
the classier’s outputs, in accordance with the new a priori probabilities
of the real-world data, in order to make more accurate predictions, even if
these a priori probabilities of the new data set are not known in advance. As
a by-product, estimates of the new a priori probabilities are also obtained.
The experimental section, section 4, will conrm that a signicant increase
in classication accuracy can be obtained when correcting the outputs of
the classier with respect to new a priori probability conditions.
For the sake of completeness, notice also that there exists another ap-
proach, the min-max criterion, which avoids the estimation of the a priori
probabilities on the new data. Basically, the min-max criterion says that one
should use the Bayes decision rule, which corresponds to the least favor-
able a priori probability distribution (see, e.g., Melsa & Cohn, 1978, or Hand,
1981).
In brief, we present a simple iterative procedure that estimates the new
a priori probabilities of a new data set and adjusts the outputs of the classi-
er, which is supposed to approximate the a posteriori probabilities, without
having to ret the model (section 2). This algorithm is a simple instance of
the expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin,
1977; McLachlan & Krishnan, 1997), which aims to maximize the likeli-
hood of the new observed data. We also discuss a simple statistical test (a
likelihood ratio test) that can be applied in order to decide if the a priori
probabilities have changed or not from the training set to the new data set
(section 3). We illustrate the procedure on articial and real classication
tasks and analyze its robustness with respect to imperfect estimation of the
a posteriori probabilities provided by the classier (section 4). Comparisons
with a standard procedure used for a priori probabilities estimation (also in
section 4) and a discussion with respect to the related work (section 5) are
also provided.

2 Correcting a Posteriori Probability Estimates with Respect to New a

Priori Probabilities

2.1 Data Classication. One of the most common uses of data is clas-
sication. Suppose that we want to forecast the unknown discrete value of
a dependent (or response) variable ! based on a measurement vector—or
observation vector—x. This discrete dependent variable takes its value in
V D (v1 , . . . , vn )—the n class labels.
A training example is therefore a realization of a random feature vec-
tor, x, measured on an individual and allocated to one of the n classes
2 V. A training set is a collection of such training examples recorded for
24 M. Saerens, P. Latinne, and C. Decaestecker

the purpose of model building (training) and forecasting based on that

model.
The a priori probability of belonging to class vi in the training set will
be denoted as pt (vi ) (in the sequel, subscript t will be used for estimates
carried out on the basis of the training set). In the case control example,
pt (v1 ) D pt (disease) D 0.5, and pt (v2 ) D pt (no disease) D 0.5.
For the purpose of training, we suppose that for P each class vi , observa-
tions on Nti individuals belonging to the class (with niD 1 Nti D Nt , the total
number of training examples) have been independently recorded accord-
ing to the within-class probability density, p(x | vi ). Indeed, case control
studies involve direct sampling from the within-class probability densities,
p(x | vi ). In a case control study with two classes (as reported in section 1),
this means that we made independent measurements on Nt1 individuals
who contracted the disease (the cases), according to p(x |disease), and on
Nt2 individuals who did not (the controls), according to p(x | no disease).
The a priori probabilities of the classes in the training set are therefore esti-
mated by their frequencies b pt (vi ) D Nti / Nt .
Let us suppose that we trained a classication model (the classier),
and denote by b pt (vi | x) the estimated a posteriori probability of belong-
ing to class vi provided by the classier, given that the feature vector x has
been observed, in the conditions of the training set. The classication model
(whose parameters are estimated on the basis of the training set as indicated
by subscript t) could be an articial neural network, a logistic regression,
or any other model that provides as output estimates of the a posteriori
probabilities of the classes given the observation. This is, for instance, the
case if we use the least-squares error or the Kullback-Leibler divergence as
a criterion for training and if the minimum of the criterion is reached (see,
e.g., Richard & Lippmann, 1991, or Saerens, 2000, for a recent discussion).
We therefore suppose that the model has n outputs, gi (x) (i D 1, . . . , n), pro-
viding estimated posterior probabilities of membership b pt (vi | x) D gi (x). In
the experimental section (section 4), we will show that even imperfect ap-
proximations of these output probabilities allow reasonably good outputs
corrections by the procedure to be presented below.
Let us now suppose that the trained classication model has to be applied
on another data set (new cases, e.g., real-world data to be scored) for which
the class frequencies, estimating the a priori probabilities p(vi ) (no subscript
t), are suspected to be different from b pt (vi ). The a posteriori probabilities
provided by the model for these new cases will have to be corrected accord-
ingly. As detailed in the two next sections, two cases must be considered
according to the fact that estimates of the new a priori probabilities b p (vi )
are, or are not, available for this new data set.

2.2 Adjusting the Outputs to New a Priori Probabilities: New a Priori

Probabilities Known. In the sequel, we assume that the generation of the
observations within the classes, and thus the within-class densities, p(x | vi ),
Adjusting a Classier to New a Priori Probabilities 25

does not change from the training set to the new data set (only the relative
proportion of measurements observed from each class has changed). This is
a natural requirement; it supposes that we choose the training set examples
only on the basis of the class labels vi , not on the basis of x. We also assume
that we have an estimate of the new a priori probabilities, b p (vi ).
Suppose now that we are working on a new data set to be scored. Bayes’
theorem provides

pt (vi | x)b
b pt (x)
pt (x | vi ) D
b , (2.1)
pt (vi )
b

where the a posteriori probabilities b pt (vi | x) are obtained by applying the

trained model as is (subscript t) on some observation x of the new data set
(i.e., by scoring the data). These are the estimated a posteriori probabilities
in the conditions of the training set (relying on the a priori probabilities of
the training set).
The corrected a posteriori probabilities, b p (vi | x) (relying this time on the
estimated a priori probabilities of the new data set) obey the same equa-
tion, but with b p (vi ) as the new a priori probabilities and b p (x) as the new
probability density function (no subscript t):

p (vi | x)b
b p (x)
p (x | vi ) D
b . (2.2)
p (vi )
b

Since the within-class densities b p (x | vi ) do not change from training to

pt (x | vi ) D b
real-world data (b p(x | vi )), by equating equation (2.1) to (2.2)
and dening f (x) D b pt (x) / b
p (x), we nd

p(vi )
b
p (vi | x) D f (x)
b pt (vi | x).
b (2.3)
pt (vi )
b
Xn
Since p (vi | x) D 1, we easily obtain
b
iD 1
2 3 ¡1
Xn b(v )
p j
f (x) D 4 pt (vj | x)5 ,
b
j 1
p
b t (vj )
D

and consequently

p(vi )
b
pt (vi | x)
b
pt (vi )
b
p (vi | x) D n
b . (2.4)
X b p (vj )
pt (vj | x)
b
pt (vj )
b
jD 1

This well-known formula can be used to compute the corrected a poste-

p(vi | x) in terms of the outputs provided by the trained
riori probabilities, b
26 M. Saerens, P. Latinne, and C. Decaestecker

pt (vi | x), and the new priors b

model, gi (x) D b p(vi ). We observe that the new
a posteriori probabilities b p (vi | x) are simply the a posteriori probabilities
in the conditions of the training set, b pt (vi | x), weighted by the ratio of the
new priors to the old priors, b p(vi ) /b
pt (vi ). The denominator of equation 2.4
ensures that the corrected a posteriori probabilities sum to one.
However, in many real-world cases, we ignore what the real-world a
priori probabilities p(vi ) are since we do not know the class labels for these
new data. This is the subject of the next section.

2.3 Adjusting the Outputs to New a Priori Probabilities: New a Pri-

ori Probabilities Unknown. When the new a priori probabilities are not
known in advance, we cannot use equation 2.4, and the p(vi ) probabilities
have to be estimated from the new data set. In this section, we present an
already known standard procedure used for new a priori probability esti-
mation (the only one available in the literature to our knowledge); then we
introduce our original method based on the EM algorithm.

2.3.1 Method 1: Confusion Matrix. The standard procedure used for a

priori probabilities estimation is based on the computation of the confusion
matrix, b p (di | vj ), an estimation of the probability of taking the decision
di to classify an observation in class vi , while in fact it belongs to class vj
(see, e.g., McLachlan, 1992, or McLachlan & Basford, 1988). In the sequel,
this method will be referred to as the confusion matrix method. Here is its
rationale. First, the confusion matrix b pt (di | vj ) is estimated on the training
set from cross-tabulated classication frequencies provided by the classier.
Once this confusion matrix has been computed on the training set, it is used
in order to infer the a priori probabilities on a new data set by solving the
following system of n linear equations,
n
X
p(di ) D
b pt (di | vj )b
b p(vj ), i D 1, . . . , n, (2.5)
jD 1

p (vj ), where the b

with respect to the b p (di ) is simply the marginal of classify-
ing an observation in class vi , estimated by the class label frequency after
application of the classier on the new data set. Once the b p (vj ) are com-
puted from equation 2.5, we use equation 2.4 to infer the new a posteriori
probabilities.

2.3.2 Method 2: EM Algorithm. We now present a new procedure for a

priori and a posteriori probabilities adjustment, based on the EM algorithm
(Dempster et al., 1977; McLachlan & Krishnan, 1997). This iterative algo-
rithm increases the likelihood of the new data at each iteration until a local
maximum is reached.
Once again, let us suppose that we record a set of N new independent
realizations of the stochastic variable x, XN
1 D (x1 , x2 , . . . , xN ), sampled from
Adjusting a Classier to New a Priori Probabilities 27

p(x), in a new data set to be scored by the model. The likelihood of these
new observations is dened as
N
Y
L(x1 , x2 , . . . , xN ) D p(x k )
kD 1
" #
N
Y n
X
D p(x k, vi )
kD 1 iD 1
" #
N
Y n
X
D p(x k | vi )p(vi ) , (2.6)
kD 1 iD 1

where the within-class densities—that is, the probabilities of observing x k

given class vi —remain the same (p(x k | vi ) D pt (xk | vi )) since we assume
that only the a priori probabilities change from the training set to the new
data set. We have to determine the estimates b p(vi ) that maximize the likeli-
hood (2.6) with respect to p(vi ). While a closed-form solution to this problem
cannot be found, we can obtain an iterative procedure for estimating the new
p(vi ) by applying the EM algorithm.
As before, let us dene gi (x k ) as the model’s output value corresponding
to class vi for the observation x k of the new data set to be scored. The model
outputs provide an approximation of the a posteriori probabilities of the
classes given the observation in the conditions of the training set (subscript
t), while the a priori probabilities of the training set are estimated by class
frequencies:

pt (vi | x k ) D gi (xk )
b (2.7)
Nti
pt (vi ) D
b . (2.8)
Nt
Let us dene as b p(s) (vi ) and b
p (s) (vi | x k ) the estimations of the new a
priori and a posteriori probabilities at step s of the iterative procedure. If
p (s) (vi ) are initialized by the frequencies of the classes in the training set
the b
(see equation 2.8), the EM algorithm provides the following iterative steps
(see the appendix) for each new observation x k and each class vi :
p(0) (vi ) D b
b pt (vi )
p (s) (vi )
b
pt (vi | x k )
b
pt (vi )
b
p (s) (vi | xk ) D n (s)
b
Xb p (vj )
bpt (vj | x k )
j 1 D
pt (vj )
b

N
1 X
p (s C 1) (vi ) D
b p(s) (vi | x k ),
b (2.9)
N kD 1
28 M. Saerens, P. Latinne, and C. Decaestecker

where b pt (vi | x k ) and b pt (vi ) are given by equations 2.7 and 2.8. Notice
the similarity between equations 2.4 and 2.9. At each iteration step s, both
the a posteriori (b p (s) (vi | x k )) and the a priori probabilities (b
p (s) (vi )) are
reestimated sequentially for each new observation x k and each class vi .
The iterative procedure proceeds until the convergence of the estimated
probabilities b p(s) (vi ).
Of course, if we have some a priori knowledge about the values of the
prior probabilities, we can take these starting values for the initialization of
p(0) (vi ). Notice also that although we did not encounter this problem
the b
in our simulations, we must keep in mind that local maxima problems
potentially may occur (the EM algorithm nds a local maximum of the
likelihood function).
In order to obtain good a priori probability estimates, it is necessary that
the a posteriori probabilities relative to the training set are reasonably well
approximated (i.e., sufciently well estimated by the model). The robust-
ness of the EM procedure with respect to imperfect a posteriori probability
estimates will be investigated in the experimental section (section 4).

3 Testing for Different A Priori Probabilities

In this section, we show that the likelihood ratio test can be used in order
to decide if the a priori probabilities have signicantly changed from the
training set to the new data set. Before adjusting the a priori probabilities
(when the trained classication model is simply applied to the new data),
the likelihood of the new observations is

N
Y
Lt (x1 , x2 , . . . , xN ) D pt (x k )
b
kD 1

N µ ¶
Y p(x k | vi )b
b pt (vi )
D , (3.1)
k 1 D
pt (vi | xk )
b

whatever the class label vi , and where we used the fact that pt (x k | vi ) D
p(x k | vi ).
After the adjustment of the a priori and a posteriori probabilities, we
compute the likelihood in the same way:

N
Y
L(x1 , x2 , . . . , xN ) D p (x k )
b
kD 1

N µ ¶
Y p (x k | vi )b
b p (vi )
D , (3.2)
k 1
D
p(vi | x k )
b
Adjusting a Classier to New a Priori Probabilities 29

so that the likelihood ratio is

and the log-likelihood ratio is

µ ¶ N
X N
L(x1 , x2 , . . . , xN ) £ ¤ X £ ¤
log D log bpt (vi | x k ) ¡ log bp (vi | xk )
Lt (x1 , x2 , . . . , xN ) kD 1 kD 1
£ ¤ £ ¤
C N log bp(vi ) ¡ N log bpt (vi ) . (3.4)

From standard statistical inference (see, e.g., Mood, Graybill, & Boes, 1974;
Papoulis, 1991), 2 £ log [L(x1 , x2 , . . . , xN ) / Lt (x1 , x2 , . . . , xN )] is asymptoti-
2
cally distributed as a chi square with (n ¡ 1) degrees of freedom (Â(n¡1) ,
Pn
n
where is the number of classes). Indeed, since iD 1 p
b (v )
i D 1 , there are
only (n ¡ 1) degrees of freedom. This allows us to test if the new a priori
probabilities differ signicantly from the original ones and thus to decide if
the a posteriori probabilities (i.e., the model outputs) need to be corrected.
Notice also that standard errors on the estimated a priori probabilities can
be obtained through the computation of the observed information matrix,
as detailed in McLachlan & Krishnan, 1997.

4 Experimental Evaluation

4.1 Simulations on Articial Data. We present a simple experiment that

illustrates the iterative adjustment of the a priori and a posteriori proba-
bilities. We chose a conventional multilayer perceptron (with one hidden
layer, softmax output functions, trained with the Levenberg-Marquardt al-
gorithm) as a classication model, as well as a database labeled Ringnorm,
introduced by Breiman (1998).1 This database is constituted of 7400 cases
described by 20 numerical features and divided into two equidistributed
classes (each drawn from a multivariate normal distribution with a differ-
ent variance-covariance matrix).

1
Available online at http://www.cs.utoronto.ca/˜delve/data/datasets.html.
30 M. Saerens, P. Latinne, and C. Decaestecker

Table 1: Results of the Estimation of Priors on the Test Sets, Averaged on 10

Runs, Ringnorm Articial Data Set.

True Priors Estimated Prior by Using Log-Likelihood Ratio Test

EM Confusion Number of Times

Matrix the Test Is Signicant

10% 14.7% 18.1% 10

20% 21.4 24.2 10
30% 33.0 34.4 10
40% 42.5 42.7 10
50% 49.2 49.0 0
60% 59.0 57.1 10
70% 66.8 64.8 10
80% 77.3 73.9 10
90% 85.6 80.9 10
Note: The neural network has been trained on a data set with a priori
probabilities of (50%, 50%).

Ten replications of the following experimental design were applied. First,

a training set of 500 cases of each class was extracted from the data (pt (v1 ) D
pt (v2 ) D 0.50) and was used for training a neural network with 10 hidden
units. For each training set, nine independent test sets of 1000 cases were
selected according to the following a priori probability sequence: p(v1 ) D
0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90 (with p(v2 ) D 1 ¡p(v1 )). Then,
for each test set, the EM procedure (see equation 2.9), as well as the confusion
matrix procedure (see equation 2.5), were applied in order to estimate the
new a priori probabilities and adjust the a posteriori probabilities provided
by the model (b pt (v1 | x) D g(x)). In each experiment, a maximum of ve
iteration steps of the EM algorithm was sufcient to ensure the convergence
of the estimated probabilities.
Table 1 shows the estimated a priori probabilities for v1 . With respect to
the EM algorithm, it also shows the number of times the likelihood ratio
test was signicant at p < 0.01 on these 10 replications. Table 2 presents the
classication rates (computed on the test set) before and after the probability
adjustments, as well as when the true priors of the test set (p(vi ), which
are unknown in a real-world situation) were used to adjust the classier’s
outputs (using equation 2.4). This latter result can be considered an optimal
reference in this experimental context.
The results reported in Table 1 show that the EM algorithm was clearly
superior to the confusion matrix method for a priori probability estimation
and that the a priori probabilities are reasonably well estimated. Except in
the cases where p(vi ) D pt (vi ) D 0.50, the likelihood ratio test revealed in
each replication a signicant difference (at p < 0.01) between the training
and the test set a priori probabilities (b pt (vi ) D p(vi )). The a priori estimates
6 b
appeared as slightly biased toward 50%; this appears as a bias affecting the
neural network classier trained on an equidistributed training set.
Adjusting a Classier to New a Priori Probabilities 31

Table 2: Classication Rates on the Test Sets, Averaged on 10 Runs, Ringnorm

Articial Data Set.

True Priors Percentage of Correct Classication

No After Adjustment by Using

Adjustment EM Confusion Matrix True Priors

10% 90.1% 93.6% 93.1% 94.0%
20% 90.3 91.9 91.7 92.2
30% 88.6 89.9 89.8 90.0
40% 90.4 90.4 90.4 90.6
50% 87.0 86.9 86.8 87.0
60% 90.0 90.0 90.0 90.0
70% 89.2 89.8 89.7 90.2
80% 89.5 90.7 90.7 91.0
90% 88.5 91.6 91.3 92.0

By looking at Table 2 (classication results), we observe that the impact of

the adjustment of the outputs on classication accuracy can be signicant.
The effect was most benecial when the new a priori probabilities, p(vi ), are
far from the training set ones (pt (vi ) D 0.50). Notice that in each case, the
classication rates obtained after adjustment were close to those obtained
by using the true a priori probabilities of the test sets. Although the EM
algorithm provides better estimates of the a priori probabilities, we found
no difference between the EM algorithm and the confusion matrix method
in terms of classication accuracy. This could be due to the high recognition
rates observed for this problem. Notice also that we observe a small degra-
dation in classication accuracy if we adjust the a priori probabilities when
not necessary (pt (vi ) D p(vi ) D 0.5), as indicated by the likelihood ratio test.

4.2 Robustness Evaluation on Articial Data. This section investigates

the robustness of the EM-based procedure with respect to imperfect esti-
mates of the a posteriori probability values provided by the classier, as
well as to the size of the training and the test set (the test set alone is used
to estimate the new a priori probabilities). In order to degrade the classier
outputs, we gradually decreased the size of the training set in steps. Sym-
metrically, in order to reduce the amount of data available to the EM and the
confusion matrix algorithms, we also gradually decreased the size of the test
set. For each condition, we compared the classier outputs with those ob-
tained with a Bayesian classier based on the true data distribution (which
is known for an articial data set such as Ringnorm). We were thus able to
quantify the error level of the classier with respect to the true a posteriori
probabilities (How far is our neural network from the Bayesian classier?)
and to evaluate the effects of a decrease in the training or test sizes on the a
priori estimates provided by EM and the classication performances.
32 M. Saerens, P. Latinne, and C. Decaestecker

Table 3: Averaged Results for Estimation of the Priors, Ringnorm Data Set, Av-
eraged on 10 Runs.

Training Set Test Set Mean Absolute Estimated Prior for v1

Size Size Deviation (p(v1 ) D 0.20) by Using
1
PN
(#v1 , #v2 ) (#v1 , #v2 ) N kD 1
| b(xk ) ¡ g(x k ) | EM Confusion Matrix

(500, 500) (200, 800) 0.107 22.0% 24.7%

(100, 400) 0.110 21.6 24.5
(40, 160) 0.104 20.4 23.5
(20, 80) 0.122 22.7 26.7
(250, 250) (200, 800) 0.139 22.1 25.3
(100, 400) 0.140 22.6 25.6
(40, 160) 0.134 23.1 25.8
(20, 80) 0.167 22.7 26.0
(100, 100) (200, 800) 0.183 24.1 27.5
(100, 400) 0.185 24.4 28.2
(40, 60) 0.181 23.5 27.3
(20, 80) 0.180 26.6 29.2
(50, 50) (200, 800) 0.202 24.9 28.5
(100, 400) 0.199 25.3 29.0
(40, 160) 0.203 24.3 27.6
(20, 80) 0.189 22.3 26.0
Note: Notice that the true priors of the test sets are (20%, 80%).

As for the experiment reported above, a multilayer perceptron was

trained on the basis of an equidistributed training set (pt (v1 ) D 0.5 D pt (v2 )).
An independent and unbalanced test set (with p(v1 ) D 0.20 and p(v2 ) D
0.80) was selected and scored by the neural network. The experiments (10
replications in each condition) were carried out on the basis of training and
test sets with decreasing sizes (1000, 500, 200 and 100 cases), as detailed in
Table 3.
We rst compared the articial neural network’s output values (g(x) D
pOt (v1 | x), obtained by scoring the test sets with the trained neural network)
with those provided by the Bayesian classier (b(x) D pt (v1 | x), obtained
by scoring the test sets with the Bayesian classier) on the test sets before
output readjustment; that is, we measured the discrepancy between the
outputs of the neural and the Bayesian classiers before output adjustment.
For this purpose, we computed the averaged absolute deviation between
the output value of the neural and the Bayesian classiers (the average of
|b(x) ¡ g(x)|) before output adjustment.
Then for each test set, the EM and the confusion matrix procedures were
applied to the outputs of the neural classier in order to estimate the new a
priori probabilities and the new a posteriori probabilities. The results for a
priori probability estimation are detailed in Table 3.
By looking at the mean absolute deviation in Table 3, it can be seen that,
as expected, decreasing the training set size results in a degradation in the
Adjusting a Classier to New a Priori Probabilities 33

100%
Using true priors
Average classification rate in each condition

No adjustment
After adjustment by EM
After adjustment by confusion matrix
95%

90%

85%

80%

#Training=(50,50); #Test=(20,80)
#Training=(50,50); #Test=(200,800)

#Training=(50,50); #Test=(100,400)

#Training=(500,500); #Test=(20,80)

#Training=(250,250); #Test=(20,80)

#Training=(100,100); #Test=(20,80)
#Training=(50,50); #Test=(40,160)
#Training=(500,500); #Test=(200,800)

#Training=(250,250); #Test=(200,800)

#Training=(100,100); #Test=(200,800)

#Training=(500,500); #Test=(100,400)

#Training=(250,250); #Test=(100,400)

#Training=(100,100); #Test=(100,400)

#Training=(500,500); #Test=(40,160)

#Training=(250,250); #Test=(40,160)

#Training=(100,100); #Test=(40,160)

Number of observations in training and test sets

Figure 1: Classication rates obtained on the Ringnorm data set. Results are
reported for four different conditions: (1) Without adjusting the classier output
(no adjustment); (2) adjusting the classier output by using the confusion matrix
method (after adjustment by confusion matrix); (3) adjusting the classier output
by using the EM algorithm (after adjustment by EM); and (4) adjusting the
classier output by using the true a priori probability of the new data (using
true priors). The results are plotted by function of different sizes of both the
training and the test sets.

estimation of the a posteriori probabilities (an increase of absolute deviation

of about 0.10 between large, i.e., Nt D 1000, and small, i.e., Nt D 100, training
set sizes). Of course, the prior estimates degraded accordingly, but only
slightly. The EM algorithm appeared to be more robust than the confusion
matrix method. Indeed, on average (on all the experiments), the EM method
overestimated the prior p(v1 ) by 3.3%, while the confusion matrix method
overestimated by 6.6%. In contrast, decreasing the size of the test set seems
to have very few effects on the results.
Figure 1 shows the classication rates (averaged on the 10 replications)
of the neural network before and after the output adjustments made by the
EM and the confusion matrix methods. It also illustrates the degradation in
34 M. Saerens, P. Latinne, and C. Decaestecker

classier performances due to the decrease in the size of the training sets: a
loss of about 8% between large (i.e., Nt D 1000), and small (i.e., Nt D 100)
training set sizes. The classication rates obtained after the adjustments
made by the confusion matrix method are very close to those obtained with
the EM method. In fact, the EM method almost always (15 times on the 16
conditions) provided better results, but the differences in accuracy between
the two methods are very small (0.3% in average). As already observed in
the rst experiment (see Table 2), the classication rates obtained after ad-
justment by the EM or the confusion matrix method are very close to those
obtained by using the true a priori probabilities (a difference of 0.2% on aver-
age). Finally, we clearly observe (see Figure 1) that by adjusting the outputs
of the classier, we always increased classication accuracy signicantly.

4.3 Tests on Real Data. We also tested the a priori estimation and out-
puts readjustment method on three real medical data sets from the UCI
repository (Blake, Keogh, & Merz, 1998) in order to conrm our results on
more realistic data. These data are Pima Indian Diabetes (2 classes of 268
and 500 cases, 8 features), Breast Cancer Wisconsin (2 classes of 239 and 444
cases after omission of the 16 cases with missing values, 9 features) and Bupa
Liver Disorders (2 classes of 145 and 200 cases, 6 features). A training set of
50 cases of each class was selected in each data set and used for training a
multilayer neural network; the remaining cases were used for selecting an
independent test set. In order to increase the difference between the class
distributions in the training (0.50, 0.50) and the test sets, we omitted a num-
ber of cases from the smallest class in order to obtain a class distribution
of (p(v1 ) D 0.20, p(v2 ) D 0.80) for the test set. Ten different selections of
training and test set were carried out, and for each of them, the training
phase was replicated 10 times, giving a total of 100 trained neural networks
for each data set.
On average over the 100 experiments, Table 4 details the a priori proba-
bilities estimated by means of the EM and the confusion matrix methods, as

Table 4: Classication Results on Three Real Data Sets.

Data True Priors Estimated by Percentage of Correct Classication

Set Priors
EM Confusion No After Adjustment
Matrix Adjustment by Using

EM Confusion True
Matrix Priors
Diabetes 20% 24.8% 31.3% 67.4% 76.3% 74.4% 78.3%
Breast 20 18.0 26.2 91.3 92.0 92.1 92.6
Liver 20 24.6 21.5 68.0 75.7 75.5 79.1

Note: The neural network has been trained on a learning set with a priori probabilities
of (50%, 50%).
Adjusting a Classier to New a Priori Probabilities 35

well as the classication rates before and after the probability adjustments.
These results show that the EM prior estimates were generally better than
the confusion matrix ones (except for the Liver data). Moreover, adjusting
the classier outputs on the basis of the new a priori probabilities always
increased classication rates and provided accuracy levels not too far from
those obtained by using the true priors for adjusting the outputs (given in
the last column of Table 4). However, except for the Diabetes data, for which
EM gave better results, the adjustments made on the basis of the EM and the
confusion matrix methods seemed to have the same effect on the accuracy
improvement.

5 Related Work

The problem of estimating parameters of a model by including unlabeled

data in addition to the labeled samples has been studied in both the machine
learning and the articial neural network communities. In this case, we
speak about learning from partially labeled data (see, e.g., Shahshahani
& Landgrebe, 1994; Ghahramani & Jordan, 1994; Castelli & Cover, 1995;
Towell, 1996; Miller & Uyar, 1997; Nigam, McCallum, Thrun, & Mitchell,
2000). The purpose is to use both labeled and unlabeled data for learning
since unlabeled data are usually easy to collect, while labeled data are much
more difcult to obtain. In this framework, the labeled part (the training
set in our case) and the unlabeled part (the new data set in our case) are
combined in one data set, and a partly supervised EM algorithm is used to
t the model (a classier) by maximizing the full likelihood of the complete
set of data (training set plus new data set). For instance, Nigam et al. (2000)
use the EM algorithm to learn classiers that take advantage of both labeled
and unlabeled data.
This procedure could easily be applied to our problem: adjusting the a
posteriori probabilities provided by a classier to new a priori conditions.
Moreover, it makes fully efcient use of the available data. However, on the
downside, the model has to be completely retted each time it is applied to
a new data set. This is the opposite of the approach discussed in this article,
where the model is tted only on the training set. When applied to a new
data set, the model is not modied; only its outputs are recomputed based
on the new observations.
Related problems involving missing data have also been studied in ap-
plied statistics. Some good recent reference pointers are Scott and Wild
(1997) and Lawless, Kalbeisch, and Wild (1999).

6 Conclusion

We presented a simple procedure for adjusting the outputs of a classier to

new a priori class probabilities. This procedure is a simple instance of the EM
algorithm. When deriving this procedure, we relied on three fundamental
36 M. Saerens, P. Latinne, and C. Decaestecker

assumptions:

1. The a posteriori probabilities provided by the model (our readjust-

ment procedure can be applied only if the classier provides as out-
put an estimate of the a posteriori probabilities) are reasonably well
approximated, which means that it provides predicted probabilities
of belonging to the classes that are sufciently close to the observed
probabilities.
2. The training set selection (the sampling) has been performed on the
basis of the discrete dependent variable (the classes), and not of the ob-
served input variable x (the explanatory variable), so that the within-
class probability densities do not change.
3. The new data set to be scored is large enough in order to be able to
estimate accurately the new a priori class probabilities.

If sampling also occurs on the basis of x, the usual sample survey so-
lution to this problem is to use weighted maximum likelihood estimators
with weights inversely proportional to the selection probabilities, which are
supposed to be known (see, e.g., Kish and Frankel, 1974).
Experimental results show that our new procedure based on EM per-
forms better than the standard method (based on the confusion matrix) for
new a priori probability estimation. The results also show that even if the
classier ’s output provides imperfect a posteriori probability estimates,

The EM procedure is able to provide reasonably good estimates of the

new a priori probabilities.
The classier with adjusted outputs always performs better than the
original one if the a priori conditions differ from the training set to the
real-world data. The gain of classication accuracy can be signicant.
The classication performances after adjustment by EM are relatively
close to the results obtained by using the true priors (which are un-
known in a real-world situation), even when the a posteriori probabil-
ities are imperfectly estimated.

Additionally, the quality of the estimates does not appear to depend

strongly on the size of the new data set. All these results enable us to relax
to a certain extent the rst and third assumptions above.
We also observed that adjusting the outputs of the classier when not
needed (i.e., when the a priori probabilities of the training set and the real-
world data do not differ) can result in a decrease in classication accuracy.
We therefore showed that a likelihood ratio test can be used in order to de-
cide if the a priori probabilities have signicantly changed from the training
set to the new data set. The readjustment procedure should be applied only
when we nd a signicant change of a priori probabilities.
Adjusting a Classier to New a Priori Probabilities 37

Notice that the EM-based adjustment procedure could be useful in the

context of disease prevalence estimation. In this application, the primary
objective is the estimation of the class proportions in an unlabeled data
set (i.e., class a priori probabilities); classication accuracy is not important
per se.
Another important problem, also encountered in medicine, concerns the
automatic estimation of the proportions of different cell populations con-
stituting, for example, a smear or a lesion (such as a tumor). Mixed tumors
are composed of two or more cell populations with different lineages, as,
for example, in brain glial tumors (Decaestecker et al., 1997). In this case, a
classier is trained on a sample of images of reference cells provided from
tumors with a pure lineage (which did not present diagnostic difculties)
and labeled by experts. When a tumor is suspected to be mixed, the classier
is applied to a sample of cells from this tumor (a few hundred) in order to
estimate the proportion of the different cell populations. The main motiva-
tion for the determination of the proportion of the different cell populations
in these mixed tumors is that the different lineage components may signif-
icantly differ with respect to their susceptibility for aggressive progression
and may thus inuence patients’ prognoses. In this case, the primary goal
is the determination of the proportion of cell populations, corresponding to
the new a priori probabilities.
Another practical use of our readjustment procedure is the automatic
labeling of geographical maps based on remote sensing information. Each
portion of the map has to be labeled according to its nature (e.g., forest,
agricultural zone, urban zone). In this case, the a priori probabilities are
unknown in advance and vary considerably from one image to another,
since they directly depend on the geographical area that has been observed
(e.g., urban area, country area).
We are now actively working on these biomedical and geographical prob-
lems.

Appendix: Derivation of the EM Algorithm

Our derivation of the iterative process (see equation 2.9) closely follows the
estimation of mixing proportions of densities (see McLachlan & Krishnan,
1997). Indeed, p(x | vi ) can be viewed as a probability density dened by
equation 2.1.
The EM algorithm supposes that there exists a set of unobserved data
dened as the class labels of the observations of the new data set. In order
to pose the problem as an incomplete data one, associated with the new
observed data, XN1 D (x1 , x2 , . . . , xN ), we introduce as the unobservable data
N
Z1 D 1 z2 . . . , zN ), where each vector z k is associated with one of the n
(z , ,
mutually exclusive classes: z k will represent the class label 2 (v1 , . . . , vn ) of
the observation x k. More precisely, each z k will be dened as an indicator
38 M. Saerens, P. Latinne, and C. Decaestecker

vector: if zki is the component i of vector zk , then z ki D 1 and zkj D 0 for

each j D 6 i if and only if the class label associated with observation x k is
vi . For instance, if the observation x k is assigned to class label vi , then
z k D [0, . . . , 0 , 1, 0 , . . . , 0]T .
1 i¡1 i i C 1 n
Let us denote by ¼ D [p(v1 ), p(v2 ), . . . , p(vn )]T the vector of a priori
probabilities (the parameters) to be estimated. The likelihood of the com-
plete data (for the new data set) is

N Y
Y n £ ¤zki
L(XN N
1 , Z1 | ¼ ) D p(x k,vi )
kD 1 i D 1
N Y
Y n £ ¤zki
D p(x k | vi )p(vi ) , (A.1)
kD 1 i D 1

where p(x k | vi ) is constant (it does not depend on the parameter vector ¼ )
and the p(vi ) probabilities are the parameters to be estimated.
The log-likelihood is
h i
l(XN N N N
1 , Z 1 | ¼ ) D log L(X 1 , Z1 | ¼ )
N X
X n £ ¤ XN X
n £ ¤
D z ki log p(vi ) C zki log p(xk | vi ) .(A.2)
kD 1 i D 1 kD 1 i D 1

Since the ZN 1 data are unobservable, during the E-step, we replace the
log-likelihood function by its conditional expectation over p(ZN N
1 | X1 , ¼ ):
N
EZN [l | X1 , ¼ ]. Moreover, since we need to know the value of ¼ in order to
1

compute EZN [l | XN 1 , ¼ ] (the expected log-likelihood), we use, as a current

1
guess for ¼ , the current value (at iteration step s) of the parameter vector,
b (s) D [b
¼ p (s) (v1 ), b
p (s) (v2 ), . . . , b
p(s) (vn )]T ,
h i
b(s) ) D EZN l(XN
Q(¼ , ¼ N N
¼ (s)
1 , Z1 | ¼ ) | X1 , b
1

N X
X n h i £ ¤
D ¼ (s) log p(vi )
EZN zki | xk , b
1
kD 1 i D 1
N X
X n h i £ ¤
C b(s) log p(x k | vi ) ,
EZN z ki | x k, ¼ (A.3)
1
kD 1 i D 1

where we assumed that the complete data observations f(x k, zk ), k D 1, . . . ,

Ng are independent. We obtain for the expectation of the unobservable data
h i
b(s) D 0 ¢ p(z ki D 0 | xk, b
E ZN zki | xk , ¼ ¼ (s) ) C 1 ¢ p(zki D 1 | x k, b
¼ (s) )
1
Adjusting a Classier to New a Priori Probabilities 39

¼ (s) )
D p(z ki D 1 | xk , b
p (s) (vi | x k )
Db

p (s) (vi )
b
pt (vi | xk )
b
pt (vi )
b
D n , (A.4)
Xb p (s) (vj )
pt (vj | xk )
b
j 1 D
pt (vj )
b

where we used equation 2.4 at the last step. The expected likelihood is
therefore
N X
X n £ ¤
¼ (s) ) D
Q(¼ , b p (s) (vi | xk ) log p(vi )
b
kD 1 i D 1

N X
X n £ ¤
C p (s) (vi | x k ) log p(xk | vi ) ,
b (A.5)
kD 1 i D 1

where bp(s) (vi | x k ) is given by equation A.4.

For the M-step, we compute the maximum of Q(¼, b ¼ (s) ) (see equation A.5)
with respect to the parameter vector¼ D [p(v1 ), p(v2 ), . . . , p(vn )]T . The new
estimate at time step (s C 1) will therefore be the value of the parameter
P vector
¼ that maximizes Q(¼ , ¼ b(s) ). Since we have the constraint, niD1 p(vi ) D 1,
we dene the Lagrange function as
" #
Xn
(s)
`(¼ ) D Q(¼ , ¼ b )Cl 1¡ p(vi )
iD 1

N X
X n N X
X n
D p (s) (vi | xk ) log[p(vi )] C
b p (s) (vi | x k ) log[p(x k | vi )]
b
kD 1 i D 1 kD 1 i D 1
" #
n
X
Cl 1¡ p(vi ) . (A.6)
iD 1

@`(¼ )
By computing D 0, we obtain
@p(vj )

N
X
p (s) (vj | x k ) D l p(vj )
b (A.7)
k D1

for j D 1, . . . , n. If we sum this equation over j, we obtain the value of the

Lagrange parameter, l D N, so that
N
1 X
p(vj ) D p (s) (vj | x k ),
b (A.8)
N kD 1
40 M. Saerens, P. Latinne, and C. Decaestecker

and the next estimate of p(vi ) is therefore

N
1 X
p(s C 1) (vi ) D
b p(s) (vi | x k ),
b (A.9)
N kD 1

so that equations A.4 (E-step) and A.9 (M-step) are repeated until the con-
vergence of the parameter vector ¼ . The overall procedure is summarized
in equation 2.9. It can be shown that this iterative process increases the
likelihood (see equation 2.6) at each step (see, e.g., Dempster et al., 1977;
McLachlan & Krishnan, 1997).

Acknowledgments

Part of this work was supported by project RBC-BR 216/4041 from the
RÂegion de Bruxelles-Capitale, and funding from the SmalS-MvM. P. L. is
supported by a grant under an Action de Recherche ConcertÂee program of
the CommunautÂe Française de Belgique. C. D. is a research associate with
the FNRS (Belgian National Scientic Research Fund). We also thank the
two anonymous reviewers for their pertinent and constructive remarks.

References

Blake, C., Keogh, E., & Merz, C. (1998). UCI repository of machine learning
databases. Irvine, CA: University of California, Department of Informa-
tion and Computer Science. Available online at: http://www.ics.uci.edu/
»mlearn/MLRepository.html.
Breiman, L. (1998). Arcing classiers. Annals of Statistics, 26, 801–849.
Castelli, V., & Cover, T. (1995). On the exponential value of labelled samples.
Pattern Recognition Letters, 16, 105–111.
Decaestecker, C., Lopes, M.-B., Gordower, L., Camby, I., Cras, P., Martin, J.-J.,
Kiss, R., VandenBerg, S., & Salmon, I. (1997). Quantitative chromatin pat-
tern description in feulgen-stained nuclei as a diagnostic tool to characterise
the oligodendroglial and astroglial components in mixed oligoatrocytomas.
Journal of Neuropathology and Experimental Neurology, 56, 391–402.
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incom-
plete data via the EM algorithm (with discussion). Journal of the Royal Statis-
tical Society B, 39, 1–38.
Ghahramani, Z., & Jordan, M. (1994). Supervised learning from incomplete data
via an EM algorithm. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Ad-
vances in neural information processing systems, 6 (pp. 120–127). San Mateo, CA:
Morgan Kaufmann.
Hand, D. (1981). Discrimination and classication. New York: Wiley.
Kish, L., & Frankel, M. (1974). Inference from complex samples (with discussion).
Journal of the Royal Statistical Society B, 61, 1–37.
Adjusting a Classier to New a Priori Probabilities 41

Lawless, J., Kalbeisch, J., & Wild, C. (1999). Semiparametric methods for
response-selective and missing data problems in regression. Journal of the
Royal Statistical Society B, 61, 413–438.
McLachlan, G. (1992). Discriminant analysis and statistical pattern recognition. New
York: Wiley.
McLachlan, G., & Basford, K. (1988). Mixture models, inference and applications to
clustering. New York: Marcel Dekker.
McLachlan, G., & Krishnan, T. (1997). The EM algorithm and extensions. New York:
Wiley.
Melsa, J., & Cohn, D. (1978). Decision and estimation theory. New York: McGraw-
Hill.
Miller, D., & Uyar, S. (1997). A mixture of experts classier with learning based on
both labeled and unlabeled data. In M. Mozer, M. Jordan, & T. Petsche (Eds.),
Advances in neural information processing systems, 9 (pp. 571–578). Cambridge,
MA: MIT Press.
Mood, A., Graybill, F., & Boes, D. (1974). Introduction to the theory of statistics (3rd
ed.). New York: McGraw-Hill.
Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. (2000). Text classication from
labeled and unlabeled documents using EM. Machine Learning, 39, 103–134.
Papoulis, A. (1991). Probability, random variables, and stochastic processes (3rd ed.),
New York: McGraw-Hill.
Richard, M., & Lippmann, R. (1991). Neural network classiers estimate
Bayesian a posteriori probabilities. Neural Computation, 2, 461–483.
Saerens, M. (2000). Building cost functions minimizing to some summary statis-
tics. IEEE Transactions on Neural Networks, 11, 1263–1271.
Scott, A., & Wild, C. (1997). Fitting regression models to case-control data by
maximum likelihood. Biometrika, 84, 57–71.
Shahshahani , B., & Landgrebe, D. (1994). The effect of unlabeled samples in
reducing the small sample size problem and mitigating the Hugues phe-
nomenon. IEEE Transactions on Geoscience and Remote Sensing, 32, 1087–1095.
Towell, G. (1996). Using unlabeled data for supervised learning. In D. Touretzky,
M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing
systems, 8 (pp. 647–653). Cambridge, MA: MIT Press.

Received April 19, 2000; accepted March 30, 2001.

Copyright of Neural Computation is the property of MIT Press and its content may not be copied or emailed to
multiple sites or posted to a listserv without the copyright holder's express written permission. However, users
may print, download, or email articles for individual use.

9-Decision Tree Induction-23-01-2025
No ratings yet
9-Decision Tree Induction-23-01-2025
40 pages
Bayesian Classifier and ML Estimation: 6.1 Conditional Probability
100% (3)
Bayesian Classifier and ML Estimation: 6.1 Conditional Probability
11 pages
Examples of Bayes Theorem PDF
67% (3)
Examples of Bayes Theorem PDF
2 pages
24 BN Exact Inference
No ratings yet
24 BN Exact Inference
13 pages
3 - Bayesian Classification
No ratings yet
3 - Bayesian Classification
15 pages
Lectures Machine Learning
No ratings yet
Lectures Machine Learning
205 pages
Bayes Classifier PDF
100% (1)
Bayes Classifier PDF
18 pages
0.1 Guilherme Marthe - Boston House Pricing Challenge
100% (1)
0.1 Guilherme Marthe - Boston House Pricing Challenge
15 pages
2ND Year AI
No ratings yet
2ND Year AI
2 pages
Introduction To Machine Learning CS - 229
No ratings yet
Introduction To Machine Learning CS - 229
109 pages
Ml2 Script v2
No ratings yet
Ml2 Script v2
123 pages
Introduction To Conformal Prediction With Python: A Short Guide For Quantifying Uncertainty of Machine Learning Models 1st Edition Christoph Molnar
No ratings yet
Introduction To Conformal Prediction With Python: A Short Guide For Quantifying Uncertainty of Machine Learning Models 1st Edition Christoph Molnar
64 pages
Mathematics and Statistics Undergraduate Handbook
No ratings yet
Mathematics and Statistics Undergraduate Handbook
12 pages
Where Can Buy Principles of Data Science - Third Edition: A Beginner's Guide To Essential Math and Coding Skills For Data Fluency and Machine Learning Sinan Ozdemir Ebook With Cheap Price
No ratings yet
Where Can Buy Principles of Data Science - Third Edition: A Beginner's Guide To Essential Math and Coding Skills For Data Fluency and Machine Learning Sinan Ozdemir Ebook With Cheap Price
62 pages
ML Module4 Classification
No ratings yet
ML Module4 Classification
79 pages
Pattern Revision
No ratings yet
Pattern Revision
63 pages
Bayes Classification
No ratings yet
Bayes Classification
86 pages
6 Classification
No ratings yet
6 Classification
53 pages
2024 - Slide2 - BayesML Sub
No ratings yet
2024 - Slide2 - BayesML Sub
40 pages
Chapter 4
No ratings yet
Chapter 4
57 pages
Naive Bayes
No ratings yet
Naive Bayes
32 pages
Bayes
No ratings yet
Bayes
48 pages
BSC ML CH2
No ratings yet
BSC ML CH2
79 pages
Naive Bayes
No ratings yet
Naive Bayes
37 pages
What Is Uncertainty and Why Does It Matter?: Ian Scoones
No ratings yet
What Is Uncertainty and Why Does It Matter?: Ian Scoones
56 pages
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
No ratings yet
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
56 pages
Lecture Slide 03 - Bayesian Classifier - Summer 2023
No ratings yet
Lecture Slide 03 - Bayesian Classifier - Summer 2023
23 pages
4.2 Bayes Decision Theory
No ratings yet
4.2 Bayes Decision Theory
49 pages
Nayes Bayes Classifier
No ratings yet
Nayes Bayes Classifier
46 pages
DATS6450 Bayesian Final Project Presentation - Team 2
No ratings yet
DATS6450 Bayesian Final Project Presentation - Team 2
20 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
Data Classification and Prediction : Lecture-11
No ratings yet
Data Classification and Prediction : Lecture-11
36 pages
Unit 2 Bayesian Learning
No ratings yet
Unit 2 Bayesian Learning
50 pages
2022 Slide9 BayesML Eng
No ratings yet
2022 Slide9 BayesML Eng
34 pages
Bishop2008 Chapter ANewFrameworkForMachineLearnin
No ratings yet
Bishop2008 Chapter ANewFrameworkForMachineLearnin
24 pages
UNIT I-Part 2
No ratings yet
UNIT I-Part 2
35 pages
Module05 - Bayesian Reasoning
No ratings yet
Module05 - Bayesian Reasoning
37 pages
MCMC Methods For Fitting and Comparing Multinomial Response Models
No ratings yet
MCMC Methods For Fitting and Comparing Multinomial Response Models
28 pages
ML Unit 3 Part 1
No ratings yet
ML Unit 3 Part 1
36 pages
L3 (Week3) Bayesian Classifier
No ratings yet
L3 (Week3) Bayesian Classifier
21 pages
Invited Review Recursive Models in Animal Breeding
No ratings yet
Invited Review Recursive Models in Animal Breeding
15 pages
Ds 7
No ratings yet
Ds 7
20 pages
Naive by
No ratings yet
Naive by
23 pages
Karanam Et Al 2023 Quantifying Uncertainty With Pavement Performance Models Comparing Bayesian and Non Parametric
No ratings yet
Karanam Et Al 2023 Quantifying Uncertainty With Pavement Performance Models Comparing Bayesian and Non Parametric
19 pages
Castillo2020 Bayesian Predictive Optimization of Multiple and Profile Responses Systems in Industry
No ratings yet
Castillo2020 Bayesian Predictive Optimization of Multiple and Profile Responses Systems in Industry
18 pages
Dependent Credit Migrations
No ratings yet
Dependent Credit Migrations
38 pages
Don't Calibrate MMM Models Through Experiments
No ratings yet
Don't Calibrate MMM Models Through Experiments
27 pages
Adjusting For Sample Selection Bias in Acquisition Credit Scorin
No ratings yet
Adjusting For Sample Selection Bias in Acquisition Credit Scorin
43 pages
Bayesian Classification
No ratings yet
Bayesian Classification
25 pages
Bayesian Compendium One-Click Download
No ratings yet
Bayesian Compendium One-Click Download
17 pages
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
15 pages
Naive Bayes Classification
No ratings yet
Naive Bayes Classification
47 pages
Lec 2
No ratings yet
Lec 2
37 pages
Evidential Deep Neural Networks For Uncertain Data Classification
No ratings yet
Evidential Deep Neural Networks For Uncertain Data Classification
11 pages
Data MIning Chapter 8
No ratings yet
Data MIning Chapter 8
11 pages
Thomas 2000
No ratings yet
Thomas 2000
24 pages
Development and Validation of Credit-Scoring Models
No ratings yet
Development and Validation of Credit-Scoring Models
70 pages
ML 05 Bayesian Classifier
No ratings yet
ML 05 Bayesian Classifier
19 pages
ML - Unit 1 - Part Ii
No ratings yet
ML - Unit 1 - Part Ii
18 pages
Predicting Winning Lottery Numbers
No ratings yet
Predicting Winning Lottery Numbers
10 pages
Bayesian Classifier Notes
No ratings yet
Bayesian Classifier Notes
9 pages
Bayesian Analysis - Explanation
No ratings yet
Bayesian Analysis - Explanation
20 pages
Module - 3 - Last Part
No ratings yet
Module - 3 - Last Part
16 pages
AI Notes
No ratings yet
AI Notes
19 pages
Unit6 - 3 Classification-Bayesian
No ratings yet
Unit6 - 3 Classification-Bayesian
15 pages
Falling Rule Lists - Fulton Wang, Cynthia Rudin
No ratings yet
Falling Rule Lists - Fulton Wang, Cynthia Rudin
10 pages
Accuracy Analysis of Semi Supervised Classification When The - 2015 - Neurocompu
No ratings yet
Accuracy Analysis of Semi Supervised Classification When The - 2015 - Neurocompu
9 pages
Bayes Classifier
No ratings yet
Bayes Classifier
20 pages
Naive Bayes Classifier
No ratings yet
Naive Bayes Classifier
14 pages
Bayes Classification
No ratings yet
Bayes Classification
9 pages
Bayesian Learning Unit 3 PDF
No ratings yet
Bayesian Learning Unit 3 PDF
18 pages
Sample Size For Binary Logistic Prediction Models: Beyond Events Per Variable Criteria
No ratings yet
Sample Size For Binary Logistic Prediction Models: Beyond Events Per Variable Criteria
20 pages
Lipton 18 A
No ratings yet
Lipton 18 A
9 pages
Bayesian Data Analysis: Introduction
No ratings yet
Bayesian Data Analysis: Introduction
13 pages
Notes6 Classification
No ratings yet
Notes6 Classification
10 pages
Lecture 5 Bayesian Classification
No ratings yet
Lecture 5 Bayesian Classification
16 pages
Elite Players' Perceptions of Football Playing Surfaces: A Mixed Effects Ordinal Logistic Regression Model of Players' Perceptions
No ratings yet
Elite Players' Perceptions of Football Playing Surfaces: A Mixed Effects Ordinal Logistic Regression Model of Players' Perceptions
18 pages
Detecting and Correcting For Label Shift With Black Box Predictors
No ratings yet
Detecting and Correcting For Label Shift With Black Box Predictors
11 pages
Aljassmi, H., & Han, S. (2013) - Analysis of Causes of Construction Defects Using Fault Trees and Risk Importance Measures.
No ratings yet
Aljassmi, H., & Han, S. (2013) - Analysis of Causes of Construction Defects Using Fault Trees and Risk Importance Measures.
11 pages
Dhs Risk Lexicon 2010
No ratings yet
Dhs Risk Lexicon 2010
72 pages
International Statistical Institute (ISI)
No ratings yet
International Statistical Institute (ISI)
15 pages
Assignment 3 Ai
No ratings yet
Assignment 3 Ai
6 pages
Spam Spam: S P A M
No ratings yet
Spam Spam: S P A M
22 pages
Financially Motivated Model Performance Measures
No ratings yet
Financially Motivated Model Performance Measures
15 pages
Lecture 1, Part 2: Linear Classification: Roger Grosse
No ratings yet
Lecture 1, Part 2: Linear Classification: Roger Grosse
10 pages
8 - Classification NaiveBayes PDF
No ratings yet
8 - Classification NaiveBayes PDF
13 pages
A Course in Bayesian Econometrics University of Queensland
No ratings yet
A Course in Bayesian Econometrics University of Queensland
22 pages
Improvements On CVBootstrap
No ratings yet
Improvements On CVBootstrap
14 pages
Influence Results A Case-Control Study: of Model-Building Strategies On The OF
No ratings yet
Influence Results A Case-Control Study: of Model-Building Strategies On The OF
14 pages
E-Note 14654 Content Document 20231228101425AM
No ratings yet
E-Note 14654 Content Document 20231228101425AM
10 pages
Is A Dataframe Just A Table?: Yifan Wu
No ratings yet
Is A Dataframe Just A Table?: Yifan Wu
10 pages
Margaret Bowden Ai Article
No ratings yet
Margaret Bowden Ai Article
21 pages
Bayesian Data Mining, With Application To Bench Marking and Credit Scoring
No ratings yet
Bayesian Data Mining, With Application To Bench Marking and Credit Scoring
13 pages
5 Phylogeography
No ratings yet
5 Phylogeography
5 pages
10 2307@4102184 PDF
No ratings yet
10 2307@4102184 PDF
9 pages
Classifier Conditional Posterior Probabilities: Robert P.W. Duin, David M.J. Tax
No ratings yet
Classifier Conditional Posterior Probabilities: Robert P.W. Duin, David M.J. Tax
9 pages
A Guide
No ratings yet
A Guide
9 pages
Heart Disease Prediction System Using Naive Bayes: Dhanashree S. Medhekar, Mayur P. Bote, Shruti D. Deshmukh
No ratings yet
Heart Disease Prediction System Using Naive Bayes: Dhanashree S. Medhekar, Mayur P. Bote, Shruti D. Deshmukh
5 pages
James-Stein Estimator
No ratings yet
James-Stein Estimator
12 pages
Selection Bias in Credit Scorecard Evaluation: David J Hand and Niall M Adams
No ratings yet
Selection Bias in Credit Scorecard Evaluation: David J Hand and Niall M Adams
8 pages
Lesson 1 M3 My Path Task - WorldQuant University
No ratings yet
Lesson 1 M3 My Path Task - WorldQuant University
4 pages
Paper 10 - Improving The Design of Conditional Transfer Programs in Colombia
No ratings yet
Paper 10 - Improving The Design of Conditional Transfer Programs in Colombia
3 pages
Analysis of Credit Card Fraud Detection Methods
No ratings yet
Analysis of Credit Card Fraud Detection Methods
3 pages
Probability of Default: A Modern Calibration Approach: Stefano Bonini and Giuliana Caivano
No ratings yet
Probability of Default: A Modern Calibration Approach: Stefano Bonini and Giuliana Caivano
4 pages
Data Mining - Bayesian Classification
No ratings yet
Data Mining - Bayesian Classification
6 pages
Uncertainty Based Classification Fusion - A Soft-Biometrics Test Case
No ratings yet
Uncertainty Based Classification Fusion - A Soft-Biometrics Test Case
4 pages
Paper 5 - Preferences of Low-Income Voters On Public Education Spending in Brazil
No ratings yet
Paper 5 - Preferences of Low-Income Voters On Public Education Spending in Brazil
3 pages
Paper 1 - Politicians, Publicly-Released Audits of Corruption, and Electoral Outcomes in Brazil
No ratings yet
Paper 1 - Politicians, Publicly-Released Audits of Corruption, and Electoral Outcomes in Brazil
3 pages
Exercises of Statistical Inference
From Everand
Exercises of Statistical Inference
Simone Malacrida
No ratings yet
How To Configure SSL VPN Client in Ubuntu
No ratings yet
How To Configure SSL VPN Client in Ubuntu
3 pages
Paper 6 - Reading The Fine Print - Credit Demand and Information Disclosure in Brazil
No ratings yet
Paper 6 - Reading The Fine Print - Credit Demand and Information Disclosure in Brazil
2 pages
Paper 12 - Business Education For Microcredit Clients in Peru
No ratings yet
Paper 12 - Business Education For Microcredit Clients in Peru
3 pages
Paper 8 - The Impact of An Online Math Learning Platform On Test Scores and Attitudes Towards Math in Brazil
No ratings yet
Paper 8 - The Impact of An Online Math Learning Platform On Test Scores and Attitudes Towards Math in Brazil
2 pages

Adjusting The Outputs of A Classi Er To New Probabilities: A Simple Procedure

Uploaded by

Adjusting The Outputs of A Classi Er To New Probabilities: A Simple Procedure

Uploaded by

NOTE Communicated by Leo Breiman

Adjusting the Outputs of a Classier to New a Priori

It sometimes happens (for instance in case control studies) that a classier

Neural Computation 14, 21–41 (2001) °

terms of classication accuracy, when the a priori probability conditions

In supervised classication tasks, sometimes the a priori probabilities of the

This is the primary motivation of this article: to introduce a procedure

2 Correcting a Posteriori Probability Estimates with Respect to New a

the purpose of model building (training) and forecasting based on that

2.2 Adjusting the Outputs to New a Priori Probabilities: New a Priori

where the a posteriori probabilities b pt (vi | x) are obtained by applying the

Since the within-class densities b p (x | vi ) do not change from training to

This well-known formula can be used to compute the corrected a poste-

pt (vi | x), and the new priors b

2.3 Adjusting the Outputs to New a Priori Probabilities: New a Pri-

2.3.1 Method 1: Confusion Matrix. The standard procedure used for a

p (vj ), where the b

2.3.2 Method 2: EM Algorithm. We now present a new procedure for a

where the within-class densities—that is, the probabilities of observing x k

3 Testing for Different A Priori Probabilities

so that the likelihood ratio is

and the log-likelihood ratio is

4.1 Simulations on Articial Data. We present a simple experiment that

Table 1: Results of the Estimation of Priors on the Test Sets, Averaged on 10

True Priors Estimated Prior by Using Log-Likelihood Ratio Test

EM Confusion Number of Times

10% 14.7% 18.1% 10

Ten replications of the following experimental design were applied. First,

Table 2: Classication Rates on the Test Sets, Averaged on 10 Runs, Ringnorm

True Priors Percentage of Correct Classication

No After Adjustment by Using

Adjustment EM Confusion Matrix True Priors

By looking at Table 2 (classication results), we observe that the impact of

4.2 Robustness Evaluation on Articial Data. This section investigates

Training Set Test Set Mean Absolute Estimated Prior for v1

(500, 500) (200, 800) 0.107 22.0% 24.7%

As for the experiment reported above, a multilayer perceptron was

Number of observations in training and test sets

estimation of the a posteriori probabilities (an increase of absolute deviation

Table 4: Classication Results on Three Real Data Sets.

Data True Priors Estimated by Percentage of Correct Classication

The problem of estimating parameters of a model by including unlabeled

We presented a simple procedure for adjusting the outputs of a classier to

1. The a posteriori probabilities provided by the model (our readjust-

 The EM procedure is able to provide reasonably good estimates of the

Additionally, the quality of the estimates does not appear to depend

Notice that the EM-based adjustment procedure could be useful in the

Appendix: Derivation of the EM Algorithm

vector: if zki is the component i of vector zk , then z ki D 1 and zkj D 0 for

compute EZN [l | XN 1 , ¼ ] (the expected log-likelihood), we use, as a current

where we assumed that the complete data observations f(x k, zk ), k D 1, . . . ,

where bp(s) (vi | x k ) is given by equation A.4.

for j D 1, . . . , n. If we sum this equation over j, we obtain the value of the

and the next estimate of p(vi ) is therefore

Received April 19, 2000; accepted March 30, 2001.

You might also like

Adjusting the Outputs of a Classier to New a Priori

It sometimes happens (for instance in case control studies) that a classier

terms of classication accuracy, when the a priori probability conditions

In supervised classication tasks, sometimes the a priori probabilities of the

4.1 Simulations on Articial Data. We present a simple experiment that

Table 2: Classication Rates on the Test Sets, Averaged on 10 Runs, Ringnorm

True Priors Percentage of Correct Classication

By looking at Table 2 (classication results), we observe that the impact of

4.2 Robustness Evaluation on Articial Data. This section investigates

Table 4: Classication Results on Three Real Data Sets.

Data True Priors Estimated by Percentage of Correct Classication

We presented a simple procedure for adjusting the outputs of a classier to

The EM procedure is able to provide reasonably good estimates of the