Papadopoulos ConformalPrediction
Papadopoulos ConformalPrediction
net/publication/4302244
CITATIONS READS
79 584
3 authors, including:
All content following this page was uploaded by Harris Papadopoulos on 20 September 2023.
order in which examples appear should not have any impact on the non-
((x1 , y1 ), . . . , (xl , yl ), (xl+1 , Yj )) conformity score αi .
2.1. Measuring Non-conformity rule D*z1 ,...,zm + (for xm+i ) and the true label ym+i . In the
(Y )
same way, the non-conformity score αl+gj for each possible
We can measure the non-conformity αi of each exam- classification Yj of the new test example xl+g is calculated
ple zi in a bag *z1 , . . . , zn + with the aid of some traditional as the degree of disagreement between D*z1 ,...,zm + and Yj .
machine learning method, which we call the underlying al- Notice that the non-conformity scores of the examples in
gorithm of the CP. Given a bag of examples *z1 , . . . , zn + the calibration set only need to be computed once.
as training set, each such method creates a prediction rule Using these non-conformity scores the p-value of each
D*z1 ,...,zn + , which maps any unclassified example x to a possible label Yj of xl+g can be calculated as
label ŷ. As this prediction rule is based on the examples
(Y )
in the bag, the degree of disagreement between D*z1 ,...,zn + #{i = m + 1, . . . , m + q, l + g : αi ≥ αl+gj }
(for xi ) and the actual label yi of the example zi tells us p(Yj ) = .
q+1
how different zi is from the rest of the examples in the bag. (3)
Therefore, this gives us a measure of the non-conformity of The p-values obtained by both CPs and ICPs for each
example zi . possible classification can be used in two different modes:
Alternatively, we can create the prediction rule
D*z1 ,...,zi−1 ,zi+1 ,...,zn + using all the examples in the bag ex- • For each test example output the predicted classifica-
cept zi , and measure the degree of disagreement between tion together with a confidence and credibility measure
D*z1 ,...,zi−1 ,zi+1 ,...,zn + and yi . for that classification.
• Given a confidence level 1 − δ, where δ > 0 is the
3. Inductive Conformal Predictors significance level (typically a small constant), output
the appropriate set of classifications such that one can
Before we describe the general way ICPs work let us first be 1 − δ confident that the true label will be in that set.
take a look at the cause of the computational inefficiency In the first case we predict the classification with the
of CPs. In order for a CP to calculate the p-value of every largest p-value. Actually, for the ICP we predict the classi-
possible label y ∈ {Y1 , . . . , Yc } for a new test example xl+g fication with the smallest non-conformity score; this is also
(l is the number of training examples), it has to compute the the classification with the largest p-value, but in the case
non-conformity score of each example in the bags where more than one classifications share this p-value we
*(x1 , y1 ), . . . , (xl , yl ), (xl+g , Y1 )+ select the most conforming one. The confidence of this pre-
diction is one minus the second largest p-value and its cred-
.. ibility is the p-value of the output prediction i.e. the largest
. p-value.
*(x1 , y1 ), . . . , (xl , yl ), (xl+g , Yc ) + . In the second case we output the set
This means that it has to train its underlying algorithm c {Yu : p(Yu ) > δ},
times so as to generate a prediction rule based on each bag,
and apply each of these prediction rules l + 1 times. The where u = 1, . . . , c (c is the number of possible classifica-
fact that these computations are repeated for each new test tions). In other words, we output the set consisting of all the
example xl+g explains why CPs cannot be used with an classifications that have a greater than δ p-value of being the
underlying algorithm that requires long training times, or true label according to our non-conformity measure.
when dealing with large data sets.
Inductive Conformal Predictors are based on the same 3.1. Computational Efficiency Improve-
general idea described in section 2, but follow a different ment
approach which allows them to train their underlying al-
gorithm just once. This is achieved by splitting the train- In order to demonstrate the computational efficiency dif-
ing set (of size l) into two smaller sets, the proper train- ference between CP and ICP, let us consider the computa-
ing set with m < l examples and the calibration set with tional complexity of each method with respect to the com-
q := l−m examples. The proper training set is used for cre- plexity of its underlying algorithm U . The complexity of U
ating the prediction rule D*z1 ,...,zm + and only the examples when applied to a data set with l training examples and r
in the calibration set are used for calculating the p-value of test examples will be
each possible classification of the new test example. More Θ(Utrain (l) + rUapply ),
specifically, the non-conformity score αm+i of each exam-
ple zm+i in the calibration set *zm+1 , . . . , zm+q + is calcu- where Utrain (l) is the time required by U to generate its pre-
lated as the degree of disagreement between the prediction diction rule and Uapply is the time needed to apply this rule
to a new example. Note that although the complexity of any where
algorithm also depends on the number of attributes d that 1, if j = u,
tij =
describe each example, this was not included in our nota- 0, otherwise,
tion for simplicity reasons. The corresponding complexity for j = 1, 2, . . . , c. Here we assumed that the Neural Net-
of the CP will be work in question has a softmax output layer, as this was the
case for the networks used in our experiments. The values
Θ(rc(Utrain (l + 1) + (l + 1)Uapply )), 0 and 1 can be adjusted accordingly depending on the range
of the output activation functions of the network being used.
where c is the number of possible labels for the task at hand;
As a result of this encoding, the prediction ŷg of the net-
we assume that the computation of the non-conformity
work, for a test pattern g, will be the label corresponding to
scores and p-values for each possible label is relatively fast
its highest output value.
compared to the time it takes to train and apply the underly-
ing algorithm. Analogously, the complexity of the ICP will
be 4.2. Non-conformity Measure
Θ(Utrain (l − q) + (q + r)Uapply ),
According to the above encoding, for an example i with
where q is the size of the calibration set. Notice that the ICP true classification Yu , the higher the output oiu (which cor-
takes less time than the original method to generate the pre- responds to that classification) the more conforming the ex-
diction rule, since Utrain (l − q) < Utrain (l), while it then ample, and the higher the other outputs the less conforming
repeats Uapply a somewhat larger amount of times. The time the example. In fact, the most important of all other out-
needed for applying the prediction rule of most inductive al- puts is the one with the maximum value maxj=1,...,c:j6=u oij ,
gorithms, however, is insignificant compared to the amount since that is the one which might be very near or even higher
of time spent for generating it. Consequently, the ICP will than oiu .
in most cases be slightly faster than the original method, So a natural non-conformity measure for an example
as it spends less time during the most complex part of its zi = (xi , yi ) where yi = Yu would be defined as
underlying algorithm’s computations. On the contrary, the
corresponding CP repeats a slightly bigger number of com- αi = max oij − oiu , (4)
j=1,...,c:j6=u
putations than the total computations of its underlying al-
gorithm for rc times, thing that makes it much slower than or as
both the original method and the ICP. maxj=1,...,c:j6=u oij
αi = , (5)
oiu + γ
4. Neural Networks ICP where the parameter γ ≥ 0 in the second definition enables
us to adjust the sensitivity of our measure to small changes
In this section we analyse the Neural Networks ICP. This of oiu depending on the data in question. We added this
method can be implemented in conjunction with any Neural parameter in order to gain control over which category of
Network for pattern recognition as long as it uses the 1-of- outputs will be more important in determining the resulting
n output encoding, which is the typical encoding used for non-conformity scores; by increasing γ one reduces the im-
such networks. We first give a detailed description of this portance of oiu and consequently increases the importance
encoding and then move on to the definition of two non- of all other outputs.
conformity measures for Neural Networks. Finally, we de-
tail the Neural Networks ICP algorithm. 4.3. The Algorithm
4.1. Output Encoding We can now use the non-conformity measure (4) or (5)
to compute the non-conformity score of each example in the
calibration set and each pair (xl+g , Yu ). These can then be
Typically the output layer of a classification Neural Net-
fed into the p-value function (3), giving us the p-value for
work consists of c units, each representing one of the c pos-
each classification Yu . The exact steps the Neural Network
sible classifications of the problem at hand; thus each label
ICP follows are:
is encoded into c target outputs. To explicitly describe this
encoding consider the label, yi = Yu of a training example • Split the training set into the proper training set with
i, where Yu ∈ {Y1 , . . . , Yc } is one of the c possible classifi- m < l examples and the calibration set with q := l−m
cations. The resulting target outputs for yi will be examples.
ti1 , . . . , tic • Use the proper training set to train the Neural Network.
• For each example zm+t = (xm+t , ym+t ), t = 1, . . . , q Satellite Shuttle Segment
in the calibration set:
Hidden Units 23 12 11
– supply the input pattern xm+t to the trained net- Hidden Learning Rate 0.002 0.002 0.002
work to obtain the output values o1m+t , . . . , om+t
c
Output Learning Rate 0.001 0.001 0.001
and Momentum Rate 0.1 0 0.1
– calculate the non-conformity score αm+t of the
pair (xm+t , ym+t ) by applying (4) or (5) to these Table 1. The parameters used in our experi-
values. ments for each data set.
Table 3. Results of the second mode of the Neural Networks ICP method.
Table 4. The processing times of our method and its underlying algorithm.
measures it produces are useful in practice. Of course, as [9] C. Saunders, A. Gammerman, and V. Vovk. Transduc-
a result of removing some examples from the training set tion with confidence and credibility. In Proceedings of
to form the calibration set, it sometimes suffers a small, the 16th International Joint Conference on Artificial Intel-
and usually negligible, loss of accuracy from the original ligence, volume 2, pages 722–726, Los Altos, CA, 1999.
network. This is not the case, however, for large data sets, Morgan Kaufmann.
[10] V. Vovk. On-line confidence machines are well-calibrated.
which contain enough training examples so that the removal
In Proceedings of the 43rd Annual Symposium on Founda-
of the calibration examples does not make any difference to tions of Computer Science (FOCS’02), pages 187–196, Los
the training of the Neural Network. Alamitos, CA, 2002. IEEE Computer Society.
[11] V. Vovk, A. Gammerman, and C. Saunders. Machine-
Acknowledgements learning applications of algorithmic randomness. In Pro-
ceedings of the 16th International Conference on Machine
Learning (ICML’99), pages 444–453, San Francisco, CA,
This work was supported by the Cyprus Research 1999. Morgan Kaufmann.
Promotion Foundation through research contract [12] V. Vovk, A. Gammerman, and G. Shafer. Algorithmic Learn-
PLHRO/0506/22 (“Development of New Conformal ing in a Random World. Springer, New York, 2005.
Prediction Methods with Applications in Medical Diagno-
sis”).
References