0% found this document useful (0 votes)
13 views9 pages

Papadopoulos ConformalPrediction

The paper discusses the application of Conformal Prediction (CP) with Neural Networks, highlighting the computational inefficiencies of traditional CP methods, especially when paired with algorithms requiring long training times. It introduces Inductive Conformal Prediction (ICP) as a modified approach that maintains computational efficiency while providing confidence measures for predictions. The authors demonstrate the effectiveness of ICP in producing reliable predictions with confidence values, making it suitable for practical applications in machine learning.

Uploaded by

Bandar Almaslukh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views9 pages

Papadopoulos ConformalPrediction

The paper discusses the application of Conformal Prediction (CP) with Neural Networks, highlighting the computational inefficiencies of traditional CP methods, especially when paired with algorithms requiring long training times. It introduces Inductive Conformal Prediction (ICP) as a modified approach that maintains computational efficiency while providing confidence measures for predictions. The authors demonstrate the effectiveness of ICP in producing reliable predictions with confidence values, making it suitable for practical applications in machine learning.

Uploaded by

Bandar Almaslukh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/4302244

Conformal Prediction with Neural Networks

Conference Paper · November 2007


DOI: 10.1109/ICTAI.2007.47 · Source: IEEE Xplore

CITATIONS READS
79 584

3 authors, including:

Harris Papadopoulos Alex Gammerman


Frederick University Royal Holloway University of London
98 PUBLICATIONS 1,892 CITATIONS 107 PUBLICATIONS 6,045 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Harris Papadopoulos on 20 September 2023.

The user has requested enhancement of the downloaded file.


Conformal Prediction with Neural Networks

Harris Papadopoulos Volodya Vovk


Frederick Institute of Technology, Department of Computer Science,
7 Y. Frederickou St., Palouriotisa, Royal Holloway, University of London,
Nicosia 1036, Cyprus Egham, Surrey TW20 0EX, England
[email protected] [email protected]
Alex Gammerman
Department of Computer Science,
Royal Holloway, University of London,
Egham, Surrey TW20 0EX, England
[email protected]

Abstract ods can be applied to a given algorithm for obtaining upper


bounds on the probability of its error with respect to some
Conformal Prediction (CP) is a method that can be used confidence level. However, each of these approaches has its
for complementing the bare predictions produced by any drawbacks.
traditional machine learning algorithm with measures of Bayesian methods require some strong assumptions
confidence. CP gives good accuracy and confidence values, about the stochastic mechanism generating the data, and
but unfortunately it is quite computationally inefficient. This if these assumptions are violated their outputs can become
computational inefficiency problem becomes huge when CP quite misleading. This negative aspect of Bayesian methods
is coupled with a method that requires long training times, is demonstrated experimentally in [4]. PAC theory, on the
such as Neural Networks. In this paper we use a modifi- other hand, is capable of producing non-trivial bounds only
cation of the original CP method, called Inductive Confor- if the data set is particularly clean. If this is not the case,
mal Prediction (ICP), which allows us to construct a Neural which is not for most data sets, the bounds obtained by PAC
Network confidence predictor without the massive computa- methods are very weak and as such they are not very use-
tional overhead of CP. The method we propose accompanies ful in practice. A demonstration of the crudeness of PAC
its predictions with confidence measures that are useful in bounds can be found in [6], where there is an example of
practice, while still preserving the computational efficiency Littlestone and Warmuth’s bound (found in [1], Theorem
of its underlying Neural Network. 4.25 and 6.8) applied to the USPS data set.
This problem was addressed in [9] and [11], where what
we call in this paper “Conformal Prediction” (CP) was pro-
posed. Conformal Predictors are built on top of traditional
1. Introduction algorithms, called underlying algorithms, but unlike the lat-
ter they complement each of their predictions with a mea-
Traditional machine learning algorithms for pattern sure of confidence; they also produce a measure of “cred-
recognition just output simple predictions, without produc- ibility”, which serves as an indicator of how suitable the
ing any indication of how likely each of these predictions training data are for classifying the example.
is. However, knowing the likelihood of given predictions The confidence measures produced by CPs are useful in
is highly desirable in many real-world applications. There practice, while their accuracy is comparable to, and some-
are two main areas in mainstream machine learning that times even better than, that of traditional machine learning
can be used for obtaining such an indication: the Bayesian methods. The only disadvantage they have is their rela-
framework and the theory of Probably Approximately Cor- tive computational inefficiency. This is basically due to the
rect learning (PAC theory). The Bayesian framework can use of transductive inference, since all computations have
be used in order to complement individual predictions with to start from scratch for every test example. Unfortunately,
probabilistic measures of their quality, while PAC meth- this computational inefficiency problem renders the appli-
cation of the CP method highly unsuitable for any approach we would in effect be measuring the likelihood of the label
that requires long training times such as Neural Networks. Yj being the true label of our new example xl+1 , since this
In order to overcome this problem a modification of the is the only component of our sequence that was not given to
original CP approach called Inductive Conformal Predic- us.
tion (ICP) was proposed in [7] for regression and in [8] In the spirit of Per Martin-Löf [3], a function p : Z ∗ →
for pattern recognition. As suggested by its name, the ICP [0, 1] is a test for randomness wrt the i.i.d. model if
method replaces the transductive inference used in the orig-
inal approach with inductive inference. As a result, ICPs • ∀n ∈ N, ∀δ ∈ [0, 1] and for all probability distribu-
are almost as computationally efficient as their underlying tions P on Z,
algorithms. P n {z ∈ Z n : p(z) ≤ δ} ≤ δ; (1)
In this paper we apply the ICP method to Neural Net-
works for pattern recognition, which is one of the most
• p is semi-computable from above.
widely used approaches for solving machine learning prob-
lems. One attractive characteristic of Neural Networks is We will use the term p-value function to refer to such a func-
that although they require relatively long training times, the tion, since this definition is practically equivalent to the no-
application of the learned network to a new example is typ- tion of p-values used in traditional statistics. In effect, the
ically very fast. This characteristic would not have been second requirement, that p is semi-computable from above,
preserved if the original CP method was applied to this ap- is completely irrelevant from the practical point of view,
proach. In fact, the processing of a single new example by since the p-value functions of any interest in applications
the resulting method would have taken longer than the train- of statistics are always computable.
ing time required by its underlying Neural Network. On the Therefore, we can obtain the typicalness of a sequence of
contrary, our algorithm preserves this characteristic, while examples by using a computable function p : Z ∗ → [0, 1],
the training time it requires is more or less the same as that which satisfies (1). We will call the output of this function
of its underlying network. for the sequence
The rest of this paper is structured as follows. The gen-
eral idea of Conformal Prediction is summarised in sec- ((x1 , y1 ), . . . , (xl , yl ), (xl+1 , Yj ))
tion 2. In section 3 we analyse the general ICP method
and demonstrate the computational complexity difference (where Yj is one of the c possible labels of our new exam-
between ICP and CP. Section 4 details our algorithm, while ple) the p-value of Yj and denote it by p(Yj ). If the p-value
section 5 lists some experimental results of our method of a given label is under some very low threshold, say 0.05,
when applied to various datasets. Finally, section 6 consists this would mean that this label is highly unlikely, since such
of our conclusions. sequences will only be generated at most 5% of the time by
any i.i.d. process.
A p-value function can be constructed by considering
2. Conformal Prediction how different each example in our sequence is from all
other examples. For this reason we use a family of func-
Here we will give an outline of the main idea behind con- tions An : Z (n−1) × Z → R, n = 1, 2, . . ., which assign a
formal prediction; for more details see [12]. We are given a numerical score
training set (z1 , . . . , zl ) of examples, where each zi ∈ Z is a
pair (xi , yi ); xi ∈ Rd is the vector of attributes for example αi = An (*z1 , . . . , zi−1 , zi+1 , . . . , zn +, zi )
i and yi is the classification for that example. We are also
to each example zi , indicating how different it is from
given a new unclassified example xl+1 and our task is to
the examples in the bag1 *z1 , . . . , zi−1 , zi+1 , . . . , zn +; such
predict the classification yl+1 of this example. We know
families of functions are called non-conformity measures.
a priori the set of all possible labels Y1 , . . . , Yc and our
The non-conformity scores of all examples can now be used
only assumption, as with all the problems we are interested
for computing the p-value of our sequence with the follow-
in, is the general i.i.d. model (z1 , z2 , . . . are independent
ing function:
and identically distributed). Now suppose we can measure
how likely it is that a given sequence of classified exam- #{i = 1, . . . , n : αi ≥ αn }
ples were drawn independently from the same probability p(z1 , . . . , zn ) = . (2)
n
distribution; in other words how typical the sequence is wrt
the i.i.d. model. Then by measuring the typicalness of the This function satisfies (1); for a proof see [6].
extended sequence 1 Here we use the concept of a bag so as to formalize the fact that the

order in which examples appear should not have any impact on the non-
((x1 , y1 ), . . . , (xl , yl ), (xl+1 , Yj )) conformity score αi .
2.1. Measuring Non-conformity rule D*z1 ,...,zm + (for xm+i ) and the true label ym+i . In the
(Y )
same way, the non-conformity score αl+gj for each possible
We can measure the non-conformity αi of each exam- classification Yj of the new test example xl+g is calculated
ple zi in a bag *z1 , . . . , zn + with the aid of some traditional as the degree of disagreement between D*z1 ,...,zm + and Yj .
machine learning method, which we call the underlying al- Notice that the non-conformity scores of the examples in
gorithm of the CP. Given a bag of examples *z1 , . . . , zn + the calibration set only need to be computed once.
as training set, each such method creates a prediction rule Using these non-conformity scores the p-value of each
D*z1 ,...,zn + , which maps any unclassified example x to a possible label Yj of xl+g can be calculated as
label ŷ. As this prediction rule is based on the examples
(Y )
in the bag, the degree of disagreement between D*z1 ,...,zn + #{i = m + 1, . . . , m + q, l + g : αi ≥ αl+gj }
(for xi ) and the actual label yi of the example zi tells us p(Yj ) = .
q+1
how different zi is from the rest of the examples in the bag. (3)
Therefore, this gives us a measure of the non-conformity of The p-values obtained by both CPs and ICPs for each
example zi . possible classification can be used in two different modes:
Alternatively, we can create the prediction rule
D*z1 ,...,zi−1 ,zi+1 ,...,zn + using all the examples in the bag ex- • For each test example output the predicted classifica-
cept zi , and measure the degree of disagreement between tion together with a confidence and credibility measure
D*z1 ,...,zi−1 ,zi+1 ,...,zn + and yi . for that classification.
• Given a confidence level 1 − δ, where δ > 0 is the
3. Inductive Conformal Predictors significance level (typically a small constant), output
the appropriate set of classifications such that one can
Before we describe the general way ICPs work let us first be 1 − δ confident that the true label will be in that set.
take a look at the cause of the computational inefficiency In the first case we predict the classification with the
of CPs. In order for a CP to calculate the p-value of every largest p-value. Actually, for the ICP we predict the classi-
possible label y ∈ {Y1 , . . . , Yc } for a new test example xl+g fication with the smallest non-conformity score; this is also
(l is the number of training examples), it has to compute the the classification with the largest p-value, but in the case
non-conformity score of each example in the bags where more than one classifications share this p-value we
*(x1 , y1 ), . . . , (xl , yl ), (xl+g , Y1 )+ select the most conforming one. The confidence of this pre-
diction is one minus the second largest p-value and its cred-
.. ibility is the p-value of the output prediction i.e. the largest
. p-value.
*(x1 , y1 ), . . . , (xl , yl ), (xl+g , Yc ) + . In the second case we output the set

This means that it has to train its underlying algorithm c {Yu : p(Yu ) > δ},
times so as to generate a prediction rule based on each bag,
and apply each of these prediction rules l + 1 times. The where u = 1, . . . , c (c is the number of possible classifica-
fact that these computations are repeated for each new test tions). In other words, we output the set consisting of all the
example xl+g explains why CPs cannot be used with an classifications that have a greater than δ p-value of being the
underlying algorithm that requires long training times, or true label according to our non-conformity measure.
when dealing with large data sets.
Inductive Conformal Predictors are based on the same 3.1. Computational Efficiency Improve-
general idea described in section 2, but follow a different ment
approach which allows them to train their underlying al-
gorithm just once. This is achieved by splitting the train- In order to demonstrate the computational efficiency dif-
ing set (of size l) into two smaller sets, the proper train- ference between CP and ICP, let us consider the computa-
ing set with m < l examples and the calibration set with tional complexity of each method with respect to the com-
q := l−m examples. The proper training set is used for cre- plexity of its underlying algorithm U . The complexity of U
ating the prediction rule D*z1 ,...,zm + and only the examples when applied to a data set with l training examples and r
in the calibration set are used for calculating the p-value of test examples will be
each possible classification of the new test example. More Θ(Utrain (l) + rUapply ),
specifically, the non-conformity score αm+i of each exam-
ple zm+i in the calibration set *zm+1 , . . . , zm+q + is calcu- where Utrain (l) is the time required by U to generate its pre-
lated as the degree of disagreement between the prediction diction rule and Uapply is the time needed to apply this rule
to a new example. Note that although the complexity of any where 
algorithm also depends on the number of attributes d that 1, if j = u,
tij =
describe each example, this was not included in our nota- 0, otherwise,
tion for simplicity reasons. The corresponding complexity for j = 1, 2, . . . , c. Here we assumed that the Neural Net-
of the CP will be work in question has a softmax output layer, as this was the
case for the networks used in our experiments. The values
Θ(rc(Utrain (l + 1) + (l + 1)Uapply )), 0 and 1 can be adjusted accordingly depending on the range
of the output activation functions of the network being used.
where c is the number of possible labels for the task at hand;
As a result of this encoding, the prediction ŷg of the net-
we assume that the computation of the non-conformity
work, for a test pattern g, will be the label corresponding to
scores and p-values for each possible label is relatively fast
its highest output value.
compared to the time it takes to train and apply the underly-
ing algorithm. Analogously, the complexity of the ICP will
be 4.2. Non-conformity Measure
Θ(Utrain (l − q) + (q + r)Uapply ),
According to the above encoding, for an example i with
where q is the size of the calibration set. Notice that the ICP true classification Yu , the higher the output oiu (which cor-
takes less time than the original method to generate the pre- responds to that classification) the more conforming the ex-
diction rule, since Utrain (l − q) < Utrain (l), while it then ample, and the higher the other outputs the less conforming
repeats Uapply a somewhat larger amount of times. The time the example. In fact, the most important of all other out-
needed for applying the prediction rule of most inductive al- puts is the one with the maximum value maxj=1,...,c:j6=u oij ,
gorithms, however, is insignificant compared to the amount since that is the one which might be very near or even higher
of time spent for generating it. Consequently, the ICP will than oiu .
in most cases be slightly faster than the original method, So a natural non-conformity measure for an example
as it spends less time during the most complex part of its zi = (xi , yi ) where yi = Yu would be defined as
underlying algorithm’s computations. On the contrary, the
corresponding CP repeats a slightly bigger number of com- αi = max oij − oiu , (4)
j=1,...,c:j6=u
putations than the total computations of its underlying al-
gorithm for rc times, thing that makes it much slower than or as
both the original method and the ICP. maxj=1,...,c:j6=u oij
αi = , (5)
oiu + γ
4. Neural Networks ICP where the parameter γ ≥ 0 in the second definition enables
us to adjust the sensitivity of our measure to small changes
In this section we analyse the Neural Networks ICP. This of oiu depending on the data in question. We added this
method can be implemented in conjunction with any Neural parameter in order to gain control over which category of
Network for pattern recognition as long as it uses the 1-of- outputs will be more important in determining the resulting
n output encoding, which is the typical encoding used for non-conformity scores; by increasing γ one reduces the im-
such networks. We first give a detailed description of this portance of oiu and consequently increases the importance
encoding and then move on to the definition of two non- of all other outputs.
conformity measures for Neural Networks. Finally, we de-
tail the Neural Networks ICP algorithm. 4.3. The Algorithm

4.1. Output Encoding We can now use the non-conformity measure (4) or (5)
to compute the non-conformity score of each example in the
calibration set and each pair (xl+g , Yu ). These can then be
Typically the output layer of a classification Neural Net-
fed into the p-value function (3), giving us the p-value for
work consists of c units, each representing one of the c pos-
each classification Yu . The exact steps the Neural Network
sible classifications of the problem at hand; thus each label
ICP follows are:
is encoded into c target outputs. To explicitly describe this
encoding consider the label, yi = Yu of a training example • Split the training set into the proper training set with
i, where Yu ∈ {Y1 , . . . , Yc } is one of the c possible classifi- m < l examples and the calibration set with q := l−m
cations. The resulting target outputs for yi will be examples.

ti1 , . . . , tic • Use the proper training set to train the Neural Network.
• For each example zm+t = (xm+t , ym+t ), t = 1, . . . , q Satellite Shuttle Segment
in the calibration set:
Hidden Units 23 12 11
– supply the input pattern xm+t to the trained net- Hidden Learning Rate 0.002 0.002 0.002
work to obtain the output values o1m+t , . . . , om+t
c
Output Learning Rate 0.001 0.001 0.001
and Momentum Rate 0.1 0 0.1
– calculate the non-conformity score αm+t of the
pair (xm+t , ym+t ) by applying (4) or (5) to these Table 1. The parameters used in our experi-
values. ments for each data set.

• For each test pattern xl+g , g = 1, . . . , r:


– supply the input pattern xl+g to the trained net- choose between: brick-face, sky, foliage, cement, window,
work to obtain the output values ol+g l+g path, grass. For our experiments on this data set we used
1 , . . . , oc ,
10 fold cross-validation, as this was the testing procedure
– consider each possible classification Yu , u = followed in the Statlog project. The set was divided into
1, . . . , c and: 10 equally sized parts and our tests were repeated 10 times,
∗ compute the non-conformity score αl+g = each time using one of the 10 parts as the test set and the
(Y ) remaining 9 as the training set. Consequently, the resulting
αl+gu of the pair (xl+g , Yu ) by applying (4)
or (5) to the outputs of the network, training and test sets consisted of 2079 and 231 examples
∗ calculate the p-value p(Yu ) of the pair respectively. Of the 2079 training examples 199 were used
(xl+g , Yu ) by applying (3) to the non- to form the calibration set.
conformity scores of the calibration exam- Our experiments on these data sets were performed using
(Y ) 2 layer fully connected networks, with sigmoid hidden units
ples and αl+gu ,
and softmax output units. The number of their input and
– predict the classification with the smallest non- output units were determined by the format of each dataset;
conformity score, equal to the number of attributes and possible classifica-
– output as confidence to this prediction one minus tions of the examples respectively. These networks were
the second largest p-value, trained with the backpropagation algorithm minimizing a
cross-entropy loss function. The number of hidden units
– output as credibility the p-value of the output pre-
and the learning and momentum rates used for each data set
diction i.e. the largest p-value.
are reported in table 1. It is worth to note that the same pa-
Note that the p-values obtained by this method can also rameters were used both for our method and its underlying
be used in the second mode described in section 3. algorithm.
Here we report the error percentages of our method and
compare them to the ones of its underlying algorithm as
5. Experimental Results
well as to those of some other traditional methods. In addi-
tion, we check the quality of its p-values by analysing the re-
Our method was tested on the Satellite, Shuttle and Seg-
sults obtained from its second mode, described in section 3,
ment data sets, which were used in the experiments of the
for the 99%, 95% and 90% confidence levels. For the pur-
Statlog Project [2] (see also [5]).
pose of reporting the results of this mode we separate its
The Satellite data set consists of 6435 satellite images,
outputs into three categories:
split into 4435 training examples and 2000 test examples,
described by 36 attributes. The classification task is to dis- • A set with more than one labels
tinguish among 6 different soil conditions that are repre-
sented in the images. The calibration set was formed from • A set with only one label
199 of the 4435 training examples. • The empty set
The Shuttle data set consists of 43500 training examples
and 14500 test examples with 9 attributes each, describing Our main concern here will be the number of outputs that
the conditions in a space shuttle. The classification task is belong to the first category; we want this number to be rela-
to choose which one of the 7 different sets of actions should tively small, since these are the examples for which the ICP
be taken according to the conditions. In this case we used is not certain in only one label at the required confidence
999 of the training examples to form the calibration set. level 1 − δ. In addition to the percentage of examples in
The Segment data set consists of 2310 outdoor images each category, we also report the number of errors made by
described by 18 attributes each. The classification task is to our method in this mode. This is the number of examples
Percentage of error (%) mance of the corresponding CP. So by comparing its results
Learning Algorithm
Satellite Shuttle Segment to those of the backpropagation method, we can see that
although in most cases the ICP suffers a small loss of ac-
Neural Networks ICP 10.40 0.0414 3.46
curacy, this loss is negligible. Moreover, we observe that
Backpropagation 10.24 0.0414 3.20
as the data set gets bigger the difference between the error
k-Nearest Neighbours 9.45 0.12 3.68
percentage of the ICP and that of its underlying algorithm
C4.5 15.00 0.10 4.00
becomes smaller. In fact, for the shuttle data set, which is
CART 13.80 0.08 4.00
the biggest, our method gives exactly the same results with
Naive Bayes 28.70 4.50 26.50
its underlying network.
CASTLE 19.40 3.80 11.20
Table 3 details the performance of the second mode of
Linear Discriminant 17.10 4.83 11.60
our method on each of the three data sets. Here we can
see that the percentage of examples for which our method
Table 2. Error rate comparison of our method needs to output more than one label is relatively small even
with traditional algorithms. for a confidence level as high as 99%, having in mind the
difficulty of each task and the performance of its underly-
ing algorithm on each data set. This reflects the quality of
the p-values calculated by this method and consequently the
for which the true label was not included in the set output by
usefulness of its confidence measures.
the ICP; including all cases where the set output by the ICP
was empty. Over many runs on different sets (both train- Finally, table 4 lists the processing times of our method
ing and test) generated from the same i.i.d. distribution, the together with those of its underlying algorithm. In the case
percentage of these errors will be close to the correspond- of the Segment data set the times listed are for the total du-
ing significance level δ; an experimental demonstration of ration of our experiments on all 10 splits. As mentioned in
this can be found in [10]. Finally, we examine the computa- our computational complexity comparison of section 3.1, in
tional efficiency of our method by comparing its processing most cases the ICP is faster than its underlying algorithm
times with those of its underlying algorithm. because it uses less training examples. In the case of Neu-
ral Networks, this reduction in training examples reduces
In table 2 we compare the performance of our method on
slightly the training time per epoch and, for more or less the
the three statlog project data sets (in terms of error percent-
same number of epochs, it results in a shorter total train-
age), with that of its underlying algorithm (we denote this
ing time. This was the case for the Satellite and Shuttle
as backpropagation) and that of 6 other traditional meth-
data sets. However, for the Segment data set the number of
ods. These are the k-Nearest Neighbours algorithm, two
epochs increased and this resulted in a slightly bigger total
decision tree algorithms, namely C4.5 and CART, the Naive
training time for the ICP.
Bayes classifier, a Bayesian network algorithm called CAS-
Based on our computational complexity analysis of sec-
TLE and a linear discriminant algorithm. The results of the
tion 3.1, if we were to perform the same experiments using
k-Nearest Neighbours algorithm were produced by the au-
the original CP method coupled with Neural Networks it
thors, while those of all other methods were reported in [2]
would have taken approximately 183 days for the Satellite
and [5]. Note that the aim of our method is not to outper-
data set, 53 years for the Shuttle data set and 93 days for the
form other algorithms but to produce more information with
Segment data set. This shows the huge computational effi-
each prediction. So in comparing these error percentages
ciency improvement of ICP in the case of Neural Networks.
we want to show that the accuracy of our method is not in-
In fact, it shows that ICP is the only conformal prediction
ferior to that of traditional algorithms.
method that can be used with this approach.
We did not perform the same experiments with the cor-
responding original CP algorithm, due to the huge amount
of time that would have been needed for doing so. How- 6. Conclusions
ever, its results in terms of error percentages would not have
been significantly different from those of its underlying al- We have presented the Neural Networks Inductive Con-
gorithm (backpropagation). So the first two rows of table 2 formal Predictor. Unlike traditional algorithms, this method
also serve as a performance comparison between ICP and accompanies each of its predictions with confidence and
CP. credibility measures. Moreover, its computational effi-
Table 2 clearly shows that the accuracy of the Neural ciency is virtually the same with that of its underlying net-
Networks ICP is comparable to that of traditional meth- work.
ods. Of course, our main comparison here is with the per- The experimental results produced by applying this
formance of its underlying algorithm, since that is where method to various data sets show that its accuracy is com-
ICPs base their predictions and since that is also the perfor- parable to that of traditional methods, while the confidence
Non-conformity Confidence Only one More than No
Data Set
Measure Level label (%) one label (%) label (%) Errors (%)
99% 60.72 39.28 0.00 1.11
(4) 95% 84.42 15.58 0.00 4.67
90% 96.16 3.02 0.82 9.59
Satellite
99% 61.69 38.31 0.00 1.10
(5) 95% 85.70 14.30 0.00 4.86
90% 96.11 3.10 0.79 9.43
99% 99.23 0.00 0.77 0.77
(4) 95% 93.52 0.00 6.48 6.48
90% 89.08 0.00 10.92 10.92
Shuttle
99% 99.30 0.00 0.70 0.70
(5) 95% 93.86 0.00 6.14 6.14
90% 88.72 0.00 11.28 11.28
99% 90.69 9.31 0.00 0.95
(4) 95% 97.71 1.25 1.04 3.68
90% 94.68 0.00 5.32 6.71
Segment
99% 91.73 8.27 0.00 1.04
(5) 95% 97.79 1.21 1.00 3.55
90% 94.76 0.00 5.24 6.67

Table 3. Results of the second mode of the Neural Networks ICP method.

Time (in seconds)


Learning Algorithm
Satellite Shuttle Segment
Neural Networks ICP 1077 11418 5322
Backpropagation 1321 16569 4982

Table 4. The processing times of our method and its underlying algorithm.
measures it produces are useful in practice. Of course, as [9] C. Saunders, A. Gammerman, and V. Vovk. Transduc-
a result of removing some examples from the training set tion with confidence and credibility. In Proceedings of
to form the calibration set, it sometimes suffers a small, the 16th International Joint Conference on Artificial Intel-
and usually negligible, loss of accuracy from the original ligence, volume 2, pages 722–726, Los Altos, CA, 1999.
network. This is not the case, however, for large data sets, Morgan Kaufmann.
[10] V. Vovk. On-line confidence machines are well-calibrated.
which contain enough training examples so that the removal
In Proceedings of the 43rd Annual Symposium on Founda-
of the calibration examples does not make any difference to tions of Computer Science (FOCS’02), pages 187–196, Los
the training of the Neural Network. Alamitos, CA, 2002. IEEE Computer Society.
[11] V. Vovk, A. Gammerman, and C. Saunders. Machine-
Acknowledgements learning applications of algorithmic randomness. In Pro-
ceedings of the 16th International Conference on Machine
Learning (ICML’99), pages 444–453, San Francisco, CA,
This work was supported by the Cyprus Research 1999. Morgan Kaufmann.
Promotion Foundation through research contract [12] V. Vovk, A. Gammerman, and G. Shafer. Algorithmic Learn-
PLHRO/0506/22 (“Development of New Conformal ing in a Random World. Springer, New York, 2005.
Prediction Methods with Applications in Medical Diagno-
sis”).

References

[1] N. Cristianini and J. Shawe-Taylor. An Introduction to


Support Vector Machines and Other Kernel-based Methods.
Cambridge University Press, Cambridge, 2000.
[2] R. D. King, C. Feng, and A. Sutherland. Stat-
log: Comparison of classification algorithms
on large real-world problems. Applied Artifi-
cial Intelligence, 9(3):259–287, 1995. See also
http://www.ncc.up.pt/liacc/ML/statlog/datasets.html.
[3] P. Martin-Löf. The definition of random sequences. Infor-
mation and Control, 9:602–619, 1966.
[4] T. Melluish, C. Saunders, I. Nouretdinov, and V. Vovk. Com-
paring the Bayes and Typicalness frameworks. In Proceed-
ings of the 12th European Conference on Machine Learning
(ECML’01), volume 2167 of Lecture Notes in Computer Sci-
ence, pages 360–371. Springer, 2001.
[5] D. Michie, D. J. Spiegelhalter, and C. C. Taylor,
editors. Machine Learning, Neural and Statistical
Classification. Ellis Horwood, 1994. See also
http://www.ncc.up.pt/liacc/ML/statlog/datasets.html.
[6] I. Nouretdinov, V. Vovk, M. V. Vyugin, and A. Gammer-
man. Pattern recognition and density estimation under the
general iid assumption. In Proceedings of the 14th Annual
Conference on Computational Learning Theory (COLT’01)
and 5th European Conference on Computational Learning
Theory (EuroCOLT’01), volume 2111 of Lecture Notes in
Computer Science, pages 337–353. Springer, 2001.
[7] H. Papadopoulos, K. Proedrou, V. Vovk, and A. Gammer-
man. Inductive confidence machines for regression. In
Proceedings of the 13th European Conference on Machine
Learning (ECML’02), volume 2430 of Lecture Notes in
Computer Science, pages 345–356. Springer, 2002.
[8] H. Papadopoulos, V. Vovk, and A. Gammerman. Qualified
predictions for large data sets in the case of pattern recogni-
tion. In Proceedings of the 2002 International Conference
on Machine Learning and Applications (ICMLA’02), pages
159–163. CSREA Press, 2002.

View publication stats

You might also like