0% found this document useful (0 votes)
31 views9 pages

Democratic Co Learning

This document presents Democratic Co-Learning, a new single-view semi-supervised learning technique that leverages the fact that different learning algorithms have different biases. It uses multiple algorithms trained on the same labeled data to label unlabeled data for each other, adding data the algorithms disagree on to their training sets. This allows it to make use of unlabeled data when there is only a small amount of labeled data and no redundant feature sets, addressing limitations of previous co-training approaches. The document also introduces Democratic Priority Sampling for active learning to select examples to label.

Uploaded by

Hisler
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views9 pages

Democratic Co Learning

This document presents Democratic Co-Learning, a new single-view semi-supervised learning technique that leverages the fact that different learning algorithms have different biases. It uses multiple algorithms trained on the same labeled data to label unlabeled data for each other, adding data the algorithms disagree on to their training sets. This allows it to make use of unlabeled data when there is only a small amount of labeled data and no redundant feature sets, addressing limitations of previous co-training approaches. The document also introduces Democratic Priority Sampling for active learning to select examples to label.

Uploaded by

Hisler
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Democratic Co-Learning

Yan Zhou
School of Computer and Information Sciences
University of South Alabama
Mobile, AL 36688
[email protected]
Sally Goldman
Department of Computer Science and Engineering
Washington University
St. Louis, MO 63130-4899
[email protected]

Abstract are also many settings in which there are not. Nigam and
Ghani [11] showed that co-training has strong dependence
For many machine learning applications it is important on its assumption of independent and redundant feature
to develop algorithms that use both labeled and unlabeled split.
data. We present democratic co-learning in which multiple The question we address in this paper is how unlabeled
algorithms instead of multiple views enable learners to la- data can be used to improve the accuracy of supervised
bel data for each other. Our technique leverages off the fact learning algorithms in situations when:
that different learning algorithms have different inductive
– Only a small amount of labeled data is available,
biases and that better predictions can be made by the voted
majority. We also present democratic priority sampling, a – there is a large pool of unlabeled data, and
new example selection method for active learning. – there are not two independent and redundant sets of at-
tributes
Our work replaces the need for two attribute sets by lever-
1. Introduction aging from the fact that different learning algorithms have
different inductive biases even when seeing the same data.
In many practical learning scenarios there is only a small Our work is motivated, in part, by the empirical success
amount of labeled data (which is often costly to obtain) of ensemble methods (e.g. boosting [8] or bagging [9]) in
along with a large pool of unlabeled data. One of many which individual classifiers are trained from different train-
example applications is content-based image retrieval in ing sets using re-sampling techniques on the labeled data.
which a user (via relevance feedback) labels a small num- There are two important questions we must address:
ber of images as desirable or undesirable. However, there is 1. How can one create the set of hypotheses to combine
an extremely large pool of unlabeled images available. The to obtain better accuracy given that there is not enough
goal of the content-based image retrieval system is to deter- labeled data to apply re-sampling techniques?
mine which images the user finds desirable. We use semi-
supervised learning to refer to settings in which unlabeled 2. How can one make use of the large unlabeled pool of
data is used to augment labeled data when the size of the la- data?
beled data is insufficient. In our work, we use an ensemble-style approach but
In a single-view semi-supervised method the learner re- rather than creating the classifiers with a single algorithm
ceives a single set of attributes to use for learning. In a run on different subsets of the labeled data (which is not an
multi-view approach (such as co-training [1]), the learner option because of the limited amount of labeled data), we
receives two or more independent and redundant sets of at- instead run different algorithms using the same set of data.
tributes where each view individually is adequate for learn- Also ensemble methods do not use unlabeled data as an ad-
ing. While there are applications with two such views, there ditional source of knowledge but rather are designed when

Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004)
1082-3409/04 $20.00 © 2004 IEEE
Authorized licensed use limited to: ASU Library. Downloaded on July 05,2022 at 21:26:23 UTC from IEEE Xplore. Restrictions apply.
there is a sufficient source of labeled data but only weak assumed model of data holds, violations of these assump-
learning algorithms. Our early work [7] demonstrates that tions often result in poor performance [10]. Democratic co-
two different algorithms can successfully label data from learning is different from other single-view algorithms such
the unlabeled pool for each other. More recently, such an ap- as EM [6] in that like the statistical co-learning algorithm
proach has been successfully applied to content-based im- introduced in our early work [7], it uses multiple learning
age retrieval [19]. algorithms to serve a similar role that multiple views pro-
We present democratic co-learning, a new single-view vide in co-training.
semi-supervised technique, that can be used for applica- Blum and Mitchell [1] introduced the multi-view semi-
tions without two independent and redundant feature sets supervised learning approach. They make the strong as-
and which is applicable with a small pool of labeled data. sumption that the instance space can be represented us-
In democratic co-learning, a set of different learning algo- ing two different views (i.e. two independent and redun-
rithms are employed to train a set of classifiers separately dant sets of attributes) and that either view by itself is
on the labeled data set. The output concepts are combined sufficient for perfect classification if there were enough
using weighted voting to predict labels for an unlabeled ex- labeled data. They presented a co-training algorithm for
amples. The newly labeled examples are added to the train- this situation and gave both empirical and theoretical re-
ing set of the classifiers that predict differently than the ma- sults evaluating it. While there are settings such as these
jority. The process is repeated until no more data can be in which there are two independent (and sufficiently redun-
added to the training set of the classifiers. We also present dant) views, there are also many settings in which such
democratic priority sampling to select examples for which redundant views are not available. Nigam and Ghani [11]
to request labels for active learning. Finally, we obtain ac- have shown that co-training has strong dependence on its as-
tive democratic co-learning which uses democratic priority sumption of independent and redundant feature split. In this
sampling to select examples to be actively labeled and uses paper, we present a new single-view technique, democratic
democratic co-learning to label additional examples. co-learning that is applicable to settings that violate the as-
sumption of independent and redundant feature sets. Our
2. Related Work technique leverages off the fact that different learning algo-
rithms have different inductive biases and that better predic-
Like ensemble methods (e.g. boosting [8] or bag- tions can be made by the voted majority.
ging [9]), democratic co-learning integrates a group of Co-EM [11] integrates co-learning and EM by using the
learners to boost the overall accuracy and exploits differ- hypothesis learned in one view to probabilistically label the
ences in the bias between methods or methods that allow examples in the other view. The primary difference between
locally different models. However, there are fundamen- co-EM and co-training is that like EM, co-EM assigns a
tal differences and motivations. An ensemble method im- temporary label to each unlabeled example from scratch at
proves itself by creating random subsets or purposely each iteration whereas co-training selects a subset of the un-
biased distributions from the training data, which is inap- labeled examples to permanently label. In both cases, the
plicable when the amount of training data is small. hypothesis obtained from one view is used to perform la-
In general the semi-supervised learning problem has beling for the other view.
been studied in two settings: multiple-view and single view. Two-view EM (2v-EM) [12] aims to demonstrate that the
In a single-view semi-supervised method the learner re- strength from co-training and co-EM does not come merely
ceives a single set of attributes to use for learning. In a multi- from combining classifiers learned from different views.
view approach (such as the co-training procedure of Blum 2v-EM performs EM on each view in isolation and then
and Mitchell [1]), the learner receives two or more indepen- combines the prediction of the hypotheses learned in each
dent and redundant sets of attributes where each view indi- view. Using text-categorization benchmarks they showed
vidually is adequate for learning. Democratic co-learning is that when the requirement of two independent and redun-
a new single-view approach. dant views is severely violated 2v-EM can outperform co-
The Expectation-Maximization (EM) [6] can be viewed training and co-EM.
as a single-view semi-supervised learning algorithm by While democratic co-learning has similarities with sta-
treating the unlabeled examples as having a hidden vari- tistical co-learning from our earlier work [7], there are ma-
able (the label). Used in this way, EM begins with an ini- jor differences. First, statistical co-learning uses two learn-
tial classifier trained on the labeled examples. It then re- ing algorithms and requires them each to output a hypothe-
peatedly uses the current classifier to temporarily label the sis that partitions the domain into equivalence classes. For
unlabeled examples and then trains a new classifier on all la- example, the decision tree output by C4.5 defines one equiv-
beled examples (the original and the newly labeled) until it alence class per leaf. This assumption limits the applicabil-
converges. While the EM algorithm works well when the ity of that approach. Also, we used statistical tests to de-

Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004)
1082-3409/04 $20.00 © 2004 IEEE
Authorized licensed use limited to: ASU Library. Downloaded on July 05,2022 at 21:26:23 UTC from IEEE Xplore. Restrictions apply.
single-view multi-view single-view
single learner single learner multiple learners
Co-Training Statistical Co-Learning
Non-active EM Co-EM Democratic Co-Learning
2v-EM
QBC (+EM) Co-Testing Active Democratic
Active Uncertainty Sampling (+EM) Co-Test(co-EM) Co-Learning

Table 1. A framework for classifying semi-supervised algorithms.

cide when one algorithm should label data for the other. Yet, ent subsets of the training data, or randomly chosen accord-
the amount of labeled data available was insufficient for ap- ing to the posterior distribution of possible models given
plying those tests. Democratic co-learning resolves both of the training data. Instead of basing priorities on the num-
these problems by using an ensemble-like method to reduce ber of disagreements, we consider a variant [15] of QBC
the need for statistical tests and enable it to be applied to where the priority of example  is computed using the en-
any three or more standard supervised learning algorithms.
Some useful insights for our work come from meta-   

tropy of the classifications voted by each member where
    
     for the number of
learning. In theory, there is no single learning algorithm that committee members,  the total number of labels, and 
will be superior on all problems [2]. It has also been shown the number of votes for label . Examples with the high-
that classifiers with uncorrelated errors may reduce the error est entropy are selected for labeling.
rate when using a combined model [5]. Chan and Stolfo [3] Co-testing [12] is an active multi-view learning that re-
considered learning in a distributed setting in which the la- peatedly trains one hypothesis for each view and selects as a
beled data is distributed over many locations and thus each query an unlabeled example where the two hypotheses pre-
learning algorithm only sees a subset of the labeled data. dict differently (a contention point). The contention points
While the setting for their research is quite different than on which the combined prediction of the two classifiers is
ours, their research showed that since different learning al- least confident is selected. Co-Test (Co-Em) [16] combines
gorithms use different representations for their hypotheses co-testing and co-EM to get an active multi-view semi-
and have different inductive biases, the underlying strate- supervised learning algorithm. Their experiments show that
gies embodied by different learning algorithms may com- co-Test (co-EM) outperforms other non-active multi-view
plement each other by effectively reducing the space of algorithms without using more labeled data and is better
incorrect classifications of a learned concept [3]. In their able to overcome violations in the assumptions of two in-
multi-algorithm meta-learning strategy [4], Chan et al. pro- dependent and redundant views.
vided only a fraction of the labeled data to each base clas- Table 1 classifies semi-supervised techniques based on
sifier yet the resulting combined classifier obtained a better whether they use a single-view or multi-view approach and
overall accuracy than a classifier trained from all the avail- on whether active learning is used. Our new contributions
able data. One key difference from our work is that they as- are shown in bold.
sume each learner only sees a small amount of labeled data
because it is distributed. As in their work, we expect dif- 3. Democratic Co-Learning
ferent algorithms to infer different patterns in the data. An-
other difference with our work is that we use the classifiers We now present democratic co-learning. Let be the set
not only to boost the performance but also to label data in of labeled data, the set of unlabeled data, and       
to increase the pool of labeled data for other learning al- (for   ) the provided supervised learning algorithms1 .
gorithms that did not infer the same patterns. Democratic co-learning begins by training all  learners
We briefly review work on active learning. Uncertainty on the original labeled data set . For every example 
sampling [13, 14] repeatedly selects an unlabeled exam- in the unlabeled data set , each learner predicts a label
ple with the most “uncertain” membership and asks the or-              for . Let  be the majority pre-
acle to provide the correct label. The learning algorithm diction. In Section 3.1, we introduce several labeling crite-
then rebuilds its hypothesis based on the new training set. ria that must be satisfied before example  will be labeled
Query-by-committee (QBC) [13, 8] measures the degree to
which a group of classifiers disagree rather than using a sin- 1 While we describe democratic co-learning for any number of super-
gle classifier to measure the certainty of its classification. vised learning algorithms in our empirical work we only consider
In QBC, committee members can be generated on differ-  . 

Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004)
1082-3409/04 $20.00 © 2004 IEEE
Authorized licensed use limited to: ASU Library. Downloaded on July 05,2022 at 21:26:23 UTC from IEEE Xplore. Restrictions apply.
with for the learners that did not predict for . All  learners in the majority group is greater than the sum of
learners are then re-trained using the updated training data the mean confidence values of the learners in the minority
and this process is repeated until no more data is selected groups where the mean confidence of a learner is   
for labeling. The final hypothesis makes predictions using for  and  defined by the 95%-confidence interval  .
a variant of a weighted majority vote among the  learn- We have performed experiments with 90% and 99% con-
ers (see Section 3.2). The detailed democratic co-learning fidence intervals and the results were very similar. Using
procedure is shown in Figure 1. a vote weighted by a measure of confidence eliminates the
possibility that a majority of learners make the same wrong
predictions each with very low confidence. For example,
is the labeled data,  is the unlabeled data
       are the  different learning algorithms suppose there are three co-learners in a binary classification
For        problem. One learner predicts “positive” for unlabeled ex-
 / labeled data for  / ample  with 99% confidence and the other two predict  is
  / estimate for # mislabeled exs in  / “negative” each with a confidence of 30%. In this case, we
Repeat until none of      change would not want to let the two learners that predict  is neg-
For        ative, label  for the learner predicting  is positive.
Run learner  with data  to compute hyp  In order to balance the benefits of adding more labeled
For each unlabeled example  examples to the training data with the increase in noise rate
For possible labels      that may occur in the labels, we use the same tests as those
        in our earlier work to estimate if the increase in the labeled
       data is sufficient to compensate for the increase in the num-
/— Choose which exs to propose for labeling —/ ber of mislabeled examples. The details can be seen in Fig-
For        ure 1.
Use to compute 95%-conf. int.    for 
    3.2. Combining
For       

   / data proposed for adding to  /


If           ¼ 
 ¼
The easiest way to create the final hypotheses is using
standard majority vote among the possible class labels. In
  

       such that     


order to combine better, in addition to the number of votes

for each label, we also consider each individual classifier’s
/– Estimate if adding  to  improves accuracy–/
confidence value (as measured by the mean of the 95%-
For        confidence interval2 ) in its prediction. We partition classi-
Use  to compute 95%-conf. int.    for 
 fiers into  groups, one for each possible label. We use an
        / est. of error rate / m-estimator to adjust the average of the mean confidence

 
 value of each group such that the average mean confidence
        / est. of new error rate /
value of a group is discounted more if it has smaller size.
 Let  be the size of a group. Based on some preliminary ex-

   
         / if
periments not reported here, we use a Laplace correction
  added / of     to avoid zero frequency of votes and bias
 
If    towards a voting power of 0.5 when the group size is too
   small. The group of classifiers with the highest discounted
   confidence value is used to predict for the example. When
Return Combine      the confidence value of classifiers within a group has a large
variance, the adjustment made above may not be effective.
Figure 1. Democratic co-learning. Hence, we ignore any classifier whose confidence value is
less than 50%. See Figure 2.

4. Democratic Priority Sampling


3.1. Labeling Criteria
In this section, we present democratic priority sampling.
No unlabeled example is labeled by one learner for an- As in democratic co-learning we begin by using the labeled
other unless a majority of the learners agree on the label.
In addition to this majority vote requirement, we also re- 2 Again, empirically we found little difference when using either a 90%
quire that the sum of the mean confidence values of the or 99%-confidence interval.

Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004)
1082-3409/04 $20.00 © 2004 IEEE
Authorized licensed use limited to: ASU Library. Downloaded on July 05,2022 at 21:26:23 UTC from IEEE Xplore. Restrictions apply.
5. Empirical Results
Combine     
For         In this section we present our empirical results. As the
Use  to compute 95%-conf. int.     for  three base learning algorithms we use: naive Bayes (NB),
       C4.5 [17] and 3-nearest neighbor (3-NN) [18]. In all of our
For each example in the instance space experiments, we compute the reported accuracy using a test
For         set that is roughly the same size as the unlabeled pool.
If   predicts  and  
Allocate  to group 
For         5.1. Non-active Co-learning
 ¾ 
/ compute group average mean confidence /
We present results for the non-active setting. In the left
         
plot in Figure 3, we compare democratic co-learning with
predicts with  for       naive Bayes, C4.5, 3-NN, and the results of using combin-
Return ing alone on the DNA data set. Democratic co-learning out-
performs the three individual algorithms and the gain was
Figure 2. Combine procedure. not achieved by simply combining the prediction made by
the three learners. In the right plot in Figure 3, we com-
pare democratic co-learning with naive Bayes, C4.5 and 3-
NN when each is combined with EM to use the unlabeled
data to train the  different learners to obtain the  clas- data. In each of these plots   varies between 35 and 100
sifiers        . One possible way to then select the ex- with an independent run of each algorithm performed for
ample to actively label would be to use the vote entropy as each integer in this range. The purpose of these experiments
in QBC. However, we also want to incorporate the confi- is to evaluate how the performance of these methods is af-
dence of each individual classifier in the priority estimate. fected by varying the size of the pool of labeled data. Across
Hence we define a confidence-weighted vote entropy to in- all values for   we tested, democratic co-learning outper-
corporate the confidence of each individual classifier in the forms the three individual algorithms when they are com-
priority estimate by computing the vote entropy weighted bined with EM to make use of the labeled data. Notice that
by the mean confidence of the classifiers. We did test us- EM may have negative impact on poor classifiers trained
ing an unweighted majority but obtain better results using a over insufficient labeled data.
weighted majority vote. We now consider a single value of   over a variety
More formally, let  be the number of different labels and of data sets. We show the performance of each base algo-
let   contain the set of classifiers among        rithms, as well as the performance when we just use the
that predict a label of  for . We define the priority of unla- combining method of democratic co-learning to demon-
beled example  as strate that we are making use of the unlabeled data as
opposed to having our gains come from the ensemble of
the three base algorithms. We also compared our work to

         other semi-supervised learning algorithms. For statistical
co-learning we use naive Bayes and C4.5 since they gener-



ally perform better than 3-NN. To create a hypothesis from


naive Bayes that partitions the input domain as required by
where     ¾   for  the mean of the statistical co-learning, we take all of the data in  and la-
95%-confidence interval of  and     . The
bel it according to the naive Bayes hypothesis and then use
example with the highest priority label is given to an expert C4.5 to create the equivalence classes (one per leaf). We
for labeling. Then the hypotheses are recomputed using the use eight of the UCI3 benchmark data sets. For all data sets
larger pool of labeled data and the process is repeated.     except for the adult data set where    . Ta-
ble 2 shows other statistics about the data sets. We created
While there are many similarities between democratic 20 different data sets by randomly partitioning the data into
priority sampling and QBC, there are two key differences. ,  , and the test data. In addition, we picked random par-
First, the committee members are obtained by using differ- titions in which democratic co-learning labeled at least one
ent learning algorithms versus the same learning algorithm example in  .
trained on different data. Secondly, we use a weighted vari-
ant of vote entropy to incorporate the confidence estimates
into the priorities. 3 http://www.ics.uci.edu/ mlearn/MLRepository.html

Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004)
1082-3409/04 $20.00 © 2004 IEEE
Authorized licensed use limited to: ASU Library. Downloaded on July 05,2022 at 21:26:23 UTC from IEEE Xplore. Restrictions apply.
1 1

0.9 0.9

0.8
0.8

0.7
0.7
0.6
0.6
0.5

0.5
Naive Bayes 0.4
C4.5 DemoCoL
3-NN EM-Naive Bayes
0.4 DemoCoL 0.3 EM-C4.5
Combine Only EM-3NN

0.3 0.2
40 50 60 70 80 90 100 110 120 40 50 60 70 80 90 100 110 120

Figure 3. Results on DNA data. The -axis is  and the  -axis is the accuracy.

data # exs from  labeled for avg. #


set atts NB C4.5 3-NN   rounds
flare 10 108 151 80 40 515 2.7
monk2 6 40 84 40 40 193 2.3
vote 16 66 40 40 40 200 2.2
DNA 180 367 289 432 40 1588 2.8
cancer 9 59 40 45 40 124 2.1
adult 14 413 130 353 60 1691 2.6
3-of-9 9 40 91 40 40 234 2.3
xd6 9 86 115 40 40 463 2.7

Table 2. A summary of amount of data labeled by democratic co-learning.

A key contribution of our work is a semi-supervised umn labeled by “data in  labeled” shows the best result
learning technique that can be applied when there are not obtained among any of the base algorithms (naive Bayes,
such independent and redundant set of attributes. Since the C4.5, and 3-NN) when all examples in  are correctly la-
work on two-view approaches generally only reports re- beled and placed in . Due to the small size of  and there-
sults on data sets that naturally have two appropriate fea- fore considerable variation in performance, a paired t-test
ture sets, comparing our work to those approaches requires is used to determine the statistical significance of the dif-
that we re-implement their work. We have selected to do ference made by democratic co-learning. Our results are
this for the Blum and Mitchell co-training procedure [1] shown in Table 3. The value in parenthesis is the value of
which we refer to as two-view co-training. In order to cre- the paired t-test values between democratic co-learning and
ate two views, we randomly partition the features into two that method. A positive value indicates that democratic co-
sets and then treat these as our two views as done by Nigam learning performed better. Any value that is statistically sig-
and Ghani [11]. We also tested how sensitive the perfor- nificant at the 95% confidence level or higher (i.e.  )
mance of two-view co-training was to the random choice of is in bold. All values greater than 2.861 are also statisti-
the partition of the features. For each of the UCI data sets cally significant at the 99% level and all values greater than
we fixed the choice of which examples to place in ,  , and 3.8834 are statistically significant at the 99.9% level. Due
the test set and then randomly picked 20 different random to space constraints the standard deviation is only shown
partitions of the features into two sets. For these we found a for democratic co-learning.
standard deviation of anywhere from 0.03 to 0.06. Finally, As compared to combining alone, the performance of
we present results obtained by using EM with each of the democratic co-learning performs better at the 95% confi-
three base algorithms. To create a measure of the best per- dence level for 6 of the 8 data sets and at the 90% confidence
formance one could expect for the given data sets, the col- level for the other two data sets. So democratic co-learning

Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004)
1082-3409/04 $20.00 © 2004 IEEE
Authorized licensed use limited to: ASU Library. Downloaded on July 05,2022 at 21:26:23 UTC from IEEE Xplore. Restrictions apply.
Algorithm flare monk2 vote DNA
Demo. Co-Learning        
Combining Only  (1.997)  (4.798)  (2.042)  (7.025)
Data in  Labeled    
Statistical Co-Learning  (2.112)  (0.993)  (2.879)  (3.566)
Two-View Co-Training  (5.841)  (4.854)  (9.959)  (-1.966)
EM-NB  (4.930)  (6.210)  (8.285)  (13.852)
EM-C4.5  (2.612)  (10.846)  (1.630)   (6.740)
EM-3NN  (5.303)  (1.000)  (12.157)  (8.693)
NB  (5.714)  (6.425)  (7.546)   (13.350)
C4.5  (2.612)  (10.846)  (1.630)   (6.740)
3NN  (5.333)  (2.836)  (9.259)  (8.584)

Algorithm cancer adult 3-of-9 xd6


Demo. Co-Learning        
Combining Only  (2.697)  (4.154)  (5.181)  (2.667)
Data in  Labeled    
Statistical Co-Learning  (0.386)  (2.781)  (5.221)  (3.484)
Two-View Co-Training  (2.752) (-0.280)  (9.931)  (7.175)
EM-NB   (5.351)  (4.358)  (14.386)   (22.433)
EM-C4.5  (1.462)  (7.452)  (7.808)  (6.580)
EM-3NN  (-1.40)  (2.630)  (4.944)  (13.638)
NB  (8.923)  (3.927)  (4.816)  (4.680)
C4.5  (1.462)  (7.398)  (7.881)  (6.111)
3NN  (3.206)  (4.091)  (4.116)  (3.239)

Table 3. Our non-active learning results.

is making use of the unlabeled data and not just benefit- which labels may be requested) with QBC and uncertainty
ing from the use of an ensemble method of combining. As sampling. For QBC and uncertainty sampling, we show the
compared to the other 5 semi-supervised methods, demo- paired t-test value with respect to democratic priority sam-
cratic co-learning performs statistically significantly at the pling. Finally, we compare the following active and semi-
95% level in 32 of the 40 tests we performed. (In fact in supervised algorithms: active democratic co-learning, co-
27 of the 40 tests, our improvements are statistically signif- testing, and co-test(co-EM) showing the paired t-test val-
icant at the 99% confidence level.) Of the 8 tests in which ues with respect to active democratic co-learning.
the difference in performance was not statistically signifi- For the active approaches in which the unlabeled data is
cant democratic co-learning performed better in all but two only used as a pool for the active learner, democratic pri-
of them. ority sampling performed better in 5 of the 8 data sets than
each of QBC and uncertainty sampling but only 2 of these
5.2. Active Co-learning 5 cases (for each data set) was statistically significant at the
95% level. We are currently repeating these experiments us-
Table 4 shows our active learning results. For uncertainty ing 20 different random choices for ,  , and the test data
sampling we use naive Bayes where the normalized proba- and we believe that we will find statistically significant im-
bility measure of naive Bayes is used to give an uncertainty provements in more cases. For the active semi-supervised
value. For QBC we use   different committee members algorithms, democratic co-learning performed better than
each trained with naive Bayes on a random subset (without each of co-testing and co-test (Co-EM) in 5 of the 8 data
replacement) of  examples from . The active learn- sets with 4 of the 5 (for each data set) being statistically sig-
ing is used to select 40 additional examples to have labeled. nificant at the 95% level.
We first show the best result obtained among the base We also ran a paired t-test between democratic priority
algorithms when all data in  is properly labeled. Next sampling and active democratic learning. For the 3-of-9 and
we compare democratic priority sampling (with no use of DNA data sets the improvement of active democratic co-
the unlabeled data except in serving as a pool of data for learning was statistically significant at the 95% level, and

Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004)
1082-3409/04 $20.00 © 2004 IEEE
Authorized licensed use limited to: ASU Library. Downloaded on July 05,2022 at 21:26:23 UTC from IEEE Xplore. Restrictions apply.
Algorithm flare monk2 vote DNA
Data in Labeled               
Demo. Priority Samp.              
Query By Committee   (4.654)  (1.165)   (2.190)   (-2.631)
Uncertainty Sampling   (0.246)  (0.631)  (2.120)   (0.703)
Active Demo. Co-Learn.              
CoTesting   (-0.714)  (0.000)  (2.581)   (5.160)
Co-Test(co-EM)   (-0.805)  (-0.690)   (4.493)   (2.189)
Algorithm cancer adult 3-of-9 xd6
Data in Labeled               
Demo. Priority Samp.               
Query By Committee   (0.753)   (-.708)   (1.616)   (3.105)
Uncertainty Sampling   (-3.903)   (-0.717)   (2.175)   (2.933)
Active Demo. Co-Learn.                
CoTesting   (-0.051)  (5.086)   (4.563)   (8.610)
Co-Test(Co-EM)   (-0.941)   (4.695)  (7.260)   (14.529)
Table 4. Our active learning results.

for the vote and XD6 data sets, the improvement of ac- beled data. Democratic co-learning also outperformed each
tive democratic co-learning was statistically significant at of the three individual algorithms when combined with EM
the 90% level. For the flare and monk2 data sets there really and by picking learners that work in very different ways, we
is not much room for improvement. Similarly, in comparing can increase the diversity needed for them to be able to la-
the performance between democratic co-learning and active bel data for each other.
democratic co-learning the use of active learning generally
improved the performance in data sets in which the perfor- References
mance of democratic co-learning was not already close to
that obtained when all data in is given the proper label. [1] Blum, A., Mitchell, T.: Combining labeled and unlabeled
data with co-training. In: Proc. of the 11th Annual Conf. on
Computational Learning Theory. (1998) 92–100.
6. Concluding Remarks [2] Schaffer, C.: A conservation law for generalization perfor-
mance. In: Proc. of the 11th Int. Conf. on Machine Learning,
We have demonstrated that democratic co-learning, a San Mateo: Morgan Kaufmann (1994) 259–265.
single-view multiple algorithm semi-supervised learning [3] Chan Philip, K., Stolfo, S.: On the accuracy of meta-learning
technique is statistically superior to many semi-supervised for scalable data mining. Journal of Intelligent Integration of
learning approaches when there are not two sufficiently in- Information 8(1) (1998) 5–28.
dependent and redundant set of attributes. Using data from [4] Chan Philip, K., Stolfo, S.: Scaling learning by meta-
the UCI repository, we have compared the performance learning over disjoint and partially replicated data. In: Proc.
of democratic co-learning to combining alone (without us- of the 9th Florida AI Research Symposium. (1996) 151–155.
ing the unlabeled data) and to other single-view and multi- [5] Ali, K., Pazzani, M.: Error reduction through learning mul-
view semi-supervised learning algorithms. Democratic co- tiple descriptions. Machine Learning 24 (1996) 173–202
learning performed better at the 95% confidence level in 38 [6] Dempster, A., Laird, N., Rubin, D.: Maximum likelihood
of the 48 tests that we performed in the non-active learn- from incomplete data via the em algorithm. Journal of the
Royal Statistical Society, Series B 39 (1977) 1–38
ing setting. For the other 10 tests there was no significant
[7] Goldman, S., Zhou, Y.: Enhancing supervised learning with
difference in performance between democratic co-learning
unlabeled data. In: Proc. of the 17th Int. Conf. on Machine
and the other approaches studied. Learning, San Francisco: Morgan Kaufmann (2000) 327–
In general, co-learning works well if the estimated mean 334.
confidence reflects which learner is better and when the [8] Freund, Y., Seung, H., Shamir, E., Tishby, N.: Selective sam-
multiple classifiers are good in different regions enabling pling using the query by committee algorithm. Machine
them to classify data for each other. Finally, there needs Learning 28 (1997) 133–168.
to be room for improvement by at least one of the super- [9] Breiman, L.: Bagging predictors. Machine Learning 24(2)
vised learning algorithm if it received more correctly la- (1996) 123–140.

Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004)
1082-3409/04 $20.00 © 2004 IEEE
Authorized licensed use limited to: ASU Library. Downloaded on July 05,2022 at 21:26:23 UTC from IEEE Xplore. Restrictions apply.
[10] Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text
classification from labeled and unlabeled documents using
em. Machine Learning 39 (2000) 103–134.
[11] Nigam, K., Ghani, R.: Analyzing the effectiveness and ap-
plicability of co-training. In: The 9th Int. Conf. on Informa-
tion and Knowledge Management. (2000) 86–93.
[12] Muslea, I., Minton, S., Knoblock, C.: Selective sampling
with redundant views. In: Proc. of AAAI-2000. (2000) 621–
626.
[13] Seung, H.S., Opper, M., Sompolinsky, H.: Query by com-
mittee. In: Proc. of the ACM Workshop on Computational
Learning Theory. (1992) 287–294.
[14] Lewis, D.D., Gale, A.W.: A sequential algorithm for train-
ing text classifiers. In: Proc. of the Special Interest Group on
Info. Retrieval, AAAI Press and MIT Press (1994) 3–12.
[15] Dagan, I., Engelson, S.: Committee-based sampling for
training probabilistic classifiers. In: Proc. of the 12th Int.
Conf on Machine Learning, San Francisco: Morgan Kauf-
mann (1995) 150–157.
[16] Muslea, I., Minton, S., Knoblock, C.: Selective sampling +
semi-supervised learning = robust multi-view learning. In:
IJCAI-01 Workshop on Text Learning: Beyond Supervision.
(2001)
[17] Quinlan, R.: Induction of decision trees. Machine Learning
1 (1986) 81–106.
[18] Cover, T., Hart, P.: Nearest neighbor pattern classification.
IEEE Transactions on Information Theory 13 (1967) 21–27.
[19] Zhou, Z., Chen, K., Jiang, Y.: Exploiting unlabeled data in
content-based image retrieval. In: Proc of the 15th European
Conf. on Machine Learning. (2004)

Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004)
1082-3409/04 $20.00 © 2004 IEEE
Authorized licensed use limited to: ASU Library. Downloaded on July 05,2022 at 21:26:23 UTC from IEEE Xplore. Restrictions apply.

You might also like