Democratic Co Learning
Democratic Co Learning
Yan Zhou
School of Computer and Information Sciences
University of South Alabama
Mobile, AL 36688
[email protected]
Sally Goldman
Department of Computer Science and Engineering
Washington University
St. Louis, MO 63130-4899
[email protected]
Abstract are also many settings in which there are not. Nigam and
Ghani [11] showed that co-training has strong dependence
For many machine learning applications it is important on its assumption of independent and redundant feature
to develop algorithms that use both labeled and unlabeled split.
data. We present democratic co-learning in which multiple The question we address in this paper is how unlabeled
algorithms instead of multiple views enable learners to la- data can be used to improve the accuracy of supervised
bel data for each other. Our technique leverages off the fact learning algorithms in situations when:
that different learning algorithms have different inductive
– Only a small amount of labeled data is available,
biases and that better predictions can be made by the voted
majority. We also present democratic priority sampling, a – there is a large pool of unlabeled data, and
new example selection method for active learning. – there are not two independent and redundant sets of at-
tributes
Our work replaces the need for two attribute sets by lever-
1. Introduction aging from the fact that different learning algorithms have
different inductive biases even when seeing the same data.
In many practical learning scenarios there is only a small Our work is motivated, in part, by the empirical success
amount of labeled data (which is often costly to obtain) of ensemble methods (e.g. boosting [8] or bagging [9]) in
along with a large pool of unlabeled data. One of many which individual classifiers are trained from different train-
example applications is content-based image retrieval in ing sets using re-sampling techniques on the labeled data.
which a user (via relevance feedback) labels a small num- There are two important questions we must address:
ber of images as desirable or undesirable. However, there is 1. How can one create the set of hypotheses to combine
an extremely large pool of unlabeled images available. The to obtain better accuracy given that there is not enough
goal of the content-based image retrieval system is to deter- labeled data to apply re-sampling techniques?
mine which images the user finds desirable. We use semi-
supervised learning to refer to settings in which unlabeled 2. How can one make use of the large unlabeled pool of
data is used to augment labeled data when the size of the la- data?
beled data is insufficient. In our work, we use an ensemble-style approach but
In a single-view semi-supervised method the learner re- rather than creating the classifiers with a single algorithm
ceives a single set of attributes to use for learning. In a run on different subsets of the labeled data (which is not an
multi-view approach (such as co-training [1]), the learner option because of the limited amount of labeled data), we
receives two or more independent and redundant sets of at- instead run different algorithms using the same set of data.
tributes where each view individually is adequate for learn- Also ensemble methods do not use unlabeled data as an ad-
ing. While there are applications with two such views, there ditional source of knowledge but rather are designed when
Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004)
1082-3409/04 $20.00 © 2004 IEEE
Authorized licensed use limited to: ASU Library. Downloaded on July 05,2022 at 21:26:23 UTC from IEEE Xplore. Restrictions apply.
there is a sufficient source of labeled data but only weak assumed model of data holds, violations of these assump-
learning algorithms. Our early work [7] demonstrates that tions often result in poor performance [10]. Democratic co-
two different algorithms can successfully label data from learning is different from other single-view algorithms such
the unlabeled pool for each other. More recently, such an ap- as EM [6] in that like the statistical co-learning algorithm
proach has been successfully applied to content-based im- introduced in our early work [7], it uses multiple learning
age retrieval [19]. algorithms to serve a similar role that multiple views pro-
We present democratic co-learning, a new single-view vide in co-training.
semi-supervised technique, that can be used for applica- Blum and Mitchell [1] introduced the multi-view semi-
tions without two independent and redundant feature sets supervised learning approach. They make the strong as-
and which is applicable with a small pool of labeled data. sumption that the instance space can be represented us-
In democratic co-learning, a set of different learning algo- ing two different views (i.e. two independent and redun-
rithms are employed to train a set of classifiers separately dant sets of attributes) and that either view by itself is
on the labeled data set. The output concepts are combined sufficient for perfect classification if there were enough
using weighted voting to predict labels for an unlabeled ex- labeled data. They presented a co-training algorithm for
amples. The newly labeled examples are added to the train- this situation and gave both empirical and theoretical re-
ing set of the classifiers that predict differently than the ma- sults evaluating it. While there are settings such as these
jority. The process is repeated until no more data can be in which there are two independent (and sufficiently redun-
added to the training set of the classifiers. We also present dant) views, there are also many settings in which such
democratic priority sampling to select examples for which redundant views are not available. Nigam and Ghani [11]
to request labels for active learning. Finally, we obtain ac- have shown that co-training has strong dependence on its as-
tive democratic co-learning which uses democratic priority sumption of independent and redundant feature split. In this
sampling to select examples to be actively labeled and uses paper, we present a new single-view technique, democratic
democratic co-learning to label additional examples. co-learning that is applicable to settings that violate the as-
sumption of independent and redundant feature sets. Our
2. Related Work technique leverages off the fact that different learning algo-
rithms have different inductive biases and that better predic-
Like ensemble methods (e.g. boosting [8] or bag- tions can be made by the voted majority.
ging [9]), democratic co-learning integrates a group of Co-EM [11] integrates co-learning and EM by using the
learners to boost the overall accuracy and exploits differ- hypothesis learned in one view to probabilistically label the
ences in the bias between methods or methods that allow examples in the other view. The primary difference between
locally different models. However, there are fundamen- co-EM and co-training is that like EM, co-EM assigns a
tal differences and motivations. An ensemble method im- temporary label to each unlabeled example from scratch at
proves itself by creating random subsets or purposely each iteration whereas co-training selects a subset of the un-
biased distributions from the training data, which is inap- labeled examples to permanently label. In both cases, the
plicable when the amount of training data is small. hypothesis obtained from one view is used to perform la-
In general the semi-supervised learning problem has beling for the other view.
been studied in two settings: multiple-view and single view. Two-view EM (2v-EM) [12] aims to demonstrate that the
In a single-view semi-supervised method the learner re- strength from co-training and co-EM does not come merely
ceives a single set of attributes to use for learning. In a multi- from combining classifiers learned from different views.
view approach (such as the co-training procedure of Blum 2v-EM performs EM on each view in isolation and then
and Mitchell [1]), the learner receives two or more indepen- combines the prediction of the hypotheses learned in each
dent and redundant sets of attributes where each view indi- view. Using text-categorization benchmarks they showed
vidually is adequate for learning. Democratic co-learning is that when the requirement of two independent and redun-
a new single-view approach. dant views is severely violated 2v-EM can outperform co-
The Expectation-Maximization (EM) [6] can be viewed training and co-EM.
as a single-view semi-supervised learning algorithm by While democratic co-learning has similarities with sta-
treating the unlabeled examples as having a hidden vari- tistical co-learning from our earlier work [7], there are ma-
able (the label). Used in this way, EM begins with an ini- jor differences. First, statistical co-learning uses two learn-
tial classifier trained on the labeled examples. It then re- ing algorithms and requires them each to output a hypothe-
peatedly uses the current classifier to temporarily label the sis that partitions the domain into equivalence classes. For
unlabeled examples and then trains a new classifier on all la- example, the decision tree output by C4.5 defines one equiv-
beled examples (the original and the newly labeled) until it alence class per leaf. This assumption limits the applicabil-
converges. While the EM algorithm works well when the ity of that approach. Also, we used statistical tests to de-
Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004)
1082-3409/04 $20.00 © 2004 IEEE
Authorized licensed use limited to: ASU Library. Downloaded on July 05,2022 at 21:26:23 UTC from IEEE Xplore. Restrictions apply.
single-view multi-view single-view
single learner single learner multiple learners
Co-Training Statistical Co-Learning
Non-active EM Co-EM Democratic Co-Learning
2v-EM
QBC (+EM) Co-Testing Active Democratic
Active Uncertainty Sampling (+EM) Co-Test(co-EM) Co-Learning
cide when one algorithm should label data for the other. Yet, ent subsets of the training data, or randomly chosen accord-
the amount of labeled data available was insufficient for ap- ing to the posterior distribution of possible models given
plying those tests. Democratic co-learning resolves both of the training data. Instead of basing priorities on the num-
these problems by using an ensemble-like method to reduce ber of disagreements, we consider a variant [15] of QBC
the need for statistical tests and enable it to be applied to where the priority of example is computed using the en-
any three or more standard supervised learning algorithms.
Some useful insights for our work come from meta-
tropy of the classifications voted by each member where
for the number of
learning. In theory, there is no single learning algorithm that committee members, the total number of labels, and
will be superior on all problems [2]. It has also been shown the number of votes for label . Examples with the high-
that classifiers with uncorrelated errors may reduce the error est entropy are selected for labeling.
rate when using a combined model [5]. Chan and Stolfo [3] Co-testing [12] is an active multi-view learning that re-
considered learning in a distributed setting in which the la- peatedly trains one hypothesis for each view and selects as a
beled data is distributed over many locations and thus each query an unlabeled example where the two hypotheses pre-
learning algorithm only sees a subset of the labeled data. dict differently (a contention point). The contention points
While the setting for their research is quite different than on which the combined prediction of the two classifiers is
ours, their research showed that since different learning al- least confident is selected. Co-Test (Co-Em) [16] combines
gorithms use different representations for their hypotheses co-testing and co-EM to get an active multi-view semi-
and have different inductive biases, the underlying strate- supervised learning algorithm. Their experiments show that
gies embodied by different learning algorithms may com- co-Test (co-EM) outperforms other non-active multi-view
plement each other by effectively reducing the space of algorithms without using more labeled data and is better
incorrect classifications of a learned concept [3]. In their able to overcome violations in the assumptions of two in-
multi-algorithm meta-learning strategy [4], Chan et al. pro- dependent and redundant views.
vided only a fraction of the labeled data to each base clas- Table 1 classifies semi-supervised techniques based on
sifier yet the resulting combined classifier obtained a better whether they use a single-view or multi-view approach and
overall accuracy than a classifier trained from all the avail- on whether active learning is used. Our new contributions
able data. One key difference from our work is that they as- are shown in bold.
sume each learner only sees a small amount of labeled data
because it is distributed. As in their work, we expect dif- 3. Democratic Co-Learning
ferent algorithms to infer different patterns in the data. An-
other difference with our work is that we use the classifiers We now present democratic co-learning. Let be the set
not only to boost the performance but also to label data in of labeled data, the set of unlabeled data, and
to increase the pool of labeled data for other learning al- (for ) the provided supervised learning algorithms1 .
gorithms that did not infer the same patterns. Democratic co-learning begins by training all learners
We briefly review work on active learning. Uncertainty on the original labeled data set . For every example
sampling [13, 14] repeatedly selects an unlabeled exam- in the unlabeled data set , each learner predicts a label
ple with the most “uncertain” membership and asks the or- for . Let be the majority pre-
acle to provide the correct label. The learning algorithm diction. In Section 3.1, we introduce several labeling crite-
then rebuilds its hypothesis based on the new training set. ria that must be satisfied before example will be labeled
Query-by-committee (QBC) [13, 8] measures the degree to
which a group of classifiers disagree rather than using a sin- 1 While we describe democratic co-learning for any number of super-
gle classifier to measure the certainty of its classification. vised learning algorithms in our empirical work we only consider
In QBC, committee members can be generated on differ- .
Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004)
1082-3409/04 $20.00 © 2004 IEEE
Authorized licensed use limited to: ASU Library. Downloaded on July 05,2022 at 21:26:23 UTC from IEEE Xplore. Restrictions apply.
with for the learners that did not predict for . All learners in the majority group is greater than the sum of
learners are then re-trained using the updated training data the mean confidence values of the learners in the minority
and this process is repeated until no more data is selected groups where the mean confidence of a learner is
for labeling. The final hypothesis makes predictions using for and defined by the 95%-confidence interval .
a variant of a weighted majority vote among the learn- We have performed experiments with 90% and 99% con-
ers (see Section 3.2). The detailed democratic co-learning fidence intervals and the results were very similar. Using
procedure is shown in Figure 1. a vote weighted by a measure of confidence eliminates the
possibility that a majority of learners make the same wrong
predictions each with very low confidence. For example,
is the labeled data, is the unlabeled data
are the different learning algorithms suppose there are three co-learners in a binary classification
For problem. One learner predicts “positive” for unlabeled ex-
/ labeled data for / ample with 99% confidence and the other two predict is
/ estimate for # mislabeled exs in / “negative” each with a confidence of 30%. In this case, we
Repeat until none of change would not want to let the two learners that predict is neg-
For ative, label for the learner predicting is positive.
Run learner with data to compute hyp In order to balance the benefits of adding more labeled
For each unlabeled example examples to the training data with the increase in noise rate
For possible labels that may occur in the labels, we use the same tests as those
in our earlier work to estimate if the increase in the labeled
data is sufficient to compensate for the increase in the num-
/— Choose which exs to propose for labeling —/ ber of mislabeled examples. The details can be seen in Fig-
For ure 1.
Use to compute 95%-conf. int. for
3.2. Combining
For
Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004)
1082-3409/04 $20.00 © 2004 IEEE
Authorized licensed use limited to: ASU Library. Downloaded on July 05,2022 at 21:26:23 UTC from IEEE Xplore. Restrictions apply.
5. Empirical Results
Combine
For In this section we present our empirical results. As the
Use to compute 95%-conf. int. for three base learning algorithms we use: naive Bayes (NB),
C4.5 [17] and 3-nearest neighbor (3-NN) [18]. In all of our
For each example in the instance space experiments, we compute the reported accuracy using a test
For set that is roughly the same size as the unlabeled pool.
If predicts and
Allocate to group
For 5.1. Non-active Co-learning
¾
/ compute group average mean confidence /
We present results for the non-active setting. In the left
plot in Figure 3, we compare democratic co-learning with
predicts with for naive Bayes, C4.5, 3-NN, and the results of using combin-
Return ing alone on the DNA data set. Democratic co-learning out-
performs the three individual algorithms and the gain was
Figure 2. Combine procedure. not achieved by simply combining the prediction made by
the three learners. In the right plot in Figure 3, we com-
pare democratic co-learning with naive Bayes, C4.5 and 3-
NN when each is combined with EM to use the unlabeled
data to train the different learners to obtain the clas- data. In each of these plots varies between 35 and 100
sifiers . One possible way to then select the ex- with an independent run of each algorithm performed for
ample to actively label would be to use the vote entropy as each integer in this range. The purpose of these experiments
in QBC. However, we also want to incorporate the confi- is to evaluate how the performance of these methods is af-
dence of each individual classifier in the priority estimate. fected by varying the size of the pool of labeled data. Across
Hence we define a confidence-weighted vote entropy to in- all values for we tested, democratic co-learning outper-
corporate the confidence of each individual classifier in the forms the three individual algorithms when they are com-
priority estimate by computing the vote entropy weighted bined with EM to make use of the labeled data. Notice that
by the mean confidence of the classifiers. We did test us- EM may have negative impact on poor classifiers trained
ing an unweighted majority but obtain better results using a over insufficient labeled data.
weighted majority vote. We now consider a single value of over a variety
More formally, let be the number of different labels and of data sets. We show the performance of each base algo-
let contain the set of classifiers among rithms, as well as the performance when we just use the
that predict a label of for . We define the priority of unla- combining method of democratic co-learning to demon-
beled example as strate that we are making use of the unlabeled data as
opposed to having our gains come from the ensemble of
the three base algorithms. We also compared our work to
other semi-supervised learning algorithms. For statistical
co-learning we use naive Bayes and C4.5 since they gener-
ally perform better than 3-NN. To create a hypothesis from
naive Bayes that partitions the input domain as required by
where ¾ for the mean of the statistical co-learning, we take all of the data in and la-
95%-confidence interval of and . The
bel it according to the naive Bayes hypothesis and then use
example with the highest priority label is given to an expert C4.5 to create the equivalence classes (one per leaf). We
for labeling. Then the hypotheses are recomputed using the use eight of the UCI3 benchmark data sets. For all data sets
larger pool of labeled data and the process is repeated. except for the adult data set where . Ta-
ble 2 shows other statistics about the data sets. We created
While there are many similarities between democratic 20 different data sets by randomly partitioning the data into
priority sampling and QBC, there are two key differences. , , and the test data. In addition, we picked random par-
First, the committee members are obtained by using differ- titions in which democratic co-learning labeled at least one
ent learning algorithms versus the same learning algorithm example in .
trained on different data. Secondly, we use a weighted vari-
ant of vote entropy to incorporate the confidence estimates
into the priorities. 3 http://www.ics.uci.edu/ mlearn/MLRepository.html
Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004)
1082-3409/04 $20.00 © 2004 IEEE
Authorized licensed use limited to: ASU Library. Downloaded on July 05,2022 at 21:26:23 UTC from IEEE Xplore. Restrictions apply.
1 1
0.9 0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
Naive Bayes 0.4
C4.5 DemoCoL
3-NN EM-Naive Bayes
0.4 DemoCoL 0.3 EM-C4.5
Combine Only EM-3NN
0.3 0.2
40 50 60 70 80 90 100 110 120 40 50 60 70 80 90 100 110 120
Figure 3. Results on DNA data. The -axis is and the -axis is the accuracy.
A key contribution of our work is a semi-supervised umn labeled by “data in labeled” shows the best result
learning technique that can be applied when there are not obtained among any of the base algorithms (naive Bayes,
such independent and redundant set of attributes. Since the C4.5, and 3-NN) when all examples in are correctly la-
work on two-view approaches generally only reports re- beled and placed in . Due to the small size of and there-
sults on data sets that naturally have two appropriate fea- fore considerable variation in performance, a paired t-test
ture sets, comparing our work to those approaches requires is used to determine the statistical significance of the dif-
that we re-implement their work. We have selected to do ference made by democratic co-learning. Our results are
this for the Blum and Mitchell co-training procedure [1] shown in Table 3. The value in parenthesis is the value of
which we refer to as two-view co-training. In order to cre- the paired t-test values between democratic co-learning and
ate two views, we randomly partition the features into two that method. A positive value indicates that democratic co-
sets and then treat these as our two views as done by Nigam learning performed better. Any value that is statistically sig-
and Ghani [11]. We also tested how sensitive the perfor- nificant at the 95% confidence level or higher (i.e. )
mance of two-view co-training was to the random choice of is in bold. All values greater than 2.861 are also statisti-
the partition of the features. For each of the UCI data sets cally significant at the 99% level and all values greater than
we fixed the choice of which examples to place in , , and 3.8834 are statistically significant at the 99.9% level. Due
the test set and then randomly picked 20 different random to space constraints the standard deviation is only shown
partitions of the features into two sets. For these we found a for democratic co-learning.
standard deviation of anywhere from 0.03 to 0.06. Finally, As compared to combining alone, the performance of
we present results obtained by using EM with each of the democratic co-learning performs better at the 95% confi-
three base algorithms. To create a measure of the best per- dence level for 6 of the 8 data sets and at the 90% confidence
formance one could expect for the given data sets, the col- level for the other two data sets. So democratic co-learning
Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004)
1082-3409/04 $20.00 © 2004 IEEE
Authorized licensed use limited to: ASU Library. Downloaded on July 05,2022 at 21:26:23 UTC from IEEE Xplore. Restrictions apply.
Algorithm flare monk2 vote DNA
Demo. Co-Learning
Combining Only (1.997) (4.798) (2.042) (7.025)
Data in Labeled
Statistical Co-Learning (2.112) (0.993) (2.879) (3.566)
Two-View Co-Training (5.841) (4.854) (9.959) (-1.966)
EM-NB (4.930) (6.210) (8.285) (13.852)
EM-C4.5 (2.612) (10.846) (1.630) (6.740)
EM-3NN (5.303) (1.000) (12.157) (8.693)
NB (5.714) (6.425) (7.546) (13.350)
C4.5 (2.612) (10.846) (1.630) (6.740)
3NN (5.333) (2.836) (9.259) (8.584)
is making use of the unlabeled data and not just benefit- which labels may be requested) with QBC and uncertainty
ing from the use of an ensemble method of combining. As sampling. For QBC and uncertainty sampling, we show the
compared to the other 5 semi-supervised methods, demo- paired t-test value with respect to democratic priority sam-
cratic co-learning performs statistically significantly at the pling. Finally, we compare the following active and semi-
95% level in 32 of the 40 tests we performed. (In fact in supervised algorithms: active democratic co-learning, co-
27 of the 40 tests, our improvements are statistically signif- testing, and co-test(co-EM) showing the paired t-test val-
icant at the 99% confidence level.) Of the 8 tests in which ues with respect to active democratic co-learning.
the difference in performance was not statistically signifi- For the active approaches in which the unlabeled data is
cant democratic co-learning performed better in all but two only used as a pool for the active learner, democratic pri-
of them. ority sampling performed better in 5 of the 8 data sets than
each of QBC and uncertainty sampling but only 2 of these
5.2. Active Co-learning 5 cases (for each data set) was statistically significant at the
95% level. We are currently repeating these experiments us-
Table 4 shows our active learning results. For uncertainty ing 20 different random choices for , , and the test data
sampling we use naive Bayes where the normalized proba- and we believe that we will find statistically significant im-
bility measure of naive Bayes is used to give an uncertainty provements in more cases. For the active semi-supervised
value. For QBC we use different committee members algorithms, democratic co-learning performed better than
each trained with naive Bayes on a random subset (without each of co-testing and co-test (Co-EM) in 5 of the 8 data
replacement) of examples from . The active learn- sets with 4 of the 5 (for each data set) being statistically sig-
ing is used to select 40 additional examples to have labeled. nificant at the 95% level.
We first show the best result obtained among the base We also ran a paired t-test between democratic priority
algorithms when all data in is properly labeled. Next sampling and active democratic learning. For the 3-of-9 and
we compare democratic priority sampling (with no use of DNA data sets the improvement of active democratic co-
the unlabeled data except in serving as a pool of data for learning was statistically significant at the 95% level, and
Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004)
1082-3409/04 $20.00 © 2004 IEEE
Authorized licensed use limited to: ASU Library. Downloaded on July 05,2022 at 21:26:23 UTC from IEEE Xplore. Restrictions apply.
Algorithm flare monk2 vote DNA
Data in Labeled
Demo. Priority Samp.
Query By Committee (4.654) (1.165) (2.190) (-2.631)
Uncertainty Sampling (0.246) (0.631) (2.120) (0.703)
Active Demo. Co-Learn.
CoTesting (-0.714) (0.000) (2.581) (5.160)
Co-Test(co-EM) (-0.805) (-0.690) (4.493) (2.189)
Algorithm cancer adult 3-of-9 xd6
Data in Labeled
Demo. Priority Samp.
Query By Committee (0.753) (-.708) (1.616) (3.105)
Uncertainty Sampling (-3.903) (-0.717) (2.175) (2.933)
Active Demo. Co-Learn.
CoTesting (-0.051) (5.086) (4.563) (8.610)
Co-Test(Co-EM) (-0.941) (4.695) (7.260) (14.529)
Table 4. Our active learning results.
for the vote and XD6 data sets, the improvement of ac- beled data. Democratic co-learning also outperformed each
tive democratic co-learning was statistically significant at of the three individual algorithms when combined with EM
the 90% level. For the flare and monk2 data sets there really and by picking learners that work in very different ways, we
is not much room for improvement. Similarly, in comparing can increase the diversity needed for them to be able to la-
the performance between democratic co-learning and active bel data for each other.
democratic co-learning the use of active learning generally
improved the performance in data sets in which the perfor- References
mance of democratic co-learning was not already close to
that obtained when all data in is given the proper label. [1] Blum, A., Mitchell, T.: Combining labeled and unlabeled
data with co-training. In: Proc. of the 11th Annual Conf. on
Computational Learning Theory. (1998) 92–100.
6. Concluding Remarks [2] Schaffer, C.: A conservation law for generalization perfor-
mance. In: Proc. of the 11th Int. Conf. on Machine Learning,
We have demonstrated that democratic co-learning, a San Mateo: Morgan Kaufmann (1994) 259–265.
single-view multiple algorithm semi-supervised learning [3] Chan Philip, K., Stolfo, S.: On the accuracy of meta-learning
technique is statistically superior to many semi-supervised for scalable data mining. Journal of Intelligent Integration of
learning approaches when there are not two sufficiently in- Information 8(1) (1998) 5–28.
dependent and redundant set of attributes. Using data from [4] Chan Philip, K., Stolfo, S.: Scaling learning by meta-
the UCI repository, we have compared the performance learning over disjoint and partially replicated data. In: Proc.
of democratic co-learning to combining alone (without us- of the 9th Florida AI Research Symposium. (1996) 151–155.
ing the unlabeled data) and to other single-view and multi- [5] Ali, K., Pazzani, M.: Error reduction through learning mul-
view semi-supervised learning algorithms. Democratic co- tiple descriptions. Machine Learning 24 (1996) 173–202
learning performed better at the 95% confidence level in 38 [6] Dempster, A., Laird, N., Rubin, D.: Maximum likelihood
of the 48 tests that we performed in the non-active learn- from incomplete data via the em algorithm. Journal of the
Royal Statistical Society, Series B 39 (1977) 1–38
ing setting. For the other 10 tests there was no significant
[7] Goldman, S., Zhou, Y.: Enhancing supervised learning with
difference in performance between democratic co-learning
unlabeled data. In: Proc. of the 17th Int. Conf. on Machine
and the other approaches studied. Learning, San Francisco: Morgan Kaufmann (2000) 327–
In general, co-learning works well if the estimated mean 334.
confidence reflects which learner is better and when the [8] Freund, Y., Seung, H., Shamir, E., Tishby, N.: Selective sam-
multiple classifiers are good in different regions enabling pling using the query by committee algorithm. Machine
them to classify data for each other. Finally, there needs Learning 28 (1997) 133–168.
to be room for improvement by at least one of the super- [9] Breiman, L.: Bagging predictors. Machine Learning 24(2)
vised learning algorithm if it received more correctly la- (1996) 123–140.
Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004)
1082-3409/04 $20.00 © 2004 IEEE
Authorized licensed use limited to: ASU Library. Downloaded on July 05,2022 at 21:26:23 UTC from IEEE Xplore. Restrictions apply.
[10] Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text
classification from labeled and unlabeled documents using
em. Machine Learning 39 (2000) 103–134.
[11] Nigam, K., Ghani, R.: Analyzing the effectiveness and ap-
plicability of co-training. In: The 9th Int. Conf. on Informa-
tion and Knowledge Management. (2000) 86–93.
[12] Muslea, I., Minton, S., Knoblock, C.: Selective sampling
with redundant views. In: Proc. of AAAI-2000. (2000) 621–
626.
[13] Seung, H.S., Opper, M., Sompolinsky, H.: Query by com-
mittee. In: Proc. of the ACM Workshop on Computational
Learning Theory. (1992) 287–294.
[14] Lewis, D.D., Gale, A.W.: A sequential algorithm for train-
ing text classifiers. In: Proc. of the Special Interest Group on
Info. Retrieval, AAAI Press and MIT Press (1994) 3–12.
[15] Dagan, I., Engelson, S.: Committee-based sampling for
training probabilistic classifiers. In: Proc. of the 12th Int.
Conf on Machine Learning, San Francisco: Morgan Kauf-
mann (1995) 150–157.
[16] Muslea, I., Minton, S., Knoblock, C.: Selective sampling +
semi-supervised learning = robust multi-view learning. In:
IJCAI-01 Workshop on Text Learning: Beyond Supervision.
(2001)
[17] Quinlan, R.: Induction of decision trees. Machine Learning
1 (1986) 81–106.
[18] Cover, T., Hart, P.: Nearest neighbor pattern classification.
IEEE Transactions on Information Theory 13 (1967) 21–27.
[19] Zhou, Z., Chen, K., Jiang, Y.: Exploiting unlabeled data in
content-based image retrieval. In: Proc of the 15th European
Conf. on Machine Learning. (2004)
Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004)
1082-3409/04 $20.00 © 2004 IEEE
Authorized licensed use limited to: ASU Library. Downloaded on July 05,2022 at 21:26:23 UTC from IEEE Xplore. Restrictions apply.