The Experimental Study of Machine Learning
The Experimental Study of Machine Learning
net/publication/2238275
CITATIONS READS
11 2,796
2 authors, including:
Dennis Kibler
University of California, Irvine
74 PUBLICATIONS 8,011 CITATIONS
SEE PROFILE
All content following this page was uploaded by Dennis Kibler on 16 April 2013.
learning algorithms. After this, we address two broad classes of independent vari-
ables { aspects of the algorithm and aspects of the environment. Finally, we consider
some issues in the design and execution of experiments. Many of our suggestions are
similar to the excellent points made by Cohen (1991) in his discussion of arti cial
intelligence, but they seem worth instantiating for the eld of machine learning.
the past. Also, because any given set of instances may not be representative of the
domain, it is important to average over the results of runs on many sets of training
and test problems that have been selected randomly from those available.3
One can use a similar scheme to study incremental systems, which process one
experience at a time. In this case, one presents training instances one at a time and,
after every nth instance, turns learning o and runs the system on a separate test
set. Alternatively, one can treat each instance rst as a test datum and then as a
training datum, but this requires that one run the system more times. In either case,
the result is a learning curve that shows change in performance as a function of the
number of instances encountered. Although learning curves are informative, one can
also condense this information into more succinct summary measures, such as the
asymptotic performance and the number of instances needed to reach this asymptote.
When studying incremental methods, it is important not only to average over dif-
ferent training and test sets, but also over di erent orders of the training instances,
since this can in uence the course of learning in most incremental systems. However,
in some contexts a researcher may be interested in examining order e ects them-
selves, in which case he or she should systematically vary this factor like any other
independent variable.
60 70 80 90 100
Predictive Accuracy
50
40
30
20
10
0
0 5 10 15 20 25 30 35 40 45 50 55 60
Number of Training Instances
Figure 1. Learning curve for Fisher's Cobweb algorithm on Congressional voting records, as
reported by McKusick and Langley (1991).
to predict the class label for each test case. The percentage of correctly classi ed
instances was recorded, learning was enabled, the system was presented with the
next training instance, and the cycle continued. The learning curve shown in Figure
1 was averaged over ten runs based on di erent random orderings of the training
data. Thus, each point on the curve shows the average percentage of the test set that
Cobweb correctly classi ed.
The shape of the curve reveals that most learning occurs rather early. Asymptotic
accuracy is approximately 92%, yet Cobweb reaches the 85% level after fewer than
ten training instances. The system remains stable at this point for some time, then
rises to above 88% around 20 instances. Slight improvements occur with additional
instances, but the most important learning seems complete by this point. These
results contradict some claims (e.g., Mitchell et al., 1986) that inductive instances
require very many instances to acquire useful knowledge. But it also suggests that
the behavior of House members may be quite regular and thus simple to induce. As
a result, it can be dangerous to draw conclusions about the behavior of a learning
algorithm from studies with a single domain. We will return to this issue later.
problem-solving traces into rules for selecting operators, states, and goals during
planning. Minton's system then uses these rules to constrain search on new problems.
Gratch's study uses CPU time as the measure of problem-solving performance.
One could examine the number of search nodes considered in solving problems, but
as Minton has shown, the amount of search is only one facet of problem-solving
eciency. His de nition of the utility of acquired knowledge also includes the cost of
applying that knowledge in controlling search, and CPU time takes this into account.
Langley and Allen (in press) use a related measure, the total number of uni cations
required to solve a set of problems, which is less dependent on implementation and
machine. They also examine both search nodes and match cost in an attempt to
determine the source of power or diculty.
The independent variable in Gratch's experiment is the number of training problems
on which the system has practiced. He generated 220 problems from a special variant
of the blocks world (described by Etzioni, 1990), dividing these into 100 training
tasks, 20 `settling' problems, and 100 test cases. After every ten training problems,
Gratch disabled the Prodigy's explanation-based learning component and ran the
system on the settling problems, which it used to gather statistics about the utility of
individual control rules. During this stage, Prodigy deleted rules that appeared to
increase the overall cost of planning. After this, the experimenter disabled this facet
of learning as well and measured the CPU time required to solve all problems in the
test set.
Figure 2 shows the learning curve for this domain. The times are averaged over
ten random orderings of the training and settling problems, since order e ects can
occur in problem-solving domains as easily as in classi cation tasks. The results are
intriguing. The search-control knowledge that Prodigy acquired actually increases
its problem-solving time. This begins to decrease after the initial large rise, but it
never quite returns to the level that existed before learning. Presumably, this e ect
occurs because the cost of matching their complex conditions more than o sets the
savings due to reduced search, even though the settling phase was designed to avoid
this problem.
For our purposes, these results demonstrate the clear need for experimental stud-
ies of learning's e ect on performance. Without such evaluation, one cannot know
whether learning is actually bene cial. With such experiments, one can identify the
source of the degradation and modify the learning scheme to improve performance.
Fortunately, this is not the complete story on Prodigy; in fact, Etzioni carefully
designed this particular variant of the blocks world to encourage Prodigy to acquire
expensive search-control rules. We will return to this point in Section 5.
Segre, Elkan, and Russell (1991) have noted an important complication in the ex-
perimental study learning in problem-solving domains. Most problem solvers include
some computational limit and give up on a problem when they exceed it. Thus, re-
porting only eciency results can be misleading; it is essential to include information
about the percentage of test problems that the system has solved. Gratch was careful
Experimentation in Machine Learning 7
0 10 20 30 40 50 60 70 80 90 100
Number of Training Problems
Figure 2. Learning curve for a reduced version of Prodigy on problem-solving tasks from a
variant of the blocks world, as reported by Gratch (1991).
to use only problems that Prodigy could solve within its computational limits, but
this may not be practical for some real-world problems. Also, in some domains, the
quality of problem solutions can also be important. Langley and Drummond (1990)
suggest some ways in which to instantiate this dependent variable.
4. Varying the Learning Method
One of the most dicult problems confronting psychology is teasing apart the rel-
ative e ects of heredity and experience, of `nature' and `nurture'. Machine learning
is more fortunate, in that it can experimentally control a learning system's `innate'
features (nature) and the training instances it encounters (nurture). Here we exam-
ine methods for evaluating the e ect of system characteristics, delaying the role of
experience until the following section.
The obvious way to examine the in uence of system features on behavior is to com-
pare di erent algorithms on the same task. That is, one runs two or more learning
systems on a given domain, measures their performance on the same test cases, and
compares the results. Until recently, such comparative studies were rare in the liter-
ature, but now they have become almost the default, and the availabilty of standard
databases has provided a variety of domains to use in such experiments.4
4. In fact, most of domains we mention in this paper are available by ftp from ics.uci.edu
using the account and password anonymous. The various data sets reside in the directory
pub/machine-learning-databases.
8 P. Langley and D. Kibler
Table 1. Percentage accuracies and training times for three induction algorithms in diagnosing
soybean diseases (Shavlik, Mooney, & Towell, 1991).
60 70 80 90 100
Predictive Accuracy
50
40
Weight Learning
30
Weight + Boolean
20
10
0
Figure 3. The e ect of lesioning Stagger's Boolean mechanism for creating new conceptual
components (Schlimmer, 1987).
5. Cohen and Howe (1988) have referred to such experiments as ablation studies.
12 P. Langley and D. Kibler
Table 2. Percentage accuracies for three induction algorithms on four classi cation domains
(Shavlik et al., 1991).
ing disorders based on 58 features (taken from Bareiss, 1989), and a task that involves
determining whether a patient has heart disease, given eight nominal attributes and
six numeric ones. The table reports only accuracy on the test sets. For comparison,
we have repeated the results for the soybean domain.
Recall that on the soybean data, the knowledge induced by both the Backpropa-
gation and Perceptron methods performed better than the ID3 algorithm. How-
ever, by examining behavior across domains, Shavlik et al. demonstrated that this
result is misleading. Behavior in a single domain, even a real-world one, does not nec-
essarily generalize to other domains. On both the chess and audiology testbeds, both
ID3 and Backpropagation are signi cantly more accurate (at the 0.05 level) than
the Perceptron learning algorithm, but there is no signi cant di erence between
the two more sophisticated methods. Backpropagation does signi cantly better
than ID3 in diagnosing heart disease, but the induced decision trees outperform the
learned perceptrons in turn (at the 0.01 level).
These results make one more con dent in the non-naive approaches, but one would
still like to understand the reasons for ID3's poor behavior on the soybean data. One
possibility is that this domain is nearly linear separable, but that the hyperplane is
not orthogonal to any of the axes in the instance space. Thus, the Perceptron
technique can accurately classify instances using a linear unit, whereas ID3 is forced
to approximate this with a highly disjunctive decision tree, in which each terminal
node is based on a small sample.
In the midst of this discussion, we should not forget one of the main points of the
Shavlik et al. study. Connectionist and `symbolic' induction algorithms, although
they rely on di erent representations of knowledge and use di erent methods to ac-
quire that knowledge, are dealing with essentially the same problem, and this means
that one can compare them on the same tasks. This form of comparative study is
much healthier for the eld than rhetorical arguments about the limitations of existing
methods and the advantages of new approaches.
Experimental studies of problem-solving systems can also use multiple domains to
evaluate learning algorithms. In Section 3.3 we reviewed results from Gratch's (1991)
study of Prodigy on a single domain, but in fact he examined the system's behavior
14 P. Langley and D. Kibler
2500
Planning Time (CPU Seconds)
2000
Normal Blocks World
Modified Blocks World
Extended Strips
1000 1500 500
0
0 10 20 30 40 50 60 70 80 90 100
Number of Training Problems
Figure 4. Learning curves for the Prodigy algorithm on three problem-solving domains
(Gratch, 1991).
on others as well. Figure 4 incorporates the learning curves for an extended version of
the Strips planning domain and for the original version of the blocks world used by
Minton (1990). The results here are much more encouraging, with Prodigy showing
clear improvement by the tenth training problem in both cases. After this point, the
system seems to have stabilized, apparently having completed its acquisition of useful
search-control knowledge.
This raises issues about the reasons Prodigy encounters diculty in the original
domain we examined. As mentioned earlier, Etzioni (1990) designed this variant of
the blocks world, which includes a single additional operator that lets one move two
blocks at a time, to produce just such a negative e ect in Prodigy. He provides
an interesting analysis of the causes for the system's divergent behaviors in these
domains. This technique { altering an existing domain to elicit some e ect { is a
powerful experimental tool, and it leads naturally into our next topic.
35 40 45 50
Percentage Error
Noise in Class Label
Noise in One Attribute
30
25
20
15
10
5
0 Noise in All Attributes
0 10 20 30 40 50 60 70 80 90 100
Level of Noise
Figure 5. The e ect of three types of noise on predictive accuracy in Quinlan's (1986) ID3.
that noise in all attributes would make learning more dicult that noise in any single
feature, including the class label. Indeed, the curve for this condition goes up rapidly,6
but then actually decreases and levels o at a 26% error rate. Quinlan explains this
surprising result by noting that, beyond a certain noise level, ID3's pruning technique
leads to one-node trees that simply predict the most frequent class. The dip in the
curve suggests the parameter setting for the statistical test is slightly high, allowing
some over tting to occur around the 40% noise level.
30
Absolute Error
25 No. of Irrelevants = 0
No. of Irrelevants = 4
20
No. of Irrelevants = 8
No. of Irrelevants = 16
15
10
5
0
0 10 20 30 40 50 60
Number of Training Instances
Figure 6. Learning curves for Gennari's (1990) Classit on domains with varying numbers of
irrelevant attributes.
information that has already been summarized at a parent node, and thus emphasizes
attributes that serve to distinguish concepts at the same level.
Gennari (1990) examined the e ect of this factor on the behavior of Classit,
an extension of Cobweb that handles both symbolic and numeric attributes. He
used a set of arti cial domains that involved four separate classes, each di ering in
their values on four relevant numeric attributes. However, the domains varied in
the number of irrelevant attributes { which have the same probability distribution
independent of class { from zero to sixteen. All domains had small but de nite
amounts of attribute noise, and training instances were unclassi ed. The performance
task involved predicting the numeric values of single relevant attributes omitted from
test instances, and the dependent measure was the absolute error between the actual
and predicted values.
Figure 6 presents the results, which are based on ten di erent orders of randomly
generated training instances. The graph suggests that Classit is robust with respect
to irrelevant attributes, with an asymptote around 2.0, regardless of the number of
irrelevant terms. This is close to the `ideal' error of 0.47, which is the error for
the best possible predictions that could be based on the observed training instances.
Classit's asymptote is also considerably less than that of a naive algorithm which
simply predicts the mean value for each attribute, independent of its class. This
provides another example of how one can use straw algorithms and optimal ones to
calibrate learning behavior. The system's rate of learning does seem a ected by the
number of irrelevant attributes, but Classit appears to scale well on this dimension,
18 P. Langley and D. Kibler
100
Predictive Accuracy
Class Noise = 0%
Class Noise = 30%
80 9070
60
50
0 1 2 3 4 5 6 7 8 9 10
Number of Peaks (Log Base 2)
60 70 80 90 100
Predictive Accuracy
Labyrinth (Theoretical)
50
Labyrinth (Observed)
40
Cobweb (Theoretical)
30
Cobweb (Observed)
20
10
0
0 5 10 15 20 25 30 35 40 45 50 55 60
Number of Training Instances
Figure 8. Theoretical and observed learning curves for Labyrinth and Cobweb in the pres-
ence of background knowledge (Thompson, Langley, & Iba, 1991).
pose. Alternatively, one can use the standard deviation of each point on the curve to
draw `error bars' around the curve, then see whether the theoretical curve falls within
these ranges, as Pazzani and Sarrett (1990) have done. But in general, theory-laden
sciences like physics have less need of statistical hypothesis testing than experiment-
driven ones like psychology, and we hope that as machine learning matures, it will
progress from the latter into the former.
7. Conclusions
One can trace experimental approaches to machine learning back more than two
decades (e.g., Hunt, Marin, & Stone, 1966), but the `modern' era of experimentation
began about ve years ago. Since then, the number of experimental studies has
grown at a rapid pace, with researchers identifying new dependent and independent
variables, testing existing systems on new domains, and improving these systems when
they encounter diculties. Many experimental studies produce unexpected results,
forcing the experimenter to think deeply about reasons for the observed learning
behavior.
In general, the eld of machine learning occupies a much healthier methodologi-
cal state than a decade ago. However, the experimental method has been adapted
more quickly to some areas than others. Early experimentation focused on inductive
approaches to classi cation, as the current paper re ects in its examples, but recent
years have seen many analogous studies of learning in problem-solving domains and
experiments on explanation-based methods. Researchers have also started to measure
the in uence of background knowledge on inductive learning.
In summary, machine learning occupies a fortunate position that makes systematic
experimentation easy and pro table. Some methodological questions remain unan-
swered, but researchers have made an excellent start and we expect the future holds
improved dependent measures, better independent variables, and more useful experi-
mental designs. There remains room for improvement in all areas of machine learning,
but the discipline seems well on its way to developing a sound experimental tradition.
However, these successes do not mean that empirical researchers should report
gratuitous experiments any more than theoreticians should publish vacuous proofs.
Whether they lead to positive or negative results, experiments are worthwhile only
to the extent that they illuminate the nature of learning mechanisms and the reasons
for their success or failure. Although experimental studies are not the only path
to understanding, we feel they constitute one of machine learning's brightest hopes
for rapid scienti c progress, and we encourage other researchers to join in our eld's
evolution toward an experimental science.
Acknowledgements
We would like to thank David Aha, Wayne Iba, David Ruby, Je Schlimmer, Kevin
Thompson, and Tom Dietterich for helpful comments on previous drafts. Earlier ver-
sions of this paper appeared in the journal Machine Learning and in the Proceedings
Experimentation in Machine Learning 25
References
Aha, D. W., Kibler, D., & Albert, M. K. (1991). Instance-based learning algorithms.
Machine Learning , 6 , 37-66.
Bareiss, E. R. (1989). Exemplar-based knowledge acquisition: A uni ed approach to
concept representation, classi cation, and learning . Boston: Academic Press.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classi cation
and regression trees . Belmont, CA: Wadsworth.
Clark, P., & Niblett, T. (1989). The CN2 induction algorithm. Machine Learning , 3 ,
261{284.
Cohen, P. R. (1991). A survey of the Eighth National Conference on Arti cial Intel-
ligence: Pulling together or pulling apart? AI Magazine , 12 , 16{41.
Cohen, P. R., & Howe, A. E. (1988). The invisible hand: How evaluation guides AI
research (COINS Technical Report 88-21). Amherst: University of Massachusetts,
Department of Computer and Information Science.
DeJong, G., & Mooney, R. (1986). Explanation-based learning: An alternative view.
Machine Learning , 1 , 145{176.
Dietterich, T. G. (1990). Machine learning. Annual Review of Computer Science , 4 .
Elio, R., & Watanabe, L. (1991). An incremental deductive strategy for controlling
constructive induction in learning from examples. Machine Learning , 7 , 7{44.
Etzioni, O. (1990). Why Prodigy/EBL works. Proceedings of the Eighth National
Conference on Arti cial Intelligence (pp. 916{922). Boston, MA: AAAI Press.
Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering.
Machine Learning , 2 , 139{172.
Gennari, J. H. (1990). An experimental study of concept formation . Doctoral disser-
tation, Department of Information & Computer Science, University of California,
Irvine.
Gratch, J. (1991). Utility generalization and composability problems in explanation-
based learning (Tech. Rep. No. UIUUCDCS-R-91-1681). Urbana: University of
Illinois, Department of Computer Science.
Haussler, D. (1987). Learning conjunctive concepts in structural domains. Proceed-
ings of the Sixth National Conference on Arti cial Intelligence (pp. 466{470).
Seattle, WA: AAAI Press.
Haussler, D. (1990). Probably approximately correct learning. Proceedings of the
Eighth National Conference on Arti cial Intelligence (pp. 1101{1108). Boston,
MA: AAAI Press.
Hawking, S. (1988). A brief history of time . New York: Bantam Books.
Hunt, E. B., Marin, J., & Stone, P. J. (1966). Experiments in induction . New York:
26 P. Langley and D. Kibler
Academic Press.
Experimentation in Machine Learning 27
Rendell, L., & Cho, H. (1990). Empirical learning as a function of concept character.
Machine Learning , 5 , 267{298.
Robertson, G. G., & Riolo, R. L. (1988). A tale of two classi er systems. Machine
Learning , 3 , 139{159.
Rosenblatt, F. (1962). Principles of neurodynamics . New York: Spartan Books.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal repre-
sentations by error propagation. In J. L. McClelland & D. E. Rumelhart (Eds.),
Parallel distributed processing: Explorations in the microstructure of cognition
(Vol. 1). Cambridge, MA: MIT Press.
Rosenbloom, P., & Newell, A. (1987). Learning by chunking: A production system
model of practice. In D. Klahr., P. Langley, & R. Neches (Eds.) Production
system models of learning and development . Cambridge, MA: MIT Press.
Schlimmer, J. C. (1987). Concept acquisition through representational adjustment .
Doctoral dissertation, Department of Information & Computer Science, Univer-
sity of California, Irvine, CA.
Segre, A., Elkan, C., & Russell, A. (1991). A critical look at experimental evaluations
of EBL. Machine Learning , 6 , 183{195.
Shapiro, A. (1987). Structured induction in expert systems . Reading, MA: Addison
Wesley.
Shavlik, J. W., Mooney, R. J., & Towell, G. G. (1991). Symbolic and neural learning:
An experimental comparison. Machine Learning , 6 , 111-143.
Simon, H. A. (1969). The sciences of the arti cial . Cambridge, MA: MIT Press.
Tambe, M., Newell, A., & Rosenbloom, P. S. (1990). The problem of expensive chunks
and its solution by restricting expressiveness. Machine Learning , 5 , 299{348.
Thompson, K., Langley, P., & Iba, W. F. (1991). Using background knowledge in
concept formation. Proceedings of the Eighth International Workshop on Machine
Learning . Evanston, IL: Morgan Kaufmann.
Utgo , P. E. (1988). Perceptron trees: A case study in hybrid concept representations.
Proceedings of the Seventh National Conference on Arti cial Intelligence (pp.
601{606). St. Paul, MN: AAAI Press.