0% found this document useful (0 votes)
30 views29 pages

The Experimental Study of Machine Learning

Uploaded by

mhtincongito
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views29 pages

The Experimental Study of Machine Learning

Uploaded by

mhtincongito
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/2238275

The Experimental Study of Machine Learning

Article · May 1997


Source: CiteSeer

CITATIONS READS
11 2,796

2 authors, including:

Dennis Kibler
University of California, Irvine
74 PUBLICATIONS 8,011 CITATIONS

SEE PROFILE

All content following this page was uploaded by Dennis Kibler on 16 April 2013.

The user has requested enhancement of the downloaded file.


Copyright c 1991 Patrick W. Langley. All rights reserved.

The Experimental Study of Machine Learning

Pat Langley ([email protected])


AI Research Branch, Mail Stop 244{17, NASA Ames Research Center,
Mo ett Field, CA 94035 USA
Dennis Kibler ([email protected])
Department of Information & Computer Science, University of California,
Irvine, CA 92717 USA

1. Science and Observation


Machine learning is often characterized as a scienti c discipline, and this suggests we
incorporate knowledge of science and its methods into the goals and techniques of
the eld. Research in AI and cognitive science further suggests that one can view
science as a search through a space of theories that requires two active components {
a generator and a test . The generator produces new theories or variants on existing
theories, whereas the test yields information concerning the quality of theories. Sci-
ence incorporates a variety of tests that guide the theory-generation process. These
include evaluation metrics like elegance and internal consistency, which it shares with
other intellectual endeavors such as mathematics and philosophy.
However, science diverges from philosophy in its emphasis on observation . No
matter how elegant or consistent, a theory that disagrees with the data must be
rejected or improved. Observation acts as the most important factor in the evaluation
function that directs scientists' search through the space of theories. Hawking (1988)
holds a similar view on the evaluation of scienti c knowledge:
A theory is a good theory if it satis es two requirements: It must accurately
describe a large class of observations . . . , and it must make de nite predictions
about the results of future observations.
Thus, he distinguishes between two sorts of observations: those made before the
theory is forwarded (which it must cover) and those made after its generation (which
it must predict). This distinction re ects the two roles played by empirical results:
the suggestion of new candidate theories and the evaluation of existing ones.
The success of physics, perhaps the most well-developed scienti c discipline, should
clarify the importance of observation. This eld is primarily concerned with under-
standing the nature of the physical world { the structure and processes that govern
matter and energy. This shared goal holds physics together as a eld, but its progress
derives primarily from its continued, repeated testing of theories against observation.
Data play a central role in selecting among competing theories, and anomalies suggest
improvements on incorrect theories. Over time, old theories are rejected and new ones
emerge with higher predictive accuracy and greater generality.
2 P. Langley and D. Kibler

2. The Role of Experiments in Machine Learning


Machine learning is another science, albeit a much younger one than physics. Our
discipline is primarily concerned with understanding the computational mechanisms
that underlie learning, and this shared purpose holds machine learning together as a
coherent eld. Like physics, the success of machine learning as a scienti c discipline
will rest on its ability to combine theory and observation, using data to drive theory
selection and revision.
The eld of machine learning focuses on intelligent artifacts { systems created by
the researchers who study them. Thus, it constitutes what Simon (1969) has called a
science of the arti cial . As such, there is a temptation to emphasize formal analysis
and theoretical approaches. Indeed, considerable progress has recently occurred on
the theoretical front,1 both in formalizing the nature of learning algorithms and in
characterizing their behavior. In this view, machine learning is primarily a mathe-
matical science.
Despite this progress, many learning algorithms are too complex for formal analysis,
at least at the level of generality assumed by most theoretical treatments. As a result,
empirical studies of the behavior of machine learning algorithms must retain a central
role. Fortunately, the arti cial nature of learning algorithms allows control over a wide
range of factors, making it more akin to experimental disciplines such as physics and
chemistry than to observational sciences such as astronomy or sociology. It is this
view { machine learning as an experimental science { that we pursue in this paper.
The goal of scienti c experimentation is to better understand a class of behaviors
and the conditions under which they occur. Ideally, this will lead to empirical laws
and theories, as well as to tests of those theories. In our eld, the central behavior is
learning. The conditions involve the algorithm employed, the domain knowledge, and
the environment in which learning occurs. Lacking a formal analysis, an implemented
learning algorithm is necessary but not sucient for understanding { one should also
attempt to specify when it operates well and the reasons for that behavior. Such
generalizations provide the raw material for forming and testing theories of machine
learning. Moreover, they can suggest improved algorithms that exhibit more desirable
learning behaviors.
As normally de ned, an experiment involves systematically varying one or more
independent variables and examining their e ect on some dependent variables. Thus,
a machine learning experiment requires more than a single observation of a system's
behavior; it requires a number of observations made under di erent conditions. In
each case, one must measure some aspect of the system's behavior for comparison
across the di erent conditions.
We have organized the remainder of the paper in these terms. We begin by ex-
amining some dependent variables that can be used in the experimental study of
1. Kearns, Li, Pitt, and Valiant (1987), Dietterich (1990), and Haussler (1990) provide informative
reviews of progress in the area of learnability theory.
Experimentation in Machine Learning 3

learning algorithms. After this, we address two broad classes of independent vari-
ables { aspects of the algorithm and aspects of the environment. Finally, we consider
some issues in the design and execution of experiments. Many of our suggestions are
similar to the excellent points made by Cohen (1991) in his discussion of arti cial
intelligence, but they seem worth instantiating for the eld of machine learning.

3. Dependent Measures of Learning


Most de nitions of learning rely on some notion of improved performance . Thus, var-
ious performance measures are the natural dependent variables for machine learning
experiments, just as they are for studies of human learning. Other measures, like
`understandability' of the acquired structures, may also be informative, but these are
not relevant unless accompanied by performance improvement.2 In some cases, intu-
itively plausible learning methods actually lead to worse performance (Minton, 1985),
so performance measures are central to evaluating almost any learner's behavior.

3.1 Measures of Performance


Many measures of performance are possible. For supervised concept induction tasks,
in which each instance has an associated class name, the most obvious metric is
the percentage of correctly classi ed instances (Quinlan, 1986). One cannot use this
dependent variable for unsupervised induction tasks like conceptual clustering, since
no class name is available. However, one can replace it with a more general measure {
the ability to predict a missing attribute's value, averaged across all attributes; Fisher
(1987) refers to this performance task as exible prediction .
More complex domains require more sophisticated measures of performance. For
grammar-induction tasks, one can record the percentage of correctly parsed sentences
and the percentage of correctly rejected non-sentences. For problem-solving domains,
one can examine the percentage of problems solved or the quality of the resulting so-
lution paths (Langley & Drummond, 1990). One can also measure the total CPU time
or number of nodes considered during search (Minton, 1985). The last two metrics
are concerned with eciency rather than correctness, and thus seems appropriate for
explanation-based approaches to learning, which have been largely concerned with
the compilation of knowledge rather than its acquisition (Mitchell, Keller, & Kedar-
Cabelli, 1986; DeJong & Mooney, 1986).
Given a particular performance criterion, one must implement this measure in some
fashion. In nonincremental settings, one can present the learning system with a train-
ing set and then evaluate its performance on a separate test set. This is an important
methodological point. The goal of learning is typically to use acquired knowledge
to aid behavior in novel situations, not on problems that have been encountered in
2. As in psychology, we make a clear distinction between performance { an agent's behavior at a given
instant in time { and learning { the change in an agent's performance over time. In this framework,
the phrases learning performance and performance of a learning system are oxymorons.
4 P. Langley and D. Kibler

the past. Also, because any given set of instances may not be representative of the
domain, it is important to average over the results of runs on many sets of training
and test problems that have been selected randomly from those available.3
One can use a similar scheme to study incremental systems, which process one
experience at a time. In this case, one presents training instances one at a time and,
after every nth instance, turns learning o and runs the system on a separate test
set. Alternatively, one can treat each instance rst as a test datum and then as a
training datum, but this requires that one run the system more times. In either case,
the result is a learning curve that shows change in performance as a function of the
number of instances encountered. Although learning curves are informative, one can
also condense this information into more succinct summary measures, such as the
asymptotic performance and the number of instances needed to reach this asymptote.
When studying incremental methods, it is important not only to average over dif-
ferent training and test sets, but also over di erent orders of the training instances,
since this can in uence the course of learning in most incremental systems. However,
in some contexts a researcher may be interested in examining order e ects them-
selves, in which case he or she should systematically vary this factor like any other
independent variable.

3.2 Performance in Classi cation Domains


Fisher (1987) has described Cobweb, an incremental unsupervised algorithm for
inducing probabilistic concepts. The system organizes its acquired knowledge as a
hierarchy of concepts, which it modi es with each training instance. However, our
concern here is not with Fisher's algorithm but with experimentation, so let us con-
sider a recent study of this system by McKusick and Langley (1991). Figure 1 presents
a learning curve for Cobweb in a particular classi cation domain.
The data used in this study were collected by Schlimmer (1987) from the Congres-
sional Quarterly . They describe votes of the 435 members of the 1984 U.S. House of
Representatives on 16 issues, such as aid to El Salvador, funding for the MX missile,
and duty-free exports. Thus, there are 435 instances, each consisting of 16 Boolean
attributes that speci es whether a given House member voted `yea' or `nea'. Each
instance also falls into one of two classes, of which 267 were Democrats and 168 were
Republicans. Although Fisher's system was designed for unsupervised tasks, it can
also learn from such supervised data, so the dependent measure here was predictive
accuracy on the class label, rather than Fisher's measure of exible prediction.
McKusick and Langley presented Cobweb with a random sample of 100 training
instances from this domain and tested it on a separate set of 25 randomly selected
instances. After each training case, learning was disabled and the system was asked
3. One can achieve similar e ects through cross-validation studies, in which one iterates through
each of N available instances, in each case running the system on the other N ? 1 instances and
using the selected instance to test performance. One then averages the results for all N runs to
estimate typical performance.
Experimentation in Machine Learning 5

60 70 80 90 100
Predictive Accuracy
50
40
30
20
10
0

0 5 10 15 20 25 30 35 40 45 50 55 60
Number of Training Instances

Figure 1. Learning curve for Fisher's Cobweb algorithm on Congressional voting records, as
reported by McKusick and Langley (1991).

to predict the class label for each test case. The percentage of correctly classi ed
instances was recorded, learning was enabled, the system was presented with the
next training instance, and the cycle continued. The learning curve shown in Figure
1 was averaged over ten runs based on di erent random orderings of the training
data. Thus, each point on the curve shows the average percentage of the test set that
Cobweb correctly classi ed.
The shape of the curve reveals that most learning occurs rather early. Asymptotic
accuracy is approximately 92%, yet Cobweb reaches the 85% level after fewer than
ten training instances. The system remains stable at this point for some time, then
rises to above 88% around 20 instances. Slight improvements occur with additional
instances, but the most important learning seems complete by this point. These
results contradict some claims (e.g., Mitchell et al., 1986) that inductive instances
require very many instances to acquire useful knowledge. But it also suggests that
the behavior of House members may be quite regular and thus simple to induce. As
a result, it can be dangerous to draw conclusions about the behavior of a learning
algorithm from studies with a single domain. We will return to this issue later.

3.3 Performance in Problem-solving Domains


Learning curves can also be used to examine performance improvement in problem
solving. Let us consider a study by Gratch (1991) using a reduced version of Prodigy-
EBL (Minton, 1990), a well-known and successful algorithm for acquiring search-
control rules. The learning system employs an explanation-based method to transform
6 P. Langley and D. Kibler

problem-solving traces into rules for selecting operators, states, and goals during
planning. Minton's system then uses these rules to constrain search on new problems.
Gratch's study uses CPU time as the measure of problem-solving performance.
One could examine the number of search nodes considered in solving problems, but
as Minton has shown, the amount of search is only one facet of problem-solving
eciency. His de nition of the utility of acquired knowledge also includes the cost of
applying that knowledge in controlling search, and CPU time takes this into account.
Langley and Allen (in press) use a related measure, the total number of uni cations
required to solve a set of problems, which is less dependent on implementation and
machine. They also examine both search nodes and match cost in an attempt to
determine the source of power or diculty.
The independent variable in Gratch's experiment is the number of training problems
on which the system has practiced. He generated 220 problems from a special variant
of the blocks world (described by Etzioni, 1990), dividing these into 100 training
tasks, 20 `settling' problems, and 100 test cases. After every ten training problems,
Gratch disabled the Prodigy's explanation-based learning component and ran the
system on the settling problems, which it used to gather statistics about the utility of
individual control rules. During this stage, Prodigy deleted rules that appeared to
increase the overall cost of planning. After this, the experimenter disabled this facet
of learning as well and measured the CPU time required to solve all problems in the
test set.
Figure 2 shows the learning curve for this domain. The times are averaged over
ten random orderings of the training and settling problems, since order e ects can
occur in problem-solving domains as easily as in classi cation tasks. The results are
intriguing. The search-control knowledge that Prodigy acquired actually increases
its problem-solving time. This begins to decrease after the initial large rise, but it
never quite returns to the level that existed before learning. Presumably, this e ect
occurs because the cost of matching their complex conditions more than o sets the
savings due to reduced search, even though the settling phase was designed to avoid
this problem.
For our purposes, these results demonstrate the clear need for experimental stud-
ies of learning's e ect on performance. Without such evaluation, one cannot know
whether learning is actually bene cial. With such experiments, one can identify the
source of the degradation and modify the learning scheme to improve performance.
Fortunately, this is not the complete story on Prodigy; in fact, Etzioni carefully
designed this particular variant of the blocks world to encourage Prodigy to acquire
expensive search-control rules. We will return to this point in Section 5.
Segre, Elkan, and Russell (1991) have noted an important complication in the ex-
perimental study learning in problem-solving domains. Most problem solvers include
some computational limit and give up on a problem when they exceed it. Thus, re-
porting only eciency results can be misleading; it is essential to include information
about the percentage of test problems that the system has solved. Gratch was careful
Experimentation in Machine Learning 7

100 150 200 250 300 350 400 450


Planning Time (CPU Seconds)
50
0

0 10 20 30 40 50 60 70 80 90 100
Number of Training Problems

Figure 2. Learning curve for a reduced version of Prodigy on problem-solving tasks from a
variant of the blocks world, as reported by Gratch (1991).

to use only problems that Prodigy could solve within its computational limits, but
this may not be practical for some real-world problems. Also, in some domains, the
quality of problem solutions can also be important. Langley and Drummond (1990)
suggest some ways in which to instantiate this dependent variable.
4. Varying the Learning Method
One of the most dicult problems confronting psychology is teasing apart the rel-
ative e ects of heredity and experience, of `nature' and `nurture'. Machine learning
is more fortunate, in that it can experimentally control a learning system's `innate'
features (nature) and the training instances it encounters (nurture). Here we exam-
ine methods for evaluating the e ect of system characteristics, delaying the role of
experience until the following section.
The obvious way to examine the in uence of system features on behavior is to com-
pare di erent algorithms on the same task. That is, one runs two or more learning
systems on a given domain, measures their performance on the same test cases, and
compares the results. Until recently, such comparative studies were rare in the liter-
ature, but now they have become almost the default, and the availabilty of standard
databases has provided a variety of domains to use in such experiments.4
4. In fact, most of domains we mention in this paper are available by ftp from ics.uci.edu
using the account and password anonymous. The various data sets reside in the directory
pub/machine-learning-databases.
8 P. Langley and D. Kibler

4.1 Gross Comparisons of Learning Methods


Comparative studies come in a variety of forms. If one's goal is a computational
model of human learning, then one should compare the algorithm's behavior with
that of human learners. For example, children pass through a number of `stages'
in their acquisition of language, and one can compare the model's learning curves
with that of children (Langley, 1982). Similarly, a model of skill acquisition should
account for the widely observed `power law' of learning (Rosenbloom & Newell, 1987).
Many factors a ect human learning, making its experimental study dicult, but the
psychological literature is lled with studies awaiting computational explanations.
More often, a machine learning researcher is interested in an algorithm's behavior
for its own sake. However, even when studying an individual learning method, it
is best to place that method's behavior in context. One can usually compare the
system's performance to that of a `straw algorithm' that uses a simple-minded strat-
egy. For instance, in classi cation domains one can use an algorithm that predicts
the most frequently occurring class. If this covers 90% of the instances, then a more
sophisticated learner that achieves 91% accuracy is not impressive, and should be ex-
amined for ways to improve its learning ability. This approach is di erent from using
a nonlearning performance system to establish a baseline, but the spirit is similar.
Shavlik, Mooney, and Towell (1991) provide an excellent example of comparing
alternative learning methods on a diagnostic task originally reported by Reinke (1984).
The goal is to classify soybean plants into one of 17 di erent disease categories based
on 50 nominal (symbolic) attributes, such as weather, time of year, and characteristics
of leaves and stems. There are 17 examples of each disease, giving a total of 289 cases.
Shavlik et al. randomly selected two-thirds of these as training instances, reserving
the remainder as test cases. They averaged their results over ten such partitions of
the soybean data set.
The authors examined the behavior of three algorithms on this domain. The Per-
ceptron algorithm (Rosenblatt, 1962), one of the simplest forms of connectionist
learning methods, represents knowledge as a single linear threshold unit. This results
in the well-known limitation that it can only discriminate concepts that are linearly
separable, that is, which can be separated by a single hyperplane drawn through
the instance space. Thus, Shavlik et al. included this method as a straw algorithm.
However, they also studied the behavior of Backpropagation (Rumelhart, Hinton,
& Williams, 1986), a more popular connectionist technique that supports learning
in multi-layer networks. Finally, they examined Quinlan's (1986) ID3, a widely-used
algorithm for inducing decision trees.
Table 1 summarizes the behavior of these induction algorithms on soybean domain
along three dimensions { classi cation accuracy on the training set, accuracy on
the test set, and training time (measured in CPU seconds). All systems achieve
perfect accuracy on the training set, but only the Perceptron method, which has
limited representational power, might have performed poorly on this front. A system
that simply remembers all observed instances could fare as well; this is the reason
Experimentation in Machine Learning 9

Table 1. Percentage accuracies and training times for three induction algorithms in diagnosing
soybean diseases (Shavlik, Mooney, & Towell, 1991).

Algorithm Accuracy Accuracy CPU Time


(Training) (Test) (Training)
Perceptron 100.0  0.0 92.9  2.1 35.8  5.2
ID3 100.0  0.0 89.0  2.0 161.0  8.9
Backprop. 99.9  0.2 94.1  2.5 5260.0  7390.0

accuracy on training cases is seldom useful. As expected, Backpropagation requires


considerably more computation than either ID3 or the simpler connectionist scheme.
This is not a performance measure but an indication of learning cost. Nevertheless,
it can be an important factor on sizable domains, and may be worth reporting.
The surprise comes when we examine classi cation accuracy on the test cases. The
Perceptron method appears to perform slightly worse than Backpropagation,
but it does much better than expected given the abuse it has taken over the years (e.g.,
Minksy & Papert, 1969). Even more unexpected, this technique actually has higher
accuracy than ID3, which employs a much more sophisticated induction technique.
The straw program refused to be blown over in this domain, suggesting it deserves
more attention than it has traditionally received. However, it would be premature to
conclude that one technique is superior to another based on their behavior in a single
domain, as we will argue in Section 5.
Note that the table presents not only the means for each combination of algorithm
and dependent measure, but the standard deviations as well. One can use this in-
formation, together with the number of runs, to determine the probability that the
observed di erences are due to chance. Shavlik et al. report that, using a t test,
the di erence between the Perceptron and ID3 accuracies is signi cant at the 0.01
level. This means that the probability is greater than 99% that this di erence did
not occur by chance. However, they also report that the apparent di erence be-
tween Backpropagation and Perceptron is not signi cant at this level. In the
absence of additional evidence, one must conclude that they are e ectively equivalent
on this domain. Signi cance tests are especially important in comparative studies
and, although they are reported in few of the studies we will describe, we encourage
researchers to use them whenever possible.

4.2 Parametric Studies of Learning Methods


Given the complexity of many learning algorithms, one may not be satis ed with
comparisons between entire systems. The goal of experimentation is not to blindly
label one method as superior to another, but to understand the reasons for behavioral
di erences. Finer-grained studies can be very useful in pursuit of this end.
10 P. Langley and D. Kibler

An obvious approach is to examine the e ect of parameters occurring in a system.


In such cases, one can determine the importance of a parameter on algorithm behavior
by systematically varying its settings and observing the results. Ideally, behavior will
be `acceptable' within a wide range of parameter values, with the system's behavior
changing slowly as the parameter varies. Alternatively, one might identify an optimal
setting that holds across di erent domains.
Clark and Niblett (1989) provide an example of a parametric study. They describe
CN2, an algorithm that combines aspects of both Quinlan's (1986) ID3 and Michalski
and Chilausky's (1980) AQ11. The system carries out a beam search through a space
of rules, guided by an evaluation function based on information theory. Once it has
decided on a rule to cover some training instances, it removes these and iterates to
nd additional rules. Because CN2 uses a statistical test to determine when to stop
adding conditions and rules, it should be able to avoid over tting the training data
in noisy domains.
However, any statistical test requires one to specify some level of signi cance for
making decisions. Clark and Niblett ran CN2 with di erent signi cance levels on a
medical diagnosis task that involved classifying patients as either healthy or having
some form of lymph cancer. They carried out ve runs on this domain, in each one
randomly selecting 70% of the 148 instances as training cases and the rest as test cases.
Using 90% as the signi cance level, CN2's average accuracy on the test instances was
78%. In contrast, at the 95% and 99% levels it was 81% and 82%, respectively. Thus,
the parameter setting appears to have some e ect, but the di erence is not a major
one. Of course, one cannot tell the actual amount of noise in such a real-world domain,
as we discuss further in Section 5.
Robertson and Riolo (1988) report another parametric study, in this case using a
genetic algorithm. This class of methods retains a number of rules in memory, which
compete for the chance to generate o spring through operations analogous to genetic
mutation and crossover. The authors hypothesize that one factor in genetic learning
is the number of copies retained of a given rule. Thus, they test their CFS system
with di erent limits on this number, measuring its behavior on a task that involves
learning to predict sequences of symbols. The results suggested a `U-shaped' curve,
in which performance increased with the number of copies allowed, but only up to a
certain point, beyond which it dropped again. Detecting such regularities can let one
ne tune a learning system to increase its learning rate or asymptotic performance.

4.3 Lesion Studies of Learning Components


Parametric studies are not the only means of exploring the sources of power in an
intelligent system. One of the most common techniques in neuroscience involves the
excision of a well-de ned area of the brain to determine its role in behavior, and there
is no diculty in adapting this notion to the study of artifacts.
Experimentation in Machine Learning 11

60 70 80 90 100
Predictive Accuracy
50
40

Weight Learning
30

Weight + Boolean
20
10
0

0 50 100 150 200 250 300


Number of Training Instances

Figure 3. The e ect of lesioning Stagger's Boolean mechanism for creating new conceptual
components (Schlimmer, 1987).

Many machine learning systems contain a number of independent components, and


each component's usefulness can be studied through `lesion' experiments.5 In other
words, one runs the system with and without a given component, measuring the
di erence on some performance dimension. If a component does not aid the overall
learning process, then it can be safely omitted from the system.
For example, Schlimmer (1987) describes a lesion study using an arti cial task that
involves predicting the output of a 1  2 multiplexer. He designed this experiment
to evaluate the relative impact of two separate learning components in his Stagger
system. The rst component assigns weights to the various components in a manner
reminiscent of the Perceptron algorithm, but using information about conditional
probabilities to speed learning. The second component augments this process by
forming logical combinations (conjuncts, disjuncts, and negations) of features that
have high diagnostic power, which are then used by the weight-learning routine.
Figure 3 presents the learning curves for the two experimental conditions, which
demonstrate the advantage of augmenting weight learning with a method for intro-
ducing Boolean features. In this case, the main di erence lies in Stagger's asymp-
totic accuracy, which is much greater for the combined method. However, the two-
component version does take somewhat more training instances to reach its higher
asymptote than does the weight-learning method to reach its lower one.

5. Cohen and Howe (1988) have referred to such experiments as ablation studies.
12 P. Langley and D. Kibler

Although this experiment focuses on a knowledge-lean method, lesion studies also


seem well suited to knowledge-intensive learning methods. One might `lobotomize' a
system by removing some of its knowledge or some of its mechanisms, then observe
the e ect on learning. For instance, in explanation-based approaches, overly speci c
domain theories would presumably lead to less transfer and thus to slower learning.
We will brie y describe one study involving the impact of knowledge in Section 6.
Before closing our discussion of experiments with learning methods, we should
emphasize that the goal of such studies is not to demonstrate superiority of one
method over another, but to increase understanding. Experiments may indeed reveal
limitations of particular methods or components, but this knowledge can in turn
suggest improved versions of the initial algorithm. For example, Aha, Kibler, and
Albert (1991) describe a learning algorithm that simply stores training instances
and uses a nearest-neighbor technique to classify new cases. Experiments reveal
drawbacks of this method, which they attempt to remedy by placing constraints on
the storage of instances. Lesion studies indicate the usefulness of this extension,
but further experiments suggest other problems, which they mitigate with another
addition to their instance-based algorithm. This process of incremental re nement
relies on understanding the reasons for a learning method's behavior, and this should
be the primary aim of experimentation.

5. Varying Characteristics of the Domain


As we mentioned earlier, innate biases or `nature' are not the only in uence on a
learning system. One must also examine the e ect of experience or `nurture' on
behavior, and this means systematically varying the environment or domain in which
the learner acquires knowledge. This presents the machine learning experimenter with
a choice. One can employ `natural' domains like the diagnostic tasks we examined
earlier. Alternatively, one use `arti cial' domains that have been designed with speci c
characteristics in mind. In this section we examine these two options. As we will see,
each approach has its advantages and disadvantages, and we recommend both for the
experimental study of machine learning.

5.1 Studies with Natural Domains


Natural domains, such as Reinke's (1984) soybean diagnosis task, are the most obvious
testbeds because they show real-world relevance. Also, successful runs on a number
of di erent natural domains provide evidence of generality. For example, let us return
to Shavlik et al.'s study, which we discussed in Section 4.1, and consider it in more
depth.
Table 2 presents additional results for their three algorithms on four separate clas-
si cation tasks. These include the soybean domain described earlier, a task that
involves predicting the winner of chess end games based on 36 high-level features
(taken from Shapiro, 1987), an audiology domain that requires diagnosis of 24 hear-
Experimentation in Machine Learning 13

Table 2. Percentage accuracies for three induction algorithms on four classi cation domains
(Shavlik et al., 1991).

Algorithm Soybean Chess Audiology Heart


Disease End Game Diagnosis Disease
Perceptron 92.9  2.1 93.9  2.2 73.5  3.9 60.5  7.9
ID3 89.0  2.0 97.0  1.6 75.5  4.4 71.2  5.2
Backprop. 94.1  2.5 96.3  1.0 77.7  3.8 80.6  3.1

ing disorders based on 58 features (taken from Bareiss, 1989), and a task that involves
determining whether a patient has heart disease, given eight nominal attributes and
six numeric ones. The table reports only accuracy on the test sets. For comparison,
we have repeated the results for the soybean domain.
Recall that on the soybean data, the knowledge induced by both the Backpropa-
gation and Perceptron methods performed better than the ID3 algorithm. How-
ever, by examining behavior across domains, Shavlik et al. demonstrated that this
result is misleading. Behavior in a single domain, even a real-world one, does not nec-
essarily generalize to other domains. On both the chess and audiology testbeds, both
ID3 and Backpropagation are signi cantly more accurate (at the 0.05 level) than
the Perceptron learning algorithm, but there is no signi cant di erence between
the two more sophisticated methods. Backpropagation does signi cantly better
than ID3 in diagnosing heart disease, but the induced decision trees outperform the
learned perceptrons in turn (at the 0.01 level).
These results make one more con dent in the non-naive approaches, but one would
still like to understand the reasons for ID3's poor behavior on the soybean data. One
possibility is that this domain is nearly linear separable, but that the hyperplane is
not orthogonal to any of the axes in the instance space. Thus, the Perceptron
technique can accurately classify instances using a linear unit, whereas ID3 is forced
to approximate this with a highly disjunctive decision tree, in which each terminal
node is based on a small sample.
In the midst of this discussion, we should not forget one of the main points of the
Shavlik et al. study. Connectionist and `symbolic' induction algorithms, although
they rely on di erent representations of knowledge and use di erent methods to ac-
quire that knowledge, are dealing with essentially the same problem, and this means
that one can compare them on the same tasks. This form of comparative study is
much healthier for the eld than rhetorical arguments about the limitations of existing
methods and the advantages of new approaches.
Experimental studies of problem-solving systems can also use multiple domains to
evaluate learning algorithms. In Section 3.3 we reviewed results from Gratch's (1991)
study of Prodigy on a single domain, but in fact he examined the system's behavior
14 P. Langley and D. Kibler

2500
Planning Time (CPU Seconds)
2000
Normal Blocks World
Modified Blocks World
Extended Strips
1000 1500 500
0

0 10 20 30 40 50 60 70 80 90 100
Number of Training Problems

Figure 4. Learning curves for the Prodigy algorithm on three problem-solving domains
(Gratch, 1991).

on others as well. Figure 4 incorporates the learning curves for an extended version of
the Strips planning domain and for the original version of the blocks world used by
Minton (1990). The results here are much more encouraging, with Prodigy showing
clear improvement by the tenth training problem in both cases. After this point, the
system seems to have stabilized, apparently having completed its acquisition of useful
search-control knowledge.
This raises issues about the reasons Prodigy encounters diculty in the original
domain we examined. As mentioned earlier, Etzioni (1990) designed this variant of
the blocks world, which includes a single additional operator that lets one move two
blocks at a time, to produce just such a negative e ect in Prodigy. He provides
an interesting analysis of the causes for the system's divergent behaviors in these
domains. This technique { altering an existing domain to elicit some e ect { is a
powerful experimental tool, and it leads naturally into our next topic.

5.2 Noise in Arti cial Domains


Studies with multiple natural domains are much more revealing than single-domain
studies, in that they give evidence about the generality of learning phenomena. How-
ever, they provide little aid in understanding the e ects of domain characteristics,
since they do not let one independently vary di erent aspects of the environment. A
given natural domain may be dicult along many dimensions, and one would like to
know which factor is responsible for particular aspects of behavior. Arti cial domains
provide a way out of this dilemma by letting one control domain characteristics as
Experimentation in Machine Learning 15

independent variables. Instead of carrying out experiments with a real-world domains


having unknown characteristics, one can design domain that have exactly the features
one wants to study.
For instance, Breiman, Friedman, Olshen, and Stone (1984) report an arti cial
domain they designed to test the e ectiveness of their Cart algorithm for decision-
tree induction. The domain concerns a simulated LED display in which digits are
described by seven Boolean features. The performance task involves classifying par-
ticular displays as one of the ten digits, which one must learn from classi ed training
instances. However, to make the learning task dicult, they added random noise to
features in the training instances, thus simulating a faulty display. To be speci c,
they introduced a ten percent noise level for each feature, by which they meant that
each Boolean value was inverted with 0.1 probability.
Breiman et al. compared Cart's accuracy on this domain to a `straw algorithm'
that simply predicts the most frequent class. Moreover, using their knowledge about
the probability of noise in features, they computed the predictive accuracy for an
optimal classi er. Thus, they established best-case learning behavior, which would
be impossible for a real-world problem. For the LED domain with ten percent noise,
the optimal accuracy is 74%. Because Cart uses a statistical pruning technique to
avoid over tting the training data, they expected it would approach this level. Their
experimental results backed this prediction, showing an accuracy of 70% for Cart
after 200 training instances, but only a 10% accuracy for the frequency method. Thus,
their algorithm fares almost as well as possible on the LED task.
The Breiman et al. study used an arti cial domain with controlled noise level, but
it did not systematically vary this variable to determine the algorithms' behaviors
across a range of noise levels. Quinlan (1986) provides an example of this type of
experiment. He studied the classi cation accuracy of the trees induced by his ID3
algorithm when he varied the amount of noise in the training instances.
In particular, Quinlan examined the e ect of noise when it occurred in a single
(nominal) attribute, when it was present in all attributes, and when it occurred in
the class label. The de nition of noise in this study is somewhat di erent, referring
to the probability of replacing the actual value with a randomly selected value (which
might be still be correct). Thus, the maximum noise level is 100%, in which case
the attribute or label contains no useful information. The ID3 algorithm di ers from
Cart in its response to over tting, halting construction of the decision tree when a
statistical test indicates that the training data fail to justify further splits. However,
Quinlan anticipated that this approach would let the system degrade gracefully for
all three forms of noise.
Figure 5 shows the results when noise was added to training instances taken from
a task involving chess end game, similar but not identical to that used in the Shavlik
et al. (1991) study. Noise in the class label degrades performance much more than
noise in individual attributes, as one might expect. Also, the former changes in a
roughly linear fashion, whereas the latter appears logarithmic. One might predict
16 P. Langley and D. Kibler

35 40 45 50
Percentage Error
Noise in Class Label
Noise in One Attribute
30
25
20
15
10
5
0 Noise in All Attributes

0 10 20 30 40 50 60 70 80 90 100
Level of Noise

Figure 5. The e ect of three types of noise on predictive accuracy in Quinlan's (1986) ID3.

that noise in all attributes would make learning more dicult that noise in any single
feature, including the class label. Indeed, the curve for this condition goes up rapidly,6
but then actually decreases and levels o at a 26% error rate. Quinlan explains this
surprising result by noting that, beyond a certain noise level, ID3's pruning technique
leads to one-node trees that simply predict the most frequent class. The dip in the
curve suggests the parameter setting for the statistical test is slightly high, allowing
some over tting to occur around the 40% noise level.

5.3 The E ect of Irrelevant Attributes


Arti cial domains are also useful for examining the e ect of irrelevant attributes on
learning. In general, as the number of attributes increases, the number of possible
concept descriptions grows exponentially (Haussler, 1987). Intuitively, learning should
be more dicult in domains that contain more alternative hypotheses. If an algorithm
has no way to identify relevant features early in training, increasing the number of
attributes could drastically slow the rate of learning.
However, the e ect of irrelevant features on any particular system is an empirical
question, and many induction algorithms include techniques that should let them
e ectively ignore attributes that contain no useful information. For instance, Fisher's
(1987) Cobweb system uses an information-theoretic evaluation function to classify
instances through its probabilistic concept hierarchy. This function subtracts out the
6. Note that the dependent variable reported in this case is percentage error, rather than the accuracy
measure used in the previous studies we have examined.
Experimentation in Machine Learning 17

30
Absolute Error
25 No. of Irrelevants = 0
No. of Irrelevants = 4
20

No. of Irrelevants = 8
No. of Irrelevants = 16
15
10
5
0

0 10 20 30 40 50 60
Number of Training Instances

Figure 6. Learning curves for Gennari's (1990) Classit on domains with varying numbers of
irrelevant attributes.

information that has already been summarized at a parent node, and thus emphasizes
attributes that serve to distinguish concepts at the same level.
Gennari (1990) examined the e ect of this factor on the behavior of Classit,
an extension of Cobweb that handles both symbolic and numeric attributes. He
used a set of arti cial domains that involved four separate classes, each di ering in
their values on four relevant numeric attributes. However, the domains varied in
the number of irrelevant attributes { which have the same probability distribution
independent of class { from zero to sixteen. All domains had small but de nite
amounts of attribute noise, and training instances were unclassi ed. The performance
task involved predicting the numeric values of single relevant attributes omitted from
test instances, and the dependent measure was the absolute error between the actual
and predicted values.
Figure 6 presents the results, which are based on ten di erent orders of randomly
generated training instances. The graph suggests that Classit is robust with respect
to irrelevant attributes, with an asymptote around 2.0, regardless of the number of
irrelevant terms. This is close to the `ideal' error of 0.47, which is the error for
the best possible predictions that could be based on the observed training instances.
Classit's asymptote is also considerably less than that of a naive algorithm which
simply predicts the mean value for each attribute, independent of its class. This
provides another example of how one can use straw algorithms and optimal ones to
calibrate learning behavior. The system's rate of learning does seem a ected by the
number of irrelevant attributes, but Classit appears to scale well on this dimension,
18 P. Langley and D. Kibler

at least in the current domain.


Although the notion of irrelevancy has been most widely studied for inductive
learning and for classi cation tasks, it has clear analogues in other approaches and
di erent domains. For instance, Iba (1989) has demonstrated that promiscuous learn-
ing of macro-operators can degrade the performance of problem-solving systems. He
shows that one can use statistical and other methods to eliminate such knowledge
structures, retaining ones that actually reduce search on test problems. His studies
focused on dicult but well-structured puzzles that had many aspects of arti cial
domains. Tambe, Newell, and Rosenbloom (1990) use an even more idealized search
problem to study the e ect of expensive rules on learning in problem solving.
Similarly, we suspect that irrelevant knowledge could slow the learning rate of ana-
lytic learning approaches by producing misleading explanations or making derivations
intractable. Techniques for selecting among competing explanations and selecting
likely search paths could play a similar role to the evaluation function that Classit
uses to ignore irrelevant attributes. Arti cial domains, including both relevant and
irrelevant background knowledge, are an obvious approach to testing this hypothe-
sis. Elio and Watanabe (1991) describe one such study, in which they use carefully
designed rules to study how the size and `shape' of background knowledge a ects
constructive induction.

5.4 The E ect of Concept Complexity


Another important characteristic of classi cation domains is the complexity of the
concepts that describe their regularity, and one can use arti cial domains to study the
e ect of concept complexity on learning. For instance, Langley (1987) systematically
varied the number of conjunctive and disjunctive features in concepts, studying the
impact of these factors on an incremental learning algorithm. Iba, Wogulis, and
Langley (1998) report the results of a similar study with the Hillary system.
However, Rendell and Cho (1990) have argued that many real-world concepts are
much more complex than those typically used in experimental studies. They view
concepts as functions over the space of instances, measuring complexity as the number
of `peaks' or disjoint regions of classes in this space. There are now many algorithms
that can acquire disjunctive concepts, but the authors hypothesized that existing
techniques would break down on domains involving very many peaks.
To test this hypothesis, Rendell and Cho used an automated data generator to pro-
duce training and test sets for a variety of domains that had between one and 1000
peaks. Figure 7 shows the results for PLS1, a nonincremental induction algorithm
that is similar to ID3, based on training sets with 2000 instances. The predictive
accuracy of the induced concept decreases nearly linearly with the log of the domain
complexity, even when the training data are free of noise. A similar but more ragged
e ect occurs when there is 30% noise in the class label, though this curve is lower
overall. Also, Quinlan's ID3 algorithm produces a similar degradation as complex-
ity increases. Rendell and Cho suggest methods for representation change as one
Experimentation in Machine Learning 19

100
Predictive Accuracy
Class Noise = 0%
Class Noise = 30%

80 9070
60
50

0 1 2 3 4 5 6 7 8 9 10
Number of Peaks (Log Base 2)

Figure 7. Predictive accuracy of the PLS system as a function of concept complexity, as


reported by Rendell and Cho (1990).

approach to grouping peaks and thus reducing e ective complexity.


Issues of complexity are not limited to classi cation tasks. One can also vary the
regularity of problem spaces, the structure of grammars, and the form of scienti c
laws. Arti cial domains have a role to play in these domains as well, although clear
de nitions of complexity have not yet been forwarded for these more advanced data
structures. Extending complexity measures to non-classi cation tasks is a prerequisite
for understanding such domains, and thus should have priority in future work.

6. Stages of the Experimental Process


Before closing, it seems worth reviewing the basic steps involved in the experimental
study of machine learning. The basic procedure di ers little from that in other exper-
imental sciences, except for the nature of the independent and dependent variables,
which we have discussed in the previous sections. Many of our points will appear
obvious to readers, but given the youth of our eld, they are worth reiterating.
6.1 Formulating Hypotheses
In many situations, a researcher has clear expectations about the e ects he will
observe in an experiment. If so, it is important to state these hypotheses explicitly
and to use them in focusing his/her experimental design. In many cases, these will
be vague and qualitative. For instance, an experimenter will typically believe that an
20 P. Langley and D. Kibler

algorithm will lead to improved performance as the result of experience. Similarly,


he/she may predict that an induction algorithm with pruning will produce more
accurate decision trees in a noisy domain than one without pruning.
Some studies, particularly those involving natural domains, are so exploratory that
no clear hypotheses suggest themselves. But many experiments are based on some
analogy with previous studies, and in these situations, it seems worth stating pre-
dictions formally. In our own experience, predictions are often violated, and having
stated them at the outset helps one focus attention on interesting phenomena, even
when they are qualitative in nature.
In some cases, one has a clear model of both the algorithm and the learning envi-
ronment, particularly when working with simple algorithms and arti cial domains. If
one is willing to make sucient assumptions about the distribution of training data,
one can make detailed predictions about the system's behavior, as Cohen (1991) has
encouraged. For example, Pazzani and Sarrett (1990) present an average-case anal-
ysis of a conjunctive induction algorithm, which lets them predict detailed learning
curves for domains with various characteristics.
In a similar vein, Thompson, Langley, and Iba (1991) describe an analysis that lets
them predict the bene t their Labyrinth system receives from background knowl-
edge in comparison to Fisher's Cobweb, which cannot use the same form of knowl-
edge. To accomplish this, they make assumptions about the number of concepts in a
background is-a hierarchy, the number of components associated with each concepts,
and the number of possible types for each component. They also assume a regu-
lar structure for the background knowledge and a uniform distribution of instances.7
From this they calculate the theoretical learning curves presented in Figure 8. Such
detailed hypotheses are not required for progress in machine learning, but they have
clear advantages over qualitative predictions.

6.2 Designing Experiments and Selecting Samples


Having decided on a set of hypotheses, the researcher must next design one or more
experiments to test them. The obvious requirement here is to decide on the depen-
dent and independent variables. Since we have spent many of the preceding pages
examining the various options, we will not repeat them here. In most cases, the hy-
potheses themselves will suggest a small set of variables, and the experimenter need
only decide which measures best suit his/her purpose. A complete design must also
include decisions about the number of runs to average across, the range of each inde-
pendent variable, and the step size for each such factor. If the independent variables
are qualitative in nature, one must specify the set of values they take on. For exam-
ple, one must enumerate the algorithms to be tested, the components to be lesioned,
7. Most theoretical analyses of learning tasks and algorithms have aimed for distribution-
independent results. However, this bias di ers from those of more mature sciences like physics
and chemistry, which are willing to make detailed assumptions to generate precise predictions,
then to reconsider those assumptions if predictions are violated.
Experimentation in Machine Learning 21

60 70 80 90 100
Predictive Accuracy

Labyrinth (Theoretical)
50

Labyrinth (Observed)
40

Cobweb (Theoretical)
30

Cobweb (Observed)
20
10
0

0 5 10 15 20 25 30 35 40 45 50 55 60
Number of Training Instances

Figure 8. Theoretical and observed learning curves for Labyrinth and Cobweb in the pres-
ence of background knowledge (Thompson, Langley, & Iba, 1991).

or the natural domains from which one will draw instances.


Another issue in experimental design involves sampling strategies. In the natural
sciences, one can never control all possible variables. As a result, researchers must
collect multiple observations for each cell in their experimental design and average the
resulting values. As a science of the arti cial, machine learning can avoid some but
not all of these complications. One has control over the learning algorithm and the
environment, but practical concerns still come into play. In particular, one cannot
examine all possible training and test sets in a natural domain, so typically one
randomly selects a number of such sets for use in an experiment, then averages over
the results. Similarly, one cannot examine all possible training orders for incremental
learning methods, so one must resort to a set of randomly selected orders.
Basic experimental method recommends varying the value of one independent term
while holding others constant. However, one can apply this process iteratively to ob-
tain factorial designs, in which one observes the dependent measures(s) under all
combinations of independent values. This lets one move beyond isolated e ects and
look for interactions between independent variables. For instance, one might hypoth-
esize that a decision-tree algorithm will fare better in one environment and that a
perceptron method will fare better in another, as argued by Utgo (1988). Factorial
designs let one measure such interactions between independent variables. The results
of Rendell and Cho's study, illustrated in Figure 7, revealed no interaction between
complexity and noise; rather, their e ects on accuracy appeared to be additive.
22 P. Langley and D. Kibler

6.3 Running Experiments and Compiling Results


Given a clear experimental design, one can carry out the experiment that it speci es.
For this one must gather the training instances or problems, implement or access the
algorithms, run the algorithms on the training cases, and measure their performance
for each sample in each experimental condition (i.e., combination of independent
variables). One then averages across all samples in a condition and organizes the
results in some readable format such as tables or graphs. This step is probably the
least controversial activity in an experimental study.
We have seen many examples of experimental results in this paper, including learn-
ing curves, asymptotic accuracies in comparative studies, the e ects of noise and
other factors on asymptotes and learning rates. Such statistics are the most obvious
product of scienti c experimentation. Figure 8 presents another example, in this case
the results of Thompson et al.'s comparative study of Labyrinth and Cobweb in
the presence of background knowledge.

6.4 Testing Hypotheses


Once the experimenter has collected and organized the data, they can be used to draw
tentative conclusions. In an exploratory study, the results may suggest hypotheses
that require additional experiments. In other cases, one will have hypotheses and use
the observations to test them. Thus, one can examine learning curves to determine
whether the acquired knowledge actually improves performance, or one can compare
di erent experimental conditions to see whether the number of irrelevant variables
a ect asymptotic accuracy. In some cases, regularities in the data may suggest de-
tailed models that would explain them. For instance, both Quinlan's results on noise
(Figure 5) and Rendell and Cho's ndings (Figure 7) involved near-linear relations
that call out for explanations.
As we saw in Section 4.1, one can use statistical methods to test some hypotheses,
and these indicate the con dence with which one can believe apparent di erences.
This con dence level is a ected by three factors { the observed di erences between
conditions, the number of samples in each condition, and the variances of those sam-
ples. Thus, even a large di erence may not be robust if the sample is small or the
variance is high, making it desirable to use signi cance tests whenever possible. Such
tests make the most sense when comparing nominal conditions, such as alternative al-
gorithms or di erent natural domains. Other statistical methods, such as correlation
analysis, can be used for numeric variables.
Ironically, signi cance tests are least relevant when one has a detailed model that
makes numeric predictions. Consider the theoretical and observed learning curves in
Figure 8. The analysis speci es a clear di erence in learning rate between the two
algorithms, but predicts the same asymptote. These trends are clearly apparent in
the experimental curves as well. Here the issue is the degree to which the predictions
match the observations. One can use a technique like correlation analysis for this pur-
Experimentation in Machine Learning 23

pose. Alternatively, one can use the standard deviation of each point on the curve to
draw `error bars' around the curve, then see whether the theoretical curve falls within
these ranges, as Pazzani and Sarrett (1990) have done. But in general, theory-laden
sciences like physics have less need of statistical hypothesis testing than experiment-
driven ones like psychology, and we hope that as machine learning matures, it will
progress from the latter into the former.

6.5 Explaining Unexpected Results


Hypotheses in machine learning are based on some model of an algorithm and an
environment, whether this is explicit or not. Results that agree with an hypothesis
lend evidence to that model, though they do not `con rm' it; science can never draw
nal conclusions about any situation. Results that diverge from one's expectations
count as evidence against a model, and thus require additional explanation.
In some cases, explanations of rejected hypotheses may involve altering assumptions
about the environment. Thus, one may posit that the Perceptron algorithm did
well on a particular domain because it was linearly separable, even though this was
not anticipated at the outset. Other explanations concern the algorithms themselves.
For instance, Thompson et al. suggest that Labyrinth's and Cobweb's behaviors
diverge slightly from the theoretical curves in Figure 8 because they cannot retrieve
some instances due to poor indexing.
In either case, faulty predictions indicate that one's model needs improvement,
often making them more signi cant than positive results. More important, they can
indicate directions in which to make changes. The ensuing altered models, whether
formal or informal, suggest new hypotheses and predictions, which in turn suggest
new experiments to test them. In other words, the iterative loop of hypothesize and
test is as valid for machine learning as for any other experimental discipline.

6.6 Communicating Experimental Results


Like other sciences, machine learning is largely a communal activity, and this makes
clear communication essential. Replication plays an important role in physics, chem-
istry, biology, and other mature sciences, since it ensures that results are robust and
general before they become widely accepted. Such replication would aid our eld as
well, but it requires detailed enough descriptions to let researchers at other sites re-
peat the conditions of original studies. Machine learning has made an excellent start
in using a standard set of natural domains and in providing pseudocode descriptions
of algorithms, which allow reconstruction of learning systems even when the original
code is unavailable.
However, replication also requires precise descriptions of the independent and de-
pendent variables, the number of runs, the sampling strategy, and other details of the
experimental design. Factors used to generate arti cial data, such as one's de nition
of noise and irrelevant attributes, are also essential. Finally, one should include in-
24 P. Langley and D. Kibler

formation about statistical tests used to evaluate hypotheses in communications of


experimental results, since these depend on assumptions that others may question.
Clear descriptions in a technical report or an archival journal constitute the nal stage
in an experimental study.

7. Conclusions
One can trace experimental approaches to machine learning back more than two
decades (e.g., Hunt, Marin, & Stone, 1966), but the `modern' era of experimentation
began about ve years ago. Since then, the number of experimental studies has
grown at a rapid pace, with researchers identifying new dependent and independent
variables, testing existing systems on new domains, and improving these systems when
they encounter diculties. Many experimental studies produce unexpected results,
forcing the experimenter to think deeply about reasons for the observed learning
behavior.
In general, the eld of machine learning occupies a much healthier methodologi-
cal state than a decade ago. However, the experimental method has been adapted
more quickly to some areas than others. Early experimentation focused on inductive
approaches to classi cation, as the current paper re ects in its examples, but recent
years have seen many analogous studies of learning in problem-solving domains and
experiments on explanation-based methods. Researchers have also started to measure
the in uence of background knowledge on inductive learning.
In summary, machine learning occupies a fortunate position that makes systematic
experimentation easy and pro table. Some methodological questions remain unan-
swered, but researchers have made an excellent start and we expect the future holds
improved dependent measures, better independent variables, and more useful experi-
mental designs. There remains room for improvement in all areas of machine learning,
but the discipline seems well on its way to developing a sound experimental tradition.
However, these successes do not mean that empirical researchers should report
gratuitous experiments any more than theoreticians should publish vacuous proofs.
Whether they lead to positive or negative results, experiments are worthwhile only
to the extent that they illuminate the nature of learning mechanisms and the reasons
for their success or failure. Although experimental studies are not the only path
to understanding, we feel they constitute one of machine learning's brightest hopes
for rapid scienti c progress, and we encourage other researchers to join in our eld's
evolution toward an experimental science.

Acknowledgements
We would like to thank David Aha, Wayne Iba, David Ruby, Je Schlimmer, Kevin
Thompson, and Tom Dietterich for helpful comments on previous drafts. Earlier ver-
sions of this paper appeared in the journal Machine Learning and in the Proceedings
Experimentation in Machine Learning 25

of the Third European Working Session on Learning .

References
Aha, D. W., Kibler, D., & Albert, M. K. (1991). Instance-based learning algorithms.
Machine Learning , 6 , 37-66.
Bareiss, E. R. (1989). Exemplar-based knowledge acquisition: A uni ed approach to
concept representation, classi cation, and learning . Boston: Academic Press.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classi cation
and regression trees . Belmont, CA: Wadsworth.
Clark, P., & Niblett, T. (1989). The CN2 induction algorithm. Machine Learning , 3 ,
261{284.
Cohen, P. R. (1991). A survey of the Eighth National Conference on Arti cial Intel-
ligence: Pulling together or pulling apart? AI Magazine , 12 , 16{41.
Cohen, P. R., & Howe, A. E. (1988). The invisible hand: How evaluation guides AI
research (COINS Technical Report 88-21). Amherst: University of Massachusetts,
Department of Computer and Information Science.
DeJong, G., & Mooney, R. (1986). Explanation-based learning: An alternative view.
Machine Learning , 1 , 145{176.
Dietterich, T. G. (1990). Machine learning. Annual Review of Computer Science , 4 .
Elio, R., & Watanabe, L. (1991). An incremental deductive strategy for controlling
constructive induction in learning from examples. Machine Learning , 7 , 7{44.
Etzioni, O. (1990). Why Prodigy/EBL works. Proceedings of the Eighth National
Conference on Arti cial Intelligence (pp. 916{922). Boston, MA: AAAI Press.
Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering.
Machine Learning , 2 , 139{172.
Gennari, J. H. (1990). An experimental study of concept formation . Doctoral disser-
tation, Department of Information & Computer Science, University of California,
Irvine.
Gratch, J. (1991). Utility generalization and composability problems in explanation-
based learning (Tech. Rep. No. UIUUCDCS-R-91-1681). Urbana: University of
Illinois, Department of Computer Science.
Haussler, D. (1987). Learning conjunctive concepts in structural domains. Proceed-
ings of the Sixth National Conference on Arti cial Intelligence (pp. 466{470).
Seattle, WA: AAAI Press.
Haussler, D. (1990). Probably approximately correct learning. Proceedings of the
Eighth National Conference on Arti cial Intelligence (pp. 1101{1108). Boston,
MA: AAAI Press.
Hawking, S. (1988). A brief history of time . New York: Bantam Books.
Hunt, E. B., Marin, J., & Stone, P. J. (1966). Experiments in induction . New York:
26 P. Langley and D. Kibler

Academic Press.
Experimentation in Machine Learning 27

Iba, G. A. (1989). A heuristic approach to the discovery of macro-operators. Machine


Learning , 3 , 285{317.
Iba, W., Wogulis, J., & Langley, P. (1988). Trading o simplicity and coverage in
incremental concept learning. Proceedings of the Fifth International Conference
on Machine Learning (pp. 73{79). Ann Arbor, MI: Morgan Kaufmann.
Kearns, M., Li, M., Pitt, L., & Valiant, L. G. (1987). Recent results on Boolean
concept learning. Proceedings of the Fourth International Workshop on Machine
Learning (pp. 337{352). Irvine, CA: Morgan Kaufmann.
Langley, P. (1982). Language acquisition through error recovery. Cognition and Brain
Theory , 5 , 211{255.
Langley, P. (1987). A general theory of discrimination learning. In D. Klahr, P. Lan-
gley, & R. Neches (Eds.), Production system models of learning and development .
Cambridge, MA: MIT Press.
Langley, P., & Drummond, M. (1990). Toward an experimental science of planning.
Proceedings of the 1990 Darpa Workshop on Innovative Approaches to Planning,
Scheduling, and Control (pp. 109{114). San Diego, CA: Morgan Kaufmann.
McKusick, K. B., & Langley, P. (1991). Constraints on tree structure in concept
formation. Proceedings of the Twelfth International Joint Conference on Arti cial
Intelligence . Sydney: Morgan Kaufmann.
Michalski, R. S., & Chilausky, R. L. (1980). Knowledge acquisition by encoding
expert rules versus computer induction from examples: A case study involving
soybean pathology. International Journal of Man-Machine Studies , 12 , 63{87.
Minsky, M., & Papert, S. (1969). Perceptrons: An introduction to computational
geometry . Cambridge, MA: MIT Press.
Mitchell, T. M., Keller, R. M., & Kedar-Cabelli, S. T. (1986). Explanation-based
learning: A unifying view. Machine Learning , 1, 47{80.
Minton, S. N. (1985). Selectively generalizing plans for problem solving. Proceedings
of the Ninth International Joint Conference on Arti cial Intelligence (pp. 596{
599). Los Angeles: Morgan Kaufmann.
Minton, S. N. (1990). Quantitative results concerning the utility of explanation-based
learning. Arti cial Intelligence , 42 , 363{391.
Nilsson, N. J. (1965). Learning machines . New York: McGraw-Hill.
Pazzani, M. J., & Sarrett, W. (1990). Average case analysis of conjunctive learning
algorithms. Proceedings of the Seventh International Conference on Machine
Learning (pp. 339{347). Austin, TX: Morgan Kaufmann.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning , 1 , 81{106.
Reinke, R. (1984). Knowledge acquisition and re nement tools for the ADVISE meta-
expert system . Master's thesis, Department of Computer Science, University of
Illinois, Urbana.
28 P. Langley and D. Kibler

Rendell, L., & Cho, H. (1990). Empirical learning as a function of concept character.
Machine Learning , 5 , 267{298.
Robertson, G. G., & Riolo, R. L. (1988). A tale of two classi er systems. Machine
Learning , 3 , 139{159.
Rosenblatt, F. (1962). Principles of neurodynamics . New York: Spartan Books.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal repre-
sentations by error propagation. In J. L. McClelland & D. E. Rumelhart (Eds.),
Parallel distributed processing: Explorations in the microstructure of cognition
(Vol. 1). Cambridge, MA: MIT Press.
Rosenbloom, P., & Newell, A. (1987). Learning by chunking: A production system
model of practice. In D. Klahr., P. Langley, & R. Neches (Eds.) Production
system models of learning and development . Cambridge, MA: MIT Press.
Schlimmer, J. C. (1987). Concept acquisition through representational adjustment .
Doctoral dissertation, Department of Information & Computer Science, Univer-
sity of California, Irvine, CA.
Segre, A., Elkan, C., & Russell, A. (1991). A critical look at experimental evaluations
of EBL. Machine Learning , 6 , 183{195.
Shapiro, A. (1987). Structured induction in expert systems . Reading, MA: Addison
Wesley.
Shavlik, J. W., Mooney, R. J., & Towell, G. G. (1991). Symbolic and neural learning:
An experimental comparison. Machine Learning , 6 , 111-143.
Simon, H. A. (1969). The sciences of the arti cial . Cambridge, MA: MIT Press.
Tambe, M., Newell, A., & Rosenbloom, P. S. (1990). The problem of expensive chunks
and its solution by restricting expressiveness. Machine Learning , 5 , 299{348.
Thompson, K., Langley, P., & Iba, W. F. (1991). Using background knowledge in
concept formation. Proceedings of the Eighth International Workshop on Machine
Learning . Evanston, IL: Morgan Kaufmann.
Utgo , P. E. (1988). Perceptron trees: A case study in hybrid concept representations.
Proceedings of the Seventh National Conference on Arti cial Intelligence (pp.
601{606). St. Paul, MN: AAAI Press.

View publication stats

You might also like