WEKA A Machine Learning Workbench For Data Mining
WEKA A Machine Learning Workbench For Data Mining
WEKA
A Machine Learning Workbench for Data Mining
Len Trigg
Reel Two, P O Box 1538, Hamilton, New Zealand
[email protected]
Keywords: machine learning software, data mining, data preprocessing, data visu-
alization, extensible workbench
1. Introduction
Experience shows that no single machine learning method is appro-
priate for all possible learning problems. The universal learner is an
idealistic fantasy. Real datasets vary, and to obtain accurate models the
bias of the learning algorithm must match the structure of the domain.
The Weka workbench is a collection of state-of-the-art machine learn-
ing algorithms and data preprocessing tools. It is designed so that users
1
2
can quickly try out existing machine learning methods on new datasets
in very flexible ways. It provides extensive support for the whole process
of experimental data mining, including preparing the input data, evalu-
ating learning schemes statistically, and visualizing both the input data
and the result of learning. This has been accomplished by including a
wide variety of algorithms for learning different types of concepts, as well
as a wide range of preprocessing methods. This diverse and comprehen-
sive set of tools can be invoked through a common interface, making it
possible for users to compare different methods and identify those that
are most appropriate for the problem at hand.
The workbench includes methods for all the standard data mining
problems: regression, classification, clustering, association rule mining,
and attribute selection. Getting to know the data is is a very important
part of data mining, and many data visualization facilities and data
preprocessing tools are provided. All algorithms and methods take their
input in the form of a single relational table, which can be read from a
file or generated by a database query.
System Architecture
In order to make its operation as flexible as possible, the workbench
was designed with a modular, object-oriented architecture that allows
new classifiers, filters, clustering algorithms and so on to be added easily.
A set of abstract Java classes, one for each major type of component,
were designed and placed in a corresponding top-level package.
All classifiers reside in subpackages of the top level “classifiers” pack-
age and extend a common base class called “Classifier.” The Classifier
class prescribes a public interface for classifiers and a set of conventions
by which they should abide. Subpackages group components accord-
ing to functionality or purpose. For example, filters are separated into
those that are supervised or unsupervised, and then further by whether
they operate on an attribute or instance basis. Classifiers are organized
according to the general type of learning algorithm, so there are sub-
packages for Bayesian methods, tree inducers, rule learners, etc.
All components rely to a greater or lesser extent on supporting classes
that reside in a top level package called “core.” This package provides
classes and data structures that read data sets, represent instances and
attributes, and provide various common utility methods. The core pack-
age also contains additional interfaces that components may implement
in order to indicate that they support various extra functionality. For
example, a classifier can implement the “WeightedInstancesHandler” in-
terface to indicate that it can take advantage of instance weights.
A major part of the appeal of the system for end users lies in its graph-
ical user interfaces. In order to maintain flexibility it was necessary to
engineer the interfaces to make it as painless as possible for developers
to add new components into the workbench. To this end, the user in-
terfaces capitalize upon Java’s introspection mechanisms to provide the
8
Applications
Weka was originally developed for the purpose of processing agri-
cultural data, motivated by the importance of this application area in
New Zealand. However, the machine learning methods and data engi-
neering capability it embodies have grown so quickly, and so radically,
that the workbench is now commonly used in all forms of data min-
ing applications—from bioinformatics to competition datasets issued by
major conferences such as Knowledge Discovery in Databases.
New Zealand has several research centres dedicated to agriculture and
horticulture, which provided the original impetus for our work, and many
of our early applications. For example, we worked on predicting the
internal bruising sustained by different varieties of apple as they make
their way through a packing-house on a conveyor belt Holmes et al., 1998;
predicting, in real time, the quality of a mushroom from a photograph in
order to provide automatic grading Kusabs et al., 1998; and classifying
kiwifruit vines into twelve classes, based on visible-NIR spectra, in order
to determine which of twelve pre-harvest fruit management treatments
has been applied to the vines Holmes and Hall, 2002. The applicability
of the workbench in agricultural domains was the subject of user studies
McQueen et al., 1998 that demonstrated a high level of satisfaction with
the tool and gave some advice on improvements.
There are countless other applications, actual and potential. As just
one example, Weka has been used extensively in the field of bioinfor-
matics. Published studies include automated protein annotation Baz-
zan et al., 2002, probe selection for gene expression arrays Tobler et al.,
2002, plant genotype discrimination Taylor et al., 2002, and classifying
gene expression profiles and extracting rules from them Li et al., 2003.
Text mining is another major field of application, and the workbench has
been used to automatically extract key phrases from text Frank et al.,
1999, and for document categorization Sauban and Pfahringer, 2003 and
word sense disambiguation Pedersen, 2002.
The workbench makes it very easy to perform interactive experiments,
so it is not surprising that most work has been done with small to
REFERENCES 9
Acknowledgments
Many thanks to past and present members of the Waikato machine
learning group and the many external contributors for all the work they
have put into Weka.
10
References
Bazzan, A. L., Engel, P. M., Schroeder, L. F., and da Silva, S. C. (2002).
Automated annotation of keywords for proteins related to mycoplas-
mataceae using machine learning techniques. Bioinformatics, 18:35S–
43S.
Frank, E., Holmes, G., Kirkby, R., and Hall, M. (2002). Racing commit-
tees for large datasets. In Proceedings of the International Conference
on Discovery Science, pages 153–164. Springer-Verlag.
Frank, E., Paynter, G. W., Witten, I. H., Gutwin, C., and Nevill-Manning,
C. G. (1999). Domain-specific keyphrase extraction. In Proceedings
of the 16th International Joint Conference on Artificial Intelligence,
pages 668–673. Morgan Kaufmann.
Holmes, G., Cunningham, S. J., Rue, B. D., and Bollen, F. (1998). Pre-
dicting apple bruising using machine learning. Acta Hort, 476:289–
296.
Holmes, G. and Hall, M. (2002). A development environment for predic-
tive modelling in foods. International Journal of Food Microbiology,
73:351–362.
Holmes, G., Kirkby, R., and Pfahringer, B. (2003). Mining data streams
using option trees. Technical Report 08/03, Department of Computer
Science, University of Waikato.
Kusabs, N., Bollen, F., Trigg, L., Holmes, G., and Inglis, S. (1998).
Objective measurement of mushroom quality. In Proc New Zealand
Institute of Agricultural Science and the New Zealand Society for Hor-
ticultural Science Annual Convention, page 51.
Li, J., Liu, H., Downing, J. R., Yeoh, A. E.-J., and Wong, L. (2003).
Simple rules underlying gene expression profiles of more than six sub-
types of acute lymphoblastic leukemia (all) patients. Bioinformatics,
19:71–78.
McQueen, R., Holmes, G., and Hunt, L. (1998). User satisfaction with
machine learning as a data analysis method in agricultural research.
New Zealand Journal of Agricultural Research, 41(4):577–584.
Pedersen, T. (2002). Evaluating the effectiveness of ensembles of decision
trees in disambiguating Senseval lexical samples. In Proceedings of the
ACL-02 Workshop on Word Sense Disambiguation: Recent Successes
and Future Directions.
Sauban, M. and Pfahringer, B. (2003). Text categorisation using doc-
ument profiling. In Proceedings of the 7th European Conference on
Principles and Practice of Knowledge Discovery in Databases, pages
411–422. Springer.
REFERENCES 11
Taylor, J., King, R. D., Altmann, T., and Fiehn, O. (2002). Application
of metabolomics to plant genotype discrimination using statistics and
machine learning. Bioinformatics, 18:241S–248S.
Tobler, J. B., Molla, M., Nuwaysir, E., Green, R., and Shavlik, J. (2002).
Evaluating machine learning approaches for aiding probe selection for
gene-expression arrays. Bioinformatics, 18:164S–171S.