Bioinformatics: Applications Note
Bioinformatics: Applications Note
Received on December 3, 2003; revised on February 3, 2004; accepted on February 26, 2004
Advance Access publication April 8, 2004
INTRODUCTION
To
2479
ABSTRACT
Summary: The Weka machine learning workbench provides
a general-purpose environment for automatic classification,
regression, clustering and feature selectioncommon data
mining problems in bioinformatics research. It contains an
extensive collection of machine learning algorithms and data
pre-processing methods complemented by graphical user
interfaces for data exploration and the experimental comparison of different machine learning techniques on the same
problem. Weka can process data given in the form of a single
relational table. Its main objectives are to (a) assist users in
extracting useful information from data and (b) enable them to
easily identify a suitable algorithm for generating an accurate
predictive model from it.
Availability: http://www.cs.waikato.ac.nz/ml/weka
Contact: [email protected]
E.Frank et al.
of all major learning techniques for classification and regression: decision trees, rule sets, Bayesian classifiers, support
vector machines, logistic and linear regression, multi-layer
perceptrons and nearest-neighbor methods. It also contains
meta-learners such as bagging, boosting, stacking and
schemes that perform automatic parameter tuning using crossvalidation, cost-sensitive classification, and so on. Learning
algorithms can be evaluated using cross-validation or a holdout set, and Weka provides standard numeric performance
measures (e.g. accuracy, root mean squared error), as well
as graphical means for visualizing classifier performance
(e.g. receiver operating characteristic curves and precisionrecall curves). It is possible to visualize the predictions of
a classification or regression model, enabling the identification of outliers, and to load and save models that have been
generated.
The third panel in the Explorer, Cluster, gives access
to Wekas clustering algorithms. These include k-means,
mixtures of normal distributions with diagonal co-variance
matrices estimated using EM, and a heuristic incremental
hierarchical clustering scheme. Cluster assignments can be
visualized and compared with actual clusters defined by one
of the attributes in the data.
Weka also contains algorithms for generating association
rules that can be used to identify relationships between
groups of attributes in the data. These are available from
the Explorers Associate panel. However, more interesting
2480
ACKNOWLEDGEMENTS
REFERENCES
Bazzan,A.L., Engel,P.M., Schroeder,L.F. and Da Silva,S.C. (2002)
Automated annotation of keywords for proteins related to mycoplasmataceae using machine learning techniques. Bioinformatics,
18 (Suppl. 2), 35S43S.
Bekaert,M.,
Bidou,L.,
Denise,A.,
Duchateau-Nguyen,G.,
Forest,J.P., Froidevaux,C., Hatin,I., Rousset,J.P. and Termier,M.
2481
Many people have contributed to the Weka project, in particular Richard Kirkby, Ashraf Kibriya and Bernhard Pfahringer,
and we thank them all for their invaluable efforts. We would
also like to thank Yu Wang for suggesting that we write
this note, and the New Zealand Foundation for Research,
Science & Technology for funding the project.