0% found this document useful (0 votes)

47 views

Bioinformatics: Applications Note

jjjkk

Uploaded by

rlelonoo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views

Bioinformatics: Applications Note

jjjkk

Uploaded by

rlelonoo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

BIOINFORMATICS APPLICATIONS NOTE

Vol. 20 no. 15 2004, pages 24792481

doi:10.1093/bioinformatics/bth261

Data mining in bioinformatics using Weka

Eibe Frank1, , Mark Hall1 , Len Trigg2 , Geoffrey Holmes1 and
Ian H. Witten1
1 Department

of Computer Science, University of Waikato, Private Bag 3105, Hamilton,

New Zealand and 2 Reel Two, PO Box 1538, Hamilton, New Zealand

Received on December 3, 2003; revised on February 3, 2004; accepted on February 26, 2004
Advance Access publication April 8, 2004

Real datasets vary: no single algorithm is superior on all

data mining problems. The algorithm needs to match the
structure of the problem to obtain useful information or an
accurate model. The aim in developing Weka was to permit a
maximum of flexibility when trying machine learning methods on new datasets. This includes algorithms for learning
different types of models (e.g. decision trees, rule sets, linear discriminants), feature selection schemes (fast filtering
as well as wrapper approaches) and pre-processing methods
(e.g. discretization, arbitrary mathematical transformations
and combinations of attributes). By providing a diverse set of
methods that are available through a common interface, Weka
makes it easy to compare different solution strategies based
on the same evaluation method and identify the one that is
most appropriate for the problem at hand. It is implemented
in Java and runs on almost any computing platform.

INTRODUCTION

THE WEKA EXPLORER

Bioinformatics research entails many problems that can be

cast as machine learning tasks. In classification or regression,
the task is to predict the outcome associated with a particular
individual given a feature vector describing that individual;
in clustering, individuals are grouped together because they
share certain properties; and in feature selection, the task is
to select those features that are important in predicting the
outcome for an individual.
The Weka data mining suite provides algorithms for all three
problem types. In the bioinformatics arena, it has been used
for automated protein annotation (Kretschmann et al., 2001;
Bazzan et al., 2002), probe selection for gene-expression
arrays (Tobler et al., 2002), experiments with automatic cancer diagnosis (Li et al., 2003a), developing a computational
model for frame-shifting sites (Bekaert et al., 2003), plant
genotype discrimination (Taylor et al., 2002), classifying gene
expression profiles (Li and Wong, 2002) and extracting rules
from them (Li et al., 2003b). Many of the algorithms in Weka
are described in Witten and Frank (2000).

The main interface in Weka is the Explorer, shown in Figure 1.

It has a set of panels, each of which can be used to perform
a certain task. The Preprocess panel, selected in Figure 1,
retrieves data from a file, SQL database or URL. (A limitation
is that all the data are kept in main memory, so subsampling
may be needed for very large datasets.) Then the data can
be pre-processed using one of Wekas filtering tools. For
example, one can delete all instances (i.e. rows) in the data for
which a certain attribute (i.e. column) has a particular value.
An undo facility is provided to revert to an earlier state of the
data if needed. The Preprocess panel also shows a histogram of
the attribute that is currently selected and some statistics about
ithistograms for all attributes can be shown simultaneously
in a separate window.
Once a dataset has been loaded (and perhaps processed by
one or more filters), one of the other panels in the Explorer can
be used to perform further analysis. If the data entail a classification or regression problem, it can be processed in the Classify panel. This provides an interface to learning algorithms
for classification and regression models (both are called classifiers in Weka), and evaluation tools for analyzing the
outcome of the learning process. Weka has implementations

whom correspondence should be addressed.

Bioinformatics 20(15) Oxford University Press 2004; all rights reserved.

2479

Downloaded from http://bioinformatics.oxfordjournals.org/ at Gadjah Mada University on November 30, 2014

ABSTRACT
Summary: The Weka machine learning workbench provides
a general-purpose environment for automatic classification,
regression, clustering and feature selectioncommon data
mining problems in bioinformatics research. It contains an
extensive collection of machine learning algorithms and data
pre-processing methods complemented by graphical user
interfaces for data exploration and the experimental comparison of different machine learning techniques on the same
problem. Weka can process data given in the form of a single
relational table. Its main objectives are to (a) assist users in
extracting useful information from data and (b) enable them to
easily identify a suitable algorithm for generating an accurate
predictive model from it.
Availability: http://www.cs.waikato.ac.nz/ml/weka
Contact: [email protected]

E.Frank et al.

of all major learning techniques for classification and regression: decision trees, rule sets, Bayesian classifiers, support
vector machines, logistic and linear regression, multi-layer
perceptrons and nearest-neighbor methods. It also contains
meta-learners such as bagging, boosting, stacking and
schemes that perform automatic parameter tuning using crossvalidation, cost-sensitive classification, and so on. Learning
algorithms can be evaluated using cross-validation or a holdout set, and Weka provides standard numeric performance
measures (e.g. accuracy, root mean squared error), as well
as graphical means for visualizing classifier performance
(e.g. receiver operating characteristic curves and precisionrecall curves). It is possible to visualize the predictions of
a classification or regression model, enabling the identification of outliers, and to load and save models that have been
generated.
The third panel in the Explorer, Cluster, gives access
to Wekas clustering algorithms. These include k-means,
mixtures of normal distributions with diagonal co-variance
matrices estimated using EM, and a heuristic incremental
hierarchical clustering scheme. Cluster assignments can be
visualized and compared with actual clusters defined by one
of the attributes in the data.
Weka also contains algorithms for generating association
rules that can be used to identify relationships between
groups of attributes in the data. These are available from
the Explorers Associate panel. However, more interesting

2480

in the context of bioinformatics is the fifth panel, which offers

methods for identifying those subsets of attributes that are predictive of another (target) attribute in the data. Weka contains
several methods for searching through the space of attribute
subsets, as well as evaluation measures for attributes and
attribute subsets. Search methods include best-first search,
forward selection, genetic algorithms and a simple ranking of attributes. Evaluation measures include correlationand entropy-based criteria as well as the performance of a
selected learning scheme (e.g. a decision tree learner) for a
particular subset of attributes. Different search and evaluation methods can be combined, making the system very
flexible.
The last panel in the Explorer, Visualization, shows a matrix
of scatter plots for all pairs of attributes in the data. Any matrix
element can be selected and enlarged in a separate window,
where one can zoom in on subsets of the data and retrieve
information about individual data points. A jitter option for
exposing obscured data points is also provided.

OTHER INTERFACES TO WEKA

All the learning techniques in Weka can be accessed from
the command line, as part of shell scripts, or from within
other Java programs using the Weka API. Weka also contains
an alternative graphical user interface, called Knowledge
Flow, which can be used instead of the Explorer. It caters

Downloaded from http://bioinformatics.oxfordjournals.org/ at Gadjah Mada University on November 30, 2014

Fig. 1. The Weka explorer.

Data mining in bioinformatics using Weka

for a more process-oriented view of data mining, where

individual learning components (represented by Java beans)
can be connected graphically to create a flow of information. Finally, there is a third graphical user interfacethe
Experimenterwhich is designed for experiments that compare the performance of (multiple) learning schemes on
(multiple) datasets. Experiments can be distributed across
multiple computers running remote experiment servers.

ACKNOWLEDGEMENTS

REFERENCES
Bazzan,A.L., Engel,P.M., Schroeder,L.F. and Da Silva,S.C. (2002)
Automated annotation of keywords for proteins related to mycoplasmataceae using machine learning techniques. Bioinformatics,
18 (Suppl. 2), 35S43S.
Bekaert,M.,
Bidou,L.,
Denise,A.,
Duchateau-Nguyen,G.,
Forest,J.P., Froidevaux,C., Hatin,I., Rousset,J.P. and Termier,M.

2481

Downloaded from http://bioinformatics.oxfordjournals.org/ at Gadjah Mada University on November 30, 2014

Many people have contributed to the Weka project, in particular Richard Kirkby, Ashraf Kibriya and Bernhard Pfahringer,
and we thank them all for their invaluable efforts. We would
also like to thank Yu Wang for suggesting that we write
this note, and the New Zealand Foundation for Research,
Science & Technology for funding the project.

(2003) Towards a computational model for 1 eukaryotic

frameshifting sites. Bioinformatics, 19, 327335.
Kretschmann,E., Fleischmann,W. and Apweiler,R. (2001) Automatic rule generation for protein annotation with the C4.5 data
mining algorithm applied on SWISS-PROT. Bioinformatics, 17,
920926.
Li,J. and Wong,L. (2002) Identifying good diagnostic gene groups
from gene expression profiles using the concept of emerging
patterns. Bioinformatics, 18, 725734.
Li,J., Liu,H., Ng,S.K. and Wong,L. (2003a). Discovery of significant rules for classifying cancer diagnosis data. Bioinformatics,
19 (Suppl. 2), II93II102.
Li,J., Liu,H., Downing,J.R., Yeoh,A.E. and Wong,L. (2003b)
Simple rules underlying gene expression profiles of more than
six subtypes of acute lymphoblastic leukemia (ALL) patients.
Bioinformatics, 19, 7178.
Taylor,J., King,R.D., Altmann,T. and Fiehn,O. (2002) Application
of metabolomics to plant genotype discrimination using statistics
and machine learning. Bioinformatics, 18 (Suppl. 2), 241S248S.
Tobler,J.B., Molla,M.N., Nuwaysir,E.F., Green,R.D. and
Shavlik,J.W. (2002) Evaluating machine learning approaches for
aiding probe selection for gene-expression arrays. Bioinformatics,
18 (Suppl. 1), 164S171S.
Witten,I.H. and Frank,E. (2000) Data Mining: Practical Machine
Learning Tools and Techniques with Java Implementations.
Morgan Kaufmann, San Francisco, CA.

Exercises With Solutions PDF
No ratings yet
Exercises With Solutions PDF
37 pages
R12 How To Diagnose and Reconcile AAP 728871.1
No ratings yet
R12 How To Diagnose and Reconcile AAP 728871.1
13 pages
Martin Rupert D. Bulquerin - RESUME
No ratings yet
Martin Rupert D. Bulquerin - RESUME
1 page
9348 11568 1 PB Published Paper
No ratings yet
9348 11568 1 PB Published Paper
12 pages
Data Base Management Key Points
No ratings yet
Data Base Management Key Points
8 pages
Introduction To Weka: Xingquan (Hill) Zhu
No ratings yet
Introduction To Weka: Xingquan (Hill) Zhu
63 pages
Appendix Weka
No ratings yet
Appendix Weka
17 pages
Introduction To WEKA: Data Mining WEKA - What Is It? Weka Uis Integration With Pentaho Projects Based On Weka
No ratings yet
Introduction To WEKA: Data Mining WEKA - What Is It? Weka Uis Integration With Pentaho Projects Based On Weka
27 pages
Weka Weka: A - Antony Alex MCA DR G R D College of Science - CBE Tamil Nadu - India
No ratings yet
Weka Weka: A - Antony Alex MCA DR G R D College of Science - CBE Tamil Nadu - India
23 pages
Weka DW&DM Lab Notes
No ratings yet
Weka DW&DM Lab Notes
37 pages
DMBI Exp1: Introduction To WEKA Tool
No ratings yet
DMBI Exp1: Introduction To WEKA Tool
6 pages
Weka Software Manuala
No ratings yet
Weka Software Manuala
20 pages
DWDM LAB MANUAL
No ratings yet
DWDM LAB MANUAL
55 pages
Priyadarshini J. L. College of Engineering, Nagpur: Session 2022-23 Semester-V
No ratings yet
Priyadarshini J. L. College of Engineering, Nagpur: Session 2022-23 Semester-V
31 pages
Weka A Tool For Exploratory Data Mining
No ratings yet
Weka A Tool For Exploratory Data Mining
157 pages
Weka Installation Steps Final
No ratings yet
Weka Installation Steps Final
7 pages
WEKA Intro
No ratings yet
WEKA Intro
17 pages
CS-703 (B) Data Warehousing and Data Mining Lab
No ratings yet
CS-703 (B) Data Warehousing and Data Mining Lab
50 pages
Introduction To Weka
No ratings yet
Introduction To Weka
39 pages
Overview: Data Mining Methods: WEKA: A Machine Learning Toolkit The Explorer
No ratings yet
Overview: Data Mining Methods: WEKA: A Machine Learning Toolkit The Explorer
41 pages
aiml manual
No ratings yet
aiml manual
27 pages
DWM1 Riya
No ratings yet
DWM1 Riya
16 pages
WEKA A Machine Learning Workbench for Data Mining
No ratings yet
WEKA A Machine Learning Workbench for Data Mining
11 pages
32013105-BDA LabManual
No ratings yet
32013105-BDA LabManual
122 pages
Lecture 7 - Weka
No ratings yet
Lecture 7 - Weka
69 pages
Machine Learning With WEKA An Introduction
No ratings yet
Machine Learning With WEKA An Introduction
66 pages
Datawarehouse Pract 2
No ratings yet
Datawarehouse Pract 2
7 pages
Weka Data Miningvsem
No ratings yet
Weka Data Miningvsem
7 pages
Mooc-on-Weka
No ratings yet
Mooc-on-Weka
59 pages
DWDM WEEK1&2
No ratings yet
DWDM WEEK1&2
13 pages
Laboratory Manual On: Data Mining
No ratings yet
Laboratory Manual On: Data Mining
41 pages
Rintro Wekacomplete
No ratings yet
Rintro Wekacomplete
135 pages
Data Mining in Bioinformatics
No ratings yet
Data Mining in Bioinformatics
21 pages
An Introduction To WEKA: Contributed by Yizhou Sun 2008
No ratings yet
An Introduction To WEKA: Contributed by Yizhou Sun 2008
85 pages
Data Mining Lab Manual
No ratings yet
Data Mining Lab Manual
50 pages
dwdm_file-final_ver3.pdf_20241230_172003_0000
No ratings yet
dwdm_file-final_ver3.pdf_20241230_172003_0000
54 pages
Data Mining Lab File
No ratings yet
Data Mining Lab File
20 pages
K-Means Clustering Using Weka Interface
No ratings yet
K-Means Clustering Using Weka Interface
6 pages
Lab Manual - DM
No ratings yet
Lab Manual - DM
56 pages
DM Lab Material
No ratings yet
DM Lab Material
88 pages
Weka Tutorial
No ratings yet
Weka Tutorial
32 pages
WEKA Lab Record
No ratings yet
WEKA Lab Record
69 pages
Data Warehousing and Data Mining Lab Manual
100% (1)
Data Warehousing and Data Mining Lab Manual
30 pages
Group 3: Elhaine, Jai, Icelle and Marianne
No ratings yet
Group 3: Elhaine, Jai, Icelle and Marianne
17 pages
Lab Updated - Merged
No ratings yet
Lab Updated - Merged
49 pages
Lab02
No ratings yet
Lab02
4 pages
An Introduction To WEKA: Contributed by Yizhou Sun 2008
No ratings yet
An Introduction To WEKA: Contributed by Yizhou Sun 2008
85 pages
Data Warehouse Lab Manual
No ratings yet
Data Warehouse Lab Manual
60 pages
Ijiset V2 I2 63 PDF
No ratings yet
Ijiset V2 I2 63 PDF
9 pages
Data Warehousing and Data Mining Lab Manual
0% (1)
Data Warehousing and Data Mining Lab Manual
30 pages
WEKA Practical Protocol
No ratings yet
WEKA Practical Protocol
40 pages
DWDM Lab File
No ratings yet
DWDM Lab File
29 pages
131953194aams Vol 196 April 2020 A3 p451-469 Kanwal Preet Singh Attwal
No ratings yet
131953194aams Vol 196 April 2020 A3 p451-469 Kanwal Preet Singh Attwal
19 pages
Weka: A Tool For Data Preprocessing, Classification, Ensemble, Clustering and Association Rule Mining
No ratings yet
Weka: A Tool For Data Preprocessing, Classification, Ensemble, Clustering and Association Rule Mining
4 pages
Weka Tutorial
No ratings yet
Weka Tutorial
8 pages
An Introduction To WEKA: Contributed by Yizhou Sun 2008
No ratings yet
An Introduction To WEKA: Contributed by Yizhou Sun 2008
85 pages
ExplorerGuide A Version 3-5-8
No ratings yet
ExplorerGuide A Version 3-5-8
22 pages
Data Mining Term Project Machine Learning With WEKA: Weka Explorer Tutorial For Version 3.4.3
No ratings yet
Data Mining Term Project Machine Learning With WEKA: Weka Explorer Tutorial For Version 3.4.3
42 pages
Weka & Rapid Miner Tutorial: by Chibuike Muoh
No ratings yet
Weka & Rapid Miner Tutorial: by Chibuike Muoh
15 pages
Weka Tutorial
No ratings yet
Weka Tutorial
15 pages
5 MIS510 Weka NetDraw
No ratings yet
5 MIS510 Weka NetDraw
33 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Mivar NETs and logical inference with the linear complexity
From Everand
Mivar NETs and logical inference with the linear complexity
Varlamov, Oleg O.
No ratings yet
Usb U363
No ratings yet
Usb U363
6 pages
Building A Ux Portfolio PDF
No ratings yet
Building A Ux Portfolio PDF
26 pages
Excel Data Form - Easy Excel Tutorial
No ratings yet
Excel Data Form - Easy Excel Tutorial
6 pages
Assignment
No ratings yet
Assignment
10 pages
White Paper Data Load Performance
No ratings yet
White Paper Data Load Performance
9 pages
Anthropometry in Workstation Design
100% (2)
Anthropometry in Workstation Design
15 pages
95-124841 Eprog5 Installation Guide
No ratings yet
95-124841 Eprog5 Installation Guide
3 pages
A Survey of Deep Learning and Its Applications: A New Paradigm To Machine Learning
No ratings yet
A Survey of Deep Learning and Its Applications: A New Paradigm To Machine Learning
22 pages
Cics Tutorial
No ratings yet
Cics Tutorial
8 pages
What Is Data Science - Introduction To Data Science
No ratings yet
What Is Data Science - Introduction To Data Science
18 pages
2014MRV Controls Accessories PDF
No ratings yet
2014MRV Controls Accessories PDF
4 pages
Manual For Integrated District Planning - Planning Commissio
100% (23)
Manual For Integrated District Planning - Planning Commissio
173 pages
7330
No ratings yet
7330
26 pages
Avishek Nag - Pragmatic Machine Learning With Python-BPB Publications (2020) - Pages-248-260
No ratings yet
Avishek Nag - Pragmatic Machine Learning With Python-BPB Publications (2020) - Pages-248-260
13 pages
Chapter 9
No ratings yet
Chapter 9
9 pages
What Is A Kanban Board - LeanKit
100% (1)
What Is A Kanban Board - LeanKit
7 pages
C++ Code Used To Create The Calculator.: // X Mod (Y) && y Must Be Positive
No ratings yet
C++ Code Used To Create The Calculator.: // X Mod (Y) && y Must Be Positive
25 pages
Fox515 Technical Data
No ratings yet
Fox515 Technical Data
2 pages
ECRIN v4.20 - App-V Sequencing Recipe
No ratings yet
ECRIN v4.20 - App-V Sequencing Recipe
12 pages
Template For Preparation of Manuscripts: Icwsaud
No ratings yet
Template For Preparation of Manuscripts: Icwsaud
4 pages
Transaction 2
No ratings yet
Transaction 2
58 pages
Intersector Distance Calculator
No ratings yet
Intersector Distance Calculator
307 pages
Alphabetical Filing System
No ratings yet
Alphabetical Filing System
5 pages
Grade 05 Maths
No ratings yet
Grade 05 Maths
4 pages
Packet Capture: Sniffer, Tcpdump, Ethereal, Ntop
No ratings yet
Packet Capture: Sniffer, Tcpdump, Ethereal, Ntop
32 pages
1 PDF
No ratings yet
1 PDF
22 pages
An Analysis of Opportunities and Risk of E-Banking in Rural Areas
No ratings yet
An Analysis of Opportunities and Risk of E-Banking in Rural Areas
12 pages

Bioinformatics: Applications Note

Uploaded by

Bioinformatics: Applications Note

Uploaded by

BIOINFORMATICS APPLICATIONS NOTE

Vol. 20 no. 15 2004, pages 24792481

Data mining in bioinformatics using Weka

of Computer Science, University of Waikato, Private Bag 3105, Hamilton,

Real datasets vary: no single algorithm is superior on all

THE WEKA EXPLORER

Bioinformatics research entails many problems that can be

The main interface in Weka is the Explorer, shown in Figure 1.

whom correspondence should be addressed.

Bioinformatics 20(15) Oxford University Press 2004; all rights reserved.

Downloaded from http://bioinformatics.oxfordjournals.org/ at Gadjah Mada University on November 30, 2014

in the context of bioinformatics is the fifth panel, which offers

OTHER INTERFACES TO WEKA

Downloaded from http://bioinformatics.oxfordjournals.org/ at Gadjah Mada University on November 30, 2014

Fig. 1. The Weka explorer.

Data mining in bioinformatics using Weka

for a more process-oriented view of data mining, where

Downloaded from http://bioinformatics.oxfordjournals.org/ at Gadjah Mada University on November 30, 2014

(2003) Towards a computational model for 1 eukaryotic

You might also like