Weka Software Manuala
Weka Software Manuala
AI Tools Seminar
Rossen Dimov 1
Supervisors:
Michael Feld, Dr. Michael Kipp,
2
Dr. Alassane Ndiaye and Dr. Dominik Heckmann
1
[email protected]
2
{michael.feld, michael.kipp, alassane.ndiaye, dominik.heckmann}@dfki.de
Abstract
3 Experimental results 15
4 Conclusion 17
1
1 Introduction
Machine learning algorithms serve for inducing classification rules from a
dataset of instances and thus broadening the domain knowledge and under-
standing.
WEKA is a workbench for machine learning that is intended to make
the application of machine learning techniques more easy and intuitive to a
variety of real-world problems. The environment targets not only the ma-
chine learning expert but also the domain specialist. That is why interactive
modules for data processing, data and trained model visualization, database
connection and cross-validation are provided. They go along with the basic
functionality that needs to be supported by a machine learning system -
classifying and regression predicting, clustering and attribute selection.
It is developed at the University of Waikato, New Zealand. The project
started when the authors needed to apply machine learning techniques on an
agricultural problem. This was about twelve years ago. Now version 3.5.5 is
available and two years ago the authors have also published a book[4]. This
book covers the different algorithms, their possible weak and strong points,
all preprocessing and evaluating methods. It also covers a detailed descrip-
tion for all four graphical modules and some basic introduction on how to
use the Java interface in your own programs. The project is developed and
distributed under the GPL license and has a subdomain on the Sourceforge1
portal.
This article coves a description of the main features of WEKA version
3.5.5 and an application for spam detection using the Java interface.
Some basic machine learning definitions, used in the forthcoming part
follow:
2
An example of data set can be records of days when the weather condi-
tions are appropriate for surfing. The temperature, the humidity, the speed
of the wind are attributes that can be measured and can be enumerated
and/or numerical. Surfing or not are the values of the class attribute. The
record for one single day represents one instance. The classification is used
for predicting the value of the class attribute for the future days.
3
Over the dataset simple editing operations, like editing single values for
concrete instances and removing columns for all instances, can be done by
hand. Automatic operations can be done by filters. Usually the data for-
mat needs to be transformed for various reasons depending on the machine
learning scheme that will be used. For example a machine learning algorithm
might only accept numeric values of the attributes, so all non-numeric at-
tributes should be transformed in order for this algorithm to be used. A
filter is chosen from a tree view, which contains all available filters - see
figure 1. Each of them has a description about how it works and a refer-
ence on all parameters it uses. Most of the filters are explained in detail in
the book but since there are newer versions of WEKA new filters are also
implemented and can be chosen.
4
A list of some filters follows:
The trained classifier can be evaluated either with an additional test set
or through k-fold cross validation, or by dividing the input dataset to a
training and test set. The result of the evaluation is shown in the ’Classifier
output’ pane - figure 2. It contains a textual representation of the created
5
Figure 2: The ’Classify’ panel. J48 classifier with corresponding parameters
used, evaluated with 10-fold cross validation. In the output pane a textual
representation of the build classifier and some statistics are shown.
6
model and statistics about the accuracy of the classifier, such as TP (True
Positive), FP (False Positive) rate and confusion matrix. TP rate shows
the percentage of instances whose predicted values of the class attribute are
identical with the actual values. FP rate shows the percentage of instances
whose predicted values of the class attribute are not identical with the ac-
tual values. The confusion matrix shows the number of instances of each
class that are assigned to all possible classes according to the classifier’s
prediction.
There is a special group of classifiers called meta-classifiers. They are
used to enhance the performance or to extend the capabilities of the other
classifiers.
A list of some important meta-classifiers is shown here:
The trained classifier can be saved. This is possible due to the serializa-
tion mechanism supported by the Java programming language.
Beside the classification schemes WEKA supply two other schemes -
association rules and clustering.
7
Figure 3: The Apriori association rule applied over the training data. All
generated rules are shown in the ’Output’ pane.
2.1.5 Clustering
There are nine clustering algorithms implemented in WEKA. They also do
not try to predict the value of the class attribute but to divide the training
set into clusters. All the instances in one group are close, according to an
appropriate metric, to all instances in the same group and far from the
instances in the other groups. The interface for choosing and configuring
them is the same as for filters and classifiers. There are options for choosing
test and training sets. The results shown in the output pane are quite similar
to these produced after building classifier. See figure 4.
8
Figure 4: The SimpleKMeans algorithm applied over the training data and
the two resulting clusters are shown in the ’Output’ pane.
9
the one subset that works best for classifying. Two operators are needed -
subset evaluator and search method. The search method traverses the whole
attribute subset space and uses the evaluator for quality measure. Both of
them can be chosen and configured similar to the filters and classifiers. After
an attribute selection is performed a list of all attributes and their relevance
rank is shown in the ’Output’ pane. See figure 5.
10
Some of the attribute evaluators are shown here:
11
Figure 6: In the ’Visualize’ panel all possible 2D distributions are shown.
12
Figure 7: The Knowledge flow
13
Figure 8: The Experimenter setted up with two different data sets, with
three different classifiers while one of them is with two different sets of
parameters.
ing and test set and all different learning schemes. In the more advanced
one there is an option for performing a distributed experiment using RMI.
Various statistics such as error rate, percentage incorrect can be shown
in the ’Analyze’ panel and used for finding the better classifier.
14
certain package, specifying its options and providing input data. The output
looks like the output in the ’Output’ pane of the ’Classify’ or ’Cluster’ or
’Associate’ panels of the Explorer.
3 Experimental results
A small spam filtering application was developed using WEKA’s classifiers
as a practical assignment for the seminar. The team included three students
attending the seminar.
The system reads emails from a directory on the hard drive. Then it
builds six classifiers, saves them on the hard drive and evaluates them. The
classification step consists of loading one of the saved classifiers and putting
labels on emails situated in a user defined folder.
Emails were collected from a repository, containing all emails of a US
bankrupted company. They are parsed in order to extract the body and
the subject of the email and some other features that we assumed could be
important for correct predicting. Namely: are there any HTML tags, are
there any Jscripts, is there NoToField, number of recipients and percentage
of capital letters. Then from each email an instance was created using all
the features as attributes. With the produced instances an initial sets of
training and testing data were created.
Then follows data preprocessing in terms of applying WEKA’s filter.
The filter used is weka.filters.unsupervised.attribute.StringToWordVector. It
15
converts String attributes into a set of attributes representing word occur-
rence information from the text contained in the strings. A parameter for
making some words that occur only in one of the classes more important
is set. This way each instance is converted to an instance with about 4000
attributes. Adding the fact that we used 2000 spam and 2000 non-spam
emails for building the classifiers this is already a big input training set.
After having some OutOfMemoryExceptions a decision for applying some
attribute selection was taken.
A meta-classifier for reducing the dimensionality of training and test
data is used for combining a non-meta classifier with an attribute selection
method. All five classifiers - J48, IBk, NaiveBayesUpdateable, Multilayer-
Perceptron and SMO - are used together with the ChiSquaredAttributeEval
attribute evaluator and Ranker search method. 250 relevant attributers
were chosen and used in the building of these classifiers.
We also implemented a brand new classifier. It extends the base
weka.classifiers.Classifier, implements the buildClassifier and overrides clas-
sifyInstance methods. It combines the decision tree, the k-nearest neigbours
and the support vector machine algorithms already built with reduced num-
ber of attributes. For classifying it performs majority voting of the three
classifiers.
The evaluation of each classifier is performed with a 10-fold cross vali-
dation. The following results are obtained for building time and accuracy:
The lowest value for the naı̈ve bayes classifier we got was suspicious
enough on that phase to think, that somehow the training data is biased.
This was proven when we tested our best classifier - support vector machine
- with an independent test set. The accuracy was about 50% with a lot of
true negative (non-spam mails classified as spam) predictions.
16
4 Conclusion
The WEKA machine learning workbench provides an environment with al-
gorithms for data preprocessing, feature selection, classification, regression,
and clustering. They are complemented by graphical user interfaces for input
data and build modes exploration. It supports also experimental compari-
son of one with varying parameters or different algorithms applied on one or
more datasets. This is done in order to facilitate the process of extraction
of useful information from the data. The input dataset is in form of a table.
Any row represents a single instance and any column represents a different
attribute.
Perhaps the most important feature is the uniform Java interface to all
algorithms. They are organized in packages and when the weka.jar is added
to a standalone project only import section is needed in order to get access
to any functionality of WEKA.
WEKA has some memory issues. It expects that the data set will be
completely loaded into the main memory, which is not possible for some data
mining tasks. It is very slow on large data sets. For k-cross fold validation
WEKA creates k copies of the original data. Only one copy exists at a time,
but recourses for copying are used in vain.
There are a lot of projects that are using WEKA to some extent or
even extend it. One of them is BioWEKA[1], which is used for knowledge
discovery and data analysis in biology, biochemistry and bioinformatics. It is
an open source project by the Ludwig Maximilians-Universitaet Muenchen.
For example, there are filters for translating DNA to RNA sequences and
vice versa.
Another project is YALE (Yet Another Learning Environment)[2], which
is implemented in the University of Dortmund. It supports composition
and analysis of complex operator chains consisting of different nested pre-
processing, building of classifiers, evaluating and complex feature generators
for introducing new attributes.
References
[1] Jan E. Gewehr, Martin Szugat, and Ralf Zimmer. BioWeka -Extending
the Weka Framework for Bioinformatics. Bioinformatics, page btl671,
2007.
17
[3] I. Witten, E. Frank, L. Trigg, M. Hall, G. Holmes, and S. Cunning-
ham. Weka: Practical machine learning tools and techniques with java
implementations. 1999.
[4] Ian H. Witten and Eibe Frank. Data mining: Practical machine learning
tools and techniques. 2005.
18