Unit V
Unit V
The Header of the ARFF file contains the name of the relation, a list of the attributes
(the columns in the data), and their types. An example header on the standard IRIS
dataset looks like this:
% 1. Title: Iris Plants Database
%
% 2. Sources:
% (a) Creator: R.A. Fisher
% (b) Donor: Michael Marshall (MARSHALL%[email protected])
% (c) Date: July, 1988
%
@RELATION iris
Examples
Several well-known machine learning datasets are distributed with Weka in the
$WEKAHOME/data directory as ARFF files.
The ARFF Header section of the file contains the relation declaration and attributes
declarations.
The relation name is defined as the first line in the ARFF file. The format is:
@relation <relation-name>
where <relation-name> is a string. The string must be quoted if the name includes
spaces.
The @attribute Declarations
where the <attribute-name> must start with an alphabetic character. If spaces are to
be included in the name then the entire name must be quoted.
The <datatype> can be any of the four types currently (version 3.2.1) supported by
Weka:
numeric
<nominal-specification>
string
date [<date-format>]
where <nominal-specification> and <date-format> are defined below. The
keywords numeric, string and date are case insensitive.
Numeric attributes
Nominal attributes
For example, the class value of the Iris dataset can be defined as follows:
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
String attributes
String attributes allow us to create attributes containing arbitrary textual values. This
is very useful in text-mining applications, as we can create datasets with string
attributes, then write Weka Filters to manipulate strings (like String To Word Vector
Filter). String attributes are declared as follows:
@ATTRIBUTE LCC string
Date attributes
where <name> is the name for the attribute and <date-format> is an optional string
specifying how date values should be parsed and printed (this is the same format
used by Simple Date Format). The default format string accepts the ISO-8601
combined date and time format: "yyyy-MM-dd'T'HH:mm:ss".
Dates must be specified in the data section as the corresponding string representations
of the date/time (see example below).
ARFF Data Section
The ARFF Data section of the file contains the data declaration line and the actual
instance lines.
The @data declaration is a single line denoting the start of the data segment in the
file. The format is:
@data
Example
Name GiveBirth CanFly Live In Water HaveLegsClass
Human yes no no yes mammals
Python no no no no non-mammals
Salmon no no yes no non-mammals
Whale yes no yes no mammals
Frog no no sometimes yes non-mammals
Kornodo no no no yes non-mammals
Bat yes yes no yes mammals
Pigeon no yes no yes non-mammals
Cat yes no no yes mammals
Leopard sha yes no yes no non-mammals
Turtle no no sometimes yes non-mammals
Penguin no no sometimes yes non-mammals
Porcupine yes no no yes mammals
Cel no no yes no non-mammals
Salamander no no sometimes yes non-mammals
Gilamanster no no no yes non-mammals
Platypus no no no yes mammals
Owl no yes no yes non-mammals
Dolphin yes no yes no mammals
Eagle no yes no yes non-mammals
program.arff:
@relation program
@attribute GiveBirth{Yes,No}
@attribute CanFly{Yes,No}
@attribute LiveInWater{Yes,No}
@attribute HaveLegs{Yes,No}
@attribute class{Mammals,Non-mammals}
@data
Yes,No,No,Yes,Mammals
No,NO,No,No,Non-mammals
No,No,Yes,No,Non-mammals
Yes,No,Yes,No,Mammals
No,No,Sometimes,Yes,Non-mammals
No,No,No,Yes,Non-mammals
Yes,Yes,No,Yes,Mammals
No,Yes,No,Yes,Non-mammals
Yes,No,No,Yes,Mammals
Yes,No,Yes,No,Non-mammals
No,No,sometimes,Yes,Non-mammals
No,No,sometimes,Yes,Non-mammals
Yes,No,No,Yes,Mammals
No,No,Yes,No,Non-mammals
No,No,sometimes,Yes,Non-mammals
No,No,No,Yes,Non-mammals
No,No,No,Yes,Mammals
No,Yes,No,Yes,Non-mammals
Yes,No,Yes,No,Mammals
No,Yes,No,Yes,Non-mammals
Iris plants database
https://archive.ics.uci.edu/ml/datasets/Iris
This data differs from the data presented in Fishers article (identified by Steve
Chadwick, spchadwick '@' espeedaz.net ). The 35th sample should be: 4.9,3.1,1.5,0.2,"Iris-setosa"
where the error is in the fourth feature. The 38th sample: 4.9,3.6,1.4,0.1,"Iris-setosa" where the errors are
in the second and third features.
Attribute Information:
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica
##### Revised Jan 10, 1991: Replaced zero bare nuclei in 1080185 & 1187805
Attribute Information:
1. Sample code number: id number
2. Clump Thickness: 1 - 10
3. Uniformity of Cell Size: 1 - 10
4. Uniformity of Cell Shape: 1 - 10
5. Marginal Adhesion: 1 - 10
6. Single Epithelial Cell Size: 1 - 10
7. Bare Nuclei: 1 - 10
8. Bland Chromatin: 1 - 10
9. Normal Nucleoli: 1 - 10
10. Mitoses: 1 - 10
11. Class: (2 for benign, 4 for malignant)
Prediction Model of Loss Payment Ratio of Motors, using 1985 Auto Imports Database
Overview
The objective of this project is training a prediction model to infer normalized loss ratio
of automobiles. This project has four stages. First, in project setup stage, it prepares the
data to be ready for data processing. Second, exploratory data analysis is conducted to
visualize the data. In the third stage, a prediction model is implemented. Lastly,
performance is recorded and visualized.
Dataset Size:
Introduction to WEKA:
Waika to Environment for Knowledge Analysis.
Weka, developed at University of Waikato in New Zealand, is an open-source data
mining software in Java.
It contains a collection of algorithms for data mining tasks, including data preprocessing,
association mining, classification, regression, clustering and visualization.
WEKA is a data mining system developed by the University of Waikato in New Zealand that implements
data mining algorithms. WEKA is a state-of-the-art facility for developing machine learning (ML)
techniques and their application to real-world data mining problems. It is a collection of machine
learning algorithms for data mining tasks. The algorithms are applied directly to a dataset. WEKA
implements algorithms for data preprocessing, classification, regression, clustering, association rules; it
also includes a visualization tools. The new machine learning schemes can also be developed with this
package. WEKA is open source software issued under the GNU General Public License [3]
Trees
J48
Start
Rightclick on 18:35:26-trees.J48
DECISION TREE:
GiveBirth
=Yes =No
Mammals(7. Non-
0/1.0) mammals(13.
0/1.0)
Select classify:
choose
Rules
ZeroR
start
Click the “Open file” button from the Pre-process section and load your .arff file from your local
file system. If you couldn’t convert your .csv to .arff, don’t worry, because Weka will do that
instead of you.
Figure 3.1 Preprocess of Iris Dataset
If you could follow all the steps so far, you can load your data set successfully and you’ll see
attribute names (it is illustrated at the red area on above images). The pre-process stage is named
as Filter in Weka, you can click the ‘Choose’ button from Filter and apply any filter you want.
For example, if you would like to use Association Rule Mining as a training model, you have to
dissociate numeric and continuous attributes. To be able to do that you can follow the path:
Choose -> Filter -> Supervised -> Attribute -> Discretize.
Classification
The concept of classification is basically distribute data among the various classes defined on a
data set. Classification algorithms learn this form of distribution from a given set of training and
then try to classify it correctly when it comes to test data for which the class is not specified. The
values that specify these classes on the dataset are given a label name and are used to determine
the class of data to be given during the test.
For this tutorial we will use Iris dataset to illustrate the usage of classification with Weka. You
can download the dataset from here. Since Iris dataset doesn’t need pre-processing, we can do
classification directly by using it. Weka is a good tool for beginners; it includes a tremendous
amount of algorithms in it. After you load your dataset, by clicking the Classify section you can
switch to another window which we will talk about in this post.
In the Classify section, as you can see in the Area 1 according to Figure 4.1, ZeroR is the default
classifier for Weka. But since ZeroR algorithm’s performance are not good for Iris dataset, we’ll
switch it with the J48 algorithm known for its very good success rate for our dataset. By clicking
the Choose button from Area 1 on the above Figure 4.1, a new algorithm can be selected from
list. J48 algorithm is inside of trees directory in the Classifier list. Before running the algorithm
we have to select the test options from Area 2. Test options consist of 4 options:
Use training set: Classifies your model based on the dataset which you originally trained your
model with.
Supplied test set: Controls how your model is classified based on the dataset you supply from
externally. Select a dataset file by clicking the Set button.
Cross-validation: The cross validation option is a widely used one, especially if you have limited
amount of datasets. The number you enter in the Fold section are used to divide your dataset into
Fold numbers (let’s say it is 10). The original dataset is randomly partitioned into 10 subsets.
After that, Weka uses set 1 for testing and 9 sets for training for the first training, then uses set 2
for testing and the other 9 sets for training, and repeat that 10 times in total by incrementing the
set number each time. In the end, the average success rate is reported to the user.
Percentage split: Divide your dataset into train and test according to the number you enter. By
default the percentage value is 66%, it means 66% of your dataset will be used as training set and
the other 33% will be your test set.
By clicking the text area, (the arrow on Figure 4.2) you can edit the parameters of the algorithm
according to your needs.
I chose the 10 fold cross validation from Test Options using the J48 algorithm. I chose my class
feature from the drop down list as class and click the “Start” button from Area 2 in Figure 4.3.
According the result, the success rate is 96%, you can see it from the Classifier Output has
shown at Area 1 in Figure 4.3.
Run Information in Area 1 will give you detailed results as you can see in Figure 4.4. It consists
of 5 parts; the first one is Run Information, which gives detailed information about the dataset
and the model you used. As you can see in Figure 4.4, we used J48 as a classification model, our
dataset was Iris dataset and its features are sepallength, sepalwidth, petallength, petalwidth, class.
Our test mode is 10-fold cross-validation. Since J48 is a decision tree, our model created a
pruned tree. As you can see on the tree, the first branching happened on petallength which shows
the petal length of the flowers, if the value is smaller or equal to 0.6, the species is Iris-setosa,
otherwise there is another branch that checks another specification to decide the species. In tree
structure, ‘:’ represents the class label.
The Classifier Model part illustrates the model as a tree and gives some information about the
tree, like number of leaves, size of the tree, etc. Next is the stratified cross-validation part and it
shows the error rates. By checking this part you can see how successful your model is. For
example, our model correctly classified 96% of the training data and our mean absolute error rate
is 0.035, which is acceptable according to Iris dataset and our model.
You can see a Confusion Matrix and detailed Accuracy Table at the bottom of the report. F-
Measure and ROC Area rates are important for the models and they are developed according to a
confusion matrix. A confusion matrix represents the True Positive, True Negative, False Positive
and False Negative rates, which I explain next. If you already understand Confusion Matrices
you can directly skip to the Visualizing the Result part.
If you’d like to visualize these results you can use graphic presentations as you can see in below
Figure 4.5.
By right clicking Visualize tree you’ll see your model’s illustration like in Figure 4.6.
If you’d like to see classification errors illustrated, select Visualize Classifier Errors in same
menu. By sliding jitter (you can see in Area 1 at Figure 4.6) you can see all samples on
coordinate plane. The X plane represents predicted classifier results, the Y plane represents
actual classifier results. Squares represent wrongly classified samples. Stars represent true
classified samples. Blue colored ones are Iris-setosa, red colored stars are Iris-versicolor, green
ones Iris-virginica species. So, red square means our model classified this sample as Iris
versicolor but it supposed to be Iris virginica.
Weka Machine Learning Algorithms
Weka has a lot of machine learning algorithms. This is great, it is one of the large benefits of
using Weka as a platform for machine learning.
A down side is that it can be a little overwhelming to know which algorithms to use, and
when. Also, the algorithms have names that may not be familiar to you, even if you know
them in other contexts.
In this section we will start off by looking at some well known algorithms supported by
Weka. What we will learn in this post applies to the machine learning algorithms used
across the Weka platform, but the Explorer is the best place to learn more about the
algorithms as they are all available in one easy place.
You can choose a machine learning algorithm by clicking the “Choose” button.
Clicking on the “Choose” button presents you with a list of machine learning algorithms to
choose from. They are divided into a number of main groups:
bayes: Algorithms that use Bayes Theorem in some core way, like Naive Bayes.
misc: Implementations that do not neatly fit into the other groups, like running a saved
model.
Association rules
Association rule learners find associations between attributes. Between any attributes: there’s no
particular class attribute. Rules can predict any attribute, or indeed any combination of attributes.
To find them we need a different kind of algorithm. “Support” and “confidence” are two
measures of a rule that are used to evaluate them, and rank them. The most popular association
rule learner, and the one used in Weka, is called Apriori.
Associator
Click on the Associate TAB and click on the Choose button. Select the Apriori association as
shown in the screenshot −
To set the parameters for the Apriori algorithm, click on its name, a window will pop up as shown
below that allows you to set the parameters –