0% found this document useful (0 votes)
67 views22 pages

Unit V

The document describes the ARFF file format used in WEKA machine learning tool. ARFF files contain a header section and a data section. The header section includes information like the relation name and attributes. Each attribute is defined with a name and data type. The data section contains the actual data values organized into instances based on the attribute order defined in the header. Common examples of datasets in ARFF format included with WEKA are the Iris plants database and Wisconsin breast cancer dataset.

Uploaded by

Vasantha Kumar V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views22 pages

Unit V

The document describes the ARFF file format used in WEKA machine learning tool. ARFF files contain a header section and a data section. The header section includes information like the relation name and attributes. Each attribute is defined with a name and data type. The data section contains the actual data values organized into instances based on the attribute order defined in the header. Common examples of datasets in ARFF format included with WEKA are the Iris plants database and Wisconsin breast cancer dataset.

Uploaded by

Vasantha Kumar V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Unit V WEKA Tool

Datasets – Introduction - ARFF File Format


ARFF files have two distinct sections. The first section is the Header information,
which is followed the Data information.

The Header of the ARFF file contains the name of the relation, a list of the attributes
(the columns in the data), and their types. An example header on the standard IRIS
dataset looks like this:
% 1. Title: Iris Plants Database
%
% 2. Sources:
% (a) Creator: R.A. Fisher
% (b) Donor: Michael Marshall (MARSHALL%[email protected])
% (c) Date: July, 1988
%
@RELATION iris

@ATTRIBUTE sepallength NUMERIC


@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

The Data of the ARFF file looks like the following:


@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa

Lines that begin with a % are comments.


The @RELATION, @ATTRIBUTE and @DATA declarations are case insensitive.

Examples
Several well-known machine learning datasets are distributed with Weka in the
$WEKAHOME/data directory as ARFF files.

The ARFF Header Section

The ARFF Header section of the file contains the relation declaration and attributes
declarations.

The @relation Declaration

The relation name is defined as the first line in the ARFF file. The format is:
@relation <relation-name>

where <relation-name> is a string. The string must be quoted if the name includes
spaces.
The @attribute Declarations

Attribute declarations take the form of an ordered sequence of @attribute statements.


Each attribute in the data set has its own @attribute statement which uniquely
defines the name of that attribute and it's data type. The order the attributes are
declared indicates the column position in the data section of the file. For example, if
an attribute is the third one declared then Weka expects that all that attributes values
will be found in the third comma delimited column.

The format for the @attribute statement is:


@attribute <attribute-name> <datatype>

where the <attribute-name> must start with an alphabetic character. If spaces are to
be included in the name then the entire name must be quoted.

The <datatype> can be any of the four types currently (version 3.2.1) supported by
Weka:

 numeric
 <nominal-specification>
 string
 date [<date-format>]
where <nominal-specification> and <date-format> are defined below. The
keywords numeric, string and date are case insensitive.

Numeric attributes

Numeric attributes can be real or integer numbers.

Nominal attributes

Nominal values are defined by providing an <nominal-specification> listing the


possible values: {<nominal-name1>, <nominal-name2>, <nominal-name3>, ...}

For example, the class value of the Iris dataset can be defined as follows:
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

Values that contain spaces must be quoted.

String attributes

String attributes allow us to create attributes containing arbitrary textual values. This
is very useful in text-mining applications, as we can create datasets with string
attributes, then write Weka Filters to manipulate strings (like String To Word Vector
Filter). String attributes are declared as follows:
@ATTRIBUTE LCC string

Date attributes

Date attribute declarations take the form:


@attribute <name> date [<date-format>]

where <name> is the name for the attribute and <date-format> is an optional string
specifying how date values should be parsed and printed (this is the same format
used by Simple Date Format). The default format string accepts the ISO-8601
combined date and time format: "yyyy-MM-dd'T'HH:mm:ss".

Dates must be specified in the data section as the corresponding string representations
of the date/time (see example below).
ARFF Data Section

The ARFF Data section of the file contains the data declaration line and the actual
instance lines.

The @data Declaration

The @data declaration is a single line denoting the start of the data segment in the
file. The format is:
@data

Example
Name GiveBirth CanFly Live In Water HaveLegsClass
Human yes no no yes mammals
Python no no no no non-mammals
Salmon no no yes no non-mammals
Whale yes no yes no mammals
Frog no no sometimes yes non-mammals
Kornodo no no no yes non-mammals
Bat yes yes no yes mammals
Pigeon no yes no yes non-mammals
Cat yes no no yes mammals
Leopard sha yes no yes no non-mammals
Turtle no no sometimes yes non-mammals
Penguin no no sometimes yes non-mammals
Porcupine yes no no yes mammals
Cel no no yes no non-mammals
Salamander no no sometimes yes non-mammals
Gilamanster no no no yes non-mammals
Platypus no no no yes mammals
Owl no yes no yes non-mammals
Dolphin yes no yes no mammals
Eagle no yes no yes non-mammals

program.arff:
@relation program

@attribute GiveBirth{Yes,No}
@attribute CanFly{Yes,No}

@attribute LiveInWater{Yes,No}

@attribute HaveLegs{Yes,No}

@attribute class{Mammals,Non-mammals}

@data

Yes,No,No,Yes,Mammals

No,NO,No,No,Non-mammals

No,No,Yes,No,Non-mammals

Yes,No,Yes,No,Mammals

No,No,Sometimes,Yes,Non-mammals

No,No,No,Yes,Non-mammals

Yes,Yes,No,Yes,Mammals

No,Yes,No,Yes,Non-mammals

Yes,No,No,Yes,Mammals

Yes,No,Yes,No,Non-mammals

No,No,sometimes,Yes,Non-mammals

No,No,sometimes,Yes,Non-mammals

Yes,No,No,Yes,Mammals

No,No,Yes,No,Non-mammals

No,No,sometimes,Yes,Non-mammals

No,No,No,Yes,Non-mammals

No,No,No,Yes,Mammals

No,Yes,No,Yes,Non-mammals

Yes,No,Yes,No,Mammals

No,Yes,No,Yes,Non-mammals
Iris plants database
https://archive.ics.uci.edu/ml/datasets/Iris

Data Set Information:


This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is
a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data
set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is
linearly separable from the other 2; the latter are NOT linearly separable from each other.

Predicted attribute: class of iris plant. This is an exceedingly simple domain.

This data differs from the data presented in Fishers article (identified by Steve
Chadwick, spchadwick '@' espeedaz.net ). The 35th sample should be: 4.9,3.1,1.5,0.2,"Iris-setosa"
where the error is in the fourth feature. The 38th sample: 4.9,3.6,1.4,0.1,"Iris-setosa" where the errors are
in the second and third features.

Attribute Information:
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica

Breast Cancer Wisconsin (Original) Data Set


https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)

Data Set Information:


Samples arrive periodically as Dr. Wolberg reports his clinical cases. The database therefore reflects this
chronological grouping of the data. This grouping information appears immediately below, having been
removed from the data itself:

Group 1: 367 instances (January 1989)


Group 2: 70 instances (October 1989)
Group 3: 31 instances (February 1990)
Group 4: 17 instances (April 1990)
Group 5: 48 instances (August 1990)
Group 6: 49 instances (Updated January 1991)
Group 7: 31 instances (June 1991)
Group 8: 86 instances (November 1991)
-----------------------------------------
Total: 699 points (as of the donated datbase on 15 July 1992)
Note that the results summarized above in Past Usage refer to a dataset of size 369, while Group 1 has
only 367 instances. This is because it originally contained 369 instances; 2 were removed. The following
statements summarizes changes to the original Group 1's set of data:

##### Group 1 : 367 points: 200B 167M (January 1989)

##### Revised Jan 10, 1991: Replaced zero bare nuclei in 1080185 & 1187805

##### Revised Nov 22,1991: Removed 765878,4,5,9,7,10,10,10,3,8,1 no record


##### : Removed 484201,2,7,8,8,4,3,10,3,4,1 zero epithelial
##### : Changed 0 to 1 in field 6 of sample 1219406
##### : Changed 0 to 1 in field 8 of following sample:
##### : 1182404,2,3,1,1,1,2,0,1,1,1

Attribute Information:
1. Sample code number: id number
2. Clump Thickness: 1 - 10
3. Uniformity of Cell Size: 1 - 10
4. Uniformity of Cell Shape: 1 - 10
5. Marginal Adhesion: 1 - 10
6. Single Epithelial Cell Size: 1 - 10
7. Bare Nuclei: 1 - 10
8. Bland Chromatin: 1 - 10
9. Normal Nucleoli: 1 - 10
10. Mitoses: 1 - 10
11. Class: (2 for benign, 4 for malignant)

1985 Auto Imports Database


https://github.com/jihoonerd/1985_Auto_Imports_Database

Prediction Model of Loss Payment Ratio of Motors, using 1985 Auto Imports Database

Overview

The objective of this project is training a prediction model to infer normalized loss ratio
of automobiles. This project has four stages. First, in project setup stage, it prepares the
data to be ready for data processing. Second, exploratory data analysis is conducted to
visualize the data. In the third stage, a prediction model is implemented. Lastly,
performance is recorded and visualized.

Data Set Information:


This data set consists of three types of entities: (a) the specification of an auto in terms
of various characteristics, (b) its assigned insurance risk rating, (c) its normalized losses
in use as compared to other cars. The second rating corresponds to the degree to
which the auto is more risky than its price indicates. Cars are initially assigned a risk
factor symbol associated with its price. Then, if it is more risky (or less), this symbol is
adjusted by moving it up (or down) the scale. Actuarians call this process "symboling". A
value of +3 indicates that the auto is risky, -3 that it is probably pretty safe. The third
factor is the relative average loss payment per insured vehicle year. This value is
normalized for all autos within a particular size classification (two-door small, station
wagons, sports/speciality, etc...), and represents the average loss per car per year.
Note: Several of the attributes in the database could be used as a "class" attribute.

Dataset Size:

 Number of Instances: 205


 Number of Attributes: 26 total
o 15 continuous
o 1 integer
o 10 nominal

Introduction to WEKA:
Waika to Environment for Knowledge Analysis.
 Weka, developed at University of Waikato in New Zealand, is an open-source data
mining software in Java.
 It contains a collection of algorithms for data mining tasks, including data preprocessing,
association mining, classification, regression, clustering and visualization.

WEKA is a data mining system developed by the University of Waikato in New Zealand that implements
data mining algorithms. WEKA is a state-of-the-art facility for developing machine learning (ML)
techniques and their application to real-world data mining problems. It is a collection of machine
learning algorithms for data mining tasks. The algorithms are applied directly to a dataset. WEKA
implements algorithms for data preprocessing, classification, regression, clustering, association rules; it
also includes a visualization tools. The new machine learning schemes can also be developed with this
package. WEKA is open source software issued under the GNU General Public License [3]

 Weka supports four file formats:


1. .arff
2. .csv
3. .name and
4. .data

Procedure to load the dataset :

Weka Explorer OpenFile program


Console
Select classify :
Choose

Trees

J48

Start
Rightclick on 18:35:26-trees.J48

Select visualize Tree

DECISION TREE:
GiveBirth
=Yes =No

Mammals(7. Non-
0/1.0) mammals(13.
0/1.0)

Select classify:
choose

Rules

ZeroR

start

Load Your Data

Click the “Open file” button from the Pre-process section and load your .arff file from your local
file system. If you couldn’t convert your .csv to .arff, don’t worry, because Weka will do that
instead of you.
Figure 3.1 Preprocess of Iris Dataset

If you could follow all the steps so far, you can load your data set successfully and you’ll see
attribute names (it is illustrated at the red area on above images). The pre-process stage is named
as Filter in Weka, you can click the ‘Choose’ button from Filter and apply any filter you want.
For example, if you would like to use Association Rule Mining as a training model, you have to
dissociate numeric and continuous attributes. To be able to do that you can follow the path:
Choose -> Filter -> Supervised -> Attribute -> Discretize.

Classification

The concept of classification is basically distribute data among the various classes defined on a
data set. Classification algorithms learn this form of distribution from a given set of training and
then try to classify it correctly when it comes to test data for which the class is not specified. The
values that specify these classes on the dataset are given a label name and are used to determine
the class of data to be given during the test.

For this tutorial we will use Iris dataset to illustrate the usage of classification with Weka. You
can download the dataset from here. Since Iris dataset doesn’t need pre-processing, we can do
classification directly by using it. Weka is a good tool for beginners; it includes a tremendous
amount of algorithms in it. After you load your dataset, by clicking the Classify section you can
switch to another window which we will talk about in this post.

In the Classify section, as you can see in the Area 1 according to Figure 4.1, ZeroR is the default
classifier for Weka. But since ZeroR algorithm’s performance are not good for Iris dataset, we’ll
switch it with the J48 algorithm known for its very good success rate for our dataset. By clicking
the Choose button from Area 1 on the above Figure 4.1, a new algorithm can be selected from
list. J48 algorithm is inside of trees directory in the Classifier list. Before running the algorithm
we have to select the test options from Area 2. Test options consist of 4 options:

Use training set: Classifies your model based on the dataset which you originally trained your
model with.

Supplied test set: Controls how your model is classified based on the dataset you supply from
externally. Select a dataset file by clicking the Set button.

Cross-validation: The cross validation option is a widely used one, especially if you have limited
amount of datasets. The number you enter in the Fold section are used to divide your dataset into
Fold numbers (let’s say it is 10). The original dataset is randomly partitioned into 10 subsets.
After that, Weka uses set 1 for testing and 9 sets for training for the first training, then uses set 2
for testing and the other 9 sets for training, and repeat that 10 times in total by incrementing the
set number each time. In the end, the average success rate is reported to the user.

Percentage split: Divide your dataset into train and test according to the number you enter. By
default the percentage value is 66%, it means 66% of your dataset will be used as training set and
the other 33% will be your test set.

By clicking the text area, (the arrow on Figure 4.2) you can edit the parameters of the algorithm
according to your needs.

I chose the 10 fold cross validation from Test Options using the J48 algorithm. I chose my class
feature from the drop down list as class and click the “Start” button from Area 2 in Figure 4.3.
According the result, the success rate is 96%, you can see it from the Classifier Output has
shown at Area 1 in Figure 4.3.
Run Information in Area 1 will give you detailed results as you can see in Figure 4.4. It consists
of 5 parts; the first one is Run Information, which gives detailed information about the dataset
and the model you used. As you can see in Figure 4.4, we used J48 as a classification model, our
dataset was Iris dataset and its features are sepallength, sepalwidth, petallength, petalwidth, class.
Our test mode is 10-fold cross-validation. Since J48 is a decision tree, our model created a
pruned tree. As you can see on the tree, the first branching happened on petallength which shows
the petal length of the flowers, if the value is smaller or equal to 0.6, the species is Iris-setosa,
otherwise there is another branch that checks another specification to decide the species. In tree
structure, ‘:’ represents the class label.

The Classifier Model part illustrates the model as a tree and gives some information about the
tree, like number of leaves, size of the tree, etc. Next is the stratified cross-validation part and it
shows the error rates. By checking this part you can see how successful your model is. For
example, our model correctly classified 96% of the training data and our mean absolute error rate
is 0.035, which is acceptable according to Iris dataset and our model.
You can see a Confusion Matrix and detailed Accuracy Table at the bottom of the report. F-
Measure and ROC Area rates are important for the models and they are developed according to a
confusion matrix. A confusion matrix represents the True Positive, True Negative, False Positive
and False Negative rates, which I explain next. If you already understand Confusion Matrices
you can directly skip to the Visualizing the Result part.

Visualizing the Result

If you’d like to visualize these results you can use graphic presentations as you can see in below
Figure 4.5.

By right clicking Visualize tree you’ll see your model’s illustration like in Figure 4.6.
If you’d like to see classification errors illustrated, select Visualize Classifier Errors in same
menu. By sliding jitter (you can see in Area 1 at Figure 4.6) you can see all samples on
coordinate plane. The X plane represents predicted classifier results, the Y plane represents
actual classifier results. Squares represent wrongly classified samples. Stars represent true
classified samples. Blue colored ones are Iris-setosa, red colored stars are Iris-versicolor, green
ones Iris-virginica species. So, red square means our model classified this sample as Iris
versicolor but it supposed to be Iris virginica.
Weka Machine Learning Algorithms
Weka has a lot of machine learning algorithms. This is great, it is one of the large benefits of
using Weka as a platform for machine learning.

A down side is that it can be a little overwhelming to know which algorithms to use, and
when. Also, the algorithms have names that may not be familiar to you, even if you know
them in other contexts.

In this section we will start off by looking at some well known algorithms supported by
Weka. What we will learn in this post applies to the machine learning algorithms used
across the Weka platform, but the Explorer is the best place to learn more about the
algorithms as they are all available in one easy place.

1. Open the Weka GUI Chooser.


2. Click the “Explorer” button to open the Weka explorer.
3. Open a dataset, such as the Pima Indians dataset from the data/diabetes.arff file in your
Weka installation.
4. Click “Classify” to open the Classify tab.
The classify tab of the Explorer is where you can learn about the various different
algorithms and explore predictive modeling.

You can choose a machine learning algorithm by clicking the “Choose” button.

Clicking on the “Choose” button presents you with a list of machine learning algorithms to
choose from. They are divided into a number of main groups:

 bayes: Algorithms that use Bayes Theorem in some core way, like Naive Bayes.

 function: Algorithms that estimate a function, like Linear Regression.

 lazy: Algorithms that use lazy learning, like k-Nearest Neighbors.

 meta: Algorithms that use or combine multiple algorithms, like Ensembles.

 misc: Implementations that do not neatly fit into the other groups, like running a saved
model.

 rules: Algorithms that use rules, like One Rule.

 trees: Algorithms that use decision trees, like Random Forest.


The tab is called “Classify” and the algorithms are listed under an overarching group called
“Classifiers”. Nevertheless, Weka supports both classification (predict a category) and regression
(predict a numeric value) predictive modeling problems.

Weka Machine Clustering Algorithms


A clustering algorithm finds groups of similar instances in the entire dataset. WEKA supports
several clustering algorithms such as EM, FilteredClusterer, HierarchicalClusterer,
SimpleKMeans and so on. You should understand these algorithms completely to fully exploit
the WEKA capabilities.
As in the case of classification, WEKA allows you to visualize the detected clusters graphically.
To demonstrate the clustering, we will use the provided iris database. The data set contains
three classes of 50 instances each. Each class refers to a type of iris plant.
Click on the Cluster TAB to apply the clustering algorithms to our loaded data. Click on
the Choose button. You will see the following screen with the list of algorithms available in weka

Association rules
Association rule learners find associations between attributes. Between any attributes: there’s no
particular class attribute. Rules can predict any attribute, or indeed any combination of attributes.
To find them we need a different kind of algorithm. “Support” and “confidence” are two
measures of a rule that are used to evaluate them, and rank them. The most popular association
rule learner, and the one used in Weka, is called Apriori.
Associator
Click on the Associate TAB and click on the Choose button. Select the Apriori association as
shown in the screenshot −

To set the parameters for the Apriori algorithm, click on its name, a window will pop up as shown
below that allows you to set the parameters –

You might also like