0% found this document useful (0 votes)
77 views62 pages

Datamining Lab Manual

This document provides instructions on how to use the WEKA data mining software. It describes the WEKA file format ARFF and gives examples of how to classify and cluster data using the WEKA Explorer interface. It also explains how to set up experiments in WEKA to compare different algorithms on a dataset and analyze the results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views62 pages

Datamining Lab Manual

This document provides instructions on how to use the WEKA data mining software. It describes the WEKA file format ARFF and gives examples of how to classify and cluster data using the WEKA Explorer interface. It also explains how to set up experiments in WEKA to compare different algorithms on a dataset and analyze the results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 62

Data Mining Lab Manual

AIM: This lab will familiarize students to the process of KDD and Data Mining
concepts and also enable them to work with the tool WEKA. The version we use is
WEKA 3.7.5. It can be downloaded from the site:
http://www.cs.waikato.ac.nz/ml/weka/index.html

Textbook: Data Mining: Practical Machine Learning Tools and Techniques (Second
Edition) Ian H. Witten, Eibe Frank, Mark A. Hall

WEKA software uses a native file format called arff. It is described seperately. It can also
open other file formats like .csv etc.

For further information visit http://weka.wikispaces.com/

1
Attribute-Relation File Format (ARFF)
An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list
of instances sharing a set of attributes. ARFF files were developed by the Machine
Learning Project at the Department of Computer Science of The University of Waikato
for use with the Weka machine learning software.

Overview

ARFF files have two distinct sections. The first section is the Header information, which
is followed the Data information.

The Header of the ARFF file contains the name of the relation, a list of the attributes (the
columns in the data), and their types. An example header on the standard IRIS dataset
looks like this:

% 1. Title: Iris Plants Database


%
% 2. Sources:
% (a) Creator: R.A. Fisher
% (b) Donor: Michael Marshall (MARSHALL%[email protected])
% (c) Date: July, 1988
%
@RELATION iris

@ATTRIBUTE sepallength NUMERIC


@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

The Data of the ARFF file looks like the following:

@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa

Lines that begin with a % are comments. The @RELATION, @ATTRIBUTE and
@DATA declarations are case insensitive.

2
Examples

Several well-known machine learning datasets are distributed with Weka in the
$WEKAHOME/data directory as ARFF files.

The ARFF Header Section

The ARFF Header section of the file contains the relation declaration and attribute
declarations.

The @relation Declaration

The relation name is defined as the first line in the ARFF file. The format is:

@relation <relation-name>

where <relation-name> is a string. The string must be quoted if the name includes spaces.

The @attribute Declarations

Attribute declarations take the form of an orderd sequence of @attribute statements.


Each attribute in the data set has its own @attribute statement which uniquely defines
the name of that attribute and it's data type. The order the attributes are declared indicates
the column position in the data section of the file. For example, if an attribute is the third
one declared then Weka expects that all that attributes values will be found in the third
comma delimited column.

The format for the @attribute statement is:

@attribute <attribute-name> <datatype>

where the <attribute-name> must start with an alphabetic character. If spaces are to be
included in the name then the entire name must be quoted.

The <datatype> can be any of the four types currently (version 3.2.1) supported by
Weka:

 numeric
 <nominal-specification>
 string
 date [<date-format>]

where <nominal-specification> and <date-format> are defined below. The keywords


numeric, string and date are case insensitive.

3
Numeric attributes

Numeric attributes can be real or integer numbers.

Nominal attributes

Nominal values are defined by providing an <nominal-specification> listing the possible


values: {<nominal-name1>, <nominal-name2>, <nominal-name3>, ...}

For example, the class value of the Iris dataset can be defined as follows:

@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

Values that contain spaces must be quoted.

String attributes

String attributes allow us to create attributes containing arbitrary textual values. This is
very useful in text-mining applications, as we can create datasets with string attributes,
then write Weka Filters to manipulate strings (like StringToWordVectorFilter). String
attributes are declared as follows:

@ATTRIBUTE LCC string

Date attributes

Date attribute declarations take the form:

@attribute <name> date [<date-format>]

where <name> is the name for the attribute and <date-format> is an optional string
specifying how date values should be parsed and printed (this is the same format used by
SimpleDateFormat). The default format string accepts the ISO-8601 combined date and
time format: "yyyy-MM-dd'T'HH:mm:ss".

Dates must be specified in the data section as the corresponding string representations of
the date/time (see example below).

ARFF Data Section

The ARFF Data section of the file contains the data declaration line and the actual
instance lines.

The @data Declaration

4
The @data declaration is a single line denoting the start of the data segment in the file.
The format is:

@data

The instance data

Each instance is represented on a single line, with carriage returns denoting the end of the
instance.

Attribute values for each instance are delimited by commas. They must appear in the
order that they were declared in the header section (i.e. the data corresponding to the nth
@attribute declaration is always the nth field of the attribute).

Missing values are represented by a single question mark, as in:

@data
4.4,?,1.5,?,Iris-setosa

Values of string and nominal attributes are case sensitive, and any that contain space
must be quoted, as follows:

@relation LCCvsLCSH

@attribute LCC string


@attribute LCSH string

@data
AG5, 'Encyclopedias and dictionaries.;Twentieth century.'
AS262, 'Science -- Soviet Union -- History.'
AE5, 'Encyclopedias and dictionaries.'
AS281, 'Astronomy, Assyro-Babylonian.;Moon -- Phases.'
AS281, 'Astronomy, Assyro-Babylonian.;Moon -- Tables.'

Dates must be specified in the data section using the string representation specified in the
attribute declaration. For example:

@RELATION Timestamps

@ATTRIBUTE timestamp DATE "yyyy-MM-dd HH:mm:ss"

@DATA
"2001-04-03 12:12:12"
"2001-05-03 12:59:55"

5
Sparse ARFF files

Sparse ARFF files are very similar to ARFF files, but data with value 0 are not be
explicitly represented.

Sparse ARFF files have the same header (i.e @relation and @attribute tags) but the
data section is different. Instead of representing each value in order, like this:

@data
0, X, 0, Y, "class A"
0, 0, W, 0, "class B"

the non-zero attributes are explicitly identified by attribute number and their value stated,
like this:
@data
{1 X, 3 Y, 4 "class A"}
{2 W, 4 "class B"}

Each instance is surrounded by curly braces, and the format for each entry is: <index>
<space> <value> where index is the attribute index (starting from 0).

Note that the omitted values in a sparse instance are 0, they are not "missing" values! If a
value is unknown, you must explicitly represent it with a question mark (?).

Warning: There is a known problem saving SparseInstance objects from datasets that
have string attributes. In Weka, string and nominal data values are stored as numbers;
these numbers act as indexes into an array of possible attribute values (this is very
efficient). However, the first string value is assigned index 0: this means that, internally,
this value is stored as a 0. When a SparseInstance is written, string instances with internal
value 0 are not output, so their string value is lost (and when the arff file is read again, the
default value 0 is the index of a different string value, so the attribute value appears to
change). To get around this problem, add a dummy string value at index 0 that is never
used whenever you declare string attributes that are likely to be used in SparseInstance
objects and saved as Sparse ARFF files.

6
INSTUCTIONS ON HOW TO WORK WITH WEKA

WEKA Explorer
Classification using Explorer
Load Data -- select use training set - choose algorithm - start -- save model --
 select supplied set - open - data file - more options - output predictions -
select csv format and select directory - load saved model - re-evaluate model on
current test set

Clustering using Explorer


Load Data - select use training set - choose algorithm - start - visualize cluster
assignments - X axis = cluster, Y axis = instance nnumber, color = select attributes and
make a study of clusters

WEKA Experimenter
Setting up an Experiment in Weka
1. Open the Experimenter from the main GUI.
2. Choose the weather.arff by clicking new and then Add new under Data sets.
3. Click Add new under Algorithms and choose classifier J4.8 and then OneR.
4. Browse and name the output file under Results Destination.
5. Start running an experiment by choosing the run tab and click start.
6. Start analyzing the results by choosing the Analyse tab.
7. Browse for output file to analyze under Source or you can press the experiment button
to load the latest experiment results to analyze.
8. Configure your analysis under Configure test and test:
• Percent correct.
• Percent incorrect and others.

7
WEKA EXERCISES

Data Preprocessing

Exercise 1) Attribute Relevance Ranking


For each step, open the indicated file in the “Preprocess” window. Then, go to the
“Attribute Selection” window and set the “Attribute selection mode to “Use full training
set”. For below mentioned case, perform attribute ranking using the following attribute
selection methods with default parameters:
a) InfoGainAttributeEval; and
b) GainRatioAttributeEval;
These attribute selection methods should consider only non-class dimensions (for each
set, the class attribute is indicated above the “Start” button). Record the output of each
run in a text file called “output.txt”. For that, copy the output of the run from the
“Attribute selection output” window in the Explorer and paste it at the end of the
“output.txt” file.

a). Perform attribute ranking on the “contact-lenses.arff” data set using the two attribute
ranking methods with default parameters.

Evaluation
Once you have performed the experiments, you should spend some time evaluating your
results. In particular, try to answer at least the following questions: Why would one need
attribute relevance ranking? Do these attribute-ranking methods often agree or disagree?
On which data set(s), if any, these methods disagree? Does discretization and its method
affect the results of attribute ranking? Do missing values affect the results of attribute
ranking? Record these and any other observations in a Word file called
“Observations.doc”.

Exercise 2
1. Fire up the Weka (Waikato Environment for Knowledge Analysis) software, launch
the explorer window and select the \Preprocess" tab.
2. Open the iris data-set (\iris.ar_", this should be in the ./data/ directory of the Weka
install). What information do you have about the data set (e.g. number of instances,
attributes and classes)? What type of attributes does this data-set contain (nominal or
numeric)? What are the classes in this data-set? Which attribute has the greatest standard
deviation? What does this tell you about that attribute? (You might also find it useful to
open \iris.ar_" in a text editor).

3. Under \Filter" choose the \Standardize" _lter and apply it to all attributes. What does it
do? How does it afect the attributes' statistics? Click \Undo" to un-standardize the data
and now apply the \Normalize" filter and apply it to all the attributes. What does it do?
How does it affect the attributes' statistics? How does it differ from \Standardize"? Click \
Undo" again to return the data to its original state.

8
4. At the bottom right of the window there should be a graph which visualizes the data-
set, making sure \Class: class (Nom)" is selected in the drop-down box click \Visualize
All". What can you interpret from these graphs? Which attribute(s) discriminate best
between the classes in the data-set? How do the \Standardize" and \Normalize"
filters affect these graphs?

5. Under \Filter" choose the \AttributeSelection" filter. What does it do? Are the attributes
it selects the same as the ones you chose as discriminatory above? How does its behavior
change as you alter its parameters?

6. Select the \Visualize" tab. This shows you 2D scatter plots of each attribute against
each other attribute (similar to the F1 vs F2 plots from tutorial 1). Make sure the drop-
down box at the bottom says \Color: class (Nom)". Pay close attention to the plots
between attributes you think discriminate best between classes, and the plots
between attributes selected by the \AttributeSelection" filter. Can you verify from these
plots whether your thoughts and the \AttributeSelection" filter are correct? Which
attributes are correlated?

Exercise 3
1. Download the Old Faithful data-set
2. Upload this data in Excel. There are 2 attributes and 2 classes. Sort the data by class
(be careful to sort the entire row), line or bar plot each of the features individually and
save the graphs in a Word _le. What do you notice on the plots from a visual inspection.
3. For each class feature, compute its minimum, maximum, mean and
standard deviation.
4. Generate a pairwise scatter plot for the combinations of: F1 vs F2. Can you visually
guess whether these attributes are related or not?
5. Based on the scatter plot generated in point 4, determine the data points that are the
outliers (extreme high or low values). Do this manually by visually inspecting the scatter
plot, remove at least 5 points.
6. Compute correlation between features for each class separately and
create a correlation matrix. What does it show?
7. Normalise all features to the range [0, 1]. There are several ways this can be done, we
will use the standard min-max normalization Recompute 6 has it made a difference?

9
Clustering
Exercise 1) Clustering using K-Means

Get to the Weka Explorer environment and load the training file using the Preprocess
mode. Try first with weather.arff. Get to the Cluster mode (by clicking on the Cluster
tab) and select a clustering algorithm, for example SimpleKMeans. Then click on Start
and you get the clustering result in the output window. The actual clustering for this
algorithm is shown as one instance for each cluster representing the cluster centroid.

Scheme:       weka.clusterers.SimpleKMeans -N 2 -S 10
Relation:     weather
Instances:    14
Attributes:   5
              outlook
              temperature
              humidity
              windy
              play
Test mode:    evaluate on training data  

=== Clustering model (full training set) ===


  kMeans
======

Number of iterations: 4
Within cluster sum of squared errors: 16.156838252701938

Cluster centroids:

Cluster 0
 Mean/Mode:  rainy 75.625  86      FALSE yes
 Std Devs:   N/A      6.5014  7.5593 N/A     N/A
Cluster 1
 Mean/Mode:  sunny 70.8333 75.8333 TRUE yes
 Std Devs:   N/A      6.1128 11.143  N/A     N/A
 

=== Evaluation on training set ===


 

kMeans
======

10
Number of iterations: 4
Within cluster sum of squared errors: 32.31367650540387

Cluster centroids:

Cluster 0
 Mean/Mode:  rainy 75.625  86      FALSE yes
 Std Devs:   N/A      6.5014  7.5593 N/A     N/A
Cluster 1
 Mean/Mode:  sunny 70.8333 75.8333 TRUE yes
 Std Devs:   N/A      6.1128 11.143  N/A     N/A

Clustered Instances

0       8 ( 57%)
1       6 ( 43%)

Evaluation

The way Weka evaluates the clusterings depends on the cluster mode you select. Four
different cluster modes are available (as buttons in the Cluster mode panel):
1. Use training set (default). After generating the clustering Weka classifies the
training instances into clusters according to the cluster representation and
computes the percentage of instances falling in each cluster. For example, the
above clustering produced by k-means shows 43% (6 instances) in cluster 0 and
57% (8 instances) in cluster 1.
2. In Supplied test set or Percentage split Weka can evaluate clusterings on
separate test data if the cluster representation is probabilistic (e.g. for EM).
3. Classes to clusters evaluation. In this mode Weka first ignores the class attribute
and generates the clustering. Then during the test phase it assigns classes to the
clusters, based on the majority value of the class attribute within each cluster.
Then it computes the classification error, based on this assignment and also shows
the corresponding confusion matrix. An example of this for k-means is shown
below.

Scheme:       weka.clusterers.SimpleKMeans -N 2 -S 10
Relation:     weather
Instances:    14
Attributes:   5

11
              outlook
              temperature
              humidity
              windy
Ignored:
              play
Test mode:    Classes to clusters evaluation on training data

=== Clustering model (full training set) ===


 

kMeans
======

Number of iterations: 4
Within cluster sum of squared errors: 11.156838252701938

Cluster centroids:

Cluster 0
 Mean/Mode:  rainy 75.625  86      FALSE
 Std Devs:   N/A      6.5014  7.5593 N/A
Cluster 1
 Mean/Mode:  sunny 70.8333 75.8333 TRUE
 Std Devs:   N/A      6.1128 11.143  N/A
 

=== Evaluation on training set ===


 

kMeans
======

Number of iterations: 4
Within cluster sum of squared errors: 22.31367650540387

Cluster centroids:

Cluster 0
 Mean/Mode:  rainy 75.625  86      FALSE
 Std Devs:   N/A      6.5014  7.5593 N/A
Cluster 1
 Mean/Mode:  sunny 70.8333 75.8333 TRUE
 Std Devs:   N/A      6.1128 11.143  N/A

Clustered Instances

12
0       8 ( 57%)
1       6 ( 43%)
 

Class attribute: play


Classes to Clusters:

 0 1  <-- assigned to cluster


 5 4 | yes
 3 2 | no

Cluster 0 <-- yes


Cluster 1 <-- no

Incorrectly clustered instances : 7.0  50      %

Using Weka for clustering

Clustering

Get to the Weka Explorer environment and load the training file using the Preprocess
mode. Try first with weather.arff. Get to the Cluster mode (by clicking on the Cluster
tab) and select a clustering algorithm, for example SimpleKMeans. Then click on Start
and you get the clustering result in the output window. The actual clustering for this
algorithm is shown as one instance for each cluster representing the cluster centroid.

Scheme:       weka.clusterers.SimpleKMeans -N 2 -S 10
Relation:     weather
Instances:    14
Attributes:   5
              outlook
              temperature
              humidity
              windy
              play
Test mode:    evaluate on training data
 

=== Clustering model (full training set) ===


 

kMeans
======

Number of iterations: 4
Within cluster sum of squared errors: 16.156838252701938

13
Cluster centroids:

Cluster 0
 Mean/Mode:  rainy 75.625  86      FALSE yes
 Std Devs:   N/A      6.5014  7.5593 N/A     N/A
Cluster 1
 Mean/Mode:  sunny 70.8333 75.8333 TRUE yes
 Std Devs:   N/A      6.1128 11.143  N/A     N/A
 

=== Evaluation on training set ===


 

kMeans
======

Number of iterations: 4
Within cluster sum of squared errors: 32.31367650540387

Cluster centroids:

Cluster 0
 Mean/Mode:  rainy 75.625  86      FALSE yes
 Std Devs:   N/A      6.5014  7.5593 N/A     N/A
Cluster 1
 Mean/Mode:  sunny 70.8333 75.8333 TRUE yes
 Std Devs:   N/A      6.1128 11.143  N/A     N/A

Clustered Instances

0       8 ( 57%)
1       6 ( 43%)

EM

The EM clustering scheme generates probabilistic descriptions of the clusters in terms of


mean and standard deviation for the numeric attributes and value counts (incremented
by 1 and modified with a small value to avoid zero probabilities) - for the nominal ones.
In "Classes to clusters" evaluation mode this algorithm also outputs the log-likelihood,
assigns classes to the clusters and prints the confusion matrix and the error rate, as shown
in the example below.

Clustered Instances

14
0       4 ( 29%)
1      10 ( 71%)
 

Log likelihood: -8.36599


 

Class attribute: play


Classes to Clusters:

 0 1  <-- assigned to cluster


 2 7 | yes
 2 3 | no

Cluster 0 <-- no
Cluster 1 <-- yes

Incorrectly clustered instances : 5.0  35.7143 %

Classification

1. The distinct stages of designing a classification model are outlined


below:
_ Collect your raw data
_ Clean your data (e.g. outlier removal, missing data removal etc.)
_ Preprocess the data (e.g. normalization, standardization, etc.)
_ Determine the type of problem (i.e. classification or regression)
_ Pick an appropriate classifier (e.g. multilayer perceptron, decision tree,
linear regression, etc.)
_ Choose some default parameters for the classifier, the choice of classifier
and parameters constitute your model

2. Pick a training/testing strategy (e.g. percentage split, cross-validation


etc.)
_ Train the classifer using your training/testing strategy
_ Analyse the performance of your model
_ If your results are unsatisfactory consider altering your model (i.e.changing the
classifer, its parameters, and/or your training/testing strategy) and re- training/testing
_ If your results are satisfactory validate your model on an unseen set of cleaned and
preprocessed data.

Classification Exercises
Exercise 1.

1. Fire up the Weka (Waikato Environment for Knowledge Analysis) soft-


15
ware, launch the explorer window and select the \Preprocess" tab. Open the iris data-set (\
iris.ar_", this should be in the ./data/ directory of the Weka install).
2. Select the \Classify" tab. Under the \Test options" section you have four different
testing options. How do each (we cannot use \supplied test set" option as we have no
applicable _le) of these options select the training/testing? Which testing mode do you
think will perform best? (the ExplorerGuide.pdf", in the ./ directory of the Weka install
may help).

3. Under \Classifier" select \MultilayerPerceptron". What type of classifier is this? How


does this classifier work? What main parameters can be specified for this classifier?

4. Under \Test options" select \Use training set" and under \More options" check \Output
predictions". Now click \Start" to start training the model. You should see a stream of
output appear in the window named \Classifier output". What do each of the following
sections tell you about the model?
(a) \Predictions on ..."
(b) \Summary"
(c) \Detailed accuracy by class"
(d) \Confusion matrix"

5. Under \Results list" you should see your model, right click on it and select \Visualise
classifier errors", points marked with a square are errors i.e. incorrectly classified. How
do you think the classifier performed on the test data?
6. Under \Test options" vary the option selected i.e. \cross-validation" or \percentage" and
their parameters i.e. \folds" and \%". Then start the training phase again for each model.
For each model analyse the classifier output and visualise the classifier errors. How do
the different training techniques affect the model? Which technique performed the
best? How does this compare to your initial prediction in 4?
7. Repeat the exercise 6 with the \J48" (Decision Tree) and \RBFNetwork" classifiers.
How do these compare to each other? How do these compare to the
MultilayerPerceptron"?

Interpreting Weka Output

Below is the output from Weka when using the weka.classifiers.trees.J48 classifier with
the file $WEKAHOME/data/iris.arff as a training file and no testing file. I.e. using the
command:

java weka.classifiers.trees.J48 -t $WEKAHOME/data/iris.arff

In square brackets ([,]) there are comments on how to interpret the output.

J48 pruned tree


------------------

16
petalwidth <= 0.6: Iris-setosa (50.0)
petalwidth > 0.6
| petalwidth <= 1.7
| | petallength <= 4.9: Iris-versicolor (48.0/1.0)
| | petallength > 4.9
| | | petalwidth <= 1.5: Iris-virginica (3.0)
| | | petalwidth > 1.5: Iris-versicolor (3.0/1.0)
| petalwidth > 1.7: Iris-virginica (46.0/1.0)

Number of Leaves : 5

Size of the tree : 9

[ Above is the decision tree constructed by the J48 classifier. This indicates how the
classifier uses the attributes to make a decision. The leaf nodes indicate which class an
instance will be assigned to should that node be reached. The numbers in brackets after
the leaf nodes indicate the number of instances assigned to that node, followed by how
many of those instances are incorrectly classified as a result. With other classifiers some
other output will be given that indicates how the decisions are made, e.g. a rule set. Note
that the tree has been pruned. An unpruned tree and be produced by using the "-U"
option. ]

Time taken to build model: 0.05 seconds


Time taken to test model on training data: 0.01 seconds

=== Error on training data ===

Correctly Classified Instances 147 98 %


Incorrectly Classified Instances 3 2 %
Kappa statistic 0.97
Mean absolute error 0.0233
Root mean squared error 0.108
Relative absolute error 5.2482 %
Root relative squared error 22.9089 %
Total Number of Instances 150

[ This gives the error levels when applying the classifier to the training data it was
constructed from. For our purposes the most important figures here are the numbers of
correctly and incorrectly classified instances. With the exception of the Kappa statistic,
the remaining statistics compute various error measures based on the class probabilities
assigned by the tree. ]

=== Confusion Matrix ===

a b c <-- classified as
50 0 0 | a = Iris-setosa

17
0 49 1 | b = Iris-versicolor
0 2 48 | c = Iris-virginica

[ This shows for each class, how instances from that class received the various
classifications. E.g. for class "b", 49 instances were correctly classified but 1 was put into
class "c". ]

=== Stratified cross-validation ===

Correctly Classified Instances 144 96 %


Incorrectly Classified Instances 6 4 %
Kappa statistic 0.94
Mean absolute error 0.035
Root mean squared error 0.1586
Relative absolute error 7.8705 %
Root relative squared error 33.6353 %
Total Number of Instances 150

[ This gives the error levels during a 10-fold cross-validation. The "-x" option can be
used to specify a different number of folds. The correctly/incorrectly classified instances
refers to the case where the instances are used as test data and again are the most
important statistics here for our purposes. ]

=== Confusion Matrix ===

a b c <-- classified as
49 1 0 | a = Iris-setosa
0 47 3 | b = Iris-versicolor
0 2 48 | c = Iris-virginica

[ This is the confusion matrix for the 10-fold cross-validation, showing what
classification the instances from each class received when it was used as testing data. E.g.
for class "a" 49 instances were correctly classified and 1 instance was assigned to class
"b". ]

Prediction:

Linear regression
Linear Regression can be very useful in association analysis of numerical values, in fact
regression analysis is a powerful approach to modeling the relationship between a
dependent and independent variables. Simple regression is when we predict from one
independent variable and multiple regression is when we predict from more than one
independent variables. The model we attempt to _t is a linear one which is, very simply,
drawing a line through the data. Of all the lines that can possibly be drawn through the

18
data, we are looking for the one that best fits the data. In fact, we look to find a line that
best satisfies
γ = β0 + β1x + ε
So a most accurate model is that which yields a best fit line to the data in question, we are
looking for minimal sum of squared deviations between actual and fitted values, this is
called method of least squares. So now that we have briefly reminded ourselves of the
very basics of regression lets directly move onto an example in Weka.

Exercise 1

(a) In Weka go back to the \Preprocess" tab. Open the iris data-set (\iris.tar_", this should
be in the ./data/ directory of the Weka install).
(b) In the \Attributes" section (bottom left of the screen) select the \class" feature and
click \Remove". We need to do this, as simple linear regression cannot deal with non
numeric values.
(c) Next select the \Classify" tab to get into the Classification perspective of Weka, and
choose \LinearRegression" (under \functions").
(d) Clicking on the textbox next to the \Choose" button brings up the parameter editor
window. Click on the \More" button to get information about the parameters. Make sure
that \attributeSelectionMethod" is set to \No attribute selection" and “\eliminate-
ColinearAttributes" is set to \False".
(e) Finally make sure that you select the parameter “\petalwidth" in the dropdown box
just under the “\Test Options". Hit Start to run the regression.

Inspect the results, in particular pay attention to the Linear Regression Model formula
returned, and the coefficients and intercept of the straight line equation. As this is a
numeric prediction/regression problem, accuracy is measured with Root Mean Squared
Error, Mean Absolute Error and the likes. As most of you will have clearly noticed, you
can repeat this process for regressing the other features in turn, and compare how well the
different features can be predicted.

Exercise 2
• Launch the WEKA tool, and then activate the “Explorer” environment.
• Open the “cpu” dataset (i.e., contained in the “cpu.arff” file).
- For each attribute and for each of its possible values, how many instances in
each class have the feature value (i.e., the class distribution of the feature
values)?
• Go to the “Classify” tab. Select the SimpleLinearRegression learner. Choose
“Percentage split” (66% for training) test mode. Run the classifier and observe the
results shown in the “Classifier output” window.
- Write down the learned regression function.
- What is the MAE (mean absolute error) made by the learned regression
function?

19
- Visualize the errors made by the learned regression function. In the plot, how
can you see the detailed information of a predicted instance?
• Now, in the “Test options” panel select the “Cross-validation” option (10 folds). Run
the classifier and observe the results shown in the “Classifier output” window.
- Write down the learned regression function.
- What is the MAE (mean absolute error) made by the learned regression
function?
- Visualize the errors made by the learned regression function. In the plot, how
can you see the detailed information of a predicted instance?

Association Rules

In this tutorial we will first look at association rules, using the APRIORI algorithm in
Weka. APRIORI works with categorical values only. Therefore we will use a different
dataset called "adult"; This dataset contains census data about 48842 US adults. The aim
is to predict whether their income exceeds $50000. The dataset is taken from the Delve
website, and originally came from the UCI Machine Learning Repository. More
information about it is available in the original UCI Documentation.

Download a copy of adult.arff and load it into Weka.

This dataset is not immediately ready for use with APRIORI. First, reduce its size by
taking a random sample. You can do this with the 'ResampleFilter' in the preprocess tab
sheet: click on the label under 'Filters', choose 'ResampleFilter' from the drop down
menu, set the 'sampleSizePercentage' (to 15 eg), click 'OK' and 'Add', and click 'Apply
Filters'. The 'Working relation' is now a subsample of the original adult dataset. Now we
have to get rid of the numerical attributes. You can choose to discard them, or to
discretise them. We will discretise the first attribute ('age'): choose the 'DiscretizeFilter',
set 'attributeIndices' to 'first', bins to a low number, like 4 or 5, and the other options to
'False'. Then add this new filter to the others. We will get rid of the other numerical
attributes: choose an 'AttributeFilter', set 'invertSelection' to 'False', and enter the indices
of the remaining numeric attributes (3,5,11-13). Apply all the filters together now. Then
click on 'Replace' to make the resulting 'Working relation' the new 'Base relation'.

Now go to the 'Associate' tab sheet and click under 'Associator'. Set 'numRules' to 25, and
keep the other options on their defaults. Click 'Start' and observe the results. What do you
think about these rules? Are they useful?

From the previous, it is obvious that some attributes should not be examined
simultaneously because they lead to trivial results. Go back to the 'Preprocess' sheet. If
you have replaced the original 'Base relation' by the 'Working relation', you can include
and exclude attributes very easily: delete all filters from the 'Filters' window, then remove
the check mark next to the attributes you want to get rid of and click 'Apply Filters'. You

20
now have a new 'Working relation'. Try to remove different combinations of the
attributes that lead to trivial association rules. Run APRIORI several times and look for
interesting rules. You will find that there is often a whole range of rules which are all
based on the same simpler rule. Also, you will often get rules that don't include the target
class. This is why in most cases you would use APRIORI for dataset exploration rather
than for predictive modelling.

Exercise 2
Association analysis is concerned with discovering interesting correlations or other
relationships between variables in large databases. We are interested into relationships
between features themselves, rather than features and class as in the standard
classification problem setting. Hence searching for association patterns is no different
from classification except that instead of predicting just the class, we try to predict
arbitrary attributes or attribute combinations.

1. Fire up Weka software, launch the explorer window and select the \Preprocess" tab.
Open the weather.nominal data-set (\weather.nominal.arff", this should be in the ./data/
directory of the Weka install).

2. Often we are in search of discovering association rules showing attribute-value


conditions that occur frequently together in a given set of data, such as; buys(X,
computer") & buys(X, \scanner") =) buys (X,\printer") [support = 2%, confidence =
60%]. Where confidence and support are measures of rule interestingness. A support of
2% means that 2% of all transactions under analysis show that computer, scanner and
printer are purchased together. A confidence of 60% means that 60% of the customers
who purchased a computer and a scanner also bought a printer. We are interested into
association rules that apply to a reasonably large number of instances and have a
reasonably high accuracy on the instances to which they apply.

Weka has three build-in association rule learners. These are, \Apriori", \Predictive
Apriori" and \Tertius", however they are not capable of handling numeric data. Therefore
in this exericse we use weather data.
(a) Select the \Associate" tab to get into the Association rule mining perspective of Weka.
Under \Associator" select and run each of the following \Apriori", \Predictive Apriori"
and \Tertius".
Briefly inspect the output produced by each Associator and try to interpret its meaning.
(b) In association rule mining the number of possible association rules can be very large
even with tiny datasets, hence it is in our best interest to reduce the count of rules found,
to only the most interesting ones. This is usually achieved by setting minimum thresh-
olds on support and confidence values. Still in the \Associate" view, select the \Apriori"
algorithm again, click on the textbox next to the \Choose" button and try, in turn,
different values for the following parameters \lowerBoundMinSupport" (min threshold
for support), \minMetric" (min threshold for confidence). As you change these parameter
values what do you notice about the rules that are found by the associator? Note that the

21
parameter \numRules" limits the maximum number of rules that the associator looks for,
you can try changing this value.

(c) This time run the Apriori algorithm with the \outputItemSets" parameter set to true.
You will notice that the algorithm now also outputs a list of \Generated sets of large
itemsets:" at di_erent levels. If you have the module's Data Mining book by Witten &
Frank with you, then you can compare and contrast the Apriori associator's output with
the association rules on pages 114-116 (I will have a couple copies circulating in the lab
during the session, just ask me for one). I also strongly recommend to read through
chapter 4.5 in your own time, while playing with the weather data in Weka, this chapter
gives a nice & easy introduction to association rules. Notice in particular how the item
sets and association rules compare with Weka and tables 4.10-4.11 in the book.
(d) Compare the association rules output from Apriori and Tertius (you can do this by
navigating through the already build associator models in the \Result list" on the right
side of the screen).

Make sure that the Apriori algorithm shows at least 20 rules. Think about how the
association rules generated by the two different methods compare to each other?
Something to always remember with association rules, is that they should not be used for
prediction directly, that is without further analysis or domain knowledge, as they do not
necessarily indicate causality.

They are however a very helpful starting point for further exploration and for building a
better understanding of our data.

As you should certainly know by this point, in order to identify associations between
parameters a correlation matrix and scatter plot matrix can be very useful fs.

Exercise 3: Boolean association rule mining in Weka

The dataset studied is the weather dataset from Weka’s data folder
The goal of this data mining study is to find strong association rules in the
weather.nominal dataset. Answer the following questions:
a. What type of variables are in this dataset (numeric / ordinal / categorical) ?
b. Load the data in Weka Explorer. Select the Associate tab. How many different
association rule mining algorithms are available?
c. Choose Apriori algorithm with the following parameters (which you can select by
clicking on the chosen algorithm: support threshold = 15%
(lowerBoundMinSupport = 0.15), confidence threshold = 90% (metricType =
confidence, minMetric = 0.9), number of rules = 50 (numRules = 50). After
starting the algorithm, how many rules do you find? Could you use the regular
weather dataset to get the results? Explain why.
d. Paste a screenshot of the Explorer window showing at least the first 20 rules.
e. Define the concepts of support, confidence, and lift for a rule. Write here the first
rule discovered. What is its support? Its confidence? Interpret the meaning of
these terms and this rule in this particular example.

22
f. Apriori algorithm generates association rules from frequent itemsets. How many
itemsets of size 4 were found? Which rule(s) have been generated from itemset of
size 4 (temperature=mild, windy=false, play=yes, outlook=rainy)? List their
numbers in the list of rules.

The KDD Process in Weka


This experiment will be using Weka data mining tool. Weka is an open source Java
development environment for data mining from the University of Waikato in New
Zealand.

Heart disease datasets


The dataset studied is the heart disease dataset from UCI repository. Two different
datasets are provided: heart-h.arff (Hungarian data), and heart-c.arff (Cleveland data).
These datasets describe factors of heart disease. Both these data sets are available to you
on the assignment page.
The data mining project goal is to better understand the risk factors for heart disease, as
represented in the 14th attribute: num (<50 means no disease, and values <50-1 to <50-4
represent increasing levels of heart disease).
The question on which this machine learning study concentrates is whether it is possible
to predict heart disease from the other known data about a patient. The data mining task
of choice to answer this question will be classification/prediction, and several different
algorithms will be used to find which one provides the best predictive power. However
this exercise focuses on the various aspects of the KDD process.

1. Data preparation- integration


We want to merge the two datasets into one, in a step called data integration. Revise arff
notation from the tutorial, which is Weka data representation language. Answer the
following questions:
a. Define what data integration means. (in your own words)

23
b. Is there an entity identification or schema integration problem in this dataset? If
yes, how to fix it?
c. Is there a redundancy problem in this dataset? If yes, how to fix it?
d. Are there data value conflicts in this dataset? If yes, how to fix it?
e. Integrate the two datasets into one single dataset, which will be used as a starting point
for the next questions, and load it in the Explorer. How many instances do you have?
How many attributes? (You could do this using Excel or spreadsheet programs. First,
save your individual files as “csv” files in weka, Open them in a spreadsheet viewing
program. Copy the rows from one file to another. Save the merged file (csv). Open it in
weka and save it as “csv”. Take care of the above questions. Think about rectifying
potential problems.
f. Paste a screenshot of the Explorer window.

2. Descriptive data summarization


Before preprocessing the data, an important step is to get acquainted with the data – also
called data understanding.
a. Stay in the Preprocess tab for now. Study for example the age attribute. What is
its mean? Its standard deviation? Its min and max?
b. Provide the five-number summary of this attribute. Is this figure provided in
Weka? This is min, max, median, lower 25% quartile and upper 25% quartile.
c. Specify which attributes are numeric, which are ordinal, and which are
categorical/nominal.
d. Interpret the graphic showing in the lower right corner of the Explorer. How can
you name this graphic? What do the red and blue colors mean (pay attention to
the pop-up messages that appear when dragging the mouse over the graphic)?
What does this graphic represent?
e. Visualize all the attributes in graphic format. Paste a screenshot.
f. Comment on what you learn from these graphics.
g. Switch to the Visualize tab. By selecting the maximum jitter, and looking at the
num column – the last one – can you determine which attributes seem to be the
most linked to heart disease? Paste the boxplot representing the attribute you find
the most predictive of heart disease (Y) as a function of num (X).
h. Does any pair of different attributes seem correlated?

3. Data preparation – selection


The datasets studied have already been processed by selecting a subset of attributes
relevant for the data mining project.
a. From the documentation provided in the dataset, how many attributes were
originally in these datasets?
b. With Weka, attribute selection can be achieved either from the specific Select
attributes tab, or within Preprocess tab. List the different options in Weka for
selecting attributes, with a short explanation about the corresponding method.

4. Data preparation - cleaning


Data cleaning deals with such defaults of real-world data as incompleteness, noise, and
inconsistencies. In Weka, data cleaning can be accomplished by applying filters to the

24
data in the Preprocess tab.
a. Missing values. List the methods seen in class for dealing with missing values, and
which Weka filters implement them – if available. Remove the missing values with the
method of your choice, explaining which filter you are using and why you make this
choice.
b. Noisy data. List the methods seen in class for dealing with noisy data, and which
Weka filters implement them – if available.
c. Save the cleaned dataset into heart-cleaned.arff, and paste here a screenshot
showing at least the first 10 rows of this dataset – with all the columns.

5. Data preparation - transformation


1. Among the different data transformation techniques, explore those available
through the Weka Filters. Stay in the Preprocess tab for now. Study the following data
transformation only:
a. Attribute construction – for example adding an attribute representing the sum of
two other ones. Which Weka filter permits to do this?
b. Normalize an attribute. Which Weka filter permits to do this? Can this filter
perform Min-max normalization? Z-score normalization? Decimal normalization?
Provide detailed information about how to perform these in Weka.
c. Normalize all real attributes in the dataset using the method of your choice – state
which one you choose.
d. Save the normalized dataset into heart-normal.arff, and paste here a screenshot
showing at least the first 10 rows of this dataset – with all the columns.

6. Data preparation- reduction


Often, data mining datasets are too large to process directly. Data reduction techniques
are used to preprocess the data. Once the data mining project has been successful on these
reduced data, the larger dataset can be processed too.
a. Stay in the Preprocess tab for now. Beside attribute selection, a reduction method
is to select rows from a dataset. This is called sampling. How to perform sampling
with Weka filters? Can it perform the two main methods: Simple Random
Sample Without Replacement, and Simple Random Sample With Replacement?

Case Study #1
Data Preprocessing with Weka

The goal of this case study is to investigate how to preprocess data using Weka data
mining tool.

This assignment will be using Weka data mining tool. Weka is an open source Java
development environment for data mining from the University of Waikato in New
Zealand. It can be downloaded freely from http://www.cs.waikato.ac.nz/ml/weka/,

25
Weka is really an asset for learning data mining because it is freely available, students
can study how the different data mining models are implemented, and develop
customized Java data mining applications. Moreover, data mining results from Weka can
be published in the most respected journals and conferences, which make it a de facto
developing environment of choice for research in data mining, where researchers often
need to develop new data mining methods.

How to use Weka

Weka can be used in four different modes: through a command line interface (CLI),
through a graphical user interface called the Explorer, through the Knowledge Flow, and
through the Experimenter. The Knowledge Flow allows to process large datasets in an
incremental manner, while the other modes can only process small to medium size
datasets. The Experimenter provides an environment for testing and comparing several
data mining algorithms.
The explanations for this assignment focus on Explorer processing of a dataset,
nevertheless the CLI can produce the same functionality, and thus can be chosen as an
alternative. Moreover, this assignment will use only the data preprocessing capabilities of
Weka, which may only require some Java development, whereas similar functionality in
SPSS/CLEMENTINE would require mastering a more complex suite of functions, and
learning a new programming language, called CLEM.

Heart disease datasets

The dataset studied is the heart disease dataset from UCI repository (datasets-UCI.jar).
Two different datasets are provided: heart-h.arff (Hungarian data), and heart-c.arff
(Cleveland data). These datasets describe factors of heart disease. They can be
downloaded from: http://www.cs.waikato.ac.nz/~ml/weka/index_datasets.html.

The machine mining project goal is to better understand the risk factors for heart
disease, as represented in the 14th attribute: num (<50 means no disease, and values <50-
1 to <50-4 represent increasing levels of heart disease).
The question on which this machine learning study concentrates is whether it is
possible to predict heart disease from the other known data about a patient. The data
mining task of choice to answer this question will be classification/prediction, and
several different algorithms will be used to find which one provides the best predictive
power.

1. Data preparation- integration

We want to merge the two datasets into one, in a step called data integration. Revise arff
notation from the tutorial, which is Weka data representation language. Answer the
following questions:
g. Define what data integration means.

26
h. Is there an entity identification or schema integration problem in this dataset? If
yes, how to fix it?
i. Is there a redundancy problem in this dataset? If yes, how to fix it?
j. Are there data value conflicts in this dataset? If yes, how to fix it?
k. Integrate the two datasets into one single dataset, which will be used as a starting
point for the next questions, and load it in the Explorer. How many instances do
you have? How many attributes?
l. Paste a screenshot of the Explorer window.

2. Descriptive data summarization

Before preprocessing the data, an important step is to get acquainted with the data – also
called data understanding in CRISP-DM.
a. Stay in the Preprocess tab for now. Study for example the age attribute. What is
its mean? Its standard deviation? Its min and max?
b. Provide the five-number summary of this attribute. Is this figure provided in
Weka?
c. Specify which attributes are numeric, which are ordinal, and which are
categorical/nominal.
d. Interpret the graphic showing in the lower right corner of the Explorer. How can
you name this graphic? What do the red and blue colors mean (pay attention to
the pop-up messages that appear when dragging the mouse over the graphic)?
What does this graphic represent?
e. Visualize all the attributes in graphic format. Paste a screenshot.
f. Comment on what you learn from these graphics.
g. Switch to the Visualize tab. What is the term used in the textbook to name the
series of boxplots represented? By selecting the maximum jitter, and looking at
the num column – the last one – can you determine which attributes seem to be
the most linked to heart disease? Paste the boxplot representing the attribute you
find the most predictive of heart disease (Y) as a function of num (X).
h. Does any pair of different attributes seem correlated?

3. Data preparation – selection

The datasets studied have already been processed by selecting a subset of attributes
relevant for the data mining project.
a. From the documentation provided in the dataset, how many attributes were
originally in these datasets?
b. With Weka, attribute selection can be achieved either from the specific Select
attributes tab, or within Preprocess tab. List the different options in Weka for
selecting attributes, with a short explanation about the corresponding method.
c. In comparison with the methods for attribute selection detailed in the textbook,
are any missing? Are any provided in Weka not provided in the textbook?

4. Data preparation - cleaning

27
Data cleaning deals with such defaults of real-world data as incompleteness, noise, and
inconsistencies. In Weka, data cleaning can be accomplished by applying filters to the
data in the Preprocess tab.
a. Missing values. List the methods seen in class for dealing with missing values,
and which Weka filters implement them – if available. Remove the missing
values with the method of your choice, explaining which filter you are using and
why you make this choice. If a filter is not available for your method of choice,
develop a new one that you add to the available filters as a Java class.
b. Noisy data. List the methods seen in class for dealing with noisy data, and which
Weka filters implement them – if available.
c. Outlier detection. List the methods seen in class for detecting outliers. How
would you detect outliers with Weka? Are there any outliers in this dataset, and if
yes, list some of them.
d. Save the cleaned dataset into heart-cleaned.arff, and paste here a screenshot
showing at least the first 10 rows of this dataset – with all the columns.

5. Data preparation - transformation

Among the different data transformation techniques, explore those available through the
Weka Filters. Stay in the Preprocess tab for now. Study the following data
transformation only:
a. Attribute construction – for example adding an attribute representing the sum of
two other ones. Which Weka filter permits to do this?
b. Normalize an attribute. Which Weka filter permits to do this? Can this filter
perform Min-max normalization? Z-score normalization? Decimal normalization?
Provide detailed information about how to perform these in Weka.
c. Normalize all real attributes in the dataset using the method of your choice – state
which one you choose.
d. Save the normalized dataset into heart-normal.arff, and paste here a screenshot
showing at least the first 10 rows of this dataset – with all the columns.

6. Data preparation- reduction

Often, data mining datasets are too large to process directly. Data reduction techniques
are used to preprocess the data. Once the data mining project has been successful on these
reduced data, the larger dataset can be processed too.
a. Stay in the Preprocess tab for now. Beside attribute selection, a reduction method
is to select rows from a dataset. This is called sampling. How to perform sampling
with Weka filters? Can it perform the two main methods: Simple Random
Sample Without Replacement, and Simple Random Sample With Replacement?

Case Study 2

28
Classification / Prediction / Cluster analysis

The goal of this assignment is to review prediction mining principles and methods,
cluster analysis principles and methods, and to apply them to a dataset using Weka data
mining tool.

Heart dataset
The first dataset studied is the cleveland dataset from UCI repository. This dataset
describes numeric factors of heart disease. It can be downloaded from
http://www.cs.waikato.ac.nz/~ml/weka/index_datasets.html and is contained in the
datasets-numeric.jar archive.

Zoo dataset
The second dataset studied is the zoo dataset from UCI repository. This dataset describes
animals with categorical features. It can be downloaded from
http://www.cs.waikato.ac.nz/~ml/weka/index_datasets.html and is contained in the
datasets-UCI.jar archive.

1. Prediction in Weka

The goal of this data mining study is to predict the severity of heart disease in the
cleveland dataset (variable num) based on the other attributes. Answer the following
questions:
m. What types of variables are in this dataset (numeric / ordinal / categorical)?
n. Load the data in Weka Explorer. Select the Classify tab. How many different
prediction algorithms are available (under functions)?
o. Explain what is prediction in data mining.
p. Choose LinearRegression algorithm. Explain what is the principle of this
algorithm.
q. Results of this algorithm can be interpreted in the following way. The first part of
the output represents the coefficients of the linear equation of the type
num = w0 + w1a1 + … + wkak.
The numbers provided in front of each attribute ak represent the wk. Based on this,
interpret the results you get from running LinearRegression on the dataset. What
is the equation of the line found?
r. The second part of the results states the correlation coefficient, which measures
the statistical correlation between the predicted and actual values (a coefficient of
+1 indicates a perfect positive relationship, 0 indicates no linear relationship, and
–1 indicates a perfect negative relationship). Only positive correlations make
sense in regression, and a coefficient above 0.5 signals a large correlational effect.
The remaining figures are the mean absolute error (the average prediction error),
the root mean squared error (the square root of the mean squared error), which
is the most commonly used error measure, the relative absolute error (which
compares this error with the one obtained if the prediction had been the mean),

29
the root relative squared error (the square root of the error in comparison with the
one obtained if the prediction had been the mean), and the total number of
instances considered.
The overall interpretation of these is the following: a prediction is good when the
correlation coefficient is as large as possible, and all the errors are as small as
possible. These figures are used to compare several prediction results. How do
you evaluate the fit of the equation provided in e), meaning how strong is this
prediction?
s. It is also notable that an important figure is the square of the correlation
coefficient (R2). In statistical regression analysis, which invented this prediction
method, the most used success measures are R and R2. The latter represents the
percentage of variation in the target figure accounted for by the model. For
example, if we want to predict a sales volume based on three factors such as the
advertising budget, the number of plays on the radio per week, and the
attractiveness of the band, and if we get a correlation coefficient R of 0.8, then
we learn from the model that R2 = 64% of the variability in the outcome (the sales
volume) is accounted for by the three factors. How much of the variability of num
can be predicted by the other attributes?
t. Are theses results compatible with the results of assignment #1, which used
classification to predict num?
u. Now compare these figures with the other classifiers provided in functions and
fill-in the following table (except the last line):

Method Correlation Mean Root mean Relative Root


coefficient absolute squared absolute relative
error error error squared
error
LinearRegression
SMOreg
MultilayerPerceptron
MultilayerPerceptron
(optimized)

v. Which prediction method provides best results with this dataset?


w. Try using the other functions to calculate the same regression. What problem(s)
are you facing?
x. Explain what is logistic regression, and how it differs from linear regression.
y. Is in fact logistic regression a prediction method? If not, what kind of data mining
method is logistic regression?
z. In the MultilayerPerceptron function, how many input nodes does this
multiplayer perceptron have?
aa. In the MultilayerPerceptron function, how many output nodes does this
multiplayer perceptron have?
bb. In the MultilayerPerceptron function, how many hidden layers does this
multiplayer perceptron have?
cc. After choosing GUI in the panel of MultilayerPerceptron options, paste here a
screenshot of the graphical representation of this neural network.
dd. What is its learning rate?

30
ee. By changing the MultilayerPerceptron parameters, what is the configuration for
the best results you get?
ff. What best prediction results do you get (fill in the table above)?

2. Clustering in Weka

The goal of this data mining study is to find groups of animals in the zoo dataset, and to
check whether these groups correspond to the real animal types in the dataset.
i. What types of variables are in this dataset?
j. How many rows / cases are there?
k. How many animal types are represented in this dataset? List them here.
l. After removing the type attribute, go to the Cluster tab. How many clustering
algorithms are available in Weka?
m. List the clustering algorithms seen in class, and map these to the ones provided in
Weka.
n. Start using the SimpleKMeans clusterer choosing 7 clusters. Do the clusters learnt
and their centroids seem to match the animal types?
o. Compare results with EM clusterer (with 7 clusters),
MakeDensityBasedClusterer, FarthestFirst (with 7 clusters), and Cobweb.
Which algorithm seems to provide the best clustering match for this dataset?
p. Explain the principles of SimpleKMeans, EM, MakeDensityBasedClusterer, and
Cobweb clustering algorithms.
q. Are results easy to interpret, even with the tree visualizations provided?
r. What would make it easier to evaluate the usefulness of the clusters found?

a. List some animals that are misclassified, meaning classified in a cluster that does
not correspond to their actual type, for instance a mammal clustered with fish, or a
reptile clustered with amphibian.
b. By modifying the selected parameters, improve the classification, explain which
modifications you made, and paste here the resulting dendrogram.

Exercises on the WEKA tool

1. Launch the WEKA tool, and activate the Explorer environment.

2. Open the “weather.nominal” dataset


- How many instances (examples) contained in the dataset?
- How many attributes used to represent the instances?
- Which attribute is the class label?
- What is the data type (e.g., numeric, nominal, etc.) of the attributes in the dataset?
- For each attribute and for each of its possible values, how many instances in each
class have the attribute value (i.e., the class distribution of the attribute values)?

31
3. Go to the Classify tab. Select the ZeroR classifier. Choose the “Cross-validation”
(10 folds) test mode. Run the classifier and observe the results shown in the “Classifier
output” window.
- How many instances are incorrectly classified?
- What is the MAE (mean absolute error) made by the classifier?
- What can you infer from the information shown in the Confusion Matrix?
- Visualize the classifier errors. In the plot, how can you differentiate between the
correctly and incorrectly classified instances? In the plot, how can you see the
detailed information of an incorrectly classified instance?
- How can you save the learned classifier to a file?
- How can you load a learned classifier from a file?

4. Choose the “Percentage split” (66% for training) test mode. Run the ZeroR classifier and
observe the results shown in the “Classifier output” window.
- How many instances are incorrectly classified? Why this number is smaller than
that observed in the previous experiment (i.e., using the cross-validation test
mode)?
- What is the MAE made by the classifier?
- Visualize the classifier errors to see the detailed information.

5. Now, select the Id3 classifier (i.e., you can find this classifier in the weka.classifiers.trees
group). Choose the “Cross-validation” (10 folds) test mode. Run the Id3 classifier and
observe the results shown in the “Classifier output” window.
- How many instances are incorrectly classified?
- What is the MAE made by the classifier?
- Visualize the classifier errors.
- Compare these results with those observed for the ZeroR classifier in the cross-
validation test mode. Which classifier, ZeroR or Id3, shows a better prediction
performance for the current dataset and the cross-validation test mode?

6. Choose the “Percentage split” (66% for training) test mode. Run the Id3 classifier and
observe the results shown in the “Classifier output” window.
- How many instances are incorrectly classified?
- What is the MAE made by the classifier?
- Visualize the classifier errors.
- Compare the results made by the Id3 classifier for the two considered test modes.
In which test mode, does the classifier produces a better result (i.e., a smaller
error)?
- Which classifier, ZeroR or Id3, shows a better prediction performance for the
current dataset and the splitting test mode?

Exercises on the probabilistic models

• Let’s assume we have the following data set that recorded (i.e., in a period of 25 days)
whether or not a person played tennis depending on the outlook and wind conditions.
• Each instance (example) is represented by the three attributes.
o Outlook: a value of {Sunny, Overcast, Rain}.

32
o Wind: a value of {Weak, Strong}.
o PlayTennis: the classification attribute (i.e., Yes- the person plays tennis; No- the
person does not play tennis).

Date Outlook Wind PlayTennis


1 Sunny Weak No
2 Sunny Strong No
3 Overcast Weak Yes
4 Rain Weak Yes
5 Rain Weak Yes
6 Rain Strong No
7 Overcast Strong Yes
8 Sunny Weak No
9 Sunny Weak Yes
10 Rain Weak Yes
11 Sunny Strong Yes
12 Overcast Strong Yes
13 Overcast Weak Yes
14 Rain Strong No
15 Sunny Strong Yes
16 Overcast Strong No
17 Overcast Weak Yes
18 Rain Weak No
19 Sunny Weak No
20 Rain Strong Yes
21 Sunny Weak Yes
22 Overcast Weak No
23 Rain Weak Yes
24 Sunny Strong Yes
25 Overcast Weak No
• We want to predict if the person will play tennis in the three future days.
o Day 26: (Outlook=Sunny, Wind=Strong) → PlayTennis=?
o Day 27: (Outlook=Overcast, Wind=Weak) → PlayTennis=?
o Day 28: (Outlook=Rain, Wind=Weak) → PlayTennis=?

Use the WEKA tool


• Convert the dataset provided above (i.e., Days 1-25) into the ARFF format (supported
by WEKA), and save it in the “play_tennis.arff” file.
• For the three future days (i.e., Days 26-28), set the values on the PlayTennis attribute by
the predictions (i.e., computed manually in Part I, by the Naïve Bayes classification
approach).
Convert the data of these three days (i.e., Days 26-28) into the ARFF format, and save it
in the “play_tennis_test.arff” file.
• Launch the WEKA tool, and then activate the “Explorer” environment.
• Open the “play_tennis” dataset (i.e., saved in the “play_tennis.arff” file).
- For each attribute and for each of its possible values, how many instances in each class

33
have the feature value (i.e., the class distribution of the feature values)?

• Go to the “Classify” tab. Select the NaiveBayes classifier. Choose “Percentage split”
(66% for training) test mode. Run the classifier and observe the results shown in the
“Classifier output” window.
- How many instances used for the training? How many for the test?
- How many instances are incorrectly classified?
- What is the MAE (mean absolute error) made by the classifier?
- What can you infer from the information shown in the Confusion Matrix?
- Visualize the classifier errors. In the plot, how can you differentiate between the
correctly and incorrectly classified instances? In the plot, how can you see the detailed
information of an incorrectly classified instance?
- How can you save the learned classifier to a file?

• Now, let’s use a separate test dataset. In the “Test options” panel select the “Supplied
test set” option. Activate the nearby “Set...” button and locate the “play_tennis_test.arff”
file.
Run the classifier and observe the results shown in the “Classifier output” window.
- How many instances used for the training? How many for the test?
- How many instances are incorrectly classified?
- What is the MAE (mean absolute error) made by the classifier?
- What can you infer from the information shown in the Confusion Matrix?
- Compare the test results with those observed in the previous experiment (i.e., using the
splitting test mode).

Exercises on Nearest Neighbor Learner


• We use a subset of the “Iris Plants Database” dataset (i.e., provided by WEKA,
contained in the “iris.aff” file).
• Each plant record (i.e., example) is represented by the 5 attributes.
- SepalLength – the plant’s sepal length in cm.
- SepalWidth – the plant’s sepal width in cm.
- PetalLength – the plant’s petal length in cm.
- PetalWidth – the plant’s petal width in cm.
- Class – the classification attribute, with the possible values {Iris-setosa, Iris-
versicolor, Iris-virginica}.
PlantID SepalLength SepalWidth PetalLength PetalWidth Class
1 5.1 3.5 1.4 0.2 Iris-setosa
2 7.1 3.0 5.9 2.1 Iris-virginica
3 5.4 3.4 1.5 0.4 Iris-setosa
4 6.4 3.2 4.5 1.5 Iris-

34
versicolor
5 6.3 3.3 4.7 1.6 Iris-
versicolor
6 7.3 2.9 6.3 1.8 Iris-virginica
7 4.4 2.9 1.4 0.2 Iris-setosa
8 4.9 3.1 1.5 0.1 Iris-setosa
9 5.8 2.8 5.1 2.4 Iris-virginica
10 5.6 2.9 3.6 1.3 Iris-
versicolor
11 6.9 3.2 5.7 2.3 Iris-virginica
12 6.0 3.4 4.5 1.6 Iris-
versicolor
13 7.2 3.0 5.8 1.6 Iris-virginica
14 4.8 3.4 1.9 0.2 Iris-setosa
15 6.8 2.8 4.8 1.4 Iris-
versicolor

How to access a database using WEKA

Go to the Control Panel


Choose Adminstrative Tools
Choose Data Sources (ODBC)
At the User DSN tab, choose Add...
Choose database
Microsoft Access
Note: Make sure your database is not open in another application before following the
steps below.
Choose the Microsoft Access driver and click Finish
Give the source a name by typing it into the Data Source Name field
In the Database section, choose Select...
Browse to find your database file, select it and click OK
Click OK to finalize your DSN

You will need to configure a file called DatabaseUtils.props. This file already exists
under the path weka/experiment/ in the weka.jar file (which is just a ZIP file) that is part
of the Weka download. In this directory you will also find a sample file for ODBC
connectivity, called DatabaseUtils.props.odbc, and one specifically for MS Access, called
DatabaseUtils.props.msaccess (>3.4.14, >3.5.8, >3.6.0), also using ODBC. You should
use one of the sample files as basis for your setup, since they already contain default
values specific to ODBC access.

This file needs to be recognized when the Explorer starts. You can achieve this by
making sure it is in the working directory or the home directory (if you are unsure what

35
the terms working directory and home directory mean, see the \textit{Notes} section).
The easiest is probably the second alternative, as the setup will apply to all the Weka
instances on your machine.

Just make sure that the file contains the following lines at least:
jdbcDriver=sun.jdbc.odbc.JdbcOdbcDriver
jdbcURL=jdbc:odbc:dbname

where dbname is the name you gave the user DSN. (This can also be changed once the
Explorer is running.)

Start up the Weka Explorer.


Choose Open DB...
The URL should read "jdbc:odbc:dbname" where dbname is the name you gave the user
DSN.
Click Connect
Enter a Query, e.g., "select * from tablename" where tablename is the name of the
database table you want to read. Or you could put a more complicated SQL query here
instead.
Click Execute
When you're satisfied with the returned data, click OK to load the data into the
Preprocess panel.

WEKA KnowledgeFlow

With the Knowledge Flow interface, users select Weka components from a tool bar, place
them on a layout canvas, and connect them into a directed graph that processesand
analyzes data. It provides an alternative to the Explorer for those who like thinking in
terms of how data flows through the system. It also allows the design and execution of
configurations for streamed data processing, which the Explorer cannot do. You invoke
the Knowledge Flow interface by selecting KnowledgeFlow from the choices in the right
panel.

Table Visualization and Evaluation Components

Name Function
Visualization DataVisualizer Visualize data in a twodimensional scatter plot
ScatterPlotMatrix Matrix of scatter plots
AttributeSummarizer Set of histograms, one for each attribute
ModelPerformanceChart Draw ROC and other threshold curves
CostBenefitAnalysis Visualize cost or benefit tradeoffs
TextViewer Visualize data or models as text
GraphViewer Visualize tree-based models
StripChart Display a scrolling plot of data Evaluation
TrainingSetMaker Make dataset into a training set
TestSetMaker Make dataset into a test set
36
CrossValidationFoldMaker Split dataset into folds
TrainTestSplitMaker Split dataset into training and test sets
InstanceStreamToBatchMaker Collect instances from a stream and assemble them into a
batch dataset
ClassAssigner Assign one of the attributes to be the class
ClassValuePicker Choose a value for the positive class
ClassifierPerformanceEvaluator Collect evaluation statistics for batch
evaluation
IncrementalClassifierEvaluator Collect evaluation statistics for incremental
evaluation
ClustererPerformanceEvaluator Collect evaluation statistics for clusterers
PredictionAppender Append a classifier’s predictions to a dataset
SerializedModelSaver Save trained models as serialized Java
objects

Evaluating numeric prediction


All the evaluation measures we have described pertain to classification situations rather
than numeric prediction situations. The basic principles—using an independent test set
rather than the training set for performance evaluation, the holdout method, and cross-
validation—apply equally well to numeric prediction. But the basic quality measure
offered by the error rate is no longer appropriate: errors are not simply present or absent;
they come in different sizes.

Several alternative measures, summarized in Table can be used to evaluate the success of
numeric prediction. The predicted values on the test instances are p1, p2, . . ., pn ; the
actual values are a1, a2, . . ., an. pi refers to the numeric value of the prediction for the ith
test instance.

Mean-squared error is the principal and most commonly used measure; sometimes the
square root is taken to give it the same dimensions as the predicted value itself. Many
mathematical techniques such as linear regression, use the mean-squared error because it
tends to be the easiest measure to manipulate mathematically. However, here we are
considering it as a performance measure: all the performance measures are easy to
calculate, so mean-squared error has no particular advantage. The question is, is it an
appropriate measure for the task at hand?

Mean absolute error is an alternative: just average the magnitude of the individual errors
without taking account of their sign. Mean-squared error tends to exaggerate the effect of
outliers—instances whose prediction error is larger than the others—but absolute error
does not have this effect: all sizes of error are
treated evenly according to their magnitude.

37
Sometimes it is the relative rather than absolute error values that are of importance. For
example, if a 10% error is equally important whether it is an error of 50 in a prediction of
500 or an error of 0.2 in a prediction of 2, then averages of absolute error will be
meaningless: relative errors are appropriate. This effect would be taken into account by
using the relative errors in the mean-squared error calculation or the mean absolute error
calculation.

Relative squared error in Table refers to something quite different. The error is made
relative to what it would have been if a simple predictor had been used. The simple
predictor in question is just the average of the actual values from the training data. Thus
relative squared error takes the total squared
error and normalizes it by dividing by the total squared error of the default predictor.

relative absolute error is just the total absolute error, with the same kind of
normalization. In these three relative error measures, the errors are ormalized by the
error of the simple predictor that predicts average values.

The final measure is the correlation coefficient, which measures the statistical correlation
between the a’s and the p’s. The correlation coefficient ranges from 1 for perfectly
correlated results, through 0 when there is no relation, to -1 when the results are perfectly
correlated negatively.

38
SOLVED EXAMPLES FOR LAB

1) Association Rules
Exercise 1. Basic association rule creation manually.
The 'database' below has four transactions. What association rules can be found in this
set, if the minimum support (i.e coverage) is 60% and the minimum confidence (i.e.
accuracy) is 80% ?
Trans_id Itemlist
T1 {K, A, D, B}
T2 {D, A C, E, B}
T3 {C, A, B, E}
T4 {B, A, D}

The solution:
Let’s first make a tabular and binary representation of the data:
Transaction
ABCDEK
T1 1 1 0 1 0 1
T2 1 1 1 1 1 0
T3 1 1 1 0 1 0
T4 1 1 0 1 0 0

STEP 1. Form the item sets. Let's start by forming the item set containing one item. The
number of occurrences and the support of each item set is given after it. In order to reach
a minimum support of 60%, the item has to occur in at least 3 transactions.
A 4, 100%
B 4, 100%

39
C 2, 50%
D 3, 75%
E 2, 50%
K 1, 25%
STEP 2. Now let's form the item sets containing 2 items. We only take the item sets from
the previous phase whose support is 60% or more.
A B 4, 100%
A D 3, 75%
B D 3, 75%
STEP 3. The item sets containing 3 items. We only take the item sets from the previous
phase whose support is 60% or more.
ABD3
STEP4. Lets now form the rules and calculate their confidence (c). We only take the item
sets from the previous phases whose support is 60% or more.
Rules:
A -> B P(B|A) = |BnA| / |A| = 4/4, |c: 100%
B -> A c: 100%
A -> D c: 75%
D -> A c: 100%
B -> D c: 75%
D -> B c: 100%
AB -> D c: 75%
D -> AB c: 100%
AD -> B c: 100%
B - > AD c: 75%
BD -> A c: 100%
A -> BD c: 75%
The rules with a confidence measure of 75% are pruned, and we are left with the
following rule set:
A -> B
B -> A
D -> A
D -> B
D -> AB
AD-> B
DB-> A

Exercise 2. Initial experiments with Weka's assiociation rule generation


tool.

Launch Weka and try to do with it the calculations you performed manually in the
previous exercise. Use the apriori algorithm for generating the association rules.
The Solution:
The file may be given to Weka in e.g. two different formats. They are called ARFF
(attribute-relation
file format) and CSV (comma separated values). Both are given below:

40
CSV format:
exista,existb,existc,existd,existe,existk
TRUE,TRUE,FALSE,TRUE,FALSE,TRUE
TRUE,TRUE,TRUE,TRUE,TRUE,FALSE
TRUE,TRUE,TRUE,FALSE,TRUE,FALSE
TRUE,TRUE,FALSE,TRUE,FALSE,FALSE

Exercise 3. Weka and the command line parameters of the apriori


algorithm.
Can you modify the options in such a way that you get the same rules as in Exercise 1?
The Solution
The options offered are as follows:
Apriori -I N(umRules) 100 -T 0 (metric type is confidence) -C(onfidence) 0.8 -D(elta)
0.5 -U
(upperBoundMinSupport) -M (lowerBoundMinSupport) -S (significance level) -1.0 -
V(verbose)
delta - iteratively decreases support by this factor. Reduces support until minimum
support has been
reached or the required number of rules has been generated.
The above presented parameters produce the same results as the one we calculated
manually. When the significance level is -1.0, the parameter is not used
Figure 2: Running the apriori algorithm in Weka.

41
Figure 3: Setting the parameters of the apriori algorithm. Information about the contents
of the parameters may also be found here.

WEKA’s Apriori explanation

42
The default values for Number of rules, the decrease for Minimum support (delta factor) and minimum
Confidence values are 10, 0.05 and 0.9. Rule Support is the proportion of examples covered by the
LHS and RHS while Confidence is the proportion of examples covered by the LHS that are also
covered by the RHS. So if a rule's RHS and LHS covers 50% of the cases then the rule has 0.5
support, if the LHS of a rule covers 200 cases and of these the RHS covers 50 cases then the
confidence is 0.25. With default settings Apriori tries to generate 10 rules by starting with a minimum
support of 100%, iteratively decreasing support by the delta factor until minimum non ‐zero support is
reached or the required number of rules with at least minimum confidence has been generated. If we
examine Weka's output, a Minimum support of 0.15 indicates the minimum support reached in order
to generate the 10 rules with the specified minimum metric, here confidence of 0.9. The item set sizes
generated are displayed; e.g. there are 6 four‐item sets having the required minimum support. By
default rules are sorted by confidence and any ties are broken based on support. The number preceding
==> indicates the number of cases covered by the LHS and the value following the rule is the number
of cases covered by the RHS. The value in parenthesis is the rule's confidence.

2) Clustering

Exercise 1. Suppose that we have the following data:

a(2, 0) b(1,2) c(2,2) d(3,2) e(2,3) f(3,3) g(2,4) h(3,4) i(4,4) j(3,5)
Identify the cluster by applying the k-means algorithm, with k=2. Try using initial cluster
centers as far apart as possible.
Solution:

Inititialization
The following chart gives the squares of the pairwise Euclidean distance:
a b c d e f g h i j
a 0 5 4 5 9 10 16 17 20 26
b 0 1 4 2 5 5 8 13 13
c 0 1 1 2 4 5 8 10
d 0 2 1 5 4 5 9
e 0 1 1 2 5 5
f 0 2 1 2 4
g 0 1 4 2
h 0 1 1
i 0 2

43
j 0

The initial cluster centers will be “a” and “j”.

Iteration 1:
Distance from points to cluster centroids:
Cluster 1 Cluster 2 Cluster assignment
a 0.0000 5.0990 1
b 2.2360 3.6055 1
c 2.0000 3.1622 1
d 2.2360 3.0000 1
e 3.0000 2.2360 2
f 3.1622 2.0000 2
g 4.0000 1.4142 2
h 4.1231 1.0000 2
i 4.4721 1.4142 2
j 5.0990 0.0000 2

Cluster 1: a b c d centroid: (2, 1.5)


Cluster 2: e f g h i j centroid (2.8333, 3.8333)

Iteration 2:
Distance from points to cluster centroids:
Cluster 1 Cluster 2 Cluster assignment
a 1.5000 3.9228 1
b 1.1180 2.5927 1
c 0.5000 2.0138 1
d 1.1180 1.8408 1
e 1.5000 1.1785 2
f 1.8027 0.8498 2
g 2.5000 0.8498 2
h 2.6925 0.2357 2
i 3.2015 1.1785 2
j 3.6400 1.1785 2

Cluster 1: a b c d centroid: (2, 1.5)


Cluster 2 e f g h i j centroid (2.8333, 3.8333)
At this point, the cluster centroids are the same as the last iteration, and so k-means
would halt.

Exercise 2:
The following is a set of one-dimensional points: {6, 12, 18, 24, 30, 42, 48}.

44
(a) For each of the following sets of initial centroids, create two clusters
by assigning each point to the nearest centroid, and then calculate the total squared error
for each set of two clusters. Show both the clusters and the total squared error for each set
of centroids.
i. {18, 45}
First cluster is 6, 12, 18, 24, 30.
Error = 360.
Second cluster is 42, 48.
Error = 18.
Total Error = 378

ii. {15, 40} First cluster is 6, 12, 18, 24 .


Error = 180.
Second cluster is 30, 42, 48.
Error = 168.
Total Error = 348.

3) Classification

Exercise 1: k-nearest neighbor Algorithm

Consider the one-dimensional data set shown in Table


x y
0.5 -

3.0 -

4.5 +

4.6 +

4.9 +

5.2 -

5.3 -

5.5 +

45
7.0 -
9.5 -

(a) Classify the data point x = 5.0 according to its 1-, 3-, 5-, and 9-nearest
neighbors (using majority vote).
Answer:
1-nearest neighbor: +,
3-nearest neighbor: −,
5-nearest neighbor: +,
9-nearest neighbor: −.

Exercise 2
Consider the data set shown

A B C Class
0 0 0 +
0 0 1 -
0 1 1 -
0 1 1 -
0 0 1 +
1 0 1 +
1 0 1 -
1 0 1 -
1 1 1 +
1 0 1 +
a) Estimate the conditional probabilities for P(A|+), P(B|+), P(C|+),
P(A|−), P(B|−), and P(C|−).
Answer:
P(A = 1|−) = 2/5 = 0.4, P(B = 1|−) = 2/5 = 0.4,
P(C = 1|−) = 1, P(A = 0|−) = 3/5 = 0.6,
P(B = 0|−) = 3/5 = 0.6, P(C = 0|−) = 0; P(A = 1|+) = 3/5 = 0.6,
P(B = 1|+) = 1/5 = 0.2, P(C = 1|+) = 2/5 = 0.4,

46
P(A = 0|+) = 2/5 = 0.4, P(B = 0|+) = 4/5 = 0.8,
P(C = 0|+) = 3/5 = 0.6.

(b) Use the estimate of conditional probabilities given in the previous question
to predict the class label for a test sample (A = 0,B = 1, C = 0)
using the na¨ıve Bayes approach.
Answer:
Let P(A = 0,B = 1, C = 0) = K.
P(+|A = 0,B = 1, C = 0)
= P(A = 0,B = 1, C = 0|+) × P(+)
P(A = 0,B = 1, C = 0)
=(P(A = 0|+)P(B = 1|+)P(C = 0|+) × P(+)) / K
= 0.4 × 0.2 × 0.6 × 0.5/K
= 0.024/K.
P(−|A = 0,B = 1, C = 0)
= P(A = 0,B = 1, C = 0|−) × P(−)
P(A = 0,B = 1, C = 0)
=( P(A = 0|−) × P(B = 1|−) × P(C = 0|−) × P(−)) / K
= 0/K
The class label should be ’+’.

KNOWLEDGEFLOW LAYOUT SAMPLES

Clustering Using KnowledgeFlow


k-Means Algorithm
a) Using cross-validation

47
b) Using hold-out method

Preprocessing Filters

48
a)

Classification

49
1. J48: This is an implementation of C4.5 / ID3 algorithm.

a) Using crossvalidation

Note: for any classification algorithm the same flowlayout can be followed. Just replace
J48 with appropriate algorithm.

PredictionAppender icon is used to see the predictions made by the algorithm on


unknown tuples.

b) Using hold-out method

50
Association Rules
a) Apriori

Arrffloader -- Apriori -- TextViewer

LAB EXERCISES WHICH CAN BE ASKED IN THE EXAM

51
1.
Dataset: cpu.arff
a. What is normalization. What are the various normalization techniques?.
Illustrate normalization using WEKA on the given dataset and write down a
sample result.
b. Use KnowledgeFLow and illustrate for the above question

c. Dataset: Glass.arff
How many attributes are there in the dataset? What are their names? What is the class
attribute? Run the classification algorithm IBk (weka.classifiers.lazy.IBk). Use cross-
validation to test its performance,leaving the number of folds at the default value of 10.
Recall that you can examine the classifier options in the Generic Object Editor window
that pops up when you click the text beside the Choose button. The default value of the
KNN field is 1: This sets the number of neighboring instances to use when classifying.
d. Use Knowledge flow canvas and develop a FlowLayout for k-means
execution for above dataset

2a. Dataset: cpu.arff

What is standardization?. Explain. Illustrate standardization on the given dataset


using WEKA and write down a sample result. Show the same using
KnowledgeFlow.

b. Iris.arff
For J48, compare cross-validated accuracy and the size of the trees generated for (1) the
raw data, (2) data discretized by the unsupervised discretization method in default mode,
and (3) data discretized by the samemethod with binary attributes.
Show the same using KnowledgeFlow.
c. Use experimenter to compare any two classifiers of your choice on iris
dataset.

3
a. Dataset: iris.arff. What is sampling?. Explain various sampling techniques.
Illustrate sampling with replacement using WEKA on given dataset and
write down sample result?.
b. Show the same using KnowledgeFlow.

c. Load the iris data using the Preprocess panel. Evaluate C4.5 on this data
using (a) Holdout method and (b) cross-validation. What is the estimated
percentage of correct classifications for (a) and (b)?

d. Use Knowledge flow and develop a FlowLayout for C4.5 execution

52
4. a. What is discretization?. Explain how do you discretize numeric variables?. Illustrate
discretization using WEKA on given dataset and write down sample result?.
b. Dataset: iris.arff
Explain k-means algorithm?. Perform clustering on the given dataset using
holdout method using WEKA?. Write down the results you get and interpret
your results?.

e. Use Knowledge flow canvas and develop a FlowLayout for C4.5


execution

5. a. Explain the terms absolute error, squared error, mean absolute error and root mean
squared error?.
b. Dataset: weather.nominal.arff
Using Apriori algorithm, calculate all frequent itemsets(L’s) for the following data
TID List of item_ids

T100 I1,I2,I5

T200 I2,I4

T300 I2,I3

T400 I1,I2,I4

T500 I1,I3

T600 I2,I3

T700 I1,I3

T800 I1,I2,I3,I5

T900 I1,I2,I3

Illustrate Apriori algorithm using WEKA and display the 10 most significant rules you
get using the default values of support and confidence.

Show the same using KnowledgeFlow.

c. Use experimenter to compare any two classifiers of your choice on iris dataset.

53
6.
a. Open the “weather.nominal” dataset
- How many instances (examples) contained in the dataset?
- How many attributes used to represent the instances?
- Which attribute is the class label?
- What is the data type of the attributes in the dataset?
- For each attribute and for each of its possible values, how many instances in
each class have the attribute value?
b. select the Id3 classifier. Choose the “Cross-validation” (10 folds) test mode. Run the
Id3 classifier and observe the results shown in the “Classifier output” window.
- How many instances are incorrectly classified?
- What is the Mean Absolute Error made by the classifier?
- Visualize the classifier errors.
- Compare these results with those observed for the ZeroR classifier in the cross-
validation test mode. Which classifier, ZeroR or Id3, shows a better prediction
performance for the current dataset and the cross-validation test mode?
c. Use Knowledge flow canvas and develop a FlowLayout for C4.5 execution

7.
a. Dataset: cpu.arff
What is normalization. What are the various normalization techniques?. Illustrate
normalization using WEKA on the given dataset and write down a sample result.
Show the same using KnowledgeFlow.

Dataset: cpu.arff
b. Answer the following questions:
gg. What types of variables are in this dataset (numeric / ordinal / categorical)?
hh. Load the data in Weka Explorer. Select the Classify tab. How many different
prediction algorithms are available (under functions)?
ii. Explain what is prediction in data mining.
jj. Choose LinearRegression algorithm. Explain what is the principle of this
algorithm.
kk. Results of this algorithm can be interpreted in the following way. The first part of
the output represents the coefficients of the linear equation of the type
num = w0 + w1a1 + … + wkak.
The numbers provided in front of each attribute ak represent the wk. Based on this,
interpret the results you get from running LinearRegression on the dataset. What
is the equation of the line found?

Show the same using KnowledgeFlow.

c. Use experimenter to compare any two classifiers of your choice on iris dataset.

54
8.
a. What is discretization?. Explain how do you discretize numeric variables?. Illustrate
discretization using WEKA on given dataset and write down sample result?.

Show the same using KnowledgeFlow.

b. Choose the “Percentage split” (66% for training) test mode. Run the Id3 classifier and
observe the results shown in the “Classifier output” window.
- How many instances are incorrectly classified?
- What is the Mean Absolute Error made by the classifier?
- Compare the results made by the Id3 classifier for the two considered test
modes, hold-out and cross-validation. In which test mode, does the classifier
produces a better result (i.e., a smaller error)?
c. Use Knowledge flow canvas and develop a FlowLayout for k-means execution

9. a) Describe the decision tree induction algorihm?


Consider the following AllElectronics dataset:
Age income student credit_rating buys_computer
youth high no fair no
youth high no excellent no
middle_a high no fair yes
ged

55
senior medium no fair yes
senior low yes fair yes
senior low yes excellent no
middle_a low yes excellent yes
ged
youth medium no fair no
youth low yes fair yes
senior medium yes fair yes
youth medium yes excellent yes
middle_a medium no excellent yes
ged
middle_a high yes fair yes
ged
senior medium no excellent no

Using WEKA what class label will Decision Tree algorithm J48 predict for the following
tuple?
(youth, Medium, Yes, Fair, ?)

b) repeat the above using knowledgeFlow

10. Describe naïve bayes classification algorithm?


Consider the following AllElectronics dataset

Age income student credit_rating buys_computer


youth high no fair no
youth high no excellent no
middle_a high no fair yes
ged
senior medium no fair yes

56
senior low yes fair yes
senior low yes excellent no
middle_a low yes excellent yes
ged
youth medium no fair no
youth low yes fair yes
senior medium yes fair yes
youth medium yes excellent yes
middle_a medium no excellent yes
ged
middle_a high yes fair yes
ged
senior medium no excellent no

Using WEKA what class label will naïve bayes predict for the following tuple?
(youth, Medium, Yes, Fair, ?)

11. a) Consider the labor dataset. Use appropriate WEKA filters and replace missing
values in the dataset. What are the filters you will use?.

b) Suppose a group of 12 sales price records has been sorted as follows:


5; 10; 11; 13; 15; 35; 50; 55; 72; 92; 204; 215:
Partition them into three bins by each of the following methods.
(a) equal-frequency partitioning
(b) equal-width partitioning
c) clustering

12. The 'database' below has four transactions. Use Apriori to find all frequent itemsets.
What association rules can be found in this set using WEKA, if the minimum support (i.e
coverage) is 60% and the minimum confidence (i.e. accuracy) is 80%.
Trans_id Itemlist
T1 {K, A, D, B}
T2 {D, A C, E, B}
T3 {C, A, B, E}
T4 {B, A, D}

13.Load the data file weather.arff.


Here look at the Explorer window and note the things like:
• How many attributes are there?
• How many instances are there?
• What is the type of attributes?

57
Classify the data set weather.arff
• Choose the classifier J4.8.
Look at the classification output and try to interpret the results:
• What is 10-fold cross validation?
• What is the tree size?
• How many correct classified instances did J4.8 do (in percentage)?
• Did you notice the accuracy and the confusion matrix? Write them down and interpret.

Change classifier to OneR in Rules and compare the results with J4.8.

14. a) Suppose that we have the following data:


a(2, 0) b(1,2) c(2,2) d(3,2) e(2,3) f(3,3) g(2,4) h(3,4) i(4,4) j(3,5)
Identify the cluster by applying the k-means algorithm, with k=2. Try using initial cluster
centers as far apart as possible.
b) Suppose that the data for analysis includes the attribute age. The age values for the
data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25,
30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) What is the mean of the data? What is the median?
(b) What is the mode of the data? Comment on the data's modality (i.e., bimodal,
trimodal, etc.).
(c) What is the midrange of the data?
(d) Can you find (roughly) the ¯rst quartile (Q1) and the third quartile (Q3) of the data?
(e) Give the five-number summary of the data.
(f) Show a boxplot of the data.

15. Suppose that the data for analysis includes the attribute age. The age values for the
data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25,
30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) Use smoothing by bin means to smooth the above data, using a bin depth of 3.
Illustrate your steps. Comment on the effect of this technique for the given data.
(b) How might you determine outliers in the data?
(c) What other methods are there for data smoothing?

16. Use the two methods below to normalize the following group of data:
200; 300; 400; 600; 1000
(a) min-max normalization by setting min = 0 and max = 1
(b) z-score normalization

17. Suppose that the data for analysis includes the attribute age. The age values for the
data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25,
30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) Use min-max normalization to transform the value 35 for age onto the range [0:0;
1:0].
(b) Use z-score normalization to transform the value 35 for age, where the standard
deviation of age is 12.94 years.
(c) Use normalization by decimal scaling to transform the value 35 for age.

58
(d) Comment on which method you would prefer to use for the given data, giving reasons
as to why.

18. Suppose a group of 12 sales price records has been sorted as follows:
5; 10; 11; 13; 15; 35; 50; 55; 72; 92; 204; 215:
Partition them into three bins by each of the following methods.
(a) equal-frequency partitioning
(b) equal-width partitioning

19. Describe k-nearest neighbor classification algorithm?


Consider the following AllElectronics dataset

Age income student credit_rating buys_computer


youth high no fair no
youth high no excellent no
middle_a high no fair yes
ged
senior medium no fair yes
senior low yes fair yes
senior low yes excellent no
middle_a low yes excellent yes
ged
youth medium no fair no
youth low yes fair yes
senior medium yes fair yes
youth medium yes excellent yes
middle_a medium no excellent yes
ged
middle_a high yes fair yes
ged
senior medium no excellent no

Using WEKA what class label will kNN algorithm predict for the following tuple?
(youth, Medium, Yes, Fair, ?)

59
20.What is linear regression?. Why do you use it for?. Explain.
Consider the dataset below which captures data about size of houses and their price in
rupees. Build a regression equation and predict the price for the following house
(1002 ?)
Size_of_House Price
2104 460
1416 232
1534 315
852 178
2287 367
568 100
989 187

21. What is linear regression?. Why do you use it for?. Explain..


Consider the dataset below where x is the number of working expeince of a college
graduate and y is the corresponding salary of the graduate. Build a regression equation
and predict the salary of college graduate whose experience is 10 years.
Salary data
X years experience Y salary in($1000s)
3 30
8 57
9 64
13 72
3 36
6 43
11 59
21 90
1 20
16 83

22.A database has five transactions, Use Apriori to find all frequent itemsets. What
association rules can be found in this set using WEKA, if the minimum support (i.e
coverage) is 60% and the minimum confidence (i.e. accuracy) is 80%
TID Items_bought

T100 {M,O,N,K,E,Y}

T200 {D,O,N,K,E,Y}

60
T300 {M,A,K,E}

T400 {M,U,C,K,Y}

T500 {C,O,O,K,I,E}

22. Let’s assume that we have collected the following data set of users who decided to
buy a computer and others who decided not.
UserID Age Income Student Credit_Rating Buy_Computer
1 Young High No Fair No
2 Young High No Excellent No
3 Medium High No Fair Yes
4 Old Medium No Fair Yes
5 Old Low Yes Fair Yes
6 Old Low Yes Excellent No
7 Medium Low Yes Excellent Yes
8 Young Medium No Fair No
9 Young Low Yes Fair Yes
10 Old Medium Yes Fair Yes
11 Young Medium Yes Excellent Yes
12 Medium Medium No Excellent Yes
13 Medium High Yes Fair Yes
14 Old Medium No Excellent No
15 Medium Medium Yes Fair No
16 Medium Medium Yes Excellent Yes
17 Young Low Yes Excellent Yes
18 Old High No Fair No
19 Old Low No Excellent No
20 Young Medium Yes Excellent Yes

• We want to predict, for each of the following users, if she/he will buy a computer or
not.
- User #21. A young student with medium income and fair credit rating.
- User #22. A young non-student with low income and fair credit rating.
- User #23. A medium student with high income and excellent credit rating.
- User #24. An old non-student with high income and excellent credit rating.
Use WEKA and give the predictions

61
62

You might also like