Datamining Lab Manual
Datamining Lab Manual
AIM: This lab will familiarize students to the process of KDD and Data Mining
concepts and also enable them to work with the tool WEKA. The version we use is
WEKA 3.7.5. It can be downloaded from the site:
http://www.cs.waikato.ac.nz/ml/weka/index.html
Textbook: Data Mining: Practical Machine Learning Tools and Techniques (Second
Edition) Ian H. Witten, Eibe Frank, Mark A. Hall
WEKA software uses a native file format called arff. It is described seperately. It can also
open other file formats like .csv etc.
1
Attribute-Relation File Format (ARFF)
An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list
of instances sharing a set of attributes. ARFF files were developed by the Machine
Learning Project at the Department of Computer Science of The University of Waikato
for use with the Weka machine learning software.
Overview
ARFF files have two distinct sections. The first section is the Header information, which
is followed the Data information.
The Header of the ARFF file contains the name of the relation, a list of the attributes (the
columns in the data), and their types. An example header on the standard IRIS dataset
looks like this:
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
Lines that begin with a % are comments. The @RELATION, @ATTRIBUTE and
@DATA declarations are case insensitive.
2
Examples
Several well-known machine learning datasets are distributed with Weka in the
$WEKAHOME/data directory as ARFF files.
The ARFF Header section of the file contains the relation declaration and attribute
declarations.
The relation name is defined as the first line in the ARFF file. The format is:
@relation <relation-name>
where <relation-name> is a string. The string must be quoted if the name includes spaces.
where the <attribute-name> must start with an alphabetic character. If spaces are to be
included in the name then the entire name must be quoted.
The <datatype> can be any of the four types currently (version 3.2.1) supported by
Weka:
numeric
<nominal-specification>
string
date [<date-format>]
3
Numeric attributes
Nominal attributes
For example, the class value of the Iris dataset can be defined as follows:
String attributes
String attributes allow us to create attributes containing arbitrary textual values. This is
very useful in text-mining applications, as we can create datasets with string attributes,
then write Weka Filters to manipulate strings (like StringToWordVectorFilter). String
attributes are declared as follows:
Date attributes
where <name> is the name for the attribute and <date-format> is an optional string
specifying how date values should be parsed and printed (this is the same format used by
SimpleDateFormat). The default format string accepts the ISO-8601 combined date and
time format: "yyyy-MM-dd'T'HH:mm:ss".
Dates must be specified in the data section as the corresponding string representations of
the date/time (see example below).
The ARFF Data section of the file contains the data declaration line and the actual
instance lines.
4
The @data declaration is a single line denoting the start of the data segment in the file.
The format is:
@data
Each instance is represented on a single line, with carriage returns denoting the end of the
instance.
Attribute values for each instance are delimited by commas. They must appear in the
order that they were declared in the header section (i.e. the data corresponding to the nth
@attribute declaration is always the nth field of the attribute).
@data
4.4,?,1.5,?,Iris-setosa
Values of string and nominal attributes are case sensitive, and any that contain space
must be quoted, as follows:
@relation LCCvsLCSH
@data
AG5, 'Encyclopedias and dictionaries.;Twentieth century.'
AS262, 'Science -- Soviet Union -- History.'
AE5, 'Encyclopedias and dictionaries.'
AS281, 'Astronomy, Assyro-Babylonian.;Moon -- Phases.'
AS281, 'Astronomy, Assyro-Babylonian.;Moon -- Tables.'
Dates must be specified in the data section using the string representation specified in the
attribute declaration. For example:
@RELATION Timestamps
@DATA
"2001-04-03 12:12:12"
"2001-05-03 12:59:55"
5
Sparse ARFF files
Sparse ARFF files are very similar to ARFF files, but data with value 0 are not be
explicitly represented.
Sparse ARFF files have the same header (i.e @relation and @attribute tags) but the
data section is different. Instead of representing each value in order, like this:
@data
0, X, 0, Y, "class A"
0, 0, W, 0, "class B"
the non-zero attributes are explicitly identified by attribute number and their value stated,
like this:
@data
{1 X, 3 Y, 4 "class A"}
{2 W, 4 "class B"}
Each instance is surrounded by curly braces, and the format for each entry is: <index>
<space> <value> where index is the attribute index (starting from 0).
Note that the omitted values in a sparse instance are 0, they are not "missing" values! If a
value is unknown, you must explicitly represent it with a question mark (?).
Warning: There is a known problem saving SparseInstance objects from datasets that
have string attributes. In Weka, string and nominal data values are stored as numbers;
these numbers act as indexes into an array of possible attribute values (this is very
efficient). However, the first string value is assigned index 0: this means that, internally,
this value is stored as a 0. When a SparseInstance is written, string instances with internal
value 0 are not output, so their string value is lost (and when the arff file is read again, the
default value 0 is the index of a different string value, so the attribute value appears to
change). To get around this problem, add a dummy string value at index 0 that is never
used whenever you declare string attributes that are likely to be used in SparseInstance
objects and saved as Sparse ARFF files.
6
INSTUCTIONS ON HOW TO WORK WITH WEKA
WEKA Explorer
Classification using Explorer
Load Data -- select use training set - choose algorithm - start -- save model --
select supplied set - open - data file - more options - output predictions -
select csv format and select directory - load saved model - re-evaluate model on
current test set
WEKA Experimenter
Setting up an Experiment in Weka
1. Open the Experimenter from the main GUI.
2. Choose the weather.arff by clicking new and then Add new under Data sets.
3. Click Add new under Algorithms and choose classifier J4.8 and then OneR.
4. Browse and name the output file under Results Destination.
5. Start running an experiment by choosing the run tab and click start.
6. Start analyzing the results by choosing the Analyse tab.
7. Browse for output file to analyze under Source or you can press the experiment button
to load the latest experiment results to analyze.
8. Configure your analysis under Configure test and test:
• Percent correct.
• Percent incorrect and others.
7
WEKA EXERCISES
Data Preprocessing
a). Perform attribute ranking on the “contact-lenses.arff” data set using the two attribute
ranking methods with default parameters.
Evaluation
Once you have performed the experiments, you should spend some time evaluating your
results. In particular, try to answer at least the following questions: Why would one need
attribute relevance ranking? Do these attribute-ranking methods often agree or disagree?
On which data set(s), if any, these methods disagree? Does discretization and its method
affect the results of attribute ranking? Do missing values affect the results of attribute
ranking? Record these and any other observations in a Word file called
“Observations.doc”.
Exercise 2
1. Fire up the Weka (Waikato Environment for Knowledge Analysis) software, launch
the explorer window and select the \Preprocess" tab.
2. Open the iris data-set (\iris.ar_", this should be in the ./data/ directory of the Weka
install). What information do you have about the data set (e.g. number of instances,
attributes and classes)? What type of attributes does this data-set contain (nominal or
numeric)? What are the classes in this data-set? Which attribute has the greatest standard
deviation? What does this tell you about that attribute? (You might also find it useful to
open \iris.ar_" in a text editor).
3. Under \Filter" choose the \Standardize" _lter and apply it to all attributes. What does it
do? How does it afect the attributes' statistics? Click \Undo" to un-standardize the data
and now apply the \Normalize" filter and apply it to all the attributes. What does it do?
How does it affect the attributes' statistics? How does it differ from \Standardize"? Click \
Undo" again to return the data to its original state.
8
4. At the bottom right of the window there should be a graph which visualizes the data-
set, making sure \Class: class (Nom)" is selected in the drop-down box click \Visualize
All". What can you interpret from these graphs? Which attribute(s) discriminate best
between the classes in the data-set? How do the \Standardize" and \Normalize"
filters affect these graphs?
5. Under \Filter" choose the \AttributeSelection" filter. What does it do? Are the attributes
it selects the same as the ones you chose as discriminatory above? How does its behavior
change as you alter its parameters?
6. Select the \Visualize" tab. This shows you 2D scatter plots of each attribute against
each other attribute (similar to the F1 vs F2 plots from tutorial 1). Make sure the drop-
down box at the bottom says \Color: class (Nom)". Pay close attention to the plots
between attributes you think discriminate best between classes, and the plots
between attributes selected by the \AttributeSelection" filter. Can you verify from these
plots whether your thoughts and the \AttributeSelection" filter are correct? Which
attributes are correlated?
Exercise 3
1. Download the Old Faithful data-set
2. Upload this data in Excel. There are 2 attributes and 2 classes. Sort the data by class
(be careful to sort the entire row), line or bar plot each of the features individually and
save the graphs in a Word _le. What do you notice on the plots from a visual inspection.
3. For each class feature, compute its minimum, maximum, mean and
standard deviation.
4. Generate a pairwise scatter plot for the combinations of: F1 vs F2. Can you visually
guess whether these attributes are related or not?
5. Based on the scatter plot generated in point 4, determine the data points that are the
outliers (extreme high or low values). Do this manually by visually inspecting the scatter
plot, remove at least 5 points.
6. Compute correlation between features for each class separately and
create a correlation matrix. What does it show?
7. Normalise all features to the range [0, 1]. There are several ways this can be done, we
will use the standard min-max normalization Recompute 6 has it made a difference?
9
Clustering
Exercise 1) Clustering using K-Means
Get to the Weka Explorer environment and load the training file using the Preprocess
mode. Try first with weather.arff. Get to the Cluster mode (by clicking on the Cluster
tab) and select a clustering algorithm, for example SimpleKMeans. Then click on Start
and you get the clustering result in the output window. The actual clustering for this
algorithm is shown as one instance for each cluster representing the cluster centroid.
Scheme: weka.clusterers.SimpleKMeans -N 2 -S 10
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Test mode: evaluate on training data
Number of iterations: 4
Within cluster sum of squared errors: 16.156838252701938
Cluster centroids:
Cluster 0
Mean/Mode: rainy 75.625 86 FALSE yes
Std Devs: N/A 6.5014 7.5593 N/A N/A
Cluster 1
Mean/Mode: sunny 70.8333 75.8333 TRUE yes
Std Devs: N/A 6.1128 11.143 N/A N/A
kMeans
======
10
Number of iterations: 4
Within cluster sum of squared errors: 32.31367650540387
Cluster centroids:
Cluster 0
Mean/Mode: rainy 75.625 86 FALSE yes
Std Devs: N/A 6.5014 7.5593 N/A N/A
Cluster 1
Mean/Mode: sunny 70.8333 75.8333 TRUE yes
Std Devs: N/A 6.1128 11.143 N/A N/A
Clustered Instances
0 8 ( 57%)
1 6 ( 43%)
Evaluation
The way Weka evaluates the clusterings depends on the cluster mode you select. Four
different cluster modes are available (as buttons in the Cluster mode panel):
1. Use training set (default). After generating the clustering Weka classifies the
training instances into clusters according to the cluster representation and
computes the percentage of instances falling in each cluster. For example, the
above clustering produced by k-means shows 43% (6 instances) in cluster 0 and
57% (8 instances) in cluster 1.
2. In Supplied test set or Percentage split Weka can evaluate clusterings on
separate test data if the cluster representation is probabilistic (e.g. for EM).
3. Classes to clusters evaluation. In this mode Weka first ignores the class attribute
and generates the clustering. Then during the test phase it assigns classes to the
clusters, based on the majority value of the class attribute within each cluster.
Then it computes the classification error, based on this assignment and also shows
the corresponding confusion matrix. An example of this for k-means is shown
below.
Scheme: weka.clusterers.SimpleKMeans -N 2 -S 10
Relation: weather
Instances: 14
Attributes: 5
11
outlook
temperature
humidity
windy
Ignored:
play
Test mode: Classes to clusters evaluation on training data
kMeans
======
Number of iterations: 4
Within cluster sum of squared errors: 11.156838252701938
Cluster centroids:
Cluster 0
Mean/Mode: rainy 75.625 86 FALSE
Std Devs: N/A 6.5014 7.5593 N/A
Cluster 1
Mean/Mode: sunny 70.8333 75.8333 TRUE
Std Devs: N/A 6.1128 11.143 N/A
kMeans
======
Number of iterations: 4
Within cluster sum of squared errors: 22.31367650540387
Cluster centroids:
Cluster 0
Mean/Mode: rainy 75.625 86 FALSE
Std Devs: N/A 6.5014 7.5593 N/A
Cluster 1
Mean/Mode: sunny 70.8333 75.8333 TRUE
Std Devs: N/A 6.1128 11.143 N/A
Clustered Instances
12
0 8 ( 57%)
1 6 ( 43%)
Clustering
Get to the Weka Explorer environment and load the training file using the Preprocess
mode. Try first with weather.arff. Get to the Cluster mode (by clicking on the Cluster
tab) and select a clustering algorithm, for example SimpleKMeans. Then click on Start
and you get the clustering result in the output window. The actual clustering for this
algorithm is shown as one instance for each cluster representing the cluster centroid.
Scheme: weka.clusterers.SimpleKMeans -N 2 -S 10
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Test mode: evaluate on training data
kMeans
======
Number of iterations: 4
Within cluster sum of squared errors: 16.156838252701938
13
Cluster centroids:
Cluster 0
Mean/Mode: rainy 75.625 86 FALSE yes
Std Devs: N/A 6.5014 7.5593 N/A N/A
Cluster 1
Mean/Mode: sunny 70.8333 75.8333 TRUE yes
Std Devs: N/A 6.1128 11.143 N/A N/A
kMeans
======
Number of iterations: 4
Within cluster sum of squared errors: 32.31367650540387
Cluster centroids:
Cluster 0
Mean/Mode: rainy 75.625 86 FALSE yes
Std Devs: N/A 6.5014 7.5593 N/A N/A
Cluster 1
Mean/Mode: sunny 70.8333 75.8333 TRUE yes
Std Devs: N/A 6.1128 11.143 N/A N/A
Clustered Instances
0 8 ( 57%)
1 6 ( 43%)
EM
Clustered Instances
14
0 4 ( 29%)
1 10 ( 71%)
Cluster 0 <-- no
Cluster 1 <-- yes
Classification
Classification Exercises
Exercise 1.
4. Under \Test options" select \Use training set" and under \More options" check \Output
predictions". Now click \Start" to start training the model. You should see a stream of
output appear in the window named \Classifier output". What do each of the following
sections tell you about the model?
(a) \Predictions on ..."
(b) \Summary"
(c) \Detailed accuracy by class"
(d) \Confusion matrix"
5. Under \Results list" you should see your model, right click on it and select \Visualise
classifier errors", points marked with a square are errors i.e. incorrectly classified. How
do you think the classifier performed on the test data?
6. Under \Test options" vary the option selected i.e. \cross-validation" or \percentage" and
their parameters i.e. \folds" and \%". Then start the training phase again for each model.
For each model analyse the classifier output and visualise the classifier errors. How do
the different training techniques affect the model? Which technique performed the
best? How does this compare to your initial prediction in 4?
7. Repeat the exercise 6 with the \J48" (Decision Tree) and \RBFNetwork" classifiers.
How do these compare to each other? How do these compare to the
MultilayerPerceptron"?
Below is the output from Weka when using the weka.classifiers.trees.J48 classifier with
the file $WEKAHOME/data/iris.arff as a training file and no testing file. I.e. using the
command:
In square brackets ([,]) there are comments on how to interpret the output.
16
petalwidth <= 0.6: Iris-setosa (50.0)
petalwidth > 0.6
| petalwidth <= 1.7
| | petallength <= 4.9: Iris-versicolor (48.0/1.0)
| | petallength > 4.9
| | | petalwidth <= 1.5: Iris-virginica (3.0)
| | | petalwidth > 1.5: Iris-versicolor (3.0/1.0)
| petalwidth > 1.7: Iris-virginica (46.0/1.0)
Number of Leaves : 5
[ Above is the decision tree constructed by the J48 classifier. This indicates how the
classifier uses the attributes to make a decision. The leaf nodes indicate which class an
instance will be assigned to should that node be reached. The numbers in brackets after
the leaf nodes indicate the number of instances assigned to that node, followed by how
many of those instances are incorrectly classified as a result. With other classifiers some
other output will be given that indicates how the decisions are made, e.g. a rule set. Note
that the tree has been pruned. An unpruned tree and be produced by using the "-U"
option. ]
[ This gives the error levels when applying the classifier to the training data it was
constructed from. For our purposes the most important figures here are the numbers of
correctly and incorrectly classified instances. With the exception of the Kappa statistic,
the remaining statistics compute various error measures based on the class probabilities
assigned by the tree. ]
a b c <-- classified as
50 0 0 | a = Iris-setosa
17
0 49 1 | b = Iris-versicolor
0 2 48 | c = Iris-virginica
[ This shows for each class, how instances from that class received the various
classifications. E.g. for class "b", 49 instances were correctly classified but 1 was put into
class "c". ]
[ This gives the error levels during a 10-fold cross-validation. The "-x" option can be
used to specify a different number of folds. The correctly/incorrectly classified instances
refers to the case where the instances are used as test data and again are the most
important statistics here for our purposes. ]
a b c <-- classified as
49 1 0 | a = Iris-setosa
0 47 3 | b = Iris-versicolor
0 2 48 | c = Iris-virginica
[ This is the confusion matrix for the 10-fold cross-validation, showing what
classification the instances from each class received when it was used as testing data. E.g.
for class "a" 49 instances were correctly classified and 1 instance was assigned to class
"b". ]
Prediction:
Linear regression
Linear Regression can be very useful in association analysis of numerical values, in fact
regression analysis is a powerful approach to modeling the relationship between a
dependent and independent variables. Simple regression is when we predict from one
independent variable and multiple regression is when we predict from more than one
independent variables. The model we attempt to _t is a linear one which is, very simply,
drawing a line through the data. Of all the lines that can possibly be drawn through the
18
data, we are looking for the one that best fits the data. In fact, we look to find a line that
best satisfies
γ = β0 + β1x + ε
So a most accurate model is that which yields a best fit line to the data in question, we are
looking for minimal sum of squared deviations between actual and fitted values, this is
called method of least squares. So now that we have briefly reminded ourselves of the
very basics of regression lets directly move onto an example in Weka.
Exercise 1
(a) In Weka go back to the \Preprocess" tab. Open the iris data-set (\iris.tar_", this should
be in the ./data/ directory of the Weka install).
(b) In the \Attributes" section (bottom left of the screen) select the \class" feature and
click \Remove". We need to do this, as simple linear regression cannot deal with non
numeric values.
(c) Next select the \Classify" tab to get into the Classification perspective of Weka, and
choose \LinearRegression" (under \functions").
(d) Clicking on the textbox next to the \Choose" button brings up the parameter editor
window. Click on the \More" button to get information about the parameters. Make sure
that \attributeSelectionMethod" is set to \No attribute selection" and “\eliminate-
ColinearAttributes" is set to \False".
(e) Finally make sure that you select the parameter “\petalwidth" in the dropdown box
just under the “\Test Options". Hit Start to run the regression.
Inspect the results, in particular pay attention to the Linear Regression Model formula
returned, and the coefficients and intercept of the straight line equation. As this is a
numeric prediction/regression problem, accuracy is measured with Root Mean Squared
Error, Mean Absolute Error and the likes. As most of you will have clearly noticed, you
can repeat this process for regressing the other features in turn, and compare how well the
different features can be predicted.
Exercise 2
• Launch the WEKA tool, and then activate the “Explorer” environment.
• Open the “cpu” dataset (i.e., contained in the “cpu.arff” file).
- For each attribute and for each of its possible values, how many instances in
each class have the feature value (i.e., the class distribution of the feature
values)?
• Go to the “Classify” tab. Select the SimpleLinearRegression learner. Choose
“Percentage split” (66% for training) test mode. Run the classifier and observe the
results shown in the “Classifier output” window.
- Write down the learned regression function.
- What is the MAE (mean absolute error) made by the learned regression
function?
19
- Visualize the errors made by the learned regression function. In the plot, how
can you see the detailed information of a predicted instance?
• Now, in the “Test options” panel select the “Cross-validation” option (10 folds). Run
the classifier and observe the results shown in the “Classifier output” window.
- Write down the learned regression function.
- What is the MAE (mean absolute error) made by the learned regression
function?
- Visualize the errors made by the learned regression function. In the plot, how
can you see the detailed information of a predicted instance?
Association Rules
In this tutorial we will first look at association rules, using the APRIORI algorithm in
Weka. APRIORI works with categorical values only. Therefore we will use a different
dataset called "adult"; This dataset contains census data about 48842 US adults. The aim
is to predict whether their income exceeds $50000. The dataset is taken from the Delve
website, and originally came from the UCI Machine Learning Repository. More
information about it is available in the original UCI Documentation.
This dataset is not immediately ready for use with APRIORI. First, reduce its size by
taking a random sample. You can do this with the 'ResampleFilter' in the preprocess tab
sheet: click on the label under 'Filters', choose 'ResampleFilter' from the drop down
menu, set the 'sampleSizePercentage' (to 15 eg), click 'OK' and 'Add', and click 'Apply
Filters'. The 'Working relation' is now a subsample of the original adult dataset. Now we
have to get rid of the numerical attributes. You can choose to discard them, or to
discretise them. We will discretise the first attribute ('age'): choose the 'DiscretizeFilter',
set 'attributeIndices' to 'first', bins to a low number, like 4 or 5, and the other options to
'False'. Then add this new filter to the others. We will get rid of the other numerical
attributes: choose an 'AttributeFilter', set 'invertSelection' to 'False', and enter the indices
of the remaining numeric attributes (3,5,11-13). Apply all the filters together now. Then
click on 'Replace' to make the resulting 'Working relation' the new 'Base relation'.
Now go to the 'Associate' tab sheet and click under 'Associator'. Set 'numRules' to 25, and
keep the other options on their defaults. Click 'Start' and observe the results. What do you
think about these rules? Are they useful?
From the previous, it is obvious that some attributes should not be examined
simultaneously because they lead to trivial results. Go back to the 'Preprocess' sheet. If
you have replaced the original 'Base relation' by the 'Working relation', you can include
and exclude attributes very easily: delete all filters from the 'Filters' window, then remove
the check mark next to the attributes you want to get rid of and click 'Apply Filters'. You
20
now have a new 'Working relation'. Try to remove different combinations of the
attributes that lead to trivial association rules. Run APRIORI several times and look for
interesting rules. You will find that there is often a whole range of rules which are all
based on the same simpler rule. Also, you will often get rules that don't include the target
class. This is why in most cases you would use APRIORI for dataset exploration rather
than for predictive modelling.
Exercise 2
Association analysis is concerned with discovering interesting correlations or other
relationships between variables in large databases. We are interested into relationships
between features themselves, rather than features and class as in the standard
classification problem setting. Hence searching for association patterns is no different
from classification except that instead of predicting just the class, we try to predict
arbitrary attributes or attribute combinations.
1. Fire up Weka software, launch the explorer window and select the \Preprocess" tab.
Open the weather.nominal data-set (\weather.nominal.arff", this should be in the ./data/
directory of the Weka install).
Weka has three build-in association rule learners. These are, \Apriori", \Predictive
Apriori" and \Tertius", however they are not capable of handling numeric data. Therefore
in this exericse we use weather data.
(a) Select the \Associate" tab to get into the Association rule mining perspective of Weka.
Under \Associator" select and run each of the following \Apriori", \Predictive Apriori"
and \Tertius".
Briefly inspect the output produced by each Associator and try to interpret its meaning.
(b) In association rule mining the number of possible association rules can be very large
even with tiny datasets, hence it is in our best interest to reduce the count of rules found,
to only the most interesting ones. This is usually achieved by setting minimum thresh-
olds on support and confidence values. Still in the \Associate" view, select the \Apriori"
algorithm again, click on the textbox next to the \Choose" button and try, in turn,
different values for the following parameters \lowerBoundMinSupport" (min threshold
for support), \minMetric" (min threshold for confidence). As you change these parameter
values what do you notice about the rules that are found by the associator? Note that the
21
parameter \numRules" limits the maximum number of rules that the associator looks for,
you can try changing this value.
(c) This time run the Apriori algorithm with the \outputItemSets" parameter set to true.
You will notice that the algorithm now also outputs a list of \Generated sets of large
itemsets:" at di_erent levels. If you have the module's Data Mining book by Witten &
Frank with you, then you can compare and contrast the Apriori associator's output with
the association rules on pages 114-116 (I will have a couple copies circulating in the lab
during the session, just ask me for one). I also strongly recommend to read through
chapter 4.5 in your own time, while playing with the weather data in Weka, this chapter
gives a nice & easy introduction to association rules. Notice in particular how the item
sets and association rules compare with Weka and tables 4.10-4.11 in the book.
(d) Compare the association rules output from Apriori and Tertius (you can do this by
navigating through the already build associator models in the \Result list" on the right
side of the screen).
Make sure that the Apriori algorithm shows at least 20 rules. Think about how the
association rules generated by the two different methods compare to each other?
Something to always remember with association rules, is that they should not be used for
prediction directly, that is without further analysis or domain knowledge, as they do not
necessarily indicate causality.
They are however a very helpful starting point for further exploration and for building a
better understanding of our data.
As you should certainly know by this point, in order to identify associations between
parameters a correlation matrix and scatter plot matrix can be very useful fs.
The dataset studied is the weather dataset from Weka’s data folder
The goal of this data mining study is to find strong association rules in the
weather.nominal dataset. Answer the following questions:
a. What type of variables are in this dataset (numeric / ordinal / categorical) ?
b. Load the data in Weka Explorer. Select the Associate tab. How many different
association rule mining algorithms are available?
c. Choose Apriori algorithm with the following parameters (which you can select by
clicking on the chosen algorithm: support threshold = 15%
(lowerBoundMinSupport = 0.15), confidence threshold = 90% (metricType =
confidence, minMetric = 0.9), number of rules = 50 (numRules = 50). After
starting the algorithm, how many rules do you find? Could you use the regular
weather dataset to get the results? Explain why.
d. Paste a screenshot of the Explorer window showing at least the first 20 rules.
e. Define the concepts of support, confidence, and lift for a rule. Write here the first
rule discovered. What is its support? Its confidence? Interpret the meaning of
these terms and this rule in this particular example.
22
f. Apriori algorithm generates association rules from frequent itemsets. How many
itemsets of size 4 were found? Which rule(s) have been generated from itemset of
size 4 (temperature=mild, windy=false, play=yes, outlook=rainy)? List their
numbers in the list of rules.
23
b. Is there an entity identification or schema integration problem in this dataset? If
yes, how to fix it?
c. Is there a redundancy problem in this dataset? If yes, how to fix it?
d. Are there data value conflicts in this dataset? If yes, how to fix it?
e. Integrate the two datasets into one single dataset, which will be used as a starting point
for the next questions, and load it in the Explorer. How many instances do you have?
How many attributes? (You could do this using Excel or spreadsheet programs. First,
save your individual files as “csv” files in weka, Open them in a spreadsheet viewing
program. Copy the rows from one file to another. Save the merged file (csv). Open it in
weka and save it as “csv”. Take care of the above questions. Think about rectifying
potential problems.
f. Paste a screenshot of the Explorer window.
24
data in the Preprocess tab.
a. Missing values. List the methods seen in class for dealing with missing values, and
which Weka filters implement them – if available. Remove the missing values with the
method of your choice, explaining which filter you are using and why you make this
choice.
b. Noisy data. List the methods seen in class for dealing with noisy data, and which
Weka filters implement them – if available.
c. Save the cleaned dataset into heart-cleaned.arff, and paste here a screenshot
showing at least the first 10 rows of this dataset – with all the columns.
Case Study #1
Data Preprocessing with Weka
The goal of this case study is to investigate how to preprocess data using Weka data
mining tool.
This assignment will be using Weka data mining tool. Weka is an open source Java
development environment for data mining from the University of Waikato in New
Zealand. It can be downloaded freely from http://www.cs.waikato.ac.nz/ml/weka/,
25
Weka is really an asset for learning data mining because it is freely available, students
can study how the different data mining models are implemented, and develop
customized Java data mining applications. Moreover, data mining results from Weka can
be published in the most respected journals and conferences, which make it a de facto
developing environment of choice for research in data mining, where researchers often
need to develop new data mining methods.
Weka can be used in four different modes: through a command line interface (CLI),
through a graphical user interface called the Explorer, through the Knowledge Flow, and
through the Experimenter. The Knowledge Flow allows to process large datasets in an
incremental manner, while the other modes can only process small to medium size
datasets. The Experimenter provides an environment for testing and comparing several
data mining algorithms.
The explanations for this assignment focus on Explorer processing of a dataset,
nevertheless the CLI can produce the same functionality, and thus can be chosen as an
alternative. Moreover, this assignment will use only the data preprocessing capabilities of
Weka, which may only require some Java development, whereas similar functionality in
SPSS/CLEMENTINE would require mastering a more complex suite of functions, and
learning a new programming language, called CLEM.
The dataset studied is the heart disease dataset from UCI repository (datasets-UCI.jar).
Two different datasets are provided: heart-h.arff (Hungarian data), and heart-c.arff
(Cleveland data). These datasets describe factors of heart disease. They can be
downloaded from: http://www.cs.waikato.ac.nz/~ml/weka/index_datasets.html.
The machine mining project goal is to better understand the risk factors for heart
disease, as represented in the 14th attribute: num (<50 means no disease, and values <50-
1 to <50-4 represent increasing levels of heart disease).
The question on which this machine learning study concentrates is whether it is
possible to predict heart disease from the other known data about a patient. The data
mining task of choice to answer this question will be classification/prediction, and
several different algorithms will be used to find which one provides the best predictive
power.
We want to merge the two datasets into one, in a step called data integration. Revise arff
notation from the tutorial, which is Weka data representation language. Answer the
following questions:
g. Define what data integration means.
26
h. Is there an entity identification or schema integration problem in this dataset? If
yes, how to fix it?
i. Is there a redundancy problem in this dataset? If yes, how to fix it?
j. Are there data value conflicts in this dataset? If yes, how to fix it?
k. Integrate the two datasets into one single dataset, which will be used as a starting
point for the next questions, and load it in the Explorer. How many instances do
you have? How many attributes?
l. Paste a screenshot of the Explorer window.
Before preprocessing the data, an important step is to get acquainted with the data – also
called data understanding in CRISP-DM.
a. Stay in the Preprocess tab for now. Study for example the age attribute. What is
its mean? Its standard deviation? Its min and max?
b. Provide the five-number summary of this attribute. Is this figure provided in
Weka?
c. Specify which attributes are numeric, which are ordinal, and which are
categorical/nominal.
d. Interpret the graphic showing in the lower right corner of the Explorer. How can
you name this graphic? What do the red and blue colors mean (pay attention to
the pop-up messages that appear when dragging the mouse over the graphic)?
What does this graphic represent?
e. Visualize all the attributes in graphic format. Paste a screenshot.
f. Comment on what you learn from these graphics.
g. Switch to the Visualize tab. What is the term used in the textbook to name the
series of boxplots represented? By selecting the maximum jitter, and looking at
the num column – the last one – can you determine which attributes seem to be
the most linked to heart disease? Paste the boxplot representing the attribute you
find the most predictive of heart disease (Y) as a function of num (X).
h. Does any pair of different attributes seem correlated?
The datasets studied have already been processed by selecting a subset of attributes
relevant for the data mining project.
a. From the documentation provided in the dataset, how many attributes were
originally in these datasets?
b. With Weka, attribute selection can be achieved either from the specific Select
attributes tab, or within Preprocess tab. List the different options in Weka for
selecting attributes, with a short explanation about the corresponding method.
c. In comparison with the methods for attribute selection detailed in the textbook,
are any missing? Are any provided in Weka not provided in the textbook?
27
Data cleaning deals with such defaults of real-world data as incompleteness, noise, and
inconsistencies. In Weka, data cleaning can be accomplished by applying filters to the
data in the Preprocess tab.
a. Missing values. List the methods seen in class for dealing with missing values,
and which Weka filters implement them – if available. Remove the missing
values with the method of your choice, explaining which filter you are using and
why you make this choice. If a filter is not available for your method of choice,
develop a new one that you add to the available filters as a Java class.
b. Noisy data. List the methods seen in class for dealing with noisy data, and which
Weka filters implement them – if available.
c. Outlier detection. List the methods seen in class for detecting outliers. How
would you detect outliers with Weka? Are there any outliers in this dataset, and if
yes, list some of them.
d. Save the cleaned dataset into heart-cleaned.arff, and paste here a screenshot
showing at least the first 10 rows of this dataset – with all the columns.
Among the different data transformation techniques, explore those available through the
Weka Filters. Stay in the Preprocess tab for now. Study the following data
transformation only:
a. Attribute construction – for example adding an attribute representing the sum of
two other ones. Which Weka filter permits to do this?
b. Normalize an attribute. Which Weka filter permits to do this? Can this filter
perform Min-max normalization? Z-score normalization? Decimal normalization?
Provide detailed information about how to perform these in Weka.
c. Normalize all real attributes in the dataset using the method of your choice – state
which one you choose.
d. Save the normalized dataset into heart-normal.arff, and paste here a screenshot
showing at least the first 10 rows of this dataset – with all the columns.
Often, data mining datasets are too large to process directly. Data reduction techniques
are used to preprocess the data. Once the data mining project has been successful on these
reduced data, the larger dataset can be processed too.
a. Stay in the Preprocess tab for now. Beside attribute selection, a reduction method
is to select rows from a dataset. This is called sampling. How to perform sampling
with Weka filters? Can it perform the two main methods: Simple Random
Sample Without Replacement, and Simple Random Sample With Replacement?
Case Study 2
28
Classification / Prediction / Cluster analysis
The goal of this assignment is to review prediction mining principles and methods,
cluster analysis principles and methods, and to apply them to a dataset using Weka data
mining tool.
Heart dataset
The first dataset studied is the cleveland dataset from UCI repository. This dataset
describes numeric factors of heart disease. It can be downloaded from
http://www.cs.waikato.ac.nz/~ml/weka/index_datasets.html and is contained in the
datasets-numeric.jar archive.
Zoo dataset
The second dataset studied is the zoo dataset from UCI repository. This dataset describes
animals with categorical features. It can be downloaded from
http://www.cs.waikato.ac.nz/~ml/weka/index_datasets.html and is contained in the
datasets-UCI.jar archive.
1. Prediction in Weka
The goal of this data mining study is to predict the severity of heart disease in the
cleveland dataset (variable num) based on the other attributes. Answer the following
questions:
m. What types of variables are in this dataset (numeric / ordinal / categorical)?
n. Load the data in Weka Explorer. Select the Classify tab. How many different
prediction algorithms are available (under functions)?
o. Explain what is prediction in data mining.
p. Choose LinearRegression algorithm. Explain what is the principle of this
algorithm.
q. Results of this algorithm can be interpreted in the following way. The first part of
the output represents the coefficients of the linear equation of the type
num = w0 + w1a1 + … + wkak.
The numbers provided in front of each attribute ak represent the wk. Based on this,
interpret the results you get from running LinearRegression on the dataset. What
is the equation of the line found?
r. The second part of the results states the correlation coefficient, which measures
the statistical correlation between the predicted and actual values (a coefficient of
+1 indicates a perfect positive relationship, 0 indicates no linear relationship, and
–1 indicates a perfect negative relationship). Only positive correlations make
sense in regression, and a coefficient above 0.5 signals a large correlational effect.
The remaining figures are the mean absolute error (the average prediction error),
the root mean squared error (the square root of the mean squared error), which
is the most commonly used error measure, the relative absolute error (which
compares this error with the one obtained if the prediction had been the mean),
29
the root relative squared error (the square root of the error in comparison with the
one obtained if the prediction had been the mean), and the total number of
instances considered.
The overall interpretation of these is the following: a prediction is good when the
correlation coefficient is as large as possible, and all the errors are as small as
possible. These figures are used to compare several prediction results. How do
you evaluate the fit of the equation provided in e), meaning how strong is this
prediction?
s. It is also notable that an important figure is the square of the correlation
coefficient (R2). In statistical regression analysis, which invented this prediction
method, the most used success measures are R and R2. The latter represents the
percentage of variation in the target figure accounted for by the model. For
example, if we want to predict a sales volume based on three factors such as the
advertising budget, the number of plays on the radio per week, and the
attractiveness of the band, and if we get a correlation coefficient R of 0.8, then
we learn from the model that R2 = 64% of the variability in the outcome (the sales
volume) is accounted for by the three factors. How much of the variability of num
can be predicted by the other attributes?
t. Are theses results compatible with the results of assignment #1, which used
classification to predict num?
u. Now compare these figures with the other classifiers provided in functions and
fill-in the following table (except the last line):
30
ee. By changing the MultilayerPerceptron parameters, what is the configuration for
the best results you get?
ff. What best prediction results do you get (fill in the table above)?
2. Clustering in Weka
The goal of this data mining study is to find groups of animals in the zoo dataset, and to
check whether these groups correspond to the real animal types in the dataset.
i. What types of variables are in this dataset?
j. How many rows / cases are there?
k. How many animal types are represented in this dataset? List them here.
l. After removing the type attribute, go to the Cluster tab. How many clustering
algorithms are available in Weka?
m. List the clustering algorithms seen in class, and map these to the ones provided in
Weka.
n. Start using the SimpleKMeans clusterer choosing 7 clusters. Do the clusters learnt
and their centroids seem to match the animal types?
o. Compare results with EM clusterer (with 7 clusters),
MakeDensityBasedClusterer, FarthestFirst (with 7 clusters), and Cobweb.
Which algorithm seems to provide the best clustering match for this dataset?
p. Explain the principles of SimpleKMeans, EM, MakeDensityBasedClusterer, and
Cobweb clustering algorithms.
q. Are results easy to interpret, even with the tree visualizations provided?
r. What would make it easier to evaluate the usefulness of the clusters found?
a. List some animals that are misclassified, meaning classified in a cluster that does
not correspond to their actual type, for instance a mammal clustered with fish, or a
reptile clustered with amphibian.
b. By modifying the selected parameters, improve the classification, explain which
modifications you made, and paste here the resulting dendrogram.
31
3. Go to the Classify tab. Select the ZeroR classifier. Choose the “Cross-validation”
(10 folds) test mode. Run the classifier and observe the results shown in the “Classifier
output” window.
- How many instances are incorrectly classified?
- What is the MAE (mean absolute error) made by the classifier?
- What can you infer from the information shown in the Confusion Matrix?
- Visualize the classifier errors. In the plot, how can you differentiate between the
correctly and incorrectly classified instances? In the plot, how can you see the
detailed information of an incorrectly classified instance?
- How can you save the learned classifier to a file?
- How can you load a learned classifier from a file?
4. Choose the “Percentage split” (66% for training) test mode. Run the ZeroR classifier and
observe the results shown in the “Classifier output” window.
- How many instances are incorrectly classified? Why this number is smaller than
that observed in the previous experiment (i.e., using the cross-validation test
mode)?
- What is the MAE made by the classifier?
- Visualize the classifier errors to see the detailed information.
5. Now, select the Id3 classifier (i.e., you can find this classifier in the weka.classifiers.trees
group). Choose the “Cross-validation” (10 folds) test mode. Run the Id3 classifier and
observe the results shown in the “Classifier output” window.
- How many instances are incorrectly classified?
- What is the MAE made by the classifier?
- Visualize the classifier errors.
- Compare these results with those observed for the ZeroR classifier in the cross-
validation test mode. Which classifier, ZeroR or Id3, shows a better prediction
performance for the current dataset and the cross-validation test mode?
6. Choose the “Percentage split” (66% for training) test mode. Run the Id3 classifier and
observe the results shown in the “Classifier output” window.
- How many instances are incorrectly classified?
- What is the MAE made by the classifier?
- Visualize the classifier errors.
- Compare the results made by the Id3 classifier for the two considered test modes.
In which test mode, does the classifier produces a better result (i.e., a smaller
error)?
- Which classifier, ZeroR or Id3, shows a better prediction performance for the
current dataset and the splitting test mode?
• Let’s assume we have the following data set that recorded (i.e., in a period of 25 days)
whether or not a person played tennis depending on the outlook and wind conditions.
• Each instance (example) is represented by the three attributes.
o Outlook: a value of {Sunny, Overcast, Rain}.
32
o Wind: a value of {Weak, Strong}.
o PlayTennis: the classification attribute (i.e., Yes- the person plays tennis; No- the
person does not play tennis).
33
have the feature value (i.e., the class distribution of the feature values)?
• Go to the “Classify” tab. Select the NaiveBayes classifier. Choose “Percentage split”
(66% for training) test mode. Run the classifier and observe the results shown in the
“Classifier output” window.
- How many instances used for the training? How many for the test?
- How many instances are incorrectly classified?
- What is the MAE (mean absolute error) made by the classifier?
- What can you infer from the information shown in the Confusion Matrix?
- Visualize the classifier errors. In the plot, how can you differentiate between the
correctly and incorrectly classified instances? In the plot, how can you see the detailed
information of an incorrectly classified instance?
- How can you save the learned classifier to a file?
• Now, let’s use a separate test dataset. In the “Test options” panel select the “Supplied
test set” option. Activate the nearby “Set...” button and locate the “play_tennis_test.arff”
file.
Run the classifier and observe the results shown in the “Classifier output” window.
- How many instances used for the training? How many for the test?
- How many instances are incorrectly classified?
- What is the MAE (mean absolute error) made by the classifier?
- What can you infer from the information shown in the Confusion Matrix?
- Compare the test results with those observed in the previous experiment (i.e., using the
splitting test mode).
34
versicolor
5 6.3 3.3 4.7 1.6 Iris-
versicolor
6 7.3 2.9 6.3 1.8 Iris-virginica
7 4.4 2.9 1.4 0.2 Iris-setosa
8 4.9 3.1 1.5 0.1 Iris-setosa
9 5.8 2.8 5.1 2.4 Iris-virginica
10 5.6 2.9 3.6 1.3 Iris-
versicolor
11 6.9 3.2 5.7 2.3 Iris-virginica
12 6.0 3.4 4.5 1.6 Iris-
versicolor
13 7.2 3.0 5.8 1.6 Iris-virginica
14 4.8 3.4 1.9 0.2 Iris-setosa
15 6.8 2.8 4.8 1.4 Iris-
versicolor
You will need to configure a file called DatabaseUtils.props. This file already exists
under the path weka/experiment/ in the weka.jar file (which is just a ZIP file) that is part
of the Weka download. In this directory you will also find a sample file for ODBC
connectivity, called DatabaseUtils.props.odbc, and one specifically for MS Access, called
DatabaseUtils.props.msaccess (>3.4.14, >3.5.8, >3.6.0), also using ODBC. You should
use one of the sample files as basis for your setup, since they already contain default
values specific to ODBC access.
This file needs to be recognized when the Explorer starts. You can achieve this by
making sure it is in the working directory or the home directory (if you are unsure what
35
the terms working directory and home directory mean, see the \textit{Notes} section).
The easiest is probably the second alternative, as the setup will apply to all the Weka
instances on your machine.
Just make sure that the file contains the following lines at least:
jdbcDriver=sun.jdbc.odbc.JdbcOdbcDriver
jdbcURL=jdbc:odbc:dbname
where dbname is the name you gave the user DSN. (This can also be changed once the
Explorer is running.)
WEKA KnowledgeFlow
With the Knowledge Flow interface, users select Weka components from a tool bar, place
them on a layout canvas, and connect them into a directed graph that processesand
analyzes data. It provides an alternative to the Explorer for those who like thinking in
terms of how data flows through the system. It also allows the design and execution of
configurations for streamed data processing, which the Explorer cannot do. You invoke
the Knowledge Flow interface by selecting KnowledgeFlow from the choices in the right
panel.
Name Function
Visualization DataVisualizer Visualize data in a twodimensional scatter plot
ScatterPlotMatrix Matrix of scatter plots
AttributeSummarizer Set of histograms, one for each attribute
ModelPerformanceChart Draw ROC and other threshold curves
CostBenefitAnalysis Visualize cost or benefit tradeoffs
TextViewer Visualize data or models as text
GraphViewer Visualize tree-based models
StripChart Display a scrolling plot of data Evaluation
TrainingSetMaker Make dataset into a training set
TestSetMaker Make dataset into a test set
36
CrossValidationFoldMaker Split dataset into folds
TrainTestSplitMaker Split dataset into training and test sets
InstanceStreamToBatchMaker Collect instances from a stream and assemble them into a
batch dataset
ClassAssigner Assign one of the attributes to be the class
ClassValuePicker Choose a value for the positive class
ClassifierPerformanceEvaluator Collect evaluation statistics for batch
evaluation
IncrementalClassifierEvaluator Collect evaluation statistics for incremental
evaluation
ClustererPerformanceEvaluator Collect evaluation statistics for clusterers
PredictionAppender Append a classifier’s predictions to a dataset
SerializedModelSaver Save trained models as serialized Java
objects
Several alternative measures, summarized in Table can be used to evaluate the success of
numeric prediction. The predicted values on the test instances are p1, p2, . . ., pn ; the
actual values are a1, a2, . . ., an. pi refers to the numeric value of the prediction for the ith
test instance.
Mean-squared error is the principal and most commonly used measure; sometimes the
square root is taken to give it the same dimensions as the predicted value itself. Many
mathematical techniques such as linear regression, use the mean-squared error because it
tends to be the easiest measure to manipulate mathematically. However, here we are
considering it as a performance measure: all the performance measures are easy to
calculate, so mean-squared error has no particular advantage. The question is, is it an
appropriate measure for the task at hand?
Mean absolute error is an alternative: just average the magnitude of the individual errors
without taking account of their sign. Mean-squared error tends to exaggerate the effect of
outliers—instances whose prediction error is larger than the others—but absolute error
does not have this effect: all sizes of error are
treated evenly according to their magnitude.
37
Sometimes it is the relative rather than absolute error values that are of importance. For
example, if a 10% error is equally important whether it is an error of 50 in a prediction of
500 or an error of 0.2 in a prediction of 2, then averages of absolute error will be
meaningless: relative errors are appropriate. This effect would be taken into account by
using the relative errors in the mean-squared error calculation or the mean absolute error
calculation.
Relative squared error in Table refers to something quite different. The error is made
relative to what it would have been if a simple predictor had been used. The simple
predictor in question is just the average of the actual values from the training data. Thus
relative squared error takes the total squared
error and normalizes it by dividing by the total squared error of the default predictor.
relative absolute error is just the total absolute error, with the same kind of
normalization. In these three relative error measures, the errors are ormalized by the
error of the simple predictor that predicts average values.
The final measure is the correlation coefficient, which measures the statistical correlation
between the a’s and the p’s. The correlation coefficient ranges from 1 for perfectly
correlated results, through 0 when there is no relation, to -1 when the results are perfectly
correlated negatively.
38
SOLVED EXAMPLES FOR LAB
1) Association Rules
Exercise 1. Basic association rule creation manually.
The 'database' below has four transactions. What association rules can be found in this
set, if the minimum support (i.e coverage) is 60% and the minimum confidence (i.e.
accuracy) is 80% ?
Trans_id Itemlist
T1 {K, A, D, B}
T2 {D, A C, E, B}
T3 {C, A, B, E}
T4 {B, A, D}
The solution:
Let’s first make a tabular and binary representation of the data:
Transaction
ABCDEK
T1 1 1 0 1 0 1
T2 1 1 1 1 1 0
T3 1 1 1 0 1 0
T4 1 1 0 1 0 0
STEP 1. Form the item sets. Let's start by forming the item set containing one item. The
number of occurrences and the support of each item set is given after it. In order to reach
a minimum support of 60%, the item has to occur in at least 3 transactions.
A 4, 100%
B 4, 100%
39
C 2, 50%
D 3, 75%
E 2, 50%
K 1, 25%
STEP 2. Now let's form the item sets containing 2 items. We only take the item sets from
the previous phase whose support is 60% or more.
A B 4, 100%
A D 3, 75%
B D 3, 75%
STEP 3. The item sets containing 3 items. We only take the item sets from the previous
phase whose support is 60% or more.
ABD3
STEP4. Lets now form the rules and calculate their confidence (c). We only take the item
sets from the previous phases whose support is 60% or more.
Rules:
A -> B P(B|A) = |BnA| / |A| = 4/4, |c: 100%
B -> A c: 100%
A -> D c: 75%
D -> A c: 100%
B -> D c: 75%
D -> B c: 100%
AB -> D c: 75%
D -> AB c: 100%
AD -> B c: 100%
B - > AD c: 75%
BD -> A c: 100%
A -> BD c: 75%
The rules with a confidence measure of 75% are pruned, and we are left with the
following rule set:
A -> B
B -> A
D -> A
D -> B
D -> AB
AD-> B
DB-> A
Launch Weka and try to do with it the calculations you performed manually in the
previous exercise. Use the apriori algorithm for generating the association rules.
The Solution:
The file may be given to Weka in e.g. two different formats. They are called ARFF
(attribute-relation
file format) and CSV (comma separated values). Both are given below:
40
CSV format:
exista,existb,existc,existd,existe,existk
TRUE,TRUE,FALSE,TRUE,FALSE,TRUE
TRUE,TRUE,TRUE,TRUE,TRUE,FALSE
TRUE,TRUE,TRUE,FALSE,TRUE,FALSE
TRUE,TRUE,FALSE,TRUE,FALSE,FALSE
41
Figure 3: Setting the parameters of the apriori algorithm. Information about the contents
of the parameters may also be found here.
42
The default values for Number of rules, the decrease for Minimum support (delta factor) and minimum
Confidence values are 10, 0.05 and 0.9. Rule Support is the proportion of examples covered by the
LHS and RHS while Confidence is the proportion of examples covered by the LHS that are also
covered by the RHS. So if a rule's RHS and LHS covers 50% of the cases then the rule has 0.5
support, if the LHS of a rule covers 200 cases and of these the RHS covers 50 cases then the
confidence is 0.25. With default settings Apriori tries to generate 10 rules by starting with a minimum
support of 100%, iteratively decreasing support by the delta factor until minimum non ‐zero support is
reached or the required number of rules with at least minimum confidence has been generated. If we
examine Weka's output, a Minimum support of 0.15 indicates the minimum support reached in order
to generate the 10 rules with the specified minimum metric, here confidence of 0.9. The item set sizes
generated are displayed; e.g. there are 6 four‐item sets having the required minimum support. By
default rules are sorted by confidence and any ties are broken based on support. The number preceding
==> indicates the number of cases covered by the LHS and the value following the rule is the number
of cases covered by the RHS. The value in parenthesis is the rule's confidence.
2) Clustering
a(2, 0) b(1,2) c(2,2) d(3,2) e(2,3) f(3,3) g(2,4) h(3,4) i(4,4) j(3,5)
Identify the cluster by applying the k-means algorithm, with k=2. Try using initial cluster
centers as far apart as possible.
Solution:
Inititialization
The following chart gives the squares of the pairwise Euclidean distance:
a b c d e f g h i j
a 0 5 4 5 9 10 16 17 20 26
b 0 1 4 2 5 5 8 13 13
c 0 1 1 2 4 5 8 10
d 0 2 1 5 4 5 9
e 0 1 1 2 5 5
f 0 2 1 2 4
g 0 1 4 2
h 0 1 1
i 0 2
43
j 0
Iteration 1:
Distance from points to cluster centroids:
Cluster 1 Cluster 2 Cluster assignment
a 0.0000 5.0990 1
b 2.2360 3.6055 1
c 2.0000 3.1622 1
d 2.2360 3.0000 1
e 3.0000 2.2360 2
f 3.1622 2.0000 2
g 4.0000 1.4142 2
h 4.1231 1.0000 2
i 4.4721 1.4142 2
j 5.0990 0.0000 2
Iteration 2:
Distance from points to cluster centroids:
Cluster 1 Cluster 2 Cluster assignment
a 1.5000 3.9228 1
b 1.1180 2.5927 1
c 0.5000 2.0138 1
d 1.1180 1.8408 1
e 1.5000 1.1785 2
f 1.8027 0.8498 2
g 2.5000 0.8498 2
h 2.6925 0.2357 2
i 3.2015 1.1785 2
j 3.6400 1.1785 2
Exercise 2:
The following is a set of one-dimensional points: {6, 12, 18, 24, 30, 42, 48}.
44
(a) For each of the following sets of initial centroids, create two clusters
by assigning each point to the nearest centroid, and then calculate the total squared error
for each set of two clusters. Show both the clusters and the total squared error for each set
of centroids.
i. {18, 45}
First cluster is 6, 12, 18, 24, 30.
Error = 360.
Second cluster is 42, 48.
Error = 18.
Total Error = 378
3) Classification
3.0 -
4.5 +
4.6 +
4.9 +
5.2 -
5.3 -
5.5 +
45
7.0 -
9.5 -
(a) Classify the data point x = 5.0 according to its 1-, 3-, 5-, and 9-nearest
neighbors (using majority vote).
Answer:
1-nearest neighbor: +,
3-nearest neighbor: −,
5-nearest neighbor: +,
9-nearest neighbor: −.
Exercise 2
Consider the data set shown
A B C Class
0 0 0 +
0 0 1 -
0 1 1 -
0 1 1 -
0 0 1 +
1 0 1 +
1 0 1 -
1 0 1 -
1 1 1 +
1 0 1 +
a) Estimate the conditional probabilities for P(A|+), P(B|+), P(C|+),
P(A|−), P(B|−), and P(C|−).
Answer:
P(A = 1|−) = 2/5 = 0.4, P(B = 1|−) = 2/5 = 0.4,
P(C = 1|−) = 1, P(A = 0|−) = 3/5 = 0.6,
P(B = 0|−) = 3/5 = 0.6, P(C = 0|−) = 0; P(A = 1|+) = 3/5 = 0.6,
P(B = 1|+) = 1/5 = 0.2, P(C = 1|+) = 2/5 = 0.4,
46
P(A = 0|+) = 2/5 = 0.4, P(B = 0|+) = 4/5 = 0.8,
P(C = 0|+) = 3/5 = 0.6.
(b) Use the estimate of conditional probabilities given in the previous question
to predict the class label for a test sample (A = 0,B = 1, C = 0)
using the na¨ıve Bayes approach.
Answer:
Let P(A = 0,B = 1, C = 0) = K.
P(+|A = 0,B = 1, C = 0)
= P(A = 0,B = 1, C = 0|+) × P(+)
P(A = 0,B = 1, C = 0)
=(P(A = 0|+)P(B = 1|+)P(C = 0|+) × P(+)) / K
= 0.4 × 0.2 × 0.6 × 0.5/K
= 0.024/K.
P(−|A = 0,B = 1, C = 0)
= P(A = 0,B = 1, C = 0|−) × P(−)
P(A = 0,B = 1, C = 0)
=( P(A = 0|−) × P(B = 1|−) × P(C = 0|−) × P(−)) / K
= 0/K
The class label should be ’+’.
47
b) Using hold-out method
Preprocessing Filters
48
a)
Classification
49
1. J48: This is an implementation of C4.5 / ID3 algorithm.
a) Using crossvalidation
Note: for any classification algorithm the same flowlayout can be followed. Just replace
J48 with appropriate algorithm.
50
Association Rules
a) Apriori
51
1.
Dataset: cpu.arff
a. What is normalization. What are the various normalization techniques?.
Illustrate normalization using WEKA on the given dataset and write down a
sample result.
b. Use KnowledgeFLow and illustrate for the above question
c. Dataset: Glass.arff
How many attributes are there in the dataset? What are their names? What is the class
attribute? Run the classification algorithm IBk (weka.classifiers.lazy.IBk). Use cross-
validation to test its performance,leaving the number of folds at the default value of 10.
Recall that you can examine the classifier options in the Generic Object Editor window
that pops up when you click the text beside the Choose button. The default value of the
KNN field is 1: This sets the number of neighboring instances to use when classifying.
d. Use Knowledge flow canvas and develop a FlowLayout for k-means
execution for above dataset
b. Iris.arff
For J48, compare cross-validated accuracy and the size of the trees generated for (1) the
raw data, (2) data discretized by the unsupervised discretization method in default mode,
and (3) data discretized by the samemethod with binary attributes.
Show the same using KnowledgeFlow.
c. Use experimenter to compare any two classifiers of your choice on iris
dataset.
3
a. Dataset: iris.arff. What is sampling?. Explain various sampling techniques.
Illustrate sampling with replacement using WEKA on given dataset and
write down sample result?.
b. Show the same using KnowledgeFlow.
c. Load the iris data using the Preprocess panel. Evaluate C4.5 on this data
using (a) Holdout method and (b) cross-validation. What is the estimated
percentage of correct classifications for (a) and (b)?
52
4. a. What is discretization?. Explain how do you discretize numeric variables?. Illustrate
discretization using WEKA on given dataset and write down sample result?.
b. Dataset: iris.arff
Explain k-means algorithm?. Perform clustering on the given dataset using
holdout method using WEKA?. Write down the results you get and interpret
your results?.
5. a. Explain the terms absolute error, squared error, mean absolute error and root mean
squared error?.
b. Dataset: weather.nominal.arff
Using Apriori algorithm, calculate all frequent itemsets(L’s) for the following data
TID List of item_ids
T100 I1,I2,I5
T200 I2,I4
T300 I2,I3
T400 I1,I2,I4
T500 I1,I3
T600 I2,I3
T700 I1,I3
T800 I1,I2,I3,I5
T900 I1,I2,I3
Illustrate Apriori algorithm using WEKA and display the 10 most significant rules you
get using the default values of support and confidence.
c. Use experimenter to compare any two classifiers of your choice on iris dataset.
53
6.
a. Open the “weather.nominal” dataset
- How many instances (examples) contained in the dataset?
- How many attributes used to represent the instances?
- Which attribute is the class label?
- What is the data type of the attributes in the dataset?
- For each attribute and for each of its possible values, how many instances in
each class have the attribute value?
b. select the Id3 classifier. Choose the “Cross-validation” (10 folds) test mode. Run the
Id3 classifier and observe the results shown in the “Classifier output” window.
- How many instances are incorrectly classified?
- What is the Mean Absolute Error made by the classifier?
- Visualize the classifier errors.
- Compare these results with those observed for the ZeroR classifier in the cross-
validation test mode. Which classifier, ZeroR or Id3, shows a better prediction
performance for the current dataset and the cross-validation test mode?
c. Use Knowledge flow canvas and develop a FlowLayout for C4.5 execution
7.
a. Dataset: cpu.arff
What is normalization. What are the various normalization techniques?. Illustrate
normalization using WEKA on the given dataset and write down a sample result.
Show the same using KnowledgeFlow.
Dataset: cpu.arff
b. Answer the following questions:
gg. What types of variables are in this dataset (numeric / ordinal / categorical)?
hh. Load the data in Weka Explorer. Select the Classify tab. How many different
prediction algorithms are available (under functions)?
ii. Explain what is prediction in data mining.
jj. Choose LinearRegression algorithm. Explain what is the principle of this
algorithm.
kk. Results of this algorithm can be interpreted in the following way. The first part of
the output represents the coefficients of the linear equation of the type
num = w0 + w1a1 + … + wkak.
The numbers provided in front of each attribute ak represent the wk. Based on this,
interpret the results you get from running LinearRegression on the dataset. What
is the equation of the line found?
c. Use experimenter to compare any two classifiers of your choice on iris dataset.
54
8.
a. What is discretization?. Explain how do you discretize numeric variables?. Illustrate
discretization using WEKA on given dataset and write down sample result?.
b. Choose the “Percentage split” (66% for training) test mode. Run the Id3 classifier and
observe the results shown in the “Classifier output” window.
- How many instances are incorrectly classified?
- What is the Mean Absolute Error made by the classifier?
- Compare the results made by the Id3 classifier for the two considered test
modes, hold-out and cross-validation. In which test mode, does the classifier
produces a better result (i.e., a smaller error)?
c. Use Knowledge flow canvas and develop a FlowLayout for k-means execution
55
senior medium no fair yes
senior low yes fair yes
senior low yes excellent no
middle_a low yes excellent yes
ged
youth medium no fair no
youth low yes fair yes
senior medium yes fair yes
youth medium yes excellent yes
middle_a medium no excellent yes
ged
middle_a high yes fair yes
ged
senior medium no excellent no
Using WEKA what class label will Decision Tree algorithm J48 predict for the following
tuple?
(youth, Medium, Yes, Fair, ?)
56
senior low yes fair yes
senior low yes excellent no
middle_a low yes excellent yes
ged
youth medium no fair no
youth low yes fair yes
senior medium yes fair yes
youth medium yes excellent yes
middle_a medium no excellent yes
ged
middle_a high yes fair yes
ged
senior medium no excellent no
Using WEKA what class label will naïve bayes predict for the following tuple?
(youth, Medium, Yes, Fair, ?)
11. a) Consider the labor dataset. Use appropriate WEKA filters and replace missing
values in the dataset. What are the filters you will use?.
12. The 'database' below has four transactions. Use Apriori to find all frequent itemsets.
What association rules can be found in this set using WEKA, if the minimum support (i.e
coverage) is 60% and the minimum confidence (i.e. accuracy) is 80%.
Trans_id Itemlist
T1 {K, A, D, B}
T2 {D, A C, E, B}
T3 {C, A, B, E}
T4 {B, A, D}
57
Classify the data set weather.arff
• Choose the classifier J4.8.
Look at the classification output and try to interpret the results:
• What is 10-fold cross validation?
• What is the tree size?
• How many correct classified instances did J4.8 do (in percentage)?
• Did you notice the accuracy and the confusion matrix? Write them down and interpret.
Change classifier to OneR in Rules and compare the results with J4.8.
15. Suppose that the data for analysis includes the attribute age. The age values for the
data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25,
30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) Use smoothing by bin means to smooth the above data, using a bin depth of 3.
Illustrate your steps. Comment on the effect of this technique for the given data.
(b) How might you determine outliers in the data?
(c) What other methods are there for data smoothing?
16. Use the two methods below to normalize the following group of data:
200; 300; 400; 600; 1000
(a) min-max normalization by setting min = 0 and max = 1
(b) z-score normalization
17. Suppose that the data for analysis includes the attribute age. The age values for the
data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25,
30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) Use min-max normalization to transform the value 35 for age onto the range [0:0;
1:0].
(b) Use z-score normalization to transform the value 35 for age, where the standard
deviation of age is 12.94 years.
(c) Use normalization by decimal scaling to transform the value 35 for age.
58
(d) Comment on which method you would prefer to use for the given data, giving reasons
as to why.
18. Suppose a group of 12 sales price records has been sorted as follows:
5; 10; 11; 13; 15; 35; 50; 55; 72; 92; 204; 215:
Partition them into three bins by each of the following methods.
(a) equal-frequency partitioning
(b) equal-width partitioning
Using WEKA what class label will kNN algorithm predict for the following tuple?
(youth, Medium, Yes, Fair, ?)
59
20.What is linear regression?. Why do you use it for?. Explain.
Consider the dataset below which captures data about size of houses and their price in
rupees. Build a regression equation and predict the price for the following house
(1002 ?)
Size_of_House Price
2104 460
1416 232
1534 315
852 178
2287 367
568 100
989 187
22.A database has five transactions, Use Apriori to find all frequent itemsets. What
association rules can be found in this set using WEKA, if the minimum support (i.e
coverage) is 60% and the minimum confidence (i.e. accuracy) is 80%
TID Items_bought
T100 {M,O,N,K,E,Y}
T200 {D,O,N,K,E,Y}
60
T300 {M,A,K,E}
T400 {M,U,C,K,Y}
T500 {C,O,O,K,I,E}
22. Let’s assume that we have collected the following data set of users who decided to
buy a computer and others who decided not.
UserID Age Income Student Credit_Rating Buy_Computer
1 Young High No Fair No
2 Young High No Excellent No
3 Medium High No Fair Yes
4 Old Medium No Fair Yes
5 Old Low Yes Fair Yes
6 Old Low Yes Excellent No
7 Medium Low Yes Excellent Yes
8 Young Medium No Fair No
9 Young Low Yes Fair Yes
10 Old Medium Yes Fair Yes
11 Young Medium Yes Excellent Yes
12 Medium Medium No Excellent Yes
13 Medium High Yes Fair Yes
14 Old Medium No Excellent No
15 Medium Medium Yes Fair No
16 Medium Medium Yes Excellent Yes
17 Young Low Yes Excellent Yes
18 Old High No Fair No
19 Old Low No Excellent No
20 Young Medium Yes Excellent Yes
• We want to predict, for each of the following users, if she/he will buy a computer or
not.
- User #21. A young student with medium income and fair credit rating.
- User #22. A young non-student with low income and fair credit rating.
- User #23. A medium student with high income and excellent credit rating.
- User #24. An old non-student with high income and excellent credit rating.
Use WEKA and give the predictions
61
62