0% found this document useful (0 votes)
63 views9 pages

Lab 01-PhamBinhDuong ITCSIU21054

The document serves as an introduction to Weka, an open-source software for data mining, detailing its features and functionalities. It outlines tasks for exploring datasets, building classifiers, and utilizing filters, along with specific examples using various datasets such as weather and glass. Additionally, it emphasizes the importance of visualizing data and classifier errors in the data mining process.

Uploaded by

Dương Phạm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views9 pages

Lab 01-PhamBinhDuong ITCSIU21054

The document serves as an introduction to Weka, an open-source software for data mining, detailing its features and functionalities. It outlines tasks for exploring datasets, building classifiers, and utilizing filters, along with specific examples using various datasets such as weather and glass. Additionally, it emphasizes the importance of visualizing data and classifier errors in the data mining process.

Uploaded by

Dương Phạm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Introduction to Data Mining

Lab 1: Introduction to Weka

1.1. Introduction
Weka is an open-source software available at www.cs.waikato.ac.nz/ml/weka. Weka stands for the
Waikato Environment for Knowledge Analysis. It offers clean, spare implementation of the simplest
techniques, designed to aid understanding of the data mining techniques. It also provides a work-bench
that includes full, working, state-of-the-art implementations of many popular learning schemes that can
be used for practical data mining or for research.

In the first class, we are going to get started with Weka: exploring the “Explorer” interface, exploring
some datasets, building a classifier, using filters, and visualizing your dataset. (See the lecture of class 1
by Ian H. Witten, [1])

Task: Taking notes how you find the Explorer, and answering questions in the following sections

1.2. Exploring the Explorer


Follow the instructions in [1]

1.3. Exploring datasets


Follow the instructions in [1]

In dataset weather.nominal.arff, how many attributes are there in the relation? What are their values?
What is the class and its values? Counting instances for each attribute value.

1
Dataset Attributes Values #Instances
outlook sunny 5
Relation:weather.symbolic overcast 4
#Instances:14 rainy 5
#Attributes:5 distinct: 3
temperature hot 4
mild 6
cold 4
distinct: 3
humidity high 7
normal 7
distinct: 2
TRUE 6
windy FALSE 8
distinct: 2
Class play yes 9
no 5
distinct: 2

Similarly, examine datasets: weather.numeric.arff and glass.arff.

Dataset Attributes Values #Instances


outlook sunny 5
Relation: weather overcast 4
#Instances: 14 rainy 5
#Attributes: 5 distinct: 3
temperature Minimum: 64 distinct :12
Maximum: 85
Mean: 73.571
StdDev: 6.572

humidity Minimum: 65 Distinct: 10


Maximum: 96
Mean: 81.643
StdDev: 10.285
True 6
windy False 8
distinct: 2

Class play Yes 9


No 5
Distinct:2

Dataset Attributes Values #Instances


RI Minimum:1.511 Distinct:178

2
Relation: Glass Maximum:1.534
#Instances: 214 Mean:1.518
#Attributes: 10 StdDev: 0.003
Minimum:10.73 Distinct:142
Na Maximum:17.38
Mean:13.408
StdDev: 0.817
Minimum:0 Distinct:94
Mg Maximum:4.49
Mean:2.685
StdDev: 1.441
Minimum:0.29 Distinct:118
Al Maximum:3.5
Mean:1.445
StdDev: 0.499
Si Minimum:69.81 Distinct:133
Maximum:75.41
Mean:72.651
StdDev: 0.775
K Minimum:0 Distinct:65
Maximum:6.21
Mean:0.497
StdDev: 0.652
Ca Minimum:5.43 Distinct:143
Maximum:16.19
Mean:8.957
StdDev: 1.423
Ba Minimum:0 Distinct:34
Maximum:3.15
Mean:0.175
StdDev: 0.497
Fe Minimum:0 Distinct:32
Maximum:0.51
Mean:0.057
StdDev: 0.097
Class Type build wind float 70
build wind non-float 76
vehic wind float 17
vehic wind non-float 0
containers 13
tableware 9
headlamps 29
distinct: 6

Create a file of ARFF format and examine it.

3
Dataset Attributes Values #Instances
Sex Minimum:0 distinct : 2
Relation: gameandgrade Maximum: 1
#Instances: 770 Mean: 0.499
#Attributes: 10 StdDev: 0.5
School Code Minimum: 1 distinct: 11
Maximum: 11
Mean: 4.944
StdDev: 3
Minimum: 0 distinct : 5
Playing Years Maximum: 4
Mean: 1.584
StdDev: 1.407
Minimum: 0 distinct: 6
Playing Often Maximum: 5
Mean: 2.243
StdDev: 1.924
Playing Hours Minimum: 0 distinct: 6
Maximum: 5
Mean: 1.488
StdDev: 1.338
Playing Games Minimum: 0 Distinct 3
Maximum: 2
mean : 0.706
StdDev: 0.459
Parent Revenue Minimum: 0 distinct: 5
maximum: 4
mean: 1.838
stddev: 1.064
Father Education minimum: 0 distinct: 7
maximum: 6
mean: 3.718
stddev: 1.172
Mother Education minimum: 0 distinct: 7
maximum: 6
mean : 3.41
std dev: 1.176
Class Grade distinct : 105

1.4. Building a classifier


Follow the instructions in [1]

Examine the output of J48 vs. RandomTree applied to dataset glass.arff

4
Algorith Pruned/unpruned minNu No. Corre
m mObj of ctly
Lea Classi
ves fied
Insta
nces
J48 Pruned: 2 30 143

Unpruned 2 30 144

5
Random 150
Tree

Evaluate the confusion matrix every time running an algorithm.

J48 Pruned Tree

J48 Unpruned Tree

6
Random Tree

1.5. Using a filter


Follow the instructions in [1], and remark

_Use a filter to remove an attribute 

- What are attributeIndices? –


- A filter parameter that specifies the column numbers (indices) of attributes you wish to remove.

_Remove instances where humidity is high 

- What are nominalIndices? –


- A filter parameter that designates which attributes are nominal (categorical) so that filters can
correctly handle their values (e.g., removing instances where a nominal attribute like humidity is
“high”).

7
_Fewer attributes, better classification:

Follow the instructions in [1], review the outputs of J48 applied to glass.arff:

Filter Leaf size Correctly Classified Remark


Instances
30 143 Baseline using all 10
Original attributes; accuracy of
68.8224%

Remove Fe 26 144 removing the Fe attribute


reduced the tree size and
slightly improve accuracy
to 67.2897%

Remove all 21 147 retaining only RI and Mg


attributes yielded the simplest tree
except RI and and the high accuracy at
MG 68.6916%

1.6. Visualizing your data


Follow the instructions in [1], how do you find “Visualize classifier errors”?

After running J48 for iris.arff, determine:

- How many instances are predicted wrong? –


- 6 (given J48 classifier – unpruned- minNumObj=15)

- What are they?

8
-

Instance Predicted class Actual class


Iris-virginica Iris-versicolor

15
Iris-virginica Iris-versicolor
73

Iris-virginica Iris-versicolor
119

Iris-versicolor Iris-virginica
92

Iris-versicolor Iris-virginica
109

98 Iris-versicolor Iris-setosa

You might also like