0% found this document useful (0 votes)
2 views

Lab 01-Form

The document provides an introduction to Weka, an open-source software for data mining, detailing its features and functionalities. It outlines tasks for exploring datasets, building classifiers, and using filters within the Weka environment, specifically focusing on datasets like weather and glass. Additionally, it includes instructions for visualizing data and evaluating classifier performance.

Uploaded by

nhatnampham0603
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lab 01-Form

The document provides an introduction to Weka, an open-source software for data mining, detailing its features and functionalities. It outlines tasks for exploring datasets, building classifiers, and using filters within the Weka environment, specifically focusing on datasets like weather and glass. Additionally, it includes instructions for visualizing data and evaluating classifier performance.

Uploaded by

nhatnampham0603
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Introduction to Data Mining

Lab 1: Introduction to Weka

1.1. Introduction
Weka is an open-source software available at www.cs.waikato.ac.nz/ml/weka. Weka stands for the
Waikato Environment for Knowledge Analysis. It offers clean, spare implementation of the simplest
techniques, designed to aid understanding of the data mining techniques. It also provides a work-bench
that includes full, working, state-of-the-art implementations of many popular learning schemes that can
be used for practical data mining or for research.

In the first class, we are going to get started with Weka: exploring the “Explorer” interface, exploring
some datasets, building a classifier, using filters, and visualizing your dataset. (See the lecture of class 1
by Ian H. Witten, [1])

Task: Taking notes how you find the Explorer, and answering questions in the following sections

1.2. Exploring the Explorer


Follow the instructions in [1]

1.3. Exploring datasets


Follow the instructions in [1]

In dataset weather.nominal.arff, how many attributes are there in the relation? What are their values?
What is the class and its values? Counting instances for each attribute value.

1
Dataset Attributes Values #Instances
outlook sunny 5
Relation: overcast 4
weather.symBolic rainy 5
#Instances: 14 Distinct 3
#Attributes: 5 hot 4
temperature mild 6
cool 4
Distinct 3
high high
humidity normal normal
Distinct 2
TRUE TRUE
windy FALSE FALSE
Distinct 2
Class play yes yes
no no
Distinct 2

Similarly, examine datasets: weather.numeric.arff and glass.arff.

Weather.numeric.arff

Dataset Attributes Values #Instances


outlook sunny 5
Relation: weather overcast 4
#Instances: 14 rainy 5
#Attributes: 5 Distinct 3
Minimum 64 Distinct 12
temperature Maximum 85
Mean 73.571
StdDev 6.572
Minimum 65 Distinct 10
humidity Maximum 96
Mean 81.643
StdDev 10.285
TRUE 6
windy FALSE 8
Distinct 2
Class play yes 9
no 5
Distinct 2

Glass.arff

Dataset Attributes Values #Instances

2
Dataset Attributes Values #Instances
Rl Minimum 1.511
Relation:Glass Maximum 1.534
#Instances: 214 Mean 1.518
#Attributes: 10 StdDev 0.003
Distinct: 178
Na Minimum 10.73
Maximum 17.38
Mean 13.408
StdDev 0.817
Distinct: 142
Mg Minimum 0
Maximum 4.49
Mean 2.685
StdDev 1.442
Distinct: 94
Al Minimum 0.29
Maximum 3.5
Mean 1.445
StdDev 0.499
Distinct: 118
Si Minimum 69.81
Maximum 75.41
Mean 72.651
StdDev 0.775
Distinct: 133
K Minimum 0
Maximum 6.21
Mean 0.497
StdDev 0.652
Distinct: 65
Ca Minimum 5.43
Maximum 16.19
Mean 8.957
StdDev 1.423
Distinct: 143
Ba Minimum 0
Maximum 3.15
Mean 0.175
StdDev 0.497
Distinct: 34
Fe Minimum 0
Maximum 0.51
Mean 0.057
StdDev 0.097
Distinct: 32
Class Type build wind float 70

3
build wind non-float 76
vehic wind float 17
vehic wind non-float 0
containers 13
tableware 9
headlamps 29
Distinct: 6

Create a file of ARFF format and examine it.

Dataset Attributes Values #Instances

temperature Minimum 20

Relation: air_quality Maximum 35

#Instances: 10 Mean 27.8

#Attributes: 5 StdDev 4.803

Distinct: 10

humidity Minimum 50

Maximum 90

Mean 70.8

StdDev 13.155

Distinct: 10

CO2_level Minimum 300

Maximum 800

Mean 535

StdDev 171.675

Distinct: 9

wind_speed Minimum 2

Maximum 7

Mean 4.1

4
StdDev 1.663

Distinct: 6

Class pollution low 4

moderate 3

high 3

Distinct: 3

1.4. Building a classifier


Follow the instructions in [1]

Examine the output of J48 vs. RandomTree applied to dataset glass.arff

Algorithm Pruned/unpruned minNumObj No. of Leaves Correctly


Classified
Instances
J48 unpruned 15 8 131

Random tree N/A N/A N/A 150

5
Evaluate the confusion matrix every time running an algorithm.

J48 - unpruned - minNumObj = 15:

The algorithm is skewed towards classifying into a = build wind float, and b = build wind non-float

RandomTree:

The algorithm is skewed towards classifying into a = build wind float, and b = build wind non-float.
However, RandomTree provides better results than 148.

1.5. Using a filter


Follow the instructions in [1], and remark

6
_Use a filter to remove an attribute 

- What are attributeIndices? -

_Remove instances where humidity is high 

- What are nominalIndices? -

_Fewer attributes, better classification:

Follow the instructions in [1], review the outputs of J48 applied to glass.arff:

Filter Leaf size Correctly Classified Remark


Instances

Original

Remove Fe

Remove all
attributes
except RI and
MG

1.6. Visualizing your data


Follow the instructions in [1], how do you find “Visualize classifier errors”?

After running J48 for iris.arff, determine:

- How many instances are predicted wrong? -


- What are they?

Instance Predicted class Actual class

7
8

You might also like