0% found this document useful (0 votes)
6 views5 pages

Lab3 Form

This document outlines a lab session on simple classifiers in data mining using Weka, focusing on OneR, overfitting, Naïve Bayes, decision trees, pruning, and nearest neighbor methods. It includes instructions for running various algorithms on datasets, comparing their performance, and understanding key concepts like entropy and information gain. The lab emphasizes practical application and evaluation of classifiers through accuracy metrics and model building.

Uploaded by

Khang Le
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views5 pages

Lab3 Form

This document outlines a lab session on simple classifiers in data mining using Weka, focusing on OneR, overfitting, Naïve Bayes, decision trees, pruning, and nearest neighbor methods. It includes instructions for running various algorithms on datasets, comparing their performance, and understanding key concepts like entropy and information gain. The lab emphasizes practical application and evaluation of classifiers through accuracy metrics and model building.

Uploaded by

Khang Le
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Introduction to Data Mining

Lab 3 – Simple Classifiers

3.1. Simplicity first!

In the third class, we are going to learn how to examine some data mining algorithms on datasets using
Weka. (See the lecture of class 3 by Ian H. Witten, [1]1)

In this section, we learn how OneR (one attribute does all the work) works. Open weather.nominal.arff,
run OneR, look at the classifier model, how is it?

- Remarks:

Use OneR to build decision tree for some datasets. Compared with ZeroR, how does OneR perform?

Dataset OneR - accuracy ZeroR - accuracy

3.2. Overfitting
What is “overfitting”? - overfitting occurs when a statistical model describes random error or noise
instead of the underlying relationship, b/c of complex model, noise/error in the data, or unsuitable
applied criterion,  poor prediction. To avoid this, use cross-validation, or pruning... [ref:
http://en.wikipedia.org/wiki/Overfitting]

1
http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/

1
Follow the instructions in [1], run OneR on the weather.numeric and diabetes dataset…

Write down the results in the following table: (cross-validation used)

Dataset OneR ZeroR


weather.numeric Classifier model: Classifier model:

Accuracy: Accuracy:
weather.numeric w/o outlook Classifier model: Classifier model:
att.

Accuracy: Accuracy:
diabetes Classifier model: Classifier model:

Accuracy: Accuracy:
Diabetes w/ minBucketSize 1 Classifier model: Classifier model:

Accuracy: Accuracy:

MinBucketSize? -

Remark? -

3.3. Using probabilities


Lecture of Naïve Bayes: [1]

 All attributes contribute equally and independently  no identical attributes

Follow the instructions in [1] to exame NaiveBayes on weather.nominal

2
Classifier model Performance
(how many percent of
total instances are
classified correctly?)

3.4. Decision Trees


Lecture of decision trees: [1]

How to calculate entropy and information gain?

Entropy measures the impurity of a collection.


c
Entropy ( S )=−∑ pi log 2 p i
i=1

Information Gain measures the Expected Reduction in Entropy.

Info. Gain = (Entropy of distribution before the split) – (Entropy of distribution after the split)

|S v|
Gain ( S , A ) ≡ Entropy ( S ) − ∑ |S|
Entropy ( S v )
v ∈Values ( A )

3
Values(A) is the set of all possible values for attribute A and Sv is the subset of S for which attribute A has
value.

Build a decision tree for the weather data step by step:

Compute Entropy and Info. Gain Selected attribute

Final decision tree

Use Weka to examine J48 on the weather data.

4
3.5. Pruning decision trees
Follow the lecture of pruning decision tree in [1] …

Why pruning? - Prevent overfitting to noise in the data.

In Weka, look at the J48 leaner. What are parameters: minNunObj, confidenceFactor?

-
-

Follow the instructions in [1] to run J48 on the two dataset, then fill in the following table:

Dataset J48 (default, pruned) J48 (unpruned)

diabetes.arff

breast‐cancer.arff

3.6. Nearest neighbor


Follow the lecture in [1]

“Instance‐based” learning = “nearest‐neighbor” learning

What is k‐nearest‐neighbors (K-NN)? –

Follow the instructions in [1] to run lazy>IBk on the glass dataset with k = 1, 5, 20, and then fill its
accuracy in the following table:

Dataset IBk, k =1 IBk, k =5 IBk, k =20

Glass

You might also like