Lab3 NguyenQuocKhanh ITITIU18186

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

Interna Nguyen Quoc Khanh – ITITIU18186

Introduction to Data Mining

Lab 3 – Simple Classifiers

3.1. Simplicity first!

In the third class, we are going to learn how to examine some data mining algorithms on datasets using
Weka. (See the lecture of class 3 by Ian H. Witten, [1] 1)

In this section, we learn how OneR (one attribute does all the work) works. Open weather.nominal.arff,
run OneR, look at the classifier model, how is it?


Interna Nguyen Quoc Khanh – ITITIU18186

- Remarks: When we run OneR, it will identify class based on outlook’s value for 4 attributes. By that
way, OneR produce 10/14 instances which are correctly guessed. However, because cross-validation
generates different outcomes, so we just have 6 instances which are correctly classified that bring only
42.86% rate of success.

Use OneR to build decision tree for some datasets. Compared with ZeroR, how does OneR perform?

Dataset OneR - accuracy ZeroR - accuracy

weather.nominal.arff 42.86 64.29
supermarket.arff 67.21 63.71
glass.arff 57.94 35.51
vote.arff 95.63 61.38
iris.arff 92.00 33.33
diabetes.arff 71.48 65.10
labor.arff 71.93 64.91
soybean.arff 39.97 13.47
breast-cancer.arff 65.73 70.28
credit-g.arff 66.10 70.00

3.2. Overfitting
What is “overfitting”? - overfitting occurs when a statistical model describes random error or noise
instead of the underlying relationship, b/c of complex model, noise/error in the data, or unsuitable
applied criterion,  poor prediction. To avoid this, use cross-validation, or pruning... [ref:]

Follow the instructions in [1], run OneR on the weather.numeric and diabetes dataset…

Write down the results in the following table: (cross-validation used)

Dataset OneR ZeroR

weather.numeric Classifier model: Classifier model:

Accuracy: 64.29
Accuracy: 42.96
weather.numeric Classifier model: Classifier model:
w/o outlook att.

Accuracy: 64.29

Interna Nguyen Quoc Khanh – ITITIU18186

Accuracy: 50.00
diabetes Classifier model: Classifier model:

Accuracy: 65.10

Accuracy: 71.48
Diabetes w/ Classifier model: Classifier model: None
minBucketSize 1
Accuracy: None

Accuracy: 57.16

MinBucketSize? - 1

Remark? -

Interna Nguyen Quoc Khanh – ITITIU18186

3.3. Using probabilities

Lecture of Naïve Bayes: [1]

 All attributes contribute equally and independently  no identical attributes

Follow the instructions in [1] to exame NaiveBayes on weather.nominal

Classifier model Performance

3.4. Decision Trees

Lecture of decision trees: [1]

How to calculate entropy and information gain?

Entropy measures the impurity of a collection.

Entropy ( S ) =−∑ pi log 2 pi

Interna Nguyen Quoc Khanh – ITITIU18186

Information Gain measures the Expected Reduction in Entropy.

Info. Gain = (Entropy of distribution before the split) – (Entropy of distribution after the split)

|S v|
Gain ( S , A ) ≡ Entropy ( S )− ∑ |S|
Entropy(S v )
v ∈Values ( A )

Values(A) is the set of all possible values for attribute A and Sv is the subset of S for which attribute A has

Build a decision tree for the weather data step by step:

Compute Entropy and Info. Gain Selected attribute

Temparature has 3 distinct values: Temperature
 Cool
4 records, 3 are yes:
-(3434 +1414 ) = 0.81
 Rainy
4 records, 2 are yes:
-(2424 +2424 ) = 1.0
 Sunny
6 records, 4 are yes:
-(4646 +2626 ) = 0.92

Expected new entropy:

414x0.81 + 414x1.0 + 614x0.92 = 0.91
Information Gain:
0.94 – 0.91 = 0.03
Windy has 2 distinct values:
 True
8 records, 6 are yes:

-6868 +2828 =0.81

 False
5 records, 3 are yes:
-(3535 +2525 ) = 0.97
Expected new entropy:
814x0.81 + 614x0.97 = 0.87
Information Gain:
0.94 – 0.87 = 0.03
Humidity has 2 distinct values: Humidity
 Normal
7 records, 6 are yes:
-(6767 +1717 ) = 0.59
 High

Interna Nguyen Quoc Khanh – ITITIU18186

7 records, 2 are yes:

-(2727 +5757 ) = 0.86
Expected new entropy:
714x0.81 + 714x0.86 = 0.72
Information Gain:
0.94 – 0.72 = 0.22
Final decision tree

Use Weka to examine J48 on the weather data.

3.5. Pruning decision trees

Follow the lecture of pruning decision tree in [1] …

Why pruning? - Prevent overfitting to noise in the data.

In Weka, look at the J48 leaner. What are parameters: minNunObj, confidenceFactor?

- minNunObj: the minimum number of instances per leaf.

- confidenceFactor: the confidence factor usef for prunning (smaller values incur more prunning).

Follow the instructions in [1] to run J48 on the two dataset, then fill in the following table:

Dataset J48 (default, pruned) J48 (unpruned)

diabetes.arff 73.82% 72.66%
breast‐cancer.arff 75.52% 69.58%

3.6. Nearest neighbor

Follow the lecture in [1]

“Instance‐based” learning = “nearest‐neighbor” learning

What is k‐nearest‐neighbors (K-NN)? – which is majority class among neighbors.

Interna Nguyen Quoc Khanh – ITITIU18186

Follow the instructions in [1] to run lazy>IBk on the glass dataset with k = 1, 5, 20, and then fill its
accuracy in the following table:

Dataset IBk, k =1 IBk, k =5 IBk, k =20

Glass 70.567 67.757 65.4206

You might also like