Lab3 NguyenQuocKhanh ITITIU18186
Lab3 NguyenQuocKhanh ITITIU18186
Lab3 NguyenQuocKhanh ITITIU18186
In this section, we learn how OneR (one attribute does all the work) works. Open weather.nominal.arff,
run OneR, look at the classifier model, how is it?
1
http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/
1
Interna Nguyen Quoc Khanh – ITITIU18186
l
- Remarks: When we run OneR, it will identify class based on outlook’s value for 4 attributes. By that
way, OneR produce 10/14 instances which are correctly guessed. However, because cross-validation
generates different outcomes, so we just have 6 instances which are correctly classified that bring only
42.86% rate of success.
Use OneR to build decision tree for some datasets. Compared with ZeroR, how does OneR perform?
3.2. Overfitting
What is “overfitting”? - overfitting occurs when a statistical model describes random error or noise
instead of the underlying relationship, b/c of complex model, noise/error in the data, or unsuitable
applied criterion, poor prediction. To avoid this, use cross-validation, or pruning... [ref:
http://en.wikipedia.org/wiki/Overfitting]
Follow the instructions in [1], run OneR on the weather.numeric and diabetes dataset…
Accuracy: 64.29
Accuracy: 42.96
weather.numeric Classifier model: Classifier model:
w/o outlook att.
Accuracy: 64.29
2
Interna Nguyen Quoc Khanh – ITITIU18186
l
Accuracy: 50.00
diabetes Classifier model: Classifier model:
Accuracy: 65.10
Accuracy: 71.48
Diabetes w/ Classifier model: Classifier model: None
minBucketSize 1
Accuracy: None
Accuracy: 57.16
MinBucketSize? - 1
Remark? -
3
Interna Nguyen Quoc Khanh – ITITIU18186
l
4
Interna Nguyen Quoc Khanh – ITITIU18186
l
Info. Gain = (Entropy of distribution before the split) – (Entropy of distribution after the split)
|S v|
Gain ( S , A ) ≡ Entropy ( S )− ∑ |S|
Entropy(S v )
v ∈Values ( A )
Values(A) is the set of all possible values for attribute A and Sv is the subset of S for which attribute A has
value.
False
5 records, 3 are yes:
-(3535 +2525 ) = 0.97
Expected new entropy:
814x0.81 + 614x0.97 = 0.87
Information Gain:
0.94 – 0.87 = 0.03
Humidity has 2 distinct values: Humidity
Normal
7 records, 6 are yes:
-(6767 +1717 ) = 0.59
High
5
Interna Nguyen Quoc Khanh – ITITIU18186
l
In Weka, look at the J48 leaner. What are parameters: minNunObj, confidenceFactor?
Follow the instructions in [1] to run J48 on the two dataset, then fill in the following table:
6
Interna Nguyen Quoc Khanh – ITITIU18186
l
Follow the instructions in [1] to run lazy>IBk on the glass dataset with k = 1, 5, 20, and then fill its
accuracy in the following table: