Lab3 Form
Lab3 Form
In the third class, we are going to learn how to examine some data mining algorithms on datasets using
Weka. (See the lecture of class 3 by Ian H. Witten, [1]1)
In this section, we learn how OneR (one attribute does all the work) works. Open weather.nominal.arff,
run OneR, look at the classifier model, how is it?
- Remarks:
Use OneR to build decision tree for some datasets. Compared with ZeroR, how does OneR perform?
3.2. Overfitting
What is “overfitting”? - overfitting occurs when a statistical model describes random error or noise
instead of the underlying relationship, b/c of complex model, noise/error in the data, or unsuitable
applied criterion, poor prediction. To avoid this, use cross-validation, or pruning... [ref:
http://en.wikipedia.org/wiki/Overfitting]
1
http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/
1
Follow the instructions in [1], run OneR on the weather.numeric and diabetes dataset…
Accuracy: Accuracy:
weather.numeric w/o outlook Classifier model: Classifier model:
att.
Accuracy: Accuracy:
diabetes Classifier model: Classifier model:
Accuracy: Accuracy:
Diabetes w/ minBucketSize 1 Classifier model: Classifier model:
Accuracy: Accuracy:
MinBucketSize? -
Remark? -
2
Classifier model Performance
(how many percent of
total instances are
classified correctly?)
Info. Gain = (Entropy of distribution before the split) – (Entropy of distribution after the split)
|S v|
Gain ( S , A ) ≡ Entropy ( S ) − ∑ |S|
Entropy ( S v )
v ∈Values ( A )
3
Values(A) is the set of all possible values for attribute A and Sv is the subset of S for which attribute A has
value.
4
3.5. Pruning decision trees
Follow the lecture of pruning decision tree in [1] …
In Weka, look at the J48 leaner. What are parameters: minNunObj, confidenceFactor?
-
-
Follow the instructions in [1] to run J48 on the two dataset, then fill in the following table:
diabetes.arff
breast‐cancer.arff
Follow the instructions in [1] to run lazy>IBk on the glass dataset with k = 1, 5, 20, and then fill its
accuracy in the following table:
Glass