CS583 Supervised Learning
CS583 Supervised Learning
CS583 Supervised Learning
Pembelajaran terbimbing
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Naïve Bayesian classification
Naïve Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
training data
Testing: Test the model using unseen test data
to assess the model accuracy
Number of correct classifications
Accuracy ,
Total number of test cases
No
|C |
Pr(c ) 1,
j 1
j
entropy Ai ( D )
j
j 1 | D |
entropy ( D j )
5 5 5
entropyAge(D) entropy(D1 ) entropy(D2 ) entropy(D3 ) Age Yes No entropy(Di)
15 15 15
young 2 3 0.971
5 5 5
0.971 0.971 0.722 middle 3 2 0.971
15 15 15
old 4 1 0.722
0.888
Efficiency
time to construct the model
time to use the model
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability:
understandable and insight provided by the model
Compactness of the model: size of the tree, or the
number of rules.
TP TP
p . r .
TP FP TP FN
Precision p is the number of correctly classified
positive examples divided by the total number of
examples that are classified as positive.
Recall r is the number of correctly classified positive
examples divided by the total number of actual
positive examples in the test set.
CS583, Bing Liu, UIC 47
An example
Then we have
100
Pe rce nt of total pos itive cas e s
90
80
70
60
lift
50
random
40
30
20
10
0
0 10 20 30 40 50 60 70 80 90 100
Percent of testing cases
is maximal
CS583, Bing Liu, UIC 87
Apply Bayes’ Rule
Pr(C c j | A1 a1 ,..., A| A| a| A| )
Pr( A1 a1 ,..., A| A| a| A| | C c j ) Pr(C c j )
Pr( A1 a1 ,..., A| A| a| A| )
Pr( A1 a1 ,..., A| A| a| A| | C c j ) Pr(C c j )
|C |
Pr( A a ,..., A
r 1
1 1 | A| a| A| | C cr ) Pr(C cr )
Pr(C cr ) Pr( Ai ai | C cr )
r 1 i 1
We are done!
How do we estimate P(Ai = ai| C=cj)? Easy!.
N it | di |
t 1
Pr( wt | cj; ) 1. (25)
t 1
i 1 Nti Pr(c j | di )
|D|
ˆ)
Pr(wt | c j ; . (27)
| V | s1 i1 Nsi Pr(c j | di )
|V | | D|
| D|
Pr(cj | di )
ˆ
Pr(c | )
j
i 1 (28)
|D|
yi ( w x i b 1, i summarizes
1, 2, ..., r
w xi + b 1 for yi = 1
w xi + b -1 for yi = -1.
r
1 r
LD i
i 1
2 i , j 1
y i y j i j x i x j , (55)
only require dot products (x) (z) and never the mapped
vector (x) in its explicit form. This is a crucial point.
Thus, if we have a way to compute the dot product
(x) (z) using the input vectors x and z directly,
no need to know the feature vector (x) or even itself.
In SVM, this is done through the use of kernel
functions, denoted by K,
K(x, z) = (x) (z) (82)
A new point
Pr(science| )?
Original 1 2 3 4 5 6 7 8
Training set 1 2 7 8 3 7 6 3 1
Training set 2 7 8 5 6 4 2 7 1
Training set 3 3 6 2 7 5 6 2 2
Training set 4 4 5 1 4 6 4 3 8
CS583, Bing Liu, UIC Bagging Predictors, Leo Breiman, 1996 159
Boosting
A family of methods:
We only study AdaBoost (Freund & Schapire, 1996)
Training
Produce a sequence of classifiers (the same base
learner)
Each classifier is dependent on the previous one,
and focuses on the previous one’s errors
Examples that are incorrectly predicted in previous
classifiers are given higher weights
Testing
For a test case, the results of the series of
classifiers are combined to determine the final
class of the test case.
training set
(x1, y1, w1) Build a classifier ht
(x2, y2, w2) whose accuracy on
… training set > ½
(xn, yn, wn) (better than random)
Non-negative weights
sum to 1
Change weights
Bagged C4.5
vs. C4.5.
Boosted C4.5
vs. C4.5.
Boosting vs.
Bagging
Genetic algorithms
Fuzzy classification