CS583 Chapter 4 Supervised Learning
CS583 Chapter 4 Supervised Learning
CS583 Chapter 4 Supervised Learning
Supervised Learning
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Naïve Bayesian classification
Naïve Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
CS583, Bing Liu, UIC 2
An example application
An emergency room in a hospital measures 17
variables (e.g., blood pressure, age, etc) of newly
admitted patients.
A decision is needed: whether to put a new patient
in an intensive-care unit.
Due to the high cost of ICU, those patients who
may survive less than a month are given higher
priority.
Problem: to predict high-risk patients and
discriminate them from low-risk patients.
No
|C |
Pr(c ) 1,
j 1
j
j 1 | D |
entropy( D j )
5 5 5
entropy Age ( D) entropy( D1 ) entropy( D2 ) entropy( D3 ) Age Yes No entropy(Di)
15 15 15
young 2 3 0.971
5 5 5
0.971 0.971 0.722 middle 3 2 0.971
15 15 15
old 4 1 0.722
0.888
Efficiency
time to construct the model
time to use the model
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability:
understandable and insight provided by the model
Compactness of the model: size of the tree, or the
number of rules.
TP TP
p . r .
TP FP TP FN
Precision p is the number of correctly classified
positive examples divided by the total number of
examples that are classified as positive.
Recall r is the number of correctly classified positive
examples divided by the total number of actual
positive examples in the test set.
CS583, Bing Liu, UIC 47
An example
Then we have
100
90
Percent of total positive cases
80
70
60
lift
50
random
40
30
20
10
0
0 10 20 30 40 50 60 70 80 90 100
Percent of testing cases
is maximal
Pr( A a ,..., A
r 1
1 1 | A| a| A| | C cr ) Pr(C cr )
Pr(C cr ) Pr( Ai ai | C cr )
r 1 i 1
We are done!
How do we estimate P(Ai = ai| C=cj)? Easy!.
1 1 2 1
Pr(C f ) Pr( A j a j | C f )
j 1 2 5 5 25
|V |
t 1
Nit | di |
t 1
Pr( wt | cj; ) 1. (25)
Pr(wt | c j ; ˆ ) . (27)
| V | s 1 i 1 N si Pr(c j | d i )
|V | | D|
| D|
Pr( cj | di )
ˆ
Pr(c | )
j
i 1 (28)
|D|
r 1 Pr(cr | )k 1 Pr(wdi ,k | cr ; ˆ )
ˆ
|C | |d i |
|| w || w w w1 w2 ... wn
2 2 2 (37)
yi ( w xi b 1, i 1, 2, ..., r summarizes
w xi + b 1 for yi = 1
w xi + b -1 for yi = -1.
[ y ( w x b) 1]
1
LP w w i i i (41)
2 i 1
Subject to : yi ( w x i b) 1 i , i 1, 2, ..., r
i 0, i 1, 2, ..., r
only require dot products (x) (z) and never the mapped
vector (x) in its explicit form. This is a crucial point.
Thus, if we have a way to compute the dot product
(x) (z) using the input vectors x and z directly,
no need to know the feature vector (x) or even itself.
In SVM, this is done through the use of kernel
functions, denoted by K,
K(x, z) = (x) (z) (82)
( x1 , x 2 , 2 x1 x 2 ) ( z1 , z 2 , 2 z1 z 2 )
2 2 2 2
(x) (z ),
This shows that the kernel x z2 is a dot product in
a transformed feature space
CS583, Bing Liu, UIC 144
Kernel trick
The derivation in (84) is only for illustration
purposes.
We do not need to find the mapping function.
We can simply apply the kernel function
directly by
replace all the dot products (x) (z) in (79) and
(80) with the kernel function K(x, z) (e.g., the
polynomial kernel x zd in (83)).
This strategy is called the kernel trick.
A new point
Pr(science| )?
Testing
Classify each new instance by voting of the k
classifiers (equal weights)
Original 1 2 3 4 5 6 7 8
Training set 1 2 7 8 3 7 6 3 1
Training set 2 7 8 5 6 4 2 7 1
Training set 3 3 6 2 7 5 6 2 2
Training set 4 4 5 1 4 6 4 3 8
CS583, Bing Liu, UIC Bagging Predictors, Leo Breiman, 1996 159
Boosting
A family of methods:
We only study AdaBoost (Freund & Schapire, 1996)
Training
Produce a sequence of classifiers (the same base
learner)
Each classifier is dependent on the previous one,
and focuses on the previous one’s errors
Examples that are incorrectly predicted in previous
classifiers are given higher weights
Testing
For a test case, the results of the series of
classifiers are combined to determine the final
class of the test case.
training set
(x1, y1, w1) Build a classifier ht
(x2, y2, w2) whose accuracy on
… training set > ½
(xn, yn, wn)
(better than random)
Non-negative weights
sum to 1
Change weights
Bagged C4.5
vs. C4.5.
Boosted C4.5
vs. C4.5.
Boosting vs.
Bagging
Genetic algorithms
Fuzzy classification