210 Handout
210 Handout
(機器學習技法)
Roadmap
1 Embedding Numerous Features: Kernel Models
2 Combining Predictive Features: Aggregation Models
more powerful features for diversity: row i other than natural basis
• projection (combination) with random row pi of P: φi (x) = pTi x
• often consider low-dimensional projection:
only d 00 non-zero components in pi
• includes random subspace as special case:
d 00 = 1 and pi ∈ natural basis
• original RF consider d 0 random low-dimensional projections for
each b(x) in C&RT
Fun Time
Within RF that contains random-combination C&RT trees, which of the
following hypothesis is equivalent to each branching function b(x)
within the tree?
1 a constant
2 a decision stump
3 a perceptron
4 none of the other choices
Random Forest Random Forest Algorithm
Fun Time
Within RF that contains random-combination C&RT trees, which of the
following hypothesis is equivalent to each branching function b(x)
within the tree?
1 a constant
2 a decision stump
3 a perceptron
4 none of the other choices
Reference Answer: 3
In each b(x), the input vector x is first
projected by a random vector v and then
thresholded to make a binary decision, which
is exactly what a perceptron does.
Bagging Revisited
Bagging g1 g2 g3 ··· gT
function Bag(D, A) (x1 , y1 ) D̃1 ? D̃3 D̃T
For t = 1, 2, . . . , T (x2 , y2 ) ? ? D̃3 D̃T
1 request size-N 0 data D̃t (x3 , y3 ) ? D̃2 ? D̃T
by bootstrapping with D ···
(xN , yN ) D̃1 D̃2 ? ?
2 obtain base gt by A(D̃t )
return G = Uniform({gt })
if N 0 = N
N
• probability for (xn , yn ) to be OOB for gt : 1 − N1
• if N large:
N
1 1 1 1
1− = = N ≈
N N N 1 e
N−1 1+ N−1
OOB Validation
g1 g2 g3 ··· gT g1− g2− ··· −
gM
(x1 , y1 ) D̃1 ? D̃3 D̃T Dtrain Dtrain Dtrain
(x2 , y2 ) ? ? D̃3 D̃T Dval Dval Dval
(x3 , y3 ) ? D̃2 ? D̃T Dval Dval Dval
···
(xN , yN ) D̃1 ? ? ? Dtrain Dtrain Dtrain
Fun Time
For a data set with N = 1126, what is the probability that (x1126 , y1126 )
is not sampled after bootstrapping N 0 = N samples from the data set?
1 0.113
2 0.368
3 0.632
4 0.887
Random Forest Out-Of-Bag Estimate
Fun Time
For a data set with N = 1126, what is the probability that (x1126 , y1126 )
is not sampled after bootstrapping N 0 = N samples from the data set?
1 0.113
2 0.368
3 0.632
4 0.887
Reference Answer: 2
The value of (1 − N1 )N with N = 1126 is about
0.367716, which is close to e1 = 0.367879.
Feature Selection
for x = (x1 , x2 , . . . , xd ), want to remove
• redundant features: like keeping one of ‘age’ and ‘full birthday’
• irrelevant features: like insurance type for cancer prediction
and only ‘learn’ subset-transform Φ(x) = (xi1 , xi2 , xid 0 )
with d 0 < d for g(Φ(x))
advantages: disadvantages:
• efficiency: simpler • computation:
hypothesis and shorter ‘combinatorial’ optimization
prediction time in training
• generalization: ‘feature • overfit: ‘combinatorial’
noise’ removed selection
• interpretability • mis-interpretability
importance(i) for i = 1, 2, . . . , d
Fun Time
For RF, if the 1126-th feature within the data set is a constant 5566,
what would importance(i) be?
1 0
2 1
3 1126
4 5566
Random Forest Feature Selection
Fun Time
For RF, if the 1126-th feature within the data set is a constant 5566,
what would importance(i) be?
1 0
2 1
3 1126
4 5566
Reference Answer: 1
When a feature is a constant, permutation
does not change its value. Then, Eoob (G) and
(p)
Eoob (G) are the same, and thus
importance(i) = 0.
Fun Time
Which of the following is not the best use of Random Forest?
1 train each tree with bootstrapped data
2 use Eoob to validate the performance
3 conduct feature selection with permutation test
4 fix the number of trees, T , to the lucky number 1126
Random Forest Random Forest in Action
Fun Time
Which of the following is not the best use of Random Forest?
1 train each tree with bootstrapped data
2 use Eoob to validate the performance
3 conduct feature selection with permutation test
4 fix the number of trees, T , to the lucky number 1126
Reference Answer: 4
A good value of T can depend on the nature of
the data and the stability of the whole random
process.
Summary
1 Embedding Numerous Features: Kernel Models
2 Combining Predictive Features: Aggregation Models