0% found this document useful (0 votes)

18 views45 pages

210 Handout

Uploaded by

zizhu.diary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views45 pages

210 Handout

Uploaded by

zizhu.diary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Machine Learning Techniques

(機器學習技法)

Lecture 10: Random Forest

Hsuan-Tien Lin (林軒田)
[email protected]

Department of Computer Science

& Information Engineering
National Taiwan University
(國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/22

Random Forest

Roadmap
1 Embedding Numerous Features: Kernel Models
2 Combining Predictive Features: Aggregation Models

Lecture 9: Decision Tree

recursive branching (purification) for conditional
aggregation of constant hypotheses

Lecture 10: Random Forest

Random Forest Algorithm
Out-Of-Bag Estimate
Feature Selection
Random Forest in Action
3 Distilling Implicit Features: Extraction Models

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 1/22

Random Forest Random Forest Algorithm

Recall: Bagging and Decision Tree

Bagging Decision Tree

function Bag(D, A) function DTree(D)
For t = 1, 2, . . . , T if termination return base gt
1 request size-N 0 data D̃t by else
1 learn b(x) and split D to
bootstrapping with D Dc by b(x)
2 obtain base gt by A(D̃t ) 2 build Gc ← DTree(Dc )
return G = Uniform({gt }) 3 return G(x) =
P
C
Jb(x) = cK Gc (x)
c=1

—reduces variance —large variance

by voting/averaging especially if fully-grown

putting them together?

(i.e. aggregation of aggregation :-) )
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/22
Random Forest Random Forest Algorithm

Random Forest (RF)

random forest (RF) = bagging + fully-grown C&RT decision tree

function RandomForest(D) function DTree(D)

For t = 1, 2, . . . , T if termination return base gt
1 request size-N 0 data D̃t by else
1 learn b(x) and split D to
bootstrapping with D
Dc by b(x)
2 obtain tree gt by DTree(D̃t )
2 build Gc ← DTree(Dc )
return G = Uniform({gt })
3 return G(x) =
P
C
Jb(x) = cK Gc (x)
c=1

• highly parallel/efficient to learn

• inherit pros of C&RT
• eliminate cons of fully-grown tree
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/22
Random Forest Random Forest Algorithm

Diversifying by Feature Projection

recall: data randomness for diversity in bagging

randomly sample N 0 examples from D

another possibility for diversity:

randomly sample d 0 features from x

• when sampling index i1 , i2 , . . . , id 0 : Φ(x) = (xi1 , xi2 , . . . , xid 0 )

0
• Z ∈ Rd : a random subspace of X ∈ Rd
• often d 0 d, efficient for large d
—can be generally applied on other models
• original RF re-sample new subspace for each b(x) in C&RT

RF = bagging + random-subspace C&RT

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/22

Random Forest Random Forest Algorithm

Diversifying by Feature Expansion

randomly sample d 0 features from x: Φ(x) = P · x
with row i of P sampled randomly ∈ natural basis

more powerful features for diversity: row i other than natural basis
• projection (combination) with random row pi of P: φi (x) = pTi x
• often consider low-dimensional projection:
only d 00 non-zero components in pi
• includes random subspace as special case:
d 00 = 1 and pi ∈ natural basis
• original RF consider d 0 random low-dimensional projections for
each b(x) in C&RT

RF = bagging + random-combination C&RT

—randomness everywhere!

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22

Random Forest Random Forest Algorithm

Fun Time
Within RF that contains random-combination C&RT trees, which of the
following hypothesis is equivalent to each branching function b(x)
within the tree?
1 a constant
2 a decision stump
3 a perceptron
4 none of the other choices
Random Forest Random Forest Algorithm

Reference Answer: 3
In each b(x), the input vector x is first
projected by a random vector v and then
thresholded to make a binary decision, which
is exactly what a perceptron does.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/22

Random Forest Out-Of-Bag Estimate

Bagging Revisited

Bagging g1 g2 g3 ··· gT
function Bag(D, A) (x1 , y1 ) D̃1 ? D̃3 D̃T
For t = 1, 2, . . . , T (x2 , y2 ) ? ? D̃3 D̃T
1 request size-N 0 data D̃t (x3 , y3 ) ? D̃2 ? D̃T
by bootstrapping with D ···
(xN , yN ) D̃1 D̃2 ? ?
2 obtain base gt by A(D̃t )
return G = Uniform({gt })

? in t-th column: not used for obtaining gt

—called out-of-bag (OOB) examples of gt

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/22

Random Forest Out-Of-Bag Estimate

Number of OOB Examples

OOB (in ?) ⇐⇒ not sampled after N 0 drawings

if N 0 = N
N
• probability for (xn , yn ) to be OOB for gt : 1 − N1
• if N large:
N
1 1 1 1
1− = = N ≈
N N N 1 e
N−1 1+ N−1

OOB size per gt ≈ e1 N

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/22

Random Forest Out-Of-Bag Estimate

OOB versus Validation

OOB Validation
g1 g2 g3 ··· gT g1− g2− ··· −
gM
(x1 , y1 ) D̃1 ? D̃3 D̃T Dtrain Dtrain Dtrain
(x2 , y2 ) ? ? D̃3 D̃T Dval Dval Dval
(x3 , y3 ) ? D̃2 ? D̃T Dval Dval Dval
···
(xN , yN ) D̃1 ? ? ? Dtrain Dtrain Dtrain

• ? like Dval : ‘enough’ random examples unused during training

• use ? to validate gt ? easy, but rarely needed
PN −
• use ? to validate G? Eoob (G) = N1 n=1 err(yn , Gn (xn )),
with Gn− contains only trees that xn is OOB of,
−
such as GN (x) = average(g2 , g3 , gT )

Eoob : self-validation of bagging/RF

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/22
Random Forest Out-Of-Bag Estimate

Model Selection by OOB Error

Previously: by Best Eval RF: by Best Eoob

gm∗ = Am∗ (D) Gm∗ = RFm∗ (D)
∗ ∗
m = argmin Em m = argmin Em
1≤m≤M 1≤m≤M
Em = Eval (Am (Dtrain )) Em = Eoob (RFm (D))
• use Eoob for self-validation
H1 H2 · · · HM —of RF parameters such
Dtrain as d 00
g1 g2 · · · gM
Dval • no re-training needed
E E2 · · · EM
| 1 {z }
pick the best
(Hm∗ , Em∗ )
D
Eoob often accurate in practice
g m∗

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/22

Random Forest Out-Of-Bag Estimate

Fun Time
For a data set with N = 1126, what is the probability that (x1126 , y1126 )
is not sampled after bootstrapping N 0 = N samples from the data set?
1 0.113
2 0.368
3 0.632
4 0.887
Random Forest Out-Of-Bag Estimate

Fun Time
For a data set with N = 1126, what is the probability that (x1126 , y1126 )
is not sampled after bootstrapping N 0 = N samples from the data set?
1 0.113
2 0.368
3 0.632
4 0.887

Reference Answer: 2
The value of (1 − N1 )N with N = 1126 is about
0.367716, which is close to e1 = 0.367879.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/22

Random Forest Feature Selection

Feature Selection
for x = (x1 , x2 , . . . , xd ), want to remove
• redundant features: like keeping one of ‘age’ and ‘full birthday’
• irrelevant features: like insurance type for cancer prediction
and only ‘learn’ subset-transform Φ(x) = (xi1 , xi2 , xid 0 )
with d 0 < d for g(Φ(x))

advantages: disadvantages:
• efficiency: simpler • computation:
hypothesis and shorter ‘combinatorial’ optimization
prediction time in training
• generalization: ‘feature • overfit: ‘combinatorial’
noise’ removed selection
• interpretability • mis-interpretability

decision tree: a rare model

with built-in feature selection
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/22
Random Forest Feature Selection

Feature Selection by Importance

idea: if possible to calculate

importance(i) for i = 1, 2, . . . , d

then can select i1 , i2 , . . . , id 0 of top-d 0 importance

importance by linear model

d
X
score = wT x = wi xi
i=1
• intuitive estimate: importance(i) = |wi | with some ‘good’ w
• getting ‘good’ w: learned from data
• non-linear models? often much harder

next: ‘easy’ feature selection in RF

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/22
Random Forest Feature Selection

Feature Importance by Permutation Test

idea: random test
—if feature i needed, ‘random’ values of xn,i degrades performance

• which random values?

• uniform, Gaussian, . . .: P(xi ) changed
• bootstrap, permutation (of {xn,i }N n=1 ): P(xi ) approximately
remained
• permutation test:

importance(i) = performance(D) − performance(D(p) )

with D(p) is D with {xn,i } replaced by permuted {xn,i }N

n=1

permutation test: a general statistical tool for

arbitrary non-linear models like RF
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/22
Random Forest Feature Selection

Feature Importance in Original Random Forest

permutation test:

importance(i) = performance(D) − performance(D(p) )

with D(p) is D with {xn,i } replaced by permuted {xn,i }N

n=1

• performance(D (p) ): needs re-training and validation in general

• ‘escaping’ validation? OOB in RF
(p)
• original RF solution: importance(i) = Eoob (G) − Eoob (G),
(p)
where Eoob comes from replacing each request of xn,i by a
permuted OOB value

RF feature selection via permutation + OOB:

often efficient and promising in practice

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/22

Random Forest Feature Selection

Fun Time
For RF, if the 1126-th feature within the data set is a constant 5566,
what would importance(i) be?
1 0
2 1
3 1126
4 5566
Random Forest Feature Selection

gt (N 0 = N/2) G with first t trees

‘easy yet robust’ nonlinear model

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/22

Random Forest Random Forest in Action

A Complicated and Noisy Data Set

gt (N 0 = N/2) G with first t trees

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/22

Random Forest Random Forest in Action

A Complicated and Noisy Data Set

gt (N 0 = N/2) G with first t trees

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/22

Random Forest Random Forest in Action

A Complicated and Noisy Data Set

gt (N 0 = N/2) G with first t trees

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/22

Random Forest Random Forest in Action

A Complicated and Noisy Data Set

gt (N 0 = N/2) G with first t trees

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/22

Random Forest Random Forest in Action

A Complicated and Noisy Data Set

gt (N 0 = N/2) G with first t trees

noise corrected by voting

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/22

Random Forest Random Forest in Action

How Many Trees Needed?

almost every theory: the more, the ‘better’
assuming good ḡ = limT →∞ G

Our NTU Experience

• KDDCup 2013 Track 1 (yes, NTU is world champion again! :-)):
predicting author-paper relation
• Eval of thousands of trees: [0.015, 0.019] depending on seed;
Eout of top 20 teams: [0.014, 0.019]
• decision: take 12000 trees with seed 1

cons of RF: may need lots of trees if the

whole random process too unstable
—should double-check stability of G
to ensure enough trees

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/22

Random Forest Random Forest in Action

Fun Time
Which of the following is not the best use of Random Forest?
1 train each tree with bootstrapped data
2 use Eoob to validate the performance
3 conduct feature selection with permutation test
4 fix the number of trees, T , to the lucky number 1126
Random Forest Random Forest in Action

Reference Answer: 4
A good value of T can depend on the nature of
the data and the stability of the whole random
process.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/22

Random Forest Random Forest in Action

Summary
1 Embedding Numerous Features: Kernel Models
2 Combining Predictive Features: Aggregation Models

Lecture 10: Random Forest

Random Forest Algorithm
bag of trees on randomly projected subspaces
Out-Of-Bag Estimate
self-validation with OOB examples
Feature Selection
permutation test for feature importance
Random Forest in Action
‘smooth’ boundary with many trees

• next: boosted decision trees beyond classification

3 Distilling Implicit Features: Extraction Models

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/22

Great LEarning Weekly Quiz - Bagging and Random Forest
100% (4)
Great LEarning Weekly Quiz - Bagging and Random Forest
5 pages
Zamboanga Del Sur Provincial Government College Vision and Mission Vision
No ratings yet
Zamboanga Del Sur Provincial Government College Vision and Mission Vision
12 pages
Random Forest 1737667979
No ratings yet
Random Forest 1737667979
11 pages
Random Forest Explained
No ratings yet
Random Forest Explained
39 pages
14 - Ensemble Methods
No ratings yet
14 - Ensemble Methods
38 pages
Lecture 12 - Decision and Regression Trees
No ratings yet
Lecture 12 - Decision and Regression Trees
35 pages
MIS410 Chapter6
No ratings yet
MIS410 Chapter6
47 pages
DS Unit - 4
No ratings yet
DS Unit - 4
76 pages
Decision Tree and Random Forest
No ratings yet
Decision Tree and Random Forest
41 pages
Jntuk Machine Learning 3-2 Unit-3
No ratings yet
Jntuk Machine Learning 3-2 Unit-3
33 pages
Lecture 22: Bagging and Random Forest: Wenbin Lu Department of Statistics North Carolina State University Fall 2019
No ratings yet
Lecture 22: Bagging and Random Forest: Wenbin Lu Department of Statistics North Carolina State University Fall 2019
35 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Request For Sound System
100% (6)
Request For Sound System
2 pages
Random Forest
No ratings yet
Random Forest
14 pages
DL Unit 1
No ratings yet
DL Unit 1
20 pages
Lecture 19 Different Classification Models
No ratings yet
Lecture 19 Different Classification Models
22 pages
Random Forests
No ratings yet
Random Forests
43 pages
Random Forests 2
No ratings yet
Random Forests 2
43 pages
Decision Trees
67% (3)
Decision Trees
14 pages
Random Forests
No ratings yet
Random Forests
35 pages
Guided Tour To Random Forest
No ratings yet
Guided Tour To Random Forest
42 pages
Random Forests
No ratings yet
Random Forests
22 pages
Aam Ut-1 QB Ans
No ratings yet
Aam Ut-1 QB Ans
12 pages
Data Science - Decision Tree - Random Forest
No ratings yet
Data Science - Decision Tree - Random Forest
15 pages
Unit 5
No ratings yet
Unit 5
12 pages
Random Forest
No ratings yet
Random Forest
25 pages
Schonlau Zou 2020 The Random Forest Algorithm For Statistical Learning
No ratings yet
Schonlau Zou 2020 The Random Forest Algorithm For Statistical Learning
27 pages
Montillo RandomForests 4-2-2009
No ratings yet
Montillo RandomForests 4-2-2009
28 pages
Da MS
No ratings yet
Da MS
24 pages
Deep Learning and Neural Networks
No ratings yet
Deep Learning and Neural Networks
21 pages
12 PAGES - Random Forest Algorithm, Support Vector Machine For Regression Analysis
No ratings yet
12 PAGES - Random Forest Algorithm, Support Vector Machine For Regression Analysis
12 pages
New Means of Cybernetics, Informatics
No ratings yet
New Means of Cybernetics, Informatics
13 pages
Variable Selection Using Random Forests
No ratings yet
Variable Selection Using Random Forests
8 pages
RandomForests Sayed
No ratings yet
RandomForests Sayed
21 pages
Decision Tree & Regression
No ratings yet
Decision Tree & Regression
33 pages
Random Forest
No ratings yet
Random Forest
29 pages
Random Forest
No ratings yet
Random Forest
83 pages
MLSP Lab Exp4
No ratings yet
MLSP Lab Exp4
9 pages
Random Forest
No ratings yet
Random Forest
32 pages
IJCSI Jehad
No ratings yet
IJCSI Jehad
8 pages
Data Mining Classification Algorithms: Credits: Padhraic Smyth
No ratings yet
Data Mining Classification Algorithms: Credits: Padhraic Smyth
54 pages
Team 5
No ratings yet
Team 5
12 pages
Swami Vivekananda and Human Excellence - A Book Summary
100% (2)
Swami Vivekananda and Human Excellence - A Book Summary
6 pages
Entrepreneurship S5 - TG
No ratings yet
Entrepreneurship S5 - TG
176 pages
2023AIB1008 Lab08
No ratings yet
2023AIB1008 Lab08
8 pages
Practical No4 - 5 ML
No ratings yet
Practical No4 - 5 ML
11 pages
ML Mid Question Solve
No ratings yet
ML Mid Question Solve
19 pages
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
10 Random - Forest - Algo
No ratings yet
10 Random - Forest - Algo
6 pages
Random Forest
No ratings yet
Random Forest
25 pages
Harry Potter and The Chamber of Secrets Sript
No ratings yet
Harry Potter and The Chamber of Secrets Sript
31 pages
Natural Lesson Plan
No ratings yet
Natural Lesson Plan
4 pages
Machine Learning With Random Forests - by Knoldus Inc. - Knoldus - Technical Insights - Medium
No ratings yet
Machine Learning With Random Forests - by Knoldus Inc. - Knoldus - Technical Insights - Medium
12 pages
Random Forest Algorithm
No ratings yet
Random Forest Algorithm
9 pages
Volunteer Teacher's Toolkit by I-To-I TEFL
No ratings yet
Volunteer Teacher's Toolkit by I-To-I TEFL
25 pages
Random Forest
No ratings yet
Random Forest
5 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
ML Lec6
No ratings yet
ML Lec6
4 pages
Random Forests: N 1 N J X A I X A I
No ratings yet
Random Forests: N 1 N J X A I X A I
12 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Random Forests - SpringerLink
No ratings yet
Random Forests - SpringerLink
6 pages
CSE225.7 Course Outline
No ratings yet
CSE225.7 Course Outline
3 pages
DCRUST B.tech First Counseling Results
No ratings yet
DCRUST B.tech First Counseling Results
72 pages
Business Analytics: Foundation: Material Handouts
No ratings yet
Business Analytics: Foundation: Material Handouts
7 pages
Career Guidance Answers Only
No ratings yet
Career Guidance Answers Only
17 pages
Lecture+Notes+-+Random Forests
No ratings yet
Lecture+Notes+-+Random Forests
10 pages
Shortcuts to College Calculus Refreshment Kit
From Everand
Shortcuts to College Calculus Refreshment Kit
Juan Acevedo
No ratings yet
Forest
No ratings yet
Forest
2 pages
Random Forest
No ratings yet
Random Forest
8 pages
ml2 PDF
No ratings yet
ml2 PDF
5 pages
Getting Started Guide PDF
No ratings yet
Getting Started Guide PDF
23 pages
Counselling Psychology in Medical Settings: The Promising Role of Counselling Health Psychology
No ratings yet
Counselling Psychology in Medical Settings: The Promising Role of Counselling Health Psychology
20 pages
Preparing For Admission Into AKU's MBBS Program
No ratings yet
Preparing For Admission Into AKU's MBBS Program
70 pages
2021 GAD Checklist 1
No ratings yet
2021 GAD Checklist 1
4 pages
Iiml Placement
No ratings yet
Iiml Placement
80 pages
55 Second Drill - Dribbling Drill: How The Drill Works
No ratings yet
55 Second Drill - Dribbling Drill: How The Drill Works
2 pages
NHS FPX 6004 Assessment 3 Training Session For Policy Implementation
No ratings yet
NHS FPX 6004 Assessment 3 Training Session For Policy Implementation
7 pages
Inggris7 SOAL
No ratings yet
Inggris7 SOAL
7 pages
Direct Examination Questions For Court
No ratings yet
Direct Examination Questions For Court
9 pages
Abolarin Blessing
No ratings yet
Abolarin Blessing
48 pages
Thinking in Multiple Directions Hyperspace Categories in Divergent Thinking
No ratings yet
Thinking in Multiple Directions Hyperspace Categories in Divergent Thinking
14 pages
Fe 1 Grids
No ratings yet
Fe 1 Grids
12 pages
1171 Math 647
No ratings yet
1171 Math 647
16 pages
Metoprolol (Lopressor, Toprol-XL) Considerations For Use : Mechanism of Action Dosing
No ratings yet
Metoprolol (Lopressor, Toprol-XL) Considerations For Use : Mechanism of Action Dosing
1 page
READ Hayden Et Al 2018
No ratings yet
READ Hayden Et Al 2018
12 pages
Community Service by S.B Okoh..
No ratings yet
Community Service by S.B Okoh..
6 pages
Week 3
No ratings yet
Week 3
12 pages
Education in Peru
No ratings yet
Education in Peru
6 pages
OM2-Project Guidelines, Format and Rubrics - PGP 2022-24
No ratings yet
OM2-Project Guidelines, Format and Rubrics - PGP 2022-24
2 pages
Tentative Seniority List of Employees Borne in JKSSB 04052023
No ratings yet
Tentative Seniority List of Employees Borne in JKSSB 04052023
4 pages