0% found this document useful (0 votes)

67 views54 pages

STAT 432: Basics of Statistical Learning: Tree and Random Forests

This document provides an overview of classification and regression trees (CART) and tree-based methods. It discusses how trees work for classification problems by recursively partitioning the feature space into rectangular subsets and making predictions for each subset. An example of using a tree to classify points based on their coordinates is also presented.

Uploaded by

Richard Adhyaputra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views54 pages

STAT 432: Basics of Statistical Learning: Tree and Random Forests

Uploaded by

Richard Adhyaputra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

STAT 432: Basics of Statistical Learning

Tree and Random Forests

Ruoqing Zhu, Ph.D. <[email protected]>

https://teazrq.github.io/stat432/

University of Illinois at Urbana-Champaign

November 20, 2018

1/50
Classification and Regression
Trees (CART)
Tree-based Methods

• Tree-based methods are nonparametric methods that recursively

partition the feature space into hyper-rectangular subsets, and
make prediction on each subset.
• Two main streams of models:
– Classification and regression Trees (CART): Breiman, Friedman,
Olshen and Stone (1984)
– ID3/C4.5: Quinlan (1986, 1993)
• Both are among the top algorithms in data mining (Wu et al.,
2008)
• In statistics, we CART is more popular.

3/50
Titanic Survivals

4/50
Classification and regression Trees

• Example: independent x1 and x2 from uniform [−1, 1],

P(Y = blue | x21 + x22 < 0.6) = 90%

P(Y = orange | x21 + x22 ≥ 0.6) = 90%

• Existing methods require transformation of the feature space to

deal with this model. Tree and random forests do not.
• How tree works in classification?

5/50
Example

6/50
Example

● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
●● ●● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●
● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ●● ● ● ● ● ● ● ● ●
● ● ●● ● ● ● ● ● ● ● ●
● ● ●● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ● ●
● ●
● ●● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ●● ● ●
● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ●
● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
●● ●
●● ● ● ● ●● ●
●● ● ● ● ●● ●
●● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ●● ●● ● ● ●● ●● ● ● ●● ●● ●
●
● ● ●● ● ● ● ●
● ● ●● ● ● ● ●
● ● ●● ● ● ●
●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●●●● ● ● ● ● ●●● ● ● ● ● ●●● ● ● ●
● ●
● ● ● ● ●
● ● ● ● ●
● ● ●
●
● ●● ● ● ● ● ●
● ●● ● ● ● ● ●
● ●● ● ● ● ●
● ●● ● ● ● ●● ● ● ● ●● ● ●
● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ●● ● ●● ● ●●
● ● ● ●● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ●● ● ●
● ● ● ●● ● ● ● ●● ● ● ● ●●
● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ●
● ●● ●
● ● ● ● ● ●● ●
● ● ● ● ● ●● ●
● ● ● ●
● ● ● ●● ● ● ● ●● ● ● ● ●●
● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●
●● ● ●● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
●● ●● ● ●● ●● ● ●● ●● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ●● ● ● ● ●● ● ● ● ●● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●● ● ● ●● ● ● ●●
●●●● ● ●●●● ● ●●●● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ● ●
● ● ● ●
● ● ●
●● ● ●
● ● ● ● ● ● ● ●● ● ●
● ● ● ● ● ● ● ●● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ●●
● ● ● ●●
● ● ● ●●
● ●
● ● ●
● ●● ●● ● ● ●● ●● ● ● ●● ●● ●
●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●
● ●●
●● ●● ● ● ● ●● ● ●●
●● ●● ● ● ● ●● ● ●●
●● ●● ● ● ● ●●
● ● ● ● ● ●
●
● ● ●● ●● ● ● ● ●
●
● ● ●● ●● ● ● ● ●
●
● ● ●● ●● ● ● ● ●
● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ●

7/50
Example

• There are many popular packages that can fit a CART model:
rpart , tree and party .
• Read the reference manual carefully!

1 > library ( rpart )

2 > x1 = r u n i f ( 5 0 0 , −1, 1 )
3 > x2 = r u n i f ( 5 0 0 , −1, 1 )
4 > y = rbinom ( 5 0 0 , s i z e = 1 , prob = i f e l s e ( x1 ˆ 2 + x2 ˆ 2 < 0 . 6 , 0 . 9 , 0 . 1 ) )
5 > c a r t . f i t = r p a r t ( as . f a c t o r ( y ) ~x1+x2 , data = data . frame ( x1 , x2 , y ) )
6 > plot ( cart . f i t )
7 > text ( cart . f i t )

8/50
Example

x2 < −0.644432
|

x1 < 0.767078
0

x2 < 0.722725
0

x1 < −0.76759
0

x1 < −0.61653
0 x2 < 0.356438 x1 < 0.547162
x2 < −0.28598 x2 < 0.369569
0 1
0 1 1 0

9/50
Tree Algorithm: Recursive Partitioning

• Initialized the root node: all training data

Root node

10/50
Tree Algorithm: Recursive Partitioning

• Initialized the root node: all training data

• Find a splitting rule 1{X (j) ≤ c} and split the node

Root node

Splitting rule 1{X (j) ≤ c}

10/50
Tree Algorithm: Recursive Partitioning

• Initialized the root node: all training data

• Find a splitting rule 1{X (j) ≤ c} and split the node
• Recursively apply the procedure on each daughter node

Root node

No Age ≤ 45 Yes

Internal A1

10/50
Tree Algorithm: Recursive Partitioning

• Initialized the root node: all training data

• Find a splitting rule 1{X (j) ≤ c} and split the node
• Recursively apply the procedure on each daughter node

Root node

Age ≤ 45
Internal A1

No Yes
Female
A2 A3

10/50
Tree Algorithm: Recursive Partitioning

• Initialized the root node: all training data

• Find a splitting rule 1{X (j) ≤ c} and split the node
• Recursively apply the procedure on each daughter node
• Predict each terminal node using within-node data

Root node

Age ≤ 45
Internal A1
fb(x), for x ∈ A1
Female
A2 A3
fb(x), for x ∈ A2 fb(x), for x ∈ A3

10/50
Classification and Regression Trees

• How to construct the splitting rules?

• Classification problems
• Regression problems
• How to deal with categorical predictors
• Tree pruning

11/50
Constructing Splitting Rules
Splitting Using Continuous Covariates

• Splitting of continuous predictors are in the form of 1{X (j) ≤ c}

• At a node A, with |A| observations

(xi , yi ) : xi ∈ A, 1 ≤ i ≤ n

• We want to split this node into two child nodes AL and AR

AL = {x ∈ A, x(j) ≤ c}
AR = {x ∈ A, x(j) > c}

• This is done by calculating and comparing the impurity, before

and after a split.

13/50
Impurity for Classification

• We need to define the criteria for classification and regression

problems separately.
• Before the split: we evaluate the impurity for the entire node A
using the Gini index,
• Gini impurity is used as the measurement. Suppose we have K
different classes,
K
X K
X
Gini = pk (1 − pk ) = 1 − pk
k=1 k=1

• Interpretation: Gini = 0 means pure node (only one class), larger

Gini means more diverse node.

14/50
Impurity for Classification

• After the split, we want each child node to be as pure as possible,

i.e., the sum of their Gini impurities are as small as possible.
• Maximize the Gini impurity reduction after the split:

|AL | |AR |
score = Gini(A) − Gini(AL ) − Gini(AR ),
|A| |A|

where | · | denotes the cardinality (sample size) of a node.

• Note 1: Gini(AL ) and Gini(AR ) are calculated within their
respective node.
• Note 2: An alternative (and equivalent) definition is to minimize
|AL | |AR |
|A| Gini(AL ) + |A| Gini(AR ).

15/50
Impurity for Classification

• Calculating the Gini index based on the samples is very simple:

• First, for any node A, we estimate the frequencies pbk :
P
1{yi = k}1{xi ∈ A}
pbk = i P ,
i 1{xi ∈ A}

which is the proportion of samples with class label k in node A.

• Then the Gini impurity is
K
X K
X
Gini(A) = pbk (1 − pbk ) = 1 − pb2k
k=1 k=1

• Do the same for AL and AR , then calculate the score of a split.

16/50
Choosing the Split

• To define a split 1{X (j) ≤ c}, we need to know

• variable index j
• cutting point c
• To find the best split at a node, we do an exhaustive search:
• Go through each variable j, and all of its possible cutting points c
• For each combination of j and c, calculate the score of that split
• Compare all of such splits and choose the one with the best score
• Note: to exhaust all cutting points, we only need to examine
middle points of order statistics.

17/50
Other Impurity Measures

• Gini index is not the only measurement.

• ID3/C4.5 uses Shannon entropy from information theory
K
X
Entropy(A) = − pbk log(b
pk )
k=1

• Misclassification error

Error(A) = 1 − max pbk

k=1,...,K

• Similarly, we can use these measures to define the reduction of

impurity and search for the best splitting rule

18/50
Comparing Impurity Measures

Class 1 Class 2 pb1 pb2 Gini Entropy Error

A 7 3 7/10 3/10 0.420 0.611 0.3
AL 3 0 3/3 0 0 0 0
AR 4 3 4/7 3/7 0.490 0.683 3/7

scoreGini = 0.420 − (3/10 · 0 + 7/10 · 0.490) = 0.077

scoreEntropy = 0.611 − (3/10 · 0 + 7/10 · 0.683) = 0.133
scoreError = 3/10 − (3/10 · 0 + 7/10 · 3/7) = 0

19/50
Comparing Different Measures

• Gini index and Shannon entropy are more sensitive to the

changes in the node probability
• They prefer to create more “pure” nodes
• Misclassification error can be used for evaluating a tree, but may
not be sensitive enough for building a tree.

20/50
Regression Problems

• When the outcome Y is continuous, all we need is a

corresponding impurity measure
• Use variance instead of Gini, and consider the weighted variance
reduction:
|AL | |AR |
score = Var(A) − Var(AL ) − Var(AR )
|A| |A|

where for any A, Var(A) is just the variance of the node samples:
1 X
Var(A) = (yi − y A )2 ,
|A|
i∈A

|A| is the cardinality of A and y A is the within-node mean.

21/50
Categorical Predictors

• If X (j) is a categorical variable talking values in {1, . . . , C}, we

search for a subset C ⊂ {1, . . . , C}, and define the child nodes

AL = {x ∈ A, x(j) ∈ C}
AR = {x ∈ A, x(j) 6∈ C}

• Maximum of 2C−1 − 1 number of possible splits

• When M is too large, exhaustively searching for the best C can
be computationally intense.
• In the R randomForest package M needs to be less than 53
(when C is larger than 10, this is not exhaustively searched).
• Some heuristic methods are used, such as randomly sample a
subset of {1, . . . , C} as C.

22/50
Overfitting and Tree Pruning

• There is a close connection with the (adaptive) histogram

estimator
• A large tree (with too many splits) can easily overfit the data
• Small terminal node ⇐⇒ small bias, large variance
• Small tree may not capture important structures
• Large terminal node ⇐⇒ large bias, small variance
• Tree size is measured by the number of splits

23/50
Overfitting and Tree Pruning

• Balancing tree size and accuracy is the same as the “loss +

penalty” framework
• One possible approach is to split tree nodes only if the decrease
in the loss exceed certain threshold, however this can be
short-sighted
• A better approach is to grow a large tree, then prune it

24/50
Cost-Complexity Pruning

• First, fit the maximum tree Tmax (possibly one observation per
terminal node).
• Specify a complexity penalty parameter α.
• For any sub-tree of Tmax , denoted as T Tmax , calculate
X
Cα (T ) = |A| · Gini(A) + α|T |
all terminal nodes A of T

= C(T ) + α|T |

where |A| is the cardinality of node A, |T | is the cardinality

(number of terminal nodes) of tree T .
• Find T that minimizes Cα (T )
• Large α gives small trees
• Choose α using CV (or plot)

25/50
Missing Values

• If each variable has 5% chance to have missing value, then if we

have 50 variables, there are only 7.7% of the samples that has
compete measures.
• Traditional approach is to discard observations with missing
values, or impute them
• Tree-based method can handle them by either putting them as a
separate category, or using surrogate variables whenever the
splitting variable is missing.

26/50
Remark

• Advantages of tree-based method:

• handles both categorical and continuous variables in a simple and
natural way
• Invariant under all monotone transformations of variables
• Robust to outliers
• Flexible model structure, capture iterations, easy to interpret
• Limitations
• Small changes in the data can result in a very different series of
splits
• Non-smooth. Some other techniques such as the multivariate
adaptive regression splines (MARS, Friedman 1991) can be used
to generate smoothed models.

27/50
Random Forests
Weak and Strong Learners

• Back in the mid-late 90’s, researches started to investigate

whether aggregated “weak learners” (unstable, less accurate)
can be a “strong learner”.
• Bagging, boosting, and random forests are all methods along this
line.
• Bagging and random forests learn individual trees with some
random perturbations, and “average” them.
• Boosting progressively learn models with small magnitude, then
“add” them
• In general, Boosting, Random Forests Bagging Single Tree.

29/50
Bagging Predictors

• Bagging stands for “Bootstrap aggregating”

• Draw B bootstrap samples from the training dataset, fit CART to
each of them, then average the trees
• “Averaging” is symbolic, what we really doing is to get the
predictions from each tree, and average the predicted values.
• Motivation: CART is unstable, however, perturbing and averaging
can improve stability and leads to better accuracy

30/50
Ensemble of Trees

Bootstrap 1 Bootstrap 2 ... Bootstrap B−1 Bootstrap B

...

fb1 (x) fb2 (x) fbB−1 (x) fbB (x)

fb(x)

31/50
Bagging Predictors

• Bootstrap sample with replacement. Fit a CART model to each

bootstrap sample (may require pruning for each tree).
• To combine the bootstrap learners, for classification:
B
fbbagging (x) = Majority Vote fbb (x) b=1 ,

and for regression:

B
1 Xb
fbbagging (x) = fb (x),
B
b=1

• Dramatically reduce the variance of individual learners

• CART can be replaced by other weak learners

32/50
CART vs. Bagging

CART vs. Bagging

33/50
Remarks about Bagging

• Why Bagging works?

• Averaging (nearly) independent copies of fb(x) can lead to
reduced variance
• The “independence” is introduced by bootstrapping
• However, the simple structure of trees will be lost due to
averaging, hence it is difficult to interpret

34/50
Remarks about Bagging

• But, the performance of bagging in practice is oftentimes not

satisfactory. Why?
• Its not really independent...
• Different trees have high correlation which makes averaging not
very effective
• How to further de-correlate trees?

35/50
Random Forests

• Several articles came out in the late 90’s discussing the

advantages of using random features, these papers greatly
influenced Breiman’s idea of random forests.
• For example, in Ho (1998), each tree is constructed using a
randomly selected subset of features
• Random forests take a step forward: at each splitting rule we
consider a random subset of features
• Important tuning parameters: mtry and nodesize

36/50
Tuning Parameter: mtry

• An important tuning parameter of random forests is mtry

• At each split, randomly select mtry variables from the entire set
of features {1, . . . , p}
• Search for the best variable and the splitting point out of these
mtry variables
• Split and proceed to child nodes

37/50
Tuning Parameter: nodesize

• Another important tuning parameter is (terminal) nodesize

• Random forests do not perform pruning!
• Instead, splitting does not stop until the terminal node size is less
or equal to nodesize, and the entire tree is used.
• nodesize controls the trade-off between bias and variance in
each tree, same as k in kNN
• In the most extreme case, nmin = 1 means exactly fit each
observation, but this is not 1NN!

38/50
Tuning parameters

• A summary of important tuning parameters in Random forests

(using R package randomForest )
– ntree : number of trees, set it to be large. Default 500.
– mtry : number of variables considered at each split. Default p/3
√
for regression, p for classification.
– nodesize : terminal node size. Default 5 for regression, 1 for
classification
– sampsize : Bootstrap sample size, usually n with replacement.
• Overall, tuning is very crucial in random forests

39/50
CART vs. Bagging vs. RF

CART Bagging RF

RF: ntree = 1000, mtry = 1, nodesize = 25

40/50
Smoothness Effect of Random Forests

CART RF

Smoothness Effect of Random Forests

(Age: continuous; Diagnosis: categorical)

41/50
Random Forests vs. Kernel
Random Forests vs. Kernel

• Random forests are essentially kernel methods

• However, the distance used in random forests is adaptive to the
true underlying structure
• This can be seen from the kernel weights derived from a random
forests

43/50
RF vs. Kernel

Random forest kernel at two different target points

44/50
RF vs. Kernel

Gaussian Kernel at two different target points

45/50
Variable Importance
Variable Importance

• Random forests has a built-in variable selection tool: variable

importance
• Variable importance utilizes samples that are not selected by
bootstrapping (out-of-bag data):
• For the b-th tree, use the corresponding out-of-bag data as the
testing set to obtain the prediction error: Errb0
• For each variable j, randomly permute its value among the testing
samples, and recalculate the prediction error: Errbj
• calculate for each j

Errbj
VIbj = −1
Errb0
• Average VIbj across all trees
B
X
VIj = VIbj
b=1

47/50
Variable Importance

• This essentially works like a cross-validation:

• the in-bag samples are training samples,
• the out-of-bag samples are testing samples
• a bootstrapped cross-validation
• Usually the misclassification error is used instead of Gini index
• Higher VI means larger loss of accuracy due to the loss of
information on X (j) , hence more important.

48/50
Variable Importance in RF

10
5
0

x1 x4 x7 x11 x15 x19 x23 x27 x31 x35 x39 x43 x47

Same simulation setting as the “circle” example, with additional 48

noise variables.
49/50
Remarks about Random Forests

• Performs well on high-dimensional data

• Tuning parameters are crucial
• Difficult to interpret
• Adaptive kernel

50/50

DBMS Lab Question Paper 1
83% (6)
DBMS Lab Question Paper 1
4 pages
Intelligent Automation Synopsys - Pascal Bornet
No ratings yet
Intelligent Automation Synopsys - Pascal Bornet
3 pages
Unit-5 Decision Trees and Ensemble Learning
100% (1)
Unit-5 Decision Trees and Ensemble Learning
162 pages
Decision Tree
No ratings yet
Decision Tree
82 pages
Chapter 7 - Trees
No ratings yet
Chapter 7 - Trees
80 pages
Predict 422 - Module 8
100% (1)
Predict 422 - Module 8
138 pages
DM-Model Question Paper Solutions
No ratings yet
DM-Model Question Paper Solutions
27 pages
Ch8 Tree Based Methods
No ratings yet
Ch8 Tree Based Methods
81 pages
Ai Class 9 Unit 1
No ratings yet
Ai Class 9 Unit 1
26 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
Class 12 Practical File Informatics Practices
No ratings yet
Class 12 Practical File Informatics Practices
28 pages
The Origin of Art and The Destination of Thinking 1967
67% (3)
The Origin of Art and The Destination of Thinking 1967
15 pages
08 Tree Regression 1
No ratings yet
08 Tree Regression 1
49 pages
Unit IV Decision Trees
No ratings yet
Unit IV Decision Trees
37 pages
MI - Unit 4
No ratings yet
MI - Unit 4
79 pages
SAP Data Migration
No ratings yet
SAP Data Migration
10 pages
Decision Trees: Make A Decision (Represent An Outcome
No ratings yet
Decision Trees: Make A Decision (Represent An Outcome
4 pages
Chap9 Cart 574 1
No ratings yet
Chap9 Cart 574 1
42 pages
5 1 Decision Trees
No ratings yet
5 1 Decision Trees
34 pages
Introduction To RPART
No ratings yet
Introduction To RPART
67 pages
Random Forest
No ratings yet
Random Forest
83 pages
Classification and Regression Trees CART
No ratings yet
Classification and Regression Trees CART
40 pages
Classification and Regression Trees
No ratings yet
Classification and Regression Trees
36 pages
Classification and Regression Trees (CART) Theory and Applications
No ratings yet
Classification and Regression Trees (CART) Theory and Applications
40 pages
Longintro
No ratings yet
Longintro
60 pages
Classification and Prediction
No ratings yet
Classification and Prediction
81 pages
TEAA - Tree Ensembles-1
No ratings yet
TEAA - Tree Ensembles-1
43 pages
Robotic Process Automation: Guide
No ratings yet
Robotic Process Automation: Guide
4 pages
Backpropagation
No ratings yet
Backpropagation
7 pages
Module 9 - CART
No ratings yet
Module 9 - CART
33 pages
Chapter 09 CART-3
No ratings yet
Chapter 09 CART-3
42 pages
An Intro To Multimodal Texts: Writer/Designer: A Guide To Making Multimodal Projects (Chapter 1)
No ratings yet
An Intro To Multimodal Texts: Writer/Designer: A Guide To Making Multimodal Projects (Chapter 1)
15 pages
Module09 TreeBasedMethods
No ratings yet
Module09 TreeBasedMethods
36 pages
Insurance Analytics: Prof. Julien Trufin
No ratings yet
Insurance Analytics: Prof. Julien Trufin
64 pages
Dadm s16 Cart
No ratings yet
Dadm s16 Cart
18 pages
Classification and Regression Trees
No ratings yet
Classification and Regression Trees
48 pages
Deep Learning - IIT Ropar
No ratings yet
Deep Learning - IIT Ropar
2 pages
Random Forest Explained
No ratings yet
Random Forest Explained
39 pages
Module10 TreeBasedMethods
No ratings yet
Module10 TreeBasedMethods
33 pages
Lecture 3 - Decision Trees and Random Forest
No ratings yet
Lecture 3 - Decision Trees and Random Forest
20 pages
Decision Tree
No ratings yet
Decision Tree
15 pages
BANA 560 Lecture - 5 - NaiveBayes - Decision - Tree
No ratings yet
BANA 560 Lecture - 5 - NaiveBayes - Decision - Tree
42 pages
Knowledge Discovery and Data Mining: Lecture 11 - Tree Methods - Introduction
No ratings yet
Knowledge Discovery and Data Mining: Lecture 11 - Tree Methods - Introduction
49 pages
Artificial Intelligence & Machine Learning
No ratings yet
Artificial Intelligence & Machine Learning
82 pages
Uni2 NNDL
No ratings yet
Uni2 NNDL
21 pages
Process Control
100% (1)
Process Control
3 pages
Understanding and Creating Art With AI Review and Outlook, de Eva Cetinic e James She
No ratings yet
Understanding and Creating Art With AI Review and Outlook, de Eva Cetinic e James She
17 pages
Note 6
No ratings yet
Note 6
33 pages
Decision Tree
No ratings yet
Decision Tree
13 pages
Classification Using Decision Trees
No ratings yet
Classification Using Decision Trees
43 pages
Decision Tree & Regression
No ratings yet
Decision Tree & Regression
33 pages
Classification and Regression Trees (CART - I) : Dr. A. Ramesh
No ratings yet
Classification and Regression Trees (CART - I) : Dr. A. Ramesh
34 pages
Trees Handout
No ratings yet
Trees Handout
51 pages
Cartfromatob: James Guszcza, Fcas, Maaa
No ratings yet
Cartfromatob: James Guszcza, Fcas, Maaa
54 pages
Tree-Based Methods
No ratings yet
Tree-Based Methods
32 pages
Classification and Regression Tree Construction
No ratings yet
Classification and Regression Tree Construction
18 pages
Chapter 09 CART - N
No ratings yet
Chapter 09 CART - N
24 pages
STA555 Data Mining: Decision Trees
No ratings yet
STA555 Data Mining: Decision Trees
40 pages
23 Ens RandomForests
No ratings yet
23 Ens RandomForests
27 pages
Classification and Regression Trees
No ratings yet
Classification and Regression Trees
37 pages
Machine Learning: Practical Tutorial On Random Forest and Parameter Tuning in R
No ratings yet
Machine Learning: Practical Tutorial On Random Forest and Parameter Tuning in R
11 pages
10.1 Decision Tree
No ratings yet
10.1 Decision Tree
17 pages
DD2437 Lecture01 PH
No ratings yet
DD2437 Lecture01 PH
32 pages
Cse 17CS82 M2 S4 PPT
No ratings yet
Cse 17CS82 M2 S4 PPT
19 pages
Chapter 9 - Classification and Regression Trees: Data Mining For Business Intelligence
No ratings yet
Chapter 9 - Classification and Regression Trees: Data Mining For Business Intelligence
36 pages
6 - CART Models
No ratings yet
6 - CART Models
15 pages
Tinier YOLO
No ratings yet
Tinier YOLO
10 pages
Speaker Response
No ratings yet
Speaker Response
14 pages
DS535 Note 6 (Page1-14)
No ratings yet
DS535 Note 6 (Page1-14)
13 pages
Dinesh Kumar Indra Panwar Arjan Singh
No ratings yet
Dinesh Kumar Indra Panwar Arjan Singh
19 pages
Anomaly Detection in Network Traffic For Cybersecurity - MINOR
No ratings yet
Anomaly Detection in Network Traffic For Cybersecurity - MINOR
11 pages
Machine Learning: Classification & Decision Trees
No ratings yet
Machine Learning: Classification & Decision Trees
24 pages
Applying Decision Tree Algorithm Classification An
No ratings yet
Applying Decision Tree Algorithm Classification An
5 pages
Gee Cart 2008
No ratings yet
Gee Cart 2008
8 pages
Luddecke Image Segmentation Using Text and Image Prompts CVPR 2022 Paper PDF
No ratings yet
Luddecke Image Segmentation Using Text and Image Prompts CVPR 2022 Paper PDF
11 pages
Industrial Automation
No ratings yet
Industrial Automation
8 pages
003 05 KNN - Enhancements W3L2
No ratings yet
003 05 KNN - Enhancements W3L2
10 pages
Regression Trees
No ratings yet
Regression Trees
11 pages
09 Decision Trees Nearest Neighbor
No ratings yet
09 Decision Trees Nearest Neighbor
8 pages
Robotics Assignment 1
No ratings yet
Robotics Assignment 1
9 pages
Assignment 1 Database Latest
No ratings yet
Assignment 1 Database Latest
5 pages
Random Forest
No ratings yet
Random Forest
8 pages
Machine Learning Axioms Q&A
No ratings yet
Machine Learning Axioms Q&A
3 pages
Figure 9: Process of Knowledge Data Discovery Based On
No ratings yet
Figure 9: Process of Knowledge Data Discovery Based On
7 pages
Deep Learning
No ratings yet
Deep Learning
5 pages
Pitch Deck
No ratings yet
Pitch Deck
2 pages
Short-Term Training Program On Statistical Machine Learning (Final)
No ratings yet
Short-Term Training Program On Statistical Machine Learning (Final)
2 pages
AI Assignment
No ratings yet
AI Assignment
2 pages

STAT 432: Basics of Statistical Learning: Tree and Random Forests

Uploaded by

STAT 432: Basics of Statistical Learning: Tree and Random Forests

Uploaded by

STAT 432: Basics of Statistical Learning

Tree and Random Forests

Ruoqing Zhu, Ph.D. <[email protected]>

University of Illinois at Urbana-Champaign

• Tree-based methods are nonparametric methods that recursively

• Example: independent x1 and x2 from uniform [−1, 1],

P(Y = blue | x21 + x22 < 0.6) = 90%

• Existing methods require transformation of the feature space to

1 > library ( rpart )

• Initialized the root node: all training data

• Initialized the root node: all training data

Splitting rule 1{X (j) ≤ c}

• Initialized the root node: all training data

• Initialized the root node: all training data

• Initialized the root node: all training data

• How to construct the splitting rules?

• Splitting of continuous predictors are in the form of 1{X (j) ≤ c}

• We want to split this node into two child nodes AL and AR

• This is done by calculating and comparing the impurity, before

• We need to define the criteria for classification and regression

• Interpretation: Gini = 0 means pure node (only one class), larger

• After the split, we want each child node to be as pure as possible,

where | · | denotes the cardinality (sample size) of a node.

• Calculating the Gini index based on the samples is very simple:

which is the proportion of samples with class label k in node A.

• Do the same for AL and AR , then calculate the score of a split.

• To define a split 1{X (j) ≤ c}, we need to know

• Gini index is not the only measurement.

Error(A) = 1 − max pbk

• Similarly, we can use these measures to define the reduction of

Class 1 Class 2 pb1 pb2 Gini Entropy Error

scoreGini = 0.420 − (3/10 · 0 + 7/10 · 0.490) = 0.077

• Gini index and Shannon entropy are more sensitive to the

• When the outcome Y is continuous, all we need is a

|A| is the cardinality of A and y A is the within-node mean.

• If X (j) is a categorical variable talking values in {1, . . . , C}, we

• Maximum of 2C−1 − 1 number of possible splits

• There is a close connection with the (adaptive) histogram

• Balancing tree size and accuracy is the same as the “loss +

where |A| is the cardinality of node A, |T | is the cardinality

• If each variable has 5% chance to have missing value, then if we

• Advantages of tree-based method:

• Back in the mid-late 90’s, researches started to investigate

• Bagging stands for “Bootstrap aggregating”

Bootstrap 1 Bootstrap 2 ... Bootstrap B−1 Bootstrap B

fb1 (x) fb2 (x) fbB−1 (x) fbB (x)

• Bootstrap sample with replacement. Fit a CART model to each

and for regression:

• Dramatically reduce the variance of individual learners

CART vs. Bagging

• Why Bagging works?

• But, the performance of bagging in practice is oftentimes not

• Several articles came out in the late 90’s discussing the

• An important tuning parameter of random forests is mtry

• Another important tuning parameter is (terminal) nodesize

• A summary of important tuning parameters in Random forests

RF: ntree = 1000, mtry = 1, nodesize = 25

Smoothness Effect of Random Forests

• Random forests are essentially kernel methods

Random forest kernel at two different target points

Gaussian Kernel at two different target points

• Random forests has a built-in variable selection tool: variable

• This essentially works like a cross-validation:

Same simulation setting as the “circle” example, with additional 48

• Performs well on high-dimensional data

You might also like