0% found this document useful (0 votes)
14 views4 pages

ML Lec6

The document discusses the transition from decision trees to random forests, highlighting the advantages of using an ensemble of decision trees to improve classification accuracy and generalization. It outlines the algorithm for constructing random forests, including the selection of variables and training samples, as well as the pros and cons of this method. Additionally, it emphasizes the efficiency, speed, and robustness of random forests in handling large datasets and missing values.

Uploaded by

luosuochao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views4 pages

ML Lec6

The document discusses the transition from decision trees to random forests, highlighting the advantages of using an ensemble of decision trees to improve classification accuracy and generalization. It outlines the algorithm for constructing random forests, including the selection of variables and training samples, as well as the pros and cons of this method. Additionally, it emphasizes the efficiency, speed, and robustness of random forests in handling large datasets and missing values.

Uploaded by

luosuochao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Lecture 6 Random Forest 1.

From Decision tree to Random Forrest


• To handle decision tree’s generalization
1. From decision tree to random forest capability, we may use a new approach
• Power of the crowds
2. Algorithm
3. Pros & Cons
4. More information

Definition Decision trees


• Decision trees are individual learners that are combined.
• A single decision tree does not perform well
• They are one of the most popular learning methods commonly
• But it is super fast, what if we learn multiple trees? used for data exploration.
• Random forest (or random forests) is an ensemble classifier • One type of decision tree is called CART - classification and
that consists of many decision trees and outputs the class regression tree.
that is the mode of the class's output by individual trees. • CART - greedy, top-down binary, recursive
• The term came from random decision forests that was first partitioning, that divides feature space into sets of
proposed by Tin Kam Ho of Bell Labs in 1995. disjoint rectangular regions.
• Regions should be pure wrt response variable.
• The method combines Breiman's "bagging" idea and the
random selection of features. • Simple model is fit in each region – majority vote for
classification, constant value for regression.
Decision tress involve greedy, recursive partitioning. 2. Algorithm
Simple dataset with two predictors Each tree is constructed using the following algorithm:
1. Let the number of training cases be N, and the number of
variables in the classifier be M.
2. We are told the number m of input variables to be used to
determine the decision at a node of the tree; m should be
much less than M.
3. Choose a training set for this tree by choosing n times with
replacement from all N available training cases (i.e. take a
bootstrap sample). Use the rest of the cases to estimate the error
of the tree, by predicting their classes.
Greedy, recursive partitioning along TI and PE 4. For each node of the tree, randomly choose m variables on which
to base the decision at that node. Calculate the best split based on
these m variables in the training set.
5. Each tree is fully grown and not pruned (as may be done in
constructing a normal tree classifier).
For prediction a new sample is pushed down the tree. It is assigned the
label of the training sample in the terminal node it ends up in. This
procedure is iterated over all trees in the ensemble, and the average
vote of all trees is reported as random forest prediction.

Algorithm flow chart Practical consideration

• Splits are chosen according to a purity measure:


• E.g., squared error (regression), Gini index or
deviance (classification)
• How to select n?
• Build trees until the error no longer decreases
• How to select m?
• Try to recommend defaults, half of them and twice
of them and pick the best.
3. Pros and Cons • It has methods for balancing error in class population unbalanced data
sets.
The advantages of random forest:
• Generated forests can be saved for future use on other data.
• It is one of the most accurate learning algorithms available. For many
data sets, it produces a highly accurate classifier. • Prototypes are computed that give information about the relation
between the variables and the classification.
• It runs efficiently on large databases.
• It computes proximities between pairs of cases that can be used in
• It can handle thousands of input variables without variable deletion.
clustering, locating outliers, or (by scaling) give interesting views of
• It gives estimates of what variables are important in the the data.
classification.
• The capabilities of the above can be extended to unlabeled data,
• It generates an internal unbiased estimate of the generalization error as leading to unsupervised clustering, data views and outlier detection.
the forest building progresses.
• It offers an experimental method for detecting variable
• It has an effective method for estimating missing data and maintains interactions.
accuracy when a large proportion of the data are missing.

4. Additional information
Disadvantages
Estimating the test error:
• While growing forest, estimate test error from
• Random forests have been observed to overfit for some
training samples
datasets with noisy classification/regression tasks.
• For each tree grown, 33-36% of samples are not selected in
• For data including categorical variables with different number of bootstrap, called out of bootstrap (OOB) samples
levels, random forests are biased in favor of those attributes with
more levels. Therefore, the variable importance scores from • Using OOB samples as input to the corresponding tree,
random forest are not reliable for this type of data. predictions are made as if they were novel test samples
• Through book-keeping, majority vote (classification), average
(regression) is computed for all OOB samples from all trees.
• Such estimated test error is very accurate in practice, with
reasonable n
Summary:
Estimating the importance of each input: • Fast fast fast!
• RF is fast to build. Even faster to predict!
• Denote by êthe OOB estimate of the loss when
• Practically speaking, not requiring cross-validation alone for model
using original training set, D. selection significantly speeds training by 10x-100x or more.
• For each input xp where p∈{1,..,k} • Fully parallelizable … to go even faster!

• Randomly permute pth input to generate a new set • Automatic predictor (inputs) selection from large number
of samples D' ={(y1,x'1),…,(yN,x'N)} of candidates
• Resistance to over training
• Compute OOB estimate êk of prediction error with
the new samples • Ability to handle data without preprocessing
• data does not need to be rescaled, transformed, or modified
• A measure of importance of predictor xp is êk – ê,
• resistant to outliers
the increase in error due to random perturbation of
• automatic handling of missing values
pth predictor

Want to know more?

https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

You might also like