ML Lec6
ML Lec6
4. Additional information
Disadvantages
Estimating the test error:
• While growing forest, estimate test error from
• Random forests have been observed to overfit for some
training samples
datasets with noisy classification/regression tasks.
• For each tree grown, 33-36% of samples are not selected in
• For data including categorical variables with different number of bootstrap, called out of bootstrap (OOB) samples
levels, random forests are biased in favor of those attributes with
more levels. Therefore, the variable importance scores from • Using OOB samples as input to the corresponding tree,
random forest are not reliable for this type of data. predictions are made as if they were novel test samples
• Through book-keeping, majority vote (classification), average
(regression) is computed for all OOB samples from all trees.
• Such estimated test error is very accurate in practice, with
reasonable n
Summary:
Estimating the importance of each input: • Fast fast fast!
• RF is fast to build. Even faster to predict!
• Denote by êthe OOB estimate of the loss when
• Practically speaking, not requiring cross-validation alone for model
using original training set, D. selection significantly speeds training by 10x-100x or more.
• For each input xp where p∈{1,..,k} • Fully parallelizable … to go even faster!
• Randomly permute pth input to generate a new set • Automatic predictor (inputs) selection from large number
of samples D' ={(y1,x'1),…,(yN,x'N)} of candidates
• Resistance to over training
• Compute OOB estimate êk of prediction error with
the new samples • Ability to handle data without preprocessing
• data does not need to be rescaled, transformed, or modified
• A measure of importance of predictor xp is êk – ê,
• resistant to outliers
the increase in error due to random perturbation of
• automatic handling of missing values
pth predictor
https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm