0% found this document useful (0 votes)
32 views

Boosting and Additive Tree

This document discusses techniques for boosting additive trees including: 1. Using gradient descent to numerically optimize the loss function and find the optimal additive tree model. 2. Controlling model complexity through techniques like limiting tree size, number of boosting iterations, and regularization to avoid overfitting. 3. Interpreting the final boosted tree model through variable importance measures and partial dependence plots to understand variable relationships.

Uploaded by

Jigar Patel
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Boosting and Additive Tree

This document discusses techniques for boosting additive trees including: 1. Using gradient descent to numerically optimize the loss function and find the optimal additive tree model. 2. Controlling model complexity through techniques like limiting tree size, number of boosting iterations, and regularization to avoid overfitting. 3. Interpreting the final boosted tree model through variable importance measures and partial dependence plots to understand variable relationships.

Uploaded by

Jigar Patel
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 26

Boosting and Additive Trees (2)

Yi Zhang , Kevyn Collins-Thompson


Advanced Statistical Seminar 11-745
Oct 29, 2002
Recap: Boosting (1)
• Background: Ensemble Learning
• Boosting Definitions, Example
• AdaBoost
• Boosting as an Additive Model
• Boosting Practical Issues
• Exponential Loss
• Other Loss Functions
• Boosting Trees
• Boosting as Entropy Projection
• Data Mining Methods
Outline for This Class
• Find the solution based on numerical optimization
• Control the model complexity and avoid over
fitting
– Right sized trees for boosting
– Number of iterations
– Regularization
• Understand the final model (Interpretation)
• Single variable
• Correlation of variables
Numerical Optimization
• Goal: Find f that minimize the loss function over
training data
^ N
f  arg min L( f )  arg min  L( yi , f ( xi ))
f f i 1
• Gradient Descent Search in the unconstrained
function space to minimize the loss on training
L( yi , f ( xi ))
dataim
g  [ ] f ( x ) f ( x )   g1m , g 2m ,..., g Nm  T
f ( x ) i
i m1 i

 m  arg min L( f m1   * g m )



f m  f m1   * g m

• Loss
f m on
 { ftraining
m ( x1 ), f m (data
x2 ),...,converges
f m ( x N )}  {to
y1 ,zero
y2 ,..., y N }
Gradient Search on Constrained Function
Space: Gradient Tree Boosting

• Introduce a tree at the mth iteration whose predictions


tm are as close as possible to the negative gradient
~ N
  arg min  ( g im  T ( xi ; )) 2
 i 1

•Advantage compared with unconstrained gradient


search: Robust, less likely for over fitting
Algorithm 3: MART
1.Initialize f 0 ( x) to single terminal node tree
2. For m  1 to M :
a) Compute pseudo residuals rim based on loss function
b) Fit a regression tree to rim  R jm , j  1,2,... J m
c) Find the optimal value of coefficient within different region R lm
 jm  arg min
jm
 L( yi , f m1 ( xi )   )
xiR jm
Jm
d ) f m ( x)  f m1 ( x)    jm I ( x  R jm )
j 1
End For
^
Output : f  f M
View Boosting as Linear Model
• Basis expansion:
– use basis function Tm (m=1..M, each Tm is a
weak learner) to transform inputs vector X into
T space, then use linear models in this new
space
• Special for Boosting: Choosing of basis
function Tm depends on T1,… Tm-1
Improve Boosting as Linear Model
Recap: Linear Models in This Chapter: Improve
Chapter 3
Boosting
• Bias Variance trade off
1. Subset selection (feature 1. Size of the constituent
selection, discrete) trees J
2. Coefficient shrinkage 2. Number of boosting
(smoothing: ridge, lasso)
iterations M (subset
3. Using derived input direction
(PCA, PLA) selection)
• Multiple outcome shrinkage 3. Regularization (Shrinkage)
and selection
– Exploit correlations in
different outcomes
Right Size Tree for Boosting (?)
• The Best for one step is not the best in long run
– Using very large tree (such as C4.5) as weak learner to fit
the residue assumes each tree is the last one in the
expansion. Usually degrade performance and increase
computation
• Simple approach: restrict all trees to be the same size J
• J limits the input features interaction level of tree-
based approximation
• In practice low-order interaction effects tend to
dominate, and empirically 4J 8 works well (?)
Number of Boosting Iterations
(subset selection)

• Boosting will over fit as M -> 


• Use validation set
• Other methods … (later)
Shrinkage
• Scale the contribution of each tree by a
factor 0<<1 to control the learning rate
J ^
f m  f m1 ( x)   *   jm I ( x  R jm )
j 1
• Both  and M control prediction risk on the
training data, and operate dependently
  M
Penalized Regression
• Ridge regression or Lasso regression
^
N
ˆ ( )  arg min{ ( y    k Tk ( xi )) 2   * J ( )}
 i 1 k
K
J ( )    k2 Ridge Regression, L2 norm
k 1
K
J ( )   |  k | Lasso
k 1
Algorithm 4: Forward stagewise
linear
^
1. Initialize k  0, k  1,..., k , set   0 to some small constant and M large
2. For m  1 to M :
N K
a) (  , k )  arg min  ( y    lTl ( xi )  Tk xi ))2
* *
 ,k i 1 l 1
*
b) k   k    sign(  )
K
2.Output f(x)    kTk ( x)
k 1
If ˆ ( ) is
monotone in
, we have k|
k| =  M, and
the solution
for algorithm
4 is identical
to result of
lasso
regression as
described in
page 64.
( , M ) lasso regression S/t/
More about algorithm 4
• Algorithm 4  Algorithm 3 + Shrinkage
• L1 norm vs. L2 norm: more details later
– Chapter 12 after learning SVM
Interpretation: Understanding the
final model
• Single decision trees are easy to interpret
• Linear combination of trees is difficult to
understand
– Which features are important?
– What’s the interaction between features?
Relative Importance of
Individual Variables
– For a single tree, define the importance of xl as
 l2 (T )   improve in square error risk over for a constant fit over the region
over all node using x l for partition

– For additive tree, define the importance of xl as


M
1
2
l 
M
 l (Tm )
 2

m 1

– For K-class classification, just treat as K 2-class


classification task
Partial Dependence Plots
• Visualize dependence of approximation f(x)
on the joint values of important features
• Usually the size of the subsets is small (1-3)
• Define average or partial dependence
f ( X  )  E X C f ( X  , X c )   f ( X  , X c ) P( X c )dX c
Xc

• Can be estimated empirically using the


training data: f_ ( X )  1 N f ( X , x )
   iC
N i 1
10.50 vs. 10.52
10.50 : f ( X  )  E X C f ( X  , X c )   f ( X  , X c ) P( X c )dX c
~ X
10.52 : f ( X  )  E ( f ( X  , X c ) | X  )   f ( X  , X c ) P( X c | X  )dX c
c

Xc
• Same if predictor variables are independent
• Why use 10.50 instead of 10.52 to Measure Partial
Dependency?
– Example 1: f(X)=h1(xs)+ h2(xc)

10.50 : f ( X  )   (h1 ( X  )  h2 ( X c )) P( X c )dX c


Xc

 h1 ( X  )   h2 ( X c ) P ( X c )dX c  h1 ( X  )  Cons tan t


– Example 2: f(X)=h1X(xs)* h2(xc)
c
Conclusion
• Find the solution based on numerical
optimization
• Control the model complexity and avoid
over fitting
– Right sized trees for boosting
– Number of iterations
– Regularization
• Understand the final model (Interpretation)
• Single variable
• Correlation of variables

You might also like