0% found this document useful (0 votes)
22 views32 pages

22 Boosting

Uploaded by

damasodra33
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views32 pages

22 Boosting

Uploaded by

damasodra33
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

CSCE-421 Machine Learning

Boosting

Instructor: Guni Sharon 1


Examples by: Kilian Weinberger
Announcements
 Midterm on Tuesday, November-23 (in class)
 Covering all topics up to the exam date
 Written exam
 One theoretical question and 4 multiple Choice/answers
 We will have a preparation class (Nov-18)
 Go over over proofs (lectures + assignments). If unsure, post question on Campuswire. I
will address specific pre-asked questions on Nov-18.
 Due:
 Assignment (P3): SVM, linear regression and kernelization, due Tuesday Nov-16
 Quiz 5: decision trees and bagging, due Thursday Nov-18
 Assignment (P4): Decision trees, due Thursday Nov-25
2
Disadvantages of Bagging
 Loss of interpretability: the underlying model might be
interpretable, e.g., decision trees. However, an ensemble
prediction is harder to make sense of.
 Computationally expensive: Bagging slows down and grows
more intensive as the number of iterations increase
 Unstable benefit (across models): Bagging is not beneficial for
models that suffer from low variance
 “bagging a linear regression model will effectively just return the original
predictions” [Hands-On Machine Learning, Boehmke & Greenwell]
3
Random Forest
 A Random Forest is essentially a bagged decision trees, with a
modified splitting criteria
1. Sample data sets from with replacement.
2. For each train with a modified splitting criteria
3. Predict:
 Splitting criteria: split on feature that maximizes IG but don’t
consider all possible features
 Consider a random subsample

4
Advantages of Random Forest
 Easy to implement. works well “out of the box”.
 Only two hyperparameters
 RF is not sensitive to hyperparameters value
 Known values that usually work well

 as large as you can afford


 Insensitive to variance in features domain (scale, magnitude, or
slope, missing values)
 Doesn’t require data preprocessing
5
6
What about bias?
 Trees with high variance – use bagging!
 What if we have high bias = underfitting = the model is too weak
= can’t capture data structure
 Both training and testing losses are high
 More training data won’t help
 Can this problem be addressed with an ensemble approach?
 Can weak learners be combined to generate a strong learner with low
bias?
 Weak learners is slightly better than random guessing
7
Ensemble loss
 The loss of an ensemble
 where
 Instead of:
 We now consider a scaled average

 Question: can we define a new member to our ensemble that, if added,


will reduce the ensemble loss?

 Claim: when we group multiple weak classifiers with each one


progressively learning from the others' wrongly classified objects, we can
build one such strong model
8
Boosting
 Schapire, Robert E. (1990). "The Strength of Weak Learnability”

 How can we approximate the loss at around a known point ?


 2nd order Taylor series
 Derivative of summed terms = summation of terms derivative
Boosting

Independent of Scalar Independent of

 Makes sense!
 = how to change the current prediction such that loss is increased
 Find a new classifier that points at the other direction
 Inner product is minimized for opposing vectors

10
Example

 the update (direction) for that will maximize the loss

11
Gradient boost
 Task: train a new tree
 Must be better than random,
 =The inner product of with is negative (angle > 90)
 = take a step in the right direction i.e., reduce loss

12
Gradient boost

 Consider the magnitude of as some constant


 Adding a constant does not change the argmin
 is independent of so does not change the argmin

13
Gradient boost

 Consider squared loss

The current error in


 That is, train a new model to minimize the squared difference
between the output and the error in

14
Sanity check

 Currently:
 However:
 Minimum value for is at
 Adding the new learner to would reduce the error

 Assuming step size

15
Gradient boost for trees
 Hypothesis space spans all regression trees with () with Limited
depth (usually })
 Highly biased model = weak learner

1. Until convergence
Regression tree minimizing
instead of

 Hyperparameters =

16
Adaptive Boost
 AdaBoost loss:
 Assume: Binary classification,
 Assume: The weak learners always return

 is no longer a hyperparameter but an adaptive step size value where


better will be assigned higher
 Merge with gradient boosting:

17
AdaBoost

18
AdaBoost

 We assumed that

 is a constant per iteration (independent of )

19
AdaBoost

 That is, the added learner should minimize exponential loss for misclassified
samples

20
AdaBoost

 We define
 The normalized loss contribution for each training sample

 Moving forward we will say that


 What happens if ?
 Can we still define ?

21
 Yes!
 As long as, , we have a meaningful value for

 That is, we can still reduce the loss even when the training error is zero
 This is good news! Why?
 Even when fits our training data perfectly, we can continue training it and
widen the classification margin
 Now that we can define , lets add it to our ensemble

22
Adaptive Boosting

 is a weak learner. As a result, it’s contribution to the accuracy of


is noisy
 Intuitively, better should have a larger
 Can we formulate this intuition as an optimization problem?
 Yes!

23
Adaptive Boosting

 Convex function = we can find argmin in a closed form:

 We assume that

24
Adaptive Boosting

 (see 4 slides back)


 Define
 (at minimum loss)
 ()

 (Wow! The optimal step size!)

25
AdaBoost

 Work in iterations!
 At each iteration need to re-compute all the weights
 Can simply update we won’t prove

26
AdaBost

Must be better than random classifier


is an upper bound on the {0/1} error
rate

27
28
29
30
What did we learn?
 Boosting = iteratively build an ensemble where each new learner
() is trained to reduce the error of the () ensemble
 Boosting is an extremely powerful algorithm, that turns any weak
learner (better than random) into a strong learner
 For AdaBoost (=adaptive step size and exponential loss), the
training error decreases exponentially with iterations
 Requires only steps until it is consistent with the training set (wasn’t
proved)

31
What next?
 Class: Midterm!
 Assignments:
 Assignment (P3): SVM, linear regression and kernelization, due Tuesday
Nov-16
 Assignment (P4): Decision trees, due Thursday Nov-25
 Quizzes:
 Quiz 5: decision trees and bagging, due Thursday Nov-18

32

You might also like