0% found this document useful (0 votes)

14 views11 pages

Boosting

Uploaded by

badger6688

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views11 pages

Boosting

Uploaded by

badger6688

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

XCS229i Supplemental Lecture Notes

Andrew Ng

1 Boosting
We have seen so far how to solve classification (and other) problems when we
have a data representation already chosen. We now talk about a procedure,
known as boosting, which was originally discovered by Rob Schapire, and
further developed by Schapire and Yoav Freund, that automatically chooses
feature representations. We take an optimization-based perspective, which
is somewhat different from the original interpretation and justification of
Freund and Schapire, but which lends itself to our approach of (1) choose a
representation, (2) choose a loss, and (3) minimize the loss.
Before formulating the problem, we give a little intuition for what we
are going to do. Roughly, the idea of boosting is to take a weak learning
algorithm—any learning algorithm that gives a classifier that is slightly bet-
ter than random—and transforms it into a strong classifier, which does much
much better than random. To build a bit of intuition for what this means,
consider a hypothetical digit recognition experiment, where we wish to dis-
tinguish 0s from 1s, and we receive images we must classify. Then a natural
weak learner might be to take the middle pixel of the image, and if it is
colored, call the image a 1, and if it is blank, call the image a 0. This clas-
sifier may be far from perfect, but it is likely better than random. Boosting
procedures proceed by taking a collection of such weak classifiers, and then
reweighting their contributions to form a classifier with much better accuracy
than any individual classifier.
With that in mind, let us formulate the problem. Our interpretation of
boosting is as a coordinate descent method in an infinite dimensional space,
which—while it sounds complex—is not so bad as it seems. First, we assume
we have raw input examples x ∈ Rn with labels y ∈ {−1, 1}, as is usual in
binary classification. We also assume we have an infinite collection of feature
functions φj : Rn → {−1, 1} and an infinite vector θ = [θ1 θ2 · · · ]T , but

1
which we assume always has only a finite number of non-zero entries. For
our classifier we use
∞
X
hθ (x) = sign θj φj (x) .
j=1

We will abuse notation, and define θT φ(x) = ∞

P
j=1 θj φj (x).
In boosting, one usually calls the features φj weak hypotheses. Given a
training set (x(1) , y (1) ), . . . , (x(m) , y (m) ), we call a vector p = (p(1) , . . . , p(m) ) a
distribution on the examples if p(i) ≥ 0 for all i and
m
X
p(i) = 1.
i=1

Then we say that there is a weak learner with margin γ > 0 if for any
distribution p on the m training examples there exists one weak hypothesis
φj such that
m
X 1
p(i) 1 y (i) 6= φj (x(i) ) ≤ − γ.

(1)
i=1
2
That is, we assume that there is some classifier that does slightly better than
random guessing on the dataset. The existence of a weak learning algorithm
is an assumption, but the surprising thing is that we can transform any weak
learning algorithm into one with perfect accuracy.
In more generality, we assume we have access to a weak learner, which is
an algorithm that takes as input a distribution (weights) p on the training
examples and returns a classifier doing slightly better than random. We will

(i) Input:
PmA distribution p(1) , . . . , p(m) and training set {(x(i) , y (i) )}m
i=1
with i=1 p(i) = 1 and p(i) ≥ 0

(ii) Return: A weak classifier φj : Rn → {−1, 1} such that

m
X 1
p(i) 1 y (i) 6= φj (x(i) ) ≤ − γ.

i=1
2

Figure 1: Weak learning algorithm

2
show how, given access to a weak learning algorithm, boosting can return a
classifier with perfect accuracy on the training data. (Admittedly, we would
like the classifier to generalize well to unseen data, but for now, we ignore
this issue.)

1.1 The boosting algorithm

Roughly, boosting begins by assigning each training example equal weight
in the dataset. It then receives a weak-hypothesis that does well according
to the current weights on training examples, which it incorporates into its
current classification model. It then reweights the training examples so that
examples on which it makes mistakes receive higher weight—so that the weak
learning algorithm focuses on a classifier doing well on those examples—while
examples with no mistakes receive lower weight. This repeated reweighting
of the training data coupled with a weak learner doing well on examples for
which the classifier currently does poorly yields classifiers with good perfor-
mance.
The boosting algorithm specifically performs coordinate descent on the
exponential loss for classification problems, where the objective is
m
1 X
J(θ) = exp(−y (i) θT φ(x(i) )).
m i=1

We first show how to compute the exact form of the coordinate descent
update for the risk J(θ). Coordinate descent iterates as follows:
(i) Choose a coordinate j ∈ N
(ii) Update θj to
θj = arg min J(θ)
θj

while leaving θk identical for all k 6= j.

We iterate the above procedure until convergence.
In the case of boosting, the coordinate updates are not too challenging to
derive because of the analytic convenience of the exp function. We now show
how to derive the update. Suppose we wish to update coordinate j. Define
!
X
w(i) = exp −y (i) θk φk (x(i) )
k6=j

3
to be a weight, and let α = θj . We can then express
m
1 X (i)
J(θ) = w exp(−y (i) φj (x(i) )α)
m i=1

Now, define
X X
W + := w(i) and W − := w(i)
i:y (i) φj (x(i) )=1 i:y (i) φj (x(i) )=−1

to be the sums of the weights of examples that φj classifies correctly and

incorrectly, respectively. Then finding θj is the same as choosing
1 W+
α = arg min W + e−α + W − eα = log − .

α 2 W
To see the final equality, take derivatives and set the resulting equation to
zero, so we have −W + e−α + W − eα = 0. That is, W − e2α = W + , or α =
1 W+
2
log W −.

What remains is to choose the particular coordinate to perform coordinate

descent on. We assume we have access to a weak-learning algorithm as in
Figure 1, which at iteration t takes as input a distribution p on the training
set and returns a weak hypothesis φt satisfying the margin condition (1).
We present the full boosting algorithm in Figure 2. It proceeds in iterations
t = 1, 2, 3, . . .. We represent the set of hypotheses returned by the weak
learning algorithm at time t by {φ1 , . . . , φt }.

2 The convergence of Boosting

We now argue that the boosting procedure achieves 0 training error, and we
also provide a rate of convergence to zero. To do so, we present a lemma
that guarantees progress is made.
Lemma 1. Let
m t
1 X
(t) (i)
X
(i)
J(θ ) = exp −y θτ φτ (x ) .
m i=1 τ =1

Then p
J(θ(t) ) ≤ 1 − 4γ 2 J(θ(t−1) ).

4
For each iteration t = 1, 2, . . .:

(i) Define weights

t−1
X
(i) (i) (i)
w = exp −y θτ φτ (x )
τ =1
Pm
and distribution p(i) = w(i) / j=1 w(j)

(ii) Construct a weak hypothesis φt : Rn → {−1, 1} from the distribu-

tion p = (p(1) , . . . , p(m) ) on the training set

(iii) Compute Wt+ = i:y(i) φt (x(i) )=1 w(i) and Wt− = i:y(i) φt (x(i) )=−1 w(i)
P P
and set
1 W+
θt = log t− .
2 Wt

Figure 2: Boosting algorithm

As the proof of the lemma is somewhat involved and not the central focus of
these notes—though it is important to know one’s algorithm will converge!—
we defer the proof to Appendix A.1. Let us describe how it guarantees
convergence of the boosting procedure to a classifier with zero training error.
We initialize the procedure at θ(0) = ~0, so that the initial empirical risk
J(θ(0) ) = 1. Now, we note that for any θ, the misclassification error satisfies

1 sign(θT φ(x)) 6= y = 1 yθT φ(x) ≤ 0 ≤ exp −yθT φ(x)

because ez ≥ 1 for all z ≥ 0. Thus, we have that the misclassification error

rate has upper bound
m
1 X
1 sign(θT φ(x(i) )) 6= y (i) ≤ J(θ),
m i=1

and so if J(θ) < m1 then the vector θ makes no mistakes on the training data.
After t iterations of boosting, we find that the empirical risk satisfies
t t
J(θ(t) ) ≤ (1 − 4γ 2 ) 2 J(θ(0) ) = (1 − 4γ 2 ) 2 .

5
1
To find how many iterations are required to guarantee J(θ(t) ) < m
, we take
logarithms to find that J(θ(t) ) < 1/m if
t 1 2 log m
log(1 − 4γ 2 ) < log , or t > .
2 m − log(1 − 4γ 2 )

Using a first order Taylor expansion, that is, that log(1 − 4γ 2 ) ≤ −4γ 2 , we
see that if the number of rounds of boosting—the number of weak classifiers
we use—satisfies
log m 2 log m
t> 2
≥ ,
2γ − log(1 − 4γ 2 )
1
then J(θ(t) ) < m
.

3 Implementing weak-learners
One of the major advantages of boosting algorithms is that they automat-
ically generate features from raw data for us. Moreover, because the weak
hypotheses always return values in {−1, 1}, there is no need to normalize fea-
tures to have similar scales when using learning algorithms, which in practice
can make a large difference. Additionally, and while this is not theoret-
ically well-understood, many types of weak-learning procedures introduce
non-linearities intelligently into our classifiers, which can yield much more
expressive models than the simpler linear models of the form θT x that we
have seen so far.

3.1 Decision stumps

There are a number of strategies for weak learners, and here we focus on
one, known as decision stumps. For concreteness in this description, let
us suppose that the input variables x ∈ Rn are real-valued. A decision
stump is a function f , which is parameterized by a threshold s and index
j ∈ {1, 2, . . . , n}, and returns
(
1 if xj ≥ s
φj,s (x) = sign(xj − s) = (2)
−1 otherwise.

These classifiers are simple enough that we can fit them efficiently even to a
weighted dataset, as we now describe.

6
Indeed, a decision stump weak learner proceeds as follows. We begin with
a distribution—set of weights p(1) , . . . , p(m) summing to 1—on the training
set, and we wish to choose a decision stump of the form (2) to minimize the
error on the training set. That is, we wish to find a threshold s ∈ R and
index j such that
m
X m
X n o
(i) (i) (i) (i)
p(i) 1 y (i) (xj − s) ≤ 0 (3)

Err(φ
c j,s , p) = p 1 φj,s (x ) 6= y =
i=1 i=1

is minimized. Naively, this could be an inefficient calculation, but a more

intelligent procedure allows us to solve this problem in roughly O(nm log m)
time. For each feature j = 1, 2, . . . , n, we sort the raw input features so that
(i ) (i ) (i )
xj 1 ≥ xj 2 ≥ · · · ≥ xj m .

As the only values s for which the error of the decision stump can change
(i)
are the values xj , a bit of clever book-keeping allows us to compute
m
X n o Xm n o
(i) (i )
p(i) 1 y (i) (xj − s) ≤ 0 = p(ik ) 1 y (ik ) (xj k − s) ≤ 0
i=1 k=1

efficiently by incrementally modifying the sum in sorted order, which takes

(i)
time O(m) after we have already sorted the values xj . (We do not describe
the algorithm in detail here, leaving that to the interested reader.) Thus,
performing this calculation for each of the n input features takes total time
O(nm log m), and we may choose the index j and threshold s that give the
best decision stump for the error (3).
One very important issue to note is that by flipping the sign of the thresh-
olded decision stump φj,s , we achieve error 1 − Err(φ
c j,s , p), that is, the error
of
Err(−φ j,s , p) = 1 − Err(φj,s , p).
c c

(You should convince yourself that this is true.) Thus, it is important to also
track the smallest value of 1 − Err(φ
c j,s , p) over all thresholds, because this
may be smaller than Err(φ
c j,s , p), which gives a better weak learner. Using
this procedure for our weak learner (Fig. 1) gives the basic, but extremely
useful, boosting classifier.

7
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1

Figure 3: Best logistic regression classifier using the raw features x ∈ R2

(and a bias term x0 = 1) for the example considered here.

3.2 Example
We now give an example showing the behavior of boosting on a simple
dataset. In particular, we consider a problem with data points x ∈ R2 ,
where the optimal classifier is
(
1 if x1 < .6 and x2 < .6
y= (4)
−1 otherwise.

This is a simple non-linear decision rule, but it is impossible for standard

linear classifiers, such as logistic regression, to learn. In Figure 3, we show
the best decision line that logistic regression learns, where positive examples
are circles and negative examples are x’s. It is clear that logistic regression
is not fitting the data particularly well.
With boosted decision stumps, however, we can achieve a much better
fit for the simple nonlinear classification problem (4). Figure 4 shows the
boosted classifiers we have learned after different numbers of iterations of
boosting, using a training set of size m = 150. From the figure, we see that
the first decision stump is to threshold the feature x1 at the value s ≈ .23,
that is, φ(x) = sign(x1 − s) for s ≈ .23.

8
Iterations = 2 Iterations = 4
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Iterations = 5 Iterations = 10
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Figure 4: Boosted decision stumps after t = 2, 4, 5, and 10 iterations of

boosting, respectively.

9
3.3 Other strategies
There are a huge number of variations on the basic boosted decision stumps
idea. First, we do not require that the input features xj be real-valued. Some
of them may be categorical, meaning that xj ∈ {1, 2, . . . , k} for some k, in
which case natural decision stumps are of the form
(
1 if xj = l
φj (x) =
−1 otherwise,

as well as variants setting φj (x) = 1 if xj ∈ C for some set C ⊂ {1, . . . , k} of

categories.
Another natural variation is the boosted decision tree, in which instead
of a single level decision for the weak learners, we consider conjunctions
of features or trees of decisions. Google can help you find examples and
information on these types of problems.

A Appendices
A.1 Proof of Lemma 1
We now return to prove the progress lemma. We prove this result by directly
showing the relationship of the weights at time t to those at time t − 1. In
particular, we note by inspection that
1 2
q
(t)
J(θ ) = + −α − α
min{Wt e + Wt e } = Wt+ Wt−
m α m
while
m t−1
(t−1) 1 X (i)
X
(i) 1 + −
J(θ )= exp −y θτ φτ (x ) = Wt + Wt .
m i=1 τ =1
m

We know by the weak-learning assumption that

m
X 1 1 X 1
p(i) 1 y (i) 6= φt (x(i) ) ≤ − γ, or w(i) ≤

− γ.
i=1
2 Wt + Wt−
+
2
i:y (i) φt (x(i) )=−1
| {z }
=Wt−

10
Rewriting this expression by noting that the sum on the right is Wt− , we
have
− 1 1 + 2γ −
Wt ≤ − γ (Wt+ + Wt− ), or Wt+ ≥ W .
2 1 − 2γ t
By substituting α = 12 log 1+2γ
1−2γ
in the minimum defining J(θ(t) ), we obtain
r r
1 1 − 2γ − 1 + 2γ
J(θ(t) ) ≤ Wt+
+ Wt
m 1 + 2γ 1 − 2γ
r r
1 + 1 − 2γ − 1 + 2γ
= Wt + Wt (1 − 2γ + 2γ)
m 1 + 2γ 1 − 2γ
r r r
1 + 1 − 2γ − 1 + 2γ 1 − 2γ 1 + 2γ +
≤ Wt + Wt (1 − 2γ) + 2γ W
m 1 + 2γ 1 − 2γ 1 + 2γ 1 − 2γ t
r r
1 + 1 − 2γ 1 − 2γ −
p
2
= Wt + 2γ + Wt 1 − 4γ ,
m 1 + 2γ 1 + 2γ

where we used that Wt− ≤ 1−2γ 1+2γ

Wt+ . Performing a few algebraic manipula-
tions, we see that the final expression becomes
1p
J(θ(t) ) ≤ 1 − 4γ 2 (Wt+ + Wt− ).
m
p
That is, J(θ(t) ) ≤ 1 − 4γ 2 J(θ(t−1) ).

Research Ethics in The Digital Age - Ethics For The Social Sciences and Humanities in Times of Mediatization and Digitization (High)
No ratings yet
Research Ethics in The Digital Age - Ethics For The Social Sciences and Humanities in Times of Mediatization and Digitization (High)
159 pages
CS229 Supplemental Lecture Notes: 1 Boosting
No ratings yet
CS229 Supplemental Lecture Notes: 1 Boosting
11 pages
07 Boosting Notes
No ratings yet
07 Boosting Notes
10 pages
Chapter 3 - Boosting Theory
No ratings yet
Chapter 3 - Boosting Theory
7 pages
Computational Data Analysis: Machine Learning
No ratings yet
Computational Data Analysis: Machine Learning
26 pages
Introduction to Boosting: Slides Adapted from Che Wanxiang (车万翔) at HIT, and Robin Dhamankar of Many thanks!
100% (1)
Introduction to Boosting: Slides Adapted from Che Wanxiang (车万翔) at HIT, and Robin Dhamankar of Many thanks!
41 pages
Introduction To Boosting: Cynthia Rudin PACM, Princeton University
No ratings yet
Introduction To Boosting: Cynthia Rudin PACM, Princeton University
29 pages
Lecture Notes 7
No ratings yet
Lecture Notes 7
8 pages
Lecture 10 Boosting
No ratings yet
Lecture 10 Boosting
20 pages
Survey - Gradient Boosting Machine
No ratings yet
Survey - Gradient Boosting Machine
9 pages
Boosting and Applications Yuan
No ratings yet
Boosting and Applications Yuan
41 pages
Boosting Mit
No ratings yet
Boosting Mit
36 pages
Boosting Reduces Bias
No ratings yet
Boosting Reduces Bias
7 pages
Introduction To Boosting - 2
No ratings yet
Introduction To Boosting - 2
79 pages
An Introduction To Boosting and Leveraging: 1 A Brief History of Boosting
No ratings yet
An Introduction To Boosting and Leveraging: 1 A Brief History of Boosting
66 pages
Gradient Boosting
No ratings yet
Gradient Boosting
9 pages
Lecture18 Boosting
No ratings yet
Lecture18 Boosting
21 pages
ML 14 Boosting
No ratings yet
ML 14 Boosting
57 pages
Boosting Approach To Machine Learn
No ratings yet
Boosting Approach To Machine Learn
23 pages
Boosting With The L - Loss: Regression and Classification
No ratings yet
Boosting With The L - Loss: Regression and Classification
32 pages
Lec13 PDF
No ratings yet
Lec13 PDF
10 pages
Artificial Intelligence Fundamentals: Learning: Boosting
No ratings yet
Artificial Intelligence Fundamentals: Learning: Boosting
24 pages
Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin
No ratings yet
Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin
19 pages
Lec5 Boosting v2.7 1
No ratings yet
Lec5 Boosting v2.7 1
46 pages
Lect4 Log Reg
No ratings yet
Lect4 Log Reg
20 pages
Ada Boost
No ratings yet
Ada Boost
25 pages
IMPROVE Boost 1999
No ratings yet
IMPROVE Boost 1999
40 pages
Bagging and Boosting
No ratings yet
Bagging and Boosting
32 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
14-AI ML Ensemble 2022
No ratings yet
14-AI ML Ensemble 2022
41 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
DM - Lecture 4
No ratings yet
DM - Lecture 4
65 pages
ENG6500 7 Ensembles Boosting
No ratings yet
ENG6500 7 Ensembles Boosting
49 pages
Adaboost Matas
No ratings yet
Adaboost Matas
136 pages
LECTURE+NOTES Boosting
No ratings yet
LECTURE+NOTES Boosting
8 pages
The Evolution of Boosting Algorithms: From Machine Learning To Statistical Modelling
No ratings yet
The Evolution of Boosting Algorithms: From Machine Learning To Statistical Modelling
32 pages
CSC 3304 Lecture 08 Boosting Ensemble Methods
No ratings yet
CSC 3304 Lecture 08 Boosting Ensemble Methods
41 pages
ML Unit 3 (Ab22)
No ratings yet
ML Unit 3 (Ab22)
42 pages
Boosting: I I I I
No ratings yet
Boosting: I I I I
5 pages
کتاب هفتم بارگزاری شده
No ratings yet
کتاب هفتم بارگزاری شده
57 pages
Boosting
No ratings yet
Boosting
13 pages
How To Boost Any Loss Function
No ratings yet
How To Boost Any Loss Function
36 pages
Foundations of Machine Learning: Courant Institute and Google Research
No ratings yet
Foundations of Machine Learning: Courant Institute and Google Research
42 pages
Introduction To Machine Learning - Boosting
No ratings yet
Introduction To Machine Learning - Boosting
6 pages
A Simple Proof of AdaBoost Algorithm
No ratings yet
A Simple Proof of AdaBoost Algorithm
4 pages
Adaboost: Derek Hoiem March 31, 2004
No ratings yet
Adaboost: Derek Hoiem March 31, 2004
46 pages
09 Boosting
No ratings yet
09 Boosting
17 pages
Randomized Decision Trees II: 1 Feature Selection
No ratings yet
Randomized Decision Trees II: 1 Feature Selection
3 pages
Slide07 Haykin Chapter 7: Committee Machines
No ratings yet
Slide07 Haykin Chapter 7: Committee Machines
8 pages
22 Boosting
No ratings yet
22 Boosting
32 pages
MLB HA 6 Answers Final
No ratings yet
MLB HA 6 Answers Final
13 pages
Experiments With A New Boosting Algorithm: Yoav Freund Robert E. Schapire
No ratings yet
Experiments With A New Boosting Algorithm: Yoav Freund Robert E. Schapire
16 pages
Boosting Buehlmann
No ratings yet
Boosting Buehlmann
52 pages
Experimenting XGBoost Algorithmfor Predictionand Classificationof Different Datasets
No ratings yet
Experimenting XGBoost Algorithmfor Predictionand Classificationof Different Datasets
12 pages
Sol3 2016
No ratings yet
Sol3 2016
8 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
AdaBoost M1
No ratings yet
AdaBoost M1
16 pages
Zhu - Multiclass Adaboost2009 PDF
No ratings yet
Zhu - Multiclass Adaboost2009 PDF
12 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
COSMOS
No ratings yet
COSMOS
3 pages
Drawing 19851
No ratings yet
Drawing 19851
1 page
FAQ Professional Assessment Under Mbot: Prepared by Author: Hrdf/Mbot Creation Date: 12 MAC 2019: 1.0
No ratings yet
FAQ Professional Assessment Under Mbot: Prepared by Author: Hrdf/Mbot Creation Date: 12 MAC 2019: 1.0
5 pages
Lego Bricks: Parts List For Base Set #9797
No ratings yet
Lego Bricks: Parts List For Base Set #9797
3 pages
Latihan Exam MTCNA - 1
No ratings yet
Latihan Exam MTCNA - 1
20 pages
Verilog Manal
No ratings yet
Verilog Manal
16 pages
Delft3D-WAVE User Manual PDF
No ratings yet
Delft3D-WAVE User Manual PDF
226 pages
Parameter List EPA Commander SK (English)
No ratings yet
Parameter List EPA Commander SK (English)
2 pages
User's Guide: Group 5 Controller
No ratings yet
User's Guide: Group 5 Controller
36 pages
Numpy-Guide-1 11 0
No ratings yet
Numpy-Guide-1 11 0
135 pages
Deltapilot S Db50
No ratings yet
Deltapilot S Db50
60 pages
Section 5 Parts List: Safety Precaution
No ratings yet
Section 5 Parts List: Safety Precaution
19 pages
MY25 Taurus Spec Sheet EN
No ratings yet
MY25 Taurus Spec Sheet EN
3 pages
Voisey Bay C&F
No ratings yet
Voisey Bay C&F
16 pages
ME990-IH-Section 2a - LongBoltFlangeDesignProblems
No ratings yet
ME990-IH-Section 2a - LongBoltFlangeDesignProblems
15 pages
The Evaluation of Operating System
No ratings yet
The Evaluation of Operating System
6 pages
SCDA PPT Presentation
100% (1)
SCDA PPT Presentation
20 pages
METTL - Logical Building 1 - 2 and 3 Links
100% (1)
METTL - Logical Building 1 - 2 and 3 Links
2 pages
MTCP NJ Client
No ratings yet
MTCP NJ Client
4 pages
Assignment - 01 Install and Uninstall Software
No ratings yet
Assignment - 01 Install and Uninstall Software
4 pages
Hira For Cement Mill
No ratings yet
Hira For Cement Mill
6 pages
Ai 2024 Board Paper Solution
No ratings yet
Ai 2024 Board Paper Solution
4 pages
D - 5368 - Qadnet-Motorized Smoke Damper-Ocr-26102022.
No ratings yet
D - 5368 - Qadnet-Motorized Smoke Damper-Ocr-26102022.
3 pages
SCIEX QTRAP 5500 System Specification
No ratings yet
SCIEX QTRAP 5500 System Specification
13 pages
Sata SSD 2.5 Inch
No ratings yet
Sata SSD 2.5 Inch
2 pages
Productflyer - 978 0 7923 7148 9
No ratings yet
Productflyer - 978 0 7923 7148 9
1 page
Microland Limited
No ratings yet
Microland Limited
3 pages
Thermal Properties of Matter
No ratings yet
Thermal Properties of Matter
21 pages
6G Spectrum - Analyzer Device User Manual
No ratings yet
6G Spectrum - Analyzer Device User Manual
23 pages

Boosting

Uploaded by

Boosting

Uploaded by

XCS229i Supplemental Lecture Notes

We will abuse notation, and define θT φ(x) = ∞

(ii) Return: A weak classifier φj : Rn → {−1, 1} such that

Figure 1: Weak learning algorithm

1.1 The boosting algorithm

while leaving θk identical for all k 6= j.

to be the sums of the weights of examples that φj classifies correctly and

What remains is to choose the particular coordinate to perform coordinate

2 The convergence of Boosting

(i) Define weights

(ii) Construct a weak hypothesis φt : Rn → {−1, 1} from the distribu-

Figure 2: Boosting algorithm

1 sign(θT φ(x)) 6= y = 1 yθT φ(x) ≤ 0 ≤ exp −yθT φ(x)

because ez ≥ 1 for all z ≥ 0. Thus, we have that the misclassification error

3.1 Decision stumps

is minimized. Naively, this could be an inefficient calculation, but a more

efficiently by incrementally modifying the sum in sorted order, which takes

Figure 3: Best logistic regression classifier using the raw features x ∈ R2

This is a simple non-linear decision rule, but it is impossible for standard

Figure 4: Boosted decision stumps after t = 2, 4, 5, and 10 iterations of

as well as variants setting φj (x) = 1 if xj ∈ C for some set C ⊂ {1, . . . , k} of

We know by the weak-learning assumption that

where we used that Wt− ≤ 1−2γ 1+2γ

You might also like