Boosting For Regression Transfer

Boosting for Regression Transfer in machine learning

Uploaded by

dangky4r

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

149 views8 pages

Boosting For Regression Transfer

Boosting for Regression Transfer in machine learning

Uploaded by

dangky4r

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

In Proceedings of the Twenty-Seventh International Conference on Machine Learning (ICML 10),

Haifa, Israel, June 2010.

Boosting for Regression Transfer

David Pardoe and Peter Stone {dpardoe, pstone}@cs.utexas.edu

The University of Texas at Austin, 1 University Station C0500, Austin, TX 78712 USA

Abstract and TrAdaBoost (Dai et al., 2007), respectively), and

we show how these algorithms can be modified for a
The goal of transfer learning is to improve
regression setting. Next, we present the primary con-
the learning of a new target concept given
tribution of this paper: two new algorithms designed
knowledge of related source concept(s). We
to overcome shortcomings observed in these modified
introduce the first boosting-based algorithms
algorithms. Finally, we present experimental results
for transfer learning that apply to regression
for all algorithms in seven test domains.
tasks. First, we describe two existing clas-
sification transfer algorithms, ExpBoost and
2. Regression Transfer
TrAdaBoost, and show how they can be mod-
ified for regression. We then introduce exten- In this section, we specify our learning problem and
sions of these algorithms that improve per- outline two approaches to solving this problem. Then
formance significantly on controlled experi- we provide necessary background on boosting for re-
ments in a wide range of test domains. gression problems.
2.1. Problem Specification
1. Introduction Our goal is to learn a model of a concept ctarget
The idea behind transfer learning (Pan & Yang, 2009) mapping feature vectors from the space X to labels
is that it is easier to learn a new concept (such as in the space Y . In binary classification problems,
how to play the trombone) if you are already familiar Y = {0, 1}, while in the regression problems stud-
with a similar concept (such as playing the trumpet). ied here, Y = R. We are given a set of training in-
In the context of supervised learning, inductive trans- stances Ttarget = {(xi , yi )}, with xi ∈ X and yi ∈ Y
fer learning is often framed as the problem of learn- for 1 ≤ i ≤ n, that reflect ctarget . In addition, we are
1 B
ing a concept of interest, called the target concept, given data sets Tsource , . . . , Tsource reflecting B differ-
given data from multiple sources: a typically small ent, but possibly related, concepts also mapping X to
amount of target data that reflects the target concept, Y . In order to learn the most accurate possible model
and a larger amount of source data that reflects one or of ctarget , we must decide how to use both the target
more different, but possibly related, source concepts. and source data sets. If Ttarget is sufficiently large,
A number of algorithms have been developed to ad- we can likely learn a good model using only this data.
dress this situation in classification settings, but much However, if Ttarget is small and one or more of the
less attention has been paid to regression settings. source concepts is similar to ctarget , then we may be
able to use the source data to improve our model.
One general approach that has been applied success-
fully to classification transfer is boosting. In this pa- 2.2. ExpBoost and TrAdaBoost
per, we introduce and evaluate the first boosting-based In this paper, we will consider regression transfer algo-
algorithms for regression transfer. These algorithms rithms that fit into two categories: those that make use
can be divided into two categories: algorithms that of models trained on the source data, and those that
make use of models trained on the source data, and use the source data directly as training data. The al-
algorithms that use the source data directly. We first gorithms we will present in these two categories are
describe an existing classification transfer algorithm inspired by two boosting-based algorithms for classifi-
from each category (ExpBoost (Rettinger et al., 2006) cation transfer, ExpBoost (Rettinger et al., 2006) and
TrAdaBoost (Dai et al., 2007). Boosting is an ensem-
Appearing in Proceedings of the 27 th International Confer-
ence on Machine Learning, Haifa, Israel, 2010. Copyright ble method in which a sequence of models (or hypothe-
2010 by the author(s)/owner(s). ses) h1 . . . hN , each mapping from X to Y , are itera-
Boosting for Regression Transfer
Algorithm 1 AdaBoost.R2 (Drucker, 1997)
tively fit to some transformation of a data set using
a base learner. The outputs of these models are then Input the labeled target data set T of size n, the maximum
number of iterations N , and a base learning algorithm Learner.
combined into a final hypothesis hf . Unless otherwise specified, set the initial weight vector w1 such
In ExpBoost, a separate hypothesis (or expert, hence that wi1 = 1/n for 1 ≤ i ≤ n.
the name) hi is learned for each of the B source data For t = 1, . . . , N :
sets, and learning is performed using only Ttarget . At 1. Call Learner with the training set T and the distribu-
tion wt , and get a hypothesis ht : X → R.
each step of the boosting process, ExpBoost chooses to 2. Calculate the adjusted error eti for each instance:
use either the hypothesis ht learned from the weighted let Dt = maxn j=1 |yj − ht (xj )|
training data or one of the experts, depending on then eti = |yi − ht (xi )|/Dt
which is most accurate. 3. Calculate the adjusted error of ht :
ǫt = n t t
P
In contrast, TrAdaBoost uses the source data sets di- i=1 ei wi ; if ǫt ≥ 0.5, stop and set N = t − 1.
rectly by combining them with Ttarget to form a sin- 4. Let βt = ǫt /(1 − ǫt ).
5. Update the weight vector:
gle data set. At each boosting step, TrAdaBoost in-
1−et
creases the relative weights of target instances that are wit+1 = wit βt i /Zt (Zt is a normalizing constant)
misclassified. When a source instance is misclassified, Output the hypothesis:
hf (x) = the weighted median of ht (x) for 1 ≤ t ≤ N , using
however, its weight is decreased. In this way, TrAd- ln(1/βt ) as the weight for hypothesis ht .
aBoost aims to identify and make use of those source
instances that are most similar to the target data while arbitrarily large. Thus, we need a method of mapping
ignoring those that are dissimilar. an error ei into an adjusted error e′i that can be used
We provide additional details on these algorithms and in the reweighting formula used by AdaBoost.
their extensions below, but first we address the issue of The method used in AdaBoost.R2 (Drucker, 1997) is
applying boosting algorithms to regression problems. to express each error in relation to the largest error
D = maxni=0 |ei | in such a way that each adjusted er-
2.3. AdaBoost and Regression
ror e′i is in the range [0, 1]. In particular, one of three
One of the best known boosting methods for clas- possible loss functions is used: e′i = ei /D (linear),
sification, and the one upon which ExpBoost and e′i = ei 2 /D2 (square), or e′i = 1 − exp(−ei /D) (expo-
TrAdaBoost are based, is AdaBoost (specifically, Ad- nential). The degree to which instance xi is reweighted
aBoost.M1) (Freund & Schapire, 1997). In AdaBoost, in iteration t thus depends on how large the error of
each training instance receives a weight wi that is used ht is on xi relative to the error on the worst instance.
when learning each hypothesis; this weight indicates AdaBoost.RT (Shrestha & Solomatine, 2006), on the
the relative importance of each instance and is used other hand, continues to label each output as correct
in computing the error of a hypothesis on the data (e′i = 0) or incorrect (e′i = 1) using an error threshold
set. After each iteration, instances are reweighted, φ. That is, if ei > φ, then e′i = 1; otherwise, e′i = 0.
with those instances that are not correctly classified
by the last hypothesis receiving larger weights (as in In preliminary experiments, we found Ad-
step 5 of Algorithm 1). Thus, as the process contin- aBoost.R2 with the linear loss function to work
ues, learning focuses on those instances that are most consistently well, and were unable to find values
difficult to classify. of φ that allowed AdaBoost.RT to regularly match
this performance. In the remainder of this paper
A number of methods have been proposed for modify- we consider only AdaBoost.R2 with the linear loss
ing AdaBoost for regression, and as TrAdaBoost and function, shown in Algorithm 1.
ExpBoost are based on AdaBoost, these modifications
can be used on them as well. In our work, we ex- 3. Using Source Models
plored two of these methods that have been shown
In this section we describe four regression transfer al-
to be generally effective and that can be applied to
gorithms based on making use of source models. In
TrAdaBoost and ExpBoost in a straightforward way:
addition to the target data, each algorithm receives
AdaBoost.R2 and AdaBoost.RT.
as input a set of experts H B = {h1 , . . . , hB }, each
The key to AdaBoost is the reweighting of those in- corresponding to a source data set.
stances that are misclassified at each iteration. In re-
gression problems, the output given by a hypothesis 3.1. ExpBoost.R2
ht for an instance xi is not correct or incorrect, but Combining the principles of AdaBoost.R2 with those
has a real-valued error ei = |yi − ht (xi )| that may be of ExpBoost results in the new regression algorithm
ExpBoost.R2. The steps involving computing the ad-
Boosting for Regression Transfer
Algorithm 2 Transfer Stacking
combination of the new hypothesis and experts that
Input a labeled data set T = {(xi , yi )} of size n, a set of experts best fits the data for the current iteration, and store
H B = {h1 , . . . , hB }, the number of folds F for cross validation,
and a base learning algorithm Learner.
the result as the iteration’s hypothesis. As a result
1. Let Oi,j = hj (xi ) for 1 ≤ i ≤ n and 1 ≤ j ≤ B. of the similarity to stacking, we call this combination
2. Perform F -fold cross validation on T using Learner. For approach transfer stacking. Since our full boosting ap-
1 ≤ i ≤ n, let Oi,B+1 equal the output of the learned model proach reduces to calling AdaBoost.R2 with transfer
for the fold where instance i is in the validation set.
stacking as the base learner, we give details only for
3. Call Learner with the full training set T and get hy-
pothesis hB+1 . transfer stacking, shown as Algorithm 2.
4. PerformP linear least squares regression on the system of
equations ( B+1 We note that transfer stacking by itself (without the
j=1 aj Oi,j ) + aB+2 = yi for 1 ≤ i ≤ n; that
is, find the linear combination of hypotheses that minimizes use of boosting) could be used as a transfer algorithm,
squared error. and so in the experiments of Section 6 we evaluate both
Output the hypothesis: plain transfer stacking (using AdaBoost.R2 as its base
hf (x) = a1 h1 (x) + ... + aB+1 hB+1 (x) + aB+2
learner for a fair comparison) and the full approach,
which we call boosted transfer stacking.
justed error and outputting the final hypothesis cor-
respond to the same steps from Algorithm 1. The 3.3. Best Expert
primary difference is in step 1 of each boosting itera- Finally, as a baseline, we test the algorithm that sim-
tion. After obtaining ht , ExpBoost.R2 computes the ply uses the best expert from H B ; that is, the expert
weighted errors of each expert in H B on the current with the lowest error on the target training data.
weighting of T , and if any expert has a lower weighted
error than ht , ht is replaced with the best expert. 4. Using Source Data Directly
3.2. (Boosted) Transfer Stacking We now describe three algorithms that take as input
both the target and all source data sets and that train
In ExpBoost.R2, the final hypothesis that is produced on a combination of this data.
represents a combination of the provided experts and
additional hypotheses learned with the base learner. 4.1. TrAdaBoost.R2
However, at each boosting iteration, ExpBoost.R2 Combining the principles of AdaBoost.R2 with those
must choose between either the newly learned hypoth- of TrAdaBoost results in the new regression algorithm
esis or a single expert. We now consider relaxing TrAdaBoost.R2. TrAdaBoost.R2 takes two data sets
this constraint by allowing a linear combination of hy- as input, Ttarget and Tsource , of size n and m, respec-
potheses to be chosen at each iteration. tively, and combines them into a single set T used in
The details of our approach are very similar to those of boosting. Although the original work on TrAdaBoost
stacking (or stacked generalization) (Wolpert, 1992). does not consider the issue of multiple sources, we are
Stacking is an ensemble approach in which a meta- interested in cases where any number of sources may
level model combines multiple base models, all trained exist. When there is more than one source, we simply
independently on the same set of data using different combine all source data sets into a single data set. As
learning algorithms. The meta-level model is learned TrAdaBoost.R2 handles the reweighting of each train-
(typically using linear regression) from a meta-level ing instance separately, there should be no harm in
data set created as follows. For each instance in the mixing data in this fashion, but care should be taken
original training set, a meta-level instance is created in setting the initial weight vector. Our experiments
using the outputs of each base model as features and involve source data sets of (roughly) equal sizes, and
using the original label. Cross validation is performed so we simply assign all source instances (and target
for each base learner so that the output for each in- instances) the same weight. If one source data set
stance in the original training set is obtained when it is were larger than another, however, setting weights uni-
out-of-sample. Once the meta-level model is learned, a formly would result in more emphasis being given to
new instance is handled by using the model to combine that source, at least in early boosting iterations.
the outputs of the base models on the instance. As with ExpBoost.R2, the steps involving comput-
Here, instead of multiple base learners, we consider a ing the adjusted error correspond to the same steps
single base learner and (potentially) multiple experts from Algorithm 1. The primary difference between
previously trained on source data; hence cross vali- TrAdaBoost.R2 and Algorithm 1 is in step 5 of each
dation is required only for the base learner and not iteration. Instead of treating all data equally, Exp-
for the experts. Thus, at each boosting iteration, we Boost.R2 increases the weights of target instances by
perform linear least squares regression to find a linear
Boosting for Regression Transfer

−et Algorithm 3 Two-stage TrAdaBoost.R2

setting wit+1 = wit βt i /Zt and decreases the weights
t Input two labeled data sets Tsource (of size n) and Ttarget (of
of source instancespby setting wit+1 = wit β ei /Zt , size m), the number of steps S, the maximum number of boost-
where β = 1/(1 + 2 ln n/N ). In addition, TrAd- ing iterations N , the number of folds F for cross validation, and
aBoost.R2 considers only the final ⌈N/2⌉ hypotheses a base learning algorithm Learner. Let T be the combination
of Tsource and Ttarget such that the first n instances in T are
when taking the weighted median to determine output
those from Tsource . Set the initial weight vector w1 such that
(as a result of theoretical considerations in the original wi1 = 1/(n + m) for 1 ≤ i ≤ n + m.
TrAdaBoost). For t = 1, . . . , S:
1. Call AdaBoost.R2′ with T , distribution wt , N , and
4.2. Two-stage TrAdaBoost.R2 Learner to obtain modelt , where AdaBoost.R2′ is identi-
cal to AdaBoost.R2 except that the weights of the first n
In analyzing the performance of TrAdaBoost.R2, we instances are never modified. Similarly, use F -fold cross val-
observed it to be highly susceptible to overfitting (that idation to obtain an estimate errort of the error of modelt .
is, beyond some point, accuracy decreased as the num- 2. Call Learner with T and distribution wt , and get a
hypothesis ht : X → R.
ber of boosting iterations N increased). In contrast, 3. Calculate the adjusted error eti for each instance as in
AdaBoost.R2 and the algorithms of Section 3 do not AdaBoost.R2.
appear to suffer from this problem. After experiment- 4. Update the weight vector:
ing with cross validation to select N , we still saw mixed wit+1 =
performance from TrAdaBoost.R2. Closer inspection
(
et
wit βt i /Zt , 1 ≤ i ≤ n
of the results revealed two problems. First, when the t
wi /Zt , n+1≤i≤n+m
size of Tsource is much larger than Ttarget , it can take
where Zt is a normalizing constant, and βt is chosen
many iterations for the total weight of the target in- such that the resulting weight of the target (final m)
stances to approach the total weight of the source in- m
instances is (n+m) t
+ (S−1) m
(1 − (n+m) ).
stances, and by this time the weights of the target Output modelt where t = argmini errori .
data may be heavily skewed – those target instances
that are either outliers or most dissimilar to the source
data may represent most of the weight. Second, even value of βt satisfying the conditions shown in Algo-
those source instances that are representative of the rithm 3 using a binary search. In addition, it is not
target concept tend to have their weights reduced to necessary to progress through all S steps once it has
zero eventually. The use of the adjusted error scheme been determined that errors are increasing.
from AdaBoost.R2 is the reason. Whereas in TrAd-
4.3. Best Uniform Initial Weight
aBoost the relevant source instances will generally be
classified correctly and not have their weights reduced, Finally, as another baseline for comparison, we test
in TrAdaBoost.R2 even small errors lead to weight re- an algorithm that simply calls AdaBoost.R2 with the
ductions. The fact that TrAdaBoost.R2 uses only the combined source and target data, but attempts to find
hypotheses generated during the final half of boosting the best initial ratio of total weight between the source
iterations exacerbates this problem. (We note that we and target data. As in two-stage TrAdaBoost.R2, we
also tried using all hypotheses, with mixed results.) try total target weights ranging from m/(n + m) to 1
in S steps and choose the best weighting using cross
To address these problems, we designed a version of validation. However, in this case all source instances
TrAdaBoost.R2 that adjusts instance weights in two have equal initial weights (i.e., there is no attempt to
stages. In stage one, the weights of source instances set individual weights based on errors), and no distinc-
are adjusted downwards gradually until reaching a cer- tion is made between source and target instances once
tain point (determined through cross validation). In AdaBoost.R2 is called – source instances with high er-
stage two, the weights of all source instances are frozen rors will have their weights increased just like target
while the weights of target instances are updated as instances will. (In fact, this is not a boosting-specific
normal in AdaBoost.R2. Only the hypotheses gener- algorithm, as any learner could be used in place of Ad-
ated in stage two are stored and used to determine the aBoost.R2; we use AdaBoost.R2 as the learner only to
output of the resulting model. We call this algorithm allow a direct comparison between the results.)
two-stage TrAdaBoost.R2, and show it in Algorithm
3. Note that the weighting factor βt is not chosen 5. Data Transformation
based on the hypothesis error, as before, but is cho-
In Section 2.1, we stated that both source and tar-
sen to result in a certain total weight for the target
get concepts had labels in the same output space. In
instances. In this way, the total weight of the target
many regression settings in which we might consider
instances increases uniformly from m/(n + m) to 1 in
transfer, however, different concepts might have labels
S steps. In our implementation, we approximated the
with considerably different label distributions. While
Boosting for Regression Transfer

we largely view this as a data preparation issue (e.g., from the UCI Machine Learning Repository1 : concrete
labels can be expressed in comparable terms, such as strength, housing, auto MPG, and automobile. (We
using relative instead of absolute prices in financial chose the first four data sets that represented standard
data) and thus beyond the scope of this paper, in our regression problems and had a few hundred instances;
experiments we do take some simple measures to en- no other data sets were tried.) We divide these stan-
sure similar label distributions. dard regression data sets into target and source sets
by using a variation on the technique used by Dai et
In algorithms making use of experts trained on source
al. (2007) in a classification setting. For each data set,
data, we can directly modify the experts so that their
we identify a continuous feature that has a moderate
outputs on the target data fall in an appropriate range.
degree of correlation (around 0.4) with the label. We
We do so by evaluating the experts on the target train-
then sort the instances by this feature, divide the set in
ing data (thus making use only of data available to the
thirds (low, medium, and high), and remove this fea-
learner) and performing linear regression to find the
ture from the resulting sets. By dividing based on a
linear transformation that best fits the outputs to the
feature moderately correlated with the label, we hope
true labels. This transformation is then applied when-
to produce three data sets that represent slightly dif-
ever the expert is used by the learning algorithm. In al-
ferent concepts; if the correlation were zero, the con-
gorithms using the source data directly, for each source
cepts might be identical, and if the correlation were
data set we train an expert on the set, find the linear
high, the concepts might be significantly different and
transformation in the same manner, and then apply
have very different label ranges. In each experiment,
this transformation to the labels in the source data set
we use one data set as the target and the other two as
before passing it to the learning algorithm. On the
sources, for a total of 12 experiments.
data sets described in the following section, we found
that this procedure was worthwhile, as it often resulted Table 1 shows the results of all eight learning algo-
in a significant increase in accuracy while only occa- rithms on all 12 experiments using both M5P model
sionally producing a slight decrease in accuracy. We trees and neural networks as base learners. Target
note that trying regression with higher degree polyno- data training sets contained 25 instances. (Increas-
mials tended to produce modest improvements at best ing this number resulted in qualitatively similar re-
and large decreases in accuracy at worst. sults.) Source data sets ranged from 68 to 343 in-
stances. Each result represents the average RMS error
6. Experiments over 30 runs. Numbers in bold represent results that
We now evaluate our boosting algorithms on seven dif- are among the best – either the lowest error, or not sig-
ferent problems: four data sets from the UCI Repos- nificantly higher. Numbers in italics represent results
itory, a space of artificial data sets created from a that are not significantly better than AdaBoost.R2,
known function, and two prediction problems from that is, those where transfer failed.
multiagent systems. Experiments were performed us- The best expert is significantly better than Ad-
ing the WEKA 3.4 (Witten & Frank, 1999) machine aBoost.R2 exactly half of the time, but is sometimes
learning package with default parameters for the base much worse, suggesting that the degree of similarity
learners. In the first group of experiments, we tested between source and target data sets varies consider-
two base learners. For the remainder, we used the re- ably across the range of experiments. Not surprisingly,
gression algorithm in WEKA giving the lowest error the cases where the best expert fares worst are often
when used alone as the base learner. The following those where other expert-based algorithms fare poorly.
parameters were used (where appropriate): N = 30, S
= 30, and F = 10. Experts were generated by running ExpBoost.R2 performs poorly, beating AdaBoost.R2
AdaBoost.R2 on a complete source data set. We used significantly only five out of 24 times. Transfer stack-
AdaBoost.R2 as the baseline non-transfer algorithm in ing (performing stacking once with AdaBoost.R2 as a
each experiment as it consistently produced lower er- base learner) and boosted transfer stacking do much
rors than using the base learner alone and offers a fair better, each beating AdaBoost.R2 significantly 15
comparison against boosting transfer algorithms. Re- times, suggesting that there is a benefit to considering
sults said to be significant are statistically significant linear combinations of models instead of only individ-
(p < .05) according to paired t-tests. ual models. Interestingly, the error of transfer stack-
ing is usually fairly close to that of boosted transfer
6.1. Four UCI Data Sets stacking when both perform well. When both perform
We begin by comparing the results of all eight al- poorly, however, the error of boosted transfer stacking
gorithms described above on four data sets taken 1
http://archive.ics.uci.edu/ml/index.html
Boosting for Regression Transfer

is typically close to that of AdaBoost.R2, while the We performed experiments using several values d and
error of transfer stacking is much worse. It may be values of 1 and 5 for B (the number of source data
the case that performing transfer stacking across mul- sets), expecting that transfer would be most effective
tiple boosting iterations is not necessary for effective for smaller values of d and the larger value of B. For
transfer but is effective in preventing overfitting when each value of d, we randomly generated 100 of each of
transfer is not possible. the following: i) target training data sets (of varying
sizes), ii) target testing sets (of size 10,000), and iii)
TrAdaBoost.R2 (with the number of boosting itera-
groups of 5 source data sets (each of size 1000). Neural
tions chosen using cross validation) gives promising
networks were chosen as the best base learner.
but somewhat erratic results, beating AdaBoost.R2
significantly 16 times but performing much worse in a Figure 1 shows the results when d = 1; results for other
few cases. Two-stage TrAdaBoost.R2 produces much values of d were qualitatively similar. As expected,
better results and is the clear winner in this set of using transfer increased accuracy the most for lower
experiments, finishing among the top algorithms 20 values of d and higher values of B. When we used one
out of 24 times and failing to significantly beat Ad- source, boosted transfer stacking significantly outper-
aBoost.R2 only once. Interestingly, simply finding the formed AdaBoost.R2 when there were 250 target in-
best uniform initial weighting also performs well, sig- stances or fewer, while two-stage TrAdaBoost.R2 was
nificantly outperforming AdaBoost.R2 17 times. significantly better than either algorithm for up to 300
instances. With five sources, boosted transfer stacking
Overall, these results suggest that making use of source
actually performed slightly better then 2-stage TrAd-
data directly is more effective than using source ex-
aBoost.R2 (the difference was significant for at least 75
perts. However, the expert-based algorithms still had
instances), and both transfer algorithms were signifi-
the best performance in a few cases, and it is worth
cantly better than AdaBoost.R2 for all points plotted.
noting that they are much less computationally inten-
sive, due to using smaller amounts of data (target data 6.3. TAC SCM and TAC Travel
only) and not requiring extensive cross validation.
While the previous data sets are useful for testing the
For the remaining experiments, we note that boosted performance of our algorithms, it is important to also
transfer stacking and two-stage TrAdaBoost.R2 (the experiment with naturally occurring data in domains
primary contributions of this paper) continue to per- where transfer would be applied in the real world.
form as well as or (usually) better than their counter- We now consider two such domains, taken from two
parts (algorithms using source experts or source data, e-commerce scenarios from the Trading Agent Com-
respectively), and so for clarity we omit the results of petition: a supply chain management scenario (TAC
the other transfer algorithms. SCM) (Eriksson et al., 2006), and a travel agent sce-
nario (TAC Travel) (Wellman et al., 2007). In both
6.2. Friedman #1 scenarios, autonomous agents compete against each
Friedman #1 (Friedman, 1991) is a well known regres- other in simulated economies to maximize profits.
sion problem, and we use a modified version that al- Many agents use some form of learning to make pre-
lows us to generate a variety of related concepts. Each dictions about future prices, but the manner in which
instance x is a feature vector of length ten, with each these prices change over time can depend heavily on
component xi drawn independently from the uniform the identities of the competing agents – essentially, dif-
distribution [0, 1]. The label for each instance is de- ferent groups of agents represent different economies.
pendent on only the first five features: This fact suggests the possibility of an agent using
transfer learning to make use of past experience in dif-
y = a1 · 10 sin(π(b1 x1 + c1 ) · (b2 x2 + c2 )) + ferent economies. In fact, many agents designed for
a2 · 20(b3 x3 + c3 − 0.5)2 + a3 · 10(b4 x4 + c4 ) + the competition, while not explicitly casting the prob-
a4 · 5(b5 x5 + c5 ) + N (0, 1) lem as transfer learning, deal in some way with the is-
sue of making use of training data from these different
where N is the normal distribution, and each ai , bi , sources. While these competitions are only abstrac-
and ci is a fixed parameter. In the original Friedman tions of real-life markets, opportunities for applying
#1 problem, each ai and bi is 1 while each ci is 0, and transfer learning certainly exist in real markets as well,
we use these values when generating the target data and these competitions represent valuable testbeds for
set Ttarget . To generate each of the source data sets, research into these opportunities.
we draw each ai and bi from N (1, 0.1d) and each ci
The first scenario we consider is TAC SCM, in which
from N (0, 0.05d), where d is a parameter that controls
agents compete as computer manufacturers. We col-
how similar the source and target data sets are.
Boosting for Regression Transfer
Table 1. RMS error on four UCI datasets, each divided into three concepts, using M5P model trees and neural networks
as base learners. Bold: lowest error; Italic: not significantly better than AdaBoost.R2 (95% confidence in each case)
Base Algorithm Data set (divided into 3 subsets)
Lrnr. Concrete Strength Housing Auto MPG Automobile
M5P AdaBoost.R2 10.26 11.01 13.26 3.65 3.59 6.52 2.90 2.92 4.35 1963 3576 4893
best expert 8.63 8.08 9.55 2.98 5.27 9.98 2.38 2.57 4.44 1374 3741 6059
ExpBoost.R2 10.11 9.64 11.76 3.02 3.74 6.66 2.53 2.94 4.39 1791 3661 4932
boosted t. stacking 8.47 7.48 10.03 3.03 3.99 7.24 2.30 2.75 4.48 1325 3480 4811
transfer stacking 8.60 7.31 10.17 3.07 5.49 8.39 2.47 2.65 4.57 1327 3631 5640
best unif. init. wt. 10.25 6.98 8.66 2.99 3.42 6.52 2.35 2.59 4.33 1734 2678 2940
TrAdaBoost.R2 (CV) 10.76 7.04 9.71 3.38 3.57 7.03 2.19 2.58 4.24 1815 2851 3527
2-Stage TrAdaBoost 8.74 6.49 8.66 2.99 3.12 6.12 2.14 2.52 4.21 1564 2555 3202
NN AdaBoost.R2 10.47 11.95 14.84 3.89 3.67 7.54 2.76 3.55 5.17 1593 3200 3836
best expert 10.12 9.67 13.87 7.00 4.99 9.22 2.66 2.77 4.43 1481 2484 6119
ExpBoost.R2 10.14 11.62 13.37 3.88 3.66 7.63 2.79 3.50 5.20 1392 3174 3829
boosted t. stacking 9.49 9.80 12.93 3.75 3.58 7.74 2.48 3.00 4.40 1215 2640 3761
transfer stacking 9.65 9.46 13.13 4.63 4.36 8.37 2.43 2.83 4.53 1144 2632 5081
best unif. init. wt. 10.48 8.02 10.77 3.89 3.00 6.46 2.44 2.80 4.19 1312 2277 2858
TrAdaBoost.R2 (CV) 11.34 9.05 11.91 4.02 3.29 7.68 2.33 2.80 4.37 1718 2573 3268
2-Stage TrAdaBoost 10.43 8.09 9.92 3.27 2.99 6.45 2.14 2.60 4.18 1290 2276 2843

36 AdaBoost.R2
AdaBoost.R2 AdaBoost.R2
3.5 2-stage TrAdaBoost.R2: 2-stage TrAdaBoost.R2
0.08 2-stage TrAdaBoost.R2
1 source boosted tr. stacking 32 boosted tr. stacking
3 5 sources
boosted transfer stacking:
RMS Error
RMS Error

RMS Error
1 source 0.075 28
2.5 5 sources
24
2 0.07
20
1.5
0.065 16
1
0 100 200 300 400 500 0 200 400 600 800 0 100 200 300 400 500
Target Instances Target Instances Target Instances

Figure 1. Friedman #1 (NN) Figure 2. TAC SCM (M5P) Figure 3. TAC Travel (M5P)

lected experience in three different economies as fol- number of instances on the SCM data set and for 150
lows. We generated three source data sets using three instances or less on the Travel data set.
different groups of agent binaries provided by com-
petition participants. The target data set came from 7. Related Work
the final round of the 2006 competition. Each instance The most closely related work to this research, that
consists of 31 features of the economy at some point in on boosting and classification transfer, is described in
time and is labeled with a particular change in future Section 2. One important item we have not discussed,
computer prices that would be of interest to an agent. as this paper is empirical in its focus, is the theoret-
Full details of these data sets are available in (Pardoe ical properties of the algorithms discussed here. One
& Stone, 2007). of the attractive features of AdaBoost is its theoreti-
In TAC Travel, agents complete travel packages by bid- cal guarantees (e.g., convergence to zero error on the
ding in simultaneous auctions for flights, hotels, and training set) (Freund & Schapire, 1997). We note that
entertainment. We consider the problem of predicting no theoretical results currently exist for ExpBoost or
the closing prices of hotel auctions given the current AdaBoost.R2; however, analogues of the main proper-
state of all auctions, represented by 51 features (as de- ties of AdaBoost have been proven to apply to TrAd-
scribed in (Schapire et al., 2002)). We use data from aBoost, and a straightforward transformation of these
the 2006 competition final round as the target data, proofs shows that these properties also extend to the
and data from the 2004 and 2005 final rounds as the combination of TrAdaBoost and AdaBoost.RT (men-
source data sets. (In each year’s final round, all games tioned in Section 2.3). Developing theoretical guaran-
consisted of the same agents, but between years the tees for the other algorithms discussed here, in both
agents changed.) classification and regression settings, is an important
area for future work.
Learning curves for 30 runs on each data set are shown
in Figures 2 and 3. M5P model trees were chosen as The lowest common denominator of transfer learning
the best base learner in both cases. Both two-stage methods is the leveraging of information from a source
TrAdaBoost.R2 and boosted transfer stacking signif- domain to speed up or otherwise improve learning in
icantly outperform AdaBoost.R2 for any number of a different target domain. Transfer learning bears
target instances. Two-stage TrAdaBoost.R2 signifi- resemblance to classic case-based reasoning (Kolod-
cantly outperforms boosted transfer stacking for any ner, 1993), especially in the need to reason about
Boosting for Regression Transfer

the similarity between tasks and instances. More re- References

cently, transfer learning has been studied in a vari- Caruana, Rich. Multitask learning. In Machine Learning,
ety of different settings, including statistical relational pp. 41–75, 1997.
learning (Mihalkova et al., 2007), reinforcement learn- Dai, Wenyuan, Yang, Qiang, Xue, Gui-rong, and Yu, Yong.
ing (Taylor, 2009), and classification as described in Boosting for transfer learning. In Proceedings of the
Section 2. A key property of the classification and Twenty Fourth International Conference on Machine
our regression setting is that the source and target do- Learning, 2007.
mains typically have the same input and output spaces Drucker, Harris. Improving regressors using boosting tech-
(X and Y in our notation), which is not always the niques. In Proceedings of the Fourteenth International
Conference on Machine Learning, pp. 107–115, 1997.
case, for example in the reinforcement learning setting.
Eriksson, Joakim, Finne, Niclas, and Janson, Sverker. Evo-
As such, the problem studied here could be consid- lution of a supply chain management game for the Trad-
ered one of concept drift (Schlimmer & Granger, 1986), ing Agent Competition. AI Communications, 19:1–12,
in which the target concept changes over time. This 2006.
property also differentiates our setting from multitask Freund, Yoav and Schapire, Robert. A decision-theoretic
learning (Caruana, 1997), in which multiple related generalization of online learning and an application to
concepts sharing an input representation but with po- boosting. Journal of Computer and System Sciences,
55:119–139, 1997.
tentially unrelated outputs are to be learned simul-
taneously; however, some multitask learning methods Friedman, Jerome. Multivariate adaptive regression splines
(with discussion). Annals of Statistics, 19:1–141, 1991.
could potentially be modified to address our setting.
Kolodner, Janet. Case-Based Reasoning. Morgan Kauf-
mann, 1993.
8. Conclusions and Future Work
Mihalkova, Lilyana, Huynh, Tuyen, and Mooney, Ray-
We explored a number of boosting-based regression mond. Mapping and revising markov logic networks for
transfer algorithms that make use of either models transfer learning. In Proceedings of the 22nd Conference
trained on source data or the source data itself. The on Artificial Intelligence, pp. 608–614, July 2007.
primary contribution of this paper is the introduc- Pan, Sinno Jialin and Yang, Qiang. A survey on transfer
learning. IEEE Transactions on Knowledge and Data
tion of boosted transfer stacking and two-stage TrAd- Engineering, 99, 2009. ISSN 1041-4347.
aBoost.R2, both of which have their roots in existing
Pardoe, David and Stone, Peter. Adapting price predic-
classification transfer approaches. Both show promise, tions in TAC SCM. In AAMAS 2007 Workshop on Agent
and two-stage TrAdaBoost.R2 in particular was con- Mediated Electronic Commerce, 2007.
sistently effective across a wide range of experimental Rettinger, Achim, Zinkevich, Martin, and Bowling,
domains. Michael. Boosting expert ensembles for rapid concept
recall. In Proceedings of the Twenty-First National Con-
There are a number of areas in which this work could ference on Artificial Intelligence, July 2006.
be expanded. So far, we have only experimented with Schapire, Robert E., Stone, Peter, McAllester, David,
the domains and base learners described. Future work Littman, Michael L., and Csirik, János A. Modeling auc-
is needed to better understand which transfer algo- tion price uncertainty using boosting-based conditional
rithms are best suited for which domains, and how dif- density estimation. In Proceedings of the Nineteenth In-
ternational Conference on Machine Learning, 2002.
ferent choices of base learners and learning parameters
interact with these algorithms. Also, additional meth- Schlimmer, J. and Granger, R. Beyond incremental pro-
cessing: Tracking concept drift. In Proceedings of the
ods of adapting boosting for regression could be ex- 5th National Conference on Artificial Intelligence, pp.
plored, and additional techniques for improving boost- 502–507, 1986.
ing (such as regularization) could be tried. Finally, it Shrestha, D. L. and Solomatine, D. P. Experiments with
would be interesting to see whether the extensions to AdaBoost.RT, an improved boosting scheme for regres-
ExpBoost and TrAdaBoost described here prove useful sion. Neural Comput., 18(7):1678–1710, 2006.
in the classification setting for which those algorithms Taylor, Matthew E. Transfer in Reinforcement Learning
were originally designed. Domains. Springer Verlag, 2009.
Wellman, Michael P. Greenwald, Amy, and Stone, Pe-
Acknowledgments ter. Autonomous Bidding Agents: Strategies and Lessons
from the Trading Agent Competition. MIT Press, 2007.
This work has taken place in the Learning Agents Witten, Ian H. and Frank, Eibe. Data Mining: Practi-
Research Group (LARG) at UT Austin. LARG re- cal Machine Learning Tools and Techniques with Java
search is supported in part by NSF (IIS-0917122), Implementations. Morgan Kaufmann, 1999.
ONR (N00014-09-1-0658), DARPA (FA8650-08-C- Wolpert, David H. Stacked generalization. Neural Net-
7812), and the FHWA (DTFH61-07-H-00030). works, 5:241–259, 1992.

Boosting
No ratings yet
Boosting
2 pages
Survey - Gradient Boosting Machine
No ratings yet
Survey - Gradient Boosting Machine
9 pages
107 Boostong Models
No ratings yet
107 Boostong Models
27 pages
Introduction To Boosting - 2
No ratings yet
Introduction To Boosting - 2
79 pages
22 Boosting
No ratings yet
22 Boosting
32 pages
A Brief Introduction To Adaboost: Hongbo Deng 6 Feb, 2007
No ratings yet
A Brief Introduction To Adaboost: Hongbo Deng 6 Feb, 2007
35 pages
Boosting Buehlmann
No ratings yet
Boosting Buehlmann
52 pages
Zhu - Multiclass Adaboost2009 PDF
No ratings yet
Zhu - Multiclass Adaboost2009 PDF
12 pages
Boosting Algorithms: Regularization, Prediction and Model Fitting
No ratings yet
Boosting Algorithms: Regularization, Prediction and Model Fitting
29 pages
Power System Security Assessment Using Adaboost Algorithm
No ratings yet
Power System Security Assessment Using Adaboost Algorithm
6 pages
Boosting With The L - Loss: Regression and Classification
No ratings yet
Boosting With The L - Loss: Regression and Classification
32 pages
Hybrid Credit Scoring
No ratings yet
Hybrid Credit Scoring
13 pages
Boosting and Applications Yuan
No ratings yet
Boosting and Applications Yuan
41 pages
Ensemble, Voting, Bagging, Boosting
No ratings yet
Ensemble, Voting, Bagging, Boosting
15 pages
Chapter 3 - Boosting Theory
No ratings yet
Chapter 3 - Boosting Theory
7 pages
LECTURE+NOTES Boosting
No ratings yet
LECTURE+NOTES Boosting
8 pages
Boosting Mit
No ratings yet
Boosting Mit
36 pages
Experimenting XGBoost Algorithmfor Predictionand Classificationof Different Datasets
No ratings yet
Experimenting XGBoost Algorithmfor Predictionand Classificationof Different Datasets
12 pages
Addaboost
No ratings yet
Addaboost
12 pages
Boosting Approach To Machine Learn
No ratings yet
Boosting Approach To Machine Learn
23 pages
Computational Data Analysis: Machine Learning
No ratings yet
Computational Data Analysis: Machine Learning
26 pages
Adaboost
No ratings yet
Adaboost
5 pages
TM Adaboost
No ratings yet
TM Adaboost
12 pages
Improving Classification With AdaBoost
No ratings yet
Improving Classification With AdaBoost
20 pages
An Introduction To Boosting and Leveraging: 1 A Brief History of Boosting
No ratings yet
An Introduction To Boosting and Leveraging: 1 A Brief History of Boosting
66 pages
Large Scale Machine Learning With Python - XGBOOST - P236
No ratings yet
Large Scale Machine Learning With Python - XGBOOST - P236
19 pages
Ensemble (v6)
No ratings yet
Ensemble (v6)
45 pages
Experiments With A New Boosting Algorithm: Yoav Freund Robert E. Schapire
No ratings yet
Experiments With A New Boosting Algorithm: Yoav Freund Robert E. Schapire
16 pages
Introduction To Boosting: Cynthia Rudin PACM, Princeton University
No ratings yet
Introduction To Boosting: Cynthia Rudin PACM, Princeton University
29 pages
ENG6500 7 Ensembles Boosting
No ratings yet
ENG6500 7 Ensembles Boosting
49 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
Introduction To Machine Learning - Boosting
No ratings yet
Introduction To Machine Learning - Boosting
6 pages
Statistics Project
No ratings yet
Statistics Project
5 pages
Adaboost
No ratings yet
Adaboost
29 pages
1.1 - Xgboost, GBboost, Adaboost - Boosting - Medium
No ratings yet
1.1 - Xgboost, GBboost, Adaboost - Boosting - Medium
6 pages
The Evolution of Boosting Algorithms: From Machine Learning To Statistical Modelling
No ratings yet
The Evolution of Boosting Algorithms: From Machine Learning To Statistical Modelling
32 pages
AdaBoost M1
No ratings yet
AdaBoost M1
16 pages
DM - Lecture 4
No ratings yet
DM - Lecture 4
65 pages
AdaBoost Notes
No ratings yet
AdaBoost Notes
5 pages
07 Boosting Notes
No ratings yet
07 Boosting Notes
10 pages
Gradient Boosting in ML
No ratings yet
Gradient Boosting in ML
5 pages
FAQ - Boosting - Ensemble Techniques - Great Learning
No ratings yet
FAQ - Boosting - Ensemble Techniques - Great Learning
2 pages
کتاب هفتم بارگزاری شده
No ratings yet
کتاب هفتم بارگزاری شده
57 pages
Boosting: 1. What Is The Difference Between Adaboost and Gradient Boosting?
No ratings yet
Boosting: 1. What Is The Difference Between Adaboost and Gradient Boosting?
2 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
2011 CLOOSTING - CLustering Data With bOOSTING
No ratings yet
2011 CLOOSTING - CLustering Data With bOOSTING
10 pages
DONG Et Al 2022 A Neural Network Boosting Regression Model Based On XGBoost
No ratings yet
DONG Et Al 2022 A Neural Network Boosting Regression Model Based On XGBoost
11 pages
14-AI ML Ensemble 2022
No ratings yet
14-AI ML Ensemble 2022
41 pages
Friedman 2002
No ratings yet
Friedman 2002
12 pages
Ada Boost
No ratings yet
Ada Boost
7 pages
Learn The Boosting - Method - Implementation - in - R
No ratings yet
Learn The Boosting - Method - Implementation - in - R
27 pages
IMPROVE Boost 1999
No ratings yet
IMPROVE Boost 1999
40 pages
AdaBoost Classifier in Python (Article) - DataCamp
100% (1)
AdaBoost Classifier in Python (Article) - DataCamp
9 pages
Data Mining Lab-2
No ratings yet
Data Mining Lab-2
6 pages
Pradipta Kumar Pattanayak - Ada Boosting
No ratings yet
Pradipta Kumar Pattanayak - Ada Boosting
44 pages
AdaBoost Is Consistent
No ratings yet
AdaBoost Is Consistent
22 pages
Types of Boosting
No ratings yet
Types of Boosting
4 pages
Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin
No ratings yet
Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin
19 pages
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mengukur Keterampilan Literasi Ilmiah Mahasiswa Tahun Pertama Menggunakan Test of Scientific Literacy Skills (Tosls)
No ratings yet
Mengukur Keterampilan Literasi Ilmiah Mahasiswa Tahun Pertama Menggunakan Test of Scientific Literacy Skills (Tosls)
9 pages
Icoc 2023 Program
No ratings yet
Icoc 2023 Program
54 pages
1 Lesson One Psychological Realignment
No ratings yet
1 Lesson One Psychological Realignment
2 pages
Beg U2 S4,5 PDF
100% (1)
Beg U2 S4,5 PDF
15 pages
Kurikulum Tunisia
100% (1)
Kurikulum Tunisia
2 pages
BIZ102 - Assessment 1 Brief - 072020
No ratings yet
BIZ102 - Assessment 1 Brief - 072020
6 pages
Resume-Gist Vishal Salve
No ratings yet
Resume-Gist Vishal Salve
11 pages
2A - Regina Tiara Gunawan - UTS Listening
No ratings yet
2A - Regina Tiara Gunawan - UTS Listening
4 pages
001 2014 4 B PDF
No ratings yet
001 2014 4 B PDF
232 pages
Capabilities & Competitiveness:: The Lucerna Project Report - 2010
No ratings yet
Capabilities & Competitiveness:: The Lucerna Project Report - 2010
63 pages
Dweck - Praise For Intelligence
100% (1)
Dweck - Praise For Intelligence
20 pages
Btech Electronics Engineering Resume
No ratings yet
Btech Electronics Engineering Resume
3 pages
Reflection
No ratings yet
Reflection
4 pages
Leadership in The Automobile Industry The Case of India's Tata Motors
No ratings yet
Leadership in The Automobile Industry The Case of India's Tata Motors
236 pages
Eng8-Dll-Q3-Day 1 - W2
No ratings yet
Eng8-Dll-Q3-Day 1 - W2
3 pages
Reflective Essay1 Mona 2241011
No ratings yet
Reflective Essay1 Mona 2241011
6 pages
Research Design - Types - Ahsan - NDC
No ratings yet
Research Design - Types - Ahsan - NDC
39 pages
Final Exam The Andragogy of Learning Including Principle of Trainers Methodology 1
No ratings yet
Final Exam The Andragogy of Learning Including Principle of Trainers Methodology 1
2 pages
Matthew Dicks - Wikipedia
No ratings yet
Matthew Dicks - Wikipedia
9 pages
Assessment Brief
No ratings yet
Assessment Brief
5 pages
Answers Lead To More Questions: Shashi Wadhwa
No ratings yet
Answers Lead To More Questions: Shashi Wadhwa
3 pages
Self Study Report Volume 2 PDF
No ratings yet
Self Study Report Volume 2 PDF
820 pages
Lesson 1 4
No ratings yet
Lesson 1 4
2 pages
English Lesson Plan Year 3 Cefr
No ratings yet
English Lesson Plan Year 3 Cefr
7 pages
Transparent Ranking Universities January 2025
No ratings yet
Transparent Ranking Universities January 2025
163 pages
Population: An Introduction To Concepts and Issues 13th Edition John R. Weeks Ebook All Chapters PDF
100% (1)
Population: An Introduction To Concepts and Issues 13th Edition John R. Weeks Ebook All Chapters PDF
47 pages
Assessment in Learning
No ratings yet
Assessment in Learning
1 page
Angles Week 1
No ratings yet
Angles Week 1
4 pages
Information Assignment
No ratings yet
Information Assignment
22 pages
D Research
No ratings yet
D Research
43 pages

Boosting For Regression Transfer

Uploaded by

Boosting For Regression Transfer

Uploaded by

In Proceedings of the Twenty-Seventh International Conference on Machine Learning (ICML 10),

Haifa, Israel, June 2010.

Boosting for Regression Transfer

David Pardoe and Peter Stone {dpardoe, pstone}@cs.utexas.edu

Abstract and TrAdaBoost (Dai et al., 2007), respectively), and

−et Algorithm 3 Two-stage TrAdaBoost.R2

the similarity between tasks and instances. More re- References

You might also like