0% found this document useful (0 votes)
149 views8 pages

Boosting For Regression Transfer

Boosting for Regression Transfer in machine learning

Uploaded by

dangky4r
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
149 views8 pages

Boosting For Regression Transfer

Boosting for Regression Transfer in machine learning

Uploaded by

dangky4r
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

In Proceedings of the Twenty-Seventh International Conference on Machine Learning (ICML 10),

Haifa, Israel, June 2010.

Boosting for Regression Transfer

David Pardoe and Peter Stone {dpardoe, pstone}@cs.utexas.edu


The University of Texas at Austin, 1 University Station C0500, Austin, TX 78712 USA

Abstract and TrAdaBoost (Dai et al., 2007), respectively), and


we show how these algorithms can be modified for a
The goal of transfer learning is to improve
regression setting. Next, we present the primary con-
the learning of a new target concept given
tribution of this paper: two new algorithms designed
knowledge of related source concept(s). We
to overcome shortcomings observed in these modified
introduce the first boosting-based algorithms
algorithms. Finally, we present experimental results
for transfer learning that apply to regression
for all algorithms in seven test domains.
tasks. First, we describe two existing clas-
sification transfer algorithms, ExpBoost and
2. Regression Transfer
TrAdaBoost, and show how they can be mod-
ified for regression. We then introduce exten- In this section, we specify our learning problem and
sions of these algorithms that improve per- outline two approaches to solving this problem. Then
formance significantly on controlled experi- we provide necessary background on boosting for re-
ments in a wide range of test domains. gression problems.
2.1. Problem Specification
1. Introduction Our goal is to learn a model of a concept ctarget
The idea behind transfer learning (Pan & Yang, 2009) mapping feature vectors from the space X to labels
is that it is easier to learn a new concept (such as in the space Y . In binary classification problems,
how to play the trombone) if you are already familiar Y = {0, 1}, while in the regression problems stud-
with a similar concept (such as playing the trumpet). ied here, Y = R. We are given a set of training in-
In the context of supervised learning, inductive trans- stances Ttarget = {(xi , yi )}, with xi ∈ X and yi ∈ Y
fer learning is often framed as the problem of learn- for 1 ≤ i ≤ n, that reflect ctarget . In addition, we are
1 B
ing a concept of interest, called the target concept, given data sets Tsource , . . . , Tsource reflecting B differ-
given data from multiple sources: a typically small ent, but possibly related, concepts also mapping X to
amount of target data that reflects the target concept, Y . In order to learn the most accurate possible model
and a larger amount of source data that reflects one or of ctarget , we must decide how to use both the target
more different, but possibly related, source concepts. and source data sets. If Ttarget is sufficiently large,
A number of algorithms have been developed to ad- we can likely learn a good model using only this data.
dress this situation in classification settings, but much However, if Ttarget is small and one or more of the
less attention has been paid to regression settings. source concepts is similar to ctarget , then we may be
able to use the source data to improve our model.
One general approach that has been applied success-
fully to classification transfer is boosting. In this pa- 2.2. ExpBoost and TrAdaBoost
per, we introduce and evaluate the first boosting-based In this paper, we will consider regression transfer algo-
algorithms for regression transfer. These algorithms rithms that fit into two categories: those that make use
can be divided into two categories: algorithms that of models trained on the source data, and those that
make use of models trained on the source data, and use the source data directly as training data. The al-
algorithms that use the source data directly. We first gorithms we will present in these two categories are
describe an existing classification transfer algorithm inspired by two boosting-based algorithms for classifi-
from each category (ExpBoost (Rettinger et al., 2006) cation transfer, ExpBoost (Rettinger et al., 2006) and
TrAdaBoost (Dai et al., 2007). Boosting is an ensem-
Appearing in Proceedings of the 27 th International Confer-
ence on Machine Learning, Haifa, Israel, 2010. Copyright ble method in which a sequence of models (or hypothe-
2010 by the author(s)/owner(s). ses) h1 . . . hN , each mapping from X to Y , are itera-
Boosting for Regression Transfer
Algorithm 1 AdaBoost.R2 (Drucker, 1997)
tively fit to some transformation of a data set using
a base learner. The outputs of these models are then Input the labeled target data set T of size n, the maximum
number of iterations N , and a base learning algorithm Learner.
combined into a final hypothesis hf . Unless otherwise specified, set the initial weight vector w1 such
In ExpBoost, a separate hypothesis (or expert, hence that wi1 = 1/n for 1 ≤ i ≤ n.
the name) hi is learned for each of the B source data For t = 1, . . . , N :
sets, and learning is performed using only Ttarget . At 1. Call Learner with the training set T and the distribu-
tion wt , and get a hypothesis ht : X → R.
each step of the boosting process, ExpBoost chooses to 2. Calculate the adjusted error eti for each instance:
use either the hypothesis ht learned from the weighted let Dt = maxn j=1 |yj − ht (xj )|
training data or one of the experts, depending on then eti = |yi − ht (xi )|/Dt
which is most accurate. 3. Calculate the adjusted error of ht :
ǫt = n t t
P
In contrast, TrAdaBoost uses the source data sets di- i=1 ei wi ; if ǫt ≥ 0.5, stop and set N = t − 1.
rectly by combining them with Ttarget to form a sin- 4. Let βt = ǫt /(1 − ǫt ).
5. Update the weight vector:
gle data set. At each boosting step, TrAdaBoost in-
1−et
creases the relative weights of target instances that are wit+1 = wit βt i /Zt (Zt is a normalizing constant)
misclassified. When a source instance is misclassified, Output the hypothesis:
hf (x) = the weighted median of ht (x) for 1 ≤ t ≤ N , using
however, its weight is decreased. In this way, TrAd- ln(1/βt ) as the weight for hypothesis ht .
aBoost aims to identify and make use of those source
instances that are most similar to the target data while arbitrarily large. Thus, we need a method of mapping
ignoring those that are dissimilar. an error ei into an adjusted error e′i that can be used
We provide additional details on these algorithms and in the reweighting formula used by AdaBoost.
their extensions below, but first we address the issue of The method used in AdaBoost.R2 (Drucker, 1997) is
applying boosting algorithms to regression problems. to express each error in relation to the largest error
D = maxni=0 |ei | in such a way that each adjusted er-
2.3. AdaBoost and Regression
ror e′i is in the range [0, 1]. In particular, one of three
One of the best known boosting methods for clas- possible loss functions is used: e′i = ei /D (linear),
sification, and the one upon which ExpBoost and e′i = ei 2 /D2 (square), or e′i = 1 − exp(−ei /D) (expo-
TrAdaBoost are based, is AdaBoost (specifically, Ad- nential). The degree to which instance xi is reweighted
aBoost.M1) (Freund & Schapire, 1997). In AdaBoost, in iteration t thus depends on how large the error of
each training instance receives a weight wi that is used ht is on xi relative to the error on the worst instance.
when learning each hypothesis; this weight indicates AdaBoost.RT (Shrestha & Solomatine, 2006), on the
the relative importance of each instance and is used other hand, continues to label each output as correct
in computing the error of a hypothesis on the data (e′i = 0) or incorrect (e′i = 1) using an error threshold
set. After each iteration, instances are reweighted, φ. That is, if ei > φ, then e′i = 1; otherwise, e′i = 0.
with those instances that are not correctly classified
by the last hypothesis receiving larger weights (as in In preliminary experiments, we found Ad-
step 5 of Algorithm 1). Thus, as the process contin- aBoost.R2 with the linear loss function to work
ues, learning focuses on those instances that are most consistently well, and were unable to find values
difficult to classify. of φ that allowed AdaBoost.RT to regularly match
this performance. In the remainder of this paper
A number of methods have been proposed for modify- we consider only AdaBoost.R2 with the linear loss
ing AdaBoost for regression, and as TrAdaBoost and function, shown in Algorithm 1.
ExpBoost are based on AdaBoost, these modifications
can be used on them as well. In our work, we ex- 3. Using Source Models
plored two of these methods that have been shown
In this section we describe four regression transfer al-
to be generally effective and that can be applied to
gorithms based on making use of source models. In
TrAdaBoost and ExpBoost in a straightforward way:
addition to the target data, each algorithm receives
AdaBoost.R2 and AdaBoost.RT.
as input a set of experts H B = {h1 , . . . , hB }, each
The key to AdaBoost is the reweighting of those in- corresponding to a source data set.
stances that are misclassified at each iteration. In re-
gression problems, the output given by a hypothesis 3.1. ExpBoost.R2
ht for an instance xi is not correct or incorrect, but Combining the principles of AdaBoost.R2 with those
has a real-valued error ei = |yi − ht (xi )| that may be of ExpBoost results in the new regression algorithm
ExpBoost.R2. The steps involving computing the ad-
Boosting for Regression Transfer
Algorithm 2 Transfer Stacking
combination of the new hypothesis and experts that
Input a labeled data set T = {(xi , yi )} of size n, a set of experts best fits the data for the current iteration, and store
H B = {h1 , . . . , hB }, the number of folds F for cross validation,
and a base learning algorithm Learner.
the result as the iteration’s hypothesis. As a result
1. Let Oi,j = hj (xi ) for 1 ≤ i ≤ n and 1 ≤ j ≤ B. of the similarity to stacking, we call this combination
2. Perform F -fold cross validation on T using Learner. For approach transfer stacking. Since our full boosting ap-
1 ≤ i ≤ n, let Oi,B+1 equal the output of the learned model proach reduces to calling AdaBoost.R2 with transfer
for the fold where instance i is in the validation set.
stacking as the base learner, we give details only for
3. Call Learner with the full training set T and get hy-
pothesis hB+1 . transfer stacking, shown as Algorithm 2.
4. PerformP linear least squares regression on the system of
equations ( B+1 We note that transfer stacking by itself (without the
j=1 aj Oi,j ) + aB+2 = yi for 1 ≤ i ≤ n; that
is, find the linear combination of hypotheses that minimizes use of boosting) could be used as a transfer algorithm,
squared error. and so in the experiments of Section 6 we evaluate both
Output the hypothesis: plain transfer stacking (using AdaBoost.R2 as its base
hf (x) = a1 h1 (x) + ... + aB+1 hB+1 (x) + aB+2
learner for a fair comparison) and the full approach,
which we call boosted transfer stacking.
justed error and outputting the final hypothesis cor-
respond to the same steps from Algorithm 1. The 3.3. Best Expert
primary difference is in step 1 of each boosting itera- Finally, as a baseline, we test the algorithm that sim-
tion. After obtaining ht , ExpBoost.R2 computes the ply uses the best expert from H B ; that is, the expert
weighted errors of each expert in H B on the current with the lowest error on the target training data.
weighting of T , and if any expert has a lower weighted
error than ht , ht is replaced with the best expert. 4. Using Source Data Directly
3.2. (Boosted) Transfer Stacking We now describe three algorithms that take as input
both the target and all source data sets and that train
In ExpBoost.R2, the final hypothesis that is produced on a combination of this data.
represents a combination of the provided experts and
additional hypotheses learned with the base learner. 4.1. TrAdaBoost.R2
However, at each boosting iteration, ExpBoost.R2 Combining the principles of AdaBoost.R2 with those
must choose between either the newly learned hypoth- of TrAdaBoost results in the new regression algorithm
esis or a single expert. We now consider relaxing TrAdaBoost.R2. TrAdaBoost.R2 takes two data sets
this constraint by allowing a linear combination of hy- as input, Ttarget and Tsource , of size n and m, respec-
potheses to be chosen at each iteration. tively, and combines them into a single set T used in
The details of our approach are very similar to those of boosting. Although the original work on TrAdaBoost
stacking (or stacked generalization) (Wolpert, 1992). does not consider the issue of multiple sources, we are
Stacking is an ensemble approach in which a meta- interested in cases where any number of sources may
level model combines multiple base models, all trained exist. When there is more than one source, we simply
independently on the same set of data using different combine all source data sets into a single data set. As
learning algorithms. The meta-level model is learned TrAdaBoost.R2 handles the reweighting of each train-
(typically using linear regression) from a meta-level ing instance separately, there should be no harm in
data set created as follows. For each instance in the mixing data in this fashion, but care should be taken
original training set, a meta-level instance is created in setting the initial weight vector. Our experiments
using the outputs of each base model as features and involve source data sets of (roughly) equal sizes, and
using the original label. Cross validation is performed so we simply assign all source instances (and target
for each base learner so that the output for each in- instances) the same weight. If one source data set
stance in the original training set is obtained when it is were larger than another, however, setting weights uni-
out-of-sample. Once the meta-level model is learned, a formly would result in more emphasis being given to
new instance is handled by using the model to combine that source, at least in early boosting iterations.
the outputs of the base models on the instance. As with ExpBoost.R2, the steps involving comput-
Here, instead of multiple base learners, we consider a ing the adjusted error correspond to the same steps
single base learner and (potentially) multiple experts from Algorithm 1. The primary difference between
previously trained on source data; hence cross vali- TrAdaBoost.R2 and Algorithm 1 is in step 5 of each
dation is required only for the base learner and not iteration. Instead of treating all data equally, Exp-
for the experts. Thus, at each boosting iteration, we Boost.R2 increases the weights of target instances by
perform linear least squares regression to find a linear
Boosting for Regression Transfer

−et Algorithm 3 Two-stage TrAdaBoost.R2


setting wit+1 = wit βt i /Zt and decreases the weights
t Input two labeled data sets Tsource (of size n) and Ttarget (of
of source instancespby setting wit+1 = wit β ei /Zt , size m), the number of steps S, the maximum number of boost-
where β = 1/(1 + 2 ln n/N ). In addition, TrAd- ing iterations N , the number of folds F for cross validation, and
aBoost.R2 considers only the final ⌈N/2⌉ hypotheses a base learning algorithm Learner. Let T be the combination
of Tsource and Ttarget such that the first n instances in T are
when taking the weighted median to determine output
those from Tsource . Set the initial weight vector w1 such that
(as a result of theoretical considerations in the original wi1 = 1/(n + m) for 1 ≤ i ≤ n + m.
TrAdaBoost). For t = 1, . . . , S:
1. Call AdaBoost.R2′ with T , distribution wt , N , and
4.2. Two-stage TrAdaBoost.R2 Learner to obtain modelt , where AdaBoost.R2′ is identi-
cal to AdaBoost.R2 except that the weights of the first n
In analyzing the performance of TrAdaBoost.R2, we instances are never modified. Similarly, use F -fold cross val-
observed it to be highly susceptible to overfitting (that idation to obtain an estimate errort of the error of modelt .
is, beyond some point, accuracy decreased as the num- 2. Call Learner with T and distribution wt , and get a
hypothesis ht : X → R.
ber of boosting iterations N increased). In contrast, 3. Calculate the adjusted error eti for each instance as in
AdaBoost.R2 and the algorithms of Section 3 do not AdaBoost.R2.
appear to suffer from this problem. After experiment- 4. Update the weight vector:
ing with cross validation to select N , we still saw mixed wit+1 =
performance from TrAdaBoost.R2. Closer inspection
(
et
wit βt i /Zt , 1 ≤ i ≤ n
of the results revealed two problems. First, when the t
wi /Zt , n+1≤i≤n+m
size of Tsource is much larger than Ttarget , it can take
where Zt is a normalizing constant, and βt is chosen
many iterations for the total weight of the target in- such that the resulting weight of the target (final m)
stances to approach the total weight of the source in- m
instances is (n+m) t
+ (S−1) m
(1 − (n+m) ).
stances, and by this time the weights of the target Output modelt where t = argmini errori .
data may be heavily skewed – those target instances
that are either outliers or most dissimilar to the source
data may represent most of the weight. Second, even value of βt satisfying the conditions shown in Algo-
those source instances that are representative of the rithm 3 using a binary search. In addition, it is not
target concept tend to have their weights reduced to necessary to progress through all S steps once it has
zero eventually. The use of the adjusted error scheme been determined that errors are increasing.
from AdaBoost.R2 is the reason. Whereas in TrAd-
4.3. Best Uniform Initial Weight
aBoost the relevant source instances will generally be
classified correctly and not have their weights reduced, Finally, as another baseline for comparison, we test
in TrAdaBoost.R2 even small errors lead to weight re- an algorithm that simply calls AdaBoost.R2 with the
ductions. The fact that TrAdaBoost.R2 uses only the combined source and target data, but attempts to find
hypotheses generated during the final half of boosting the best initial ratio of total weight between the source
iterations exacerbates this problem. (We note that we and target data. As in two-stage TrAdaBoost.R2, we
also tried using all hypotheses, with mixed results.) try total target weights ranging from m/(n + m) to 1
in S steps and choose the best weighting using cross
To address these problems, we designed a version of validation. However, in this case all source instances
TrAdaBoost.R2 that adjusts instance weights in two have equal initial weights (i.e., there is no attempt to
stages. In stage one, the weights of source instances set individual weights based on errors), and no distinc-
are adjusted downwards gradually until reaching a cer- tion is made between source and target instances once
tain point (determined through cross validation). In AdaBoost.R2 is called – source instances with high er-
stage two, the weights of all source instances are frozen rors will have their weights increased just like target
while the weights of target instances are updated as instances will. (In fact, this is not a boosting-specific
normal in AdaBoost.R2. Only the hypotheses gener- algorithm, as any learner could be used in place of Ad-
ated in stage two are stored and used to determine the aBoost.R2; we use AdaBoost.R2 as the learner only to
output of the resulting model. We call this algorithm allow a direct comparison between the results.)
two-stage TrAdaBoost.R2, and show it in Algorithm
3. Note that the weighting factor βt is not chosen 5. Data Transformation
based on the hypothesis error, as before, but is cho-
In Section 2.1, we stated that both source and tar-
sen to result in a certain total weight for the target
get concepts had labels in the same output space. In
instances. In this way, the total weight of the target
many regression settings in which we might consider
instances increases uniformly from m/(n + m) to 1 in
transfer, however, different concepts might have labels
S steps. In our implementation, we approximated the
with considerably different label distributions. While
Boosting for Regression Transfer

we largely view this as a data preparation issue (e.g., from the UCI Machine Learning Repository1 : concrete
labels can be expressed in comparable terms, such as strength, housing, auto MPG, and automobile. (We
using relative instead of absolute prices in financial chose the first four data sets that represented standard
data) and thus beyond the scope of this paper, in our regression problems and had a few hundred instances;
experiments we do take some simple measures to en- no other data sets were tried.) We divide these stan-
sure similar label distributions. dard regression data sets into target and source sets
by using a variation on the technique used by Dai et
In algorithms making use of experts trained on source
al. (2007) in a classification setting. For each data set,
data, we can directly modify the experts so that their
we identify a continuous feature that has a moderate
outputs on the target data fall in an appropriate range.
degree of correlation (around 0.4) with the label. We
We do so by evaluating the experts on the target train-
then sort the instances by this feature, divide the set in
ing data (thus making use only of data available to the
thirds (low, medium, and high), and remove this fea-
learner) and performing linear regression to find the
ture from the resulting sets. By dividing based on a
linear transformation that best fits the outputs to the
feature moderately correlated with the label, we hope
true labels. This transformation is then applied when-
to produce three data sets that represent slightly dif-
ever the expert is used by the learning algorithm. In al-
ferent concepts; if the correlation were zero, the con-
gorithms using the source data directly, for each source
cepts might be identical, and if the correlation were
data set we train an expert on the set, find the linear
high, the concepts might be significantly different and
transformation in the same manner, and then apply
have very different label ranges. In each experiment,
this transformation to the labels in the source data set
we use one data set as the target and the other two as
before passing it to the learning algorithm. On the
sources, for a total of 12 experiments.
data sets described in the following section, we found
that this procedure was worthwhile, as it often resulted Table 1 shows the results of all eight learning algo-
in a significant increase in accuracy while only occa- rithms on all 12 experiments using both M5P model
sionally producing a slight decrease in accuracy. We trees and neural networks as base learners. Target
note that trying regression with higher degree polyno- data training sets contained 25 instances. (Increas-
mials tended to produce modest improvements at best ing this number resulted in qualitatively similar re-
and large decreases in accuracy at worst. sults.) Source data sets ranged from 68 to 343 in-
stances. Each result represents the average RMS error
6. Experiments over 30 runs. Numbers in bold represent results that
We now evaluate our boosting algorithms on seven dif- are among the best – either the lowest error, or not sig-
ferent problems: four data sets from the UCI Repos- nificantly higher. Numbers in italics represent results
itory, a space of artificial data sets created from a that are not significantly better than AdaBoost.R2,
known function, and two prediction problems from that is, those where transfer failed.
multiagent systems. Experiments were performed us- The best expert is significantly better than Ad-
ing the WEKA 3.4 (Witten & Frank, 1999) machine aBoost.R2 exactly half of the time, but is sometimes
learning package with default parameters for the base much worse, suggesting that the degree of similarity
learners. In the first group of experiments, we tested between source and target data sets varies consider-
two base learners. For the remainder, we used the re- ably across the range of experiments. Not surprisingly,
gression algorithm in WEKA giving the lowest error the cases where the best expert fares worst are often
when used alone as the base learner. The following those where other expert-based algorithms fare poorly.
parameters were used (where appropriate): N = 30, S
= 30, and F = 10. Experts were generated by running ExpBoost.R2 performs poorly, beating AdaBoost.R2
AdaBoost.R2 on a complete source data set. We used significantly only five out of 24 times. Transfer stack-
AdaBoost.R2 as the baseline non-transfer algorithm in ing (performing stacking once with AdaBoost.R2 as a
each experiment as it consistently produced lower er- base learner) and boosted transfer stacking do much
rors than using the base learner alone and offers a fair better, each beating AdaBoost.R2 significantly 15
comparison against boosting transfer algorithms. Re- times, suggesting that there is a benefit to considering
sults said to be significant are statistically significant linear combinations of models instead of only individ-
(p < .05) according to paired t-tests. ual models. Interestingly, the error of transfer stack-
ing is usually fairly close to that of boosted transfer
6.1. Four UCI Data Sets stacking when both perform well. When both perform
We begin by comparing the results of all eight al- poorly, however, the error of boosted transfer stacking
gorithms described above on four data sets taken 1
http://archive.ics.uci.edu/ml/index.html
Boosting for Regression Transfer

is typically close to that of AdaBoost.R2, while the We performed experiments using several values d and
error of transfer stacking is much worse. It may be values of 1 and 5 for B (the number of source data
the case that performing transfer stacking across mul- sets), expecting that transfer would be most effective
tiple boosting iterations is not necessary for effective for smaller values of d and the larger value of B. For
transfer but is effective in preventing overfitting when each value of d, we randomly generated 100 of each of
transfer is not possible. the following: i) target training data sets (of varying
sizes), ii) target testing sets (of size 10,000), and iii)
TrAdaBoost.R2 (with the number of boosting itera-
groups of 5 source data sets (each of size 1000). Neural
tions chosen using cross validation) gives promising
networks were chosen as the best base learner.
but somewhat erratic results, beating AdaBoost.R2
significantly 16 times but performing much worse in a Figure 1 shows the results when d = 1; results for other
few cases. Two-stage TrAdaBoost.R2 produces much values of d were qualitatively similar. As expected,
better results and is the clear winner in this set of using transfer increased accuracy the most for lower
experiments, finishing among the top algorithms 20 values of d and higher values of B. When we used one
out of 24 times and failing to significantly beat Ad- source, boosted transfer stacking significantly outper-
aBoost.R2 only once. Interestingly, simply finding the formed AdaBoost.R2 when there were 250 target in-
best uniform initial weighting also performs well, sig- stances or fewer, while two-stage TrAdaBoost.R2 was
nificantly outperforming AdaBoost.R2 17 times. significantly better than either algorithm for up to 300
instances. With five sources, boosted transfer stacking
Overall, these results suggest that making use of source
actually performed slightly better then 2-stage TrAd-
data directly is more effective than using source ex-
aBoost.R2 (the difference was significant for at least 75
perts. However, the expert-based algorithms still had
instances), and both transfer algorithms were signifi-
the best performance in a few cases, and it is worth
cantly better than AdaBoost.R2 for all points plotted.
noting that they are much less computationally inten-
sive, due to using smaller amounts of data (target data 6.3. TAC SCM and TAC Travel
only) and not requiring extensive cross validation.
While the previous data sets are useful for testing the
For the remaining experiments, we note that boosted performance of our algorithms, it is important to also
transfer stacking and two-stage TrAdaBoost.R2 (the experiment with naturally occurring data in domains
primary contributions of this paper) continue to per- where transfer would be applied in the real world.
form as well as or (usually) better than their counter- We now consider two such domains, taken from two
parts (algorithms using source experts or source data, e-commerce scenarios from the Trading Agent Com-
respectively), and so for clarity we omit the results of petition: a supply chain management scenario (TAC
the other transfer algorithms. SCM) (Eriksson et al., 2006), and a travel agent sce-
nario (TAC Travel) (Wellman et al., 2007). In both
6.2. Friedman #1 scenarios, autonomous agents compete against each
Friedman #1 (Friedman, 1991) is a well known regres- other in simulated economies to maximize profits.
sion problem, and we use a modified version that al- Many agents use some form of learning to make pre-
lows us to generate a variety of related concepts. Each dictions about future prices, but the manner in which
instance x is a feature vector of length ten, with each these prices change over time can depend heavily on
component xi drawn independently from the uniform the identities of the competing agents – essentially, dif-
distribution [0, 1]. The label for each instance is de- ferent groups of agents represent different economies.
pendent on only the first five features: This fact suggests the possibility of an agent using
transfer learning to make use of past experience in dif-
y = a1 · 10 sin(π(b1 x1 + c1 ) · (b2 x2 + c2 )) + ferent economies. In fact, many agents designed for
a2 · 20(b3 x3 + c3 − 0.5)2 + a3 · 10(b4 x4 + c4 ) + the competition, while not explicitly casting the prob-
a4 · 5(b5 x5 + c5 ) + N (0, 1) lem as transfer learning, deal in some way with the is-
sue of making use of training data from these different
where N is the normal distribution, and each ai , bi , sources. While these competitions are only abstrac-
and ci is a fixed parameter. In the original Friedman tions of real-life markets, opportunities for applying
#1 problem, each ai and bi is 1 while each ci is 0, and transfer learning certainly exist in real markets as well,
we use these values when generating the target data and these competitions represent valuable testbeds for
set Ttarget . To generate each of the source data sets, research into these opportunities.
we draw each ai and bi from N (1, 0.1d) and each ci
The first scenario we consider is TAC SCM, in which
from N (0, 0.05d), where d is a parameter that controls
agents compete as computer manufacturers. We col-
how similar the source and target data sets are.
Boosting for Regression Transfer
Table 1. RMS error on four UCI datasets, each divided into three concepts, using M5P model trees and neural networks
as base learners. Bold: lowest error; Italic: not significantly better than AdaBoost.R2 (95% confidence in each case)
Base Algorithm Data set (divided into 3 subsets)
Lrnr. Concrete Strength Housing Auto MPG Automobile
M5P AdaBoost.R2 10.26 11.01 13.26 3.65 3.59 6.52 2.90 2.92 4.35 1963 3576 4893
best expert 8.63 8.08 9.55 2.98 5.27 9.98 2.38 2.57 4.44 1374 3741 6059
ExpBoost.R2 10.11 9.64 11.76 3.02 3.74 6.66 2.53 2.94 4.39 1791 3661 4932
boosted t. stacking 8.47 7.48 10.03 3.03 3.99 7.24 2.30 2.75 4.48 1325 3480 4811
transfer stacking 8.60 7.31 10.17 3.07 5.49 8.39 2.47 2.65 4.57 1327 3631 5640
best unif. init. wt. 10.25 6.98 8.66 2.99 3.42 6.52 2.35 2.59 4.33 1734 2678 2940
TrAdaBoost.R2 (CV) 10.76 7.04 9.71 3.38 3.57 7.03 2.19 2.58 4.24 1815 2851 3527
2-Stage TrAdaBoost 8.74 6.49 8.66 2.99 3.12 6.12 2.14 2.52 4.21 1564 2555 3202
NN AdaBoost.R2 10.47 11.95 14.84 3.89 3.67 7.54 2.76 3.55 5.17 1593 3200 3836
best expert 10.12 9.67 13.87 7.00 4.99 9.22 2.66 2.77 4.43 1481 2484 6119
ExpBoost.R2 10.14 11.62 13.37 3.88 3.66 7.63 2.79 3.50 5.20 1392 3174 3829
boosted t. stacking 9.49 9.80 12.93 3.75 3.58 7.74 2.48 3.00 4.40 1215 2640 3761
transfer stacking 9.65 9.46 13.13 4.63 4.36 8.37 2.43 2.83 4.53 1144 2632 5081
best unif. init. wt. 10.48 8.02 10.77 3.89 3.00 6.46 2.44 2.80 4.19 1312 2277 2858
TrAdaBoost.R2 (CV) 11.34 9.05 11.91 4.02 3.29 7.68 2.33 2.80 4.37 1718 2573 3268
2-Stage TrAdaBoost 10.43 8.09 9.92 3.27 2.99 6.45 2.14 2.60 4.18 1290 2276 2843

36 AdaBoost.R2
AdaBoost.R2 AdaBoost.R2
3.5 2-stage TrAdaBoost.R2: 2-stage TrAdaBoost.R2
0.08 2-stage TrAdaBoost.R2
1 source boosted tr. stacking 32 boosted tr. stacking
3 5 sources
boosted transfer stacking:
RMS Error
RMS Error

RMS Error
1 source 0.075 28
2.5 5 sources
24
2 0.07
20
1.5
0.065 16
1
0 100 200 300 400 500 0 200 400 600 800 0 100 200 300 400 500
Target Instances Target Instances Target Instances

Figure 1. Friedman #1 (NN) Figure 2. TAC SCM (M5P) Figure 3. TAC Travel (M5P)

lected experience in three different economies as fol- number of instances on the SCM data set and for 150
lows. We generated three source data sets using three instances or less on the Travel data set.
different groups of agent binaries provided by com-
petition participants. The target data set came from 7. Related Work
the final round of the 2006 competition. Each instance The most closely related work to this research, that
consists of 31 features of the economy at some point in on boosting and classification transfer, is described in
time and is labeled with a particular change in future Section 2. One important item we have not discussed,
computer prices that would be of interest to an agent. as this paper is empirical in its focus, is the theoret-
Full details of these data sets are available in (Pardoe ical properties of the algorithms discussed here. One
& Stone, 2007). of the attractive features of AdaBoost is its theoreti-
In TAC Travel, agents complete travel packages by bid- cal guarantees (e.g., convergence to zero error on the
ding in simultaneous auctions for flights, hotels, and training set) (Freund & Schapire, 1997). We note that
entertainment. We consider the problem of predicting no theoretical results currently exist for ExpBoost or
the closing prices of hotel auctions given the current AdaBoost.R2; however, analogues of the main proper-
state of all auctions, represented by 51 features (as de- ties of AdaBoost have been proven to apply to TrAd-
scribed in (Schapire et al., 2002)). We use data from aBoost, and a straightforward transformation of these
the 2006 competition final round as the target data, proofs shows that these properties also extend to the
and data from the 2004 and 2005 final rounds as the combination of TrAdaBoost and AdaBoost.RT (men-
source data sets. (In each year’s final round, all games tioned in Section 2.3). Developing theoretical guaran-
consisted of the same agents, but between years the tees for the other algorithms discussed here, in both
agents changed.) classification and regression settings, is an important
area for future work.
Learning curves for 30 runs on each data set are shown
in Figures 2 and 3. M5P model trees were chosen as The lowest common denominator of transfer learning
the best base learner in both cases. Both two-stage methods is the leveraging of information from a source
TrAdaBoost.R2 and boosted transfer stacking signif- domain to speed up or otherwise improve learning in
icantly outperform AdaBoost.R2 for any number of a different target domain. Transfer learning bears
target instances. Two-stage TrAdaBoost.R2 signifi- resemblance to classic case-based reasoning (Kolod-
cantly outperforms boosted transfer stacking for any ner, 1993), especially in the need to reason about
Boosting for Regression Transfer

the similarity between tasks and instances. More re- References


cently, transfer learning has been studied in a vari- Caruana, Rich. Multitask learning. In Machine Learning,
ety of different settings, including statistical relational pp. 41–75, 1997.
learning (Mihalkova et al., 2007), reinforcement learn- Dai, Wenyuan, Yang, Qiang, Xue, Gui-rong, and Yu, Yong.
ing (Taylor, 2009), and classification as described in Boosting for transfer learning. In Proceedings of the
Section 2. A key property of the classification and Twenty Fourth International Conference on Machine
our regression setting is that the source and target do- Learning, 2007.
mains typically have the same input and output spaces Drucker, Harris. Improving regressors using boosting tech-
(X and Y in our notation), which is not always the niques. In Proceedings of the Fourteenth International
Conference on Machine Learning, pp. 107–115, 1997.
case, for example in the reinforcement learning setting.
Eriksson, Joakim, Finne, Niclas, and Janson, Sverker. Evo-
As such, the problem studied here could be consid- lution of a supply chain management game for the Trad-
ered one of concept drift (Schlimmer & Granger, 1986), ing Agent Competition. AI Communications, 19:1–12,
in which the target concept changes over time. This 2006.
property also differentiates our setting from multitask Freund, Yoav and Schapire, Robert. A decision-theoretic
learning (Caruana, 1997), in which multiple related generalization of online learning and an application to
concepts sharing an input representation but with po- boosting. Journal of Computer and System Sciences,
55:119–139, 1997.
tentially unrelated outputs are to be learned simul-
taneously; however, some multitask learning methods Friedman, Jerome. Multivariate adaptive regression splines
(with discussion). Annals of Statistics, 19:1–141, 1991.
could potentially be modified to address our setting.
Kolodner, Janet. Case-Based Reasoning. Morgan Kauf-
mann, 1993.
8. Conclusions and Future Work
Mihalkova, Lilyana, Huynh, Tuyen, and Mooney, Ray-
We explored a number of boosting-based regression mond. Mapping and revising markov logic networks for
transfer algorithms that make use of either models transfer learning. In Proceedings of the 22nd Conference
trained on source data or the source data itself. The on Artificial Intelligence, pp. 608–614, July 2007.
primary contribution of this paper is the introduc- Pan, Sinno Jialin and Yang, Qiang. A survey on transfer
learning. IEEE Transactions on Knowledge and Data
tion of boosted transfer stacking and two-stage TrAd- Engineering, 99, 2009. ISSN 1041-4347.
aBoost.R2, both of which have their roots in existing
Pardoe, David and Stone, Peter. Adapting price predic-
classification transfer approaches. Both show promise, tions in TAC SCM. In AAMAS 2007 Workshop on Agent
and two-stage TrAdaBoost.R2 in particular was con- Mediated Electronic Commerce, 2007.
sistently effective across a wide range of experimental Rettinger, Achim, Zinkevich, Martin, and Bowling,
domains. Michael. Boosting expert ensembles for rapid concept
recall. In Proceedings of the Twenty-First National Con-
There are a number of areas in which this work could ference on Artificial Intelligence, July 2006.
be expanded. So far, we have only experimented with Schapire, Robert E., Stone, Peter, McAllester, David,
the domains and base learners described. Future work Littman, Michael L., and Csirik, János A. Modeling auc-
is needed to better understand which transfer algo- tion price uncertainty using boosting-based conditional
rithms are best suited for which domains, and how dif- density estimation. In Proceedings of the Nineteenth In-
ternational Conference on Machine Learning, 2002.
ferent choices of base learners and learning parameters
interact with these algorithms. Also, additional meth- Schlimmer, J. and Granger, R. Beyond incremental pro-
cessing: Tracking concept drift. In Proceedings of the
ods of adapting boosting for regression could be ex- 5th National Conference on Artificial Intelligence, pp.
plored, and additional techniques for improving boost- 502–507, 1986.
ing (such as regularization) could be tried. Finally, it Shrestha, D. L. and Solomatine, D. P. Experiments with
would be interesting to see whether the extensions to AdaBoost.RT, an improved boosting scheme for regres-
ExpBoost and TrAdaBoost described here prove useful sion. Neural Comput., 18(7):1678–1710, 2006.
in the classification setting for which those algorithms Taylor, Matthew E. Transfer in Reinforcement Learning
were originally designed. Domains. Springer Verlag, 2009.
Wellman, Michael P. Greenwald, Amy, and Stone, Pe-
Acknowledgments ter. Autonomous Bidding Agents: Strategies and Lessons
from the Trading Agent Competition. MIT Press, 2007.
This work has taken place in the Learning Agents Witten, Ian H. and Frank, Eibe. Data Mining: Practi-
Research Group (LARG) at UT Austin. LARG re- cal Machine Learning Tools and Techniques with Java
search is supported in part by NSF (IIS-0917122), Implementations. Morgan Kaufmann, 1999.
ONR (N00014-09-1-0658), DARPA (FA8650-08-C- Wolpert, David H. Stacked generalization. Neural Net-
7812), and the FHWA (DTFH61-07-H-00030). works, 5:241–259, 1992.

You might also like