Boosting For Regression Transfer
Boosting For Regression Transfer
we largely view this as a data preparation issue (e.g., from the UCI Machine Learning Repository1 : concrete
labels can be expressed in comparable terms, such as strength, housing, auto MPG, and automobile. (We
using relative instead of absolute prices in financial chose the first four data sets that represented standard
data) and thus beyond the scope of this paper, in our regression problems and had a few hundred instances;
experiments we do take some simple measures to en- no other data sets were tried.) We divide these stan-
sure similar label distributions. dard regression data sets into target and source sets
by using a variation on the technique used by Dai et
In algorithms making use of experts trained on source
al. (2007) in a classification setting. For each data set,
data, we can directly modify the experts so that their
we identify a continuous feature that has a moderate
outputs on the target data fall in an appropriate range.
degree of correlation (around 0.4) with the label. We
We do so by evaluating the experts on the target train-
then sort the instances by this feature, divide the set in
ing data (thus making use only of data available to the
thirds (low, medium, and high), and remove this fea-
learner) and performing linear regression to find the
ture from the resulting sets. By dividing based on a
linear transformation that best fits the outputs to the
feature moderately correlated with the label, we hope
true labels. This transformation is then applied when-
to produce three data sets that represent slightly dif-
ever the expert is used by the learning algorithm. In al-
ferent concepts; if the correlation were zero, the con-
gorithms using the source data directly, for each source
cepts might be identical, and if the correlation were
data set we train an expert on the set, find the linear
high, the concepts might be significantly different and
transformation in the same manner, and then apply
have very different label ranges. In each experiment,
this transformation to the labels in the source data set
we use one data set as the target and the other two as
before passing it to the learning algorithm. On the
sources, for a total of 12 experiments.
data sets described in the following section, we found
that this procedure was worthwhile, as it often resulted Table 1 shows the results of all eight learning algo-
in a significant increase in accuracy while only occa- rithms on all 12 experiments using both M5P model
sionally producing a slight decrease in accuracy. We trees and neural networks as base learners. Target
note that trying regression with higher degree polyno- data training sets contained 25 instances. (Increas-
mials tended to produce modest improvements at best ing this number resulted in qualitatively similar re-
and large decreases in accuracy at worst. sults.) Source data sets ranged from 68 to 343 in-
stances. Each result represents the average RMS error
6. Experiments over 30 runs. Numbers in bold represent results that
We now evaluate our boosting algorithms on seven dif- are among the best – either the lowest error, or not sig-
ferent problems: four data sets from the UCI Repos- nificantly higher. Numbers in italics represent results
itory, a space of artificial data sets created from a that are not significantly better than AdaBoost.R2,
known function, and two prediction problems from that is, those where transfer failed.
multiagent systems. Experiments were performed us- The best expert is significantly better than Ad-
ing the WEKA 3.4 (Witten & Frank, 1999) machine aBoost.R2 exactly half of the time, but is sometimes
learning package with default parameters for the base much worse, suggesting that the degree of similarity
learners. In the first group of experiments, we tested between source and target data sets varies consider-
two base learners. For the remainder, we used the re- ably across the range of experiments. Not surprisingly,
gression algorithm in WEKA giving the lowest error the cases where the best expert fares worst are often
when used alone as the base learner. The following those where other expert-based algorithms fare poorly.
parameters were used (where appropriate): N = 30, S
= 30, and F = 10. Experts were generated by running ExpBoost.R2 performs poorly, beating AdaBoost.R2
AdaBoost.R2 on a complete source data set. We used significantly only five out of 24 times. Transfer stack-
AdaBoost.R2 as the baseline non-transfer algorithm in ing (performing stacking once with AdaBoost.R2 as a
each experiment as it consistently produced lower er- base learner) and boosted transfer stacking do much
rors than using the base learner alone and offers a fair better, each beating AdaBoost.R2 significantly 15
comparison against boosting transfer algorithms. Re- times, suggesting that there is a benefit to considering
sults said to be significant are statistically significant linear combinations of models instead of only individ-
(p < .05) according to paired t-tests. ual models. Interestingly, the error of transfer stack-
ing is usually fairly close to that of boosted transfer
6.1. Four UCI Data Sets stacking when both perform well. When both perform
We begin by comparing the results of all eight al- poorly, however, the error of boosted transfer stacking
gorithms described above on four data sets taken 1
http://archive.ics.uci.edu/ml/index.html
Boosting for Regression Transfer
is typically close to that of AdaBoost.R2, while the We performed experiments using several values d and
error of transfer stacking is much worse. It may be values of 1 and 5 for B (the number of source data
the case that performing transfer stacking across mul- sets), expecting that transfer would be most effective
tiple boosting iterations is not necessary for effective for smaller values of d and the larger value of B. For
transfer but is effective in preventing overfitting when each value of d, we randomly generated 100 of each of
transfer is not possible. the following: i) target training data sets (of varying
sizes), ii) target testing sets (of size 10,000), and iii)
TrAdaBoost.R2 (with the number of boosting itera-
groups of 5 source data sets (each of size 1000). Neural
tions chosen using cross validation) gives promising
networks were chosen as the best base learner.
but somewhat erratic results, beating AdaBoost.R2
significantly 16 times but performing much worse in a Figure 1 shows the results when d = 1; results for other
few cases. Two-stage TrAdaBoost.R2 produces much values of d were qualitatively similar. As expected,
better results and is the clear winner in this set of using transfer increased accuracy the most for lower
experiments, finishing among the top algorithms 20 values of d and higher values of B. When we used one
out of 24 times and failing to significantly beat Ad- source, boosted transfer stacking significantly outper-
aBoost.R2 only once. Interestingly, simply finding the formed AdaBoost.R2 when there were 250 target in-
best uniform initial weighting also performs well, sig- stances or fewer, while two-stage TrAdaBoost.R2 was
nificantly outperforming AdaBoost.R2 17 times. significantly better than either algorithm for up to 300
instances. With five sources, boosted transfer stacking
Overall, these results suggest that making use of source
actually performed slightly better then 2-stage TrAd-
data directly is more effective than using source ex-
aBoost.R2 (the difference was significant for at least 75
perts. However, the expert-based algorithms still had
instances), and both transfer algorithms were signifi-
the best performance in a few cases, and it is worth
cantly better than AdaBoost.R2 for all points plotted.
noting that they are much less computationally inten-
sive, due to using smaller amounts of data (target data 6.3. TAC SCM and TAC Travel
only) and not requiring extensive cross validation.
While the previous data sets are useful for testing the
For the remaining experiments, we note that boosted performance of our algorithms, it is important to also
transfer stacking and two-stage TrAdaBoost.R2 (the experiment with naturally occurring data in domains
primary contributions of this paper) continue to per- where transfer would be applied in the real world.
form as well as or (usually) better than their counter- We now consider two such domains, taken from two
parts (algorithms using source experts or source data, e-commerce scenarios from the Trading Agent Com-
respectively), and so for clarity we omit the results of petition: a supply chain management scenario (TAC
the other transfer algorithms. SCM) (Eriksson et al., 2006), and a travel agent sce-
nario (TAC Travel) (Wellman et al., 2007). In both
6.2. Friedman #1 scenarios, autonomous agents compete against each
Friedman #1 (Friedman, 1991) is a well known regres- other in simulated economies to maximize profits.
sion problem, and we use a modified version that al- Many agents use some form of learning to make pre-
lows us to generate a variety of related concepts. Each dictions about future prices, but the manner in which
instance x is a feature vector of length ten, with each these prices change over time can depend heavily on
component xi drawn independently from the uniform the identities of the competing agents – essentially, dif-
distribution [0, 1]. The label for each instance is de- ferent groups of agents represent different economies.
pendent on only the first five features: This fact suggests the possibility of an agent using
transfer learning to make use of past experience in dif-
y = a1 · 10 sin(π(b1 x1 + c1 ) · (b2 x2 + c2 )) + ferent economies. In fact, many agents designed for
a2 · 20(b3 x3 + c3 − 0.5)2 + a3 · 10(b4 x4 + c4 ) + the competition, while not explicitly casting the prob-
a4 · 5(b5 x5 + c5 ) + N (0, 1) lem as transfer learning, deal in some way with the is-
sue of making use of training data from these different
where N is the normal distribution, and each ai , bi , sources. While these competitions are only abstrac-
and ci is a fixed parameter. In the original Friedman tions of real-life markets, opportunities for applying
#1 problem, each ai and bi is 1 while each ci is 0, and transfer learning certainly exist in real markets as well,
we use these values when generating the target data and these competitions represent valuable testbeds for
set Ttarget . To generate each of the source data sets, research into these opportunities.
we draw each ai and bi from N (1, 0.1d) and each ci
The first scenario we consider is TAC SCM, in which
from N (0, 0.05d), where d is a parameter that controls
agents compete as computer manufacturers. We col-
how similar the source and target data sets are.
Boosting for Regression Transfer
Table 1. RMS error on four UCI datasets, each divided into three concepts, using M5P model trees and neural networks
as base learners. Bold: lowest error; Italic: not significantly better than AdaBoost.R2 (95% confidence in each case)
Base Algorithm Data set (divided into 3 subsets)
Lrnr. Concrete Strength Housing Auto MPG Automobile
M5P AdaBoost.R2 10.26 11.01 13.26 3.65 3.59 6.52 2.90 2.92 4.35 1963 3576 4893
best expert 8.63 8.08 9.55 2.98 5.27 9.98 2.38 2.57 4.44 1374 3741 6059
ExpBoost.R2 10.11 9.64 11.76 3.02 3.74 6.66 2.53 2.94 4.39 1791 3661 4932
boosted t. stacking 8.47 7.48 10.03 3.03 3.99 7.24 2.30 2.75 4.48 1325 3480 4811
transfer stacking 8.60 7.31 10.17 3.07 5.49 8.39 2.47 2.65 4.57 1327 3631 5640
best unif. init. wt. 10.25 6.98 8.66 2.99 3.42 6.52 2.35 2.59 4.33 1734 2678 2940
TrAdaBoost.R2 (CV) 10.76 7.04 9.71 3.38 3.57 7.03 2.19 2.58 4.24 1815 2851 3527
2-Stage TrAdaBoost 8.74 6.49 8.66 2.99 3.12 6.12 2.14 2.52 4.21 1564 2555 3202
NN AdaBoost.R2 10.47 11.95 14.84 3.89 3.67 7.54 2.76 3.55 5.17 1593 3200 3836
best expert 10.12 9.67 13.87 7.00 4.99 9.22 2.66 2.77 4.43 1481 2484 6119
ExpBoost.R2 10.14 11.62 13.37 3.88 3.66 7.63 2.79 3.50 5.20 1392 3174 3829
boosted t. stacking 9.49 9.80 12.93 3.75 3.58 7.74 2.48 3.00 4.40 1215 2640 3761
transfer stacking 9.65 9.46 13.13 4.63 4.36 8.37 2.43 2.83 4.53 1144 2632 5081
best unif. init. wt. 10.48 8.02 10.77 3.89 3.00 6.46 2.44 2.80 4.19 1312 2277 2858
TrAdaBoost.R2 (CV) 11.34 9.05 11.91 4.02 3.29 7.68 2.33 2.80 4.37 1718 2573 3268
2-Stage TrAdaBoost 10.43 8.09 9.92 3.27 2.99 6.45 2.14 2.60 4.18 1290 2276 2843
36 AdaBoost.R2
AdaBoost.R2 AdaBoost.R2
3.5 2-stage TrAdaBoost.R2: 2-stage TrAdaBoost.R2
0.08 2-stage TrAdaBoost.R2
1 source boosted tr. stacking 32 boosted tr. stacking
3 5 sources
boosted transfer stacking:
RMS Error
RMS Error
RMS Error
1 source 0.075 28
2.5 5 sources
24
2 0.07
20
1.5
0.065 16
1
0 100 200 300 400 500 0 200 400 600 800 0 100 200 300 400 500
Target Instances Target Instances Target Instances
Figure 1. Friedman #1 (NN) Figure 2. TAC SCM (M5P) Figure 3. TAC Travel (M5P)
lected experience in three different economies as fol- number of instances on the SCM data set and for 150
lows. We generated three source data sets using three instances or less on the Travel data set.
different groups of agent binaries provided by com-
petition participants. The target data set came from 7. Related Work
the final round of the 2006 competition. Each instance The most closely related work to this research, that
consists of 31 features of the economy at some point in on boosting and classification transfer, is described in
time and is labeled with a particular change in future Section 2. One important item we have not discussed,
computer prices that would be of interest to an agent. as this paper is empirical in its focus, is the theoret-
Full details of these data sets are available in (Pardoe ical properties of the algorithms discussed here. One
& Stone, 2007). of the attractive features of AdaBoost is its theoreti-
In TAC Travel, agents complete travel packages by bid- cal guarantees (e.g., convergence to zero error on the
ding in simultaneous auctions for flights, hotels, and training set) (Freund & Schapire, 1997). We note that
entertainment. We consider the problem of predicting no theoretical results currently exist for ExpBoost or
the closing prices of hotel auctions given the current AdaBoost.R2; however, analogues of the main proper-
state of all auctions, represented by 51 features (as de- ties of AdaBoost have been proven to apply to TrAd-
scribed in (Schapire et al., 2002)). We use data from aBoost, and a straightforward transformation of these
the 2006 competition final round as the target data, proofs shows that these properties also extend to the
and data from the 2004 and 2005 final rounds as the combination of TrAdaBoost and AdaBoost.RT (men-
source data sets. (In each year’s final round, all games tioned in Section 2.3). Developing theoretical guaran-
consisted of the same agents, but between years the tees for the other algorithms discussed here, in both
agents changed.) classification and regression settings, is an important
area for future work.
Learning curves for 30 runs on each data set are shown
in Figures 2 and 3. M5P model trees were chosen as The lowest common denominator of transfer learning
the best base learner in both cases. Both two-stage methods is the leveraging of information from a source
TrAdaBoost.R2 and boosted transfer stacking signif- domain to speed up or otherwise improve learning in
icantly outperform AdaBoost.R2 for any number of a different target domain. Transfer learning bears
target instances. Two-stage TrAdaBoost.R2 signifi- resemblance to classic case-based reasoning (Kolod-
cantly outperforms boosted transfer stacking for any ner, 1993), especially in the need to reason about
Boosting for Regression Transfer