Regression Trees
Regression Trees
João Gama
1 Introduction
The generalization capacity of a learning algorithm depends on the appropriate-
ness of its representation language to express a generalization of the examples
for the given task. Different learning algorithms employ different representations,
search heuristics, evaluation functions, and search spaces. It is now commonly
accepted that each algorithm has its own selective superiority [3]; each is best
for some but not all tasks. The design of algorithms that explore multiple repre-
sentation languages and explore different search spaces has an intuitive appeal.
This paper presents one such algorithm.
In the context of supervised learning problems it is useful to distinguish
between classification problems and regression problems. In the former the target
variable takes values in a finite and pre-defined set of un-ordered values, and the
usual goal is to minimize a 0-1-loss function. In the latter the target variable
is ordered and takes values in a subset of . The usual goal is to minimize a
squared error loss function. Mainly due to the differences in the type of the target
variable successful techniques in one type of problems are not directly applicable
to the other type of problems.
Gama [6] has presented a technique to combine classifiers that use distinct
representation languages using constructive induction. In this work we study
the applicability of a related technique for regression problems. In particular we
combine a linear model with a regression tree using constructive induction.
F. Hoffmann et al. (Eds.): IDA 2001, LNCS 2189, pp. 156–166, 2001.
c Springer-Verlag Berlin Heidelberg 2001
Functional Trees for Regression 157
Generalized Linear Models. Generalized Linear Models (GLM) is the most fre-
quently applied statistical technique to set a relationship between several inde-
pendent variables
and a target variable. In the most general terms GLM are of
the form w0 + wi × fi (xi ). GLM estimation is aimed at minimizing the sum of
squared deviations of the observed values for the dependent variable from those
predicted by the model. One appealing characteristic is that there is an analyti-
cal solution for this problem. The coefficients for the polynomial using the least
squares error criterion is found by solving the equation: W = (X T X)−1 X T Y
In this paper, we assume that fi is the identity function leading to the linear
multiple regression.
The standard algorithm to build regression trees consists of two phases. In the
first phase a large decision tree is constructed. In the second phase this tree
is pruned back. The algorithm to grow the tree follows the standard divide-
and-conquer approach. The most relevant aspects are: the splitting rule, the
termination criterion, and the leaf assignment criterion. With respect to the last
criterion, the usual rule consists of assignment a constant to a leaf node. This
constant is usually the mean of the y values taken from the examples that fall
at this node. With respect to the splitting rule, each attribute value defines a
158 João Gama
the linear-regression function is used later in the pruning phase. In this way, all
decision nodes are based in the original attributes. Leaf nodes could contain a
constructor model. A leaf node contains a constructor model if and only if in the
pruning algorithm the estimated mean-squared error of the constructor model
is lower than the Backed-up-error and the estimated error of the node has if a
leaf replaced it.
3 An Illustrative Example
RM LR Node 14
<= 14.4 > 14.4 <= 7.44 > 7.44 <= 18.11 > 18.11 <= 41.0 > 41.0
<= 1.38 > 1.38 <= 6.99 > 6.99 <= 361.92 > 361.92 <= 13.87 > 13.87 <= 1.23 > 1.23 <= 32.67 > 32.67 <= 46.14 > 46.14
Leaf 45.58 RM NOX NOX Leaf 21.9 Leaf 45.89 LR Node 17 AGE Leaf 50.0 LR Node 18 Leaf 29.321 Leaf 35.35 DIS Leaf 49.86
<= 6.54 > 6.54 <= 0.53 > 0.53 <= 0.61 > 0.61 <= 12.13 > 12.13 <= 38.2 > 38.2 <= 22.50 > 22.5 <= 2.58 > 2.58
Leaf 21.63 Leaf 27.43 Leaf 20.02 Leaf 16.24 Leaf 16.63 Leaf 11.08 Leaf 9.82 Leaf 13.66 Leaf 23.4 Leaf 15.89 Leaf 20.21 Leaf 24.45 Leaf 49.6 Leaf 44.06
Fig. 1. (a)The Univariate Regression Tree and (b) Top-Down Functional regres-
sion tree for the Housing problem.
Figure 2(b) presents the full functional regression tree using both top-down
and bottom-up multivariate approaches. In this case, decision nodes could con-
tain (not necessarily) tests based on a linear combination of the original at-
tributes, and leaf nodes could predict (not necessarily) values obtained by using
a linear-regression function built from the examples that fall at this node.
4 Related Work
RM LR Node 14
LSTAT RM
<= 18.11 > 18.11 <= 41.0 > 41.0
<= 14.4 > 14.4 <= 7.43 > 7.43 LR Node 16 DIS LR Leaf LR Node 17
<= 13.87 > 13.87 <= 1.22 > 1.22 <= 46.14 > 46.14
LR Leaf CRIM LR Leaf B
LR Leaf LR Leaf Leaf 50.0 LR Leaf DIS Leaf 49.86
<= 6.99 > 6.992 <= 361.92 > 361.92
<= 2.58 > 2.58
LR Leaf LR Leaf Leaf 21.9 Leaf 45.89 Leaf 49.6 Leaf 44.06
Fig. 2. (a)The Bottom-Up Functional Regression Tree and (b) The Functional
Regression Tree for the Housing problem.
Karalic shows that it leads to smaller models with increase of performance. Torgo
[13] has presented an experimental study about functional models for regression
tree leaves. Later, the same author [14] has presented the system RT. Using RT
with linear models at the leaves, RT builds and prunes a regular univariate tree.
Then at each leaf a linear model is built using the examples that fall at this leaf.
5 Experimental Evaluation
It is commonly accepted that multivariate regression trees should be competitive
against univariate models. In this section we evaluate the proposed algorithm,
its lesioned variants, and its components on a set of benchmark datasets. For
comparative proposes we evaluate also system M52 . The main goal in this ex-
perimental evaluation is to study the influence in terms of performance of the
position inside a regression tree of the linear models. We evaluate three situa-
tions:
– Trees that could use linear combinations at each internal node.
– Trees that could use linear combinations at each leaf.
– Trees that could use linear combinations both at each internal and leaf nodes.
All evaluated models are based on the same tree growing and pruning algorithm.
That is, they use exactly the same splitting criteria, stopping criteria, and prun-
ing mechanism. Moreover they share many minor heuristics that individually
are too small to mention, but collectively can make difference. Doing so, the dif-
ferences on the evaluation statistics are due to the differences in the conceptual
2
We have used M5 from the last version of Weka environment. We have used several
regression systems. The most competitive was M5.
Functional Trees for Regression 163
model. In this work we estimate the performance of a learned model using the
mean squared error statistic.
The results in terms of MSE and standard deviation are presented in Table
1. The first two columns refer to the results of the components of the hybrid
algorithm. The following three columns refer to the simplified versions of our
algorithm and the full model. The last column refers to the M5 system. For each
dataset, the algorithms are compared against the full multivariate tree using the
Wilcoxon signed rank-test. The null hypothesis is that the difference between
error rates has median value zero. A − (+) sign indicates that for this dataset
the performance of the algorithm was worse (better) than the full model with a p
value less than 0.01. It is interesting to note that the full model (MT) significantly
improves over both components (LR and UT) in 9 datasets out of 16. Table 1
also presents a comparative summary of the results. The first line presents the
geometric mean of the MSE statistic across all datasets. The second line shows
the average rank of all models, computed for each dataset by assigning rank 1
to the best algorithm, 2 to the second best and so on. The third line shows the
average ratio of MSE. This is computed for each dataset as the ratio between the
MSE of one algorithm and the MSE of M5. The fourth line shows the number of
significant differences using the signed-rank test taking the multivariate tree MT
as reference. We use the Wilcoxon Matched-Pairs Signed-Ranks Test to compare
the error rate of pairs of algorithms across datasets. The last line shows the p
values associated with this test for the MSE results on all datasets and taking
3
http://www.ncc.up.pt/∼ltorgo/Datasets
4
The actual implementation ignores missing values at learning time. At application
time, if the value of the test attribute is unknown, all descendent branches produce
a prediction. The final prediction is a weighted average of the predictions.
164 João Gama
MT as reference. All the functional trees have a similar performance. Using the
significant test as criteria, FT is the most performing algorithm. It is interesting
to note that the bottom-up version is the most competitive algorithm. Neverthe-
less there is a computational cost associated with the increase in performance
verified. To run all the experiments referred here, FT requires almost 1.8 more
time than the univariate regression tree.
6 Conclusions
Acknowledgments
Gratitude to the financial support given by the FEDER project, projects Sol
Eu-Net, Metal, and ALES, and the Plurianual support of LIACC. We would like
to thank Luis Torgo and the referees for their useful comments.
References
1. L. Breiman. Arcing classifiers. The Annals of Statistics, 26(3):801–849, 1998.
2. L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression
Trees. Wadsworth International Group., 1984.
3. Carla E. Brodley. Recursive automatic bias selection for classifier construction.
Machine Learning, 20:63–94, 1995.
4. Carla E. Brodley and Paul E. Utgoff. Multivariate decision trees. Machine Learn-
ing, 19:45–77, 1995.
5. João Gama. Probabilistic Linear Tree. In D. Fisher, editor, Machine Learning,
Proceedings of the 14th International Conference. Morgan Kaufmann, 1997.
6. João Gama and P. Brazdil. Cascade Generalization. Machine Learning, 41:315–
343, 2000.
7. Aram Karalic. Employing linear regression in regression tree leaves. In Bernard
Neumann, editor, European Conference on Artificial Intelligence, 1992.
8. S. Murthy, S. Kasif, and S. Salzberg. A system for induction of oblique decision
trees. Journal of Artificial Intelligence Research, 1994.
9. W. Press, S. Teukolsky, W. Vetterling, and B. Flannery. Numerical Recipes in C:
the art of scientific computing 2 Ed. University of Cambridge, 1992.
10. R. Quinlan. Learning with continuous classes. In Adams and Sterling, editors,
Proceedings of AI’92. World Scientific, 1992.
11. R. Quinlan. Combining instance-based and model-based learning. In P.Utgoff,
editor, ML93, Machine Learning, Proceedings of the 10th International Conference.
Morgan Kaufmann, 1993.
12. R. Quinlan. Data mining tools See5 and C5.0. Technical report, RuleQuest Re-
search, 1998.
13. Luis Torgo. Functional models for regression tree leaves. In D. Fisher, editor, Ma-
chine Learning, Proceedings of the 14th International Conference. Morgan Kauf-
mann, 1997.
166 João Gama
14. Luis Torgo. Inductive Learning of Tree-based Regression Models. PhD thesis,
University of Porto, 2000.
15. P. Utgoff and C. Brodley. Linear machine decision trees. Coins technical report,
91-10, University of Massachusetts, 1991.
16. Ian Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and
Techniques with Java Implementations. Morgan Kaufmann Publishers, 2000.
17. D. Wolpert. Stacked generalization. In Neural Networks, volume 5, pages 241–260.
Pergamon Press, 1992.