0% found this document useful (0 votes)

33 views4 pages

Regression Tree by Bishop

this is a tech note

Uploaded by

Eisha Ahmad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views4 pages

Regression Tree by Bishop

this is a tech note

Uploaded by

Eisha Ahmad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

14.4.

Tree-based Models 663

Figure 14.4 Comparison of the squared error

(green) with the absolute error (red)
E(z)
showing how the latter places much
less emphasis on large errors and
hence is more robust to outliers and
mislabelled data points.

−1 0 1 z

can be addressed by basing the boosting algorithm on the absolute deviation |y − t|

instead. These two error functions are compared in Figure 14.4.

14.4. Tree-based Models

There are various simple, but widely used, models that work by partitioning the
input space into cuboid regions, whose edges are aligned with the axes, and then
assigning a simple model (for example, a constant) to each region. They can be
viewed as a model combination method in which only one model is responsible
for making predictions at any given point in input space. The process of selecting
a specific model, given a new input x, can be described by a sequential decision
making process corresponding to the traversal of a binary tree (one that splits into
two branches at each node). Here we focus on a particular tree-based framework
called classification and regression trees, or CART (Breiman et al., 1984), although
there are many other variants going by such names as ID3 and C4.5 (Quinlan, 1986;
Quinlan, 1993).
Figure 14.5 shows an illustration of a recursive binary partitioning of the input
space, along with the corresponding tree structure. In this example, the first step

Figure 14.5 Illustration of a two-dimensional in- x2

put space that has been partitioned
into five regions using axis-aligned E
boundaries.
θ3
B

C D
θ2

θ1 θ4 x1
664 14. COMBINING MODELS

Figure 14.6 Binary tree corresponding to the par-

titioning of input space shown in Fig- x1 > θ1
ure 14.5.

x2 θ2 x2 > θ3

x1 θ4

A B C D E

divides the whole of the input space into two regions according to whether x1 θ1
or x1 > θ1 where θ1 is a parameter of the model. This creates two subregions, each
of which can then be subdivided independently. For instance, the region x1 θ1
is further subdivided according to whether x2 θ2 or x2 > θ2 , giving rise to the
regions denoted A and B. The recursive subdivision can be described by the traversal
of the binary tree shown in Figure 14.6. For any new input x, we determine which
region it falls into by starting at the top of the tree at the root node and following
a path down to a specific leaf node according to the decision criteria at each node.
Note that such decision trees are not probabilistic graphical models.
Within each region, there is a separate model to predict the target variable. For
instance, in regression we might simply predict a constant over each region, or in
classification we might assign each region to a specific class. A key property of tree-
based models, which makes them popular in fields such as medical diagnosis, for
example, is that they are readily interpretable by humans because they correspond
to a sequence of binary decisions applied to the individual input variables. For in-
stance, to predict a patient’s disease, we might first ask “is their temperature greater
than some threshold?”. If the answer is yes, then we might next ask “is their blood
pressure less than some threshold?”. Each leaf of the tree is then associated with a
specific diagnosis.
In order to learn such a model from a training set, we have to determine the
structure of the tree, including which input variable is chosen at each node to form
the split criterion as well as the value of the threshold parameter θi for the split. We
also have to determine the values of the predictive variable within each region.
Consider first a regression problem in which the goal is to predict a single target
variable t from a D-dimensional vector x = (x1 , . . . , xD )T of input variables. The
training data consists of input vectors {x1 , . . . , xN } along with the corresponding
continuous labels {t1 , . . . , tN }. If the partitioning of the input space is given, and we
minimize the sum-of-squares error function, then the optimal value of the predictive
variable within any given region is just given by the average of the values of tn for
Exercise 14.10 those data points that fall in that region.
Now consider how to determine the structure of the decision tree. Even for a
fixed number of nodes in the tree, the problem of determining the optimal structure
(including choice of input variable for each split as well as the corresponding thresh-
14.4. Tree-based Models 665

olds) to minimize the sum-of-squares error is usually computationally infeasible due

to the combinatorially large number of possible solutions. Instead, a greedy opti-
mization is generally done by starting with a single root node, corresponding to the
whole input space, and then growing the tree by adding nodes one at a time. At each
step there will be some number of candidate regions in input space that can be split,
corresponding to the addition of a pair of leaf nodes to the existing tree. For each
of these, there is a choice of which of the D input variables to split, as well as the
value of the threshold. The joint optimization of the choice of region to split, and the
choice of input variable and threshold, can be done efficiently by exhaustive search
noting that, for a given choice of split variable and threshold, the optimal choice of
predictive variable is given by the local average of the data, as noted earlier. This
is repeated for all possible choices of variable to be split, and the one that gives the
smallest residual sum-of-squares error is retained.
Given a greedy strategy for growing the tree, there remains the issue of when
to stop adding nodes. A simple approach would be to stop when the reduction in
residual error falls below some threshold. However, it is found empirically that often
none of the available splits produces a significant reduction in error, and yet after
several more splits a substantial error reduction is found. For this reason, it is com-
mon practice to grow a large tree, using a stopping criterion based on the number
of data points associated with the leaf nodes, and then prune back the resulting tree.
The pruning is based on a criterion that balances residual error against a measure of
model complexity. If we denote the starting tree for pruning by T0 , then we define
T ⊂ T0 to be a subtree of T0 if it can be obtained by pruning nodes from T0 (in
other words, by collapsing internal nodes by combining the corresponding regions).
Suppose the leaf nodes are indexed by τ = 1, . . . , |T |, with leaf node τ representing
a region Rτ of input space having Nτ data points, and |T | denoting the total number
of leaf nodes. The optimal prediction for region Rτ is then given by
1
yτ = tn (14.29)
Nτ
xn ∈Rτ

and the corresponding contribution to the residual sum-of-squares is then

2
Qτ (T ) = {tn − yτ } . (14.30)
xn ∈Rτ

The pruning criterion is then given by

|T |
C(T ) = Qτ (T ) + λ|T | (14.31)
τ =1

The regularization parameter λ determines the trade-off between the overall residual
sum-of-squares error and the complexity of the model as measured by the number
|T | of leaf nodes, and its value is chosen by cross-validation.
For classification problems, the process of growing and pruning the tree is sim-
ilar, except that the sum-of-squares error is replaced by a more appropriate measure
666 14. COMBINING MODELS

of performance. If we define pτ k to be the proportion of data points in region Rτ

assigned to class k, where k = 1, . . . , K, then two commonly used choices are the
cross-entropy
K
Qτ (T ) = pτ k ln pτ k (14.32)
k=1
and the Gini index

K
Qτ (T ) = pτ k (1 − pτ k ) . (14.33)
k=1
These both vanish for pτ k = 0 and pτ k = 1 and have a maximum at pτ k = 0.5. They
encourage the formation of regions in which a high proportion of the data points are
assigned to one class. The cross entropy and the Gini index are better measures than
the misclassification rate for growing the tree because they are more sensitive to the
Exercise 14.11 node probabilities. Also, unlike misclassification rate, they are differentiable and
hence better suited to gradient based optimization methods. For subsequent pruning
of the tree, the misclassification rate is generally used.
The human interpretability of a tree model such as CART is often seen as its
major strength. However, in practice it is found that the particular tree structure that
is learned is very sensitive to the details of the data set, so that a small change to the
training data can result in a very different set of splits (Hastie et al., 2001).
There are other problems with tree-based methods of the kind considered in
this section. One is that the splits are aligned with the axes of the feature space,
which may be very suboptimal. For instance, to separate two classes whose optimal
decision boundary runs at 45 degrees to the axes would need a large number of
axis-parallel splits of the input space as compared to a single non-axis-aligned split.
Furthermore, the splits in a decision tree are hard, so that each region of input space
is associated with one, and only one, leaf node model. The last issue is particularly
problematic in regression where we are typically aiming to model smooth functions,
and yet the tree model produces piecewise-constant predictions with discontinuities
at the split boundaries.

14.5. Conditional Mixture Models

We have seen that standard decision trees are restricted by hard, axis-aligned splits of
the input space. These constraints can be relaxed, at the expense of interpretability,
by allowing soft, probabilistic splits that can be functions of all of the input variables,
not just one of them at a time. If we also give the leaf models a probabilistic inter-
pretation, we arrive at a fully probabilistic tree-based model called the hierarchical
mixture of experts, which we consider in Section 14.5.3.
An alternative way to motivate the hierarchical mixture of experts model is to
start with a standard probabilistic mixtures of unconditional density models such as
Chapter 9 Gaussians and replace the component densities with conditional distributions. Here
we consider mixtures of linear regression models (Section 14.5.1) and mixtures of

Classification and Regression Trees
No ratings yet
Classification and Regression Trees
36 pages
Chapter 7 - Trees
No ratings yet
Chapter 7 - Trees
80 pages
Unit 3
No ratings yet
Unit 3
31 pages
Module09 TreeBasedMethods
No ratings yet
Module09 TreeBasedMethods
36 pages
Module10 TreeBasedMethods
No ratings yet
Module10 TreeBasedMethods
33 pages
AIML Ak
No ratings yet
AIML Ak
21 pages
Decision Tree DT
No ratings yet
Decision Tree DT
20 pages
Árboles de Regresión. Algunos Algoritmos y Extensiones A Métodos de Consenso Autor David Gonzalo Ejea Carbonell
No ratings yet
Árboles de Regresión. Algunos Algoritmos y Extensiones A Métodos de Consenso Autor David Gonzalo Ejea Carbonell
34 pages
ESGB - 2025 - Classification and Regression Tress (Enregistré Automatiquement)
No ratings yet
ESGB - 2025 - Classification and Regression Tress (Enregistré Automatiquement)
43 pages
Chap9 Cart 574 1
No ratings yet
Chap9 Cart 574 1
42 pages
Unit IV
No ratings yet
Unit IV
36 pages
Tree Based Learning Methods
No ratings yet
Tree Based Learning Methods
28 pages
Mathematical Optimization in Classification and Regression Trees
No ratings yet
Mathematical Optimization in Classification and Regression Trees
29 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
Aiml QB With Ans - 075736
No ratings yet
Aiml QB With Ans - 075736
69 pages
BSC ML Ch3
No ratings yet
BSC ML Ch3
106 pages
ML Unit 2-2-40
No ratings yet
ML Unit 2-2-40
39 pages
6 - CART Models
No ratings yet
6 - CART Models
15 pages
Mod4 Eda
No ratings yet
Mod4 Eda
13 pages
MI - Unit 4
No ratings yet
MI - Unit 4
79 pages
Session 17-Decision Tree
No ratings yet
Session 17-Decision Tree
16 pages
Predict 422 - Module 8
100% (1)
Predict 422 - Module 8
138 pages
XGBoost and Upgrades
No ratings yet
XGBoost and Upgrades
14 pages
Gee Cart 2008
No ratings yet
Gee Cart 2008
8 pages
Insurance Analytics: Prof. Julien Trufin
No ratings yet
Insurance Analytics: Prof. Julien Trufin
64 pages
Knowledge Discovery and Data Mining: Lecture 11 - Tree Methods - Introduction
No ratings yet
Knowledge Discovery and Data Mining: Lecture 11 - Tree Methods - Introduction
49 pages
Trees Handout
No ratings yet
Trees Handout
51 pages
Assignment 6 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
No ratings yet
Assignment 6 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
10 pages
08 Tree Regression 1
No ratings yet
08 Tree Regression 1
49 pages
MSCI331 Lecture9
No ratings yet
MSCI331 Lecture9
21 pages
Complete Search For Feature Selection in Decision Trees
No ratings yet
Complete Search For Feature Selection in Decision Trees
34 pages
B311-221 10.0.1.1 (H187SP60C983) Firmware Release Notes
100% (1)
B311-221 10.0.1.1 (H187SP60C983) Firmware Release Notes
10 pages
Tree-Based Model
No ratings yet
Tree-Based Model
21 pages
Random Forest
No ratings yet
Random Forest
5 pages
Decision Tree
No ratings yet
Decision Tree
82 pages
Suitability of Various Intelligent Tree Based Classifiers For Diagnosing Noisy Medical Data
No ratings yet
Suitability of Various Intelligent Tree Based Classifiers For Diagnosing Noisy Medical Data
12 pages
Treepred
No ratings yet
Treepred
5 pages
Lecture 16
No ratings yet
Lecture 16
5 pages
Main
No ratings yet
Main
27 pages
09 Decision Trees Nearest Neighbor
No ratings yet
09 Decision Trees Nearest Neighbor
8 pages
20schonlau_rforest
No ratings yet
20schonlau_rforest
23 pages
Lecture 16
No ratings yet
Lecture 16
6 pages
Tutorial Microprocessors 2025
No ratings yet
Tutorial Microprocessors 2025
2 pages
Chap 8
No ratings yet
Chap 8
9 pages
Born Again Trees
No ratings yet
Born Again Trees
13 pages
Decision Tree
No ratings yet
Decision Tree
26 pages
Regression Trees
No ratings yet
Regression Trees
11 pages
Paper6
No ratings yet
Paper6
15 pages
Boosted Tree
No ratings yet
Boosted Tree
41 pages
Focus 1 Unit 3 Test 7125839 Viktoriia Stepaniuk Live
No ratings yet
Focus 1 Unit 3 Test 7125839 Viktoriia Stepaniuk Live
1 page
Extremely Randomized Trees: Pierre Geurts
No ratings yet
Extremely Randomized Trees: Pierre Geurts
40 pages
Ch8 Tree Based Methods
No ratings yet
Ch8 Tree Based Methods
81 pages
Figure 9: Process of Knowledge Data Discovery Based On
No ratings yet
Figure 9: Process of Knowledge Data Discovery Based On
7 pages
Details of NAAC Accreditation
100% (1)
Details of NAAC Accreditation
78 pages
DS Honor Sem 5 Endsem Paper 1
No ratings yet
DS Honor Sem 5 Endsem Paper 1
2 pages
CP 4
No ratings yet
CP 4
2 pages
THUẬT TOÁN
No ratings yet
THUẬT TOÁN
4 pages
Geometrical Transformation: Chapter Five
No ratings yet
Geometrical Transformation: Chapter Five
49 pages
Decision Trees For Predictive Modeling (Neville)
100% (1)
Decision Trees For Predictive Modeling (Neville)
24 pages
OLED Module 1.12 Inch-White-27 - 38.9 - 1.28mm - Datasheet
No ratings yet
OLED Module 1.12 Inch-White-27 - 38.9 - 1.28mm - Datasheet
22 pages
Intel Core 2 Duo E7500
No ratings yet
Intel Core 2 Duo E7500
4 pages
Black Wade The Wild Side of Love PDF
No ratings yet
Black Wade The Wild Side of Love PDF
4 pages
AMINA Group Case Study V1
No ratings yet
AMINA Group Case Study V1
2 pages
Xii CS Term-1 Retestqp
No ratings yet
Xii CS Term-1 Retestqp
5 pages
A New PWM Controller With One Cycle Response
No ratings yet
A New PWM Controller With One Cycle Response
7 pages
2 Low-Cost - SDR-based - Tool - For - Evaluating - LoRa - Satellite - Communications-MN2022
No ratings yet
2 Low-Cost - SDR-based - Tool - For - Evaluating - LoRa - Satellite - Communications-MN2022
6 pages
Random
No ratings yet
Random
17 pages
0 - Module 0 Fundamental Introduction (Huawei VRP) PDF
No ratings yet
0 - Module 0 Fundamental Introduction (Huawei VRP) PDF
4 pages
5th Sem
No ratings yet
5th Sem
19 pages
Cse3035 - Principles-Of-Cloud-Computing - Eth - 1.0 - 57 - Cse3035 - 61 Acp
No ratings yet
Cse3035 - Principles-Of-Cloud-Computing - Eth - 1.0 - 57 - Cse3035 - 61 Acp
3 pages
Implementation of DevSecOps by Integrating Static and Dynamic Security Testing in CI CD Pipelines
No ratings yet
Implementation of DevSecOps by Integrating Static and Dynamic Security Testing in CI CD Pipelines
6 pages
Dbaudio Technical Information Ti370 1.1 en
No ratings yet
Dbaudio Technical Information Ti370 1.1 en
18 pages
Data Logger - DLLTE-IS PDF
No ratings yet
Data Logger - DLLTE-IS PDF
3 pages
DLCO. 5th Unit Questions Wise
No ratings yet
DLCO. 5th Unit Questions Wise
39 pages
Introduction To RPART
No ratings yet
Introduction To RPART
67 pages
Exponent Rules & Practice PDF
No ratings yet
Exponent Rules & Practice PDF
2 pages
3.3.10 Lab - Create User Accounts
No ratings yet
3.3.10 Lab - Create User Accounts
3 pages
Decision Tree & Techniques
71% (7)
Decision Tree & Techniques
41 pages
Assessment in Quadratic Inequality
No ratings yet
Assessment in Quadratic Inequality
2 pages
30 Day's Batch Complete Schedule PDF
No ratings yet
30 Day's Batch Complete Schedule PDF
4 pages
REPORT New-1
No ratings yet
REPORT New-1
40 pages
Test1-OOC: Total Points Test1-Object Oriented Concepts in Java
No ratings yet
Test1-OOC: Total Points Test1-Object Oriented Concepts in Java
19 pages
ICAML 2021: 3 International Conference On Applications of AI & Machine Learning
No ratings yet
ICAML 2021: 3 International Conference On Applications of AI & Machine Learning
2 pages
Introduction To Docker: Ajeet Singh Raina Docker Captain - Docker, Inc
No ratings yet
Introduction To Docker: Ajeet Singh Raina Docker Captain - Docker, Inc
56 pages
Evaluation PHP
No ratings yet
Evaluation PHP
1 page

Regression Tree by Bishop

Uploaded by

Regression Tree by Bishop

Uploaded by

14.4.

Tree-based Models 663

Figure 14.4 Comparison of the squared error

can be addressed by basing the boosting algorithm on the absolute deviation |y − t|

14.4. Tree-based Models

Figure 14.5 Illustration of a two-dimensional in- x2

Figure 14.6 Binary tree corresponding to the par-

olds) to minimize the sum-of-squares error is usually computationally infeasible due

and the corresponding contribution to the residual sum-of-squares is then

The pruning criterion is then given by

of performance. If we define pτ k to be the proportion of data points in region Rτ

14.5. Conditional Mixture Models

You might also like