0% found this document useful (0 votes)

62 views64 pages

Insurance Analytics: Prof. Julien Trufin

This document discusses regression trees, which partition data into homogeneous subgroups to predict a continuous outcome variable. Regression trees determine the optimal feature to split on, split value, tree depth, and predictions in terminal nodes. The classification and regression tree (CART) methodology recursively partitions the data space and fits a simple prediction model within each partition. It finds the split that minimizes the prediction error at each step until a stopping criteria is reached.

Uploaded by

david Abotsitse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views64 pages

Insurance Analytics: Prof. Julien Trufin

Uploaded by

david Abotsitse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

Insurance Analytics

Prof. Julien Trufin

Année académique 2020-2021

1
Tree-based methods : Regression trees

Regression trees

2
Tree-based methods : Regression trees

Introduction
Tree-based models : one or more nested if-then statements for the features
that partition the data.
Example :
→ if Feature A (Gender) = "male" then
if Feature B (Age) >= 35 then Outcome = 500
else Outcome = 700
else Outcome = 400
→ In the terminology of tree models : there are two splits of the dataset into
three terminal nodes or leaves of the tree.

3
Tree-based methods : Regression trees

Introduction
Strenghts :
→ Easy to interpret.
→ Do not require to specify the form of the regression function.
→ Handle missing data.
→ Implicitly conduct feature selection.

Weaknesses :
→ Model instability (i.e. slight changes in the data can drastically change the
structure of the tree).
→ Less-than-optimal predictive performance (because these models partition the
data into rectangular regions of the feature space).
→ Finite number of possible predicted outcomes, determined by the number of
terminal nodes (unlikely to capture all the nuances of the data).

4
Tree-based methods : Regression trees

Context
Regression trees : partition the feature space into subspaces that are more
homogeneous with respect to the response.
Regression trees determine
→ The feature to split on and the value of the split.
→ The depth or complexity of the tree.
→ The predictions in the terminal nodes.

The classification and regression tree (CART) methodology, also called

recursive partitioning :
→ One of the most popular techniques to construct regression trees.

5
Tree-based methods : Regression trees

Recursive partitioning
Algorithm :
→ Begin with D = {(yi , x i ); i ∈ I}.
→ Search every distinct value of every feature to find the feature and split value
that partitions the feature space χ into two subspaces χt1 and χt2 such that
X X X
L(yi , µ
b(x i )) = L(yi , b
ct1 ) + L(yi , b
ct2 )
i∈I i∈I:x i ∈χt1 i∈I:x i ∈χt2

is minimized, where
ct1 = ave (yi |x i ∈ χt1 )
b
and
ct2 = ave (yi |x i ∈ χt2 ) .
b
We have
µ ct1 I [x ∈ χt1 ] + b
b(x ) = b ct2 I [x ∈ χt2 ] .

6
Tree-based methods : Regression trees

Recursive partitioning
Algorithm :
→ Example with 2 features : Gender = {Female, Male} and AgePh = {18, 19, . . . , 85}.
Possibility 1 : Gender :
X X X
L(yi , µ
b(x i )) = L(yi , b
ct1 ) + L(yi , b
ct2 )
i∈I i:Female i:Male

where
ct1 = ave (yi |Female)
b and ct2 = ave (yi |Male) .
b

Possibility 2 : AgePh (split value = 18) :

X X X
L(yi , µ
b(x i )) = L(yi , b
ct1 ) + L(yi , b
ct2 )
i∈I i:AgePh≤18 i:AgePh>18

where
ct1 = ave (yi |AgePh ≤ 18)
b and ct2 = ave (yi |AgePh > 18) .
b

7
Tree-based methods : Regression trees

Recursive partitioning
Algorithm :
Possibility 3 : AgePh (split value = 19) :
X X X
L(yi , µ
b(x i )) = L(yi , b
ct1 ) + L(yi , b
ct2 )
i∈I i:AgePh≤19 i:AgePh>19

where
ct1 = ave (yi |AgePh ≤ 19)
b and ct2 = ave (yi |AgePh > 19) .
b
..
.
Possibility 42 : AgePh (split value = 58) :
X X X
L(yi , µ
b(x i )) = L(yi , b
ct1 ) + L(yi , b
ct2 )
i∈I i:AgePh≤58 i:AgePh>58

where
ct1 = ave (yi |AgePh ≤ 58)
b and ct2 = ave (yi |AgePh > 58) .
b
..
.

8
Tree-based methods : Regression trees

Recursive partitioning
Algorithm :
Possibility 68 : AgePh (split value = 84) :
X X X
L(yi , µ
b(x i )) = L(yi , b
ct1 ) + L(yi , b
ct2 )
i∈I i:AgePh≤84 i:AgePh>84

where
ct1 = ave (yi |AgePh ≤ 84)
b and ct2 = ave (yi |AgePh > 84) .
b

P
The best possibility = the one that minimizes i∈I L(yi , µ
b(x i )).

9
Tree-based methods : Regression trees

Recursive partitioning
Algorithm :
→ Repeat the previous step within each of subspaces χt1 and χt2 , and so on
(hence the name recursive partitioning).
→ Stop when the number of observations in the split falls below some threshold
(others stopping criteria can be used).
→ Output : X
µ
b(x ) = ct I[x ∈ χt ]
b
t∈T

where
T is a set of indexes labelling the terminal nodes of the tree ;
ct = ave(yi |x i ∈ χt ), t ∈ T .
b

10
Tree-based methods : Regression trees

Recursive partitioning
Remarks :
→ Finding the optimal split :
Is straightforward for continuous features since the data can be ordered in a
natural way.
Is easy for binary features since there is only one possible split point.
Is more complicated for features with more than two categories.
Indeed, for a features having q unordered categories, there are 2q−1 − 1 possible
partitions of the q values into two groups.
⇒ For large q : the computation can be very time consuming.

→ Why binary splits ?

In general, multiway splits fragment the data too quickly ⇒ leaving insufficient
data at the next level down.
Also, multiway splits can be achieved by a series of binary splits.
⇒ Binary splits are preferred.

11
Tree-based methods : Regression trees

Recursive partitioning
Remarks :
→ Others stopping criteria can be used :
The interaction depth (ID) :
ID = 1 : single-split regression tree ;
ID=2 : two-way interactions ;
...
Stop when the empirical improvement of the optimization criteria (here the
deviance reduction) is less than a given threshold.
→ This particular tree methodology can handle missing data. Instead of
imputing the data, we can
For categorical features : create a new category “missing” (observations with
missing values could behave differently than those with non-missing values).
Construct surrogate variables : find another split based on another variable by
looking at all the splits using all the other variables and searching for the one
yielding a division of the training set most similar to the optimal split.

12
Tree-based methods : Regression trees

Right-sized tree
We have seen several rules to declare a node t terminal.
→ These rules have in common that they early stop the growth of the tree.

Another way to find the right sized tree consists in fully developing the tree
and then pruning it.
→ Henceforth, we also denote a regression tree by T .
→ A branch T (t) is the part of T that is composed of node t and all its
descendants nodes.
→ Pruning a branch T (t) of a tree T means deleting from T all descendants
nodes of t. The resulting tree is denoted T − T (t) .

A tree T 0 that is obtained from T by successively pruning branches is called

a pruned subtree of T , or simply a subtree of T , and is denoted T 0 T .

13
Tree-based methods : Regression trees

Right-sized tree
For instance, we represent below a tree T , its branch T (t2 ) and the subtree
T − T (t2 ) obtained from T by pruning the branch T (t2 ) .
t0

t1 t2

t3 t4 t5 t6

t7 t8 t9 t10
t2 t0

t5 t6 t1 t2

t7 t8 t9 t10 t3 t4

14
Tree-based methods : Regression trees

Right-sized tree : pruning process

The first step of the pruning process is to grow the largest possible tree
Tmax .
→ Notice that the initial tree Tinit can be smaller than Tmax , so that we start the
pruning process with a sufficiently large tree Tinit .

The idea of pruning Tinit consists in constructing a sequence of smaller and

smaller trees
Tinit , T|Tinit |−1 , T|Tinit |−2 , . . . , T1 ,
where Tk is a subtree of Tinit with k terminal nodes.

We define the cost-complexity measure of a tree T as

Rα (T ) = D (bct )t∈T(T ) + α|T |.

→ The number of terminal nodes |T | is called the complexity of the tree T .

→ α is the increase in the penalty for having one more terminal node.

15
Tree-based methods : Regression trees

Right-sized tree : pruning process

Let T 0 be the tree obtained by splitting the terminal node t of T into two
children nodes tL and tR ⇒ T T 0 .
→ We have
ct )t∈T(T 0 ) ≤ D (b
D (b ct )t∈T(T ) .
→ Now,

Rα (T 0 ) = D (bct )t∈T(T 0 ) + α|T 0 |

= D (bct )t∈T(T 0 ) + α(|T | + 1)

= Rα (T ) + D (b ct )t∈T(T 0 ) − D (b
ct )t∈T(T ) + α,

so that Rα (T 0 ) ≥ Rα (T ) if and only if

α ≥ D (b ct )t∈T(T ) − D (b
ct )t∈T(T 0 ) .

⇒ The cost-complexity measure for model selection may now lead to chose
the simplest tree T over T 0 .

16
Tree-based methods : Regression trees

Right-sized tree : pruning process

Let Tinit be the large tree that is to be pruned to the right sized tree Tprune .

Let T (α) be the subtree of Tinit such that

Rα (T (α)) = min Rα (T )
T Tinit

and
If Rα (T ) = Rα (T (α)), then T (α) T .

→ It is obvious that for every value of α there is at least one subtree of Tinit that
minimizes Rα (T ).
→ It can be shown that for every value of α ≥ 0, there exists a unique subtree
T (α) of Tinit that minimizes Rα (T ) and which satisfies T (α) T for all
subtrees T minimizing Rα (T ).

Tinit has only a finite number of subtrees. Hence, {T (α)}α≥0 only contains a
finite number of subtrees of Tinit .

17
Tree-based methods : Regression trees

Right-sized tree : pruning process

Let α = 0. We define α0 = 0 and we start from Tinit = Tα0 .
(t)
It becomes optimal to prune the branch Tα0 for a certain node t of Tα0 when

Rα (Tα0 ) ≥ Rα Tα0 − Tα(t)0
.
→ The deviance of Tα0 can be written as
!
X
D (b
cs )s∈T = Dχs (b
cs )
(Tα0 )
s∈T(T
α0 )
X X
= Dχs (b
cs ) + cs ) − D
Dχs (b (b
cs )s∈{t}
s∈T (t) s∈T (t)
(Tα0 −Tα ) (Tα )
0 0
   
   
= D
(b
cs )s∈T  + D (b  − D (b
 cs ) cs )s∈{t} .
 s∈T 
(t) (t)
Tα0 −Tα Tα
0 0

→ Furthermore, we have
|Tα0 | = |Tα0 − Tα(t)
0
| + |Tα(t)
0
| − 1.
→ The cost-complexity measure Rα (Tα0 ) can then be rewritten as
 

(t) (t)
 
Rα (Tα0 ) = Rα Tα0 − Tα + D (b
 cs )s∈T  + α|Tα | − D (b
cs )s∈{t} − α.
0 (t)  0
Tα
0

18
Tree-based methods : Regression trees

Right-sized tree : pruning process

(t)
Thus, we have Rα (Tα0 ) ≥ Rα Tα0 − Tα0 if and only if

+ α|Tα(t)

D (b
cs )s∈T (t) 0
| ≥ D (b
cs )s∈{t} + α
( Tα
0 )

or equivalently

cs )s∈{t} − D (b
D (b cs )s∈T
(Tα(t)0 ) (t)
α≥ (t)
= α1 .
|Tα0 | −1

Let
(t)
α1 = min α1 ,
t∈Te(Tα )
0

where Te(T ) denotes the set of the non-terminal nodes of T .

Once α reaches α1 , it becomes preferable to prune Tα0 at its weakest links.

The resulting tree is denoted Tα1 .
19
Tree-based methods : Regression trees

Right-sized tree : pruning process

We repeat the same process for Tα1 . For a non-terminal node t of Tα1 , it
(t)
will be preferable to cut the branch Tα1 when

D (bcs )s∈{t} − D (b cs )s∈T (t)
(Tα1 ) (t)
α≥ (t)
= α2 .
|Tα1 | − 1
Let
(t)
α2 = min α2 .
t∈Te(Tα )
1

(t)
The non-terminal nodes t of Tα1 for which α2 = α2 are called the weakest
links of Tα1 and it becomes better to cut in these nodes once α reaches the
value α2 in order to produce Tα2 .

We continue the same process for Tα2 , and so on until we reach the root
node {t0 }.

Finally, we come up with the sequence of trees

Tα0 , Tα1 , Tα2 , . . . , Tακ = {t0 }.
20
Tree-based methods : Regression trees

Right-sized tree : pruning process

Proposition : We have 0 = α0 < α1 < . . . < ακ < ακ+1 = ∞. Furthermore,
for all k = 0, 1, . . . , κ, we have

T (α) = Tαk for all α ∈ [αk , αk +1 ).

We define the cost-complexity parameter as

α
cp = ,
D (b
ct )t∈{t0 }

where D (b
ct )t∈{t0 } is the deviance of the root tree.
⇒ The cost-complexity measure Rα (T ) can then be rewritten as

Rα (T ) = D (b ct )t∈T(T ) + α|T |

= D (b ct )t∈T(T ) + cp D (bct )t∈{t0 } |T |.

→ The sequence αk , k = 0, . . . , κ, defines the sequence cpk , k = 0, . . . , κ, with

αk
cpk = .
D (b
ct )t∈{t0 }

21
Tree-based methods : Regression trees

Right-sized tree : choice of the best pruned tree

We rely on cross-validation.

We set
√
α̃k = αk αk +1 , , k = 0, 1, . . . , κ,
which is considered as a typical value for [αk , αk +1 ) and hence as the value
corresponding to Tαk .
→ Notice that α̃0 = 0 and α̃κ = ∞.

The K −fold cross-validation estimate of the generalization error for the

regularization parameter α̃k is given by
K
CV 1 XX
Err
d (α̃k ) = L yi , µ
bT (j ) (α̃k ) (x i ) .
|I| j =1
i∈Ij

The right sized tree Tprune is then selected as the tree Tαk ∗ of the sequence
Tα0 , Tα1 , Tα2 , . . . , Tακ such that
CV CV
Err
d (α̃k ∗ ) = min Err
d (α̃k ).
k ∈{0,1,...,κ}

22
Tree-based methods : Regression trees

Relative importance of the features

To assess the relative importance of the features to the outcome, the
overall reduction in the optimization criteria for each feature is computed.
The relative importance of feature xj is the total reduction of deviance
obtained by using this feature throughout the tree.
→ Denoting by Te(T ) (xj ) the set of all non-terminal nodes of tree T for which xj
was selected as the splitting feature, the relative importance of feature xj is
the sum of the deviance reductions ∆Dχt over the non-terminal nodes
t ∈ Te(T ) (xj ), that is,
X
I(xj ) = ∆Dχt .
t∈T
e
(T ) (xj )

The features can be ordered with respect to their relative importances.

It is customary to assign the largest a value of 100 and then scale the others
accordingly.
Another way is to scale the relative importances such that their sum equals
100, so that any relative importance can be interpreted as the percentage
contribution to the overall model.

23
Tree-based methods : Regression trees

Example
Data set :
> set.seed(1)
>
> n <- 500000 # number of observations
>
> Gender <- factor(sample(c("m","f"),n,replace=TRUE))
> Age <- sample(c(18:65),n,replace=TRUE)
> Split <- factor(sample(c("yes","no"),n,replace=TRUE))
> Sport <- factor(sample(c("yes","no"),n,replace=TRUE))
>
> lambda <- 0.1*ifelse(Gender=="m",1.1,1)
> lambda <- lambda*ifelse(Age>=18 & Age<30,1.4,1)
> lambda <- lambda*ifelse(Age>=30 & Age<45,1.2,1)
> lambda <- lambda*ifelse(Sport=="yes",1.15,1)
> N <- rpois(n, lambda)
>
> data <- data.frame(N,Gender,Age,Sport,Split)

24
Tree-based methods : Regression trees

Example
Data set :
> head(data,10) # 10 first observations
N Gender Age Sport Split
1 0 m 46 yes yes
2 0 m 57 no no
3 0 f 34 yes no
4 0 f 27 no yes
5 0 m 42 yes yes
6 0 f 27 yes no
7 0 f 55 yes yes
8 0 f 23 yes no
9 0 f 33 no no
10 2 m 36 yes no

25
Tree-based methods : Regression trees

Example
R package : rpart (for “recursive partitioning”).
rpart().
→ Description :
Fit a recursive partitioning model.
→ Usage :
rpart(formula, data, weights, subset, na.action = na.rpart, method,
model = FALSE, x = FALSE, y = TRUE, parms, control, cost, ...).

Fit rpart :
> # fit model
> library(rpart) # package rpart (rpart for "recursive partitioning")
> tree <- rpart(N ~ Gender+Age+Split+Sport, data = data, method="poisson",
+ control = rpart.control(cp = 0, maxdepth = 5))
> summary(tree)

> # plot tree

> library(partykit) # package to produce "beautiful" plots for rpart trees
> tree2 <- as.party(tree)
> plot(tree2)

26
Tree-based methods : Regression trees

Example
Fit rpart (cp = 0) :
> summary(tree)
Variable importance
Age Sport Gender Split
73 18 9 1
> plot(tree2)

⇒ cp = 0 : overfitting. 27
Tree-based methods : Regression trees

Example
Fit rpart (cp = 0.00005) :
> summary(tree)
Variable importance
Age Sport Gender
73 18 9
> plot(tree2)

⇒ cp = 0.00005 : all relevant features are identified. 28

Tree-based methods : Regression trees

Example
Fit rpart (cp = 0.00020) :
> summary(tree)
Variable importance
Age Sport
80 20
> plot(tree2)

⇒ cp = 0.00020 : feature “Gender” is missing. 29

Tree-based methods : Regression trees

Example
Estimation of cp by cross-validation :
> tree <- rpart(N ~ Gender+Age+Split+Sport, data = data, method="poisson",
+ control = rpart.control(cp=0, maxdepth = 5))
> # cp = 0 : minimal value of cp when computing the cross-validated error
> printcp(tree) # xerror column = cross-validated error
> # idea: find cp that minimizes the xerror
> # rel error = 1 - R^2

CP nsplit rel error xerror xstd

1 3.5920e-03 0 1.00000 1.00001 0.0023034
2 9.3746e-04 1 0.99641 0.99642 0.0022959
3 5.6213e-04 2 0.99547 0.99549 0.0022950
4 3.1509e-04 3 0.99491 0.99494 0.0022945
5 2.6391e-04 4 0.99459 0.99469 0.0022941
6 1.1682e-04 5 0.99433 0.99437 0.0022934
7 1.1658e-04 6 0.99421 0.99437 0.0022939
8 9.5494e-05 7 0.99410 0.99428 0.0022938
9 8.4238e-05 8 0.99400 0.99419 0.0022939
10 7.7462e-05 9 0.99392 0.99408 0.0022937
11 6.0203e-05 10 0.99384 0.99396 0.0022935
12 3.2927e-05 11 0.99378 0.99390 0.0022934
13 2.3035e-05 12 0.99375 0.99395 0.0022936
14 1.7216e-05 13 0.99372 0.99395 0.0022936
15 1.2893e-05 14 0.99371 0.99401 0.0022940
16 1.1135e-05 16 0.99368 0.99403 0.0022940
17 1.0995e-05 17 0.99367 0.99404 0.0022940
18 9.2293e-06 18 0.99366 0.99407 0.0022942
19 5.5970e-06 20 0.99364 0.99412 0.0022945
20 5.4761e-06 21 0.99363 0.99415 0.0022947
21 5.3590e-06 22 0.99363 0.99414 0.0022947
22 5.3108e-06 23 0.99362 0.99415 0.0022947
23 4.9740e-06 24 0.99362 0.99415 0.0022947
24 4.2883e-06 25 0.99361 0.99417 0.0022948
25 3.7381e-06 26 0.99361 0.99417 0.0022948
26 3.4052e-06 27 0.99360 0.99417 0.0022948
27 4.3202e-09 30 0.99359 0.99418 0.0022949
28 0.0000e+00 31 0.99359 0.99418 0.0022949

⇒ cp optimal =3.2927e-05. 30
Tree-based methods : Regression trees

Example
Optimal tree :
> tree.cp <- prune(tree,cp=3.2927e-05) # optimal tree
> print(tree.cp) # * means terminal node
n= 500000

node), split, n, deviance, yval

* denotes terminal node

1) root 500000 279043.30 0.1317560

2) Age>=44.5 218846 111779.70 0.1134548
4) Sport=no 109659 53386.74 0.1044621
8) Gender=f 54866 26084.89 0.1004673 *
9) Gender=m 54793 27285.05 0.1084660 *
5) Sport=yes 109187 58236.07 0.1224878
10) Gender=f 54545 28257.29 0.1164381 *
11) Gender=m 54642 29946.18 0.1285280 *
3) Age< 44.5 281154 166261.40 0.1460015
6) Age>=29.5 156219 88703.96 0.1355531
12) Sport=no 77851 42515.21 0.1267940
24) Gender=f 38991 20701.11 0.1206197 *
25) Gender=m 38860 21790.59 0.1329900 *
13) Sport=yes 78368 46100.83 0.1442541
26) Gender=f 39133 22350.27 0.1365079 *
27) Gender=m 39235 23718.03 0.1519777 *
7) Age< 29.5 124935 77295.79 0.1590651
14) Sport=no 62475 37323.97 0.1493856
28) Gender=f 31298 18211.52 0.1422110 *
29) Gender=m 31177 19090.83 0.1565837 *
15) Sport=yes 62460 39898.18 0.1687435
30) Gender=f 31277 19474.59 0.1602706 *
31) Gender=m 31183 20396.94 0.1772329 *

31
Tree-based methods : Regression trees

Example
Optimal tree :
> tree.cp2 <- as.party(tree.cp)
> plot(tree.cp2)

32
Tree-based methods : Regression trees

Example
Predictions and risk classes :
> lambda <- predict(tree.cp, type = "vector")
> data$lambda <- lambda
>
> class <- partykit:::.list.rules.party(tree.cp2)
> data$class <- class[as.character(predict(tree.cp2, type = "node"))]
>
> head(data,10)
N Gender Age Sport Split lambda class
1 0 m 46 yes yes 0.1285280 Age >= 44.5 & Sport %in% c("yes") & Gender %in% c("m")
2 0 m 57 no no 0.1084660 Age >= 44.5 & Sport %in% c("no") & Gender %in% c("m")
3 0 f 34 yes no 0.1365079 Age < 44.5 & Age >= 29.5 & Sport %in% c("yes") & Gender %in% c("f")
4 0 f 27 no yes 0.1422110 Age < 44.5 & Age < 29.5 & Sport %in% c("no") & Gender %in% c("f")
5 0 m 42 yes yes 0.1519777 Age < 44.5 & Age >= 29.5 & Sport %in% c("yes") & Gender %in% c("m")
6 0 f 27 yes no 0.1602706 Age < 44.5 & Age < 29.5 & Sport %in% c("yes") & Gender %in% c("f")
7 0 f 55 yes yes 0.1164381 Age >= 44.5 & Sport %in% c("yes") & Gender %in% c("f")
8 0 f 23 yes no 0.1602706 Age < 44.5 & Age < 29.5 & Sport %in% c("yes") & Gender %in% c("f")
9 0 f 33 no no 0.1206197 Age < 44.5 & Age >= 29.5 & Sport %in% c("no") & Gender %in% c("f")
10 2 m 36 yes no 0.1519777 Age < 44.5 & Age >= 29.5 & Sport %in% c("yes") & Gender %in% c("m")

33
Tree-based methods : Regression trees

Example
Summary of risk classes and predictions :
> data$class.i <- data$class==class[1]
> lambda.class <- subset(data, data$class.i==TRUE, select=c(class,lambda))[1,]
>
> c <- 2
> while(c <= length(class)) {
+ data$class.i <- data$class==class[c]
+ lambda.class[nrow(lambda.class) + 1,] = subset(data, data$class.i==TRUE, select=c(class,lambda))[1,];
+ c <- c+1;
+ } # lambda.class <- lambda.class[order(lambda.class$lambda),]
> rownames(lambda.class) <- NULL
>
> lambda.class
class lambda
1 Age >= 44.5 & Sport %in% c("no") & Gender %in% c("f") 0.1004673
2 Age >= 44.5 & Sport %in% c("no") & Gender %in% c("m") 0.1084660
3 Age >= 44.5 & Sport %in% c("yes") & Gender %in% c("f") 0.1164381
4 Age >= 44.5 & Sport %in% c("yes") & Gender %in% c("m") 0.1285280
5 Age < 44.5 & Age >= 29.5 & Sport %in% c("no") & Gender %in% c("f") 0.1206197
6 Age < 44.5 & Age >= 29.5 & Sport %in% c("no") & Gender %in% c("m") 0.1329900
7 Age < 44.5 & Age >= 29.5 & Sport %in% c("yes") & Gender %in% c("f") 0.1365079
8 Age < 44.5 & Age >= 29.5 & Sport %in% c("yes") & Gender %in% c("m") 0.1519777
9 Age < 44.5 & Age < 29.5 & Sport %in% c("no") & Gender %in% c("f") 0.1422110
10 Age < 44.5 & Age < 29.5 & Sport %in% c("no") & Gender %in% c("m") 0.1565837
11 Age < 44.5 & Age < 29.5 & Sport %in% c("yes") & Gender %in% c("f") 0.1602706
12 Age < 44.5 & Age < 29.5 & Sport %in% c("yes") & Gender %in% c("m") 0.1772329

34
Tree-based methods : Regression trees

Example MTPL
Data set :
→ MTPL insurance portfolio of a Belgian insurance company observed during
one year.
→ Description of data set :
> str(data) # description of the dataset
’data.frame’: 160944 obs. of 10 variables:
$ AgePh : int 50 64 60 77 28 26 26 58 59 57 ...
$ AgeCar : int 12 3 10 15 7 12 8 14 6 10 ...
$ Fuel : Factor w/ 2 levels "Diesel","Gasoline": 2 2 1 2 2 2 2 2 2 2 ...
$ Split : Factor w/ 4 levels "Half-Yearly",..: 2 4 4 4 1 3 1 3 1 1 ...
$ Cover : Factor w/ 3 levels "Comprehensive",..: 3 2 3 3 3 3 1 3 2 2 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 2 1 1 2 1 1 2 2 2 ...
$ Use : Factor w/ 2 levels "Private","Professional": 1 1 1 1 1 1 1 1 1 1 ...
$ PowerCat: Factor w/ 5 levels "C1","C2","C3",..: 2 2 2 2 2 2 2 2 1 1 ...
$ ExpoR : num 1 1 1 1 0.0466 ...
$ Nclaim : int 1 0 0 0 1 0 1 0 0 0 ...

35
Tree-based methods : Regression trees

Example MTPL
Data set :
→ The data set comprises 160 944 insurance policies.
→ For each policy, we have 8 features :
- AgePh : policyholder’s age ;
- AgeCar : age of the car ;
- Fuel : fuel of the car, with two categories (gas or diesel) ;
- Split : splitting of the premium, with four categories (annually, semi-annually,
quarterly or monthly) ;
- Cover : extent of the coverage, with three categories (from compulsory
third-party liability cover to comprehensive) ;
- Gender : policyholder’s gender, with two categories (female or male) ;
- Use : use of the car, with two categories (private or professional) ;
- PowerCat : the engine’s power, with five categories.
→ For each policy, we have the number of claim (Nclaim), the target variable,
and exposure information (exposure-to-risk (ExpoR) expressed in year).

36
Tree-based methods : Regression trees

Example MTPL
Data set :
> head(data,10) # 10 first observations
AgePh AgeCar Fuel Split Cover Gender Use PowerCat ExpoR Nclaim
1 50 12 Gasoline Monthly TPL.Only Female Private C2 1.00000000 1
2 64 3 Gasoline Yearly Limited.MD Male Private C2 1.00000000 0
3 60 10 Diesel Yearly TPL.Only Female Private C2 1.00000000 0
4 77 15 Gasoline Yearly TPL.Only Female Private C2 1.00000000 0
5 28 7 Gasoline Half-Yearly TPL.Only Male Private C2 0.04657534 1
6 26 12 Gasoline Quarterly TPL.Only Female Private C2 1.00000000 0
7 26 8 Gasoline Half-Yearly Comprehensive Female Private C2 1.00000000 1
8 58 14 Gasoline Quarterly TPL.Only Male Private C2 0.40273973 0
9 59 6 Gasoline Half-Yearly Limited.MD Male Private C1 1.00000000 0
10 57 10 Gasoline Half-Yearly Limited.MD Male Private C1 1.00000000 0

37
Tree-based methods : Regression trees

Example MTPL
Descriptive statistics of the data :

120000
100000
80000
Number of policies

60000
40000
20000
0

1 2 3 4 5 6 7 8 9 10 11 12

Exposure (in months)

38
Tree-based methods : Regression trees

Example MTPL
Descriptive statistics of the data :
0.15

100000

75000

0.10

ClaimFrequency
totalExposure

50000

0.05

25000

0 0.00

Female Male Female Male

Gender Gender

39
Tree-based methods : Regression trees

Example MTPL
Descriptive statistics of the data :
100000

0.15

75000

0.10

ClaimFrequency
totalExposure

50000

0.05

25000

0 0.00

Diesel Gasoline Diesel Gasoline

Fuel Fuel

40
Tree-based methods : Regression trees

Example MTPL
Descriptive statistics of the data :

1e+05
0.10

ClaimFrequency
totalExposure

5e+04 0.05

0e+00 0.00

Private Professional Private Professional

Use Use

41
Tree-based methods : Regression trees

Example MTPL
Descriptive statistics of the data :
0.15

80000

60000

0.10

ClaimFrequency
totalExposure

40000

0.05

20000

0 0.00

Comprehensive Limited.MD TPL.Only Comprehensive Limited.MD TPL.Only

Cover Cover

42
Tree-based methods : Regression trees

Example MTPL
Descriptive statistics of the data :
0.20

60000

0.15

ClaimFrequency
totalExposure

40000

0.10

20000
0.05

0 0.00

Half−Yearly Monthly Quarterly Yearly Half−Yearly Monthly Quarterly Yearly

Split Split

43
Tree-based methods : Regression trees

Example MTPL
Descriptive statistics of the data :
80000

0.15
60000

ClaimFrequency
totalExposure

0.10
40000

0.05
20000

0 0.00

C1 C2 C3 C4 C5 C1 C2 C3 C4 C5
PowerCat PowerCat

44
Tree-based methods : Regression trees

Example MTPL
Descriptive statistics of the data :

0.20

10000
0.15

ClaimFrequency
totalExposure

0.10

5000

0.05

0 0.00

0 5 10 15 20 0 5 10 15 20
AgeCar AgeCar

45
Tree-based methods : Regression trees

Example MTPL
Descriptive statistics of the data :
1.00

3000

0.75

ClaimFrequency
2000
totalExposure

0.50

1000
0.25

0 0.00

20 40 60 80 20 40 60 80
AgePh AgePh

46
Tree-based methods : Regression trees

Example MTPL
Regression trees (Poisson models)
> library(rpart) # package rpart (rpart for "recursive partitioning")
> tree <- rpart(Nclaim ~ AgePh+AgeCar+Fuel+Split+Cover+Gender+Use+PowerCat
+ +offset(log(ExpoR)), data = data, method="poisson",
+ control = rpart.control(cp = 0, maxdepth = 10, minbucket = 5000, xval = 1))
> # xval = 1: no cross validation => speed up the algorithm
> library(rpart.plot)
> prp(tree) #prp: plot rpart

0.14
20e+3 / 161e+3
100%
yes AgePh >= 30 no

0.21
4358 / 24e+3
15%
Split = Yearly

0.13
16e+3 / 137e+3
85%
Split = Half−Yearly,Yearly

0.17
3931 / 29e+3
18%
AgePh >= 58

0.12
12e+3 / 108e+3
67%
AgePh >= 58

0.13 0.18
8358 / 71e+3 3375 / 24e+3
44% 15%

Fuel = Gasoline Cover = Comprehensive,Limited.MD

0.097
3305 / 36e+3
23%

Fuel = Gasoline

0.14
3213 / 25e+3
16%
Cover = Comprehensive,Limited.MD

0.091 0.12
2515 / 29e+3 5145 / 46e+3
18% 29%

Gender = Female Split = Yearly

0.13
2200 / 18e+3
11%

Cover = Limited.MD

0.089 0.12
1989 / 24e+3 2945 / 28e+3
15% 17%
AgePh < 74 AgePh >= 48

0.12
1866 / 17e+3
10%
Cover = Comprehensive,Limited.MD

0.086
1482 / 18e+3
11%
AgeCar >= 5.5

0.092 0.1 0.11 0.13 0.14 0.15 0.16 0.17

552 / 6360 526 / 5474 1079 / 11e+3 1239 / 11e+3 1556 / 12e+3 1870 / 14e+3 1621 / 12e+3 1497 / 9659
4% 3% 7% 7% 8% 9% 8% 6%

0.082 0.099 0.12 0.11 0.12 0.13 0.13 0.19 0.24

930 / 12e+3 507 / 5455 790 / 7015 627 / 6093 644 / 5987 1343 / 11e+3 556 / 5020 1754 / 12e+3 2861 / 14e+3
8% 3% 4% 4% 4% 7% 3% 8% 9%

47
Tree-based methods : Regression trees

Example MTPL
Choice for minbucket :
→ Average claim frequency = 13.9%.
→ Average exposure per policy = 0.89.
→ Estimated confidence interval of 2 standard deviations for a claims frequency
= 13.9% with minbucket=5000 :
" r r #
13.9% 13.9%
CI = 13.9% − 2 , 13.9% + 2
0.89 × 5000 0.89 × 5000
= [12.8%, 15.0%]
= 13.9% × [92.0%, 108.0%] .

⇒ Gives an idea of the precision that can be achieved.

→ The choice minbucket=5000 seems relevant.

48
Tree-based methods : Regression trees

Example MTPL
Selection of cp :
→ The complexity parameter is set by 10-fold cross-validation :
> tree <- rpart(Nclaim ~ AgePh+AgeCar+Fuel+Split+Cover+Gender+Use+PowerCat
+ +offset(log(ExpoR)), data = data, method="poisson",
+ control = rpart.control(cp = 0, minbucket = 5000))
> # cp = 0 : minimal value of cp when computing the CV error
> plotcp(tree, minline = TRUE, upper = "split", ylim=c(0.975,1.01))
number of splits

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1.010
1.005
1.000
X−val Relative Error

0.995
0.990
0.985
0.980
0.975

Inf 0.0062 0.0017 0.00049 0.00031 0.00018 0.00015 9.6e−05 6.8e−05 0

49
Tree-based methods : Regression trees

Example MTPL
Selection of cp :
→ The complexity parameter is set by 10-fold cross-validation :
> printcp(tree)
Variables actually used in tree construction:
[1] AgeCar AgePh Cover Fuel Gender Split

Root node error: 88664/160944 = 0.5509

n= 160944

CP nsplit rel error xerror xstd

1 9.3945e-03 0 1.00000 1.00003 0.0048524
2 4.0574e-03 1 0.99061 0.99115 0.0047885
3 2.1243e-03 2 0.98655 0.98719 0.0047662
4 1.4023e-03 3 0.98442 0.98524 0.0047565
5 5.1564e-04 4 0.98302 0.98376 0.0047441
6 4.5896e-04 5 0.98251 0.98344 0.0047427
7 4.1895e-04 6 0.98205 0.98312 0.0047437
8 2.3144e-04 7 0.98163 0.98254 0.0047440
9 1.9344e-04 8 0.98140 0.98244 0.0047428
10 1.6076e-04 9 0.98120 0.98233 0.0047405
11 1.5651e-04 10 0.98104 0.98225 0.0047400
12 1.4674e-04 11 0.98089 0.98225 0.0047400
13 1.0746e-04 12 0.98074 0.98213 0.0047390
14 8.6425e-05 13 0.98063 0.98200 0.0047385
15 8.6149e-05 14 0.98055 0.98189 0.0047374 <= minimum CV error
16 5.3416e-05 15 0.98046 0.98197 0.0047391
17 0.0000e+00 16 0.98041 0.98204 0.0047396
50
Tree-based methods : Regression trees

Example MTPL
Tree with minimum CV error :
> tree.minCV <- prune(tree,cp=8.6149e-05) # tree with min CV
> print(tree.minCV) # * means terminal node
n= 160944

node), split, n, deviance, yval

* denotes terminal node

1) root 160944 88663.810 0.13928480

2) AgePh>=30.5 137228 71351.560 0.12694240
4) Split=Half-Yearly,Yearly 107753 54500.910 0.11724870
8) AgePh>=57.5 36391 16497.170 0.09694367
16) Fuel=Gasoline 29376 12818.560 0.09123836
32) Gender=Female 23902 10226.330 0.08852467
64) AgePh< 73.5 18447 7712.091 0.08550212 *
65) AgePh>=73.5 5455 2506.580 0.09879058 *
33) Gender=Male 5474 2582.697 0.10325860 *
17) Fuel=Diesel 7015 3632.896 0.12107130 *
9) AgePh< 57.5 71362 37815.390 0.12784130
18) Fuel=Gasoline 46294 23864.900 0.12123980
36) Split=Yearly 27901 13844.700 0.11501100
72) AgePh>=47.5 11009 5198.219 0.10568820 *
73) AgePh< 47.5 16892 8633.468 0.12120700 *
37) Split=Half-Yearly 18393 9999.683 0.13072370
74) Cover=Limited.MD 5987 3025.772 0.11591810 *
75) Cover=Comprehensive,TPL.Only 12406 6959.656 0.13803250 *
19) Fuel=Diesel 25068 13909.790 0.14005510
38) Cover=Comprehensive,Limited.MD 11270 5964.095 0.12887710 *
39) Cover=TPL.Only 13798 7928.547 0.14936040 *
5) Split=Monthly,Quarterly 29475 16490.910 0.16819730
10) AgePh>=57.5 5020 2546.435 0.13402550 *
11) AgePh< 57.5 24455 13907.330 0.17557220
22) Cover=Comprehensive,Limited.MD 12130 6745.530 0.16458110 *
23) Cover=TPL.Only 12325 7147.918 0.18709190 *
3) AgePh< 30.5 23716 16479.300 0.21357760
6) Split=Yearly 9659 6033.823 0.17216140 *
7) Split=Half-Yearly,Monthly,Quarterly 14057 10321.140 0.24429200 *

51
Tree-based methods : Regression trees

Example MTPL
Tree with minimum CV error :
> prp(tree.minCV)

0.14
20e+3 / 161e+3
100%

yes AgePh >= 30 no

0.21
4358 / 24e+3
15%

Split = Yearly

0.13
16e+3 / 137e+3
85%

Split = Half−Yearly,Yearly

0.17
3931 / 29e+3
18%

AgePh >= 58

0.12
12e+3 / 108e+3
67%

AgePh >= 58

0.13 0.18
8358 / 71e+3 3375 / 24e+3
44% 15%

Fuel = Gasoline Cover = Comprehensive,Limited.MD

0.097
3305 / 36e+3
23%

Fuel = Gasoline

0.14
3213 / 25e+3
16%

Cover = Comprehensive,Limited.MD

0.091 0.12
2515 / 29e+3 5145 / 46e+3
18% 29%

Gender = Female Split = Yearly

0.13
2200 / 18e+3
11%

Cover = Limited.MD

0.089 0.12
1989 / 24e+3 2945 / 28e+3
15% 17%

AgePh < 74 AgePh >= 48

0.099 0.12 0.12 0.14 0.15 0.16 0.17

507 / 5455 790 / 7015 1866 / 17e+3 1556 / 12e+3 1870 / 14e+3 1621 / 12e+3 1497 / 9659
3% 4% 10% 8% 9% 8% 6%

0.086 0.1 0.11 0.12 0.13 0.13 0.19 0.24

1482 / 18e+3 526 / 5474 1079 / 11e+3 644 / 5987 1343 / 11e+3 556 / 5020 1754 / 12e+3 2861 / 14e+3
11% 3% 7% 4% 7% 3% 8% 9%

52
Tree-based methods : Regression trees

Example MTPL
Observations :
→ The tree with the minimum CV error (15 leaves) is relatively big compared to
the biggest tree model (cp = 0, 17 leaves).
Reason : the choice minbucket=5000 limits the size of the biggest tree.
→ The tree with only 3 splits (i.e. 4 leaves) is within 1 standard deviation (SD) of
the tree with the minimum CV error.

0.14
20e+3 / 161e+3
100%

yes AgePh >= 30 no

0.13
16e+3 / 137e+3
85%

Split = Half−Yearly,Yearly

0.12
12e+3 / 108e+3
67%

AgePh >= 58

0.097 0.13 0.17 0.21

3305 / 36e+3 8358 / 71e+3 3931 / 29e+3 4358 / 24e+3
23% 44% 18% 15%

Comparison of the tree within 1 SD (4 leaves) and the tree with the minimum
CV (15 leaves) : require to calculate the generalization errors of these models
(⇒ a validation set is needed).
53
Tree-based methods : Regression trees

Example MTPL
Validation set :
→ Training set D : 80% of the data set.
→ Validation set D : 20% of the data set.
> library(caret)
> inValidation = createDataPartition(data$Nclaim, p=0.2, list=FALSE)
> validation.set = data[inValidation,]
> training.set = data[-inValidation,]

54
Tree-based methods : Regression trees

Example MTPL
Regression trees (Poisson models) :
> tree <- rpart(Nclaim ~ AgePh+AgeCar+Fuel+Split+Cover+Gender+Use+PowerCat
+ +offset(log(ExpoR)), data = training.set, method="poisson",
+ control = rpart.control(cp = 0, maxdepth = 10, minbucket = 4000, xval = 1)
> prp(tree)

0.14
16e+3 / 129e+3
100%

yes AgePh >= 30 no

0.21
3459 / 19e+3
15%

Split = Yearly

0.13
12e+3 / 110e+3
85%

Split = Half−Yearly,Yearly

0.17
3097 / 24e+3
18%

AgePh >= 56

0.12
9325 / 86e+3
67%

AgePh >= 58

0.13 0.17
6704 / 57e+3 2612 / 19e+3
44% 15%

Fuel = Gasoline Cover = Comprehensive,Limited.MD

0.096
2621 / 29e+3
23%

Fuel = Gasoline

0.14
2588 / 20e+3
16%

Cover = Comprehensive,Limited.MD

0.09 0.12
1982 / 23e+3 4116 / 37e+3
18% 29%

Gender = Female Split = Yearly

0.13
1750 / 15e+3
11%

Cover = Limited.MD

0.087 0.12
1551 / 19e+3 2366 / 22e+3
15% 17%

AgePh < 74 AgePh >= 48

0.12
1587 / 14e+3
11%

Gender = Female

0.084
1156 / 15e+3
11%

AgeCar >= 5.5

0.093 0.1 0.11 0.13 0.14 0.15 0.16 0.17

441 / 5033 431 / 4421 779 / 7868 604 / 5040 1247 / 9863 1506 / 11e+3 1247 / 9419 1179 / 7750
4% 3% 6% 4% 8% 9% 7% 6%

0.079 0.096 0.12 0.12 0.11 0.13 0.13 0.19 0.24

715 / 9639 395 / 4347 639 / 5613 983 / 9358 503 / 4808 1082 / 9111 485 / 4437 1365 / 9690 2280 / 11e+3
7% 3% 4% 7% 4% 7% 3% 8% 9%

55
Tree-based methods : Regression trees

Example MTPL
Remarks :
→ minbucket=80% × 5000 = 4000.
→ The tree obtained is almost identical to the one obtained on the whole data
set with cp=0, maxdepth=10 and minbucket=5000 (only one different split at
the bottom).

56
Tree-based methods : Regression trees

Example MTPL
Selection of cp :
> tree <- rpart(Nclaim ~ AgePh+AgeCar+Fuel+Split+Cover+Gender+Use+PowerCat
+ +offset(log(ExpoR)), data = training.set, method="poisson",
+ control = rpart.control(cp = 0, minbucket = 4000))
> plotcp(tree, minline = TRUE, upper = "split", ylim=c(0.975,1.01))

number of splits

1.010
1.005
1.000 0 1 2 3 4 5 6 7 9 10 11 12 13 15 16
X−val Relative Error

0.995
0.990
0.985
0.980
0.975

Inf 0.0059 0.0018 0.00053 0.00031 0.00018 0.00013 8.7e−05 0

57
Tree-based methods : Regression trees

Example MTPL
Selection of cp :
> printcp(tree)

Variables actually used in tree construction:

[1] AgeCar AgePh Cover Fuel Gender Split

Root node error: 70767/128755 = 0.54963

n= 128755

CP nsplit rel error xerror xstd

1 9.1497e-03 0 1.00000 1.00001 0.0054547
2 3.7595e-03 1 0.99085 0.99160 0.0053964
3 2.2473e-03 2 0.98709 0.98853 0.0053797
4 1.5067e-03 3 0.98484 0.98607 0.0053688
5 6.0567e-04 4 0.98334 0.98417 0.0053489
6 4.6174e-04 5 0.98273 0.98362 0.0053447
7 4.5173e-04 6 0.98227 0.98337 0.0053502
8 2.1482e-04 7 0.98182 0.98301 0.0053520
9 2.0725e-04 9 0.98139 0.98304 0.0053529
10 1.6066e-04 10 0.98118 0.98297 0.0053534
11 1.5325e-04 11 0.98102 0.98293 0.0053548
12 1.1692e-04 12 0.98087 0.98294 0.0053558
13 9.0587e-05 13 0.98075 0.98302 0.0053580
14 8.4270e-05 15 0.98057 0.98292 0.0053562
15 0.0000e+00 16 0.98048 0.98285 0.0053555 <= minimum CV error

58
Tree-based methods : Regression trees

Example MTPL
Tree with minimum CV error :
→ cp=0 ⇒ Biggest tree (= initial tree).
⇒ minbucket=4000 prevents of overfitting.
> tree.minCV <- prune(tree,cp=0) # tree with min CV
> prp(tree.minCV)

0.14
16e+3 / 129e+3
100%
yes AgePh >= 30 no

0.21
3459 / 19e+3
15%
Split = Yearly

0.13
12e+3 / 110e+3
85%
Split = Half−Yearly,Yearly

0.17
3097 / 24e+3
18%
AgePh >= 56

0.12
9325 / 86e+3
67%
AgePh >= 58

0.13 0.17
6704 / 57e+3 2612 / 19e+3
44% 15%

Fuel = Gasoline Cover = Comprehensive,Limited.MD

0.096
2621 / 29e+3
23%

Fuel = Gasoline

0.14
2588 / 20e+3
16%
Cover = Comprehensive,Limited.MD

0.09 0.12
1982 / 23e+3 4116 / 37e+3
18% 29%

Gender = Female Split = Yearly

0.13
1750 / 15e+3
11%

Cover = Limited.MD

0.087 0.12
1551 / 19e+3 2366 / 22e+3
15% 17%
AgePh < 74 AgePh >= 48

0.12
1587 / 14e+3
11%

Gender = Female

0.084
1156 / 15e+3
11%
AgeCar >= 5.5

0.093 0.1 0.11 0.13 0.14 0.15 0.16 0.17

441 / 5033 431 / 4421 779 / 7868 604 / 5040 1247 / 9863 1506 / 11e+3 1247 / 9419 1179 / 7750
4% 3% 6% 4% 8% 9% 7% 6%

0.079 0.096 0.12 0.12 0.11 0.13 0.13 0.19 0.24

715 / 9639 395 / 4347 639 / 5613 983 / 9358 503 / 4808 1082 / 9111 485 / 4437 1365 / 9690 2280 / 11e+3
7% 3% 4% 7% 4% 7% 3% 8% 9%

59
Tree-based methods : Regression trees

Example MTPL
Tree within 1 SD :
→ Tree within 1 SD has 4 leaves :
> tree.1SD <- prune(tree,cp=1.5067e-03) # tree within 1 SD
> prp(tree.1SD)

0.14
16e+3 / 129e+3
100%

yes AgePh >= 30 no

0.13
12e+3 / 110e+3
85%

Split = Half−Yearly,Yearly

0.12
9325 / 86e+3
67%

AgePh >= 58

0.096 0.13 0.17 0.21

2621 / 29e+3 6704 / 57e+3 3097 / 24e+3 3459 / 19e+3
23% 44% 18% 15%

60
Tree-based methods : Regression trees

Example MTPL
Comparison :
→ Generalization error :

d val (b 1 X
Err µ) = L (yi , µ
b(x i )di ) ,
|I|
i∈I
yi
yi
with L(yi , µ
b(x i )di ) = 2 ln µ
b(x i )di
+ µ
b (x i )di − yi

and di = exposure-to-risk for policy i.

> valid.pred.minCV <- validation.set$ExpoR*predict(tree.minCV, newdata=validation.set, type = "vector")

> gen.error.minCV <- 1/nrow(validation.set)*2*(sum(valid.pred.minCV)-sum(validation.set$Nclaim)
+ +sum(log((validation.set$Nclaim/valid.pred.minCV)^(validation.set$Nclaim))))
>
> valid.pred.1SD <- validation.set$ExpoR*predict(tree.1SD, newdata=validation.set, type = "vector")
> gen.error.1SD <- 1/nrow(validation.set)*2*(sum(valid.pred.1SD)-sum(validation.set$Nclaim)
+ +sum(log((validation.set$Nclaim/valid.pred.1SD)^(validation.set$Nclaim))))

val
Model Err
d (b µ)
Tree with min CV 0.5452772
Tree within 1 SD 0.5464333

61
Tree-based methods : Regression trees

Example MTPL
Model instability :
> data1 = data[1:100000,]
> tree.data1 <- rpart(Nclaim ~ AgePh+AgeCar+Fuel+Split+Cover+Gender+Use+PowerCat
+ +offset(log(ExpoR)), data = data1, method="poisson",
+ control = rpart.control(cp = 0, minbucket = 3000))
> printcp(tree.data1)
> tree.data1.minCV <- prune(tree.data1,cp=1.1727e-04)
> prp(tree.data1.minCV)

0.14
13e+3 / 100e+3
100%

yes AgePh >= 32 no

0.22
2651 / 15e+3
15%

Split = Yearly

0.13
10e+3 / 85e+3
85%

Split = Half−Yearly,Yearly

0.17 0.24
2932 / 22e+3 1946 / 9852
22% 10%

AgePh >= 58 AgePh >= 26

0.12
7089 / 64e+3
64%

AgePh >= 58

0.13 0.18
5023 / 42e+3 2545 / 18e+3
42% 18%

Cover = Comprehensive,Limited.MD Cover = Comprehensive,Limited.MD

0.1
2066 / 22e+3
22%

Fuel = Gasoline

0.14
2823 / 22e+3
22%

Fuel = Gasoline

0.13 0.13 0.13 0.19 0.22

532 / 4454 1746 / 15e+3 387 / 3493 1243 / 8535 1162 / 6434
4% 15% 3% 9% 6%

0.093 0.12 0.16 0.17 0.17 0.29

1534 / 18e+3 2200 / 19e+3 1077 / 7610 1302 / 9555 705 / 4684 784 / 3418
18% 19% 8% 10% 5% 3%

62
Tree-based methods : Regression trees

Example MTPL
Model instability :
> data2 = data[10000:110000,]
> tree.data2 <- rpart(Nclaim ~ AgePh+AgeCar+Fuel+Split+Cover+Gender+Use+PowerCat
+ +offset(log(ExpoR)), data = data2, method="poisson",
+ control = rpart.control(cp = 0, minbucket = 3000))
> printcp(tree.data2)
> tree.data2.minCV <- prune(tree.data2,cp=8.7672e-05)
> prp(tree.data2.minCV)

0.13
12e+3 / 100e+3
100%

yes AgePh >= 30 no

0.21
1834 / 10e+3
10%

Split = Yearly
0.13
9999 / 90e+3
90%

Split = Half−Yearly,Yearly

0.16
2672 / 21e+3
21%

AgePh >= 58
0.12
7327 / 69e+3
69%

AgePh >= 58

0.13 0.17
5272 / 46e+3 2357 / 18e+3
46% 18%

Fuel = Gasoline Fuel = Gasoline

0.097
2055 / 23e+3
23%

Fuel = Gasoline

0.14
2086 / 17e+3
17%

Cover = Comprehensive,Limited.MD
0.09 0.12
1525 / 18e+3 3186 / 29e+3
18% 29%

Gender = Female AgePh >= 50

0.12
2385 / 21e+3
21%

PowerCat = C1

0.13
1553 / 13e+3
13%

Gender = Female

0.1 0.11 0.12 0.13 0.12 0.18 0.23

325 / 3380 801 / 7949 1103 / 9936 945 / 8191 315 / 3108 957 / 6740 1322 / 6974
3% 8% 10% 8% 3% 7% 7%

0.087 0.12 0.11 0.15 0.15 0.16 0.18

1200 / 15e+3 530 / 4607 832 / 8053 450 / 3383 1141 / 8422 1400 / 11e+3 512 / 3272
15% 5% 8% 3% 8% 11% 3%

63
Tree-based methods : Regression trees

Example MTPL
Model instability : Trees with minimum CV error :
0.14 0.13
13e+3 / 100e+3 12e+3 / 100e+3
100% 100%
yes AgePh >= 32 no
yes AgePh >= 30 no

0.21
1834 / 10e+3
10%

0.22 Split = Yearly

2651 / 15e+3 0.13
15% 9999 / 90e+3
90%
Split = Yearly
Split = Half−Yearly,Yearly

0.13 0.16
10e+3 / 85e+3 2672 / 21e+3
85% 21%
Split = Half−Yearly,Yearly
AgePh >= 58
0.12
7327 / 69e+3
69%
0.17 0.24
2932 / 22e+3 1946 / 9852
AgePh >= 58
22% 10%

AgePh >= 58 AgePh >= 26

0.13 0.17
5272 / 46e+3 2357 / 18e+3
46% 18%

0.12 Fuel = Gasoline Fuel = Gasoline

7089 / 64e+3 0.097
64% 2055 / 23e+3
23%
AgePh >= 58
Fuel = Gasoline

0.14
0.13 0.18 2086 / 17e+3
5023 / 42e+3 2545 / 18e+3 17%
42% 18%
Cover = Comprehensive,Limited.MD
Cover = Comprehensive,Limited.MD Cover = Comprehensive,Limited.MD 0.09 0.12
1525 / 18e+3 3186 / 29e+3
18% 29%
0.1
2066 / 22e+3 Gender = Female AgePh >= 50
22%

0.12
Fuel = Gasoline
2385 / 21e+3
21%

PowerCat = C1
0.14
2823 / 22e+3
22%

Fuel = Gasoline

0.13
1553 / 13e+3
13%

Gender = Female

0.13 0.13 0.13 0.19 0.22

532 / 4454 1746 / 15e+3 387 / 3493 1243 / 8535 1162 / 6434
4% 15% 3% 9% 6%
0.1 0.11 0.12 0.13 0.12 0.18 0.23
325 / 3380 801 / 7949 1103 / 9936 945 / 8191 315 / 3108 957 / 6740 1322 / 6974
3% 8% 10% 8% 3% 7% 7%

0.093 0.12 0.16 0.17 0.17 0.29 0.087 0.12 0.11 0.15 0.15 0.16 0.18
1534 / 18e+3 2200 / 19e+3 1077 / 7610 1302 / 9555 705 / 4684 784 / 3418 1200 / 15e+3 530 / 4607 832 / 8053 450 / 3383 1141 / 8422 1400 / 11e+3 512 / 3272
18% 19% 8% 10% 5% 3% 15% 5% 8% 3% 8% 11% 3%

Figure – Training set : data1 Figure – Training set : data2

Classification and Regression Trees Wadsworth Statistics Probability
No ratings yet
Classification and Regression Trees Wadsworth Statistics Probability
420 pages
Module09 TreeBasedMethods
No ratings yet
Module09 TreeBasedMethods
36 pages
Module10 TreeBasedMethods
No ratings yet
Module10 TreeBasedMethods
33 pages
Chapter 7 - Trees
No ratings yet
Chapter 7 - Trees
80 pages
MI - Unit 4
No ratings yet
MI - Unit 4
79 pages
Trees Handout
No ratings yet
Trees Handout
51 pages
08 Tree Regression 1
No ratings yet
08 Tree Regression 1
49 pages
Decision Tree & Regression
No ratings yet
Decision Tree & Regression
33 pages
Regression Trees
No ratings yet
Regression Trees
11 pages
Ch8 Tree Based Methods
No ratings yet
Ch8 Tree Based Methods
81 pages
Sy19 A22 Cours7
No ratings yet
Sy19 A22 Cours7
76 pages
Unit 4 Da
No ratings yet
Unit 4 Da
23 pages
STAT 432: Basics of Statistical Learning: Tree and Random Forests
No ratings yet
STAT 432: Basics of Statistical Learning: Tree and Random Forests
54 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
Chap9 Cart 574 1
No ratings yet
Chap9 Cart 574 1
42 pages
Gee Cart 2008
No ratings yet
Gee Cart 2008
8 pages
Árboles de Regresión. Algunos Algoritmos y Extensiones A Métodos de Consenso Autor David Gonzalo Ejea Carbonell
No ratings yet
Árboles de Regresión. Algunos Algoritmos y Extensiones A Métodos de Consenso Autor David Gonzalo Ejea Carbonell
34 pages
Decision Tree
No ratings yet
Decision Tree
82 pages
Knowledge Discovery and Data Mining: Lecture 11 - Tree Methods - Introduction
No ratings yet
Knowledge Discovery and Data Mining: Lecture 11 - Tree Methods - Introduction
49 pages
BSC ML Ch3
No ratings yet
BSC ML Ch3
106 pages
ML Assignment-2: Unit 3
No ratings yet
ML Assignment-2: Unit 3
21 pages
AIML Ak
No ratings yet
AIML Ak
21 pages
Resensi Big Data, Data Mining, and Machine Learning "Bahasa Inggris"
No ratings yet
Resensi Big Data, Data Mining, and Machine Learning "Bahasa Inggris"
2 pages
Classification Using Decision Trees
No ratings yet
Classification Using Decision Trees
43 pages
Tree Based Learning Methods
No ratings yet
Tree Based Learning Methods
28 pages
Figure 9: Process of Knowledge Data Discovery Based On
No ratings yet
Figure 9: Process of Knowledge Data Discovery Based On
7 pages
Classification and Regression Trees
No ratings yet
Classification and Regression Trees
36 pages
Lecture 7 - Decision Tree Regression Imran 19032025 103416am
No ratings yet
Lecture 7 - Decision Tree Regression Imran 19032025 103416am
40 pages
Chapter 9 - Classification and Regression Trees: Data Mining For Business Intelligence
No ratings yet
Chapter 9 - Classification and Regression Trees: Data Mining For Business Intelligence
36 pages
Regression Tree Thesis
100% (3)
Regression Tree Thesis
4 pages
09 Decision Trees Nearest Neighbor
No ratings yet
09 Decision Trees Nearest Neighbor
8 pages
Classification and Prediction
No ratings yet
Classification and Prediction
81 pages
Regression Tree by Bishop
No ratings yet
Regression Tree by Bishop
4 pages
Unit IV Decision Trees
No ratings yet
Unit IV Decision Trees
37 pages
Unit-4 Notes
No ratings yet
Unit-4 Notes
13 pages
2EL1730-ML-Lecture05-Trees and Ensemble Learning
No ratings yet
2EL1730-ML-Lecture05-Trees and Ensemble Learning
70 pages
Tree-Based Methods
No ratings yet
Tree-Based Methods
32 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
57 pages
Data Science in Non-Life Insurance Pricing
No ratings yet
Data Science in Non-Life Insurance Pricing
142 pages
Predict 422 - Module 8
100% (1)
Predict 422 - Module 8
138 pages
AST Day 3 Slides
No ratings yet
AST Day 3 Slides
79 pages
C4.5 Algorithm
100% (1)
C4.5 Algorithm
31 pages
Mod4 Eda
No ratings yet
Mod4 Eda
13 pages
6 - CART Models
No ratings yet
6 - CART Models
15 pages
A Comparative Analysis of Methods For Pruning Decision Trees
No ratings yet
A Comparative Analysis of Methods For Pruning Decision Trees
16 pages
Random Forest
No ratings yet
Random Forest
83 pages
EUC1502 Module3 Machine-Learning
No ratings yet
EUC1502 Module3 Machine-Learning
25 pages
Here
No ratings yet
Here
17 pages
Random Forest Explained
No ratings yet
Random Forest Explained
39 pages
Trees and Random Forest
No ratings yet
Trees and Random Forest
34 pages
Unit-5 Decision Trees and Ensemble Learning
100% (1)
Unit-5 Decision Trees and Ensemble Learning
162 pages
Decision Tree
No ratings yet
Decision Tree
13 pages
Classification and Regression Tree Methods
No ratings yet
Classification and Regression Tree Methods
13 pages
FMLanswerkey-IT 2
No ratings yet
FMLanswerkey-IT 2
11 pages
Tree Based Methods
No ratings yet
Tree Based Methods
21 pages
Classification and Regression Tree Construction
No ratings yet
Classification and Regression Tree Construction
18 pages
Machine Learning: Classification & Decision Trees
No ratings yet
Machine Learning: Classification & Decision Trees
24 pages
Useful Formulae: Mathematical & Physical
From Everand
Useful Formulae: Mathematical & Physical
Matthew Watkins
No ratings yet
Shortcuts to College Calculus Refreshment Kit
From Everand
Shortcuts to College Calculus Refreshment Kit
Juan Acevedo
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Chapter 3
No ratings yet
Chapter 3
58 pages
Chapter 2
No ratings yet
Chapter 2
47 pages
Excel Functions Translation Table
No ratings yet
Excel Functions Translation Table
14 pages
Chapter 3
No ratings yet
Chapter 3
18 pages
Commutation Functions - 2 Per Page
No ratings yet
Commutation Functions - 2 Per Page
12 pages
First Name Surname Assignment 1 Assignment 2 Assignment 3
No ratings yet
First Name Surname Assignment 1 Assignment 2 Assignment 3
6 pages
Mock Exam Life Insurance Actuarial Controlling 1
No ratings yet
Mock Exam Life Insurance Actuarial Controlling 1
9 pages
Life Contingencies II Commutation Functions Mortality Table: GKM95
No ratings yet
Life Contingencies II Commutation Functions Mortality Table: GKM95
4 pages
Life Contingencies II: Objectives
No ratings yet
Life Contingencies II: Objectives
2 pages
Insurance Analytics: Prof. Julien Trufin
No ratings yet
Insurance Analytics: Prof. Julien Trufin
50 pages
E ES E BI BM B: 2.2.11 Real and Statutory Result
No ratings yet
E ES E BI BM B: 2.2.11 Real and Statutory Result
11 pages
Lecture 9
No ratings yet
Lecture 9
72 pages
Chapter 1: Warm Up: 1 Powers of Base 2
No ratings yet
Chapter 1: Warm Up: 1 Powers of Base 2
2 pages
Chapter 1: Warm Up - Solution: 1 Powers of Base 2
No ratings yet
Chapter 1: Warm Up - Solution: 1 Powers of Base 2
4 pages
Script
No ratings yet
Script
43 pages
Exam Questions
No ratings yet
Exam Questions
9 pages
Zero Coupon Bonds ($1 Maturity) Inputs Vasicek CIR Vasicek (A (B-R) DT + S DZ)
No ratings yet
Zero Coupon Bonds ($1 Maturity) Inputs Vasicek CIR Vasicek (A (B-R) DT + S DZ)
20 pages
Risk Pricing: Chapter 3 Part 1
No ratings yet
Risk Pricing: Chapter 3 Part 1
27 pages
Deep Video Dehazing: Major Project Part I (18B19CI791) - AY 2023-24
No ratings yet
Deep Video Dehazing: Major Project Part I (18B19CI791) - AY 2023-24
26 pages
Universiti Teknologi Mara Final Examination: Confidential CS/APR 2008/ITC430/CSC510
No ratings yet
Universiti Teknologi Mara Final Examination: Confidential CS/APR 2008/ITC430/CSC510
7 pages
June 2021 Model Set Question Paper
No ratings yet
June 2021 Model Set Question Paper
32 pages
Python Unit II Unit 3 COntrol Flow FINAL
No ratings yet
Python Unit II Unit 3 COntrol Flow FINAL
153 pages
Disk Management
No ratings yet
Disk Management
8 pages
BRAHMASTR1
No ratings yet
BRAHMASTR1
2 pages
Fla Unit-1 Worksheet (Cintel)
No ratings yet
Fla Unit-1 Worksheet (Cintel)
6 pages
Synphony Model Compiler ME I-2013.09M User Guide
No ratings yet
Synphony Model Compiler ME I-2013.09M User Guide
966 pages
Power Off Reset Reason
No ratings yet
Power Off Reset Reason
3 pages
Programming Techniques T3 Worksheet 1
No ratings yet
Programming Techniques T3 Worksheet 1
5 pages
DSA - NOTES Unit 1,2
No ratings yet
DSA - NOTES Unit 1,2
22 pages
ErrMsg Eng
No ratings yet
ErrMsg Eng
10 pages
Dsa Paper 2
No ratings yet
Dsa Paper 2
1 page
SJ-20210416103553-036-ElasticNet UME R18 (V16.21.11) Enhanced SON Operation Guide
No ratings yet
SJ-20210416103553-036-ElasticNet UME R18 (V16.21.11) Enhanced SON Operation Guide
23 pages
Cse Daa
No ratings yet
Cse Daa
5 pages
Grade 8 Task 3 Worksheet
No ratings yet
Grade 8 Task 3 Worksheet
6 pages
Oop DS U3
No ratings yet
Oop DS U3
17 pages
Binary Tree Data Structure
100% (1)
Binary Tree Data Structure
56 pages
d3 PDF
No ratings yet
d3 PDF
7 pages
General Knowledge
100% (3)
General Knowledge
2,754 pages
6th Semester Syllabus
No ratings yet
6th Semester Syllabus
20 pages
Cayley Graph
No ratings yet
Cayley Graph
15 pages
Cs-304 Mcq's Final Term by Vu Topper RM
No ratings yet
Cs-304 Mcq's Final Term by Vu Topper RM
19 pages
Final Ai Module Wise Questions
No ratings yet
Final Ai Module Wise Questions
5 pages
Written Assignment Discrete Math Unit 7
No ratings yet
Written Assignment Discrete Math Unit 7
4 pages
Tutorial 5 Solution
No ratings yet
Tutorial 5 Solution
14 pages
Vtamps 15 S1 Set 3 Corrected Solman
No ratings yet
Vtamps 15 S1 Set 3 Corrected Solman
17 pages
Uart PWM
No ratings yet
Uart PWM
3 pages
RMT 2 Marks
100% (3)
RMT 2 Marks
22 pages
Caesar Cipher in Cryptography
No ratings yet
Caesar Cipher in Cryptography
14 pages