0% found this document useful (0 votes)
62 views64 pages

Insurance Analytics: Prof. Julien Trufin

This document discusses regression trees, which partition data into homogeneous subgroups to predict a continuous outcome variable. Regression trees determine the optimal feature to split on, split value, tree depth, and predictions in terminal nodes. The classification and regression tree (CART) methodology recursively partitions the data space and fits a simple prediction model within each partition. It finds the split that minimizes the prediction error at each step until a stopping criteria is reached.

Uploaded by

david Abotsitse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views64 pages

Insurance Analytics: Prof. Julien Trufin

This document discusses regression trees, which partition data into homogeneous subgroups to predict a continuous outcome variable. Regression trees determine the optimal feature to split on, split value, tree depth, and predictions in terminal nodes. The classification and regression tree (CART) methodology recursively partitions the data space and fits a simple prediction model within each partition. It finds the split that minimizes the prediction error at each step until a stopping criteria is reached.

Uploaded by

david Abotsitse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Insurance Analytics

Prof. Julien Trufin

Année académique 2020-2021

1
Tree-based methods : Regression trees

Regression trees

2
Tree-based methods : Regression trees

Introduction
Tree-based models : one or more nested if-then statements for the features
that partition the data.
Example :
→ if Feature A (Gender) = "male" then
if Feature B (Age) >= 35 then Outcome = 500
else Outcome = 700
else Outcome = 400
→ In the terminology of tree models : there are two splits of the dataset into
three terminal nodes or leaves of the tree.

3
Tree-based methods : Regression trees

Introduction
Strenghts :
→ Easy to interpret.
→ Do not require to specify the form of the regression function.
→ Handle missing data.
→ Implicitly conduct feature selection.

Weaknesses :
→ Model instability (i.e. slight changes in the data can drastically change the
structure of the tree).
→ Less-than-optimal predictive performance (because these models partition the
data into rectangular regions of the feature space).
→ Finite number of possible predicted outcomes, determined by the number of
terminal nodes (unlikely to capture all the nuances of the data).

4
Tree-based methods : Regression trees

Context
Regression trees : partition the feature space into subspaces that are more
homogeneous with respect to the response.
Regression trees determine
→ The feature to split on and the value of the split.
→ The depth or complexity of the tree.
→ The predictions in the terminal nodes.

The classification and regression tree (CART) methodology, also called


recursive partitioning :
→ One of the most popular techniques to construct regression trees.

5
Tree-based methods : Regression trees

Recursive partitioning
Algorithm :
→ Begin with D = {(yi , x i ); i ∈ I}.
→ Search every distinct value of every feature to find the feature and split value
that partitions the feature space χ into two subspaces χt1 and χt2 such that
X X X
L(yi , µ
b(x i )) = L(yi , b
ct1 ) + L(yi , b
ct2 )
i∈I i∈I:x i ∈χt1 i∈I:x i ∈χt2

is minimized, where
ct1 = ave (yi |x i ∈ χt1 )
b
and
ct2 = ave (yi |x i ∈ χt2 ) .
b
We have
µ ct1 I [x ∈ χt1 ] + b
b(x ) = b ct2 I [x ∈ χt2 ] .

6
Tree-based methods : Regression trees

Recursive partitioning
Algorithm :
→ Example with 2 features : Gender = {Female, Male} and AgePh = {18, 19, . . . , 85}.
Possibility 1 : Gender :
X X X
L(yi , µ
b(x i )) = L(yi , b
ct1 ) + L(yi , b
ct2 )
i∈I i:Female i:Male

where
ct1 = ave (yi |Female)
b and ct2 = ave (yi |Male) .
b

Possibility 2 : AgePh (split value = 18) :


X X X
L(yi , µ
b(x i )) = L(yi , b
ct1 ) + L(yi , b
ct2 )
i∈I i:AgePh≤18 i:AgePh>18

where
ct1 = ave (yi |AgePh ≤ 18)
b and ct2 = ave (yi |AgePh > 18) .
b

7
Tree-based methods : Regression trees

Recursive partitioning
Algorithm :
Possibility 3 : AgePh (split value = 19) :
X X X
L(yi , µ
b(x i )) = L(yi , b
ct1 ) + L(yi , b
ct2 )
i∈I i:AgePh≤19 i:AgePh>19

where
ct1 = ave (yi |AgePh ≤ 19)
b and ct2 = ave (yi |AgePh > 19) .
b
..
.
Possibility 42 : AgePh (split value = 58) :
X X X
L(yi , µ
b(x i )) = L(yi , b
ct1 ) + L(yi , b
ct2 )
i∈I i:AgePh≤58 i:AgePh>58

where
ct1 = ave (yi |AgePh ≤ 58)
b and ct2 = ave (yi |AgePh > 58) .
b
..
.

8
Tree-based methods : Regression trees

Recursive partitioning
Algorithm :
Possibility 68 : AgePh (split value = 84) :
X X X
L(yi , µ
b(x i )) = L(yi , b
ct1 ) + L(yi , b
ct2 )
i∈I i:AgePh≤84 i:AgePh>84

where
ct1 = ave (yi |AgePh ≤ 84)
b and ct2 = ave (yi |AgePh > 84) .
b

P
The best possibility = the one that minimizes i∈I L(yi , µ
b(x i )).

9
Tree-based methods : Regression trees

Recursive partitioning
Algorithm :
→ Repeat the previous step within each of subspaces χt1 and χt2 , and so on
(hence the name recursive partitioning).
→ Stop when the number of observations in the split falls below some threshold
(others stopping criteria can be used).
→ Output : X
µ
b(x ) = ct I[x ∈ χt ]
b
t∈T

where
 T is a set of indexes labelling the terminal nodes of the tree ;
ct = ave(yi |x i ∈ χt ), t ∈ T .
 b

10
Tree-based methods : Regression trees

Recursive partitioning
Remarks :
→ Finding the optimal split :
 Is straightforward for continuous features since the data can be ordered in a
natural way.
 Is easy for binary features since there is only one possible split point.
 Is more complicated for features with more than two categories.
Indeed, for a features having q unordered categories, there are 2q−1 − 1 possible
partitions of the q values into two groups.
⇒ For large q : the computation can be very time consuming.

→ Why binary splits ?


 In general, multiway splits fragment the data too quickly ⇒ leaving insufficient
data at the next level down.
 Also, multiway splits can be achieved by a series of binary splits.
⇒ Binary splits are preferred.

11
Tree-based methods : Regression trees

Recursive partitioning
Remarks :
→ Others stopping criteria can be used :
 The interaction depth (ID) :
ID = 1 : single-split regression tree ;
ID=2 : two-way interactions ;
...
 Stop when the empirical improvement of the optimization criteria (here the
deviance reduction) is less than a given threshold.
→ This particular tree methodology can handle missing data. Instead of
imputing the data, we can
 For categorical features : create a new category “missing” (observations with
missing values could behave differently than those with non-missing values).
 Construct surrogate variables : find another split based on another variable by
looking at all the splits using all the other variables and searching for the one
yielding a division of the training set most similar to the optimal split.

12
Tree-based methods : Regression trees

Right-sized tree
We have seen several rules to declare a node t terminal.
→ These rules have in common that they early stop the growth of the tree.

Another way to find the right sized tree consists in fully developing the tree
and then pruning it.
→ Henceforth, we also denote a regression tree by T .
→ A branch T (t) is the part of T that is composed of node t and all its
descendants nodes.
→ Pruning a branch T (t) of a tree T means deleting from T all descendants
nodes of t. The resulting tree is denoted T − T (t) .

A tree T 0 that is obtained from T by successively pruning branches is called


a pruned subtree of T , or simply a subtree of T , and is denoted T 0  T .

13
Tree-based methods : Regression trees

Right-sized tree
For instance, we represent below a tree T , its branch T (t2 ) and the subtree
T − T (t2 ) obtained from T by pruning the branch T (t2 ) .
t0

t1 t2

t3 t4 t5 t6

t7 t8 t9 t10
t2 t0

t5 t6 t1 t2

t7 t8 t9 t10 t3 t4

14
Tree-based methods : Regression trees

Right-sized tree : pruning process


The first step of the pruning process is to grow the largest possible tree
Tmax .
→ Notice that the initial tree Tinit can be smaller than Tmax , so that we start the
pruning process with a sufficiently large tree Tinit .

The idea of pruning Tinit consists in constructing a sequence of smaller and


smaller trees
Tinit , T|Tinit |−1 , T|Tinit |−2 , . . . , T1 ,
where Tk is a subtree of Tinit with k terminal nodes.

We define the cost-complexity measure of a tree T as



Rα (T ) = D (bct )t∈T(T ) + α|T |.

→ The number of terminal nodes |T | is called the complexity of the tree T .


→ α is the increase in the penalty for having one more terminal node.

15
Tree-based methods : Regression trees

Right-sized tree : pruning process


Let T 0 be the tree obtained by splitting the terminal node t of T into two
children nodes tL and tR ⇒ T  T 0 .
→ We have    
ct )t∈T(T 0 ) ≤ D (b
D (b ct )t∈T(T ) .
→ Now,
 
Rα (T 0 ) = D (bct )t∈T(T 0 ) + α|T 0 |
 
= D (bct )t∈T(T 0 ) + α(|T | + 1)
   
= Rα (T ) + D (b ct )t∈T(T 0 ) − D (b
ct )t∈T(T ) + α,

so that Rα (T 0 ) ≥ Rα (T ) if and only if


   
α ≥ D (b ct )t∈T(T ) − D (b
ct )t∈T(T 0 ) .

⇒ The cost-complexity measure for model selection may now lead to chose
the simplest tree T over T 0 .

16
Tree-based methods : Regression trees

Right-sized tree : pruning process


Let Tinit be the large tree that is to be pruned to the right sized tree Tprune .

Let T (α) be the subtree of Tinit such that

Rα (T (α)) = min Rα (T )
T Tinit

and
If Rα (T ) = Rα (T (α)), then T (α)  T .

→ It is obvious that for every value of α there is at least one subtree of Tinit that
minimizes Rα (T ).
→ It can be shown that for every value of α ≥ 0, there exists a unique subtree
T (α) of Tinit that minimizes Rα (T ) and which satisfies T (α)  T for all
subtrees T minimizing Rα (T ).

Tinit has only a finite number of subtrees. Hence, {T (α)}α≥0 only contains a
finite number of subtrees of Tinit .

17
Tree-based methods : Regression trees

Right-sized tree : pruning process


Let α = 0. We define α0 = 0 and we start from Tinit = Tα0 .
(t)
It becomes optimal to prune the branch Tα0 for a certain node t of Tα0 when
 
Rα (Tα0 ) ≥ Rα Tα0 − Tα(t)0
.
→ The deviance of Tα0 can be written as
!
X
D (b
cs )s∈T = Dχs (b
cs )
(Tα0 )
s∈T(T
α0 )
X X  
= Dχs (b
cs ) + cs ) − D
Dχs (b (b
cs )s∈{t}
s∈T (t) s∈T (t)
(Tα0 −Tα ) (Tα )
0 0
   
     
= D
(b
cs )s∈T  + D (b  − D (b
 cs ) cs )s∈{t} .
 s∈T 
(t) (t)
Tα0 −Tα Tα
0 0

→ Furthermore, we have
|Tα0 | = |Tα0 − Tα(t)
0
| + |Tα(t)
0
| − 1.
→ The cost-complexity measure Rα (Tα0 ) can then be rewritten as
 
 
(t) (t)
   
Rα (Tα0 ) = Rα Tα0 − Tα + D (b
 cs )s∈T   + α|Tα | − D (b
cs )s∈{t} − α.
0 (t)  0

0

18
Tree-based methods : Regression trees

Right-sized tree : pruning process


 
(t)
Thus, we have Rα (Tα0 ) ≥ Rα Tα0 − Tα0 if and only if
 
+ α|Tα(t)

D (b
cs )s∈T (t) 0
| ≥ D (b
cs )s∈{t} + α
( Tα
0 )

or equivalently
 

cs )s∈{t} − D (b
D (b cs )s∈T
(Tα(t)0 ) (t)
α≥ (t)
= α1 .
|Tα0 | −1

Let
(t)
α1 = min α1 ,
t∈Te(Tα )
0

where Te(T ) denotes the set of the non-terminal nodes of T .

Once α reaches α1 , it becomes preferable to prune Tα0 at its weakest links.


The resulting tree is denoted Tα1 .
19
Tree-based methods : Regression trees

Right-sized tree : pruning process


We repeat the same process for Tα1 . For a non-terminal node t of Tα1 , it
(t)
will be preferable to cut the branch Tα1 when
 

D (bcs )s∈{t} − D (b cs )s∈T (t)
(Tα1 ) (t)
α≥ (t)
= α2 .
|Tα1 | − 1
Let
(t)
α2 = min α2 .
t∈Te(Tα )
1

(t)
The non-terminal nodes t of Tα1 for which α2 = α2 are called the weakest
links of Tα1 and it becomes better to cut in these nodes once α reaches the
value α2 in order to produce Tα2 .

We continue the same process for Tα2 , and so on until we reach the root
node {t0 }.

Finally, we come up with the sequence of trees


Tα0 , Tα1 , Tα2 , . . . , Tακ = {t0 }.
20
Tree-based methods : Regression trees

Right-sized tree : pruning process


Proposition : We have 0 = α0 < α1 < . . . < ακ < ακ+1 = ∞. Furthermore,
for all k = 0, 1, . . . , κ, we have

T (α) = Tαk for all α ∈ [αk , αk +1 ).

We define the cost-complexity parameter as


α
cp = ,
D (b
ct )t∈{t0 }

where D (b
ct )t∈{t0 } is the deviance of the root tree.
⇒ The cost-complexity measure Rα (T ) can then be rewritten as

Rα (T ) = D (b ct )t∈T(T ) + α|T |
 
= D (b ct )t∈T(T ) + cp D (bct )t∈{t0 } |T |.

→ The sequence αk , k = 0, . . . , κ, defines the sequence cpk , k = 0, . . . , κ, with


αk
cpk = .
D (b
ct )t∈{t0 }

21
Tree-based methods : Regression trees

Right-sized tree : choice of the best pruned tree


We rely on cross-validation.

We set

α̃k = αk αk +1 , , k = 0, 1, . . . , κ,
which is considered as a typical value for [αk , αk +1 ) and hence as the value
corresponding to Tαk .
→ Notice that α̃0 = 0 and α̃κ = ∞.

The K −fold cross-validation estimate of the generalization error for the


regularization parameter α̃k is given by
K
CV 1 XX 
Err
d (α̃k ) = L yi , µ
bT (j ) (α̃k ) (x i ) .
|I| j =1
i∈Ij

The right sized tree Tprune is then selected as the tree Tαk ∗ of the sequence
Tα0 , Tα1 , Tα2 , . . . , Tακ such that
CV CV
Err
d (α̃k ∗ ) = min Err
d (α̃k ).
k ∈{0,1,...,κ}

22
Tree-based methods : Regression trees

Relative importance of the features


To assess the relative importance of the features to the outcome, the
overall reduction in the optimization criteria for each feature is computed.
The relative importance of feature xj is the total reduction of deviance
obtained by using this feature throughout the tree.
→ Denoting by Te(T ) (xj ) the set of all non-terminal nodes of tree T for which xj
was selected as the splitting feature, the relative importance of feature xj is
the sum of the deviance reductions ∆Dχt over the non-terminal nodes
t ∈ Te(T ) (xj ), that is,
X
I(xj ) = ∆Dχt .
t∈T
e
(T ) (xj )

The features can be ordered with respect to their relative importances.


It is customary to assign the largest a value of 100 and then scale the others
accordingly.
Another way is to scale the relative importances such that their sum equals
100, so that any relative importance can be interpreted as the percentage
contribution to the overall model.

23
Tree-based methods : Regression trees

Example
Data set :
> set.seed(1)
>
> n <- 500000 # number of observations
>
> Gender <- factor(sample(c("m","f"),n,replace=TRUE))
> Age <- sample(c(18:65),n,replace=TRUE)
> Split <- factor(sample(c("yes","no"),n,replace=TRUE))
> Sport <- factor(sample(c("yes","no"),n,replace=TRUE))
>
> lambda <- 0.1*ifelse(Gender=="m",1.1,1)
> lambda <- lambda*ifelse(Age>=18 & Age<30,1.4,1)
> lambda <- lambda*ifelse(Age>=30 & Age<45,1.2,1)
> lambda <- lambda*ifelse(Sport=="yes",1.15,1)
> N <- rpois(n, lambda)
>
> data <- data.frame(N,Gender,Age,Sport,Split)

24
Tree-based methods : Regression trees

Example
Data set :
> head(data,10) # 10 first observations
N Gender Age Sport Split
1 0 m 46 yes yes
2 0 m 57 no no
3 0 f 34 yes no
4 0 f 27 no yes
5 0 m 42 yes yes
6 0 f 27 yes no
7 0 f 55 yes yes
8 0 f 23 yes no
9 0 f 33 no no
10 2 m 36 yes no

25
Tree-based methods : Regression trees

Example
R package : rpart (for “recursive partitioning”).
rpart().
→ Description :
 Fit a recursive partitioning model.
→ Usage :
 rpart(formula, data, weights, subset, na.action = na.rpart, method,
model = FALSE, x = FALSE, y = TRUE, parms, control, cost, ...).

Fit rpart :
> # fit model
> library(rpart) # package rpart (rpart for "recursive partitioning")
> tree <- rpart(N ~ Gender+Age+Split+Sport, data = data, method="poisson",
+ control = rpart.control(cp = 0, maxdepth = 5))
> summary(tree)

> # plot tree


> library(partykit) # package to produce "beautiful" plots for rpart trees
> tree2 <- as.party(tree)
> plot(tree2)

26
Tree-based methods : Regression trees

Example
Fit rpart (cp = 0) :
> summary(tree)
Variable importance
Age Sport Gender Split
73 18 9 1
> plot(tree2)

⇒ cp = 0 : overfitting. 27
Tree-based methods : Regression trees

Example
Fit rpart (cp = 0.00005) :
> summary(tree)
Variable importance
Age Sport Gender
73 18 9
> plot(tree2)

⇒ cp = 0.00005 : all relevant features are identified. 28


Tree-based methods : Regression trees

Example
Fit rpart (cp = 0.00020) :
> summary(tree)
Variable importance
Age Sport
80 20
> plot(tree2)

⇒ cp = 0.00020 : feature “Gender” is missing. 29


Tree-based methods : Regression trees

Example
Estimation of cp by cross-validation :
> tree <- rpart(N ~ Gender+Age+Split+Sport, data = data, method="poisson",
+ control = rpart.control(cp=0, maxdepth = 5))
> # cp = 0 : minimal value of cp when computing the cross-validated error
> printcp(tree) # xerror column = cross-validated error
> # idea: find cp that minimizes the xerror
> # rel error = 1 - R^2

CP nsplit rel error xerror xstd


1 3.5920e-03 0 1.00000 1.00001 0.0023034
2 9.3746e-04 1 0.99641 0.99642 0.0022959
3 5.6213e-04 2 0.99547 0.99549 0.0022950
4 3.1509e-04 3 0.99491 0.99494 0.0022945
5 2.6391e-04 4 0.99459 0.99469 0.0022941
6 1.1682e-04 5 0.99433 0.99437 0.0022934
7 1.1658e-04 6 0.99421 0.99437 0.0022939
8 9.5494e-05 7 0.99410 0.99428 0.0022938
9 8.4238e-05 8 0.99400 0.99419 0.0022939
10 7.7462e-05 9 0.99392 0.99408 0.0022937
11 6.0203e-05 10 0.99384 0.99396 0.0022935
12 3.2927e-05 11 0.99378 0.99390 0.0022934
13 2.3035e-05 12 0.99375 0.99395 0.0022936
14 1.7216e-05 13 0.99372 0.99395 0.0022936
15 1.2893e-05 14 0.99371 0.99401 0.0022940
16 1.1135e-05 16 0.99368 0.99403 0.0022940
17 1.0995e-05 17 0.99367 0.99404 0.0022940
18 9.2293e-06 18 0.99366 0.99407 0.0022942
19 5.5970e-06 20 0.99364 0.99412 0.0022945
20 5.4761e-06 21 0.99363 0.99415 0.0022947
21 5.3590e-06 22 0.99363 0.99414 0.0022947
22 5.3108e-06 23 0.99362 0.99415 0.0022947
23 4.9740e-06 24 0.99362 0.99415 0.0022947
24 4.2883e-06 25 0.99361 0.99417 0.0022948
25 3.7381e-06 26 0.99361 0.99417 0.0022948
26 3.4052e-06 27 0.99360 0.99417 0.0022948
27 4.3202e-09 30 0.99359 0.99418 0.0022949
28 0.0000e+00 31 0.99359 0.99418 0.0022949

⇒ cp optimal =3.2927e-05. 30
Tree-based methods : Regression trees

Example
Optimal tree :
> tree.cp <- prune(tree,cp=3.2927e-05) # optimal tree
> print(tree.cp) # * means terminal node
n= 500000

node), split, n, deviance, yval


* denotes terminal node

1) root 500000 279043.30 0.1317560


2) Age>=44.5 218846 111779.70 0.1134548
4) Sport=no 109659 53386.74 0.1044621
8) Gender=f 54866 26084.89 0.1004673 *
9) Gender=m 54793 27285.05 0.1084660 *
5) Sport=yes 109187 58236.07 0.1224878
10) Gender=f 54545 28257.29 0.1164381 *
11) Gender=m 54642 29946.18 0.1285280 *
3) Age< 44.5 281154 166261.40 0.1460015
6) Age>=29.5 156219 88703.96 0.1355531
12) Sport=no 77851 42515.21 0.1267940
24) Gender=f 38991 20701.11 0.1206197 *
25) Gender=m 38860 21790.59 0.1329900 *
13) Sport=yes 78368 46100.83 0.1442541
26) Gender=f 39133 22350.27 0.1365079 *
27) Gender=m 39235 23718.03 0.1519777 *
7) Age< 29.5 124935 77295.79 0.1590651
14) Sport=no 62475 37323.97 0.1493856
28) Gender=f 31298 18211.52 0.1422110 *
29) Gender=m 31177 19090.83 0.1565837 *
15) Sport=yes 62460 39898.18 0.1687435
30) Gender=f 31277 19474.59 0.1602706 *
31) Gender=m 31183 20396.94 0.1772329 *

31
Tree-based methods : Regression trees

Example
Optimal tree :
> tree.cp2 <- as.party(tree.cp)
> plot(tree.cp2)

32
Tree-based methods : Regression trees

Example
Predictions and risk classes :
> lambda <- predict(tree.cp, type = "vector")
> data$lambda <- lambda
>
> class <- partykit:::.list.rules.party(tree.cp2)
> data$class <- class[as.character(predict(tree.cp2, type = "node"))]
>
> head(data,10)
N Gender Age Sport Split lambda class
1 0 m 46 yes yes 0.1285280 Age >= 44.5 & Sport %in% c("yes") & Gender %in% c("m")
2 0 m 57 no no 0.1084660 Age >= 44.5 & Sport %in% c("no") & Gender %in% c("m")
3 0 f 34 yes no 0.1365079 Age < 44.5 & Age >= 29.5 & Sport %in% c("yes") & Gender %in% c("f")
4 0 f 27 no yes 0.1422110 Age < 44.5 & Age < 29.5 & Sport %in% c("no") & Gender %in% c("f")
5 0 m 42 yes yes 0.1519777 Age < 44.5 & Age >= 29.5 & Sport %in% c("yes") & Gender %in% c("m")
6 0 f 27 yes no 0.1602706 Age < 44.5 & Age < 29.5 & Sport %in% c("yes") & Gender %in% c("f")
7 0 f 55 yes yes 0.1164381 Age >= 44.5 & Sport %in% c("yes") & Gender %in% c("f")
8 0 f 23 yes no 0.1602706 Age < 44.5 & Age < 29.5 & Sport %in% c("yes") & Gender %in% c("f")
9 0 f 33 no no 0.1206197 Age < 44.5 & Age >= 29.5 & Sport %in% c("no") & Gender %in% c("f")
10 2 m 36 yes no 0.1519777 Age < 44.5 & Age >= 29.5 & Sport %in% c("yes") & Gender %in% c("m")

33
Tree-based methods : Regression trees

Example
Summary of risk classes and predictions :
> data$class.i <- data$class==class[1]
> lambda.class <- subset(data, data$class.i==TRUE, select=c(class,lambda))[1,]
>
> c <- 2
> while(c <= length(class)) {
+ data$class.i <- data$class==class[c]
+ lambda.class[nrow(lambda.class) + 1,] = subset(data, data$class.i==TRUE, select=c(class,lambda))[1,];
+ c <- c+1;
+ } # lambda.class <- lambda.class[order(lambda.class$lambda),]
> rownames(lambda.class) <- NULL
>
> lambda.class
class lambda
1 Age >= 44.5 & Sport %in% c("no") & Gender %in% c("f") 0.1004673
2 Age >= 44.5 & Sport %in% c("no") & Gender %in% c("m") 0.1084660
3 Age >= 44.5 & Sport %in% c("yes") & Gender %in% c("f") 0.1164381
4 Age >= 44.5 & Sport %in% c("yes") & Gender %in% c("m") 0.1285280
5 Age < 44.5 & Age >= 29.5 & Sport %in% c("no") & Gender %in% c("f") 0.1206197
6 Age < 44.5 & Age >= 29.5 & Sport %in% c("no") & Gender %in% c("m") 0.1329900
7 Age < 44.5 & Age >= 29.5 & Sport %in% c("yes") & Gender %in% c("f") 0.1365079
8 Age < 44.5 & Age >= 29.5 & Sport %in% c("yes") & Gender %in% c("m") 0.1519777
9 Age < 44.5 & Age < 29.5 & Sport %in% c("no") & Gender %in% c("f") 0.1422110
10 Age < 44.5 & Age < 29.5 & Sport %in% c("no") & Gender %in% c("m") 0.1565837
11 Age < 44.5 & Age < 29.5 & Sport %in% c("yes") & Gender %in% c("f") 0.1602706
12 Age < 44.5 & Age < 29.5 & Sport %in% c("yes") & Gender %in% c("m") 0.1772329

34
Tree-based methods : Regression trees

Example MTPL
Data set :
→ MTPL insurance portfolio of a Belgian insurance company observed during
one year.
→ Description of data set :
> str(data) # description of the dataset
’data.frame’: 160944 obs. of 10 variables:
$ AgePh : int 50 64 60 77 28 26 26 58 59 57 ...
$ AgeCar : int 12 3 10 15 7 12 8 14 6 10 ...
$ Fuel : Factor w/ 2 levels "Diesel","Gasoline": 2 2 1 2 2 2 2 2 2 2 ...
$ Split : Factor w/ 4 levels "Half-Yearly",..: 2 4 4 4 1 3 1 3 1 1 ...
$ Cover : Factor w/ 3 levels "Comprehensive",..: 3 2 3 3 3 3 1 3 2 2 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 2 1 1 2 1 1 2 2 2 ...
$ Use : Factor w/ 2 levels "Private","Professional": 1 1 1 1 1 1 1 1 1 1 ...
$ PowerCat: Factor w/ 5 levels "C1","C2","C3",..: 2 2 2 2 2 2 2 2 1 1 ...
$ ExpoR : num 1 1 1 1 0.0466 ...
$ Nclaim : int 1 0 0 0 1 0 1 0 0 0 ...

35
Tree-based methods : Regression trees

Example MTPL
Data set :
→ The data set comprises 160 944 insurance policies.
→ For each policy, we have 8 features :
- AgePh : policyholder’s age ;
- AgeCar : age of the car ;
- Fuel : fuel of the car, with two categories (gas or diesel) ;
- Split : splitting of the premium, with four categories (annually, semi-annually,
quarterly or monthly) ;
- Cover : extent of the coverage, with three categories (from compulsory
third-party liability cover to comprehensive) ;
- Gender : policyholder’s gender, with two categories (female or male) ;
- Use : use of the car, with two categories (private or professional) ;
- PowerCat : the engine’s power, with five categories.
→ For each policy, we have the number of claim (Nclaim), the target variable,
and exposure information (exposure-to-risk (ExpoR) expressed in year).

36
Tree-based methods : Regression trees

Example MTPL
Data set :
> head(data,10) # 10 first observations
AgePh AgeCar Fuel Split Cover Gender Use PowerCat ExpoR Nclaim
1 50 12 Gasoline Monthly TPL.Only Female Private C2 1.00000000 1
2 64 3 Gasoline Yearly Limited.MD Male Private C2 1.00000000 0
3 60 10 Diesel Yearly TPL.Only Female Private C2 1.00000000 0
4 77 15 Gasoline Yearly TPL.Only Female Private C2 1.00000000 0
5 28 7 Gasoline Half-Yearly TPL.Only Male Private C2 0.04657534 1
6 26 12 Gasoline Quarterly TPL.Only Female Private C2 1.00000000 0
7 26 8 Gasoline Half-Yearly Comprehensive Female Private C2 1.00000000 1
8 58 14 Gasoline Quarterly TPL.Only Male Private C2 0.40273973 0
9 59 6 Gasoline Half-Yearly Limited.MD Male Private C1 1.00000000 0
10 57 10 Gasoline Half-Yearly Limited.MD Male Private C1 1.00000000 0

37
Tree-based methods : Regression trees

Example MTPL
Descriptive statistics of the data :

120000
100000
80000
Number of policies

60000
40000
20000
0

1 2 3 4 5 6 7 8 9 10 11 12

Exposure (in months)

38
Tree-based methods : Regression trees

Example MTPL
Descriptive statistics of the data :
0.15

100000

75000

0.10

ClaimFrequency
totalExposure

50000

0.05

25000

0 0.00

Female Male Female Male


Gender Gender

39
Tree-based methods : Regression trees

Example MTPL
Descriptive statistics of the data :
100000

0.15

75000

0.10

ClaimFrequency
totalExposure

50000

0.05

25000

0 0.00

Diesel Gasoline Diesel Gasoline


Fuel Fuel

40
Tree-based methods : Regression trees

Example MTPL
Descriptive statistics of the data :

1e+05
0.10

ClaimFrequency
totalExposure

5e+04 0.05

0e+00 0.00

Private Professional Private Professional


Use Use

41
Tree-based methods : Regression trees

Example MTPL
Descriptive statistics of the data :
0.15

80000

60000

0.10

ClaimFrequency
totalExposure

40000

0.05

20000

0 0.00

Comprehensive Limited.MD TPL.Only Comprehensive Limited.MD TPL.Only


Cover Cover

42
Tree-based methods : Regression trees

Example MTPL
Descriptive statistics of the data :
0.20

60000

0.15

ClaimFrequency
totalExposure

40000

0.10

20000
0.05

0 0.00

Half−Yearly Monthly Quarterly Yearly Half−Yearly Monthly Quarterly Yearly


Split Split

43
Tree-based methods : Regression trees

Example MTPL
Descriptive statistics of the data :
80000

0.15
60000

ClaimFrequency
totalExposure

0.10
40000

0.05
20000

0 0.00

C1 C2 C3 C4 C5 C1 C2 C3 C4 C5
PowerCat PowerCat

44
Tree-based methods : Regression trees

Example MTPL
Descriptive statistics of the data :

0.20

10000
0.15

ClaimFrequency
totalExposure

0.10

5000

0.05

0 0.00

0 5 10 15 20 0 5 10 15 20
AgeCar AgeCar

45
Tree-based methods : Regression trees

Example MTPL
Descriptive statistics of the data :
1.00

3000

0.75

ClaimFrequency
2000
totalExposure

0.50

1000
0.25

0 0.00

20 40 60 80 20 40 60 80
AgePh AgePh

46
Tree-based methods : Regression trees

Example MTPL
Regression trees (Poisson models)
> library(rpart) # package rpart (rpart for "recursive partitioning")
> tree <- rpart(Nclaim ~ AgePh+AgeCar+Fuel+Split+Cover+Gender+Use+PowerCat
+ +offset(log(ExpoR)), data = data, method="poisson",
+ control = rpart.control(cp = 0, maxdepth = 10, minbucket = 5000, xval = 1))
> # xval = 1: no cross validation => speed up the algorithm
> library(rpart.plot)
> prp(tree) #prp: plot rpart

0.14
20e+3 / 161e+3
100%
yes AgePh >= 30 no

0.21
4358 / 24e+3
15%
Split = Yearly

0.13
16e+3 / 137e+3
85%
Split = Half−Yearly,Yearly

0.17
3931 / 29e+3
18%
AgePh >= 58

0.12
12e+3 / 108e+3
67%
AgePh >= 58

0.13 0.18
8358 / 71e+3 3375 / 24e+3
44% 15%

Fuel = Gasoline Cover = Comprehensive,Limited.MD

0.097
3305 / 36e+3
23%

Fuel = Gasoline

0.14
3213 / 25e+3
16%
Cover = Comprehensive,Limited.MD

0.091 0.12
2515 / 29e+3 5145 / 46e+3
18% 29%

Gender = Female Split = Yearly

0.13
2200 / 18e+3
11%

Cover = Limited.MD

0.089 0.12
1989 / 24e+3 2945 / 28e+3
15% 17%
AgePh < 74 AgePh >= 48

0.12
1866 / 17e+3
10%
Cover = Comprehensive,Limited.MD

0.086
1482 / 18e+3
11%
AgeCar >= 5.5

0.092 0.1 0.11 0.13 0.14 0.15 0.16 0.17


552 / 6360 526 / 5474 1079 / 11e+3 1239 / 11e+3 1556 / 12e+3 1870 / 14e+3 1621 / 12e+3 1497 / 9659
4% 3% 7% 7% 8% 9% 8% 6%

0.082 0.099 0.12 0.11 0.12 0.13 0.13 0.19 0.24


930 / 12e+3 507 / 5455 790 / 7015 627 / 6093 644 / 5987 1343 / 11e+3 556 / 5020 1754 / 12e+3 2861 / 14e+3
8% 3% 4% 4% 4% 7% 3% 8% 9%

47
Tree-based methods : Regression trees

Example MTPL
Choice for minbucket :
→ Average claim frequency = 13.9%.
→ Average exposure per policy = 0.89.
→ Estimated confidence interval of 2 standard deviations for a claims frequency
= 13.9% with minbucket=5000 :
" r r #
13.9% 13.9%
CI = 13.9% − 2 , 13.9% + 2
0.89 × 5000 0.89 × 5000
= [12.8%, 15.0%]
= 13.9% × [92.0%, 108.0%] .

⇒ Gives an idea of the precision that can be achieved.


→ The choice minbucket=5000 seems relevant.

48
Tree-based methods : Regression trees

Example MTPL
Selection of cp :
→ The complexity parameter is set by 10-fold cross-validation :
> tree <- rpart(Nclaim ~ AgePh+AgeCar+Fuel+Split+Cover+Gender+Use+PowerCat
+ +offset(log(ExpoR)), data = data, method="poisson",
+ control = rpart.control(cp = 0, minbucket = 5000))
> # cp = 0 : minimal value of cp when computing the CV error
> plotcp(tree, minline = TRUE, upper = "split", ylim=c(0.975,1.01))
number of splits

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1.010
1.005
1.000
X−val Relative Error

0.995
0.990
0.985
0.980
0.975

Inf 0.0062 0.0017 0.00049 0.00031 0.00018 0.00015 9.6e−05 6.8e−05 0

cp

49
Tree-based methods : Regression trees

Example MTPL
Selection of cp :
→ The complexity parameter is set by 10-fold cross-validation :
> printcp(tree)
Variables actually used in tree construction:
[1] AgeCar AgePh Cover Fuel Gender Split

Root node error: 88664/160944 = 0.5509

n= 160944

CP nsplit rel error xerror xstd


1 9.3945e-03 0 1.00000 1.00003 0.0048524
2 4.0574e-03 1 0.99061 0.99115 0.0047885
3 2.1243e-03 2 0.98655 0.98719 0.0047662
4 1.4023e-03 3 0.98442 0.98524 0.0047565
5 5.1564e-04 4 0.98302 0.98376 0.0047441
6 4.5896e-04 5 0.98251 0.98344 0.0047427
7 4.1895e-04 6 0.98205 0.98312 0.0047437
8 2.3144e-04 7 0.98163 0.98254 0.0047440
9 1.9344e-04 8 0.98140 0.98244 0.0047428
10 1.6076e-04 9 0.98120 0.98233 0.0047405
11 1.5651e-04 10 0.98104 0.98225 0.0047400
12 1.4674e-04 11 0.98089 0.98225 0.0047400
13 1.0746e-04 12 0.98074 0.98213 0.0047390
14 8.6425e-05 13 0.98063 0.98200 0.0047385
15 8.6149e-05 14 0.98055 0.98189 0.0047374 <= minimum CV error
16 5.3416e-05 15 0.98046 0.98197 0.0047391
17 0.0000e+00 16 0.98041 0.98204 0.0047396
50
Tree-based methods : Regression trees

Example MTPL
Tree with minimum CV error :
> tree.minCV <- prune(tree,cp=8.6149e-05) # tree with min CV
> print(tree.minCV) # * means terminal node
n= 160944

node), split, n, deviance, yval


* denotes terminal node

1) root 160944 88663.810 0.13928480


2) AgePh>=30.5 137228 71351.560 0.12694240
4) Split=Half-Yearly,Yearly 107753 54500.910 0.11724870
8) AgePh>=57.5 36391 16497.170 0.09694367
16) Fuel=Gasoline 29376 12818.560 0.09123836
32) Gender=Female 23902 10226.330 0.08852467
64) AgePh< 73.5 18447 7712.091 0.08550212 *
65) AgePh>=73.5 5455 2506.580 0.09879058 *
33) Gender=Male 5474 2582.697 0.10325860 *
17) Fuel=Diesel 7015 3632.896 0.12107130 *
9) AgePh< 57.5 71362 37815.390 0.12784130
18) Fuel=Gasoline 46294 23864.900 0.12123980
36) Split=Yearly 27901 13844.700 0.11501100
72) AgePh>=47.5 11009 5198.219 0.10568820 *
73) AgePh< 47.5 16892 8633.468 0.12120700 *
37) Split=Half-Yearly 18393 9999.683 0.13072370
74) Cover=Limited.MD 5987 3025.772 0.11591810 *
75) Cover=Comprehensive,TPL.Only 12406 6959.656 0.13803250 *
19) Fuel=Diesel 25068 13909.790 0.14005510
38) Cover=Comprehensive,Limited.MD 11270 5964.095 0.12887710 *
39) Cover=TPL.Only 13798 7928.547 0.14936040 *
5) Split=Monthly,Quarterly 29475 16490.910 0.16819730
10) AgePh>=57.5 5020 2546.435 0.13402550 *
11) AgePh< 57.5 24455 13907.330 0.17557220
22) Cover=Comprehensive,Limited.MD 12130 6745.530 0.16458110 *
23) Cover=TPL.Only 12325 7147.918 0.18709190 *
3) AgePh< 30.5 23716 16479.300 0.21357760
6) Split=Yearly 9659 6033.823 0.17216140 *
7) Split=Half-Yearly,Monthly,Quarterly 14057 10321.140 0.24429200 *

51
Tree-based methods : Regression trees

Example MTPL
Tree with minimum CV error :
> prp(tree.minCV)

0.14
20e+3 / 161e+3
100%

yes AgePh >= 30 no

0.21
4358 / 24e+3
15%

Split = Yearly

0.13
16e+3 / 137e+3
85%

Split = Half−Yearly,Yearly

0.17
3931 / 29e+3
18%

AgePh >= 58

0.12
12e+3 / 108e+3
67%

AgePh >= 58

0.13 0.18
8358 / 71e+3 3375 / 24e+3
44% 15%

Fuel = Gasoline Cover = Comprehensive,Limited.MD

0.097
3305 / 36e+3
23%

Fuel = Gasoline

0.14
3213 / 25e+3
16%

Cover = Comprehensive,Limited.MD

0.091 0.12
2515 / 29e+3 5145 / 46e+3
18% 29%

Gender = Female Split = Yearly

0.13
2200 / 18e+3
11%

Cover = Limited.MD

0.089 0.12
1989 / 24e+3 2945 / 28e+3
15% 17%

AgePh < 74 AgePh >= 48

0.099 0.12 0.12 0.14 0.15 0.16 0.17


507 / 5455 790 / 7015 1866 / 17e+3 1556 / 12e+3 1870 / 14e+3 1621 / 12e+3 1497 / 9659
3% 4% 10% 8% 9% 8% 6%

0.086 0.1 0.11 0.12 0.13 0.13 0.19 0.24


1482 / 18e+3 526 / 5474 1079 / 11e+3 644 / 5987 1343 / 11e+3 556 / 5020 1754 / 12e+3 2861 / 14e+3
11% 3% 7% 4% 7% 3% 8% 9%

52
Tree-based methods : Regression trees

Example MTPL
Observations :
→ The tree with the minimum CV error (15 leaves) is relatively big compared to
the biggest tree model (cp = 0, 17 leaves).
Reason : the choice minbucket=5000 limits the size of the biggest tree.
→ The tree with only 3 splits (i.e. 4 leaves) is within 1 standard deviation (SD) of
the tree with the minimum CV error.

0.14
20e+3 / 161e+3
100%

yes AgePh >= 30 no

0.13
16e+3 / 137e+3
85%

Split = Half−Yearly,Yearly

0.12
12e+3 / 108e+3
67%

AgePh >= 58

0.097 0.13 0.17 0.21


3305 / 36e+3 8358 / 71e+3 3931 / 29e+3 4358 / 24e+3
23% 44% 18% 15%

Comparison of the tree within 1 SD (4 leaves) and the tree with the minimum
CV (15 leaves) : require to calculate the generalization errors of these models
(⇒ a validation set is needed).
53
Tree-based methods : Regression trees

Example MTPL
Validation set :
→ Training set D : 80% of the data set.
→ Validation set D : 20% of the data set.
> library(caret)
> inValidation = createDataPartition(data$Nclaim, p=0.2, list=FALSE)
> validation.set = data[inValidation,]
> training.set = data[-inValidation,]

54
Tree-based methods : Regression trees

Example MTPL
Regression trees (Poisson models) :
> tree <- rpart(Nclaim ~ AgePh+AgeCar+Fuel+Split+Cover+Gender+Use+PowerCat
+ +offset(log(ExpoR)), data = training.set, method="poisson",
+ control = rpart.control(cp = 0, maxdepth = 10, minbucket = 4000, xval = 1)
> prp(tree)

0.14
16e+3 / 129e+3
100%

yes AgePh >= 30 no

0.21
3459 / 19e+3
15%

Split = Yearly

0.13
12e+3 / 110e+3
85%

Split = Half−Yearly,Yearly

0.17
3097 / 24e+3
18%

AgePh >= 56

0.12
9325 / 86e+3
67%

AgePh >= 58

0.13 0.17
6704 / 57e+3 2612 / 19e+3
44% 15%

Fuel = Gasoline Cover = Comprehensive,Limited.MD

0.096
2621 / 29e+3
23%

Fuel = Gasoline

0.14
2588 / 20e+3
16%

Cover = Comprehensive,Limited.MD

0.09 0.12
1982 / 23e+3 4116 / 37e+3
18% 29%

Gender = Female Split = Yearly

0.13
1750 / 15e+3
11%

Cover = Limited.MD

0.087 0.12
1551 / 19e+3 2366 / 22e+3
15% 17%

AgePh < 74 AgePh >= 48

0.12
1587 / 14e+3
11%

Gender = Female

0.084
1156 / 15e+3
11%

AgeCar >= 5.5

0.093 0.1 0.11 0.13 0.14 0.15 0.16 0.17


441 / 5033 431 / 4421 779 / 7868 604 / 5040 1247 / 9863 1506 / 11e+3 1247 / 9419 1179 / 7750
4% 3% 6% 4% 8% 9% 7% 6%

0.079 0.096 0.12 0.12 0.11 0.13 0.13 0.19 0.24


715 / 9639 395 / 4347 639 / 5613 983 / 9358 503 / 4808 1082 / 9111 485 / 4437 1365 / 9690 2280 / 11e+3
7% 3% 4% 7% 4% 7% 3% 8% 9%

55
Tree-based methods : Regression trees

Example MTPL
Remarks :
→ minbucket=80% × 5000 = 4000.
→ The tree obtained is almost identical to the one obtained on the whole data
set with cp=0, maxdepth=10 and minbucket=5000 (only one different split at
the bottom).

56
Tree-based methods : Regression trees

Example MTPL
Selection of cp :
> tree <- rpart(Nclaim ~ AgePh+AgeCar+Fuel+Split+Cover+Gender+Use+PowerCat
+ +offset(log(ExpoR)), data = training.set, method="poisson",
+ control = rpart.control(cp = 0, minbucket = 4000))
> plotcp(tree, minline = TRUE, upper = "split", ylim=c(0.975,1.01))

number of splits

1.010
1.005
1.000 0 1 2 3 4 5 6 7 9 10 11 12 13 15 16
X−val Relative Error

0.995
0.990
0.985
0.980
0.975

Inf 0.0059 0.0018 0.00053 0.00031 0.00018 0.00013 8.7e−05 0

cp

57
Tree-based methods : Regression trees

Example MTPL
Selection of cp :
> printcp(tree)

Variables actually used in tree construction:


[1] AgeCar AgePh Cover Fuel Gender Split

Root node error: 70767/128755 = 0.54963

n= 128755

CP nsplit rel error xerror xstd


1 9.1497e-03 0 1.00000 1.00001 0.0054547
2 3.7595e-03 1 0.99085 0.99160 0.0053964
3 2.2473e-03 2 0.98709 0.98853 0.0053797
4 1.5067e-03 3 0.98484 0.98607 0.0053688
5 6.0567e-04 4 0.98334 0.98417 0.0053489
6 4.6174e-04 5 0.98273 0.98362 0.0053447
7 4.5173e-04 6 0.98227 0.98337 0.0053502
8 2.1482e-04 7 0.98182 0.98301 0.0053520
9 2.0725e-04 9 0.98139 0.98304 0.0053529
10 1.6066e-04 10 0.98118 0.98297 0.0053534
11 1.5325e-04 11 0.98102 0.98293 0.0053548
12 1.1692e-04 12 0.98087 0.98294 0.0053558
13 9.0587e-05 13 0.98075 0.98302 0.0053580
14 8.4270e-05 15 0.98057 0.98292 0.0053562
15 0.0000e+00 16 0.98048 0.98285 0.0053555 <= minimum CV error

58
Tree-based methods : Regression trees

Example MTPL
Tree with minimum CV error :
→ cp=0 ⇒ Biggest tree (= initial tree).
⇒ minbucket=4000 prevents of overfitting.
> tree.minCV <- prune(tree,cp=0) # tree with min CV
> prp(tree.minCV)

0.14
16e+3 / 129e+3
100%
yes AgePh >= 30 no

0.21
3459 / 19e+3
15%
Split = Yearly

0.13
12e+3 / 110e+3
85%
Split = Half−Yearly,Yearly

0.17
3097 / 24e+3
18%
AgePh >= 56

0.12
9325 / 86e+3
67%
AgePh >= 58

0.13 0.17
6704 / 57e+3 2612 / 19e+3
44% 15%

Fuel = Gasoline Cover = Comprehensive,Limited.MD

0.096
2621 / 29e+3
23%

Fuel = Gasoline

0.14
2588 / 20e+3
16%
Cover = Comprehensive,Limited.MD

0.09 0.12
1982 / 23e+3 4116 / 37e+3
18% 29%

Gender = Female Split = Yearly

0.13
1750 / 15e+3
11%

Cover = Limited.MD

0.087 0.12
1551 / 19e+3 2366 / 22e+3
15% 17%
AgePh < 74 AgePh >= 48

0.12
1587 / 14e+3
11%

Gender = Female

0.084
1156 / 15e+3
11%
AgeCar >= 5.5

0.093 0.1 0.11 0.13 0.14 0.15 0.16 0.17


441 / 5033 431 / 4421 779 / 7868 604 / 5040 1247 / 9863 1506 / 11e+3 1247 / 9419 1179 / 7750
4% 3% 6% 4% 8% 9% 7% 6%

0.079 0.096 0.12 0.12 0.11 0.13 0.13 0.19 0.24


715 / 9639 395 / 4347 639 / 5613 983 / 9358 503 / 4808 1082 / 9111 485 / 4437 1365 / 9690 2280 / 11e+3
7% 3% 4% 7% 4% 7% 3% 8% 9%

59
Tree-based methods : Regression trees

Example MTPL
Tree within 1 SD :
→ Tree within 1 SD has 4 leaves :
> tree.1SD <- prune(tree,cp=1.5067e-03) # tree within 1 SD
> prp(tree.1SD)

0.14
16e+3 / 129e+3
100%

yes AgePh >= 30 no

0.13
12e+3 / 110e+3
85%

Split = Half−Yearly,Yearly

0.12
9325 / 86e+3
67%

AgePh >= 58

0.096 0.13 0.17 0.21


2621 / 29e+3 6704 / 57e+3 3097 / 24e+3 3459 / 19e+3
23% 44% 18% 15%

60
Tree-based methods : Regression trees

Example MTPL
Comparison :
→ Generalization error :

d val (b 1 X
Err µ) = L (yi , µ
b(x i )di ) ,
|I|
i∈I
  yi  
yi
with L(yi , µ
b(x i )di ) = 2 ln µ
b(x i )di
+ µ
b (x i )di − yi

and di = exposure-to-risk for policy i.

> valid.pred.minCV <- validation.set$ExpoR*predict(tree.minCV, newdata=validation.set, type = "vector")


> gen.error.minCV <- 1/nrow(validation.set)*2*(sum(valid.pred.minCV)-sum(validation.set$Nclaim)
+ +sum(log((validation.set$Nclaim/valid.pred.minCV)^(validation.set$Nclaim))))
>
> valid.pred.1SD <- validation.set$ExpoR*predict(tree.1SD, newdata=validation.set, type = "vector")
> gen.error.1SD <- 1/nrow(validation.set)*2*(sum(valid.pred.1SD)-sum(validation.set$Nclaim)
+ +sum(log((validation.set$Nclaim/valid.pred.1SD)^(validation.set$Nclaim))))

val
Model Err
d (b µ)
Tree with min CV 0.5452772
Tree within 1 SD 0.5464333

61
Tree-based methods : Regression trees

Example MTPL
Model instability :
> data1 = data[1:100000,]
> tree.data1 <- rpart(Nclaim ~ AgePh+AgeCar+Fuel+Split+Cover+Gender+Use+PowerCat
+ +offset(log(ExpoR)), data = data1, method="poisson",
+ control = rpart.control(cp = 0, minbucket = 3000))
> printcp(tree.data1)
> tree.data1.minCV <- prune(tree.data1,cp=1.1727e-04)
> prp(tree.data1.minCV)

0.14
13e+3 / 100e+3
100%

yes AgePh >= 32 no

0.22
2651 / 15e+3
15%

Split = Yearly

0.13
10e+3 / 85e+3
85%

Split = Half−Yearly,Yearly

0.17 0.24
2932 / 22e+3 1946 / 9852
22% 10%

AgePh >= 58 AgePh >= 26

0.12
7089 / 64e+3
64%

AgePh >= 58

0.13 0.18
5023 / 42e+3 2545 / 18e+3
42% 18%

Cover = Comprehensive,Limited.MD Cover = Comprehensive,Limited.MD

0.1
2066 / 22e+3
22%

Fuel = Gasoline

0.14
2823 / 22e+3
22%

Fuel = Gasoline

0.13 0.13 0.13 0.19 0.22


532 / 4454 1746 / 15e+3 387 / 3493 1243 / 8535 1162 / 6434
4% 15% 3% 9% 6%

0.093 0.12 0.16 0.17 0.17 0.29


1534 / 18e+3 2200 / 19e+3 1077 / 7610 1302 / 9555 705 / 4684 784 / 3418
18% 19% 8% 10% 5% 3%

62
Tree-based methods : Regression trees

Example MTPL
Model instability :
> data2 = data[10000:110000,]
> tree.data2 <- rpart(Nclaim ~ AgePh+AgeCar+Fuel+Split+Cover+Gender+Use+PowerCat
+ +offset(log(ExpoR)), data = data2, method="poisson",
+ control = rpart.control(cp = 0, minbucket = 3000))
> printcp(tree.data2)
> tree.data2.minCV <- prune(tree.data2,cp=8.7672e-05)
> prp(tree.data2.minCV)

0.13
12e+3 / 100e+3
100%

yes AgePh >= 30 no

0.21
1834 / 10e+3
10%

Split = Yearly
0.13
9999 / 90e+3
90%

Split = Half−Yearly,Yearly

0.16
2672 / 21e+3
21%

AgePh >= 58
0.12
7327 / 69e+3
69%

AgePh >= 58

0.13 0.17
5272 / 46e+3 2357 / 18e+3
46% 18%

Fuel = Gasoline Fuel = Gasoline


0.097
2055 / 23e+3
23%

Fuel = Gasoline

0.14
2086 / 17e+3
17%

Cover = Comprehensive,Limited.MD
0.09 0.12
1525 / 18e+3 3186 / 29e+3
18% 29%

Gender = Female AgePh >= 50

0.12
2385 / 21e+3
21%

PowerCat = C1

0.13
1553 / 13e+3
13%

Gender = Female

0.1 0.11 0.12 0.13 0.12 0.18 0.23


325 / 3380 801 / 7949 1103 / 9936 945 / 8191 315 / 3108 957 / 6740 1322 / 6974
3% 8% 10% 8% 3% 7% 7%

0.087 0.12 0.11 0.15 0.15 0.16 0.18


1200 / 15e+3 530 / 4607 832 / 8053 450 / 3383 1141 / 8422 1400 / 11e+3 512 / 3272
15% 5% 8% 3% 8% 11% 3%

63
Tree-based methods : Regression trees

Example MTPL
Model instability : Trees with minimum CV error :
0.14 0.13
13e+3 / 100e+3 12e+3 / 100e+3
100% 100%
yes AgePh >= 32 no
yes AgePh >= 30 no

0.21
1834 / 10e+3
10%

0.22 Split = Yearly


2651 / 15e+3 0.13
15% 9999 / 90e+3
90%
Split = Yearly
Split = Half−Yearly,Yearly

0.13 0.16
10e+3 / 85e+3 2672 / 21e+3
85% 21%
Split = Half−Yearly,Yearly
AgePh >= 58
0.12
7327 / 69e+3
69%
0.17 0.24
2932 / 22e+3 1946 / 9852
AgePh >= 58
22% 10%

AgePh >= 58 AgePh >= 26


0.13 0.17
5272 / 46e+3 2357 / 18e+3
46% 18%

0.12 Fuel = Gasoline Fuel = Gasoline


7089 / 64e+3 0.097
64% 2055 / 23e+3
23%
AgePh >= 58
Fuel = Gasoline

0.14
0.13 0.18 2086 / 17e+3
5023 / 42e+3 2545 / 18e+3 17%
42% 18%
Cover = Comprehensive,Limited.MD
Cover = Comprehensive,Limited.MD Cover = Comprehensive,Limited.MD 0.09 0.12
1525 / 18e+3 3186 / 29e+3
18% 29%
0.1
2066 / 22e+3 Gender = Female AgePh >= 50
22%

0.12
Fuel = Gasoline
2385 / 21e+3
21%

PowerCat = C1
0.14
2823 / 22e+3
22%

Fuel = Gasoline

0.13
1553 / 13e+3
13%

Gender = Female

0.13 0.13 0.13 0.19 0.22


532 / 4454 1746 / 15e+3 387 / 3493 1243 / 8535 1162 / 6434
4% 15% 3% 9% 6%
0.1 0.11 0.12 0.13 0.12 0.18 0.23
325 / 3380 801 / 7949 1103 / 9936 945 / 8191 315 / 3108 957 / 6740 1322 / 6974
3% 8% 10% 8% 3% 7% 7%

0.093 0.12 0.16 0.17 0.17 0.29 0.087 0.12 0.11 0.15 0.15 0.16 0.18
1534 / 18e+3 2200 / 19e+3 1077 / 7610 1302 / 9555 705 / 4684 784 / 3418 1200 / 15e+3 530 / 4607 832 / 8053 450 / 3383 1141 / 8422 1400 / 11e+3 512 / 3272
18% 19% 8% 10% 5% 3% 15% 5% 8% 3% 8% 11% 3%

Figure – Training set : data1 Figure – Training set : data2

64

You might also like