Insurance Analytics: Prof. Julien Trufin
Insurance Analytics: Prof. Julien Trufin
1
Tree-based methods : Regression trees
Regression trees
2
Tree-based methods : Regression trees
Introduction
Tree-based models : one or more nested if-then statements for the features
that partition the data.
Example :
→ if Feature A (Gender) = "male" then
if Feature B (Age) >= 35 then Outcome = 500
else Outcome = 700
else Outcome = 400
→ In the terminology of tree models : there are two splits of the dataset into
three terminal nodes or leaves of the tree.
3
Tree-based methods : Regression trees
Introduction
Strenghts :
→ Easy to interpret.
→ Do not require to specify the form of the regression function.
→ Handle missing data.
→ Implicitly conduct feature selection.
Weaknesses :
→ Model instability (i.e. slight changes in the data can drastically change the
structure of the tree).
→ Less-than-optimal predictive performance (because these models partition the
data into rectangular regions of the feature space).
→ Finite number of possible predicted outcomes, determined by the number of
terminal nodes (unlikely to capture all the nuances of the data).
4
Tree-based methods : Regression trees
Context
Regression trees : partition the feature space into subspaces that are more
homogeneous with respect to the response.
Regression trees determine
→ The feature to split on and the value of the split.
→ The depth or complexity of the tree.
→ The predictions in the terminal nodes.
5
Tree-based methods : Regression trees
Recursive partitioning
Algorithm :
→ Begin with D = {(yi , x i ); i ∈ I}.
→ Search every distinct value of every feature to find the feature and split value
that partitions the feature space χ into two subspaces χt1 and χt2 such that
X X X
L(yi , µ
b(x i )) = L(yi , b
ct1 ) + L(yi , b
ct2 )
i∈I i∈I:x i ∈χt1 i∈I:x i ∈χt2
is minimized, where
ct1 = ave (yi |x i ∈ χt1 )
b
and
ct2 = ave (yi |x i ∈ χt2 ) .
b
We have
µ ct1 I [x ∈ χt1 ] + b
b(x ) = b ct2 I [x ∈ χt2 ] .
6
Tree-based methods : Regression trees
Recursive partitioning
Algorithm :
→ Example with 2 features : Gender = {Female, Male} and AgePh = {18, 19, . . . , 85}.
Possibility 1 : Gender :
X X X
L(yi , µ
b(x i )) = L(yi , b
ct1 ) + L(yi , b
ct2 )
i∈I i:Female i:Male
where
ct1 = ave (yi |Female)
b and ct2 = ave (yi |Male) .
b
where
ct1 = ave (yi |AgePh ≤ 18)
b and ct2 = ave (yi |AgePh > 18) .
b
7
Tree-based methods : Regression trees
Recursive partitioning
Algorithm :
Possibility 3 : AgePh (split value = 19) :
X X X
L(yi , µ
b(x i )) = L(yi , b
ct1 ) + L(yi , b
ct2 )
i∈I i:AgePh≤19 i:AgePh>19
where
ct1 = ave (yi |AgePh ≤ 19)
b and ct2 = ave (yi |AgePh > 19) .
b
..
.
Possibility 42 : AgePh (split value = 58) :
X X X
L(yi , µ
b(x i )) = L(yi , b
ct1 ) + L(yi , b
ct2 )
i∈I i:AgePh≤58 i:AgePh>58
where
ct1 = ave (yi |AgePh ≤ 58)
b and ct2 = ave (yi |AgePh > 58) .
b
..
.
8
Tree-based methods : Regression trees
Recursive partitioning
Algorithm :
Possibility 68 : AgePh (split value = 84) :
X X X
L(yi , µ
b(x i )) = L(yi , b
ct1 ) + L(yi , b
ct2 )
i∈I i:AgePh≤84 i:AgePh>84
where
ct1 = ave (yi |AgePh ≤ 84)
b and ct2 = ave (yi |AgePh > 84) .
b
P
The best possibility = the one that minimizes i∈I L(yi , µ
b(x i )).
9
Tree-based methods : Regression trees
Recursive partitioning
Algorithm :
→ Repeat the previous step within each of subspaces χt1 and χt2 , and so on
(hence the name recursive partitioning).
→ Stop when the number of observations in the split falls below some threshold
(others stopping criteria can be used).
→ Output : X
µ
b(x ) = ct I[x ∈ χt ]
b
t∈T
where
T is a set of indexes labelling the terminal nodes of the tree ;
ct = ave(yi |x i ∈ χt ), t ∈ T .
b
10
Tree-based methods : Regression trees
Recursive partitioning
Remarks :
→ Finding the optimal split :
Is straightforward for continuous features since the data can be ordered in a
natural way.
Is easy for binary features since there is only one possible split point.
Is more complicated for features with more than two categories.
Indeed, for a features having q unordered categories, there are 2q−1 − 1 possible
partitions of the q values into two groups.
⇒ For large q : the computation can be very time consuming.
11
Tree-based methods : Regression trees
Recursive partitioning
Remarks :
→ Others stopping criteria can be used :
The interaction depth (ID) :
ID = 1 : single-split regression tree ;
ID=2 : two-way interactions ;
...
Stop when the empirical improvement of the optimization criteria (here the
deviance reduction) is less than a given threshold.
→ This particular tree methodology can handle missing data. Instead of
imputing the data, we can
For categorical features : create a new category “missing” (observations with
missing values could behave differently than those with non-missing values).
Construct surrogate variables : find another split based on another variable by
looking at all the splits using all the other variables and searching for the one
yielding a division of the training set most similar to the optimal split.
12
Tree-based methods : Regression trees
Right-sized tree
We have seen several rules to declare a node t terminal.
→ These rules have in common that they early stop the growth of the tree.
Another way to find the right sized tree consists in fully developing the tree
and then pruning it.
→ Henceforth, we also denote a regression tree by T .
→ A branch T (t) is the part of T that is composed of node t and all its
descendants nodes.
→ Pruning a branch T (t) of a tree T means deleting from T all descendants
nodes of t. The resulting tree is denoted T − T (t) .
13
Tree-based methods : Regression trees
Right-sized tree
For instance, we represent below a tree T , its branch T (t2 ) and the subtree
T − T (t2 ) obtained from T by pruning the branch T (t2 ) .
t0
t1 t2
t3 t4 t5 t6
t7 t8 t9 t10
t2 t0
t5 t6 t1 t2
t7 t8 t9 t10 t3 t4
14
Tree-based methods : Regression trees
15
Tree-based methods : Regression trees
⇒ The cost-complexity measure for model selection may now lead to chose
the simplest tree T over T 0 .
16
Tree-based methods : Regression trees
Rα (T (α)) = min Rα (T )
T Tinit
and
If Rα (T ) = Rα (T (α)), then T (α) T .
→ It is obvious that for every value of α there is at least one subtree of Tinit that
minimizes Rα (T ).
→ It can be shown that for every value of α ≥ 0, there exists a unique subtree
T (α) of Tinit that minimizes Rα (T ) and which satisfies T (α) T for all
subtrees T minimizing Rα (T ).
Tinit has only a finite number of subtrees. Hence, {T (α)}α≥0 only contains a
finite number of subtrees of Tinit .
17
Tree-based methods : Regression trees
→ Furthermore, we have
|Tα0 | = |Tα0 − Tα(t)
0
| + |Tα(t)
0
| − 1.
→ The cost-complexity measure Rα (Tα0 ) can then be rewritten as
(t) (t)
Rα (Tα0 ) = Rα Tα0 − Tα + D (b
cs )s∈T + α|Tα | − D (b
cs )s∈{t} − α.
0 (t) 0
Tα
0
18
Tree-based methods : Regression trees
or equivalently
cs )s∈{t} − D (b
D (b cs )s∈T
(Tα(t)0 ) (t)
α≥ (t)
= α1 .
|Tα0 | −1
Let
(t)
α1 = min α1 ,
t∈Te(Tα )
0
(t)
The non-terminal nodes t of Tα1 for which α2 = α2 are called the weakest
links of Tα1 and it becomes better to cut in these nodes once α reaches the
value α2 in order to produce Tα2 .
We continue the same process for Tα2 , and so on until we reach the root
node {t0 }.
21
Tree-based methods : Regression trees
We set
√
α̃k = αk αk +1 , , k = 0, 1, . . . , κ,
which is considered as a typical value for [αk , αk +1 ) and hence as the value
corresponding to Tαk .
→ Notice that α̃0 = 0 and α̃κ = ∞.
The right sized tree Tprune is then selected as the tree Tαk ∗ of the sequence
Tα0 , Tα1 , Tα2 , . . . , Tακ such that
CV CV
Err
d (α̃k ∗ ) = min Err
d (α̃k ).
k ∈{0,1,...,κ}
22
Tree-based methods : Regression trees
23
Tree-based methods : Regression trees
Example
Data set :
> set.seed(1)
>
> n <- 500000 # number of observations
>
> Gender <- factor(sample(c("m","f"),n,replace=TRUE))
> Age <- sample(c(18:65),n,replace=TRUE)
> Split <- factor(sample(c("yes","no"),n,replace=TRUE))
> Sport <- factor(sample(c("yes","no"),n,replace=TRUE))
>
> lambda <- 0.1*ifelse(Gender=="m",1.1,1)
> lambda <- lambda*ifelse(Age>=18 & Age<30,1.4,1)
> lambda <- lambda*ifelse(Age>=30 & Age<45,1.2,1)
> lambda <- lambda*ifelse(Sport=="yes",1.15,1)
> N <- rpois(n, lambda)
>
> data <- data.frame(N,Gender,Age,Sport,Split)
24
Tree-based methods : Regression trees
Example
Data set :
> head(data,10) # 10 first observations
N Gender Age Sport Split
1 0 m 46 yes yes
2 0 m 57 no no
3 0 f 34 yes no
4 0 f 27 no yes
5 0 m 42 yes yes
6 0 f 27 yes no
7 0 f 55 yes yes
8 0 f 23 yes no
9 0 f 33 no no
10 2 m 36 yes no
25
Tree-based methods : Regression trees
Example
R package : rpart (for “recursive partitioning”).
rpart().
→ Description :
Fit a recursive partitioning model.
→ Usage :
rpart(formula, data, weights, subset, na.action = na.rpart, method,
model = FALSE, x = FALSE, y = TRUE, parms, control, cost, ...).
Fit rpart :
> # fit model
> library(rpart) # package rpart (rpart for "recursive partitioning")
> tree <- rpart(N ~ Gender+Age+Split+Sport, data = data, method="poisson",
+ control = rpart.control(cp = 0, maxdepth = 5))
> summary(tree)
26
Tree-based methods : Regression trees
Example
Fit rpart (cp = 0) :
> summary(tree)
Variable importance
Age Sport Gender Split
73 18 9 1
> plot(tree2)
⇒ cp = 0 : overfitting. 27
Tree-based methods : Regression trees
Example
Fit rpart (cp = 0.00005) :
> summary(tree)
Variable importance
Age Sport Gender
73 18 9
> plot(tree2)
Example
Fit rpart (cp = 0.00020) :
> summary(tree)
Variable importance
Age Sport
80 20
> plot(tree2)
Example
Estimation of cp by cross-validation :
> tree <- rpart(N ~ Gender+Age+Split+Sport, data = data, method="poisson",
+ control = rpart.control(cp=0, maxdepth = 5))
> # cp = 0 : minimal value of cp when computing the cross-validated error
> printcp(tree) # xerror column = cross-validated error
> # idea: find cp that minimizes the xerror
> # rel error = 1 - R^2
⇒ cp optimal =3.2927e-05. 30
Tree-based methods : Regression trees
Example
Optimal tree :
> tree.cp <- prune(tree,cp=3.2927e-05) # optimal tree
> print(tree.cp) # * means terminal node
n= 500000
31
Tree-based methods : Regression trees
Example
Optimal tree :
> tree.cp2 <- as.party(tree.cp)
> plot(tree.cp2)
32
Tree-based methods : Regression trees
Example
Predictions and risk classes :
> lambda <- predict(tree.cp, type = "vector")
> data$lambda <- lambda
>
> class <- partykit:::.list.rules.party(tree.cp2)
> data$class <- class[as.character(predict(tree.cp2, type = "node"))]
>
> head(data,10)
N Gender Age Sport Split lambda class
1 0 m 46 yes yes 0.1285280 Age >= 44.5 & Sport %in% c("yes") & Gender %in% c("m")
2 0 m 57 no no 0.1084660 Age >= 44.5 & Sport %in% c("no") & Gender %in% c("m")
3 0 f 34 yes no 0.1365079 Age < 44.5 & Age >= 29.5 & Sport %in% c("yes") & Gender %in% c("f")
4 0 f 27 no yes 0.1422110 Age < 44.5 & Age < 29.5 & Sport %in% c("no") & Gender %in% c("f")
5 0 m 42 yes yes 0.1519777 Age < 44.5 & Age >= 29.5 & Sport %in% c("yes") & Gender %in% c("m")
6 0 f 27 yes no 0.1602706 Age < 44.5 & Age < 29.5 & Sport %in% c("yes") & Gender %in% c("f")
7 0 f 55 yes yes 0.1164381 Age >= 44.5 & Sport %in% c("yes") & Gender %in% c("f")
8 0 f 23 yes no 0.1602706 Age < 44.5 & Age < 29.5 & Sport %in% c("yes") & Gender %in% c("f")
9 0 f 33 no no 0.1206197 Age < 44.5 & Age >= 29.5 & Sport %in% c("no") & Gender %in% c("f")
10 2 m 36 yes no 0.1519777 Age < 44.5 & Age >= 29.5 & Sport %in% c("yes") & Gender %in% c("m")
33
Tree-based methods : Regression trees
Example
Summary of risk classes and predictions :
> data$class.i <- data$class==class[1]
> lambda.class <- subset(data, data$class.i==TRUE, select=c(class,lambda))[1,]
>
> c <- 2
> while(c <= length(class)) {
+ data$class.i <- data$class==class[c]
+ lambda.class[nrow(lambda.class) + 1,] = subset(data, data$class.i==TRUE, select=c(class,lambda))[1,];
+ c <- c+1;
+ } # lambda.class <- lambda.class[order(lambda.class$lambda),]
> rownames(lambda.class) <- NULL
>
> lambda.class
class lambda
1 Age >= 44.5 & Sport %in% c("no") & Gender %in% c("f") 0.1004673
2 Age >= 44.5 & Sport %in% c("no") & Gender %in% c("m") 0.1084660
3 Age >= 44.5 & Sport %in% c("yes") & Gender %in% c("f") 0.1164381
4 Age >= 44.5 & Sport %in% c("yes") & Gender %in% c("m") 0.1285280
5 Age < 44.5 & Age >= 29.5 & Sport %in% c("no") & Gender %in% c("f") 0.1206197
6 Age < 44.5 & Age >= 29.5 & Sport %in% c("no") & Gender %in% c("m") 0.1329900
7 Age < 44.5 & Age >= 29.5 & Sport %in% c("yes") & Gender %in% c("f") 0.1365079
8 Age < 44.5 & Age >= 29.5 & Sport %in% c("yes") & Gender %in% c("m") 0.1519777
9 Age < 44.5 & Age < 29.5 & Sport %in% c("no") & Gender %in% c("f") 0.1422110
10 Age < 44.5 & Age < 29.5 & Sport %in% c("no") & Gender %in% c("m") 0.1565837
11 Age < 44.5 & Age < 29.5 & Sport %in% c("yes") & Gender %in% c("f") 0.1602706
12 Age < 44.5 & Age < 29.5 & Sport %in% c("yes") & Gender %in% c("m") 0.1772329
34
Tree-based methods : Regression trees
Example MTPL
Data set :
→ MTPL insurance portfolio of a Belgian insurance company observed during
one year.
→ Description of data set :
> str(data) # description of the dataset
’data.frame’: 160944 obs. of 10 variables:
$ AgePh : int 50 64 60 77 28 26 26 58 59 57 ...
$ AgeCar : int 12 3 10 15 7 12 8 14 6 10 ...
$ Fuel : Factor w/ 2 levels "Diesel","Gasoline": 2 2 1 2 2 2 2 2 2 2 ...
$ Split : Factor w/ 4 levels "Half-Yearly",..: 2 4 4 4 1 3 1 3 1 1 ...
$ Cover : Factor w/ 3 levels "Comprehensive",..: 3 2 3 3 3 3 1 3 2 2 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 2 1 1 2 1 1 2 2 2 ...
$ Use : Factor w/ 2 levels "Private","Professional": 1 1 1 1 1 1 1 1 1 1 ...
$ PowerCat: Factor w/ 5 levels "C1","C2","C3",..: 2 2 2 2 2 2 2 2 1 1 ...
$ ExpoR : num 1 1 1 1 0.0466 ...
$ Nclaim : int 1 0 0 0 1 0 1 0 0 0 ...
35
Tree-based methods : Regression trees
Example MTPL
Data set :
→ The data set comprises 160 944 insurance policies.
→ For each policy, we have 8 features :
- AgePh : policyholder’s age ;
- AgeCar : age of the car ;
- Fuel : fuel of the car, with two categories (gas or diesel) ;
- Split : splitting of the premium, with four categories (annually, semi-annually,
quarterly or monthly) ;
- Cover : extent of the coverage, with three categories (from compulsory
third-party liability cover to comprehensive) ;
- Gender : policyholder’s gender, with two categories (female or male) ;
- Use : use of the car, with two categories (private or professional) ;
- PowerCat : the engine’s power, with five categories.
→ For each policy, we have the number of claim (Nclaim), the target variable,
and exposure information (exposure-to-risk (ExpoR) expressed in year).
36
Tree-based methods : Regression trees
Example MTPL
Data set :
> head(data,10) # 10 first observations
AgePh AgeCar Fuel Split Cover Gender Use PowerCat ExpoR Nclaim
1 50 12 Gasoline Monthly TPL.Only Female Private C2 1.00000000 1
2 64 3 Gasoline Yearly Limited.MD Male Private C2 1.00000000 0
3 60 10 Diesel Yearly TPL.Only Female Private C2 1.00000000 0
4 77 15 Gasoline Yearly TPL.Only Female Private C2 1.00000000 0
5 28 7 Gasoline Half-Yearly TPL.Only Male Private C2 0.04657534 1
6 26 12 Gasoline Quarterly TPL.Only Female Private C2 1.00000000 0
7 26 8 Gasoline Half-Yearly Comprehensive Female Private C2 1.00000000 1
8 58 14 Gasoline Quarterly TPL.Only Male Private C2 0.40273973 0
9 59 6 Gasoline Half-Yearly Limited.MD Male Private C1 1.00000000 0
10 57 10 Gasoline Half-Yearly Limited.MD Male Private C1 1.00000000 0
37
Tree-based methods : Regression trees
Example MTPL
Descriptive statistics of the data :
120000
100000
80000
Number of policies
60000
40000
20000
0
1 2 3 4 5 6 7 8 9 10 11 12
38
Tree-based methods : Regression trees
Example MTPL
Descriptive statistics of the data :
0.15
100000
75000
0.10
ClaimFrequency
totalExposure
50000
0.05
25000
0 0.00
39
Tree-based methods : Regression trees
Example MTPL
Descriptive statistics of the data :
100000
0.15
75000
0.10
ClaimFrequency
totalExposure
50000
0.05
25000
0 0.00
40
Tree-based methods : Regression trees
Example MTPL
Descriptive statistics of the data :
1e+05
0.10
ClaimFrequency
totalExposure
5e+04 0.05
0e+00 0.00
41
Tree-based methods : Regression trees
Example MTPL
Descriptive statistics of the data :
0.15
80000
60000
0.10
ClaimFrequency
totalExposure
40000
0.05
20000
0 0.00
42
Tree-based methods : Regression trees
Example MTPL
Descriptive statistics of the data :
0.20
60000
0.15
ClaimFrequency
totalExposure
40000
0.10
20000
0.05
0 0.00
43
Tree-based methods : Regression trees
Example MTPL
Descriptive statistics of the data :
80000
0.15
60000
ClaimFrequency
totalExposure
0.10
40000
0.05
20000
0 0.00
C1 C2 C3 C4 C5 C1 C2 C3 C4 C5
PowerCat PowerCat
44
Tree-based methods : Regression trees
Example MTPL
Descriptive statistics of the data :
0.20
10000
0.15
ClaimFrequency
totalExposure
0.10
5000
0.05
0 0.00
0 5 10 15 20 0 5 10 15 20
AgeCar AgeCar
45
Tree-based methods : Regression trees
Example MTPL
Descriptive statistics of the data :
1.00
3000
0.75
ClaimFrequency
2000
totalExposure
0.50
1000
0.25
0 0.00
20 40 60 80 20 40 60 80
AgePh AgePh
46
Tree-based methods : Regression trees
Example MTPL
Regression trees (Poisson models)
> library(rpart) # package rpart (rpart for "recursive partitioning")
> tree <- rpart(Nclaim ~ AgePh+AgeCar+Fuel+Split+Cover+Gender+Use+PowerCat
+ +offset(log(ExpoR)), data = data, method="poisson",
+ control = rpart.control(cp = 0, maxdepth = 10, minbucket = 5000, xval = 1))
> # xval = 1: no cross validation => speed up the algorithm
> library(rpart.plot)
> prp(tree) #prp: plot rpart
0.14
20e+3 / 161e+3
100%
yes AgePh >= 30 no
0.21
4358 / 24e+3
15%
Split = Yearly
0.13
16e+3 / 137e+3
85%
Split = Half−Yearly,Yearly
0.17
3931 / 29e+3
18%
AgePh >= 58
0.12
12e+3 / 108e+3
67%
AgePh >= 58
0.13 0.18
8358 / 71e+3 3375 / 24e+3
44% 15%
0.097
3305 / 36e+3
23%
Fuel = Gasoline
0.14
3213 / 25e+3
16%
Cover = Comprehensive,Limited.MD
0.091 0.12
2515 / 29e+3 5145 / 46e+3
18% 29%
0.13
2200 / 18e+3
11%
Cover = Limited.MD
0.089 0.12
1989 / 24e+3 2945 / 28e+3
15% 17%
AgePh < 74 AgePh >= 48
0.12
1866 / 17e+3
10%
Cover = Comprehensive,Limited.MD
0.086
1482 / 18e+3
11%
AgeCar >= 5.5
47
Tree-based methods : Regression trees
Example MTPL
Choice for minbucket :
→ Average claim frequency = 13.9%.
→ Average exposure per policy = 0.89.
→ Estimated confidence interval of 2 standard deviations for a claims frequency
= 13.9% with minbucket=5000 :
" r r #
13.9% 13.9%
CI = 13.9% − 2 , 13.9% + 2
0.89 × 5000 0.89 × 5000
= [12.8%, 15.0%]
= 13.9% × [92.0%, 108.0%] .
48
Tree-based methods : Regression trees
Example MTPL
Selection of cp :
→ The complexity parameter is set by 10-fold cross-validation :
> tree <- rpart(Nclaim ~ AgePh+AgeCar+Fuel+Split+Cover+Gender+Use+PowerCat
+ +offset(log(ExpoR)), data = data, method="poisson",
+ control = rpart.control(cp = 0, minbucket = 5000))
> # cp = 0 : minimal value of cp when computing the CV error
> plotcp(tree, minline = TRUE, upper = "split", ylim=c(0.975,1.01))
number of splits
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1.010
1.005
1.000
X−val Relative Error
0.995
0.990
0.985
0.980
0.975
cp
49
Tree-based methods : Regression trees
Example MTPL
Selection of cp :
→ The complexity parameter is set by 10-fold cross-validation :
> printcp(tree)
Variables actually used in tree construction:
[1] AgeCar AgePh Cover Fuel Gender Split
n= 160944
Example MTPL
Tree with minimum CV error :
> tree.minCV <- prune(tree,cp=8.6149e-05) # tree with min CV
> print(tree.minCV) # * means terminal node
n= 160944
51
Tree-based methods : Regression trees
Example MTPL
Tree with minimum CV error :
> prp(tree.minCV)
0.14
20e+3 / 161e+3
100%
0.21
4358 / 24e+3
15%
Split = Yearly
0.13
16e+3 / 137e+3
85%
Split = Half−Yearly,Yearly
0.17
3931 / 29e+3
18%
AgePh >= 58
0.12
12e+3 / 108e+3
67%
AgePh >= 58
0.13 0.18
8358 / 71e+3 3375 / 24e+3
44% 15%
0.097
3305 / 36e+3
23%
Fuel = Gasoline
0.14
3213 / 25e+3
16%
Cover = Comprehensive,Limited.MD
0.091 0.12
2515 / 29e+3 5145 / 46e+3
18% 29%
0.13
2200 / 18e+3
11%
Cover = Limited.MD
0.089 0.12
1989 / 24e+3 2945 / 28e+3
15% 17%
52
Tree-based methods : Regression trees
Example MTPL
Observations :
→ The tree with the minimum CV error (15 leaves) is relatively big compared to
the biggest tree model (cp = 0, 17 leaves).
Reason : the choice minbucket=5000 limits the size of the biggest tree.
→ The tree with only 3 splits (i.e. 4 leaves) is within 1 standard deviation (SD) of
the tree with the minimum CV error.
0.14
20e+3 / 161e+3
100%
0.13
16e+3 / 137e+3
85%
Split = Half−Yearly,Yearly
0.12
12e+3 / 108e+3
67%
AgePh >= 58
Comparison of the tree within 1 SD (4 leaves) and the tree with the minimum
CV (15 leaves) : require to calculate the generalization errors of these models
(⇒ a validation set is needed).
53
Tree-based methods : Regression trees
Example MTPL
Validation set :
→ Training set D : 80% of the data set.
→ Validation set D : 20% of the data set.
> library(caret)
> inValidation = createDataPartition(data$Nclaim, p=0.2, list=FALSE)
> validation.set = data[inValidation,]
> training.set = data[-inValidation,]
54
Tree-based methods : Regression trees
Example MTPL
Regression trees (Poisson models) :
> tree <- rpart(Nclaim ~ AgePh+AgeCar+Fuel+Split+Cover+Gender+Use+PowerCat
+ +offset(log(ExpoR)), data = training.set, method="poisson",
+ control = rpart.control(cp = 0, maxdepth = 10, minbucket = 4000, xval = 1)
> prp(tree)
0.14
16e+3 / 129e+3
100%
0.21
3459 / 19e+3
15%
Split = Yearly
0.13
12e+3 / 110e+3
85%
Split = Half−Yearly,Yearly
0.17
3097 / 24e+3
18%
AgePh >= 56
0.12
9325 / 86e+3
67%
AgePh >= 58
0.13 0.17
6704 / 57e+3 2612 / 19e+3
44% 15%
0.096
2621 / 29e+3
23%
Fuel = Gasoline
0.14
2588 / 20e+3
16%
Cover = Comprehensive,Limited.MD
0.09 0.12
1982 / 23e+3 4116 / 37e+3
18% 29%
0.13
1750 / 15e+3
11%
Cover = Limited.MD
0.087 0.12
1551 / 19e+3 2366 / 22e+3
15% 17%
0.12
1587 / 14e+3
11%
Gender = Female
0.084
1156 / 15e+3
11%
55
Tree-based methods : Regression trees
Example MTPL
Remarks :
→ minbucket=80% × 5000 = 4000.
→ The tree obtained is almost identical to the one obtained on the whole data
set with cp=0, maxdepth=10 and minbucket=5000 (only one different split at
the bottom).
56
Tree-based methods : Regression trees
Example MTPL
Selection of cp :
> tree <- rpart(Nclaim ~ AgePh+AgeCar+Fuel+Split+Cover+Gender+Use+PowerCat
+ +offset(log(ExpoR)), data = training.set, method="poisson",
+ control = rpart.control(cp = 0, minbucket = 4000))
> plotcp(tree, minline = TRUE, upper = "split", ylim=c(0.975,1.01))
number of splits
1.010
1.005
1.000 0 1 2 3 4 5 6 7 9 10 11 12 13 15 16
X−val Relative Error
0.995
0.990
0.985
0.980
0.975
cp
57
Tree-based methods : Regression trees
Example MTPL
Selection of cp :
> printcp(tree)
n= 128755
58
Tree-based methods : Regression trees
Example MTPL
Tree with minimum CV error :
→ cp=0 ⇒ Biggest tree (= initial tree).
⇒ minbucket=4000 prevents of overfitting.
> tree.minCV <- prune(tree,cp=0) # tree with min CV
> prp(tree.minCV)
0.14
16e+3 / 129e+3
100%
yes AgePh >= 30 no
0.21
3459 / 19e+3
15%
Split = Yearly
0.13
12e+3 / 110e+3
85%
Split = Half−Yearly,Yearly
0.17
3097 / 24e+3
18%
AgePh >= 56
0.12
9325 / 86e+3
67%
AgePh >= 58
0.13 0.17
6704 / 57e+3 2612 / 19e+3
44% 15%
0.096
2621 / 29e+3
23%
Fuel = Gasoline
0.14
2588 / 20e+3
16%
Cover = Comprehensive,Limited.MD
0.09 0.12
1982 / 23e+3 4116 / 37e+3
18% 29%
0.13
1750 / 15e+3
11%
Cover = Limited.MD
0.087 0.12
1551 / 19e+3 2366 / 22e+3
15% 17%
AgePh < 74 AgePh >= 48
0.12
1587 / 14e+3
11%
Gender = Female
0.084
1156 / 15e+3
11%
AgeCar >= 5.5
59
Tree-based methods : Regression trees
Example MTPL
Tree within 1 SD :
→ Tree within 1 SD has 4 leaves :
> tree.1SD <- prune(tree,cp=1.5067e-03) # tree within 1 SD
> prp(tree.1SD)
0.14
16e+3 / 129e+3
100%
0.13
12e+3 / 110e+3
85%
Split = Half−Yearly,Yearly
0.12
9325 / 86e+3
67%
AgePh >= 58
60
Tree-based methods : Regression trees
Example MTPL
Comparison :
→ Generalization error :
d val (b 1 X
Err µ) = L (yi , µ
b(x i )di ) ,
|I|
i∈I
yi
yi
with L(yi , µ
b(x i )di ) = 2 ln µ
b(x i )di
+ µ
b (x i )di − yi
val
Model Err
d (b µ)
Tree with min CV 0.5452772
Tree within 1 SD 0.5464333
61
Tree-based methods : Regression trees
Example MTPL
Model instability :
> data1 = data[1:100000,]
> tree.data1 <- rpart(Nclaim ~ AgePh+AgeCar+Fuel+Split+Cover+Gender+Use+PowerCat
+ +offset(log(ExpoR)), data = data1, method="poisson",
+ control = rpart.control(cp = 0, minbucket = 3000))
> printcp(tree.data1)
> tree.data1.minCV <- prune(tree.data1,cp=1.1727e-04)
> prp(tree.data1.minCV)
0.14
13e+3 / 100e+3
100%
0.22
2651 / 15e+3
15%
Split = Yearly
0.13
10e+3 / 85e+3
85%
Split = Half−Yearly,Yearly
0.17 0.24
2932 / 22e+3 1946 / 9852
22% 10%
0.12
7089 / 64e+3
64%
AgePh >= 58
0.13 0.18
5023 / 42e+3 2545 / 18e+3
42% 18%
0.1
2066 / 22e+3
22%
Fuel = Gasoline
0.14
2823 / 22e+3
22%
Fuel = Gasoline
62
Tree-based methods : Regression trees
Example MTPL
Model instability :
> data2 = data[10000:110000,]
> tree.data2 <- rpart(Nclaim ~ AgePh+AgeCar+Fuel+Split+Cover+Gender+Use+PowerCat
+ +offset(log(ExpoR)), data = data2, method="poisson",
+ control = rpart.control(cp = 0, minbucket = 3000))
> printcp(tree.data2)
> tree.data2.minCV <- prune(tree.data2,cp=8.7672e-05)
> prp(tree.data2.minCV)
0.13
12e+3 / 100e+3
100%
0.21
1834 / 10e+3
10%
Split = Yearly
0.13
9999 / 90e+3
90%
Split = Half−Yearly,Yearly
0.16
2672 / 21e+3
21%
AgePh >= 58
0.12
7327 / 69e+3
69%
AgePh >= 58
0.13 0.17
5272 / 46e+3 2357 / 18e+3
46% 18%
Fuel = Gasoline
0.14
2086 / 17e+3
17%
Cover = Comprehensive,Limited.MD
0.09 0.12
1525 / 18e+3 3186 / 29e+3
18% 29%
0.12
2385 / 21e+3
21%
PowerCat = C1
0.13
1553 / 13e+3
13%
Gender = Female
63
Tree-based methods : Regression trees
Example MTPL
Model instability : Trees with minimum CV error :
0.14 0.13
13e+3 / 100e+3 12e+3 / 100e+3
100% 100%
yes AgePh >= 32 no
yes AgePh >= 30 no
0.21
1834 / 10e+3
10%
0.13 0.16
10e+3 / 85e+3 2672 / 21e+3
85% 21%
Split = Half−Yearly,Yearly
AgePh >= 58
0.12
7327 / 69e+3
69%
0.17 0.24
2932 / 22e+3 1946 / 9852
AgePh >= 58
22% 10%
0.14
0.13 0.18 2086 / 17e+3
5023 / 42e+3 2545 / 18e+3 17%
42% 18%
Cover = Comprehensive,Limited.MD
Cover = Comprehensive,Limited.MD Cover = Comprehensive,Limited.MD 0.09 0.12
1525 / 18e+3 3186 / 29e+3
18% 29%
0.1
2066 / 22e+3 Gender = Female AgePh >= 50
22%
0.12
Fuel = Gasoline
2385 / 21e+3
21%
PowerCat = C1
0.14
2823 / 22e+3
22%
Fuel = Gasoline
0.13
1553 / 13e+3
13%
Gender = Female
0.093 0.12 0.16 0.17 0.17 0.29 0.087 0.12 0.11 0.15 0.15 0.16 0.18
1534 / 18e+3 2200 / 19e+3 1077 / 7610 1302 / 9555 705 / 4684 784 / 3418 1200 / 15e+3 530 / 4607 832 / 8053 450 / 3383 1141 / 8422 1400 / 11e+3 512 / 3272
18% 19% 8% 10% 5% 3% 15% 5% 8% 3% 8% 11% 3%
64