Sol Forests
Sol Forests
Solution 1: Bagging
Let’s take a look at the average loss of individual base learner predictions. Since we are in the (theoretical) case of
an infinitely large ensemble, the average contains an infinite sum, which can equally be viewed as the expectation
over M:
2 2
EM y − b[m] (x) = EM y 2 − 2yb[m] (x) + b[m] (x)
2
= y 2 − 2yEM b[m] (x) + EM b[m] (x) ,
2 2
[m]
EM y−b (x) ≥ y 2 − 2yEM b[m] (x) + EM b[m] (x)
2
= y − EM b[m] (x)
2
= y − f [M ] (x) .
The ensemble loss is thus less than or equal to the average loss of individual base learners. How ”sharp” this
inequality is depends on how unequal both sides of
2 2
EM b[m] (x) ≥ EM b[m] (x)
are, determined via the definition of variance by Var b[m] (x) . In other words: the more instable the base learners
(high variance), the more beneficial the ensembling procedure.
a) The spam data is a binary classification task where the aim is to classify an e-mail as spam or non-spam.
library(mlr3)
tsk("spam")
b) library(rpart.plot)
## Loading required package: rpart
set.seed(123)
rpart.plot(learner$model, roundint = FALSE)
nonspam
0.61
100%
yes charDollar >= 0.056 no
spam nonspam
0.12 0.76
25% 75%
hp < 0.4 remove >= 0.055
nonspam
0.84
68%
charExclamation >= 0.38
spam
0.40
9%
capitalTotal >= 56
nonspam
0.71
4%
free >= 0.85
set.seed(456)
subset_1 <- sample.int(task_spam$nrow, size = 0.6 * task_spam$nrow)
set.seed(789)
subset_2 <- sample.int(task_spam$nrow, size = 0.6 * task_spam$nrow)
spam nonspam
0.13 0.78
26% 74%
hp < 0.4 remove >= 0.055
spam nonspam
0.08 0.84
24% 68%
edu < 0.2 charExclamation >= 0.39
spam nonspam
0.39 0.90
8% 60%
capitalTotal >= 44 num000 >= 0.3
nonspam
0.62
100%
yes charExclamation >= 0.079 no
spam nonspam
0.27 0.86
42% 58%
capitalLong >= 19 remove >= 0.045
nonspam spam
0.66 0.42
14% 3%
charDollar >= 0.03 hp < 0.31
nonspam
0.74
12%
free >= 0.69
spam nonspam spam spam spam nonspam spam nonspam spam nonspam nonspam
0.05 0.89 0.04 0.18 0.21 0.81 0.08 1.00 0.25 1.00 0.94
24% 1% 3% 2% 1% 10% 4% 0% 2% 1% 51%
Observation: trees trained on different samples differ considerably in their structure, regarding split variables
as well as thresholds (recall, though, that the split candidates are a further source of randomness).
c) i) This is actually quite easy when we recall that the exponential function at an arbitrary input x can be
characterized via n
ex = lim 1 + nx ,
n→∞
which already resembles the limit expression we are looking for. Setting x to -1 yields:
n
lim 1 − n1 = e−1 = 1e .
n→∞
ii) library(mlr3learners)
## [1] 0.04542491
d) Variable importance in general measures the contributions of features to a model. One way of computing the
variable importance of the j-th variable is based on permuting it for the OOB observations and calculating
the mean increase in OOB error this permutation entails.
In order to determine the with the biggest influence on prediction quality, we can choose the k variables with
the highest importance score, e.g., for k = 5:
library(mlr3filters)
## feature score
## 1: capitalLong 0.04523183
## 2: hp 0.04099699
## 3: charExclamation 0.04018370
## 4: remove 0.03975776
## 5: capitalAve 0.03412908
Solution 3: Proximities
a) Using the treeInfo() output, we can follow the path of each sample through each tree.
The following table prints for each observation (rows) their terminal nodes as assigned by trees 1-3. For example,
consider observation 1 in tree 1 (first cell): the observation has phenols > 1.94, putting it in node 2 (rightChild
of node 0), from there in node 6 (because it has alcohol > 13.04).
b) For the proximities, we consider each pair of observations and compute the relative frequency of trees assigning
them to the same terminal node.
Observations 1 and 2: only tree 1 assigns them to the same node, so the proximity is 13 .
Observations 1 and 3: all trees assign them to the same node, so the proximity is 1.
Observations 2 and 3: only tree 1 assigns them to the same node, so the proximity is 13 .
c) We can put this information into a similarity matrix (as such matrices become large quite quickly for more
data, it is common to store only the lower diagonal – the rest is non-informative/redundant):
library(proxy)
compute_prox <- function(i, j) sum(i == j) / length(i)
round(proxy::dist(end_nodes, method = compute_prox), 2L)
## 1 2
## 2 0.33
## 3 1.00 0.33