0% found this document useful (0 votes)
45 views5 pages

Sol Forests

The document discusses random forests and bagging techniques for machine learning. It examines how bagging, or averaging predictions from multiple models, can reduce error compared to using individual models alone. Specifically, it shows that the ensemble/average prediction has lower error than the average error of the individual base learners. It also notes that ensembling provides more benefit when the base learners have high variance, or instability in their predictions. Finally, it provides an example of using a decision tree to classify spam emails, and shows how the trees may differ when trained on different random subsets of the data.

Uploaded by

aaamou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views5 pages

Sol Forests

The document discusses random forests and bagging techniques for machine learning. It examines how bagging, or averaging predictions from multiple models, can reduce error compared to using individual models alone. Specifically, it shows that the ensemble/average prediction has lower error than the average error of the individual base learners. It also notes that ensembling provides more benefit when the base learners have high variance, or instability in their predictions. Finally, it provides an example of using a decision tree to classify spam emails, and shows how the trees may differ when trained on different random subsets of the data.

Uploaded by

aaamou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Introduction to Machine Learning Exercise sheet 9

https://slds-lmu.github.io/i2ml/ Random Forests

Solution 1: Bagging

Let’s take a look at the average loss of individual base learner predictions. Since we are in the (theoretical) case of
an infinitely large ensemble, the average contains an infinite sum, which can equally be viewed as the expectation
over M:
 2    2 
EM y − b[m] (x) = EM y 2 − 2yb[m] (x) + b[m] (x)
   2 
= y 2 − 2yEM b[m] (x) + EM b[m] (x) ,

where we use the linearity of the expectation.


Note that the average base learner prediction is simply the prediction of the ensemble: EM (b[m] (x)) = f [M ] (x).
Plugging this into the above equation and using the definition of the variance of a random variable Z (”Ver-
schiebungssatz”1 ), which tells us that E(Z 2 ) ≥ (E(Z))2 , we obtain:

 2      2
[m]
EM y−b (x) ≥ y 2 − 2yEM b[m] (x) + EM b[m] (x)
  2
= y − EM b[m] (x)
 2
= y − f [M ] (x) .

The ensemble loss is thus less than or equal to the average loss of individual base learners. How ”sharp” this
inequality is depends on how unequal both sides of
 2    2
EM b[m] (x) ≥ EM b[m] (x)


are, determined via the definition of variance by Var b[m] (x) . In other words: the more instable the base learners
(high variance), the more beneficial the ensembling procedure.

Solution 2: Classifying spam

a) The spam data is a binary classification task where the aim is to classify an e-mail as spam or non-spam.

library(mlr3)
tsk("spam")

## <TaskClassif:spam> (4601 x 58): HP Spam Detection


## * Target: type
## * Properties: twoclass
## * Features (57):
## - dbl (57): address, addresses, all, business, capitalAve,
## capitalLong, capitalTotal, charDollar, charExclamation, charHash,
## charRoundbracket, charSemicolon, charSquarebracket, conference,
## credit, cs, data, direct, edu, email, font, free, george, hp, hpl,
1 Var(Z) = E(Z 2 ) − (E(Z))2 ⇐⇒ E(Z 2 ) = Var(Z) + (E(Z))2 , where Var(Z) ≥ 0 by definition.
## internet, lab, labs, mail, make, meeting, money, num000, num1999,
## num3d, num415, num650, num85, num857, order, original, our, over,
## parts, people, pm, project, re, receive, remove, report, table,
## technology, telnet, will, you, your

b) library(rpart.plot)
## Loading required package: rpart

task_spam <- tsk("spam")

learner <- lrn("classif.rpart")


learner$train(task_spam)

set.seed(123)
rpart.plot(learner$model, roundint = FALSE)

nonspam
0.61
100%
yes charDollar >= 0.056 no

spam nonspam
0.12 0.76
25% 75%
hp < 0.4 remove >= 0.055

nonspam
0.84
68%
charExclamation >= 0.38

spam
0.40
9%
capitalTotal >= 56

nonspam
0.71
4%
free >= 0.85

spam nonspam spam spam spam nonspam nonspam


0.07 0.90 0.09 0.15 0.05 0.80 0.90
23% 2% 7% 5% 0% 3% 59%

set.seed(456)
subset_1 <- sample.int(task_spam$nrow, size = 0.6 * task_spam$nrow)
set.seed(789)
subset_2 <- sample.int(task_spam$nrow, size = 0.6 * task_spam$nrow)

for (i in list(subset_1, subset_2)) {


learner$train(task_spam, row_ids = i)
rpart.plot(learner$model, roundint = FALSE)
}
nonspam
0.61
100%
yes charDollar >= 0.046 no

spam nonspam
0.13 0.78
26% 74%
hp < 0.4 remove >= 0.055

spam nonspam
0.08 0.84
24% 68%
edu < 0.2 charExclamation >= 0.39

spam nonspam
0.39 0.90
8% 60%
capitalTotal >= 44 num000 >= 0.3

spam nonspam nonspam spam spam nonspam spam nonspam


0.05 0.85 0.87 0.10 0.15 0.75 0.14 0.91
23% 1% 2% 7% 5% 3% 1% 59%

nonspam
0.62
100%
yes charExclamation >= 0.079 no

spam nonspam
0.27 0.86
42% 58%
capitalLong >= 19 remove >= 0.045

spam nonspam spam nonspam


0.08 0.55 0.18 0.91
25% 17% 4% 55%
hp < 0.35 remove >= 0.09 george < 0.08 charDollar >= 0.079

nonspam spam
0.66 0.42
14% 3%
charDollar >= 0.03 hp < 0.31

nonspam
0.74
12%
free >= 0.69

spam nonspam spam spam spam nonspam spam nonspam spam nonspam nonspam
0.05 0.89 0.04 0.18 0.21 0.81 0.08 1.00 0.25 1.00 0.94
24% 1% 3% 2% 1% 10% 4% 0% 2% 1% 51%

Observation: trees trained on different samples differ considerably in their structure, regarding split variables
as well as thresholds (recall, though, that the split candidates are a further source of randomness).
c) i) This is actually quite easy when we recall that the exponential function at an arbitrary input x can be
characterized via n
ex = lim 1 + nx ,
n→∞

which already resembles the limit expression we are looking for. Setting x to -1 yields:
n
lim 1 − n1 = e−1 = 1e .
n→∞

ii) library(mlr3learners)

learner <- lrn("classif.ranger", "oob.error" = TRUE)


learner$train(tsk("spam"))
learner$model$prediction.error

## [1] 0.04542491

d) Variable importance in general measures the contributions of features to a model. One way of computing the
variable importance of the j-th variable is based on permuting it for the OOB observations and calculating
the mean increase in OOB error this permutation entails.
In order to determine the with the biggest influence on prediction quality, we can choose the k variables with
the highest importance score, e.g., for k = 5:

library(mlr3filters)

learner <- lrn("classif.ranger", importance = "permutation", "oob.error" = TRUE)


filter <- flt("importance", learner = learner)
filter$calculate(tsk("spam"))
head(as.data.table(filter), 5)

## feature score
## 1: capitalLong 0.04523183
## 2: hp 0.04099699
## 3: charExclamation 0.04018370
## 4: remove 0.03975776
## 5: capitalAve 0.03412908

Solution 3: Proximities

a) Using the treeInfo() output, we can follow the path of each sample through each tree.
The following table prints for each observation (rows) their terminal nodes as assigned by trees 1-3. For example,
consider observation 1 in tree 1 (first cell): the observation has phenols > 1.94, putting it in node 2 (rightChild
of node 0), from there in node 6 (because it has alcohol > 13.04).

# end node each observation is placed in across trees


end_nodes

## tree_1 tree_2 tree_3


## 1: 6 6 6
## 2: 6 5 5
## 3: 6 6 6

b) For the proximities, we consider each pair of observations and compute the relative frequency of trees assigning
them to the same terminal node.
ˆ Observations 1 and 2: only tree 1 assigns them to the same node, so the proximity is 13 .
ˆ Observations 1 and 3: all trees assign them to the same node, so the proximity is 1.
ˆ Observations 2 and 3: only tree 1 assigns them to the same node, so the proximity is 13 .

c) We can put this information into a similarity matrix (as such matrices become large quite quickly for more
data, it is common to store only the lower diagonal – the rest is non-informative/redundant):

library(proxy)
compute_prox <- function(i, j) sum(i == j) / length(i)
round(proxy::dist(end_nodes, method = compute_prox), 2L)

## 1 2
## 2 0.33
## 3 1.00 0.33

You might also like