Supervised Classification & Trees

Supervised Learning: Classification
BIG DATA ANALYTICS
Introduction
Logistic Regressions
Decision trees
Random Forest
Extreme Gradient Boosting
K-nearest neighbors
Jeroen VK Rombouts
BIG DATA ANALYTICS
Decision trees
Jeroen VK Rombouts
Decision or classification trees
¡ What: recursive partitioning of the feature space
¡ Graphically represented using trees, the terminal nodes correspond to

the final partitions R1, R2, …., RS
¡ Predictions (i.e. new x0): by simple majority vote:
y0 = 𝑎𝑟𝑔𝑚𝑎𝑥!!∈# ∑$"∈%($) 𝐼(𝑌( = 𝑐) )
The methodological questions
¡ What are the best variables to make those splits?
¡ How are those splits decided?
¡ When do we stop?
3
Classification Tree: in practice
Training
Income Set
Age
>=50 <50
1500
YES Income
>=1500 <1500
20 50 age YES Age
>=20
NO YES
Prediction
4
Binary decision tree
¡ intermediate nodes: t1, t2, t3, t7;
¡ terminal nodes: t4, t5, t6, t8, t9;

N t1
1
¡ splits: s1, s2, s3, s4; s1
¡ labels: l1, l2, l3, l4, l5;

N1 t2 t3 N2
s2 s3
t4 t5 t6 t7
l1 l2 l3
s4
t8 t9
5 l4 l5
Decision Trees Methods
Dependent Variable
Year Method (nature and number) Splitting criterion
Maximum deviation between theoretical
1959 Belson Dichotomous dependent variable
and observed counts
Maximise "between groups" deviance,
1963 AID Continuous dependent variable
minimise "within groups" deviance
Nominal or ordinal dependent
1970 ELISEE Maximise distance betwwen child nodes
variable
Maximum explained variance & distance
1972 THAID Nominal dependent variable
between centroids
A few continuous dependent
1974 MAID-M M2 & cRM
variables
1980 CHAID Nominal dependent variable Chi-square
1980 DNP Nominal dependent variable Minimise the Bayes misclassification risk
Maximise the decrease of impurity (e.g.
1984 CART Dependent variable
Gini coefficient)
1987 RECPAM 1 or more dependent variable(s) Minimise the information loss
Two-stage splitting
1991 Dependent variable Maximise the prediction performance
algorithm
…
Conditional
2006 Dependent variable Significant (p-value) split
Inference Trees
6
Steps of a segmentation procedure
1. A set of binary questions: define, for each node, the set of
admissible splits (we start with “no node”)
2. A splitting variable and criterion: define the variable and define a

criterion to select the best split, based on minimizing an impurity
function
3. A stopping rule: define a rule to declare a node as a terminal or an

intermediate one, often keep at least five observations in a terminal
node
4. An assignment rule for nodes: one of the J classes of the nominal

response variable or a value of the continuous response variable has
to be assigned to each terminal node (unanimously or by a majority –
need to break ties)
5. Quality assessment of the decision rule: estimate the risk for the
associated misclassification rate or prediction error
7
Remarks
1. Nonparametric procedure: no parameters to estimate
2. Difficult to say what is the impact of a feature on the response

variables – we call this unsmooth
3. Highly nonlinear so difficult to interpret
4. Quality assessment of the decision rule only based on prediction error
5. Easy to read for practitioners
8
BIG DATA ANALYTICS
Random forests
Jeroen VK Rombouts
Random forests
1. Re-applying many decision trees to the same problem
2. Each of those trees are constructed using a bootstrap resampling

technique and uses only a random subset of the features.
3. In case of a classification problem, the final prediction is determined

by the majority vote over all trees.
4. This method is relatively recently proposed by Breiman in 2001
5. Very popular prediction technique, often difficult to beat
6. Complex to interpret
10
BIG DATA ANALYTICS
Extreme Gradient Boosting
Jeroen VK Rombouts
Extreme Gradient Boosting (XGBoost)
Gradient boosting, proposed by Friedman (2001), constructs the

forecasting by sequentially fitting small regression trees, i.e. weak
learners, to the residuals by the ensemble of the previous trees.
This procedure results in a one final tree, constructed as a sum of trees,

used for forecasting in contrast to the random forest where the forecast
results as the average of many trees. The properties of gradient boosting
have been well studied for iid data.
Extreme gradient boosting (XGBoost) is introduced by Chen and Guestrin

(2016) and is an algorithm that optimizes the implementation of the
gradient Boosting framework in terms of speed and flexibility. There are
several tuning parameters that can strongly impact the performance, and
cross validation techniques can be used to optimize them.
12
BIG DATA ANALYTICS
Simple method: K-nearest neighbors
Jeroen VK Rombouts
K-nearest neighbors
1. Given: a labeled dataset with target variable vector y and feature

matrix X (containing p features)
2. For a new feature vector a, what is the predicted target value (0-1 for
standard binary classification)?
3. Take the K nearest observations using (squared) Euclidean distance:
-
𝑑 𝑥( , 𝑎 = ∑*+,(𝑥(* − 𝑎* ).
4. The predicted target for a is the most common class in this set
5. K-NN predictions can be unstable with respect to K, expensive

computationally in big data settings
14

Supervised Classification & Trees

Uploaded by

Copyright:

Available Formats

Supervised Classification & Trees

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Supervised Classification & Trees

Uploaded by

Copyright:

Available Formats

Supervised Learning: Classification

BIG DATA ANALYTICS

Extreme Gradient Boosting

¡ What: recursive partitioning of the feature space

¡ Graphically represented using trees, the terminal nodes correspond to

¡ Predictions (i.e. new x0): by simple majority vote:

y0 = 𝑎𝑟𝑔𝑚𝑎𝑥!!∈# ∑$"∈%($) 𝐼(𝑌( = 𝑐) )

The methodological questions

¡ What are the best variables to make those splits?

¡ How are those splits decided?

¡ intermediate nodes: t1, t2, t3, t7;

¡ terminal nodes: t4, t5, t6, t8, t9;

¡ labels: l1, l2, l3, l4, l5;

2. A splitting variable and criterion: define the variable and define a

3. A stopping rule: define a rule to declare a node as a terminal or an

4. An assignment rule for nodes: one of the J classes of the nominal

1. Nonparametric procedure: no parameters to estimate

2. Difficult to say what is the impact of a feature on the response

3. Highly nonlinear so difficult to interpret

4. Quality assessment of the decision rule only based on prediction error

5. Easy to read for practitioners

1. Re-applying many decision trees to the same problem

2. Each of those trees are constructed using a bootstrap resampling

3. In case of a classification problem, the final prediction is determined

4. This method is relatively recently proposed by Breiman in 2001

5. Very popular prediction technique, often difficult to beat

Extreme Gradient Boosting

Gradient boosting, proposed by Friedman (2001), constructs the

This procedure results in a one final tree, constructed as a sum of trees,

Extreme gradient boosting (XGBoost) is introduced by Chen and Guestrin

Simple method: K-nearest neighbors

1. Given: a labeled dataset with target variable vector y and feature

3. Take the K nearest observations using (squared) Euclidean distance:

5. K-NN predictions can be unstable with respect to K, expensive

You might also like