Supervised Classification & Trees

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Supervised Learning: Classification

BIG DATA ANALYTICS

Introduction

Logistic Regressions

Decision trees

Random Forest

Extreme Gradient Boosting

K-nearest neighbors

Jeroen VK Rombouts
Supervised Learning: Classification
BIG DATA ANALYTICS

Decision trees

Jeroen VK Rombouts
Decision or classification trees

¡ What: recursive partitioning of the feature space

¡ Graphically represented using trees, the terminal nodes correspond to


the final partitions R1, R2, …., RS

¡ Predictions (i.e. new x0): by simple majority vote:

y0 = 𝑎𝑟𝑔𝑚𝑎𝑥!!∈# ∑$"∈%($) 𝐼(𝑌( = 𝑐) )

The methodological questions

¡ What are the best variables to make those splits?

¡ How are those splits decided?

¡ When do we stop?
3
Classification Tree: in practice
Training
Income Set

Age
>=50 <50
1500
YES Income

>=1500 <1500
20 50 age YES Age
>=20

NO YES

Prediction
4
Binary decision tree

¡ intermediate nodes: t1, t2, t3, t7;

¡ terminal nodes: t4, t5, t6, t8, t9;


N t1
1
¡ splits: s1, s2, s3, s4; s1

¡ labels: l1, l2, l3, l4, l5;


N1 t2 t3 N2

s2 s3

t4 t5 t6 t7
l1 l2 l3
s4

t8 t9

5 l4 l5
Decision Trees Methods
Dependent Variable
Year Method (nature and number) Splitting criterion
Maximum deviation between theoretical
1959 Belson Dichotomous dependent variable
and observed counts
Maximise "between groups" deviance,
1963 AID Continuous dependent variable
minimise "within groups" deviance
Nominal or ordinal dependent
1970 ELISEE Maximise distance betwwen child nodes
variable
Maximum explained variance & distance
1972 THAID Nominal dependent variable
between centroids
A few continuous dependent
1974 MAID-M M2 & cRM
variables
1980 CHAID Nominal dependent variable Chi-square
1980 DNP Nominal dependent variable Minimise the Bayes misclassification risk
Maximise the decrease of impurity (e.g.
1984 CART Dependent variable
Gini coefficient)
1987 RECPAM 1 or more dependent variable(s) Minimise the information loss
Two-stage splitting
1991 Dependent variable Maximise the prediction performance
algorithm

Conditional
2006 Dependent variable Significant (p-value) split
Inference Trees
6
Steps of a segmentation procedure
1. A set of binary questions: define, for each node, the set of
admissible splits (we start with “no node”)

2. A splitting variable and criterion: define the variable and define a


criterion to select the best split, based on minimizing an impurity
function

3. A stopping rule: define a rule to declare a node as a terminal or an


intermediate one, often keep at least five observations in a terminal
node

4. An assignment rule for nodes: one of the J classes of the nominal


response variable or a value of the continuous response variable has
to be assigned to each terminal node (unanimously or by a majority –
need to break ties)

5. Quality assessment of the decision rule: estimate the risk for the
associated misclassification rate or prediction error
7
Remarks

1. Nonparametric procedure: no parameters to estimate

2. Difficult to say what is the impact of a feature on the response


variables – we call this unsmooth

3. Highly nonlinear so difficult to interpret

4. Quality assessment of the decision rule only based on prediction error

5. Easy to read for practitioners

8
Supervised Learning: Classification
BIG DATA ANALYTICS

Random forests

Jeroen VK Rombouts
Random forests

1. Re-applying many decision trees to the same problem

2. Each of those trees are constructed using a bootstrap resampling


technique and uses only a random subset of the features.

3. In case of a classification problem, the final prediction is determined


by the majority vote over all trees.

4. This method is relatively recently proposed by Breiman in 2001

5. Very popular prediction technique, often difficult to beat

6. Complex to interpret

10
Supervised Learning: Classification
BIG DATA ANALYTICS

Extreme Gradient Boosting

Jeroen VK Rombouts
Extreme Gradient Boosting (XGBoost)

Gradient boosting, proposed by Friedman (2001), constructs the


forecasting by sequentially fitting small regression trees, i.e. weak
learners, to the residuals by the ensemble of the previous trees.

This procedure results in a one final tree, constructed as a sum of trees,


used for forecasting in contrast to the random forest where the forecast
results as the average of many trees. The properties of gradient boosting
have been well studied for iid data.

Extreme gradient boosting (XGBoost) is introduced by Chen and Guestrin


(2016) and is an algorithm that optimizes the implementation of the
gradient Boosting framework in terms of speed and flexibility. There are
several tuning parameters that can strongly impact the performance, and
cross validation techniques can be used to optimize them.

12
Supervised Learning: Classification
BIG DATA ANALYTICS

Simple method: K-nearest neighbors

Jeroen VK Rombouts
K-nearest neighbors

1. Given: a labeled dataset with target variable vector y and feature


matrix X (containing p features)

2. For a new feature vector a, what is the predicted target value (0-1 for
standard binary classification)?

3. Take the K nearest observations using (squared) Euclidean distance:

-
𝑑 𝑥( , 𝑎 = ∑*+,(𝑥(* − 𝑎* ).

4. The predicted target for a is the most common class in this set

5. K-NN predictions can be unstable with respect to K, expensive


computationally in big data settings

14

You might also like