0% found this document useful (0 votes)
3 views

Module 9- CART

Module 9 covers Classification and Regression Trees (CART) in machine learning, detailing decision trees, their definitions, criteria for splits, and the tree-building process. It discusses regression and classification trees, including metrics like Gini index and entropy, and emphasizes the importance of pruning to avoid overfitting. The module also highlights the pros and cons of decision trees, particularly their applications in finance.

Uploaded by

Aashir Aftab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Module 9- CART

Module 9 covers Classification and Regression Trees (CART) in machine learning, detailing decision trees, their definitions, criteria for splits, and the tree-building process. It discusses regression and classification trees, including metrics like Gini index and entropy, and emphasizes the importance of pruning to avoid overfitting. The module also highlights the pros and cons of decision trees, particularly their applications in finance.

Uploaded by

Aashir Aftab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Module 9

Classification and Regression Trees


CART

Prof. Pedram Jahangiry

Prof. Pedram Jahangiry


Class Modules
• Module 1- Introduction to Machine Learning
• Module 2- Setting up Machine Learning Environment
• Module 3- Linear Regression (Econometrics approach)
• Module 4- Machine Learning Fundamentals
• Module 5- Linear Regression (Machine Learning approach)
• Module 6- Penalized Regression (Ridge, LASSO, Elastic Net)
• Module 7- Logistic Regression
• Module 8- K-Nearest Neighbors (KNN)
• Module 9- Classification and Regression Trees (CART)
• Module 10- Bagging and Boosting
• Module 11- Dimensionality Reduction (PCA)
• Module 12- Clustering (KMeans – Hierarchical)

Prof. Pedram Jahangiry


Road map ML Algorithm

Supervised Unsupervised

Dimensionality
Regression Classification Clustering
Reduction

Linear / Logistic Principle K-Mean


Polynomial regression Component
Penalized Analysis
regression (PCA)
KNN KNN Hierarchical

SVR SVM SVC

1. Decision Trees (DTs)


Tree-based Tree-based
Regression models Classification models 2. Bagging, Random Forest
3. Boosting

Prof. Pedram Jahangiry


Topics
Part I
1. Decision Trees Definitions Part III
2. Decision Trees criteria 1. Pruning a tree
• MSE
2. Hyperparameters
• Error Rate
• Gini Index
• Entropy
Part IV
Part II 1. Pros and Cons
1. Regression Trees 2. Applications in Finance
2. Classification Trees

Prof. Pedram Jahangiry


Part I
Decision Trees definitions and criteria

Prof. Pedram Jahangiry


Decision Trees Definitions
• DTs are ML algorithms that progressively divide data sets into smaller data groups based on a
descriptive feature, until they reach sets that are small enough to be described by some label.
• DTs apply a top-down approach to data, trying to group and label observations that are similar.
Weight

Height> 6
𝑅2
180 No Yes

Weight >
100 180
Male 𝑅1
𝑅3 𝑅1 No Yes

4 5 6 Height 𝑅3 Female Male 𝑅2


Prof. Pedram Jahangiry
Decision Trees Definitions
• When the target variable consists of real numbers: regression trees
• When the target variable is categorical: classification trees
• Terminology:

✓ Root node
✓ Splitting
✓ Branch
✓ Decision node (internal node)
✓ Leaf node (terminal node)
✓ Sub-tree
✓ Depth (level)
✓ Pruning

Prof. Pedram Jahangiry


Decision Trees Criteria
• Which split adds the most information gain (minimum impurity)?
• Regression trees: MSE
• Classification trees:
Control how a Decision Tree
1. Error rate decides to split the data
𝑋2
2. Entropy
3. Gini Index

They all
measure
impurity
𝑋1

Prof. Pedram Jahangiry


Decision Trees Criteria
• Entropy: Measures the impurity or randomness (uncertainty) in the data points
• Gini Index: Measure how often a randomly chosen element would be incorrectly labeled
• For both Entropy and Gini, 0 expresses all the elements
belong to a specified class (pure)
• Different decision tree algorithms utilize different
impurity metrics

𝑒𝑛𝑡𝑟𝑜𝑝𝑦 = − ෍ 𝑝𝑗 log 2 (𝑝𝑗 )


𝑗

𝐺𝑖𝑛𝑖 = 1 − ෍ 𝑝𝑗2
𝑗

Prof. Pedram Jahangiry


Part II
Regression / Classification Trees!
How does a decision tree work?

Prof. Pedram Jahangiry


Regression Trees
• Baseball Salary is color-coded from low (blue, green) to high (yellow, red)

• DTs apply a top-down approach to data, trying to


group and label observations that are similar.

• The main questions in every decision-making process:


1. Which feature to start with?
2. Where to put the split (cut off)?

Prof. Pedram Jahangiry


Interpreting the results
• Based on color-coded salary, it seems that years is the most
important factor in determining salary.
• For less experienced players, the number of hits seems
irrelevant.
• Among more experienced players thought, players with more
hits tend to have higher salaries.
• As one can see, the model is very easy to display, interpret
and explain.

Prof. Pedram Jahangiry


Tree building process
• Divide the feature space into J distinct and non-overlapping
regions.
• For every observation that falls into the region 𝑅𝑗 , we make
the same prediction, which is simply the mean of the target
values for the training observations in 𝑅𝑗 .
• The goal is to find rectangles 𝑅1 , 𝑅2 , … , 𝑅𝑗 that minimize
the RSS:

• Where 𝑦ො𝑅𝑗 is the mean target for the training observations within the 𝑗𝑡ℎ rectangle.

Prof. Pedram Jahangiry


Tree building process: Recursive Binary Splitting

• How does the algorithm select 𝑋𝑗 and the split 𝑠 ?


• 𝑋𝑗 and 𝑠 are selected such that splitting the feature space into the
regions {𝑋|𝑋𝑗 < 𝑠} and {𝑋|𝑋𝑗 ≥ 𝑠} leads to the largest possible
reduction in RSS.

𝑅1 𝑗, 𝑠 = 𝑋 𝑋𝑗 < 𝑠 and 𝑅2 𝑗, 𝑠 = {𝑋|𝑋𝑗 ≥ 𝑠}

• Seeking for the value of 𝑗 and 𝑠 that minimized the following


equation:

• The best split is made at that particular step, rather than looking ahead and picking a split that will lead
to a better tree in some future step.

Prof. Pedram Jahangiry


Tree building process: Recursive Binary Splitting
• Next, the algorithm repeats the process, looking for the best feature and best split in order to
split the data further to minimize the RSS within each of the resulting regions.
• The process continues until a stopping criterion is reached; for instance, continues until no
region contains more than a fixed number of observations.

Prof. Pedram Jahangiry


A Five-Region Example of Recursive Binary Splitting

The output of recursive binary A tree corresponding to the A perspective plot of the prediction
splitting on a two-dimensional partition in the left panel. surface corresponding to that tree.
example

Prof. Pedram Jahangiry


Overfitting?

Prof. Pedram Jahangiry


Classification Trees
• Classification trees are very similar to regression trees, except that it is used to predict
a qualitative response rather thana quantitative one.
• The prediction of the algorithm at each terminal node will be the category with the
majority of data points i.e., the most commonly occurring class.

Prof. Pedram Jahangiry


Classification Trees (details)
• Just as in the regression setting, the recursive binary splitting is used to grow a
classification tree. However, instead of RSS we will be using one of the following
impurity criteria:

1. Classification error rate:

2. Gini index:

3. Cross entropy:

• 𝑝Ƹ 𝑚𝑘 represents the proportion of training observations in the 𝑚𝑡ℎ region from the 𝑘 𝑡ℎ class.
• Classification error rate is not sufficiently sensitive to node purity and in practice either Gini or
Cross entropy is preferred.

Prof. Pedram Jahangiry


Decision Tree Metrics (Simple Example)
Node 𝐺𝑖𝑛𝑖 𝐶𝑟𝑜𝑠𝑠 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝐸𝑟𝑟𝑜𝑟 𝑟𝑎𝑡𝑒
1 − max 𝑝𝑖
1− ෍ 𝑝𝑗2 − ෍ 𝑝𝑗 log 𝑝𝑖
𝑗 𝑗
Entire training 2 2 10 10 20 20 10 20
1–
10
+
20
= 0.44
− 𝑙𝑜𝑔 + 𝑙𝑜𝑔 = 1 − max ,
30 30
data before 30 30 30 30
30 30 20
split 0.64 =1− = 0.333
30
Root node: 20 9 2 11 2 20 9 9 11 11 20 9 11
∗ 1– + + − log 2 + log 2 + 30
{1 − max ,
20 20
}+
X1 > s 30 20 20 30 20 20 20 20
10 1 1 9 9 10 1 9
10 1 2 9 2 − log 2 + log 2 = 30
{1 − max ,
10 10
}=
∗ 1– + = 30 10 10 10 10
30 10 10 20 10 0.333
∗ 0.69 + ∗ 0.325 = 𝟎. 𝟓𝟕
20 10 30 30
∗ 0.495 + ∗ 0.18 = 𝟎. 𝟑𝟗
30 30
Root node: 2 2 15 2 2 13 13 15 2 13
15
∗ 1–
2
+
13
+
− log 2 + log 2 + 30
{1 − max ,
15 15
}+
X2 > s 30 15 15 15 15
30 15 15 15 8 8 7 7 15 8 7
− log 2 + log 2 = 30
{1 − max ,
15 15
}=
2 2 30 15 15 15 15
15 8 7 15 15
30
∗ 1–
15
+
15
=
∗ 0.39 + ∗ 0.69 = 𝟎. 𝟓𝟒 0.3
30 30
15 15
∗ 0.231 + ∗ 0.497 = 𝟎. 𝟑𝟕
30 30

Prof. Pedram Jahangiry


Trees Versus Linear Models
Left column: linear model; Right column: tree-based model

Top Row: True linear boundary


Bottom row: true non-linear
boundary.

Prof. Pedram Jahangiry


Part III
Pruning a tree
Tunning hyper parameters

Prof. Pedram Jahangiry


Pruning a tree
• A smaller tree with fewer splits may lead to lower variance
and better interpretation (but) at the cost of higher bias.
• Smaller trees are too short-sighted:

“ a seemingly worthless split early on in the tree might be


followed by a very good split, a split that leads to a large
reduction in RSS/impurity index later on”

• A better strategy is to grow a very large tree, this may


produce good predictions on the training set, but is likely to
overfit the data, leading to poor test set performance.
• So, we need to prune it back in order to obtain a subtree.
• Cost complexity pruning is used to do this.

Prof. Pedram Jahangiry


Cost complexity pruning (weakest link pruning)

• Consider a sequence of trees indexed by a nonnegative tuning parameter 𝛼.


• For each value of 𝛼 there corresponds a subtree 𝑇 ⊂ 𝑇0 such that the following objective
function is minimized.

• |𝑇| indicates the number of terminal nodes of the tree 𝑇


• 𝑅𝑚 is the rectangle corresponding to 𝑚𝑡ℎ terminal node and
• 𝑦ො𝑅𝑚 is the mean of the training observations in 𝑅𝑚

• 𝛼 controls the bias variance trade off and is determined by cross validation.
• Lastly, we return to full data set and obtain the subtree corresponding to 𝛼

Prof. Pedram Jahangiry


Salary example continued

The unpruned tree


that results from
recursive binary
splitting on the
training data

Prof. Pedram Jahangiry


Finding the optimal 𝛼 or T

Prof. Pedram Jahangiry


The optimal (pruned) tree

Prof. Pedram Jahangiry


Other hyperparameters
✓ To avoid overfitting, regularization parameters can be added to the model such as:
• Maximum depth of the tree
• Minimum population at a node
• Maximum number of decision nodes
• Minimum impurity decrease (info gain)
• Alpha (complexity parameter)
✓ Other hyperparameters are:
• Criterion: gini, entropy
• Splitter: best, random
• Class weight: balanced, none

Prof. Pedram Jahangiry


Part IV
Pros and Cons
Applications in finance

Prof. Pedram Jahangiry


DTs’ Pros and Cons

Pros:
• Easy to interpret and visualize
• Can easily handle categorical data without the need to create dummy variables
• Can easily capture Non-linear patterns
• Can handle data in its raw form (no preprocessing needed). Why?
• Has no assumptions about distribution because of the non-parametric nature of the algorithm

Cons:
• Poor level of predictive accuracy.
• Sensitive to noisy data. It can overfit noisy data. Small variations in data can result in the
different decision tree*.
*This can be reduced by bagging and boosting algorithms.

Prof. Pedram Jahangiry


DTs’ Applications in finance

• Enhancing detection of fraud in financial statements,


• Generating consistent decision processes in equity and fixed-income selection
• Simplifying communication of investment strategies to clients.
• Portfolio allocation problems.

Prof. Pedram Jahangiry


Appendix A

Prof. Pedram Jahangiry


Class Modules
✓ Module 1- Introduction to Machine Learning
✓ Module 2- Setting up Machine Learning Environment
✓ Module 3- Linear Regression (Econometrics approach)
✓ Module 4- Machine Learning Fundamentals
✓ Module 5- Linear Regression (Machine Learning approach)
✓ Module 6- Penalized Regression (Ridge, LASSO, Elastic Net)
✓ Module 7- Logistic Regression
✓ Module 8- K-Nearest Neighbors (KNN)
✓ Module 9- Classification and Regression Trees (CART)
• Module 10- Bagging and Boosting
• Module 11- Dimensionality Reduction (PCA)
• Module 12- Clustering (KMeans – Hierarchical)

Prof. Pedram Jahangiry

You might also like