Decision Trees
Decision Trees
Data pre-processing
• What to do with missing info
• Transform variables
• Get it ready for analysis
Predictive Models
• Linear regression (covered in Evaluate Model
STAT202) • Training and test sets
• Logistic Regression • Performance metrics
• Decision Trees
Classification
How can I estimate the category a target variable falls into?
◦ Similar to estimation, but NON-NUMERIC target
◦ Help to answer yes/no questions or place observations in mutually exclusive subsets
◦ Ex: classifying credit card transaction as real or fraudulent, diagnosing a patient with a particular
disease, determining income brackets based on personal characteristics, etc.
Prediction
How can I predict the results of a FUTURE outcome
◦ Similar to classification and estimation however, the “outcome” is not yet known.
◦ Methods of classification and estimation can be used here if they appropriately fit the context
◦ Ex: what will the price of a stock be 3 months from now? What will the increase be in car related
accidents if I increase the speed limit?
Classification Tasks
Goal:
◦ To correctly classify an observation as one of multiple discrete categories
Predictor variables:
◦ Discrete numerical
◦ Categorical
◦ Continuous numerical (used by creating “categories” or break points)
Classification Tasks
Examine a data set containing:
◦ Predictor variables
◦ Already classified target variable outcomes
The model “learns” about links between predictors and a classification level of the target variable
This is the “training” portion
Consider new data:
◦ Predictor variables
◦ Target variable unclassified
The model then classifies the target variable (this is the “testing” portion)
We can evaluate error during this portion before applying to truly unknown target variable
classification
Expected Value Decision Trees
Lay out options in logical order
Calculate expected value of each
Select a “path” with the highest value
Decision making tool
Model pieces:
◦ Root nodes: beginning of the decision tree
◦ Decision nodes: a node where observations are broken down by discrete categories
◦ Branches: represents one categorical level associated with a parent decision node
◦ Leaf nodes: a termination point on a tree that seeks to have as little variation in classifications of the
target variable as possible
◦ Pure leaf nodes: all observations have the same classification
Decision Tree: Example
Predicting “good” versus “bad” credit risk
You have information regarding:
◦ Savings level
◦ Assets
◦ Income
Decision Tree: Example
Predicting “good” versus “bad” credit risk
You have information regarding:
◦ Savings level
◦ Grouped into discrete categories (low, medium and high)
◦ Assets
◦ Grouped into discrete categories (low and high)
◦ Income
◦ Grouped into discrete categories via a breakpoint (greater than or equal to 30,000 and less than $30,000)
Decision Tree: Example
Root Node
Savings = Low, Med, High?
Yes No Yes No
Yes No Yes No
Yes No Yes No
Yes No Yes No
Yes No Yes No
Yes No Yes No
Yes No Yes No
Yes No Yes No
Yes No Yes No
Yes No Yes No
Diverse leaf node: a terminal point on the decision tree where we have a mix of target variable
outcomes but no splits can be made
Decision Tree: Example
Diverse Leaf node:
Cust Savings Assets Income Credit Risk
004 high Low <=$30,000 Good
009 High Low <=$30,000 Good
027 High Low <=$30,000 Bad
031 High Low <=$30,000 Bad
104 high Low <=$30,000 Bad
Specifically:
◦ Classification and regression tress (CART algorithm)
◦ C4.5 algorithm (information gain algorithm)
How to build a decision tree
In our example:
◦ Why did we split on savings for the root node?
◦ Why not assets or income level?
Specifically:
◦ Classification and regression tress (CART algorithm)
◦ C4.5 algorithm (information gain algorithm)
Classification and Regression
Trees (CART)
CART basics:
◦ Strictly binary (all nodes have two branches)
◦ Continuous variables and categorical/discrete are ok here
◦ The algorithm will search for the proper split
◦ Conducts an exhaustive search over all options of splits and selects an optimal split at each decision
node by maximizing the “goodness” over all potential splits
and where,
We will first use split #4: assets=low versus assets=medium and high
What do we notice about the
optimality measure?
Let ( s | t ) be a measure of the " goodness"of a candidate split s at node t , where
#classes
( s | t ) 2 PL PR P( j | t L ) P( j | t R )
j 1
Assets = Assets =
High Good Risk Medium
Bad Risk
(Records 6) (Records 3)
Specifically:
◦ Classification and regression tress (CART algorithm)
◦ C4.5 algorithm (information gain algorithm)
C4.5 Algorithm (Information
Gaining)
C4.5 Basics:
NOT restricted to binary splits from decision nodes
◦ Leads to a tree of more variable shape (can have a wider tree than with CART)
By default, creates one branch for EACH level of a categorical variable if it selects that predictor
to branch on
◦ This may not be ideal if multiple levels of a category have similar relationships with the target
Conducts an exhaustive search over potential splits and selects an optimal split based on
information gain (or entropy reduction)
◦ Before we get into how this is calculated, we will first discuss what entropy is
What is Entropy Reduction?
Entropy is a measure of impurity or disorder in the dataset
◦ Measured between 0 and 1
◦ 0 is perfectly pure
◦ 1 is evenly distributed
Entropy
What is Entropy Reduction?
Let’s apply the entropy formula to our situation
C4.5 Algorithm seeks to maximize the entropy reduction (or information gain) produced by a
potential split
◦ Reduction=H(T)-HS(T)
C4.5 with an example
Cust Savings Assets Income (1000s) Credit Risk
1 Medium High 75 Good
2 Low Low 50 Bad
3 High Medium 25 Bad
4 Medium Medium 50 Good
5 Low Medium 100 Good
6 High High 25 Good
7 Low Low 25 Bad
8 Medium Medium 75 Good
C4.5 Using an Example
Potential Split Resulting Child Nodes
1 Savings=low Savings=medium Savings=high
2
3
4
5
C4.5 Using an Example
Potential Split Resulting Child Nodes
1 Savings=low Savings=medium Savings=high
2 Assets=low Assets=medium Assets=high
3
4
5
C4.5 Using an Example
Potential Split Resulting Child Nodes
1 Savings=low Savings=medium Savings=high
2 Assets=low Assets=medium Assets=high
3 Income<=$25,000 Income>$25,000
4 Income<=$50,000 Income>$50,000
5 Income<=$75,000 Income>$75,000
Entropy=
Split on savings
Cust Savings Assets Income (1000s) Credit Risk
1 Medium High 75 Good
2 Low Low 50 Bad
3 High Medium 25 Bad
4 Medium Medium 50 Good
5 Low Medium 100 Good
6 High High 25 Good
7 Low Low 25 Bad
8 Medium Medium 75 Good
C4.5 Using an Example
Consider split 1 (splitting by savings)
Low savings
◦ 3 observations; 1 is good and 2 are bad credit risk
Medium savings
◦ 3 observations; 3 are good and 0 are bad credit risk
High savings
◦ 2 observations, 1 is good and 1 is bad credit risk
Entropy=
◦ =+
Entropy reduction=0.95-0.59=0.36
Split <=25 income vs >25
income
Cust Savings Assets Income (1000s) Credit Risk
1 Medium High 75 Good
2 Low Low 50 Bad
3 High Medium 25 Bad
4 Medium Medium 50 Good
5 Low Medium 100 Good
6 High High 25 Good
7 Low Low 25 Bad
8 Medium Medium 75 Good
C4.5 Using an Example
Consider split 3: Income<=$25,000
Income<=$25,000
◦ 3 observations; 1 good credit risk, 2 bad credit risk
Income>$25,000
◦ 5 observations; 4 good credit risk, 1 bad credit risk
Entropy=
◦ =
Entropy reduction=0.9544-0.7956=0.1588
C4.5 Using an Example
Potential split Baseline Entropy Split Entropy Entropy Reduction
1 0.9544 0.5944 0.36
2 0.9544 0.4057 0.5487
3 0.9544 0.7956 0.1588
4 0.9544 0.6069 0.3475
5 0.9544 0.8621 0.0923
We will first use a split along the assets dimension because it has the
highest entropy reduction (or information gain).
C4.5 Using an Example
Root Node (All Records)
Assets = Low, Medium, High?
Compare this tree to the one found using the CART method
Root Node (All
Records)
Assets = Low,
Medium, High? Assets =
Assets =
Low Medium or High Root Node (All
Records)
Bad Risk Decision Node A Assets = Low,
(Records 2, 7) (Records 1, 3, 4, 5, Medium, High?
6, 8)
Assets = Assets = Assets =
Savings = Savings = Low or Low Medium High
High Medium Bad Risk Good Risk
(Records 2, Decision Node (Records 1,
Decision Node B Good Risk A
(Records 3, 6)
7) 6)
(Records 1, 4, (Records 3, 4,
5, 8)
5, 8)
Assets = Assets =
High Medium Savings = Savings Savings =
Good Risk Bad Risk Low =Medium High
(Records 6) (Records 3)
Good Risk Good Risk Bad Risk
(Records 5) (Records 4, (Records 3)
8)
Comparison of CART and C4.5
CART C4.5
CART can build a tree where the resulting end nodes split based on average age within the node
Comparison of CART and C4.5
CART C4.5
Tactically, you impose rules like maximum depth or minimum leaf size
Post-pruning
Letting the tree grow to “full depth” and then going back to eliminate branches
Use cost complexity pruning (weakest link pruning) by giving trees a score that combines error
and a penalty for tree size
Penalty for tree size is determined using cross-validation
Comparison of CART and C4.5
Both can be run easily in Rstudio
Either one can perform better on a certain data set
Both can be adjusted to include weights
Favoring classifying a particular outcome