0% found this document useful (0 votes)
3 views77 pages

Decision Trees

Uploaded by

medhavipandit3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views77 pages

Decision Trees

Uploaded by

medhavipandit3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 77

Intro to Data

Mining for Business


STAT331
DECISION TREES
What we will cover
Become familiar with a Dimension Reduction
data set • Perform PCA to reduce # of
• What variables are in it? variables in data set
• Where did it come from?
• Any initial patterns or Interpret
relationships? Draw conclusions
Provide recommendations
Cluster Analysis
• Generate useful segments
within the data

Data pre-processing
• What to do with missing info
• Transform variables
• Get it ready for analysis
Predictive Models
• Linear regression (covered in Evaluate Model
STAT202) • Training and test sets
• Logistic Regression • Performance metrics
• Decision Trees
Classification
How can I estimate the category a target variable falls into?
◦ Similar to estimation, but NON-NUMERIC target
◦ Help to answer yes/no questions or place observations in mutually exclusive subsets
◦ Ex: classifying credit card transaction as real or fraudulent, diagnosing a patient with a particular
disease, determining income brackets based on personal characteristics, etc.
Prediction
How can I predict the results of a FUTURE outcome
◦ Similar to classification and estimation however, the “outcome” is not yet known.
◦ Methods of classification and estimation can be used here if they appropriately fit the context
◦ Ex: what will the price of a stock be 3 months from now? What will the increase be in car related
accidents if I increase the speed limit?
Classification Tasks
Goal:
◦ To correctly classify an observation as one of multiple discrete categories

Target variable: categorical


◦ Ex: high, medium low; yes, no

Predictor variables:
◦ Discrete numerical
◦ Categorical
◦ Continuous numerical (used by creating “categories” or break points)
Classification Tasks
Examine a data set containing:
◦ Predictor variables
◦ Already classified target variable outcomes

The model “learns” about links between predictors and a classification level of the target variable
This is the “training” portion
Consider new data:
◦ Predictor variables
◦ Target variable unclassified

The model then classifies the target variable (this is the “testing” portion)
We can evaluate error during this portion before applying to truly unknown target variable
classification
Expected Value Decision Trees
Lay out options in logical order
Calculate expected value of each
Select a “path” with the highest value
Decision making tool

Data mining decision tress are different


Decision Trees
One form of classification model

Model pieces:
◦ Root nodes: beginning of the decision tree
◦ Decision nodes: a node where observations are broken down by discrete categories
◦ Branches: represents one categorical level associated with a parent decision node
◦ Leaf nodes: a termination point on a tree that seeks to have as little variation in classifications of the
target variable as possible
◦ Pure leaf nodes: all observations have the same classification
Decision Tree: Example
Predicting “good” versus “bad” credit risk
You have information regarding:
◦ Savings level
◦ Assets
◦ Income
Decision Tree: Example
Predicting “good” versus “bad” credit risk
You have information regarding:
◦ Savings level
◦ Grouped into discrete categories (low, medium and high)
◦ Assets
◦ Grouped into discrete categories (low and high)
◦ Income
◦ Grouped into discrete categories via a breakpoint (greater than or equal to 30,000 and less than $30,000)
Decision Tree: Example
Root Node
Savings = Low, Med, High?

Savings = Low Savings = Med Savings = High

Assets = Low? Income <= $30K?


Good Risk

Yes No Yes No

Bad Risk Good Risk Bad Risk Good Risk


Decision Tree: Example
Root Node
Savings = Low, Med, High?

Savings = Low Savings = Med Savings = High

Assets = Low? Income <= $30K?


Good Risk

Yes No Yes No

Bad Risk Good Risk Bad Risk Good Risk


Decision Tree: Example
Root Node
Savings = Low, Med, High?

Savings = Low Savings = Med Savings = High

Assets = Low? Income <= $30K?


Good Risk

Yes No Yes No

Bad Risk Good Risk Bad Risk Good Risk


Decision Tree: Example
Root Node
Savings = Low, Med, High?

Savings = Low Savings = Med Savings = High

Assets = Low? Income <= $30K?


Good Risk

Yes No Yes No

Bad Risk Good Risk Bad Risk Good Risk


Decision Tree: Example
Root Node
Savings = Low, Med, High?

Savings = Low Savings = Med Savings = High

Assets = Low? Income <= $30K?


Good Risk

Yes No Yes No

Bad Risk Good Risk Bad Risk Good Risk


Decision Tree: Example
Root Node
Savings = Low, Med, High?

Savings = Low Savings = Med Savings = High

Assets = Low? Income <= $30K?


Good Risk

Yes No Yes No

Bad Risk Good Risk Bad Risk Good Risk


Decision Tree: Example
Root Node
Savings = Low, Med, High?

Savings = Low Savings = Med Savings = High

Assets = Low? Income <= $30K?


Good Risk

Yes No Yes No

Bad Risk Good Risk Bad Risk Good Risk


Decision Tree: Example
Tree begins with a root node
Partitions data according to a single predictor variable
◦ If the resulting node only has one level of classification, it becomes a leaf not
◦ If the resulting node has multiple levels of classification, create another decision node to maximize the
uniformity of the partitions with regards to the target variable

This is what occurs in a “perfect world”


◦ This cannot always be achieved
Decision Tree: Example
Root Node
Savings = Low, Med, High?

Savings = Low Savings = Med Savings = High

Assets = Low? Good Risk Income <= $30K?

Yes No Yes No

Bad Risk Good Risk Bad Risk Good Risk

Cust Savings Assets Income Credit Risk


004 high Low <=$30,000 Bad
009 High Low <=$30,000 Bad
027 High Low <=$30,000 Bad
031 High Low <=$30,000 Bad
104 high Low <=$30,000 Bad
Decision Tree: Example
Root Node
Savings = Low, Med, High?

Savings = Low Savings = Med Savings = High

Assets = Low? Good Risk Income <= $30K?

Yes No Yes No

Bad Risk Good Risk Bad Risk Good Risk

Cust Savings Assets Income Credit Risk


004 high Low <=$30,000 Good
009 High Low <=$30,000 Good
027 High Low <=$30,000 Bad
031 High Low <=$30,000 Bad
104 high Low <=$30,000 Bad
Decision Tree: Example
Root Node
Savings = Low, Med, High?

Savings = Low Savings = Med Savings = High

Assets = Low? Good Risk Income <= $30K?

Yes No Yes No

Bad Risk Good Risk Bad Risk Good Risk

Diverse leaf node: a terminal point on the decision tree where we have a mix of target variable
outcomes but no splits can be made
Decision Tree: Example
Diverse Leaf node:
Cust Savings Assets Income Credit Risk
004 high Low <=$30,000 Good
009 High Low <=$30,000 Good
027 High Low <=$30,000 Bad
031 High Low <=$30,000 Bad
104 high Low <=$30,000 Bad

◦ What we can say here is:


◦ A customer with high savings, low assets and income less than $30,000 has “bad credit risk” with 60% confidence
When can you use decision
trees?
Must have a training data set
Must have background knowledge in the industry/domain
Training set must have a cross section of types of records
◦ If you are systematically lacking a certain subset of records, your tree will be biased

Target variable MUST BE CATEGORICAL


◦ You can create this if necessary using domain/industry knowledge or expert recommendations
◦ Some algorithms can also “create” this but you need to be ok with a non-continuous outcome
How to build a decision tree
In our example:
◦ Why did we split on savings for the root node?
◦ Why not assets or income level?

We are seeking to maximize the purity of our leaf nodes


◦ Leads to high confidence levels on leaf nodes

Specifically:
◦ Classification and regression tress (CART algorithm)
◦ C4.5 algorithm (information gain algorithm)
How to build a decision tree
In our example:
◦ Why did we split on savings for the root node?
◦ Why not assets or income level?

We are seeking to maximize the purity of our leaf nodes


◦ Leads to high confidence levels on leaf nodes

Specifically:
◦ Classification and regression tress (CART algorithm)
◦ C4.5 algorithm (information gain algorithm)
Classification and Regression
Trees (CART)
CART basics:
◦ Strictly binary (all nodes have two branches)
◦ Continuous variables and categorical/discrete are ok here
◦ The algorithm will search for the proper split
◦ Conducts an exhaustive search over all options of splits and selects an optimal split at each decision
node by maximizing the “goodness” over all potential splits

Let  ( s | t ) be a measure of the " goodness"of a candidate split s at node t , where


#classes
 ( s | t ) 2 PL PR  P( j | t L )  P( j | t R )
j 1
CART
Let us examine what we are maximizing:
Let  ( s | t ) be a measure of the " goodness"of a candidate split s at node t , where
#classes
 ( s | t ) 2 PL PR  P( j | t L )  P( j | t R )
j 1

and where,

t L left child node of node t


t R right child node of node t
number of records at t L
PL 
number of records in training set
number of records at t R
PR 
number of records in training set
number of class j records at t L
P( j | t L ) 
number of records at t
number of class j records at t R
P( j | t R ) 
number of records at t
CART: Using an example
Cust Savings Assets Income (1000s) Credit Risk
1 Medium High 75 Good
2 Low Low 50 Bad
3 High Medium 25 Bad
4 Medium Medium 50 Good
5 Low Medium 100 Good
6 High High 25 Good
7 Low Low 25 Bad
8 Medium Medium 75 Good
CART: Using an example
Potential Split Left Child Node, tL Right Child Node, tR
1 Savings=low Savings=medium, high
2 Savings=medium Savings=low, high
3 Savings=high Savings=low, medium
4
5
6
7
8
9
CART: Using an example
Potential Split Left Child Node, tL Right Child Node, tR
1 Savings=low Savings=medium, high
2 Savings=medium Savings=low, high
3 Savings=high Savings=low, medium
4 Assets=low Assets=medium, high
5 Assets=medium Assets=low, high
6 Assets=high Assets=low, medium
7
8
9
CART: Using an example
Potential Split Left Child Node, tL Right Child Node, tR
1 Savings=low Savings=medium, high
2 Savings=medium Savings=low, high
3 Savings=high Savings=low, medium
4 Assets=low Assets=medium, high
5 Assets=medium Assets=low, high
6 Assets=high Assets=low, medium
7 Income<=$25,000 Income>$25,000
8 Income<=$50,000 Income>$50,000
9 Income<=$75,000 Income>$75,000
Split: Savings Low vs
Medium/High
Cust Savings Assets Income (1000s) Credit Risk
1 Medium High 75 Good
2 Low Low 50 Bad
3 High Medium 25 Bad
4 Medium Medium 50 Good
5 Low Medium 100 Good
6 High High 25 Good
7 Low Low 25 Bad
8 Medium Medium 75 Good
How Do we evaluate each?
Split PL PR P(j|tL) P(j|tR) 2PLPR Q(s|t) Gdn
1 37.5% 62.5% 33%, 67% 80%, 20%
2
3
4
5
6
7
8
9
How Do we evaluate each?
Split PL PR P(j|tL) P(j|tR) 2PLPR Q(s|t) Gdn
1 37.5% 62.5% 33%, 67% 80%, 20% 0.47 0.93 0.44
2
3
4
5
6
7
8
9
Split: Savings Med vs Low/High
Cust Savings Assets Income (1000s) Credit Risk
1 Medium High 75 Good
2 Low Low 50 Bad
3 High Medium 25 Bad
4 Medium Medium 50 Good
5 Low Medium 100 Good
6 High High 25 Good
7 Low Low 25 Bad
8 Medium Medium 75 Good
How Do we evaluate each?
Split PL PR P(j|tL) P(j|tR) 2PLPR Q(s|t) Gdn
1 37.5% 62.5% 33%, 67% 80%, 20% 0.47 0.93 0.44
2 37.5% 62.5% 100%, 0 40%, 60% 0.47 1.2 0.56
3
4
5
6
7
8
9
Split: Savings High vs med/low
Cust Savings Assets Income (1000s) Credit Risk
1 Medium High 75 Good
2 Low Low 50 Bad
3 High Medium 25 Bad
4 Medium Medium 50 Good
5 Low Medium 100 Good
6 High High 25 Good
7 Low Low 25 Bad
8 Medium Medium 75 Good
How Do we evaluate each?
Split PL PR P(j|tL) P(j|tR) 2PLPR Q(s|t) Gdn
1 37.5% 62.5% 33%, 67% 80%, 20% 0.47 0.93 0.44
2 37.5% 62.5% 100%, 0 40%, 60% 0.47 1.2 0.56
3 25% 75% 50%, 50% 67%, 33% 0.38 0.33 0.13
4
5
6
7
8
9
How Do we evaluate each?
Split PL PR P(j|tL) P(j|tR) 2PLPR Q(s|t) Gdn
1 37.5% 62.5% 33%, 67% 80%, 20% 0.47 0.93 0.44
2 37.5% 62.5% 100%, 0 40%, 60% 0.47 1.2 0.56
3 25% 75% 50%, 50% 67%, 33% 0.38 0.33 0.13
4 25% 75% 0%, 100% 83%, 17% 0.38 1.67 0.62
5 50% 50% 75%, 25% 50%, 50% 0.5 0.5 0.25
6 25% 75% 100%, 0% 50%, 50% 0.38 1 0.36
7 37.5% 62.5% 33%, 67% 80%, 20% 0.47 0.93 0.44
8 62.5% 37.5% 40%, 60% 100%, 0% 0.47 1.2 0.56
9 87.5% 12.5% 57%, 43% 100%, 0% 0.22 0.86 0.19

We will first use split #4: assets=low versus assets=medium and high
What do we notice about the
optimality measure?
Let  ( s | t ) be a measure of the " goodness"of a candidate split s at node t , where
#classes
 ( s | t ) 2 PL PR  P( j | t L )  P( j | t R )
j 1

When is this large?


What do we notice about the
optimality measure?
Let  ( s | t ) be a measure of the " goodness"of a candidate split s at node t , where
#classes
 ( s | t ) 2 PL PR  P( j | t L )  P( j | t R )
j 1

When is this large?


◦ When proportions are equal
What do we notice about the
optimality measure?
Let  ( s | t ) be a measure of the " goodness"of a candidate split s at node t , where
#classes
 ( s | t ) 2 PL PR  P( j | t L )  P( j | t R )
j 1

When is this large?


What do we notice about the
optimality measure?
Let  ( s | t ) be a measure of the " goodness"of a candidate split s at node t , where
#classes
 ( s | t ) 2 PL PR  P( j | t L )  P( j | t R )
j 1

When is this large?


◦ When proportions in left and right nodes are as different as possible
First split
Assets=low
◦ Only records are 2 and 7
◦ Both have bad credit
◦ This is a pure leaf node

Assets=medium and high


◦ Records: 1, 3, 4, 5, 6, and 8
◦ How can we now split this partition?
Cust Savings Assets Income (1000s) Credit Risk
1 Medium High 75 Good
2 Low Low 50 Bad
3 High Medium 25 Bad
4 Medium Medium 50 Good
5 Low Medium 100 Good
6 High High 25 Good
7 Low Low 25 Bad
8 Medium Medium 75 Good

Potential Split Left Child Node, tL Right Child Node, tR


1 Savings=low Savings=medium, high
2 Savings=medium Savings=low, high
3 Savings=high Savings=low, medium
4 Assets=low Assets=medium, high
5 Assets=medium Assets=low, high
6 Assets=high Assets=low, medium
7 Income<=$25,000 Income>$25,000
8 Income<=$50,000 Income>$50,000
9 Income<=$75,000 Income>$75,000
What split comes next?
There is a tie between:
◦ Split 3: Savings=high and savings=low, medium
◦ Split 7: Income<=$25,000 and Income>$25,000

You can arbitrarily select in this instance


Split PL PR P(j|tL) P(j|tR) 2PLPR Q(s|t) Gdn
1 17% 83% 100%, 0 80%, 20% 0.28 0.4 0.11
2 50% 50% 100%, 0 67%, 33% 0.5 0.67 0.33
3 33% 67% 50%, 50% 100%, 0 0.44 1 0.44
4
5 67% 33% 75%, 25% 100%, 0 0.44 0.5 0.22
6 33% 67% 100%, 0 75%, 25% 0.44 0.5 0.22
7 33% 67% 50%, 50% 100%, 0 0.44 1 0.44
8 50% 50% 67%, 33% 100%, 0 0.5 0.67 0.33
9 83% 17% 80%, 20% 100%, 0 0.28 0.4 0.11
Root Node (All Records)
Assets = Low, Medium,
High?
Assets = Assets = Medium or
Low High
Bad Risk Decision Node A
(Records 2, 7) (Records 1, 3, 4, 5, 6, 8)

Savings = Savings = Low or


High Medium
Decision Node B Good Risk
(Records 3, 6) (Records 1, 4, 5, 8)

Assets = Assets =
High Good Risk Medium
Bad Risk
(Records 6) (Records 3)

How do we explain these results?


Decision Rules
If then Support Confidence
Assets are low Bad credit risk 2/8 100%
Assets are high Good credit risk 2/8 100%
Assets are medium Good credit risk 1/8 100%
and savings are low
Assets are medium Good credit risk 2/8 100%
and savings are
medium
Assets are medium Bad credit risk 1/8 100%
and savings are
high
How to build a decision tree
In our example:
◦ Why did we split on savings for the root node?
◦ Why not assets or income level?

We are seeking to maximize the purity of our leaf nodes


◦ Leads to high confidence levels on leaf nodes

Specifically:
◦ Classification and regression tress (CART algorithm)
◦ C4.5 algorithm (information gain algorithm)
C4.5 Algorithm (Information
Gaining)
C4.5 Basics:
NOT restricted to binary splits from decision nodes
◦ Leads to a tree of more variable shape (can have a wider tree than with CART)

By default, creates one branch for EACH level of a categorical variable if it selects that predictor
to branch on
◦ This may not be ideal if multiple levels of a category have similar relationships with the target

Conducts an exhaustive search over potential splits and selects an optimal split based on
information gain (or entropy reduction)
◦ Before we get into how this is calculated, we will first discuss what entropy is
What is Entropy Reduction?
Entropy is a measure of impurity or disorder in the dataset
◦ Measured between 0 and 1
◦ 0 is perfectly pure
◦ 1 is evenly distributed
Entropy
What is Entropy Reduction?
Let’s apply the entropy formula to our situation

◦ Where S is a potential split


◦ And each T is a partition of the data created by making split S
◦ And Pi is the proportion of records in subset i
◦ The average disorder/impurity is then the weighted sum of the entropies for all the subsets created by
the split.

C4.5 Algorithm seeks to maximize the entropy reduction (or information gain) produced by a
potential split
◦ Reduction=H(T)-HS(T)
C4.5 with an example
Cust Savings Assets Income (1000s) Credit Risk
1 Medium High 75 Good
2 Low Low 50 Bad
3 High Medium 25 Bad
4 Medium Medium 50 Good
5 Low Medium 100 Good
6 High High 25 Good
7 Low Low 25 Bad
8 Medium Medium 75 Good
C4.5 Using an Example
Potential Split Resulting Child Nodes
1 Savings=low Savings=medium Savings=high
2
3
4
5
C4.5 Using an Example
Potential Split Resulting Child Nodes
1 Savings=low Savings=medium Savings=high
2 Assets=low Assets=medium Assets=high
3
4
5
C4.5 Using an Example
Potential Split Resulting Child Nodes
1 Savings=low Savings=medium Savings=high
2 Assets=low Assets=medium Assets=high
3 Income<=$25,000 Income>$25,000
4 Income<=$50,000 Income>$50,000
5 Income<=$75,000 Income>$75,000

• Note the difference in number of potential splits compared to a


CART decision tree
C4.5 Using an Example
First, we will calculate the entropy of the initial data set
◦ We will use this for comparison to determine greatest entropy reduction

Our variable has 2 outcomes (good versus bad credit risk)


◦ Therefore our k=2 here
◦ There are 5 good credit risks and 3 bad credit risks

Entropy=
Split on savings
Cust Savings Assets Income (1000s) Credit Risk
1 Medium High 75 Good
2 Low Low 50 Bad
3 High Medium 25 Bad
4 Medium Medium 50 Good
5 Low Medium 100 Good
6 High High 25 Good
7 Low Low 25 Bad
8 Medium Medium 75 Good
C4.5 Using an Example
Consider split 1 (splitting by savings)
Low savings
◦ 3 observations; 1 is good and 2 are bad credit risk

Medium savings
◦ 3 observations; 3 are good and 0 are bad credit risk

High savings
◦ 2 observations, 1 is good and 1 is bad credit risk

Entropy=
◦ =+

Entropy reduction=0.95-0.59=0.36
Split <=25 income vs >25
income
Cust Savings Assets Income (1000s) Credit Risk
1 Medium High 75 Good
2 Low Low 50 Bad
3 High Medium 25 Bad
4 Medium Medium 50 Good
5 Low Medium 100 Good
6 High High 25 Good
7 Low Low 25 Bad
8 Medium Medium 75 Good
C4.5 Using an Example
Consider split 3: Income<=$25,000
Income<=$25,000
◦ 3 observations; 1 good credit risk, 2 bad credit risk

Income>$25,000
◦ 5 observations; 4 good credit risk, 1 bad credit risk

Entropy=
◦ =

Entropy reduction=0.9544-0.7956=0.1588
C4.5 Using an Example
Potential split Baseline Entropy Split Entropy Entropy Reduction
1 0.9544 0.5944 0.36
2 0.9544 0.4057 0.5487
3 0.9544 0.7956 0.1588
4 0.9544 0.6069 0.3475
5 0.9544 0.8621 0.0923

We will first use a split along the assets dimension because it has the
highest entropy reduction (or information gain).
C4.5 Using an Example
Root Node (All Records)
Assets = Low, Medium, High?

Assets = Assets = Medium Assets =


Low High
Bad Risk Good Risk
(Records 2, 7) Decision Node A (Records 1, 6)
(Records 3, 4, 5, 8)

Here we have 2 pure leaf nodes


The remaining decision node contains 4 observations
◦ These records represent all levels of savings and all levels of income, therefore we must compute
entropy again to determine the next predictor variable to branch on
C4.5 Using an Example
Root Node (All Records)
Assets = Low, Medium,
High?

Assets = Assets = Medium Assets =


Low High
Bad Risk Good Risk
(Records 2, 7) Decision Node A (Records 1, 6)
(Records 3, 4, 5, 8)

Savings = Low Savings Savings = High


=Medium
Good Risk Good Risk Bad Risk
(Records 5) (Records 4, 8) (Records 3)

Compare this tree to the one found using the CART method
Root Node (All
Records)
Assets = Low,
Medium, High? Assets =
Assets =
Low Medium or High Root Node (All
Records)
Bad Risk Decision Node A Assets = Low,
(Records 2, 7) (Records 1, 3, 4, 5, Medium, High?
6, 8)
Assets = Assets = Assets =
Savings = Savings = Low or Low Medium High
High Medium Bad Risk Good Risk
(Records 2, Decision Node (Records 1,
Decision Node B Good Risk A
(Records 3, 6)
7) 6)
(Records 1, 4, (Records 3, 4,
5, 8)
5, 8)

Assets = Assets =
High Medium Savings = Savings Savings =
Good Risk Bad Risk Low =Medium High
(Records 6) (Records 3)
Good Risk Good Risk Bad Risk
(Records 5) (Records 4, (Records 3)
8)
Comparison of CART and C4.5
CART C4.5

Used for both classification and regression Primarily classification


What does it mean to use CART
for regression?
Suppose my example was trying to estimate a bank account owners age instead of their credit
risk
 This is a numerical target variable

CART can build a tree where the resulting end nodes split based on average age within the node
Comparison of CART and C4.5
CART C4.5

Used for both classification and regression Primarily classification


Can handle a mix of categorical and numerical Would need to adjust for numerical predictor
predictor variables variables
Comparison of CART and C4.5
CART C4.5

Used for both classification and regression Primarily classification


Can handle a mix of categorical and numerical Would need to adjust for numerical predictor
predictor variables variables
Creates only binary splits (leads to certain Creates multi-way splits (can lead to wide tree
shape of tree) with nodes of small representation)
Comparison of CART and C4.5
CART C4.5

Used for both classification and regression Primarily classification


Can handle a mix of categorical and numerical Would need to adjust for numerical predictor
predictor variables variables
Creates only binary splits (leads to certain Creates multi-way splits (can lead to wide tree
shape of tree) with nodes of small representation)
Naturally handles missing values with separate Requires imputation for missing values
branch
Comparison of CART and C4.5
CART C4.5

Used for both classification and regression Primarily classification


Can handle a mix of categorical and numerical Would need to adjust for numerical predictor
predictor variables variables
Creates only binary splits (leads to certain Creates multi-way splits (can lead to wide tree
shape of tree) with nodes of small representation)
Naturally handles missing values with separate Requires imputation for missing values
branch
Criteria for splitting favors more imbalanced
Criteria for splitting favors more balanced data data sets
sets
Comparison of CART and C4.5
CART C4.5
Used for both classification and regression Primarily classification
Can handle a mix of categorical and numerical Would need to adjust for numerical predictor
predictor variables variables
Creates only binary splits (leads to certain shape Creates multi-way splits (can lead to wide tree
of tree)
with nodes of small representation)
Naturally handles missing values with separate
Requires imputation for missing values
branch
Criteria for splitting favors more balanced data Criteria for splitting favors more imbalanced
sets data sets
Must do post-pruning Allows for pre-pruning
What is pruning?
Just like with other models, we worry about overfitting – decision trees are susceptible to this
Pre-pruning
Prevent a tree from reaching its “full depth” by imposing “early stopping”
 Stops the tree from continuing to split in ways that would create very small leaf nodes

Tactically, you impose rules like maximum depth or minimum leaf size
Post-pruning
Letting the tree grow to “full depth” and then going back to eliminate branches
Use cost complexity pruning (weakest link pruning) by giving trees a score that combines error
and a penalty for tree size
 Penalty for tree size is determined using cross-validation
Comparison of CART and C4.5
Both can be run easily in Rstudio
Either one can perform better on a certain data set
Both can be adjusted to include weights
 Favoring classifying a particular outcome

Both are fast


Other Algorithms
ID3 – precursor to C4.5
C5.0 – successor of C4.5
CHAID – chi-squared automatic interaction detection
Random Forest

You might also like