0% found this document useful (0 votes)

47 views50 pages

2000 Cart

This document discusses classification and regression trees (CART) and how they are built and interpreted. It covers: - What classification and regression trees are and how they differ - How trees are built by recursively splitting nodes to maximize purity - How trees can be overfit and pruning is used to remove unnecessary splits - Cross-validation is used to select the optimal tree size by averaging error rates across folds

Uploaded by

Swapan Kumar Saha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views50 pages

2000 Cart

Uploaded by

Swapan Kumar Saha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

AQM 2000

Predictive Business Analytics

Eric W. Chan, Ph.D.
Sessions 14: Classification and Regression Trees (CART)
AQM 2000, Math, Analytics, Science, and Tech (MAST) Division
What is it?

• Classification trees try to predict a categorical target.

• Will this person default on their loan?
• What type of book is this?

• Regression trees try to predict a continuous numerical target.

• What will the total rainfall next year be?
• How much will this house sell for?
Data

• Someone started a company using AI to assist high school students

improve admissions chances at “reach” schools.
• How it works:
• Students take a detailed survey.
• Algorithm helps determine who their reach, possible, and safety schools realistically are.
• Company uses survey and other attributes to determine where students need most help in
(essays, SATs, grades, etc.) and provide the assistance for fee.
• Goal is to improve chances at admissions towards a “reach” school.
Building Intuition

• Each dot represents the results of a student: success (green triangles) or failure
(red circles).
Building Intuition

• Goal: predict whether a student with a given GPA and SAT score will result in a
successful admission into at least one “reach” school.
Building Intuition

• Method: draw a minimal number of horizontal and vertical lines that partitions
the predictor space into regions containing just one value of the target.
Building Intuition

• Condition: keep drawing until you hit the boundary or a line you've already drawn.
Building Intuition
Building Intuition

• Here is one possible solution.

Building Intuition

• Would this new person (blue square) be predicted to gain entry or not into a
“reach” school?
Interpreting Trees
Interpreting Trees

• This tree map can be represented as a sequence of really simple decisions in the
form of a decision tree.
Interpreting Trees

• Decision trees are composed of nodes and edges.

• Nodes represent subsets of our data; edges represent choices.
Interpreting Trees

Current Prediction
Percent Target Value 1
Percent of Total Data Set

• Each node has the same representation for a classification. This is one from a
classification tree.
• For a regression tree, there is no “Percent Target Value 1.” Why?
Interpreting Trees

• In the top node, we would predict 1 (Success), because 62% of people in this
group were successes; this node represents 100% of the data set.
Interpreting Trees

• Below, we have a rule that splits the node's records, e.g., GPA< 3.2.
• Left is always “yes,” and right is always “no.”
Interpreting Trees

• Graphically, this condition is the same as drawing a line at GPA=3.2

Interpreting Trees

• Our prediction at this stage (where GPA < 91 is “no”) is Success; 83% of records in this
node are Successes.
• This node represents 75% of the total training set.
Interpreting Trees

• Our next splitting condition is GPA < 3.8.

• Records satisfying this condition go to the left.
Interpreting Trees

• This is condition is the same as drawing a vertical line at GPA < 3.8.
• This line stops at SAT = 1285.
Interpreting Trees

• Let's predict our test record: SAT = 1580, GPA = 3.55.

• Earlier we decided that this record should be classified as a failed attempt.
Interpreting Trees

• In the tree context, this prediction is just following a path from the top to the
bottom, making the appropriate choices along the way.
Building Trees in R
Splitting

• The leaves here contain more than one value of the target!
• How do we know if we should split? How do we know where to split?
Splitting

• How do we know where to split?

• Why is the rest split at SAT = 1220?
Splitting

• Qualitatively, we want the split that maximizes node purity.

• For categorical targets, a perfectly pure node contains only one value of the
target.
Splitting

• For categorical targets, a perfectly pure node contains only one value of the target.
• The closer we can get to this goal the better.
Splitting

• Quantitatively, there are lots of ways to measure node purity.

• For a binary target, we might use Gini impurity:
impurity = p(1 - p)
where p is the proportion of True observations in one node after the split.

• If p = 0, the node we have only False target values, and impurity = 0.

• If p = 1, the node we have only True target values, and impurity = 0.
• If p = 0.5, we have an evenly split target, and impurity is maximized.
Splitting
Splitting
Splitting

• For each variable, the model considers every possible split.

• The best split is the one that results in the lowest resulting impurity.
Stopping Rules and Pruning

Even given a splitting rule, how do we know when to stop building the tree?
There are several such stopping rules:
• The node contains records that all have the same value of the target, i.e., the
node is perfectly pure.
• minsplit: The node is below a user specified minimum size, typically something
like 2% of the total number of records.
• minbucket: One of the nodes that would result from the split is below a user
specified minimum size, typically something like 1% of the total number of
records.
• cp: Performing a split wouldn't improve the predictive power of the model by
some user specified amount.
Stopping Rules and Pruning

• As we add more complexity to our tree model, its performance on training gets
better and better -- with a big enough tree, every node is perfectly pure!
Stopping Rules and Pruning

• The performance on test gets better for a while, then starts to get worse.
• At some point, the splits in our tree are just modeling noise in training.
Stopping Rules and Pruning

• We want to find the tree size that minimizes error rate on test.
• We want to find the tree that is neither overfit or underfit.
Stopping Rules and Pruning

• Here's an example of underfit tree.

• There is likely much more structure in the data than what is represented here.
Stopping Rules and Pruning

• Here's an example of a likely overfit tree

• We need to prune the tree to remove these unnecessary branches.
Stopping Rules and Pruning

• The pruned tree is just the original with all the unnecessary branches removed.
• Pruned trees often perform worse on training and better on test than unpruned trees.
Stopping Rules and Pruning

We cannot tell whether a tree is over- or under-fit

simply by looking at it - we need a more systematic and
quantitative way to determine the best model complexity!
Cross Validation

• Earlier, we learned there's a problem if we're using test data to build our
model.
• The whole point of test was to hold it out in until we had made our model!
• There's a clever way to get around this called k-fold cross-validation.
Cross Validation

• Let's see how 4-fold cross-validation would work.

• We begin by breaking the training data set into 4 equally sized pieces.
Cross Validation

• First, we build a model using data only from pieces 2, 3 and 4.

• We then evaluate the performance of this model on piece 1.
Cross Validation

• Next, we build a model using data only from pieces 1, 3 and 4.

• We then evaluate the performance of this model on piece 2.
Cross Validation

• We repeat using pieces 1, 2, and 4 as our “training” and piece 3 as our “test”.
..
Cross Validation

• . . . and finally using pieces 1, 2, and 3 as our “training’ and piece 4 as our
“test.”
Cross Validation

• Each fold grows a different tree -- notice that the splits might not be the same!
• After each split, the error rates across all k trees are averaged.
Cross Validation
So how can we use k-fold cross-validation to build a tree of the “best” size?
• We build k different trees, each one using one of the k pieces of the real
training set as its “test.”
• Each time we add a split in each of the k trees, we keep track of the
performance of this new tree on its “test” data.
• For a given tree size, we average the error rate across all k trees.
• We define the “best” tree size to be the one with the lowest average error rate.
Cross Validation

• Cross-validation allows us to estimate the error on test without ever using the
test data.
Stopping Rules and Pruning in R

Vsam Tutorial
100% (1)
Vsam Tutorial
42 pages
How To Start Programming For ARM7 Based LPC2148 Microcontroller
100% (1)
How To Start Programming For ARM7 Based LPC2148 Microcontroller
5 pages
Objectives of The Study
No ratings yet
Objectives of The Study
4 pages
COREN Registration and Guide
No ratings yet
COREN Registration and Guide
3 pages
Large Rhombicosidodecahedron PDF
No ratings yet
Large Rhombicosidodecahedron PDF
11 pages
CGI EA Maturity Model
No ratings yet
CGI EA Maturity Model
1 page
How To Do ESD Protection During SMT Assembly Process
No ratings yet
How To Do ESD Protection During SMT Assembly Process
18 pages
Securities Regulation
100% (1)
Securities Regulation
29 pages
MISM2301 - 10 Business Process Management Overview-1-5
No ratings yet
MISM2301 - 10 Business Process Management Overview-1-5
53 pages
2nd Quarter Exam Mil
100% (2)
2nd Quarter Exam Mil
3 pages
Learning Management System (LMS) : USER Manual Version 6.0: Sl. No Version History Date
No ratings yet
Learning Management System (LMS) : USER Manual Version 6.0: Sl. No Version History Date
19 pages
Financial Literacy Unit2 Notes by Abhishek Patel Sem1
No ratings yet
Financial Literacy Unit2 Notes by Abhishek Patel Sem1
9 pages
ZMF4ECL Users Guide
No ratings yet
ZMF4ECL Users Guide
254 pages
HTTPS::WWW Amazon In:documents:download::invoice
No ratings yet
HTTPS::WWW Amazon In:documents:download::invoice
2 pages
Breakers EaToN Serie G
No ratings yet
Breakers EaToN Serie G
416 pages
SME2001-Midterm 2 Practice Problem & Solution, Multi-Prod CVP Target Profit, Point of Indifference, V (3.4) - RB
No ratings yet
SME2001-Midterm 2 Practice Problem & Solution, Multi-Prod CVP Target Profit, Point of Indifference, V (3.4) - RB
10 pages
MXC-6400 Series Datasheet-En 20180706
No ratings yet
MXC-6400 Series Datasheet-En 20180706
2 pages
Unit 3 - Introduction To Unity
No ratings yet
Unit 3 - Introduction To Unity
27 pages
TH1n EN Datasheet
No ratings yet
TH1n EN Datasheet
2 pages
(FA) 2 Multi-Step - Essay Practice Problem Solutions v2
No ratings yet
(FA) 2 Multi-Step - Essay Practice Problem Solutions v2
18 pages
MISM2301 - Process Flowcharts
No ratings yet
MISM2301 - Process Flowcharts
13 pages
Loki Manual
No ratings yet
Loki Manual
4 pages
MB Manual Ga-78lmt-Usb3 v.6.0 e 04
No ratings yet
MB Manual Ga-78lmt-Usb3 v.6.0 e 04
36 pages
Chapter 2
No ratings yet
Chapter 2
38 pages
Homework #2 Answer Key (Some of The Data Analysis Problems Have Not Been Updated Recently)
No ratings yet
Homework #2 Answer Key (Some of The Data Analysis Problems Have Not Been Updated Recently)
19 pages
Exam 2 Study Guide Fall 2021
No ratings yet
Exam 2 Study Guide Fall 2021
14 pages
(FA) 1 Multi-Step - Essay Practice Problems
No ratings yet
(FA) 1 Multi-Step - Essay Practice Problems
2 pages
VenkatSAP MM Resume
No ratings yet
VenkatSAP MM Resume
6 pages
CSS 12 Exam
No ratings yet
CSS 12 Exam
3 pages
CV Tiwi Kota
No ratings yet
CV Tiwi Kota
2 pages
SME2001 - Midterm 2 Formulas For Breakeven, Targ Profit, Point of Indifference Point - RB V (4.0)
No ratings yet
SME2001 - Midterm 2 Formulas For Breakeven, Targ Profit, Point of Indifference Point - RB V (4.0)
13 pages
Emerging Issue in Law - Summer 2022
No ratings yet
Emerging Issue in Law - Summer 2022
2 pages
Unit 2: Security: Week 3: Managing Individual Databases
No ratings yet
Unit 2: Security: Week 3: Managing Individual Databases
9 pages
Inception Requirement Gathering and Risk Analysis
No ratings yet
Inception Requirement Gathering and Risk Analysis
1 page
Configuring ODI External User Authentication
No ratings yet
Configuring ODI External User Authentication
18 pages
Neural Networks 16 Mark Answers
No ratings yet
Neural Networks 16 Mark Answers
3 pages
BUDGET #1-UPS & Downs Solution-1
No ratings yet
BUDGET #1-UPS & Downs Solution-1
5 pages
Module 2 Lecture 4
No ratings yet
Module 2 Lecture 4
4 pages
User Manual PDF
No ratings yet
User Manual PDF
8 pages
SME2001 - Midterm 2 Practice Problem& Solution, CVP-Breakeven, V (1.0)
No ratings yet
SME2001 - Midterm 2 Practice Problem& Solution, CVP-Breakeven, V (1.0)
2 pages
Educating Financial Accounting-A Need Analysis For Technology Driven Problem Solving Skills
No ratings yet
Educating Financial Accounting-A Need Analysis For Technology Driven Problem Solving Skills
9 pages
SME2001 - Midterm 2 Reference, GAAP To CM Inc STMT, CVP Formulas, V (1.0)
No ratings yet
SME2001 - Midterm 2 Reference, GAAP To CM Inc STMT, CVP Formulas, V (1.0)
2 pages
Contact Us Neway
No ratings yet
Contact Us Neway
1 page
Sop - Vor
No ratings yet
Sop - Vor
3 pages
Quiz 2
No ratings yet
Quiz 2
2 pages
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (648)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2886)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)

2000 Cart

Uploaded by

2000 Cart

Uploaded by

AQM 2000

Predictive Business Analytics

• Classification trees try to predict a categorical target.

• Regression trees try to predict a continuous numerical target.

• Someone started a company using AI to assist high school students

• Here is one possible solution.

• Decision trees are composed of nodes and edges.

• Graphically, this condition is the same as drawing a line at GPA=3.2

• Our next splitting condition is GPA < 3.8.

• Let's predict our test record: SAT = 1580, GPA = 3.55.

• How do we know where to split?

• Qualitatively, we want the split that maximizes node purity.

• Quantitatively, there are lots of ways to measure node purity.

• If p = 0, the node we have only False target values, and impurity = 0.

• For each variable, the model considers every possible split.

• Here's an example of underfit tree.

• Here's an example of a likely overfit tree

We cannot tell whether a tree is over- or under-fit

• Let's see how 4-fold cross-validation would work.

• First, we build a model using data only from pieces 2, 3 and 4.

• Next, we build a model using data only from pieces 1, 3 and 4.

You might also like