0% found this document useful (0 votes)
21 views

Lecture 21 (DS) - Decision Tree

decision tree

Uploaded by

anayabutt658
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Lecture 21 (DS) - Decision Tree

decision tree

Uploaded by

anayabutt658
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Data Science

Lecture # 22
Decision Tree
• Lets look at Decision Tree model, a popular method
used for classification
• By the end of this lecture, you should be able to:
• Explain how a decision tree is used for classification
• Describe the process of constructing a decision tree for
classification
• Interpret how a decision tree comes up with a classification
decision
2

Note: All Images are taken from edx.org


Decision Tree Overview

Note: All Images are taken from edx.org


Decision Tree Overview
• The idea behind decision tree is to split the data into
subsets where each subset belongs to only one class
• This is accomplished by dividing the input space into
pure regions
• i.e. regions with samples from only one class
• With real data completely pure subsets may not be
possible, so we divide into subsets that are as pure
as possible
• Decision tree makes classification decision based on
decision boundaries 4

Note: All Images are taken from edx.org


Classification Using Decision Tree
• The root and internal nodes
have test conditions
• Each leaf node has a class
label associated with it
• Decision is made by
traversing the decision tree
• At each node test condition
answer determines which
branch to traverse
• When a leaf node is reached,
the category at the leaf node
determines the decision 5

Note: All Images are taken from edx.org


Classification Using Decision Tree
• Depth of a node is the
number of edges from root
to that node
• The depth of root node is
zero
• Depth of tree is the
number of edges in the
longest path
• Size of tree is the number
of nodes in the tree
6

Note: All Images are taken from edx.org


Example Decision Tree

• This decision tree is used to classify an animal as a 7

mammal or not a mammal


Note: All Images are taken from edx.org
Constructing Decision Tree
• Constructing a decision tree consists of following
steps:
• Start with all samples at a node
• i.e. starting with all samples at root node
• Adding additional nodes when data is split into subsets

• Partition samples based on input to create purest subsets


• i.e. each subset contains as many samples as possible belonging
to just one class

• Repeat to partition data into successively purer subsets


• Do this process until stopping criteria are satisfied

• An algorithm for constructing a decision tree model is 8

called induction algorithm


Note: All Images are taken from edx.org
Greedy Approach
• At each split,
the induction
algorithm only
considers the
best way to split
the particular
portion of the
data
• This is referred
to as greedy
approach
9

Note: All Images are taken from edx.org


How to Determine Best Split?
• Again the goal is
to partition the
data into subsets
as pure as possible
• In this example,
right partition is
more
homogeneous
subsets, since
these contain
more samples
belonging to a
single class 10

Note: All Images are taken from edx.org


Impurity Measure
• Therefore, we need to
measure the purity of a
split
• Impurity measure of a
node specifies how
mixed the resulting
subsets are
• We want the split that
minimizes the impurity
measure
• Other impurity measures
are entropy and
misclassification rate 11

Note: All Images are taken from edx.org


What Variable to Split On?
• The other factor in
determining the best
way to partition a node
is which variable to split
on
• Decision tree will test all
variables to determine
the best way to split the
nodes, using a purity
measure such as Gini
index to compare the
various possibilities 12

Note: All Images are taken from edx.org


When to Stop Splitting a Node?
• Recall that tree induction algorithm repeatedly splits
nodes to get more and more homogeneous datasets
• So when does this process stop building subsets?

• All (or x% of) samples have same class label


• Number of samples in node reaches a minimum value
• Change in impurity measure is smaller than threshold
• Max tree depth is reached
• Others… (but we’ll not discuss here) 13

Note: All Images are taken from edx.org


Tree Induction Example: Split 1

• Let’s say we want to


classify loan
applicants as being
likely to repay a loan,
or not likely to repay a
loan, based on their
income and amount of
debt they have

14

Note: All Images are taken from edx.org


• Building
Tree aInduction Example:
decision tree for Split
this classification 1
problem
could proceed as follows
• Consider the input space of this problem, as shown
in left figure
• One way to split this dataset into a more
homogeneous subset is to consider the decision
boundary where income is t1.
• To the right of this decision boundary are mostly red
samples
• The subsets are not completely homogeneous, but
that is the best way to split the original dataset 15

based on variable income


Note: All Images are taken from edx.org
Tree Induction Example: Split 2
• Income > t1
represented at root
node
• This is the condition
used to split the original
dataset
• Samples > t1 are
placed in right subset
and < t1 are placed in
left subset
• Because right subset is
almost perfect, it is now
16
labeled as RED
Note: All Images are taken from edx.org
Tree
• RED Induction
means Example:
loan applicant Split
loan applicants 2
likely to
repay the loan
• The second step, then, is to determine how to split
the region outlined in red
• The best way to split this data is specified by the
second decision boundary, with debts equals t2
• This is represented in the decision tree on the right
with the addition of the node with condition debt >
t2
• This region contains all blue samples meaning that
the loan applicant is not likely to repay the loan 17

Note: All Images are taken from edx.org


Decision Boundaries
• The final decision tree
implements the decision
boundaries shown as
dashed lines in left
diagram
• The label for each region
is determined by the
label of the majority of
the samples
• These labels are
reflected in the leaf
nodes of the decision
tree shown on the right 18

Note: All Images are taken from edx.org


Decision Boundaries
• Notice that decision boundaries are parallel to axes
referred as rectilinear
• The boundaries are rectilinear because each split
considers only a single variable
• Some algorithms can consider more than one
variables
• However each split has to consider all combinations
of combined variables
• Such induction algorithms are more computationally
intensive 19

Note: All Images are taken from edx.org


Decision Tree for Classification
• There are few important things to note about the
decision tree classifier
• Resulting tree is often simple and easy to understand
• Induction is computationally inexpensive, so training a
decision tree for classification can be relatively fast
• Greedy approach does not guarantee best solution
• Rectilinear decision boundaries which means it may not be
able to solve complicated classification problems that
require complex decision boundaries
• Discuss Week 7 notebooks 20

Note: All Images are taken from edx.org

You might also like