0% found this document useful (0 votes)
60 views90 pages

Lec 16,17

Refund Marital Taxable Status Income Cheat The document discusses decision trees for classification. It describes how decision trees are constructed during the training phase by recursively splitting the training examples based on attributes, and then pruned. During testing, unknown samples are classified by testing their attribute values against the decision tree. Two examples of decision trees are shown for classifying tax returns based on refund status, marital status, taxable income, and whether cheating occurred.

Uploaded by

ABHIRAJ E
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views90 pages

Lec 16,17

Refund Marital Taxable Status Income Cheat The document discusses decision trees for classification. It describes how decision trees are constructed during the training phase by recursively splitting the training examples based on attributes, and then pruned. During testing, unknown samples are classified by testing their attribute values against the decision tree. Two examples of decision trees are shown for classifying tax returns based on refund status, marital status, taxable income, and whether cheating occurred.

Uploaded by

ABHIRAJ E
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

Decision Trees

Prof. Navneet Goyal


BITS, Pilani
Classification Techniques
§ Decision Tree based Methods
§ Distance-based Methods
§ Neural Networks
§ Naïve Bayes and Bayesian Belief Networks
§ Support Vector Machines
General Approach
Tid Attrib1 Attrib2 Attrib3 Class
Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set

Figure taken from text book (Tan, Steinbach, Kumar)


Classification by Decision
Tree Induction
§ Decision tree – is a classification scheme
§ Represents – a model of different classes
§ Generates – tree & set of rules
§ A node without children - is a leaf node. Otherwise an
internal node.
§ Each internal node has - an associated splitting
predicate. e.g. binary predicates.
§ Example predicates:
Age <= 20
Profession in {student, teacher}
5000*Age + 3*Salary – 10000 > 0
Classification by Decision
Tree Induction
t Decision tree
t A flow-chart-like tree structure
t Internal node denotes a test on an attribute
t Branch represents an outcome of the test
t Leaf nodes represent class labels or class distribution
t Decision tree generation consists of two phases
t Tree construction
t At start, all the training examples are at the root
t Partition examples recursively based on selected
attributes
t Tree pruning
t Identify and remove branches that reflect noise or outliers

t Use of decision tree: Classifying an unknown sample


t Test the attribute values of the sample against the decision
tree
Classification by Decision
Tree Induction
Decision tree classifiers are very popular.
WHY?
§It does not require any domain knowledge or
parameter setting, and is therefore suitable for
exploratory knowledge discovery
§DTs can handle high dimensional data
§Representation of acquired knowledge in tree
form is intuitive and easy to assimilate by humans
§Learning and classification steps are simple & fast
§Good accuracy
Classification by Decision
Tree Induction
Main Algorithms
§Hunt’s algorithm
§ID3
§C4.5
§CART
§SLIQ,SPRINT
Example of a Decision Tree
i c al i c al ous
or or i nu
t e g
t e g
nt a ss
ca ca co c l
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Figure taken from text book (Tan, Steinbach, Kumar)
Another Example of
al
Decision Tree al us
r i c r i c o
u
e go e go t in ss
a t a t o n l a MarSt Single,
c c c c
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10

Figure taken from text book (Tan, Steinbach, Kumar)


Some Questions
§ Which tree is better and why?
§ How many decision trees?
§ How to find the optimal tree?
§ Is it computationally feasible?
(Try constructing a suboptimal tree in
reasonable amount of time – greedy
algorithm)
§ What should be the order of split?
§ Look for answers in “20 questions” &
“Guess Who” games!
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

Figure taken from text book (Tan, Steinbach, Kumar)


Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

Refund No Married 80K ?


10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

Figure taken from text book (Tan, Steinbach, Kumar)


Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

Figure taken from text book (Tan, Steinbach, Kumar)


Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

Figure taken from text book (Tan, Steinbach, Kumar)


Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES

Figure taken from text book (Tan, Steinbach, Kumar)


Decision Trees: Example
Training Data Set
Outlook Temp Humidity Windy Class
Sunny 79 90 true No play
Numerical Attributes
Sunny 56 70 False Play
Temprature, Humidity
Sunny 79 75 True Play
Sunny 60 90 True No Play Categorical Attributes
Overcast 88 88 False Play Outlook, Windy
Overcast 63 75 True Play
Class ???
Overcast 88 95 False Play
Class label
Rain 78 60 False Play
Rain 66 70 False No Play
Rain 68 60 True No Play
Decision Trees: Example
Sample Decision Tree

Outlook
sunny overcast rain

Humidity Play Windy

<=75 true false


> 75

Play No
No Play
{1} No Play Play

Five leaf nodes – Each represents a rule


Decision Trees: Example
Rules corresponding to the given tree
1. If it is a sunny day and humidity is not above 75%,
then play.
2. If it is a sunny day and humidity is above 75%, then do
not play.
3. If it is overcast, then play.
4. If it is rainy and not windy, then play.
5. If it is rainy and windy, then do not play.

Is it the best classification ????


Decision Trees: Example
Classification of new record
New record: outlook=rain, temp =70, humidity=65,
windy=true.
Class: “No Play”
Accuracy of the classifier

determined by the percentage of the test data set that is


correctly classified
Decision Trees: Example

Test Data Set


Outlook Temp Humidity Windy Class Rule 1: two records
Sunny 79 90 true Play Sunny & hum <=75
Sunny 56 70 False Play
(one is correctly
Sunny 79 75 True No Play
classified)
Sunny 60 90 True No Play
Accuracy= 50%
Overcast 88 88 False No Play
Overcast 63 75 True Play Rule 2:sunny, hum> 75
Overcast 88 95 False Play
Accuracy = 50%
Rain 78 60 False Play
Rule 3: overcast
Rain 66 70 False No Play
Rain 68 60 True Play Accuracy= 66%
Practical Issues of
Classification
t Underfitting and Overfitting
t Missing Values
t Costs of Classification
Overfitting the Data
t A classification model commits two kinds of
errors:
t Training Errors (TE) (resubstitution, apparent errors)
t Generalization Errors (GE)
t A good classification model must have low TE
as well as low GE
t A model that fits the training data too well can
have high GE than a model with high TE
t This problem is known as model overfitting
Underfitting and Overfitting
Overfitting

Underfitting: when model is too simple, both training and test errors are large. TE & GE
are large when the size of the tree is very small.
It occurs because the model is yet to learn the true structure of the data and as a result it
performs poorly on both training and test sets
Figure taken from text book (Tan, Steinbach, Kumar)
Overfitting the Data

t When a decision tree is built, many of the


branches may reflect anomalies in the training
data due to noise or outliers.
t We may grow the tree just deeply enough to
perfectly classify the training data set.
t This problem is known as overfitting the data.
Overfitting the Data
t TE of a model can be reduced by increasing
the model complexity
t Leaf nodes of the tree can be expanded until
it perfectly fits the training data
t TE for such a complex tree = 0
t GE can be large because the tree may
accidently fit noise points in the training set
t Overfitting & underfitting are two pathologies
that are related to model complexity
Occam’s Razor
t Given two models of similar generalization
errors, one should prefer the simpler
model over the more complex model
t For complex models, there is a greater
chance that it was fitted accidentally by
errors in data
t Therefore, one should include model
complexity when evaluating a model
Definition

A decision tree T is said to overfit the training


data if there exists some other tree T’ which is
a simplification of T, such that T has smaller
error than T’ over the training set but T’ has a
smaller error than T over the entire distribution
of the instances.
Problems of Overfitting
Overfitting can lead to many difficulties:
t Overfitted models are incorrect.
t Require more space and more computational
resources
t Require collection of unnecessary features
t They are more difficult to comprehend
Overfitting
Overfitting can be due to:
1. Presence of Noise
2. Lack of representative samples
Overfitting: Example
Presence of Noise:Training Set
Name Body Gives 4-legged Hibernates Class Label
Temperature Birth (mammal)
Procupine Warm Blooded Y Y Y Y

Cat Warm Blooded Y Y N Y

Bat Warm Blooded Y N Y N

Whale Warm Blooded Y N N N

Salamander Cold Blooded N Y Y N

Komodo Dragon Cold Blooded N Y N N

Python Cold Blooded N N Y N

Salmon Cold Blooded N N N N

Eagle Warm Blooded N N N N

Guppy Cold Blooded Y N N N

Table taken from text book (Tan, Steinbach, Kumar)


Overfitting: Example
Presence of Noise:Test Set
Name Body Gives 4-legged Hibernates Class Label
Temperature Birth (mammal)
Human Warm Blooded Y N N Y (classified as N)

Pigeon Warm Blooded N N N N

Elephant Warm Blooded Y Y N Y

Leopard Shark Cold Blooded Y N N N

Turtle Cold Blooded N Y N N

Penguin Cold Blooded N N N N

Eel Cold Blooded N N N N

Dolphin Warm Blooded Y N N Y (classified as N)

Spiny Anteater Warm Blooded N Y Y Y

Gila Monster Cold Blooded N Y Y N

Table taken from text book (Tan, Steinbach, Kumar)


Overfitting: Example
Presence of Noise: Models
Body Temp
Body Temp
Warm blooded Cold blooded
Warm blooded Cold blooded
Non-mammals
Gives Birth Non-mammals
No Gives Birth
Yes
Yes No
Non-mammals
4-legged Mammals Non-mammals
Yes No
Model M2
TE = 20%, GE=10%
Mammals Non-mammals
Model M1
TE = 0%, GE=30% Find out why?
Figure taken from text book (Tan, Steinbach, Kumar)
Overfitting: Example
Lack of representative samples: Training Set

Name Body Gives 4-legged Hibernates Class Label


Temperature Birth (mammal)
Salamander Cold Blooded N Y Y N

Eagle Warm Blooded N N N N

Guppy Cold Blooded Y N N N

Poorwill Warm blooded N N Y N


Platypus Warm blooded N Y Y Y

Table taken from text book (Tan, Steinbach, Kumar)


Overfitting: Example
Lack of representative samples: Training Set

Model M3
Body Temp
TE = 0%, GE=30% Find out
Warm blooded Cold blooded why?
Humans, elephants, and dolphins
Non-mammals are misclassified because the DT
Hibernates
classifies all warm-blooded
Yes No
vertebrates which do not
hibernate as non-mammals.
Non-mammals
4-legged The DT arrives at this decision,
No because there is only one training
Yes
record with such characteristics

Mammals Non-mammals

Figure taken from text book (Tan, Steinbach, Kumar)


Overfitting due to Noise

Decision boundary is distorted by noise point

Figure taken from text book (Tan, Steinbach, Kumar)


Overfitting due to Insufficient Examples

Lack of data points in the lower half of the diagram makes it difficult to
predict correctly the class labels of that region
- Insufficient number of training records in the region causes the decision
tree to predict the test examples using other training records that are
irrelevant to the classification task
Figure taken from text book (Tan, Steinbach, Kumar)
How to Address Overfitting
t Pre-Pruning (Early Stopping Rule)
t Stop the algorithm before it becomes a fully-grown tree
t Typical stopping conditions for a node:
t Stop if all instances belong to the same class
t Stop if all the attribute values are the same
t More restrictive conditions:
t Stop if number of instances is less than some user-specified
threshold
t Stop if class distribution of instances are independent of the
available features (e.g., using c 2 test)
t Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
How to Address Overfitting…
t Post-pruning
t Grow decision tree to its entirety
t Trim the nodes of the decision tree in a
bottom-up fashion
t If generalization error improves after
trimming, replace sub-tree by a leaf node.
t Class label of leaf node is determined
from majority class of instances in the
sub-tree
t Can use MDL for post-pruning
Post-pruning

Post-pruning approach- removes branches of a fully


grown tree.
t Subtree replacement replaces a subtree with
a single leaf node

Alt Alt
Yes Yes

Yes
Price
$ $$$
$$
Yes Yes No
Post-pruning
t Subtree raising moves a subtree to a higher
level in the decision tree, subsuming its parent

Alt
Yes
Alt
Yes

Res
No Yes Price
$ $$$
$$
No
Price 4/4 Yes Yes No
$ $$$
$$

Yes Yes No
Overfitting: Example
Presence of Noise:Training Set
Name Body Gives 4-legged Hibernates Class Label
Temperature Birth (mammal)
Porcupine Warm Blooded Y Y Y Y

Cat Warm Blooded Y Y N Y

Bat Warm Blooded Y N Y N*

Whale Warm Blooded Y N N N*

Salamander Cold Blooded N Y Y N

Komodo Dragon Cold Blooded N Y N N

Python Cold Blooded N N Y N

Salmon Cold Blooded N N N N

Eagle Warm Blooded N N N N

Guppy Cold Blooded Y N N N

Table taken from text book (Tan, Steinbach, Kumar)


Post-pruning: Techniques
t Cost Complexity pruning Algorithm: pruning
operation is performed if it does not increase the
estimated error rate.
§ Of course, error on the training data is not the useful
estimator (would result in almost no pruning)
t Minimum Description Length Algorithm: states
that the best tree is the one that can be encoded
using the fewest number of bits.
§ The challenge for the pruning phase is to find the subtree
that can be encoded with the least number of bits.
Hunt’s Algorithm
Tid Refund Marital Taxable
Status Income Cheat

t Let Dt be the set of training records 1 Yes Single 125K No


that reach a node t 2 No Married 100K No

t Let y={y1,y2,…yc} be the class labels 3 No Single 70K No


4 Yes Married 120K No
Step 1:
5 No Divorced 95K Yes
t If Dt contains records that belong 6 No Married 60K No
the same class yt, then t is a leaf 7 Yes Divorced 220K No
node labeled as yt. If Dt is an empty 8 No Single 85K Yes
set, then t is a leaf node labeled by 9 No Married 75K No
the default class, yd 10
10 No Single 90K Yes

Step 2:
Dt
t If Dt contains records that belong to
more than one class, use an
attribute test to split the data into ?
smaller subsets. Recursively apply
the procedure to each child node

Figure taken from text book (Tan, Steinbach, Kumar)


Tid Refund Marital Taxable

Hunt’s Algorithm
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
Refund
Don’t 3 No Single 70K No
Cheat Yes No
4 Yes Married 120K No
Don’t Don’t Yes
5 No Divorced 95K
Cheat Cheat
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
Refund Refund 9 No Married 75K No
Yes No Yes No 10 No Single 90K Yes
10

Don’t Don’t Marital


Cheat
Marital Cheat Status
Status
Single, Single, Married
Divorced Married Divorced

Don’t Taxable
Don’t
Cheat Cheat Income Cheat

< 80K >= 80K


Don’t Cheat
Cheat
Figure taken from text book (Tan, Steinbach, Kumar)
Hunt’s Algorithm
t Should handle the following additional conditions:
t Child nodes created in step 2 are empty. When can this happen?
Declare the node as leaf node (majority class label of the training
records of parent node)
t In step 2, if all the records associated with Dt have identical attributes
(except for the class label), then it is not possible to split these
records further. Declare the node as leaf with the same class label
as the majority class of training records associated with this node.
Tree Induction
t Greedy strategy.
t Split the records based on an attribute
test that optimizes certain criterion.

t Issues
t Determine how to split the records
t How to specify the attribute test condition?
t How to determine the best split?

t Determine when to stop splitting


Hunt’s Algorithm
t Design Issues of Decision Tree Induction
t How should the training records be split?
At each recursive step, an attribute test condition must be selected.
Algorithm must provide a method for specifying the test condition for
diff. attrib. types as well as an objective measure for evaluating the
goodness of each test condition
t How should the splitting procedure stop?
Stopping condition is needed to terminate the tree-growing process.
Stop when:
- all records belong to the same class
- all records have identical values
- both conditions are sufficient to stop any DT induction algo., other
criterion can be imposed to terminate the procedure early (do we
need to do this? Think of model over-fitting!)
How to determine the Best Split?
Before Splitting: 10 records of class 0,
10 records of class 1

Own Car Student


Car? Type? ID?

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?

Slide taken from text book slides available at companion website (Tan, Steinbach, Kumar)
How to determine the Best Split?

t Greedy approach:
t Nodes with homogeneous class
distribution are preferred
t Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1

Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity

Slide taken from text book slides available at companion website (Tan, Steinbach, Kumar)
Measures of Node Impurity
- Based on the degree of impurity of child nodes
- Less impurity è more skew
- node with class distribution (1,0) has zero
impurity, whereas a node with class distribution
(0.5, 0.5) has highest impurity
t Gini Index
t Entropy
t Misclassification error
-
Measures of Node Impurity
t Gini Index
GINI (t ) = 1 - å [ p ( j | t )]2
j
t Entropy
Entropy (t ) = - å p ( j | t ) log p ( j | t )
j

t Misclassification error
Error (t ) = 1 - max P (i | t )
i
Comparison among
Splitting Criteria
For a 2-class problem:

I
m
p
u
r
i
t
y

Figure taken from text book (Tan, Steinbach, Kumar)


How to find the best split?
t Example:
Node N1 GINI = 0
C0: 0 ENTROPY =0
N1 has lowest
C1:6 ERROR = 0 impurity
value, followed
Node N2 GINI = 0.278 by N2 and N3
C0: 1 ENTROPY = 0.650
C1:5 ERROR = 0.167

Node N3 GINI = 0.5


C0: 3 ENTROPY =1
C1:3 ERROR = .5
How to find the best split?
t The 3 measures have similar characteristic curves
t Despite this, the attribute chosen as the test condition may vary
depending on the choice of the impurity measure
t Need to normalize these measures!
t Introducing GAIN, D
k N (v j )
D = I ( parent ) - å I (v j )
j =1 N
where I=impurity measure of a given node
N = total no. of records at the parent node
K = no. of attribute values
N(vj) = no. of records associated with the child node vj
t I(parent) is same for all test conditions
t When entropy is used, it is called Information Gain, Dinfo
t The larger the Gain, the better is the split
t Is it the best measure?
How to Find the Best Split?
Before Splitting: C0 N00 M0
C1 N01

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M1 M2 M3 M4

M12 M34
Gain = M0 – M12 vs M0 – M34
Slide taken from text book slides available at companion website (Tan, Steinbach, Kumar)
Measure of Impurity: GINI
t Gini Index for a given node t :

GINI (t ) = 1 - å [ p ( j | t )]2
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

t Maximum (1 - 1/nc) when records are equally


distributed among all classes, implying least
interesting information
t Minimum (0.0) when all records belong to one class,
implying most interesting information

C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500

Slide taken from text book slides available at companion website (Tan, Steinbach, Kumar)
Examples for computing GINI
GINI (t ) = 1 - å [ p ( j | t )]2
j

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Splitting Based on GINI
t Used in CART, SLIQ, SPRINT.
t When a node p is split into k partitions (children), the
quality of split is computed as,
k
ni
GINI split = å GINI (i )
i =1 n

where, ni = number of records at child i,


n = number of records at node p.
Binary Attributes: Computing
GINI Index
! Splits into two partitions
! Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
Parent
A? C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 – (4/7)2 – (3/7)2 N1 N2 Gini(Children)
= 0.490 C1 4 2 = 7/12 * 0.490 +
Gini(N2) C2 3 3 5/12 * 0.480
= 1 – (2/5)2 – (3/5)2 Gini=0.486 = 0.486
= 0.480
Binary Attributes: Computing
GINI Index
! Splits into two partitions
! Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
Parent
B? C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 – (5/7)2 – (2/7)2 N1 N2 Gini(Children)
= 0.408
C1 5 1 = 7/12 * 0.408 +
Gini(N2) C2 2 4 5/12 * 0.320
= 1 – (1/5)2 – (4/5)2 Gini=0.371 = 0.371
= 0.320
Attribute B is preferred over A
Categorical Attributes: Computing Gini Index

t For each distinct value, gather counts for each


class in the dataset
t Use the count matrix to make decisions
Multi-way split Two-way split
(find best partition of values)

CarType CarType CarType


Family Sports Luxury {Sports, {Family,
{Family} {Sports}
Luxury} Luxury}
C1 1 2 1 C1 3 1 C1 2 2
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419

GINI favours multiway splits!!


Continuous Attributes: Computing Gini Index

Tid Refund Marital Taxable


t Use Binary Decisions based on one Status Income Cheat
value
1 Yes Single 125K No
t Several Choices for the splitting value 2 No Married 100K No
t Number of possible splitting values 3 No Single 70K No
= Number of distinct values 4 Yes Married 120K No
t Each splitting value has a count matrix 5 No Divorced 95K Yes
associated with it 6 No Married 60K No

t Class counts in each of the 7 Yes Divorced 220K No

partitions, A < v and A ³ v 8 No Single 85K Yes

t Simple method to choose best v 9 No Married 75K No


10 No Single 90K Yes
t For each v, scan the database to 10

gather count matrix and compute its Taxable


Gini index Income
t Computationally Inefficient! > 80K?
Repetition of work.
Yes No
Continuous Attributes: Computing Gini Index...

t For efficient computation: for each attribute,


t Sort the attribute on values
t Linearly scan these values, each time updating the count matrix
and computing gini index
t Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Taxable Income

Sorted Values 60 70 75 85 90 95 100 120 125 220


55 65 72 80 87 92 97 110 122 172 230
Split Positions
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Exercise:Find the time complexity in terms of # records!


Alternative Splitting Criteria based on INFO
t Entropy at a given node t:
Entropy (t ) = - å p ( j | t ) log p ( j | t )
j

(NOTE: p( j | t) is the relative frequency of class j at node t).


t Measures homogeneity of a node.
t Maximum (log nc) when records are equally distributed
among all classes implying least information
t Minimum (0.0) when all records belong to one class,
implying most information
t Entropy based computations are similar to the
GINI index computations
Examples for computing
Entropy
Entropy (t ) = -å p ( j | t ) log 2 p ( j | t )
j

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
Splitting Based on INFO...
t Information Gain:
æ n ö
= Entropy ( p ) - ç å Entropy (i ) ÷
k
GAIN i

è n ø
split i =1

Parent Node, p is split into k partitions;


ni is number of records in partition i
t Measures Reduction in Entropy achieved because of
the split. Choose the split that achieves most reduction
(maximizes GAIN)
t Used in ID3 and C4.5
t Disadvantage: Tends to prefer splits that result in large
number of partitions, each being small but pure.
Splitting Based on INFO...
t Gain Ratio:

GAIN n n
GainRATIO = SplitINFO = - å log
Split k
i i

SplitINFO
split

n n i =1

Parent Node, p is split into k partitions


ni is the number of records in partition i

t Adjusts Information Gain by the entropy of the


partitioning (SplitINFO). Higher entropy partitioning
(large number of small partitions) is penalized!
t Used in C4.5
t Designed to overcome the disadvantage of Information
Gain
Splitting Criteria based on Classification Error

t Classification error at a node t :

Error (t ) = 1 - max P(i | t )


i

t Measures misclassification error made by a node.


t Maximum (1 - 1/nc) when records are equally
distributed among all classes, implying least interesting
information
t Minimum (0.0) when all records belong to one class,
implying most interesting information
Examples for Computing
Error
Error (t ) = 1 - max P (i | t )
i

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
Misclassification Error vs Gini
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42

Gini(N1)
N1 N2 Gini(Children)
= 1 – (3/3)2 – (0/3)2
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.342 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves !!
Decision Tree Based
Classification
t Advantages:
t Inexpensive to construct
t Extremely fast at classifying unknown
records
t Easy to interpret for small-sized trees
t Accuracy is comparable to other
classification techniques for many simple
data sets
Example: C4.5
t Simple depth-first construction.
t Uses Information Gain
t Sorts Continuous Attributes at each
node.
t Needs entire data to fit in memory.
t Unsuitable for Large Datasets.
t Needs out-of-core sorting.

t You can download the software from:


http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz
Example
t Web robot or crawler
t Based on access patterns, distinguish between
human user and web robots
t Web Usage Mining
t BUILD a MODEL – use web log data
Summary: DT Classifiers
t Does not require any prior assumptions about prob.
dist. Satisfied by classes
t Finding optimal DT is NP-complete
t Construction of DT is fast even for large data sets.
t Testing is also fast. O(w), w=max. depth of the tree
t Robust to niose
t Irrelevant attributes can cause problems. (use
feature selection)
t Data fragmentation problem (leaf nodes having
very few records)
t Tree pruning has greater impact on the final tree
than choice of impurity measure
Decision Boundary
1

0.9

0.8
x < 0.43?

0.7
Yes No
0.6

y < 0.33?
y

0.5 y < 0.47?


0.4

0.3
Yes No Yes No

0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x
• Border line between two neighboring regions of different classes is known as
decision boundary
• Decision boundary is parallel to axes because test condition involves a single
attribute at-a-time
Oblique Decision Trees

x+y<1

Class = + Class =

• Test condition may involve multiple attributes


• More expressive representation
• Finding optimal test condition is computationally expensive
Tree Replication
P

Q R

S 0 Q 1

0 1 S 0

Split using P
0 1
redundant?
Remove P in
• Same subtree appears in multiple branches post pruning
Metrics for Performance Evaluation
t Focus on the predictive capability of a model
t Rather than how fast it takes to classify or build models,
scalability, etc.
t Confusion Matrix:

PREDICTED CLASS
Class=Yes Class=No a: TP (true positive)
b: FN (false negative)
Class=Yes a b
ACTUAL Class=No c: FP (false positive)
c d
CLASS d: TN (true negative)

t TP: predicted to be in YES, and is actually in it


t FP: predicted to be in YES, but is not actually in it
t TN: predicted not to be in YES, and is not actually in it
t FN: predicted not to be in YES, but is actually in it
Metrics for Performance
Evaluation…
PREDICTED CLASS
Class=Yes Class=No

Class=Yes a b
ACTUAL (TP) (FN)
CLASS Class=No c d
(FP) (TN)

t Most widely-used metric:


a+d TP + TN
Accuracy = =
a + b + c + d TP + TN + FP + FN
Limitation of Accuracy
t Consider a 2-class problem
t Number of Class 0 examples = 9990
t Number of Class 1 examples = 10

t If model predicts everything to be class


0, accuracy is 9990/10000 = 99.9 %
t Accuracy is misleading because model
does not detect any class 1 example
Cost Matrix
PREDICTED CLASS

C(i|j) Class=Yes Class=No

Class=Yes C(Yes|Yes) C(No|Yes)


ACTUAL
CLASS Class=No C(Yes|No) C(No|No)

C(i|j): Cost of misclassifying class j example as class i


Computing Cost of Classification
Cost PREDICTED CLASS
Matrix
C(i|j) + -
ACTUAL + -1 100
CLASS
- 1 0

Model PREDICTED CLASS Model PREDICTED CLASS


M1 M2
+ - + -
ACTUAL + 150 40 ACTUAL + 250 45
CLASS CLASS
- 60 250 - 5 200

Accuracy = 80% Accuracy = 90%


Cost = 3910 Cost = 4255
Cost-Sensitive Measures
a
Precision (p) =
a+c
a
Recall (r) =
a+b
2rp 2a
F - measure (F) = =
r + p 2a + b + c

! Precision is biased towards C(Yes|Yes) & C(Yes|No)


! Recall is biased towards C(Yes|Yes) & C(No|Yes)
! F-measure is biased towards all except C(No|No)
wa + w d
Weighted Accuracy = 1 4

wa + wb+ wc+ w d
1 2 3 4
Handling Overfitting in DTs
t Pre-pruning (Early Stopping Rule)
Stop the tree growing algorithm before generating a
fully grown tree that perfectly fits the entire training
data
t Post-Pruning
Grow the tree to its maximum size and then prune it
in a bottoms up fashion
Handling Overfitting in DTs
Pre-pruning (Early Stopping Rule)
t How it can be done?
t Stop expanding a leaf node when it becomes
“sufficiently” pure
t OR improvement in the GE falls below a threshold
t Adv – avoids generating overly complex subtrees
that overfit the training data
t Issue – difficult to choose the right threshold for
early termination
t High threshold – underfitted models
t Low threshold – not sufficient to overcome the
overfitting
Handling Overfitting in DTs
Post-pruning
t Grow the tree fully
t Prune the tree in a bottom up fashion
t Replace a subtree with a new leaf node whose
class label is determined from the majority class of
records affiliated with the subtree
t OR replace the subtree with the most frequently
used branch of the subtree
t Stop tree pruning when no further improvement is
observed
Handling Overfitting in DTs
Pre-pruning vs. Post-pruning
t Post-pruning tends to give better results than pre-
pruning as it makes pruning decisions based on a
fully grown tree
t Pre-pruning can suffer from premature termination
of the tree growing process
t Post-pruning can lead to wastage of additional
computations needed to grow the tree fully when
the tree is pruned
Decision Tree Example
Age Income Student Credit_rating Class:Buys_comp
Youth HIGH N FAIR N
Youth HIGH N EXCELLENT N
Middle_aged HIGH N FAIR Y
Senior MEDIUM N FAIR Y
Senior LOW Y FAIR Y
Senior LOW Y EXCELLENT N
Middle_aged LOW Y EXCELLENT Y
Youth MEDIUM N FAIR N
Youth LOW Y FAIR Y
Senior MEDIUM Y FAIR Y
Youth MEDIUM Y EXCELLENT Y
Middle_aged MEDIUM N EXCELLENT Y
Middle_aged HIGH Y FAIR Y
Senior MEDIUM N EXCELLENT N

Use Gain (with Entropy) to build DT

You might also like