Chap4 - Basic - Classification - Class Teaching
Chap4 - Basic - Classification - Class Teaching
A Couple of Questions:
− What is this?
− Why do you know?
− How have you come to that knowledge?
Warning:
Models are only
approximating examples!
Not guaranteed to be
correct or complete!
Tree? Tree?
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
A decision tree is a flowchart-like structure.
Each internal node represents a “test” on an attribute
(e.g. whether a coin flip comes up heads or tails),
Each branch represents the outcome of the test, and
Each leaf node represents a class label
(decision taken after computing all attributes).
The paths from root to leaf represent classification rules.
Tree based learning algorithms: one of the best and mostly used supervised learning
methods.
Tree based methods empower predictive models with high accuracy, stability and ease
of interpretation.
Unlike linear models, they map non-linear relationships quite well.
They are adaptable at solving any kind of problem at hand (classification or
regression).
Decision Tree algorithms are referred to as CART (Classification and Regression
Trees).
Classification techniques are most suited for predicting or describing data sets with
binary or nominal categories.
They are less effective for ordinal categories (e.g., to classify a person as a member of
high-, medium-, or low income group) because they do not consider the implicit order
among the categories.
Decision Node: When a sub-node splits into further sub-nodes, then it is called
decision node.
Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
Parent and Child Node: A node, which is divided into sub-nodes is called parent node
of sub-nodes, whereas, sub-nodes are the child of parent node.
Two types:
1. Place the best attribute of the dataset at the root of the tree.
2. Split the training set into subsets. Subsets should be made in such a way that each subset
contains data with the same value for an attribute.
Repeat step 1 and step 2 on each subset until you find leaf nodes in all the branches of the tree.
In decision trees, for predicting a class label for a record we start from the root of the tree. We
compare the values of the root attribute with record’s attribute. On the basis of comparison, we
follow the branch corresponding to that value and jump to the next node.
We continue comparing our record’s attribute values with other internal nodes of the tree until we
reach a leaf node with predicted class value. The modeled decision tree can be used to predict the
target class or the value.
Feature values are preferred to be categorical. If the values are continuous then they
are discretized prior to building the model.
Order to placing attributes as root or internal node of the tree is done by using some
statistical approach.
Over fitting: Decision-tree learners can create over-complex trees that do not
generalize the data well. This is called overfitting. Over fitting is one of the most
practical difficulty for decision tree models. This problem gets solved by setting
constraints on model parameters and pruning.
Not fit for continuous variables: While working with continuous numerical variables,
decision tree loses information, when it categorizes variables in different categories.
Decision trees can be unstable because small variations in the data might result in a
completely different tree being generated. This is called variance, which needs to be
lowered by methods like bagging and boosting.
Greedy algorithms cannot guarantee to return the globally optimal decision tree. This
can be mitigated by training multiple trees, where the features and samples are
randomly sampled with replacement.
Decision tree learners create biased trees if some classes dominate. It is therefore
recommended to balance the data set prior to fitting with the decision tree.
Information gain in a decision tree with categorical variables gives a biased response
for attributes with greater no. of categories.
Generally, it gives low prediction accuracy for a dataset as compared to other machine
learning algorithms.
Calculations can become complex when there are many class label.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Regression Trees vs Classification Trees
The terminal nodes (or leaves) lies at the bottom of the decision tree. This means that decision
trees are typically drawn upside down such that leaves are the bottom & roots are the tops.
Both the trees work almost similar to each other.
The primary differences and similarities between Classification and Regression Trees are:
Regression trees are used when dependent variable is continuous. Classification Trees are
used when dependent variable is categorical.
In case of Regression Tree, the value obtained by terminal nodes in the training data is the
mean response of observation falling in that region. Thus, if an unseen data observation falls
in that region, we’ll make its prediction with mean value.
In case of Classification Tree, the value (class) obtained by terminal node in the training data
is the mode of observations falling in that region. Thus, if an unseen data observation falls in
that region, we’ll make its prediction with mode value.
Both the trees divide the predictor space (independent variables) into distinct and non-
overlapping regions.
Both the trees follow a top-down greedy approach known as recursive binary splitting. We call
it as ‘top-down’ because it begins from the top of tree when all the observations are available
in a single region and successively splits the predictor space into two new branches down the
tree. It is known as ‘greedy’ because, the algorithm cares (looks for best variable available)
about only the current split, and not about future splits which will lead to a better tree.
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
Refund
All Tid Refund Marital Taxable
Training Yes No Cheat
Status Income
Records
Don’t 1 Yes Single 125K No
??
Cheat
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
1. We calculate the purity of the
5 No Divorced 95K Yes
resulting subsets for all possible
splits 6 No Married 60K No
7 Yes Divorced 220K No
• Purity of split on Refund
8 No Single 85K Yes
• Purity of split on Marital Status
9 No Married 75K No
• Purity of split on Taxable
Income 10 No Single 90K Yes
10
Each recursive step of the tree-growing process must select an attribute test
condition to divide the records into smaller subsets. To implement this step, the
algorithm must provide a method for specifying the test condition for different
attribute types as well as an objective measure for evaluating the goodness of
each test condition.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Design Issues of Decision Tree Induction
Greedy strategy:
– Split the records based on an attribute test
that optimizes certain criterion.
Issues
– Determine how to split the records
◆How to specify the attribute test condition?
◆How to determine the best split?
Size
{Small,
What about this split? Large} {Medium}
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
Greedy approach: Test all possible splits and use the one that
results in the most homogeneous (= pure) nodes
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Gini Index
Entropy
Misclassification error
Gain = P – M
A? B?
Yes No Yes No
M1 M2 M3 M4
M12 M34
Higher Purity Gain ? P – M12 or P – M34
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Measure of Impurity: GINI
GINI (t ) = 1 − [ p( j | t )]2
j
GINI (t ) = 1 − [ p( j | t )]2
j
Parent
B? C1 6
Gini(N1) Yes No C2 6
= 1 – (5/7) – (2/7)
2 2
Gini = 0.500
= 0.408 Node N1 Node N2
Gini(N2) N1 N2 Gini(Children)
= 1 – (1/5)2 – (4/5)2 C1 5 1 = 7/12 * 0.408 +
= 0.32 C2 2 4
5/12 * 0.32
= 0.371
Gini=0.371
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Entropy(t ) = − p( j | t ) log p( j | t )
j 2
Information Gain:
GAIN n
= Entropy( p) − Entropy(i )
k
i
n
split i =1
Gain Ratio:
GAIN n n
GainRATIO = SplitINFO = − log
Split k
i i
split
SplitINFO n n i =1
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
Gini(N1) N1 N2
= 1 – (3/3)2 – (0/3)2 Gini(Children)
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.361 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves !!
But Error increases!!
Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
Issues
– Determine how to split the records
◆How to specify the attribute test condition?
◆How to determine the best split?
Missing Values
Costs of Classification
Data Fragmentation
Search Strategy
Expressiveness
Tree Replication
Other strategies?
– Bottom-up
– Bi-directional
Q R
S 0 Q 1
0 1 S 0
0 1
0.9
0.8
x < 0.43?
0.7
Yes No
0.6
y
0.3
Yes No Yes No
0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
• Border line between two neighboring regions of different classes is
known as decision boundary
• Decision boundary is parallel to axes because test condition involves
a single attribute at-a-time
x+y<1
Class = + Class =
Advantages:
• Inexpensive to construct
• Extremely fast at classifying unknown records
• Easy to interpret by humans for small-sized trees (eager
learning)
• Can easily handle redundant or irrelevant attributes
• Accuracy is comparable to other classification techniques
for many low dimensional data sets (not texts and images)
Disadvantages:
• Space of possible decision tree is exponentially large
• Greedy approaches are often unable to find the best tree
• Trees do not take into account interactions between
attributes.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Overfitting
• They will neither know engineering nor cricket pretty well. They
never had their heart in what they did and have insufficient
knowledge of everything.
They know a lot about a particular field for example subject matter experts.
You ask them anything about the functionality of their tool (even in details),
they’ll probably be able to answer you and that too pretty precisely.
But when you ask them why the oil price fluctuate, they’ll probably make an
informed guess and say something peculiar.
In terms of machine learning, we can state them as too much focus on the
training set (programmers) and learns complex relations which may not be valid
in general for new data (test set).
To address this, we can split our initial dataset into separate training and test subsets. This method
can approximate how well our model will perform on new data.
If our model does much better on the training set than on the test set, then we’re likely
overfitting.
For example, it would be a big red flag if our model saw 95% accuracy on the training set but only
48% accuracy on the test set.
How to Prevent Overfitting:
Detecting overfitting is useful, but it doesn’t solve the problem. Fortunately, you have several
options to try.
Lack of data points in the lower half of the diagram makes it difficult
to predict correctly the class labels of that region
Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training
records that are irrelevant to the classification task
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Notes on Overfitting
o : 5200 instances
• Generated from a uniform
distribution
Decision Tree
Decision Tree
Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small, but test error is
large
Approach:
– Get 50 analysts
– Each analyst makes 10 random guesses
– Choose the analyst that makes the most
number of correct predictions
Test errors
– Errors committed on the test set
Generalization errors
– Expected error of a model over random
selection of records from same distribution
Thus out of many possible tree the one which has a better pessimistic
error is selected.
Also a node should not be expanded into its child nodes unless it
reduces the misclassification error for more than one training record.
e(TL) = 4/24
+: 3 +: 5 +: 1 +: 3 +: 3 e(TR) = 6/24
-: 0 -: 2 -: 4 -: 0 -: 6
+: 3 +: 2 +: 0 +: 1 +: 3 +: 0 =1
-: 1 -: 1 -: 2 -: 2 -: 1 -: 5
Resubstitution Estimate:
– Using training error as an optimistic estimate of
generalization error
– Referred to as optimistic error estimate
e(TL) = 4/24=
0.167
+: 3 +: 2 +: 0 +: 1 +: 3 +: 0
-: 1 -: 1 -: 2 -: 2 -: 1 -: 5
Drawback:
– Less data available for training
3-fold cross-validation
A1 A4
A2 A3
Entropy(Children)
Missing = 0.3 (0) + 0.6 (0.9183) = 0.551
value
Gain = 0.9 (0.8813 – 0.551) = 0.3303
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Distribute Instances
Purpose:
– To estimate performance of classifier on previously
unseen data (test set)
Holdout
– Reserve k% for training and (100-k)% for testing
– Random subsampling: repeated holdout
Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
Negative (N) (No) (-) (F) : Observation is not positive (for example: is
not an apple)
PREDICTED CLASS
Class= Class=
a: TP (true positive)
Yes No b: FN (false negative)
If we see carefully the proportion of Class 0 examples is high which is 9990 and the
proportion of class 1 examples is very low which is 10 for the 2-class problem.
Hence in the above case accuracy will not be a correct measure to evaluate the
performance of model.
a+d TP + TN
Accuracy = =
a + b + c + d TP + TN + FP + FN
Precision:
To get the value of precision we divide the total number of correctly classified
positive examples by the total number of predicted positive examples. When it
predicts yes, how often is it correct?
High Precision indicates an example labeled as positive is indeed positive
(small number of FP).
Precision is given by the relation:
The cost function comes into play in deciding which of the incorrect predictions
can be more detrimental — the false positive or the false negative (in other
words, which performance measure is important — precision or recall).
It is difficult to compare two models with low precision and high recall or vice
versa. So to make them comparable, we use F-Score. F-score helps to measure
Recall and Precision at the same time.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
F-measure:
Since we have two measures (Precision and Recall) it helps to have a
measurement that represents both of them. We calculate an F-measure which
uses Harmonic Mean in place of Arithmetic Mean as it punishes the extreme
values more.
The F-Measure will always be nearer to the smaller value of Precision or Recall.
PREDICTED CLASS
wa + wb+ wc+ w d 1 2 3 4
The following diagram illustrates the pattern of accuracy growth when sample size
increases.
To be noted that:
▪ Accuracy is 100% when the entire population has been examined (as in the case of a
census).
▪ The pattern of accuracy growth is not linear. The accuracy of a sample equal to half
the data population size is not 50% but very near to 100%.
▪ Good accuracy levels can be achieved at relatively small sample sizes, provided that
the samples are representative.
▪ The result of this relationship is that beyond a certain sample size the gains in
accuracy are negligible, while sampling costs increase significantly
At threshold t:
TP=0.5, FN=0.5, FP=0.12, TN=0.88
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
ROC Curve
(TP,FP):
(0,0): declare everything
to be negative class
(1,1): declare everything
to be positive class
(1,0): ideal
Diagonal line:
– Random guessing
– Below diagonal line:
◆ prediction is opposite of
the true class
No model consistently
outperform the other
M1 is better for
small FPR
M2 is better for
large FPR
TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
TN 0 0 1 1 2 3 4 4 5 5 5
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
ROC Curve:
acc − p
P( Z Z )
p(1 − p) / N
/2 1− / 2
p= /2 /2
2( N + Z ) 2
/2
e1 ~ N (1 , 1 )
e2 ~ N (2 , 2 )
e (1 − e )
– Approximate: ˆ =
i i
i
n i
= + ˆ + ˆ
2
t 1
2
2
2 2
1
2
ˆ = 2 j =1 j
k (k − 1)
t
d = d t ˆ
t 1− , k −1 t