0% found this document useful (0 votes)
41 views95 pages

5+6 Classification

The document summarizes lectures on classification using decision trees. It introduces classification problems and techniques, including decision trees. Decision trees represent a learned classification function as a tree structure, with internal nodes testing attributes, branches splitting on attribute values, and leaf nodes assigning classes. The tree is constructed by splitting training data into purer subgroups based on attribute tests, to minimize classification error.

Uploaded by

Hamed Rokni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views95 pages

5+6 Classification

The document summarizes lectures on classification using decision trees. It introduces classification problems and techniques, including decision trees. Decision trees represent a learned classification function as a tree structure, with internal nodes testing attributes, branches splitting on attribute values, and leaf nodes assigning classes. The tree is constructed by splitting training data into purer subgroups based on attribute tests, to minimize classification error.

Uploaded by

Hamed Rokni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

Fakultt fr Elektrotechnik und Informatik

Institut fr Verteilte Systeme


Fachgebiet Wissensbasierte Systeme (KBS)

Data Mining I
Summer semester 2017

Lecture 5 & 6: Classification


Lectures: Prof. Dr. Eirini Ntoutsi
Exercises: Tai Le Quy and Damianos Melidis
Recap from previous lecture

Apriori improvements
FPGrowth
Compact forms of frequent itemsets
Closed frequent itemsets
Maximal frequent itemsets
FIM and ARM beyond binary, assymetric data
Categorical data
Continuous data

Data Mining I: Classification 1 & 2 2


Outline

Classification basics
Decision tree classifiers
Splitting attributes
Hypothesis space
Decision tree decision boundaries
Overfitting
Reading material
Things you should know from this lecture

Data Mining I: Classification 1 & 2 3


The classification problem

Given:
a dataset of instances D={t1,t2,,tn} and
a set of classes C={c1,,ck}
classification is the task of learning a target function/ mapping f:DC that assigns each ti to a cj.
The mapping or target function is known as the classification model.

ID Age Car type Risk


1 23 Familie high
2 17 Sport high
3 43 Sport high
4 68 Familie low
5 32 LKW low

Predictor attributes: Age, Car type Class attribute: risk={high, low}


Data Mining I: Classification 1 & 2 4
A supervised learning task

Classification is a supervised learning task ID Age Car type Risk


Supervision: The training data (observations, 1 23 Familie high
measurements, etc.) are accompanied by labels 2 17 Sport high
3 43 Sport high
indicating the class of the observations
4 68 Familie low
New data is classified based on the training set 5 32 LKW low

Class attribute: risk={high, low}

Clustering is an unsupervised learning task ID Age Car type Risk


1 23 Familie high
The class labels of training data is unknown
2 17 Sport high
Given a set of measurements, observations, etc., 3 43 Sport high
the goal is to group the data into groups of 4 68 Familie low
5 32 LKW low
similar data (clusters)

Data Mining I: Classification 1 & 2 5


Applications

Credit approval
Classify bank loan applications as e.g. safe or risky.
Fraud detection
e.g., in credit cards
Churn prediction
E.g., in telecommunication companies
Target marketing
Is the customer a potential buyer for a new computer?
Medical diagnosis
Character recognition

Data Mining I: Classification 1 & 2 6


Classification techniques

Decision trees
Bayesian classifiers
Neural networks
Nearest neighbors
Support vector machines
Boosting
Bagging
Random forests
.
http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-auto-examples-classification-plot-classifier-comparison-py

Data Mining I: Classification 1 & 2 7


General approach for building a classification model

Data Mining I: Classification 1 & 2 8


General approach for building a classification model

Data Mining I: Classification 1 & 2 9


General approach for building a classification model

Model

Data Mining I: Classification 1 & 2 10


General approach for building a classification model

Model

Testing set
Data Mining I: Classification 1 & 2 11
General approach for building a classification model

Model

Testing set
Data Mining I: Classification 1 & 2 12
General approach for building a classification model

Different learning algorithms or classifiers:


Decision trees
kNN
Neural networks
SVMs

Model

Testing set
Data Mining I: Classification 1 & 2 13
General approach for building a classification model

Different learning algorithms or classifiers:


Decision trees
kNN
Neural networks
SVMs

Model

Induction: makes broad generalizations from specific observations


- Generates new theory emerging from the data

Testing set
Data Mining I: Classification 1 & 2 14
General approach for building a classification model

Different learning algorithms or classifiers:


Decision trees
kNN
Neural networks
SVMs

Model

Induction: makes broad generalizations from specific observations


- Generates new theory emerging from the data
Deduction: from general to specific
Testing set - Tests the theory

Data Mining I: Classification 1 & 2 15


Outline

Classification basics
Decision tree classifiers
Splitting attributes
Hypothesis space
Decision tree decision boundaries
Overfitting
Reading material
Things you should know from this lecture

Data Mining I: Classification 1 & 2 16


Decision tree (DTs) classifiers
Training set
One of the most popular classification methods

DTs are included in many commercial systems nowadays

Easy to interpret, human readable, intuitive

Simple and fast methods

Many algorithms have been proposed

ID3 (Quinlan 1986)

C4.5 (Quinlan 1993)

CART (Breiman et al 1984)


Data Mining I: Classification 1 & 2 17
Representation 1/2

The learned function is represented by a decision tree!


Representation
Each internal node specifies a test of some predictor attribute
Each branch descending from a node corresponds to one of the possible values for this attribute
Attribute test
Each leaf node assigns a class label
Training set
Attribute value

Class value

Decision trees classify instances by sorting them down the tree from the root to some leaf node, which provides
the classification of the instance.

Data Mining I: Classification 1 & 2 18


Representation 2/2

Decision trees represent a disjunction of conjunctions of constraints on the attribute values of the
instances
Each path from the root to a leaf node, corresponds to a conjunction of attribute tests
The tree corresponds to a disjunction of these conjunctions, i.e., (... ... ... ) (... ... ... ) ...
We can translate each path into IF-THEN rules (human readable)

IF ((Outlook = Sunny) ^ (Humidity = Normal)),


THEN (Play tennis=Yes)

IF ((Outlook = Rain) ^ (Wind = Weak)),


THEN (Play tennis=Yes)

Data Mining I: Classification 1 & 2 19


How to build a decision tree?

Training set

What decisions do I have to make in order to build a decision tree?

Data Mining I: Classification 1 & 2 20


The basic decision tree learning algorithm

Basic algorithm (ID3, Quinlan 1986) Training set

The tree is constructed in a top-down recursive divide-and-conquer


manner
At start, all the training examples are at the root node

#14

Data Mining I: Classification 1 & 2 21


The basic decision tree learning algorithm

Basic algorithm (ID3, Quinlan 1986) Training set

The tree is constructed in a top-down recursive divide-and-conquer


manner
At start, all the training examples are at the root node
The question is Which attribute should be tested at the root?
Attributes are evaluated using some statistical measure, which determines how
well each attribute alone classifies the training examples.

#14

Outlook, Temperature,
Humidity or Wind?

Data Mining I: Classification 1 & 2 22


The basic decision tree learning algorithm

Basic algorithm (ID3, Quinlan 1986) Training set

The tree is constructed in a top-down recursive divide-and-conquer


manner
At start, all the training examples are at the root node
The question is Which attribute should be tested at the root?
Attributes are evaluated using some statistical measure, which determines how
well each attribute alone classifies the training examples.
The best splitting attribute is selected and used as the test attribute at the root. Outlook #14

Data Mining I: Classification 1 & 2 23


The basic decision tree learning algorithm

Basic algorithm (ID3, Quinlan 1986) Training set

The tree is constructed in a top-down recursive divide-and-conquer


manner
At start, all the training examples are at the root node
The question is Which attribute should be tested at the root?
Attributes are evaluated using some statistical measure, which determines how
well each attribute alone classifies the training examples.
The best splitting attribute is selected and used as the test attribute at the root. Outlook #14

For each possible value of the test attribute, a descendant of the root node
is created and the instances are mapped to the appropriate descendant
node.

Data Mining I: Classification 1 & 2 24


The basic decision tree learning algorithm

Basic algorithm (ID3, Quinlan 1986) Training set

The tree is constructed in a top-down recursive divide-and-conquer


manner
At start, all the training examples are at the root node
The question is Which attribute should be tested at the root?
Attributes are evaluated using some statistical measure, which determines how
well each attribute alone classifies the training examples.
The best splitting attribute is selected and used as the test attribute at the root. Outlook #14

For each possible value of the test attribute, a descendant of the root node Sunny Overcast Rain
is created and the instances are mapped to the appropriate descendant
#5 #4 #5
node.

Data Mining I: Classification 1 & 2 25


The basic decision tree learning algorithm

Basic algorithm (ID3, Quinlan 1986) Training set

The tree is constructed in a top-down recursive divide-and-conquer


manner
At start, all the training examples are at the root node
The question is Which attribute should be tested at the root?
Attributes are evaluated using some statistical measure, which determines how
well each attribute alone classifies the training examples.
The best splitting attribute is selected and used as the test attribute at the root. Outlook #14

For each possible value of the test attribute, a descendant of the root node Sunny Overcast Rain
is created and the instances are mapped to the appropriate descendant
#5 #4 #5
node.
The procedure is repeated for each descendant node, so instances are
partitioned recursively. Temperature, Humidity or Wind?
Data Mining I: Classification 1 & 2 26
The basic decision tree learning algorithm

Basic algorithm (ID3, Quinlan 1986) Training set

The tree is constructed in a top-down recursive divide-and-conquer


manner
At start, all the training examples are at the root node
The question is Which attribute should be tested at the root?
Attributes are evaluated using some statistical measure, which determines how
well each attribute alone classifies the training examples.
The best splitting attribute is selected and used as the test attribute at the root.

For each possible value of the test attribute, a descendant of the root node
is created and the instances are mapped to the appropriate descendant
node.
The procedure is repeated for each descendant node, so instances are
partitioned recursively.
Data Mining I: Classification 1 & 2 27
The basic decision tree learning algorithm

Pseudocode Training set

Data Mining I: Classification 1 & 2 28


The basic decision tree learning algorithm

Pseudocode Training set

When do we stop partitioning?


All samples for a given node belong to the same class
There are no remaining attributes for further partitioning
majority voting for classifying the leaf

Data Mining I: Classification 1 & 2 29


The basic decision tree learning algorithm

Pseudocode Training set

When do we stop partitioning?


All samples for a given node belong to the same class
There are no remaining attributes for further partitioning
majority voting for classifying the leaf

Data Mining I: Classification 1 & 2 30


Outline

Classification basics
Decision tree classifiers
Splitting attributes
Hypothesis space
Decision tree decision boundaries
Overfitting
Reading material
Things you should know from this lecture

Data Mining I: Classification 1 & 2 31


Which attribute is the best?

Which attribute to choose for splitting? A1 or A2?

Data Mining I: Classification 1 & 2 32


Which attribute is the best?

Which attribute to choose for splitting? A1 or A2?

The goal is to select the attribute that is most useful for classifying examples.
By useful we mean that the resulting partitioning is as pure as possible
A partition is pure if all its instances belong to the same class.

Data Mining I: Classification 1 & 2 33


Which attribute is the best?

Which attribute to choose for splitting? A1 or A2?

The goal is to select the attribute that is most useful for classifying examples
help us be more certain about the class after the split
we would like the resulting partitioning to be as pure as possible
A partition is pure if all its instances belong to the same class.

Different attribute selection measures


Information gain, gain ratio, gini index,
all based on the degree of impurity of the parent (before splitting) vs the children nodes (after splitting)
Data Mining I: Classification 1 & 2 34
Entropy for measuring impurity of a set of instances

Let S be a collection of positive and negative examples for a binary classification problem, C={+, -}.
p+: the percentage of positive examples in S
p-: the percentage of negative examples in S
Entropy measures the impurity of S:
Entropy( S ) p log 2 ( p ) p log 2 ( p )

Entropy = 0, when all members belong to the same class


Entropy = 1, when there is an equal number of positive and negative examples

Entropy comes from information theory. The higher the entropy the more the information content

Data Mining I: Classification 1 & 2 35


Entropy for measuring impurity of a set of instances

Let S be a collection of positive and negative examples for a binary classification problem, C={+, -}.
p+: the percentage of positive examples in S
p-: the percentage of negative examples in S
Entropy measures the impurity of S:
Entropy( S ) p log 2 ( p ) p log 2 ( p )

Entropy = 0, when all members belong to the same class


Entropy = 1, when there is an equal number of positive and negative examples

Examples: What is the entropy?

Let S: [9+,5-]
Let S: [7+,7-]
Let S: [14+,0-]

Data Mining I: Classification 1 & 2 36


Entropy for measuring impurity of a set of instances

Let S be a collection of positive and negative examples for a binary classification problem, C={+, -}.
p+: the percentage of positive examples in S
p-: the percentage of negative examples in S
Entropy measures the impurity of S:
Entropy( S ) p log 2 ( p ) p log 2 ( p )

Entropy = 0, when all members belong to the same class


Entropy = 1, when there is an equal number of positive and negative examples

Examples: What is the entropy?


9 9 5 5
Let S: [9+,5-] Entropy( S ) log 2 ( ) log 2 ( ) 0.940
14 14 14 14
Let S: [7+,7-]
Let S: [14+,0-]

Data Mining I: Classification 1 & 2 37


Entropy for measuring impurity of a set of instances

Let S be a collection of positive and negative examples for a binary classification problem, C={+, -}.
p+: the percentage of positive examples in S
p-: the percentage of negative examples in S
Entropy measures the impurity of S:
Entropy( S ) p log 2 ( p ) p log 2 ( p )

Entropy = 0, when all members belong to the same class


Entropy = 1, when there is an equal number of positive and negative examples

Examples: What is the entropy?

Let S: [9+,5-]
7 7 7 7
Let S: [7+,7-] Entropy( S ) log 2 ( ) log 2 ( ) 1
14 14 14 14
Let S: [14+,0-]

Data Mining I: Classification 1 & 2 38


Entropy for measuring impurity of a set of instances

Let S be a collection of positive and negative examples for a binary classification problem, C={+, -}.
p+: the percentage of positive examples in S
p-: the percentage of negative examples in S
Entropy measures the impurity of S:
Entropy( S ) p log 2 ( p ) p log 2 ( p )

Entropy = 0, when all members belong to the same class


Entropy = 1, when there is an equal number of positive and negative examples

Examples: What is the entropy?

Let S: [9+,5-]
Let S: [7+,7-]
14 14 0 0
Let S: [14+,0-] Entropy( S ) log 2 ( ) log 2 ( ) 0
14 14 14 14
Data Mining I: Classification 1 & 2 39
Entropy for measuring impurity of a set of instances

Let S be a collection of positive and negative examples for a binary classification problem, C={+, -}.
p+: the percentage of positive examples in S
p-: the percentage of negative examples in S
Entropy measures the impurity of S:
Entropy( S ) p log 2 ( p ) p log 2 ( p )

Entropy = 0, when all members belong to the same class


Entropy = 1, when there is an equal number of positive and negative examples

in the general case


(k-classification problem)
k
Entropy(S ) pi log 2 ( pi )
i 1

Data Mining I: Classification 1 & 2 40


Attribute selection measure: Information gain

Used in ID3 (Quinlan, 1986)


It uses entropy, a measure of pureness of the data
The information gain Gain(S, A) of an attribute A relative to a collection of examples S measures the
entropy reduction in S due to splitting on A:

| Sv |
Gain( S , A) Entropy( S )
vValues( A) | S |
Entropy( Sv )

Before splitting After splitting on A

information Gain measures the expected reduction in entropy due to splitting on A


The attribute with the higher entropy reduction is chosen for splitting

Data Mining I: Classification 1 & 2 41


Information Gain example 1

Humidity or Wind? Which attribute to choose for splitting?

Which attribute is chosen?

Data Mining I: Classification 1 & 2 42


Information Gain example 2

Repeat recursively

Training set

Which attribute should we choose for splitting here?

Data Mining I: Classification 1 & 2 43


Information Gain example 2

Repeat recursively

Training set

Which attribute should we choose for splitting here?

Data Mining I: Classification 1 & 2 44


Information Gain example 2

Repeat recursively

Training set

Which attribute should we choose for splitting here?

Data Mining I: Classification 1 & 2 45


Information Gain example 2

Repeat recursively

Training set

Which attribute should we choose for splitting here?

Data Mining I: Classification 1 & 2 46


Information Gain example 2

Repeat recursively

Training set

Which attribute should we choose for splitting here?

Data Mining I: Classification 1 & 2 47


Information Gain example 2

Repeat recursively

Training set

Which attribute should we choose for splitting here?

Data Mining I: Classification 1 & 2 48


Information Gain example 2

Repeat recursively

Training set

Which attribute should we choose for splitting here?

Which attribute is chosen?

Data Mining I: Classification 1 & 2 49


Attribute selection measure: Information Gain

Information gain is biased towards attributes with a large number of distinct values.
| Sv |
Consider unique identifiers like ID or credit card Gain( S , A) Entropy( S )
vValues( A) | S |
Entropy( Sv )

Such attributes have a high information gain, because they uniquely identify each instance, but we
do not want to include them in the decision tree
E.g., deciding how to treat a customer based on their credit card number is unlikely to generalize to customers we
haven't seen before.

Measures have been proposed that correct this issue


Quinlan suggested information gain in his ID3 system and later the gain ratio, both based on entropy.
Gini index

Data Mining I: Classification 1 & 2 50


Lecture 6 starts here

Data Mining I: Classification 1 & 2 51


Attribute selection measure: Gain ratio

C4.5 (a successor of ID3) uses gain ratio to overcome the problem, which normalizes the gain by
splitting on A by the split information of A:
Measures the information
Gain(S, A) w.r.t. classification
GainRatio(S, A)
SplitInfo(S, A)
Measures the information
generated by splitting S into
|Values(A)|partitions

Data Mining I: Classification 1 & 2 52


Attribute selection measure: Gain ratio

C4.5 (a successor of ID3) uses gain ratio to overcome the problem, which normalizes the gain by
splitting on A by the split information of A:
Measures the information
Gain(S, A) w.r.t. classification
GainRatio(S, A)
SplitInfo(S, A)
Measures the information
generated by splitting S into
|Values(A)|partitions
| Sv | | Sv |
SplitInfo( S , A) Pv log 2 ( Pv )
vValues ( A )

vValues ( A ) | S |
log 2 (
|S|
)

Data Mining I: Classification 1 & 2 53


Attribute selection measure: Gain ratio

C4.5 (a successor of ID3) uses gain ratio to overcome the problem, which normalizes the gain by
splitting on A by the split information of A:
Measures the information
Gain(S, A) w.r.t. classification
GainRatio(S, A)
SplitInfo(S, A)
Measures the information
generated by splitting S into
|Values(A)|partitions
| Sv | | Sv |
SplitInfo( S , A) Pv log 2 ( Pv )
vValues ( A )

vValues ( A ) | S |
log 2 (
|S|
)

Gain( S , A) Entropy( S )
| Sv |
Entropy( Sv ) Entropy(S ) pi log 2 ( pi )
vValues( A) | S | i 1

Data Mining I: Classification 1 & 2 54


Attribute selection measure: Gain ratio

C4.5 (a successor of ID3) uses gain ratio to overcome the problem, which normalizes the gain by
splitting on A by the split information of A:
Measures the information
Gain(S, A) w.r.t. classification
GainRatio(S, A)
SplitInfo(S, A)
Measures the information
generated by splitting S into
|Values(A)|partitions
| Sv | | Sv |
SplitInfo( S , A) Pv log 2 ( Pv )
vValues ( A )

vValues ( A ) | S |
log 2 (
|S|
)

High split info: partitions have more or less the same size (uniform)
Low split info: few partitions hold most of the tuples (peaks)
If an attribute produces many splits high SplitInfo()low GainRatio().
This is the case for e.g., the ID attribute

The attribute with the maximum gain ratio is selected as the splitting attribute

Data Mining I: Classification 1 & 2 55


Example: Gain ratio - Split information
| Sv | |S | Training set
Example: SplitInfo( S , A) P log
v
vValues ( A )
2 ( Pv )
vValues ( A ) | S |
log 2 ( v )
|S|

Humidity={High, Normal}

7 7 7 7
SplitInformation( S , Humidity) log 2 ( ) log 2 ( ) 1
14 14 14 14

Wind={Weak, Strong}
8 8 6 6
SplitInformation( S , Wind ) log 2 ( ) log 2 ( ) 0.9852
14 14 14 14

Outlook = {Sunny, Overcast, Rain}


5 5 4 4 5 5
SplitInformation( S , Outlook) log 2 ( ) log 2 ( ) log 2 ( ) 1.5774
14 14 14 14 14 14

Data Mining I: Classification 1 & 2 56


Example: Gain ratio - Split information

Training set
Example:

Humidity={High, Normal}

7 7 7 7
SplitInformation( S , Humidity) log 2 ( ) log 2 ( ) 1
14 14 14 14

Wind={Weak, Strong}
8 8 6 6
SplitInformation( S , Wind ) log 2 ( ) log 2 ( ) 0.9852
14 14 14 14

Outlook = {Sunny, Overcast, Rain}


5 5 4 4 5 5
SplitInformation( S , Outlook) log 2 ( ) log 2 ( ) log 2 ( ) 1.5774
14 14 14 14 14 14

Data Mining I: Classification 1 & 2 57


Example: Gain ratio - Split information

Training set
Example:

Humidity={High, Normal}

7 7 7 7
SplitInformation( S , Humidity) log 2 ( ) log 2 ( ) 1
14 14 14 14

Wind={Weak, Strong}
8 8 6 6
SplitInformation( S ,Wind ) log 2 ( ) log 2 ( ) 0.9852
14 14 14 14

Outlook = {Sunny, Overcast, Rain}


5 5 4 4 5 5
SplitInformation( S , Outlook) log 2 ( ) log 2 ( ) log 2 ( ) 1.5774
14 14 14 14 14 14

Data Mining I: Classification 1 & 2 58


Example: Gain ratio - Split information

Training set
Example:

Humidity={High, Normal}

7 7 7 7
SplitInformation( S , Humidity) log 2 ( ) log 2 ( ) 1
14 14 14 14

Wind={Weak, Strong}
8 8 6 6
SplitInformation( S ,Wind ) log 2 ( ) log 2 ( ) 0.9852
14 14 14 14

Which attribute will be selected?


Outlook = {Sunny, Overcast, Rain}
5 5 4 4 5 5
SplitInformation( S , Outlook) log 2 ( ) log 2 ( ) log 2 ( ) 1.5774
14 14 14 14 14 14

Data Mining I: Classification 1 & 2 59


Attribute selection measure: Gini Index

Used in CART (Breiman et al, 1984)


Let a dataset S containing examples from k classes. Let pj be the probability of class j in S. The Gini
Index of S is given by:
k
Gini(S ) 1 p 2j
j 1
Gini index considers a binary split for each attribute
If S is split based on attribute A into two subsets S1 and S2 :
|S1| |S |
Gini (S , A) Gini (S1) 2 Gini (S2 )
|S | |S |
Reduction in impurity:

Gini(S , A) Gini(S ) Gini(S , A)


The attribute A that provides the smallest Gini(S,A) (or the largest reduction in impurity) is chosen to
split the node

Data Mining I: Classification 1 & 2 60


Attribute selection measure: Gini Index a small example

Let D has 14 instances


9 of class buys_computer = yes
5 in buys_computer = no

The Gini Index of D is:

k 2 2
Gini( D) 1 p 2j 9 5
Gini( D) 1 0.459
j 1 14 14

Data Mining I: Classification 1 & 2 61


Attribute selection measure: Gini Index for non-binary data

Gini index considers a binary split for each attribute


How to find the binary splits?
For a categorical attribute A, we consider all possible subsets that can be formed by values of A (next slides)
For a numerical attribute A, we find the split points of A (next slides)

Data Mining I: Classification 1 & 2 62


Attribute selection measure: Gini index for categorical attributes

Let the categorical attribute Income = {low, medium, high} . How can we convert it
into a binary attribute?

Data Mining I: Classification 1 & 2 63


Attribute selection measure: Gini index for categorical attributes

Let the categorical attribute Income = {low, medium, high} .


To generate the binary splits for Income, we check all possible subsets:
({low,medium} and {high})
({low,high} and {medium})
({medium,high} and {low})

For each subset, we check the Gini Index of setting up a split in that subset
Gini{low,medium} and {high}(D) = ?
Gini{low,high} and {medium}(D) = ?
Gini{medium,high} and {low}(D) = ?

The split that provides the smallest Gini(S,Asplit) (or the largest reduction in impurity) is chosen to
split the node

Data Mining I: Classification 1 & 2 64


Attribute selection measure: Gini index for categorical attributes

For each subset, we check the Gini Index:


For example, ({low,medium} and {high}) split result in D1 (#10 instances: 6+, 4-) and D2( #4 instances: 1+,3-)

10 4
Gini{low,medium}and {high} ( D) Gini( D1 ) Gini( D2 )
14 14

k
Gini(S ) 1 p 2j
j 1
|S | |S |
Gini (S , A) 1 Gini (S1) 2 Gini (S2 )
For the remaining binary split partitions: |S | |S |

Gini(S , A) Gini(S ) Gini(S , A)

Which split should we take?

Data Mining I: Classification 1 & 2 65


Attribute selection measure: Gini index for categorical attributes

For each subset, we check the Gini Index:


For example, ({low,medium} and {high}) split result in D1 (#10 instances: 6+, 4-) and D2( #4 instances: 1+,3-)

10 4
Gini{low,medium}and {high} ( D) Gini( D1 ) Gini( D2 )
14 14

For the remaining binary split partitions:

So, the best binary split for income is on ({medium, high} and {low})

Data Mining I: Classification 1 & 2 66


Attribute selection measure: Gini index for numerical attributes

Let attribute A be a continuous-valued attribute


Must determine the best split point t for A
Sort the values of A in increasing order
Identify adjacent examples that differ in their target classification
Typically, every such pair suggests a potential split threshold t = (ai+ai+1)/2
Select threshold t that yields the best value of the splitting criterion.

t=(48+60)/2=54 t =(80+90)/2=85

2 potential thresholds: Temperature>54, Temperature >85


Compute the attribute selection measure (e.g. information gain) for both
Choose the best (Temperature>54 here)

Data Mining I: Classification 1 & 2 67


Attribute selection measure: Gini index for numerical attributes

Let t be the threshold chosen from the previous step


Create a boolean attribute based on A and threshold t with two possible outcomes: yes, no
S1 is the set of tuples in S satisfying (A >t), and S2 is the set of tuples in S satisfying (A t)

Temperature>54 Temperature

How it looks yes no or >54 54

An example of a tree for the


play tennis problem when
attributes Humidity and Wind
are continuous

Data Mining I: Classification 1 & 2 68


Comparing Attribute Selection Measures

The three measures, are commonly used and in general, return good results but
Information gain Gain(S,A):
biased towards multivalued attributes

Gain ratio GainRatio(S,A) :


tends to prefer unbalanced splits in which one partition is much smaller than the others

Gini index:
biased to multivalued attributes
has difficulty when # of classes is large
tends to favor tests that result in equal-sized partitions and purity in both partitions

Other measures also exist


most previously published empirical results concluded that it is not possible to decide which one of the two tests to
prefer, Theoretical Comparison between the Gini Index and Information Gain Criteria, Raileanu and Stoffel, 2004.
https://link.springer.com/article/10.1023/B:AMAI.0000018580.96245.c6

Data Mining I: Classification 1 & 2 69


Outline

Classification basics
Decision tree classifiers
Splitting attributes
Hypothesis space
Decision tree decision boundaries
Overfitting
Reading material
Things you should know from this lecture

Data Mining I: Classification 1 & 2 70


Training vs generalization errors

The errors of a classifier are divided into


Training errors (or resubstitution error or apparent
error):
errors commited in the training set
Generalization errors:
the expected error of the model on previously unseen
examples

A good classifier must


1. Fit the training data &
2. Accurately classify records never seen before
i.e., a good model low training error & low
generalization error

Data Mining I: Classification 1 & 2 72


Model overfitting

Model overfitting
A model that fits the training data well (low training error) but has a poor generalization power (high
generalization error

Consider a hypothesis h

Consider the error of h over:


The training set: errortrain(h)

The entire distribution D of the data: errorD(h)

Hypothesis h overfits training data if there is an alternative hypothesis h in H such that:

Data Mining I: Classification 1 & 2 73


Decision trees overfitting

An induced tree may overfit the training data Training set

Too many branches, some may reflect anomalies due to noise or


outliers
Very good performance in the training (already seen) samples
Poor accuracy for unseen samples
Example
D15 Sunny Hot Normal Strong No
Let us add a noisy/outlier training example (D15) to the training
set
How the earlier tree (built upon training examples D1-D14) would
be effected?

Data Mining I: Classification 1 & 2 74


Underfitting & Overfitting

The training error can be decreased by increasing the model complexity


But, a complex, tailored to the training data model, will also have a high generalization error
How the error in both training and test data
evolves with the tree complexity

Data Mining I: Classification 1 & 2 75


Underfitting & Overfitting

The training error can be decreased by increasing the model complexity


But, a complex, tailored to the training data model, will also have a high generalization error
How the error in both training and test data
evolves with the tree complexity

The model has yet to learn the true The model overspecializes
structure from the training data. to the training data

Model underfitting Model overfitting

Data Mining I: Classification 1 & 2 76


Potential causes of model overfitting

Overfitting due to presence of noise


Overfitting due to lack of representative samples

Data Mining I: Classification 1 & 2 77


Overfitting due to presence of noise

The decision boundary is distorted by the noise point.

Data Mining I: Classification 1 & 2 78


Overfitting due to presence of noise an example

Training set
(* stands for missclassified instances)

Test set

M2
Training error: 20%
Test error: 10%

M1
Training error: 0
Test error: 30%
Data Mining I: Classification 1 & 2 79
Overfitting due to lack of representative samples

Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class
labels of that region
Insufficient number of training records in the region causes the decision tree to predict the test examples
using other training records that are irrelevant to the classification task
Data Mining I: Classification 1 & 2 80
Avoiding overfitting in Decision Trees

Two approaches to avoid overfitting:

Pre-pruning: Stop growing the tree when data split not statistically significant
do not split a node if this would result in the goodness measure falling below a threshold

Difficult to choose an appropriate threshold

Post-pruning: Grow full tree, then prune it


Get a sequence of progressively pruned trees

How to select best pruned tree?


Measure performance over training data

Measure performance over a separate validation dataset

Add complexity penalty to performance measure

Data Mining I: Classification 1 & 2 82


Reduced-error pruning

Split data into training and validation set


Do until further pruning is harmful
Evaluate impact on validation set of pruning each possible node (plus those below it)
Greedily remove the one that most improves the performance on the validation set

Data Mining I: Classification 1 & 2 83


Effect of reduced-error pruning?

How the error in both training and test data evolves with the tree complexity; with and without
pruning

Data Mining I: Classification 1 & 2 84


Outline

Classification basics
Decision tree classifiers
Splitting attributes
Hypothesis space
Decision tree decision boundaries
Overfitting
Reading material
Things you should know from this lecture

Data Mining I: Classification 1 & 2 85


Hypothesis space search in decision tree learning

In classification we want to learn a target function/ mapping f:DC


In case of decision trees, f is represented by a decision tree
Hypothesis space: set of possible decision trees
Search method:
hill-climbing
from simple to complex (top-down)
Only a single current hypothesis is maintained
No backtracking: split attributes are fixed
Local minima risk
Greedy approach
Evaluation function: Information gain
Batch learning: use all training data
Data Mining I: Classification 1 & 2 86
Inductive bias in decision tree learning

Inductive bias: the set of assumptions that, together with the training data, deductively justify the
classifications assigned by the learner to future instances.
What is the policy by which ID3 generalizes from observed training examples to classify unseen
instances.
Inductive bias in ID3: It chooses the first acceptable tree it encounters in its simple-to-complex, hill
climbing search through the space of possible trees.
shorter trees are preferred over larger trees.
trees that place high information gain attributes close to the root are preferred over those that do not.
The inductive bias of ID3 follows from its search strategy (search bias or preference bias)

Data Mining I: Classification 1 & 2 87


Why prefer shorter hypothesis?

Occams Razor: Prefer the simplest hypothesis that fits the data
Scientists seem to do that: Physicists, for example, prefer simple explanations for the motions of the
planets, over more complex explanations.
Argument:
Since there are fewer short hypotheses than long ones, it is less likely that one will find a short hypothesis
that coincidentally fits the training data
In contrast there are often many very complex hypotheses that fit the current training data but fail to
generalize correctly to subsequent data.

Data Mining I: Classification 1 & 2 88


Outline

Classification basics
Decision tree classifiers
Splitting attributes
Hypothesis space
Decision tree decision boundaries
Overfitting
Reading material
Things you should know from this lecture

Data Mining I: Classification 1 & 2 89


Decision tree decision boundaries

DTs partition the feature space into axis-parallel hyper-rectangles, and label each rectangle with one of the class labels.

These rectangles are called decision regions

Decision boundary: the border line between two neighboring regions of different classes

Data Mining I: Classification 1 & 2 90


Decision tree decision boundaries

Data Mining I: Classification 1 & 2 91


Decision tree decision boundaries

Data Mining I: Classification 1 & 2 92


When to consider decision trees

Instances are represented by attribute-value pairs


Instances are represented by a fixed number of attributes, e.g. outlook, humidity, wind and their values, e.g.
(wind=strong, outlook =rainy, humidity=normal)
The easiest situation for a DT is when attributes take a small number of disjoint possible values, e.g. wind={strong, weak}
There are extensions for numerical attributes also, e.g. temperature, income.

The class attribute has discrete output values


Usually binary classification, e.g. {yes, no}, but also for more class values, e.g. {pos, neg, neutral}

The training data might contain errors


DTs are robust to errors: both errors in the class values of the training examples and in the attribute values of these
examples

The training data might contain missing values


DTs can be used even when some training examples have some unknown attribute values

Data Mining I: Classification 1 & 2 93


Outline

Classification basics
Decision tree classifiers
Splitting attributes
Hypothesis space
Decision tree decision boundaries
Overfitting
Reading material
Things you should know from this lecture

Data Mining I: Classification 1 & 2 94


Reading material

This lecture (Decision trees) reading material:


Chapter 3: Decision tree learning, Machine Learning book by Tom Mitchel
Chapter 4: Classification, Introduction to Data Mining book by Tan et al

Next lecture reading material:


Evaluation of classifiers Section 4.5, Tan et al book
Lazy learners KNN Section 5.2, Tan et al book
Bayesian classifiers Section 5.3, Tan et al book

Data Mining I: Classification 1 & 2 95


Outline

Classification basics
Decision tree classifiers
Splitting attributes
Hypothesis space
Decision tree decision boundaries
Overfitting
Reading material
Things you should know from this lecture

Data Mining I: Classification 1 & 2 96


Outline

Decision tree learning


Measures for attribute selection
Overfitting
Prunning

Data Mining I: Classification 1 & 2 97

You might also like