Business Analytics & Machine Learning: Decision Tree Classifiers
Business Analytics & Machine Learning: Decision Tree Classifiers
Business Analytics & Machine Learning: Decision Tree Classifiers
• Introduction
Supervized
• Regression Analysis Learning
• Regression Diagnostics
Regression Classification
• Logistic and Poisson Regression
• Naive Bayes and Bayesian Networks Linear Logistic
Regression Regression
• Decision Tree Classifiers
• Data Preparation and Causal Inference Poisson
Regression
Naive Bayes
• Dimensionality Reduction
PCR Ensemble
• Association Rules and Recommenders Regression Methods
• Convex Optimization
Neural
Lasso
• Neural Networks Networks
2
Recommended Literature
• Data Mining: Practical Machine Learning Tools and
Techniques
− Ian H. Witten, Eibe Frank, Mark A. Hall, Christopher Pal
− http://www.cs.waikato.ac.nz/ml/weka/book.html
− Section: 4.3, 6.1
Alternative literature:
• Machine Learning
− Tom M. Mitchell, 1997
3
Agenda for Today
• Choosing a splitting attribute in decision trees
− Information gain
− Gain ratio
− Gini index
• Numeric attributes
• Missing values
• Relational rules
4
The Weather Data
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
5
Example Tree for “Play?”
Outlook
sunny rain
overcast
Yes
Humidity Windy
No Yes No Yes
6
Regression Tree
CHMIN
≤ 7.5 > 7.5
CACH MMAX
• High Accuracy
• Large and possibly awkward
7
Model Trees
CHMIN
≤ 7.5 > 7.5
CACH MMAX
At each node, one attribute is chosen to split training examples into distinct classes
as much as possible.
9
Building Decision Tree
Top-down tree construction
• At start, all training examples are at the root
• Partition the examples recursively by choosing one attribute each time
10
Choosing the Splitting Attribute
At each node, available attributes are evaluated on the basis of separating the
classes of the training examples.
11
Which Attribute to Select?
12
A Criterion for Attribute Selection
Which is the best attribute?
• The one which will result in the smallest tree.
• Heuristic: choose the attribute that produces the “purest” nodes.
13
Example: attribute “Outlook”
“Outlook” = “Sunny”:
info( 2,3 ) = entropy(2/5, 3/5) = − 2/5 𝑙𝑙𝑙𝑙𝑙𝑙2 (2/5 ) − 3/5 𝑙𝑙𝑙𝑙𝑙𝑙2 (3/5) = 0.971 bits
14
Computing the Information Gain
Information gain: (information before split) – (information after split)
gain Outlook = info 9,5 − info 2,3 , 4,0 , 3,2
= 0.940 − 0.693 = 0.247 bits
15
Computing Information
Information is measured in bits
• given a probability distribution, the info required to encode/predict an event.
• entropy gives the information required in bits (this can involve fractions of bits!)
Note: In the exercises, we will use the natural logarithm for convenience (easier
with calculators in the exam), which turns „bits“ into „nits“
16
Expected Information Gain
| Sv |
gain ( S , a ) = entropy( S ) − ∑
v∈Values ( a ) | S |
entropy( S v )
S v = {s ∈ S : a ( s ) = v}
All possible values
for attribute 𝑎𝑎
Problems?
17
Wish List for a Purity Measure
Properties we require from a purity measure:
• When node is pure, measure should be zero (=0)
• When impurity is maximal (i.e. all classes equally likely), measure should be
maximal (e.g., 1 for boolean values)
• Multistage property: info[2,3,4]=info[2,7]+7/9 info[3,4]
Number of classes
𝑛𝑛
18
The Weather Data
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
19
Continuing to Split
20
The Final Decision Tree
21
Claude Shannon
Born: 30 April 1916, Died: 23 February 2001
Shannon is famous for having founded
information theory with one landmark paper
published in 1948 (A Mathematical Theory of
Communication).
Information theory was developed to find fundamental limits on compressing and reliably
communicating data. Communications over a channel was the primary motivation. Channels
(such as a phone line) often fail to produce exact reconstruction of a signal; noise, periods of
silence, and other forms of signal corruption often degrade quality. How much information can
one hope to communicate over a noisy (or otherwise imperfect) channel?
An important application of information theory is coding theory:
• Data compression (by removing redundancy in data)
• Error-correcting codes add just the right kind of redundancy (i.e. error correction) needed to
transmit the data efficiently and faithfully across a noisy channel.
22
Background on Entropy and Information Theory
• Suppose there are 𝑛𝑛 possible states or messages.
• If the messages are equally likely, then the probability of each message
is 𝑝𝑝 = 1/𝑛𝑛 or 𝑛𝑛 = 1/𝑝𝑝
• Information (number of bits) of a message is described as
log 2 (𝑛𝑛) = log 2 (1/𝑝𝑝) = −log 2 (𝑝𝑝)
• Example: with 16 possible messages, the information is log(16) = 4
and we require 4 bits for each message.
• If the following probability distribution is given:
𝑃𝑃 = (𝑝𝑝1, 𝑝𝑝2, . . , 𝑝𝑝𝑛𝑛 )
then the information (or entropy of P) conveyed by the distribution
can be computed as follows:
𝐼𝐼(𝑃𝑃) = −(𝑝𝑝1 ∗ log(𝑝𝑝1) + 𝑝𝑝2 ∗ log(𝑝𝑝2) + . . + 𝑝𝑝𝑛𝑛 ∗ log(𝑝𝑝𝑛𝑛 ))
= − ∑𝑖𝑖 𝑝𝑝𝑖𝑖 ∗ log(𝑝𝑝𝑖𝑖 ) = ∑𝑖𝑖 𝑝𝑝𝑖𝑖 ∗ log(1/𝑝𝑝𝑖𝑖 )
23
Entropy
• Example: 𝑃𝑃 = 75% sun and 25% rain
𝐼𝐼 𝑃𝑃 = − 𝑝𝑝1 ∗ log 𝑝𝑝1 + 𝑝𝑝2 ∗ log 𝑝𝑝2 = −(0,75 ∗ log 0,75 − 0,25 ∗ log 0,25 ) = 0,81 bits
1
• Example: 8 equally likely states 𝐼𝐼 𝑃𝑃 = − ∑𝑖𝑖 𝑝𝑝𝑖𝑖 ∗ log 𝑝𝑝𝑖𝑖 = −8 ∗ ( ∗ −3) = 3 bits
8
24
Detour: Cross-Entropy
Suppose, you have a classifier that produces a discrete probability distribution, and you have
the true underlying distribution. For example, the probabilities for high, medium, and low risk of
a customer might be:
Outcome distribution 𝑞𝑞 Ground truth 𝑝𝑝
0.7 1
0.2 0
0.1 0
The cross-entropy between the two distributions 𝑝𝑝 and 𝑞𝑞 is used to quantify the difference
between these two distributions.
𝐻𝐻 𝑝𝑝, 𝑞𝑞 = −∑𝑖𝑖 𝑝𝑝𝑖𝑖 ∗ log(𝑞𝑞𝑖𝑖 )
25
Detour: KL-Divergence (or Relative Entropy)
The Kullback-Leibler-Divergence between two distributions p and q is a measure of how
one probability distribution is different from another. If KL-divergence is 0, there is no
difference.
KL-divergence is defined as the difference between the cross entropy and entropy:
Both, cross-entropy and KL-divergence are used as loss function in machine learning.
26
Which splitting attribute would be selected in the following example?
27
Split for ID Code Attribute
Entropy of split = 0 (since each leaf node is “pure”), having only one case.
28
Highly-Branching Attributes
Problematic: attributes with a large number of values
(extreme case: ID code)
29
Gain Ratio
Gain ratio: a modification of the information gain that reduces its bias on high-branch
attributes.
Gain ratio takes number and size of branches into account when choosing an
attribute.
It corrects the information gain by taking the intrinsic information of a split into
account (i.e. how much info do we need to tell which branch an instance belongs to).
30
Computing the Gain Ratio
Example: intrinsic information for ID code
1 1
intrinsic_info 1,1, … , 1 = 14 ∗ − ∗ log 2 = 3.807 bits
14 14
gain S,a
Example of gain ratio: gainRatio S, a =
intrinsic_info S,a
0.940bits
Example: gainRatio ID_Code = = 0.246
3.807bits
31
Gain Ratios for Weather Data
Outlook Temperature
Info: 0.693 Info: 0.911
Gain: 0.940-0.693 0.247 Gain: 0.940-0.911 0.029
Split info: info([5,4,5]) 1.577 Split info: info([4,6,4]) 1.557
Gain ratio: 0.247/1.577 0.156 Gain ratio: 0.029/1.557 0.019
Humidity Windy
Info: 0.788 Info: 0.892
Gain: 0.940-0.788 0.152 Gain: 0.940-0.892 0.048
Split info: info([7,7]) 1.000 Split info: info([8,6]) 0.985
Gain ratio: 0.152/1 0.152 Gain ratio: 0.048/0.985 0.049
32
More on the Gain Ratio
„Outlook” still comes out top.
However:
• “ID code” has still greater gain ratio (0.246).
• Standard fix: ad hoc test to prevent splitting on that type of attribute.
33
The Splitting Criterion in CART
• Classification and Regression Tree (CART)
• developed 1974-1984 by 4 statistics professors
− Leo Breiman (Berkeley), Jerry Friedman (Stanford), Charles Stone (Berkeley),
Richard Olshen (Stanford)
• Gini Index is used as a splitting criterion
• both C4.5 and CART are robust tools
• no method is always superior – experiment!
34
Gini Index for 2 Attribute Values
For example, two classes, Pos and Neg, and dataset S with 𝑝𝑝 Pos-elements and 𝑛𝑛
Neg-elements. The frequency of positives and negative classes is:
𝑃𝑃 = 𝑝𝑝 / (𝑝𝑝 + 𝑛𝑛)
𝑁𝑁 = 𝑛𝑛 / (𝑝𝑝 + 𝑛𝑛)
35
Example
Split by attribute I or II?
400 of A 400 of A
400 of B 400 of B
36
Example
Gini 𝑝𝑝 = 1 − � 𝑝𝑝𝑗𝑗2
𝑗𝑗
A B A B
Select the split that decreases the Gini Index most. This is done over all possible
places for a split and all possible variables to split.
37
Gini Index Example
Number Proportion Gini Index Info
of Cases of Cases required
A B A B
pA pB p2A p2B 1- p2A - p2B
300 100 0.75 0.25 0.5625 0.0625 0.375 0.1875 0.5*Gini(i)
100 300 0.25 0.75 0.0625 0.5625 0.375 0.1875
Total 0.375
200 400 0.33 0.67 0.1111 0.4444 0.4444 0.3333 0.75*Gini(i)
200 0 1 0 1 0 0 0
Total 0.3333
38
Generate_DT(samples, attribute_list)
• Create a new node N.
39
Time Complexity of Basic Algorithm
• Let 𝑚𝑚 be the number of attributes.
• For each level of the tree all 𝑛𝑛 instances are considered (best = vi ).
− 𝒪𝒪(𝑛𝑛 log 𝑛𝑛) work for a single attribute over the entire tree
• Total cost is 𝒪𝒪(𝑚𝑚𝑚𝑚 log 𝑛𝑛) since all attributes are eventually considered.
− without pruning (see next class)
40
Scalability of DT Algorithms
• need to design for large amounts of data
o some new algorithms do not require all the data to be memory resident
41
C4.5 History
The above procedure is the basis for Ross Quinlain’s ID3 algorithm (so far works
only for nominal attributes).
• ID3, CHAID – 1960s
The algorithm was improved and is now most widely used as C4.5 or C5.0
respectively, available in most DM software packages.
• Commercial successor: C5.0
Witten et al. write “a landmark decision tree program that is probably the machine
learning workhorse most widely used in practice to date”.
42
C4.5 An Industrial-Strength Algorithm
For an algorithm to be useful in a wide range of real-world applications it must:
• permit numeric attributes
• allow missing values
• be robust in the presence of noise
43
How would you deal with missing values in the training data?
44
Outline for today
• Choosing a splitting attribute in decision trees
− Information gain
− Gain ratio
− Gini index
• Numeric attributes
• Missing values
• Relational rules
45
Numeric Attributes
Unlike nominal attributes, every attribute has many possible split points.
• Standard method: binary splits
• E.g. temp < 45
Numerical attributes can be used several times in a decision tree, nominal attributes
only once.
46
Example
Split on temperature attribute:
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
47
Binary Splits on Numeric Attributes
Splitting (multi-way) on a nominal attribute exhausts all information in that attribute.
• nominal attribute is tested (at most) once on any path in the tree
Remedy:
• pre-discretize numeric attributes, or
• allow for multi-way splits instead of binary ones using the Information Gain
criterion
48
Outline for today
• Choosing a splitting attribute in decision trees
− Information gain
− Gain ratio
− Gini index
• Numeric attributes
• Missing values
• Relational rules
49
Handling Missing Values / Training
Ignore instances with missing values.
• pretty harsh, and missing value might not be important
50
Handling Missing Values / Classification
Follow the leader.
• an instance with a missing value for a tested attribute (temp) is sent down the
branch with the most instances
Temp
<75 >= 75
5 instances 3 instances
51
Handling Missing Values / Classification
“Partition” the instance.
• branches show # of instances
• Send down parts of the instance (e.g. 3/8 on Windy and 5/8 on Sunny)
proportional to the number of training instances
• Resulting leaf nodes get weighted in the result
Outlook
5/8 5 3 3/8
Temp. Windy
2 3 1 1 1
52
Overfitting
Two sources of abnormalities
• noise (randomness)
• outliers (measurement errors)
53
Decision Trees - Summary
Decision trees are a classification technique.
The output of decision trees can be used for descriptive as well as predictive
purposes.
54
Outline for today
• Choosing a splitting attribute in decision trees
− Information gain
− Gain ratio
− Gini index
• Numeric attributes
• Missing values
• Relational rules
55
The Shapes Problem
Shaded=standing
Unshaded=lying
56
Instances
Width Height Sides Class
2 4 4 Standing
3 6 4 Standing
4 3 4 Lying
7 8 3 Standing
7 6 3 Lying
2 9 4 Standing
9 1 4 Lying
10 2 3 Lying
57
Classification Rules
If width ≥ 3.5 and height < 7.0 then lying.
If height ≥ 3.5 then standing.
Work well to classify these instances ... but not necessarily for new ones.
Problems?
58
Relational Rules
If width > height then lying
If height > width then standing
As a workaround for some cases, one can introduce additional attributes, describing
if width > height.
• allows using conventional “propositional” learners
59
Propositional Logic
Essentially, decision trees can represent any function in propositional logic.
• A, B, C: propositional variables
• and, or, not, => (implies), <=> (equivalent): connectives
60