DMML Unit 2
DMML Unit 2
{B, E} 3
{B, C} 2 BC E {A, E}
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2
CE B {B, E}
BE C {C, E}
C3 Itemset 3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
The Apriori Algorithm — Example
Database D ite m s e t s u p .
L1 ite m s e t s u p .
T ID Item s C1 {1 } 2 {1 } 2
100 1 3 4 {2 } 3 {2 } 3
200 2 3 5 Scan D {3 } 3 {3 } 3
300 1 2 3 5 {4 } 1 {5 } 3
400 2 5 {5 } 3
C2 item s et s up C2 item s et
L2 itemset sup {1 2}
{1 2} 1 Scan D
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
32
Drawbacks of Apriori Algorithm
39
DECISION TREES
Example of a Decision Tree
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Each node represents a test on an attribute
of the instance to be classified and each
outgoing arch a possible outcome, leading to
a further test.
4/29/2024
When to use Decision Trees
Problem characteristics:
Instances can be described by attribute value pairs
4/29/2024
Top-down induction of Decision Trees
ID3 (Quinlan, 1986) is a basic algorithm for learning DT's
Given a training set of examples, the algorithms for building
DT performs search in the space of decision trees
The construction of the tree is top-down. The algorithm is
greedy.
The fundamental question is “which attribute should be
tested next? Which question gives us more information?”
Select the best attribute
A descendent node is then created for each possible value of
this attribute and examples are partitioned according to this
value
The process is repeated for each successor node until all the
examples are classified correctly or there are no attributes left
4/29/2024
Which attribute is the best classifier?
4/29/2024
Entropy in binary classification
4/29/2024
Entropy
4/29/2024
Example: expected information gain
Let
Values(Wind) = {Weak, Strong}
S = [9+, 5−]
SWeak = [6+, 2−]
SStrong = [3+, 3−]
Information gain due to knowing Wind:
Gain(S, Wind) = Entropy(S) − 8/14 Entropy(SWeak) − 6/14
Entropy(SStrong)
= 0,94 − 8/14 0,811 − 6/14 1,00
= 0,048
4/29/2024
Which attribute is the best classifier?
4/29/2024
First step: which attribute to test at the
root?
Which attribute should be tested at the root?
Gain(S, Outlook) = 0.246
4/29/2024
Second and third steps
{D1, D2, D8} {D9, D11} {D4, D5, D10} {D6, D14}
No Yes Yes No
4/29/2024