0% found this document useful (0 votes)

5 views17 pages

DM Lec8

Uploaded by

maryamfatimanavtcc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views17 pages

DM Lec8

Uploaded by

maryamfatimanavtcc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Decision Tree Construction

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Continuous Attributes: Computing Gini Index

Tid Refund Marital Taxable

Use Binary Decisions based on one Status Income Cheat
value
1 Yes Single 125K No
Several Choices for the splitting value 2 No Married 100K No
– Number of possible splitting values 3 No Single 70K No
= Number of distinct values 4 Yes Married 120K No

Each splitting value has a count matrix 5 No Divorced 95K Yes

associated with it 6 No Married 60K No

7 Yes Divorced 220K No
– Class counts in each of the
partitions, A < v and A  v 8 No Single 85K Yes
9 No Married 75K No
Simple method to choose best v 10 No Single 90K Yes
– For each v, scan the database to
10

gather count matrix and compute Taxable

its Gini index Income
> 80K?
– Computationally Inefficient!
Repetition of work. Yes No

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Continuous Attributes: Computing Gini Index...

For efficient computation: for each attribute,

– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No

Taxable Income

Sorted Values 60 70 75 85 90 95 100 120 125 220

55 65 72 80 87 92 97 110 122 172 230
Split Positions
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Measures of Node Impurity

l Gini Index

l Entropy

l Misclassification error

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Entropy

Information is measured in bits

– Given a probability distribution, the info
required to predict an event is the distribution’s
entropy
– Entropy gives the information required in bits
(this can involve fractions of bits!)
Formula for computing the entropy:
entropy( p1 , p2 ,, pn ) = − p1logp1 − p2logp2  − pn logpn

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Entropy

l Entropy at a given node t:

Entropy(t ) = − p( j | t ) log p( j | t )
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Measures homogeneity of a node.
◆ Maximum (log nc) when records are equally distributed
among all classes implying least information
◆ Minimum (0.0) when all records belong to one class,
implying most information
– Entropy based computations are similar to the
GINI index computations
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Examples for computing Entropy

Entropy(t ) = − p( j | t ) log p( j | t )
j 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (5/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Information Gain

l Information Gain:
 n 
GAIN = Entropy( p) −   Entropy(i ) 
k
i

 n 
split i =1

Parent Node, p is split into k partitions;

ni is number of records in partition i
– Measures Reduction in Entropy achieved because of
the split. Choose the split that achieves most reduction
(maximizes GAIN)
– Used in ID3 and C4.5
– Disadvantage: Tends to prefer splits that result in large
number of partitions, each being small but pure.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Weather Data: Play or not Play?

Outlook Temperature Humidity Windy Play?

sunny hot high false No
Note:
sunny hot high true No Outlook is the
overcast hot high false Yes Forecast,
no relation to
rain mild high false Yes Microsoft
rain cool normal false Yes email program
rain cool normal true No
overcast cool normal true Yes
sunny mild high false No
sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No

Which attribute to select?

Example: attribute “Outlook”

“Outlook” = “Sunny”:
info([2,3] ) = entropy(2/ 5,3/5) = −2 / 5 log( 2 / 5) − 3 / 5 log(3 / 5) = 0.971 bits

“Outlook” = “Overcast”: Note: log(0) is not

defined, but we evaluate
info([4,0] ) = entropy(1, 0) = −1log(1) − 0 log(0) = 0 bits 0*log(0) as zero

“Outlook” = “Rainy”:
info([3,2] ) = entropy(3/ 5,2/5) = −3 / 5 log(3 / 5) − 2 / 5 log( 2 / 5) = 0.971 bits
Expected information for attribute:
info([3,2] ,[4,0], [3,2]) = (5 / 14)  0.971 + (4 / 14)  0 + (5 / 14)  0.971
= 0.693 bits
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Computing the information gain

Information gain:
(information before split) – (information after split)
gain(" Outlook" ) = info([9,5] ) - info([2,3] , [4,0], [3,2]) = 0.940 - 0.693
= 0.247 bits
Information gain for attributes from weather data:
gain(" Outlook" ) = 0.247 bits
gain(" Temperatur e" ) = 0.029 bits
gain(" Humidity" ) = 0.152 bits
gain(" Windy" ) = 0.048 bits
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
witten&eibe
Continuing to split

gain(" Humidity" ) = 0.971 bits

gain(" Temperatur e" ) = 0.571 bits

gain(" Windy" ) = 0.020 bits

The final decision tree

Note: not all leaves need to be pure; sometimes

identical instances have different classes
 Splitting stops when data can’t be split any
further
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Highly-branching attributes

l Problematic: attributes with a large number of

values (extreme case: ID code)
l Subsets are more likely to be pure if there is a
large number of values
Information gain is biased towards choosing
attributes with a large number of values
This may result in overfitting (selection of an
attribute that is non-optimal for prediction)

Weather Data with ID code

ID Outlook Temperature Humidity Windy Play?

A sunny hot high false No
B sunny hot high true No
C overcast hot high false Yes
D rain mild high false Yes
E rain cool normal false Yes
F rain cool normal true No
G overcast cool normal true Yes
H sunny mild high false No
I sunny cool normal false Yes
J rain mild normal false Yes
K sunny mild normal true Yes
L overcast mild high true Yes
M overcast hot normal false Yes
N rain mild high true No
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Split for ID Code Attribute

Entropy of split = 0 (since each leaf node is “pure”, having only

one case.

Information gain is maximal for ID code

witten&eibe

DR Engp I 1.15 R6 - Ing
No ratings yet
DR Engp I 1.15 R6 - Ing
19 pages
Decision Tree
100% (4)
Decision Tree
66 pages
Attribute Selection Measures: Decision Tree Based Classification
No ratings yet
Attribute Selection Measures: Decision Tree Based Classification
16 pages
Ion Exchange Chromatography
No ratings yet
Ion Exchange Chromatography
41 pages
Concepts and Techniques: Data Mining
100% (1)
Concepts and Techniques: Data Mining
81 pages
Onga'nya 24
No ratings yet
Onga'nya 24
23 pages
Classification - Decision Trees
No ratings yet
Classification - Decision Trees
43 pages
ASI Show Orlando 2025 Exhibitor List
No ratings yet
ASI Show Orlando 2025 Exhibitor List
16 pages
C' Ifornia: California Code Ol, Regulations
No ratings yet
C' Ifornia: California Code Ol, Regulations
62 pages
RAC QB Final-2023
No ratings yet
RAC QB Final-2023
9 pages
MODULE 3 Job Order Costing PDF
100% (1)
MODULE 3 Job Order Costing PDF
9 pages
Touch Me - Chapter 1 - Baeconandeggs, Xiaolianhua - EXO (Band) (Archive of Our Own)
No ratings yet
Touch Me - Chapter 1 - Baeconandeggs, Xiaolianhua - EXO (Band) (Archive of Our Own)
12 pages
Business Finance - ADM - Module 1 Q1 WK 1 To 2 Introduction To Financial Management 3
No ratings yet
Business Finance - ADM - Module 1 Q1 WK 1 To 2 Introduction To Financial Management 3
37 pages
Lecture Notes For Chapter 2 Introduction To Data Mining: by Tan, Steinbach, Kumar
100% (1)
Lecture Notes For Chapter 2 Introduction To Data Mining: by Tan, Steinbach, Kumar
66 pages
Unit 1 Classification & Prediction DM
No ratings yet
Unit 1 Classification & Prediction DM
71 pages
Gr11 P2 ECO June 2024 Question Paper - 125612
100% (1)
Gr11 P2 ECO June 2024 Question Paper - 125612
13 pages
Mod 3 Part1 - Merged
No ratings yet
Mod 3 Part1 - Merged
101 pages
Decision Tree
No ratings yet
Decision Tree
19 pages
Chap4 Basic Classification
No ratings yet
Chap4 Basic Classification
101 pages
Unit-3 ML
No ratings yet
Unit-3 ML
47 pages
VII - CS8031 - DMDW - Module 6 - Classification - VBP
No ratings yet
VII - CS8031 - DMDW - Module 6 - Classification - VBP
99 pages
Chapter 3 Data Modeling Using The Entity Relationship ER Model
No ratings yet
Chapter 3 Data Modeling Using The Entity Relationship ER Model
55 pages
DM 4
No ratings yet
DM 4
68 pages
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
82 pages
Lecture Notes For Chapter 2: Data Mining: Data
No ratings yet
Lecture Notes For Chapter 2: Data Mining: Data
67 pages
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
101 pages
Solucionario Koretssky
No ratings yet
Solucionario Koretssky
130 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
DM Lec7
No ratings yet
DM Lec7
17 pages
P9-10 ClassBasic
No ratings yet
P9-10 ClassBasic
82 pages
FSBC01 The Use of Repair and Maintenance Budget For Buildings
No ratings yet
FSBC01 The Use of Repair and Maintenance Budget For Buildings
5 pages
Chapter 6 Classification and Prediction25.10.13
No ratings yet
Chapter 6 Classification and Prediction25.10.13
43 pages
Lecture 4
No ratings yet
Lecture 4
74 pages
Data Mining Chapter 2 Notes
No ratings yet
Data Mining Chapter 2 Notes
87 pages
Lecture 4
No ratings yet
Lecture 4
74 pages
06-Classification Part1
No ratings yet
06-Classification Part1
44 pages
15 1 Random Forest and Decision Tree
No ratings yet
15 1 Random Forest and Decision Tree
66 pages
Chap4 Basic Classification
No ratings yet
Chap4 Basic Classification
82 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
21 pages
Decision Tree and Random Forest
No ratings yet
Decision Tree and Random Forest
74 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
Decision Trees
No ratings yet
Decision Trees
31 pages
Powerplant Exercises
No ratings yet
Powerplant Exercises
3 pages
Decision-Tree Learning .
No ratings yet
Decision-Tree Learning .
29 pages
Classification Techniques
No ratings yet
Classification Techniques
50 pages
Decision Tree Induction
No ratings yet
Decision Tree Induction
80 pages
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
35 pages
4-Chap4 Basic Classification
No ratings yet
4-Chap4 Basic Classification
128 pages
07.2.decision Trees
No ratings yet
07.2.decision Trees
33 pages
Decision Trees
No ratings yet
Decision Trees
26 pages
Handling Continuous Attributes: Different Kinds of Rules
No ratings yet
Handling Continuous Attributes: Different Kinds of Rules
33 pages
DM Lec9
No ratings yet
DM Lec9
8 pages
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
No ratings yet
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
33 pages
Chap4 Basic Classification PDF
No ratings yet
Chap4 Basic Classification PDF
101 pages
Unit 6 Finalized
No ratings yet
Unit 6 Finalized
30 pages
Unity TCP Open Block Library Users Manual
No ratings yet
Unity TCP Open Block Library Users Manual
124 pages
Week - 2 Day - 2 Machine Learning 2 - 3
No ratings yet
Week - 2 Day - 2 Machine Learning 2 - 3
33 pages
04-Data Maining Classification Decision Trees
No ratings yet
04-Data Maining Classification Decision Trees
24 pages
Decision Tree
No ratings yet
Decision Tree
30 pages
Edraky - SD
No ratings yet
Edraky - SD
29 pages
Group 2 - Aspects of Connected Speech
No ratings yet
Group 2 - Aspects of Connected Speech
31 pages
ESD Assignment
No ratings yet
ESD Assignment
14 pages
Lecture 9
No ratings yet
Lecture 9
21 pages
ID3 Lecture4
No ratings yet
ID3 Lecture4
25 pages
Clase12 13
No ratings yet
Clase12 13
15 pages
Classification: Decision Trees
No ratings yet
Classification: Decision Trees
30 pages
Chap4 - Basic - Classification-Admin and Economy
No ratings yet
Chap4 - Basic - Classification-Admin and Economy
31 pages
CS 6823 Data Mining: Classification Decision Tree
No ratings yet
CS 6823 Data Mining: Classification Decision Tree
39 pages
Data Mining & Knowledge Discovery
No ratings yet
Data Mining & Knowledge Discovery
34 pages
Data Mining: Data: Lecture Notes For Chapter 2
No ratings yet
Data Mining: Data: Lecture Notes For Chapter 2
34 pages
Chapter 18: C++ As A Better C Introducing Object Technology
No ratings yet
Chapter 18: C++ As A Better C Introducing Object Technology
23 pages
PDF Living On A Prayer - English Version
No ratings yet
PDF Living On A Prayer - English Version
17 pages
Features: Battery Earth Fault Relays
No ratings yet
Features: Battery Earth Fault Relays
5 pages
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
25 pages
G0709084055
No ratings yet
G0709084055
16 pages
Partnership - Case Digests (Thyrz)
No ratings yet
Partnership - Case Digests (Thyrz)
15 pages
Project Proposal Seminar Workshop
No ratings yet
Project Proposal Seminar Workshop
6 pages
Framo Pumps
No ratings yet
Framo Pumps
5 pages
Quiz Ecology
No ratings yet
Quiz Ecology
9 pages
Data Mining Algorithms Classification L4
No ratings yet
Data Mining Algorithms Classification L4
7 pages
The Star Weaver
No ratings yet
The Star Weaver
2 pages
Construction of Decision Tree Attribute Selection Measures
No ratings yet
Construction of Decision Tree Attribute Selection Measures
5 pages
Translate Skenario Toefl
No ratings yet
Translate Skenario Toefl
2 pages
MaheswariVeni Auth Nagercoil
No ratings yet
MaheswariVeni Auth Nagercoil
2 pages
Mechanical Engineering Seminars
No ratings yet
Mechanical Engineering Seminars
1 page
Make Transistor as a Switch
From Everand
Make Transistor as a Switch
GURUPRASAD N H
No ratings yet
Automatic Room Lighting System Using NOT Gate
From Everand
Automatic Room Lighting System Using NOT Gate
GURUPRASAD N H
No ratings yet
Its HOT! Build a Temperature Warning Sound Alarm with Thermistor
From Everand
Its HOT! Build a Temperature Warning Sound Alarm with Thermistor
GURUPRASAD N H
No ratings yet
Electroplating, Plating, Polishing, Anodizing & Coloring World Summary: Market Values & Financials by Country
From Everand
Electroplating, Plating, Polishing, Anodizing & Coloring World Summary: Market Values & Financials by Country
Editorial DataGroup
No ratings yet