0% found this document useful (0 votes)

12 views31 pages

Classification Intr DT

The document discusses classification in data mining, which involves predicting categorical class labels using a model constructed from a training set. It outlines the two-step process of model construction and model usage, emphasizing the importance of accuracy and evaluation metrics. Additionally, it covers various classification algorithms, including decision trees and measures like information gain and Gini index for attribute selection.

Uploaded by

Chen Gary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views31 pages

Classification Intr DT

Uploaded by

Chen Gary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

Data Mining:

Concepts and Techniques

Classification

Jiawei Han, Micheline Kamber, and Jian Pei

University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Classification
• Classification
– predicts categorical class labels (discrete or
nominal)
– classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data

April 22, 2025 Data Mining: Concepts and Techniqu 2

es
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set, otherwise over-fitting will
occur
– If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known
April 22, 2025 Data Mining: Concepts and Techniqu 3
es
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier

Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
April 22, 2025 Data Mining: Concepts and Techniqu 4
es
Process (2): Using the Model in Prediction
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph
April 22, 2025
Assistant ProfData Mining:
7 Concepts and
yesTechniqu 5
es
Supervised vs. Unsupervised Learning

• Supervised learning (classification)

– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
– New data is classified based on the training set
• Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
April 22, 2025 Data Mining: Concepts and Techniqu 6
es
Evaluating Classification Methods
• Accuracy
– classifier accuracy: predicting class label
• Speed
– time to construct the model (training time)
– time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
– understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules

April 22, 2025 Data Mining: Concepts and Techniqu 7

es
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
 Training data set: Buys_computer <=30 high no excellent no
 Resulting tree: 31…40 high no fair yes
>40 medium no fair yes
[age>40, income=high, student=yes, >40 low yes fair yes
>40 low yes excellent no
c_r=fair] 31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes no yes
8
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer
manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discret
ized in advance)
– Examples are partitioned recursively based on selected
attributes
– Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no samples left
– There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf 9
Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify
m a tuple in D:
Info ( D)   pi log 2 ( pi )
i 1
 Information needed (after using A to split D into v partitions) to
v | D |
classify D:
Info A ( D ) 
j
Info ( D j )
j 1 | D |
 Information gained by branching on attribute A
Gain(A) Info(D)  Info A(D)
10
m
Info ( D)  
Attribute Selection: Information Gain
pi log 2 ( pi )
i 1

g Class P: buys_computer = “yes” 5 4

Info age ( D )  I (2,3)  I (4,0)
g Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info ( D) I (9,5)  log 2 ( )  log 2 ( ) 0.940  I (3,2) 0.694
14 14 14 14 14
age pi ni I(p i, n i)
<=30 2 3 0.971 Gain(age) Info ( D )  Info age ( D ) 0.246
31…40 4 0 0
>40 3 2 0.971
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes Similarly,
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes Gain(income) 0.029
<=30 medium no fair no
<=30
>40
low
medium
yes
yes
fair
fair
yes
yes
Gain( student ) 0.151
<=30
31…40
medium
medium
yes
no
excellent
excellent
yes
yes Gain(credit _ rating ) 0.048
31…40 high yes fair yes 11
>40 medium no excellent no
age income student credit_rating buys_computer
<=30 high no fair no
<=30
31…40
>40
>40
high
high
medium
low
no
no
no
yes
excellent
fair
fair
fair
Decision tree
no
yes
yes
yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
age?

<=30 31..40 >40

yes
Output: A Decision Tree for “buys_computer”

age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

April 22, 2025 Data Mining: Concepts and Techniqu 13

es
Practice
student Gender Rating Class label
Yes M Excellent Y
Yes M Fair Y
Yes F Fair Y
Yes F Good N
No M Fair N
No F Excellent N

numerator
log2 1 2 3 4 5 6
2 -1 1
3 -1.60 -0.58 1
denominator 4 -2 -1 -0.42 1
5 -2.32 -1.32 -0.74 -0.32 1
6 -2.58 -1.6 -1 -0.58 -0.26 1
Gain Ratio for Attribute Selection (C4.5)
• Information gain measure is biased towards attributes with a lar
ge number of
valuesid
Student Gender Rating Class label
1 M Excellent Y
2 M Fair N
3 F Fair Y
4 F Good Y
5 M Fair N
6 F Excellent N

Student id
1 6
2 3 4 5
Y N N
Y Y N 15
Gain Ratio for Attribute Selection (C4.5)
• C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
– The gain ratio is derived by taking into account the number and size of
children nodes into which an attribute splits the dataset.
v | Dj | | Dj | Income
SplitInfo A ( D)   log 2 ( )
j 1 |D| |D| High 4
Medium 6
– GainRatio(A) = Gain(A)/SplitInfo(A) Low 4
• Ex.

– gain_ratio(income) = 0.029/1.557 = 0.019

• The attribute with the maximum gain ratio is selected as the
splitting attribute
16
v | Dj | | Dj |
Exercise SplitInfo A ( D )   log 2 ( )
j 1 |D| |D|
age income student credit_rating buys_computer
<=30 high no fair no
<=30
31…40
high
high
no
no
excellent
fair
no
yes
Compute gain ratio of age
>40
>40
medium
low
no fair
yes fair
yes
yes
Compute gain ratio of student
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no Gain(age) Info ( D)  Gain
fair no ( D)) 
(age
Info age Info ( D)  Info age ( D) 
0.246
<=30 low yes fair yes
>40 medium yes fair Gain(income) 0.029
yes
<=30 medium yes excellent yes
31…40 medium no excellent Gain( student ) 0.151
yes
31…40 high yes fair yes
>40 medium no excellent Gain(credit _ rating ) 0.048
no

3/14 4/14 5/14 6/14 7/14

log2 -2.22 -1.81 -1.49 -1.22 -1

17
Gain Ratio for Attribute Selection (C4.5)
• Unfortunately, in some situations the gain ratio modification
overcompensates and can lead to preferring an attribute just
because its intrinsic information is much lower than for the
other attributes. A 1
– GainRatio(A) = Gain(A)/SplitInfo(A) SplitInfo=0.0763
High 99
Low 1
v | Dj | | Dj |
SplitInfo A ( D)   log 2 ( ) A2
j 1 |D| |D|
High 50 SplitInfo=1
Low 50
• A standard fix is to choose the attribute that maximizes the gain
ratio, provided that the information gain for that attribute is at
least as great as the average information gain for all the
attributes examined.

18
Gini Index (CART, IBM IntelligentMiner)

• If a data set D contains examples from n classes, gini index,

gini(D) is defined as n
gini( D) 1  p 2 j impurity
j 1
where pj is the relative frequency of class j in D
• If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as |D1| |D2 |
gini A ( D)  gini( D1)  gini( D 2)
|D| |D|
• Reduction in Impurity:
gini( A) gini( D)  giniA ( D)
• The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
19
age income student credit_rating buys_computer
<=30 high no fair no

Computation of Gini Index

<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
D has 9 tuples in buys_computer >40 low yes excellent no
= “yes” and 5 in “no” 31…40
<=30
low
medium
yes excellent
no fair
yes
no
2 2 <=30 low yes fair yes
 9  5 >40 medium yes fair yes
gini ( D) 1       0.459 <=30 medium yes excellent yes
 14   14  31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

Student
student pi ni gini
yes 6 1 1-(6/7)2-(1/7)2=0.246
no 3 4 1-(3/7)2-(4/7)2=0.490
ginistudent(D)=(7/14)* 0.246+ (7/14)*0.490=0.368

Dgini(student)=0.459-0.368=0.091
20
Computation of Gini Index
• Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
 9  5
gini ( D) 1       0.459
 14   14 

• Suppose the attribute income partitions D into 10 in D1: {low,

medium} and 4 in D2 giniincome{low,medium} ( D)  10 Gini( D1 )   4 Gini( D1 )
 14   14 

Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the

{low,medium} (and {high}) since it has the lowest Gini index

21
2 2
 9  5
gini ( D) 1       0.459
 14   14 
age income student credit_rating buys_computer
<=30 high no fair no
 10   4 <=30 high no excellent no
giniincome{low,medium} ( D )   Gini ( D1 )    Gini ( D1 ) 31…40 high no fair yes
 14   14  >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
{(low, high), medium} 0.458 31…40 high yes fair yes
{(medium, high), low} 0.450 >40 medium no excellent no

income? gini( A) gini(D)  giniA (D)

Low, medium high

Exercise
Gini index considers a binary split for each attribute, find
the best binary split on attribute Income
Income Age Credit Class label

l 20-29 good T

l 30-39 good F

l 40-49 good F

m 20-29 fair F

m 30-39 fair F

m 40-49 fair T

h 40-49 good F

h 20-29 good T

h 30-39 fair T

h 40-49 fair T
Comparing Attribute Selection Measures

• The three measures, in general, return good results but

– Information gain:
• biased towards multivalued attributes
– Gain ratio:
• tends to prefer unbalanced splits in which one partition is
much smaller than the others
– Gini index:
• biased to multivalued attributes
• tends to favor tests that result in equal-sized partitions
and purity in both partitions

24
Overfitting and Tree Pruning
• Overfitting: An induced tree may overfit the training data
– Too many branches, some may reflect anomalies due to
noise or outliers
– Poor accuracy for unseen samples

25
Overfitting and Tree Pruning
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early ̵ do not split
a node if this would result in the goodness measure
falling below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
• Use a set of data different from the training data
to decide which is the “best pruned tree”

26
Classification in Large Databases
• Classification—a classical problem extensively studied by
statisticians and machine learning researchers
• Scalability: Classifying data sets with millions of examples and
hundreds of attributes with reasonable speed
• Why is decision tree induction popular?
– relatively faster learning speed (than other classification
methods)
– convertible to simple and easy to understand classification
rules
– comparable classification accuracy with other methods

27
Entropy
• Category : a, b m
Entropy ( S )   pi log 2 ( pi )
– S1={a, a, a, a, a, a}
i 1

– S2={a, a, a, b, b, b}

Entropy ( S1 )  (1log 2 1  0 log 2 0) 0

Entropy ( S 2 )  (0.5 log 2 0.5  0.5 log 2 0.5) 1

28
Example
Ag att att2 Class
e 1 label E1  (1log 2 1  0 log 2 0) 0
5 … … N
10 … … N
20 … … Y E2  (0.5 log 2 0.5  0.5 log 2 0.5) 1
38 … … Y
52 … … N

1 4
I ( S , T1 )  * 0  *1 0.8
5 5

29
Example
Ag att att2 Class
e 1 label E1  (1log 2 1  0 log 2 0) 0
5 … … N
10 … … N
20 … … Y 2 2 1 1
E2  ( log 2  log 2 ) 0.92
38 … … Y 3 3 3 3
52 … … N

2 3
I ( S , T2 )  * 0  * 0.92 0.552
5 5

30
Computing Information-Gain for Continuous-
Valued Attributes
• Let attribute A be a continuous-valued attribute
• Must determine the best split point for A
– Sort the value A in increasing order
– Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
– The point with the minimum expected information
requirement for A is selected as the split-point for A
• Split:
– D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
31

Practica-Distribuida Resuelta
No ratings yet
Practica-Distribuida Resuelta
1 page
Keeping The Business Environment in Mind: A Case Study For PMP Certification Course
No ratings yet
Keeping The Business Environment in Mind: A Case Study For PMP Certification Course
9 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
87 pages
Chapter 6 Classification and Prediction25.10.13
No ratings yet
Chapter 6 Classification and Prediction25.10.13
43 pages
08 Class Basic
No ratings yet
08 Class Basic
81 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
Concepts and Techniques: Data Mining
100% (1)
Concepts and Techniques: Data Mining
81 pages
Lecture 4
No ratings yet
Lecture 4
79 pages
CH 5
No ratings yet
CH 5
81 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
42 pages
08 Class Basic
No ratings yet
08 Class Basic
76 pages
VII - CS8031 - DMDW - Module 6 - Classification - VBP
No ratings yet
VII - CS8031 - DMDW - Module 6 - Classification - VBP
99 pages
08ClassBasic v1
No ratings yet
08ClassBasic v1
46 pages
05 Classification
No ratings yet
05 Classification
79 pages
Classification: Basic Concepts
No ratings yet
Classification: Basic Concepts
73 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
08 Class Basic
No ratings yet
08 Class Basic
81 pages
Classification and Prediction
No ratings yet
Classification and Prediction
40 pages
Slide 07 Chapter8 Classification Basic Concept
No ratings yet
Slide 07 Chapter8 Classification Basic Concept
55 pages
Class Basic
No ratings yet
Class Basic
67 pages
Unit 4 DM
No ratings yet
Unit 4 DM
88 pages
Unit 3-Classification
No ratings yet
Unit 3-Classification
71 pages
08 Class Basic
No ratings yet
08 Class Basic
86 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
88 pages
Classification Prediction
No ratings yet
Classification Prediction
71 pages
DM 4
No ratings yet
DM 4
68 pages
Lecture 8
No ratings yet
Lecture 8
81 pages
Data Mining Book
No ratings yet
Data Mining Book
84 pages
Classification Lecture 1
No ratings yet
Classification Lecture 1
51 pages
08ClassBasic L
No ratings yet
08ClassBasic L
78 pages
Module 4
No ratings yet
Module 4
99 pages
04 Classification
No ratings yet
04 Classification
72 pages
Module - 4.1-DM-1
No ratings yet
Module - 4.1-DM-1
63 pages
Slides For Textbook - Chapter 7 - : March 6, 2014 Data Mining: Concepts and Techniques 1
No ratings yet
Slides For Textbook - Chapter 7 - : March 6, 2014 Data Mining: Concepts and Techniques 1
23 pages
DM 3
No ratings yet
DM 3
37 pages
Classification Ppts 2021
No ratings yet
Classification Ppts 2021
80 pages
Data Mining & Knowledge Discovery
No ratings yet
Data Mining & Knowledge Discovery
34 pages
Unit V Classification
No ratings yet
Unit V Classification
69 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
83 pages
P9-10 ClassBasic
No ratings yet
P9-10 ClassBasic
82 pages
7 Class
No ratings yet
7 Class
72 pages
Data Mining: UNIT-3 Classification
No ratings yet
Data Mining: UNIT-3 Classification
54 pages
Unit 4
No ratings yet
Unit 4
186 pages
Classification DecisionTreesNaiveBayeskNN
No ratings yet
Classification DecisionTreesNaiveBayeskNN
75 pages
20210913115613D3708 - Session 05-08 Decision Tree Classification
No ratings yet
20210913115613D3708 - Session 05-08 Decision Tree Classification
37 pages
Classification and Prediction
No ratings yet
Classification and Prediction
143 pages
MIS416 Chapter6 by DrAsimAlwabel
No ratings yet
MIS416 Chapter6 by DrAsimAlwabel
73 pages
UNIT 2 Class Basic
No ratings yet
UNIT 2 Class Basic
69 pages
Classification
No ratings yet
Classification
73 pages
DM Ch6 (Classification and Prediction)
No ratings yet
DM Ch6 (Classification and Prediction)
39 pages
DM Classification 1 3
No ratings yet
DM Classification 1 3
19 pages
8 Classification
No ratings yet
8 Classification
82 pages
ML Unit II
No ratings yet
ML Unit II
183 pages
Chap 7
No ratings yet
Chap 7
71 pages
Chap4 Classification Lecture 5
No ratings yet
Chap4 Classification Lecture 5
74 pages
3 - Sınıflandırma 2
No ratings yet
3 - Sınıflandırma 2
62 pages
Chapter3 Classification and Prediction
No ratings yet
Chapter3 Classification and Prediction
63 pages
Unit 3
No ratings yet
Unit 3
98 pages
Sahu 2020
No ratings yet
Sahu 2020
58 pages
TFN PPT Ida Jean Orlando
100% (1)
TFN PPT Ida Jean Orlando
19 pages
Lesson Plan: Ratios and Proportions
No ratings yet
Lesson Plan: Ratios and Proportions
2 pages
Radiology Test 3 Study Guide
No ratings yet
Radiology Test 3 Study Guide
5 pages
Islamic Art
No ratings yet
Islamic Art
34 pages
Working Through Sadness: Check Your Thinking
No ratings yet
Working Through Sadness: Check Your Thinking
3 pages
(IJCST-V3I5P29) : Mr. Sachin Ashok Vanjari, Dr. R. B. Ingle
No ratings yet
(IJCST-V3I5P29) : Mr. Sachin Ashok Vanjari, Dr. R. B. Ingle
5 pages
Thesis Writing Research Design
100% (3)
Thesis Writing Research Design
8 pages
A Pragmatic Study of Yoruba Proverbs in English-90
No ratings yet
A Pragmatic Study of Yoruba Proverbs in English-90
10 pages
Curriculum Map 10
No ratings yet
Curriculum Map 10
6 pages
4 - Happiness & Prosperity & RU, Rel, PF
No ratings yet
4 - Happiness & Prosperity & RU, Rel, PF
34 pages
Maths Jotting
No ratings yet
Maths Jotting
19 pages
Session 9 - Body Language Basics (Decoding)
No ratings yet
Session 9 - Body Language Basics (Decoding)
16 pages
Guide To Language Learning
No ratings yet
Guide To Language Learning
13 pages
Career Choice CH 1 3
100% (2)
Career Choice CH 1 3
25 pages
Political Secularism
No ratings yet
Political Secularism
2 pages
Allie Kowal SLP Resume
No ratings yet
Allie Kowal SLP Resume
2 pages
Treib, O. Et Al. (2007) Modes of Governance, Towards A Conceptual Clarification
No ratings yet
Treib, O. Et Al. (2007) Modes of Governance, Towards A Conceptual Clarification
22 pages
Economic and Political Weekly Vol. 47, No. 10, MARCH 10, 2012
No ratings yet
Economic and Political Weekly Vol. 47, No. 10, MARCH 10, 2012
84 pages
Responding To Drug Information Requests
No ratings yet
Responding To Drug Information Requests
4 pages
Flexibility in Learning
No ratings yet
Flexibility in Learning
9 pages
Lets Practice Imperatives Interactive Worksheet
50% (2)
Lets Practice Imperatives Interactive Worksheet
2 pages
CTFC Qualification Specification-Specification PDF
No ratings yet
CTFC Qualification Specification-Specification PDF
14 pages
Winmark Global Case Study
No ratings yet
Winmark Global Case Study
4 pages
A Study On The Effectiveness of Training Program of OIL
No ratings yet
A Study On The Effectiveness of Training Program of OIL
7 pages
Albert Bandura
No ratings yet
Albert Bandura
18 pages
MATH F341 - Jhuma
No ratings yet
MATH F341 - Jhuma
3 pages
The Influence of Facebook To The Social Skills of Senior High School of FCU
No ratings yet
The Influence of Facebook To The Social Skills of Senior High School of FCU
53 pages