0% found this document useful (0 votes)
71 views42 pages

Classification: Lecture Notes For Chapters 4 & 5

The document discusses classification, which involves using a model to assign class labels to new records based on patterns learned from a training set of labeled records. It provides examples of classification tasks like predicting medical diagnoses or categorizing news articles. Common classification techniques are discussed, including decision trees which use a tree structure to partition the data based on attribute values to determine a record's class. The document demonstrates how a decision tree model is induced from a training set and then used to classify new, unlabeled records by traversing the tree based on attribute values.

Uploaded by

Diaa Malah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views42 pages

Classification: Lecture Notes For Chapters 4 & 5

The document discusses classification, which involves using a model to assign class labels to new records based on patterns learned from a training set of labeled records. It provides examples of classification tasks like predicting medical diagnoses or categorizing news articles. Common classification techniques are discussed, including decision trees which use a tree structure to partition the data based on attribute values to determine a record's class. The document demonstrates how a decision tree model is induced from a training set and then used to classify new, unlabeled records by traversing the tree based on attribute values.

Uploaded by

Diaa Malah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 42

Classification

Lecture Notes for Chapters 4 & 5


Classification: Definition
 Given a collection of records (training set)
 Each record contains a set of attributes, one of the
attributes is the class label
 Find a model for the class label as a function of the
values of other attributes
 Goal: previously unseen records should be assigned a
class as accurately as possible
 A test set is used to determine the accuracy of the model
 Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it
Illustrating Classification Task
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Attrib1Single
Tid Attrib2 Attrib3
125K Class
No Learning
1No Yes Large 125K No
2 Married 100K No algorithm
2 No Medium 100K No
3 No Single 70K No
3 No Small 70K No
4 Yes Married 120K No
4 Yes Medium 120K No
5 No Divorced 95K Yes Induction
5 No Large 95K Yes
6 No Married 60K No
6 No Medium 60K No
7 7Yes Yes Divorced
Large 220K
220K No
No Learn
8 8No No Single
Small 85K
85K Yes
Yes Model
9 9No No Married
Medium 75K
75K No
No

10 10
No No Small
Single 90K
90K Yes
Yes
Model
10

10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Examples of Classification
 Predicting tumor cells as benign or malignant

 Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random
coil

 Classifying credit card transactions


as legitimate or fraudulent

 Categorizing news stories as finance,


weather, entertainment, sports, etc …
 Yahoo!
Classification Techniques
 “Decision Tree”-based Methods
 k Nearest Neighbors
 Rule-based Methods
 Case-based Reasoning
 Neural Networks
 Naïve Bayes and Bayesian Belief Networks
 Support Vector Machines
Example of a Decision Tree
Splitting Attributes

Tid Refund Marital Taxable


Status Income Cheat
Refund
1 Yes Single 125K No Yes No
2 No Married 100K No
NO MarSt
3 No Single 70K No
Single, Divorced Married
4 Yes Married 120K No
5 No Divorced 95K Yes TaxInc NO
6 No Married 60K No < 80K > 80K
7 Yes Divorced 220K No
NO YES
8 No Single 85K Yes
9 No Married 75K No

10
10 No Single 90K Yes Model: Decision Tree
Training Data
Another Example
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
Decision Tree Classification
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


Decision
8 No Small 85K Yes Model Tree
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
Decision Tree Classification
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Model Decision
Tid Attrib1 Attrib2 Attrib3 Class
Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Decision Tree Induction
 Many Algorithms:
 Hunt’s Algorithm (one of the earliest)
 CART
 ID3, C4.5
 SLIQ, SPRINT
General Structure of Hunt’s
Algorithm
Tid Refund Marital Taxable
Income Cheat
Let Dt be the set of training records
Status

1 Yes Single 125K No
that reach a node t 2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
 General Procedure: 5 No Divorced 95K Yes

 If Dt contains records that belong to the 6 No Married 60K No

same class yt, then t is a leaf node 7 Yes Divorced 220K No

labeled as yt 8 No Single 85K Yes


9 No Married 75K No

If Dt contains records that belong to


10 No Single 90K Yes
 10

more than one class, use an attribute test Dt


to split the data into smaller subsets
?
 Recursively apply the procedure to each
subset
Hunt’s Algorithm Tid Refund Marital
Status
Taxable
Income Cheat

Refund 1 Yes Single 125K No

Yes No 2 No Married 100K No


3 No Single 70K No
Don’t ?
Cheat 4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
Refund Refund
8 No Single 85K Yes
Yes No Yes No
9 No Married 75K No
Don’t Don’t Marital
Marital 10 No Single 90K Yes
Cheat Cheat Status 10

Status
Single, Single,
Married Married
Divorced Divorced

Don’t Taxable Don’t


? Cheat
Cheat Income
< 80K >= 80K

Don’t Cheat
Cheat
Tree Induction
 Greedy strategy
 Split the records based on an attribute test that optimizes
certain criterion
 Greedy algorithms work in phases
 At each phase, a decision is made that improves the current state
and appears to be the best without regard for future consequences
 Local maxima!
 E.g. finding the shortest path

 Issues
 Determine how to split the records
 How to specify the attribute test condition?
 How to determine the best split (E.g. which attribute)?
 Determine when to stop splitting
How to Specify Test
Condition?
 Depends on attribute types
 Categorical
 Continuous

 Depends on number of ways to split


 2-way split
 Multi-way split
Splitting on Nominal Attributes
 Multi-way split: Use as many partitions as
distinct values
CarType
Family Luxury
Sports

 Binary split: Divides values into two


subsets {Sports, CarType {Family,
CarType
Luxury} {Family} Luxury} {Sports}

OR

 Need to find optimal partitioning


Splitting on Ordinal Attributes
 Multi-way split: Use as many partitions as
distinct values Small Size Large
Medium

 Binary split: Divides values into two subsets


OR
Size Size
{Small, {Medium,
Medium} {Large} Large} {Small}

Size
 What about this split? {Small,
Large} {Medium}
Splitting on Continuous
Attributes
 Different ways of handling
 Binary Decision: (A < v) or (A  v)
 consider all possible splits and find the best cut
 can be more computationally intensive
Taxable
Income
> 80K?

Yes No

(i) Binary split


Splitting on Continuous Attributes

 Different ways of handling


 Discretization to form an ordinal categorical
attribute
 Static – discretize once at the beginning
 Dynamic Taxable
Income?

< 10K > 80K

[10K,25K) [25K,50K) [50K,80K)

(ii) Multi-way split


Tree Induction
 Greedy strategy
 Split the records based on an attribute test that
optimizes certain criterion

 Issues
 Determine how to split the records
 How to specify the attribute test condition?
 How to determine the best split?
 Determine when to stop splitting
How to determine the Best
Split
Before Splitting: 10 records of class 0,
10 records of class 1

Own Car Student


Car? Type? ID?

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?


How to determine the Best
Split
 Greedy approach:
 Nodes with homogeneous class distribution are
preferred
 Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Measures of Node Impurity
 Gini Index

 Entropy

 Misclassification error
How to Find the Best Split
Before Splitting: C0 X
M0
C1 Y
A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 XA C0 X!A C0 XB C0 X!B
C1 YA C1 Y!A C1 YB C1 Y!B

M1 M2 M3 M4

M12 M34
Gain = M0 – M12 vs. M0 – M34
Measure of Impurity: GINI
 Gini Index for a given node t :
GINI (t )  1   [ p( j | t )]2
j

(NOTE: p( j|t) is the relative frequency of class j at node t).

 Measures impurity  need to minimize it

 Minimum (0.0) when all records belong to one class,


implying most interesting information

 Maximum when records are equally distributed among all


classes, implying least interesting information
GINI (t )  1   [ p( j | t )]2
j

Examples for computing GINI


C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
C1 3
P(C1) = 3/6 P(C2) = 3/6
C2 3
Gini = 1 – (3/6)2 – (3/6)2 = 0.5
Splitting Based on GINI
 When a node p is split into k partitions (children), the
quality of split is computed as,

k
ni
GINI split   GINI (i )
i 1 n

where, ni = number of records at child i,


n = number of records at node p
Binary Attributes: Computing
GINI Index
 Splits into two partitions
 Effect of Weighing partitions:
 Larger and purer partitions are sought for
Parent
B? C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 – (5/7)2 – (2/7)2 N1 N2 Gini(Children)
= 0.408
C1 5 1 = 7/12 * 0.408 +
Gini(N2) C2 2 4 5/12 * 0.32
= 1 – (1/5)2 – (4/5)2 = 0.371
= 0.32
Continuous Attributes: Computing
Gini Index
Tid Refund Marital Taxable
Status Income Cheat

 Use Binary Decisions based on 1 Yes Single 125K No

one value 2
3
No
No
Married
Single
100K
70K
No
No

 Several Choices for the splitting 4 Yes Married 120K No

value
5 No Divorced 95K Yes
6 No Married 60K No

 Number of possible splitting values 7 Yes Divorced 220K No

= Number of distinct values


8 No Single 85K Yes
9 No Married 75K No

 Each splitting value has a count 10


10 No Single 90K Yes

matrix associated with it Taxable


Income
 Class counts in each of the > 80K?

partitions, A < v and A  v Yes No


Continuous Attributes: Computing
Gini Index... Tid Refund Marital
Status
Taxable
Income Cheat

1 Yes Single 125K No

For each continuous attribute,


2 No Married 100K No

3 No Single 70K No
 Sort the attribute on values 4 Yes Married 120K No
 Choose a split position midway between 5 No Divorced 95K Yes
any two values 6 No Married 60K No
 Linearly scan these values, each time 7 Yes Divorced 220K No
updating the count matrix and computing 8 No Single 85K Yes
gini index 9 No Married 75K No
 Choose the split position that has the least 10 No Single 90K Yes
gini index
10

Cheat No No No Yes Yes Yes No No No No


Taxable Income

Sorted Values 60 70 75 85 90 95 100 120 125 220


55 65 72 80 87 92 97 110 122 172 230
Split Positions
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Another Example
# Attribute Class
Outlook Temperature Humidity Windy Play
1 sunny 100 high no N
2 sunny 110 high yes N
3 overcast 110 high no Y
4 rainy 75 high no Y
5 rainy 40 normal no Y
6 rainy 40 normal yes N
7 overcast 45 normal yes Y
8 sunny 70 high no N
9 sunny 40 normal no Y
10 rainy 70 normal no Y
11 sunny 70 normal yes Y
12 overcast 70 high yes Y
13 overcast 95 normal no Y
14 rainy 65 high yes N
3
7

Decision Tree Induction


 Many Algorithms:
 Hunt’s Algorithm (one of the earliest)
 CART
 ID3, C4.5
 SLIQ, SPRINT
3
8
Decision Tree Induction: Training Dataset

age income student credit_rating buys_computer


<=30 high no fair no
This <=30 high no excellent no
31…40 high no fair yes
follows an >40 medium no fair yes
example >40 low yes fair yes
of >40 low yes excellent no
31…40 low yes excellent yes
Quinlan’s <=30 medium no fair no
ID3 <=30 low yes fair yes
(Playing >40 medium yes fair yes
<=30 medium yes excellent yes
Tennis) 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
3
9
Output: A Decision Tree for “buys_computer”

age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes yes
4
0
Algorithm for Decision Tree Induction

 Basic algorithm (a greedy algorithm)


 Tree is constructed in a top-down recursive divide-and-conquer manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are discretized in
advance)
 Examples are partitioned recursively based on selected attributes
 Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
 Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
 There are no samples left
4
1
Attribute Selection Measure: Information Gain
(ID3/C4.5)

 Select the attribute with the highest information gain


 Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple
in D: m
Info( D)   pi log 2 ( pi )
i 1
 Information needed (after using A to split D into v
partitions) to classify D: v |D |
InfoA ( D)    I (D j )
j

j 1 | D |
 Information gained by branching on attribute A

Gain(A)  Info(D)  InfoA(D)


4
2
Attribute Selection: Information Gain

 Class P: buys_computer = “yes” Infoage ( D) 


5
I (2,3) 
4
I (4,0)
 Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D)  I (9,5)   log 2 ( )  log 2 ( ) 0.940  I (3,2)  0.694
14 14 14 14 14
5
age pi ni I(pi, ni) I (2,3) means “age <=30” has 5
14
<=30 2 3 0.971 out of 14 samples, with 2 yes’es
31…40 4 0 0 and 3 no’s. Hence
>40 3 2 0.971
age income student credit_rating buys_computer Gain(age)  Info( D)  Infoage ( D)  0.246
<=30 high no fair no
<=30 high no excellent no
31…40
>40
high
medium
no
no
fair
fair
yes
yes
Similarly,
>40 low yes fair yes
>40
31…40
low
low
yes excellent
yes excellent
no
yes Gain(income)  0.029
Gain( student )  0.151
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30
31…40
medium
medium
yes excellent
no excellent
yes
yes
Gain(credit _ rating )  0.048
31…40 high yes fair yes
>40 medium no excellent no

You might also like