0% found this document useful (0 votes)
98 views50 pages

Classification DMKD

The document discusses the differences between classification and prediction problems, describing classification as predicting categorical class labels while prediction models continuous values. It also covers supervised vs unsupervised learning, with supervised learning using labeled training data and unsupervised learning not having labels. Various classification and prediction applications are provided like credit approval, medical diagnosis, and web page categorization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views50 pages

Classification DMKD

The document discusses the differences between classification and prediction problems, describing classification as predicting categorical class labels while prediction models continuous values. It also covers supervised vs unsupervised learning, with supervised learning using labeled training data and unsupervised learning not having labels. Various classification and prediction applications are provided like credit approval, medical diagnosis, and web page categorization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 50

Classification vs.

Prediction
• Classification:
• predicts categorical class labels
• classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying
new data
• Prediction:
• models continuous-valued functions, i.e., predicts unknown or missing
values
• Typical Applications
• credit approval
• target marketing
• medical diagnosis
• treatment effectiveness analysis
• Large data sets: disk-resident rather than memory-resident
data
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
• Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
• New data is classified based on the training set
• Unsupervised learning (clustering)
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in the
data
2
Prediction Problems: Classification vs.
Numeric Prediction
• Classification
• predicts categorical class labels (discrete or nominal)
• classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute and
uses it in classifying new data
• Numeric Prediction
• models continuous-valued functions, i.e., predicts unknown
or missing values
• Typical applications
• Credit/loan approval:
• Medical diagnosis: if a tumor is cancerous or benign
• Fraud detection: if a transaction is fraudulent
• Web page categorization: which category it is

3
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
• Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set (otherwise overfitting)
• If the accuracy is acceptable, use the model to classify new data

4
Process (1): Model Construction

Classification
Algorithms
Training
Data

Classifier
Student Maths physics chemistry Grade (Model)
name
Ram 90 80 70 A
Siva 70 75 80 B
IF maths > 80 OR physics > 80
Mani 99 68 98 A
OR Chemistry > 80
sanjay 76 79 74 B
THEN Grade = ‘A’
5 Else Grade = ‘B’
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Manoj, 70, 68, 77)


Student Maths physics chemistry
name
Gaurav 78 79 56
Grade: B
Ankith 90 91 80
Manoj 70 68 77
6 Rakesh 90 81 82
Entropy

• Entropy (Information theory)


• A measure of uncertainty associated with a random
variable
Interpretation
• Higher Entropy -> higher uncertainty
means that the events being measured are less predictable
• Lower Entropy -> Lower uncertainty
mean that the events being measured are more predictable
Attribute Selection Measure:
Information Gain
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple in D:
m
Info( D)   pi log 2 ( pi )
i 1
 Information needed (after using A to split D into v partitions) to
classify D: v | D |
InfoA ( D)    Info( D j )
j

j 1 | D |
 Information gained by branching on attribute A

Gain(A)  Info(D)  InfoA(D)


8
Attribute Selection: Information Gain
age income student credit_rating buys_computer
 Class P: buys_computer = “yes”
<=30 high no fair no
 Class N: buys_computer = “no” <=30 high no excellent no
31…40 high no fair yes
m >40 medium no fair yes
Info( D)   pi log 2 ( pi ) >40
>40
low
low
yes
yes
fair
excellent
yes
no
i 1
31…40 low yes excellent yes
<=30 medium no fair no
9 9 5 5
Info( D)  I (9,5)   log 2 ( )  log 2 ( ) <=30 low yes fair yes
14 14 14 14 >40 medium yes fair yes
<=30 medium yes excellent yes
v | Dj | 31…40 medium no excellent yes
InfoA ( D)    Info( D j ) 31…40 high yes fair yes
j 1 | D| >40 medium no excellent no

5
Infoage ( D) 
5
I (2,3) 
4
I (4,0) I (2,3) means “age <=30” has 5 out of 14 samples,
14 with 2 yes’es and 3 no’s. Hence
14 14
5
 I (3,2) 2 2 3 3
14 I (2,3)   log 2 ( )  log 2 ( )
5 5 5 5
Gain
Gain(age)  Info( D)  Infoage ( D)  0.246

Gain(income)  0.029
Gain( student )  0.151
Gain(credit _ rating )  0.048
age
<=30 >40
31-40

Income Income Income


High High Low
High Low
Med Low Med Med
Student Student Student
Student Student Student
Student
Yes Student Student Yes
No Yes No Yes
No Yes Yes No Yes
Yes No No No
CR CR CR CR
No No CR CR CR
CR CR
CR CR CR CR
CR CR CR CR CR

FAIR EXCELLENT
Buys Buys
comp comp
Yes Yes
No No
Weather Temperature Humidity Wind Golf Play
fine Hot high none no
fine Hot high few no
cloudy Hot high none Yes
rain Warm high none yes
rain Cold medium none yes
rain Cold medium few no
cloudy Cold medium few yes
fine Warm high none no
fine Cold medium none yes
rain Warm medium none yes
fine Warm medium few yes
cloudy Warm high few yes
cloudy Hot medium none yes
rain Warm high few no
S1
gender major birth_country age_range gpa count
M Science Canada 20-25 Very_good 16
F Science Foreign 25-30 Excellent 22
M Engineering Foreign 25-30 Excellent 18
F Science Foreign 25-30 Excellent 25
M Science Canada 20-25 Excellent 21
F Engineering Canada 20-25 Excellent 18

S2 120

gender major birth_country age_range gpa count


M Science Foreign <20 Very_good 18
F Business Canada <20 Fair 20
M Business Canada <20 Fair 22
F Science Canada 20-25 Fair 24
M Engineering Foreign 20-25 Very_good 22
F Engineering Canada <20 Excellent 24

130
120 120 130 130
I(s 1, s 2)  I(120,130)   log 2  log 2  0.9988
250 250 250 250
For major=”Science”: S11=84 S21=42 I(s11,s21)=0.9183

For major=”Engineering”: S12=36 S22=46 I(s12,s22)=0.9892

For major=”Business”: S13=0 S23=42 I(s13,s23)=0

126 82 42
E(major)  I ( s11, s 21)  I ( s12, s 22)  I ( s13, s 23)  0.7873
250 250 250

Gain(major )  I(s 1, s 2)  E(major)  0.2115


Bayes’ Theorem: Basics

• Bayes’ Theorem: P(H | X)  P(X | H )P(H )  P(X | H ) P(H ) / P(X)


P(X)
• Let X be a data sample (“evidence”): class label is unknown
• Let H be a hypothesis that X belongs to class C
• Classification is to determine P(H|X), (i.e., posteriori probability): the
probability that the hypothesis holds given the observed data sample X
• P(H) (prior probability): the initial probability
• E.g., X will buy computer, regardless of age, income, …
• P(X): probability that sample data is observed
• P(X|H) (likelihood): the probability of observing the sample X, given that
the hypothesis holds
• E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
15
Prediction Based on Bayes’ Theorem
• Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem

P(H | X)  P(X | H )P(H )  P(X | H ) P(H ) / P(X)


P(X)
• Informally, this can be viewed as
posteriori = likelihood x prior/evidence
• Predicts X belongs to Ci if the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
• Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost

16
Classification Is to Derive the Maximum Posteriori
• Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)

• Since P(X) is constant for all classes, only


P(C | X)  P(X | C )P(C )
i i i
needs to be maximized
17
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
n
P( X | C i)   P( x | C i)  P( x | C i)  P( x | C i)  ...  P( x | C i)
k 1 2 n
k 1

• This greatly reduces the computation cost: Only counts the class
distribution

18
Naïve Bayes Classifier: Training Dataset
age income student credit_rating buys_computer
<=30 high no fair no
Class: <=30 high no excellent no
31…40 high no fair yes
C1:buys_computer = ‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
Data to be classified: <=30 medium no fair no
X = (age <=30, <=30 low yes fair yes
>40 medium yes fair yes
Income = medium, <=30 medium yes excellent yes
Student = yes 31…40 medium no excellent yes
31…40 high yes fair yes
Credit_rating = Fair) >40 medium no excellent no

19
Naïve Bayes Classifier: An Example
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
• Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
20
Using IF-THEN Rules for Classification
• Represent the knowledge in the form of IF-THEN rules
R: IF age = <=30 AND student = yes THEN buys_computer = yes
• Assessment of a rule: coverage and accuracy
• ncovers = no of tuples covered by R
• ncorrect = no of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers

• IF age = “<=30” AND student = no AND credit rating = fair THEN


buys_computer = no

• If more than one rule are triggered, need conflict resolution


• Size ordering: assign the highest priority to the triggering rules that has the
“toughest” requirement (i.e., with the most attribute tests)
• Class-based ordering: decreasing order of prevalence or misclassification cost
per class
• Rule-based ordering (decision list): rules are organized into one long priority
list, according to some measure of rule quality or by experts
21
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

IF age = “<=30” AND student = no ncovers = no of tuples covered by R = 3 [no of records that satisfy
the rule antecedent]
THEN buys_computer = no ncorrect = no of tuples correctly classified by R = 3 [no of records
Let given rule as R:AB Then that satisfy both the antecedent and consequent]
|D|= total no of records
Coverage(R) = |A| /|D|
coverage(R) = ncovers /|D| = 3/14
Accuracy(R) = |A∩B| /|A| accuracy(R) = ncorrect / ncovers =3/3 [i.e. 100 %]
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low no fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

IF age = “<=30” AND student = no ncovers = no of tuples covered by R = 4 [no of records that satisfy
the rule antecedent]
THEN buys_computer = no ncorrect = no of tuples correctly classified by R = 3 [no of records
Let given rule as R:AB Then that satisfy both the antecedent and consequent]
|D|= total no of records
Coverage(R) = |A| /|D|
coverage(R) = ncovers /|D| = 4/14
Accuracy(R) = |A∩B| /|A| accuracy(R) = ncorrect / ncovers =3/4 [i.e. 75 %]
Rule Extraction from a Decision Tree
 Rules are easier to understand than large
trees age?

 One rule is created for each path from the <=30 31..40 >40

root to a leaf student? credit rating?


yes
 Each attribute-value pair along a path forms a no yes excellent fair
conjunction: the leaf holds the class no yes yes
prediction
 Rules are mutually exclusive and exhaustive
• Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
24
Rule Induction: Sequential Covering Method
• Sequential covering algorithm: Extracts rules directly from training
data
• Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
• Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
• Steps:
• Rules are learned one at a time
• Each time a rule is learned, the tuples covered by the rules are
removed
• Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the
quality of a rule returned is below a user-specified threshold
• Comp. w. decision-tree induction: learning a set of rules
simultaneously
25
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Sequential Covering Algorithm
while (enough target tuples left)
generate a rule
remove positive target tuples satisfying this rule

Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3

Positive
examples

27
Rule Generation
• To generate a rule
while(true)
find the best predicate p
if coverage > threshold then add p to current rule
else break

A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1

Positive Negative
examples examples

28
How to Learn-One-Rule?

• Start with the most general rule possible: condition = empty


• Adding new attributes by adopting a greedy depth-first strategy
• Picks the one that most improves the rule quality
• Rule-Quality measures: consider both coverage and accuracy

29
Model Evaluation
• Metrics for Performance Evaluation
• How to evaluate the performance of a model?

• Methods for Performance Evaluation


• How to obtain reliable estimates?

• Methods for Model Comparison


• How to compare the relative performance among competing models?
Metrics for Performance Evaluation
• Focus on the predictive capability of a model
• Rather than how fast it takes to classify or build models, scalability, etc.
• Confusion Matrix:
PREDICTED CLASS

Class=Yes Class=No
a: TP (true positive)
b: FN (false negative)
Class=Yes a b
ACTUAL c: FP (false positive)
d: TN (true negative)
CLASS Class=No c d
Metrics for Performance Evaluation…
PREDICTED CLASS

Class=Yes Class=No

Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)

• Most widely-used metric:


ad TP  TN
Accuracy  
a  b  c  d TP  TN  FP  FN
Limitation of Accuracy
• Consider a 2-class problem
• Number of Class 0 examples = 9990
• Number of Class 1 examples = 10

• If model predicts everything to be class 0, accuracy is 9990/10000 =


99.9 %
• Accuracy is misleading because model does not detect any class 1 example
Cost Matrix
PREDICTED CLASS

C(i|j) Class=Yes Class=No

Class=Yes C(Yes|Yes) C(No|Yes)


ACTUAL
CLASS Class=No C(Yes|No) C(No|No)

C(i|j): Cost of misclassifying class j example as class i


Cost-Sensitive Measures
a
Precision (p) 
ac
a
Recall (r) 
ab
2rp 2a
F - measure (F)  
r  p 2a  b  c
 Precision is biased towards C(Yes|Yes) & C(Yes|No)
 Recall is biased towards C(Yes|Yes) & C(No|Yes)
 F-measure is biased towards all except C(No|No)
K- Nearest Neighbor Algorithm
• K- nearest neighbor is a simple algorithm that stores all available
classes and classifies new cases based on similarity measure (eg.
Euclidean distance)
• A case is classified by a majority vote of its neighbor, with the case
being assigned to the class most common amongst its K nearest
neighbors measured by a distance function.

d ( p, q )   ( pi
i
q )
i
2

• K is an integer, If K= 1 then the case is simply assigned to the class of


its nearest neighbor.
K- Nearest Neighbor Algorithm
Example

Name Acid Durability Strength Class


Type 1 7 7 Bad
Type 2 7 4 Bad
Type 3 3 4 Good
Type 4 1 4 Good

Test Data  Acid durability = 3 and strength = 7 class = ?


• Euclidian distance

d ( p, q )   ( pi
i
q ) i
2

Name Acid Durability Strength Class Distance


Type 1 7 7 Bad Sqrt((7-3)2+(7-7)2) = 4

Type 2 7 4 Bad 5
Type 3 3 4 Good 3
Type 4 1 4 Good 3.6

Test Data  Acid durability = 3 and strength = 7 class = ?


Name Acid Strength Class Distance Rank
Durability
Type 1 7 7 Bad 4 3
Type 2 7 4 Bad 5 4
Type 3 3 4 Good 3 1
Type 4 1 4 Good 3.6 2
K=1

Name Acid Strength Class Distance Rank


Durability
Type 1 7 7 Bad 4 3
Type 2 7 4 Bad 5 4
Type 3 3 4 Good 3 1
Type 4 1 4 Good 3.6 2

Test Data  Acid durability = 3 and strength = 7 class = Good


K=2

Name Acid Strength Class Distance Rank


Durability
Type 1 7 7 Bad 4 3
Type 2 7 4 Bad 5 4
Type 3 3 4 Good 3 1
Type 4 1 4 Good 3.6 2

Test Data  Acid durability = 3 and strength = 7 class = Good


K=3

Name Acid Strength Class Distance Rank


Durability
Type 1 7 7 Bad 4 3
Type 2 7 4 Bad 5 4
Type 3 3 4 Good 3 1
Type 4 1 4 Good 3.6 2

Test Data  Acid durability = 3 and strength = 7 class = 2 Good and 1 Bad majority = Good
Practical Issues of Classification
• Underfitting and Overfitting

• Missing Values
Underfitting and Overfitting (Example)
500 circular and 500
triangular data points.
Underfitting and Overfitting
Overfitting

Underfitting: when model is too simple, both training and test errors are large
Overfitting due to Noise

Decision boundary is distorted by noise point


Overfitting due to Insufficient Examples

Lack of data points in the lower half of the diagram makes it difficult
to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training
records that are irrelevant to the classification task
Computing Impurity Measure
Tid Refund Marital Taxable Before Splitting:
Status Income Class Entropy(Parent)
= -0.3 log(0.3)-(0.7)log(0.7) = 0.8813
1 Yes Single 125K No
2 No Married 100K No Class Class
3 No Single 70K No = Yes = No
Refund=Yes 0 3
4 Yes Married 120K No
Refund=No 2 4
5 No Divorced 95K Yes
Refund=? 1 0
6 No Married 60K No
7 Yes Divorced 220K No
Split on Refund:

8 No Single 85K Yes Entropy(Refund=Yes) = 0


9 No Married 75K No Entropy(Refund=No)
10 ? Single 90K Yes = -(2/6)log(2/6) – (4/6)log(4/6) = 0.9183
10

Entropy(Children)
Missing = 0.3 (0) + 0.6 (0.9183) = 0.551
value
Gain = 0.9  (0.8813 – 0.551) = 0.3303
Distribute Instances
Tid Refund Marital Taxable
Status Income Class
Tid Refund Marital Taxable
1 Yes Single 125K No Status Income Class
2 No Married 100K No
10 ? Single 90K Yes
3 No Single 70K No 10

4 Yes Married 120K No


Refund
5 No Divorced 95K Yes Yes No
6 No Married 60K No
Class=Yes 0 + 3/9 Class=Yes 2 + 6/9
7 Yes Divorced 220K No
Class=No 3 Class=No 4
8 No Single 85K Yes
9 No Married 75K No
10

Probability that Refund=Yes is 3/9


Refund
Yes No Probability that Refund=No is 6/9
Assign record to the left child with
Class=Yes 0 Cheat=Yes 2 weight = 3/9 and to the right child
Class=No 3 Cheat=No 4 with weight = 6/9

You might also like