Unit-IV Classification Part 1
Unit-IV Classification Part 1
Classifies data (constructs a model) based on the training set and the
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(X, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Data Mining: Concepts and Techniques 5
III (b): Classification and Prediction
Classification by Decision Tree Induction
Bayesian Classification
Rule-based Classification
Prediction: Linear Regression
iv. Scalability
the attribute values of the tuple are tested against the decision
tree.
A path is traced from the root to a leaf node, which holds the
Applications
Medicine
Manufacturing
Production
Financial analysis
Astronomy
Molecular Biology
Student? Yes
Credit Rating?
No Yes No Yes
methods)
convertible to simple and easy to understand
classification rules
can use SQL queries for accessing databases
a. Bayes’ Theorem
needs to be maximized
Data Mining: Concepts and Techniques 26
Derivation of Naïve Bayes Classifier
A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P ( X | C i ) P ( x | C i ) P ( x | C i ) P ( x | C i ) ... P ( x | C i )
k 1 2 n
k 1
This greatly reduces the computation cost: Only counts
the class distribution
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
If Ak is continous-valued, P(xk|Ci) is usually computed
based on Gaussian distribution with a mean μ and
standard deviation σ ( x ) 2
1
g ( x, , )
2 2
e
2
and P(xk|Ci) is P ( X | C i ) g ( xk , Ci , Ci )
Data Mining: Concepts and Techniques 27
Naïve Bayesian Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
Data sample
>40 low yes excellent no
X = (age <=30,
Income = medium,
31…40 low yes excellent yes
Student = yes <=30 medium no fair no
Credit_rating = Fair) <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Data Mining: Concepts and Techniques 28
Naïve Bayesian Classifier: An Example
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
counterparts
Data Mining: Concepts and Techniques 30
Naïve Bayesian Classifier: Comments
Advantages
Easy to implement , Min. Error Rate
Disadvantages
Assumption: class conditional independence, therefore
loss of accuracy
Practically, dependencies exist among variables
Classifier
How to deal with these dependencies?
Bayesian Belief Networks
Rules are easier to understand than large trees <=30 31..40 >40
One rule is created for each path from the root student?
yes
credit rating?