05 Classification
05 Classification
Part 1
Outline
◘ What Is Classification?
◘ Classification Examples
◘ Classification Methods
– Decision Trees
– Bayesian Classification
– K-Nearest Neighbor
– Neural Network
– Support Vector Machines (SVM)
– Fuzzy Set Approaches
What Is Classification?
◘ Classification
– Construction of a model to classify data
– When constructing the model, use the training set and the class labels
(i.e. yes no) in the target column
1. Model construction
– Each tuple is assumed to belong to a predefined class
– The set of tuples used for model construction is training set
– The model is represented as classification rules, trees, or mathematical formulae
2. Test Model
– Using test set, estimate accuracy rate of the model
• Accuracy rate is the percentage of test set samples that are correctly classified
by the model
Training
Learn
Model
Set Classifier
Data
Training Data Test Data Mining Model To Predict
Mining Model
DM DM
Engine Engine
◘ Given old data about customers and payments, predict new applicant’s
loan eligibility.
– Good Customers
– Bad Customers
Decision Trees
Bayesian Classification
K-Nearest Neighbor
…
Decision Trees
Salary < 1 M
◘ Write a rule for each path in the decision tree from the root to a leaf.
Decision Tree Algorithms
◘ ID3
– Quinlan (1981)
– Tries to reduce expected number of comparison
◘ C 4.5
– Quinlan (1993)
– It is an extension of ID3
– Just starting to be used in data mining applications
– Also used for rule induction
◘ CART
– Breiman, Friedman, Olshen, and Stone (1984)
– Classification and Regression Trees
◘ CHAID
– Kass (1980)
– Oldest decision tree algorithm
– Well established in database marketing industry
◘ QUEST
– Loh and Shih (1997)
Decision Tree Construction
m
Entropy(S) = − p i log 2 p i Entropy(S) = −p1 log 2 p1 − p 2 log 2 p 2
i =1
| Si |
Gain (S, A ) = Entropy(S ) − Entropy(Si)
iA |S|
Entropy
Decision Tree Construction
Outlook
sunny overcast rain
[2+, 3-] [4+, 0] [3+, 2-]
E=0.971 E=0.0 E=0.971
Gain(S,Outlook)=0.940-(5/14)*0.971
-(4/14)*0.0
- (5/14)*0.0971
=0.247
| S High | |S | 7 7
Gain( S , Humidity) = Entropy( S ) − Entropy( S High) − Normal Entropy( S Normal) = 0,940 − * 0,985 − *1,0
|S| |S| 14 14
= 0,151
Decision Tree Construction
Outlook
Sunny
overcast rain
? ?
yes
Decision Tree Construction
Outlook
Sunny
overcast rain
? ?
yes
Outlook
rain
Sunny overcast
Wind
Humidity yes
[D3,D7,D12,D13]
High Normal weak strong
yes no
No yes
[D4,D5,D10] [D6,D14]
[D1,D2, D8] [D9,D11]
Another Example
At the weekend:
- go shopping,
- watch a movie,
- play tennis or
- just stay in.
Decision Trees
Bayesian Classification
K-Nearest Neighbor
…
Classification Techniques
2- Bayesian Classification
◘ A statistical classifier: performs probabilistic prediction, i.e., predicts class
membership probabilities.
Decision Trees
Bayesian Classification
K-Nearest Neighbor
…
K-Nearest Neighbor (k-NN)