06-Classification Part1
06-Classification Part1
Part 1
◘ What Is Classification?
◘ Classification Examples
◘ Classification Methods
– Decision Trees
– Bayesian Classification
– K-Nearest Neighbor
– Neural Network
– Support Vector Machines (SVM)
– Fuzzy Set Approaches
What Is Classification?
◘ Classification
– Construction of a model to classify data
– When constructing the model, use the training set and the class labels
(i.e. yes no) in the target column
1. Model construction
– Each tuple is assumed to belong to a predefined class
– The set of tuples used for model construction is training set
– The model is represented as classification rules, trees, or mathematical formulas
2. Test Model
– Using test set, estimate accuracy rate of the model
• Accuracy rate is the percentage of test set samples that are correctly classified
by the model
Training
Learn
Set Classifier Model
Decision Trees
Bayesian Classification
K-Nearest Neighbor
…
Decision Trees
◘ Write a rule for each path in the decision tree from the root to a leaf.
Entropy
Information Gain
m
Entropy(S) p i log2 p i Entropy(S) p1 log 2 p1 p 2 log 2 p 2
i 1
| Si |
Gain( S , A) Entropy( S ) i A |S|
Entropy( S i )
Information Gain
◘ ID3
– Quinlan (1981)
– Tries to reduce expected number of comparison
◘ C 4.5
– Quinlan (1993)
– It is an extension of ID3
– Just starting to be used in data mining applications
– Also used for rule induction
◘ CART
– Breiman, Friedman, Olshen, and Stone (1984)
– Classification and Regression Trees
◘ CHAID
– Kass (1980)
– Oldest decision tree algorithm
– Well established in database marketing industry
◘ QUEST
– Loh and Shih (1997)
ID3 Algorithm
Outlook
Sunny Overcast Rain
[2+, 3-] [4+, 0] [3+, 2-]
E=0.971 E=0.0 E=0.971
Humidity
High Normal
[3+, 4-] [6+,
E=0.985 1-]
E=0.592
Decision Tree Construction
Outlook
Sunny Rain
Overcast
? yes ?
[2+, 3-] [4+, 0-] [3+, 2-]
Decision Tree Construction
Outlook
Sunny Rain
Overcas
? t yes
?
[2+, 3-] [4+, 0-] [3+, 2-]
Outlook
Humidity Wind
yes
[D3,D7,D12,D13]
High Normal Weak Strong
no yes yes no
[D1,D2, D8] [D9,D11] [D4,D5,D10] [D6,D14]
Converting the Tree to Rules
Outlook
No Yes No Yes
R1: If (Outlook=Sunny) (Humidity=High) Then PlayTennis=No
R2: If (Outlook=Sunny) (Humidity=Normal) Then PlayTennis=Yes
R3: If (Outlook=Overcast) Then PlayTennis=Yes
R4: If (Outlook=Rain) (Wind=Strong) Then PlayTennis=No
R5: If (Outlook=Rain) (Wind=Weak) Then PlayTennis=Yes
Gain Ratio for Attribute Selection (C4.5)
Information gain measure is biased towards attributes with a large
number of values
C4.5 (a successor of ID3) uses gain ratio to overcome the problem
(normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D ) log 2 ( )
j 1 |D| |D|
GainRatio(A) = Gain(A)/SplitInfo(A)
The attribute with the maximum gain ratio is selected as the splitting
attribute
Computation of Gain Ratio
◘ Suppose the attribute “Wind” partitions D into 8 in D1: {Weak} and 6 in
D2: {Strong}
8 8 6 6
SplitInfoWind ( D ) log log 0.9852
14 14 14 14
Memory consumption
grows exponentially with
the depth d of the trees.
Classification Techniques
Decision Trees
Bayesian Classification
K-Nearest Neighbor
…
Classification Techniques
2- Bayesian Classification
◘ A statistical classifier: performs probabilistic prediction, i.e., predicts class
membership probabilities.
◘ Foundation: Based on Bayes’ Theorem.
Given training data X, posteriori probability of a hypothesis H, P(H|X), follows
the Bayes theorem
Decision Trees
Bayesian Classification
K-Nearest Neighbor
…
K-Nearest Neighbor (k-NN)
Predicted Class
Positive Negative
Type II Error
Positive Negative
Actual Class
True TP = 45 FN = 20
False FP = 5 TN = 30
Test set
References
◘ [1] Criminisi, A., Shotton, J., & Konukoglu, E. (2012). Decision forests: A unified framework for
classification, regression, density estimation, manifold learning and semi-supervised
learning. Foundations and Trends® in Computer Graphics and Vision, 7(2–3), 81-227.