Classification and prediction
Data Mining Concepts and Techniques
Chapter 8.1-8.3, 9.5.1 Partly based on slides prepared by Jiawei Han
Type of method
Infrastructure preparation exploration analysis intepretation - exploration Supervised unsupervised Classification - prediction
Process
Process (1): Model Construction
Classification Algorithms
Training Data
NAME Mike Mary Bill Jim Dave Anne
4
RANK Assistant Prof Assistant Prof Professor Associate Prof Assistant Prof Associate Prof
YEARS TENURED (Model) 3 no 7 yes 2 yes 7 yes IF rank = professor 6 no OR years > 6 3 no
Classifier
THEN tenured = yes
Process (2): Using the Model in Prediction
Classifier Testing Data
Unseen Data
(Jeff, Professor, 4)
NAME RANK T om M erlisa G eorge 5 Joseph A ssistant P rof A ssociate P rof P rofessor A ssistant P rof YEARS TENURED 2 7 5 7 no no yes yes
Tenured?
Decision trees
Information gain
Information gain:
Gain(A) Info(D) Info A(D)
Information before split:
Info ( D) pi log 2 ( pi )
i 1
v
Information after split:
InfoA ( D)
j 1
| Dj | | D|
Info( D j )
Try it: decision tree induction
Concepts
Overfitting Pruning: postpruning and prepruning
Nave bayes
10
Nave Bayes
Bayes theorem:
P(H | X) = P(X | H )P(H ) P(X)
Nave Bayes classification:
Class Ci is hypothesis H Other attributes are evidence X
n P(X | C i) P( x | C i) P( x | C i) P( x | C i) ... P( x | C i) k 1 2 n k 1
11
Independence assumption:
Estimate from training set
P(Ci) from class frequency Nominal attributes:
P(xk|Ci) from occurrence of xk with instances in Ci
Numerical continuous attributes:
Gaussian distribution with a mean i and standard deviation i i and i from values of xk with instances in Ci
1 P(X | Ci) = g( xk , mCi , s Ci ) = e 2ps i
( xi -mi )2 2s i2
12
Try it:
Outlook Sunny Sunny Overcast Rainy Rainy Rainy Overcast Sunny Sunny Rainy Sunny Overcast Overcast Temp Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Humidity High High High High Normal Normal Normal High Normal Normal Normal High Normal Windy False True False False False True True False False False True True False Play No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes
New day: Predict play
Outlook Sunny Temp. Cool Humidity High Windy True Play ?
Rainy
Mild
High
True
No
13
Outlook Sunny Overcast Rainy Sunny Overcast Rainy 2 4 3
Yes
Temperature Concepts
Humidity
Windy
Play
No
3 0 2
Yes
2 4 3
No
2 2 1 2/5 2/5 1/5 High Normal High Normal
Yes
3 6 3/9 6/9
No
4 1 4/5 1/5 False True False True
Yes
6 3 6/9 3/9
No
2 3 2/5 3/5
Yes
9
No
5
Hot Mild Cool Hot Mild Cool
2/9 4/9 3/9
3/5 0/5 2/5
2/9 4/9 3/9
9/ 14
5/ 14
Outlook Sunny
Temp. Cool
Humidity High
Windy True
Play ?
Likelihood of the two classes For yes = 2/9 3/9 3/9 3/9 9/14 = 0.0053 For no = 3/5 1/5 4/5 3/5 5/14 = 0.0206 Conversion into a probability by normalization: P(yes) = 0.0053 / (0.0053 + 0.0206) = 0.205 P(no) = 0.0206 / (0.0053 + 0.0206) = 0.795
14
Concepts
Zero-frequency problem Smoothing / Laplacian correction
15
K-nearest neighbor
16
Concepts
Lazy learner Distance function
Which ones?
17
And now
Assignment classification, classification 2
18