0% found this document useful (0 votes)
76 views18 pages

Data Mining for Academics

This document provides an overview of classification and prediction techniques in data mining, including: 1. The document discusses supervised and unsupervised learning methods for classification and prediction problems. Classification techniques covered include decision trees and Naive Bayes classifiers. 2. The classification process involves constructing a model using a training dataset, and then applying the model to predict class labels for new, unlabeled data. 3. Key concepts for classification algorithms are discussed, such as overfitting, pruning techniques for decision trees, and dealing with zero probabilities in Naive Bayes. Other techniques like k-nearest neighbors are also introduced.

Uploaded by

Dimitar Georgiev
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views18 pages

Data Mining for Academics

This document provides an overview of classification and prediction techniques in data mining, including: 1. The document discusses supervised and unsupervised learning methods for classification and prediction problems. Classification techniques covered include decision trees and Naive Bayes classifiers. 2. The classification process involves constructing a model using a training dataset, and then applying the model to predict class labels for new, unlabeled data. 3. Key concepts for classification algorithms are discussed, such as overfitting, pruning techniques for decision trees, and dealing with zero probabilities in Naive Bayes. Other techniques like k-nearest neighbors are also introduced.

Uploaded by

Dimitar Georgiev
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Classification and prediction

Data Mining Concepts and Techniques


Chapter 8.1-8.3, 9.5.1 Partly based on slides prepared by Jiawei Han

Type of method
Infrastructure preparation exploration analysis intepretation - exploration Supervised unsupervised Classification - prediction

Process

Process (1): Model Construction


Classification Algorithms

Training Data

NAME Mike Mary Bill Jim Dave Anne


4

RANK Assistant Prof Assistant Prof Professor Associate Prof Assistant Prof Associate Prof

YEARS TENURED (Model) 3 no 7 yes 2 yes 7 yes IF rank = professor 6 no OR years > 6 3 no

Classifier

THEN tenured = yes

Process (2): Using the Model in Prediction

Classifier Testing Data

Unseen Data

(Jeff, Professor, 4)
NAME RANK T om M erlisa G eorge 5 Joseph A ssistant P rof A ssociate P rof P rofessor A ssistant P rof YEARS TENURED 2 7 5 7 no no yes yes

Tenured?

Decision trees

Information gain
Information gain:

Gain(A) Info(D) Info A(D)


Information before split:

Info ( D) pi log 2 ( pi )
i 1
v

Information after split:

InfoA ( D)
j 1

| Dj | | D|

Info( D j )

Try it: decision tree induction

Concepts
Overfitting Pruning: postpruning and prepruning

Nave bayes

10

Nave Bayes
Bayes theorem:

P(H | X) = P(X | H )P(H ) P(X)

Nave Bayes classification:


Class Ci is hypothesis H Other attributes are evidence X

n P(X | C i) P( x | C i) P( x | C i) P( x | C i) ... P( x | C i) k 1 2 n k 1
11

Independence assumption:

Estimate from training set


P(Ci) from class frequency Nominal attributes:
P(xk|Ci) from occurrence of xk with instances in Ci

Numerical continuous attributes:


Gaussian distribution with a mean i and standard deviation i i and i from values of xk with instances in Ci

1 P(X | Ci) = g( xk , mCi , s Ci ) = e 2ps i

( xi -mi )2 2s i2

12

Try it:
Outlook Sunny Sunny Overcast Rainy Rainy Rainy Overcast Sunny Sunny Rainy Sunny Overcast Overcast Temp Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Humidity High High High High Normal Normal Normal High Normal Normal Normal High Normal Windy False True False False False True True False False False True True False Play No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes

New day: Predict play


Outlook Sunny Temp. Cool Humidity High Windy True Play ?

Rainy

Mild

High

True

No

13

Outlook Sunny Overcast Rainy Sunny Overcast Rainy 2 4 3

Yes

Temperature Concepts

Humidity

Windy

Play

No
3 0 2

Yes
2 4 3

No
2 2 1 2/5 2/5 1/5 High Normal High Normal

Yes
3 6 3/9 6/9

No
4 1 4/5 1/5 False True False True

Yes
6 3 6/9 3/9

No
2 3 2/5 3/5

Yes
9

No
5

Hot Mild Cool Hot Mild Cool

2/9 4/9 3/9

3/5 0/5 2/5

2/9 4/9 3/9

9/ 14

5/ 14

Outlook Sunny

Temp. Cool

Humidity High

Windy True

Play ?

Likelihood of the two classes For yes = 2/9 3/9 3/9 3/9 9/14 = 0.0053 For no = 3/5 1/5 4/5 3/5 5/14 = 0.0206 Conversion into a probability by normalization: P(yes) = 0.0053 / (0.0053 + 0.0206) = 0.205 P(no) = 0.0206 / (0.0053 + 0.0206) = 0.795
14

Concepts
Zero-frequency problem Smoothing / Laplacian correction

15

K-nearest neighbor

16

Concepts
Lazy learner Distance function
Which ones?

17

And now
Assignment classification, classification 2

18

You might also like