Classification lecture 1
Classification lecture 1
Bayesian classification
Rule-based classification
Target marketing
Medical diagnosis
Fraud detection
will occur
If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
January 27, 2015 Data Mining: Concepts and Techniques 3
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
January 27, 2015 Data Mining: Concepts and Techniques 5
Supervised vs. Unsupervised Learning
Data cleaning
Preprocess data in order to reduce noise and handle
missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
Accuracy
classifier accuracy: predicting class label
age?
<=30 overcast
31..40 >40
no yes yes
j 1 | D |
but gini{medium,high} is 0.30 and thus the best since it is the lowest
All attributes are assumed continuous-valued
May need other tools, e.g., clustering, to get the possible split values
Can be modified for categorical attributes
Overfitting:
Overfitting results in decision trees that are more
complex than necessary
An induced tree may overfit the training data
Too many branches, some may reflect anomalies due to noise or
outliers
Poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning: Halt tree construction early—do not split a node if this
would result in the goodness measure falling below a threshold
Difficult to choose an appropriate threshold
large.
When a class list becomes too large performance of
SLIQ decreases.
SPRINT
constructs an attribute list data structure .
P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) =
3/9·2/9·3/9·6/9·9/14 = 0.010582
P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
2/5·2/5·4/5·2/5·5/14 = 0.018286
Disadvantages
Assumption: class conditional independence, therefore
loss of accuracy
Practically, dependencies exist among variables
Bayesian Classifier
How to deal with these dependencies?
Bayesian Belief Networks
January 27, 2015 Data Mining: Concepts and Techniques 43
Bayesian Belief Networks
Bayesian classification
Rule-based classification
Non-linear regression
(x x )( yi y )
w i 1
i
w y w x
1 | D|
0 1
(x
i 1
i x )2
Y = 58.6$
January 27, 2015 Data Mining: Concepts and Techniques 49
Nonlinear Regression
Some nonlinear models can be modeled by a polynomial
function
A polynomial regression model can be transformed into
linear regression model. For example,
y = w0 + w1 x + w2 x2 + w3 x3
convertible to linear with new variables: x2 = x2, x3= x3
y = w0 + w1 x + w2 x2 + w3 x3
Other functions, such as power function, can also be
transformed to linear model
Some models are intractable nonlinear (e.g., sum of
exponential terms)
possible to obtain least square estimates through
Bayesian classification
Rule-based classification
Derive Estimate
Training Classifier Accuracy
set
Data
Test set
Random sampling: a variation of holdout
Repeat holdout k times, accuracy = avg. of the accuracies
obtained
January 27, 2015 Data Mining: Concepts and Techniques 52
Evaluating the Accuracy of a Classifier
or Predictor (I)
Cross-validation (k-fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive subsets,
Ensemble methods
Use a combination of models to increase accuracy
classifiers
Boosting: weighted vote with a collection of classifiers
The bagged classifier M* counts the votes and assigns the class
classified as such)
t_neg (“not_cancer” samples that were
correctly classified as such)
False positives (“not_cancer” samples that were
d d
d
i yi ' |
Relative squared error: ( yi yi ' ) 2
i 1
i 1
d d
| y
i 1
i y| ( y y)
i 1
i
2