0% found this document useful (0 votes)
8 views24 pages

Unit3 7 Issues

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views24 pages

Unit3 7 Issues

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

3.

Classification (12 hours)

3.1 Basics and Algorithms


3.2 Decision Tree Classifier [humanoriented]
3.3 Rule Based Classifier
3.4 Nearest Neighbor Classifier
3.5 Bayesian Classifier
3.6 Artificial Neural Network Classifier
3.7 Issues : Overfitting, Validation, Model Comparison
3.7 Issues : Overfitting, Validation, Model
Comparison

2
Overfitting
n Overfitting occurs when a statistical model describes
random error or noise instead of the underlying
relationship.
n Overfitting generally occurs when a model is excessively
complex, such as having too many parameters relative to
the number of observations.
n A model which has been overfit will generally have poor
predictive performance.
n Overfitting depends not only on the number of parameters
and data but also the conformability of the model
structure.
n In order to avoid overfitting, it is necessary to use
additional techniques (e.g. crossvalidation, pruning (Pre or
3
4
n Reason
n Noise in training data.

n Incomplete training data.

n Flaw in assumed theory.

5
Validation
n Validation techniques are motivated by two fundamental
problems in pattern recognition:
n model selection and

n performance estimation

n Validation Approaches:
n One approach is to use the entire training data to

select our classifier and estimate the error rate, but


the final model will normally overfit the training data.
n A much better approach is to split the training data

into disjoint subsets cross validation ( The Holdout


Method)
6
Cross Validation (The holdout method)
n Data set divided into two groups.
n Training set: used to train the classifier and

n Test set: used to estimate the error rate of the trained

classifier
n Total number of examples = Training Set +Test Set
n Approach1: Random Sub sampling
n Random Sub sampling performs K data splits of the

dataset
n Each split randomly selects (fixed) no. examples without

replacement
n For each data split we retrain the classifier from scratch

with the training examples and estimate error with the


7
n Approach2: K-Fold Cross-Validation
n K-Fold Cross validation is similar to Random Sub

sampling.
n Create a K-fold partition of the dataset, For each of

K experiments, use K-1 folds for training and the


remaining one for testing.
n The advantage of K-Fold Cross validation is that all

the examples in the dataset are eventually used for


both training and testing.
n The true error is estimated as the average error

rate
8
n Approach3: Leave-one-out Cross-Validation
n Leave-one-out is the degenerate case of K-Fold

Cross Validation, where K is chosen as the total


number of examples where one sample is left out
at each experiment.

9
Example – 5 Fold Cross Validation

10
Model Comparison

n Models can be evaluated based on the output using


different method :
n i. Confusion Matrix

n ii. ROC Analysis

n iii. Others such as: Gain and Lift Charts, K-S Charts

11
i. Confusion Matrix (Contigency Table):
n A confusion matrix contains information about actual
and predicted classifications done by classifier.
n Performance of such system is commonly evaluated
using data in the matrix.
n It is also known as a contingency table or an error
matrix, is a specific table layout that allows
visualization of the performance of an algorithm.
n Each column of the matrix represents the instances in
a predicted class, while each row represents the
instances in an actual class.

12
Classifier Evaluation Metrics: Confusion
Matrix
Confusion Matrix:
Actual class\Predicted class Predicted C1 Predicted ¬ C1
Actual C1 True Positives (TP) False Negatives (FN)
Actual ¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:


Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

n Given m classes, an entry, CMi,j in a confusion matrix indicates


# of tuples in class i that were labeled by the classifier as class j
n May have extra rows/columns to provide totals
13
Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
A\P C ¬C n Class Imbalance Problem:
C TP FN P
n One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
n Significant majority of the

n Classifier Accuracy, or negative class and minority of


recognition rate: percentage of the positive class
test set tuples that are correctly n Sensitivity: True Positive
classified recognition rate
Accuracy = (TP + TN)/All n Sensitivity = TP/P

n Error rate: 1 – accuracy, or n Specificity: True Negative

Error rate = (FP + FN)/All recognition rate


n Specificity = TN/N

n FPR = 1- TNR(specificity) 14
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
n Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive

n Recall: completeness – what % of positive tuples did the


classifier label as positive?
n Perfect score is 1.0
n Inverse relationship between precision & recall
n F measure (F1 or F-score): harmonic mean of precision and
recall,

n Fß: weighted measure of precision and recall


n assigns ß times as much weight to recall as to precision

15
Classifier Evaluation Metrics: Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)


cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)

n Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

16
ii. ROC Analysis
n Receiver Operating Characteristic (ROC), or ROC curve, is a
graphical plot that illustrates the performance of a binary
classifier system as its discrimination threshold is varied.
n The curve is created by plotting the true positive rate
against the false positive rate at various threshold
settings.
n The ROC curve plots sensitivity (TPR) versus FPR
n ROC analysis provides tools to select possibly optimal
models and to discard suboptimal ones independently
from (and prior to specifying) the cost context or the class
distribution.
17
n ROC analysis is related in a direct and natural way to
cost/benefit analysis of diagnostic decision making.

18
Model Selection: ROC Curves
n ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
n Originated from signal detection theory
n Shows the trade-off between the true
positive rate and the false positive
rate
n The area under the ROC curve is a n Vertical axis represents
measure of the accuracy of the model the true positive rate
n Rank the test tuples in decreasing n Horizontal axis rep. the
order: the one that is most likely to false positive rate
belong to the positive class appears at n A model with perfect
the top of the list accuracy will have an
area of 1.0
n The closer to the diagonal line (i.e., the
closer the area is to 0.5), the less
19
20
Figure shows the ROC curves of two classification models. The
diagonal line representing random guessing is also shown. Thus,
the closer the ROC curve of a model is to the diagonal line, the
less accurate the model.
If the model is really good, initially we are more likely to
encounter true positives as we move down the ranked
list.
Thus, the curve moves steeply up from zero. Later, as we start to
encounter fewer and fewer true positives, and more and more
false positives, the curve eases off and becomes more horizontal.
To assess the accuracy of a model, we can measure the area
under the curve. Several software packages are able to perform
such calculation.
The closer the area is to 0.5, the less accurate the
corresponding model is. A model with perfect accuracy
will have an area of 1.0.

21
Issues Affecting Model Selection
n Accuracy
n classifier accuracy: predicting class label
n Speed
n time to construct the model (training time)
n time to use the model (classification/prediction time)
n Robustness: handling noise and missing values
n Scalability: efficiency in disk-resident databases
n Interpretability
n understanding and insight provided by the model
n Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
22
Summary (I)
n Classification is a form of data analysis that extracts models
describing important data classes.
n Effective and scalable methods have been developed for decision
tree induction, Naive Bayesian classification, rule-based
classification, and many other classification methods.
n Evaluation metrics include: accuracy, sensitivity, specificity,
precision, recall, F measure, and Fß measure.
n Stratified k-fold cross-validation is recommended for accuracy
estimation

23
Summary (II)
n Significance tests and ROC curves are useful for model selection.
n There have been numerous comparisons of the different
classification methods; the matter remains a research topic
n No single method has been found to be superior over all others
for all data sets
n Issues such as accuracy, training time, robustness, scalability,
and interpretability must be considered and can involve trade-
offs, further complicating the quest for an overall superior
method

24

You might also like