0% found this document useful (0 votes)

8 views24 pages

Unit3 7 Issues

Uploaded by

Nishan shah Thakuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views24 pages

Unit3 7 Issues

Uploaded by

Nishan shah Thakuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

3.

Classification (12 hours)

3.1 Basics and Algorithms

3.2 Decision Tree Classifier [humanoriented]
3.3 Rule Based Classifier
3.4 Nearest Neighbor Classifier
3.5 Bayesian Classifier
3.6 Artificial Neural Network Classifier
3.7 Issues : Overfitting, Validation, Model Comparison
3.7 Issues : Overfitting, Validation, Model
Comparison

2
Overfitting
n Overfitting occurs when a statistical model describes
random error or noise instead of the underlying
relationship.
n Overfitting generally occurs when a model is excessively
complex, such as having too many parameters relative to
the number of observations.
n A model which has been overfit will generally have poor
predictive performance.
n Overfitting depends not only on the number of parameters
and data but also the conformability of the model
structure.
n In order to avoid overfitting, it is necessary to use
additional techniques (e.g. crossvalidation, pruning (Pre or
3
4
n Reason
n Noise in training data.

n Incomplete training data.

n Flaw in assumed theory.

5
Validation
n Validation techniques are motivated by two fundamental
problems in pattern recognition:
n model selection and

n performance estimation

n Validation Approaches:
n One approach is to use the entire training data to

select our classifier and estimate the error rate, but

the final model will normally overfit the training data.
n A much better approach is to split the training data

into disjoint subsets cross validation ( The Holdout

Method)
6
Cross Validation (The holdout method)
n Data set divided into two groups.
n Training set: used to train the classifier and

n Test set: used to estimate the error rate of the trained

classifier
n Total number of examples = Training Set +Test Set
n Approach1: Random Sub sampling
n Random Sub sampling performs K data splits of the

dataset
n Each split randomly selects (fixed) no. examples without

replacement
n For each data split we retrain the classifier from scratch

with the training examples and estimate error with the

7
n Approach2: K-Fold Cross-Validation
n K-Fold Cross validation is similar to Random Sub

sampling.
n Create a K-fold partition of the dataset, For each of

K experiments, use K-1 folds for training and the

remaining one for testing.
n The advantage of K-Fold Cross validation is that all

the examples in the dataset are eventually used for

both training and testing.
n The true error is estimated as the average error

rate
8
n Approach3: Leave-one-out Cross-Validation
n Leave-one-out is the degenerate case of K-Fold

Cross Validation, where K is chosen as the total

number of examples where one sample is left out
at each experiment.

9
Example – 5 Fold Cross Validation

10
Model Comparison

n Models can be evaluated based on the output using

different method :
n i. Confusion Matrix

n ii. ROC Analysis

n iii. Others such as: Gain and Lift Charts, K-S Charts

11
i. Confusion Matrix (Contigency Table):
n A confusion matrix contains information about actual
and predicted classifications done by classifier.
n Performance of such system is commonly evaluated
using data in the matrix.
n It is also known as a contingency table or an error
matrix, is a specific table layout that allows
visualization of the performance of an algorithm.
n Each column of the matrix represents the instances in
a predicted class, while each row represents the
instances in an actual class.

12
Classifier Evaluation Metrics: Confusion
Matrix
Confusion Matrix:
Actual class\Predicted class Predicted C1 Predicted ¬ C1
Actual C1 True Positives (TP) False Negatives (FN)
Actual ¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:

Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

n Given m classes, an entry, CMi,j in a confusion matrix indicates

# of tuples in class i that were labeled by the classifier as class j
n May have extra rows/columns to provide totals
13
Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
A\P C ¬C n Class Imbalance Problem:
C TP FN P
n One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
n Significant majority of the

n Classifier Accuracy, or negative class and minority of

recognition rate: percentage of the positive class
test set tuples that are correctly n Sensitivity: True Positive
classified recognition rate
Accuracy = (TP + TN)/All n Sensitivity = TP/P

n Error rate: 1 – accuracy, or n Specificity: True Negative

Error rate = (FP + FN)/All recognition rate

n Specificity = TN/N

n FPR = 1- TNR(specificity) 14
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
n Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive

n Recall: completeness – what % of positive tuples did the

classifier label as positive?
n Perfect score is 1.0
n Inverse relationship between precision & recall
n F measure (F1 or F-score): harmonic mean of precision and
recall,

n Fß: weighted measure of precision and recall

n assigns ß times as much weight to recall as to precision

15
Classifier Evaluation Metrics: Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)

n Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

16
ii. ROC Analysis
n Receiver Operating Characteristic (ROC), or ROC curve, is a
graphical plot that illustrates the performance of a binary
classifier system as its discrimination threshold is varied.
n The curve is created by plotting the true positive rate
against the false positive rate at various threshold
settings.
n The ROC curve plots sensitivity (TPR) versus FPR
n ROC analysis provides tools to select possibly optimal
models and to discard suboptimal ones independently
from (and prior to specifying) the cost context or the class
distribution.
17
n ROC analysis is related in a direct and natural way to
cost/benefit analysis of diagnostic decision making.

18
Model Selection: ROC Curves
n ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
n Originated from signal detection theory
n Shows the trade-off between the true
positive rate and the false positive
rate
n The area under the ROC curve is a n Vertical axis represents
measure of the accuracy of the model the true positive rate
n Rank the test tuples in decreasing n Horizontal axis rep. the
order: the one that is most likely to false positive rate
belong to the positive class appears at n A model with perfect
the top of the list accuracy will have an
area of 1.0
n The closer to the diagonal line (i.e., the
closer the area is to 0.5), the less
19
20
Figure shows the ROC curves of two classification models. The
diagonal line representing random guessing is also shown. Thus,
the closer the ROC curve of a model is to the diagonal line, the
less accurate the model.
If the model is really good, initially we are more likely to
encounter true positives as we move down the ranked
list.
Thus, the curve moves steeply up from zero. Later, as we start to
encounter fewer and fewer true positives, and more and more
false positives, the curve eases off and becomes more horizontal.
To assess the accuracy of a model, we can measure the area
under the curve. Several software packages are able to perform
such calculation.
The closer the area is to 0.5, the less accurate the
corresponding model is. A model with perfect accuracy
will have an area of 1.0.

21
Issues Affecting Model Selection
n Accuracy
n classifier accuracy: predicting class label
n Speed
n time to construct the model (training time)
n time to use the model (classification/prediction time)
n Robustness: handling noise and missing values
n Scalability: efficiency in disk-resident databases
n Interpretability
n understanding and insight provided by the model
n Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
22
Summary (I)
n Classification is a form of data analysis that extracts models
describing important data classes.
n Effective and scalable methods have been developed for decision
tree induction, Naive Bayesian classification, rule-based
classification, and many other classification methods.
n Evaluation metrics include: accuracy, sensitivity, specificity,
precision, recall, F measure, and Fß measure.
n Stratified k-fold cross-validation is recommended for accuracy
estimation

23
Summary (II)
n Significance tests and ROC curves are useful for model selection.
n There have been numerous comparisons of the different
classification methods; the matter remains a research topic
n No single method has been found to be superior over all others
for all data sets
n Issues such as accuracy, training time, robustness, scalability,
and interpretability must be considered and can involve trade-
offs, further complicating the quest for an overall superior
method

Int3209 - Data Mining: Week 5: Classification Model Improvements
No ratings yet
Int3209 - Data Mining: Week 5: Classification Model Improvements
56 pages
Data Analytics Using R (DA-R)
100% (1)
Data Analytics Using R (DA-R)
67 pages
Lecture 5 Evaluation - Classifer
No ratings yet
Lecture 5 Evaluation - Classifer
61 pages
Chp8 Classification Basic Concepts - Lecture#8
No ratings yet
Chp8 Classification Basic Concepts - Lecture#8
40 pages
6.data Mining - Classification
No ratings yet
6.data Mining - Classification
37 pages
9b. Evaluation of Classifiers
No ratings yet
9b. Evaluation of Classifiers
4 pages
Lecture 3b - Evaluation
No ratings yet
Lecture 3b - Evaluation
37 pages
Week 5
No ratings yet
Week 5
72 pages
2 Supervised Learning
No ratings yet
2 Supervised Learning
52 pages
TE - DWM Module No 3
No ratings yet
TE - DWM Module No 3
48 pages
ML Model Evaluation
No ratings yet
ML Model Evaluation
17 pages
UNIT-1-2.Binary Classification and Related Tasks
No ratings yet
UNIT-1-2.Binary Classification and Related Tasks
22 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
37 pages
Classification - Performance Evlaution
No ratings yet
Classification - Performance Evlaution
13 pages
Module 4
No ratings yet
Module 4
12 pages
Data Mining Final
No ratings yet
Data Mining Final
25 pages
ClassificationandPrediction Module3
No ratings yet
ClassificationandPrediction Module3
88 pages
Evaluation Method Holdout
No ratings yet
Evaluation Method Holdout
14 pages
FALLSEM2024-25 BCSE334L TH VL2024250101768 2024-10-08 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE334L TH VL2024250101768 2024-10-08 Reference-Material-I
18 pages
Model Performance Assessment
No ratings yet
Model Performance Assessment
13 pages
Module 5 Advanced Classification Techniques
No ratings yet
Module 5 Advanced Classification Techniques
40 pages
DM - Ch4 - Classification (Part1)
No ratings yet
DM - Ch4 - Classification (Part1)
20 pages
Unit 6-Feature Engineering and Sensitivity Analysis
No ratings yet
Unit 6-Feature Engineering and Sensitivity Analysis
63 pages
DM 09 Classification and Prediction 19112024 102854am
No ratings yet
DM 09 Classification and Prediction 19112024 102854am
21 pages
NLP Unit-1 Notes
No ratings yet
NLP Unit-1 Notes
59 pages
Evaluation Matrix
No ratings yet
Evaluation Matrix
29 pages
ML Unit-3 - RTU
No ratings yet
ML Unit-3 - RTU
20 pages
A10 Model Performance v2 2up
No ratings yet
A10 Model Performance v2 2up
11 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
49 pages
An Introduction To Structural Health Monitoring:, 303-315 2007 Charles R Farrar and Keith Worden
No ratings yet
An Introduction To Structural Health Monitoring:, 303-315 2007 Charles R Farrar and Keith Worden
14 pages
6 Evaluarea Performantei
No ratings yet
6 Evaluarea Performantei
43 pages
MACHINELEARNING
No ratings yet
MACHINELEARNING
20 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
41 pages
Classification Metrics
No ratings yet
Classification Metrics
39 pages
MMPC-5 Imp
100% (1)
MMPC-5 Imp
32 pages
CH-5 ML
No ratings yet
CH-5 ML
36 pages
Lesson 6 Analytics Methods
No ratings yet
Lesson 6 Analytics Methods
12 pages
Unit6 - 7 Issues
No ratings yet
Unit6 - 7 Issues
53 pages
Ai DS 2 Book-Chpt-5
No ratings yet
Ai DS 2 Book-Chpt-5
17 pages
CST 42315 Dam - L9 1
No ratings yet
CST 42315 Dam - L9 1
15 pages
Clase10 11
No ratings yet
Clase10 11
18 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
Classification Unit-4
No ratings yet
Classification Unit-4
19 pages
Module 6
No ratings yet
Module 6
24 pages
2-Training and Testing Models, Evaluation Metrics-01-07-2023
No ratings yet
2-Training and Testing Models, Evaluation Metrics-01-07-2023
23 pages
Module 5 ML
No ratings yet
Module 5 ML
12 pages
Chap3 Part1 Classification
No ratings yet
Chap3 Part1 Classification
38 pages
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
No ratings yet
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
62 pages
Lecture11evaluationmetricsforclassification 240913060639 0c766554
No ratings yet
Lecture11evaluationmetricsforclassification 240913060639 0c766554
28 pages
CSC4316 9
No ratings yet
CSC4316 9
40 pages
ML Metrics
No ratings yet
ML Metrics
9 pages
DL IT324a 4
No ratings yet
DL IT324a 4
52 pages
Xchapter 1
No ratings yet
Xchapter 1
31 pages
V1-CH-6-Classification and Prediction
No ratings yet
V1-CH-6-Classification and Prediction
38 pages
DWDM - Unit - V
No ratings yet
DWDM - Unit - V
93 pages
L 13 Choose Your Own Algorithm D 07062024 111828am
No ratings yet
L 13 Choose Your Own Algorithm D 07062024 111828am
36 pages
Machine Learning Project Report (Group 3) Shahbaz Khan
No ratings yet
Machine Learning Project Report (Group 3) Shahbaz Khan
11 pages
DM Unit - 3
No ratings yet
DM Unit - 3
21 pages
Unit-6: Classification and Prediction
No ratings yet
Unit-6: Classification and Prediction
63 pages
DWDM Unit IV Note
No ratings yet
DWDM Unit IV Note
21 pages
4 Classification
No ratings yet
4 Classification
20 pages
ML Unit 3
No ratings yet
ML Unit 3
127 pages
DWDM Unit-3: What Is Classification? What Is Prediction?
No ratings yet
DWDM Unit-3: What Is Classification? What Is Prediction?
12 pages
AIML Curriculum Powered by IBM - Pregrad-Merged
No ratings yet
AIML Curriculum Powered by IBM - Pregrad-Merged
66 pages
Machine Learning Lab (CIE 421P)
No ratings yet
Machine Learning Lab (CIE 421P)
49 pages
Land Use Mapping Using Remote Sensing & ML Tools
No ratings yet
Land Use Mapping Using Remote Sensing & ML Tools
14 pages
Rangkuman Data Science
No ratings yet
Rangkuman Data Science
9 pages
Unit 3
No ratings yet
Unit 3
13 pages
Types of Variables: Qualitative Attribute Variable
No ratings yet
Types of Variables: Qualitative Attribute Variable
7 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
15 pages
Doan Uccs 0892D 10279
No ratings yet
Doan Uccs 0892D 10279
147 pages
Credit Card Fraud Detection
No ratings yet
Credit Card Fraud Detection
27 pages
Comprehensive Resource
No ratings yet
Comprehensive Resource
47 pages
Classification of ECG Signals Using Machine Learning Techniques: A Survey
No ratings yet
Classification of ECG Signals Using Machine Learning Techniques: A Survey
8 pages
SMS Spam Detection Using Machine Learning
No ratings yet
SMS Spam Detection Using Machine Learning
68 pages
The Implementation of Decision Tree Algorithm C4.5 Using Rapidminer in Analyzing Dropout Students
No ratings yet
The Implementation of Decision Tree Algorithm C4.5 Using Rapidminer in Analyzing Dropout Students
6 pages
A Brief Survey of Machine Learning Methods and Their Sensor and IoT Applications
No ratings yet
A Brief Survey of Machine Learning Methods and Their Sensor and IoT Applications
8 pages
Overview of The HASOC Subtrack at FIRE 2023: Identification of Conversational Hate-Speech
No ratings yet
Overview of The HASOC Subtrack at FIRE 2023: Identification of Conversational Hate-Speech
9 pages
Support Vector Machine Classification Algorithm and Its Application
No ratings yet
Support Vector Machine Classification Algorithm and Its Application
8 pages
FabioGalbusera Spine
No ratings yet
FabioGalbusera Spine
20 pages
Cluster Validity
No ratings yet
Cluster Validity
18 pages
C2C - Predictive Analysis of Student Campus Placement PDF
No ratings yet
C2C - Predictive Analysis of Student Campus Placement PDF
16 pages
Surg-3M: A Dataset and Foundation Model For Perception in Surgical Settings
No ratings yet
Surg-3M: A Dataset and Foundation Model For Perception in Surgical Settings
15 pages
Program - 9
No ratings yet
Program - 9
7 pages
Handout 1
No ratings yet
Handout 1
5 pages
666 3353 1 PB
No ratings yet
666 3353 1 PB
6 pages
Logistic Regression
No ratings yet
Logistic Regression
4 pages
Animal Intrusion Detection System Using CNN and Image Processing
No ratings yet
Animal Intrusion Detection System Using CNN and Image Processing
5 pages
Products Reviews and Sentimental Analysis System For Ecommerce Website
No ratings yet
Products Reviews and Sentimental Analysis System For Ecommerce Website
3 pages
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
From Everand
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
SUJAUL CHOWDHURY
No ratings yet

Unit3 7 Issues

Uploaded by

Unit3 7 Issues

Uploaded by

3.

Classification (12 hours)

3.1 Basics and Algorithms

n Incomplete training data.

n Flaw in assumed theory.

select our classifier and estimate the error rate, but

into disjoint subsets cross validation ( The Holdout

n Test set: used to estimate the error rate of the trained

with the training examples and estimate error with the

K experiments, use K-1 folds for training and the

the examples in the dataset are eventually used for

Cross Validation, where K is chosen as the total

n Models can be evaluated based on the output using

n ii. ROC Analysis

Example of Confusion Matrix:

n Given m classes, an entry, CMi,j in a confusion matrix indicates

n Classifier Accuracy, or negative class and minority of

n Error rate: 1 – accuracy, or n Specificity: True Negative

Error rate = (FP + FN)/All recognition rate

n Recall: completeness – what % of positive tuples did the

n Fß: weighted measure of precision and recall

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

n Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

You might also like