Data Mining Final

Uploaded by

ahmedjamshaid953

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views25 pages

Data Mining Final

Uploaded by

ahmedjamshaid953

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 25

Data mining

Evaluating classification methods

• Predictive accuracy

• Efficiency
– time to construct the model
– time to use the model
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability:
– understandable and insight provided by the model
• Compactness of the model: size of the tree, or the number of rules.

2
Evaluation methods
• Holdout set: The available data set D is divided into two disjoint
subsets,
– the training set Dtrain (for learning a model)
– the test set Dtest (for testing the model)
• Important: training set should not be used in testing and the test
set should not be used in learning.
– Unseen test set provides a unbiased estimate of accuracy.
• The test set is also called the holdout set. (the examples in the
original data set D are all labeled with classes.)
• This method is mainly used when the data set D is large.

CS583, Bing Liu, UIC 3

Evaluation methods (cont…)
• n-fold cross-validation:
• N-Fold Cross-Validation is a resampling technique used to evaluate the
performance of a machine learning model. It helps to assess how well a
model generalizes to an independent dataset by using multiple training and
testing splits of the data. The available data is partitioned into n equal-size
disjoint subsets.

• Use each subset as the test set and combine the rest n-1 subsets as the training
set to learn a classifier.
• The procedure is run n times, which give n accuracies.
• The final estimated accuracy of learning is the average of the n accuracies.
• 10-fold and 5-fold cross-validations are commonly used.
• This method is used when the available data is not large.
CS583, Bing Liu, UIC 4
Precision and recall measures
• Used in information retrieval and text classification.
• We use a confusion matrix to introduce them.

CS583, Bing Liu, UIC 5

true positives (TP): These are cases in which we predicted yes (they have the
disease), and they do have the disease.
true negatives (TN): We predicted no, and they don't have the disease.
false positives (FP): We predicted yes, but they don't actually have the
disease. (Also known as a "Type I error.")
false negatives (FN): We predicted no, but they actually do have the disease.
(Also known as a "Type II error.")
Precision and recall measures (cont…)

TP TP
p . r .
TP  FP TP  FN

 Precision p is the number of correctly classified positive

examples divided by the total number of examples that are
classified as positive.
 Recall r is the number of correctly classified positive
examples divided by the total number of actual positive
examples in the test set.
CS583, Bing Liu, UIC 7
An example

• This confusion matrix gives

– precision p = 100% and
– recall r = 1%
because we only classified one positive example correctly
and no negative examples wrongly.
• Note: precision and recall only measure classification
on the positive class.

CS583, Bing Liu, UIC 8

F1-value (also called F1-score)
• It is hard to compare two classifiers using two measures. F1 score combines
precision and recall into one measure

• The harmonic mean of two numbers tends to be closer to the smaller of the
two.
• For F1-value to be large, both p and r much be large.

CS583, Bing Liu, UIC 9

Receive operating characteristics curve

• It is commonly called the ROC curve.

• It is a plot of the true positive rate (TPR) against the
false positive rate (FPR).
• True positive rate:

• False positive rate:

CS583, Bing Liu, UIC 10
Sensitivity and Specificity
• In statistics, there are two other evaluation
measures:
– Sensitivity: Same as TPR
– Specificity: Also called True Negative Rate (TNR)

• Then we have
CS583, Bing Liu, UIC 11
Confusion Matrix:

A confusion matrix is a technique for summarizing the performance of a

classification algorithm.
Evaluation Parameters
Confusion Matrix and Evaluation Parameters
Confusion Matrix :

Accuracy: Overall, how often is the classifier correct?

Accuracy = (TP + TN) / (TP + TN + FP + FN)= (100+50)
/(100+5+10+50)= 0.90

Misclassification Rate: Overall, how often is it wrong?

(FP+FN)/total = (10+5)/165 = 0.09
equivalent to 1 minus Accuracy
also known as "Error Rate"
True Positive Rate/Recall: When it's actually yes, how often does it
predict yes?
TP/actual yes = 100/105 = 0.95
also known as "Sensitivity" or "Recall"
False Positive Rate: When it's actually no, how often does it predict yes?
FP/actual no = 10/60 = 0.17
Confusion Matrix
Specificity: When it's actually no, how often does it predict no?
TN/actual no = 50/60 = 0.83
equivalent to 1 minus False Positive Rate
Precision: When it predicts yes, how often is it correct?
TP/predicted yes = 100/110 = 0.91
Prevalence: How often does the yes condition actually occur in our sample?
actual yes/total = 105/165 = 0.64
F1Score:
Fmeasure=(2*Recall*Precision)/(Recall+Presision)=(2*0.95*0.91)/
(0.91+0.95)=0.92
What is Naive Bayes algorithm?
• It is used in classification especially in text mining
• Very good in large data sets
• Based on probability
•‘Naive Bayes‘, which can be extremely fast relative
to other classification algorithms
What is Naive Bayes algorithm?

Step 1: Convert the data set into a frequency table

Step 2: Create Likelihood table by finding the probabilities like Overcast probability =
0.29 and probability of playing is 0.64.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each
class. The class with the highest posterior probability is the outcome of prediction.
Example
• Example: Play Tennis

Today's outlook is sunny, temperature is cool, Humidity high, and

wind strong. Using Naive Bayes, what is the probability that she
will be playing tennis today? 20
Example
• Learning Phase
Outlook Play=Y Play=N Temperat Play=Ye Play=No
es o ure s
Sunny 2/9 3/5 Hot 2/9 2/5
Overcas 4/9 0/5 Mild 4/9 2/5
t
Cool 3/9 1/5
Rain 3/9 2/5
Humidity Play=Y Play= Wind Play=Ye Play=N
es No s o
High 3/9 4/5 Strong 3/9 3/5
Normal 6/9 1/5 Weak 6/9 2/5

P(Play=Yes) = P(Play=No) =
9/14 5/14
Example

Then, dividing the results by this value:

P(Play=Yes | X) = 0.0053/0.02186 = 0.2424
P(Play=No | X) = 0.0206/0.02186 = 0.9421

Since 0.9421 is greater than 0.2424 then the answer is ‘no’, we cannot play a game of tennis
today.
Example:
Thank you

Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
79 pages
9b. Evaluation of Classifiers
No ratings yet
9b. Evaluation of Classifiers
4 pages
Unit 5 Classification PDF
No ratings yet
Unit 5 Classification PDF
131 pages
Introduction To Artificial Intelligence: Amna Iftikhar Fall ' 2019 1
No ratings yet
Introduction To Artificial Intelligence: Amna Iftikhar Fall ' 2019 1
33 pages
Mla Unit-5'2
No ratings yet
Mla Unit-5'2
74 pages
BSC ML CH1
No ratings yet
BSC ML CH1
63 pages
ML Unit 3
No ratings yet
ML Unit 3
127 pages
Classification Data Mining
No ratings yet
Classification Data Mining
84 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
37 pages
CSE4261 Lecture-10
No ratings yet
CSE4261 Lecture-10
50 pages
Python Seaborn Notes
No ratings yet
Python Seaborn Notes
28 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
49 pages
Chapter 3
No ratings yet
Chapter 3
10 pages
Unit 4 Learning
No ratings yet
Unit 4 Learning
100 pages
Lecture 3b - Evaluation
No ratings yet
Lecture 3b - Evaluation
37 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
25 pages
Lecture 5 Evaluation - Classifer
No ratings yet
Lecture 5 Evaluation - Classifer
61 pages
Presentation On Classification
No ratings yet
Presentation On Classification
18 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
41 pages
DM 09 Classification and Prediction 19112024 102854am
No ratings yet
DM 09 Classification and Prediction 19112024 102854am
21 pages
Module 6
No ratings yet
Module 6
24 pages
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
No ratings yet
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
62 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
22 pages
Lesson 6 Analytics Methods
No ratings yet
Lesson 6 Analytics Methods
12 pages
CH-5 ML
No ratings yet
CH-5 ML
36 pages
Classification Metrics
No ratings yet
Classification Metrics
39 pages
Sr. No. Questions A B C D Ans: Unit ONE SUB: 410243 DA
No ratings yet
Sr. No. Questions A B C D Ans: Unit ONE SUB: 410243 DA
1,777 pages
CSC4316 9
No ratings yet
CSC4316 9
40 pages
Evaluation Matrix
No ratings yet
Evaluation Matrix
29 pages
Unit3 7 Issues
No ratings yet
Unit3 7 Issues
24 pages
Module 5 Advanced Classification Techniques
No ratings yet
Module 5 Advanced Classification Techniques
40 pages
ML CH 5
No ratings yet
ML CH 5
45 pages
ML Lec-11
No ratings yet
ML Lec-11
12 pages
6 Evaluarea Performantei
No ratings yet
6 Evaluarea Performantei
43 pages
Chap3 Part1 Classification
No ratings yet
Chap3 Part1 Classification
38 pages
UNIT-1-2.Binary Classification and Related Tasks
No ratings yet
UNIT-1-2.Binary Classification and Related Tasks
22 pages
DL IT324a 4
No ratings yet
DL IT324a 4
52 pages
Xchapter 1
No ratings yet
Xchapter 1
31 pages
06-FSSR DS610 2024 2025T1 Metrics
No ratings yet
06-FSSR DS610 2024 2025T1 Metrics
24 pages
Lecture 10
No ratings yet
Lecture 10
16 pages
CST 42315 Dam - L9 1
No ratings yet
CST 42315 Dam - L9 1
15 pages
Unit6 - 7 Issues
No ratings yet
Unit6 - 7 Issues
53 pages
Lecture 4
No ratings yet
Lecture 4
31 pages
ML Model Evaluation
No ratings yet
ML Model Evaluation
17 pages
Lecture 7
No ratings yet
Lecture 7
25 pages
Classification Algorithm in Machine Learning
No ratings yet
Classification Algorithm in Machine Learning
13 pages
Chapter 5 Model Evaluation
No ratings yet
Chapter 5 Model Evaluation
21 pages
Data MIning Chapter 8
No ratings yet
Data MIning Chapter 8
11 pages
19-Performance Metrics
No ratings yet
19-Performance Metrics
23 pages
L 13 Choose Your Own Algorithm D 07062024 111828am
No ratings yet
L 13 Choose Your Own Algorithm D 07062024 111828am
36 pages
Session 2 Evaluation Boosting Bagging Contemporary Business Anaytics
No ratings yet
Session 2 Evaluation Boosting Bagging Contemporary Business Anaytics
17 pages
Accuracy and Error Measures
No ratings yet
Accuracy and Error Measures
14 pages
Ai DS 2 Book-Chpt-5
No ratings yet
Ai DS 2 Book-Chpt-5
17 pages
Classification Metrics Mod 6
No ratings yet
Classification Metrics Mod 6
8 pages
Imbalance Problem
No ratings yet
Imbalance Problem
13 pages
Instruction & Option Choice
No ratings yet
Instruction & Option Choice
6 pages
Accuracy Precision and Recall
No ratings yet
Accuracy Precision and Recall
21 pages
20150908-Lecture-3-Draft Asd Def HFL DFGF Lkreglker Lerg Kelr GK
No ratings yet
20150908-Lecture-3-Draft Asd Def HFL DFGF Lkreglker Lerg Kelr GK
15 pages
Risk Security and Regulatory Compliance
No ratings yet
Risk Security and Regulatory Compliance
12 pages
Class Imbalance Notes
No ratings yet
Class Imbalance Notes
6 pages
Practice 1 From Introductory Time Series With R
No ratings yet
Practice 1 From Introductory Time Series With R
14 pages
TR Rain Error
No ratings yet
TR Rain Error
6 pages
Chapter 1 Concept of Economics and Significance of Statistics in Economics
No ratings yet
Chapter 1 Concept of Economics and Significance of Statistics in Economics
6 pages
Programme Guide (PGDAST) PDF
No ratings yet
Programme Guide (PGDAST) PDF
49 pages
Econometrics PPT Final Review Slides
No ratings yet
Econometrics PPT Final Review Slides
41 pages
3 Teddlie Mixed Methods Sampling
No ratings yet
3 Teddlie Mixed Methods Sampling
24 pages
Experimental Psychology Finals Reviewer (Myers Hansen)
No ratings yet
Experimental Psychology Finals Reviewer (Myers Hansen)
17 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
2 pages
Risk Assessment Model For Railway Passengers On A Crowded Platform
No ratings yet
Risk Assessment Model For Railway Passengers On A Crowded Platform
8 pages
Social Studies Ryan Abogado - 041140
No ratings yet
Social Studies Ryan Abogado - 041140
133 pages
A Primer On Partial Least Squares Structural Equation Modeling - Hair-242-289
No ratings yet
A Primer On Partial Least Squares Structural Equation Modeling - Hair-242-289
48 pages
1.tell Me About Yourself? Sample Answer
No ratings yet
1.tell Me About Yourself? Sample Answer
35 pages
DM Unit III CH 1final
No ratings yet
DM Unit III CH 1final
60 pages
Long Test-I.i.i
No ratings yet
Long Test-I.i.i
8 pages
ML Lecture 13-14
No ratings yet
ML Lecture 13-14
33 pages
Edutxt 20
No ratings yet
Edutxt 20
12 pages
Lec4 (Week 4)
No ratings yet
Lec4 (Week 4)
16 pages
Data Mining
No ratings yet
Data Mining
34 pages
Field Experience & Methodology Field Experience & Methodology
No ratings yet
Field Experience & Methodology Field Experience & Methodology
16 pages
Assignment 13 (Statistics)
No ratings yet
Assignment 13 (Statistics)
3 pages
Lect - 7.1 - MEC
No ratings yet
Lect - 7.1 - MEC
18 pages
Curse of Dimensionality
No ratings yet
Curse of Dimensionality
51 pages
Lesson 3 Measures of Central Tendency
No ratings yet
Lesson 3 Measures of Central Tendency
6 pages
Syllabus Booklet - SEM I-JDPGDBM 2023-2025
No ratings yet
Syllabus Booklet - SEM I-JDPGDBM 2023-2025
17 pages
How Do Corporate Social Responsibility (CSR) and Innovativeness Increase
No ratings yet
How Do Corporate Social Responsibility (CSR) and Innovativeness Increase
11 pages
QMS 105 Group Assignment
No ratings yet
QMS 105 Group Assignment
4 pages
Slides Lecture 6.4 PDF
No ratings yet
Slides Lecture 6.4 PDF
3 pages
LP1 Midterm Exam 4
No ratings yet
LP1 Midterm Exam 4
6 pages
Diploma in Management General
No ratings yet
Diploma in Management General
11 pages
Design of Experiments (DOE) - ASQ
No ratings yet
Design of Experiments (DOE) - ASQ
2 pages
Repeat Analysis of Radiograph in Radiology Facility of Panam Awal Bros Hospital
No ratings yet
Repeat Analysis of Radiograph in Radiology Facility of Panam Awal Bros Hospital
4 pages
STAT 121 Writing Assignment 1 Confidence Intervals
No ratings yet
STAT 121 Writing Assignment 1 Confidence Intervals
3 pages
BAYES Theorem
From Everand
BAYES Theorem
Jeffery Short
2/5 (5)
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
From Everand
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
Joseph George Caldwell
No ratings yet