0% found this document useful (0 votes)
18 views25 pages

Data Mining Final

Uploaded by

ahmedjamshaid953
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views25 pages

Data Mining Final

Uploaded by

ahmedjamshaid953
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Data mining

Evaluating classification methods

• Predictive accuracy

• Efficiency
– time to construct the model
– time to use the model
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability:
– understandable and insight provided by the model
• Compactness of the model: size of the tree, or the number of rules.

2
Evaluation methods
• Holdout set: The available data set D is divided into two disjoint
subsets,
– the training set Dtrain (for learning a model)
– the test set Dtest (for testing the model)
• Important: training set should not be used in testing and the test
set should not be used in learning.
– Unseen test set provides a unbiased estimate of accuracy.
• The test set is also called the holdout set. (the examples in the
original data set D are all labeled with classes.)
• This method is mainly used when the data set D is large.

CS583, Bing Liu, UIC 3


Evaluation methods (cont…)
• n-fold cross-validation:
• N-Fold Cross-Validation is a resampling technique used to evaluate the
performance of a machine learning model. It helps to assess how well a
model generalizes to an independent dataset by using multiple training and
testing splits of the data. The available data is partitioned into n equal-size
disjoint subsets.

• Use each subset as the test set and combine the rest n-1 subsets as the training
set to learn a classifier.
• The procedure is run n times, which give n accuracies.
• The final estimated accuracy of learning is the average of the n accuracies.
• 10-fold and 5-fold cross-validations are commonly used.
• This method is used when the available data is not large.
CS583, Bing Liu, UIC 4
Precision and recall measures
• Used in information retrieval and text classification.
• We use a confusion matrix to introduce them.

CS583, Bing Liu, UIC 5


true positives (TP): These are cases in which we predicted yes (they have the
disease), and they do have the disease.
true negatives (TN): We predicted no, and they don't have the disease.
false positives (FP): We predicted yes, but they don't actually have the
disease. (Also known as a "Type I error.")
false negatives (FN): We predicted no, but they actually do have the disease.
(Also known as a "Type II error.")
Precision and recall measures (cont…)

TP TP
p . r .
TP  FP TP  FN

 Precision p is the number of correctly classified positive


examples divided by the total number of examples that are
classified as positive.
 Recall r is the number of correctly classified positive
examples divided by the total number of actual positive
examples in the test set.
CS583, Bing Liu, UIC 7
An example

• This confusion matrix gives


– precision p = 100% and
– recall r = 1%
because we only classified one positive example correctly
and no negative examples wrongly.
• Note: precision and recall only measure classification
on the positive class.

CS583, Bing Liu, UIC 8


F1-value (also called F1-score)
• It is hard to compare two classifiers using two measures. F1 score combines
precision and recall into one measure

• The harmonic mean of two numbers tends to be closer to the smaller of the
two.
• For F1-value to be large, both p and r much be large.

CS583, Bing Liu, UIC 9


Receive operating characteristics curve

• It is commonly called the ROC curve.


• It is a plot of the true positive rate (TPR) against the
false positive rate (FPR).
• True positive rate:

• False positive rate:


CS583, Bing Liu, UIC 10
Sensitivity and Specificity
• In statistics, there are two other evaluation
measures:
– Sensitivity: Same as TPR
– Specificity: Also called True Negative Rate (TNR)

• Then we have
CS583, Bing Liu, UIC 11
Confusion Matrix:

A confusion matrix is a technique for summarizing the performance of a


classification algorithm.
Evaluation Parameters
Confusion Matrix and Evaluation Parameters
Confusion Matrix :

Accuracy: Overall, how often is the classifier correct?


Accuracy = (TP + TN) / (TP + TN + FP + FN)= (100+50)
/(100+5+10+50)= 0.90

Misclassification Rate: Overall, how often is it wrong?


(FP+FN)/total = (10+5)/165 = 0.09
equivalent to 1 minus Accuracy
also known as "Error Rate"
True Positive Rate/Recall: When it's actually yes, how often does it
predict yes?
TP/actual yes = 100/105 = 0.95
also known as "Sensitivity" or "Recall"
False Positive Rate: When it's actually no, how often does it predict yes?
FP/actual no = 10/60 = 0.17
Confusion Matrix
Specificity: When it's actually no, how often does it predict no?
TN/actual no = 50/60 = 0.83
equivalent to 1 minus False Positive Rate
Precision: When it predicts yes, how often is it correct?
TP/predicted yes = 100/110 = 0.91
Prevalence: How often does the yes condition actually occur in our sample?
actual yes/total = 105/165 = 0.64
F1Score:
Fmeasure=(2*Recall*Precision)/(Recall+Presision)=(2*0.95*0.91)/
(0.91+0.95)=0.92
What is Naive Bayes algorithm?
• It is used in classification especially in text mining
• Very good in large data sets
• Based on probability
•‘Naive Bayes‘, which can be extremely fast relative
to other classification algorithms
What is Naive Bayes algorithm?

Step 1: Convert the data set into a frequency table


Step 2: Create Likelihood table by finding the probabilities like Overcast probability =
0.29 and probability of playing is 0.64.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each
class. The class with the highest posterior probability is the outcome of prediction.
Example
• Example: Play Tennis

Today's outlook is sunny, temperature is cool, Humidity high, and


wind strong. Using Naive Bayes, what is the probability that she
will be playing tennis today? 20
Example
• Learning Phase
Outlook Play=Y Play=N Temperat Play=Ye Play=No
es o ure s
Sunny 2/9 3/5 Hot 2/9 2/5
Overcas 4/9 0/5 Mild 4/9 2/5
t
Cool 3/9 1/5
Rain 3/9 2/5
Humidity Play=Y Play= Wind Play=Ye Play=N
es No s o
High 3/9 4/5 Strong 3/9 3/5
Normal 6/9 1/5 Weak 6/9 2/5

P(Play=Yes) = P(Play=No) =
9/14 5/14
Example

P(Outlook=Sunny|Play=Yes) = 2/9
P(Temperature=Cool|Play=Yes) = 3/9 P(Outlook=Sunny|Play=No) =
P(Huminity=High|Play=Yes) = 3/9 3/5
P(Wind=Strong|Play=Yes) = 3/9 P(Temperature=Cool|
P(Play=Yes) = 9/14 Play==No) = 1/5
P(Huminity=High|Play=No) =
4/5
P(Yes|x’) ≈ [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|
P(Wind=Strong|Play=No) = 3/5
Yes)]P(Play=Yes) = 0.0053 P(Play=No) = 5/14
P(No|x’) ≈ [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|
No)]P(Play=No) = 0.0206
P(X) = P(Outlook=Sunny) * P(Temperature=Cool) * P(Humidity=High) * P(Wind=Strong)
P(X) = (5/14) * (4/14) * (7/14) * (6/14)
P(X) = 0.02186

Then, dividing the results by this value:


P(Play=Yes | X) = 0.0053/0.02186 = 0.2424
P(Play=No | X) = 0.0206/0.02186 = 0.9421

Since 0.9421 is greater than 0.2424 then the answer is ‘no’, we cannot play a game of tennis
today.
Example:
Thank you

You might also like