Sensitivity Analysis
Sensitivity Analysis
(CS40003)
Lecture #11
Sensitivity Analysis
Estimation Strategies
Accuracy Estimation
Error Estimation
Statistical Estimation
Performance Estimation
ROC Curve
There is a need to estimate the accuracy and performance of the classifier with
respect to few controlling parameters in data sensitivity
Estimation strategy
Metrics for measuring accuracy
Metrics for measuring performance
CS 40003: Data Analytics 3
Estimation Strategy
Usually training data and test data are outsourced from a large pool of data
already available.
Learning
Training technique
data
Test data
split CLASSIFIER
Holdout method
Random subsampling
Cross-validation
Bootstrap approach
Given a dataset, it is partitioned into two disjoint sets called training set and
testing set.
Classifier is learned based on the training set and get evaluated with testing set.
If the training set is too large, then model may be good enough, but estimation
may be less reliable due to small testing set and vice-versa.
In this method, Holdout method is repeated k times, and in each time, two
disjoint sets are chosen at random with a predefined sizes.
k-fold cross-validation
N-fold cross-validation
A series of k runs is carried out with this decomposition, and in ith iteration is
used as test data and other folds as training data
Thus, each tuple is used same number of times for training and once for testing.
Learning
Di technique
Fold i
Data set
Dk
Fold k
CLASSIFIER
Here, dataset is divided into as many folds as there are instances; thus, all
most each tuple forming a training set, building N classifiers.
In this method, therefore, N classifiers are built from N-1 instances, and
each tuple is used to classify a single test instances.
Test sets are mutually exclusive and effectively cover the entire set (in
sequence). This is as if trained by entire data as well as tested by entire data
set.
In practice, the method is extremely beneficial with very small data set
only, where as much data as possible to need to be used to train a classifier.
Each time a record is selected for training set, is put back into the original pool of
records, so that it is equally likely to be redrawn in the next run.
In other words, the Bootstrap method samples the given data set uniformly
with replacement.
The rational of having this strategy is that let some records be occur more
than once in the samples of both training as well as testing.
What is the probability that a record will be selected more than once?
CS 40003: Data Analytics 13
Bootstrap Method
Suppose, we have given a data set of N records. The data set is sampled N times
with replacement, resulting in a bootstrap sample (i.e., training set) of I samples.
Note that the entire runs are called a bootstrap sample in this method.
There are certain chance (i.e., probability) that a particular tuple occurs one or
more times in the training set
If they do not appear in the training set, then they will end up in the test set.
Each tuple has a probability of being selected (and the probability of not being
selected is .
We have to select N times, so the probability that a record will not be chosen during
the whole run is
Thus, the probability that a record is chosen by a bootstrap sample is
For a large value of N, it can be proved that
record chosen in a bootstrap sample is = 0.632
This is why, the Bootstrap method is also known as 0.632 bootstrap method
Accuracy
Performance
Accuracy estimation
If N is the number of instances with which a classifier is tested and p is the number
of correctly classified instances, the accuracy can be denoted as
Also, we can say the error rate (i.e., is misclassification rate) denoted by is denoted
by
CS 40003: Data Analytics 17
Accuracy : True and Predictive
Now, this accuracy may be true (or absolute) accuracy or predicted (or
optimistic) accuracy.
The predictive accuracy when estimated with a given test set it should be
acceptable without any objection
CS 40003: Data Analytics 18
Predictive Accuracy
Example 11.1 : Universality of predictive accuracy
Consider a classifier model MD developed with a training set D using an
algorithm M.
In the next few slides, we will discus about the two estimations
XN yN
N×(n+1)
Also, assume that denotes a difference between and (following certain difference (or
similarity), (e.g., = 0, if there is a match else 1)
The two loss functions measure the error between (the actual value) and (the predicted
value) are
Absolute error:
Squred error:
CS 40003: Data Analytics 21
Error Estimation using Loss Functions
Based
on the two loss functions, the test error (rate) also called generalization
error, is defined as the average loss over the test set T. The following two
measures for test errors are
Mean Absolute Error (MAE):
Mean Squared Error(MSE): ):
Confidence level: The concept of “confidence level ” can be better understood with the
following two experiments, related to tossing a coin.
Experiment 1: When a coin is tossed, there is a probability that the head will occur. We have
to experiment the value for this probability value. A simple experiment is that the coin is
tossed many times and both numbers of heads and tails are recorded.
H T H T H T H T H T H T
0.30 0.70 0.58 0.42 0.54 0.46 0.54 0.46 0.48 0.42 0.49 0.51
Thus, we can say that after a large number of trials in each experiment.
P
where N = Number of trials
v = Number of outcomes that an event occurs.
p = Probability that the event occur
Note:
Also, we may note the following
Mean = N×p = 50×0.5 = 25 and Variance = p× (1-p) ×N = 50×0.5×0.5 = 12.5
Let and denotes the lower and upper bound of a confidence level . Then
the confidence interval for is given by
~
∈− ∈
(
𝑃 𝜏 ≤
𝐿
∝
√ ∈ ( 1 −∈ ) / 𝑁
𝑈
≤ 𝜏 ∝ =𝛼
)
If is the mean of and , then we can write
~
∈ =∈ ± 𝜏 𝛼 × √ ∈ (1 − ∈ )/ 𝑁
CS 40003: Data Analytics 25
Statistical Estimation using Confidence Level
~
∈=∈± 𝝉 𝜶 × √∈(𝟏−∈)/𝑵
A table of with different values of can be obtained from any book on
statistics. A small part of the same is given below.
Thus, given a confidence level , we shall be able to know the value of and
hence the true accuracy (), if we have the value of the observed accuracy ().
Thus, knowing a test data set of size N, it is possible to estimate the true
accuracy!
Note:
Suppose, a classifier is tested k times with k different test sets. If i denotes the
predicted accuracy when tested with test set Ni in the i-th run (1≤ i ≤ k), then
the overall predicted accuracy is
Thus, is the weighted average of values. The standard error and true accuracy
at a confidence are
In fact, data sets with imbalanced class distributions are quite common in
many real life applications
When the classifier classified a test data set with imbalanced class
distributions then, predictive accuracy on its own is not a reliable indicator of
a classifier’s effectiveness.
With this data set, if classifier’s predictive accuracy is 0.98, a very high value!
Here, there is a high chance that 2 “worst” stock markets may incorrectly be classified as “good”
CS 40003: Data Analytics 30
Performance Estimation of a Classifier
Thus, when the classifier classified a test data set with imbalanced class
distributions, then predictive accuracy on its own is not a reliable indicator of
a classifier’s effectiveness.
There are four quadrants in the confusion matrix, which are symbolized as
below.
True Positive (TP: f++) : The number of instances that were positive (+) and
correctly classified as positive (+v).
False Negative (FN: f+-): The number of instances that were positive (+) and
incorrectly classified as negative (-). It is also known as Type 2 Error.
False Positive (FP: f-+): The number of instances that were negative (-) and
incorrectly classified as (+). This also known as Type 1 Error.
CS 40003: Data Analytics 32
Confusion Matrix
Note:
Np = TP (f++) + FN (f+-)
= is the total number of positive instances.
Nn = FP(f-+) + Tn(f--)
= is the total number of negative instances.
N = Np + Nn
= is the total number of instances.
Predictive accuracy?
Following table shows the confusion matrix of a classification problem with six
classes labeled as C1, C2, C3, C4, C5 and C6.
Class C1 C2 C3 C4 C5 C6
C1 52 10 7 0 0 1
C2 15 50 6 2 1 2
C3 5 6 6 0 0 0
C4 0 2 0 10 0 1
C5 0 1 0 0 7 1
C6 1 3 0 1 0 24
Predictive accuracy?
Thus a large confusion matrix of m*m can be concised into 2*2 matrix.
For example, the CM shown in Example 11.5 is transformed into a CM of size 2×2
considering the class C1 as the positive class and classes C2, C3, C4, C5 and C6
Class as negative.
combined together + -
+ 52 18
- 21 123
How we can calculate the predictive accuracy of the classifier model in this case?
Are the predictive accuracy same in both Example 11.5 and Example 11.6?
CS 40003: Data Analytics 37
Performance Evaluation Metrics
We now define a number of metrics for the measurement of a classifier.
In our discussion, we shall make the assumptions that there are only two classes: + (positive)
and – (negative)
Nevertheless, the metrics can easily be extended to multi-class classifiers (with some
modifications)
True Positive Rate (TPR): It is defined as the fraction of the positive examples
predicted correctly by the classifier.
False Positive Rate (FPR): It is defined as the fraction of negative examples classified as
positive class by the classifier.
F1 Score (F1): Recall (r) and Precision (p) are two widely used metrics employed
in analysis, where detection of one of the classes is considered more significant
than the others.
It is defined in terms of (r or TPR) and (p or PPV) as follows.
Note
F1 represents the harmonic mean between recall and precision
High value of F1 score ensures that both Precision and Recall are reasonably high.
CS 40003: Data Analytics 40
Performance Evaluation Metrics
More
generally, score can be used to determine the trade-off between Recall
and Precision as
Both, Precision and Recall are special cases of when and , respectively.
Metric
Recall 1 1 0 1
Precision 1 0 1 0
+1 1 0
Note
In fact, given TPR, FPR, p and r, we can derive all others measures.
Note
.
We can write
Thus,
TPR = =1
+ -
FPR = =0 + P 0
Actual
class
Precision = = 1 - 0 N
F1 Score = = 1
Accuracy = =1
When every instance is wrongly classified, it is called the worst classifier. In this
case, TP = 0, TN = 0 and the CM is
Predicted Class
TPR = =0
+ -
FPR = = 1
+ 0 P
Precision = = 0
Actual
class
F1 Score = Not applicable - N 0
as Recall + Precision = 0
Accuracy = =0
The classifier always predicts the + class correctly. Here, the False Negative
(FN) and True Negative (TN) are zero. The CM is
Predicted Class
TPR = = 1
+ -
FPR = = 1
+ P 0
Precision =
Actual
class
F1 Score = - N 0
Accuracy = =0
This classifier always predicts the - class correctly. Here, the False Negative
(FN) and True Negative (TN) are zero. The CM is
Predicted Class
TPR = = 0
+ -
FPR = = 0
+ 0 p
Precision =
Actual
class
(as TP + FP = 0) - 0 N
F1 Score =
Accuracy = =0
The same is also applicable for FNR and TNR and others measures from CM.
In contrast, the Predictive Accuracy, Precision, Error Rate, F1 Score, etc. are
affected by the relative size of P and N.
FPR, TPR, FNR and TNR are calculated from the different rows of the CM.
On the other hand Predictive Accuracy, etc. are derived from the values in both
rows.
This suggests that FPR, TPR, FNR and TNR are more effective than
Predictive Accuracy, etc.
In the context of classifier, ROC plot is a useful tool to study the behaviour of
a classifier or comparing two or more classifiers.
Since, the values of FPR and TPR varies from 0 to 1 both inclusive, the two
axes thus from 0 to 1 only.
Each point (x, y) on the plot indicating that the FPR has value x and the TPR
value y.
Note the four cornered points are the four extreme cases of classifiers
A: TPR = 1, FPR = 0, the ideal model, i.e., the perfect classifier, no false results
B: TPR = 0, FPR = 1, the worst classifier, not able to predict a single instance
C: TPR = 0, FPR = 0, the model predicts every instance to be a Negative class, i.e., it is an ultra-conservative
classifier
D: TPR = 1, FPR = 1, the model predicts every instance to be a Positive class, i.e., it is an ultra-liberal classifier
The diagonal line joining point C(0,0) and D(1,1) corresponds to random guessing
Random guessing means that a record is classified as positive (0r negative) with a certain probability
Suppose, a test set contacting N+ positive and N- negative instances. Suppose, the classifier guesses any instances with
probability p
Thus, the random classifier is expected to correctly classify p.N+ of the positive instances and p.N- of the negative instances
Hence, TPR = FPR = p
Since TPR = FPR, the random classifier results reside on the main diagonals
CS 40003: Data Analytics 55
Interpretation of Different Points in ROC Plot
Let us interpret the different points in the ROC plot.
All points, which reside on upper-diagonal region are corresponding to classifiers “good” as their
TPR is as good as FPR (i.e., FPRs are lower than TPRs)
Here, X is better than Z as X has higher TPR and lower FPR than Z.
The Lower-diagonal triangle corresponds to the classifiers that are worst than random classifiers
Note: A classifier that is worst than random guessing, simply by reversing its prediction, we can
get good results.
W’(0.2, 0.4) is the better version than W(0.4, 0.2), W’ is a mirror reflection of W
CS 40003: Data Analytics 57
Tuning a Classifier through ROC Plot
Using ROC plot, we can compare two or more classifiers by their TPR and
FPR values and this plot also depicts the trade-off between TPR and FPR of a
classifier.
Examining ROC curves can give insights into the best way of tuning
parameters of classifier.
For example, in the curve C2, the result is degraded after the point P.
Similarly for the observation C1, beyond Q the settings are not acceptable.
CS 40003: Data Analytics 58
Comparing Classifiers trough ROC Plot
Two curves C1 and C2 are corresponding to the experiments to choose two
classifiers with their parameters.
A model that is strictly better than other, would have a larger value of AUC than the other.
Here, C(fpr, tpr) is a classifier and denotes the Euclidean distance between
the best classifier (0, 1) and C. That is,
We could hypothesise that the smaller the value of , the better the classifier.
is a useful measure, but does not take into account the relative importance of true and false positive
rates.
We can specify the relative importance of making TPR as close to 1 and FPR as close 0 by a weight w
between 0 to 1.
Note
If w = 0, it reduces to = fpr, i.e., FP Rate.
Data Mining: Concepts and Techniques, (3rd Edn.), Jiawei Han, Micheline Kamber, Morgan
Kaufmann, 2015.
Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Addison-
Wesley, 2014