0% found this document useful (0 votes)
4 views21 pages

Evaluation

Evaluation

Uploaded by

parksy317575
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views21 pages

Evaluation

Evaluation

Uploaded by

parksy317575
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Instructor: Junghye Lee

School of Management Engineering


[email protected]
Evaluation of test set
Generalized Evaluation
Evaluation on “Small” Data
• The holdout method reserves a certain amount for
testing and uses the remainder for training
• Usually, one third for testing, the rest for training
• For small or “unbalanced” datasets, samples might
not be representative
• For instance, few or none instances of some classes
• Stratified sample
• Advanced version of balancing the data
• Make sure that each class is represented with
approximately equal proportions in both subsets
Repeated Holdout Method
• Holdout estimate can be made more reliable by
repeating the process with different subsamples
• In each iteration, a certain proportion is randomly
selected for training (possibly with stratification)
• The error rates on the different iterations are averaged
to yield an overall error rate
• This is called the repeated holdout method
• Still not optimum: the different test sets overlap
Cross-Validation
• Avoids overlapping test sets
• First step: data is split into k subsets of equal size
• Second step: each subset in turn is used for testing and
the remainder for training
• This is called k-fold cross-validation
• Often the subsets are stratified before the cross-
validation is performed
• The error estimates are averaged to yield an overall
error estimate
More on Cross-Validation
• Standard method for evaluation
• Stratified ten-fold cross-validation
• Why ten? Extensive experiments have shown that
this is the best choice to get an accurate estimate
• Stratification reduces the estimate’s variance
• Even better: repeated stratified cross-validation
• e.g. ten-fold cross-validation is repeated ten times and
results are averaged (reduces the variance)
Leave-One-Out Cross-Validation
• It is a particular form of cross-validation
• Set number of folds to number of training instances
• i.e., for n training instances, build classifier n times
• Makes best use of the data
• Involves no random subsampling
• Very computationally expensive
Evaluation criteria
• Predictive accuracy: this refers to the ability of the
model to correctly predict the target of new or
previously unseen data:
• Time & Memory: this refers to the computation
costs involved in generating and using the model
• Robustness: this is the ability of the model to make
correct predictions given noisy data or data with
missing values
• Scalability: this refers to the ability to construct the
model efficiently given large amount of data
Evaluation criteria
• Interpretability: this refers to the level of
understanding and insight that is provided by the
model
• Simplicity:
• decision tree size
• rule compactness

• Domain-dependent quality indicators


Prediction Model

• Regression

• Classification
Evaluation of Prediction Model
• BIAS - The arithmetic mean of the errors
𝑛 𝑛
Σ𝑖=1 (𝑦𝑖 − 𝑦ො𝑖 ) Σ𝑖=1 𝑒𝑟𝑟𝑜𝑟
𝐵𝐼𝐴𝑆 = =
𝑛 𝑛
• n is the number of test samples.

• Mean Absolute Deviation - MAD


𝑛 𝑛
Σ𝑖=1 𝑦𝑖 − 𝑦ො𝑖 Σ𝑖=1 |𝑒𝑟𝑟𝑜𝑟|
𝑀𝐴𝐷 = =
𝑛 𝑛
Evaluation of Prediction Model
• Mean Square Error – MSE (most popular)
𝑛 2 𝑛 2
Σ𝑖=1 𝑦𝑖 − 𝑦ො𝑖 Σ𝑖=1 𝑒𝑟𝑟𝑜𝑟
𝑀𝑆𝐸 = =
𝑛 𝑛
• Standard error is square root of MSE or (RMSE)

• Mean Absolute Percentage Error – MAPE

𝑛 𝑦𝑖 − 𝑦ො𝑖
Σ𝑖=1 ∗ 100%
𝑦𝑖
𝑀𝐴𝑃𝐸 =
𝑛
Evaluation of Prediction Model
• Root relative squared error - RRSE

𝑛 2
Σ𝑖=1 𝑦𝑖 − 𝑦ො𝑖
𝑅𝑅𝑆𝐸 = 𝑛 2
Σ𝑖=1 𝑦𝑖 − 𝑦ത𝑖

• In general, the lower the error measure (BIAS, MAD, MSE, MAPE,
RRSE) or the higher the 𝑅2 , 𝑟 the better the forecasting model
Which measure?
• Best to look at all of them
• Often it doesn’t matter
• Example:
Prediction Model

• Regression

• Classification
Evaluation of Prediction Model
• Two-class case (yes, no)

• Four different outcomes


• true positive, true negative, false positive, false negative

• We display these outcomes in the following confusion matrix


Evaluation of Prediction Model
• Accuracy
𝑇𝑃+𝑇𝑁
𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁
• Sensitivity = Recall
𝑇𝑃
𝑇𝑃+𝐹𝑁
• Specificity
𝑇𝑁
𝑇𝑁+𝐹𝑃
• Precision
𝑇𝑃
𝑇𝑃+𝐹𝑃
AUC
• Stands for “Area under
Receiver Operating
characteristic”
• It can show tradeoff
between true positives and
false positives over noisy
channel
• (1-specificity) vs. sensitivity
F-Measure
• It can show tradeoff between precision and recall over noisy
channel

1 𝛽 2 +1 𝑃∙𝑅
𝐹𝛽 = 𝛽2
=

1
+
1

1 𝛽 2 𝑃+𝑅
𝛽2 +1 𝑅𝑒𝑐𝑎𝑙𝑙 𝛽2 +1 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛

• The most popular measure is


2𝑃𝑅
𝐹1 =
𝑃+𝑅
Cross-validation and AUC, F1
• Collect probabilities for instances in test folds
• Sort instances according to probabilities

• Generate an AUC or a F1 for each fold


• Average them

• Generate an AUC or a F1 for each repetition


• Average them

You might also like