This document discusses techniques for evaluating predictive performance in predictive analytics. It describes three main types of predictive outcomes and several measures for assessing prediction accuracy based on a validation data set. These include measures like mean absolute error, mean percentage error, and root mean squared error. The document also discusses classification accuracy measures like misclassification rate that can be derived from a confusion matrix. It describes how propensity scores and cutoffs can be used to classify cases or rank them by probability of class membership. Benchmarking predictions against a naïve average model is also covered.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
35 views29 pages
Performance Evaluation
This document discusses techniques for evaluating predictive performance in predictive analytics. It describes three main types of predictive outcomes and several measures for assessing prediction accuracy based on a validation data set. These include measures like mean absolute error, mean percentage error, and root mean squared error. The document also discusses classification accuracy measures like misclassification rate that can be derived from a confusion matrix. It describes how propensity scores and cutoffs can be used to classify cases or rank them by probability of class membership. Benchmarking predictions against a naïve average model is also covered.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29
MODULE 3
PERFORMANCE AND EVALUATION
At the end of the topic, the learner should be able to: • Learn and understand the different techniques used in evaluating the predictive performance • Identify and understand the difference of each techniques for the performance evaluation of predictive analytics Three main types of outcomes of interest are: • Predicted numerical value: when the outcome variable is numerical (e.g., house price) • Predicted class membership: when the outcome variable is categorical (e.g., buyer/nonbuyer) • Propensity: the probability of class membership, when the outcome variable is categorical (e.g., the propensity to default) For assessing prediction performance, several measures are used. In all cases the measures are based on the validation set, which servers as a more objective ground than the training set to assess predictive accuracy. Naïve Benchmark: The Average • The benchmark criterion in prediction is using the average outcome value. In other words, the prediction for a new record is simply the average across the outcome values of the records in the training set. Prediction Accuracy Measures The prediction error for record i is defined as the difference between its actual outcome value and tis predicted outcome value: ei = yi – yi. A few popular numerical measures of predictive accuracy are: • MAE (mean absolute error/deviation). This gives the magnitude of the average absolute error. • Mean Error. This measure is similar to MAE except that it retains the sign of the errors, so that negative errors cancel out positive errors of the same magnitude. • MPE (mean percentage error). This gives the percentage score of how predictions deviate from the actual values (on average), taking into account the direction of the error. • MAPE (mean absolute percentage error). This measure gives a percentage score of how predictions deviate from the actual values. • RMSE (root mean squared error). This is similar to the standard error of estimate in linear regression, except that it is computed on the validation data rather than on the training data. • Errors that are based on the training set tell us about model fit, whereas those that are based on the validation set (called “prediction errors”) measure the model’s ability to predict new data (predictive performance). • We expect training errors to be smaller than the validation errors (because the model was fitted using the training set), and the more complex the model, the greater the likelihood that it will overfit the training data (indicated by a greater difference between the training and validation errors). • In an extreme case of overfitting, the training errors would be zero (perfect fit of the model to the training data), and the validation errors would be non-zero and non-negligible. • A natural criterion for judging the performance of a classifier is the probability of making a misclassification error. • Misclassification means that the record belongs to one class but the model classifies it as a member of a different class. • A classifier that makes no errors would be perfect, but we do not expect to be able to construct such classifiers in the real world due to “noise” and not having all the information needed to classify records precisely. Figure 5.2 Cumulative gains chart (a) and decile lift chart (b) for continuous outcome variable (sales of Toyota cars) • Benchmark: The Naïve Rule • Class Separation • The Confusion (Classification) Matrix • Using the Validation Data • Accuracy Measures • Propensities and Cutoff for Classification • A very simple rule for classifying a record into one of m classes, ignoring all predictor information (x1, x2, …, xp) that we may have, is to classify the record as a member of the majority class. • In other words, “classify as belonging to the most prevalent class.” • The naive rule is used mainly as a baseline or benchmark for evaluating the performance of more complicated classifiers. • Clearly, a classifier that uses external predictor information (on top of the class membership allocation) should outperform the naive rule. • If the classes are well separated by the predictor information, even a small dataset will suffice in finding a good classifier, whereas if the classes are not separated at all by the predictors, even a very large dataset will not help. Figure 5.3 High (a) and low (b) levels of separation between two classes, using two predictors • This matrix summarizes the correct and incorrect classifications that a classifier produced for a certain dataset. Rows and columns of the confusion matrix correspond to the predicted and true (actual) classes, respectively. • The confusion matrix gives estimates of the true classification and misclassification rates. Table 5.2 Confusion matrix based on 3000 records and two classes • Different accuracy measures can be derived from the classification matrix. Consider a two-class case with classes C1 and C2 (e.g., buyer/non-buyer). The schematic confusion matrix in Table 5.3 uses the notation ni, j to denote the number of records that are class Ci members and were classified as Cj members. Of course, if i ≠ j, these are counts of misclassifications. The total number of records is n = n1, 1 + n1, 2 + n2, 1 + n2, 2. • The main accuracy measure is the estimated misclassification rate, also called the overall error rate. It is given by
where n is the total number of cases in the validation dataset.
Table 5.2 Confusion matrix based on 3000 records and two classes Table 5.3 Confusion matrix: Meaning of Each Cell • Propensities are typically used either as an interim step for generating predicted class membership (classification), or for rank-ordering the records by their probability of belonging to a class of interest.
• If overall classification accuracy (involving all the classes) is of interest,
the record can be assigned to the class with the highest probability. In many records, a single class is of special interest, so we will focus on that particular class and compare the propensity of belonging to that class to a cutoff value set by the analyst. • This approach can be used with two classes or more than two classes, though it may make sense in such cases to consolidate classes so that you end up with two: the class of interest and all other classes. Table 5.4 24 Records with Their Actual Class and the Probability (Propensity) of Them Being Class “owner” Members, as Estimated by a Classifier
FIGURE 5.6 Classification metrics based on cutoffs of 0.5, 0.25,
and 0.75 for the Riding Mower data Lift Curves • Lift curves (also called lift charts, gain curves, or gain charts) are models involving categorical outcomes. • The lift curve helps us determine how effectively we can “skim the cream” by selecting a relatively small number of cases and getting a relatively large portion of the responders. • It is often the case that the more rare events are the more interesting or important ones: responders to a mailing, those who commit fraud, defaulters on debt, and the like. • This same stratified sampling procedure is sometimes called weighted sampling or undersampling, the latter referring to the fact that the more plentiful class is undersampled, relative to the rare class. Step 1. The response and nonresponse data are separated into two distinct sets, or strata Step 2. Records are randomly selected for the training set from each stratum. Typically, one might select half the (scarce) responders for the training set, then an equal number of nonresponders. Step 3. Remaining responders are put in the validation set Step 4. Nonresponders are randomly selected for the validation set in sufficient numbers to maintain the original ration of responders to nonresponders Step 5. If a test is required, it can be taken randomly from the validation set. • Shmueli G., et al. Data Mining for Business Intelligence Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner 2nd Ed. A John Wiley & Sons, Inc. Publication • Bruce P., et al. Data Mining for Business Analytics Concepts, Techniques and Applications. John Wiley & Sons, Inc. 2020