.. automodule:: Orange.evaluation.scoring
Scoring plays and integral role in evaluation of any prediction model. Orange implements various scores for evaluation of classification, regression and multi-label models. Most of the methods needs to be called with an instance of :obj:`~Orange.evaluation.testing.ExperimentResults`.
.. literalinclude:: code/scoring-example.py
Many scores for evaluation of the classification models measure whether the model assigns the correct class value to the test instances. Many of these scores can be computed solely from the confusion matrix constructed manually with the :obj:`confusion_matrices` function. If class variable has more than two values, the index of the value to calculate the confusion matrix for should be passed as well.
.. autofunction:: CA
.. autofunction:: Sensitivity
.. autofunction:: Specificity
.. autofunction:: PPV
.. autofunction:: NPV
.. autofunction:: Precision
.. autofunction:: Recall
.. autofunction:: F1
.. autofunction:: Falpha
.. autofunction:: MCC
.. autofunction:: AP
.. autofunction:: IS
.. autofunction:: confusion_chi_square
Scores that measure how good can the prediction model separate instances with different classes are called discriminatory scores.
.. autofunction:: Brier_score
.. autofunction:: AUC
.. autofunction:: AUC_for_single_class
.. autofunction:: AUC_matrix
.. autofunction:: AUCWilcoxon
.. autofunction:: compute_ROC
.. autofunction:: confusion_matrices
.. autoclass:: ConfusionMatrix
.. autofunction:: McNemar
.. autofunction:: McNemar_of_two
Several alternative measures, as given below, can be used to evaluate the sucess of numeric prediction:
.. autofunction:: MSE
.. autofunction:: RMSE
.. autofunction:: MAE
.. autofunction:: RSE
.. autofunction:: RRSE
.. autofunction:: RAE
.. autofunction:: R2
The following code (:download:`statExamples.py <code/statExamples.py>`) uses most of the above measures to score several regression methods.
The code above produces the following output:
Learner MSE RMSE MAE RSE RRSE RAE R2 maj 84.585 9.197 6.653 1.002 1.001 1.001 -0.002 rt 40.015 6.326 4.592 0.474 0.688 0.691 0.526 knn 21.248 4.610 2.870 0.252 0.502 0.432 0.748 lr 24.092 4.908 3.425 0.285 0.534 0.515 0.715
.. autofunction:: graph_ranks
The following script (:download:`statExamplesGraphRanks.py <code/statExamplesGraphRanks.py>`) shows hot to plot a graph:
.. literalinclude:: code/statExamplesGraphRanks.py
Code produces the following graph:
.. autofunction:: compute_CD
.. autofunction:: compute_friedman
.. autofunction:: split_by_iterations
Multi-label classification requires different metrics than those used in traditional single-label classification. This module presents the various metrics that have been proposed in the literature. Let D be a multi-label evaluation data set, conisting of |D| multi-label examples (x_i,Y_i), i=1..|D|, Y_i \\subseteq L. Let H be a multi-label classifier and Z_i=H(x_i) be the set of labels predicted by H for example x_i.
.. autofunction:: mlc_hamming_loss
.. autofunction:: mlc_accuracy
.. autofunction:: mlc_precision
.. autofunction:: mlc_recall
The following script demonstrates the use of those evaluation measures:
.. literalinclude:: code/mlc-evaluate.py
The output should look like this:
loss= [0.9375] accuracy= [0.875] precision= [1.0] recall= [0.875]
Boutell, M.R., Luo, J., Shen, X. & Brown, C.M. (2004), 'Learning multi-label scene classification', Pattern Recogintion, vol.37, no.9, pp:1757-71
Godbole, S. & Sarawagi, S. (2004), 'Discriminative Methods for Multi-labeled Classification', paper presented to Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2004)
Schapire, R.E. & Singer, Y. (2000), 'Boostexter: a bossting-based system for text categorization', Machine Learning, vol.39, no.2/3, pp:135-68.