Classification Metrics For Generalized Results
Classification Metrics For Generalized Results
com/view/aiml-deepthought/home
Classification Metrics: A Comprehensive
Guide
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, log_loss, roc_auc_score
from sklearn.metrics import recall_score, f1_score, precision_score, matthews_corrcoef, balanced_accuracy_score
from sklearn.linear_model import LogisticRegression, RidgeClassifierCV, Perceptron, SGDClassifier, RidgeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.neighbors import NearestCentroid, KNeighborsClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.calibration import CalibratedClassifierCV
from sklearn.naive_bayes import BernoulliNB, GaussianNB
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from xgboost import XGBClassifier
import lightgbm as lgb
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve
from sklearn.model_selection import train_test_split, cross_val_score
import category_encoders as ce
import warnings
warnings.filterwarnings('ignore')
X = titanic.drop(['Survived'], axis=1)
y = titanic['Survived']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=21)
# Define models
models = {
'Logistic Regression': LogisticRegression(),
'Random Forest': RandomForestClassifier(criterion='entropy', n_estimators=100),
'LightGBM': lgb.LGBMClassifier(),
'Ridge Classifier CV': RidgeClassifierCV(),
'XGBoost': XGBClassifier(),
'Nearest Centroid': NearestCentroid(),
'Quadratic Discriminant Analysis': QuadraticDiscriminantAnalysis(),
'Calibrated Classifier CV': CalibratedClassifierCV(),
'Bernoulli NB': BernoulliNB(),
'Bagging Classifier': BaggingClassifier(),
'SVC': SVC(),
'Linear SVC': LinearSVC(),
'KNeighbors Classifier': KNeighborsClassifier(),
'Gaussian NB': GaussianNB(),
'Perceptron': Perceptron(),
'SGD Classifier': SGDClassifier(),
'Decision Tree': DecisionTreeClassifier(),
'MLP Classifier': MLPClassifier(),
'Extra Trees': ExtraTreesClassifier(),
'AdaBoost': AdaBoostClassifier(),
'Nu SVC': NuSVC(),
'Gaussian Process': GaussianProcessClassifier(kernel=RBF()),
'Ridge Classifier': RidgeClassifier(),
'Passive Aggressive': PassiveAggressiveClassifier(),
'Hist Gradient Boosting': HistGradientBoostingClassifier()
}
# Calculate metrics
CM = confusion_matrix(y_test, y_pred)
TN, FP, FN, TP = CM.ravel()
results.append([name, acc, balanced_acc, prec, rec, specificity, f1, roc, loss_log, mathew])
Accuracy
Definition: The ratio of correct predictions to the total number of predictions.
Importance: A general measure of how well the model performs overall. However, it can be misleading in imbalanced datasets.
Balanced Accuracy
Definition: The average of sensitivity and specificity, providing a more balanced view in imbalanced datasets.
Importance: Useful when dealing with imbalanced classes, as it accounts for both positive and negative predictions.
Precision
Definition: The ratio of true positive predictions to the total number of positive predictions.
Importance: Measures how many of the positive predictions made by the model were actually correct.
Sensitivity (Recall)
Definition: The ratio of true positive predictions to the total number of actual positive instances.
Importance: Measures how well the model can identify positive instances.
Specificity
Definition: The ratio of true negative predictions to the total number of actual negative instances.
Importance: Measures how well the model can identify negative instances.
F1 Score
Definition: The harmonic mean of precision and recall.
Importance: A balanced metric that considers both precision and recall, making it useful when both are important.
Imbalanced datasets: Balanced accuracy, F1 score, MCC, and ROC AUC are often preferred.
When both precision and recall are important: F1 score is a good choice.
When probability predictions are important: Log loss is useful.
In many cases, it's helpful to consider multiple metrics to get a comprehensive understanding of the model's performance.
model_results
0 Random Forest 0.832402 0.817246 0.843750 0.729730 0.904762 0.782609 0.817246 6.040836 0.651925
1 XGBoost 0.832402 0.821236 0.823529 0.756757 0.885714 0.788732 0.821236 6.040836 0.651851
2 Extra Trees 0.832402 0.823230 0.814286 0.770270 0.876190 0.791667 0.823230 6.040836 0.652365
3 LightGBM 0.821229 0.807722 0.818182 0.729730 0.885714 0.771429 0.807722 6.443558 0.628185
4 Gaussian NB 0.815642 0.802960 0.805970 0.729730 0.876190 0.765957 0.802960 6.644919 0.616566
5 Hist Gradient Boosting 0.810056 0.798198 0.794118 0.729730 0.866667 0.760563 0.798198 6.846281 0.605103
6 Ridge Classifier 0.810056 0.794208 0.812500 0.702703 0.885714 0.753623 0.794208 6.846281 0.604584
7 Logistic Regression 0.810056 0.794208 0.812500 0.702703 0.885714 0.753623 0.794208 6.846281 0.604584
8 Ridge Classifier CV 0.810056 0.794208 0.812500 0.702703 0.885714 0.753623 0.794208 6.846281 0.604584
9 MLP Classifier 0.804469 0.795431 0.774648 0.743243 0.847619 0.758621 0.795431 7.047642 0.594779
Quadratic Discriminant
10 0.804469 0.787452 0.809524 0.689189 0.885714 0.744526 0.787452 7.047642 0.592797
Analysis
11 Calibrated Classifier CV 0.798883 0.780695 0.806452 0.675676 0.885714 0.735294 0.780695 7.249003 0.581014
12 Bagging Classifier 0.798883 0.782690 0.796875 0.689189 0.876190 0.739130 0.782690 7.249003 0.580914
13 AdaBoost 0.798883 0.788674 0.771429 0.729730 0.847619 0.750000 0.788674 7.249003 0.582621
14 Nu SVC 0.798883 0.788674 0.771429 0.729730 0.847619 0.750000 0.788674 7.249003 0.582621
15 Perceptron 0.787709 0.783140 0.736842 0.756757 0.809524 0.746667 0.783140 7.651725 0.564179
16 Linear SVC 0.770950 0.776834 0.689655 0.810811 0.742857 0.745342 0.776834 8.255809 0.545515
17 Decision Tree 0.765363 0.752124 0.735294 0.675676 0.828571 0.704225 0.752124 8.457170 0.511609
18 SGD Classifier 0.709497 0.736422 0.600000 0.891892 0.580952 0.717391 0.736422 10.470782 0.478418
19 Gaussian Process 0.703911 0.663835 0.744186 0.432432 0.895238 0.547009 0.663835 10.672143 0.377698
20 KNeighbors Classifier 0.698324 0.671042 0.678571 0.513514 0.828571 0.584615 0.671042 10.873504 0.363327
21 Passive Aggressive 0.692737 0.642342 0.787879 0.351351 0.933333 0.485981 0.642342 11.074866 0.361527
22 SVC 0.664804 0.616538 0.694444 0.337838 0.895238 0.454545 0.616538 12.081672 0.286344
23 Nearest Centroid 0.648045 0.612227 0.612245 0.405405 0.819048 0.487805 0.612227 12.685755 0.247894
24 Bernoulli NB 0.558659 0.506113 0.428571 0.202703 0.809524 0.275229 0.506113 15.907534 0.015181
When selecting metrics for generalized results, consider the following factors:
1. Dataset Characteristics:
Imbalanced Classes: If the dataset is imbalanced (one class has significantly more instances than the other), metrics like balanced
accuracy, F1 score, and MCC are more appropriate.
Noise: If the dataset contains noise or outliers, metrics that are less sensitive to individual errors, such as ROC AUC, can be helpful.
2. Problem Domain:
False Positives vs. False Negatives: Consider the consequences of different types of errors. If false positives are more costly,
focus on precision. If false negatives are more costly, focus on recall.
Probability Predictions: If the model outputs probabilities, metrics like log loss can be used to evaluate the quality of the probability
distribution.
3. Evaluation Goals:
Overall Performance: Accuracy can be a good starting point, but be aware of its limitations in imbalanced datasets.
Class-Specific Performance: If you need to evaluate performance for specific classes, calculate metrics for each class.
Trade-offs: If there is a trade-off between different metrics, consider using a composite metric like the F1 score or creating a
weighted average based on the relative importance of each metric.
4. Interpretability:
Additional Considerations:
Multiple Metrics: It's often beneficial to use multiple metrics to get a comprehensive understanding of the model's performance.
Domain Knowledge: Leverage domain experts to understand the specific requirements and constraints of the problem.
Cross-Validation: Use cross-validation to assess the model's generalization performance on unseen data.
By carefully considering these factors and choosing appropriate metrics, you can obtain more generalized and reliable results
from your classification models.
# Set title and labels with increased font size and space before the plot
ax.set_title(f"{metric}", fontsize=20, color='Blue')
ax.set_xlabel("")
ax.set_ylabel(f"{metric}")
# Increase space above the plot using `top` prop in `plt.subplots_adjust`
plt.subplots_adjust(top=0.95) # Adjust value as needed
https://sites.google.com/view/aiml-deepthought/home
List of 30 common metrics used in evaluating Large
Language Models (LLMs)
This list covers many of the metrics used in LLM evaluation, but it's not
exhaustive. The field is evolving rapidly, and new metrics are being developed
to address specific aspects of language model performance.
Precision: How many of the words in the machine translation are also
present in the reference translations?
Length Penalty: Longer translations tend to have lower precision, so a
penalty is applied to discourage overly long translations.
BLEU calculates a score based on these factors. A perfect match between the
machine translation and the reference translations results in a BLEU score of
1.0, while a complete mismatch results in a score of 0.0.
Example
Reference: "The quick brown fox jumps over the lazy dog."
Limitations of BLEU
Despite these limitations, BLEU remains a valuable tool for evaluating machine
translation systems, especially when used in conjunction with other metrics.
Example
Reference summary: "The quick brown fox jumps over the lazy dog."
System summary: "Quick brown fox jumps."
A higher ROUGE score indicates better overlap between the system output and
reference summaries.
Advantages of ROUGE
Simple to compute
Widely used in the evaluation of text summarization and machine
translation systems
Limitations of ROUGE
Conclusion
Example
Reference: "The quick brown fox jumps over the lazy dog."
Machine Translation: "A fast brown fox jumps over the lazy dog."
METEOR would:
Match "quick" with "fast" using synonymy.
Match "brown", "fox", "jumps", "over", "the", "lazy", and "dog" as exact
matches.
Align the matched words to create an alignment.
Calculate a penalty based on the alignment, considering the substitution
of "quick" with "fast".
Compute the final METEOR score based on precision, recall, and the
penalty.
Advantages of METEOR
Limitations of METEOR
More computationally expensive than BLEU
Still relies on n-gram matching and might not capture all aspects of
translation quality
TER (Translation Edit Rate) is a metric used to evaluate the quality of machine
translation. Unlike metrics like BLEU which focus on n-gram overlap, TER
measures the amount of editing required to transform a machine-translated
output into a human-quality reference translation.
A lower TER score indicates a better translation, as it implies fewer edits are
needed to make it correct.
Example
Reference: "The quick brown fox jumps over the lazy dog."
Machine Translation: "Quick brown fox jumps lazy dog."
Advantages of TER
Limitations of TER
Example
Reference: "The quick brown fox jumps over the lazy dog."
Generated: "Quick brown fox jumps lazy dog."
To calculate GLEU:
Advantages of GLEU
Limitations of GLEU
Reference descriptions:
o "A tabby cat is sitting on a blue mat."
o "A cat is sleeping on a mat."
o "A furry cat is resting on a blue mat."
Generated description: "A cat is on a mat."
CIDEr would calculate the similarity between the generated description and
each reference description based on n-gram overlap and TF-IDF weights. The
final CIDEr score would reflect how well the generated description aligns with
the consensus among the human-written descriptions.
Advantages of CIDEr
Limitations of CIDEr
1. Scene graph generation: Both the generated caption and the reference
captions are converted into scene graphs. A scene graph represents an
image as a set of objects, their attributes, and relationships between them.
2. Semantic proposition matching: The scene graphs of the generated and
reference captions are compared based on their semantic propositions
(e.g., "person holding ball").
3. Matching score calculation: A matching score is calculated based on the
number of matched semantic propositions and their attributes.
4. SPICE score: The final SPICE score is computed as the average
matching score across all reference captions.
Reference captions:
o "A person is holding a red ball."
o "The man is playing with a red ball."
Generated caption: "A woman is holding a ball."
SPICE would convert each caption into a scene graph, identifying objects
(person, ball), attributes (red), and relationships (holding). The scene graphs of
the generated and reference captions would then be compared to calculate the
matching score based on shared semantic propositions.
Advantages of SPICE
Limitations of SPICE
1. Token Embedding: Both the reference and candidate text are tokenized,
and each token is converted into a contextual embedding using a pre-
trained BERT model.
2. Matching: Each token in the candidate text is matched with the most
similar token in the reference text based on cosine similarity between
their embeddings.
3. Precision, Recall, and F1:
o Precision: Measures how many tokens in the candidate text are
correctly matched to the reference text.
o Recall: Measures how many tokens in the reference text are
correctly matched by the candidate text.
o F1: The harmonic mean of precision and recall.
4. Scoring: The final BERTScore is computed as the average of the F1
scores across all tokens in the candidate text.
Example
Reference: "The quick brown fox jumps over the lazy dog."
Candidate: "A fast brown fox jumps over the sleepy dog."
Advantages of BERTScore
Limitations of BERTScore
Example
BLEURT would process both sentences using its pre-trained BERT model,
capturing the semantic and syntactic similarities between the two. It would then
output a score indicating how close the generated text is to the quality of the
reference text.
Advantages of BLEURT
Limitations of BLEURT
MAUVE (Measuring the Gap Between Neural Text and Human Text) is a
metric designed to assess the quality of text generated by neural models. It
measures how closely the generated text distribution matches the distribution of
human-written text. Unlike traditional metrics that focus on specific aspects like
n-gram overlap or lexical similarity, MAUVE provides a more holistic
evaluation.
1. Embedding: Both the human-written text and the generated text are
converted into embeddings using a pre-trained language model.
2. Quantization: The embeddings are quantized into a fixed number of
clusters using k-means clustering. This reduces the dimensionality of the
data and makes the calculations more efficient.
3. Divergence Calculation: The Kullback-Leibler (KL) divergence is
calculated between the probability distributions of the quantized
embeddings for the human and generated text. This measures how
different the two distributions are.
4. Divergence Curve: By varying the number of clusters (quantization
level), MAUVE creates a divergence curve that captures the trade-off
between different types of errors in the generated text.
5. MAUVE Score: The final MAUVE score is the area under the
divergence curve, providing a single scalar value to represent the overall
similarity between the two text distributions.
Example
A higher MAUVE score indicates that the generated text distribution is closer to
the human text distribution, suggesting better quality.
Advantages of MAUVE
Limitations of MAUVE
In summary, MAUVE offers a valuable tool for assessing the quality of text
generated by neural models. By focusing on the overall distribution of text, it
provides a more nuanced evaluation than traditional metrics.
Formula:
Interpretation
Example
Imagine two language models, A and B. When presented with the sentence "The
quick brown fox", model A assigns higher probabilities to common word
sequences like "jumps over", while model B assigns higher probabilities to less
common sequences. Model A will likely have a lower perplexity, indicating it is
better at capturing language patterns.
Word Error Rate (WER) is a metric used to evaluate the accuracy of speech
recognition or machine translation systems. It measures the number of errors
(insertions, deletions, and substitutions) in a generated text compared to a
reference text. A lower WER indicates higher accuracy.
Insertion: A word is added in the generated text that doesn't exist in the
reference text.
Deletion: A word is missing in the generated text that exists in the
reference text.
Substitution: A word is replaced by a different word in the generated
text.
WER = (S + D + I) / N
Where:
S = number of substitutions
D = number of deletions
I = number of insertions
N = total number of words in the reference text
Example
In this example:
Note:
WER is often used in conjunction with other metrics like Character Error
Rate (CER) for a more comprehensive evaluation.
While WER is a common metric, it has limitations. For example, it
doesn't consider semantic similarity between words.
CER = (S + D + I) / N
Where:
S = number of substitutions
D = number of deletions
I = number of insertions
N = total number of characters in the reference text
Example
In this example:
Note:
CER is often used in conjunction with Word Error Rate (WER) for a
more comprehensive evaluation.
While CER is a useful metric, it doesn't consider the semantic or syntactic
structure of the text.
Before we dive into F1 score, let's quickly recap precision and recall:
F1 Score
The F1 score is the harmonic mean of precision and recall. It provides a single
metric to evaluate a model's performance.
Formula:
Example
TP = 80
FP = 20
FN = 10
The harmonic mean is used instead of the arithmetic mean because it gives
more weight to lower values. If either precision or recall is low, the F1 score
will also be low. This ensures that the model performs well in both aspects.
1. Reciprocal Rank (RR): For each query, the reciprocal rank is calculated
as 1 divided by the position of the first correct answer.
o If the correct answer is in the first position, RR = 1/1 = 1.
o If the correct answer is in the second position, RR = 1/2 = 0.5.
o If no correct answer is found, RR = 0.
2. Mean Reciprocal Rank (MRR): The MRR is the average of the
reciprocal ranks across all queries.
Example
Consider a search engine that returns a ranked list of results for a query.
Interpretation
NDCG is a metric used to evaluate the quality of a ranked list of items, such as
search results or recommendations. It considers both the relevance of each item
and its position in the list.
Example
Imagine a search engine returning results for the query "best smartphones". We
assign relevance scores to each result (4 for most relevant, 3 for less relevant,
etc.).
DCG:
IDCG:
Assuming Phone A is the most relevant, Phone B the second, and Phone
C the third:
IDCG = 4 + (3 / log2(2)) + (2 / log2(3)) ≈ 6.21 (same as DCG in this
case, as the list is already perfectly ordered)
NDCG:
Key points
1. Average Precision (AP): For each query, calculate the precision at each
relevant result retrieved. Then, calculate the average of these precision
values.
2. Mean Average Precision (MAP): Calculate the average of the AP
values for all queries.
Example
Consider a search engine with three queries: "apple", "banana", and "orange".
Query: apple
o Relevant results: A, B, C, D
o Retrieved results: A, C, E, F
o Precision at each relevant result: 1/1, 2/3, 3/4, 4/5
o Average Precision (AP) for "apple" = (1 + 2/3 + 3/4 + 4/5) / 4
Query: banana
o Relevant results: X, Y, Z
o Retrieved results: X, W, Y
o Average Precision (AP) for "banana" = (1/1 + 2/3) / 3
Query: orange
o Relevant results: P, Q
o Retrieved results: R, S
o Average Precision (AP) for "orange" = 0
Interpretation
ROC stands for Receiver Operating Characteristic. It's a graphical plot that
illustrates the diagnostic ability of a binary classifier system as its
discrimination threshold is varied. The curve is created by plotting the true
positive rate (TPR) against the false positive rate (FPR) at various threshold
settings.
AUC stands for Area under the Curve. It is a numerical value representing the
two-dimensional area under the ROC curve.
Example
By varying the classification threshold, you can calculate TPR and FPR for
different scenarios. Plotting these points creates the ROC curve. The area under
this curve is the AUC.
Visual Representation
ROC curve
Limitations of AUC-ROC
How it works
Interpretation
AUROC ranges from 0 to 1.
An AUROC of 1 indicates perfect classification.
An AUROC of 0.5 means the model is no better than random guessing.
Higher AUROC values indicate better performance.
Advantages of AUROC
Independent of class distribution: Works well with imbalanced
datasets.
Considers all classification thresholds: Provides a comprehensive view
of model performance.
Limitations of AUROC
Doesn't directly optimize a specific metric: Doesn't directly optimize
precision, recall, or F1-score.
How it works
1. Calculate the difference: For each data point, subtract the predicted
value from the actual value.
2. Square the difference: Square the difference calculated in step 1.
3. Calculate the mean: Find the average of all the squared differences.
Formula:
Where:
Example
Let's say we're trying to predict house prices based on square footage. We have
the following data:
Actual Predicted
Price Price
200,000 190,000
300,000 320,000
Interpretation
How it works
1. Calculate the absolute difference: For each data point, find the absolute
difference between the predicted and actual value.
2. Calculate the mean: Find the average of all the absolute differences.
Formula:
Where:
Example
Actual Predicted
Price Price
200,000 190,000
300,000 320,000
Interpretation
How it works
1. Calculate the difference: For each data point, subtract the predicted
value from the actual value.
2. Square the difference: Square the difference calculated in step 1.
3. Calculate the mean: Find the average of all the squared differences (this
is MSE).
4. Take the square root: Calculate the square root of the MSE.
Formula:
Where:
Example
Actual Predicted
Price Price
200,000 190,000
Interpretation
How it works
1. Calculate the absolute percentage error: For each data point, calculate
the absolute difference between the predicted and actual value, divide it
by the actual value, and then multiply by 100 to get a percentage.
2. Calculate the mean: Find the average of all the absolute percentage
errors.
Formula:
Where:
Example
Actual Predicted
Price Price
200,000 190,000
300,000 320,000
Interpretation
Example
Calculation
Important Considerations
Example:
Precise measurements: 3.2 kg, 3.21 kg, 3.19 kg, 3.2 kg, 3.22 kg
Less precise measurements: 2.8 kg, 3.5 kg, 4.1 kg, 2.9 kg, 3.3 kg
In the first example, the measurements are very close to each other, indicating
high precision. In the second example, the measurements are spread out,
indicating low precision.
Important to note:
Example:
In essence, recall answers the question: "Of all the actual positive cases, how
many did the model correctly identify?"
A high recall means that the model is good at finding all relevant instances, but
it might also include some irrelevant ones (which is where precision comes into
play).
Example
Machine Translation Output: "The quick brown fox jumps over the
lazy dog."
Reference Translation: "The fast brown fox leaps over the sleepy dog."
Human-Edited Output: "The fast brown fox jumps over the lazy dog."
Advantages of HTER
Example
Advantages of COMET
Limitations of COMET
Example
Character n-grams in reference: "the", "he ", "e q", " qu", "qui", "uic",
"ick", "ck ", "k b", " br", "bro", "row", "own", "wn ", "n f", " fo", "fox"
Character n-grams in machine translation: "the", "he ", "e q", " qv", "qvi",
"vic", "ick", "ck ", "k b", " br", "bro", "row", "own", "wn ", "n f", " fo",
"fox"
Advantages of chrF
Limitations of chrF
Let's say you have a machine translation output and its corresponding reference
translation. You can use the SacreBLEU library to calculate the BLEU score:
import sacrebleu
dog"]]
dog"
print(score)
Output:
Advantages of SacreBLEU
Metric Description
Perplexity
(PPL) Measures the perplexity of a language model.
BLEU Bilingual Evaluation Understudy for machine translation.
ROUGE Recall-Oriented Understudy for Gisting Evaluation.
METEOR Metric for Evaluation of Translation with Explicit Ordering.
TER Translation Edit Rate.
GLEU Google BLEU.
SacreBLEU Standardized BLEU.
chrF Character n-gram F-score.
COMET Crosslingual Optimized Metric for Evaluation of Translation.
HTER Human-Targeted Translation Edit Rate.
BERTScore BERT-based Scoring.
BLEURT BERT-based Language Understanding Evaluation Reference Tool.
MAUVE Measuring the Gap between Neural Text and Human Text.
Metric Description
Accuracy Proportion of correct predictions.
Precision Proportion of positive predictions that are truly positive.
Recall Proportion of actual positives that are correctly identified.
F1 Score Harmonic mean of precision and recall.
AUC-ROC Area Under the Receiver Operating Characteristic curve.
AUROC Area Under the Receiver Operating Characteristic curve.
MSE Mean Squared Error.
MAE Mean Absolute Error.
RMSE Root Mean Square Error.
MAPE Mean Absolute Percentage Error.
Metric Description
MRR Mean Reciprocal Rank.
NDCG Normalized Discounted Cumulative Gain.
MAP Mean Average Precision.
Metric Description
CIDEr Consensus-based Image Description Evaluation.
SPICE Semantic Propositional Image Caption Evaluation.
Note: Some metrics can be used in multiple contexts. For example, BLEU can be used for both
machine translation and text summarization.
https://sites.google.com/view/aiml-deepthought/home