Evaluation Metrics - BleuScore
Evaluation Metrics - BleuScore
BLEU
How good is the predicted output?
In order to evaluate the performance of our model, we need a quantitative metric to measure the quality
of its predictions.
Motivation
• The score was developed for evaluating the predictions made by automatic
machine translation systems.
• The BLEU score was proposed by Kishore Papineni, et al. in their 2002 paper
“BLEU: A Method for Automatic Evaluation of Machine Translation“.
• BLEU, or the Bilingual Evaluation Understudy, is a score for comparing a
candidate translation of text to one or more reference translations.
BLEU: a Method for Automatic Evaluation of
Machine Translation
• The paper "BLEU: a Method for Automatic Evaluation of Machine
Translation" by Kishore Papineni, Salim Roukos, Todd Ward, and
Wei-Jing Zhu introduces BLEU (Bilingual Evaluation Understudy), an
automatic method for evaluating machine translation systems.
• BLEU aims to replace human evaluations, which are costly and time-
consuming, by providing a quick, inexpensive, language-independent
alternative that correlates well with human judgment.
• The main goal of BLEU is to determine how close a machine-
generated translation is to one or more high-quality human
translations using a numerical metric.
Methodology
BLEU relies on comparing a candidate machine translation with one or
more reference human translations using n-grams. The key innovations
in BLEU's method are:
• Modified n-gram precision
• Brevity Penalty
• Geometric Mean of n-gram Scores
• Bleu Scores are between 0 and 1.
• A score of 0.6 or 0.7 is considered the best you can achieve. Even two
humans would likely come up with different sentence variants for a
problem, and would rarely achieve a perfect match.
• Here, n-grams are just the combination of a sequence of words, with "n"
specifying the number of words.
• Caption Generator: Evaluate the generated caption with that of a reference caption.
• Text Summarization: Evaluate the goodness of the summarized text to that of a human-
summarized text.
• Automatic Speech Recognition: Although not a direct application of NLP, it still is based
on the speech-text model and uses BLEU Score to evaluate the generated output.
• In simple terms, precision measures the number of words in the predicted text that
also appear in the reference text.
• This does not help with repetition. Say, if the predicted text is "It It It," the
precision will become 3/4 still, which is wrong.
• As we saw earlier, there are multiple ways to write the same sentence,
hence there can be multiple reference texts for the same output.
It Both 3 1
Refer to the is None 0 0
table below
for more raining Ref Text 2 1 1
Total 5 3
Now, as we can see, the clipped precision here will be clipped count/total number of predicted words = 3/6=½
The precision for this would have been 5/6. Hence, we were able to overcome the shortcomings of using only
precision.
Clipped Precision:
• Reference Text 1: It was a rainy day
• Reference Text 2: It was raining heavily today
• Predicted Text: It It It is raining heavily
Now,
• We compare each word in the predicted text to all the reference text
• We limit the count for each correct word to the maximum number of
times it occurs in the reference text.
N-grams
Simply said, N-grams are just a set of "n-consecutive
words."
• For example, in "It is raining heavily."
• 1-gram (unigram): "It", "is", "raining", "heavily"
• 2-gram (bigram): "It is", "is raining", "raining heavily"
• 3-gram (trigram): "It is raining", "is raining heavily"
• 4-gram: "It is raining heavily"
Calculating BLEU Score
Here,
in our example,
r = number of words in the reference text
c = number of words in the predicted text
r=5
This ensures that the brevity penalty cannot be c=6
larger than 1, even if the predicted text is larger Since C>R Brevity Penalty = 1
than the reference text.
Calculating BLEU Score based on all these values
• Now, to calculate the BLEU Score in NLP, we simply multiply the Brevity
Penalty by the Global Average Precision = 1∗0.316=0.316
• Thus, it is also called BLEU-4, so the results will vary slightly when we
compute the Bleu Score using code.
• BLEU and similar automatic metrics also do poorly at comparing very different
kinds of systems, such as comparing human-aided translation against machine
translation, or different machine translation architectures against each other
(Callison-Burch et al., 2006).
• In BLEU, the words "cat" and "feline", as well as "mat" and "rug", would
not be considered matches, leading to a lower score.
• In BERTScore, since "cat" and "feline" are semantically similar, and "mat"
and "rug" have close meanings, their embeddings will be similar, resulting
in a higher score.
Source: Online — https://arxiv.org/pdf/1904.09675.pdf
• R_BERT (BERT Recall): This equation calculates the recall score. It
sums the maximum cosine similarity between each token in the
reference text (x_i) and all tokens in the candidate text (x̂_j), then
normalizes by dividing by the length of the reference text (|x|).
• P_BERT (BERT Precision): This equation calculates the precision score.
It's similar to recall, but sums the maximum cosine similarity between
each token in the candidate text (x̂_j) and all tokens in the reference
text (x_i), then normalizes by dividing by the length of the candidate
text (|x̂|).
• Foundations of NLP Explained — Bleu Score and WER Metrics | by
Ketan Doshi | Towards Data Science
• BLEU - a Hugging Face Space by evaluate-metric
• Understanding BLEU Score: A Metric for Evaluating Machine
Translation Quality | by Gaurav Sharma | Medium
• BERTScore Explained in 5 minutes. Evaluating Text Generation with
BERT… | by Abonia Sojasingarayar | Medium