0% found this document useful (0 votes)
27 views35 pages

Evaluation Metrics - BleuScore

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views35 pages

Evaluation Metrics - BleuScore

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Evaluation Metrics:

BLEU
How good is the predicted output?

When translating a sentence, for instance, two


different people may come up with two slightly
different answers, both of which are completely
correct.

eg. “The ball is blue” and “The ball has a blue


color”.

In order to evaluate the performance of our model, we need a quantitative metric to measure the quality
of its predictions.
Motivation

Goal of BLEU: To correlate with human judgment


BLEU: Bilingual Evaluation Understudy Score
• The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric for
evaluating a generated sentence to a reference sentence.

• The score was developed for evaluating the predictions made by automatic
machine translation systems.

• The BLEU score was proposed by Kishore Papineni, et al. in their 2002 paper
“BLEU: A Method for Automatic Evaluation of Machine Translation“.
• BLEU, or the Bilingual Evaluation Understudy, is a score for comparing a
candidate translation of text to one or more reference translations.
BLEU: a Method for Automatic Evaluation of
Machine Translation
• The paper "BLEU: a Method for Automatic Evaluation of Machine
Translation" by Kishore Papineni, Salim Roukos, Todd Ward, and
Wei-Jing Zhu introduces BLEU (Bilingual Evaluation Understudy), an
automatic method for evaluating machine translation systems.
• BLEU aims to replace human evaluations, which are costly and time-
consuming, by providing a quick, inexpensive, language-independent
alternative that correlates well with human judgment.
• The main goal of BLEU is to determine how close a machine-
generated translation is to one or more high-quality human
translations using a numerical metric.
Methodology
BLEU relies on comparing a candidate machine translation with one or
more reference human translations using n-grams. The key innovations
in BLEU's method are:
• Modified n-gram precision
• Brevity Penalty
• Geometric Mean of n-gram Scores
• Bleu Scores are between 0 and 1.

• A score of 0.6 or 0.7 is considered the best you can achieve. Even two
humans would likely come up with different sentence variants for a
problem, and would rarely achieve a perfect match.

• For this reason, a score closer to 1 is unrealistic in practice and should


raise a flag that your model is overfitting.
Candidate Translation and reference
translation

• Candidate translation:-It is translation automatically


generated by the machine while performing
machine translation.

• Reference translations:- It is translation of the input


sequence generated by some experts(linguists).
Strengths of BLEU
• Speed and Automation: BLEU can be run frequently without the need
for human intervention, making it ideal for rapid system evaluation.
• Language-Independence: Since BLEU focuses on n-grams rather than
any language-specific rules, it can be applied across languages with
minimal adjustments.
• Correlation with Human Judgment: BLEU's high correlation with
human judgments over large test sets makes it a reliable alternative
to human evaluation.
Working of BLEU score
• The approach works by counting matching n-grams in the candidate
translation to n-grams in the reference text.

• Where 1-gram or unigram would be each token and a bigram comparison


would be each word pair. The comparison is made regardless of word
order.

• Here, n-grams are just the combination of a sequence of words, with "n"
specifying the number of words.

• It is a combination of precision using n-grams and something called


a brevity penalty.
When to use BLEU score ?
Originally, BLEU Score was used for machine-translation tasks. But with time, the use of this
metric has become abundant with a lot more tasks now, such as:

• Caption Generator: Evaluate the generated caption with that of a reference caption.

• Chatbots: Evaluate the generated text to that of an actual conversation.

• Text Summarization: Evaluate the goodness of the summarized text to that of a human-
summarized text.

• Automatic Speech Recognition: Although not a direct application of NLP, it still is based
on the speech-text model and uses BLEU Score to evaluate the generated output.

• Machine Translation: Evaluate the generated translated text to actual text.


Terminologies in Bleu Score :Precision

• In simple terms, precision measures the number of words in the predicted text that
also appear in the reference text.

• For example, say


• Reference text: It was raining today
• Predicted text: It is raining today

• precision formula would be:


Number of correct predicted words/total number of words in the predicted text
Problem With Precision:
• Hence, here it will be 3/4 But using precision like this can lead to a few
problems:

• This does not help with repetition. Say, if the predicted text is "It It It," the
precision will become 3/4 still, which is wrong.

• As we saw earlier, there are multiple ways to write the same sentence,
hence there can be multiple reference texts for the same output.

• To work around these two issues, we use a modified version of precision,


called "Clipped Precision," for computing BLEU Score in NLP.
Matched Predicted Match Clipped
Word
Text Count Count

It Both 3 1
Refer to the is None 0 0
table below
for more raining Ref Text 2 1 1

clarity: heavily Ref Text 2 1 1

Total 5 3
Now, as we can see, the clipped precision here will be clipped count/total number of predicted words = 3/6=½

The precision for this would have been 5/6. Hence, we were able to overcome the shortcomings of using only
precision.
Clipped Precision:
• Reference Text 1: It was a rainy day
• Reference Text 2: It was raining heavily today
• Predicted Text: It It It is raining heavily
Now,
• We compare each word in the predicted text to all the reference text
• We limit the count for each correct word to the maximum number of
times it occurs in the reference text.
N-grams
Simply said, N-grams are just a set of "n-consecutive
words."
• For example, in "It is raining heavily."
• 1-gram (unigram): "It", "is", "raining", "heavily"
• 2-gram (bigram): "It is", "is raining", "raining heavily"
• 3-gram (trigram): "It is raining", "is raining heavily"
• 4-gram: "It is raining heavily"
Calculating BLEU Score

• Now, based on these, how can we calculate BLEU Score in NLP?


• Let's consider the earlier predicted and reference text.
• Reference Text: It was raining heavily today
• Predicted Text: It It It is raining heavily
• We will take two cases, uni-gram, and bi-gram here, for simpler
calculation, though usually 1-gram to 4-gram is taken.
Calculating Bleu Score

Now we must calculate clipped precision for unigram and bigram.


• Unigram:The clipped Precision count, as we saw from the table earlier, will
be 3/6=1/2 for this case.
• Bigram:Bigrams for reference text: ["It was", "was raining", "raining heavily",
"heavily today"]
• Bigrams for predicted text: ["It It", "It It", "It is", "is raining", "raining heavily"]
• Clipped Precision: 1/5
Now we combine these precision scores
Brevity Penalty
• Suppose our predicted text is just a single word, "raining".
Now for this, clipped precision would have been 1.

• This is misleading as this tells the model to output fewer


words with high precision.

• To penalize such cases, we use Brevity Penalty.


Brevity Penalty

Here,
in our example,
r = number of words in the reference text
c = number of words in the predicted text
r=5
This ensures that the brevity penalty cannot be c=6
larger than 1, even if the predicted text is larger Since C>R Brevity Penalty = 1
than the reference text.
Calculating BLEU Score based on all these values

• Now, to calculate the BLEU Score in NLP, we simply multiply the Brevity
Penalty by the Global Average Precision = 1∗0.316=0.316

• Thus, the BLEU Score for our predicted text is 0.316

• We use 1-gram to 4-gram.

• Thus, it is also called BLEU-4, so the results will vary slightly when we
compute the Bleu Score using code.

• What we did now is called BLEU-2.


• Target Sentence 1: She plays the piano.
• Target Sentence 2: She is playing piano.
• Predicted Sentence: She She She plays playing keyboard.
Clipped precision = number of clipped correct predicted words
/ total number of predicted words
In this case, clipped precision = 3 / 5.
Limitations and Criticisms
• No Semantic Understanding: BLEU focuses purely on n-gram matches and
does not account for synonymy or paraphrasing, which means it may
penalize valid translations that use different phrasing than the reference.
• Overemphasis on Short Phrases: The use of short n-grams, while capturing
word-level precision, may miss out on evaluating longer, more complex
sentence structures.
• Effectiveness with Multiple References: BLEU scores are notably higher
with multiple reference translations. However, acquiring multiple high-
quality references can be costly and may not always be feasible.
• Not Ideal for Single Sentence Evaluation: BLEU works best when averaged
over many sentences. On a sentence-by-sentence basis, it may produce
scores that do not align perfectly with human intuition, especially for
idiomatic or highly contextual translations.
BLEU: Limitations
• BLEU is very local: A large phrase that is moved around might not change the
BLEU score at all, and BLEU can’t evaluate cross-sentence properties of a
document like its discourse coherence

• BLEU and similar automatic metrics also do poorly at comparing very different
kinds of systems, such as comparing human-aided translation against machine
translation, or different machine translation architectures against each other
(Callison-Burch et al., 2006).

• Such automatic metrics are probably most appropriate when evaluating


changes to a single system.
BERTScore: An Embedding-Based Alternative
Unlike BLEU, which focuses on exact word matching, BERTScore leverages
pre-trained models like BERT to compare the semantic meaning of words in
the reference and candidate translations. Instead of matching words directly,
BERTScore measures the similarity between their embeddings—vector
representations that capture the context and meaning of each word.
• Each word in the reference translation is converted into an embedding.
• Each word in the candidate translation is also converted into an
embedding.
• The score is computed by comparing these embeddings using cosine
similarity, which measures how close the meanings of the words are.
Precision, Recall, and F1 in BERTScore
BERTScore calculates:
• Precision: Measures how many words in the candidate translation
have similar embeddings to the words in the reference translation.
• Recall: Measures how many words in the reference translation have
similar embeddings in the candidate.
• F1-Score: Combines precision and recall, providing a balanced view of
translation quality.
• Because BERTScore uses embeddings, it is much more flexible than
BLEU:
• Synonyms: If the candidate translation uses a word that is a synonym
of a word in the reference, their embeddings will likely be similar,
leading to a high BERTScore.
• Paraphrases: BERTScore can also handle different phrasing or word
orders better than BLEU, as long as the meaning remains similar.
Following example illustrates the flexibility of BERTScore:

• Reference: "The cat is on the mat."


• Candidate: "The feline is on the rug."

• In BLEU, the words "cat" and "feline", as well as "mat" and "rug", would
not be considered matches, leading to a lower score.
• In BERTScore, since "cat" and "feline" are semantically similar, and "mat"
and "rug" have close meanings, their embeddings will be similar, resulting
in a higher score.
Source: Online — https://arxiv.org/pdf/1904.09675.pdf
• R_BERT (BERT Recall): This equation calculates the recall score. It
sums the maximum cosine similarity between each token in the
reference text (x_i) and all tokens in the candidate text (x̂_j), then
normalizes by dividing by the length of the reference text (|x|).
• P_BERT (BERT Precision): This equation calculates the precision score.
It's similar to recall, but sums the maximum cosine similarity between
each token in the candidate text (x̂_j) and all tokens in the reference
text (x_i), then normalizes by dividing by the length of the candidate
text (|x̂|).
• Foundations of NLP Explained — Bleu Score and WER Metrics | by
Ketan Doshi | Towards Data Science
• BLEU - a Hugging Face Space by evaluate-metric
• Understanding BLEU Score: A Metric for Evaluating Machine
Translation Quality | by Gaurav Sharma | Medium
• BERTScore Explained in 5 minutes. Evaluating Text Generation with
BERT… | by Abonia Sojasingarayar | Medium

You might also like