0% found this document useful (0 votes)
5 views25 pages

GLUE

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views25 pages

GLUE

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

GLUE

General Language Understanding Evaluation


What is GLUE ?

General Language Understanding Evaluation is an evaluation benchmark designed to


measure the performance of language understanding models in a range of natural
language processing (NLP) tasks.

It provides a standardized set of diverse NLP tasks, allowing researchers and practitioners
to evaluate and compare the effectiveness of different language models on these tasks.

The collection consists of nine “difficult and diverse” task datasets designed to test a
model’s language understanding and is crucial to understanding how transfer learning
models like BERT are evaluated.
Why was GLUE created?

In the past, NLP models were nearly always designed around


performing well on a single, specific task.

The best performing models had specialized architectures (often a very


specific type of LSTM) designed with this one task in mind, were
trained end-to-end on this task, and rarely had the same level of
success generalizing to other tasks or datasets.
Why Do We Need GLUE?
• Standardization Across Tasks: GLUE standardizes evaluation across a diverse set
of tasks, offering a unified benchmark to measure general language
understanding.
• Generalization:GLUE tests a model’s robustness and ability to generalize across a
variety of language tasks, rather than excelling at just one.
• Diverse Task Set:GLUE includes multiple NLP tasks (e.g., sentiment analysis,
paraphrase detection, natural language inference, etc.), covering different
aspects of language understanding.
• Unified Leaderboard:GLUE offers a unified leaderboard, allowing researchers to
compare different models on the same set of tasks, with consistent scoring and
evaluation standards.
• Holistic Model Comparison:GLUE aggregates performance across multiple
datasets, allowing researchers to see how well a model performs in a broad range
of tasks, making model comparison easier and more reliable.
How GLUE Works?
• GLUE consists of a collection of nine representative NLP tasks, including sentence classification,
sentiment analysis, and question answering.

• Each task in the benchmark comes with


• a training set,
• a development set for fine-tuning the models,
• and an evaluation set for testing the performance of the models.

• Participants in the benchmark can submit their models and evaluate their performance on the
GLUE leaderboard, which tracks the progress and advancements in language understanding.
The Most Important GLUE Use Cases

GLUE has various applications in the field of NLP and machine learning. Some of the important use cases
include:

• Sentiment Analysis: Assessing the sentiment of a given text, such as determining whether a customer review is
positive or negative.

• Text Classification: Categorizing text into predefined classes or categories based on its content.

• Named Entity Recognition: Identifying and classifying named entities in text, such as person names,
organizations, and locations.

• Text Similarity: Measuring the similarity between two pieces of text, which has applications in information
retrieval and recommendation systems.

• Question Answering: Automatically finding relevant answers to user questions based on a given context or a
set of documents.
What is an NLP task and what is its relevance ?

NLP tasks all seek to test a model’s understanding of a specific aspect of language.

For:
•Named Entity Recognition: which words in a sentence are a proper name, organization name, or entity?

•Textual Entailment: given two sentences, does the first sentence entail or contradict the second sentence?

•Coreference Resolution: given a pronoun like “it” in a sentence that discusses multiple objects, which object does “it”
refer to?
For Example

you’re much more interested in BERT to predict the sentiment of customer reviews, so coreference resolution
benchmarking isn’t obviously a thing that should interest you.

However, it’s helpful to understand that all these discrete tasks, including your own application and a coreference
resolution task, are connected aspects of language.So while you may not be interested in coreference resolution or textual
entailment,

When it comes time to evaluate the sentiment of your customer reviews, a model that can be used to effectively
determine things like which object “it” refers to will in the aggregate likely make your model more effective when it comes
time to evaluate that ambiguously worded customer review.

For that reason, when a model like BERT is shown to be effective at understanding and predicting a broad range of these
canonical and fine-grained linguistic tasks, it’s an indication that the model will probably be effective for your application.
These models are designed with a specific task or even a specific dataset in
mind, evaluating the model was as simple as evaluating it on the task it was trained for.

Is the NER model good at NER?

However, as people began experimenting with transfer learning and the success of
transfer learning in NLP took off, a new method of evaluation was needed.
Unlike single task models that are designed for and trained end-to-end on a specific
task, transfer learning models extensively train their large network on a generalized
language understanding task.

The idea is that with some extensive generalized pretraining the model gains a good
“understanding” of language in general.

Then, this generalized “understanding” gives us an advantage when it comes time to adjust
our model to tackle a specific task.
Creation and use of GLUE
A team at NYU put together a collection of tasks
called GLUE.

When researchers want to evaluate their model, they


train and then score the model on all nine tasks, and
the resulting average score of those nine tasks is the
model’s final performance score.

It doesn’t matter exactly what the model looks like or


how it works so long as it can process inputs and
output predictions on all the tasks.
What are the tasks?
The tasks are meant to cover a diverse and difficult range of NLP problems and are mostly adopted from existing
datasets.
Single-Sentence Tasks

1. CoLA (Corpus of Linguistic Acceptability)


•Goal: determine if a sentence is grammatically correct or not.
•Dataset: it consists of English acceptability judgments drawn from books and journal
articles. Each example is a sequence of words annotated with whether it is a correct
grammatical English sentence or not.

2. SST-2 (Stanford Sentiment Treebank)


•Goal: determine if the sentence has a positive or negative sentiment.
•Dataset: it consists of sentences from movie reviews and binary human annotations
of their sentiment.
Similarity and Paraphrase Tasks
3. MRPC (Microsoft Research Paraphrase Corpus)
•Goal: determine if two sentences are paraphrases from one another.
•Dataset: it’s a corpus of sentence pairs automatically extracted from online news sources, with human annotations
indicating whether the sentences in the pair are semantically equivalent (i.e. paraphrases).

4. QQP (Quora Question Pairs)


•Goal: determine if two questions are semantically equivalent or not.
•Dataset: it’s a collection of question pairs from the community question-answering website Quora, with human
annotations indicating whether the questions in the pair are actually the same question.

5. STS-B (Semantic Textual Similarity Benchmark)


•Goal: determine the similarity of two sentences with a score from one to five.
•Dataset: it’s a collection of sentence pairs drawn from news headlines, video and image captions, and natural
language inference data. Each pair is annotated by humans with a similarity score from one to five.
Inference Tasks

6. MNLI (Multi-Genre Natural Language Inference)


•Goal: determine if a sentence entails, contradicts, or is unrelated to another sentence.
•Dataset: it’s a crowdsourced collection of sentence pairs with textual entailment annotations. The premise sentences are
gathered from ten different sources, including transcribed speech, fiction, and government reports. The dataset has two
test sets: a matched (in-domain) and mismatched (cross-domain) test set. The scores on the matched and mismatched test
sets are then averaged together to give the final score on the MNLI task.

7. QNLI (Question-answering Natural Language Inference)


•Goal: determine if the answer to a question is contained in a second sentence or not.
•Dataset: it’s a question-answering dataset consisting of question-paragraph pairs, where one of the sentences in the
paragraph (drawn from Wikipedia) contains the answer to the corresponding question (written by an annotator).
8. RTE (Recognizing Textual Entailment)
•Goal: determine if a sentence entails a given hypothesis or not.
•Dataset: it is a combination of data from annual textual entailment challenges (i.e. from RTE1, RTE2, RTE3, and RTE5).
Examples are constructed based on news and Wikipedia text.

9. WNLI (Winograd Natural Language Inference)


•Goal: determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or
not.
•Dataset: this dataset is built from the Winograd Schema Challenge dataset, where it’s a reading comprehension task in
which a system must read a sentence with a pronoun and select the referent of that pronoun from a list of choices. To
convert the problem into sentence pair classification, the authors of the benchmark construct sentence pairs by replacing
the ambiguous pronoun with each possible referent. The examples are manually constructed to foil simple statistical
methods: each one is contingent on contextual information provided by a single word or phrase in the sentence.

You might also like