GLUE
GLUE
It provides a standardized set of diverse NLP tasks, allowing researchers and practitioners
to evaluate and compare the effectiveness of different language models on these tasks.
The collection consists of nine “difficult and diverse” task datasets designed to test a
model’s language understanding and is crucial to understanding how transfer learning
models like BERT are evaluated.
Why was GLUE created?
• Participants in the benchmark can submit their models and evaluate their performance on the
GLUE leaderboard, which tracks the progress and advancements in language understanding.
The Most Important GLUE Use Cases
GLUE has various applications in the field of NLP and machine learning. Some of the important use cases
include:
• Sentiment Analysis: Assessing the sentiment of a given text, such as determining whether a customer review is
positive or negative.
• Text Classification: Categorizing text into predefined classes or categories based on its content.
• Named Entity Recognition: Identifying and classifying named entities in text, such as person names,
organizations, and locations.
• Text Similarity: Measuring the similarity between two pieces of text, which has applications in information
retrieval and recommendation systems.
• Question Answering: Automatically finding relevant answers to user questions based on a given context or a
set of documents.
What is an NLP task and what is its relevance ?
NLP tasks all seek to test a model’s understanding of a specific aspect of language.
For:
•Named Entity Recognition: which words in a sentence are a proper name, organization name, or entity?
•Textual Entailment: given two sentences, does the first sentence entail or contradict the second sentence?
•Coreference Resolution: given a pronoun like “it” in a sentence that discusses multiple objects, which object does “it”
refer to?
For Example
you’re much more interested in BERT to predict the sentiment of customer reviews, so coreference resolution
benchmarking isn’t obviously a thing that should interest you.
However, it’s helpful to understand that all these discrete tasks, including your own application and a coreference
resolution task, are connected aspects of language.So while you may not be interested in coreference resolution or textual
entailment,
When it comes time to evaluate the sentiment of your customer reviews, a model that can be used to effectively
determine things like which object “it” refers to will in the aggregate likely make your model more effective when it comes
time to evaluate that ambiguously worded customer review.
For that reason, when a model like BERT is shown to be effective at understanding and predicting a broad range of these
canonical and fine-grained linguistic tasks, it’s an indication that the model will probably be effective for your application.
These models are designed with a specific task or even a specific dataset in
mind, evaluating the model was as simple as evaluating it on the task it was trained for.
However, as people began experimenting with transfer learning and the success of
transfer learning in NLP took off, a new method of evaluation was needed.
Unlike single task models that are designed for and trained end-to-end on a specific
task, transfer learning models extensively train their large network on a generalized
language understanding task.
The idea is that with some extensive generalized pretraining the model gains a good
“understanding” of language in general.
Then, this generalized “understanding” gives us an advantage when it comes time to adjust
our model to tackle a specific task.
Creation and use of GLUE
A team at NYU put together a collection of tasks
called GLUE.