0% found this document useful (0 votes)

5 views25 pages

GLUE

Uploaded by

rajputakashchand4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views25 pages

GLUE

Uploaded by

rajputakashchand4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

GLUE

General Language Understanding Evaluation

What is GLUE ?

General Language Understanding Evaluation is an evaluation benchmark designed to

measure the performance of language understanding models in a range of natural
language processing (NLP) tasks.

It provides a standardized set of diverse NLP tasks, allowing researchers and practitioners
to evaluate and compare the effectiveness of different language models on these tasks.

The collection consists of nine “difficult and diverse” task datasets designed to test a
model’s language understanding and is crucial to understanding how transfer learning
models like BERT are evaluated.
Why was GLUE created?

In the past, NLP models were nearly always designed around

performing well on a single, specific task.

The best performing models had specialized architectures (often a very

specific type of LSTM) designed with this one task in mind, were
trained end-to-end on this task, and rarely had the same level of
success generalizing to other tasks or datasets.
Why Do We Need GLUE?
• Standardization Across Tasks: GLUE standardizes evaluation across a diverse set
of tasks, offering a unified benchmark to measure general language
understanding.
• Generalization:GLUE tests a model’s robustness and ability to generalize across a
variety of language tasks, rather than excelling at just one.
• Diverse Task Set:GLUE includes multiple NLP tasks (e.g., sentiment analysis,
paraphrase detection, natural language inference, etc.), covering different
aspects of language understanding.
• Unified Leaderboard:GLUE offers a unified leaderboard, allowing researchers to
compare different models on the same set of tasks, with consistent scoring and
evaluation standards.
• Holistic Model Comparison:GLUE aggregates performance across multiple
datasets, allowing researchers to see how well a model performs in a broad range
of tasks, making model comparison easier and more reliable.
How GLUE Works?
• GLUE consists of a collection of nine representative NLP tasks, including sentence classification,
sentiment analysis, and question answering.

• Each task in the benchmark comes with

• a training set,
• a development set for fine-tuning the models,
• and an evaluation set for testing the performance of the models.

• Participants in the benchmark can submit their models and evaluate their performance on the
GLUE leaderboard, which tracks the progress and advancements in language understanding.
The Most Important GLUE Use Cases

GLUE has various applications in the field of NLP and machine learning. Some of the important use cases
include:

• Sentiment Analysis: Assessing the sentiment of a given text, such as determining whether a customer review is
positive or negative.

• Text Classification: Categorizing text into predefined classes or categories based on its content.

• Named Entity Recognition: Identifying and classifying named entities in text, such as person names,
organizations, and locations.

• Text Similarity: Measuring the similarity between two pieces of text, which has applications in information
retrieval and recommendation systems.

• Question Answering: Automatically finding relevant answers to user questions based on a given context or a
set of documents.
What is an NLP task and what is its relevance ?

NLP tasks all seek to test a model’s understanding of a specific aspect of language.

For:
•Named Entity Recognition: which words in a sentence are a proper name, organization name, or entity?

•Textual Entailment: given two sentences, does the first sentence entail or contradict the second sentence?

•Coreference Resolution: given a pronoun like “it” in a sentence that discusses multiple objects, which object does “it”
refer to?
For Example

you’re much more interested in BERT to predict the sentiment of customer reviews, so coreference resolution
benchmarking isn’t obviously a thing that should interest you.

However, it’s helpful to understand that all these discrete tasks, including your own application and a coreference
resolution task, are connected aspects of language.So while you may not be interested in coreference resolution or textual
entailment,

When it comes time to evaluate the sentiment of your customer reviews, a model that can be used to effectively
determine things like which object “it” refers to will in the aggregate likely make your model more effective when it comes
time to evaluate that ambiguously worded customer review.

For that reason, when a model like BERT is shown to be effective at understanding and predicting a broad range of these
canonical and fine-grained linguistic tasks, it’s an indication that the model will probably be effective for your application.
These models are designed with a specific task or even a specific dataset in
mind, evaluating the model was as simple as evaluating it on the task it was trained for.

Is the NER model good at NER?

However, as people began experimenting with transfer learning and the success of
transfer learning in NLP took off, a new method of evaluation was needed.
Unlike single task models that are designed for and trained end-to-end on a specific
task, transfer learning models extensively train their large network on a generalized
language understanding task.

The idea is that with some extensive generalized pretraining the model gains a good
“understanding” of language in general.

Then, this generalized “understanding” gives us an advantage when it comes time to adjust
our model to tackle a specific task.
Creation and use of GLUE
A team at NYU put together a collection of tasks
called GLUE.

When researchers want to evaluate their model, they

train and then score the model on all nine tasks, and
the resulting average score of those nine tasks is the
model’s final performance score.

It doesn’t matter exactly what the model looks like or

how it works so long as it can process inputs and
output predictions on all the tasks.
What are the tasks?
The tasks are meant to cover a diverse and difficult range of NLP problems and are mostly adopted from existing
datasets.
Single-Sentence Tasks

1. CoLA (Corpus of Linguistic Acceptability)

•Goal: determine if a sentence is grammatically correct or not.
•Dataset: it consists of English acceptability judgments drawn from books and journal
articles. Each example is a sequence of words annotated with whether it is a correct
grammatical English sentence or not.

2. SST-2 (Stanford Sentiment Treebank)

•Goal: determine if the sentence has a positive or negative sentiment.
•Dataset: it consists of sentences from movie reviews and binary human annotations
of their sentiment.
Similarity and Paraphrase Tasks
3. MRPC (Microsoft Research Paraphrase Corpus)
•Goal: determine if two sentences are paraphrases from one another.
•Dataset: it’s a corpus of sentence pairs automatically extracted from online news sources, with human annotations
indicating whether the sentences in the pair are semantically equivalent (i.e. paraphrases).

4. QQP (Quora Question Pairs)

•Goal: determine if two questions are semantically equivalent or not.
•Dataset: it’s a collection of question pairs from the community question-answering website Quora, with human
annotations indicating whether the questions in the pair are actually the same question.

5. STS-B (Semantic Textual Similarity Benchmark)

•Goal: determine the similarity of two sentences with a score from one to five.
•Dataset: it’s a collection of sentence pairs drawn from news headlines, video and image captions, and natural
language inference data. Each pair is annotated by humans with a similarity score from one to five.
Inference Tasks

6. MNLI (Multi-Genre Natural Language Inference)

•Goal: determine if a sentence entails, contradicts, or is unrelated to another sentence.
•Dataset: it’s a crowdsourced collection of sentence pairs with textual entailment annotations. The premise sentences are
gathered from ten different sources, including transcribed speech, fiction, and government reports. The dataset has two
test sets: a matched (in-domain) and mismatched (cross-domain) test set. The scores on the matched and mismatched test
sets are then averaged together to give the final score on the MNLI task.

7. QNLI (Question-answering Natural Language Inference)

•Goal: determine if the answer to a question is contained in a second sentence or not.
•Dataset: it’s a question-answering dataset consisting of question-paragraph pairs, where one of the sentences in the
paragraph (drawn from Wikipedia) contains the answer to the corresponding question (written by an annotator).
8. RTE (Recognizing Textual Entailment)
•Goal: determine if a sentence entails a given hypothesis or not.
•Dataset: it is a combination of data from annual textual entailment challenges (i.e. from RTE1, RTE2, RTE3, and RTE5).
Examples are constructed based on news and Wikipedia text.

9. WNLI (Winograd Natural Language Inference)

•Goal: determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or
not.
•Dataset: this dataset is built from the Winograd Schema Challenge dataset, where it’s a reading comprehension task in
which a system must read a sentence with a pronoun and select the referent of that pronoun from a list of choices. To
convert the problem into sentence pair classification, the authors of the benchmark construct sentence pairs by replacing
the ambiguous pronoun with each possible referent. The examples are manually constructed to foil simple statistical
methods: each one is contingent on contextual information provided by a single word or phrase in the sentence.

Syllabus Generative AI
100% (1)
Syllabus Generative AI
22 pages
NLP Notes For Students
No ratings yet
NLP Notes For Students
18 pages
NLP Notes
No ratings yet
NLP Notes
18 pages
Unit 1 2 3 4 5 NLP Notes Merged
100% (1)
Unit 1 2 3 4 5 NLP Notes Merged
105 pages
Unit 1
No ratings yet
Unit 1
99 pages
Siva PHD Thesis
No ratings yet
Siva PHD Thesis
173 pages
Unit 5 DL
No ratings yet
Unit 5 DL
11 pages
An In-Depth Exploration of Natural Language Processing: Evolution, Applications, and Future Directions
100% (8)
An In-Depth Exploration of Natural Language Processing: Evolution, Applications, and Future Directions
5 pages
Unit 4 DL
No ratings yet
Unit 4 DL
31 pages
How To Do Human Evaluation A Brief Introduction To User Studies in NLP
No ratings yet
How To Do Human Evaluation A Brief Introduction To User Studies in NLP
24 pages
NLP Handwritten Notes
No ratings yet
NLP Handwritten Notes
26 pages
AI ROADMAP and Syllabus
No ratings yet
AI ROADMAP and Syllabus
24 pages
Intenship Report
No ratings yet
Intenship Report
45 pages
Natural Language Processing (NLP) : April 2024
No ratings yet
Natural Language Processing (NLP) : April 2024
88 pages
Morphological Analysis
No ratings yet
Morphological Analysis
118 pages
Morphological Analysis
No ratings yet
Morphological Analysis
118 pages
Unit 1
No ratings yet
Unit 1
20 pages
NLP Exam Notes
No ratings yet
NLP Exam Notes
15 pages
Natural Language Processing For Sentiment Analysis - Ankur Shukla
No ratings yet
Natural Language Processing For Sentiment Analysis - Ankur Shukla
27 pages
Evaluation of Text Generation: A Survey
No ratings yet
Evaluation of Text Generation: A Survey
75 pages
Cs224n 2025 Lecture12 Evaluation Final
No ratings yet
Cs224n 2025 Lecture12 Evaluation Final
59 pages
NLP M4 Part 2 SPP
No ratings yet
NLP M4 Part 2 SPP
71 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
NLP Unit I
No ratings yet
NLP Unit I
30 pages
Intro Class
No ratings yet
Intro Class
81 pages
Glue: A M - T B A P N L U - : Ulti ASK Enchmark and Nalysis Latform For Atural Anguage Nderstand ING
No ratings yet
Glue: A M - T B A P N L U - : Ulti ASK Enchmark and Nalysis Latform For Atural Anguage Nderstand ING
20 pages
Vector Semantics
No ratings yet
Vector Semantics
83 pages
Mod2 Data Streams
No ratings yet
Mod2 Data Streams
75 pages
Super Glue
No ratings yet
Super Glue
29 pages
Detecting Fake News Through Deep Learning: A Current Systematic Review
No ratings yet
Detecting Fake News Through Deep Learning: A Current Systematic Review
11 pages
SuperGlue Score
No ratings yet
SuperGlue Score
30 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
63 pages
Finkster-Python Cheatsheet
No ratings yet
Finkster-Python Cheatsheet
11 pages
24 - VNHSGE - VietNamese High School Graduation Examination Dataset For Large Language Models
No ratings yet
24 - VNHSGE - VietNamese High School Graduation Examination Dataset For Large Language Models
74 pages
NLP Notes-1
No ratings yet
NLP Notes-1
54 pages
(IJCST-V6I3P19) :vignesh Venkatesh
No ratings yet
(IJCST-V6I3P19) :vignesh Venkatesh
16 pages
Docs
No ratings yet
Docs
108 pages
Hocken Maier 25
No ratings yet
Hocken Maier 25
46 pages
Barts: Evaluating Generated Text As Text Generation: Corresponding Author
No ratings yet
Barts: Evaluating Generated Text As Text Generation: Corresponding Author
18 pages
Continual Learning of Large Language Models: A Comprehensive Survey
No ratings yet
Continual Learning of Large Language Models: A Comprehensive Survey
57 pages
MINI PROJECT 22352242 Sri Ranjana Mohan C
No ratings yet
MINI PROJECT 22352242 Sri Ranjana Mohan C
15 pages
NeurIPS 2019 Superglue A Stickier Benchmark For General Purpose Language Understanding Systems Paper
No ratings yet
NeurIPS 2019 Superglue A Stickier Benchmark For General Purpose Language Understanding Systems Paper
15 pages
Coli A 00561
No ratings yet
Coli A 00561
27 pages
DLT Unit-5
No ratings yet
DLT Unit-5
48 pages
Question Answering System: 296: Natural Language Processing
No ratings yet
Question Answering System: 296: Natural Language Processing
30 pages
BERT A Review of Applications in Sentiment Analysis
No ratings yet
BERT A Review of Applications in Sentiment Analysis
10 pages
NLP Unit 1
No ratings yet
NLP Unit 1
44 pages
NLP
No ratings yet
NLP
40 pages
Reference Material NLP - 2
No ratings yet
Reference Material NLP - 2
40 pages
CAT King Study Material 2
No ratings yet
CAT King Study Material 2
20 pages
NLP Unit V Notes
No ratings yet
NLP Unit V Notes
21 pages
NLP Introduction Overview
No ratings yet
NLP Introduction Overview
34 pages
Bert - Se: A P - L R M S E: RE Trained Anguage Epresentation Odel For Oftware Ngineering
No ratings yet
Bert - Se: A P - L R M S E: RE Trained Anguage Epresentation Odel For Oftware Ngineering
17 pages
NLP Materia
No ratings yet
NLP Materia
29 pages
Unit 1 TB
No ratings yet
Unit 1 TB
19 pages
Unit - 1 Introduction
No ratings yet
Unit - 1 Introduction
33 pages
Fine-Tuning and Masked Lan-Guage Models: 11.1 Bidirectional Transformer Encoders
No ratings yet
Fine-Tuning and Masked Lan-Guage Models: 11.1 Bidirectional Transformer Encoders
22 pages
Linked Data For Language-Learning Applications
No ratings yet
Linked Data For Language-Learning Applications
8 pages
Natural Language Processing Applications in Cyber Security
No ratings yet
Natural Language Processing Applications in Cyber Security
3 pages
Natural Language Processing
No ratings yet
Natural Language Processing
17 pages
NLP 1
No ratings yet
NLP 1
29 pages
Unit 2
No ratings yet
Unit 2
21 pages
cs224n Winter2023 Lecture1 Notes Draft
No ratings yet
cs224n Winter2023 Lecture1 Notes Draft
13 pages
Lecture 1
No ratings yet
Lecture 1
16 pages
The Best LLMs Cheatsheet 1727364716
No ratings yet
The Best LLMs Cheatsheet 1727364716
15 pages
2021-Sentence Embedding Models For Similarity Detection of Software Requirements
No ratings yet
2021-Sentence Embedding Models For Similarity Detection of Software Requirements
11 pages
01 - Intro NLP
No ratings yet
01 - Intro NLP
13 pages
Ch-4 Pre-Trained Models and Fine-Tuning
No ratings yet
Ch-4 Pre-Trained Models and Fine-Tuning
13 pages
Fact or Fake How News Title Sentiment and Writing Style Help AI To Detect COVID-19 Fake News
No ratings yet
Fact or Fake How News Title Sentiment and Writing Style Help AI To Detect COVID-19 Fake News
23 pages
AI UNIT 6 and UNIT 7 Question and Answers
No ratings yet
AI UNIT 6 and UNIT 7 Question and Answers
10 pages
An Al-BERT-Bi-GRU-LDA Algorithm For Negative Sentiment Analysis On Bilibili Comments
No ratings yet
An Al-BERT-Bi-GRU-LDA Algorithm For Negative Sentiment Analysis On Bilibili Comments
15 pages
NLP Key
No ratings yet
NLP Key
16 pages
Contextual Word Embeddings
No ratings yet
Contextual Word Embeddings
8 pages
MEFaND A Multimodel Framework For Early Fake News Detection
No ratings yet
MEFaND A Multimodel Framework For Early Fake News Detection
17 pages
CL Unit 1
No ratings yet
CL Unit 1
11 pages
RussianSuperGLUE EMNLP 2020
No ratings yet
RussianSuperGLUE EMNLP 2020
9 pages
NLP Quick NOtes
No ratings yet
NLP Quick NOtes
15 pages
Souza 2020 Bertimbau
No ratings yet
Souza 2020 Bertimbau
23 pages
Literature Review On Vulnerability Detection Using
No ratings yet
Literature Review On Vulnerability Detection Using
10 pages
Introduction To NLP - First - Week - Lecture - 1st
No ratings yet
Introduction To NLP - First - Week - Lecture - 1st
6 pages
Viegas 2023 Jurisbert
No ratings yet
Viegas 2023 Jurisbert
7 pages
Robustness in Natural Language Processing Addressing Challenges in Text-Based AI Systems
No ratings yet
Robustness in Natural Language Processing Addressing Challenges in Text-Based AI Systems
5 pages
Advances in Natural Language Processing
No ratings yet
Advances in Natural Language Processing
7 pages
IndoXLNet Pre-Trained Language Model For Bahasa Indonesia
No ratings yet
IndoXLNet Pre-Trained Language Model For Bahasa Indonesia
15 pages
Instructuie: Multi-Task Instruction Tuning For Unified Information Extraction
No ratings yet
Instructuie: Multi-Task Instruction Tuning For Unified Information Extraction
15 pages
Natural Language Processing - Bridging The Gap Between Humans and Machines
No ratings yet
Natural Language Processing - Bridging The Gap Between Humans and Machines
6 pages
What Does BERT Look At? An Analysis of BERT's Attention
No ratings yet
What Does BERT Look At? An Analysis of BERT's Attention
11 pages
NLP
No ratings yet
NLP
2 pages
BioM-Transformers: Building Large Biomedical Language Models With
No ratings yet
BioM-Transformers: Building Large Biomedical Language Models With
7 pages
02+ijisae Budi+juarto
No ratings yet
02+ijisae Budi+juarto
7 pages
Past Paper 1
No ratings yet
Past Paper 1
4 pages
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
From Everand
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
Alexandra George
No ratings yet
GRE Verbal Reasoning Supreme: Study Guide with Practice Questions
From Everand
GRE Verbal Reasoning Supreme: Study Guide with Practice Questions
Vibrant Publishers
1/5 (1)