0% found this document useful (0 votes)
61 views9 pages

Ml6team - Keyphrase-Generation-Keybart-Inspec Hugging Face

This document provides information about the ml6team/keyphrase-generation-keybart-inspec model on Hugging Face, including a description of the model, intended uses and limitations, and an example usage for keyphrase extraction. The model was trained on the Inspec dataset and uses KeyBART as its base, fine-tuning it for the task of keyphrase generation from scientific papers and abstracts. Usage instructions are provided to load and use the model via a custom Text2TextGenerationPipeline for keyphrase extraction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views9 pages

Ml6team - Keyphrase-Generation-Keybart-Inspec Hugging Face

This document provides information about the ml6team/keyphrase-generation-keybart-inspec model on Hugging Face, including a description of the model, intended uses and limitations, and an example usage for keyphrase extraction. The model was trained on the Inspec dataset and uses KeyBART as its base, fine-tuning it for the task of keyphrase generation from scientific papers and abstracts. Usage instructions are provided to load and use the model via a custom Text2TextGenerationPipeline for keyphrase extraction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

3/13/24, 7:51 AM ml6team/keyphrase-generation-keybart-inspec · Hugging Face

Search models, datasets, users...

ml6team / keyphrase-generation-keybart-inspec like 4

Text2Text Generation Transformers PyTorch midas/inspec English

bart keyphrase-generation Eval Results Inference Endpoints arxiv:2112.08547

License: mit

Train Deploy Use in Transformers

Model card Files Community 4

Downloads last month


23

Inference API
Text2Text Generation Example 2

The contextualized representation of pre-trained language models (PLMs) has been


highly successful in Keyphrase Extraction (KPE) tasks, achieving state-of-the-art (SotA)
results. However, due to the limited context within which they operate, extracting
keyphrases from lengthy documents poses a challenge, as the importance of the
keyphrase may rely on long-term dependencies that the PLM is not able to capture. To
overcome this limitation, we present an attention expansion mechanism that leverages
pre-trained word embeddings to allow the PLM to consider words beyond its contextual
boundaries, thereby enhancing the representation of words for KPE. To evaluate the
efficacy of our approach, we fine-tuned multiple PLMs on publicly available long
document KPE datasets, comparing results with and without our attention expansion
mechanism. PLMs with the expansion mechanism consistently outperformed state-of-
the-art models, exhibiting significant improvements in their F1 score—a metric that
harmoniously combines precision and recall to provide a comprehensive measure of a
model's accuracy—across all datasets.

1.8
Compute

Computation time on cpu: 1.695 s

contextualized representation ; pre-trained language models ; keyphrase extraction ;

JSON Output Maximize

https://huggingface.co/ml6team/keyphrase-generation-keybart-inspec?text=The+contextualized+representation+of+pre-trained+language+models+%… 1/9
3/13/24, 7:51 AM ml6team/keyphrase-generation-keybart-inspec · Hugging Face

Dataset used to train ml6team/keyphrase-generation-keybart-inspec

midas/inspec
Viewer • Updated Mar 5, 2022 • 70 • 9

Spaces using ml6team/keyphrase-generation-keybart-inspec 2

🔑 ml6team/keyphrase-extraction 🐠 davidna22/bot-simulation-app

Evaluation results

F1@M (Present) on inspec self-reported 0.361

F1@O (Present) on inspec self-reported 0.329

F1@M (Absent) on inspec self-reported 0.083

F1@O (Absent) on inspec self-reported 0.080

View on Papers With Code

Edit model card

🔑 Keyphrase Generation Model: KeyBART-inspec


Keyphrase extraction is a technique in text analysis where you extract the important
keyphrases from a document. Thanks to these keyphrases humans can understand the
content of a text very quickly and easily without reading it completely. Keyphrase
extraction was first done primarily by human annotators, who read the text in detail
and then wrote down the most important keyphrases. The disadvantage is that if you
work with a lot of documents, this process can take a lot of time ⏳.

Here is where Artificial Intelligence 🤖 comes in. Currently, classical machine learning
methods, that use statistical and linguistic features, are widely used for the extraction
process. Now with deep learning, it is possible to capture the semantic meaning of a
text even better than these classical methods. Classical methods look at the frequency,
occurrence and order of words in the text, whereas these neural approaches can
capture long-term semantic dependencies and context of words in a text.

📓 Model Description

https://huggingface.co/ml6team/keyphrase-generation-keybart-inspec?text=The+contextualized+representation+of+pre-trained+language+models+%… 2/9
3/13/24, 7:51 AM ml6team/keyphrase-generation-keybart-inspec · Hugging Face

This model uses KeyBART as its base model and fine-tunes it on the Inspec dataset.
KeyBART focuses on learning a better representation of keyphrases in a generative
setting. It produces the keyphrases associated with the input document from a
corrupted input. The input is changed by token masking, keyphrase masking and
keyphrase replacement. This model can already be used without any fine-tuning, but
can be fine-tuned if needed. You can find more information about the architecture in
this paper.

Kulkarni, Mayank, Debanjan Mahata, Ravneet Arora, and Rajarshi Bhowmik. "Learning
Rich Representation of Keyphrases from Text." arXiv preprint arXiv:2112.08547 (2021).

✋ Intended Uses & Limitations

🛑 Limitations

This keyphrase generation model is very domain-specific and will perform very
well on abstracts of scientific papers. It's not recommended to use this model for
other domains, but you are free to test it out.

Only works for English documents.

❓ How To Use

# Model parameters
from transformers import (
Text2TextGenerationPipeline,
AutoModelForSeq2SeqLM,
AutoTokenizer,
)

class KeyphraseGenerationPipeline(Text2TextGenerationPipeline):
def __init__(self, model, keyphrase_sep_token=";", *args, **kwargs)
super().__init__(
model=AutoModelForSeq2SeqLM.from_pretrained(model),
tokenizer=AutoTokenizer.from_pretrained(model),
*args,
**kwargs
)
self.keyphrase_sep_token = keyphrase_sep_token

https://huggingface.co/ml6team/keyphrase-generation-keybart-inspec?text=The+contextualized+representation+of+pre-trained+language+models+%… 3/9
3/13/24, 7:51 AM ml6team/keyphrase-generation-keybart-inspec · Hugging Face

def postprocess(self, model_outputs):


results = super().postprocess(
model_outputs=model_outputs
)
return [[keyphrase.strip() for keyphrase in result.get("generat

# Load pipeline
model_name = "ml6team/keyphrase-generation-keybart-inspec"
generator = KeyphraseGenerationPipeline(model=model_name)

# Inference
text = """
Keyphrase extraction is a technique in text analysis where you extract
important keyphrases from a document. Thanks to these keyphrases humans
understand the content of a text very quickly and easily without readin
completely. Keyphrase extraction was first done primarily by human anno
who read the text in detail and then wrote down the most important keyp
The disadvantage is that if you work with a lot of documents, this proc
can take a lot of time.

Here is where Artificial Intelligence comes in. Currently, classical ma


learning methods, that use statistical and linguistic features, are wid
for the extraction process. Now with deep learning, it is possible to c
the semantic meaning of a text even better than these classical methods
Classical methods look at the frequency, occurrence and order of words
in the text, whereas these neural approaches can capture long-term
semantic dependencies and context of words in a text.
""".replace("\n", " ")

keyphrases = generator(text)

print(keyphrases)

# Output
[['keyphrase extraction', 'text analysis', 'keyphrases', 'human annotat

https://huggingface.co/ml6team/keyphrase-generation-keybart-inspec?text=The+contextualized+representation+of+pre-trained+language+models+%… 4/9
3/13/24, 7:51 AM ml6team/keyphrase-generation-keybart-inspec · Hugging Face

📚 Training Dataset

Inspec is a keyphrase extraction/generation dataset consisting of 2000 English scientific


papers from the scientific domains of Computers and Control and Information
Technology published between 1998 to 2002. The keyphrases are annotated by
professional indexers or editors.

You can find more information in the paper.

👷‍♂️ Training Procedure

Training Parameters

Parameter Value

Learning Rate 5e-5

Epochs 15

Early Stopping Patience 1

Preprocessing

The documents in the dataset are already preprocessed into list of words with the
corresponding keyphrases. The only thing that must be done is tokenization and
joining all keyphrases into one string with a certain seperator of choice( ; ).

from datasets import load_dataset


from transformers import AutoTokenizer

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("bloomberg/KeyBART", add_pref

# Dataset parameters
dataset_full_name = "midas/inspec"
dataset_subset = "raw"
dataset_document_column = "document"

keyphrase_sep_token = ";"

https://huggingface.co/ml6team/keyphrase-generation-keybart-inspec?text=The+contextualized+representation+of+pre-trained+language+models+%… 5/9
3/13/24, 7:51 AM ml6team/keyphrase-generation-keybart-inspec · Hugging Face

def preprocess_keyphrases(text_ids, kp_list):


kp_order_list = []
kp_set = set(kp_list)
text = tokenizer.decode(
text_ids, skip_special_tokens=True, clean_up_tokenization_space
)
text = text.lower()
for kp in kp_set:
kp = kp.strip()
kp_index = text.find(kp.lower())
kp_order_list.append((kp_index, kp))

kp_order_list.sort()
present_kp, absent_kp = [], []

for kp_index, kp in kp_order_list:


if kp_index < 0:
absent_kp.append(kp)
else:
present_kp.append(kp)
return present_kp, absent_kp

def preprocess_fuction(samples):
processed_samples = {"input_ids": [], "attention_mask": [], "labels
for i, sample in enumerate(samples[dataset_document_column]):
input_text = " ".join(sample)
inputs = tokenizer(
input_text,
padding="max_length",
truncation=True,
)
present_kp, absent_kp = preprocess_keyphrases(
text_ids=inputs["input_ids"],
kp_list=samples["extractive_keyphrases"][i]
+ samples["abstractive_keyphrases"][i],
)
keyphrases = present_kp
keyphrases += absent_kp

target_text = f" {keyphrase_sep_token} ".join(keyphrases)

https://huggingface.co/ml6team/keyphrase-generation-keybart-inspec?text=The+contextualized+representation+of+pre-trained+language+models+%… 6/9
3/13/24, 7:51 AM ml6team/keyphrase-generation-keybart-inspec · Hugging Face

with tokenizer.as_target_tokenizer():
targets = tokenizer(
target_text, max_length=40, padding="max_length", trunc
)
targets["input_ids"] = [
(t if t != tokenizer.pad_token_id else -100)
for t in targets["input_ids"]
]
for key in inputs.keys():
processed_samples[key].append(inputs[key])
processed_samples["labels"].append(targets["input_ids"])
return processed_samples

# Load dataset
dataset = load_dataset(dataset_full_name, dataset_subset)
# Preprocess dataset
tokenized_dataset = dataset.map(preprocess_fuction, batched=True)

Postprocessing

For the post-processing, you will need to split the string based on the keyphrase
separator.

def extract_keyphrases(examples):
return [example.split(keyphrase_sep_token) for example in examples]

📝 Evaluation results

Traditional evaluation methods are the precision, recall and F1-score @k,m where k is
the number that stands for the first k predicted keyphrases and m for the average
amount of predicted keyphrases. In keyphrase generation you also look at F1@O where
O stands for the number of ground truth keyphrases.

The model achieves the following results on the Inspec test set:

Extractive Keyphrases

https://huggingface.co/ml6team/keyphrase-generation-keybart-inspec?text=The+contextualized+representation+of+pre-trained+language+models+%… 7/9
3/13/24, 7:51 AM ml6team/keyphrase-generation-keybart-inspec · Hugging Face

Dataset P@5 R@5 F1@5 P@10 R@10 F1@10 P@M R@M F1@M P@O R@O

Inspec 0.40 0.37 0.35 0.20 0.37 0.24 0.42 0.37 0.36 0.33 0.33
Test Set

Abstractive Keyphrases

Dataset P@5 R@5 F1@5 P@10 R@10 F1@10 P@M R@M F1@M P@O R@O

Inspec 0.07 0.12 0.08 0.03 0.12 0.05 0.08 0.12 0.08 0.08 0.12
Test Set

🚨 Issues

Please feel free to start discussions in the Community Tab.

Company
TOS
Privacy
About
Jobs

Website
Models
Datasets

Spaces
Pricing

https://huggingface.co/ml6team/keyphrase-generation-keybart-inspec?text=The+contextualized+representation+of+pre-trained+language+models+%… 8/9
3/13/24, 7:51 AM ml6team/keyphrase-generation-keybart-inspec · Hugging Face

Docs

© Hugging Face

https://huggingface.co/ml6team/keyphrase-generation-keybart-inspec?text=The+contextualized+representation+of+pre-trained+language+models+%… 9/9

You might also like