Ml6team - Keyphrase-Generation-Keybart-Inspec Hugging Face
Ml6team - Keyphrase-Generation-Keybart-Inspec Hugging Face
License: mit
Inference API
Text2Text Generation Example 2
1.8
Compute
https://huggingface.co/ml6team/keyphrase-generation-keybart-inspec?text=The+contextualized+representation+of+pre-trained+language+models+%… 1/9
3/13/24, 7:51 AM ml6team/keyphrase-generation-keybart-inspec · Hugging Face
midas/inspec
Viewer • Updated Mar 5, 2022 • 70 • 9
🔑 ml6team/keyphrase-extraction 🐠 davidna22/bot-simulation-app
Evaluation results
Here is where Artificial Intelligence 🤖 comes in. Currently, classical machine learning
methods, that use statistical and linguistic features, are widely used for the extraction
process. Now with deep learning, it is possible to capture the semantic meaning of a
text even better than these classical methods. Classical methods look at the frequency,
occurrence and order of words in the text, whereas these neural approaches can
capture long-term semantic dependencies and context of words in a text.
📓 Model Description
https://huggingface.co/ml6team/keyphrase-generation-keybart-inspec?text=The+contextualized+representation+of+pre-trained+language+models+%… 2/9
3/13/24, 7:51 AM ml6team/keyphrase-generation-keybart-inspec · Hugging Face
This model uses KeyBART as its base model and fine-tunes it on the Inspec dataset.
KeyBART focuses on learning a better representation of keyphrases in a generative
setting. It produces the keyphrases associated with the input document from a
corrupted input. The input is changed by token masking, keyphrase masking and
keyphrase replacement. This model can already be used without any fine-tuning, but
can be fine-tuned if needed. You can find more information about the architecture in
this paper.
Kulkarni, Mayank, Debanjan Mahata, Ravneet Arora, and Rajarshi Bhowmik. "Learning
Rich Representation of Keyphrases from Text." arXiv preprint arXiv:2112.08547 (2021).
🛑 Limitations
This keyphrase generation model is very domain-specific and will perform very
well on abstracts of scientific papers. It's not recommended to use this model for
other domains, but you are free to test it out.
❓ How To Use
# Model parameters
from transformers import (
Text2TextGenerationPipeline,
AutoModelForSeq2SeqLM,
AutoTokenizer,
)
class KeyphraseGenerationPipeline(Text2TextGenerationPipeline):
def __init__(self, model, keyphrase_sep_token=";", *args, **kwargs)
super().__init__(
model=AutoModelForSeq2SeqLM.from_pretrained(model),
tokenizer=AutoTokenizer.from_pretrained(model),
*args,
**kwargs
)
self.keyphrase_sep_token = keyphrase_sep_token
https://huggingface.co/ml6team/keyphrase-generation-keybart-inspec?text=The+contextualized+representation+of+pre-trained+language+models+%… 3/9
3/13/24, 7:51 AM ml6team/keyphrase-generation-keybart-inspec · Hugging Face
# Load pipeline
model_name = "ml6team/keyphrase-generation-keybart-inspec"
generator = KeyphraseGenerationPipeline(model=model_name)
# Inference
text = """
Keyphrase extraction is a technique in text analysis where you extract
important keyphrases from a document. Thanks to these keyphrases humans
understand the content of a text very quickly and easily without readin
completely. Keyphrase extraction was first done primarily by human anno
who read the text in detail and then wrote down the most important keyp
The disadvantage is that if you work with a lot of documents, this proc
can take a lot of time.
keyphrases = generator(text)
print(keyphrases)
# Output
[['keyphrase extraction', 'text analysis', 'keyphrases', 'human annotat
https://huggingface.co/ml6team/keyphrase-generation-keybart-inspec?text=The+contextualized+representation+of+pre-trained+language+models+%… 4/9
3/13/24, 7:51 AM ml6team/keyphrase-generation-keybart-inspec · Hugging Face
📚 Training Dataset
Training Parameters
Parameter Value
Epochs 15
Preprocessing
The documents in the dataset are already preprocessed into list of words with the
corresponding keyphrases. The only thing that must be done is tokenization and
joining all keyphrases into one string with a certain seperator of choice( ; ).
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("bloomberg/KeyBART", add_pref
# Dataset parameters
dataset_full_name = "midas/inspec"
dataset_subset = "raw"
dataset_document_column = "document"
keyphrase_sep_token = ";"
https://huggingface.co/ml6team/keyphrase-generation-keybart-inspec?text=The+contextualized+representation+of+pre-trained+language+models+%… 5/9
3/13/24, 7:51 AM ml6team/keyphrase-generation-keybart-inspec · Hugging Face
kp_order_list.sort()
present_kp, absent_kp = [], []
def preprocess_fuction(samples):
processed_samples = {"input_ids": [], "attention_mask": [], "labels
for i, sample in enumerate(samples[dataset_document_column]):
input_text = " ".join(sample)
inputs = tokenizer(
input_text,
padding="max_length",
truncation=True,
)
present_kp, absent_kp = preprocess_keyphrases(
text_ids=inputs["input_ids"],
kp_list=samples["extractive_keyphrases"][i]
+ samples["abstractive_keyphrases"][i],
)
keyphrases = present_kp
keyphrases += absent_kp
https://huggingface.co/ml6team/keyphrase-generation-keybart-inspec?text=The+contextualized+representation+of+pre-trained+language+models+%… 6/9
3/13/24, 7:51 AM ml6team/keyphrase-generation-keybart-inspec · Hugging Face
with tokenizer.as_target_tokenizer():
targets = tokenizer(
target_text, max_length=40, padding="max_length", trunc
)
targets["input_ids"] = [
(t if t != tokenizer.pad_token_id else -100)
for t in targets["input_ids"]
]
for key in inputs.keys():
processed_samples[key].append(inputs[key])
processed_samples["labels"].append(targets["input_ids"])
return processed_samples
# Load dataset
dataset = load_dataset(dataset_full_name, dataset_subset)
# Preprocess dataset
tokenized_dataset = dataset.map(preprocess_fuction, batched=True)
Postprocessing
For the post-processing, you will need to split the string based on the keyphrase
separator.
def extract_keyphrases(examples):
return [example.split(keyphrase_sep_token) for example in examples]
📝 Evaluation results
Traditional evaluation methods are the precision, recall and F1-score @k,m where k is
the number that stands for the first k predicted keyphrases and m for the average
amount of predicted keyphrases. In keyphrase generation you also look at F1@O where
O stands for the number of ground truth keyphrases.
The model achieves the following results on the Inspec test set:
Extractive Keyphrases
https://huggingface.co/ml6team/keyphrase-generation-keybart-inspec?text=The+contextualized+representation+of+pre-trained+language+models+%… 7/9
3/13/24, 7:51 AM ml6team/keyphrase-generation-keybart-inspec · Hugging Face
Dataset P@5 R@5 F1@5 P@10 R@10 F1@10 P@M R@M F1@M P@O R@O
Inspec 0.40 0.37 0.35 0.20 0.37 0.24 0.42 0.37 0.36 0.33 0.33
Test Set
Abstractive Keyphrases
Dataset P@5 R@5 F1@5 P@10 R@10 F1@10 P@M R@M F1@M P@O R@O
Inspec 0.07 0.12 0.08 0.03 0.12 0.05 0.08 0.12 0.08 0.08 0.12
Test Set
🚨 Issues
Company
TOS
Privacy
About
Jobs
Website
Models
Datasets
Spaces
Pricing
https://huggingface.co/ml6team/keyphrase-generation-keybart-inspec?text=The+contextualized+representation+of+pre-trained+language+models+%… 8/9
3/13/24, 7:51 AM ml6team/keyphrase-generation-keybart-inspec · Hugging Face
Docs
© Hugging Face
https://huggingface.co/ml6team/keyphrase-generation-keybart-inspec?text=The+contextualized+representation+of+pre-trained+language+models+%… 9/9