BERT fine-tuned CORD-19 NER Dataset

Citation Author(s):: Shin Thant (Asian Institute of Technology)

Chutiporn Anutariya (Asian Institute of Technology)

Frederic Andres (National Institute of Informatics)

Teeradaj Racharak (Japan Advanced Institute of Science and Technology)
Submitted by:: Andres Frederic
Last updated:: Sun, 07/09/2023 - 16:37
DOI:: 10.21227/m7gj-ks21
Data Format:: *.csv

363 views

Categories:

Keywords:

large language models

BERT

NER

Document level entity extraction

CORD-19

CORD-NER

Dataset sampling

ACCESS DATASET CITE

Abstract

This Named Entities dataset is implemented by employing the widely used Large Language Model (LLM), BERT, on the CORD-19 biomedical literature corpus. By fine-tuning the pre-trained BERT on the CORD-NER dataset, the model gains the ability to comprehend the context and semantics of biomedical named entities. The refined model is then utilized on the CORD-19 to extract more contextually relevant and updated named entities. However, fine-tuning large datasets with LLMs poses a challenge. To counter this, two distinct sampling methodologies are utilized. First, for the NER task on the CORD-19, a Latent Dirichlet Allocation (LDA) topic modeling technique is employed. This maintains the sentence structure while concentrating on related content. Second, a straightforward greedy method is deployed to gather the most informative data of 25 entity types from the CORD-NER dataset.

Instructions:

This NER dataset can be applied to any supervised, unsupervised, or deep learning approaches.

This NER dataset is auto generated from the BERT model. So, the dataset is not 100% accurate. This benefit could help to utilize in any approach which tends to handle noisy data.