1. Introduction
This paper contributes to the creation of a
dataset citation network, a knowledge graph linking datasets to scientific articles when used in an article. Unlike the citation network of papers, the dataset citation infrastructure is still primitive, due to the limited referencing of dataset usage in scientific articles [
1,
2,
3,
4]. The use and value of such a dataset citation network is similar to that of the ordinary scientific citation network: realizing recognition for dataset providers by computing the impact scores of datasets based on citations [
2,
4], ranking datasets in dataset search engines by impact [
1], creating a representation of a dataset by its use instead of its metadata and content [
4,
5], studying cooccurrences of datasets, etc. According to Kratz and Strasser [
2], researchers believe that the citation count is the most valuable way to measure the impact of a dataset.
Creating the dataset citation network from a collection of articles involves three main steps: scientific PDF parsing, recognizing and extracting mentioned datasets, and cross documenting the coreference resolution (“dataset name de-duplication”). This paper is only concerned with the dataset extraction process, which we view as a Named Entity Recognition (NER) task. When focusing on the articles that use a NER method for this task, it becomes clear that almost every article uses another approach and another dataset. Not only do these approaches differ, but they also deviate from the core dataset NER task, as every approach has something extra added onto it [
3,
6,
7,
8,
9,
10,
11,
12,
13,
14]. This makes it hard to compare which method or component fits a task best. According to Beltagy et al. [
15], SciBERT has shown state-of-the-art results on one of the datasets (SciERC), while other methods have outperformed this score on a similar task and dataset [
9,
10]. To fully be able to compare the performance and annotation costs of each (basic) model, we compare their performance with them all being trained and tested on the same dataset. This results in the following research question:
- RQ
Which Named Entity Recognition model is best suited for the dataset name recognition task in scientific articles, considering both performance and annotation costs?
The comparison of the performance of each method, when only run once, is not sufficient enough to fully compare them, as [
16] showed that annotation choices have an irrefutable impact on the system’s performance. This effect was neglected in the aforementioned papers that focused on the dataset extraction task. However, not only can these choices impact the models’ performances, but they can also impact the annotation costs. We consider a number of factors that could influence the performance and annotation costs of the models. First, domain transfer is considered, as this is shown to impact NER performance [
17,
18,
19]. This has become a trending topic in NER in an effort to reduce the amount of training data needed [
20]. Another factor that is taken into account is the training set size, as multiple sources have shown that a small amount of training data can lead to performance problems [
7,
21]. Next to the size of the training set, the effect of the distribution of positive and negative samples is considered, as this, too, has been shown to influence performance [
22,
23,
24,
25]. These choices all influence the amount of training data that is needed to achieve the best performance, thus influencing the annotation costs since adding ‘real examples’ is costly. In order to further reduce the annotation costs, the effect of adding weakly supervised examples in the training data is investigated for the best performing model. Summing up, we answer the following questions:
- RQ1
What is the performance of rule-based, CRF, BiLSTM, BiLSTM-CRF [
26,
27,
28], BERT [
29] and SciBERT [
15] models on the dataset name recognition task?
- RQ2
How well do the models perform when tested on a scientific (sub)domain that is not included in the training data?
- RQ3
How does the amount of training data impact the models’ performance?
- RQ4
How are the models’ performance affected by the ratio of negative versus positive examples in the training and test data?
- RQ5
Does adding weakly supervised examples further improve the scores of the best performing model? Additionally, how well does the best performing model perform without any manually annotated labels?
- RQ6
Is there a difference in the performance of NER models when predicting easy (short) or hard (long) dataset mentions?
To answer these questions on realistic input data, we created a hand-annotated dataset of 6000 sentences based on four sets of conferences in the fields of neural machine learning, data mining, information retrieval and computer vision (see
Section 3.1). NER can be evaluated in many ways. We mostly use the most strict and realistic, that is, the exact match on a zero shot test set. We note, however, that, due to enumerations and ellipses, many NER hits contain several datasets, which makes the partial and B-match also useful (as the found NER hits have to be post-processed anyway).
Our main findings are that SciBERT performs best, particularly on realistic test data (with >90% sentences, not mentioning a dataset). Surprisingly our own developed rule based system (using POS tags and keywords) performed almost as well, and all others, except BERT, perform (much) worse than this rule-based system. SciBERT was also robust when looking at the other tests performed, regarding domain adaptability, the negative sample ratio and the train set size. However, nothing comes for free; we did not succeed in training SciBERT to outperform the rule-based system when we gave it only weakly supervised training examples (obtained without manual annotation).
2. Related Work
The overwhelming volume of scientific papers have made extracting knowledge from them an unmanageable task [
30], making automatic IE especially relevant for this domain [
31]. Scientific IE has been of interest since the early 1990s [
32]. Despite the growing interest in the automatic extraction of scientific information, research on this topic is still narrow even now [
7]. The reason for the limited research in scientific IE in comparison to the general domain is the specific set of challenges associated with the scientific domain. The main challenge is the expertise that is needed for annotated data, making these data costly and hard to obtain, resulting in very limited data available [
7]. However, there is a significant focus on this kind of research in the scientific sub-domains: medicine and biology [
30].
Where at the beginning of Scientific IE, the focus mainly laid upon citations and topic analyses [
13], now, the focus has become broader and has shifted toward scientific fact extraction (for example, population statistics, variants of genomics, material properties, etc.) [
30]. Research on the dataset name extraction task uses a great variety of methods throughout the NER spectrum, including, but not limited to, the following: rule-based, BiLSTM-CRF and BERT [
3,
6,
7,
8,
9,
10,
11,
12,
13,
14].
For dataset extraction, it was found that verbs surrounding the dataset provide information about the role or function; as such, the words, use, apply or adopt, indicate a ‘use’ function [
10]. Nevertheless, not only these verbs surrounding it play an important role, as for dataset detection, a wide range of context is needed [
6], indicating that a model’s ability to grasp context could play a significant role in the performance of that model.
We briefly go through the NER models that we tested for dataset extraction.
The rule-based approach was the most prominent one in the early stages of NER [
33]. Despite the fact that most state-of-the-art results are now achieved by machine learning methods, the rule-based model is still attractive to use, due to its transparency [
34]. The authors conclude that rule-based methods can achieve state-of-the-art extraction performance, but note that the rule development is very time consuming and a manual task. Not only is this method used as a stand-alone classification method, but it is also suitable as a form of weak supervision [
35] as an alternative to the manual labeling of data, providing training examples for the other methods [
36,
37].
Conditional Random Fields (CRF) is a probabilistic model for labeling sequential data, which has proven its effectiveness in NER, producing state-of-the-art results around the year 2007 [
38]. A dataset extraction model solely based on CRF is missing, but it was used for other tasks in Scientific IE. A well-known example is the GROBID parser, which extracts bibliographic data from scientific texts (such as the title, headers, references, etc.) [
39].
The BiLSTM-CRF is a hybrid model, combining LSTM layers with a CRF layer on top [
40]. Using this combination, the advantages of both models can be joined. The advantage of BiLSTM is that it is better at predicting long sequences, predicting every word individually [
41], while CRF predicts based on the joint probability of the whole sentence, making sure that the optimal sequence of tags is achieved [
41,
42,
43]. To date, the BiLSTM-CRF based model produces the best performance on the dataset extraction task, with an F1 score of 0.85 [
8].
BERT produces state-of-the-art results in a range of NLP tasks [
29]. It is based on a transformer network, which is praised for its context-aware word representations, improving the prediction ability [
44]. BERT has revolutionized classical NLP. However, its performance as a base for the dataset extraction tasks differs greatly, as one research study found an F1 score of 0.68 [
13], while another has found an F1 score of 0.79 [
10]. Beltagy et al. [
15] developed the SciBERT model based on the BERT model. The big and only difference between those models is that, unlike BERT, which is trained on general texts, SciBERT is trained on 1.14 M scientific papers from Semantic Scholar, consisting of 18% computer science papers, and the remaining 82% consisting of papers from the biomedical domain. This model, which was specially created for knowledge extraction in the scientific domain, indeed achieves better performance, in comparison to BERT, in the computer science domain.
4. Results
We report the results grouped by the five subquestions.
Appendix D contains additional results (e.g., precision and recall scores, and scores for the B(eginning) and I(nternal) tags.
4.1. RQ1, Overall Performances
Table 2 contains the F1 performance scores for the six different NER models we tested. This is the only experiment we conducted with (5-fold) cross validation on the complete set of 6000 sentences. BERT and SciBERT perform almost the same on both scores and (much) better than all the others, except that the rule-based system performs equally well on the partial match score.
Notice that the partial and exact match scores are closest for SciBERT. Due to conjunctions, ellipses and the used annotation guidelines, NER phrases can be quite long, so a large difference between the two ways of scoring could be expected. An error analysis shows that SciBERT is especially good in learning the beginning of a dataset mention.
The two most interesting systems seem to be SciBERT and the rule-based one, and thus we will mostly report results on the other subquestions for these two.
4.2. RQ2, Domain Adaptability
The models’ ability to adapt to differences within the scientific domain is shown in
Table 3. These scores are achieved using one corpus as a test set, while training on the other three corpora. The corpus on which it is tested can be found in the header. We expected the scores to be lower than the cross validation scores, but we only found a small negative effect when testing on the VISION conferences. We note that the VISION set is different in that the sentences come from the last three years, while the others are from the last two decades.
4.3. RQ3, Amount of Training Data
Here, we look at the major cost factor: the size of the training set.
Figure 1 shows the exact match F1 score on the zero-shot set for varying amounts of training sentences, ranging from 500 to 4500. We see a clear difference between CRF and the two BERT models on the one hand and the two BiLSTM models on the other. We now zoom in on the most stable behaving models, CRF and SciBERT.
Figure 2 zooms in on both precision and recall, also for the (supposedly easier) test set. Both models show remarkably robust behavior: only a slight influence of the amount of training examples and hardly any difference in performance for the test and zero-shot test set. It is noticeable that CRF can be seen as a precision-oriented system, while for SciBERT, precision and recall are very similar. We see this as evidence that these two systems learn the structure of a dataset mention well and do not overfit on the dataset names themselves.
4.4. RQ4, Negative/Positive Ratio
Recall that about half of the sentences in our dataset do not mention a dataset,
while containing one of the trigger words, such as dataset, corpus, collection, etc. We can decide to use those in training or not. As noted in previous research, the ratio of positive and negative sentences was found to be important for NER models trained on a dataset mention extraction task [
6]. We see a slight improvement in F1 scores for all models when adding also negative training examples, but this is quite small.
What is more interesting is when we test on a set in which sentences mentioning a dataset are very rare, just like in a real scientific article. Using the developed rule-based system, we added sentences that most probably do not mention a dataset (i.e., they did not contain any of the trigger words (more precisely, did not match the regex in
Appendix A)) to the test set until we obtained a 1 in 100 ratio.
Table 4 shows the results. We see that all F1 scores drop, compared to those in
Table 2. This is expected as the task becomes harder. However, note that the recall remains very high for the two BERT models, indicating that the drop in F1 is caused mainly by extra false positives (of course, it might be that the SciBERT model discovered genuine dataset mentions not containing one of the trigger terms. We did not check for this).
4.5. RQ5, Weakly Annotated Data
We now see how much SciBERT can learn from positive training examples discovered by the rule-based system. As these examples are not hand annotated, we call them weakly supervised. We created a weakly supervised training set, SSC (for Silver Standard Corpus), with the same number of positive and negative sentences as in the manually annotated train set.
Table 5 shows that the performance of SciBERT is substantially lower when trained on those ‘cost-free’ training examples alone than when trained on the hand-annotated data (train set). A reason for this is that the SSC can contain false negatives or false positives, and learning from these false data will impact the model’s prediction ability, thus impacting the scores.
According to [
55], weakly supervised negative examples harm the performance. To test this effect, SciBERT was also trained on a combination of the manually labeled data and only the positive data from the SSC. The differences were very small: a 0.01 improvement for both partial and exact match on the zero-shot test, no difference for the partial match, and a 0.02 decrease on the test set.
4.6. RQ6, Easy vs. Hard Sentences
Sentences enumerating a number of named datasets are common in scientific articles. According to the guidelines, these are tagged as one entity, leading to long BI+ tag sequences. We wanted to test whether SciBERT is able to learn these more complex long entities just as well as the easier ones. So we split both the train and the zero-shot test sets into a hard and easy set, with sentences being hard if they contained a BI+ tag sequence of a length of four or more. We then performed all four possible train on hard/easy, test on hard/easy experiments. Only (we also saw a 6% drop in F1 when trained on hard and test on easy, but this may be due to much less training sentences) with train on easy, test on hard did we see an expected but still remarkable difference in scores (a drop in F1 of 42%). This means that the network is also able to understand and interpret ellipses and enumerations. These more complex rules and structures are not harder for the network to identify than simple one- or two-word dataset mentions. These structures and patterns are difficult, even for human annotators to consistently parse and classify correctly, making the network’s ability to understand the nuances of the labeling task significant.
5. Discussion
We have created a large and varied annotated set of sentences likely to contain a dataset name, with about half actually containing one or more datasets. We have shown that extracting these datasets using traditional NER techniques is feasible but clearly not straightforward or solved. We believe our results show that the created gold standard is a valuable asset for the scientific document parsing community. The set stands out because the sentences come from all sections in scientific articles, and come with exact links to the articles. Except for those coming from SIGIR, all articles are openly available in PDF format.
Analysis of the errors of the NER systems and the disagreements among the annotators revealed that dataset entity recognition from scientific articles is complicated through the use of enumerations, conjunctions and ellipsis in sentences. This means that, for example, in the sentence ‘
We used the VOC 2007, 2008 and 2009 collections.’, the phrase ‘VOC 2007, 2008 and 2009 collections’ is tagged as one dataset entity mention, as individual elements of the enumeration are nonsensical without the context provided by the other elements [
7]. We think it is this aspect that makes the task exciting and different from standard NER. Postprocessing the found mention, extracting all dataset entities, and completing the information hidden by the use of ellipses is an NLP task needed on top of dataset NER before we can create a dataset citation network. Of course, a cross-document coreference resolution of the found dataset names is then needed for the obtained network to be useful [
56]. Expanding the provided set of sentences with this extra information, linking every sentence to a set of unique dataset identifiers is not that much work and would make the dataset also applicable for training the dataset reconciliation task.
We wanted to know which NER system performs well and at what cost. Not surprisingly, the best performing systems were BERT and SciBERT. Unsupervised pretraining also helps for this task. Both systems (and CRF) worked already almost optimally with relatively few training examples. They were robust on our domain adaptation experiments, and kept a high recall at the cost of some loss in precision when we diluted the test set to a realistic 1 in 100 ratio of sentences with a dataset.
We found the performance of our quite simple rule-based system to be remarkable. In fact, this system can be seen as a formalization of the annotation guidelines, and having those carefully spelled out made it almost effortless to create; this is, in our opinion, the reason for its strong performance. The experiment in which we trained SciBERT with extra examples found by the rule-based system was inconclusive in that we saw hardly any change in performance. However, there may be more clever ways to combine these two models.
Future Directions
We think a gold standard dataset reminiscent of the end-to-end task of dataset mention extraction from scientific PDFs could lead to a big step forward in this field. In particular, we could then train and test end-to-end systems, which would link dataset DOIs to article DOIs.
Additionally, the articles from the four chosen ML/DM/IR/CV conferences are relatively easy for the dataset extraction task, as they do not contain that many named entities. The task is likely harder with papers from biological, chemical or medical domains.
A different approach to this task is to start with a knowledge base of existing research datasets containing their names and some metadata and then to use that in an informed dataset mention extraction system.