0% found this document useful (0 votes)
10 views12 pages

Zero - and Few-Shots Knowledge Graph Triplet Extraction With LLM

This document presents a study on enhancing Triplet Extraction (TE) capabilities using Large Language Models (LLMs) in Zero- and Few-Shot settings by integrating contextual information from a Knowledge Base (KB). The proposed pipeline dynamically gathers relevant context triplets and sentence-triplet pairs to improve the performance of LLMs, making them competitive with traditional supervised models. The findings indicate that the quality of the gathered KB context significantly correlates with TE performance, while the model size only marginally affects capabilities.

Uploaded by

Lâm Anh Quân
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views12 pages

Zero - and Few-Shots Knowledge Graph Triplet Extraction With LLM

This document presents a study on enhancing Triplet Extraction (TE) capabilities using Large Language Models (LLMs) in Zero- and Few-Shot settings by integrating contextual information from a Knowledge Base (KB). The proposed pipeline dynamically gathers relevant context triplets and sentence-triplet pairs to improve the performance of LLMs, making them competitive with traditional supervised models. The findings indicate that the quality of the gathered KB context significantly correlates with TE performance, while the model size only marginally affects capabilities.

Uploaded by

Lâm Anh Quân
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Zero- and Few-Shots Knowledge Graph Triplet Extraction with

Large Language Models


Andrea Papaluca1 , Daniel Krefl2 , Sergio J. Rodríguez Méndez1 , Artem Lensky3,4 ,
Hanna Suominen1,5,6
1
School of Computing, The Australian National University, Canberra, ACT, Australia, 2 Independent,
3
School of Engineering and Technology, The University of New South Wales, ACT, Australia,
4
School of Biomedical Engineering, The University of Sydney, NSW, Australia,
5
School of Medicine and Psychology, The Australian National University, Canberra, ACT, Australia,
6
Department of Computing, University of Turku, Turku, Finland
Correspondence: [email protected]
Abstract SOTA models follow the classic NLP paradigm,
In this work, we tested the Triplet Extraction i.e., they are trained by supervision on specific TE
(TE) capabilities of a variety of Large Lan- datasets. However, this dependence on labeled data
guage Models (LLMs) of different sizes in the restricts their generality and, therefore, limits the
Zero- and Few-Shots settings. In detail, we applicability of such models to the real world.
proposed a pipeline that dynamically gathers While several labeled datasets for the TE task
contextual information from a Knowledge Base are publicly available (Riedel et al., 2010; Gardent
(KB), both in the form of context triplets and
et al., 2017), these cover only part of the spectrum
of (sentence, triplets) pairs as examples, and
provides it to the LLM through a prompt. The
of possible entities and relations. This means that a
additional context allowed the LLMs to be com- supervised model trained on these public data will
petitive with all the older fully trained baselines be restricted to the closed set of entities and rela-
based on the Bidirectional Long Short-Term tions seen during training, implying that it may lack
Memory (BiLSTM) Network architecture. We generalization capabilities. Producing a tailored
further conducted a detailed analysis of the dataset for training a model for particular applica-
quality of the gathered KB context, finding it tions, is, however, in general expensive (Johnson
to be strongly correlated with the final TE per-
et al., 2018).
formance of the model. In contrast, the size
of the model appeared to only logarithmically For this reason, the recent language understand-
improve the TE capabilities of the LLMs. We ing and reasoning capabilities demonstrated by
release the code on GitHub 1 for reproducibil- Large Language Models (LLMs), such as the Gen-
ity. erative Pre-trained Transformer 4 (GPT-4) (Ope-
nAI, 2023), LLM Meta AI (LLaMA) (Touvron
1 Introduction
et al., 2023), and Falcon (Penedo et al., 2023) to
The task of Triplet Extraction (TE) (Nayak et al., name a few, have led researchers (Chia et al., 2022;
2021) is of fundamental importance for Natural Kim et al., 2023; Wadhwa et al., 2023; Wei et al.,
Language Processing (NLP). This is because the 2023b; Zhu et al., 2023) to investigate whether they
core meaning of a sentence is usually carried represent a viable option to overcome the limita-
by a set of (subject, predicate, object) triplets. tions imposed by supervised models for TE. In
Therefore, the capability to identify such triplets detail, the new approach being that at inference
is a key ingredient for being able to understand time the LLMs are prompted to extract the triplets
the sentence. contained in a sentence, while being provided with
Currently, the State-Of-The-Art (SOTA) for TE only a few labeled examples (or no example at all in
is achieved by models that approach the TE task the Zero-Shot setting). This LLM approach largely
in an end-to-end fashion (Zheng et al., 2017; Zeng limits the amount of data needed to perform the
et al., 2018; Fu et al., 2019; Zeng et al., 2019; Tang task, and, in particular, lifts the restriction of adher-
et al., 2022). That is, they are trained to perform ing to a predefined closed set of relations. However,
all the TE sub-tasks, namely, Named Entity Recog- the investigations so far indicated that the Zero and
nition (NER (Yadav and Bethard, 2018)), Entity Few-Shots performance of the LLMs appears to
Linking (EL (Alam et al., 2022)), and Relation Ex- be often underwhelming compared to the classic
traction (RE (Detroja et al., 2023)), together. These fully trained NLP models (Wadhwa et al., 2023;
1
https://github.com/BrunoLiegiBastonLiegi/ Wei et al., 2023b; Zhu et al., 2023).
KG-TE-with-LLMs In order to enhance the abilities of LLMs in the
12
Proceedings of the 1st Workshop on Knowledge Graphs and Large Language Models (KaLLM 2024), pages 12–23
August 15, 2024 ©2024 Association for Computational Linguistics
TE task, we propose in this work to aid them with dataset-dependent contextual information in the
the addition of a Knowledge Base (KB). We demon- prompt. The second work proposed to perform TE
strate that augmenting LLMs with KB information, by sequentially prompting ChatGPT in two stages:
i.e., dynamically gathering contextual information asking to individuate the possible relation types
from the KB, largely improves their TE capabili- first and then extracting the entities participating
ties, thereby making them more competitive with in each relation. The procedure demonstrated bet-
classic NLP baselines. In particular, we show that ter results than a one-stage approach where the
when the retrieved information is presented to the model is prompted to extract the triplet directly.
LLMs in the form of complete TE examples rele- Finally, the third work, evaluated GPT-3 (Brown
vant to the input sentence, their performance gets et al., 2020b) and GPT-4 (OpenAI, 2023) on some
closer to the fully trained SOTA models. standard benchmarks in the Zero- and One-Shot set-
tings. However, classical fine-tuned models proved
2 Related Work to be superior in the majority of the cases.
In our study, we similarly test the Zero- and
Classical end-to-end fully supervised models cur- Few-Shots capabilities of LLMs in two standard
rently hold the best performance in the TE task. TE datasets that have not been covered by these pre-
Starting from the older baseline, Zheng et al. vious works. In contrast to Wadhwa et al. (2023)
(2017), which also introduced the revised version that manually crafted static dataset-specific con-
of the WebNLG dataset for Natural Language Gen- text to be fed to the LLM, we propose here to dy-
eration (NLG) that is commonly used, several other namically gather contextual information useful for
architectures based on the bidirectional Recurrent extracting the triplets from a KB. This makes our
Neural Networks (RNNs) (Zeng et al., 2018; Fu approach more flexible and less data-dependent,
et al., 2019; Zeng et al., 2019) have steadily im- as the KB does not require any manual operation
proved the SOTA over the years. More recently, and can be easily switched depending on the need.
Transformer-based models achieved a big leap for- Also, in contrast to other works, we investigate a
ward in performance, with the recent UniRel model wide range of Language Models with varying sizes.
being the current SOTA (Tang et al., 2022) in the This allows us to provide an in-depth analysis of
datasets we consider. A further class of fully su- the scaling of the performance, both, from the per-
pervised models, such as Huguet Cabot and Nav- spective of the model chosen, and the quality of the
igli (2021) and Josifoski et al. (2022), treats the contextual KB information included in the prompt.
TE problem as a sequence-to-sequence generation
task, which is more similar to the LLM approach 3 The Pipeline
adopted here, but still requires some training or
finetuning. In this section, we provide a detailed illustration
With the advent of LLMs, Chia et al. (2022) and of the pipeline used to test the TE capabilities
Kim et al. (2023) tested the use of such models for of LLMs.
those TE cases where the availability of examples
3.1 Task Formulation
to train on is low. The first work proposed to use
a LLM to generate training examples to finetune a Given a sentence composed of tokens
Relation Extractor model to recognize relations for (t1 , t2 , · · · , tN ), the TE task consists of
which labels were not available. The latter work, identifying all the relations expressed in it and ex-
instead, suggested using relation templates of the tracting them in the form of triplets (s, p, o). Here,
form 〈X〉 relation 〈Y〉 and finetune a LLM to fill out s = (ti , · · · , ti+ns ) and o = (tk , · · · , tk+no )
〈X〉 and 〈Y〉 with the entities appearing in the sen- represent a subject and an object of length ns and
tence. Wadhwa et al. (2023),Wei et al. (2023b), and no tokens, and p is the predicate. Usually, the
Zhu et al. (2023) investigated the general TE task task is related to a specific KB, i.e., a graph of
in both Zero- and Few-Shots settings. These stud- the form G = (V, E), composed of entities e ∈ V
ies proposed different approaches based on LLM as vertices and relations r ∈ E as directed edges.
prompting. The first work tested the Few-Shots per- Therefore, s and o of the sentence correspond to
formance of GPT-3 (Brown et al., 2020a) and Text- vertices es , eo ∈ V. The predicate p is mapped to a
to-Text Transfer Transformer (T5) (Raffel et al., relation included in the closed set of possible edge
2023) under the inclusion of manually-crafted and types of the KB.
13
Figure 1: The TE pipeline. Left: illustration of the pipeline. A KB is constructed from the training and validation
splits of a given dataset. For each test sentence, the relevant contextual information is retrieved from the KB and
included in the prompt for a LLM-based TE. Right: summary of information retrieval from the KB. Either the
sentence-triplets pairs or the single triplets alone are encoded by a sentence encoder and compared to the encoding
of the input sentence by cosine similarity.

3.2 LLMs as Triplet Generators We prepare the information coming from the
In order to perform TE, we can prompt LLMs to KB in two different forms: either as simple con-
generate, for a given input sentence, a sequence of text triplets
tokens corresponding
n oto the set of entity-relation n oNKB
triplets (es , rp , eo )
i i i
n
. As demonstrated by Tc = (eis , rpi , eio ) ∈G, (1)
i=1
i=1
Wadhwa et al. (2023), Wei et al. (2023b), and
or as sentence-triplets pairs
Zhu et al. (2023), LLMs are, in principle, able
to extract the knowledge triplets contained in a text n oNKB
without a need for task-specific training, under a Ec = (Sci , Tci ) . (2)
i=1
suitable choice of prompt. In general, successful
The latter provides factual examples of triplets to be
LLM prompts follow a fixed schema that provides
extracted for specific sentences. Note that we indi-
a detailed explanation of what the task consists of,
cate with NKB the number of triplets, respectively,
a clear indication of the sentence to process, and
some hints or examples for the desired result.
In this work, we tested the use of three different Triplet Extraction Prompt
prompts: a simple baseline and two slight varia- Some text is provided below. Extract up to {max_triplets} knowledge
triplets in the form (subject, predicate, object) from the text.
tions of it. However, preliminary testing in TE ---------------------------------------------------------------------------------------------------
showed no significant difference in the F1 scores Examples:
among them. Therefore, we opted for using only Text: Abilene, Texas is in the United States.
Triplets:
the base prompt reported in Figure 2 in the main (abilene texas, country, united states)
experiments. The details of the prompts tested and Text: The United States includes the ethnic group of African
Americans and is the birthplace of Abrahm A Ribicoff
their results can be found in Appendix A.3. who is married to Casey Ribicoff.
Triplets:
3.3 KB-aided Triplet Extraction (abrahm a. ribicoff, spouse, casey ribicoff)
(abrahm a. ribicoff, birth places, united states)
In order to support LLMs in the TE task, we pro- (united states, ethnic group, african americans)
---------------------------------------------------------------------------------------------------
pose the pipeline illustrated in Figure 1. The Text: {text}
Triplets:
pipeline augments the LLM with external KB in-
formation. In detail, for each input sentence, rele- Figure 2: The base prompt we experimented with. At in-
vant context information contained in the KB is re- ference time the {text} and {max_triplets} variables
trieved and attached to the LLM prompt described are substituted with the sentence to process, respectively,
above. The context-enriched prompt is then fed to the maximum number of triplets found in a sentence in
the LLM for the knowledge triplet generation. the corresponding dataset.

14
sentence-triplets pairs retrieved from the KB. In Train Validation Test Relations Max Avg
WebNLG 5,019 500 703 171 7 2.29
the first case, augmentation is achieved by simply NYT 56,195 5,000 5,000 24 22 1.72
attaching the retrieved triplets Tc as an additional
“Context Triplets” argument to the base prompt Table 1: Statistics of the WebNLG and NYT datasets.
reported in Figure 2. For the second approach, The number of training, validation, and testing sentences
is reported, together with the number of relations types
instead, we substitute the two static examples pro-
in the dataset and the maximum and average number of
vided in the base prompt, with the input relevant triplets contained in a sentence.
examples Ec retrieved from the KB.
The relevant context information to build the Tc Parameters [B] Context
triplets set for each input sentence is retrieved as GPT-2 (Radford et al., 2019) 0.1 | 1.5 1,024
Falcon (Penedo et al., 2023) 7 | 40 2,048
follows. Given the KB, we isolate all the triplets LLaMA (Touvron et al., 2023) 13 | 65 2,048
(eis , rpi , eio ) ∈ G contained therein, and store them
in a node based vector store index (Liu, 2022). In Table 2: The number of parameters (in billions [B]) and
detail, each node of this index corresponds to one context window size of the selected LLMs.
and only one of the triplets and stores the embed-
ding obtained by running a small-scale sentence
the presentation of the main results for the TE task.
encoder, MiniLM (Wang et al., 2020), on the corre-
sponding (subject, predicate, object) string. In 4.1 Datasets and Models
the first approximation, this should be enough to
provide a meaningful embedding for each triplet. In order to test the TE capabilities of a selected set
During inference (i.e., TE), we first encode the in- of LLMs (see Table 2 for their comparison), we
put sentence using the MiniLM. This is followed by experimented with two standard benchmarks for
comparing the obtained sentence embedding with the TE task: the aforementioned WebNLG (Gar-
all the triplet embeddings contained in the index dent et al., 2017) and the New York Times
to retrieve the top NKB most similar triplets to the (NYT) (Riedel et al., 2010) dataset (see Table 1
input sentence. Out of this NKB -dimensional sam- for their basic statistics). The former was initially
ple, we further select the first two triplets for each proposed as a benchmark for the NLG task, but has
relation type present in the sample. This is done to been successively adapted to the TE task and in-
obtain a more diverse set of context triplets with a cluded in the WebNLG challenge (Castro Ferreira
more homogeneous distribution over the relations. et al., 2020). As the revision provided by Zheng
In some cases, indeed, the risk of obtaining a highly et al. (2017) appears to be the most widely used in
biased distribution towards a specific relation type the literature, we decided to run our tests on that
exists, which is sub-optimal for those sentences particular version of WebNLG. The NYT bench-
that contain several different relationships. mark is a dataset created by distant supervision,
Note that a similar procedure can be followed aligning more than 1.8 million articles from the
to prepare the Ec examples set. However, in this NYT newspaper with the Freebase KB. For each
case, the focus will be shifted to the example sen- dataset, we used the training and validation splits
tences we wish to include. Namely, each node to build the corresponding KB following the proce-
of the vector store index is going to consist both dure outlined in Section 3.3.
of the example sentence and the KB triplets to be We selected the LLMs reported in Table 2 for
extracted from it. Then, the embedding vector is testing. We ran locally all the models in their 8-
obtained by running the sentence encoder on either, bit quantized version provided by the Hugging-
the example sentence alone, or the sentence and Face (Wolf et al., 2020) library. We tested the use
triplets combined. As before, at inference time the of OpenAI models through their provided API as
top NKB most similar (sentence, triplets) pairs to well. However, as their results were often incon-
the input sentence are retrieved and included in the sistent and given the limited access and control
prompt as Few-Shots examples. we had over them, we decided to exclude these
models from the main report. All the experiments
4 Experiments regarding them can be found in Appendix A.1. The
temperature was set to τ = 0.1 for all the experi-
In this section, we first provide details about the ments. We experimented with higher temperatures
datasets and models we tested. This is followed by but observed that they were detrimental to the TE
15
Model WebNLG NYT Model WebNLG NYT
NovelTagging (Zheng et al., 2017) 0.283 0.420 0.5-Shot 5-Shots 0.5-Shot 5-Shots
CopyRE (Zeng et al., 2018), 0.371 0.587 base 0.249 0.430 0.175 0.375
GraphRel (Fu et al., 2019) 0.429 0.619 GPT-2
xl 0.297 0.517 0.193 0.448
OrderCopyRE (Zeng et al., 2019) 0.616 0.721
7b 0.381 0.567 0.250 0.519
UniRel (Tang et al., 2022) 0.947 0.937 Falcon
40b 0.345 0.615 0.226 0.547
Table 3: Micro-averaged F1 of some finetuned models 13b 0.374 0.609 0.247 0.582
LLaMA
selected from the literature. 65b 0.377 0.677 0.243 0.647

Table 5: 0.5 and 5-Shots micro-averaged F1 perfor-


Model WebNLG NYT
mance of the LLMs tested with the prompt of Fig-
0-Shot 2-Shots 0-Shot 2-Shots
ure 2 augmented with NKB = 5 triplets, respectively,
base 0.000 0.006 0.000 0.000 sentence-triplets pairs retrieved from the KB.
GPT-2
xl 0.000 0.037 0.000 0.000
7b 0.000 0.066 0.000 0.002
Falcon
40b 0.021 0.158 0.000 0.007 line NLP models (Table 3). The sole exception is
13b 0.006 0.129 0.000 0.002 the LLaMA 65B model that achieves an F1 score
LLaMA
65b 0.041 0.219 0.000 0.017
close to the one obtained by Zheng et al. (2017)
Table 4: Zero and 2-Shots micro-averaged F1 perfor- in the WebNLG dataset with 2-Shots. In particu-
mance of the LLMs tested with the prompt of Figure 2 lar, the NYT benchmark appears to be challenging
and without any context coming from the KB. for LLMs as they have difficulties even reaching
a mere 1% F1 score. This discrepancy in perfor-
mance between the datasets could potentially be
performance of the model. For Falcon and LLaMA explained as follows: In contrast to the WebNLG
LLMs, we also explored their instructed counter- dataset, which features more linear and simple sen-
parts, i.e., models that were fine-tuned for chat ap- tences, in NYT articles quite complex structures,
plications, either through Reinforcement Learning with several subordinate clauses and implicit rela-
with Human Feedback (RLHF) (Christiano et al., tions, are frequent. In particular, the triplet labels
2023) or supervision from other LLMs (Taori et al., of the NYT dataset often cover only a subset of the
2023). However, as the instructed models always actual relations found in the sentence. Therefore,
performed on par, or worse, in our tests, we decided without training examples available, LLMs cannot
to present the base variants. infer which relations are and are not supposed to
We made use of the LlamaIndex (Liu, 2022), be extracted.
LangChain (Chase, 2022) and HuggingFace trans-
formers (Wolf et al., 2020) python libraries for the 4.3 Zero-shot with KB Triplets (0.5-Shots)
implementation of the pipeline. If we supplement the LLMs with context triplets
retrieved from the KB, as described in Section 3.3
4.2 Zero- and 2-Shots without the KB
and illustrated in Figure 1, the performance of the
As a baseline, we test the Zero- and 2-Shots ca- LLM in the TE task increases substantially (see
pabilities of the LLMs without any additional in- Table 5). We refer to this setting where only a
formation supplemented from a KB. As described set of context triplets, but no example sentence, is
in Section 3, we prompt the LLM with the base provided to the model as 0.5-Shots. The additional
prompt of Figure 2 to extract all the triplets for a triplets hint at which relations and entities the LLM
sentence in the form (subject, predicate, object). In should expect, but they do not give any indication
particular, for the 2-Shots settings, two standard ex- of which sentence pattern they could arise from.
amples are included in the prompt but not changed In this case, the smallest model we tested,
over the different sentences (c.f. Figure 2). namely GPT-2 base, is competitive with the
In general, the LLMs queried by the base prompt LLaMA 65B model without context triplets, both,
do not seem capable of performing well in the for the WebNLG and the NYT dataset. Further-
TE task (Table 4). The two static examples in- more, the bigger models (Falcon, LLaMa) perform
cluded in the 2-Shots setting help to clarify the better or on par with some of the classical NLP
task and improve substantially the performance baselines for the WebNLG dataset given in Table 3.
over the Zero-Shot. However, all models struggle Even so for the NYT dataset a large improve-
to achieve the performance of the classical base- ment is obtained under the addition of context
16
triplets sentence-triplet
triplets, all the LLMs are not able to reach scores 1

competitive with the classical NLP models. The 0.8


reason behind this might be related to the lower
0.6
capability of the KB retriever to gather relevant

P
context for NYT (c.f. Figure 3) discussed below 0.4

and to the specific difficulties associated with the 0.2 WebNLG


NYT dataset discussed in the previous section. NYT
0
In general, it is interesting to observe that per- 0 10 20 30 40 50 0 2 4 6 8
NK B NK B
formance with the addition of the context triplets
appears to be less dependent on the particular LLM Figure 3: Probability that the correct triplet is present in-
used in case of 0.5-Shot setting. Quite remarkably, side the retrieved KB context consisting of (left) triplets
the small GPT-2 xl is able to retain most of the per- alone or (right) sentence-triplet example pairs, plotted
formance of the larger models. This is particularly against the amount of context gathered, NKB .
evident for the NYT dataset, where all the LLMs
are not able to perform better than a 25% F1 thresh- correct triplets.
old. This could be seen as a symptom of the TE
accuracy being mainly driven by the added context 4.5 Quality of the KB Context
triplets in this case. Indeed, we also tested this KB
triplets augmentation combined with the inclusion To evaluate the effectiveness of the KB retriever
of the two static examples used in Section 4.2, but and the quality of the included KB context, we
no significant differences were observed. plot in Figure 3 the probability of finding the cor-
rect triplets with increasing NKB , i.e., the solution
4.4 Few Shots with KB Sentence-Triplets to the TE task, inside the gathered KB context.
Pairs Namely, for each test sentence contained in the two
To further aid LLMs in the TE task, we experiment datasets, we looped over every labeled triplet and
with inclusion in the prompt of input-specific (sen- counted the number of times it was contained inside
tence, triplets) example pairs retrieved from the KB, the context provided by the retriever. We repeated
as detailed in Section 3.3. Such updated prompts this procedure for different values of NKB .
should provide a much stronger signal to the LLM Figure 3(left) suggests that NKB ∼ 10 − 20
as they not only suggest which entities and rela- retrieved triplets almost maximize the probability
tions the LLM should expect, but also which kind of retrieving a useful context already, as, beyond
of patterns in the sentence correspond to a specific that, the improvement is only marginal. However,
relation. In particular, as it will be discussed in as few as five triplets worked the best in our tests.
Section 4.5, the measured train-test overlapping Probably, a greater number of context triplets re-
seems to be large for both datasets (c.f. Figure 3) trieved leads to a marginally increased likelihood
and, therefore, the updated prompts are likely to of including relevant information, but at the cost
include examples of similar sentences. Therefore, of a larger dilution. Conversely, as illustrated by
performance improvements are expected, and in Figure 3(right), for the sentence-triplets augmen-
fact, looking at Table 5, we see that including 5 tation convergence is not reached with NKB = 8
of these examples in the prompt makes the LLM yet. However, in our experiments, the final TE per-
competitive with most of the classical baselines formance only marginally improved going from 5
reported in Table 3 (except the most recent SOTA to 8 sentence-triplets examples included. Still, it is
from Tang et al. (2022)). interesting to note that LLM performance increases
Interestingly, the performance gap between the with NKB in this case, providing further evidence
two datasets narrowed under the updated prompt. that the examples composed of sentence-triplets
In particular, the NYT corpus seems to have be- pairs are much more informative. Adding several
come far easier now for the LLMs. As discussed of them does not lead to a dilution of useful in-
in Section 4.2, this dataset consists of sentences formation, but rather contributes to widening the
with a much more complex structure and more spectrum of examples the LLM can take “inspira-
implicit relations. Therefore, having available ex- tion” from.
amples of similarly constructed sentences might In general, the probability of providing the cor-
have helped the models to more easily identify the rect triplet to the LLM through the context appears
17
triplets sentence-triplet
1 original prompt without any additional information
0.8
from the KB. The F1 score is plotted against the
probability PS (NKB = 5) of having the correct
0.6
triplet inside the retrieved data with NKB = 5 for
P

0.4
1
the different KB sizes. This corresponds to the
0.2
0.5
0.25
probability curves of Figure 4 evaluated at NKB =
0.1
5. We observe that the performance degrades as the
0
0 10 20 30 40 50 0 2 4 6 8 probability PS (NKB ) shrinks with decreasing S,
NK B NK B
as expected. In particular, the relation appears to be
Figure 4: Probability that the correct triplet is present linear: F1triplets ∼ 0.25 · Ps (NKB = 5) + 0.21.
among the retrieved KB context for the WebNLG F 1sentence−triplets ∼ 0.55 · Ps (NKB = 5) + 0.21.
dataset, as in Figure 3, but with different scaled-down with measured determination coefficients r2 =
versions of the original KB. 0.98 and r2 = 0.96, respectively. This suggests
that there is a strong correlation between the TE
capabilities of the model and the quality of the
to be large: greater than 50% in the majority of
retrieved data.
the cases, and even approaching the 70 ∼ 80%
for the WebNLG dataset. This is symptomatic of Furthermore, we investigated how the final TE
substantial overlap that exists between the training, performance scales with the size of the model. In
validation, and test splits for both datasets, to the Figure 5b, the F1 score is plotted against the num-
point that even a stochastic model, that randomly ber of parameters Npar in log scale for all the mod-
sampled the triplets out of the KB context retrieved, els we tested. The plot includes the results obtained
was able achieve performance competitive with for both the WebNLG and NYT datasets, for all
many of the LLMs and baselines of Table 3 in settings considered. We observe that for each of
some cases (see Appendix A.2 for more details). the three settings, the models’ performance grows
linearly in log scale with respect to their sizes. The
4.6 Ablation Study scaling in the number of parameters Npar in log
scale can be approximated by
To further investigate the impact of the additional
knowledge retrieved from the KB, we revisit in
this section the performance of one of our best F 1norm ∼ m · log Npar . (3)
performing LLMs, LLaMA-65b. In detail, we con-
struct a scaled-down version of the KB via ran-
domly sampling from the original training and vali- The slope parameters of the linear fit for
dation splits, keeping only a fraction of the original WebNLG are m = 0.0456, 0.0304, and 0.0871
sentences and triplets. For this reduced KB, the for, respectively, 2-Shot, 0.5-Shot(KB), 5-
probability of having the correct triplet answer al- Shots(KB) settings, and for the NYT the corre-
ready within the retrieved information is reduced sponding parameters are m = 0.0028, 0.0257
(c.f. Figure 4). This allows us to evaluate how the and 0.0906. The determination coefficients for
accuracy of the model is impacted by the quality the WebLNG and NYT datasets are, respectively,
of the retrieved data. r2 = 0.67, 0.62, and 0.97, and r2 = 0.18, 0.7
We decided to conduct this test on the WebNLG and 0.90. Interestingly, the F1 score increase with
dataset. As P (NKB ) for the full-scale KB has been the size of the model is steeper for the few-shots
larger than for the NYT dataset, c.f. Figure 3, a prompt (c.f. Figure 5b right). This suggests that
wider range of values to be explored is allowed. larger models might be more capable in making use
Nonetheless, a preliminary test on the NYT dataset of several examples included inside of the prompt.
yielded similar results. In Figure 5a we report the Therefore, the F 1 score and thus the TE accu-
variation of the final F1 score obtained by LLaMA- racy appears to scale linearly with the size of the
65b with prompts augmented by NKB = 5 triplets KB (c.f. Figure 5a), but only logarithmically with
and sentence-triplets pairs gathered from a KB of the size of the model (c.f. Figure 5b). This suggests
different scales S = 0, 0.1, 0.25, 0.5, 1. Here, the that it could be better to invest resources to improve
scale refers to the fraction of left-over data from the the quality of the KB and its associated information
original KB. Note that S = 0 corresponds to the retriever, rather than in training larger models.
18
2-Shot 0.5-Shot(KB) 5-Shots(KB)
1 1
0.5-Shot (KB) WebNLG
5-Shots (KB) 0.8 NYT
0.8
L65
L65
L13 F40
0.6 0.6 F7L13 F40
F7
GPT2-XL GPT3.5t GPT4

F1
GPT2-XL
F1

0.4 F7L13 L65 GPT4 GPT2


GPT2
0.4 F40 GPT3.5t
GPT2-XL
GPT2 F7L13 F40
L65
L65 GPT2-XL
0.2 F40 GPT3.5t GPT4 GPT2 GPT3.5t
0.2 L13 GPT4
F7 GPT3.5t GPT4
GPT2-XL L65 GPT3.5t GPT4
GPT2 F7L13 F40
GPT2-XL
0 0

0 0.2 0.4 0.6 0.8 1 8 9 10 11 12 8 9 10 11 12 8 9 10 11 12


PS (N KB = 5) log 10 (number of parameters)

(a) KB scaling (b) LLM scaling

Figure 5: (Left) Triplets (orange) and sentence-triplets (green) KB augmented performance of the LLaMA-65b
model with different scaled-down versions of the KB built for the WebNLG, S = 0, 0.1, 0.25, 0.5, 1. The F1 score
is plotted against the probability of retrieving the correct triplet with NKB = 5 for each S (namely P (NKB = 5)
for each curve of Figure 4). (Right) F 1 score obtained by the tested models, plotted against their corresponding log
of number of parameters, for WebNLG (blue) and NYT (orange) in the three settings: 2-shots, 0.5-shots with KB
triplets (NKB = 5), and 5-shots with KB sentence-triplets pairs (NKB = 5). The outliers (GPT-4 and GPT-3.5
turbo) are shown in green.

5 Conclusion the context, they do not shine, yet, in re-elaborating


such information, generalizing and making use of
In this work, a pipeline for Zero- and Few-Shots TE it for different examples. Indeed, the investigation
from sentences was presented and tested for various of the impact of the quality of the retrieved KB
LLMs. We showed that the inclusion of KB infor- context, showed as the performance of the LLaMA-
mation into the LLMs prompting can substantially 65b model linearly decreased with the probability
improve the TE performance. In particular, small of finding the solution of the task within the context
models were often able to outperform their bigger already, indicating that the intrinsic incompleteness
siblings without access to the additional KB infor- of KBs might represent a big limiting factor of this
mation. Furthermore, with the information from approach. Concurrently, we found that the TE per-
the KB organized as sentence-triplets pair examples formance improved only approximately logarithmi-
relevant to the input sentence, the accuracy of the cally with the size of the model. This suggests that
LLMs improved further. In this setting, the larger improving the quality of the KB and the associated
LLMs were getting closer to the classical SOTA information retriever might be more effective than
models and outperformed most of the older base- increasing the modeling power of the LLM for TE.
lines. However, even for the largest models, TE
remains a challenging task without any finetun- Acknowledgements
ing. LLMs were still no match for SOTA classical
Andrea Papaluca was supported by an Australian
finetuned models in the two standard benchmark
Government Research Training Program Interna-
datasets we tested as part of our work, in agreement
tional Scholarship. Artem Lensky was partially
with Wadhwa et al. (2023); Wei et al. (2023b); Zhu
supported by the Commonwealth Department of
et al. (2023).
Defence, Defence Science and Technology Group.
Moreover, the performed investigation of the
quality of the retrieved KB context showed that
the solution to the TE task was often contained References
inside it already. This first indicated that a large
Mehwish Alam, Davide Buscaldi, Michael Cochez,
overlapping between the train, validation and tests Francesco Osborne, Diego Reforgiato Recupero, Har-
sets exists for both WebNLG and NYT, leading ald Sack, Özge Sevgili, Artem Shelmanov, Mikhail
us to reconsider their generality for benchmarking Arkhipov, Alexander Panchenko, Chris Biemann,
TE capabilities and suggesting that a revision with Mehwish Alam, Davide Buscaldi, Michael Cochez,
Francesco Osborne, Diego Refogiato Recupero, and
better test isolation might be helpful. Secondly, Harald Sack. 2022. Neural entity linking: A survey
it demonstrated that, while LLMs are capable of of models based on deep learning. Semant. Web,
correctly individuating the relevant information in 13(3):527–570.
19
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Computational Linguistics (Volume 1: Long Papers),
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind pages 179–188, Vancouver, Canada. Association for
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Computational Linguistics.
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child, Pere-Lluís Huguet Cabot and Roberto Navigli. 2021.
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, REBEL: Relation extraction by end-to-end language
Clemens Winter, Christopher Hesse, Mark Chen, generation. In Findings of the Association for Com-
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin putational Linguistics: EMNLP 2021, pages 2370–
Chess, Jack Clark, Christopher Berner, Sam Mc- 2381, Punta Cana, Dominican Republic. Association
Candlish, Alec Radford, Ilya Sutskever, and Dario for Computational Linguistics.
Amodei. 2020a. Language models are few-shot learn-
ers. Preprint, arXiv:2005.14165. Mark Johnson, Peter Anderson, Mark Dras, and Mark
Steedman. 2018. Predicting accuracy on large
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie datasets from smaller pilot data. In Proceedings
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind of the 56th Annual Meeting of the Association for
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Computational Linguistics (Volume 2: Short Papers),
Askell, Sandhini Agarwal, Ariel Herbert-Voss, pages 450–455, Melbourne, Australia. Association
Gretchen Krueger, Tom Henighan, Rewon Child, for Computational Linguistics.
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
Clemens Winter, Christopher Hesse, Mark Chen, Martin Josifoski, Nicola De Cao, Maxime Peyrard,
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Fabio Petroni, and Robert West. 2022. GenIE: Gen-
Chess, Jack Clark, Christopher Berner, Sam Mc- erative information extraction. In Proceedings of
Candlish, Alec Radford, Ilya Sutskever, and Dario the 2022 Conference of the North American Chap-
Amodei. 2020b. Language models are few-shot learn- ter of the Association for Computational Linguistics:
ers. Preprint, arXiv:2005.14165. Human Language Technologies, pages 4626–4643,
Seattle, United States. Association for Computational
Thiago Castro Ferreira, Claire Gardent, Nikolai Ilinykh, Linguistics.
Chris van der Lee, Simon Mille, Diego Moussallem,
and Anastasia Shimorina. 2020. The 2020 bilingual, Bosung Kim, Hayate Iso, Nikita Bhutani, Estevam Hr-
bi-directional WebNLG+ shared task: Overview and uschka, Ndapa Nakashole, and Tom Mitchell. 2023.
evaluation results (WebNLG+ 2020). In Proceed- Zero-shot triplet extraction by template infilling.
ings of the 3rd International Workshop on Natu- Preprint, arXiv:2212.10708.
ral Language Generation from the Semantic Web
(WebNLG+), pages 55–76, Dublin, Ireland (Virtual). Jerry Liu. 2022. LlamaIndex.
Association for Computational Linguistics.
Tapas Nayak, Navonil Majumder, Pawan Goyal, and
Harrison Chase. 2022. Langchain. Soujanya Poria. 2021. Deep neural approaches to
relation triplets extraction: a comprehensive survey.
Yew Ken Chia, Lidong Bing, Soujanya Poria, and Luo Cognitive Computation, 13(5):1215–1232.
Si. 2022. RelationPrompt: Leveraging prompts to
generate synthetic data for zero-shot relation triplet OpenAI. 2023. Gpt-4 technical report. Preprint,
extraction. In Findings of the Association for Compu- arXiv:2303.08774.
tational Linguistics: ACL 2022, pages 45–57, Dublin,
Ireland. Association for Computational Linguistics. Guilherme Penedo, Quentin Malartic, Daniel Hesslow,
Ruxandra Cojocaru, Alessandro Cappelli, Hamza
Paul Christiano, Jan Leike, Tom B. Brown, Miljan Mar- Alobeidli, Baptiste Pannier, Ebtesam Almazrouei,
tic, Shane Legg, and Dario Amodei. 2023. Deep and Julien Launay. 2023. The refinedweb dataset for
reinforcement learning from human preferences. falcon llm: Outperforming curated corpora with web
Preprint, arXiv:1706.03741. data, and web data only. Preprint, arXiv:2306.01116.

Kartik Detroja, C.K. Bhensdadia, and Brijesh S. Bhatt. Alec Radford, Jeff Wu, Rewon Child, David Luan,
2023. A survey on relation extraction. Intelligent Dario Amodei, and Ilya Sutskever. 2019. Language
Systems with Applications, 19:200244. models are unsupervised multitask learners.

Tsu-Jui Fu, Peng-Hsuan Li, and Wei-Yun Ma. 2019. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
GraphRel: Modeling text as relational graphs for Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
joint entity and relation extraction. In Proceedings of Wei Li, and Peter J. Liu. 2023. Exploring the limits
the 57th Annual Meeting of the Association for Com- of transfer learning with a unified text-to-text trans-
putational Linguistics, pages 1409–1418, Florence, former. Preprint, arXiv:1910.10683.
Italy. Association for Computational Linguistics.
Sebastian Riedel, Limin Yao, and Andrew McCallum.
Claire Gardent, Anastasia Shimorina, Shashi Narayan, 2010. Modeling relations and their mentions without
and Laura Perez-Beltrachini. 2017. Creating training labeled text. In Machine Learning and Knowledge
corpora for NLG micro-planners. In Proceedings Discovery in Databases, pages 148–163, Berlin, Hei-
of the 55th Annual Meeting of the Association for delberg. Springer Berlin Heidelberg.
20
Wei Tang, Benfeng Xu, Yuyue Zhao, Zhendong Mao, International Conference on Computational Linguis-
Yifeng Liu, Yong Liao, and Haiyong Xie. 2022. tics, pages 2145–2158, Santa Fe, New Mexico, USA.
UniRel: Unified representation and interaction for Association for Computational Linguistics.
joint relational triple extraction. In Proceedings of
the 2022 Conference on Empirical Methods in Nat- Xiangrong Zeng, Shizhu He, Daojian Zeng, Kang Liu,
ural Language Processing, pages 7087–7099, Abu Shengping Liu, and Jun Zhao. 2019. Learning the
Dhabi, United Arab Emirates. Association for Com- extraction order of multiple relational facts in a sen-
putational Linguistics. tence with reinforcement learning. In Proceedings
of the 2019 Conference on Empirical Methods in
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Natural Language Processing and the 9th Interna-
Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, tional Joint Conference on Natural Language Pro-
and Tatsunori B. Hashimoto. 2023. Stanford alpaca: cessing (EMNLP-IJCNLP), pages 367–377, Hong
An instruction-following llama model. https:// Kong, China. Association for Computational Lin-
github.com/tatsu-lab/stanford_alpaca. guistics.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Xiangrong Zeng, Daojian Zeng, Shizhu He, Kang Liu,
Martinet, Marie-Anne Lachaux, Timothée Lacroix, and Jun Zhao. 2018. Extracting relational facts by
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal an end-to-end neural model with copy mechanism.
Azhar, Aurelien Rodriguez, Armand Joulin, Edouard In Proceedings of the 56th Annual Meeting of the
Grave, and Guillaume Lample. 2023. Llama: Open Association for Computational Linguistics (Volume 1:
and efficient foundation language models. Preprint, Long Papers), pages 506–514, Melbourne, Australia.
arXiv:2302.13971. Association for Computational Linguistics.

Somin Wadhwa, Silvio Amir, and Byron C. Wallace. Suncong Zheng, Feng Wang, Hongyun Bao, Yuexing
2023. Revisiting relation extraction in the era of large Hao, Peng Zhou, and Bo Xu. 2017. Joint extraction
language models. Proceedings of the conference. of entities and relations based on a novel tagging
Association for Computational Linguistics. Meeting, scheme. In Proceedings of the 55th Annual Meeting
2023:15566–15589. of the Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 1227–1236, Vancouver,
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Canada. Association for Computational Linguistics.
Nan Yang, and Ming Zhou. 2020. Minilm: Deep Yuqi Zhu, Xiaohan Wang, Jing Chen, Shuofei Qiao,
self-attention distillation for task-agnostic com- Yixin Ou, Yunzhi Yao, Shumin Deng, Huajun Chen,
pression of pre-trained transformers. Preprint, and Ningyu Zhang. 2023. Llms for knowledge graph
arXiv:2002.10957. construction and reasoning: Recent capabilities and
future opportunities. Preprint, arXiv:2305.13168.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and
Denny Zhou. 2023a. Chain-of-thought prompting A Appendix
elicits reasoning in large language models. Preprint,
arXiv:2201.11903. A.1 OpenAI Models results
We report here the results obtained by the OpenAI
Xiang Wei, Xingyu Cui, Ning Cheng, Xiaobin Wang, models listed in Table 6. We ran them remotely
Xin Zhang, Shen Huang, Pengjun Xie, Jinan
Xu, Yufeng Chen, Meishan Zhang, Yong Jiang, through the OpenAI API and always setting a tem-
and Wenjuan Han. 2023b. Zero-shot information perature T = 0.1. We were not able to find any
extraction via chatting with chatgpt. Preprint, information regarding the parameter precision they
arXiv:2302.10205. used. Note that the GPT-3.5 and GPT-4 are in-
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien structed models. For some experiments we also
Chaumond, Clement Delangue, Anthony Moi, Pier- tested the use of text-davinci-002, which is a non-
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, instructed model based on GPT-3 and, apparently,
Joe Davison, Sam Shleifer, Patrick von Platen, Clara the only base variant OpenAI provides through
Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le
Scao, Sylvain Gugger, Mariama Drame, Quentin their API.
Lhoest, and Alexander M. Rush. 2020. Transform-
ers: State-of-the-art natural language processing. In Parameters [B] Context
Proceedings of the 2020 Conference on Empirical text-davinci-002 (Brown et al., 2020b) 175* 2,048
Methods in Natural Language Processing: System GPT-3.5 (Brown et al., 2020b) 175* 4,096
Demonstrations, pages 38–45, Online. Association
GPT-4 (OpenAI, 2023) 1,760* 8,192
for Computational Linguistics.
Table 6: The number of parameters (in billions [B]) and
Vikas Yadav and Steven Bethard. 2018. A survey on
recent advances in named entity recognition from context window size of the OpenAI LLMs. We indicate
deep learning models. In Proceedings of the 27th by * the numbers that are not officially confirmed.

21
Tables 7 and 8 report the results obtained by the Model WebNLG NYT
OpenAI models on the WebNLG and NYT datasets 0.5-Shot 5-Shots 0.5-Shot 5-Shots
text-davinci-002 0.403 0.491 0.144 0.418
in all the different settings. The comparison with
OpenAI GPT-3.5 0.336 0.520 0.088 0.184
the other models, c.f. Tables 4 and 5, shows them GPT-4 0.394 0.510 0.096 0.151
to be comparable to the Falcon 40B model in the
majority of the cases. However, they recorded very Table 8: 0.5 and 5-Shots micro-averaged F1 perfor-
underwhelming results on the NYT dataset both, in mance of the OpenAI LLMs tested with the prompt of
the 0.5-Shots and Few-Shots settings. Our manual Figure 2 augmented with NKB = 5 triplets, respec-
inspection of the triplets they provided as an an- tively, sentence-triplets pairs retrieved from the KB.
swer suggested that they were less keen to adhere to triplets sentence-triplet
the entities and relations appearing in the provided 1
Random
KB context, often paraphrasing or reformulating 0.8 LLaMA-65b
Fitting Eq. (3)
them in a more prolix form that lowered the accu- 0.6
racy. This might be a consequence of the instructed

F1
0.4
training they had gone through, as discussed in Sec-
tion 4.1. In contrast, the results provided by the 0.2

non-instructed text-davinci-002 model were more 0


0 20 40 60 80 100 0 2 4 6 8
in line with all the other LLMs. NK B NK B

A.2 Random Model Figure 6: Degradation of the random model perfor-


mance with the increase of the context information in-
In order to better understand the results obtained cluded, NKB . The LLaMA-65b, instead, is able to
by the KB-augmented LLMs, we considered the retain most of its performance when more triplets are
following simple random TE model: first, we added (left panel), and sees a significant F1 rise with
randomly select the number of triplets n ∈ an increasing number of sentence-triplets pairs (right
[1, max_triplets] to extract, with max_triplets panel). For reference, we also report the fit of (4) as a
indicating the maximum number of triplets con- dashed orange line.
tained in a sentence of the dataset. Then, we
uniformly sample n triplets out of the retrieved follows the empirical scaling relation
KB context. Surprisingly, the random model is
 n
very competitive with the KB-augmented LLM P (NKB )
F 1rand (NKB ) ∼ , (4)
for small NKB on the WebNLG dataset (see Fig- NKB
ure 6), and similar results were observed for the
NYT dataset. This can be explained by the hand with n number of triplets to extract and P (NKB )
of Figure 3. In detail, we infer from the figure probability of retrieving the correct triplet from the
that the KB-augmented prompt has a large prob- KB (Figure 3).
ability of containing the correct triplets to extract In contrast, the LLM is able to retain much of its
already, therefore, even randomly selecting a subset original performance for a larger number of triplets
of them yields a relatively high accuracy. This pro- provided (c.f. Figure 6(left)) or even improve under
vides further confirmation that the TE performance the inclusion of more sentence-triplets examples
is largely driven by the KB retriever. (c.f. Figure 6(right)).
However, the performance of the random model
A.3 Prompts
decreases polynomially with NKB , as the prob-
ability of randomly sampling the correct triplets Here we report all the TE prompts that we tested.
Figures 7 and 8 report two variations of the base
prompt of Figure 2. The first one implements a
Model WebNLG NYT
Chain-of-Thought (Wei et al., 2023a) approach
0-Shot 2-Shots 0-Shot 2-Shots
where multi-step reasoning is enforced. The sec-
GPT-3.5 0.000 0.144 0.000 0.008
OpenAI ond tries to provide the LLM with more informa-
GPT-4 0.007 0.156 0.000 0.007
tion about the task, describing in more detail the
Table 7: Zero and 2-Shots micro-averaged F1 perfor- role of each one of the core components of TE. In
mance of the LLMs tested with the prompt of Figure 2 Table 9 the three prompts of Figures 2, 7, and 8
and without any context coming from the KB. are compared for the WebNLG and NYT datasets
22
Chain-of-Thought Prompt
Some text is provided below. Procede step by step:
- Identify a predicate expressed in the text
- Identify the subject of that predicate
- Identify the object of that predicate
- Extract the corresponding (subject, predicate, object)
knowledge triplet
- Repeat until all predicates contained in the text have been extracted,
but no more than {max_triplets} times
---------------------------------------------------------------------------------------------------
Text: {text}
Triplets:

Figure 7: Prompt implementing the Chain-of-Thoughts


approach (Wei et al., 2023a).

Documented Prompt
Some text is provided below. The text might contain one or more
predicates expressing a relation between a subject and an object.
The subject is the entity that takes or undergo the action expressed by
the predicate.
The object is the entity which is the factual object of the action.
The information provided by each predicate can be summarized as a
knowledge triplet of the form (subject, predicate, object).
Extract all the information contained in the text in the form of
knowledge triplets. Extract no more than {max_triplets} knowledge
triplets. Prompt WebNLG NYT
---------------------------------------------------------------------------------------------------
base 0.037 0.0002
GPT-2 xl
Text: {text}
Triplets:
documented 0.034 0.0003
Figure 8: Prompt providing more details about the core chain-of-thought 0.039 0.0004
components of the TE task, namely, including defini-
0.002 0.0001
tions of subject, object, predicate, and triplet.
LLaMA-65b

base 0.219 0.017


documented 0.213 0.012
under the use of two different LLMs, GPT-2 xl,
and LLaMA 65B. The three prompts yield similar chain-of-thought 0.219 0.015
micro-averaged F1 scores, with small deviations. 0.003 0.002
Figure 9 reports the prompt that we used in the
0.5-Shots setting. The prompt consists of a simple Table 9: Comparison of the 2-Shots TE micro-averaged
adaptation of the base prompt of Figure 2 to accom- F1 performance with the three different prompts of Fig-
ures 2, 7, and 8. The standard deviation of the perfor-
modate for the additional triplets retrieved from the
mance across the three prompts is reported below each
KB. column.

0.5-Shots Prompt
Some text and some context triplets in the form
(subject, predicate, object) are provided below.
Firstly, select the context triplets that are relevant to the input text.
Then, extract up to {max_triplets} knowledge triplets in the form
(subject, predicate, object) contained in the text taking inspiration
from the context triplets selected.
---------------------------------------------------------------------------------------------------
Text: {text}
Context Triplets:
{context_triplets}
Triplets:

Figure 9: Adaptation of the base prompt found in


Figure 2 to the 0.5-Shots setting. An additional
{context_triplets} argument is included to accommo-
date for the KB triplets retrieved from the KB.

23

You might also like