Zero - and Few-Shots Knowledge Graph Triplet Extraction With LLM
Zero - and Few-Shots Knowledge Graph Triplet Extraction With LLM
3.2 LLMs as Triplet Generators We prepare the information coming from the
In order to perform TE, we can prompt LLMs to KB in two different forms: either as simple con-
generate, for a given input sentence, a sequence of text triplets
tokens corresponding
n oto the set of entity-relation n oNKB
triplets (es , rp , eo )
i i i
n
. As demonstrated by Tc = (eis , rpi , eio ) ∈G, (1)
i=1
i=1
Wadhwa et al. (2023), Wei et al. (2023b), and
or as sentence-triplets pairs
Zhu et al. (2023), LLMs are, in principle, able
to extract the knowledge triplets contained in a text n oNKB
without a need for task-specific training, under a Ec = (Sci , Tci ) . (2)
i=1
suitable choice of prompt. In general, successful
The latter provides factual examples of triplets to be
LLM prompts follow a fixed schema that provides
extracted for specific sentences. Note that we indi-
a detailed explanation of what the task consists of,
cate with NKB the number of triplets, respectively,
a clear indication of the sentence to process, and
some hints or examples for the desired result.
In this work, we tested the use of three different Triplet Extraction Prompt
prompts: a simple baseline and two slight varia- Some text is provided below. Extract up to {max_triplets} knowledge
triplets in the form (subject, predicate, object) from the text.
tions of it. However, preliminary testing in TE ---------------------------------------------------------------------------------------------------
showed no significant difference in the F1 scores Examples:
among them. Therefore, we opted for using only Text: Abilene, Texas is in the United States.
Triplets:
the base prompt reported in Figure 2 in the main (abilene texas, country, united states)
experiments. The details of the prompts tested and Text: The United States includes the ethnic group of African
Americans and is the birthplace of Abrahm A Ribicoff
their results can be found in Appendix A.3. who is married to Casey Ribicoff.
Triplets:
3.3 KB-aided Triplet Extraction (abrahm a. ribicoff, spouse, casey ribicoff)
(abrahm a. ribicoff, birth places, united states)
In order to support LLMs in the TE task, we pro- (united states, ethnic group, african americans)
---------------------------------------------------------------------------------------------------
pose the pipeline illustrated in Figure 1. The Text: {text}
Triplets:
pipeline augments the LLM with external KB in-
formation. In detail, for each input sentence, rele- Figure 2: The base prompt we experimented with. At in-
vant context information contained in the KB is re- ference time the {text} and {max_triplets} variables
trieved and attached to the LLM prompt described are substituted with the sentence to process, respectively,
above. The context-enriched prompt is then fed to the maximum number of triplets found in a sentence in
the LLM for the knowledge triplet generation. the corresponding dataset.
14
sentence-triplets pairs retrieved from the KB. In Train Validation Test Relations Max Avg
WebNLG 5,019 500 703 171 7 2.29
the first case, augmentation is achieved by simply NYT 56,195 5,000 5,000 24 22 1.72
attaching the retrieved triplets Tc as an additional
“Context Triplets” argument to the base prompt Table 1: Statistics of the WebNLG and NYT datasets.
reported in Figure 2. For the second approach, The number of training, validation, and testing sentences
is reported, together with the number of relations types
instead, we substitute the two static examples pro-
in the dataset and the maximum and average number of
vided in the base prompt, with the input relevant triplets contained in a sentence.
examples Ec retrieved from the KB.
The relevant context information to build the Tc Parameters [B] Context
triplets set for each input sentence is retrieved as GPT-2 (Radford et al., 2019) 0.1 | 1.5 1,024
Falcon (Penedo et al., 2023) 7 | 40 2,048
follows. Given the KB, we isolate all the triplets LLaMA (Touvron et al., 2023) 13 | 65 2,048
(eis , rpi , eio ) ∈ G contained therein, and store them
in a node based vector store index (Liu, 2022). In Table 2: The number of parameters (in billions [B]) and
detail, each node of this index corresponds to one context window size of the selected LLMs.
and only one of the triplets and stores the embed-
ding obtained by running a small-scale sentence
the presentation of the main results for the TE task.
encoder, MiniLM (Wang et al., 2020), on the corre-
sponding (subject, predicate, object) string. In 4.1 Datasets and Models
the first approximation, this should be enough to
provide a meaningful embedding for each triplet. In order to test the TE capabilities of a selected set
During inference (i.e., TE), we first encode the in- of LLMs (see Table 2 for their comparison), we
put sentence using the MiniLM. This is followed by experimented with two standard benchmarks for
comparing the obtained sentence embedding with the TE task: the aforementioned WebNLG (Gar-
all the triplet embeddings contained in the index dent et al., 2017) and the New York Times
to retrieve the top NKB most similar triplets to the (NYT) (Riedel et al., 2010) dataset (see Table 1
input sentence. Out of this NKB -dimensional sam- for their basic statistics). The former was initially
ple, we further select the first two triplets for each proposed as a benchmark for the NLG task, but has
relation type present in the sample. This is done to been successively adapted to the TE task and in-
obtain a more diverse set of context triplets with a cluded in the WebNLG challenge (Castro Ferreira
more homogeneous distribution over the relations. et al., 2020). As the revision provided by Zheng
In some cases, indeed, the risk of obtaining a highly et al. (2017) appears to be the most widely used in
biased distribution towards a specific relation type the literature, we decided to run our tests on that
exists, which is sub-optimal for those sentences particular version of WebNLG. The NYT bench-
that contain several different relationships. mark is a dataset created by distant supervision,
Note that a similar procedure can be followed aligning more than 1.8 million articles from the
to prepare the Ec examples set. However, in this NYT newspaper with the Freebase KB. For each
case, the focus will be shifted to the example sen- dataset, we used the training and validation splits
tences we wish to include. Namely, each node to build the corresponding KB following the proce-
of the vector store index is going to consist both dure outlined in Section 3.3.
of the example sentence and the KB triplets to be We selected the LLMs reported in Table 2 for
extracted from it. Then, the embedding vector is testing. We ran locally all the models in their 8-
obtained by running the sentence encoder on either, bit quantized version provided by the Hugging-
the example sentence alone, or the sentence and Face (Wolf et al., 2020) library. We tested the use
triplets combined. As before, at inference time the of OpenAI models through their provided API as
top NKB most similar (sentence, triplets) pairs to well. However, as their results were often incon-
the input sentence are retrieved and included in the sistent and given the limited access and control
prompt as Few-Shots examples. we had over them, we decided to exclude these
models from the main report. All the experiments
4 Experiments regarding them can be found in Appendix A.1. The
temperature was set to τ = 0.1 for all the experi-
In this section, we first provide details about the ments. We experimented with higher temperatures
datasets and models we tested. This is followed by but observed that they were detrimental to the TE
15
Model WebNLG NYT Model WebNLG NYT
NovelTagging (Zheng et al., 2017) 0.283 0.420 0.5-Shot 5-Shots 0.5-Shot 5-Shots
CopyRE (Zeng et al., 2018), 0.371 0.587 base 0.249 0.430 0.175 0.375
GraphRel (Fu et al., 2019) 0.429 0.619 GPT-2
xl 0.297 0.517 0.193 0.448
OrderCopyRE (Zeng et al., 2019) 0.616 0.721
7b 0.381 0.567 0.250 0.519
UniRel (Tang et al., 2022) 0.947 0.937 Falcon
40b 0.345 0.615 0.226 0.547
Table 3: Micro-averaged F1 of some finetuned models 13b 0.374 0.609 0.247 0.582
LLaMA
selected from the literature. 65b 0.377 0.677 0.243 0.647
P
context for NYT (c.f. Figure 3) discussed below 0.4
0.4
1
the different KB sizes. This corresponds to the
0.2
0.5
0.25
probability curves of Figure 4 evaluated at NKB =
0.1
5. We observe that the performance degrades as the
0
0 10 20 30 40 50 0 2 4 6 8 probability PS (NKB ) shrinks with decreasing S,
NK B NK B
as expected. In particular, the relation appears to be
Figure 4: Probability that the correct triplet is present linear: F1triplets ∼ 0.25 · Ps (NKB = 5) + 0.21.
among the retrieved KB context for the WebNLG F 1sentence−triplets ∼ 0.55 · Ps (NKB = 5) + 0.21.
dataset, as in Figure 3, but with different scaled-down with measured determination coefficients r2 =
versions of the original KB. 0.98 and r2 = 0.96, respectively. This suggests
that there is a strong correlation between the TE
capabilities of the model and the quality of the
to be large: greater than 50% in the majority of
retrieved data.
the cases, and even approaching the 70 ∼ 80%
for the WebNLG dataset. This is symptomatic of Furthermore, we investigated how the final TE
substantial overlap that exists between the training, performance scales with the size of the model. In
validation, and test splits for both datasets, to the Figure 5b, the F1 score is plotted against the num-
point that even a stochastic model, that randomly ber of parameters Npar in log scale for all the mod-
sampled the triplets out of the KB context retrieved, els we tested. The plot includes the results obtained
was able achieve performance competitive with for both the WebNLG and NYT datasets, for all
many of the LLMs and baselines of Table 3 in settings considered. We observe that for each of
some cases (see Appendix A.2 for more details). the three settings, the models’ performance grows
linearly in log scale with respect to their sizes. The
4.6 Ablation Study scaling in the number of parameters Npar in log
scale can be approximated by
To further investigate the impact of the additional
knowledge retrieved from the KB, we revisit in
this section the performance of one of our best F 1norm ∼ m · log Npar . (3)
performing LLMs, LLaMA-65b. In detail, we con-
struct a scaled-down version of the KB via ran-
domly sampling from the original training and vali- The slope parameters of the linear fit for
dation splits, keeping only a fraction of the original WebNLG are m = 0.0456, 0.0304, and 0.0871
sentences and triplets. For this reduced KB, the for, respectively, 2-Shot, 0.5-Shot(KB), 5-
probability of having the correct triplet answer al- Shots(KB) settings, and for the NYT the corre-
ready within the retrieved information is reduced sponding parameters are m = 0.0028, 0.0257
(c.f. Figure 4). This allows us to evaluate how the and 0.0906. The determination coefficients for
accuracy of the model is impacted by the quality the WebLNG and NYT datasets are, respectively,
of the retrieved data. r2 = 0.67, 0.62, and 0.97, and r2 = 0.18, 0.7
We decided to conduct this test on the WebNLG and 0.90. Interestingly, the F1 score increase with
dataset. As P (NKB ) for the full-scale KB has been the size of the model is steeper for the few-shots
larger than for the NYT dataset, c.f. Figure 3, a prompt (c.f. Figure 5b right). This suggests that
wider range of values to be explored is allowed. larger models might be more capable in making use
Nonetheless, a preliminary test on the NYT dataset of several examples included inside of the prompt.
yielded similar results. In Figure 5a we report the Therefore, the F 1 score and thus the TE accu-
variation of the final F1 score obtained by LLaMA- racy appears to scale linearly with the size of the
65b with prompts augmented by NKB = 5 triplets KB (c.f. Figure 5a), but only logarithmically with
and sentence-triplets pairs gathered from a KB of the size of the model (c.f. Figure 5b). This suggests
different scales S = 0, 0.1, 0.25, 0.5, 1. Here, the that it could be better to invest resources to improve
scale refers to the fraction of left-over data from the the quality of the KB and its associated information
original KB. Note that S = 0 corresponds to the retriever, rather than in training larger models.
18
2-Shot 0.5-Shot(KB) 5-Shots(KB)
1 1
0.5-Shot (KB) WebNLG
5-Shots (KB) 0.8 NYT
0.8
L65
L65
L13 F40
0.6 0.6 F7L13 F40
F7
GPT2-XL GPT3.5t GPT4
F1
GPT2-XL
F1
Figure 5: (Left) Triplets (orange) and sentence-triplets (green) KB augmented performance of the LLaMA-65b
model with different scaled-down versions of the KB built for the WebNLG, S = 0, 0.1, 0.25, 0.5, 1. The F1 score
is plotted against the probability of retrieving the correct triplet with NKB = 5 for each S (namely P (NKB = 5)
for each curve of Figure 4). (Right) F 1 score obtained by the tested models, plotted against their corresponding log
of number of parameters, for WebNLG (blue) and NYT (orange) in the three settings: 2-shots, 0.5-shots with KB
triplets (NKB = 5), and 5-shots with KB sentence-triplets pairs (NKB = 5). The outliers (GPT-4 and GPT-3.5
turbo) are shown in green.
Kartik Detroja, C.K. Bhensdadia, and Brijesh S. Bhatt. Alec Radford, Jeff Wu, Rewon Child, David Luan,
2023. A survey on relation extraction. Intelligent Dario Amodei, and Ilya Sutskever. 2019. Language
Systems with Applications, 19:200244. models are unsupervised multitask learners.
Tsu-Jui Fu, Peng-Hsuan Li, and Wei-Yun Ma. 2019. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
GraphRel: Modeling text as relational graphs for Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
joint entity and relation extraction. In Proceedings of Wei Li, and Peter J. Liu. 2023. Exploring the limits
the 57th Annual Meeting of the Association for Com- of transfer learning with a unified text-to-text trans-
putational Linguistics, pages 1409–1418, Florence, former. Preprint, arXiv:1910.10683.
Italy. Association for Computational Linguistics.
Sebastian Riedel, Limin Yao, and Andrew McCallum.
Claire Gardent, Anastasia Shimorina, Shashi Narayan, 2010. Modeling relations and their mentions without
and Laura Perez-Beltrachini. 2017. Creating training labeled text. In Machine Learning and Knowledge
corpora for NLG micro-planners. In Proceedings Discovery in Databases, pages 148–163, Berlin, Hei-
of the 55th Annual Meeting of the Association for delberg. Springer Berlin Heidelberg.
20
Wei Tang, Benfeng Xu, Yuyue Zhao, Zhendong Mao, International Conference on Computational Linguis-
Yifeng Liu, Yong Liao, and Haiyong Xie. 2022. tics, pages 2145–2158, Santa Fe, New Mexico, USA.
UniRel: Unified representation and interaction for Association for Computational Linguistics.
joint relational triple extraction. In Proceedings of
the 2022 Conference on Empirical Methods in Nat- Xiangrong Zeng, Shizhu He, Daojian Zeng, Kang Liu,
ural Language Processing, pages 7087–7099, Abu Shengping Liu, and Jun Zhao. 2019. Learning the
Dhabi, United Arab Emirates. Association for Com- extraction order of multiple relational facts in a sen-
putational Linguistics. tence with reinforcement learning. In Proceedings
of the 2019 Conference on Empirical Methods in
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Natural Language Processing and the 9th Interna-
Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, tional Joint Conference on Natural Language Pro-
and Tatsunori B. Hashimoto. 2023. Stanford alpaca: cessing (EMNLP-IJCNLP), pages 367–377, Hong
An instruction-following llama model. https:// Kong, China. Association for Computational Lin-
github.com/tatsu-lab/stanford_alpaca. guistics.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Xiangrong Zeng, Daojian Zeng, Shizhu He, Kang Liu,
Martinet, Marie-Anne Lachaux, Timothée Lacroix, and Jun Zhao. 2018. Extracting relational facts by
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal an end-to-end neural model with copy mechanism.
Azhar, Aurelien Rodriguez, Armand Joulin, Edouard In Proceedings of the 56th Annual Meeting of the
Grave, and Guillaume Lample. 2023. Llama: Open Association for Computational Linguistics (Volume 1:
and efficient foundation language models. Preprint, Long Papers), pages 506–514, Melbourne, Australia.
arXiv:2302.13971. Association for Computational Linguistics.
Somin Wadhwa, Silvio Amir, and Byron C. Wallace. Suncong Zheng, Feng Wang, Hongyun Bao, Yuexing
2023. Revisiting relation extraction in the era of large Hao, Peng Zhou, and Bo Xu. 2017. Joint extraction
language models. Proceedings of the conference. of entities and relations based on a novel tagging
Association for Computational Linguistics. Meeting, scheme. In Proceedings of the 55th Annual Meeting
2023:15566–15589. of the Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 1227–1236, Vancouver,
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Canada. Association for Computational Linguistics.
Nan Yang, and Ming Zhou. 2020. Minilm: Deep Yuqi Zhu, Xiaohan Wang, Jing Chen, Shuofei Qiao,
self-attention distillation for task-agnostic com- Yixin Ou, Yunzhi Yao, Shumin Deng, Huajun Chen,
pression of pre-trained transformers. Preprint, and Ningyu Zhang. 2023. Llms for knowledge graph
arXiv:2002.10957. construction and reasoning: Recent capabilities and
future opportunities. Preprint, arXiv:2305.13168.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and
Denny Zhou. 2023a. Chain-of-thought prompting A Appendix
elicits reasoning in large language models. Preprint,
arXiv:2201.11903. A.1 OpenAI Models results
We report here the results obtained by the OpenAI
Xiang Wei, Xingyu Cui, Ning Cheng, Xiaobin Wang, models listed in Table 6. We ran them remotely
Xin Zhang, Shen Huang, Pengjun Xie, Jinan
Xu, Yufeng Chen, Meishan Zhang, Yong Jiang, through the OpenAI API and always setting a tem-
and Wenjuan Han. 2023b. Zero-shot information perature T = 0.1. We were not able to find any
extraction via chatting with chatgpt. Preprint, information regarding the parameter precision they
arXiv:2302.10205. used. Note that the GPT-3.5 and GPT-4 are in-
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien structed models. For some experiments we also
Chaumond, Clement Delangue, Anthony Moi, Pier- tested the use of text-davinci-002, which is a non-
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, instructed model based on GPT-3 and, apparently,
Joe Davison, Sam Shleifer, Patrick von Platen, Clara the only base variant OpenAI provides through
Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le
Scao, Sylvain Gugger, Mariama Drame, Quentin their API.
Lhoest, and Alexander M. Rush. 2020. Transform-
ers: State-of-the-art natural language processing. In Parameters [B] Context
Proceedings of the 2020 Conference on Empirical text-davinci-002 (Brown et al., 2020b) 175* 2,048
Methods in Natural Language Processing: System GPT-3.5 (Brown et al., 2020b) 175* 4,096
Demonstrations, pages 38–45, Online. Association
GPT-4 (OpenAI, 2023) 1,760* 8,192
for Computational Linguistics.
Table 6: The number of parameters (in billions [B]) and
Vikas Yadav and Steven Bethard. 2018. A survey on
recent advances in named entity recognition from context window size of the OpenAI LLMs. We indicate
deep learning models. In Proceedings of the 27th by * the numbers that are not officially confirmed.
21
Tables 7 and 8 report the results obtained by the Model WebNLG NYT
OpenAI models on the WebNLG and NYT datasets 0.5-Shot 5-Shots 0.5-Shot 5-Shots
text-davinci-002 0.403 0.491 0.144 0.418
in all the different settings. The comparison with
OpenAI GPT-3.5 0.336 0.520 0.088 0.184
the other models, c.f. Tables 4 and 5, shows them GPT-4 0.394 0.510 0.096 0.151
to be comparable to the Falcon 40B model in the
majority of the cases. However, they recorded very Table 8: 0.5 and 5-Shots micro-averaged F1 perfor-
underwhelming results on the NYT dataset both, in mance of the OpenAI LLMs tested with the prompt of
the 0.5-Shots and Few-Shots settings. Our manual Figure 2 augmented with NKB = 5 triplets, respec-
inspection of the triplets they provided as an an- tively, sentence-triplets pairs retrieved from the KB.
swer suggested that they were less keen to adhere to triplets sentence-triplet
the entities and relations appearing in the provided 1
Random
KB context, often paraphrasing or reformulating 0.8 LLaMA-65b
Fitting Eq. (3)
them in a more prolix form that lowered the accu- 0.6
racy. This might be a consequence of the instructed
F1
0.4
training they had gone through, as discussed in Sec-
tion 4.1. In contrast, the results provided by the 0.2
Documented Prompt
Some text is provided below. The text might contain one or more
predicates expressing a relation between a subject and an object.
The subject is the entity that takes or undergo the action expressed by
the predicate.
The object is the entity which is the factual object of the action.
The information provided by each predicate can be summarized as a
knowledge triplet of the form (subject, predicate, object).
Extract all the information contained in the text in the form of
knowledge triplets. Extract no more than {max_triplets} knowledge
triplets. Prompt WebNLG NYT
---------------------------------------------------------------------------------------------------
base 0.037 0.0002
GPT-2 xl
Text: {text}
Triplets:
documented 0.034 0.0003
Figure 8: Prompt providing more details about the core chain-of-thought 0.039 0.0004
components of the TE task, namely, including defini-
0.002 0.0001
tions of subject, object, predicate, and triplet.
LLaMA-65b
0.5-Shots Prompt
Some text and some context triplets in the form
(subject, predicate, object) are provided below.
Firstly, select the context triplets that are relevant to the input text.
Then, extract up to {max_triplets} knowledge triplets in the form
(subject, predicate, object) contained in the text taking inspiration
from the context triplets selected.
---------------------------------------------------------------------------------------------------
Text: {text}
Context Triplets:
{context_triplets}
Triplets:
23