KGLM: Integrating Knowledge Graph Structure in Language Models For Link Prediction
KGLM: Integrating Knowledge Graph Structure in Language Models For Link Prediction
Melinda French
Abstract
daugh Jennifer Gates
The ability of knowledge graphs to represent terOf
complex relationships at scale has led to their
arXiv:2211.02744v2 [cs.CL] 17 May 2023
With
f
bo
edge representation, question-answering, and O
divorced
ter
rn
h
In
recommendation systems. Knowledge graphs ug Microsoft
da
are often incomplete in the information they located
rOf In
represent, necessitating the need for knowl- e
found
edge graph completion tasks. Pre-trained
and fine-tuned language models have shown Washington
promise in these tasks although these models Bill Gates
ignore the intrinsic information encoded in the Figure 1: Sample knowledge graph with 6 triples. The
knowledge graph, namely the entity and re- graph contains three unique entity types (circle for per-
lation types. In this work, we propose the son, triangle for company, and square for location) and
Knowledge Graph Language Model (KGLM) 5 unique relation types or 10 if considering both the
architecture, where we introduce a new en- forward and inverse relations. The task of the knowl-
tity/relation embedding layer that learns to dif- edge graph completion is to complete the missing links
ferentiate distinctive entity and relation types, in the graph, e.g., (Bill Gates, bornIn?, Washington) us-
therefore allowing the model to learn the struc- ing the existing knowledge graph.
ture of the knowledge graph. In this work,
we show that further pre-training the language
models with this additional embedding layer
using the triples extracted from the knowledge 2020), and automatically curated ones like Free-
graph, followed by the standard fine-tuning Base (Bollacker et al., 2008), Knowledge Vault
phase sets a new state-of-the-art performance (Dong et al., 2014), and NELL (Carlson et al.,
for the link prediction task on the benchmark 2010) exist. However, these KGs often suffer from
datasets. incompleteness. For example, 71% of the people in
FreeBase have no known place of birth (West et al.,
1 Introduction 2014). To address this issue, knowledge graph
Knowledge graph (KG) is defined as a directed, completion (KGC) methods aim at connecting the
multi-relational graph where entities (nodes) are missing links in the KG.
connected with one or more relations (edges) Graph feature models like path ranking algo-
(Wang et al., 2017). It is represented with a set rithm (PRA) (Lao and Cohen, 2010; Lao et al.,
of triples, where a triple consists of (head entity, 2011) attempt to solve the KGC tasks by extract-
relation, tail entity) or (h, r, t) for short, for ex- ing the features from the observed edges over the
ample (Bill Gates, founderOf, Microsoft) as shown KG to predict the existence of a new edge (Nickel
in Figure 1. Due to their effectiveness in iden- et al., 2015). For example, the existence of the
daughterOf
tifying patterns among data and gaining insights path Jennifer Gates −− −−−−−→ Melinda French
divorcedWith
into the mechanisms of action, associations, and ←−−−−−−−− Bill Gates in Figure 1 can be used as
testable hypotheses (Li and Chen, 2014; Silvescu a clue to infer the triple (Jennifer Gates, daugh-
et al., 2012), both manually curated KGs like DB- terOf, Bill Gates). Other popular types of models
pedia (Auer et al., 2007), WordNet (Miller, 1998), are latent feature models such as TransE (Bordes
KIDS (Youn et al., 2022), and CARD (Alcock et al., et al., 2013), TransH (Wang et al., 2014), and Ro-
Generate pre-training data Pre-train language model
e0 Pre-trained language model
e0 r1 e1
r1 e1 Triple 1
e1 r1 e0 Token
Triple 2 E[s] Etoke4 Etokr4 Etoke3 E[/s]
embeddings
r1
r0
r3
e3
r4 Position
r2 e4 r4 e3 E0 E1 E2 E3 E4
embeddings
e4 Triple 12
e2 Entity/relation type
Input knowledge graph Pre-training corpus of hop 1 E E Er4-1 E E
embeddings
Figure 2: Proposed pre-training approach of the KGLM. First, both the forward and inverse triples are extracted
from the knowledge graph to serve as the pre-training corpus. We then continue pre-training the language model,
RoBERTa in our case, using the masked language model training objective, with an additional entity/relation-type
embedding layer. The entity/relation-type embedding scheme shown here corresponds to the KGLMGER , the most
fine-grained version where both the entity and relation types are considered unique. Note that the inverse relation
denoted by -1 is different from its forward counterpart. For demonstration purposes, we assume all entities and
relations to have a single token.
tatE (Sun et al., 2019) where entities and relations are no ways to distinguish forward relation (Jen-
are converted into a latent space using embeddings. nifer Gates, daughterOf, Melinda French) from
TransE, a representative latent feature model, mod- inverse relation (Melinda French, daughterOf -1 ,
els the relationship between the entities by inter- Jennifer Gates).
preting them as a translational operation. That is, In this paper, we propose the Knowledge Graph
the model optimizes the embeddings by enforcing Language Model (KGLM) (Figure 2), a simple
the vector operation of head entity embedding h yet effective language model pre-training approach
plus the relation embedding r to be close to the tail that learns from both the textual and structural in-
entity embedding t for a given fact in the KG, or formation of the knowledge graph. We continue
simply h + r ≈ t. pre-training the language model that has already
been pre-trained on other large natural language
Recently, pre-trained language models like corpora using the corpus generated by converting
BERT (Devlin et al., 2018) and RoBERTa (Liu the triples in the knowledge graphs as textual se-
et al., 2019) have shown state-of-the-art perfor- quences, while enforcing the model to better under-
mance in all of the natural language processing stand the underlying graph structure and by adding
(NLP) tasks. As a natural extension, models like an additional entity/relation-type embedding layer.
KG-BERT (Yao et al., 2019) and BERTRL (Zha Testing our model on the WN18RR dataset for the
et al., 2021) that utilize these pre-trained language link prediction task shows that our model improved
models by treating a triple in the KG as a textual se- the mean rank by 21.2% compared to the previous
quence, e.g., (Bill Gates, founderOf, Microsoft) as state-of-the-art method (51 vs. 40.18, respectively).
‘Bill Gates founder of Microsoft’, have also shown All code and instructions on how to reproduce the
state-of-the-art results on the downstream KGC results are available online.1
tasks. Although such textual encoding (Wang et al.,
2021) models are generalizable to unseen entities 2 Background
or relations (Zha et al., 2021), they still fail to learn
Link Prediction. The link prediction (LP) task,
the intrinsic structure of the KG as the models are
one of the commonly researched knowledge graph
only trained on the textual sequence. To solve this
completion tasks, attempts to predict the missing
issue, a hybrid approach like StAR (Wang et al.,
head entity (h) or tail entity (t) of a triple (h, r,
2021) has recently been proposed to take advantage
t) given a KG G = (E, R), where {h, t} ∈ E is
of both latent feature models and textual encoding
the set of all entities and r ∈ R is the set of all
models by enforcing a translation-based graph em-
relations. Specifically, given a single test positive
bedding approach to train the textual encoders. Yet,
triple (h, r, t), its corresponding link prediction test
current textual encoding models still suffer from en-
dataset can be constructed by corrupting either the
tity ambiguation problems (Cucerzan, 2007) where
head or the tail entity in the filtered setting (Bordes
an entity Apple, for example, can refer to either the
1
company Apple Inc. or the fruit. Moreover, there https://github.com/ibpa/KGLM
et al., 2013) as Table 1: Statistics of the benchmark knowledge graphs
used for link prediction.
(h,r,t)
DLP = Dataset # ent # rel # train # val # test
{(h, r, t’) | t0 ∈ (E − {h, t}) ∧ (h, r, t0 ) ∈
/ D} WN18RR 40,943 11 86,835 3,034 3,134
FB15k-237 14,951 237 272,115 17,535 20,466
∪{(h’, r, t) | h0 ∈ (E − {h, t}) ∧ (h0 , r, t) ∈
/ D} UMLS 135 46 5,216 652 661
∪{(h, r, t)},
(1)
where D = Dtrain ∪Dval ∪Dtest is the complete Gates, daughterOf -1 , Jennifer Gates), where the -1
dataset. Evaluation of the link prediction task is notation denotes the inverse direction of the corre-
measured with mean rank (MR), mean reciprocal sponding relation.
rank (MRR), and hits@N (Rossi et al., 2021). MR To enforce the model to learn the knowledge
is defined as graph structure, we introduce a new embedding
layer entity/relation-type embedding (ER-type em-
P (h,r,t)
rank((h, r, t) | DLP ) bedding) in addition to the pre-existing token and
(h,r,t)∈Dtest position embeddings of RoBERTa as shown in Fig-
MR = , ure 2. This additional layer aims to embed the
|Dtest |
(2) tokens in the input sequence with its corresponding
where rank(·|·) is the rank of the positive triple entity/relation-type, where the set of entities E in
among its corrupted versions and |Dtest | is the num- the knowledge graph can have tE different entity
ber of positive test triples. MRR is the same as MR types depending on the schema of the knowledge
except that the reciprocal rank 1/rank(·|·) is used. graph, (e.g., tE = 3 for person, company, and lo-
Hits@N is defined as cation in Figure 1). Note that many knowledge
graphs do not specify the entity types, in which
case tE = 1. For the set of relations R, there exist
hits@N =
( tR = 2nR , where nR is the number of unique rela-
(h,r,t)
P 1, if rank((h, r, t) | DLP ) < N tions in the knowledge graph and the multiplier of
(h,r,t)∈Dtest 0, otherwise 2 comes from forward and inverse directions (e.g.,
, tR = 10 for the sample knowledge graph in Figure
|Dtest |
(3) 1).
where N ∈ {1, 3, 10} is commonly reported. In this work, we propose three different varia-
Higher MRR and hits@N values are better, tions of ER-type embeddings. KGLMBase is the
whereas, for MR, lower values denote higher per- simplified version where all entities are assigned
formance. a single entity type and relations are assigned ei-
ther forward or inverse relation type regardless of
3 Proposed Approach their unique relation types, resulting in a total of 3
In this work, we propose to continue pre-training, ER-type embeddings. The KGLMGR is a version
instead of pre-training from scratch, the language with granular relation types with tR + 1 ER-type
model RoBERTaLARGE (Liu et al., 2019) that has al- embeddings. The KGLMGER is the most granular
ready been trained on English-language corpora of version where we utilize all tE + tR ER-type em-
varying sizes and domains, using both the forward beddings. In other words, all entity types as well
and inverse knowledge graph textual sequences as all relation types including both directions are
(Figure 2). Following the convention used in the considered.
KG-BERT and StAR (see Appendix A), we use a To be specific, we convert a triple (h, r, t) to a
textual representation of a given triple, e.g., (Bill sequence of tokens w(h,r,t) = h[s]wah wbr wct [/s] :
Gates, founderOf, Microsoft) as ‘Bill Gates founder a ∈ {1..|h|} & b ∈ {1..|r|} & c ∈ {1..|t|}i ∈
of Microsoft’, to generate the pre-training corpus. R(|h|+|r|+|t|+2) , where [s] and [/s] are special
However, instead of extracting only the forward tokens denoting beginning and end of the sequence,
triple as done in the previous work, we extract both respectively. The input to the RoBERTa model is
the forward and inverse versions of the triple, e.g., then constructed by adding the ER-type embedding
(Jennifer Gates, daughterOf, Bill Gates) and (Bill t(h,r,t) and the p(h,r,t) position embeddings to the
Table 2: Link prediction results on the benchmark datasets WN18RR, FB15k-237, and UMLS. Bold numbers de-
note the best performance for a given metric and class of models. Underlined numbers denote the best performance
for a given metric regardless of the model type. Note that we do not report KGLMGER performance since the tested
datasets do not specify entity types in their schema.
ER-type embeddings
Model Continue pre-training Pre-train Fine-tune Hits @1 Hits @3 Hits @10 MR MRR
Claim 1 o x x 0.331 0.529 0.728 53.5 0.462
Claim 2 x - o 0.322 0.489 0.672 66.4 0.439
KGLMGR o o o 0.330 0.538 0.741 40.18 0.467
and AdamW optimizer (Loshchilov and Hutter, sult shows that the combination of these two claims
2017). For fine-tuning training data, we sampled works in a non-linear fashion to maximize perfor-
10 negative triples for a positive triple by corrupt- mance.
ing both the head and tail entity 5 times each. We The results of performing link prediction on the
used the validation set to find the optimal learning benchmark datasets are shown in Table 2. Com-
rates = {1e − 06, 5e − 07}, batch size = {16, 32}, pared to StAR, which had the best performance on
epochs = {1, 2, 3, 4, 5} for WN18RR and FB15k- MR and hits@10 on WN18RR, KGLMGR outper-
237 and 25, 50, 75, 100 for UMLS, and α from formed all the metrics with 21.2% improved MR
0.0 to 1.0 with an increment of 0.1. For all exper- (40.18 vs. 51, respectively) and 4.5% increased
iments, we set α = 0.5 based on the WN18RR hits@10 (0.709 vs. 0.741, respectively). Al-
validation set performance. Both pre-training and though still inferior compared to the graph embed-
fine-tuning were performed on 3 × Nvidia Quadro ding approaches, KGLMGR has 35.8% improved
RTX 6000 GPUs in a distributed manner using hits@1 compared to the best language model-based
the 16-bit mixed precision and DeepSpeed (Rasley approach StAR (0.243 vs. 0.330, respectively).
et al., 2020; Rajbhandari et al., 2020) library in the Across all model types, KGLMGR has the best per-
stage-2 setting. We used the Transformers library formance on all metrics for WN18RR except for
(Wolf et al., 2019). hits@1. Although we did not observe any improve-
ment compared to StAR for the FB15k-237 dataset,
4.3 Link Prediction Results we had the best performance on all metrics for
The hypothesis behind the KGLM was that learning UMLS with 21.2% improved MR than ComplEx
the ER-type embedding layers in the pre-training (1.19 vs. 1.51, respectively). KGLMGR outper-
stage using the corpus generated by the knowl- formed KGLMBase in all metrics.
edge graph, followed by fine-tuning has the best
performance. To test our hypothesis, we broke 5 Conclusion
down the hypothesis into two separate claims. For
In this work, we presented KGLM, which intro-
the first claim, we only continued pre-training
duces a new entity/relation (ER)-type embedding
RoBERTaLARGE followed by fine-tuning without
layer for learning the structure of the knowledge
the ER-type embeddings. This test removes the
graph. Compared to the previous language model-
contribution from the ER-type embeddings and
based methods that only fine-tune for a given task,
solely tests the performance gained by further pre-
we found that learning the ER-type embeddings
training the model with the knowledge graph as
in the pre-training stage followed by fine-tuning
input. Table 3 shows that claim 1 falls behind the
resulted in better performance. In future work, we
KGLMGR in all metrics except for hits @1 (0.331
plan to further test the version of KGLM that takes
vs. 0.330, respectively). For the second claim, we
into account entity types, KGLMGER , on domain-
did not continue pre-training and instead used the
specific knowledge graphs like KIDS (Youn et al.,
RoBERTaLARGE pre-trained weights as-is. We then
2022) with entity types in their schema.
learned the ER-type embeddings in the fine-tuning
stage. This test shows if the ER-type embeddings Limitations
can be learned only during the fine-tuning stage.
Table 3 shows that KGLMGR outperforms all of the Although KGLM outperforms state-of-the-art mod-
metrics obtained using the second claim. This re- els when the training set includes full sentences
(e.g., UMLS and WN18RR), the model performed Silviu Cucerzan. 2007. Large-scale named entity dis-
similarly to the state-of-the-art in cases where the ambiguation based on wikipedia data. In Proceed-
ings of the 2007 joint conference on empirical meth-
training dataset had only ontological relationships,
ods in natural language processing and computa-
such as the /music/artist/origin relation present in tional natural language learning (EMNLP-CoNLL),
the FB15k-237 dataset. One major limitation of pages 708–716.
the proposed method is the long training and infer-
Tim Dettmers, Pasquale Minervini, Pontus Stenetorp,
ence time, which we plan to alleviate by adopting and Sebastian Riedel. 2018. Convolutional 2d
Siamese-style textual encoders (Wang et al., 2021; knowledge graph embeddings. In Thirty-second
Li et al., 2022) in future work. AAAI conference on artificial intelligence.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Ethics Statement Kristina Toutanova. 2018. Bert: Pre-training of deep
bidirectional transformers for language understand-
The authors declare no competing interests. ing. arXiv preprint arXiv:1810.04805.
Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Lehmann, Richard Cyganiak, and Zachary Ives. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
2007. Dbpedia: A nucleus for a web of open data. Luke Zettlemoyer, and Veselin Stoyanov. 2019.
In The semantic web, pages 722–735. Springer. Roberta: A robustly optimized bert pretraining ap-
proach. arXiv preprint arXiv:1907.11692.
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim
Sturge, and Jamie Taylor. 2008. Freebase: a collab- Ilya Loshchilov and Frank Hutter. 2017. Decou-
oratively created graph database for structuring hu- pled weight decay regularization. arXiv preprint
man knowledge. In Proceedings of the 2008 ACM arXiv:1711.05101.
SIGMOD international conference on Management George A Miller. 1998. WordNet: An electronic lexical
of data, pages 1247–1250. database. MIT press.
Antoine Bordes, Nicolas Usunier, Alberto Garcia- Maximilian Nickel, Kevin Murphy, Volker Tresp, and
Duran, Jason Weston, and Oksana Yakhnenko. Evgeniy Gabrilovich. 2015. A review of relational
2013. Translating embeddings for modeling multi- machine learning for knowledge graphs. Proceed-
relational data. Advances in neural information pro- ings of the IEEE, 104(1):11–33.
cessing systems, 26.
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase,
Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr and Yuxiong He. 2020. Zero: Memory optimiza-
Settles, Estevam R Hruschka, and Tom M Mitchell. tions toward training trillion parameter models. In
2010. Toward an architecture for never-ending lan- SC20: International Conference for High Perfor-
guage learning. In Twenty-Fourth AAAI conference mance Computing, Networking, Storage and Anal-
on artificial intelligence. ysis, pages 1–16. IEEE.
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, State-of-the-art natural language processing. arXiv
and Yuxiong He. 2020. Deepspeed: System opti- preprint arXiv:1910.03771.
mizations enable training deep learning models with
over 100 billion parameters. In Proceedings of the Liang Yao, Chengsheng Mao, and Yuan Luo. 2019. Kg-
26th ACM SIGKDD International Conference on bert: Bert for knowledge graph completion. arXiv
Knowledge Discovery & Data Mining, pages 3505– preprint arXiv:1909.03193.
3506. Jason Youn, Navneet Rai, and Ilias Tagkopoulos. 2022.
Andrea Rossi, Denilson Barbosa, Donatella Fir- Knowledge integration and decision support for ac-
mani, Antonio Matinata, and Paolo Merialdo. 2021. celerated discovery of antibiotic resistance genes.
Knowledge graph embedding for link prediction: A Nature Communications, 13(1):1–11.
comparative analysis. ACM Transactions on Knowl- Hanwen Zha, Zhiyu Chen, and Xifeng Yan. 2021. In-
edge Discovery from Data (TKDD), 15(2):1–49. ductive relation prediction by bert. arXiv preprint
Adrian Silvescu, Doina Caragea, and Anna Atramen- arXiv:2103.07102.
tov. 2012. Graph databases. Artificial Intelligence
Research Laboratory Department of Computer Sci-
A Previous Work
ence, Iowa State University. A.1 KG-BERT
Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian KG-BERT (Yao et al., 2019) is a fine-
Tang. 2019. Rotate: Knowledge graph embed-
ding by relational rotation in complex space. arXiv
tuning method that utilizes the base version
preprint arXiv:1902.10197. of the pre-trained language model BERT
(BERTBASE ) (Devlin et al., 2018) as an encoder
Kristina Toutanova and Danqi Chen. 2015. Observed
versus latent features for knowledge base and text for entities and relations of the knowledge
inference. In Proceedings of the 3rd workshop on graph. Specifically, KG-BERT first converts
continuous vector space models and their composi- a triple (h, r, t) to a sequence of tokens
tionality, pages 57–66. w(h,r,t) = h[CLS]wah [SEP]wbr [SEP]wct [SEP] :
Thanh Vu, Tu Dinh Nguyen, Dat Quoc Nguyen, Dinh a ∈ {1..|h|} & b ∈ {1..|r|} & c ∈ {1..|t|}i,
Phung, et al. 2019. A capsule network-based em- where wn denotes the nth token of either entity
bedding model for knowledge graph completion and or relation, [CLS] and [SEP] are the special
search personalization. In Proceedings of the 2019
Conference of the North American Chapter of the
tokens, while |h|, |r|, and |t| denote the number of
Association for Computational Linguistics: Human tokens in the head entity, relation, and tail entity,
Language Technologies, Volume 1 (Long and Short respectively. This textual token sequence is then
Papers), pages 2180–2189. converted to a sequence of token embeddings
Bo Wang, Tao Shen, Guodong Long, Tianyi Zhou, w(h,r,t) ∈ Rd×(|h|+|r|+|t|+4) , where d is the
Ying Wang, and Yi Chang. 2021. Structure- dimension of the embeddings and 4 is from the
augmented text representation learning for efficient special tokens. Then the segment embeddings
knowledge graph completion. In Proceedings of the
s(h,r,t) = h(se )×(|h|+2) (sr )×(|r|+1) (se )×(|t|+1) i,
Web Conference 2021, pages 1737–1748.
where se and sr are used to differentiate en-
Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. tities from relations, respectively, as well as
2017. Knowledge graph embedding: A survey of
approaches and applications. IEEE Transactions
the position embeddings p(h,r,t) = hpi : i ∈
on Knowledge and Data Engineering, 29(12):2724– {1..(|h|+|r|+|t|+4)}i are added to the token
2743. embeddings w(h,r,t) to form a final input repre-
Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng
sentation X(h,r,t) ∈ Rd×(|h|+|r|+|t|+4) that is fed
Chen. 2014. Knowledge graph embedding by trans- to BERT as input. Then, the score of how likely a
lating on hyperplanes. In Proceedings of the AAAI given triple (h, r, t) is to be true is computed by
Conference on Artificial Intelligence, volume 28.
Robert West, Evgeniy Gabrilovich, Kevin Murphy, scoreKG-BERT (h, r, t) = SeqCls(X(h,r,t) ). (6)
Shaohua Sun, Rahul Gupta, and Dekang Lin. 2014.
Knowledge base completion via search-based ques- KG-BERT significantly improved the MR of the
tion answering. In Proceedings of the 23rd inter-
national conference on World wide web, pages 515–
link prediction task compared to the previous state-
526. of-the-art approach CapsE (Vu et al., 2019) (97
compared to 719, an 86.5% decrease), but suffered
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
from poor hits@1 of 0.041 due to the entity am-
ric Cistac, Tim Rault, Rémi Louf, Morgan Fun- biguation problem and lack of structural learning
towicz, et al. 2019. Huggingface’s transformers: (Wang et al., 2021; Cucerzan, 2007).
A.2 StAR
StAR (Wang et al., 2021) is a hybrid model that
learns both the contextual and structural informa-
tion of the knowledge graph by augmenting the
structured knowledge in the encoder. It divides
a triple into two parts, (h, r) and (t), and applies
a Siamese-style transformer with a sequence clas-
sification head to generate u = Pool(X(h,r) ) ∈
Rd×(|h|+|r|+3) and v = Pool(X(t) ) ∈ Rd×(|t|+2) ,
respectively, where Pool(·) is the output of the
RoBERTa’s pooling layer. The first scoring module
focuses on classifying the triple by applying a