Aph Language Models
Aph Language Models
Abstract
While Language Models (LMs) are the
workhorses of NLP, their interplay with struc-
tured knowledge graphs (KGs) is still actively
arXiv:2401.07105v3 [cs.CL] 3 Jun 2024
IsA
IsA
cat
cat is a
(a) Original GoT. (b) Extended Levi graph of GoT (with relative distances P for dog).
Figure 2: Example of graph preprocessing in our GLM. Fig 2b shows relative distances for dog, i.e., when dog is
attending to other tokens. The red Graph-to-Graph (G2G) connections only exist for the gGLM, not for the ℓGLM.
gGLM 55.4±0.3 58.6±0.7 58.8±0.6 59.3±0.7 59.5±0.4 41.8±0.8 25.6±0.9 22.0±0.6 19.4±0.5 17.0±0.2
T5 (list) 53.7±0.3 56.8±1.1 56.5±1.2 55.8±0.6 55.3±0.5 20.3±0.6 19.9±0.4 15.3±0.6 14.0±1.1 10.2±1.2
T5 (set) 53.1±0.6 52.8±1.2 54.6±0.6 53.9±0.5 53.1±0.8 18.2±0.6 16.7±0.5 13.1±0.7 12.3±0.6 9.7±0.9
ℓGLM 64.0±1.3 64.0±1.0 64.4±0.7 64.1±0.9 64.2±1.1 47.9±0.4 26.8±0.8 23.8±0.9 19.8±1.1 18.1±0.7
gGLM 63.2±0.9 64.4±1.1 64.6±1.2 64.1±1.3 65.3±0.7 48.0±0.6 27.2±0.7 24.2±0.7 20.2±1.4 19.2±0.7
T5 (list) 64.9±1.0 64.9±1.2 64.9±1.3 63.9±0.9 64.0±0.6 40.4±0.8 21.8±0.8 17.8±1.0 15.4±0.3 12.8±0.5
Finetuning
T5 (set) 63.9±0.7 65.8±0.8 64.0±0.3 64.1±1.2 64.3±1.1 40.3±1.2 21.8±0.7 18.0±0.6 15.5±0.6 13.1±0.7
GCN 44.3±0.9 37.1±1.0 34.4±1.2 36.5±0.6 36.8±1.4 22.2±1.2 21.9±0.8 12.1±3.5 9.0±4.3 5.9±0.0
GAT 44.5±0.9 40.6±1.3 36.3±1.3 37.0±0.8 37.0±0.8 20.0±0.7 20.8±0.2 14.0±0.6 13.8±0.8 11.0±0.6
ℓGT 24.2±3.4 35.0±1.2 34.7±1.3 32.7±2.9 34.5±2.8 30.1±2.6 12.8±2.4 15.5±0.3 9.5±1.3 10.0±1.6
gGT 27.6±1.9 29.0±0.8 23.4±1.2 19.2±1.2 15.6±1.5 18.6±0.7 13.2±1.1 14.5±0.6 12.4±1.3 12.1±1.7
Table 1: Relation label classification accuracy on CN in %. Results are shown for Linear Probing and Finetuning.
come with pretrained weights, we only apply them The overall high performance of GLMs confirms
in finetuning, when training all parameters. our assumption that GLMs are compatible with LM
weights, even without any training. Increasing per-
Graph transformer Finally we compare GLMs formance with increasing radii further shows that
to models with the same architecture, but random GLMs have good inductive graph biases. When
weight initialization (normal graph transformers). long-range connections are relevant, the represen-
This allows us to assess the impact of weight initial- tations learned by gGLM outperform the locally
ization from a LM with two further baselines: ℓGT constrained ℓGLM – which showcases the strength
and gGT. We only consider GTs with finetuning. of the global view that the gGLM is able to take.
5.1.3 Results Finetuning Tab. 1 shows results when training
Linear probing Tab. 1 shows the relation label all parameters. In this setting, models can adjust
prediction accuracy for linear probing, i.e., when to the task and learn to reason over graphs through
training only the classification head. Our first ob- parameter updates. In addition, GLMs can tune pa-
servation is that gGLM is consistently the best, rameters to better match the novel input structure.
outperforming ℓGLM and the LM baselines. For a The GLM and LM variants are consistently bet-
radius of r = 1 we have exactly one triplet, which ter than GNN and GT methods, which indicates
has almost the same representation in the GLM and that linguistic understanding is potentially more
LM approaches. The only difference is that the LM important than graph reasoning for this task. Mod-
baselines have an end-of-sentence token, which the els outperform their linear probing scores, which
GLM does not have. Surprisingly, not having the shows that finetuning is, as expected, beneficial.
end-of-sentence token seems to be an advantage Overall, the GLMs perform best, while GTs per-
with linear probing, but we will see later that this form the worst. The only difference between the
changes when updating model weights. two model groups is weight initialization – the
For r ≥ 3, LM baselines show decreasing per- GLMs are initialized from T5, while the GTs are
formance with increasing radii. By contrast, both randomly initialized. Further, we observe that for
ℓGLM and gGLM show increasing performances r ≥ 1 and m = 0 the local GT (ℓGT) significantly
with increasing radii. This indicates that GLMs can outperforms its global counterpart gGT. For the
utilize the additional context. But LM baselines GLM the global version is on par, or even better
don’t have any inbuilt methods to grasp distances than the local one. This shows the effectiveness of
in the graph, which could cause them to fail at dis- T5’s attention mechanism: thanks to its weight ini-
tinguishing relevant from less relevant information. tialization, gGLM attends to relevant tokens even
The performance gap between gGLM and LM in large context windows, while gGT suffers from
models tends to increase for larger m, i.e., when potentially distracting long-ranged information.
larger sub-structures are masked. However, the For m = 0 the differences between GLM and
ℓGLM underperforms for large m, highlighting the LM approaches are small, with a slight trend for
advantage of the global view in gGLM when long- GLMs to outperform LMs on large graphs, and vice
ranged connections are necessary. versa for small graphs. However, when graph rea-
5.2 Jointly representing Graph and Text
We now investigate GLM capabilities to process
interleaved inputs of text and graph in a KG popu-
lation setup, i.e., extending a KG with new relation
instances. Subtask 1 performs text-guided rela-
tion classification where some relations may be
inferrable from the text, while others may exploit
graph knowledge to make predictions. In Subtask
2, models classify the source of a predicted rela-
tion, i.e., whether it can be inferred from the text, or
(a) Relation label classification. whether it requires (additional) graph knowledge.
We construct our data from Huguet Cabot and
Navigli (2021), who offer a corpus of Wikipedia ab-
stracts that are linked to Wikidata via entity linking.
Their focus is relation extraction, so they filter the
graphs using NLI, such that all triplets are entailed
by the text. We augment the entailed triplets with
further triplets from Wikidata that are not entailed
by the text. For a given text, subgraph, head and
tail entity, models will jointly predict the relation
and the source. We adopt the 220 most common re-
(b) Source classification. lations in our train graphs and a “no-relation” label.
For source labels we have 3 classes: entailed by the
Figure 4: KG population test results during training. text, not entailed and no-relation. No-relation is
gGLM outperforms T5 set by up to 6 points in 4a.
the correct label iff the relation is also no-relation.
§B.2.1 shows statistics and construction details.
5.2.1 Experimental setup and baselines
soning is more important due to masking (m ≥ 1),
then GLMs consistently and significantly outper- Unlike §5.1.1, models now receive text and graph
form all other baselines. This indicates that LMs data as input. We train two distinct classification
can learn to do simple graph reasoning through heads on the mask’s embedding for relation and
parameter updates, but underperform in more com- source classification. While the mask is part of the
plex graph reasoning tasks where either graphs are graph, its embedding depends on both modalities.
larger, or long-ranged connections are required. The final loss is the sum of the relation classifica-
tion and the source prediction loss, weighted by 0.9
For m ≥ 1, the gGLM outperforms ℓGLM due and 0.1. We use T5-large, but otherwise baselines
to its global connections. In contrast to the lin- are as in §5.1.2. §B.2.2 shows the training details.
ear probing setting, the ℓGLM outperforms other
baselines for all non-zero levels of masking. This 5.2.2 Results
indicates that ℓGLM can learn to use long-ranged Fig. 4 and Tab. 8 show test set performance for a)
information during training, if the task requires it. relation and b) source classification, at different
training stages. gGLM performs the best overall,
followed by ℓGLM. LM baselines are competitive,
Impact of model size To investigate the effect of but lag behind at early stages and for source predic-
model size, we train the most promising approaches tion. Again, GT baselines perform poorly, showcas-
(GLM and LM) in 3 different sizes. Tab. 6 in §B.1.3 ing the advantage of weight initialization in GLM –
shows that overall larger models perform better. even with large-scale training data. For all models,
Surprisingly, the base models sometimes outper- training plateaus beyond ∼ 500k seen instances (cf.
form the larger models for settings that require Fig. 8 in §B.2.3), so we stop training at this cut-off.
more graph reasoning, i.e., larger m. However, Tab. 2 gives results for ablating different input
these differences are small and non-significant. In modalities to GLMs. Since source prediction al-
most cases, gGLM large or base are the best model. ways requires text input, we test relation classifi-
Ablation
Relation classification Source classification soning skills for different languages, domains and
ℓGLM gGLM ℓGLM gGLM
tasks is left for future work.
w/ text & graph 82.63 82.25 83.39 83.21
w/o text -6.22 -5.84 – – Our GLM framework supports instantiation from
w/o graph -6.05 -5.10 -4.67 -4.49
w/o text & graph -19.62 -19.24 – –
any LM with relative positional encoding, includ-
ing rotary positional encoding. Comprehensive
Table 2: Ablations for KG population in macro F1. comparisons to determine the most suitable models
for the GLM framework remain for future investiga-
tion. Nonetheless, bidirectional LMs are expected
cation w/o source prediction. Ablating the text or to perform best in the novel framework, because
graph lowers performance by similar amounts, indi- unidirectional LMs necessitate additional inverse
cating that GLMs utilize both modalities. Training relations, as discussed in §4.
curves in Fig. 10 reveal that first, the model almost
exclusively utilizes text data, but quickly learns Ethical considerations
to make use of the graph. For textually entailed
We do not foresee immediate ethical concerns
triplets, text is more impactful than the graph, and
for our research, as we rely on well-established
vice versa for other triplets (cf. Tab. 9). Ablating
datasets. However, even established datasets can
graphs lowers source prediction by ∼4.5 points,
contain undesirable biases which our method could
which shows that GLMs benefit from graph infor-
potentially spread and amplify.
mation even for predominantly text oriented tasks.
Looking ahead, our focus lies in enriching
The results show that GLMs can efficiently rea-
knowledge graph integration within language mod-
son over interleaved inputs of graph and text, espe-
els, with the aim of enhancing factuality and miti-
cially with limited training data. This makes GLMs
gating hallucination. This advancement is expected
a promising new model type for knowledge-intense
to bolster the reliability and controllability of LMs,
NLP tasks, such as KG population or Q&A.
leading to positive societal impacts. Furthermore,
6 Conclusion LMs relying on knowledge graphs may facilitate
easier maintenance, potentially reducing the need
We present the Graph Language Model (GLM) – a for frequent retraining of deployed models, thereby
graph transformer initialized with weights from a promoting sustainability in NLP practices.
LM. It excels at graph reasoning, while simultane-
ously encoding textual triplets in the graph as LMs Acknowledgements
do, thereby bridging the gap between LMs and We want to thank Letit, ia Pârcălăbescu for provid-
GNNs. GLMs can natively reason over joint inputs ing feedback on our manuscript.
from texts and graphs, leveraging and enhancing This work was funded by DFG, the German Re-
each modality. Experiments show the GLM’s ad- search Foundation, within the project “ACCEPT:
vantage over LM and GNN based baselines, even Perspectivized Argument Knowledge Graphs for
in a linear probing setting. In particular, GLMs Deliberation”, as part of the priority program
greatly outperform graph transformers. This high- “RATIO: Robust Argumentation Machines” (SPP-
lights the need for pretrained LM weights, even for 1999).
graph reasoning. We therefore advocate GLMs as a
valuable tool for advancing research in embedding
and leveraging knowledge graphs for NLP tasks. References
Uri Alon and Eran Yahav. 2021. On the bottleneck of
Limitations graph neural networks and its practical implications.
In International Conference on Learning Representa-
While GLMs are designed as general purpose tools tions.
for knowledge-intense NLP tasks, our evaluation is
Sören Auer, Christian Bizer, Georgi Kobilarov, Jens
limited to English knowledge graphs. However, we Lehmann, Richard Cyganiak, and Zachary Ives. 2007.
explore various types of knowledge graphs (com- Dbpedia: A nucleus for a web of open data. In The
monsense and factual) and tasks (relation classifica- Semantic Web, pages 722–735, Berlin, Heidelberg.
tion, text-guided relation classification, and source Springer Berlin Heidelberg.
prediction), broadening our empirical assessment. Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai-
Confirming GLMs improved text and graph rea- tanya Malaviya, Asli Celikyilmaz, and Yejin Choi.
2019. COMET: Commonsense transformers for auto- Transformers. In Proceedings of the 2019 Confer-
matic knowledge graph construction. In Proceedings ence of the North American Chapter of the Associ-
of the 57th Annual Meeting of the Association for ation for Computational Linguistics: Human Lan-
Computational Linguistics, pages 4762–4779, Flo- guage Technologies, Volume 1 (Long and Short Pa-
rence, Italy. Association for Computational Linguis- pers), pages 2284–2293, Minneapolis, Minnesota.
tics. Association for Computational Linguistics.
Michael M Bronstein, Joan Bruna, Taco Cohen, and Junyi Li, Tianyi Tang, Wayne Xin Zhao, Zhicheng Wei,
Petar Veličković. 2021. Geometric deep learning: Nicholas Jing Yuan, and Ji-Rong Wen. 2021. Few-
Grids, groups, graphs, geodesics, and gauges. arXiv shot Knowledge Graph-to-Text Generation with Pre-
preprint arXiv:2104.13478. trained Language Models. In ACL Findings.
Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Shujie Li, Liang Li, Ruiying Geng, Min Yang, Binhua
Xu Sun. 2020. Measuring and relieving the over- Li, Guanghu Yuan, Wanwei He, Shao Yuan, Can Ma,
smoothing problem for graph neural networks from Fei Huang, and Yongbin Li. 2024. Unifying Struc-
the topological view. Proceedings of the AAAI Con- tured Data as Graph for Data-to-Text Pre-Training.
ference on Artificial Intelligence, 34(04):3438–3445. Transactions of the Association for Computational
Linguistics, 12:210–228.
Philipp Dufter, Martin Schmitt, and Hinrich Schütze.
2022. Position information in transformers: An
overview. Computational Linguistics, 48(3):733– Bill Yuchen Lin, Xinyue Chen, Jamin Chen, and Xiang
763. Ren. 2019. KagNet: Knowledge-aware graph net-
works for commonsense reasoning. In Proceedings
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, of the 2019 Conference on Empirical Methods in Nat-
Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, ural Language Processing and the 9th International
and Haofen Wang. 2024. Retrieval-augmented gen- Joint Conference on Natural Language Processing
eration for large language models: A survey. (EMNLP-IJCNLP), pages 2829–2839, Hong Kong,
China. Association for Computational Linguistics.
Claire Gardent, Anastasia Shimorina, Shashi Narayan,
and Laura Perez-Beltrachini. 2017. The WebNLG Chaitanya Malaviya, Chandra Bhagavatula, Antoine
challenge: Generating text from RDF data. In Pro- Bosselut, and Yejin Choi. 2020. Commonsense
ceedings of the 10th International Conference on knowledge base completion with structural and se-
Natural Language Generation, pages 124–133, San- mantic context. Proceedings of the 34th AAAI Con-
tiago de Compostela, Spain. Association for Compu- ference on Artificial Intelligence.
tational Linguistics.
Erxue Min, Runfa Chen, Yatao Bian, Tingyang Xu,
Jonas Gehring, Michael Auli, David Grangier, and Yann Kangfei Zhao, Wen bing Huang, Peilin Zhao, Jun-
Dauphin. 2017. A convolutional encoder model for zhou Huang, Sophia Ananiadou, and Yu Rong. 2022.
neural machine translation. In Proceedings of the Transformer for graphs: An overview from architec-
55th Annual Meeting of the Association for Compu- ture perspective. ArXiv, abs/2202.08455.
tational Linguistics (Volume 1: Long Papers), pages
123–135, Vancouver, Canada. Association for Com- Luis Müller, Christopher Morris, Mikhail Galkin, and
putational Linguistics. Ladislav Rampášek. 2023. Attending to Graph Trans-
formers. Arxiv preprint.
Pere-Lluís Huguet Cabot and Roberto Navigli. 2021.
REBEL: Relation extraction by end-to-end language
Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Ji-
generation. In Findings of the Association for Com-
apu Wang, and Xindong Wu. 2024. Unifying large
putational Linguistics: EMNLP 2021, pages 2370–
language models and knowledge graphs: A roadmap.
2381, Punta Cana, Dominican Republic. Association
IEEE Transactions on Knowledge and Data Engi-
for Computational Linguistics.
neering (TKDE).
Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras,
Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, and Moritz Plenz, Philipp Heinisch, Anette Frank, and
Yejin Choi. 2021. Comet-atomic 2020: On sym- Philipp Cimiano. 2024. Pakt: Perspectivized argu-
bolic and neural commonsense knowledge graphs. In mentation knowledge graph and tool for deliberation
AAAI. analysis.
Thomas N. Kipf and Max Welling. 2017. Semi- Moritz Plenz, Juri Opitz, Philipp Heinisch, Philipp Cimi-
supervised classification with graph convolutional ano, and Anette Frank. 2023. Similarity-weighted
networks. In International Conference on Learning construction of contextualized commonsense knowl-
Representations (ICLR). edge graphs for knowledge-intense argumentation
tasks. In Proceedings of the 61st Annual Meeting of
Rik Koncel-Kedziorski, Dhanush Bekal, Yi Luan, the Association for Computational Linguistics (Vol-
Mirella Lapata, and Hannaneh Hajishirzi. 2019. Text ume 1: Long Papers), pages 6130–6158, Toronto,
Generation from Knowledge Graphs with Graph Canada. Association for Computational Linguistics.
Ofir Press, Noah Smith, and Mike Lewis. 2022. Train you need. In Advances in Neural Information Pro-
short, test long: Attention with linear biases enables cessing Systems, volume 30. Curran Associates, Inc.
input length extrapolation. In International Confer-
ence on Learning Representations. Petar Veličković, Guillem Cucurull, Arantxa Casanova,
Adriana Romero, Pietro Liò, and Yoshua Bengio.
Colin Raffel, Noam Shazeer, Adam Roberts, Kather- 2018. Graph Attention Networks. International Con-
ine Lee, Sharan Narang, Michael Matena, Yanqi ference on Learning Representations.
Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the
limits of transfer learning with a unified text-to-text Denny Vrandečić and Markus Krötzsch. 2014. Wiki-
transformer. Journal of Machine Learning Research, data: A free collaborative knowledgebase. Commun.
21(140):1–67. ACM, 57(10):78–85.
Peifeng Wang, Nanyun Peng, Filip Ilievski, Pedro
Leonardo F. R. Ribeiro, Martin Schmitt, Hinrich Szekely, and Xiang Ren. 2020a. Connecting the dots:
Schütze, and Iryna Gurevych. 2021. Investigating A knowledgeable path generator for commonsense
pretrained language models for graph-to-text genera- question answering. In Findings of the Association
tion. In Proceedings of the 3rd Workshop on Natural for Computational Linguistics: EMNLP 2020, pages
Language Processing for Conversational AI, pages 4129–4140, Online. Association for Computational
211–227, Online. Association for Computational Lin- Linguistics.
guistics.
Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei,
Leonardo F. R. Ribeiro, Yue Zhang, Claire Gardent, and Xuanjing Huang, Jianshu Ji, Guihong Cao, Daxin
Iryna Gurevych. 2020. Modeling global and local Jiang, and Ming Zhou. 2021. K-Adapter: Infusing
node contexts for text generation from knowledge Knowledge into Pre-Trained Models with Adapters.
graphs. Transactions of the Association for Compu- In Findings of the Association for Computational
tational Linguistics, 8:589–604. Linguistics: ACL-IJCNLP 2021, pages 1405–1418,
Online. Association for Computational Linguistics.
Martin Schmitt, Leonardo F. R. Ribeiro, Philipp Dufter,
Iryna Gurevych, and Hinrich Schütze. 2021. Mod- Tianming Wang, Xiaojun Wan, and Hanqi Jin. 2020b.
eling graph structure via relative position for text AMR-to-text generation with graph transformer.
generation from knowledge graphs. In Proceedings Transactions of the Association for Computational
of the Fifteenth Workshop on Graph-Based Methods Linguistics, 8:19–33.
for Natural Language Processing (TextGraphs-15),
pages 10–21, Mexico City, Mexico. Association for Peter West, Ronan Bras, Taylor Sorensen, Bill Lin, Li-
Computational Linguistics. wei Jiang, Ximing Lu, Khyathi Chandu, Jack Hessel,
Ashutosh Baheti, Chandra Bhagavatula, and Yejin
Martin Schmitt, Sahand Sharifzadeh, Volker Tresp, and Choi. 2023. NovaCOMET: Open commonsense
Hinrich Schütze. 2020. An unsupervised joint sys- foundation models with symbolic knowledge distil-
tem for text generation from knowledge graphs and lation. In Findings of the Association for Computa-
semantic parsing. In Proceedings of the 2020 Con- tional Linguistics: EMNLP 2023, pages 1127–1149,
ference on Empirical Methods in Natural Language Singapore. Association for Computational Linguis-
Processing (EMNLP), pages 7117–7130, Online. As- tics.
sociation for Computational Linguistics.
Michihiro Yasunaga, Antoine Bosselut, Hongyu Ren,
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Xikun Zhang, Christopher D Manning, Percy Liang,
Self-attention with relative position representations. and Jure Leskovec. 2022. Deep bidirectional
In Proceedings of the 2018 Conference of the North language-knowledge graph pretraining. In Advances
American Chapter of the Association for Computa- in Neural Information Processing Systems.
tional Linguistics: Human Language Technologies,
Volume 2 (Short Papers), pages 464–468, New Or- Xikun Zhang, Antoine Bosselut, Michihiro Yasunaga,
leans, Louisiana. Association for Computational Lin- Hongyu Ren, Percy Liang, Christopher D Manning,
guistics. and Jure Leskovec. 2022. GreaseLM: Graph REA-
Soning enhanced language models. In International
Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. Conference on Learning Representations.
Conceptnet 5.5: An open multilingual graph of gen-
Jianan Zhao, Meng Qu, Chaozhuo Li, Hao Yan, Qian
eral knowledge. In Proceedings of the Thirty-First
Liu, Rui Li, Xing Xie, and Jian Tang. 2023. Learning
AAAI Conference on Artificial Intelligence, AAAI’17,
on large-scale text-attributed graphs via variational
page 4444–4451. AAAI Press.
inference. In The Eleventh International Conference
Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng on Learning Representations.
Liu. 2021. Roformer: Enhanced transformer with
rotary position embedding. CoRR, abs/2104.09864.
A Model
animal
cat is a
(b) Relative position matrix P for ℓGLM (c) Mask matrix M for ℓGLM.
-4
-3 0
+3
-2 +2
-1 +1
cat is a
(e) Relative position matrix P for gGLM (f) Mask matrix M for gGLM.
(c) Relative position matrix P for gGLM. (d) Mask matrix M for gGLM.
Figure 6: Relative positions P and masking M for ℓGLM and gGLM when encoding text and graph jointly. The
example sentence is “The dog chased the cat.”
B Experiments on. Formally, m denotes the radius of the masked
graph in Levi representation, which should not be
B.1 ConceptNet confused with the extended Levi graph, nor the
B.1.1 Dataset normal graph representation. We replace each con-
cept and relation in the masked subgraph with a
We experiment on randomly selected subgraphs different mask token. This in principle enables LM
from the largest connected component of the En- baselines to internally reconstruct the graph.
glish part of CN version 5.7 (Speer et al., 2017),
which consists of 125,661 concepts and 1,025,802 B.1.2 Experimental setup and baselines
triplets. We select 17 distinct relation label classes Tab. 5 shows our hyperparameters. For the GNNs,
(cf. Tab. 4), ensuring sufficient frequency and se- we tested different numbers of layers (2, 3, 4, 5),
mantic dissimilarity. For each class, we randomly hidden channel dimensions (32, 64, 128), and non-
sample 1,000 triplets, allowing only cases where ex- linearities (ReLU, leaky ReLU) in preliminary ex-
actly one triplet connects the head and tail entities, periments.
to reduce label ambiguity. These 1,000 instances
are split into train (800), dev (100), and test (100). B.1.3 Results
This creates a balanced dataset of 13,600 train, Tab. 6 shows performance on CN for different mod-
1,700 dev, and 1,700 test instances. To predict re- elsizes.
lation labels, we replace them with <extra_id_0>,
T5’s first mask token. For our experiments, we B.2 Wikidata and Wikipedia
replace CN (unmasked) relations with more natural B.2.1 Dataset
verbalizations. Tab. 4 shows the static verbalization
Huguet Cabot and Navigli (2021) propose a large-
for each relation.
scale corpus of aligned Wikipedia abstracts and
During graph construction we control the graph
Wikidata (Vrandečić and Krötzsch, 2014) triplets.
size, parameterized by the radius r. We start with
They first extract Wikidata entities from the ab-
a radius of r = 1, when we consider only the two
stract, and then link these entities with triplets in
concepts (head and tail) in the target triplet. To
Wikidata. They are interested in triplets that are
create a larger graph context, we randomly select 4
entailed by the text, so they use a NLI model to
adjacent triplets – 2 for the head, and 2 for the tail
filter out all other triplets. They publicly released
entity of the original triplet. A graph with radius
the extracted entities and the filtered triplets.
r = 2 is formed by the subgraph spanned by all
For our purpose, we are interested in aligned
entities used in these 5 triplets. For r = 3 we again
graphs and texts, but triplets in the graph do not nec-
randomly select 2 triplets for each of the outer (up
essarily have to be entailed by the text. Hence, we
to) 4 entities, yielding (up to) 13 triplets. To avoid
find all triplets between the extracted entities using
accidentally adding more short-ranged information,
the Wikidata Query Service.6 From Huguet Cabot
we restrict the new triplets to triplets that actually
and Navigli (2021) we know which triplets in our
extend the radius of the graph. This enables us
graphs are entailed by the text.
to control graph size and complexity, while still
Similar to Huguet Cabot and Navigli (2021) we
enabling sufficient diversity in the graph structure.
consider the 220 most common relations in the train
Further, the graphs are created such that graphs for
split as our relation labels. Additionally, we add a
smaller radii are strict subgraphs of graphs with
“no-relation” label, yielding 221 relation classes.
larger radii. This ensures that performance changes
For 10 % of the graphs we randomly add a new
with increasing radii are due to long-ranged con-
triplet between previously unconnected head and
nections, and not due to potentially different short-
tail entity, and the mask token as relation. For these
ranged information. Tab. 3 shows structural prop-
graphs “no-relation” is the correct relation label.
erties of CN subgraphs, depending on their radius.
For the other 90 % graphs we replace a random
When masking subgraphs, we mask complete
existing relation with the mask token, while making
subgraphs of a certain size around the target to
sure that (i) the existing relation is in our 220 labels
be predicted. The size of the masked subgraph is
and that (ii) there is no other triplet connecting
denoted by m, where m = 0 means no masking,
the respective head and tail entities. We remove
m = 1 masks neighboring concepts, m = 2 masks
6
neighboring concepts and the next relations, and so https://query.wikidata.org, accessed in Jan. 2024.
Metric r=1 r=2 r=3 r=4 r=5
#nodes 2.00 ± 0.00 5.77 ± 0.46 12.28 ± 1.67 23.47 ± 4.33 42.90 ± 9.57
#edges 1.00 ± 0.00 8.25 ± 2.74 19.19 ± 5.33 36.41 ± 9.09 66.06 ± 16.77
mean degree 1.00 ± 0.00 2.87 ± 0.96 3.14 ± 0.82 3.11 ± 0.59 3.08 ± 0.42
Relation Verbalization
Antonym is an antonym of
AtLocation is in
CapableOf is capable of
Causes causes
CausesDesire causes desire
Used as relation label
Batchsize 32
HasSubevent has subevent Max. # epochs 50
IsA is a Early stopping criterion dev loss
MannerOf is a manner of Early stopping # epochs 5
MotivatedByGoal is motivated by # parameters in small 35B (FT) & 8k (LP)
# parameters in base 110B (FT) & 13k (LP)
PartOf is a part of # parameters in large 335B (FT) & 17k (LP)
Synonym is a synonym of # encoder layers in small 6
UsedFor is used for # encoder layers in base 12
# encoder layers in large 24
CreatedBy is created by Loss cross entropy loss
DefinedAs is defined as Optimizer AdamW
Desires desires Learning rate 5e−3
Batchsize 32
Entails entails
GNN
Max. # epochs 50
Not used as relation label
Table 6: Relation label classification accuracy on ConceptNet (§5.1) when training all parameters. Best score per
model family is boldfaced, and best score overall is highlighted in yellow.
Metric train test many more classes and hence, is potentially more
difficult. This means that model parameters are op-
#nodes 5.59 ± 3.77 5.60 ± 3.78
#edges 8.71 ± 11.99 8.71 ± 12.01 timized for both objectives jointly, while only the
mean degree 2.66 ± 1.58 2.66 ± 1.58 linear classification heads can specialize on their
respective task.
Table 7: Structural statistics of Wikidata (§5.2) sub- The dataset is unbalanced (c.f. Fig 7), so report
graphs. macro F1 scores instead of accuracy. This means
that models only achieve high scores if they per-
form well on all classes, including minority classes.
instances where no suitable triplet is available. This We assume that classifying one out of 221 rela-
yields a dataset with 2,449,582 train, 135,828 val tions requires fine grained text understanding, so
and 135,923 test instances. we initialize models from T5-large instead of T5-
Fig. 7 shows the label distributions for relation small. To reduce computational load, we only train
and source for train and test. Out of the 221 rela- one model per setting. Further, we enable efficient
tions, only 195 and 194 relations occur in the train batching by restricting inputs to a maximum of 512
and test set, respectively. All relations in the test tokens. This truncates 2.8 % of train instances for
set also occur in the train set. GLMs and 5.1 % for LM baselines due to their less
Tab. 7 shows graph statistics. Compared to CN efficient graph encoding.
subgraphs (c.f. Tab. 3) the graphs are relatively Hyperparameters are identical to Tab. 5, except
small, matching the size of r = 2. On CN we found that (i) we reduce batch size to 8, (ii) train for at
that LMs can perform well on such small graphs, most 1 epoch and (iii) don’t use early stopping.
so we expect that the performance gap between
GLMs and LM baselines on Wikidata would be B.2.3 Results
larger if Wikidata subgraphs were larger. Fig. 8 shows the training curve when training for
an entire epoch, i.e., 2,499,582. We observe that
B.2.2 Experimental setup and baselines performances plateau beyond ∼ 0.2 epochs, so we
For these experiments we omit GNNs as a baseline, stop training after 524,288 instances in our other
since they can’t natively process texts. experiments.
The other models all compute an embedding of Tab. 8 shows concrete numbers for the models
the mask token, and then two separate classifica- in Figures 4 and 8.
tion heads produce predictions for the relations Fig. 9 shows confusion matrices for source pre-
(221 classes) and the source (3 classes). For each diction.
prediction, we compute the Cross Entropy Loss. Fig. 10 shows the test performance in relation
The final loss is the weighted sum of these losses, classification of ablated models during different
weighted by 0.9 and 0.1 respectively. The rela- training steps. Table 9 shows relation classification
tion classification has a higher weight since it has scores for (i) triplets entailed by text and for (ii)
(a) Relation label. (b) Source label.
Figure 7: Label distributions for Wikidata (§5.2) train and test sets.
other triplets.
C Usage of AI assistants
We use GitHub Copilot (https://github.com/
features/copilot) for speeding up program-
ming, and ChatGPT 3.5 (https://chat.openai.
com) to aid with reformulations. The content of this
work is our own, and not inspired by AI assistants.
(a) Evaluation on train set. (b) Evaluation on test set.
Figure 8: Training curves (§5.2) when training for a whole epoch, i.e., 2,449,582 train instances. Performances are
for relation classification. On the train set we did not compute macro F1, so we report accuracy instead.
Figure 10: Ablation of different input modalities to GLMs. All runs are done without source prediction (besides
ℓGLM and gGLM). Scores are for relation classification on Wikidata (§5.2).