0% found this document useful (0 votes)
26 views18 pages

Aph Language Models

The document introduces Graph Language Models (GLMs), a novel type of language model that integrates the strengths of traditional language models and graph neural networks to effectively process both text and structured knowledge graphs. GLMs utilize parameters from pretrained language models while incorporating graph biases, allowing for improved reasoning over graph-structured data and text. Empirical evaluations demonstrate that GLMs outperform existing methods in tasks such as relation classification and knowledge graph population.

Uploaded by

ljh030114a
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views18 pages

Aph Language Models

The document introduces Graph Language Models (GLMs), a novel type of language model that integrates the strengths of traditional language models and graph neural networks to effectively process both text and structured knowledge graphs. GLMs utilize parameters from pretrained language models while incorporating graph biases, allowing for improved reasoning over graph-structured data and text. Empirical evaluations demonstrate that GLMs outperform existing methods in tasks such as relation classification and knowledge graph population.

Uploaded by

ljh030114a
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Graph Language Models

Moritz Plenz Anette Frank


Computational Linguistics Computational Linguistics
Heidelberg University Heidelberg University
[email protected] [email protected]

Abstract
While Language Models (LMs) are the
workhorses of NLP, their interplay with struc-
tured knowledge graphs (KGs) is still actively
arXiv:2401.07105v3 [cs.CL] 3 Jun 2024

researched. Current methods for encoding such


graphs typically either (i) linearize them for em-
bedding with LMs – which underutilize struc- Figure 1: The GLM inherits its architecture from a
tural information, or (ii) use Graph Neural Net- Graph Transformer, and its parameters from a LM. This
works (GNNs) to preserve the graph structure – enables it to jointly reason over graphs and language.
but GNNs cannot represent text features as well
as pretrained LMs. In our work we introduce
a novel LM type, the Graph Language Model triplet, complex structures emerge in KGs. We
(GLM), that integrates the strengths of both ap- refer to such KGs as Graphs of Triplets (GoTs).
proaches and mitigates their weaknesses. The To use GoTs effectively, we need meaningful
GLM parameters are initialized from a pre- encodings of their components. A natural choice
trained LM to enhance understanding of in-
is leveraging LMs, as they can capture the seman-
dividual graph concepts and triplets. Simul-
taneously, we design the GLM’s architecture
tics of textually encoded entities, relations or entire
to incorporate graph biases, thereby promot- triplets. But LMs are not prepared to capture graph-
ing effective knowledge distribution within the structured information and cannot model complex
graph. This enables GLMs to process graphs, interactions in a GoT. To alleviate this problem,
texts, and interleaved inputs of both. Empirical one can leverage graph NNs (GNNs). But GNNs
evaluations on relation classification tasks show are not well suited to capture meanings associated
that GLM embeddings surpass both LM- and with text, and hence often LMs are used to convert
GNN-based baselines in supervised and zero-
nodes (and possibly edges) to language-based se-
shot setting, demonstrating their versatility.1
mantic embeddings. But in such settings, semantic
1 Introduction encoding leveraged from LMs and structural rea-
soning performed by GNNs are separated and are
Knowledge Graphs (KGs) are essential for or- driven by distinct underlying principles. We expect
ganizing vast data, to facilitate information re- this to limit model performance if both textual and
trieval, or revealing hidden insights for decision- structural information are important for a task.
making (Plenz et al., 2024). KGs excel in explicitly In this work we introduce a Graph Language
representing manifold relationships, so with an ex- Model (GLM) that resolves this tension through
panding wealth of information they become crucial early fusion of textual and structural information.
tools in the digital age. Most LMs today are transformers. Since transform-
Many KGs consist of knowledge triplets, where ers operate on sets, Positional Encoding (PE) is
nodes are entities and edges represent relationships used to inform them about the inherent sequential
holding between them. Each triplet represents a ordering of linguistic inputs. In our GLM formu-
fact in pseudo-natural language, e.g., (Thailand; lation, we modify PE and self-attention to convert
Capital; Bangkok) in DBpedia (Auer et al., 2007). LMs (i.e., sequence transformers) to graph trans-
Despite the (usual) simplicity of each individual formers that natively operate on graphs, while pre-
1
https://github.com/Heidelberg-NLP/ serving their LM capabilities. Usually, a new archi-
GraphLanguageModels tecture requires pretraining from scratch, which is
extremely costly. By adopting some non-invasive Another approach is to provide the linearized
changes in the LM’s self-attention module, we KG as part of the input during inference. This
transform the LM to a Graph Transformer (GT) – is common for KG-to-text generation (Schmitt
while maintaining compatibility with its pretrained et al., 2020; Ribeiro et al., 2021; Li et al., 2021),
LM parameters. When encoding a graph, LM-like where models learn to take the linearized (typi-
attention patterns process linearly organized textual cally small-sized) graphs as input. A recent trend
information of individual triplets, while GT-like at- is retrieval augmented generation, where relevant
tention patterns aggregate information along the parts of a knowledge base, or KG, are retrieved,
graph structure. Hence, the GLM inherits text un- linearized and provided as part of a prompt (Gao
derstanding of triplets from the LM, while its GT et al., 2024).2
architecture allows it to directly perform structural In both options the graph must be linearized to fit
reasoning, without additional GNN layers. the input or output of a sequence-to-sequence LM.
Importantly, for text sequences – which can be Hence, no graph priors can be enforced – instead,
seen as a special type of graph – the GLM is iden- the LM has to learn the graph structure implicitly.
tical to the original LM. This allows the GLM to By contrast, GLMs model a graph as a true graph
process interleaved inputs of text and GoT jointly, and have inductive graph priors instilled in their
handling both modalities in a single framework. architecture. This prepares a GLM for more profi-
Our main contributions are: (i) We propose cient graph reasoning, compared to a LM approach.
Graph Language Models (GLMs) and a theoret-
GNNs LMs excel at representing single triplets,
ical framework to construct them. GLMs are
but struggle with structural reasoning. To alleviate
graph transformers, which enables graph reason-
this problem, LMs can be combined with GNNs.
ing. Simultaneously, they inherit and exploit LM
Many approaches get node and edge features from
weights, enabling them to represent and contextu-
LMs and aggregate this information in the graph
alize triplets in a GoT. Further, by encoding texts
with GNNs (Lin et al., 2019; Malaviya et al., 2020;
and graph components alike, it can naturally take
Zhao et al., 2023). Zhang et al. (2022); Yasunaga
graph and text data as interleaved inputs. (ii) Ex-
et al. (2022) train models consisting of a LM and a
periments on relation classification in ConceptNet
GNN that encode interleaved text and graph inputs
subgraphs show that GLMs outperform LM- and
jointly. They also use a LM to obtain node features.
GNN-based methods for encoding GoTs – even
While some approaches jointly train for textual
when the inherited LM parameters are not updated
understanding and graph reasoning, none offer a
during GLM training. (iii) KG population experi-
unified method. By contrast, our GLM formulation
ments on Wikidata subgraphs and corresponding
seamlessly integrates both in a holistic framework
Wikipedia abstracts show that GLMs can reason
for embedding language and KGs.
over interleaved inputs of GoTs and text – again,
outperforming strong LM-based methods. Graph Transformers GTs, a special type of
GNN (Bronstein et al., 2021), gain popularity in
2 Related Work NLP and beyond (Min et al., 2022; Müller et al.,
2023). E.g., Koncel-Kedziorski et al. (2019) and
LMs One way to augment LMs with knowledge
Wang et al. (2020b) train GTs to generate text from
from KGs (Pan et al., 2024) is to formulate pretrain-
KGs and AMRs, respectively. Most relevant to our
ing objectives that operate on a KG. E.g., LMs can
work is Schmitt et al. (2021), who use GTs for KG-
be trained to generate parts of a KG, encouraging
to-text generation. Similar to us, they employ PE
the LM to store KG content in its parameters. Typi-
matrices, but train their model from scratch, which
cally, single triplets are used for pretraining (Bosse-
limits its applicability: while their model trained on
lut et al., 2019; Wang et al., 2021; Hwang et al.,
WebNLG (Gardent et al., 2017) has a vocabulary
2021; West et al., 2023). In such cases, the graph
size of 2,100, initializing a GLM from T5, equips
structure is not a target of pretraining. Some works
it with T5’s full vocabulary of 32,128 tokens.
generate larger substructures, such as paths or lin-
Concurrently, and independently from our work,
earized subgraphs (Wang et al., 2020a; Schmitt
Li et al. (2024) also convert a LM to a graph trans-
et al., 2020; Huguet Cabot and Navigli, 2021). In
former. They focus on data-to-text generation,
either case, the LM needs to memorize the KG, as
2
it will not be part of the input during inference. Cf. www.llamaindex.ai and www.langchain.com
where they unify table, key-value and KG struc- PE, BP encodes a bias depending on the relative
tures in a unified graph format, and apply structure- distances between pairs of tokens – for example,
enhanced pre-training to support data-to-text gen- by learning one scalar for each possible distance:
eration with their structure-enhanced transformer
model. They apply attention maps similar to ours to BP = f (P ), (2)
better capture the graph-structured input, which the
pre-trained model rewrites into natural language. where P is a matrix of relative distances and f (·)
Contrary to their work, we do not resort to structure- an elementwise function.
enhanced pre-training – which is restricted in re- Similarly, GTs use PEs to encode the structure of
sources – but instead assess the GLMs’ innate ca- the input, and hence, their PE has to encode a graph
pabilities. We showcase the versatility of the inher- structure, as opposed to a sequence. This can again
ited LM parameters in conjunction with our graph be done with absolute or relative PEs. However,
transformer architecture, by applying them to chal- defining an “absolute position” of a node or edge in
lenging reasoning tasks, where the model needs to a graph is not straightforward. While many meth-
reason over complementary inputs from text and ods exist, they are not directly compatible with the
graphs, and where it needs to infer information not usual (absolute) “counting position” known from
present in the input, unlike data-to-text generation. sequence encoding in LMs. In this work we thus
Moreover, we demonstrate that our architectural focus on relative PE. Given a directed acyclic path
changes are highly compatible with the original in a graph, we can define the (signed) distance be-
LM weights, via linear probing experiments, where tween any pair of nodes along a path simply as the
the GLM outperforms conventional LM and Graph number of hops between the nodes. The sign can
Transformer models. be set by the direction of the path. Thus, by find-
ing a consistent set of such paths in §4, we obtain
3 Preliminary: Graph Transformers (GT) relative distances and hence the graph’s PE.
This section briefly introduces graph transformers, Masked Attention In a vanilla transformer, self-
focusing on architectural choices relevant for our attention is computed for all possible pairs of to-
work. We also discuss some general properties of kens in the input. By contrast, nodes typically only
GNNs that motivate our design choices in §4. attend to adjacent nodes in GNNs. Therefore, infor-
The attention in self-attention can be written as mation between more distant nodes has to be prop-

QK T
 agated across multiple GNN layers. For graphs,
softmax √ + BP + M V, (1) such sparse message passing approaches are some-
d
times preferred, as in most graphs the neighbor-
where Q, K and V are the query, key and value ma- hood size increases exponentially with increasing
trices, and d is the query and key dimension. The radius, which can cause loss of information due
BP and M matrices can be used for positional en- to over-smoothing (Chen et al., 2020). Thus, in
coding and masking. Setting BP = M = 0 yields GTs it can be beneficial to introduce graph priors,
the standard formulation (Vaswani et al., 2017). for example by restricting self-attention to local
neighborhoods. This can be realized by setting ele-
Positional Encoding The self-attention mecha-
ments of M to 0 for pairs of tokens that should be
nism of transformer models is permutation invari-
connected, and to −∞ otherwise.
ant, i.e., it doesn’t have any notion of the order of
On the other hand, it has been shown that a
its input elements. Thus, positional Encoding (PE)
global view of the graph can enable efficient, long-
is used to inform LMs of the ordering of tokens in a
ranged information flow (Alon and Yahav, 2021;
text (Dufter et al., 2022). Most approaches employ
Ribeiro et al., 2020). We will therefore present two
either absolute PE, where absolute token positions
model variants in §4 – a local and a global GLM.
are encoded (Vaswani et al., 2017; Gehring et al.,
2017) or relative PE, which encodes the relative 4 Graph Language Model
position between pairs of tokens (Shaw et al., 2018;
Raffel et al., 2020; Su et al., 2021; Press et al., GLM vs. GT We aim to design an architecture
2022). Absolute PE is typically combined with the that can efficiently and jointly reason over text and
input sequence and hence, the PE does not need to graph-structured data. GTs can offer desired graph
be encoded in self-attention (BP = 0). For relative priors, but they lack language understanding.
-4
-3 0
+3
-2 +2
-1 +1

black black poodle is a dog is a


IsA dog
poodle

IsA

animal G2G G2G G2G animal

IsA

cat
cat is a

(a) Original GoT. (b) Extended Levi graph of GoT (with relative distances P for dog).

Figure 2: Example of graph preprocessing in our GLM. Fig 2b shows relative distances for dog, i.e., when dog is
attending to other tokens. The red Graph-to-Graph (G2G) connections only exist for the gGLM, not for the ℓGLM.

One intuitive approach to bridge this gap is to


pretrain a GT from scratch (Schmitt et al., 2021).
But pretraining is costly and the necessary data
bound to be scarce. We thus take a different av-
enue. We hypothesize that for reasoning over GoTs
a model needs language understanding capabilities
similar to those used for reasoning over text. Intu-
itively this should be the case, since (i) GoTs are
designed to be understandable by humans and (ii)
literate people can “read” and understand GoTs.
By initializing a GT with parameters from a com-
patible LM, we obtain our Graph Language Model
(GLM). The GT architecture introduces graph pri-
ors, while parameter initialization from the LM Figure 3: Relative position matrix P for tokens in
gives it language understanding capabilities. In the Fig. 2b. Entries with G2G have no relative position
following we explain the necessary modifications (ℓGLM) or are initialized from +∞ (gGLM). Cf. §A.
to the input graph and to the model, to make this
work. The general idea is that (verbalized) triplets original direction. This yields the extended Levi
should resemble natural language as much as pos- graph (see Fig. 2b). In this representation, each
sible to enable LM weights to capture them, while triplet is represented as a sequence of tokens – just
graph reasoning should work via message passing. as it would be for a standard LM.3
Graph preprocessing A LM tokenizer converts Positional Encodings As discussed in §3, we
text into a sequence of tokens from the LM vo- prefer PEs that encode the relative position between
cabulary. Similarly, we process GoTs, such that pairs of tokens, determined by their signed distance.
the GLM can process the graphs “as a LM would We can directly adopt this method to encode the
do” (cf. Fig. 2). To achieve this, we first convert relative position between pairs of tokens occurring
the GoT to its so-called Levi graph (Schmitt et al., within the same triplet – by simply considering the
2021), i.e., we replace each edge with a node that triplet as a piece of text, and counting the token dis-
contains the relation name as text feature, and con- tance in this text. Note that a single token can occur
nect the new node to the head and tail of the original 3
Note that the token sequence of the converted GoT is
edge via unlabeled edges, preserving the direction not necessarily perfectly identical to the token sequence that
of the original edge. Next, we tokenize each node corresponds to the input triplets. We tokenize each node in the
and split each node into multiple nodes, such that Levi graph individually, to ensure consistent tokenization of
concepts shared by multiple triplets. This removes whitespace
every new node corresponds to a single token. New between concepts and edges, which impacts tokenization. We
edges connect adjacent nodes, again preserving the leave investigation of the impact of this effect to future work.
in multiple triplets, leading to, e.g., multiple “left- proximity bias during pretraining, i.e., they tend to
hand side neighbors” (cf. animal in Fig. 2b and 3). have higher attention scores between tokens that
While this does not occur in ordinary sequential are close to each other in the text. This means that
text, it does not impose a problem for relative PE. tokens with a high relative distance tend to have low
Yet, the approach above breaks when tokens do attention scores. For our gGLM this corresponds to
not belong to the same triplet. To determine the a graph bias where distant nodes are less important,
distance between such pairs of tokens, previous but are still accessible.5 Note that unlike in the
work considered, e.g., the length of the shortest ℓGLM, this bias is not part of the architecture. It
path between them (Schmitt et al., 2021). However, originates from the pretrained parameters, meaning
this results in PEs that do not come natural to a LM, that the gGLM can learn to attend to distant tokens.
since a triplet would appear in reversed order, if it Along with P and M , the GLM takes a sequence
is traversed in “the wrong direction” in the shortest of all tokens in the extended Levi graph as input.
path.4 We therefore omit structure-informed PE For this, we technically need to “linearize” the
between tokens that do not belong to the same graph. However, the order of tokens in the result-
triplet and instead propose two GLM variants: a ing sequence does not matter: relative positions in
local (ℓGLM) and a global (gGLM) one. P are determined by distances in the graph, not in
the sequence. Permuting the input sequence simply
Local and global GLM Fig. 3 shows the relative means that rows and columns of P and M need
token position matrix P for the graph in Fig. 2b. to be permuted accordingly, but the resulting to-
In the ℓGLM the self-attention mechanism is re- ken embeddings remain unchanged. See example
stricted to tokens from the same triplet. This means matrices for P and M for ℓGLM and gGLM in §A.
that attention to any token located beyond the local Being transformers, GLMs have the same com-
triplet is set to 0 – and hence does not require PE. putational complexity as their respective LM. For
Still, in such configurations, messages can propa- sparse graphs the ℓGLM could make use of sparse
gate through the graph across multiple layers, since matrix multiplication, making it more efficient than
tokens belonging to a concept can be shared by mul- a corresponding LM or gGLM. However, for our
tiple triplets. This is analogous to standard message experiments this was not necessary.
passing in GNNs, where non-adjacent nodes have
no direct connection, but can still share information Joint graph and text encoding If we use normal
via message passing. For example, the representa- matrices for P and M , the GLM is identical to its
tion of dog is contextualized by the triplets black underlying LM. Hence, GLMs can be applied to
poodle is a dog and dog is a animal after the first texts and – more interestingly – interleaved inputs
ℓGLM layer. Hence in the second layer, when an- of text and graph. In this joint setting, P and M
imal attends to dog, the animal embedding gets each consists of four sub-matrices that correspond
impacted by black poodle, even though there is no to self-attention between tokens from (i) graph-to-
direct connection from animal to black poodle. graph, (ii) text-to-text, (iii) text-to-graph and (iv)
However, it has been shown that a global view graph-to-text. Graph-to-graph sub-matrices are for-
can have benefits (Ribeiro et al., 2020). Hence, we matted as described above for ℓGLM and gGLM,
also formalize the gGLM, as an alternative where respectively. Text-to-text sub-matrices are standard
self-attention can connect any node to every other matrices from conventional sequence transformers.
node. For this setting we need to assign a PE to We introduce new T2G and G2T relative positions
any pair of tokens, including those that do not oc- for text-to-graph, and graph-to-text connections,
cur within the same triplet. For these pairs we respectively. With this, the model can learn interac-
introduce a new graph-to-graph (G2G) relative po- tion strength between the two modalities. Similar
sition. LMs don’t have learned parameters for G2G to G2G in gGLM, we initialize T2G and G2T pa-
connections, so we initialize the parameters with rameters from +∞. See example matrices in §A.
the corresponding parameters of a relative posi- Uni- and Bidirectional LMs If a LM’s self-atten-
tion of +∞. In a LM a relative position of +∞ tion is unidirectional, information can only prop-
means that the respective tokens occur somewhere agate along the direction of arrows in Fig. 2b for
“far” away in a remote text passage. LMs learn a
5
Preliminary experiments showed that initializing G2G pa-
4
For example, cat would see the graph as the following rameters from +∞ outperforms random initialization, which
sequence: cat is a animal a is dog a is poodle black. outperforms initialization from 0.
the ℓGLM. This means that, e.g., the representation 5.1.1 Experimental setup
of the node black poodle is independent of the rest The input to our model is a CN subgraph. The rela-
of the graph. We could augment the graph with in- tion to be predicted is replaced with <extra_id_0>.
verse relations to enable bidirectional information The GLM encodes the graphs as in §4, producing
flow with unidirectional LMs, but in this work, we an embedding for each token. A linear classifica-
restrict our analysis to bidirectional models. tion head gives the final prediction from the mask’s
embedding. We verbalize unmasked relations using
T5 We use T5 (Raffel et al., 2020) – a bidirec- static templates (Plenz et al., 2023), shown in §B,
tional encoder with unidirectional decoder – as base Table 4.
LM to instantiate GLMs. In T5, relative distances In a finetuning setting we train the GLM and
in P group into so-called buckets, and each bucket the classification head jointly. However, since the
maps to one learned positional bias in BP for each GLM is initialized from a LM, we hypothesize that
head. Positional biases are shared across layers. it should produce meaningful embeddings, even
The decoder is not needed to encode graphs, but without any training. To test this hypothesis, we
can be used to generate sequences, such as text or train only the classification head, i.e., we only train
linearized graphs in future work. a linear probe. In this setting, the GLM was never
trained on any graph data, similar to a zero-shot set-
5 Experiments ting. The linear probe only extracts linear features
and hence, can only achieve high performance if
We assess the GLMs’ capabilities for embedding the GLM embeddings show expressive features.
GoTs in two experiments on relation (label) classi- We report mean accuracy across 5 different runs.
fication, i.e., classifying which relation belongs to See §B.1.2 for hyperparameters. Unless stated oth-
a given head and tail entity. One experiment uses erwise, we use T5-small to allow many baselines.
ConceptNet (CN; Speer et al., 2017) subgraphs
that we construct to enable analysis of the impact 5.1.2 Baselines
of structural graph properties. In a second experi- We compare to several baselines inspired by related
ment on Wikidata (Vrandečić and Krötzsch, 2014) work. For all baselines we utilize the T5 encoder
subgraphs and associated Wikipedia abstracts we as underlying LM. This allows us to focus on the
test GLMs on interleaved inputs of text and graph. architectural design of different model types.
LM For LM-based approaches we linearize the
5.1 Representing and reasoning over Graphs input graphs to a sequence, by concatenating the
We construct a balanced dataset of English CN verbalized triplets. There are structured ways to
subgraphs consisting of 13,600 train, 1,700 dev linearize graphs, but such graph traversals gener-
and 1,700 test instances with 17 distinct relations ally require the graph to be directed and acyclic –
as labels. We replace the relation to be predicted which makes them inapplicable to linearizing GoTs.
with <extra_id_0>, T5’s first mask token. Instead, we order the triplets either randomly (T5
To investigate the impact of varying graph com- set) or alphabetically (T5 list). For T5 set, triplets
plexities, we experiment with different graph sizes are shuffled randomly in every training epoch such
denoted by their radius r. We ensure that small that the model can learn to generalize to unseen or-
graphs are strict subgraphs of larger graphs, such derings. The concatenated triplets are passed to the
that potential performance gains in larger graphs T5 encoder, and the embedding of <extra_id_0>
must stem from additional long-ranged context. is presented to the classification head.
To evaluate model effectiveness when long- GNN For GNN baselines we encode each node of
ranged connections are crucial, we mask complete the original graph (cf. Fig. 2a) with the T5 encoder,
subgraphs around the relation to be predicted. The and train a GNN using these static embeddings.
size of a masked subgraph is m, where m = 0 After the final layer, the GNN returns 17 logits
means no mask, m = 1 masks neighboring con- for each node. As final logits, we take the mean
cepts, m = 2 masks neighboring concepts and the logit of the two nodes adjacent to the relation to be
next relations, etc. We replace each masked con- predicted. We experiment with different variants as
cept and relation with a different mask token. Con- GNN layers: GCN (Kipf and Welling, 2017) and
struction details and statistics are shown in §B.1.1. GAT (Veličković et al., 2018). Since GNNs do not
r 1 2 3 4 5 4 4 4 4 4
Model
m 0 0 0 0 0 1 2 3 4 5
ℓGLM 55.4±0.3 57.1±0.3 56.8±0.6 56.9±0.4 57.0±0.4 30.4±0.4 17.8±0.2 14.0±0.3 11.4±0.5 11.9±0.3
Lin. Prob.

gGLM 55.4±0.3 58.6±0.7 58.8±0.6 59.3±0.7 59.5±0.4 41.8±0.8 25.6±0.9 22.0±0.6 19.4±0.5 17.0±0.2
T5 (list) 53.7±0.3 56.8±1.1 56.5±1.2 55.8±0.6 55.3±0.5 20.3±0.6 19.9±0.4 15.3±0.6 14.0±1.1 10.2±1.2
T5 (set) 53.1±0.6 52.8±1.2 54.6±0.6 53.9±0.5 53.1±0.8 18.2±0.6 16.7±0.5 13.1±0.7 12.3±0.6 9.7±0.9
ℓGLM 64.0±1.3 64.0±1.0 64.4±0.7 64.1±0.9 64.2±1.1 47.9±0.4 26.8±0.8 23.8±0.9 19.8±1.1 18.1±0.7
gGLM 63.2±0.9 64.4±1.1 64.6±1.2 64.1±1.3 65.3±0.7 48.0±0.6 27.2±0.7 24.2±0.7 20.2±1.4 19.2±0.7
T5 (list) 64.9±1.0 64.9±1.2 64.9±1.3 63.9±0.9 64.0±0.6 40.4±0.8 21.8±0.8 17.8±1.0 15.4±0.3 12.8±0.5
Finetuning

T5 (set) 63.9±0.7 65.8±0.8 64.0±0.3 64.1±1.2 64.3±1.1 40.3±1.2 21.8±0.7 18.0±0.6 15.5±0.6 13.1±0.7
GCN 44.3±0.9 37.1±1.0 34.4±1.2 36.5±0.6 36.8±1.4 22.2±1.2 21.9±0.8 12.1±3.5 9.0±4.3 5.9±0.0
GAT 44.5±0.9 40.6±1.3 36.3±1.3 37.0±0.8 37.0±0.8 20.0±0.7 20.8±0.2 14.0±0.6 13.8±0.8 11.0±0.6
ℓGT 24.2±3.4 35.0±1.2 34.7±1.3 32.7±2.9 34.5±2.8 30.1±2.6 12.8±2.4 15.5±0.3 9.5±1.3 10.0±1.6
gGT 27.6±1.9 29.0±0.8 23.4±1.2 19.2±1.2 15.6±1.5 18.6±0.7 13.2±1.1 14.5±0.6 12.4±1.3 12.1±1.7

Table 1: Relation label classification accuracy on CN in %. Results are shown for Linear Probing and Finetuning.

come with pretrained weights, we only apply them The overall high performance of GLMs confirms
in finetuning, when training all parameters. our assumption that GLMs are compatible with LM
weights, even without any training. Increasing per-
Graph transformer Finally we compare GLMs formance with increasing radii further shows that
to models with the same architecture, but random GLMs have good inductive graph biases. When
weight initialization (normal graph transformers). long-range connections are relevant, the represen-
This allows us to assess the impact of weight initial- tations learned by gGLM outperform the locally
ization from a LM with two further baselines: ℓGT constrained ℓGLM – which showcases the strength
and gGT. We only consider GTs with finetuning. of the global view that the gGLM is able to take.
5.1.3 Results Finetuning Tab. 1 shows results when training
Linear probing Tab. 1 shows the relation label all parameters. In this setting, models can adjust
prediction accuracy for linear probing, i.e., when to the task and learn to reason over graphs through
training only the classification head. Our first ob- parameter updates. In addition, GLMs can tune pa-
servation is that gGLM is consistently the best, rameters to better match the novel input structure.
outperforming ℓGLM and the LM baselines. For a The GLM and LM variants are consistently bet-
radius of r = 1 we have exactly one triplet, which ter than GNN and GT methods, which indicates
has almost the same representation in the GLM and that linguistic understanding is potentially more
LM approaches. The only difference is that the LM important than graph reasoning for this task. Mod-
baselines have an end-of-sentence token, which the els outperform their linear probing scores, which
GLM does not have. Surprisingly, not having the shows that finetuning is, as expected, beneficial.
end-of-sentence token seems to be an advantage Overall, the GLMs perform best, while GTs per-
with linear probing, but we will see later that this form the worst. The only difference between the
changes when updating model weights. two model groups is weight initialization – the
For r ≥ 3, LM baselines show decreasing per- GLMs are initialized from T5, while the GTs are
formance with increasing radii. By contrast, both randomly initialized. Further, we observe that for
ℓGLM and gGLM show increasing performances r ≥ 1 and m = 0 the local GT (ℓGT) significantly
with increasing radii. This indicates that GLMs can outperforms its global counterpart gGT. For the
utilize the additional context. But LM baselines GLM the global version is on par, or even better
don’t have any inbuilt methods to grasp distances than the local one. This shows the effectiveness of
in the graph, which could cause them to fail at dis- T5’s attention mechanism: thanks to its weight ini-
tinguishing relevant from less relevant information. tialization, gGLM attends to relevant tokens even
The performance gap between gGLM and LM in large context windows, while gGT suffers from
models tends to increase for larger m, i.e., when potentially distracting long-ranged information.
larger sub-structures are masked. However, the For m = 0 the differences between GLM and
ℓGLM underperforms for large m, highlighting the LM approaches are small, with a slight trend for
advantage of the global view in gGLM when long- GLMs to outperform LMs on large graphs, and vice
ranged connections are necessary. versa for small graphs. However, when graph rea-
5.2 Jointly representing Graph and Text
We now investigate GLM capabilities to process
interleaved inputs of text and graph in a KG popu-
lation setup, i.e., extending a KG with new relation
instances. Subtask 1 performs text-guided rela-
tion classification where some relations may be
inferrable from the text, while others may exploit
graph knowledge to make predictions. In Subtask
2, models classify the source of a predicted rela-
tion, i.e., whether it can be inferred from the text, or
(a) Relation label classification. whether it requires (additional) graph knowledge.
We construct our data from Huguet Cabot and
Navigli (2021), who offer a corpus of Wikipedia ab-
stracts that are linked to Wikidata via entity linking.
Their focus is relation extraction, so they filter the
graphs using NLI, such that all triplets are entailed
by the text. We augment the entailed triplets with
further triplets from Wikidata that are not entailed
by the text. For a given text, subgraph, head and
tail entity, models will jointly predict the relation
and the source. We adopt the 220 most common re-
(b) Source classification. lations in our train graphs and a “no-relation” label.
For source labels we have 3 classes: entailed by the
Figure 4: KG population test results during training. text, not entailed and no-relation. No-relation is
gGLM outperforms T5 set by up to 6 points in 4a.
the correct label iff the relation is also no-relation.
§B.2.1 shows statistics and construction details.
5.2.1 Experimental setup and baselines
soning is more important due to masking (m ≥ 1),
then GLMs consistently and significantly outper- Unlike §5.1.1, models now receive text and graph
form all other baselines. This indicates that LMs data as input. We train two distinct classification
can learn to do simple graph reasoning through heads on the mask’s embedding for relation and
parameter updates, but underperform in more com- source classification. While the mask is part of the
plex graph reasoning tasks where either graphs are graph, its embedding depends on both modalities.
larger, or long-ranged connections are required. The final loss is the sum of the relation classifica-
tion and the source prediction loss, weighted by 0.9
For m ≥ 1, the gGLM outperforms ℓGLM due and 0.1. We use T5-large, but otherwise baselines
to its global connections. In contrast to the lin- are as in §5.1.2. §B.2.2 shows the training details.
ear probing setting, the ℓGLM outperforms other
baselines for all non-zero levels of masking. This 5.2.2 Results
indicates that ℓGLM can learn to use long-ranged Fig. 4 and Tab. 8 show test set performance for a)
information during training, if the task requires it. relation and b) source classification, at different
training stages. gGLM performs the best overall,
followed by ℓGLM. LM baselines are competitive,
Impact of model size To investigate the effect of but lag behind at early stages and for source predic-
model size, we train the most promising approaches tion. Again, GT baselines perform poorly, showcas-
(GLM and LM) in 3 different sizes. Tab. 6 in §B.1.3 ing the advantage of weight initialization in GLM –
shows that overall larger models perform better. even with large-scale training data. For all models,
Surprisingly, the base models sometimes outper- training plateaus beyond ∼ 500k seen instances (cf.
form the larger models for settings that require Fig. 8 in §B.2.3), so we stop training at this cut-off.
more graph reasoning, i.e., larger m. However, Tab. 2 gives results for ablating different input
these differences are small and non-significant. In modalities to GLMs. Since source prediction al-
most cases, gGLM large or base are the best model. ways requires text input, we test relation classifi-
Ablation
Relation classification Source classification soning skills for different languages, domains and
ℓGLM gGLM ℓGLM gGLM
tasks is left for future work.
w/ text & graph 82.63 82.25 83.39 83.21
w/o text -6.22 -5.84 – – Our GLM framework supports instantiation from
w/o graph -6.05 -5.10 -4.67 -4.49
w/o text & graph -19.62 -19.24 – –
any LM with relative positional encoding, includ-
ing rotary positional encoding. Comprehensive
Table 2: Ablations for KG population in macro F1. comparisons to determine the most suitable models
for the GLM framework remain for future investiga-
tion. Nonetheless, bidirectional LMs are expected
cation w/o source prediction. Ablating the text or to perform best in the novel framework, because
graph lowers performance by similar amounts, indi- unidirectional LMs necessitate additional inverse
cating that GLMs utilize both modalities. Training relations, as discussed in §4.
curves in Fig. 10 reveal that first, the model almost
exclusively utilizes text data, but quickly learns Ethical considerations
to make use of the graph. For textually entailed
We do not foresee immediate ethical concerns
triplets, text is more impactful than the graph, and
for our research, as we rely on well-established
vice versa for other triplets (cf. Tab. 9). Ablating
datasets. However, even established datasets can
graphs lowers source prediction by ∼4.5 points,
contain undesirable biases which our method could
which shows that GLMs benefit from graph infor-
potentially spread and amplify.
mation even for predominantly text oriented tasks.
Looking ahead, our focus lies in enriching
The results show that GLMs can efficiently rea-
knowledge graph integration within language mod-
son over interleaved inputs of graph and text, espe-
els, with the aim of enhancing factuality and miti-
cially with limited training data. This makes GLMs
gating hallucination. This advancement is expected
a promising new model type for knowledge-intense
to bolster the reliability and controllability of LMs,
NLP tasks, such as KG population or Q&A.
leading to positive societal impacts. Furthermore,
6 Conclusion LMs relying on knowledge graphs may facilitate
easier maintenance, potentially reducing the need
We present the Graph Language Model (GLM) – a for frequent retraining of deployed models, thereby
graph transformer initialized with weights from a promoting sustainability in NLP practices.
LM. It excels at graph reasoning, while simultane-
ously encoding textual triplets in the graph as LMs Acknowledgements
do, thereby bridging the gap between LMs and We want to thank Letit, ia Pârcălăbescu for provid-
GNNs. GLMs can natively reason over joint inputs ing feedback on our manuscript.
from texts and graphs, leveraging and enhancing This work was funded by DFG, the German Re-
each modality. Experiments show the GLM’s ad- search Foundation, within the project “ACCEPT:
vantage over LM and GNN based baselines, even Perspectivized Argument Knowledge Graphs for
in a linear probing setting. In particular, GLMs Deliberation”, as part of the priority program
greatly outperform graph transformers. This high- “RATIO: Robust Argumentation Machines” (SPP-
lights the need for pretrained LM weights, even for 1999).
graph reasoning. We therefore advocate GLMs as a
valuable tool for advancing research in embedding
and leveraging knowledge graphs for NLP tasks. References
Uri Alon and Eran Yahav. 2021. On the bottleneck of
Limitations graph neural networks and its practical implications.
In International Conference on Learning Representa-
While GLMs are designed as general purpose tools tions.
for knowledge-intense NLP tasks, our evaluation is
Sören Auer, Christian Bizer, Georgi Kobilarov, Jens
limited to English knowledge graphs. However, we Lehmann, Richard Cyganiak, and Zachary Ives. 2007.
explore various types of knowledge graphs (com- Dbpedia: A nucleus for a web of open data. In The
monsense and factual) and tasks (relation classifica- Semantic Web, pages 722–735, Berlin, Heidelberg.
tion, text-guided relation classification, and source Springer Berlin Heidelberg.
prediction), broadening our empirical assessment. Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai-
Confirming GLMs improved text and graph rea- tanya Malaviya, Asli Celikyilmaz, and Yejin Choi.
2019. COMET: Commonsense transformers for auto- Transformers. In Proceedings of the 2019 Confer-
matic knowledge graph construction. In Proceedings ence of the North American Chapter of the Associ-
of the 57th Annual Meeting of the Association for ation for Computational Linguistics: Human Lan-
Computational Linguistics, pages 4762–4779, Flo- guage Technologies, Volume 1 (Long and Short Pa-
rence, Italy. Association for Computational Linguis- pers), pages 2284–2293, Minneapolis, Minnesota.
tics. Association for Computational Linguistics.

Michael M Bronstein, Joan Bruna, Taco Cohen, and Junyi Li, Tianyi Tang, Wayne Xin Zhao, Zhicheng Wei,
Petar Veličković. 2021. Geometric deep learning: Nicholas Jing Yuan, and Ji-Rong Wen. 2021. Few-
Grids, groups, graphs, geodesics, and gauges. arXiv shot Knowledge Graph-to-Text Generation with Pre-
preprint arXiv:2104.13478. trained Language Models. In ACL Findings.
Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Shujie Li, Liang Li, Ruiying Geng, Min Yang, Binhua
Xu Sun. 2020. Measuring and relieving the over- Li, Guanghu Yuan, Wanwei He, Shao Yuan, Can Ma,
smoothing problem for graph neural networks from Fei Huang, and Yongbin Li. 2024. Unifying Struc-
the topological view. Proceedings of the AAAI Con- tured Data as Graph for Data-to-Text Pre-Training.
ference on Artificial Intelligence, 34(04):3438–3445. Transactions of the Association for Computational
Linguistics, 12:210–228.
Philipp Dufter, Martin Schmitt, and Hinrich Schütze.
2022. Position information in transformers: An
overview. Computational Linguistics, 48(3):733– Bill Yuchen Lin, Xinyue Chen, Jamin Chen, and Xiang
763. Ren. 2019. KagNet: Knowledge-aware graph net-
works for commonsense reasoning. In Proceedings
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, of the 2019 Conference on Empirical Methods in Nat-
Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, ural Language Processing and the 9th International
and Haofen Wang. 2024. Retrieval-augmented gen- Joint Conference on Natural Language Processing
eration for large language models: A survey. (EMNLP-IJCNLP), pages 2829–2839, Hong Kong,
China. Association for Computational Linguistics.
Claire Gardent, Anastasia Shimorina, Shashi Narayan,
and Laura Perez-Beltrachini. 2017. The WebNLG Chaitanya Malaviya, Chandra Bhagavatula, Antoine
challenge: Generating text from RDF data. In Pro- Bosselut, and Yejin Choi. 2020. Commonsense
ceedings of the 10th International Conference on knowledge base completion with structural and se-
Natural Language Generation, pages 124–133, San- mantic context. Proceedings of the 34th AAAI Con-
tiago de Compostela, Spain. Association for Compu- ference on Artificial Intelligence.
tational Linguistics.
Erxue Min, Runfa Chen, Yatao Bian, Tingyang Xu,
Jonas Gehring, Michael Auli, David Grangier, and Yann Kangfei Zhao, Wen bing Huang, Peilin Zhao, Jun-
Dauphin. 2017. A convolutional encoder model for zhou Huang, Sophia Ananiadou, and Yu Rong. 2022.
neural machine translation. In Proceedings of the Transformer for graphs: An overview from architec-
55th Annual Meeting of the Association for Compu- ture perspective. ArXiv, abs/2202.08455.
tational Linguistics (Volume 1: Long Papers), pages
123–135, Vancouver, Canada. Association for Com- Luis Müller, Christopher Morris, Mikhail Galkin, and
putational Linguistics. Ladislav Rampášek. 2023. Attending to Graph Trans-
formers. Arxiv preprint.
Pere-Lluís Huguet Cabot and Roberto Navigli. 2021.
REBEL: Relation extraction by end-to-end language
Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Ji-
generation. In Findings of the Association for Com-
apu Wang, and Xindong Wu. 2024. Unifying large
putational Linguistics: EMNLP 2021, pages 2370–
language models and knowledge graphs: A roadmap.
2381, Punta Cana, Dominican Republic. Association
IEEE Transactions on Knowledge and Data Engi-
for Computational Linguistics.
neering (TKDE).
Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras,
Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, and Moritz Plenz, Philipp Heinisch, Anette Frank, and
Yejin Choi. 2021. Comet-atomic 2020: On sym- Philipp Cimiano. 2024. Pakt: Perspectivized argu-
bolic and neural commonsense knowledge graphs. In mentation knowledge graph and tool for deliberation
AAAI. analysis.

Thomas N. Kipf and Max Welling. 2017. Semi- Moritz Plenz, Juri Opitz, Philipp Heinisch, Philipp Cimi-
supervised classification with graph convolutional ano, and Anette Frank. 2023. Similarity-weighted
networks. In International Conference on Learning construction of contextualized commonsense knowl-
Representations (ICLR). edge graphs for knowledge-intense argumentation
tasks. In Proceedings of the 61st Annual Meeting of
Rik Koncel-Kedziorski, Dhanush Bekal, Yi Luan, the Association for Computational Linguistics (Vol-
Mirella Lapata, and Hannaneh Hajishirzi. 2019. Text ume 1: Long Papers), pages 6130–6158, Toronto,
Generation from Knowledge Graphs with Graph Canada. Association for Computational Linguistics.
Ofir Press, Noah Smith, and Mike Lewis. 2022. Train you need. In Advances in Neural Information Pro-
short, test long: Attention with linear biases enables cessing Systems, volume 30. Curran Associates, Inc.
input length extrapolation. In International Confer-
ence on Learning Representations. Petar Veličković, Guillem Cucurull, Arantxa Casanova,
Adriana Romero, Pietro Liò, and Yoshua Bengio.
Colin Raffel, Noam Shazeer, Adam Roberts, Kather- 2018. Graph Attention Networks. International Con-
ine Lee, Sharan Narang, Michael Matena, Yanqi ference on Learning Representations.
Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the
limits of transfer learning with a unified text-to-text Denny Vrandečić and Markus Krötzsch. 2014. Wiki-
transformer. Journal of Machine Learning Research, data: A free collaborative knowledgebase. Commun.
21(140):1–67. ACM, 57(10):78–85.
Peifeng Wang, Nanyun Peng, Filip Ilievski, Pedro
Leonardo F. R. Ribeiro, Martin Schmitt, Hinrich Szekely, and Xiang Ren. 2020a. Connecting the dots:
Schütze, and Iryna Gurevych. 2021. Investigating A knowledgeable path generator for commonsense
pretrained language models for graph-to-text genera- question answering. In Findings of the Association
tion. In Proceedings of the 3rd Workshop on Natural for Computational Linguistics: EMNLP 2020, pages
Language Processing for Conversational AI, pages 4129–4140, Online. Association for Computational
211–227, Online. Association for Computational Lin- Linguistics.
guistics.
Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei,
Leonardo F. R. Ribeiro, Yue Zhang, Claire Gardent, and Xuanjing Huang, Jianshu Ji, Guihong Cao, Daxin
Iryna Gurevych. 2020. Modeling global and local Jiang, and Ming Zhou. 2021. K-Adapter: Infusing
node contexts for text generation from knowledge Knowledge into Pre-Trained Models with Adapters.
graphs. Transactions of the Association for Compu- In Findings of the Association for Computational
tational Linguistics, 8:589–604. Linguistics: ACL-IJCNLP 2021, pages 1405–1418,
Online. Association for Computational Linguistics.
Martin Schmitt, Leonardo F. R. Ribeiro, Philipp Dufter,
Iryna Gurevych, and Hinrich Schütze. 2021. Mod- Tianming Wang, Xiaojun Wan, and Hanqi Jin. 2020b.
eling graph structure via relative position for text AMR-to-text generation with graph transformer.
generation from knowledge graphs. In Proceedings Transactions of the Association for Computational
of the Fifteenth Workshop on Graph-Based Methods Linguistics, 8:19–33.
for Natural Language Processing (TextGraphs-15),
pages 10–21, Mexico City, Mexico. Association for Peter West, Ronan Bras, Taylor Sorensen, Bill Lin, Li-
Computational Linguistics. wei Jiang, Ximing Lu, Khyathi Chandu, Jack Hessel,
Ashutosh Baheti, Chandra Bhagavatula, and Yejin
Martin Schmitt, Sahand Sharifzadeh, Volker Tresp, and Choi. 2023. NovaCOMET: Open commonsense
Hinrich Schütze. 2020. An unsupervised joint sys- foundation models with symbolic knowledge distil-
tem for text generation from knowledge graphs and lation. In Findings of the Association for Computa-
semantic parsing. In Proceedings of the 2020 Con- tional Linguistics: EMNLP 2023, pages 1127–1149,
ference on Empirical Methods in Natural Language Singapore. Association for Computational Linguis-
Processing (EMNLP), pages 7117–7130, Online. As- tics.
sociation for Computational Linguistics.
Michihiro Yasunaga, Antoine Bosselut, Hongyu Ren,
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Xikun Zhang, Christopher D Manning, Percy Liang,
Self-attention with relative position representations. and Jure Leskovec. 2022. Deep bidirectional
In Proceedings of the 2018 Conference of the North language-knowledge graph pretraining. In Advances
American Chapter of the Association for Computa- in Neural Information Processing Systems.
tional Linguistics: Human Language Technologies,
Volume 2 (Short Papers), pages 464–468, New Or- Xikun Zhang, Antoine Bosselut, Michihiro Yasunaga,
leans, Louisiana. Association for Computational Lin- Hongyu Ren, Percy Liang, Christopher D Manning,
guistics. and Jure Leskovec. 2022. GreaseLM: Graph REA-
Soning enhanced language models. In International
Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. Conference on Learning Representations.
Conceptnet 5.5: An open multilingual graph of gen-
Jianan Zhao, Meng Qu, Chaozhuo Li, Hao Yan, Qian
eral knowledge. In Proceedings of the Thirty-First
Liu, Rui Li, Xing Xie, and Jian Tang. 2023. Learning
AAAI Conference on Artificial Intelligence, AAAI’17,
on large-scale text-attributed graphs via variational
page 4444–4451. AAAI Press.
inference. In The Eleventh International Conference
Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng on Learning Representations.
Liu. 2021. Roformer: Enhanced transformer with
rotary position embedding. CoRR, abs/2104.09864.
A Model

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob


Fig. 5 shows the matrices P and M for ℓGLM and
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz gGLM. Fig. 6 shows the same matrices for joint
Kaiser, and Illia Polosukhin. 2017. Attention is all encoding of text and graph data.
-4
-3 0
+3
-2 +2
-1 +1

black poodle is a dog is a

animal

cat is a

(a) Relative positions P for dog in ℓGLM.

(b) Relative position matrix P for ℓGLM (c) Mask matrix M for ℓGLM.

-4
-3 0
+3
-2 +2
-1 +1

black poodle is a dog is a

G2G G2G G2G animal

cat is a

(d) Relative position P for dog in gGLM.

(e) Relative position matrix P for gGLM (f) Mask matrix M for gGLM.

Figure 5: Relative positions P and masking M for ℓGLM and gGLM.


(a) Relative position matrix P for ℓGLM. (b) Mask matrix M for ℓGLM.

(c) Relative position matrix P for gGLM. (d) Mask matrix M for gGLM.

Figure 6: Relative positions P and masking M for ℓGLM and gGLM when encoding text and graph jointly. The
example sentence is “The dog chased the cat.”
B Experiments on. Formally, m denotes the radius of the masked
graph in Levi representation, which should not be
B.1 ConceptNet confused with the extended Levi graph, nor the
B.1.1 Dataset normal graph representation. We replace each con-
cept and relation in the masked subgraph with a
We experiment on randomly selected subgraphs different mask token. This in principle enables LM
from the largest connected component of the En- baselines to internally reconstruct the graph.
glish part of CN version 5.7 (Speer et al., 2017),
which consists of 125,661 concepts and 1,025,802 B.1.2 Experimental setup and baselines
triplets. We select 17 distinct relation label classes Tab. 5 shows our hyperparameters. For the GNNs,
(cf. Tab. 4), ensuring sufficient frequency and se- we tested different numbers of layers (2, 3, 4, 5),
mantic dissimilarity. For each class, we randomly hidden channel dimensions (32, 64, 128), and non-
sample 1,000 triplets, allowing only cases where ex- linearities (ReLU, leaky ReLU) in preliminary ex-
actly one triplet connects the head and tail entities, periments.
to reduce label ambiguity. These 1,000 instances
are split into train (800), dev (100), and test (100). B.1.3 Results
This creates a balanced dataset of 13,600 train, Tab. 6 shows performance on CN for different mod-
1,700 dev, and 1,700 test instances. To predict re- elsizes.
lation labels, we replace them with <extra_id_0>,
T5’s first mask token. For our experiments, we B.2 Wikidata and Wikipedia
replace CN (unmasked) relations with more natural B.2.1 Dataset
verbalizations. Tab. 4 shows the static verbalization
Huguet Cabot and Navigli (2021) propose a large-
for each relation.
scale corpus of aligned Wikipedia abstracts and
During graph construction we control the graph
Wikidata (Vrandečić and Krötzsch, 2014) triplets.
size, parameterized by the radius r. We start with
They first extract Wikidata entities from the ab-
a radius of r = 1, when we consider only the two
stract, and then link these entities with triplets in
concepts (head and tail) in the target triplet. To
Wikidata. They are interested in triplets that are
create a larger graph context, we randomly select 4
entailed by the text, so they use a NLI model to
adjacent triplets – 2 for the head, and 2 for the tail
filter out all other triplets. They publicly released
entity of the original triplet. A graph with radius
the extracted entities and the filtered triplets.
r = 2 is formed by the subgraph spanned by all
For our purpose, we are interested in aligned
entities used in these 5 triplets. For r = 3 we again
graphs and texts, but triplets in the graph do not nec-
randomly select 2 triplets for each of the outer (up
essarily have to be entailed by the text. Hence, we
to) 4 entities, yielding (up to) 13 triplets. To avoid
find all triplets between the extracted entities using
accidentally adding more short-ranged information,
the Wikidata Query Service.6 From Huguet Cabot
we restrict the new triplets to triplets that actually
and Navigli (2021) we know which triplets in our
extend the radius of the graph. This enables us
graphs are entailed by the text.
to control graph size and complexity, while still
Similar to Huguet Cabot and Navigli (2021) we
enabling sufficient diversity in the graph structure.
consider the 220 most common relations in the train
Further, the graphs are created such that graphs for
split as our relation labels. Additionally, we add a
smaller radii are strict subgraphs of graphs with
“no-relation” label, yielding 221 relation classes.
larger radii. This ensures that performance changes
For 10 % of the graphs we randomly add a new
with increasing radii are due to long-ranged con-
triplet between previously unconnected head and
nections, and not due to potentially different short-
tail entity, and the mask token as relation. For these
ranged information. Tab. 3 shows structural prop-
graphs “no-relation” is the correct relation label.
erties of CN subgraphs, depending on their radius.
For the other 90 % graphs we replace a random
When masking subgraphs, we mask complete
existing relation with the mask token, while making
subgraphs of a certain size around the target to
sure that (i) the existing relation is in our 220 labels
be predicted. The size of the masked subgraph is
and that (ii) there is no other triplet connecting
denoted by m, where m = 0 means no masking,
the respective head and tail entities. We remove
m = 1 masks neighboring concepts, m = 2 masks
6
neighboring concepts and the next relations, and so https://query.wikidata.org, accessed in Jan. 2024.
Metric r=1 r=2 r=3 r=4 r=5
#nodes 2.00 ± 0.00 5.77 ± 0.46 12.28 ± 1.67 23.47 ± 4.33 42.90 ± 9.57
#edges 1.00 ± 0.00 8.25 ± 2.74 19.19 ± 5.33 36.41 ± 9.09 66.06 ± 16.77
mean degree 1.00 ± 0.00 2.87 ± 0.96 3.14 ± 0.82 3.11 ± 0.59 3.08 ± 0.42

Table 3: Structural statistics of ConceptNet (§5.1) train graphs.

Relation Verbalization
Antonym is an antonym of
AtLocation is in
CapableOf is capable of
Causes causes
CausesDesire causes desire
Used as relation label

DistinctFrom is distinct from


FormOf is a form of Parameter Value
HasContext has context Loss cross entropy loss
HasPrerequisite has prerequisite Optimizer AdamW
HasProperty is Learning rate 1e−4 (FT) & 5e−3 (LP)
GLM, LM & GT

Batchsize 32
HasSubevent has subevent Max. # epochs 50
IsA is a Early stopping criterion dev loss
MannerOf is a manner of Early stopping # epochs 5
MotivatedByGoal is motivated by # parameters in small 35B (FT) & 8k (LP)
# parameters in base 110B (FT) & 13k (LP)
PartOf is a part of # parameters in large 335B (FT) & 17k (LP)
Synonym is a synonym of # encoder layers in small 6
UsedFor is used for # encoder layers in base 12
# encoder layers in large 24
CreatedBy is created by Loss cross entropy loss
DefinedAs is defined as Optimizer AdamW
Desires desires Learning rate 5e−3
Batchsize 32
Entails entails
GNN

Max. # epochs 50
Not used as relation label

HasA has Early stopping criterion dev loss


HasFirstSubevent starts with Early stopping # epochs 5
HasLastSubevent ends with # layers 3
hidden channel dimension 64
InstanceOf is an instance of non-linearity ReLU
LocatedNear is near
MadeOf is made of Table 5: Hyperparameters for §5.1. FT stands for fine-
NotCapableOf is not capable of tuning and LP stand for linear probing. “# parameters”
NotDesires does not desire is the number of trainable parameters.
NotHasProperty is not
ReceivesAction receives action
RelatedTo is related to
SymbolOf is a symbol of

Table 4: Verbalization templates for relations in Con-


ceptNet. The upper part of the relations are the 17
classes in the classification task.
r 1 2 3 4 5 4 4 4 4 4
Model
m 0 0 0 0 0 1 2 3 4 5
small 64.0±1.3 64.0±1.0 64.4±0.7 64.1±0.9 64.2±1.1 47.9±0.4 26.8±0.8 23.8±0.9 19.8±1.1 18.1±0.7
ℓGLM base 67.6±0.8 69.6±0.9 69.8±0.5 69.8±1.3 69.6±0.7 49.2±0.8 29.3±0.8 24.4±0.3 20.8±0.9 19.6±0.8
large 72.0±1.0 71.4±1.5 72.2±1.0 72.7±0.8 71.5±1.8 48.4±1.1 29.7±1.6 24.8±1.6 20.0±0.9 20.3±0.5
small 63.2±0.9 64.4±1.1 64.6±1.2 64.1±1.3 65.3±0.7 48.0±0.6 27.2±0.7 24.2±0.7 20.2±1.4 19.2±0.7
gGLM base 67.8±0.7 71.3±1.0 70.5±1.2 71.5±1.1 71.1±0.4 49.7±1.2 30.2±0.8 25.5±0.8 21.4±1.2 20.1±0.2
large 72.1±1.1 73.9±0.7 74.2±0.6 74.8±0.8 73.9±0.7 50.1±0.5 31.9±1.2 24.4±1.5 21.2±0.6 19.6±0.8
small 64.9±1.0 64.9±1.2 64.9±1.3 63.9±0.9 64.0±0.6 40.4±0.8 21.8±0.8 17.8±1.0 15.4±0.3 12.8±0.5
T5 list base 71.2±0.9 69.5±0.7 69.5±1.0 70.4±1.6 70.4±0.7 40.7±0.9 25.5±1.2 17.8±0.2 16.4±1.3 13.9±0.7
large 74.5±0.4 73.7±0.4 73.5±0.6 73.6±0.8 73.3±1.0 41.2±1.5 27.9±1.0 18.3±0.9 17.0±0.5 13.0±0.9
small 63.9±0.7 65.8±0.8 64.0±0.3 64.1±1.2 64.3±1.1 40.3±1.2 21.8±0.7 18.0±0.6 15.5±0.6 13.1±0.7
T5 set base 71.2±0.6 69.8±0.6 69.5±0.6 70.1±0.7 69.8±1.4 40.4±0.9 23.9±1.1 18.5±1.1 16.3±0.3 14.3±0.7
large 74.9±0.3 73.0±0.5 73.1±0.8 72.5±1.1 73.5±0.4 41.2±1.3 25.1±1.3 17.4±0.9 15.9±0.5 13.2±0.8

Table 6: Relation label classification accuracy on ConceptNet (§5.1) when training all parameters. Best score per
model family is boldfaced, and best score overall is highlighted in yellow.

Metric train test many more classes and hence, is potentially more
difficult. This means that model parameters are op-
#nodes 5.59 ± 3.77 5.60 ± 3.78
#edges 8.71 ± 11.99 8.71 ± 12.01 timized for both objectives jointly, while only the
mean degree 2.66 ± 1.58 2.66 ± 1.58 linear classification heads can specialize on their
respective task.
Table 7: Structural statistics of Wikidata (§5.2) sub- The dataset is unbalanced (c.f. Fig 7), so report
graphs. macro F1 scores instead of accuracy. This means
that models only achieve high scores if they per-
form well on all classes, including minority classes.
instances where no suitable triplet is available. This We assume that classifying one out of 221 rela-
yields a dataset with 2,449,582 train, 135,828 val tions requires fine grained text understanding, so
and 135,923 test instances. we initialize models from T5-large instead of T5-
Fig. 7 shows the label distributions for relation small. To reduce computational load, we only train
and source for train and test. Out of the 221 rela- one model per setting. Further, we enable efficient
tions, only 195 and 194 relations occur in the train batching by restricting inputs to a maximum of 512
and test set, respectively. All relations in the test tokens. This truncates 2.8 % of train instances for
set also occur in the train set. GLMs and 5.1 % for LM baselines due to their less
Tab. 7 shows graph statistics. Compared to CN efficient graph encoding.
subgraphs (c.f. Tab. 3) the graphs are relatively Hyperparameters are identical to Tab. 5, except
small, matching the size of r = 2. On CN we found that (i) we reduce batch size to 8, (ii) train for at
that LMs can perform well on such small graphs, most 1 epoch and (iii) don’t use early stopping.
so we expect that the performance gap between
GLMs and LM baselines on Wikidata would be B.2.3 Results
larger if Wikidata subgraphs were larger. Fig. 8 shows the training curve when training for
an entire epoch, i.e., 2,499,582. We observe that
B.2.2 Experimental setup and baselines performances plateau beyond ∼ 0.2 epochs, so we
For these experiments we omit GNNs as a baseline, stop training after 524,288 instances in our other
since they can’t natively process texts. experiments.
The other models all compute an embedding of Tab. 8 shows concrete numbers for the models
the mask token, and then two separate classifica- in Figures 4 and 8.
tion heads produce predictions for the relations Fig. 9 shows confusion matrices for source pre-
(221 classes) and the source (3 classes). For each diction.
prediction, we compute the Cross Entropy Loss. Fig. 10 shows the test performance in relation
The final loss is the weighted sum of these losses, classification of ablated models during different
weighted by 0.9 and 0.1 respectively. The rela- training steps. Table 9 shows relation classification
tion classification has a higher weight since it has scores for (i) triplets entailed by text and for (ii)
(a) Relation label. (b) Source label.

Figure 7: Label distributions for Wikidata (§5.2) train and test sets.

524,288 train instances 2,449,582 train instances


Model
Relation Source Relation Source
ℓGLM 82.35 83.39 85.06 86.20
gGLM 81.98 83.21 85.28 86.17
T5 list 81.45 82.17 85.36 85.83
T5 set 81.29 82.00 85.04 85.53
ℓGT 3.19 39.81 1.50 37.83
gGT 3.47 39.58 3.40 39.37

Table 8: Macro F1 scores on Wikidata test set for rela-


tion classification and source classification. Scores are
shown for models after training on different numbers of
train instances.

Entailed Not entailed


Ablation
ℓGLM gGLM ℓGLM gGLM
w/ text & graph 85.46 84.85 78.47 78.46
w/o text -8.40 -6.75 -4.57 -4.28
w/o graph -4.56 -3.94 -7.56 -7.55
w/o text & graph -20.52 -19.90 -20.08 -20.07

Table 9: Ablations for KG population (§5.2). Scores are


macro F1 for relation label classification on (i) triplets
that are entailed by the text and (ii) all other triplets.
Models are trained w/o source prediction.

other triplets.

C Usage of AI assistants
We use GitHub Copilot (https://github.com/
features/copilot) for speeding up program-
ming, and ChatGPT 3.5 (https://chat.openai.
com) to aid with reformulations. The content of this
work is our own, and not inspired by AI assistants.
(a) Evaluation on train set. (b) Evaluation on test set.

Figure 8: Training curves (§5.2) when training for a whole epoch, i.e., 2,449,582 train instances. Performances are
for relation classification. On the train set we did not compute macro F1, so we report accuracy instead.

(a) ℓGLM. (b) gGLM.

Figure 9: Confusion matrices source prediction on Wikidata (§5.2).

(a) ℓGLM. (b) gGLM.

Figure 10: Ablation of different input modalities to GLMs. All runs are done without source prediction (besides
ℓGLM and gGLM). Scores are for relation classification on Wikidata (§5.2).

You might also like