0% found this document useful (0 votes)
6 views13 pages

2023 Matching-1 1

Uploaded by

ksimardeep4991
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views13 pages

2023 Matching-1 1

Uploaded by

ksimardeep4991
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Text-To-KG Alignment:

Comparing Current Methods on Classification Tasks

Sondre Wold, Lilja Øvrelid, Erik Velldal


University of Oslo, Language Technology Group
{sondrewo, liljao, erikve}@ifi.uio.no

Abstract the graph and later intertwining the produced repre-


sentations (Sun et al., 2022; Yasunaga et al., 2021;
In contrast to large text corpora, knowledge Zhang et al., 2022; Yasunaga et al., 2022). Beyond
graphs (KG) provide dense and structured rep- their competitive performance, these knowledge-
resentations of factual information. This makes enhanced systems are often upheld as more inter-
them attractive for systems that supplement
pretable, as their reliance on structured information
or ground the knowledge found in pre-trained
language models with an external knowledge can be reverse-engineered in order to explain pre-
source. This has especially been the case for dictions or used to create reasoning paths.
classification tasks, where recent work has fo- One of the central components in these systems
cused on creating pipeline models that retrieve is the identification of the most relevant part of a
information from KGs like ConceptNet as addi- KG for each natural language query. Given that
tional context. Many of these models consist of
most KGs are noisy and contain idiosyncratic phras-
multiple components, and although they differ
in the number and nature of these parts, they all ings, which leads to graph sparsity (Sun et al., 2022;
have in common that for some given text query, Jung et al., 2022), it is non-trivial to align entities
they attempt to identify and retrieve a relevant from text with nodes in the graph. Despite this,
subgraph from the KG. Due to the noise and existing work often uses relatively simple methods
idiosyncrasies often found in KGs, it is not and does not isolate and evaluate the effect of this
known how current methods compare to a sce- component on the overall classification pipeline.
nario where the aligned subgraph is completely
Furthermore, due to the lack of datasets that con-
relevant to the query. In this work, we try to
bridge this knowledge gap by reviewing current tain manually selected relevant graphs, it is not
approaches to text-to-KG alignment and eval- known how well current methods perform relative
uating them on two datasets where manually to a potential upper bound where the graph pro-
created graphs are available, providing insights vides a structured explanation as to why the sample
into the effectiveness of current methods. We under classification belongs to a class. Given that
release our code for reproducibility.1 this problem applies to a range of typical NLP
tasks, and subsequently can be found under a range
1 Introduction of different names, such as grounding, etc., there
is much to be gained from reviewing current ap-
There is a growing interest in systems that com-
proaches and assessing their effect in isolation.
bine the implicit knowledge found in large pre-
trained language models (PLMs) with external In this paper, we address these issues by provid-
knowledge. The majority of these systems use ing an overview of text-to-KG alignment methods.
knowledge graphs (KG) like ConceptNet (Speer We also evaluate a sample of the current main ap-
et al., 2017) or Freebase (Bollacker et al., 2008) proaches to text-to-KG alignment on two down-
and either inject information from the graph di- stream NLP tasks, comparing them to manually
rectly into the PLM (Peters et al., 2019; Chang created graphs that we use for estimating a po-
et al., 2020; Wang et al., 2020; Lauscher et al., tential upper bound. For evaluation, we use the
2020; Kaur et al., 2022) or perform some type of tasks of binary stance prediction (Saha et al., 2021),
joint reasoning between the PLM and the graph, transformed from a graph generation problem in
for example by using a graph neural network on order to get gold reference alignments, and a sub-
set of the Choice of Plausible Alternatives (COPA)
1
https://github.com/SondreWold/graph_impact (Roemmele et al., 2011) that contain additional ex-
1
Proceedings of the First Workshop on Matching From Unstructured and Structured Data (MATCHING 2023), pages 1–13
July 13, 2023 ©2023 Association for Computational Linguistics
France
is a India
is located in

is a
Paris Country
is located at
is a
has a Islam
Notre-Dame capital

is a
Is a
Church has context Religion

Figure 1: An example of a multi-relational knowledge graph.

planation graphs (Brassard et al., 2022). As the 2.1 Task definition


focus of this work is not how to best combine struc- The task of text-to-KG alignment involves two in-
tured data with PLMs, but rather to report on how put elements: a piece of natural text and a KG. The
current text-to-KG alignment methods compare to KG is often a multi-relational graph, G = (V, E),
manually created graphs, we use a rather simple where V is a set of entity nodes and E is the set
integration technique to combine the graphs with of edges connecting the nodes in V . The task is to
a pre-trained language model. Through this work, align the text with a subset of the KG that is relevant
we hope to motivate more research into methods to the text. What defines relevance is dependent on
that align unstructured and structured data sources the specific use case. For example, given the ques-
for a range of tasks within NLP, not only for QA. tion Where is the most famous church in France
located? and the KG found in Figure 1, a well-
2 Background executed text-to-KG alignment could, for example,
link the spans church and France from the text to
Combining text with structured knowledge is a their corresponding entity nodes in the KG and re-
long-standing challenge in NLP. While earlier work turn a subgraph that contains the minimal amount
focused more on the text-to-KG alignment itself, of nodes and edges required in order to guide any
using rule-based systems and templates, recent downstream system towards the correct behaviour.
work often approaches the problem as a part of a
system intended for other NLP tasks than the align- 2.2 Current approaches
ment itself, such as question answering (Yasunaga Although the possibilities are many, most current
et al., 2021), language modelling (Kaur et al., 2022) approaches to text-to-KG alignment base them-
and text summarization (Feng et al., 2021). selves on some form of lexical overlap. As noted in
As a consequence, approaches to what is essen- Aglionby and Teufel (2022); Becker et al. (2021);
tially the same problem, namely to align some Sun et al. (2022), the idiosyncratic phrasings often
relevant subspace of a large KG with a piece of found in KGs make this problematic. One specific
text, can be found under a range of terms, such implementation based on lexical overlap is the one
as: retrieval (Feng et al., 2021; Kaur et al., 2022; found in Lin et al. (2019), which has been later
Sun et al., 2022; Wang et al., 2020), extraction reused in a series of other works on QA without
(Huang et al., 2021; Feng et al., 2020), KG-to- any major modifications (Feng et al., 2020; Ya-
text-alignment (Agarwal et al., 2021), linking (Gao sunaga et al., 2021; Zhang et al., 2022; Yasunaga
et al., 2022; Becker et al., 2021), grounding (Shu et al., 2022; Sun et al., 2022).
et al., 2022; Lin et al., 2019), and mapping (Yu In the approach of Lin et al. (2019), a schema
et al., 2022). Although it is natural to use multi- graph is constructed from each question-answer
ple of these terms to describe a specific technique, pair. The first step involves recognising concepts
we argue that it would be beneficial to refer to the mentioned in the text that exists in the KG. Al-
task itself under a common name and propose the though they note that exact n-gram matches are not
term text-to-KG alignment. The following sections ideal, due to idiosyncratic phrasings and sparsity,
formalise the task and discuss current approaches they do little to improve on this naive approach
found in the literature. besides lemmatisation and filtering of stop words,
2
Figure 2: An example of the different graph construction approaches for COPA-SSE (Brassard et al., 2022). Here,
the premise and answer options are: P: The bodybuilder lifted weights; A1: The gym closed; A2: Her muscles
became fatigued, from left to right: Purple: Gold annotation, Brown: Approach 3, Green: Approach 2, and Blue:
Approach 1.

leaving it for future work. The enhanced n-gram et al. (2019) use RNN-based architectures that read
matching produces two sets of entities, one from a text sequence word by word, and at each time
the question and one from the answer, Vq and Va . step the current word is aligned to a triple from
The graph itself is then constructed by adding the ConceptNet (We assume by lexical overlap). Each
k-hop paths between the nodes in these two sets, triple, and also its neighbours in the KG, is encoded
with k often being 2 or 3. This returns a graph using word embeddings and then combined with
that contains a lot of noise in terms of irrelevant the context vector from the RNN using different
nodes found in the k-hop neighbourhoods of Vq attention style mechanisms.
and Va and motivates some form of pruning ap- As an alternative to these types of approaches
plied to Gsub before it is used together with the based on some form of lexical matching for the
PLM, such as node relevance scoring (Yasunaga alignment, Aglionby and Teufel (2022) experi-
et al., 2021), dynamic pruning via LM-to-KG atten- mented with embedding each entity in the KG us-
tion (Kaur et al., 2022), and ranking using sentence ing a PLM, and then for each question answer pair
representations of the question and answer pair and find the most similar concepts using euclidean dis-
a linearized version of Gsub (Kaur et al., 2022). tance. They conclude that this leads to graphs that
Another approach based on lexical matching is are more specific to the question-answer pair, and
from Becker et al. (2021), which is specifically de- that this helps performance in some cases. Wang
veloped for ConceptNet. Candidate phrases are et al. (2020) also experimented with using a PLM
first extracted from the text using a constituency to generate the graphs instead of aligning them, re-
parser, limited to noun, verb and adjective phrases. lying on KGs such as ConceptNet as a fine-tuning
These are then lemmatized and filtered for articles, dataset for the PLM instead of as a direct source
pronouns, conjunctions, interjections and punctu- during alignment. In a QA setting, the model is
ation. The same process is also applied to all the trained to connect entities from question-answer
nodes in ConceptNet. This makes it possible to pairs with a multi-hop path. The generated paths
match the two modalities better, as both are nor- can then be later used for knowledge-enhanced sys-
malised using the same pre-processing pipeline. tems. This has the benefit of being able to use
Results on two QA dataset show that the proposed all the knowledge acquired during the PLMs pre-
method is able to align more meaningful concepts training, which might result in concepts that are
and that the ratio between informative and unin- not present in KGs.
formative concepts are superior to simple string
matching. For the language modelling task, Kaur 3 KG and Datasets
et al. (2022) uses a much simpler technique where a
This section explains the data used in our own ex-
Named Entity Recognition model identifies named
periments.
entity mentions in text and selects entities with the
maximum overlap in the KG. ConceptNet As our knowledge graph, we use
For the tasks of text summarisation and story ConceptNet (Speer et al., 2017) — a general-
ending generation, Feng et al. (2021) and Guan domain KG that contains 799, 273 nodes and
3
Figure 4: An example of a manually created graph from
COPA-SSE (Brassard et al., 2022) for the premise and
options: P: The man felt ashamed of a scar on his face;
A1: He hid the scar with makeup; A2: He explained the
scar to strangers.

Figure 3: An example graph from ExplaGraphs (Saha


nation graphs per sample. Five annotators have
et al., 2021) generated by a PLM for the belief argument also rated the quality of each graph with respect
pair: Organ transplant is important; A patient with to how well it captures the relationship between
failed kidneys might not die if he gets organ donation. the premise and the correct answer choice. As we
only need one graph per sample, we select the one
with the highest average rating. As the official
2, 487, 810 edges. The graph is structured as a col-
COPA-SSE set does not contain any training data,
lection of triples, each containing a head and tail
we keep the official development split as our train-
entity connected via a relation from a pre-defined
ing data and split the official test data by half for
set of types.
our in-house development and testing set.
ExplaGraphs ExplaGraphs (Saha et al., 2021) is
originally a graph generation task for binary stance 4 Alignment approaches
prediction. Given a belief and argument pair (b,a), As mentioned, the general procedure for grounding
models should both classify whether the argument text to a graph is three-fold: we first have to identify
counters or supports the belief and construct a struc- entities mentioned in the text, then link them to enti-
tured explanation as to why this is the correct label. ties in the graph, and lastly construct a graph object
An example of this can be seen in Figure 3. that is returned to the inference model as additional
The original dataset provides a train (n = 2367) context to be used together with the original text.
and validation (n = 397) split, as well as a test set For QA the text aligned with the graph is typically
that is kept private for evaluation on a leaderboard. a combination of the question and answer choices.
The node labels have been written by humans using As our two downstream tasks are not QA, and also
free-form text, but the edge labels are limited to the different from each other, we have to rely on differ-
set of relation types used in ConceptNet. We con- ent pre-processing techniques than previous work.
catenate the train and validation split and partition The following section presents the implementation
the data into a new train, validation and test split of three different text-to-KG alignment approaches
with an 80–10–10 ratio. that we compare against manually created graphs.
COPA-SSE Introduced in Brassard et al. (2022), An illustration of the different approaches applied
COPA-SSE adds semi-structured explanations cre- to the same text sample can be seen in Figure 2.
ated by human annotators to 1500 samples from
Balanced COPA (Kavumba et al., 2019) — which 4.1 Approach 1: Basic String Matching
is an extension to the original COPA dataset from Our first approach establishes a simple baseline
Roemmele et al. (2011). In this task, given a sce- based on naive string matching. For ExplaGraphs,
nario as a premise, models have to select the al- we first word-tokenize the belief and argument
ternative that more plausibly stands in a causal on whitespace, and then for each word we check
relation with the premise. An example with a man- whether or not it is a concept in ConceptNet by
ually constructed explanation graph can be seen exact lexical overlap. This gives us two sets of
in Figure 4. As with ExplaGraphs, COPA-SSE entities: Cq and Ca . The graph is constructed by
uses free-form text for the head and tail entities of finding paths in ConceptNet between the concepts
the triples and limits the relation types to the ones in Cq and Ca . For COPA-SSE, we do the same but
found in ConceptNet. create Cq from a concatenation of the premise and
The dataset provides on average over six expla- the first answer choice, and Ca from a concatena-
4
tion of the premise and the second answer choice. sentations via linearization and cosine similarity in
We use Dijkstra’s algorithm to find the paths (Dijk- order to prune irrelevant paths from the graph.
stra, 1959).2 The reason to use this rather simple
approach, also pointed out by Lin et al. (2019) and 4.3 Approach 3: Path Generator
Aglionby and Teufel (2022), is that finding a min- Our third approach is based on a method where a
imal spanning graph that covers all the concepts generative LM is fine-tuned on the task of generat-
from Cq and Ca , which seems like a more obvious ing paths between concepts found in two sets. We
choice, would be to solve the NP-complete "Steiner use the implementation and already trained path
tree problem" (Garey and Johnson, 1977), and this generator (PG) from Wang et al. (2020) for this
would be too resource demanding given the size of purpose. This model is a GPT-2 model (Radford
ConceptNet. et al., 2019) fine-tuned on generating paths between
As many of the retrieved paths are irrelevant to two nodes in ConceptNet. 4 One advantage of this
the original text, it is common to implement some method is that since GPT-2 already has unstruc-
sort of pruning. We follow Kaur et al. (2022) and tured knowledge encoded in its parameters from
linearize the subject-relation-object triples its original pre-training, it is able to generate paths
to normal text and then embed them into the same between entities that might not exist in the original
vector space as the original context using the Sen- graph.
tenceTransformer (Reimers and Gurevych, 2019). For both ExplaGraphs and COPA-SSE, we take
We then calculate the cosine similarity between the the first and last entity identified by the entity linker
linearized graphs and the original text context and from Approach 2 as the start and end points of the
select the one with the highest score. PG. As the model only returns one generated path,
we do not perform any pruning. For the follow-
4.2 Approach 2: Enhanced String Matching ing example from COPA-SSE, P: The man felt
Our second approach is based on the widely used ashamed of a scar on his face; A1: He hid the
method from Lin et al. (2019), found in the works scar with makeup; A2: He explained the scar to
of Feng et al. (2020); Yasunaga et al. (2021); Zhang strangers., the PG constructs the following path:
et al. (2022); Yasunaga et al. (2022); Sun et al. masking tape used for hide scar, masking tape is a
(2022), but modified to our use case. We construct makeup.
the set of entities Cq and Ca using n-gram match-
ing enhanced with lemmatisation and filtering of 4.3.1 Start and end entities
stop words.3 As in Approach 1, for ExplaGraphs, We also experiment with the same setup, but with
Cq is constructed from the belief, and Ca from the first and last entity from the gold annotations
the argument; for COPA-SSE, Cq is based on a as the start and end points for the PG. We do this
concatenation of the premise and the first answer to assess the importance of having nodes that are
choice, while Ca is based on a concatenation of the at least somewhat relevant to the original context
premise and the second answer choice. as input to the PG. In our experiments, we refer to
The graph is constructed by finding paths in Con- this sub-method as Approach 3-G.
ceptNet from concepts in between Cq and Ca using
4.4 Integration technique
the same method as in Approach 1. However, we
limit the length of the paths to a variable k. In the As the focus of this work is not how to best com-
aforementioned works, k is set as to retrieve ei- bine structured data with PLMs, but rather to report
ther two or three-hop paths, essentially finding the on how current text-to-KG alignment methods com-
2-hop or 3-hop neighbourhoods of the identified pare to manually created graphs, we use a rather
concepts. For our experiments, we set k = 3. simple integration technique to combine the graphs
As with Approach 1, many of the retrieved paths with a pre-trained language model and use it uni-
are irrelevant to the original text which warrants formly for the different alignment approaches. We
some sort of pruning strategy. In the aforemen- conjecture that the ranking of the different linking
tioned works, this is done by node relevance scor- approaches with this technique would be similar
ing. We follow Approach 1 and use sentence repre- to a more complex method for reasoning over the
2
graph structures, for example using GNNs. By not
Using the implementation from https://networkx.org
3 4
We use the implementation from Yasunaga et al. (2021) See Wang et al. (2020) for details on the fine-tuning pro-
to construct Cq and Ca cedure.

5
relying on another deep learning model for the in- SSE. When using the entities from Approach 2 as
tegration, we can better control the effect of the the start and end points, denoted by the abbrevia-
graph quality itself. tion Approach 3, the number of triples containing
For each text and graph pair, we linearize the some form of alignment error is over twenty per-
graph to text as in Kaur et al. (2022). For example, cent. When using the gold annotation as the start
the graph in Figure 4 is transformed to the string and end point of the PG, abbreviated Approach
masking tape used for hide scar, masking tape is 3-G, this goes down a bit but is still considerably
a makeup. As linearization does not provide any higher than the approaches based on lexical overlap.
natural way to capture the information provided by Approach 2 is able to identify some well-formatted
having directed edges, we transform all the graphs triple in all of the cases for both tasks, while Ap-
to undirected graphs before integrating them with proach 1 fails to retrieve anything for five percent
the PLM 5 . For a different integration technique, of the samples in COPA-SSE and two percent for
such as GNNs, it would probably be reasonable to ExplaGraphs.
maintain information about the direction of edges. In order to get some notion of semantic similarity
For ExplaGraphs, which consists of belief and between the different approaches and the original
argument pairs, we feed the model with the follow- context they are meant to be a structural representa-
ing sequence: BELIEF [SEP] ARGUMENT [SEP] tion of, we calculate the cosine similarity between
GRAPH [SEP], where [SEP] is a model-dependent the context and a linearized (see Section 4.4 for
separation token and the model classifies the se- details on this procedure) version of the graphs.
quence as either support or counter. The scores can be found in Table 3. Unsurprisingly,
For COPA-SSE, which has two options for each the similarity increases with the complexity of the
premise, we use the following format: PREMISE + approach. The basic string matching technique of
GRAPH [SEP] A 1 [SEP] and PREMISE + GRAPH Approach 1 creates the least similar graphs, fol-
[SEP] A 2 [SEP], where + just adds the linearized lowed by the tad more sophisticated Approach 2,
graph to the premise as a string and the model has while the generative approaches are able to cre-
to select the most likely sequence of the two. ate a bit more similar graphs despite having a low
number of average triples per graph. All of the
5 Graph quality approaches are still far from the manually created
graphs — which are also linearized using the same
The following section provides an analysis of the
procedure as the others.
quality of the different approaches when used to
align graphs for both ExplaGraphs and COPA-SSE.
Approach Avg. number of triples Broken triples
Table 1 and Table 2 show the average number
Approach 1 2.90 0.05
of triples per sample identified or created by the Approach 2 2.90 0.00
different approaches for the two datasets, as well Approach 3 1.39 0.20
as how many triples we count as containing some Approach 3-G 1.64 0.12
form of error (’Broken triples’ in the table). The Gold 2.12 0.00
criterion for marking a triple as broken includes
Table 1: Statistics for the different approaches on the
missing head or tail entities inside the triple, having training set of COPA-SSE. The number of broken triples
more than one edge between the head and tail, and is reported as percentages.
returning nothing from ConceptNet. It is, of course,
natural that not all samples contain an entity that
can be found in ConceptNet, and consequently, Approach Avg. number of triples Broken triples
we decided to not discard the broken triples but
Approach 1 2.99 0.02
rather to include them to showcase the expected Approach 2 3.03 0.00
performance in a realistic setting. Approach 3 1.34 0.21
Approach 3-G 1.58 0.15
As can be seen from the tables, the approach
Gold 4.23 0.00
based on the Path Generator (PG) from Wang et al.
(2020) (Approach 3) returns fewer triples than the Table 2: Statistics for the different approaches on the
other approaches for both ExplaGraphs and COPA- training set of ExplaGraphs. The number of broken
5 triples is reported as percentages.
In practice, this is done by simply removing the under-
score prepended to all reversed directions.

6
Approach ExplaGraphs COPA-SSE Approach ExplaGraphs COPA-SSE
Approach 1 0.39 0.32 No graph 69.67 ±3.36
67.05±2.07
Approach 2 0.45 0.42 Approach 1 66.46±8.48 51.20±2.08
Approach 3 0.48 0.45 Approach 2 70.03±2.71 53.33±1.80
Approach 3-G 0.55 0.46 Approach 3 73.55±1.66 56.20±8.39
Gold 0.75 0.57 Approach 3-G 70.57±3.27 85.86±0.75
Gold 80.28±2.31 96.60±0.28
Table 3: The different graphs and their average cosine
similarity with the original text. Table 4: Results of the different approaches on Expla-
Graphs and COPA-SSE. Results are reported as average
accuracy over ten runs together with standard deviations
Women distinct from Man after outlier removal, if any.
has context

7 Results
Catch has context Computing
Table 4 shows the results on ExplaGraphs and
COPA-SSE. For both datasets, we observe the fol-
Figure 5: The graph aligned with ConceptNet for both
lowing: Methods primarily based on lexical over-
the approaches based on lexical overlap. The original
COPA-SSE context is Premise: The women met for lap provide no definitive improvement. The perfor-
coffee Alt 1: The cafe reopened in a new location; Alt 2: mance of Approach 1 (String matching) and Ap-
They wanted to catch up with each other proach 2 (String matching with added lemmatisa-
tion and stop word filtering) is within the standard
deviation of the experiments without any appended
6 Experiments graph data, and might even impede the performance
by making it harder to fit the data by introducing
noise from the KG that is not relevant for the clas-
We now present experiments where we compare sification at hand.
the discussed approaches to text-to-KG alignment For Approach 3, based on a generative model,
for ExplaGraphs and COPA-SSE. As our PLM, we we see that it too provides little benefit for Expla-
use BERT (Devlin et al., 2019) for all experiments. Graphs, but that when it has access to the gold
We use the base version and conduct a hyperparam- annotation entities as the start and end point of the
eter grid search for both tasks. We do the same paths, it performs significantly better than having
search both with and without any appended graphs access to no graphs at all for COPA-SSE.
as the former naturally makes it easier to overfit For both tasks, having access to manually cre-
the data, especially since both ExplaGraphs and ated graphs improves performance significantly.
COPA-SSE are relatively small in size. The grid
search settings can be found in Appendix A.2 and 8 Discussion
the final hyperparameters in Appendix A.3. We run
all experiments over ten epochs with early stopping The most striking result is perhaps the performance
on validation loss with a patience value of five. of Approach 3-G on COPA-SSE. We hypothesise
that this can be explained by the fact that anno-
As few-sample fine-tuning with BERT is known tators probably used exact spans from both the
to show instability (Zhang et al., 2021), we run premise and the correct alternative from the text
all experiments with ten random seeds and report in their graphs, and consequently, they provide a
the mean accuracy scores together with standard strong signal as to why there is a relation between
deviations. We use the same random seeds for both the premise and the correct answer choice and not
tasks; they can be found in Appendix A.4. the wrong one. This is easily picked up by the
We find that the experiments are highly suscep- model. For ExplaGraphs, which is a text classifi-
tible to seed variation. Although we are able to cation problem, this is not the case: the appended
match the performance of some previous work for graph might provide some inductive bias, but it
the same PLM on some runs, this does not hold does not provide a direct link to the correct choice,
across seeds. Consequently, we also perform out- as the task is to assign a label to the whole sequence,
lier detection and removal. Details on this proce- not to choose the most probable sequence out of
dure can be found in Appendix A.5. two options. This conclusion is further supported
7
Figure 6: The train loss curves for the different approaches on COPA-SSE.

by the observation that appending the manually whether or not an argument counters or supports
constructed graphs in their entirety has a much a belief, in the case of ExplaGraphs, or if it can
larger effect on COPA-SSE than ExplaGraphs. aid in the selection of the most likely follow-up
Furthermore, for COPA-SSE, as pointed out in scenario to a situation, in the case of COPA-SSE.
Table 1, the average triple length for the generative In Figure 5, both the approaches based on lexical
approaches is rather low, so the majority of the overlap (1 & 2) align the same exact graph with
aligned graphs from Approach 3-G are actually the text context, and judging from the result, it is
from the manually written text, not generated by pretty clear that the aligned graph has little to offer
the model itself. in terms of guiding the model towards the most
likely follow-up.
The key finding of our experiments is that hav-
ing access to structured knowledge relevant to the 9 Conclusion
sample at hand, here represented by the gold an-
notations, provides a significant increase in perfor- In this work, we find that the process of identify-
mance even with a simple injection technique and ing and retrieving the most relevant information
judging by today’s standards, a small pre-trained in a knowledge graph is found under a range of
language model. They also show that for datasets of different names in the literature and propose the
low sample sizes, such as ExplaGraphs and COPA- term text-to-KG alignment. We systematise current
SSE, the results are susceptible to noise. As the approaches for text-to-KG alignment and evaluate
approaches based on lexical overlap are within the a selection of them on two different tasks where
standard deviations of the experiments without any manually created graphs are available, providing
appended graphs, it is not possible to conclude insights into how they compare to a scenario where
that they add any useful information to the model. the aligned graph is completely relevant to the text.
Based on Figure 6, we think it is fair to conclude Our experiments show that having access to such
that these methods based on lexical overlap only a graph could help performance significantly, and
provide a signal that has no relation to the correct that current approaches based on lexical overlap
label. As to why the approaches based on lexical are unsuccessful under our experimental setup, but
matching do not have any effect here but reportedly that a generative approach using a PLM to gener-
have an effect in previous work on QA, there is ate a graph based on manually written text as start
one major reason that has not been discussed so and end entities adds a significant increase in per-
far: namely that both datasets require knowledge formance for multiple-choice type tasks, such as
that is not represented in ConceptNet. As shown COPA-SSE. For the approaches based on lexical
by Bauer and Bansal (2021), matching the task overlap, we hypothesise that the lack of perfor-
with the right KG is important. It is reasonable to mance increase can be attributed to the choice of
question whether or not ConceptNet, which aims knowledge graph, in our case ConceptNet, which
to represent commonsense and world knowledge, might not contain any information useful for solv-
does indeed contain information useful for deciding ing the two tasks.
8
Limitations for Computational Linguistics: System Demonstra-
tions, pages 119–126, Online. Association for Com-
While there is a lot of work on creating and making putational Linguistics.
available large pre-trained language models for a
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim
range of languages, there is to our knowledge not Sturge, and Jamie Taylor. 2008. Freebase: a collabo-
that many knowledge graphs for other languages ratively created graph database for structuring human
than English — especially general knowledge ones, knowledge. In Proceedings of the 2008 ACM SIG-
like ConceptNet. This is a major limitation, as it MOD international conference on Management of
data, pages 1247–1250.
restricts research to one single language and the
structured representation of knowledge found in Ana Brassard, Benjamin Heinzerling, Pride Kavumba,
the culture associated with that specific group of and Kentaro Inui. 2022. COPA-SSE: Semi-structured
explanations for commonsense reasoning. In Pro-
language users. Creating commonsense KGs from ceedings of the Thirteenth Language Resources and
unstructured text is a costly process that requires Evaluation Conference, pages 3994–4000, Marseille,
financial resources for annotation as well as avail- France. European Language Resources Association.
able corpora to extract the graph from.
Ting-Yun Chang, Yang Liu, Karthik Gopalakrishnan,
Behnam Hedayatnia, Pei Zhou, and Dilek Hakkani-
Ethics Statement Tur. 2020. Incorporating commonsense knowledge
graph in pretrained models for social commonsense
We do not foresee that combining knowledge tasks. In Proceedings of Deep Learning Inside Out
graphs with pre-trained language models in the way (DeeLIO): The First Workshop on Knowledge Extrac-
done here, add to any of the existing ethical chal- tion and Integration for Deep Learning Architectures,
pages 74–79, Online. Association for Computational
lenges associated with language models. However, Linguistics.
this rests on the assumption that the knowledge
graph does not contain any harmful information Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
that might inject or amplify unwanted behaviour in Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
the language model. standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
References nologies, Volume 1 (Long and Short Papers), pages
4171–4186, Minneapolis, Minnesota. Association for
Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Computational Linguistics.
Al-Rfou. 2021. Knowledge graph based synthetic
corpus generation for knowledge-enhanced language Edsger Wybe Dijkstra. 1959. A note on two problems
model pre-training. In Proceedings of the 2021 Con- in connexion with graphs. Numerische Mathematik,
ference of the North American Chapter of the Asso- 1:269–271.
ciation for Computational Linguistics: Human Lan-
guage Technologies, pages 3554–3565, Online. As- Xiachong Feng, Xiaocheng Feng, and Bing Qin. 2021.
sociation for Computational Linguistics. Incorporating commonsense knowledge into abstrac-
tive dialogue summarization via heterogeneous graph
networks. In Chinese Computational Linguistics:
Guy Aglionby and Simone Teufel. 2022. Identifying
20th China National Conference, CCL 2021, Hohhot,
relevant common sense information in knowledge
China, August 13–15, 2021, Proceedings, pages 127–
graphs. In Proceedings of the First Workshop on
142. Springer.
Commonsense Representation and Reasoning (CSRR
2022), pages 1–7, Dublin, Ireland. Association for Yanlin Feng, Xinyue Chen, Bill Yuchen Lin, Peifeng
Computational Linguistics. Wang, Jun Yan, and Xiang Ren. 2020. Scalable multi-
hop relational reasoning for knowledge-aware ques-
Lisa Bauer and Mohit Bansal. 2021. Identify, align, and tion answering. In Proceedings of the 2020 Con-
integrate: Matching knowledge graphs to common- ference on Empirical Methods in Natural Language
sense reasoning tasks. In Proceedings of the 16th Processing (EMNLP), pages 1295–1309, Online. As-
Conference of the European Chapter of the Associ- sociation for Computational Linguistics.
ation for Computational Linguistics: Main Volume,
pages 2259–2272, Online. Association for Computa- Silin Gao, Jena D. Hwang, Saya Kanno, Hiromi Wakaki,
tional Linguistics. Yuki Mitsufuji, and Antoine Bosselut. 2022. Com-
Fact: A benchmark for linking contextual common-
Maria Becker, Katharina Korfhage, and Anette Frank. sense knowledge. In Findings of the Association
2021. COCO-EX: A tool for linking concepts from for Computational Linguistics: EMNLP 2022, pages
texts to ConceptNet. In Proceedings of the 16th Con- 1656–1675, Abu Dhabi, United Arab Emirates. Asso-
ference of the European Chapter of the Association ciation for Computational Linguistics.

9
Michael R Garey and David S. Johnson. 1977. The representations. In Proceedings of the 2019 Confer-
rectilinear steiner tree problem is np-complete. SIAM ence on Empirical Methods in Natural Language Pro-
Journal on Applied Mathematics, 32(4):826–834. cessing and the 9th International Joint Conference
on Natural Language Processing (EMNLP-IJCNLP),
Jian Guan, Yansen Wang, and Minlie Huang. 2019. pages 43–54, Hong Kong, China. Association for
Story ending generation with incremental encod- Computational Linguistics.
ing and commonsense knowledge. Proceedings
of the AAAI Conference on Artificial Intelligence, Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
33(01):6473–6480. Dario Amodei, Ilya Sutskever, et al. 2019. Language
models are unsupervised multitask learners. OpenAI
Canming Huang, Weinan He, and Yongmei Liu. 2021. blog, 1(8):9.
Improving unsupervised commonsense reasoning us-
ing knowledge-enabled natural language inference. Nils Reimers and Iryna Gurevych. 2019. Sentence-
In Findings of the Association for Computational BERT: Sentence embeddings using Siamese BERT-
Linguistics: EMNLP 2021, pages 4875–4885, Punta networks. In Proceedings of the 2019 Conference on
Cana, Dominican Republic. Association for Compu- Empirical Methods in Natural Language Processing
tational Linguistics. and the 9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP), pages
Yong-Ho Jung, Jun-Hyung Park, Joon-Young Choi, 3982–3992, Hong Kong, China. Association for Com-
Mingyu Lee, Junho Kim, Kang-Min Kim, and putational Linguistics.
SangKeun Lee. 2022. Learning from missing rela-
tions: Contrastive learning with commonsense knowl- Melissa Roemmele, Cosmin Adrian Bejan, and An-
edge graphs for commonsense inference. In Findings drew S Gordon. 2011. Choice of plausible alter-
of the Association for Computational Linguistics: natives: An evaluation of commonsense causal rea-
ACL 2022, pages 1514–1523, Dublin, Ireland. As- soning. In AAAI spring symposium: logical formal-
sociation for Computational Linguistics. izations of commonsense reasoning, pages 90–95.
Swarnadeep Saha, Prateek Yadav, Lisa Bauer, and Mohit
Jivat Kaur, Sumit Bhatia, Milan Aggarwal, Rachit
Bansal. 2021. ExplaGraphs: An explanation graph
Bansal, and Balaji Krishnamurthy. 2022. LM-CORE:
generation task for structured commonsense reason-
Language models with contextually relevant external
ing. In Proceedings of the 2021 Conference on Em-
knowledge. In Findings of the Association for Com-
pirical Methods in Natural Language Processing,
putational Linguistics: NAACL 2022, pages 750–769,
pages 7716–7740, Online and Punta Cana, Domini-
Seattle, United States. Association for Computational
can Republic. Association for Computational Lin-
Linguistics.
guistics.
Pride Kavumba, Naoya Inoue, Benjamin Heinzerling, Yiheng Shu, Zhiwei Yu, Yuhan Li, Börje Karlsson,
Keshav Singh, Paul Reisert, and Kentaro Inui. 2019. Tingting Ma, Yuzhong Qu, and Chin-Yew Lin. 2022.
When choosing plausible alternatives, clever hans can TIARA: Multi-grained retrieval for robust question
be clever. In Proceedings of the First Workshop on answering over large knowledge base. In Proceed-
Commonsense Inference in Natural Language Pro- ings of the 2022 Conference on Empirical Methods
cessing, pages 33–42, Hong Kong, China. Associa- in Natural Language Processing, pages 8108–8121,
tion for Computational Linguistics. Abu Dhabi, United Arab Emirates. Association for
Computational Linguistics.
Anne Lauscher, Olga Majewska, Leonardo F. R. Ribeiro,
Iryna Gurevych, Nikolai Rozanov, and Goran Glavaš. Robyn Speer, Joshua Chin, and Catherine Havasi. 2017.
2020. Common sense or world knowledge? in- Conceptnet 5.5: An open multilingual graph of gen-
vestigating adapter-based knowledge injection into eral knowledge. In Thirty-first AAAI conference on
pretrained transformers. In Proceedings of Deep artificial intelligence.
Learning Inside Out (DeeLIO): The First Workshop
on Knowledge Extraction and Integration for Deep Yueqing Sun, Qi Shi, Le Qi, and Yu Zhang. 2022.
Learning Architectures, pages 43–49, Online. Asso- JointLK: Joint reasoning with language models and
ciation for Computational Linguistics. knowledge graphs for commonsense question answer-
ing. In Proceedings of the 2022 Conference of the
Bill Yuchen Lin, Xinyue Chen, Jamin Chen, and Xiang North American Chapter of the Association for Com-
Ren. 2019. KagNet: Knowledge-aware graph net- putational Linguistics: Human Language Technolo-
works for commonsense reasoning. In Proceedings gies, pages 5049–5060, Seattle, United States. Asso-
of the 2019 Conference on Empirical Methods in Nat- ciation for Computational Linguistics.
ural Language Processing and the 9th International
Joint Conference on Natural Language Processing Peifeng Wang, Nanyun Peng, Filip Ilievski, Pedro
(EMNLP-IJCNLP), pages 2829–2839, Hong Kong, Szekely, and Xiang Ren. 2020. Connecting the dots:
China. Association for Computational Linguistics. A knowledgeable path generator for commonsense
question answering. In Findings of the Association
Matthew E. Peters, Mark Neumann, Robert Logan, Roy for Computational Linguistics: EMNLP 2020, pages
Schwartz, Vidur Joshi, Sameer Singh, and Noah A. 4129–4140, Online. Association for Computational
Smith. 2019. Knowledge enhanced contextual word Linguistics.

10
Michihiro Yasunaga, Antoine Bosselut, Hongyu Ren, A Appendix A
Xikun Zhang, Christopher D Manning, Percy Liang,
and Jure Leskovec. 2022. Deep bidirectional A.1 SentenceTransformer
language-knowledge graph pretraining. In Advances
in Neural Information Processing Systems.
We use the model with id ALL - MPNET- BASE - V 2 to
prune the different paths and to calculate similarity.
Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut,
Percy Liang, and Jure Leskovec. 2021. QA-GNN: A.2 Grid search
Reasoning with language models and knowledge
graphs for question answering. In Proceedings of Based on the following values, we do a grid search
the 2021 Conference of the North American Chapter checking every possible combination.
of the Association for Computational Linguistics: Hu-
man Language Technologies, pages 535–546, Online. Hyperparameter Value
Association for Computational Linguistics.
4 ∗ 10−5 , 3 ∗ 10−5
Donghan Yu, Chenguang Zhu, Yuwei Fang, Wenhao lr 5 ∗ 10−5 , 6 ∗ 10−6
Yu, Shuohang Wang, Yichong Xu, Xiang Ren, Yim- 4 ∗ 10−6 , 1 ∗ 10−6
ing Yang, and Michael Zeng. 2022. KG-FiD: Infus-
ing knowledge graph in fusion-in-decoder for open- Weight decay 0.01 | 0.1
domain question answering. In Proceedings of the Batch size 4 | 8 | 16
60th Annual Meeting of the Association for Compu-
Dropout 0.2 | 0.3
tational Linguistics (Volume 1: Long Papers), pages
4961–4974, Dublin, Ireland. Association for Compu-
Table 5: The values used for the grid search
tational Linguistics.
Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q Wein-
berger, and Yoav Artzi. 2021. Revisiting few-sample A.3 Hyperparameters
BERT fine-tuning. In International Conference on
Learning Representations.
Based on the grid search, we select the following
hyperparameters:
Xikun Zhang, Antoine Bosselut, Michihiro Yasunaga,
Hongyu Ren, Percy Liang, Christopher D Manning, Hyperparameter With graphs w/o graphs
and Jure Leskovec. 2022. GreaseLM: Graph REA-
Soning enhanced language models. In International
Learning rate 3 ∗ 10−5 4 ∗ 10−5
Conference on Learning Representations. Dropout 0.3 0.3
Weight decay 0.01 0.1
Batch size 16 8

Table 6: The hyperparameters used for ExplaGraphs

Hyperparameter With graphs w/o graphs


Learning rate 4 ∗ 10−5 4 ∗ 10−5
Dropout 0.2 0.3
Weight decay 0.01 0.1
Batch size 8 16

Table 7: The hyperparameters used for COPA-SSE

A.4 Seeds
Seeds used for both tasks during fine-tuning:
[9, 119, 7230, 4180, 6050, 257, 981, 1088, 416, 88]

A.5 Outliers

11
83 75 72.5
82
70.0
81 70
67.5
80
65.0
79 65
62.5
78
60 60.0
77
57.5
76
55 55.0
75
1 1 1

(a) Manually created graphs (b) No graphs appended (c) Approach 2


to original context

75
75 75
70
70 70
65
65 65

60 60 60

55 55 55

1 1 1

(d) Approach 1 (e) Approach 3 (f) Approach 3-G

Figure 7: Outliers from the different runs for all graph configurations for ExplaGraphs. Circular dots mark outliers
that were removed, if any.

12
98 70.0
70
67.5
97 65.0
65
96 62.5
60 60.0
95 57.5
55 55.0
94
52.5
93 50 50.0
1 1 1

(a) Manually created graphs (b) No graphs appended (c) Approach 2


to original context

65.0
65 85
62.5
60.0 60 80
57.5
55 75
55.0
52.5 50 70
50.0
45
47.5 65

1 1 1

(d) Approach 1 (e) Approach 3 (f) Approach 3-G

Figure 8: Outliers from the different runs for all graph configurations for COPA-SSE. Circular dots mark outliers
that were removed, if any.

13

You might also like