2023 Matching-1 1
2023 Matching-1 1
is a
Paris Country
is located at
is a
has a Islam
Notre-Dame capital
is a
Is a
Church has context Religion
leaving it for future work. The enhanced n-gram et al. (2019) use RNN-based architectures that read
matching produces two sets of entities, one from a text sequence word by word, and at each time
the question and one from the answer, Vq and Va . step the current word is aligned to a triple from
The graph itself is then constructed by adding the ConceptNet (We assume by lexical overlap). Each
k-hop paths between the nodes in these two sets, triple, and also its neighbours in the KG, is encoded
with k often being 2 or 3. This returns a graph using word embeddings and then combined with
that contains a lot of noise in terms of irrelevant the context vector from the RNN using different
nodes found in the k-hop neighbourhoods of Vq attention style mechanisms.
and Va and motivates some form of pruning ap- As an alternative to these types of approaches
plied to Gsub before it is used together with the based on some form of lexical matching for the
PLM, such as node relevance scoring (Yasunaga alignment, Aglionby and Teufel (2022) experi-
et al., 2021), dynamic pruning via LM-to-KG atten- mented with embedding each entity in the KG us-
tion (Kaur et al., 2022), and ranking using sentence ing a PLM, and then for each question answer pair
representations of the question and answer pair and find the most similar concepts using euclidean dis-
a linearized version of Gsub (Kaur et al., 2022). tance. They conclude that this leads to graphs that
Another approach based on lexical matching is are more specific to the question-answer pair, and
from Becker et al. (2021), which is specifically de- that this helps performance in some cases. Wang
veloped for ConceptNet. Candidate phrases are et al. (2020) also experimented with using a PLM
first extracted from the text using a constituency to generate the graphs instead of aligning them, re-
parser, limited to noun, verb and adjective phrases. lying on KGs such as ConceptNet as a fine-tuning
These are then lemmatized and filtered for articles, dataset for the PLM instead of as a direct source
pronouns, conjunctions, interjections and punctu- during alignment. In a QA setting, the model is
ation. The same process is also applied to all the trained to connect entities from question-answer
nodes in ConceptNet. This makes it possible to pairs with a multi-hop path. The generated paths
match the two modalities better, as both are nor- can then be later used for knowledge-enhanced sys-
malised using the same pre-processing pipeline. tems. This has the benefit of being able to use
Results on two QA dataset show that the proposed all the knowledge acquired during the PLMs pre-
method is able to align more meaningful concepts training, which might result in concepts that are
and that the ratio between informative and unin- not present in KGs.
formative concepts are superior to simple string
matching. For the language modelling task, Kaur 3 KG and Datasets
et al. (2022) uses a much simpler technique where a
This section explains the data used in our own ex-
Named Entity Recognition model identifies named
periments.
entity mentions in text and selects entities with the
maximum overlap in the KG. ConceptNet As our knowledge graph, we use
For the tasks of text summarisation and story ConceptNet (Speer et al., 2017) — a general-
ending generation, Feng et al. (2021) and Guan domain KG that contains 799, 273 nodes and
3
Figure 4: An example of a manually created graph from
COPA-SSE (Brassard et al., 2022) for the premise and
options: P: The man felt ashamed of a scar on his face;
A1: He hid the scar with makeup; A2: He explained the
scar to strangers.
5
relying on another deep learning model for the in- SSE. When using the entities from Approach 2 as
tegration, we can better control the effect of the the start and end points, denoted by the abbrevia-
graph quality itself. tion Approach 3, the number of triples containing
For each text and graph pair, we linearize the some form of alignment error is over twenty per-
graph to text as in Kaur et al. (2022). For example, cent. When using the gold annotation as the start
the graph in Figure 4 is transformed to the string and end point of the PG, abbreviated Approach
masking tape used for hide scar, masking tape is 3-G, this goes down a bit but is still considerably
a makeup. As linearization does not provide any higher than the approaches based on lexical overlap.
natural way to capture the information provided by Approach 2 is able to identify some well-formatted
having directed edges, we transform all the graphs triple in all of the cases for both tasks, while Ap-
to undirected graphs before integrating them with proach 1 fails to retrieve anything for five percent
the PLM 5 . For a different integration technique, of the samples in COPA-SSE and two percent for
such as GNNs, it would probably be reasonable to ExplaGraphs.
maintain information about the direction of edges. In order to get some notion of semantic similarity
For ExplaGraphs, which consists of belief and between the different approaches and the original
argument pairs, we feed the model with the follow- context they are meant to be a structural representa-
ing sequence: BELIEF [SEP] ARGUMENT [SEP] tion of, we calculate the cosine similarity between
GRAPH [SEP], where [SEP] is a model-dependent the context and a linearized (see Section 4.4 for
separation token and the model classifies the se- details on this procedure) version of the graphs.
quence as either support or counter. The scores can be found in Table 3. Unsurprisingly,
For COPA-SSE, which has two options for each the similarity increases with the complexity of the
premise, we use the following format: PREMISE + approach. The basic string matching technique of
GRAPH [SEP] A 1 [SEP] and PREMISE + GRAPH Approach 1 creates the least similar graphs, fol-
[SEP] A 2 [SEP], where + just adds the linearized lowed by the tad more sophisticated Approach 2,
graph to the premise as a string and the model has while the generative approaches are able to cre-
to select the most likely sequence of the two. ate a bit more similar graphs despite having a low
number of average triples per graph. All of the
5 Graph quality approaches are still far from the manually created
graphs — which are also linearized using the same
The following section provides an analysis of the
procedure as the others.
quality of the different approaches when used to
align graphs for both ExplaGraphs and COPA-SSE.
Approach Avg. number of triples Broken triples
Table 1 and Table 2 show the average number
Approach 1 2.90 0.05
of triples per sample identified or created by the Approach 2 2.90 0.00
different approaches for the two datasets, as well Approach 3 1.39 0.20
as how many triples we count as containing some Approach 3-G 1.64 0.12
form of error (’Broken triples’ in the table). The Gold 2.12 0.00
criterion for marking a triple as broken includes
Table 1: Statistics for the different approaches on the
missing head or tail entities inside the triple, having training set of COPA-SSE. The number of broken triples
more than one edge between the head and tail, and is reported as percentages.
returning nothing from ConceptNet. It is, of course,
natural that not all samples contain an entity that
can be found in ConceptNet, and consequently, Approach Avg. number of triples Broken triples
we decided to not discard the broken triples but
Approach 1 2.99 0.02
rather to include them to showcase the expected Approach 2 3.03 0.00
performance in a realistic setting. Approach 3 1.34 0.21
Approach 3-G 1.58 0.15
As can be seen from the tables, the approach
Gold 4.23 0.00
based on the Path Generator (PG) from Wang et al.
(2020) (Approach 3) returns fewer triples than the Table 2: Statistics for the different approaches on the
other approaches for both ExplaGraphs and COPA- training set of ExplaGraphs. The number of broken
5 triples is reported as percentages.
In practice, this is done by simply removing the under-
score prepended to all reversed directions.
6
Approach ExplaGraphs COPA-SSE Approach ExplaGraphs COPA-SSE
Approach 1 0.39 0.32 No graph 69.67 ±3.36
67.05±2.07
Approach 2 0.45 0.42 Approach 1 66.46±8.48 51.20±2.08
Approach 3 0.48 0.45 Approach 2 70.03±2.71 53.33±1.80
Approach 3-G 0.55 0.46 Approach 3 73.55±1.66 56.20±8.39
Gold 0.75 0.57 Approach 3-G 70.57±3.27 85.86±0.75
Gold 80.28±2.31 96.60±0.28
Table 3: The different graphs and their average cosine
similarity with the original text. Table 4: Results of the different approaches on Expla-
Graphs and COPA-SSE. Results are reported as average
accuracy over ten runs together with standard deviations
Women distinct from Man after outlier removal, if any.
has context
7 Results
Catch has context Computing
Table 4 shows the results on ExplaGraphs and
COPA-SSE. For both datasets, we observe the fol-
Figure 5: The graph aligned with ConceptNet for both
lowing: Methods primarily based on lexical over-
the approaches based on lexical overlap. The original
COPA-SSE context is Premise: The women met for lap provide no definitive improvement. The perfor-
coffee Alt 1: The cafe reopened in a new location; Alt 2: mance of Approach 1 (String matching) and Ap-
They wanted to catch up with each other proach 2 (String matching with added lemmatisa-
tion and stop word filtering) is within the standard
deviation of the experiments without any appended
6 Experiments graph data, and might even impede the performance
by making it harder to fit the data by introducing
noise from the KG that is not relevant for the clas-
We now present experiments where we compare sification at hand.
the discussed approaches to text-to-KG alignment For Approach 3, based on a generative model,
for ExplaGraphs and COPA-SSE. As our PLM, we we see that it too provides little benefit for Expla-
use BERT (Devlin et al., 2019) for all experiments. Graphs, but that when it has access to the gold
We use the base version and conduct a hyperparam- annotation entities as the start and end point of the
eter grid search for both tasks. We do the same paths, it performs significantly better than having
search both with and without any appended graphs access to no graphs at all for COPA-SSE.
as the former naturally makes it easier to overfit For both tasks, having access to manually cre-
the data, especially since both ExplaGraphs and ated graphs improves performance significantly.
COPA-SSE are relatively small in size. The grid
search settings can be found in Appendix A.2 and 8 Discussion
the final hyperparameters in Appendix A.3. We run
all experiments over ten epochs with early stopping The most striking result is perhaps the performance
on validation loss with a patience value of five. of Approach 3-G on COPA-SSE. We hypothesise
that this can be explained by the fact that anno-
As few-sample fine-tuning with BERT is known tators probably used exact spans from both the
to show instability (Zhang et al., 2021), we run premise and the correct alternative from the text
all experiments with ten random seeds and report in their graphs, and consequently, they provide a
the mean accuracy scores together with standard strong signal as to why there is a relation between
deviations. We use the same random seeds for both the premise and the correct answer choice and not
tasks; they can be found in Appendix A.4. the wrong one. This is easily picked up by the
We find that the experiments are highly suscep- model. For ExplaGraphs, which is a text classifi-
tible to seed variation. Although we are able to cation problem, this is not the case: the appended
match the performance of some previous work for graph might provide some inductive bias, but it
the same PLM on some runs, this does not hold does not provide a direct link to the correct choice,
across seeds. Consequently, we also perform out- as the task is to assign a label to the whole sequence,
lier detection and removal. Details on this proce- not to choose the most probable sequence out of
dure can be found in Appendix A.5. two options. This conclusion is further supported
7
Figure 6: The train loss curves for the different approaches on COPA-SSE.
by the observation that appending the manually whether or not an argument counters or supports
constructed graphs in their entirety has a much a belief, in the case of ExplaGraphs, or if it can
larger effect on COPA-SSE than ExplaGraphs. aid in the selection of the most likely follow-up
Furthermore, for COPA-SSE, as pointed out in scenario to a situation, in the case of COPA-SSE.
Table 1, the average triple length for the generative In Figure 5, both the approaches based on lexical
approaches is rather low, so the majority of the overlap (1 & 2) align the same exact graph with
aligned graphs from Approach 3-G are actually the text context, and judging from the result, it is
from the manually written text, not generated by pretty clear that the aligned graph has little to offer
the model itself. in terms of guiding the model towards the most
likely follow-up.
The key finding of our experiments is that hav-
ing access to structured knowledge relevant to the 9 Conclusion
sample at hand, here represented by the gold an-
notations, provides a significant increase in perfor- In this work, we find that the process of identify-
mance even with a simple injection technique and ing and retrieving the most relevant information
judging by today’s standards, a small pre-trained in a knowledge graph is found under a range of
language model. They also show that for datasets of different names in the literature and propose the
low sample sizes, such as ExplaGraphs and COPA- term text-to-KG alignment. We systematise current
SSE, the results are susceptible to noise. As the approaches for text-to-KG alignment and evaluate
approaches based on lexical overlap are within the a selection of them on two different tasks where
standard deviations of the experiments without any manually created graphs are available, providing
appended graphs, it is not possible to conclude insights into how they compare to a scenario where
that they add any useful information to the model. the aligned graph is completely relevant to the text.
Based on Figure 6, we think it is fair to conclude Our experiments show that having access to such
that these methods based on lexical overlap only a graph could help performance significantly, and
provide a signal that has no relation to the correct that current approaches based on lexical overlap
label. As to why the approaches based on lexical are unsuccessful under our experimental setup, but
matching do not have any effect here but reportedly that a generative approach using a PLM to gener-
have an effect in previous work on QA, there is ate a graph based on manually written text as start
one major reason that has not been discussed so and end entities adds a significant increase in per-
far: namely that both datasets require knowledge formance for multiple-choice type tasks, such as
that is not represented in ConceptNet. As shown COPA-SSE. For the approaches based on lexical
by Bauer and Bansal (2021), matching the task overlap, we hypothesise that the lack of perfor-
with the right KG is important. It is reasonable to mance increase can be attributed to the choice of
question whether or not ConceptNet, which aims knowledge graph, in our case ConceptNet, which
to represent commonsense and world knowledge, might not contain any information useful for solv-
does indeed contain information useful for deciding ing the two tasks.
8
Limitations for Computational Linguistics: System Demonstra-
tions, pages 119–126, Online. Association for Com-
While there is a lot of work on creating and making putational Linguistics.
available large pre-trained language models for a
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim
range of languages, there is to our knowledge not Sturge, and Jamie Taylor. 2008. Freebase: a collabo-
that many knowledge graphs for other languages ratively created graph database for structuring human
than English — especially general knowledge ones, knowledge. In Proceedings of the 2008 ACM SIG-
like ConceptNet. This is a major limitation, as it MOD international conference on Management of
data, pages 1247–1250.
restricts research to one single language and the
structured representation of knowledge found in Ana Brassard, Benjamin Heinzerling, Pride Kavumba,
the culture associated with that specific group of and Kentaro Inui. 2022. COPA-SSE: Semi-structured
explanations for commonsense reasoning. In Pro-
language users. Creating commonsense KGs from ceedings of the Thirteenth Language Resources and
unstructured text is a costly process that requires Evaluation Conference, pages 3994–4000, Marseille,
financial resources for annotation as well as avail- France. European Language Resources Association.
able corpora to extract the graph from.
Ting-Yun Chang, Yang Liu, Karthik Gopalakrishnan,
Behnam Hedayatnia, Pei Zhou, and Dilek Hakkani-
Ethics Statement Tur. 2020. Incorporating commonsense knowledge
graph in pretrained models for social commonsense
We do not foresee that combining knowledge tasks. In Proceedings of Deep Learning Inside Out
graphs with pre-trained language models in the way (DeeLIO): The First Workshop on Knowledge Extrac-
done here, add to any of the existing ethical chal- tion and Integration for Deep Learning Architectures,
pages 74–79, Online. Association for Computational
lenges associated with language models. However, Linguistics.
this rests on the assumption that the knowledge
graph does not contain any harmful information Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
that might inject or amplify unwanted behaviour in Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
the language model. standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
References nologies, Volume 1 (Long and Short Papers), pages
4171–4186, Minneapolis, Minnesota. Association for
Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Computational Linguistics.
Al-Rfou. 2021. Knowledge graph based synthetic
corpus generation for knowledge-enhanced language Edsger Wybe Dijkstra. 1959. A note on two problems
model pre-training. In Proceedings of the 2021 Con- in connexion with graphs. Numerische Mathematik,
ference of the North American Chapter of the Asso- 1:269–271.
ciation for Computational Linguistics: Human Lan-
guage Technologies, pages 3554–3565, Online. As- Xiachong Feng, Xiaocheng Feng, and Bing Qin. 2021.
sociation for Computational Linguistics. Incorporating commonsense knowledge into abstrac-
tive dialogue summarization via heterogeneous graph
networks. In Chinese Computational Linguistics:
Guy Aglionby and Simone Teufel. 2022. Identifying
20th China National Conference, CCL 2021, Hohhot,
relevant common sense information in knowledge
China, August 13–15, 2021, Proceedings, pages 127–
graphs. In Proceedings of the First Workshop on
142. Springer.
Commonsense Representation and Reasoning (CSRR
2022), pages 1–7, Dublin, Ireland. Association for Yanlin Feng, Xinyue Chen, Bill Yuchen Lin, Peifeng
Computational Linguistics. Wang, Jun Yan, and Xiang Ren. 2020. Scalable multi-
hop relational reasoning for knowledge-aware ques-
Lisa Bauer and Mohit Bansal. 2021. Identify, align, and tion answering. In Proceedings of the 2020 Con-
integrate: Matching knowledge graphs to common- ference on Empirical Methods in Natural Language
sense reasoning tasks. In Proceedings of the 16th Processing (EMNLP), pages 1295–1309, Online. As-
Conference of the European Chapter of the Associ- sociation for Computational Linguistics.
ation for Computational Linguistics: Main Volume,
pages 2259–2272, Online. Association for Computa- Silin Gao, Jena D. Hwang, Saya Kanno, Hiromi Wakaki,
tional Linguistics. Yuki Mitsufuji, and Antoine Bosselut. 2022. Com-
Fact: A benchmark for linking contextual common-
Maria Becker, Katharina Korfhage, and Anette Frank. sense knowledge. In Findings of the Association
2021. COCO-EX: A tool for linking concepts from for Computational Linguistics: EMNLP 2022, pages
texts to ConceptNet. In Proceedings of the 16th Con- 1656–1675, Abu Dhabi, United Arab Emirates. Asso-
ference of the European Chapter of the Association ciation for Computational Linguistics.
9
Michael R Garey and David S. Johnson. 1977. The representations. In Proceedings of the 2019 Confer-
rectilinear steiner tree problem is np-complete. SIAM ence on Empirical Methods in Natural Language Pro-
Journal on Applied Mathematics, 32(4):826–834. cessing and the 9th International Joint Conference
on Natural Language Processing (EMNLP-IJCNLP),
Jian Guan, Yansen Wang, and Minlie Huang. 2019. pages 43–54, Hong Kong, China. Association for
Story ending generation with incremental encod- Computational Linguistics.
ing and commonsense knowledge. Proceedings
of the AAAI Conference on Artificial Intelligence, Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
33(01):6473–6480. Dario Amodei, Ilya Sutskever, et al. 2019. Language
models are unsupervised multitask learners. OpenAI
Canming Huang, Weinan He, and Yongmei Liu. 2021. blog, 1(8):9.
Improving unsupervised commonsense reasoning us-
ing knowledge-enabled natural language inference. Nils Reimers and Iryna Gurevych. 2019. Sentence-
In Findings of the Association for Computational BERT: Sentence embeddings using Siamese BERT-
Linguistics: EMNLP 2021, pages 4875–4885, Punta networks. In Proceedings of the 2019 Conference on
Cana, Dominican Republic. Association for Compu- Empirical Methods in Natural Language Processing
tational Linguistics. and the 9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP), pages
Yong-Ho Jung, Jun-Hyung Park, Joon-Young Choi, 3982–3992, Hong Kong, China. Association for Com-
Mingyu Lee, Junho Kim, Kang-Min Kim, and putational Linguistics.
SangKeun Lee. 2022. Learning from missing rela-
tions: Contrastive learning with commonsense knowl- Melissa Roemmele, Cosmin Adrian Bejan, and An-
edge graphs for commonsense inference. In Findings drew S Gordon. 2011. Choice of plausible alter-
of the Association for Computational Linguistics: natives: An evaluation of commonsense causal rea-
ACL 2022, pages 1514–1523, Dublin, Ireland. As- soning. In AAAI spring symposium: logical formal-
sociation for Computational Linguistics. izations of commonsense reasoning, pages 90–95.
Swarnadeep Saha, Prateek Yadav, Lisa Bauer, and Mohit
Jivat Kaur, Sumit Bhatia, Milan Aggarwal, Rachit
Bansal. 2021. ExplaGraphs: An explanation graph
Bansal, and Balaji Krishnamurthy. 2022. LM-CORE:
generation task for structured commonsense reason-
Language models with contextually relevant external
ing. In Proceedings of the 2021 Conference on Em-
knowledge. In Findings of the Association for Com-
pirical Methods in Natural Language Processing,
putational Linguistics: NAACL 2022, pages 750–769,
pages 7716–7740, Online and Punta Cana, Domini-
Seattle, United States. Association for Computational
can Republic. Association for Computational Lin-
Linguistics.
guistics.
Pride Kavumba, Naoya Inoue, Benjamin Heinzerling, Yiheng Shu, Zhiwei Yu, Yuhan Li, Börje Karlsson,
Keshav Singh, Paul Reisert, and Kentaro Inui. 2019. Tingting Ma, Yuzhong Qu, and Chin-Yew Lin. 2022.
When choosing plausible alternatives, clever hans can TIARA: Multi-grained retrieval for robust question
be clever. In Proceedings of the First Workshop on answering over large knowledge base. In Proceed-
Commonsense Inference in Natural Language Pro- ings of the 2022 Conference on Empirical Methods
cessing, pages 33–42, Hong Kong, China. Associa- in Natural Language Processing, pages 8108–8121,
tion for Computational Linguistics. Abu Dhabi, United Arab Emirates. Association for
Computational Linguistics.
Anne Lauscher, Olga Majewska, Leonardo F. R. Ribeiro,
Iryna Gurevych, Nikolai Rozanov, and Goran Glavaš. Robyn Speer, Joshua Chin, and Catherine Havasi. 2017.
2020. Common sense or world knowledge? in- Conceptnet 5.5: An open multilingual graph of gen-
vestigating adapter-based knowledge injection into eral knowledge. In Thirty-first AAAI conference on
pretrained transformers. In Proceedings of Deep artificial intelligence.
Learning Inside Out (DeeLIO): The First Workshop
on Knowledge Extraction and Integration for Deep Yueqing Sun, Qi Shi, Le Qi, and Yu Zhang. 2022.
Learning Architectures, pages 43–49, Online. Asso- JointLK: Joint reasoning with language models and
ciation for Computational Linguistics. knowledge graphs for commonsense question answer-
ing. In Proceedings of the 2022 Conference of the
Bill Yuchen Lin, Xinyue Chen, Jamin Chen, and Xiang North American Chapter of the Association for Com-
Ren. 2019. KagNet: Knowledge-aware graph net- putational Linguistics: Human Language Technolo-
works for commonsense reasoning. In Proceedings gies, pages 5049–5060, Seattle, United States. Asso-
of the 2019 Conference on Empirical Methods in Nat- ciation for Computational Linguistics.
ural Language Processing and the 9th International
Joint Conference on Natural Language Processing Peifeng Wang, Nanyun Peng, Filip Ilievski, Pedro
(EMNLP-IJCNLP), pages 2829–2839, Hong Kong, Szekely, and Xiang Ren. 2020. Connecting the dots:
China. Association for Computational Linguistics. A knowledgeable path generator for commonsense
question answering. In Findings of the Association
Matthew E. Peters, Mark Neumann, Robert Logan, Roy for Computational Linguistics: EMNLP 2020, pages
Schwartz, Vidur Joshi, Sameer Singh, and Noah A. 4129–4140, Online. Association for Computational
Smith. 2019. Knowledge enhanced contextual word Linguistics.
10
Michihiro Yasunaga, Antoine Bosselut, Hongyu Ren, A Appendix A
Xikun Zhang, Christopher D Manning, Percy Liang,
and Jure Leskovec. 2022. Deep bidirectional A.1 SentenceTransformer
language-knowledge graph pretraining. In Advances
in Neural Information Processing Systems.
We use the model with id ALL - MPNET- BASE - V 2 to
prune the different paths and to calculate similarity.
Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut,
Percy Liang, and Jure Leskovec. 2021. QA-GNN: A.2 Grid search
Reasoning with language models and knowledge
graphs for question answering. In Proceedings of Based on the following values, we do a grid search
the 2021 Conference of the North American Chapter checking every possible combination.
of the Association for Computational Linguistics: Hu-
man Language Technologies, pages 535–546, Online. Hyperparameter Value
Association for Computational Linguistics.
4 ∗ 10−5 , 3 ∗ 10−5
Donghan Yu, Chenguang Zhu, Yuwei Fang, Wenhao lr 5 ∗ 10−5 , 6 ∗ 10−6
Yu, Shuohang Wang, Yichong Xu, Xiang Ren, Yim- 4 ∗ 10−6 , 1 ∗ 10−6
ing Yang, and Michael Zeng. 2022. KG-FiD: Infus-
ing knowledge graph in fusion-in-decoder for open- Weight decay 0.01 | 0.1
domain question answering. In Proceedings of the Batch size 4 | 8 | 16
60th Annual Meeting of the Association for Compu-
Dropout 0.2 | 0.3
tational Linguistics (Volume 1: Long Papers), pages
4961–4974, Dublin, Ireland. Association for Compu-
Table 5: The values used for the grid search
tational Linguistics.
Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q Wein-
berger, and Yoav Artzi. 2021. Revisiting few-sample A.3 Hyperparameters
BERT fine-tuning. In International Conference on
Learning Representations.
Based on the grid search, we select the following
hyperparameters:
Xikun Zhang, Antoine Bosselut, Michihiro Yasunaga,
Hongyu Ren, Percy Liang, Christopher D Manning, Hyperparameter With graphs w/o graphs
and Jure Leskovec. 2022. GreaseLM: Graph REA-
Soning enhanced language models. In International
Learning rate 3 ∗ 10−5 4 ∗ 10−5
Conference on Learning Representations. Dropout 0.3 0.3
Weight decay 0.01 0.1
Batch size 16 8
A.4 Seeds
Seeds used for both tasks during fine-tuning:
[9, 119, 7230, 4180, 6050, 257, 981, 1088, 416, 88]
A.5 Outliers
11
83 75 72.5
82
70.0
81 70
67.5
80
65.0
79 65
62.5
78
60 60.0
77
57.5
76
55 55.0
75
1 1 1
75
75 75
70
70 70
65
65 65
60 60 60
55 55 55
1 1 1
Figure 7: Outliers from the different runs for all graph configurations for ExplaGraphs. Circular dots mark outliers
that were removed, if any.
12
98 70.0
70
67.5
97 65.0
65
96 62.5
60 60.0
95 57.5
55 55.0
94
52.5
93 50 50.0
1 1 1
65.0
65 85
62.5
60.0 60 80
57.5
55 75
55.0
52.5 50 70
50.0
45
47.5 65
1 1 1
Figure 8: Outliers from the different runs for all graph configurations for COPA-SSE. Circular dots mark outliers
that were removed, if any.
13