Combining Pre-Trained Language Models and Structured Knowledge
Combining Pre-Trained Language Models and Structured Knowledge
structured knowledge
Cynthia Breazeal
MIT Media Lab
In recent years, transformer-based language models have achieved state of the art performance
in various NLP benchmarks. These models are able to extract mostly distributional information
with some semantics from unstructured text, however it has proven challenging to integrate
structured information, such as knowledge graphs into these models. We examine a variety
of approaches to integrate structured knowledge into current language models and determine
challenges, and possible opportunities to leverage both structured and unstructured information
sources. From our survey, we find that there are still opportunities at exploiting adapter-based
injections and that it may be possible to further combine various of the explored approaches into
one system.
1. Introduction
2
sequence of tokens that is then converted into some initial context-independent embed-
dings. To a word sequence we can apply a tokenization technique T to convert the word
sequence into a token sequence T . This can be seen as T (S) = T . This sequence T is
used as a lookup in an embedding layer E to produce context independent token vector
embeddings: E(T ) = E. These are then passed sequentially through various contextu-
alization layers (i.e. transformers) which we define as the set H, H = (H1 , ..., Hn ). The
successive application of these ultimately produces a sequence of contextual embed-
dings C: C = Hn (Hn−1 (...H1 (E))). We additionally define G as graph embeddings of a
knowledge graph G that are the result of some embedding function Eg : G = Eg (G). This
final sequence is run through a final layer LLM that is used to calculate the language
modeling loss function L that is optimized through back-propagation. The notation that
we utilize is intentionally vague on the definition of the functions, in order for us to fit
the different works that we survey.
In the following sections we will look at attempts of injecting knowledge graph
information that fall into the aforementioned categories. Additionally, we will highlight
relevant benefits in these approaches. We conclude with possible opportunities and
future directions for injecting structured knowledge into language models.
Figure 1
Visualization of boundaries of the different categories of knowledge injections. Combination
injections involve combinations of the three categories.
In this section we will describe knowledge injections whose techniques center around
modifying either the structure of the input or the data that is selected to be fed into the
base transformer models. A common approach to inject information from a knowledge
graph is by converting its assertions into a set of words (possibly including separator
tokens) and pre-training or fine-tuning a language model with these inputs. We discuss
two particular papers that focus on structuring the input in different ways as to capture
the semantic information from triples found in a KG. These approaches start from a pre-
trained model, and fine-tune on their knowledge infusing datasets. A summary of these
approaches can be found in table 1.
Input focused injections can be seen any technique whose output is a modified E,
hereby known as E 0 . This modification can be achieved either by modifying S,T , T , E,or
directly E. (i.e. the word sequence, the token sequence, the tokenization function, the
context-less embedding function, or the actual context-less embeddings). The hope of
3
input focused injections is that the knowledge in E 0 will be distributed and contextual-
ized through H as the language models are trained.
COMET is a GPT(Radford et al. 2018) based system which is trained on triples from KGs
(ConceptNet and Atomic) to learn to predict the object of the triple (the triples being
defined as (subject, relation, object)). The triples are fed as a concatenated sequence of
words into the model (i.e. the words for the subject, the relationship, and the object)
along with some separators.
The authors initialize the GPT model to the final weights in the training from
Radford et al.(Radford et al. 2018) and proceed to train it to predict the words that
belong to the object in the triple. A very interesting part of this work is that it is capable
directly of performing knowledge graph completion for nodes and relations that may
not have been seen during training, in the form of sentences.
Some plausible shortcomings of this work is that you are still having the model
extract the semantic information from the distributional one and possibly suffering
from the same bias as AMS. In addition to this, by training on the text version of these
triples, it may be the case that we lose some of the syntax that the model learns due to
awkwardly formatted inputs (i.e. “cat located at housing" rather than “a cat is located
at a house"), however further testing of these two needs to be performed.
4
There is some relevant derivative work for COMET by Bosselut et al.(Bosselut and
Choi 2019) which looks into how effective COMET is at building KGs on the fly given
a certain context, a question, and a proposed answer. They combine the context with a
relation from ATOMIC and feed it into COMET to represent reasoning hops. They do
this for multiple relations and keep redoing this with the generated outputs to represent
a reasoning chain for which they can derive a probability. They use this in a zero-shot
evaluation of a question-answering system and find that it is effective. Overall some
highlights of COMET are:
Table 1
Input Injection System Comparisons
3. Architecture Injections
KnowBERT modifies BERT’s architecture by integrating some layers that they call the
Knowledge Attention and Recontextualization (KAR). These layers take graph entity
embeddings, that are based on Tucker Tensor Decompositions for KG completion (Bal-
ažević, Allen, and Hospedales 2019), and run them through an attention mechanism to
generate entity span embeddings. These span embeddings are then added to the regular
BERT contextual representations. The summed representations are then uncompressed
5
and passed on to the next layer in a regular BERT. Once the KAR entity linker has been
trained, the rest of the BERT model is unfrozen and is trained in the pre-training. These
KAR layers are trained for every KG that is to be injected, in this work they use data
from Wikipedia and Wordnet.
An interesting observation is that the injection happens in the later layers, which
means that the contextual representation up to that point may be unaltered by the
injected knowledge. This is done to stabilize the training, but could present an oppor-
tunity to inject knowledge in earlier levels. Additionally, the way the system is trained,
the entity linking is first trained, and then the whole system is unfrozen to incorporate
the additional knowledge into BERT. This strategy could lead to the catastrophic forget-
ting(Kirkpatrick et al. 2017) problem where the knowledge from the underlying BERT
model or the additional structured injection may be forgotten or ignored.
This technique falls into a broader category of what is called Adapters(Houlsby
et al. 2019). Adapters are layers that are added into a language model and are subse-
quently fine tuned to a specific task. The interesting aspect of adapters is that they add
a minimal amount of additional parameters, and freeze the original model weights.
The added parameters are also initialized to produce a close to identity output. It is
worth noting that the KnowBERT is not explicitly an Adapter technique as the model is
unfrozen during training. Some highlights of KnowBERT are the following:
This work explores what kinds of knowledge are infused by fine tuning an adapter
equipped version of BERT on ConceptNet. They generate and test models trained on
sentences from the Open Mind Common Sense (OMCS)(Singh et al. 2002) corpus and
from walks in the ConceptNet graph. They note that with simple adapters and as little
as 25k/100k update steps on their training sentences, they are able greatly improve the
encoded “World Knowledge" (another name for the knowledge found in ConceptNet).
However, it is worth noting that the information is presented as sentences to which
the adapters are fine tuned. This may mean that the model may have similar possible
shortcomings such as with the approaches that are input-focused (model may rely more
on the distributional rather than the semantic information), however testing needs to be
performed to confirm this. Overall some highlights of this work are the following:
6
Injected Injected
Model in Pre- in Fine- Summary of Injection
Training Tuning
KnowBERT Yes Yes Sandwich Adapter-like layers
which sum contextual repre-
sentation of layer with graph
representation of entities and
distributes it in an entity span
Common sense or world No Yes Use sandwich adapters to
knowledge?[...] fine tune on a KG
Table 2
Architecture Injection System Comparisons
4. Output Injections
In this section we describe approaches that focus on changing either the output structure
or the losses that were used in the base model in some way to incorporate knowledge.
Only one model falls strictly under this category, the model injects entity embeddings
into the output of a BERT model.
Here we describe approaches that use combinations of injection types such as in-
put/output injections or architecture/output injections, etc. We start by looking at
models that perform input injections and reinforce these with output injections (LIBERT,
KALM), We then look at models that manipulate the attention mechanisms to mimic
graph connections (BERT-MK,K-BERT). We follow this by looking into KG-BERT, a
model that operates on KG triples, and K-Adapter, a modification of RoBERTa that
encodes KGs into Adapter Layers and fuses them. After this, we look into the approach
presented as Cracking the Contextual Commonsense Code[...] which determines that
there are areas lacking in BERT that could be addressed by supplying appropriate data,
and we look at ERNIE 2.0, a framework for multi-task training for semantically aware
models. Lastly, we look at two hybrid approaches which extract LM knowledge and
leverage it for different tasks. A summary of these injections can be found in table 3.
7
5.1 Knowledge-Aware Language Model (KALM) Pre-Training (Rosset et al. 2020)
KALM is a system that does not modify the internal architecture of the model that it
is looking to inject the knowledge into, rather it modifies the input of the model by
fusing entity embeddings with the normal word embeddings that the language model
(in KALM’s case, GPT-2) uses. They then enforce the model in the output to uphold the
entity information by adding an additional loss component in the pre-training that uses
a max margin between the cosine distance of the output contextual representation and
the input entity embedding and the cosine distance of the contextual representation and
a confounder entity. Altogether what this does is that it forces the model to notice when
there is an entity and tries to make the contextual representation have the semantics of
the correct input entity. Some highlights of KALM are:
• Sends an entity signal in the beginning and and enforces it in the output of
a generative model to notice its semantics
This work masks informative entities that are drawn from a knowledge graph, in
BERT’s MLM objective. In addition to this, they have an auxiliary objective which uses
a max-margin loss for a ranking task which is composed of using a bilinear model that
calculates a similarity score between a contextual representation of an entity mention
and the representation of the [CLS] token for the text. The use for this is to determine if
it is a relevant entity or a distractor. Both KALM and this work are very similar, but a key
difference is that KALM uses a generative model without any kind of MLM objective,
and KALM does not do any kind of filtering for the entities. Some highlights of this
work are:
LIBERT converts batches of lexical constraints and negative examples, into a BERT-
compatible format. The lexical constraints are synonyms and direct hyponyms-
hypernyms (specific,broad) and take the form of a set of tuples of words: (C =
{(w1 , w2 )i }N
i=1 ). In addition to this set, the authors generate some negative examples
by finding the words that are semantically close to (w1 ) and (w2 ) in a given batch. They
then format the examples into something BERT can use which is simply the wordpieces
that pertain to words in the batch separated by the separator token. They pass this input
through BERT and use the [CLS] token as an input to a softmax classifier to determine
if the example is a valid lexical relation or not.
During pre-training they alternate between a batch of sentences and a batch of
constraints. LIBERT outperforms BERT with lesser (1M) iterations of pre-training. It
is worth noting that as the amount of iterations of training increase, the gap between
the two systems, although present, becomes smaller. This may indicate that, although
the additional training objective is effective, it may be getting overshadowed by the
8
regular MLM coupled with large amounts of data, however more testing needs to be
performed. It is also worth noting that the authors do not align the sentences with the
constraint batches, combine the training tuples which may hinder training as BERT has
to alternate between different training input structures, and lastly, they do not incor-
porate antonymy constraints in their confounder selection, so further experimentation
would be required to verify the effects of these. Some highlights of LIBERT are the
following:
K-BERT uses a combination of input injection and architecture injections. For a given
sentence, they inject relevant triples for the entities that are present in the sentence and
in a KG. They inject these triples in between the actual text and utilize a soft-position
embedding to determine the order in which the triples are evaluated. These soft position
embeddings simply add positional embeddings to the injected triple tokens. This in turn
creates a problem that the tokens are injected as entities appear in a sentence, and hence
the ordering of the tokens is altered.
To remedy this the authors utilize a masked self attention similar to BERT-MK. What
this means is that the attention mechanism should only be able to see everything up to
the entity that matched in the injected triple. This attention mechanism helps the model
focus on what relevant knowledge it should incorporate. It would have been good to
see a comparison of just adding these as sentences in the input rather than having to
9
fix the attention mechanism to compensate for the erratic placement. Some highlights
of K-BERT are:
The authors present a combination approach which fine-tunes a BERT model with the
text of triples from a KG similar to COMET. The authors also feed confounders in the
form of random samples of entities into the training of the system. It utilizes a binary
classification task to determine if the triple is valid and a relationship type prediction
task to determine which relations are present between pairs of entities. Although this
system is useful for KG completion, there is no evidence of its performance on other
tasks. Additionally they train one triple at a time which may limit the model’s ability to
learn the extended relationships for a given set of entities. Some highlights of KG-BERT
are the following:
A work based on adapters, K-Adapter works by adding projection layers before and
after a subset of transformer layers. They do this only for some specific layers in a pre-
trained RoBERTa model (the first layer, the middle layer, and the last layer). They then
freeze RoBERTa as per the Adapter work in (Houlsby et al. 2019) and train 2 adapters
to learn factual knowledge from Wikipedia triples (Elsahar et al. 2019) and linguistic
knowledge from outputs of the Stanford parser(Chen and Manning 2014). They then
train the adapters with a triple classification (whether the triple is true or not) task
similar to KG-BERT.
It is worth noting that the authors compare RoBERTa and their K-Adapter approach
against BERT, and BERT has considerably better performance on the LAMA probes. The
authors attribute RoBERTa’s byte pair encodings(Shibata et al. 1999) (BPE) as the major
performance delta between their approach and BERT. Another possible reason may be
that they only perform injection in a few layers rather than throughout the entire model,
although testing needs to be done to confirm this. Some highlights of K-Adapter are:
The authors analyze BERT to determine that it is deficient in certain attribute repre-
sentations of entities. The authors use the RACE (Lai et al. 2017) dataset, and based
10
on five attribute categories (Visual, Encyclopedic, Functional Perceptual, Taxonomic),
select samples from the dataset that may be help a BERT model compensate deficiencies
in the areas. They then fine tune on this data. In addition to this, the authors concatenate
the fine tuned BERT embeddings with some knowledge graph embeddings. These
graph embeddings are generated based on assertions that involve the entities that are
present in the questions and passages they train their final joint model on (MCScript
2.0(Ostermann, Roth, and Pinkal 2019)). Their selection of additional fine-tuning data
for BERT improves their performance in MCScript 2.0, highlighting that their selection
addressed missing knowledge.
It is worth noting that the graph embeddings that they concatenate boost the
performance of their system which shows that there is still some information in KGs
that is not in BERT. We classify this approach as a combination approach because they
concatenate the result of the BERT embeddings and KG embeddings and fine tune both
at the same time. The authors however gave no insight as to how the KG embeddings
could have been incorporated in the fine-tuning/pre-training of BERT with the RACE
dataset. Some highlights of this work are:
The authors develop a framework that constructs pre-training tasks that center around
Word-aware Pre-training, Structure-aware Pre-training, Semantic-aware Pre-training
Tasks, and proceeds to train a transformer based model on these tasks. An interesting as-
pect is that as they finish training on tasks, they keep training on older tasks in order for
the model to not forget what it has learned. In ERNIE 2.0 the authors do not incorporate
KG information explicitly. They do have a sub-task within the Word-aware pre-training
that masks entities and phrases with the hope that it learns the dependencies of the
masked elements which may help to incorporate assertion information.
A possible shortcoming of this model is that some tasks that are intended to infuse
semantic information into the model (i.e. the Semantic aware tasks which are a Dis-
course Relation task and an information retrieval (IR) relevance) rely on the model to
pick it up from the distributional examples. This could have the same possible issue as
with the Input Injections and would need to be investigated further. Additionally, they
do not explicitly use KGs in the work. Some highlights of ERNIE 2.0 are:
11
5.10 Graph-based reasoning over heterogeneous external knowledge for common-
sense question answering (Lv et al. 2020)
A hybrid approach in which the authors do not inject knowledge into a language model
(namely XLNet(Yang et al. 2019), rather they utilize a language model as a way to unify
graph knowledge and contextual information. They combine XLNet embeddings as
nodes in a Graph Convolutional Network (GCN) to answer questions.
They generate relevant subgraphs of ConceptNet and Wikipedia (from ConceptNet
the relations that include entities in a question/answer exercise and the top 10 most
relevant sentences from ElasticSearch on Wikipedia). They then perform a topological
sorting on the combined graphs and pass them as input to XLNet. XLNet then generates
contextual representations that are then used as representations for nodes in a Graph
Convolutional Network (GCN). They then utilize graph attention to generate a graph
level representation and combine it with XLNet’s input ([CLS] token) representation to
determine if an answer is valid for a question. In this model they do not fine-tune XLNet,
which could have been done on the dataset to give better contextual representations,
and additionally they do not leverage the different levels of representation present in
XLNet. Some highlights of this work are the following
5.11 Commonsense knowledge base completion with structural and semantic con-
text(Malaviya et al. 2020)
Another hybrid approach, the authors fine tune a BERT model on a list of the unique
phrases that are used to represent nodes in a KG. They then take the embeddings
from BERT and from a sub-graph in the form of a GCN and run it through an en-
coder/decoder structure to determine the validity of an assertion.2
They then take this input and concatenate it with node representations for for
a subgraph (in this case a combination of ConceptNet and Atomic). They treat this
concatenation as an encoded representation, and run combinations of these through
a convolutional decoder that additionally takes an embedding of a relation type.
The result of the convolutional decoder is run through a bilinear model and a
sigmoid function to determine the validity of the assertion. It seems interesting that the
authors only run the convolution through one side: convolution of (ei , erel ) rather than
both the convolution of (ei , erel ) and (erel , ej ) (where (ei , ej ) are the entity embeddings
for entity i and j respectively and (erel ) is the embedding for a specific relationship) and
then a concatenation. They rely on the bilinear model joining the two representations.
Some highlights of this work are the following:
2 It is worth noting that the two hybrid projects possibly benefited from the ability for these language
models to encode assertions as shown by Feldman et al. (Davison, Feldman, and Rush 2019) and Petroni
et al.(Petroni et al. 2019).
12
• Use BERT to generate contextual embeddings for nodes
• Use an encoder-decoder structure to learn triples
6. Future Directions
Most input injections are to format KG information into whatever format a transformer
model can ingest. Although KALM has explored incorporating a signal to the input
representations, it would be interesting to add additional information such as the lexical
constraints mentioned in LIBERT, to the word embeddings that are trained with the
transformer based models like BERT. A possible approach could be to build a post-
specialization system that could generate retrofitted(Faruqui et al. 2014) representations
that can then be fed into language models.
There are a variety of combined approaches, but none of them tackle all three areas
(input, architecture, and output) at the same time. It seems promising to test out a
signaling method such as KALM and see how this would work with an adapter based
method similar to KnowBERT. The idea being that the input signal could help the
entity embeddings contextualize better within the injected layers. Additionally, it would
be interesting to see how the aforementioned combination would look with a system
similar to LIBERT; such that you could fuse entity embeddings with some semantic
information.
7. Conclusion
References
Balažević, Ivana, Carl Allen, and Timothy M Hospedales. 2019. Tucker: Tensor factorization for
knowledge graph completion. arXiv preprint arXiv:1901.09590.
13
Bodenreider, Olivier. 2004. The unified medical language system (umls): integrating biomedical
terminology. Nucleic acids research, 32(suppl_1):D267–D270.
Bollacker, Kurt, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a
collaboratively created graph database for structuring human knowledge. In Proceedings of the
2008 ACM SIGMOD international conference on Management of data, pages 1247–1250.
Bosselut, Antoine and Yejin Choi. 2019. Dynamic knowledge graph construction for zero-shot
commonsense question answering. arXiv preprint arXiv:1911.03876.
Bosselut, Antoine, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and
Yejin Choi. 2019. Comet: Commonsense transformers for automatic knowledge graph
construction. arXiv preprint arXiv:1906.05317.
Chen, Danqi and Christopher D Manning. 2014. A fast and accurate dependency parser using
neural networks. In Proceedings of the 2014 conference on empirical methods in natural language
processing (EMNLP), pages 740–750.
Da, Jeff and Jungo Kusai. 2019. Cracking the contextual commonsense code: Understanding
commonsense reasoning aptitude of deep contextual representations. arXiv preprint
arXiv:1910.01157.
Dai, Zihang, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov.
2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint
arXiv:1901.02860.
Davison, Joe, Joshua Feldman, and Alexander M Rush. 2019. Commonsense knowledge mining
from pretrained models. In Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 1173–1178.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Elsahar, Hady, Pavlos Vougiouklis, Arslen Remaci, Christophe Gravier, Jonathon Hare, Elena
Simperl, and Frederique Laforest. 2019. T-rex: A large scale alignment of natural language
with knowledge base triples.
Faruqui, Manaal, Jesse Dodge, Sujay K Jauhar, Chris Dyer, Eduard Hovy, and Noah A Smith.
2014. Retrofitting word vectors to semantic lexicons. arXiv preprint arXiv:1411.4166.
Gabrilovich, Evgeniy, Michael Ringgaard, and Amarnag Subramanya. 2013. Facc1: Freebase
annotation of clueweb corpora, version 1 (release date 2013-06-26, format version 1, correction
level 0). Note: http://lemurproject. org/clueweb09/FACC1/Cited by, 5:140.
He, Bin, Di Zhou, Jinghui Xiao, Qun Liu, Nicholas Jing Yuan, Tong Xu, et al. 2019. Integrating
graph contextualized knowledge into pre-trained language models. arXiv preprint
arXiv:1912.00147.
Hewitt, John and Christopher D Manning. 2019. A structural probe for finding syntax in word
representations. In Proceedings of the 2019 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short
Papers), pages 4129–4138.
Hogan, Aidan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard de Melo, Claudio
Gutierrez, José Emilio Labra Gayo, Sabrina Kirrane, Sebastian Neumaier, Axel Polleres, et al.
2020. Knowledge graphs. arXiv preprint arXiv:2003.02320.
Houlsby, Neil, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe,
Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer
learning for nlp. arXiv preprint arXiv:1902.00751.
Kirkpatrick, James, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins,
Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska,
et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national
academy of sciences, 114(13):3521–3526.
Lai, Guokun, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. Race: Large-scale
reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683.
Lauscher, Anne, Olga Majewska, Leonardo FR Ribeiro, Iryna Gurevych, Nikolai Rozanov, and
Goran Glavaš. 2020. Common sense or world knowledge? investigating adapter-based
knowledge injection into pretrained transformers. arXiv preprint arXiv:2005.11787.
Lauscher, Anne, Ivan Vulić, Edoardo Maria Ponti, Anna Korhonen, and Goran Glavaš. 2019.
Informing unsupervised pretraining with external linguistic knowledge. arXiv preprint
arXiv:1909.02339.
14
Liu, Weijie, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, and Ping Wang. 2020.
K-bert: Enabling language representation with knowledge graph. In AAAI, pages 2901–2908.
Lv, Shangwen, Daya Guo, Jingjing Xu, Duyu Tang, Nan Duan, Ming Gong, Linjun Shou, Daxin
Jiang, Guihong Cao, and Songlin Hu. 2020. Graph-based reasoning over heterogeneous
external knowledge for commonsense question answering. In AAAI, pages 8449–8456.
Malaviya, Chaitanya, Chandra Bhagavatula, Antoine Bosselut, and Yejin Choi. 2020.
Commonsense knowledge base completion with structural and semantic context. In AAAI,
pages 2925–2933.
Màrquez, Lluís, Xavier Carreras, Kenneth C Litkowski, and Suzanne Stevenson. 2008. Semantic
role labeling: an introduction to the special issue.
Miller, George A. 1998. WordNet: An electronic lexical database. MIT press.
Munkhdalai, Tsendsuren, Alessandro Sordoni, Tong Wang, and Adam Trischler. 2019.
Metalearned neural memory. In Advances in Neural Information Processing Systems, pages
13331–13342.
Ostermann, Simon, Michael Roth, and Manfred Pinkal. 2019. Mcscript2. 0: A machine
comprehension corpus focused on script events and participants. arXiv preprint
arXiv:1905.09531.
Peters, Matthew E, Mark Neumann, Robert L Logan IV, Roy Schwartz, Vidur Joshi, Sameer
Singh, and Noah A Smith. 2019. Knowledge enhanced contextual word representations. arXiv
preprint arXiv:1909.04164.
Petroni, Fabio, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller,
and Sebastian Riedel. 2019. Language models as knowledge bases? arXiv preprint
arXiv:1909.01066.
Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving
language understanding by generative pre-training.
Rosset, Corby, Chenyan Xiong, Minh Phan, Xia Song, Paul Bennett, and Saurabh Tiwary. 2020.
Knowledge-aware language model pretraining. arXiv preprint arXiv:2007.00655.
Sap, Maarten, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah
Rashkin, Brendan Roof, Noah A Smith, and Yejin Choi. 2019. Atomic: An atlas of machine
commonsense for if-then reasoning. In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 33, pages 3027–3035.
Shen, Tao, Yi Mao, Pengcheng He, Guodong Long, Adam Trischler, and Weizhu Chen. 2020.
Exploiting structured knowledge in text via graph-guided representation learning. arXiv
preprint arXiv:2004.14224.
Shibata, Yusuxke, Takuya Kida, Shuichi Fukamachi, Masayuki Takeda, Ayumi Shinohara,
Takeshi Shinohara, and Setsuo Arikawa. 1999. Byte pair encoding: A text compression scheme
that accelerates pattern matching. Technical report, Technical Report DOI-TR-161,
Department of Informatics, Kyushu University.
Singh, Push, Thomas Lin, Erik T Mueller, Grace Lim, Travell Perkins, and Wan Li Zhu. 2002.
Open mind common sense: Knowledge acquisition from the general public. In OTM
Confederated International Conferences" On the Move to Meaningful Internet Systems", pages
1223–1237, Springer.
Speer, Robyn, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: An open multilingual
graph of general knowledge. In Thirty-First AAAI Conference on Artificial Intelligence.
Speer, Robyn and Catherine Havasi. 2012. Representing general relational knowledge in
conceptnet 5. In LREC, pages 3679–3686.
Sun, Yu, Shuohuan Wang, Yu-Kun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. 2020.
Ernie 2.0: A continual pre-training framework for language understanding. In AAAI, pages
8968–8975.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural
information processing systems, pages 5998–6008.
Wang, Ruize, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing Huang, Cuihong Cao, Daxin
Jiang, Ming Zhou, et al. 2020. K-adapter: Infusing knowledge into pre-trained models with
adapters. arXiv preprint arXiv:2002.01808.
Yang, Zhilin, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le.
2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances
in neural information processing systems, pages 5753–5763.
15
Yao, Liang, Chengsheng Mao, and Yuan Luo. 2019. Kg-bert: Bert for knowledge graph
completion. arXiv preprint arXiv:1909.03193.
Ye, Zhi-Xiu, Qian Chen, Wen Wang, and Zhen-Hua Ling. 2019. Align, mask and select: A simple
method for incorporating commonsense knowledge into language representation models.
arXiv preprint arXiv:1908.06725.
Zhang, Zhuosheng, Yuwei Wu, Hai Zhao, Zuchao Li, Shuailiang Zhang, Xi Zhou, and Xiang
Zhou. 2019. Semantics-aware bert for language understanding. arXiv preprint
arXiv:1909.02209.
16
Injected
Injected
in
Model in Pre- Summary of Input Injec- Summary of Architecture Summary of Output In-
Fine-
Training tion Injection jection
Tuning
KALM Yes No Combines an entity em- N/a Incorporates a max-
bedding with the model’s margin loss with cosine
word embeddings distances to enforce
semantic information
Exploiting
structured
Yes No Uses a KG informed N/A Incorporates a max-
knowledge
masking scheme to exploit margin loss with
in text [...]
MLM learning distractors from a KG
and a bilinear scoring
model for the MM loss.
LIBERT Yes No Alternates between N/a Adds a binary classifier as
batches of sentences, a third training task to de-
and batches of tuples that termine if the tuples form
are lexically related a valid lexical relation
BERT-MK Yes No N/A Combines a base trans- Use a triple reconstruction
former with KG trans- loss, similar to MLM, but
former modules which are for triples
trained to learn contextual
entity representations and
have an attention mecha-
nism that mimics the con-
nections between a graph
K-BERT Yes No Incorporate as part of Modify the attention N/A
their training batches, mechanism and position
assertions from entities embeddings to reorder
present in a sample injected information, and
mimic KG connections
KG-BERT No Yes Feeds triples from a N/A Uses a binary classifica-
knowledge graph as input tion objective to deter-
examples mine if a triple is correct
and a multi-class classifi-
cation objective to deter-
mine what kind of relation
the triple has.
K-Adapter No Yes N/A Fine tune adapter layers to Combine the model and
store information from a different adapter layers to
KG give a contextual repre-
sentation with information
from different sources, ad-
ditionally use a relation
classification loss for each
trained adapter.
Cracking
the
contextual
Yes No Pre-process the data to ad- N/A Concatenates a graph em-
common-
dress commonsense rela- bedding to the output of
sense
tion properties that are de- the BERT model
code[...]
ficient in BERT
ERNIE 2.0 Yes No Construct data for pre- N/A Provides a battery of tasks
training tasks that are trained in parallel
to enforce different seman-
tic areas in a model
Table 3
Combination Injection Systems Comparisons
17
Knowledge Injection Underlying Type of Injection Knowledge Sources Training Objective
Approach Language Model
Align, Mask, Select BERT Input ConceptNet Binary Cross Entropy
COMET GPT Input Atomic, ConceptNet Language Modeling
KnowBERT BERT Architecture Wikipedia, WordNet Masked Language
Modeling
Common sense or BERT Architecture ConceptNet Masked Language
world knowledge?[...] Modeling
SemBERT BERT Output Semantic Role Masked Language
Labeling of Pre- Modeling
Training data
KALM GPT-2 Combination FACC1 and FAKBA Language Modeling +
(Input,Output) entity annota- Max Margin
tion(Gabrilovich,
Ringgaard, and
Subramanya 2013)
Exploiting Struc- BERT Combination ConceptNet Masked Language
tured[...] (Input+Output) Modeling, Max
Margin
LiBERT BERT Combination (Input, Wordnet, Roget’s Masked Language
Output) Thesaurus Modeling + Max
Margin
Graph-based XLNet + Graph Con- Hybrid (Language Wikipedia, Concept- Cross Entropy
reasoning over volutional Network Model + Graph Net
heterogeneous Reasoning)
external knowledge
Commonsense BERT+Graph Convo- Hybrid (Language Atomic,ConceptNet Binary Cross Entropy
knowledge base lutional Net Model + GCN
completion with Embeddings)
structural and
semantic context
ERNIE 2.0 Transformer-Based Combination Wikipedia, Book- Various tasks, among
Model (Input+Output) Corpus, Reddit, them Knowledge
Discovery Data Masking, Token-
(Various types Document Relation
of relationships Prediction, Sentence
extracted from these Distance Task, IR
datasets) Relevance Task
BERT-MK BERT Combination (Archi- Unified Medical Masked Language
tecture+Output) Language Sys- Modeling, Max
tem(Bodenreider Margin
2004)(UMLS)
K-Bert BERT Combination (Input + TBD Same As BERT
Architecture)
KG-Bert BERT Combination (Input + Freebase, Wordnet, Binary + Categorical
Output) UMLS Cross Entropy
K-Adapter RoBERTa Combination (Archi- Wikipedia, Depen- Binary Cross Entropy
tecture+Output) dency Parsing from
Book Corpus
Cracking the BERT Combination N/A:Fine Tuning on Binary Cross Entropy
Commonsense Code (Input+Output) RACE dataset subset
Table 4
Knowledge Injection Models Overview
18
Knowledge Benchmark Base Model Base Model Knowledge Percent
Injection Name Benchmark Injected Model Difference
Approach Performance Performance
BERT-CSbase GLUE (Average) Bert-Base 78.975 79.612 0.81%
(Align, Mask, Se-
lect)
BERT-CSlarge GLUE (Average) Bert-Base-large 81.5 81.45 -0.06%
(Align, Mask, Se-
lect)
LIBERT (2M) GLUE (Average) BERT Baseline 72.775 74.275 2.06%
Trained with 2m
examples
SemBERTbase GLUE(Average) BERT-Base 78.975 80.35 1.74%
SemBERTlarge GLUE(Average) BERT-Base 81.5 84.262 3.39%
K-Adapter F+L CosmosQA, TA- RoBERTa + Mul- 81.19,71.62 81.83,71.93 1.54%, 0.95%
CRED titask training
Ernie 2.0 (large) GLUE (Average) BERT-Base-Large 81.5 84.65 3.87%
BERT-MK Entity Typing, BERT-base 96.55 ,77.75 97.26,83.02 0.74%,6.78%
Rel. Classification
(using UMLS)
K-BERT XNLI BERT-base 75.4 76.1 0.93%
Cracking the MCScript 2.0 BERT-Large 82.3 85.5 3.89
contextual
commonsense
code (Bert large +
KB+RACE)
KnowBERT TACRED BERT-Base 66 71.5 8%
Common Sense GLUE (Average) BERT-Base 78.975 79.225 0.40%
or World
Knowledge?
(OM-ADAPT
100K)
Common Sense GLUE (Average) BERT-Base 78.975 79.225 0.32%
or World
Knowledge?
(CN-ADAPT
50K)
Table 5
Knowledge Injection Models Performance Comparison
19