0% found this document useful (0 votes)
94 views5 pages

Knowledge - Graph - Embedding - and - OpenKE (Report)

OpenKE is an open source toolkit for knowledge graph embedding. It provides implementations of several fundamental knowledge graph embedding models like TransE, TransR, and DistMult. These models learn low-dimensional vector representations of entities and relations in a knowledge graph. TransE models relationships as translations between entity embeddings. TransR improves on TransE by projecting entities and relations into separate vector spaces to better model one-to-many relationships. OpenKE aims to support efficient model validation and learning on large knowledge graphs while maintaining modularity and extensibility.

Uploaded by

Toàn Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views5 pages

Knowledge - Graph - Embedding - and - OpenKE (Report)

OpenKE is an open source toolkit for knowledge graph embedding. It provides implementations of several fundamental knowledge graph embedding models like TransE, TransR, and DistMult. These models learn low-dimensional vector representations of entities and relations in a knowledge graph. TransE models relationships as translations between entity embeddings. TransR improves on TransE by projecting entities and relations into separate vector spaces to better model one-to-many relationships. OpenKE aims to support efficient model validation and learning on large knowledge graphs while maintaining modularity and extensibility.

Uploaded by

Toàn Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Knowledge Graph Embedding and OpenKE

Yingying Chen, Xiaozhe Yao


[email protected]

December 2019

Abstract
OpenKE[1] is an open toolkit for knowledge embedding, which provides a unified framework and various
fundamental models to embed knowledge graphs into a continuous low-dimensional space. It prioritize operational
efficiency to support fast model validation and large-scale knowledge representation learning. At the same time,
OpenKE maintains sufficient modularity and insensibility to easily incorporate new models into the framework.
In this article, we will firstly introduce several algorithms provided by OpenKE, which we need to evaluate in
more detail in section 2. Besides of these, we will discuss several possible future work in section 3. All codes
and parsed dataset can be found at Yaonotes1 .

1 Introduction a learned low-dimensional embedding of the knowledge


graph entities.
Knowledge graph (KG) is a model of a knowledge do- In TransE, if a fact subject, relation, object holds,
main created by domain experts with the help of machine then the embedding of the object entity should be close
learning algorithms. It provides a structural representa- to the embedding of the subject entity plus some vec-
tion of knowledge and a unified interface for the struc- tor representing the relationship type. All we expect the
tured data, which enables the creation of smart multilat- learned embedding is to make sure Subject+Relation ≈
eral relations throughout the database. In the past few Object if the fact holds, and Subject + Relation to be far
years, a large amount of KGs have been created and suc- away from the Object if the fact is wrong.
cessfully applied to many real-world applications, from
semantic parsing to information extraction. A KG is a To evaluate the performance of such embeddings, we
multi-relational graph composed of entities (nodes) and need to introduce a measurement to measure the dis-
relations (edges). Each edge is represented by a triple of tances between the predicted edges or vertexes is near
the form (head, relation, tail), also called a fact, indicat- the ground truth. Here we could use the `1 or `2
ing that two entities are connected by a specific relation. norm as the distance measurement. Hence, We define
Although effective in representing structured data, the d(y 0 , y) = kh + r − tk. The distance is illustrated as be-
underlying symbolic nature of such triples usually makes low:
KGs hard to manipulate.
Figure 1: Illustration of problems in TransE
To tackle this issue, a new research direction known
as knowledge graph embedding has quickly gained massive
attention. The key idea behind is to embed components
of a KG, including entities and relations into continuous
vector spaces, so as to simplify the manipulation while
preserving the inherent structure of the KG[2]. Those
entity and relation embedding can further be used to
benefit all kinds of tasks, such as entity classification, re-
lation extraction, etc. The algorithms for KG embedding
includes TransE[3], TransR[4], and DistMult[5].

1.1 Knowledge Embedding Algorithms


1.1.1 TransE
To perform the training, we need to define the loss
TransE (Translating Embeddings) is a method that mod- function. As we have discussed, intuitively, we want the
els relationships by interpreting them as translations embedding to be as close as possible if the relation holds,
operation not on the graph structure directly but on or the embedding to be as far as possible if the relation
1 https://yaonotes.org/datasets/nlp/

1
does not hold. Hence we know that there will be two The limitations of TransE comes from the assumption
distances dhold and dotherwise and we can call them dpos that all relations are located in a single semantic vector
and dneg . All we want is to let the dneg to be as large space. If we assume they are in different space, and map
as possible and dpos to be as small as possible. Hence the different relations into different relation spaces, we
we have L = (dpos − dneg ). If we train this directly, can calculate the distance in the specific relation spaces.
some weird results may occur because we have no con- In TransR, the entities are in entity space, i.e. h, t ∈
straints on the loss function, for example, the distance Rk while the relations are in relation space Rd . In general
may approach ∞ and that is not what we want to hap- k 6= d. For each relation, we define a projection matrix
pen. So we could add a constraint here because as long Mr ∈ Rk×d . Then we can project entity into the relation
as the distance dpos − dneg is large enough (for example,
space as hr = hMr , tr = tMr . A simple illustration of
dneg = 1 while dpos = 0, and dneg − dpos = 1 in our TransR can be seen in the following graph.
case). This will eventually become hinge loss and looks
like L = max(0, margin+dpos −dneg ). By using gradient
descent method, we could train against this loss function. Figure 3: Illustration of TransR
There are some other training tricks present in the
original paper, such as generating negative samples by
fixing head (tail resp.) and change tail (head resp.).
The most important drawbacks of TransE approach
is that it can only handle one-to-one relations, and to
tackle this problem, there appears many different forms
of Trans* Family, which we will discuss in the following
sections.

1.1.2 TransR
With the projections, we could then adopt the ex-
actly same distance measurement and loss function from
One important assumption in TransE is that the entities
TransE.
and relations are within the same semantic vector space
Rk (That’s why we could perform vector addition). How-
ever, in fact an entity may have multiple different aspects
and the corresponding relations may vary. For instance,
tiger as an entity, may have two completely different as- 1.1.3 Other Trans* Family
pect and relation with snake. From the aspect of the
tuple (tiger, rdf:type, mammal), tiger will be far from There are many other different translating models, such
sharks because sharks is not a mammal, but if we view as TransHyperplanes[6], Translating via Dynamic map-
it from the aspect (tiger, rdf:type, large animals), then ping matrix[7] and many others. Due to the limitation
sharks will be very close to tiger. Therefore from dif- of pages, we will only briefly introduce their ideas.
ferent, incomparable vector spaces, some entities may be
TransH was proposed to handle one-to-many and
comparably different, if we view them from different as-
many-to-one problem, another shortcoming of TransE.
pects. In the below illustration, we assume there are two
For each relation, TransH maps head and tail into a new
independent entities, tiger and shark, the relation vector
hyperplane, and will get h⊥, t⊥. In the hyperplane, there
in blue indicates that we want the shark to be close to
will be a vector to reprensent the relation. By doing the
tiger since they all are huge animals. In the mean while,
factorization, we eventually use only a part of h. The
the relation vector in red indicates that sharks and tigers
different part (projection) will become a specific aspects
cannot be too far away even though they are not both
of an entity and will help to solve one-to-many problems.
mammals.
TransD was proposed to handle the problem of
Figure 2: Illustration of TransR TransR. In TransR, different relations are projected into
different vector spaces. It makes sense but not enough.
Take tiger and mammals as an example, tigers and mam-
mals are different objects in different field (tiger is a spe-
cific animal while mammal is a category). [7] proposed
that we should project this kind of objects into different
vector spaces. Meanhile, TransR needs many different
vector spaces, which leads to super high computational
complexities. In TransD, every relation and entity are
represented by two vectors, one for semantic information
and the other one for projection matrix. Then they form
the projection matrix by entities and relations.

2
1.1.4 LFM and DistMult 1.3.2 Yago Film Dataset

In addition to translational models, there are several di- YAGO is a huge semantic knowledge base, derived from
iferent other embeddings that are feasible for represent- Wikipedia WordNet and GeoNames. At present, YAGO
ing knowledge graph, one of which is Latent Factor Model has knowledge of more than 10 million entities (like per-
(LFM )[8][9]. sons, organizations, cities, etc.) and contains more than
Latent factor model assumes that head’s vector is or- 120 million facts about these entities. This dataset is
thogonal to tail’s vector. We can use a matrix to map represented as RDF triples, consisting of a triplet (head,
head’s vector into tail’s vector. Therefore we can con- tail, relation) in each line.
|
sider using a score function like fr (h, t) = lh Mr lt , where We perform preprocessing on the original Yago
k×k
Mr ∈ R . Now since every element in Mr can be Film Dataset. According to the requirements by
linearly transformed from head and tail, Mr can be re- OpenKE, we extract entities, relations and triplets
garded as a bi-linear transformation of lh and lt . respectively, index them respectively and save them
The problem in LFM is that the computational time into entity2id.txt , relation2id.txt and train2id.txt .
complexity is O(k 2 ) and it may take too long to compute. Amoung the train2id.txt , we randomly choose 20%
To tackle this problem, [5] proposed a way to reduce the triplets for testing Those triplets for testing will be moved
number of relation parameters by restricting Mr to be a into test2id.txt and removed in train file. After that,
diagonal matrix. This will result in the same number of we will generate type-constraint.txt file by using the
relation parameters as TransE. provided n-n.py script.

1.2 OpenKE
OpenKE is an open toolkit developed by Tsinghua Uni- 2 The Training Setup
versity, and it provides a unified underlying platform to
organize data and memory. It has three major advan- 2.1 Setup
tages compared with other knowledge embeddinh toolk-
its: We run our experiment on a single NVIDIA T4 GPU
equipped with 12 GB CUDA memory hosted by Google
• Efficient Implementation. OpenKE uses C++ for
Colab. It falls back to Tesla P100 if there is no T4 GPU
data preprocessing and negative sampling. As dis-
available.
cussed in TransE, most methods use similar meth-
ods, alternating head or tail, to acquire negative We run TransE, TransR and MultiDist on the FB15k-
samples. By written this module in C++, OpenKE 237 dataset and YAGO film dataset respectively. Two
is considered to be faster than other pure-Python hyper-parameters (dimension and train time) are modi-
implementations. fied to repeat the training process. We set up the control
variable experiment, using dimension and train time sep-
• Python Interface. The supported models are imple- arately. When one of them is chosen as an independent
mented in TensorFlow[10] and Pytorch[11], which variable, the other will be fixed in the control group.
provides a Python interface for programming and
debugging. It also enables the models to run in The FB15K-237 dataset is consisted of 237 relations
parallel on GPU without programminh in CUDA. with 14541 entities. We use 272115 triples for training
and 20466 for testing. After preprocessing, in the Yago
• Multiple Models. It has built-in models like Film dataset, we use 426285 for training, 10 for valida-
ResCAL, DistMult, TransE/H/R/D, etc. These tion and 104094 for testing. The yago film dataset itself
models can be used directly without any modifi- consists of 14 relations and 90088 entities. The prepro-
cation. cessing scripts and processed dataset can be downloaded
at the link in the first page.
1.3 Datasets
1.3.1 FB15K-237 2.2 Results and Analysis
Several research[12][13] have found that there is a severe In the following result tables, we use T1 to denote link
bias in FB15k, as it contains many pairs of (h,r,t) and prediction and T2 to denote triple classification. In the
(t, r−1 , h) where r−1 is the inverse of r. In the follow- triple classification task, we use hit10 (the model will give
ing experiments, we will use a subset of FB15K, called
top 10 results, and the result will be counted as correct
FB15k-237. It is constructed from FB15k but removed if the correct entity is in these 10 results) as metrics.
redundant relations. While in the link prediction, which can be regarded as a
The dataset has already been built-in in OpenKE, binary classification task, we use accuracy and threshold
and we do not perform any preprocessing on the dataset. as metrics.

3
tings. The model may quickly stopped training after it
works well on valiadation set, but actually not very well
Table 1: Experimental Results on FB15k-237
generally. It turns out that we should use a larger vali-
Dataset: FB15k-237 dation set to tell when the training should be stopped.
Configs T1 T2
Approach Dims. epochs hit10 acc thres
Dataset: Yago Film
TransE 150 500 46.5 93.1 12.5
Approach Dims. epochs hit10 acc thres
TransE 150 800 46.4 92.9 12.7
TransE 150 1000 46.8 93.1 12.6 TransE 150 50 19.0 88.3 12.8
TransE 100 500 44.2 94.2 9.45 TransE 150 150 21.4 89.5 12.94
TransE 100 800 44.1 94.1 9.51 TransE 150 200 21.8 89.6 13.11
TransE 100 1000 44.3 94.1 9.54 TransE 100 50 17.7 89.2 9.98
TransE 50 500 37.8 94.5 5.94 TransE 100 150 19.5 90.1 10.1
TransE 50 800 37.5 94.7 5.91 TransE 100 200 19.9 90.4 10.0
TransE 50 1000 37.0 94.8 5.87 TransE 50 50 13.7 89.4 6.25
TransE 50 150 15.4 90.2 6.17
TransR 150 500 50.2 91.4 13.4
TransE 50 200 15.5 90.4 6.10
TransR 150 800 49.6 91.0 13.4
TransR 150 1000 49.5 0 0 TransR 150 500 0 0 0
TransR 100 500 49.2 91.9 10.7 TransR 150 800 0 0 0
TransR 100 800 48.9 91.4 10.7 TransR 150 1000 0 0 0
TransR 100 1000 48.7 0 0 TransR 100 500 0 0 0
TransR 100 800 0 0 0
DistMult 150 500 33.0 93.1 1.40
TransR 100 1000 0 0 0
DistMult 150 800 32.3 92.8 1.68
DistMult 150 1000 32.4 92.7 1.67 DistMult 150 50 12.4 87.8 1.46
DistMult 100 500 29.7 92.6 0.62 DistMult 150 150 0 0 0
DistMult 100 800 30.6 92.9 0.98 DistMult 150 200 13.43 88.18 1.65
DistMult 100 1000 29.5 92.8 0.291 DistMult 100 50 9.16 87.5 1.07
DistMult 100 150 9.47 88.18 1.06
DistMult 100 200 10.97 88.52 0.87
2.2.1 Analysis on FB15K-237 DistMult 50 50 4.96 84.7 0.32
DistMult 50 150 5.89 86.7 0.28
In the experiment of TransE, we found that generally DistMult 50 200 6.10 86.2 0.48
the more dimensions we have, the better results we
could acquire. It is the same with our assumption that
higher dimensions could better describe the relation. Be- 3 Conclusion and Summary
sides of this we found that TransR outperforms TransE,
which proved that theoretically projecting relations to
In this short article, we introduced several methods for
new spaces could improve the accuracy. However, train-
knowledge graph embedding, such as TransE, TransR,
ing and testing on TransR cost much more time than
LFM, DistMult and TransD, TransH. We used FB15k-
TransE.
237 and Yago Film to conduct the experiment of com-
Among these three models, to our surprise, DistMult paring different algorithms. We have found that
works worst regarding the link prediction task. It may
due to the hyper-parameters is not utilized. It turns out
that we should use more dimensions to train the Dist-
Mult models and test it accuracy, rather than using the
4 Discussion
same configuration with translating models.
Comparing the trend in translating models, it is clear
that they follow a certain approach to acquire a better
2.2.2 Analysis on Yago Film performance in describing relations in knowledge graph.
TransE uses a simple distance based optimization meth-
In the Yago Film Dataset, due to the unavailability of ods, and then TransR proposed that relations should be
time and GPU resources, several experiment has not been in different spaces, and after that, TransD proposed that
done successfully. They will be added if time permits in the mapping function should depend not only on relation,
this week. but also on entities.
In our configuration, we set a quite small validation A potential future work might be a better way to
set (10 samples) and a relatively large testing dataset find the mapping matrix between original space and pro-
(104094 samples). As for the experiment, we have found jection space. Here the ”better” has two aspects, one
that these models do not work well on link-prediction for performance, like what DistMult has done (make the
tasks, which may be caused by the early-stopping set- matrix diagonal), one for accuracy.

4
Regarding the huge amount of parameters in transR, perform a pretrain with TransE and then project similar
a question arises that if it is possible to perform a ”group relation to the same space, it would reduce the amount of
mapping” of relations to new spaces? It should make parameters in this model. It will do some harm regarding
sense since many relations may have similarities. If we the accuracy but could improve the performance.

References
[1] X. Han, S. Cao, X. Lv, Y. Lin, Z. Liu, M. Sun, and J. Li, “Openke: An open toolkit for knowledge embed-
ding,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System
Demonstrations, pp. 139–144, 2018.
[2] Q. Wang, Z. Mao, B. Wang, and L. Guo, “Knowledge graph embedding: A survey of approaches and applica-
tions,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 12, pp. 2724–2743, 2017.

[3] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko, “Translating embeddings for modeling
multi-relational data,” in Advances in neural information processing systems, pp. 2787–2795, 2013.
[4] Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu, “Learning entity and relation embeddings for knowledge graph
completion,” in Twenty-ninth AAAI conference on artificial intelligence, 2015.
[5] B. Yang, W.-t. Yih, X. He, J. Gao, and L. Deng, “Embedding entities and relations for learning and inference
in knowledge bases,” arXiv preprint arXiv:1412.6575, 2014.
[6] Z. Wang, J. Zhang, J. Feng, and Z. Chen, “Knowledge graph embedding by translating on hyperplanes,” in
Twenty-Eighth AAAI conference on artificial intelligence, 2014.
[7] G. Ji, S. He, L. Xu, K. Liu, and J. Zhao, “Knowledge graph embedding via dynamic mapping matrix,” in Pro-
ceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International
Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 687–696, 2015.
[8] I. Sutskever, J. B. Tenenbaum, and R. R. Salakhutdinov, “Modelling relational data using bayesian clustered
tensor factorization,” in Advances in neural information processing systems, pp. 1821–1828, 2009.
[9] R. Jenatton, N. L. Roux, A. Bordes, and G. R. Obozinski, “A latent factor model for highly multi-relational
data,” in Advances in Neural Information Processing Systems, pp. 3167–3175, 2012.
[10] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al.,
“Tensorflow: A system for large-scale machine learning,” in 12th {USENIX} Symposium on Operating Systems
Design and Implementation ({OSDI} 16), pp. 265–283, 2016.

[11] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,
et al., “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Informa-
tion Processing Systems, pp. 8024–8035, 2019.
[12] K. Toutanova, D. Chen, P. Pantel, H. Poon, P. Choudhury, and M. Gamon, “Representing text for joint
embedding of text and knowledge bases,” in Proceedings of the 2015 Conference on Empirical Methods in
Natural Language Processing, pp. 1499–1509, 2015.
[13] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel, “Convolutional 2d knowledge graph embeddings,” in
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

You might also like