2014 KDD PDF
2014 KDD PDF
2014 KDD PDF
∗
Xin Luna Dong , Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao,
†
Kevin Murphy , Thomas Strohmann, Shaohua Sun, Wei Zhang
Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043
{lunadong|gabr|geremy|wilko|nlao|kpmurphy|tstrohmann|sunsh|weizh}@google.com
Table 1: Comparison of knowledge bases. KV, DeepDive, NELL, and PROSPERA rely solely on extraction,
Freebase and KG rely on human curation and structured sources, and YAGO2 uses both strategies. Confident
facts means with a probability of being true at or above 0.9.
a
Ce Zhang (U Wisconsin), private communication
b
Bryan Kiesel (CMU), private communication
c
Core facts, http://www.mpi-inf.mpg.de/yago-naga/yago/downloads.html
d
This is the number of non-redundant base triples, excluding reverse predicates and “lazy” triples derived from flattening
CVTs (complex value types).
e
http://insidesearch.blogspot.com/2012/12/get-smarter-answers-from-knowledge_4.html
An alternative approach to constructing the test set would We empirically evaluate its adequacy in Section 6, by com-
have been to leave out all edges emanating from a particu- paring to human-labeled data. There are more sophisticated
lar node. However, in such a case, the graph-based models methods for training models that don’t make this assump-
would have no signal to leverage. For example, suppose we tion (such as [28, 36]), but we leave the integration of such
omitted all facts about Barack Obama, and asked the sys- methods into KV to future work.
tem to predict where he lives, and who his children are.
This would be possible given text extractions, but impossi- 3. FACT EXTRACTION FROM THE WEB
ble given just a prior graph of facts. A compromise would
In this section, we summarize the extractors that we use to
be to omit all edges of a given type; for example, we could
build KV, and then we evaluate their relative performance.
omit connections to all his children, but leave in other re-
lations. However, we think the random sampling scenario 3.1 Extraction methods
more accurately reflects our actual use-case, which consists
of growing an existing KB, where arbitrary facts may be 3.1.1 Text documents (TXT)
missing. We use relatively standard methods for relation extraction
from text (see [16] for a recent overview), but we do so at a
2.2 Local closed world assumption (LCWA) much larger scale than previous systems.
All the components of our system use supervised machine We first run a suite of standard NLP tools over each doc-
learning methods to fit probabilistic binary classifiers, which ument. These perform named entity recognition, part of
can compute the probability of a triple being true. We give speech tagging, dependency parsing, co-reference resolution
the details on how these classifiers are constructed in the (within each document), and entity linkage (which maps
following sections. Here, we describe how we determine the mentions of proper nouns and their co-references to the cor-
labels (we use the same procedure for the training and test responding entities in the KB). The in-house named entity
set). linkage system we use is similar to the methods described in
For (s, p, o) triples that are in Freebase, we assume the [18].
label is true. For triples that do not occur in Freebase, we Next, we train relation extractors using distant supervi-
could assume the label is false (corresponding to a closed sion [29]. Specifically, for each predicate of interest, we ex-
world assumption), but this would be rather dangerous, since tract a seed set of entity pairs that have this predicate, from
we know that Freebase is very incomplete. So instead, we an existing KB. For example, if the predicate is married_to,
make use a somewhat more refined heuristic that we call the the pairs could be (BarackObama, MichelleObama) and (Bill-
local closed world assumption. Clinton, HillaryClinton). We then find examples of sen-
To explain this heuristic, let us define O(s, p) as the set tences in which this pair is mentioned, and extract features/
of existing object values for a given s and p. This set will patterns (either from the surface text or the dependency
be a singleton for functional (single-valued) predicates such parse) from all of these sentences. The features that we use
as place of birth, but can have multiple values for general are similar to those described in [29].
relations, such as children; of course, the set can also be In a bootstrapping phase, we look for more examples of
empty. Now, given a candidate triple (s, p, o), we assign sentences with these patterns occurring between pairs of en-
its label as follows: if (s, p, o) ∈ O(s, p), we say the triple tities of the correct type. We use the local closed world
is correct; if (s, p, o) 6∈ O(s, p), but |O(s, p)| > 0, we say assumption to derive labels for the resulting set of extrac-
the triple is incorrect (because we assume the KB is locally tions. Once we have a labeled training set, we fit a binary
complete for this subject-predicate pair); if O(s, p) is empty, classifier (we use logistic regression) for each predicate inde-
we do not label the triple, and we throw it out of our training pendently in parallel, using a MapReduce framework. We
/ test set. have trained extractors for 4469 predicates, which is much
This heuristic is also used in previous works such as [15]. more than previous machine reading systems.
3.1.2 HTML trees (DOM)
A somewhat different way to extract information from
Web pages is to parse their DOM trees. These can either
come from text pages, or from “deep web” sources, where
data are stored in underlying databases and queried by fill-
ing HTML forms; these sources together generate more than
1B pages of data in DOM tree format [7]. To extract triples
from DOM trees, we train classifiers as in the text case, ex-
cept that we derive features connecting two entities from the
DOM trees instead of from the text. Specifically, we use the
lexicalized path (along the tree) between the two entities as
a feature vector. The score of the extracted triples is the Figure 1: True probability vs estimated probability
output of the classifier. for each triple in KV.
Table 3: Some of the paths learned by PRA for predicting where someone went to college. Rules are sorted
by decreasing precision. Column headers: F1 is the harmonic mean of precision and recall, P is the precision,
R is the recall, W is the weight given to this feature by logistic regression.
Since multiple rules or paths might apply for any given we associate one vector per predicate, as in Equation 1, but
pair of entities, we can combine them by fitting a binary then use a standard multi layer perceptron (MLP) to capture
classifier (we use logistic regression). In PRA, the features interaction terms. More precisely, our model has the form
are the probabilities of reaching O from S following differ- h i
ent types of paths, and the labels are derived using the Pr(G(s, p, o) = 1) = σ β~T f A ~ [~
us , w
~ p , ~vo ] (3)
local closed world assumption. We can fit a classifier for
each predicate independently in parallel. We have trained where A ~ is a L × (3K) matrix (where the 3K term arises
prior predictors for 4469 predicates using Freebase as train- from the K-dimensional ~ us and w~ p and ~vo ) representing the
ing data. At test time, given a new (s, p, o) triple, we look first layer weights (after the embeddings), and β ~ is a L × 1
up all the paths for predicate p chosen by the learned model, vector representing the second layer weights. (We set L =
and perform a walk (on the training graph) from s to o via K = 60.) This has only O(L+LK +KE +KP ) parameters,
each such path; this gives us a feature value that can be but achieves essentially the same performance as the one in
plugged in to the classifier. Equation 2 on their dataset.6
The overall AUC is 0.884, which is less than that of the Having established that our MLP model is comparable to
fused extractor system (0.927), but is still surprisingly high. the state of the art, we applied it to the KV data set. Sur-
prisingly, we find that the neural model has about the same
4.2 Neural network model (MLP) performance as PRA when evaluated using ROC curves (the
An alternative approach to building the prior model is AUC for the MLP model is 0.882, and for PRA is 0.884).
to view the link prediction problem as matrix (or rather, To illustrate that the neural network model learns a mean-
tensor) completion. In particular, the original KB can be ingful “semantic” representation of the entities and predi-
viewed as a very sparse E × P × E 3d matrix G, where E cates, we can compute the nearest neighbors of various items
is the number of entities, P is the number of predicates, in the a K-dimensional space. It is known from previous
and G(s, p, o) = 1 if there is a link of type p from s to work (e.g., [27]) that related entities cluster together in the
o, and G(s, p, o) = 0 otherwise. We can perform a low- space, so here we focus on predicates. The results are shown
rank decomposition of this tensor by associating a latent in Table 4. We see that the model learns to put semanti-
low dimensional vector to each entity and predicate, and cally related (but not necessarily similar) predicates near
then computing the elementwise inner product: each other. For example, we see that the closest predicates
K
! (in the w ~ embedding space) to the ’children’ predicate are
X ’parents’, ’spouse’ and ’birth-place’.
Pr(G(s, p, o) = 1) = σ usk wpk vok (1)
k=1
4.3 Fusing the priors
−x
where σ(x) = 1/(1 + e ) is the sigmoid or logistic function, We can combine the different priors together using the fu-
and K ∼ 60 is the number of hidden dimensions. Here sion method described in Section 3.2. The only difference
~
us , w
~ p and ~vo are K-dimensional vectors, which embed the is the features that we use, since we no longer have any ex-
discrete tokens into a low dimensional “semantic” space. If tractions. Insead, the feature vector contains the vector of
we ignore the sigmoid transform (needed to produce binary confidence values from each prior system, plus indicator val-
responses), this is equivalent to the PARAFAC method of ues specifying if the prior was able to predict or not. (This
tensor decomposition [14, 5, 11]. lets us distinguish a missing prediction from a prediction
A more powerful model was recently proposed in [37]; this score of 0.0.) We train a boosted classifier using these sig-
associates a different tensor with each relation, and hence nals, and calibrate it with Platt Scaling, as before. Fusing
has the form the two prior methods helps performance, since they have
h
~pT f ~ ~ p1:M ~vo
i complementary strengths and weaknesses (different induc-
Pr(G(s, p, o) = 1) = σ β uTs W (2) tive biases): the AUC of the fused system is 0.911.
~p is a K × 1 6
where f () is a nonlinear function such as tanh, β More precisely, [37] reported an 88.9% accuracy on the sub-
~ m
vector, and Wp is a K × K matrix. Unfortunately, this set of Freebase data they have worked with (75,043 entities,
13 relations) when they replaced entities such as Barack-
model requires O(KE + K 2 M P ) parameters, where M is Obama by their constituting words Barack and Obama. Apply-
the number of “layers” in the tensor W . ing the same technique of replacing entities with consituting
In this paper, we considered a simpler approach where words, our simpler model got an accuracy of 89.1%.
Predicate Neighbor 1 Neighbor 2 Neighbor 3
children 0.4 parents 0.5 spouse 0.8 birth-place
birth-date 1.24 children 1.25 gender 1.29 parents
edu-end 1.41 job-start 1.61 edu-end 1.74 job-end