2014 KDD PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Knowledge Vault: A Web-Scale Approach to

Probabilistic Knowledge Fusion


Xin Luna Dong , Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao,

Kevin Murphy , Thomas Strohmann, Shaohua Sun, Wei Zhang
Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043
{lunadong|gabr|geremy|wilko|nlao|kpmurphy|tstrohmann|sunsh|weizh}@google.com

ABSTRACT In recent years, several large-scale knowledge bases (KBs)


Recent years have witnessed a proliferation of large-scale have been constructed, including academic projects such as
knowledge bases, including Wikipedia, Freebase, YAGO, Mi- YAGO [39], NELL [8], DBpedia [3], and Elementary/ Deep-
crosoft’s Satori, and Google’s Knowledge Graph. To in- Dive [32], as well as commercial projects, such as those by
crease the scale even further, we need to explore automatic Microsoft1 , Google2 , Facebook3 , Walmart [9], and others.
methods for constructing knowledge bases. Previous ap- (See Section 7 for a detailed discussion of related work.)
proaches have primarily focused on text-based extraction, These knowledge repositories store millions of facts about
which can be very noisy. Here we introduce Knowledge the world, such as information about people, places and
Vault, a Web-scale probabilistic knowledge base that com- things (generically referred to as entities).
bines extractions from Web content (obtained via analysis of Despite their seemingly large size, these repositories are
text, tabular data, page structure, and human annotations) still far from complete. For example, consider Freebase, the
with prior knowledge derived from existing knowledge repos- largest open-source knowledge base [4]. 71% of people in
itories. We employ supervised machine learning methods for Freebase have no known place of birth, and 75% have no
fusing these distinct information sources. The Knowledge known nationality4 . Furthermore, coverage for less common
Vault is substantially bigger than any previously published relations/predicates can be even lower.
structured knowledge repository, and features a probabilis- Previous approaches for building knowledge bases primar-
tic inference system that computes calibrated probabilities ily relied on direct contributions from human volunteers
of fact correctness. We report the results of multiple studies as well as integration of existing repositories of structured
that explore the relative utility of the different information knowledge (e.g., Wikipedia infoboxes). However, these meth-
sources and extraction methods. ods are more likely to yield head content, namely, frequently
mentioned properties of frequently mentioned entities. Suh
et al. [41] also observed that Wikipedia growth has essen-
Keywords tially plateaued, hence unsolicited contributions from hu-
Knowledge bases; information extraction; probabilistic mod- man volunteers may yield a limited amount of knowledge
els; machine learning going forward. Therefore, we believe a new approach is
necessary to further scale up knowledge base construction.
Such an approach should automatically extract facts from
1. INTRODUCTION the whole Web, to augment the knowledge we collect from
human input and structured data sources. Unfortunately,
“The acquisition of knowledge is always of use to standard methods for this task (cf. [44]) often produce very
the intellect, because it may thus drive out useless noisy, unreliable facts. To alleviate the amount of noise in
things and retain the good. For nothing can be the automatically extracted data, the new approach should
loved or hated unless it is first known.” automatically leverage already-cataloged knowledge to build
prior models of fact correctness.
– Leonardo da Vinci. In this paper, we propose a new way of automatically con-
∗Authors are listed alphabetically structing a Web-scale probabilistic knowledge base, which
we call the Knowledge Vault, or KV for short. Like many
†Corresponding author other knowledge bases, KV stores information in the form
1
http://www.bing.com/blogs/site_blogs/b/search/
archive/2013/03/21/satorii.aspx
2
http://www.google.com/insidesearch/features/
Permission to make digital or hard copies of all or part of this work for
search/knowledge.html
personal or classroom use is granted without fee provided that copies are 3
not made or distributed for profit or commercial advantage and that copies http://www.insidefacebook.com/2013/01/14/
bear this notice and the full citation on the first page. To copy otherwise, to facebook-builds-knowledge-graph-with-info-modules-
republish, to post on servers or to redistribute to lists, requires prior specific on-community-pages/
4
permission and/or a fee. Numbers current as of October 2013, cf. [28]. Freebase
KDD 2014 data is publicly available at https://developers.google.
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. com/freebase/data.
of RDF triples (subject, predicate, object).An example is data sources considerably improves precision at a given re-
</m/02mjmr, /people/person/place_of_birth /m/02hrh0_>, call level
where /m/02mjmr is the Freebase id for Barack Obama, and
/m/02hrh0_ is the id for Honolulu. Associated with each
such triple is a confidence score, representing the probabil-
2. OVERVIEW
ity that KV “believes” the triple is correct. KV contains three major components:
Entity types and predicates come from a fixed ontology,
which is similar to that used in other systems, such as YAGO • Extractors: these systems extract triples from a huge
[39], NELL [8], DeepDive [32], and various systems partici- number of Web sources. Each extractor assigns a con-
pating in the TAC-KBP slot-filling competition [22]. Knowl- fidence score to an extracted triple, representing un-
edge bases that use a fixed ontology should be contrasted certainty about the identity of the relation and its cor-
with open information extraction (Open IE) approaches, responding arguments.
such as Reverb [12], which work at the lexical level. Open
• Graph-based priors: these systems learn the prior
IE systems usually have multiple redundant facts that are
probability of each possible triple, based on triples
worded differently, such as <Barack Obama, was born in,
stored in an existing KB.
Honolulu> and <Obama, place of birth, Honolulu>. In
contrast, KV separates facts about the world from their lex- • Knowledge fusion: this system computes the prob-
ical representation. This makes KV a structured repository ability of a triple being true, based on agreement be-
of knowledge that is language independent. tween different extractors and priors.
The contributions of this paper are threefold. First, the
Knowledge Vault is different from previous works on au- Abstractly, we can view the KV problem as follows: we
tomatic knowledge base construction as it combines noisy are trying to construct a weighted labeled graph, which
extractions from the Web together with prior knowledge, we can view as a very sparse E × P × E 3d matrix G,
which is derived from existing knowledge bases (in this pa- where E is the number of entities, P is the number of pred-
per, we use Freebase as our source of prior data). This icates, and G(s, p, o) = 1 if there is a link of type p from
approach is analogous to techniques used in speech recog- s to o, and G(s, p, o) = 0 otherwise. We want to compute
nition, which combine noisy acoustic signals with priors de- Pr(G(s, p, o) = 1|·) for candidate (s, p, o) triples, where the
rived from a language model. KV’s prior model can help probability is conditional on different sources of information.
overcome errors due to the extraction process, as well as When using extractions, we condition on text features about
errors in the sources themselves. For example, suppose an the triple. When using graph-based priors, we condition on
extractor returns a fact claiming that Barack Obama was known edges in the Freebase graph (obviously we exclude the
born in Kenya, and suppose (for illustration purposes) that edge we are trying to predict!). Finally, in knowledge fusion,
the true place of birth of Obama was not already known we condition on both text extractions and prior edges.
in Freebase. Our prior model can use related facts about We describe each of three components in more detail in the
Obama (such as his profession being US President) to infer following sections. Before that, we discuss our training and
that this new fact is unlikely to be true. The error could be test procedure, which is common to all three approaches.
due to mistaking Barack Obama for his father (entity res-
olution or co-reference resolution error), or it could be due 2.1 Evaluation protocol
to an erroneous statement on a spammy Web site (source
Using the methods to be described in Section 3, we extract
error).
about 1.6B candidate triples, covering 4469 different types of
Second, KV is much bigger than other comparable KBs
relations and 1100 different types of entities. About 271M
(see Table 1). In particular, KV has 1.6B triples, of which
of these facts have an estimated probability of being true
324M have a confidence of 0.7 or higher, and 271M have a
above 90%; we call these “confident facts”. The resulting KB
confidence of 0.9 or higher. This is about 38 times more than
is much larger than other automatically constructed KBs, as
the largest previous comparable system (DeepDive [32]),
summarized in Table 1.
which has 7M confident facts (Ce Zhang, personal communi-
To evaluate the quality of our methods, we randomly split
cation). To create a knowledge base of such size, we extract
this data into a training set (80% of the data) and a test set
facts from a large variety of sources of Web data, including
(20% of the data); we infer labels for these triples using the
free text, HTML DOM trees, HTML Web tables, and hu-
method described below. To ensure that certain common
man annotations of Web pages. (Note that about 1/3 of the
predicates (e.g., relating to geographical containment) did
271M confident triples were not previously in Freebase, so
not dominate the performance measures, we took at most
we are extracting new knowledge not contained in the prior.)
10k instances of each predicate when creating the test set.
Third, we perform a detailed comparison of the quality
We then pooled the samples from each predicate to get a
and coverage of different extraction methods, as well as dif-
more balanced test set.
ferent prior methods. We also demonstrate the benefits of
If the test set contains the triple (s, p, o), then the training
using multiple extraction sources and systems. Finally, we
set is guaranteed not to contain the same triple. However,
evaluate the validity of the closed world assumption, which
it may contain (s, p, o0 ) or (s, p0 , o) or (s0 , p, o). For example,
is often used to automatically evaluate newly extracted facts
suppose s is Barack Obama, p is father-of, and o is Sasha
given an existing knowledge base (see Section 6).
Obama. Then the training set may contain the fact that
In the following sections, we describe the components of
Barack is the father of Malia Obama, or that Barack lives in
KV in more detail. We then study the performance of each
the same place as Sasha Obama, etc. In graph terminology,
part of the system in isolation and in combination, and
we are leaving out edges at random from the training set,
show that fusion of multiple complementary systems and
and asking how well we can predict their presence or absence.
Name # Entity types # Entity instances # Relation types # Confident facts (relation instances)
Knowledge Vault (KV) 1100 45M 4469 271M
DeepDive [32] 4 2.7M 34 7Ma
NELL [8] 271 5.19M 306 0.435Mb
PROSPERA [30] 11 N/A 14 0.1M
YAGO2 [19] 350,000 9.8M 100 4Mc
Freebase [4] 1,500 40M 35,000 637Md
Knowledge Graph (KG) 1,500 570M 35,000 18,000Me

Table 1: Comparison of knowledge bases. KV, DeepDive, NELL, and PROSPERA rely solely on extraction,
Freebase and KG rely on human curation and structured sources, and YAGO2 uses both strategies. Confident
facts means with a probability of being true at or above 0.9.
a
Ce Zhang (U Wisconsin), private communication
b
Bryan Kiesel (CMU), private communication
c
Core facts, http://www.mpi-inf.mpg.de/yago-naga/yago/downloads.html
d
This is the number of non-redundant base triples, excluding reverse predicates and “lazy” triples derived from flattening
CVTs (complex value types).
e
http://insidesearch.blogspot.com/2012/12/get-smarter-answers-from-knowledge_4.html

An alternative approach to constructing the test set would We empirically evaluate its adequacy in Section 6, by com-
have been to leave out all edges emanating from a particu- paring to human-labeled data. There are more sophisticated
lar node. However, in such a case, the graph-based models methods for training models that don’t make this assump-
would have no signal to leverage. For example, suppose we tion (such as [28, 36]), but we leave the integration of such
omitted all facts about Barack Obama, and asked the sys- methods into KV to future work.
tem to predict where he lives, and who his children are.
This would be possible given text extractions, but impossi- 3. FACT EXTRACTION FROM THE WEB
ble given just a prior graph of facts. A compromise would
In this section, we summarize the extractors that we use to
be to omit all edges of a given type; for example, we could
build KV, and then we evaluate their relative performance.
omit connections to all his children, but leave in other re-
lations. However, we think the random sampling scenario 3.1 Extraction methods
more accurately reflects our actual use-case, which consists
of growing an existing KB, where arbitrary facts may be 3.1.1 Text documents (TXT)
missing. We use relatively standard methods for relation extraction
from text (see [16] for a recent overview), but we do so at a
2.2 Local closed world assumption (LCWA) much larger scale than previous systems.
All the components of our system use supervised machine We first run a suite of standard NLP tools over each doc-
learning methods to fit probabilistic binary classifiers, which ument. These perform named entity recognition, part of
can compute the probability of a triple being true. We give speech tagging, dependency parsing, co-reference resolution
the details on how these classifiers are constructed in the (within each document), and entity linkage (which maps
following sections. Here, we describe how we determine the mentions of proper nouns and their co-references to the cor-
labels (we use the same procedure for the training and test responding entities in the KB). The in-house named entity
set). linkage system we use is similar to the methods described in
For (s, p, o) triples that are in Freebase, we assume the [18].
label is true. For triples that do not occur in Freebase, we Next, we train relation extractors using distant supervi-
could assume the label is false (corresponding to a closed sion [29]. Specifically, for each predicate of interest, we ex-
world assumption), but this would be rather dangerous, since tract a seed set of entity pairs that have this predicate, from
we know that Freebase is very incomplete. So instead, we an existing KB. For example, if the predicate is married_to,
make use a somewhat more refined heuristic that we call the the pairs could be (BarackObama, MichelleObama) and (Bill-
local closed world assumption. Clinton, HillaryClinton). We then find examples of sen-
To explain this heuristic, let us define O(s, p) as the set tences in which this pair is mentioned, and extract features/
of existing object values for a given s and p. This set will patterns (either from the surface text or the dependency
be a singleton for functional (single-valued) predicates such parse) from all of these sentences. The features that we use
as place of birth, but can have multiple values for general are similar to those described in [29].
relations, such as children; of course, the set can also be In a bootstrapping phase, we look for more examples of
empty. Now, given a candidate triple (s, p, o), we assign sentences with these patterns occurring between pairs of en-
its label as follows: if (s, p, o) ∈ O(s, p), we say the triple tities of the correct type. We use the local closed world
is correct; if (s, p, o) 6∈ O(s, p), but |O(s, p)| > 0, we say assumption to derive labels for the resulting set of extrac-
the triple is incorrect (because we assume the KB is locally tions. Once we have a labeled training set, we fit a binary
complete for this subject-predicate pair); if O(s, p) is empty, classifier (we use logistic regression) for each predicate inde-
we do not label the triple, and we throw it out of our training pendently in parallel, using a MapReduce framework. We
/ test set. have trained extractors for 4469 predicates, which is much
This heuristic is also used in previous works such as [15]. more than previous machine reading systems.
3.1.2 HTML trees (DOM)
A somewhat different way to extract information from
Web pages is to parse their DOM trees. These can either
come from text pages, or from “deep web” sources, where
data are stored in underlying databases and queried by fill-
ing HTML forms; these sources together generate more than
1B pages of data in DOM tree format [7]. To extract triples
from DOM trees, we train classifiers as in the text case, ex-
cept that we derive features connecting two entities from the
DOM trees instead of from the text. Specifically, we use the
lexicalized path (along the tree) between the two entities as
a feature vector. The score of the extracted triples is the Figure 1: True probability vs estimated probability
output of the classifier. for each triple in KV.

3.1.3 HTML tables (TBL)


There are over 570M tables on the Web that contain rela- the extractions from this extractor, averaging over sources
tional information (as opposed to just being used for visual (or 0 if the system did not produce this triple).
formatting) [6]. Unfortunately, fact extraction techniques The classifier learns a different weight for each compo-
developed for text and trees do not work very well for ta- nent of this feature vector, and hence can learn the relative
bles, because the relation between two entities is usually reliabilities of each system. In addition, since we fit a sep-
contained in the column header, rather than being close by arate classifier per predicate, we can model their different
in the text/ tree. Instead, we use the following heuristic reliabilities, too.
technique. First, we perform named entity linkage, as in the The labels for training the fusion system come from ap-
text case. Then we attempt to identify the relation that is plying the local closed world assumption to the training set.
expressed in each column of the table by looking at the en- Since this is a very low-dimensional classification problem,
tities in each column, and reasoning about which predicate we initially used a linear logistic regression model. How-
each column could correspond to, by matching to Freebase, ever, we observed considerably better performance by using
as in standard schema matching methods [42]. Ambiguous boosted decision stumps [35]. This kind of classifier can
columns are discarded. The score of the extracted triple re- learn to quantize the features into bins, and thus learn a
flects the confidence returned by the named entity linkage non-linear decision boundary.
system.
3.3 Calibration of the probability estimates
3.1.4 Human Annotated pages (ANO) The confidence scores from each extractor (and/or the
There are a large number of webpages where the web- fused system) are not necessarily on the same scale, and
master has added manual annotations following ontologies cannot necessarily be interpreted as probabilities. To alle-
from schema.org, microformats.org, openGraphProtocol.org, viate this problem, we adopt the standard technique known
etc. In this paper, we use schema.org annotations. Many as Platt Scaling (named after [33]), which consists of fitting
of these annotations are related to events or products, etc. a logistic regression model to the scores, using a separate
Such information is not currently stored in the knowledge validation set. Figure 1 shows that our (fused) probability
vault. So instead, in this paper we focus on a small sub- estimates are well-calibrated, in the following sense: if we
set of 14 different predicates, mostly related to people. We collect all the triples that have a predicted probability of 0.9,
define a manual mapping from schema.org to the Freebase then we find that about 90% of them are indeed true. Each
schema for these different predicates. The score of the ex- individual extractor is also calibrated (results not shown).
tracted triple reflects the confidence returned by the named
entity linkage system (the same one we use for TXT triples).
3.4 Comparison of the methods
Using the four extractors described earlier applied to a
3.2 Fusing the extractors very large web corpus, we extract about 1.6B triples. Ta-
ble 2 shows the number of triples from each system. We see
We have described 4 different fact extraction methods. A
that the DOM system extracts the largest number of triples
simple way to combine these signals is to construct a feature
overall (about 1.2B), of which about 94M (or 8%) are high
vector f~(t) for each extracted triple t = (s, p, o), and then confidence (with a probability of being true at or above 0.9;
to apply a binary classifier to compute Pr(t = 1|f~(t)). For see the penultimate column of Table 2). The TBL system
simplicity and speed, we fit a separate classifier for each extracts the least number of triples overall (about 9.4M).
predicate. One reason for this is that very few columns in webtables
The feature vector is composed of two numbers for each (only 18% according to [17]) map to a corresponding Free-
extractor: the square root5 of the number of sources that the base predicate. The ANO and TXT systems both produce
extractor extracted this triple from, and the mean score of hundreds of millions of triples.
5 √ In addition to measuring the number of triples at different
The motivation for using n, where n is the number of confidence levels, it is interesting to consider the area under
sources, is to reduce the effect of very commonly expressed the ROC curve (AUC score). This score is equal to the prob-
facts (such as the birth place of Barack Obama). Re-
sults are similar if we use log(1 + n). Note that we per- ability that a classifier will rank a randomly chosen positive
form de-duplication of sources before running the extraction instance higher than a randomly chosen negative one. We
pipelines. computed the AUC scores for the different extraction meth-
System # # > 0.7 # > 0.9 Frac. >0.9 AUC
TBL 9.4M 3.8M 0.59M 0.06 0.856
ANO 140M 2.4M 0.25M 0.002 0.920
TXT 330M 20M 7.1M 0.02 0.867
DOM 1200M 150M 94M 0.08 0.928
FUSED-EX. 1600M 160M 100M 0.06 0.927

Table 2: Performance of different extraction sys-


tems.

Figure 3: Predicted probability of each triple vs.


the number of unique web sources that contain this
triple (axis truncated at 50 for clarity).

triples. To prevent over-counting of evidence, we only count


each triple once per domain, as opposed to once per URL;
for example, if we extract a triple asserting that Barack
Obama was born in Kenya from myblogger.com/page1 and
myblogger.com/page2, we only count this once.
Figure 2: Predicted probability of each triple vs. the
number of systems that predicted it. Solid blue line:
correct (true) triples. Dotted red line: incorrect
4. GRAPH-BASED PRIORS
(false) triples. As mentioned in the introduction, facts extracted from
the Web can be unreliable. A good way to combat this is
to use prior knowledge, derived from other kinds of data.
ods on different test sets, namely, only using the extractions In this paper, we exploit existing triples in Freebase to fit
produced by each system (obviously, the test sets were dis- prior models, which can assign a probability to any possible
tinct from the training sets). The test set for computing the triple, even if there is no corresponding evidence for this fact
AUC score for the fused extractors was the union of all the on the Web (cf. [2]). This can be thought of as link predic-
test sets of the individual systems. tion in a graph. That is, we observe a set of existing edges
We see that the DOM system has the highest AUC score, (representing predicates that connect different entities), and
so although it produces a large number of low confidence we want to predict which other edges are likely to exist. We
triples, the system “knows” that these are likely to be false. have tried two different approaches to solving this problem,
The table also illustrates the benefits of fusing multiple ex- which we describe below.
tractors: we get about 7% more high confidence triples,
while maintaining a high AUC score (see the last row of 4.1 Path ranking algorithm (PRA)
the table). Not surprisingly, however, the performance of One way to perform link prediction is to use the path
the fusion system is dominated by that of the DOM system. ranking algorithm of [24]. Similar to distant supervision,
In Section 5, we shall show much greater gains from fusion, we start with a set of pairs of entities that are connected by
when we combine graph priors with extractors. some predicate p. PRA then performs a random walk on the
graph, starting at all the subject (source) nodes. Paths that
3.5 The beneficial effects of adding more evi- reach the object (target) nodes are considered successful.
dence For example, the algorithm learns that pairs (X, Y ) which
Figure 2 shows how the overall predicted probability of are connected by a marriedTo edge often also have a path of
parentOf parentOf
each triple changes as more systems extract it. When no the form X − −−−−− →Z← −−−−− − Y , since if two people share
systems extract a given triple, we rely on our prior model a common child, they are likely to be married. The quality
(described in Section 4); averaging over all the triples, we of these paths can be measured in terms of their support
see that the prior probability for the true triples is about 0.5, and precision, as in association rule mining (cf., [15]).
whereas the prior probability for the false triples is close to The paths that PRA learns can be interpreted as rules.
0. As we accumulate more evidence in favor of the triple, For example, consider the task of predicting where someone
our belief in its correctness increases to near 1.0 for the went to college. The algorithm discovers several useful rules,
true triples; for the false triples, our belief also increases, shown in Table 3. In English, the first rule says: a person
although it stays well below 0.5. X is likely to have attended school S if X was drafted from
Figure 3 shows how the probability of a triple increases sports team T , and T is from school S. The second rule
with the number of sources where the triple is seen. Again, says: a person is likely to attend the same school as their
our final belief in true triples is much higher than in false sibling.
F1 P R W Path
0.03 1 0.01 2.62 /sports/drafted-athlete/drafted,/sports/sports-league-draft-pick/school

0.05 0.55 0.02 1.88 /people/person/sibling-s, /people/sibling-relationship/sibling, /people/person/education, /education/education/institution

0.06 0.41 0.02 1.87 /people/person/spouse-s, /people/marriage/spouse, /people/person/education, /education/education/institution

0.04 0.29 0.02 1.37 /people/person/parents, /people/person/education, /education/education/institution

0.05 0.21 0.02 1.85 /people/person/children, /people/person/education, /education/education/institution

0.13 0.1 0.38 6.4 /people/person/place-of-birth, /location/location/people-born-here, /people/person/education, /education/education/institution

0.05 0.04 0.34 1.74 /type/object/type, /type/type/instance, /people/person/education, /education/education/institution

0.04 0.03 0.33 2.19 /people/person/profession, /people/profession/people-with-this-profession, /people/person/education, /education/education/institution

Table 3: Some of the paths learned by PRA for predicting where someone went to college. Rules are sorted
by decreasing precision. Column headers: F1 is the harmonic mean of precision and recall, P is the precision,
R is the recall, W is the weight given to this feature by logistic regression.

Since multiple rules or paths might apply for any given we associate one vector per predicate, as in Equation 1, but
pair of entities, we can combine them by fitting a binary then use a standard multi layer perceptron (MLP) to capture
classifier (we use logistic regression). In PRA, the features interaction terms. More precisely, our model has the form
are the probabilities of reaching O from S following differ-  h i
ent types of paths, and the labels are derived using the Pr(G(s, p, o) = 1) = σ β~T f A ~ [~
us , w
~ p , ~vo ] (3)
local closed world assumption. We can fit a classifier for
each predicate independently in parallel. We have trained where A ~ is a L × (3K) matrix (where the 3K term arises
prior predictors for 4469 predicates using Freebase as train- from the K-dimensional ~ us and w~ p and ~vo ) representing the
ing data. At test time, given a new (s, p, o) triple, we look first layer weights (after the embeddings), and β ~ is a L × 1
up all the paths for predicate p chosen by the learned model, vector representing the second layer weights. (We set L =
and perform a walk (on the training graph) from s to o via K = 60.) This has only O(L+LK +KE +KP ) parameters,
each such path; this gives us a feature value that can be but achieves essentially the same performance as the one in
plugged in to the classifier. Equation 2 on their dataset.6
The overall AUC is 0.884, which is less than that of the Having established that our MLP model is comparable to
fused extractor system (0.927), but is still surprisingly high. the state of the art, we applied it to the KV data set. Sur-
prisingly, we find that the neural model has about the same
4.2 Neural network model (MLP) performance as PRA when evaluated using ROC curves (the
An alternative approach to building the prior model is AUC for the MLP model is 0.882, and for PRA is 0.884).
to view the link prediction problem as matrix (or rather, To illustrate that the neural network model learns a mean-
tensor) completion. In particular, the original KB can be ingful “semantic” representation of the entities and predi-
viewed as a very sparse E × P × E 3d matrix G, where E cates, we can compute the nearest neighbors of various items
is the number of entities, P is the number of predicates, in the a K-dimensional space. It is known from previous
and G(s, p, o) = 1 if there is a link of type p from s to work (e.g., [27]) that related entities cluster together in the
o, and G(s, p, o) = 0 otherwise. We can perform a low- space, so here we focus on predicates. The results are shown
rank decomposition of this tensor by associating a latent in Table 4. We see that the model learns to put semanti-
low dimensional vector to each entity and predicate, and cally related (but not necessarily similar) predicates near
then computing the elementwise inner product: each other. For example, we see that the closest predicates
K
! (in the w ~ embedding space) to the ’children’ predicate are
X ’parents’, ’spouse’ and ’birth-place’.
Pr(G(s, p, o) = 1) = σ usk wpk vok (1)
k=1
4.3 Fusing the priors
−x
where σ(x) = 1/(1 + e ) is the sigmoid or logistic function, We can combine the different priors together using the fu-
and K ∼ 60 is the number of hidden dimensions. Here sion method described in Section 3.2. The only difference
~
us , w
~ p and ~vo are K-dimensional vectors, which embed the is the features that we use, since we no longer have any ex-
discrete tokens into a low dimensional “semantic” space. If tractions. Insead, the feature vector contains the vector of
we ignore the sigmoid transform (needed to produce binary confidence values from each prior system, plus indicator val-
responses), this is equivalent to the PARAFAC method of ues specifying if the prior was able to predict or not. (This
tensor decomposition [14, 5, 11]. lets us distinguish a missing prediction from a prediction
A more powerful model was recently proposed in [37]; this score of 0.0.) We train a boosted classifier using these sig-
associates a different tensor with each relation, and hence nals, and calibrate it with Platt Scaling, as before. Fusing
has the form the two prior methods helps performance, since they have
 h
~pT f ~ ~ p1:M ~vo
i complementary strengths and weaknesses (different induc-
Pr(G(s, p, o) = 1) = σ β uTs W (2) tive biases): the AUC of the fused system is 0.911.
~p is a K × 1 6
where f () is a nonlinear function such as tanh, β More precisely, [37] reported an 88.9% accuracy on the sub-
~ m
vector, and Wp is a K × K matrix. Unfortunately, this set of Freebase data they have worked with (75,043 entities,
13 relations) when they replaced entities such as Barack-
model requires O(KE + K 2 M P ) parameters, where M is Obama by their constituting words Barack and Obama. Apply-
the number of “layers” in the tensor W . ing the same technique of replacing entities with consituting
In this paper, we considered a simpler approach where words, our simpler model got an accuracy of 89.1%.
Predicate Neighbor 1 Neighbor 2 Neighbor 3
children 0.4 parents 0.5 spouse 0.8 birth-place
birth-date 1.24 children 1.25 gender 1.29 parents
edu-end 1.41 job-start 1.61 edu-end 1.74 job-end

Table 4: Nearest neighbors for some predicates in


the 60d embedding space learned by the neural net-
work. Numbers represent squared Euclidean dis-
tance. Edu-start and edu-end represent the start
and end dates of someone attending a school or col-
lege. Similarly, job-start and job-end represent the
start and end dates of someone holding a particular
job.

Figure 5: Number of triples in KV in each confi-


dence bin.

<Barry Richter (/m/02ql38b),


/people/person/edu./edu/edu/institution,
Universty of Wisconsin-Madison (/m/01yx1b)>
The (fused) extraction confidence for this triple was just
0.14, since it was based on the following two rather indirect
statements:8
In the fall of 1989, Richter accepted a scholarship
to the University of Wisconsin, where he played for
four years and earned numerous individual accolades...
Figure 4: ROC curves for the fused extractor, fused
The Polar Caps’ cause has been helped by the impact of
prior, and fused prior + extractor. The numbers in
knowledgable coaches such as Andringa, Byce and former
the legend are the AUC scores.
UW teammates Chris Tancill and Barry Richter.
However, we know from Freebase that Barry Richter was
5. FUSING EXTRACTORS AND PRIORS born and raised in Madison, WI. This increases our prior
We have described several different fact extraction meth- belief that he went to school there, resulting in a final fused
ods, and several different priors. We can combine all these belief of 0.61.
systems together using the fusion method described in Sec-
tion 3.2. Figure 4 shows the benefits of fusion quantitatively. 6. EVALUATING LCWA
We see that combining prior and extractor together results
So far, we have been relying on the local closed world
in a significant boost in performance.
assumption (LCWA) to train and test our system. However,
To more clearly illustrate the effect of adding the prior,
we know that this is just an approximation to the truth. For
Figure 5 plots the number of triples in each confidence bin
example, Freebase often lists the top 5 or so actors for any
for the (fused) extractor, the (fused) prior, and the overall
given movie, but it is unreasonable to assume that this list
system. We see that compared with considering only extrac-
is complete (since most movies have a cast of 10–20 actors);
tions, combining priors and extractors increases the number
this can result in false negatives (if our system predicts the
of high confidence facts (those with a probability greater
name of an actor that is not on the list). Conversely (but less
than 0.9) from about 100M to about 271M. Of these, about
frequently), Freebase can contain errors, which can result in
33% are new facts that were not yet in Freebase.
false positives.
Figure 5 illustrates another interesting effect: when we
To assess the severity of this problem, we manually la-
combine prior and extractor, the number of triples about
beled a subset of our balanced test set, using an in-house
which we are uncertain (i.e., the predicted probability falling
team of raters. This subset consisted of 1000 triples for 10
in the range of [.3, .7]) has gone down; some of these triples
we now believe to be true (as we discussed previously), but tion/education/institution, obtained by passing through
many we now believe to be false. This is a visual illustration a complex value type (CVT) node, aka an anonymous or
“blank” node, representing the temporal event that Barry
that the prior can reduce the false positive rate. attended Madison.
We now give a qualitative example of the benefits of com- 8
Sources: http://www.legendsofhockey.net/
bining the prior with the extractor. The extraction pipeline LegendsOfHockey/jsp/SearchPlayer.jsp?player=11377
extracted the following triple:7 and http://host.madison.com/sports/high-school/
hockey/numbers-dwindling-for-once-mighty-madison\
7
Here the predicate is a conjunction of two primi- -high-school\-hockey-programs/article_
tive predicates, /people/person/education and /educa- 95843e00-ec34-11df-9da9-001cc4c002e0.html
Labels Prior Extractor Prior+ex databases (see e.g., [40, 43]). KV is a probabilistic database,
LCWA 0.943 0.872 0.959 and it can support simple queries, such as BarackObama
Human 0.843 0.852 0.869 BornIn ?, which returns a distribution over places where
KV thinks Obama was born. However, we do not yet sup-
Table 5: AUC scores for the fused prior, extractor port sophisticated queries, such as JOIN or SELECT.
and prior+extractor using different labels on the 10k Finally, there is a small set of papers on representing un-
test set. certainty in information extraction systems (see e.g., [45,
25]). KV also represents uncertainty in the facts it has ex-
tracted. Indeed, we show that its uncertainty estimates are
different predicates. We asked each rater to evaluate each well-calibrated. We also show how they change as a function
such triple, and to determine (based on their own research, of the amount of evidence (see Figure 2).
which can include web searches, looking at wikipedia, etc)
whether each triple is true or false or unknown; we discarded
the 305 triples with unknown labels. 8. DISCUSSION
We then computed the performance of our systems on Although Knowledge Vault is a large repository of useful
this test set, using both LCWA labels and the human la- knowledge, there are still many ways in which it can be
bels. In both cases, the system was trained on our full improved. We discuss some of these issues below.
training set (i.e., 80% of 1.6B) using LCWA labels. The Modeling mutual exclusion between facts. Currently
results are shown in Table 5. We see that the performance (for reasons of scalability) we treat each fact as an inde-
on the human labeled data is lower, although not by that pendent binary random variable, that is either true or false.
much, indirectly justifying our use of the LCWA. However, in reality, many triples are correlated. For exam-
ple, for a functional relation such as born-in, we know there
7. RELATED WORK can only be one true value, so the (s, p, oi ) triples represent-
ing different values oi for the same subject s and predicate p
There is a growing body of work on automatic knowledge become correlated due to the mutual exclusion constraint. A
base construction [44, 1]. This literature can be clustered simple way to handle this is to collect together all candidate
into 4 main groups: (1) approaches such as YAGO [39], values, and to force the distribution over them to sum to 1
YAGO2 [19], DBpedia [3], and Freebase [4], which are built (possibly allowing for some “extra” probability mass to ac-
on Wikipedia infoboxes and other structured data sources; count for the fact that the true value might not be amongst
(2) approaches such as Reverb [12], OLLIE [26], and PRIS- the extracted set of candidates). This is similar to the no-
MATIC [13], which use open information (schema-less) ex- tion of an X-tuple in probabilistic databases [40]. Prelimi-
traction techniques applied to the entire web; (3) approaches nary experiments of this kind did not work very well, since
such as NELL/ ReadTheWeb [8], PROSPERA [30], and the different oi often represent the same entity at different
DeepDive/ Elementary [32], which extract information from levels of granularity. For example, we might have a fact that
the entire web, but use a fixed ontology/ schema; and (4) ap- Obama was born in Honolulu, and another one stating he
proaches such as Probase [47], which construct taxonomies was born in Hawaii. These are not mutually exclusive, so
(is-a hierarchies), as opposed to general KBs with multiple the naive approach does not work. We are currently inves-
types of predicates. tigating more sophisticated methods.
The knowledge vault is most similar to methods of the
third kind, which extract facts, in the form of disambiguated Modeling soft correlation between facts. For some
triples, from the entire web. The main difference from this kinds of relations, there will be soft constraints on their val-
prior work is that we fuse together facts extracted from text ues. For example, we know that people usually have between
with prior knowledge derived from the Freebase graph. 0 and 5 children; there is of course a long tail to this distri-
There is also a large body of work on link prediction in bution, but it would still be surprising (and indicative of a
graphs. This can be thought as creating a joint probability potential error) if we extracted 100 different children for one
model over a large set of binary random variables, where person. Similarly, we expect the date of birth of a person
G(s, p, o) = 1 if and only if there is a link of type p from s to be about 15 to 50 years earlier than the date of birth of
to o. The literature can be clustered into three main kinds their child. Preliminary experiments using joint Gaussian
of methods: (1) methods that directly model the correlation models to represent correlations amongst numerical values
between the variables, using discrete Markov random fields show some promise, but we still need to fully integrate this
(e.g., [23]) or continuous relaxations thereof (e.g., [34]); (2) kind of joint prior into KV.
methods that use latent variables to model the correlations Values can be represented at multiple levels of ab-
indirectly, using either discrete factors (e.g., [48]) or contin- straction. We can represent the world at different levels of
uous factors (e.g., [31, 11, 20, 37]); and (3) methods that granularity. For example, we can say that Obama is born
approximate the correlation using algorithmic approaches, in Honolulu, or in Hawaii, or in the USA. When matching
such as random walks [24]. extracted facts with those stored in Freebase, we use prior
In the knowledge vault, we currently employ graph priors geographical knowledge to reason about compatibility. For
of the second and third kind. In particular, our neural tensor example, if we extract that Obama was born in Hawaii, and
model is a continuous latent variable model, which is sim- we already know he was born in Honolulu, we consider this
ilar to, but slightly different from, [37] (see Section 4.2 for a correct extraction. In the future, we would like to gener-
a discussion). Our PRA model is similar to the method de- alize this approach to other kinds of values. For example,
scribed in [24], except it is trained on Freebase instead of on if we extract that Obama’s profession is politician, and we
NELL. In addition, it uses a more scalable implementation. already know his profession is president, we should regard
Another related literature is on the topic of probabilistic the extracted fact as true, since it is implied by what we
already know. high confidence from what we are uncertain about. In the fu-
Dealing with correlated sources. In Figure 3, we showed ture, we hope to continue to scale KV, to store more knowl-
how our belief in a triple increased as we saw it extracted edge about the world, and to use this resource to help var-
from more sources. This is of course problematic if we have ious downstream applications, such as question answering,
duplicated or correlated sources. Currently we have a very entity-based search, etc.
simple solution to this, based on counting each domain only
once. In the future, we plan to deploy more sophisticated Acknowledgments
copy detection mechanisms, such as those in [10].
We thank John Giannandrea, Fernando Pereira, and Amar
Some facts are only temporarily true. In some cases, Subramanya for constructive suggestions that helped us im-
the “truth” about a fact can change. For example, Google’s prove this paper, and Kevin Lerman, Abhijit Mahabal, and
current CEO is Larry Page, but from 2001 to 2011 it was Oksana Yakhnenko for developing most of the extraction
Eric Schmidt. Both facts are correct, but only during the pipeline. We also thank Eric Altendorf and Alex Smola,
specified time interval. For this reason, Freebase allows some who were instrumental in the early phases of the project.
facts to be annotated with beginning and end dates, by use
of the CVT (compound value type) construct, which rep-
resents n-ary relations via auxiliary nodes. (An alternative
10. REFERENCES
approach is to reify the pairwise relations, and add extra [1] AKBC-WEKEX. The Knowledge Extraction
assertions to them, as in the YAGO2 system [19].) In the Workshop at NAACL-HLT, 2012.
future, we plan to extend KV to model such temporal facts. [2] G. Angeli and C. Manning. Philosophers are mortal:
However, this is non-trivial, since the duration of a fact is Inferring the truth of unseen facts. In CoNLL, 2013.
not necessarily related to the timestamp of the correspond- [3] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann,
ing source (cf. [21]). R. Cyganiak, and Z. Ives. DBpedia: A nucleus for a
Adding new entities and relations. In addition to miss- web of open data. In The semantic web, pages
ing facts, there are many entities that are mentioned on 722–735, 2007.
the Web but are not in Freebase, and hence not in KV ei- [4] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and
ther. In order to represent such information, we need to J. Taylor. Freebase: a collaboratively created graph
automatically create new entities (cf. [46]); this is work in database for structuring human knowledge. In
progress. Furthermore, there are many relations that are SIGMOD, pages 1247–1250. ACM, 2008.
mentioned on the Web but cannot be represented in the [5] A. Bordes, X. Glorot, J. Weston, and Y. Bengio. Joint
Freebase schema. To capture such facts, we need to extend learning of words and meaning representations for
the schema, but we need to do so in a controlled way, to open-text semantic parsing. In AI/Statistics, 2012.
avoid the problems faced by open IE systems, which have [6] M. Cafarella, A. Halevy, Z. D. Wang, E. Wu, and
many redundant and synonymous relations. See [17] for one Y. Zhang. WebTables: Exploring the Power of Tables
possible approach to this problem. on the Web. VLDB, 1(1):538–549, 2008.
Knowledge representation issues. The RDF triple for- [7] M. J. Cafarella, A. Y. Halevy, and J. Madhavan.
mat seems adequate for representing factual assertions (as- Structured data on the web. Commun. ACM,
suming a suitably rich schema), but it might be less ap- 54(2):72–79, 2011.
propriate for other kinds of knowledge (e.g., representing [8] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. H.
the difference between running and jogging, or between jazz Jr., and T. Mitchell. Toward an architecture for
music and blues). There will always be a long tail of con- never-ending language learning. In AAAI, 2010.
cepts that are difficult to capture in any fixed ontology. Our [9] O. Deshpande, D. Lambda, M. Tourn, S. Das,
neural network is one possible way to provide semantically S. Subramaniam, A. Rajaraman, V. Harinarayan, and
plausible generalizations, but extending it to represent richer A. Doan. Building, maintaing and using knowledge
forms of knowledge is left to future work. bases: A report from the trenches. In SIGMOD, 2013.
Inherent upper bounds on the potential amount of [10] X. L. Dong, L. Berti-Equille, and D. Srivastatva.
knowledge that we can extract. The goal of KV is to Integrating conflicting data: the role of source
become a large-scale repository of all of human knowledge. dependence. In VLDB, 2009.
However, even if we had a perfect machine reading system, [11] L. Drumond, S. Rendle, and L. Schmidt-Thieme.
not all of human knowledge is available on the Web. In Predicting RDF Triples in Incomplete Knowledge
particular, common sense knowledge may be hard to acquire Bases with Tensor Factorization. In 10th ACM Intl.
from text sources. However, we may be able to acquire such Symp. on Applied Computing, 2012.
knowledge using crowdsourcing techniques (c.f., [38]). [12] A. Fader, S. Soderland, and O. Etzioni. Identifying
relations for open information extraction. In EMNLP,
9. CONCLUSIONS 2011.
In this paper we described how we built a Web-scale prob- [13] J. Fan, D. Ferrucci, D. Gondek, and A. Kalyanpur.
abilistic knowledge base, which we call Knowledge Vault. In Prismatic: Inducing knowledge from a large scale
contrast to previous work, we fuse together multiple extrac- lexicalized relation resource. In First Intl. Workshop
tion sources with prior knowledge derived from an existing on Formalisms and Methodology for Learning by
KB. The resulting knowledge base is about 38 times bigger Reading, pages 122–127. Association for
than existing automatically constructed KBs. The facts in Computational Linguistics, 2010.
KV have associated probabilities, which we show are well- [14] T. Franz, A. Schultz, S. Sizov, and S. Staab.
calibrated, so that we can distinguish what we know with TripleRank: Ranking Semantic Web Data by Tensor
Decomposition. In ISWC, 2009. editors, Advances in Large Margin Classifiers. MIT
[15] L. A. Galárraga, C. Teflioudi, K. Hose, and Press, 2000.
F. Suchanek. Amie: association rule mining under [34] J. Pujara, H. Miao, L. Getoor, and W. Cohen.
incomplete evidence in ontological knowledge bases. In Knowledge graph identification. In International
WWW, pages 413–422, 2013. Semantic Web Conference (ISWC), 2013.
[16] R. Grishman. Information extraction: Capabilities and [35] L. Reyzin and R. Schapire. How boosting the margin
challenges. Technical report, NYU Dept. CS, 2012. can also boost classifier complexity. In Intl. Conf. on
[17] R. Gupta, A. Halevy, X. Wang, S. Whang, and F. Wu. Machine Learning, 2006.
Biperpedia: An Ontology for Search Applications. In [36] A. Ritter, L. Zettlemoyer, Mausam, and O. Etzioni.
VLDB, 2014. Modeling missing data in distant supervision for
[18] B. Hachey, W. Radford, J. Nothman, M. Honnibal, information extraction. Trans. Assoc. Comp.
and J. Curran. Evaluating entity linking with Linguistics, 1, 2013.
wikipedia. Artificial Intelligence, 194:130–150, 2013. [37] R. Socher, D. Chen, C. Manning, and A. Ng.
[19] J. Hoffart, F. M. Suchanek, K. Berberich, and Reasoning with Neural Tensor Networks for
G. Weikum. YAGO2: A Spatially and Temporally Knowledge Base Completion. In NIPS, 2013.
Enhanced Knowledge Base from Wikipedia. Artificial [38] R. Speer and C. Havasi. Representing general
Intelligence Journal, 2012. relational knowledge in conceptnet 5. In Proc. of
[20] R. Jenatton, N. L. Roux, A. Bordes, and LREC Conference, 2012.
G. Obozinski. A latent factor model for highly [39] F. Suchanek, G. Kasneci, and G. Weikum. YAGO - A
multi-relational data. In NIPS, 2012. Core of Semantic Knowledge. In WWW, 2007.
[21] H. Ji, T. Cassidy, Q. Li, and S. Tamang. Tackling [40] D. Suciu, D. Olteanu, C. Re, and C. Koch.
Representation, Annotation and Classification Probabilistic Databases. Morgan & Claypool, 2011.
Challenges for Temporal Knowledge Base Population. [41] B. Suh, G. Convertino, E. H. Chi, and P. Pirolli. The
Knowledge and Information Systems, pages 1–36, singularity is not near: slowing growth of wikipedia. In
August 2013. Proceedings of the 5th International Symposium on
[22] H. Ji and R. Grishman. Knowledge base population: Wikis and Open Collaboration, WikiSym ’09, pages
successful approaches and challenges. In Proc. ACL, 8:1–8:10, 2009.
2011. [42] P. Venetis, A. Halevy, J. Madhavan, M. Pasca,
[23] S. Jiang, D. Lowd, and D. Dou. Learning to refine an W. Shen, F. Wu, G. Miao, and C. Wi. Recovering
automatically extracted knowledge base using markov semantics of tables on the web. In Proc. of the VLDB
logic. In Intl. Conf. on Data Mining, 2012. Endowment, 2012.
[24] N. Lao, T. Mitchell, and W. Cohen. Random walk [43] D. Z. Wang, E. Michelakis, M. Garofalakis, and
inference and learning in a large scale knowledge base. J. Hellerstein. BayesStore: Managing Large, Uncertain
In EMNLP, 2011. Data Repositories with Probabilistic Graphical
[25] X. Li and R. Grishman. Confidence estimation for Models. In VLDB, 2008.
knowledge base population. In Recent Advances in [44] G. Weikum and M. Theobald. From information to
NLP, 2013. knowledge: harvesting entities and relationships from
[26] Mausam, M. Schmitz, R. Bart, S. Soderland, and web sources. In Proceedings of the twenty-ninth ACM
O. Etzioni. Open language learning for information SIGMOD-SIGACT-SIGART symposium on Principles
extraction. In EMNLP, 2012. of database systems, pages 65–76. ACM, 2010.
[27] T. Mikolov, K. Chen, G. Corrado, and J. Dean. [45] M. Wick, S. Singh, A. Kobren, and A. McCallum.
Efficient estimation of word representations in vector Assessing confidence of knowledge base content with
space. In ICLR, 2013. an experimental study in entity resolution. In AKBC
[28] B. Min, R. Grishman, L. Wan, C. Wang, and workshop, 2013.
D. Gondek. Distant supervision for relation extraction [46] M. Wick, S. Singh, H. Pandya, and A. McCallum. A
with an incomplete knowledge base. In NAACL, 2013. Joint Model for Discovering and Linking Entities. In
[29] M. Mintz, S. Bills, R. Snow, and D. Jurafksy. Distant AKBC Workshop, 2013.
supervision for relation extraction without labeled [47] W. Wu, H. Li, H. Wang, and K. Q. Zhu. Probase: A
data. In Prof. Conf. Recent Advances in NLP, 2009. probabilistic taxonomy for text understanding. In
[30] N. Nakashole, M. Theobald, and G. Weikum. Scalable SIGMOD, pages 481–492. ACM, 2012.
knowledge harvesting with high precision and high [48] Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. Infinite
recall. In WSDM, pages 227–236, 2011. hidden relational models. In UAI, 2006.
[31] M. Nickel, V. Tresp, and H.-P. Kriegel. Factorizing
YAGO: scalable machine learning for linked data. In
WWW, 2012.
[32] F. Niu, C. Zhang, and C. Re. Elementary: Large-scale
Knowledge-base Construction via Machine Learning
and Statistical Inference. Intl. J. On Semantic Web
and Information Systems, 2012.
[33] J. Platt. Probabilities for SV machines. In A. Smola,
P. Bartlett, B. Schoelkopf, and D. Schuurmans,

You might also like